#databases
1 messages · Page 20 of 1
I googled it for you: https://www.freecodecamp.org/news/how-to-get-started-with-firebase-using-python/
This article a detailed guide that'll help you set up your Firebase database and perform simple CRUD operations on it using Python. Firebase, as you might know, is a platform provided by Google to accelerate app development. It offers BaaS or backend as a service, which means that Firebase
Read the data and print it. For example if that's SQLite then something like this https://stackoverflow.com/questions/17015980/python-how-to-print-sqlite-table
If that doesn't solve it for you try a help channel #❓|how-to-get-help
Much thanks!
can u help me with doing it with mysql.connector
Very similar but you'll need to look up the exact syntax. And like I said if you're stuck get a help channel
Mhm
Working with roles in postgres,
I dont fully understand the implication of one role being a member of another role, say I have something like this setup would this mean that public role inherits permissions from its parent role ?
I initially created a database through aditya then I attempted to create/insert tables into that database and always got a permission error. It ended up getting fixed when I declared projects was a member of aditya, I wanted to reduce the amount of permissions given to projects role since I would most likely be sharing it with a bunch of people
okay soo i made tests here and for some reason i don't know why it's not returning me the document, it's just sending me an empty list:
async def quick_search_default_character(self, *, user_id: int, name: str | None = None, prompt_prefix: str | None = None) -> None:
documents = list()
init = True
if name and prompt_prefix:
cursor = self.db.characters.find({'user_id': user_id, 'name': name,'prompt_prefix': prompt_prefix}, no_cursor_timeout = True)
while (await cursor.fetch_next):
data = cursor.next_object()
documents.append(data)
elif prompt_prefix:
cursor = self.db.characters.find({'user_id': user_id, 'prompt_prefix': prompt_prefix}, no_cursor_timeout = True)
while (await cursor.fetch_next):
data = cursor.next_object()
documents.append(data)
elif name:
cursor = self.db.characters.find({'user_id': user_id, 'name': name}, no_cursor_timeout = True)
while (await cursor.fetch_next):
data = cursor.next_object()
documents.append(data)
else:
cursor = self.db.characters.find({'user_id': user_id}, no_cursor_timeout = True)
while (await cursor.fetch_next):
data = cursor.next_object()
documents.append(data)
print(documents)
return documents if len(documents) > 0 else None
like, this code is suppose to make a quick search
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
okay i realize it was my stupid misstake
Hi, I'm trying to figure out a way to get a Text widget to auto update in Tkinter to show all the values in a SQLite3 database, and the code runs without errors, but the contents of the database aren't inserted into the Text widget. Here is the code concerning just the auto-update and the database:
import tkinter as tk
import sqlite3 as sql
root = tk.Tk()
conn = sql.connect('database_store.db')
c = conn.cursor()
c.execute("CREATE TABLE dataStore (data text)")
conn.commit()
conn.close()
data_show = tk.Text(root, wrap='word') #other params for height/width that i didn't add
data_show.grid(row=1, column=1, padx=(100, 0), pady=20, sticky='ew')
def update_textbox():
conn = sql.connect('database_store.db')
c = conn.cursor()
updated_db = c.fetchall()
for data in updated_db:
data_show.insert('end', f'{data}\n')
conn.commit()
conn.close()
data_show.after(1000, update_textbox)
update_textbox()
And here is the full code if that wasn't enough to find any issues: https://paste.pythondiscord.com/6GTA
Thank you in advance for helping me out!
Is this a db question or a tkinter question? I didn’t look at the code since it sounds like you’re saying the db part is fine?
But, now that I look at it, you have no query. You’re creating a cursor then calling c.fetchall() on an empty cursor
You should add some print statements so you can see / debug what’s happening. For example, add a print statement in the for loop to print the data.
I think its a db question since all the tkinter code would seem to work for inserting something like a list or string, but maybe its just how i called the database over
yeah I'll try that out, thanks
Well, you should do that in general. But in this case, you're not making a query: you need to query something. Like: c.execute('SELECT * FROM your_table')
Oh so the issue is that I am bringing in everything rather than specifically mentioning the table?
You’re bringing in nothing
You opened a cursor to the db, then did fetchall, but without a query, the cursor returns nothing.
ah alright. Thanks for explaining since I only recently started learning how to use databases for data handling while also applying it to a GUI
This seems to fix everything. Thank you again for helping me out and have a nice day/night 🙏
Also: .fetchall() will return a list of tuples (or an empty list).
for data in updated_db:
data_show.insert('end', f'{data}\n')
So data will be a tuple. I guess you want data[0] in that case.
Hello, has anyone had access setting up a connection to an MS access database? I have seen a few methods but struggling to get anything to work proper in the past.
I should mention I am not on windows
does it mean you are at MacOS, or at Linux, or at Android, or Chrome OS or at something else 😅
not recognized
I am on MacOS, but will potentially run on linux, but most importantly it means I dont have access to native MSaccess drivers.
I know UCanAccess JDBC works with DBeaver on Mac, but had issues getting this to be recognised in Python, I tried using JayDeBeApi but I dont know if it hit my skill ceiling or if there was an actual issue with thoes drivers.
i can suggest crazy alternative
What if u connect your MSaccess to sqlite3 as external source
and easily work in Macos/Linux through python with sqlite3?
which is natively supported in std libs
TLDR: selecting some more open source database engine as common denominator between your two environment
Postgresql, MySQL, Mariadb, Sqlite3
https://docs.devart.com/odbc/sqlite/access.htm quick googling says, Access supports sqlite3
ok thats interesting, thank you for the suggestion I will investigate
Makes sense, thanks for clarifying. I think the solution for selecting everything from the table is more readable though, so I might just keep using that
Hello everyone, i have a problem with "RETURNING" stantment in python, I use sqlite3 .3.43 and databases[aiosqlite] databases==0.7.0
Traceback (most recent call last):
File "/home/yosaki/.local/lib/python3.10/site-packages/discord/app_commands/commands.py", line 827, in _do_call
return await self._callback(self.binding, interaction, **params) # type: ignore
File "/home/yosaki/main/cogs/suggest.py", line 206, in suggest_
suggest_id = await db.add_suggest(i.user.display_name, consiglio)
File "/home/yosaki/main/utility/db.py", line 66, in add_suggest
return await database.execute(query=query, values=SuggestObject(member, 0, 0, suggest=suggest).to_dict())
File "/home/yosaki/.local/lib/python3.10/site-packages/databases/core.py", line 169, in execute
return await connection.execute(query, values)
File "/home/yosaki/.local/lib/python3.10/site-packages/databases/core.py", line 288, in execute
return await self._connection.execute(built_query)
File "/home/yosaki/.local/lib/python3.10/site-packages/databases/backends/sqlite.py", line 133, in execute
await cursor.execute(query_str, args)
File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/cursor.py", line 48, in execute
await self._execute(self._cursor.execute, sql, parameters)
File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/cursor.py", line 40, in _execute
return await self._conn._execute(fn, *args, **kwargs)
File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/core.py", line 133, in _execute
return await future
File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/core.py", line 106, in run
result = function()
sqlite3.OperationalError: near "RETURNING": syntax error
any idea? this error only appear in my remote VPS Ubuntu 20.04.6 LTS x86_64
query = "INSERT INTO suggest(member, suggest, positive_vote, negative_vote, status) VALUES (:member, :suggest, :positive_vote, :negative_vote, :status) RETURNING *"
return await database.execute(query=query, values=SuggestObject(member, 0, 0, suggest=suggest).to_dict())
I guess returning * is not a thing?
Why? but locally I don't have this problem....
Ah, it's supported 🤔
Not sure what's wrong then
it's creating but not displaying
why?
it was working normally in Alpha version
but then in beta version it's not working like intended
is sqlite 3.43.0 the version reported by python itself or something else? could you import sqlite3 and run print(sqlite3.sqlite_version)" to double check?
I installed sqlite3 not from pip but using the source code, and to manage the dbs and queries I use databases[aiosqlite]
well aiosqlite wraps the built-in sqlite3 library, and that links to some version of sqlite not necessarily the same as what is currently installed (ive had several issues with the wrong sqlite lib being used, and i dont know enough about the linking process to explain why they happened but i did eventually get it working)
so I have to update it?
iunno, what was the sqlite version shown by python?
'2.6.0'
not sqlite3.version, i mean sqlite3.sqlite_version
'3.31.1'
hm
Now the problem is, I uninstalled this version to install 3.43
you could try recompiling python? might be a bad guess though
but although, I have no idea where it gets the old version, maybe the sqlite library is compiled based on the one installed and made embed, now I try the recompilation
some time ago with my raspberrypi, after make installing my sqlite into /usr/local/lib, i had to fidget with /etc/ld.so.conf and sudo ldconfig as i believed the system sqlite was being prioritized over my local install
sqlite3.sqlite_version
still '3.31.1'
I don't know what to do really
how did you upgrade sqlite? where were the .so libraries placed?
Collecting pysqlite
Downloading pysqlite-2.8.3.tar.gz (80 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80.6/80.6 kB 10.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [1 lines of output]
pysqlite is not supported on Python 3. When using Python 3, use the sqlite3 module from the standard library.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details. @waxen finch
I completely reinstalled sqlite and python
but how did you install sqlite? are you talking about that pysqlite module? if so, thats the wrapper which links to the sqlite program and is not sqlite itself, plus its already built into python 3, hence the error
I'm talking about both
sqlite3 system and python sqlite
only that system sqlite is up to date, python's is not
i checked the ubuntu package index, 20.04 only comes with sqlite 3.31.1 (see libsqlite3-0)
personally i would compile sqlite 3.43.0 from source using the autoconf distribution
unpack it, enter the source directory, ./configure then sudo make install
already done nothing changes, I will update the whole VPS at this point.
if you want to then sure, jammy 22.04 has sqlite 3.37.2 and the minimum version for RETURNING syntax is 3.35
what db do you recommend for python?
anyone expert on peewee please message me im so stuck and noone seems to be able to help
for backend? Postgresql for is very good default for real work
Sqlite3 if it is just pet project level
Sorry for the ping, I solved it, I just forgot to run ./configure again after upgrading sqlite3
Hello! I have a question. I have an algorithm that gathers data on financial performance, keeps it and yields it to an AI. However, im facing problems feeding the ai bc the amount of data is too large. To solve this, I am trying with langchain models, but i dont know which one and how to use it. Does anyone know about it ? If not, do you know any alternatives?
langchain agents*
You may be better off asking in #data-science-and-ml
Hi everyone, i am trying for a scholar project to do a multi-tenancy application, i did a quite "good" architecture for my backend following the MVC structure (Model, Vue, Controller) and also DAOS to access the data to avoid having this in the model part.
I am now trying to migrate my tables from a single tenant structure to have a 1 PostgreSQL schema per "user" so he can have all his tables for himself.
My project is in FastApi, SqlModel and PostgreSQL. I need some help because i can't figure out a way to create schemas directly and i would love some help for this one, would be very thankful, thanks in advance!
Model, Vue, Controller.
That is wild
I recommend using some library meant for migrations. Alembic is meant to integrate with SqlAlchemy, may be it will work in standalone way too
Django ORM could be abused for library migrating purposes. Just using it's migrating module
I would recommend probably Django ORM if u will not find thing with better integration. It has quite comfy migrating paths
https://github.com/tiangolo/sqlmodel/issues/85
People wrote guide and examples how to do that with Alembic for SQLModel
I googled it for you 😂
https://dev.mysql.com/doc/refman/8.0/en/window-functions-frames.html
have a read here.
you want BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW probably
Wich datebase here use?
correct.
if you want to optimise, imo you should always refer to the query planner's plan.
potentially using recursive CTE if you want to short circuit it.
this is just a toy problem so i wouldn't have an issue with writing something like that, but if it's for work then i would really think twice and see if it's absolutely necessary.
also short circuiting it wouldn't necessarily mean it's faster in practice.
fwiw i think a "sufficiently smart" query optimizer should be able to recognize this 🙂
i don't know if any real optimizers actually do that
i wouldn't be surprised if some olap-oriented databases implemented optimizations like this
try it in duckdb, I guess :p
Having clause for speed
The DBA always gets mad at me when I do stuff like this without a having
I'm not proposing a code diff I'm answering the question of " how would I stop computing weights after hitting sum>1000" from ur first post.
Join on a.turn>=b.turn
Group by a.turn
Having sum(b.weight)<1000
Ur DBA may get mad that it's a quarter of the way to a cross join but he will be happy about the having clause and probaly overlook it
I don't know if it's better I'm just answering the question about how to quit counting after sum>1000
Write your MySQL query statement below
SELECT person_name FROM Queue
WHERE turn = (
SELECT MAX(turn) as turn FROM
(
SELECT q1.turn
FROM Queue q1 JOIN QUEUE q2
ON q1.turn >= q2.turn
GROUP BY q1.turn
HAVING SUM(q2.weight) <= 1000
) as table1
);
But there's the full working query if u wanna play around with it
And as for the prose " hello my name is Doug,
I don't use recursion,
Cuz it makes me need Tylenol;
Tylenol is bad for the liver
Having clause make ur DBA very very happy,
Greetings traveler "
I doubt this would be optimized out, but who knows
Is there not just a general chat or am I just overlooking that channel
In lobby, discussion
I think the idea was whether a database could shortcut a cumulative sum operation once it reached a certain threshold. I don't know of any that would. The recursive CTE would shortcut the evaluation, but I doubt a window function would/could be shortcutted.
i use notes as my database
If the cost of the cumulative calculation is prohibitively expensive, perhaps employ a search strategy to narrow the search space? @brazen lava
part of what i wonder about is whether the overhead of the CTE version is so great that it outweighs the "short-circuiting" benefit
e.g. you see this in pandas code a lot, where the algorithmically-efficient code is not any faster than, or significantly slower than, the inefficient but vectorized version
Recursive ctes are really iterative, there’s little room for optimization
i guess it depends a lot on what the query optimizer can do
right, but i don't know if internally they take a slower code path
Oh sorry, misread. Yah, agree.
Oh, I'd perhaps accumulate just from the first 1million rows, and if I haven't reached 10k, use the max cumulative and get the next million, and so on.
Nah, just separate queries
I don't know if I could do it in a recursive CTE: accumulating the next 1 million rows. I guess depends on whether the engine allows a dynamic limit/offset clause.
You'd check the last row of each search
import duckdb
target = 100000
last = 0
iteration = 0
batchsize = 10000
with duckdb.connect() as con:
results = []
con.execute("create table abc as select * from range(10000000) t(i)")
while last < target:
df = con.execute(f"select i, sum(i) over (order by i) as cumsum from abc limit {batchsize * (iteration+1)} offset {batchsize * iteration}").df()
results.append(df)
last = df.iloc[-1, -1]
iteration+=1
print(df.tail())
Yah, I don't think this approach can be used in a recursive cte... at least not in duckdb (limits/offset aren't allowed in recursive ctes). I could probably hack something together that used the max value from the last iteration and got the next n rows, but I can't imagine that performing very well
Good day guys it's my first time here!
My question is can someone suggest to me a video tutorial or a step by step guide on how to host twisted python to internet to work in real life not in local host
Thanks in advance
Hey guys, Im struggling with my database conception
player_close = sqlalchemy.Table('player-close', SqlAlchemyBase.metadata,
Column('player_id', Integer, ForeignKey('player.id')),
Column('close_id', Integer, ForeignKey('close.id'))
)
class Player(SqlAlchemyBase, SerializerMixin):
__tablename__ = 'player'
id = Column(Integer, primary_key=True, autoincrement=True)
coins = Column(Integer, nullable=True)
class Close(SqlAlchemyBase, SerializerMixin):
__tablename__ = 'close'
id = Column(Integer, primary_key=True, autoincrement=True)
winner = Column(Boolean)
team1 = relationship(Player, secondary=player_close, backref='closes')
team2 = relationship(Player, secondary=player_close, backref='closes')
timestamp = Column(DateTime)
I want to connect two column to the same backref
but getting property of that name exists
I don't think that's possible
But maybe I'm wrong : 🤔
You can make a third relationship with custom join
Do you want that relationship to be read only btw?
read and write
I was thinking about creating third table Team with just 5 columns of player and connect to Closers table
and dont flex on it
like this
but it look mess to be honest
1 sec
check dm
Maybe you can use a schema like this?
oh yeah
never thought about it...
thanks mate
To paraphrase Jamie Zawinski: “Programmer has database problem. Thinks: ‘I know, I’ll use an ORM.’ Now programmer has two problems.”
ORM is a tool, if you can't use it well - don't
Programmer has a problem, he thinks "I'll build SQL stringly", now he has three problems.
SQL is just another language. You learn to use it properly, just like you learn to use Python or JavaScript or HTML/CSS properly.
What's the problem with orms?
That was a total non-sequitur
it may help to also state what you think is the right approach to prevent common issues with people using SQL.
For instance relying on parametrized queries, ORMs and tooling (lint, etc.)
Yes. I did assume Lawrence actually knows about all that.
Show me an ORM that handles this:
def escape_sql_wild(s, escch) :
"escapes SQL pattern wildcards in s with escch. The same escch" \
" needs to be passed to the ESCAPE clause for the LIKE" \
" operator."
if not isinstance(s, str) :
raise TypeError("expecting s to be a string")
#end if
if not isinstance(escch, str) or len(escch) != 1 :
raise TypeError("expecting escch to be a single-character string")
#end if
result = []
for ch in s :
if ch == escch or ch == "%" or ch == "_" :
result.append(escch)
#end if
result.append(ch)
#end for
return "".join(result)
#end escape_sql_wild
Handles what? Show the problem, not the implementation of the supposed solution.
(And jesus, you really comment #end at the end of all the blocks?)
Question, does API is in the cloud not in your 'database'? Am I understanding it right? And if anyone knows your API Key, anyone can change it?
I don't know if this is related.
Sorry but that made no sense
from sqlalchemy import String, select
from sqlalchemy.dialects import postgresql
from sqlalchemy.orm import DeclarativeBase, mapped_column, Mapped
class Base(DeclarativeBase):
pass
class Book(Base):
__tablename__ = "book"
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(255), unique=True)
stmt = select(Book).where(Book.title.ilike("a;[]\\'/\""))
print(stmt.compile(dialect=postgresql.dialect(), compile_kwargs={"literal_binds": True}))
And you can pass escape parameter to ilike
SELECT book.id, book.title
FROM book
WHERE book.title ILIKE 'a;[]\\''/"'
He's asking if an ORM can do escaping properly? I hope not.
guys maybe this is too much to ask for but can anyone write a program for me to make a database for online flight reservation using mysql and tkinter(I need a UI for it) if possible maybe by tomorrow before 4:30 AM GMT?
Sorry, but nobody would do that for free, also with such a tight deadline
Also this server is not a place for paid requests
I understand. thx for responding
People would be more than happy to help you with your code, but they won't code for you
Umm, so where exactly do you supply the user-specified string to match?
What do you mean?
You’ll note that the function I posted took arguments, to be supplied by the caller. Your code does not.
...
Just pass it into ilike
OK, here’s the problem: I want to do a partial match on what the user enters. So I will use their input in, say “like '%input%'”. But I want the user input to always match literally, even if it contains “%” and "_” characters. That’s what my escaping function achieves. How do you do it with an ORM?
Oh, sorry, gotcha...
Like I said: “partial match”.
In other words, an ORM is no help. QED.
? You'd escape and still use an ORM for a partial match.
Becauese that's not what orm does 😂
Don't see why this is the ORMs problem.
So what is the ORM’s problem?
Are you asking how you'd do a partial match on the exact user input?
Yeah.
What database? PG?
Standard SQL.
msft sql server?
You’ll note that I did not use any DBMS-specific SQL syntax, just standard SQL.
If you don't want your user to fiddle with your query in some way (sql injections aren't possible btw) you could implement a functon like this: https://sqlalchemy-utils.readthedocs.io/en/latest/_modules/sqlalchemy_utils/functions/database.html#escape_like
If the user types “%”, and it happens to match an arbitrary substring instead of a literal “%”, then that’s a bug, whether you class it as “SQL injection” or not. Your ORM is of no help here.
Again, it's not an orm problem
Who hurt you?
Fine. That’s conceding my point, that ORMs are useless for solving this sort of problem.
Yeah, solving your specific problem
I'm not even sure, it's just: replace userinput's wildcard variables (% and _), append a '%' then pass that as a parameter to the like function.
def escape(text: str, escape_char: str) -> str:
return (
text
.replace(escape_char, escape_char * 2)
.replace('%', f"{escape_char}%")
.replace("_", f"{escape_char}_")
)
user_input = "a/b%c"
stmt = select(Book).where(Book.title.ilike("%{}%".format(escape(user_input, "/")), escape="/"))
print(stmt.compile(dialect=postgresql.dialect(), compile_kwargs={"literal_binds": True}))
SELECT book.id, book.title
FROM book
WHERE book.title ILIKE '%%a//b/%%c%%' ESCAPE '/'
hah, yes, that
It's still parameterized, even though there's some f-strings involved.
I don't know how I would have made anything remotely complex without an orm/query builder
But "orm bad"
I hate ORMs... they almost always made the complex stuff harder.
But, they made the easy/medium stuff super easy
It depends on the orm
but I'm a db person, so that didn't slow me down... I just use SQL for what SQL is good at.
What orms did you try? 
Oh, this is my past Java life... so mostly hibernate
You should take a look at sqlalchemy
Pretty much everything you can write in sql you can write in it
e.g. subqueries, ctes, aggregates, window functions, etc
Well, I do mostly analytic stuff, so there's no reason for an ORM
And, I'd rather write in SQL than in chained pythonic APIs
A single declarative statement is far more readable and debuggable, imo.
but I know people will fight me over that
Personally if I interact with DB from code I would use orm/query builder in 99% of cases
sqlalchemy is almost exactly like sql, but in python syntax, I don't think there's much to debug 🤔
stmt = (
select(Model.name, func.count(Model.id)
.where(Model.name.ilike("a%"))
.group_by(Model.name)
)
I think that's pretty similar to how SQL is written
There's dozens of APIs like that, which is part of the problem. It's yet another thing to learn, when all I need to know is SQL.
Yeah, in case of analytical queries it might now matter as much, but when you need to dynamically create queries, maintain them, etc. it's much easier 🤔
That's where dbt comes in
What's dbt?
E.g. I change/delete a model field, I would immediatelly get an error from my linter
Ah, that's not really would fit a usecase of a generic app that needs a DB imo
E.g. some kind of web application
no, not at all
Ah, yeh i had worked with someone that used hibernate that crushed the MSSQL server. this was because of threading and although i think that they just made a mistake, it was compounded by java philosophy that java is all you need to know. the actual result set should have been 100K rows. So i think you are more aware of a naked query result set than an ORM result set.
btw that was an interesting discussion you guys had and i agree with Billy that its sometimes simpler to bypass the ORM layer.
but ORM is nice too at times for the simple and medium stuff Billy said it well
fwiw dbt doesn't actually check if you change model fields, it doesn't parse your sql at all and is surprisingly "dumb"
also yeah their attitude to jinja templating is a little... cavalier
you'd think they'd at least offer a quote jinja macro, but nope. '{{ foo }}'
It’s data engineering, everything is cavalier:) a lot of duct tape and glue
it doesn't need to be. it's that way because all the tooling is designed that way
there's no reason dbt couldn't at least try to make inserting strings into sql slightly safe
it works because 99% of the time users are well-behaved
hey @harsh pulsar i know you think i don't know anything or perhaps should learn a bit more, if you dpon't mind ? would you see this https://paste.pythondiscord.com/QLHA
it's all the honor system
i actually meant to look at this, i got busy with work
thank you? its ok im not tripping but i do understand better then most, when you do have time, only because you are the only person that understand it.
let's move this back to #data-science-and-ml
sure
I'm a business intelligence person so I just use SQL for what SQL is bad at. I'm just in it for the witchcraft
It's fun that every time someone says they hate ORMs it's somehow always Hibernate they've worked with.
And when people like ORMs they've never used Hibernate :P
so i want to take data which is saved in database and show it on frontend
but this not working
import { useState, useEffect } from 'react';
import { Container, Table } from 'react-bootstrap';
export default function Admintable() {
const [ordersData, setOrdersData] = useState([]);
useEffect(() => {
const fetchData = async () => {
const response = await fetch('/api/admintable');
const data = await response.json();
setOrdersData(data);
};
fetchData();
}, []);
return (
<Container className=" container-fluid d-flex justify-content-center align-items-center">
<div className='m-5'>
<Table responsive>
<thead>
<tr>
<th>Email</th>
<th>Price</th>
<th>Date</th>
</tr>
</thead>
<tbody>
{ordersData.map(order => (
<tr key={order.email}>
<td>{order.email}</td>
<td>{order.totalprice}tk</td>
<td>{order.date}</td>
</tr>
))}
</tbody>
</Table>
</div>
</Container>
);
}
"Doesn't work" is not an error description
also, this is #databases but that looks like #web-development
router.post('/admintable', async (req, res) => {
try {
// Attempt to save a new Admintable object to the database
let { email, totalprice, date } = req.body;
const newAdmin = new Admintable({ email, totalprice, date });
await newAdmin.save();
res.json({ message: "Data saved successfully" });
} catch (error) {
// If an error occurs, send a 500 (Internal Server Error) response to the client
res.status(500).send("Server Error");
}
});
Failed to load resource: the server responded with a status of 404 (Not Found)
admin:1 Uncaught (in promise) SyntaxError: Unexpected token '<', "<!DOCTYPE "... is not valid JSON
again: move this to #web-development
hi ppl
fwiw dbt doesn't actually check if you change model fields, it doesn't parse your sql at all and is surprisingly "dumb"
what does "change model fields" entail?
also yeah their attitude to jinja templating is a little... cavalier
ha yes indeed.
'{{ foo }}'
ha - personal taste i guess, i am completely fine with this.
(Windows)Can someone help? I'm trying install mysqlclient for my Django project but error pop up:
Failed to build mysqlclient
ERROR: Could not build wheels for mysqlclient, which is required to install pyproject.toml-based projects
by "change" i meant adding, removing, or renaming a column
at least as far as i know?
hmm yeah, the default behaviour is to ignore that.
that's an artifact of backward compat i guess - there are some documentation on this matter for incremental models at least (that's the model that matters the most in this case imo) https://docs.getdbt.com/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change
Question: Is there a way to use a well supported package to use schema definitions with mongodb with asyncio? Currently I don't think there is. The way I got to that conclusion was that mongoengine seems to be the standard way to do schemas for mongo, but mongoengine doesn't work with the motor package which provides mongodb compatibility with asyncio. there is a package called motorengine to get the same functionality. I looked up this package and it only supports Tornado which was the original use case of Motor before it was ported to asyncio. There is another package called AIOMotorEngine which has only 27 stars on GitHub and proudly states that it was ported from motorengine and that "All tests are rewritten too and seem to pass." this is pretty far from well supported in my view, and since this is for work, I want to use packages that I know have been extensively used and tested.
This seems a bit weird to me though because schemas, mongodb, and asyncio are all extremely popular and it seems like this combination would be better supported, so maybe there is an obvious approach that I'm missing.
Realistically I think if you want schemas with mongo... You should just use postgres
A large reason why nothing exists is because it's not really common for businesses and people who have the time to write those libraries to use mongo when they need a schema
hey i have got a question, should i store id's as bigint type or rather string and cast it to int in python?
model Guild {
id BigInt @id @unique
welcomeMessage String?
welcomeChannel BigInt?
leaveMessage String?
leaveChannel BigInt?
}
its postgres btw
by IDs you specifically mean discord snowflakes? if so bigints seem to be pretty common, despite being signed integers
yeah i mean snowflakes
fwiw 2^63-1 converted to a discord timestamp would be <t:3619093655>, or <t:3619093655:R>, so i probably wouldnt worry about it overflowing yet
is bigint 2^63?
the min value is -2^63, then another spot for 0, and the rest is 2^63-1
wiki: https://en.wikipedia.org/wiki/Integer_(computer_science)
psql numeric type ranges: https://www.postgresql.org/docs/15/datatype-numeric.html
yeah
i see, im still learning all the more "advanced" features
isn't that what odmantic is?
it's an "ODM", like sqlalchemy for mongodb, using motor for mongodb interaction and pydantic to define types and validate inputs/outputs
i've used it before, it's not bad at all
AsyncIO MongoDB ODM (Object Document Mapper) using python type hinting
I'm already using pydantic in this codebase for API interaction, so I will definitely look at this, thank you!
Hi does anyone knows if there is a database with company wise production data for agriculture sector (crops livestock dairy)? Preferably europe
At the individual company level? If such data exists I assume it will be expensive. If the data on Eurostat isn't detailed enough you can try https://www.reddit.com/r/datasets/
Ive created a login and registration system for a project but im unable to get a working database using SQLite, can anybody help me with this please?
Just post the problem you are having. Don't ask to ask.
Ill look into eurostats
async def run():
conn = await psy.connect(user='postgres', password='root',
database='postgres', host='localhost')
#values = await conn.fetch(
# 'SELECT * FROM mytable WHERE id = $1',
# 10,
#)
await conn.execute("CREATE TABLE IF NOT EXISTS test1(id SERIAL PRIMARY KEY, number INT)")
await conn.execute("CREATE TABLE IF NOT EXISTS test2(id SERIAL PRIMARY KEY, number INT)")
await conn.execute("CREATE TABLE IF NOT EXISTS test3(id SERIAL PRIMARY KEY, number INT)")
await add(conn)
async def add(co):
await asyncio.create_task(a1(co))
await asyncio.create_task(a2(co))
await asyncio.create_task(a3(co))
async def a1(con):
while True: # test1
n = random.randint(100000, 199999)
await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
async def a2(con):
while True: # test2
n = random.randint(200000, 299999)
await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
async def a3(con):
while True: # test3
n = random.randint(300000, 399999)
await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")```
How come this isn't running all three of them at once, it's only doing the first one. (AKA a1)
## async pg
it should run them concurrently, but your conn object might be holding a mutex or the database itself is limiting the concurrency if you are not seeing the values inserting
I wonder if anything is awaiting
If I'm not wrong postgres connections don't support concurrent operations
@torn sphinx Also you're not running your coroutines concurrently
And shouldn't a2() use test2 as table (same for 3)?
tasks = [a1(co), a2(co), a3(co)]
await asyncio.gather(*tasks)
Or in python 3.11:
async with asyncio.TaskGroup() as tg:
for fn in (a1, a2, a3):
tg.create_task(fn(co))
oh yeah i forgot about the immediate await lol
Ah so I need to group them then run all of them. In previous code I had two task and I did it pretty much the same as my other code and it worked, figured it worked again.
+ Exception Group Traceback (most recent call last):
File "c:\Users\m\Desktop\Test\V1\New\main.py", line 38, in <module>
| loop.run_until_complete(run())
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
| return future.result()
| ^^^^^^^^^^^^^^^
| File "c:\Users\m\Desktop\Test\V1\New\main.py", line 16, in run
| await add(conn)
| File "c:\Users\m\Desktop\Test\V1\New\main.py", line 19, in add
| async with asyncio.TaskGroup() as tg:
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\taskgroups.py", line 147, in __aexit__
| raise me from None
| ExceptionGroup: unhandled errors in a TaskGroup (2 sub-exceptions)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "c:\Users\m\Desktop\Test\V1\New\main.py", line 31, in a2
| await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\site-packages\asyncpg\connection.py", line 317, in execute
| return await self._protocol.query(query, timeout)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "asyncpg\protocol\protocol.pyx", line 323, in query
| File "asyncpg\protocol\protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
| asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
+---------------- 2 ----------------
| Traceback (most recent call last):
| File "c:\Users\m\Desktop\Test\V1\New\main.py", line 35, in a3
| await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\site-packages\asyncpg\connection.py", line 317, in execute
| return await self._protocol.query(query, timeout)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "asyncpg\protocol\protocol.pyx", line 323, in query
| File "asyncpg\protocol\protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
| asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
+------------------------------------```
Not sure what a mutex is
I have no idea what you're talking about
async def a2(con):
while True: # test2 <-------------------
n = random.randint(200000, 299999)
await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
^^^^^
i think without any further configuration this will all run inside one big transaction https://www.psycopg.org/psycopg3/docs/basic/transactions.html, but idk why that would cause this particular error
It's asyncpg
i see.. i saw psy in your code and assumed psycopg
lol yeah I always change it to psy when using libraries like this
Well mainly because I did use pyscopg
that's guaranteed to confuse people
Ah I see, fixed it. It still doesn't do both though
afaik one connection isnt capable of doing more than one query at a time, so you'd normally open multiple connections to your database to run each of your queries in separate transactions, e.g. ```py
async with asyncpg.create_pool(...) as pool, asyncio.TaskGroup() as tg:
tasks = [tg.create_task(run_query(pool)) for _ in range(3)]
async def run_query(pool):
async with pool.acquire() as conn:
await conn.execute("INSERT INTO my_table VALUES ('foo')")```
(by default create_pool() will immediately open 10 connections, but you can adjust that with min/max_size= if you want less)
googleapis has a way to pull youtube data
Is there a way to set unlimited? I plan on allowing alot of executions that could take 5 - 10 seconds to happen all the time. So It probably will go into the hundreds of executions
i dont think literally setting an unlimited number of connections would be a good idea because at some point your client/server won't have the resources to handle it, not to mention your database has a limit on the number of connections too (100 by default)
what specifically about your "executions" is taking up so much time? is it some kind of processing on the database, or on your client? if its the client, i would suggest batching your queries into a few transactions so you can release the connection during the heaviest parts, e.g. ```py
async with pool.acquire() as conn:
all_the_data = await conn.fetch("SELECT ...")
Work with data now...
async with pool.acquire() as conn:
await conn.execute("INSERT INTO ...")```
I have a user level column in my database. How do i select all users where the user level is greater tahn or equal to 3?
Please ping me if you respond.
I'll be storing billions of records in hundreds possibly thousands of tables. Each record will have to be found by going through each table and checking if there is a result that matches the query. This causes a lengthy process.
When I was doing it before with I think a couple hundred million records it took ~30 seconds
But that was using pyscopg so it could be faster with asyncpg
I fully coded the tool before but the only issue was it wasn't asynchronous so if one user called it, the next would have to wait the time left for that transaction + the time for the transaction they made
Also I'm confused on what's happeing here, for _ in range(3). So this is specifically if I have 3 concurrent executions?
if all that selecting/inserting needs to be done for each user request then i guess you would need to increase your pool size, but are you sure the schema is optimized for your use case? having thousands of tables sounds wrong to me
thats merely a list comprehension shorthand to create 3 different query tasks
you can't improve what you can't measure.
So I would suggest to monitor and dive into it so you understand where it spends the most time
The only other way to do it is if I had one table with a shit ton of different column names (since I'd have possibly thousands of different tables there would be a lot of different names). I'd then need to have a column name specifically for what that table would be, I suppose that could work but I don't think it's worth it. I'm not sure though.
How can I monitor it?
I'm not sure if it is slow due to the resources available or if it's just the way psycopg was. I'm going to have a similar testing situation as before so I can actually grasp what's taking majority of the time.
Based off of their testing I should be seeing somewhere near a 5x speed difference, this is a big difference. That means if it took 30 seconds before it would take ~6 seconds now.
there are monitoring libraries, there are also ways to understand your queries better like the explain keyword
If you wonder if it's slow due to the resources available, then you ought to monitor the resources available
be it resources on the db, the thread or connection pool from the app, etc.
Any library you'd recommend for this case? Also I'll look into the explain.
I would look into opentelemetry in case or some APM
Question, are these specifically for postgresql related things or is this a one in all type of library
it's applicable to everything and anything
Alright cool, I'll jot it down and use it once I have the same data set as I had before.
But overall, the answer is: science works, bitches
So make sure you can observe the variables at play, and don't hesitate to do load testing if needed
What's load testing? Also yeah with what you've guys taught me I'm much more knowledgeable now than I was 2 months ago when making the non asynchronous version of this. I have good hope for making this a bazillion times more efficient.
You may want to peruse through https://www.brendangregg.com/overview.html
But in a sentence, it's about creating test scenarios and making a bunch of requests against your service so that you understand when it starts falling apart and where it starts falling apart
At the end of the day, it's all about going through hypothesis and validating them
You should never have to guess
also, it's a very valuable and difficult skill to get
so it's great for your career too
Hey, can someone help me out with this question : I’m running 3 websites on a linux instance. For each website I have a sql database server running using docker container. So 2 postgresql docker containers & 1 mysql docker container. What I want to do is to migrate these db’s to a new postgres docker container in real-time, i.e copy the current data and keep copying the new data until its same. There are tools like pg_dump for posgres, but those are manual methods. Is there an automated approach to solve this problem, for eg : bash/ python scripts.
twice as fast as golang?? 👀
oh, I see, asyncpg is 25% cython. cool.
yeah most of the heavy lifting is done in cython
not that those numbers matter much most of the time, since you need a massive DB to start hitting those numbers
For pg to pg this looks like it might work out of box but I haven't tried it: https://github.com/ankane/pgsync
MySQL to pg will be a little more complicated I imagine.
Let's say you are in a position where you can't use Spark etc to process a huge chunk of data from a RDBMS (Postgres / MySQL) to calculate some value, but you need all the data from multiple tables to calculate it. What would your approach be? (stored procedures is a no-no)
what does this error mean from pymongo?
SSL handshake failed: ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992),SSL handshake failed: ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992),SSL handshake failed: ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992), Timeout: 30s, Topology Description: <TopologyDescription id: 64f388b2e820f7223e1abc5d, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>, <ServerDescription ('ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>, <ServerDescription ('ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>]>
it means it was unable to connect to the address given due to it failing to complete a handshake within the timeout
do you know what could be causing the problem
are you running some really old version of pymongo?
or is atlas having any service issues rn?
my pymongo version is 4.4.1
can you run python -c "import requests; print(requests.get('https://www.howsmyssl.com/a/check', verify=False).json()['tls_version'])" in your terminal
(assumes you have requests installed)
And/Or run openssl version
i ran this
from urllib.request import urlopen
print(urlopen('https://www.howsmyssl.com/a/check').read())
tls version is 1.3
openssl version gives me LibreSSL 2.8.3
I assume you're using atlas btw?
yes
What if you run python -c "import ssl; print(ssl.OPENSSL_VERSION)"
OpenSSL 1.1.1s 1 Nov 2022
right
so it looks like the openssl version python is using is outdated
well, outdated enough to not be able to connect to atlas
What python version are you using?
did you compile it from source?
i used pip
hmm?
oh shit
For the interpreter itself?
can you run python -c "import ssl; print(ssl.HAS_TLSv1_2, ssl.HAS_TLSv1_3)"
Just realised thats the openssl version not the tls version
true true
okay, well thats weird
you have the right TLS versions
So why is mongo using 1.1
what is your connection URI? (Remove the auth stuff if that's in there aka the mongodb://foo:bar@
"mongodb+srv://x:x@cluster0.2lrs6ai.mongodb.net/?retryWrites=true&w=majority"
can you add &tls=true to the end of the uri?
pymongo having some seriously poor error handling and no TLS upgrade support apparently
So it's not really a TLS version error, the error is just super miss-leading
pymongo just doesn't bother attempting to setting up TLS unless you explicitly tell it to
SeattleDataGuy:
https://www.linkedin.com/posts/benjaminrogojan_in-the-last-12-months-i-have-dealt-with-activity-7103463847377735680-MD3M
i worked on a system where mongodb was the only database, it was horrible
oof. and i guess you were on the analytics side of the house too
my condolences
🕯️
mongo: complex queries over ton of data -> falls over
Yes. You can use AND or OR to combine multiple conditions.
You only need one WHERE at the beginning.
mongo: any query which isn't a lookup on a hashtable -> falls over
How I can store data in mongo db?
Hey guys. Need a little suggestion. So I need create a nosql db (dynamoDb). The purpose is to store ads data in there. And the queries will be like filtering on a certain keyword and getting data based in a date range.
What would be the appropriate nosql schema for this task. The number of rows would be high likely over a billion.
Can you explain more about what ads detail is?
Google ads
When you search a keyword you will get multiple ads
But what about the ads? What kinda of information? Can’t design a schema without know the objects and attributes
Hmm title, description, datetime, keyword
4 attributes? Why nosql?
And I don’t really understand the question then, if just four attributes, the schema is just one table four attributes?
Yes I meant as in partition and sort key and all that kind of stuff
I have only worked with RDS so far
I think because of scale maybe. I didn't make that decision but that's what the team wants to use
If you’re only searching for keyword and date range, I’d think you’d partition in keyword with a secondary on date to support a range query
Yah, or if you are searching across entire, you could use a global secondary index… one sec find a ref
Ah, global secondary is like if you want ads in past 30 days irrespective keyword bin they belong to yeah?
Yah
So what will I put as a partition key in global index
in this case
Since we essentially just need a sort key and no partition
Personally, I find most people overthink this stuff early on before they know their query usage. Unless the team truly understands the queries they’ll need
I’d think you’d partition by keyword
Even for global secondary index?
Thinking more generally: I’m not an ad-guy, so I don’t know if this is optimal or even makes semantic sense (don’t searches have multiple keywords). Im just commenting on the schema you described
Gpt says make date the partition in gsi
You said you were searching keyword and date, not merely date.
Sure, you can put an index on any additional thing you want to filter on
Awesome. And do you see any use case of a local secondary index in this case?
Just relating to keywords and date
There are limits and costs. Locals are good when you’re doing an and search: keyword=Foo and date…
Are secondary indexes slower than normal ones?
Don’t think so, but they add overhead because they need to be updated during inserts
Cost wise?
No idea, been a while since I look at that
Aws charges you for everything so certainly there’s a cost for everything and every io
Overall tho, I’d really make sure you understand the queries and whether dynamodb makes sense here. If there’s queries are aggregate stuff like: average ad hits per day over the last year, I dunno if it’s the right solution
Secondary indexes don't have a read performance cost but it makes writes potentially slower/more expensive
As the cluster has to update the indexes across each node seperately
you dont want Dynamo for this though btw
It'll cost you an absolute fortune
Uniqueness: The combination of Partition Key and Sort Key (if present) must be unique for each item in the table.
Is this true?
No, at least not when I last checked
Silly gpt
partition keys and sort keys are not like primary keys
We doing serverless
Me, I’d probably dump the ad data into parquet files and do roll up analysis into a separate summary db
You have 3 fairly reasonable options for this depending on what latency you want:
- Clickhouse for the lowest latency possible, it'll be the most expensive option though and require manual deployment of clickhouse cloud.
- Redshift a bit cheaper than Dynamo but still expensive
- Athena, which is basically S3 datalake querying, your queries are in the seconds at best though, it is by far the cheapest option though
Ah dynamoDb even expensive that redshift
Both are pretty expensive for a lot of data
but redshift is more tailored for analytic queries
Depending on the queries you could also potentially use https://quickwit.io (Although disclaimer I used to work there and still work on the project) which will probably one of the cheapest options and still give you sub second latency ish
We are stuck with aws
Is this db part of customer facing app? Like interactive? Or for internal analytics?
Hmm, yes it will query ads both to show user straight away. But also to do something in the backend
All of the mentioned options run on AWS, just ClickHouse or QW require manual management to be deployed with something like a k8s cluster.
But queries are simple like select * from table where keyword='djsj' and dat in past 30 days
Or cassandra/scylla would work for the lowest latency possible
since these seem like single point click queries
Your primary issue is that you need to be able to search over billions of rows of text quickly, not necessarily NoSQL yeah?
yes
Talking to gpt, it said maintaining indexes at that scale is difficult in an rds
I kind of agree with GPT that a non-relational DB is probably well suited for this
What sort of queries per second are you expecting?
Not much I reckon, not many users
Internal uni usage by staff to post ads
And also to see competitor ads
and what is your budget for this roughly? On a monthly basis
That's a talk we haven't had. But ig not super low
If dynamo costs you per time then the queries shouldn't take long and that can't be a fortune like you said
It charges by just about everything, i.e. per op, per time, per GB, etc...
but if your read rate is very low (I.e. less than 1k QPS) and your data is like < 500GB then you are probably fine i guess
i think it'll be a couple hundred dollars a month ish
More or less repeating myself but: So why would you expect billions of hits? I’d aggregate, and create static roll up reports, rather than ad-hoc queries
that's no big problem
I am curious about this though
Since low QPS, and the queries seem to be very direct I.e. "Get n matches for keyword 'XY' in date range Z"
Aggregate what?
We need titles and descriptions
Of indivudal ads
How many results do you expect per query? Just one?
(And how many distinct ads would there be? Dozens? Hundreds? Thousands?)
I think they said upto around 1B rows
Yah but that’s events: ad hits
No probably more. Since the scrape for now is being done once per hour, for 20 regions. That's 480 per day
And say for past 30 days, that would be 480*30=About 15k
Your biggest issue is how the DB handles indexing and search over such a big dataset.
Actually 15k and each search might have returned 3 ads on average. So that will be about 45k rows
Like, could you just store two tables: ad and ad_hit? Ad_hit could be a rollup
Hang on a sec, they just want to get the ad title and description though, this isn't really an aggregation or searching thing
it's Keyword, daterange -> rows
Right?
yep
yeah Dynamo should be fine
So you index over time but then you still need something that searches over all the records in that timeframe somewhat efficiently
I'm lazy so my reflex would be to explore possible postgres plug-ins.
yeah making a Global secondary index on dat
Your partition key is the keyword
Your sort key would be the datetime
If you have multiple keywords per ad, you're better off duplicating the rows
How would the index over the text look like?
Yep they are duplicated already
one row per retrieved ad even though they came in same request
Hah, I’m biased towards my analytical world… I assumed hits per day was the type of metric they wanted.
Dynamo should be fine then, as long as you can target the specific partition key without having dynamo filter, it's fine
Awesome
The aim of the game with Dynamo/Cassandra/Scylla is you always want to filter to a specific partition (because the drivers can work out exactly what node the data lives on) so you're not scanning loads of data
So if you're doing that, should be fine
Yeah initial I was a bit confused coming from ad tech where we have petabytes of data doing queries of that 😅
And should I be worried about having a gsi on date. That query wouldn't be as common, that is to query with a keyword or a list of keywords. So will that gsi be a lot of cost for lot of rows?
Do you need a GSI on date?
like you want a list of keywords that appeared within a range?
You will probably be fine, it's a rare query and you don't have a huge amount of data or query rate.
If I remember right GSIs are built upon the materialized view system, so you do effectively double your data size though
I'd still look into how text is indexed with that GSI thing.
It's a hash
from a very high level view, a GSI effectively creates a new table with the field(s) you want to build the index on as the partition key
Which is fine for text cause you only need equality anyway.
might need to pass a list of keywords as well so that is also just a list of equalities, yeah?
yes
internally the DB hashes each key and then works out what partition the key lives in based on the hash
Wouldn't be equality in case dat though
When I hear requirements like this, I can only imagine the other requests that will come in post implementation: ‘tell me all the keywords that this ad hit for’, or ‘which ads with xyz in the title had the most hits’
https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html has a good overview of GSIs, even though it as a DB is different to Dynamo, they all use the same principle
ScyllaDB is an Apache Cassandra-compatible NoSQL data store that can handle 1 million transactions per second on a single server.
It would be a range on the gsi
may<dat<june
I haven't used Dynamo. Imo indexes in DBs are a bit of a leaky abstraction. You should read the documentation to verify if it's up to spec for you.
In those cases you go beyond regular DB systems and enter search engines
If you're doing things like which ads with xyz in the title had the most hits in your postgres or athena queries you're going to be there a while
Well, @coral wasp asked the right questions already "is this an operational or analytical system"
If it's really just searching for keywords my broke solution might've been to use an RDBMS and do it somewhat properly:
Before inserting split all the text into distinct words.
Have an ad table: ad(PK, date, title)
Have a bridge table: ad_keyword_bridge(FK_ad, FK_key)
Have a keyword table: keyword(Keyword). The keyword can just be the PK. You hash this.
Your query plan would look quite good; If you put a good ol' B-tree or similar index on date and a hash on the keyword you'll be fine. You're also storing way way less data this way because your keyword table contains a unique set of keywords.
Maybe I misunderstood a requirement but it aint that deep, no?
There are specialized indices that do this stuff in most databases, PostGres has two or three different indices, including full text and trigrams (for partial matching)… but agree this doesn’t scale so well
It's not just about full text search imo but also just storage size.
Yup, just pointing out you don’t necessarily need to split the words out.
Depending on how many ads there are per 30 days (if that's truly their timeframe) this is a trivial solution. It depends on the type of query they're doing. What I'm suggesting might scale gracefully and is conceptually very simple and DB-wise idiomatic, I'd benchmark it 🤷
I think you two might be slightly overengineering this 😅
Now we need to decide on a caching layer
I'm old school so I think No-SQL is overengineering

3 tables in a RDBMS is not overengineering
That is probably not a great way of going about things
I feel that way about mongo (or perhaps it’s under engineering), but dynamodb is really good at what it does
I'm exaggerating a bit but this is a very simple schema imo. I can't count the amount of times I've gotten Mongo from web devs that needed to become relational downstream.
That's just because people fan boy over using Mongo because it's easy to do something quick and dirty VS actually thinking about the schema.
I like nosql when it has a clear advantage but I'm pretty sure their schema is "fixed" and if they're not doing this on their entire DB it'll scale while likely having a lower storage cost so idk what the advantage of dynamo is.
This might be a me-problem, I'm curious to know why using 3 tables is worse than dynamo. I haven't used it because I'm squarely in an Azure country
Dynamodb isn’t nosql in the sense that mongo is
(Lacking the right words right now to express this, I’m going to get some coffee)
It is serverless point -> click access.
Dynamo is the same as Cassandra and Scylla if you've used either of those.
But the base concept is the partitioning of data and querying of data is done such that a query only ever needs to hit on database node/shard. Sometimes even down to what CPU core the data lives on.
So rather than your relational DB which might have several indexes and table to JOIN on which get expensive to scale, Dynamo and that go Keyword 'x' lives on partition 'Y' which is one node 'Z' which is a single round trip and much faster.
They operate on the "Many cheap queries per second is better than fewer expensive queries" I.e. issuing 40 queries for 40 keywords is cheaper than issuing one query, because the client can run all 40 queries concurrently and only talk to the DB node which actually has that keyword, rather than randomly picking one node, and then that node having to work out what keyword lives on what node and then sending the query to each node that has the data, which has a much higher number of round trips which increases latency.
On Azure you have MPP versions of SQL server where you determine your hash and storage strategy
Now for this amount of data, it doesn't matter that much because relational DBs can still realistically handle the data fairly easily, but in this case you're on serverless AWS instances where accessing dynamo is miles faster and easier than RDS for this workload
Like round robin over all nodes or "store everything of one node together" etc.
It's also serverless
I have some 😅 less than positive opinions of SQL server xD
On-prem or cloud?
Both
This is how it's actually behaving as well when I am pushing a df. It says contains duplicate rows
You might need to do PK: keyword, SK: (dt, ad_id)
oh okay, gpt suggests just a random id
not much point of it though. Should just allow to put this directly
It doesn't really have any use other than stopping the conflict
The rules and guidelines that apply to this community can be found on our rules page. We expect all members of the community to have read and understood these.
8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.
is there any incentive with using the cli client over the workbench? i mean the workbench is so much more intuitive
Use whatever works for you.
question from a novice here is mysql meant to be clunky , it seems unintuitive and inefficient to use ,excuse my ignorance as i'm only a 12th grader
or am i oblivious to it's greatness i mean there ought to be a reason for it's popularity
cause there's plenty of alternatives out there
Depending what you want to do sometimes CLI is the only option or can be just easier. But even advanced professionals generally spend a lot of time in DBeaver or another GUI tool
oh and quick question if we dont mention the size what value will it take by default
it's 1 right
My opinion is the same as @fading patrol - use whatever works for you
I use postgresql and rarely use it's cli directly, but I often write raw sql
got it mate
Specific tool shouldn't matter
well in this scenario it kinda does my school is forcing me to use the msql cli and it's really annoying cause i cant really move up lines and edit my text
that's why i've resorted to using the workbench for now
Yeah, that makes sense
If you're learning SQL - use a tool that allows you to write it more comfortably 
mind answering this question for me bud?
Don't ask to ask
i've heard this numerous times lmao
oh and is there a way to "lock" queries in the workbench for future references?
for example i would like to lock the output to another pane
I don't really use mysql and whatever tool you're using too
But you can just write them to files 🤔
it's the workbench which is installed alongside mysql
eh might just do that
@fading patrol i need help :/
Also, for school work and the like, you can also use: https://dbfiddle.uk/
a free online environment to experiment with SQL and Node.js
thanks bud
anyone here work with sql alchemy?
I don't really (rarely), but lots of people here do. Ask your question, someone might be able to answer it.
I don't either, but I think I just spaghettited a fix
ahh dead 💀
I do, also don't ask to ask 😛
@halcyon dew i saw your message about aiosqlite connection pooling yesterday #discord-bots message, and i was curious about it so ive written up a few benchmarks with asqlite to see how multiple connections perform
https://github.com/thegamecracks/asqlite-bench
performance apparently degrades very quickly: ```py
py -m asqlite_bench -c 1 queries\2-selects.toml
Finished in 0.512s
py -m asqlite_bench -c 2 queries\2-selects.toml
Finished in 0.969s
py -m asqlite_bench -c 10 queries\2-selects.toml
Finished in 7.106s
py -m asqlite_bench -c 100 queries\2-selects.toml
Finished in 7.643s``` the numbers suggest that asqlite performs better with a limited number of connections, though it doesnt make sense because SQLite is supposed to allow concurrent readers...
Ah thanks for testing this. I didn't realize that
similar results came out of manually running the select statements in different threads, although it was overall faster
https://paste.pythondiscord.com/YTLQ
1 conn: 0.212s
2 conns: 0.589s
10 conns: 6.600s
100 conns: 6.892s
what in particular are you stuck on? i think your suggested table is fine, although the ordering of user1 and user2 might be ambiguous
what will you use for the row's primary key? if its (cmd, user1, user2), then (cmd, user2, user1) would be considered a separate interaction
Sort the pks of user1 and user2.
^ although as a derivative question, how would the query be written if it was a many-interactions to many-users relationship? for example: sql CREATE TABLE interaction ( id INTEGER PRIMARY KEY, cmd TEXT NOT NULL, count INTEGER NOT NULL DEFAULT 0 ); CREATE TABLE interaction_user ( interaction_id INTEGER REFERENCES interaction (id), user_id INTEGER, PRIMARY KEY (interaction_id, user_id) ); im thinking of a self-referential join but im not sure if there's a better way: sql SELECT i.cmd, i.count FROM interaction_user iu1 JOIN interaction_user iu2 ON iu1.interaction_id = iu2.interaction_id JOIN interaction i ON i.id = iu1.interaction_id WHERE iu1.user_id = ? AND iu2.user_id = ?;
Need some help with cassandra, if anyone's around. Got a list<text> type, and when I call for cassandra to return it, it only returns the first ten items in the list, regardless of what I set the LIMIT to
What is your query
Also in general it's recommended to avoid the list datatype if you can, really you want a set unless you really need duplicates or sorting
Need duplicates
SELECT wallet FROM madeas LIMIT 1300;
returns only 10 items
take a screenshot of the rows returned
Only 1 row
Aight, that can be a bit risky though since you can create duplicates without knowing under normal usage
but thats expected because there is only 1 row
What are you using to query the data?
my problem is the fact that the row thats returned, the list has 12 items, but the query only returns 10
Python datastax driver
Can i see the code?
How do you know it has 12 items btw?
Proper weird. Now I only have 10 items. The table must not have updated in my client (TablePlus). Appreciate the help anyway
Had some help with this a couple of days ago but I'm still confused so I'm gonna post again.
import asyncpg as psy
import asyncio
tables = ('test1', 'test2', 'test3')
async def start():
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
async with asyncio.TaskGroup() as tg:
for fn in tables:
tg.create_task(table_check(fn))
async with asyncio.TaskGroup() as tg:
for fn in tables:
tg.create_task(search(fn))
async def table_check(i):
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
await conn.execute(f"CREATE TABLE IF NOT EXISTS {i} (value INT)")
async def search(i):
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
print(i)
a = await conn.execute(f"SELECT COUNT(id) FROM {i}")
print(a)
loop = asyncio.get_event_loop()
loop.run_until_complete(start())```
test1
test3
SELECT 1
SELECT 1
test2
SELECT 1```
^ Output
It should be printing how many records there are
@torn sphinx didn't work g
+ Exception Group Traceback (most recent call last):
| File "c:\Users\m\Desktop\Test\V1\New\search.py", line 27, in <module>
| loop.run_until_complete(start())
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line
653, in run_until_complete
| return future.result()
| ^^^^^^^^^^^^^^^
| File "c:\Users\m\Desktop\Test\V1\New\search.py", line 11, in start
| async with asyncio.TaskGroup() as tg:
| File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\taskgroups.py", line 147, in __aexit__
| raise me from None
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "c:\Users\m\Desktop\Test\V1\New\search.py", line 24, in search
| print(a[0]['count'])
| ~~~~^^^^^^^^^
| TypeError: string indices must be integers, not 'str'```
actually not in that particular case, but i was responsible for developing something akin to an analytics database... in mongodb 🫠 . lots of $lookup
admittedly that's because i had/have poor intuition for mongodb system design and i basically translated it to a relational model. because realistically it was relational data
i spent a long time thinking about how much i could precompute to take advantage of the "document" thing, but the computations all involved lookups across entities so the pre-computation upon insert would end up getting expensive, and i thought it was a lot less bad to have slightly slower lookups than to run the risk of accumulating backpressure due to a large number of updates/inserts, because the thing was being fed by a webhook triggered by updates in the cms
in hindsight i was probably too worried about that because the group of users who were actively editing things in the cms was small. with the benefit of hindsight i probably could have done without the giant pile of lookups in the main query
fortunately the lookups were all lookups on unique keys so we just added some indexes and the performance was good enough that we didnt need to pursue further optimization
sometimes its hard to figure this stuff out in the moment tbh
indeed
funny enough, one of the products in the portfolio atm only has a small number of users but theyre kinda optimizing too much for them as well but what do i know

yeah it takes discipline to not spend too long optimizing something that already has 50ms round trip response time and serves 10 concurrent users max per day
its taking up two dev teams but not like im in charge
damn ok
damn thats already good
i thought i was bad taking an extra week to polish my data science code for posterity
i would take 300ms
heck yknow how ML stuff is sometimes, sometimes we're looking at SECONDS in latency
python with asyncio and gunicorn+uvicorn is surprisingly fast
but usually that stuff is async or internal so its fine
for now
always aws vpc smh
but that's all we need, the current app im finishing up is for an internal api only
i like it that way
keep that frontend away from me
they will have you redo and redo until the wheels come off the wagon in that world

lol
i dont understand it
this is getting offtopic now but the frontend tooling churn is really insane
oh yeah i forgot we were in db. oops
anyway currently something like 50% of our latency is just database lookups. id actually be curious about this
our main query is something like
select *
from records
where group_id = :1 and created_time <= :2
order by created_time desc
limit 1
and the schema for the table is something like this in timescale:
create table records (
group_id uuid,
created_time timestamp,
-- other attributes ...
primary key (group_id, created_time)
);
select create_hypertable('records', 'created_time');
and im wondering when i know if i need to add an index on group_id, like what tea leaves should i be reading in the explain output and/or what kinds of metrics should i be collecting about postgres and timescale
or does postgres actually create an index for lookups because it's part of the primary key?
i can never remember how that works in different databases
my intuition was that within each chunk of the hypertable, table scans should be reasonably fast, so i didn't feel like we desperately needed anything special here. i don't have a query plan handy but it seemed pretty "normal"
hmm yeah if you ever need to optimize, those would be good places to look/investigate first. lmk if you end up doing so
sure. that or i might never touch it again 
I guess since group_id, created_time is a composite PK/Index it does not allow you to efficiently filter and sort by created_time
Could you try slapping an index on created_time?
import asyncpg as psy
import asyncio
tables = ('test1', 'test2', 'test3')
async def start():
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
async with asyncio.TaskGroup() as tg:
for fn in tables:
tg.create_task(table_check(fn))
async with asyncio.TaskGroup() as tg:
for fn in tables:
tg.create_task(search(fn))
async def table_check(i):
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
await conn.execute(f"CREATE TABLE IF NOT EXISTS {i} (value INT)")
async def search(i):
conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
print(i)
a = await conn.execute(f"SELECT COUNT(*) FROM {i}")
print(a)
loop = asyncio.get_event_loop()
loop.run_until_complete(start())```
test1
test3
SELECT 1
SELECT 1
test2
SELECT 1```
^ This is the output, I want it to be outputting the count, which I tried count(id) (mentioned by someone previously)
see the API reference:
https://magicstack.github.io/asyncpg/current/api/index.html#asyncpg.connection.Connection.execute
conn.execute() returns the status of the query (which is apparently an sql string?), but what you want is the row/values of that query, so you should use fetch() or fetchrow() or fetchval() instead: ```py
rows = await conn.fetch("SELECT ...")
row = rows[0]
count = row[0]
OR
row = await conn.fetchrow("SELECT ...")
count = row[0]
OR
count = await conn.fetchval("SELECT ...")```
Appreciate it mate
Do you know if you're able to add columns to a table?
Like after it's created
there's a lot of things you can alter about a table post-creation
https://www.postgresql.org/docs/current/sql-altertable.html
i mostly toy around with postgresql through a pgadmin container though, so i wouldnt be aware of specific details
Appreciate it g. Also some other questions, I'm storing a lot of data. I planned on doing in separate tables as some have similar data but others have a lot different, I could end up with about 50 different columns in total. Should I do this all in one table (storing no more than 50b records) or in hundreds possibly thousands of tables.
i think having that many tables would be cumbersome, but i don't know enough about the data you're modelling to give an appropriate answer (im no professional database designer either, so i probably wouldnt know anyway)
actually why would you have thousands of tables if the total number of columns is ~50?
or did you mean ~50 columns for each table
if you dont need to enforce any constraints on the data or access/modify individual columns (i.e. selecting the entire row), perhaps you'd be interested in denormalizing your schema with json[b] columns, arrays, etc.
If I do one table, I'd have a lot of columns do to the different types of data being stored. The issue is I'll need to do searches through all of the tables if I have multiple. meaning to find results (anything with related data needs to be pulled) I'd have to loop through each table.
I've never used constraints, not sure how they work.
er, a constraint generally defines some rule/assertion about the data, like a column being NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY, etc.
I won't really have those, the data is all pretty much csv files I'm dumping. Most of the columns will probably be text aswell (to avoid issues)
oh, so you're debating between giving each csv file their own table vs making it available in one? if you again dont want to constrain the data or modify it, i imagine a table with an indexed filename and an array column would be okay for querying sql CREATE TABLE datasets (filename TEXT, "row" TEXT[]); CREATE INDEX ON datasets (filename); though the lack of a schema makes it hard to work with the COPY command, so perhaps that's not a good idea
What's an indexed filename?
Not sure what the copy command is either, I made something like what I'm trying to make like 2 months ago but it was using pyscopg
I used pandas to convert csv files into tables which would then dump the data, I typed in the columns that needed to be created and it would create the table based off of the csv file's name
i just mean using the filename of each CSV you're importing as a way to distinguish which rows come from which file
And by an array column do you mean I'd have one column which would be the entire row rather than having a column for each value?
That's what I planned to do
its there to import/export files on the server directly to/from tables, csv format included
yeah
amending the schema to store the filename in a single place (less duplication/smaller storage requirement), along with its columns: sql CREATE TABLE dataset ( id SERIAL PRIMARY KEY, filename TEXT UNIQUE, columns TEXT[] ); CREATE TABLE dataset_row ( dataset_id INTEGER REFERENCES dataset (id), "row" TEXT[] ); CREATE INDEX ON dataset_row (dataset_id);
What's the reason for creating two different tables?
less duplication, the filename is stored once and a number is used in-place of it instead of storing 5 billion filenames
same goes for the columns column
alternatively you can look at it and say one csv file has many rows, a.k.a. a one-to-many relationship
Basically relearning this so I'll lyk what I'm thinking is happening. You're creating a table which has a SERIAL PRIMARY KEY column under the name id. You then create a column called filename (text), which is unique. Meaning that only one of whatever text is inside of it can exist, not sure how it manages that though. Then you create another column which is test, I think the [] turns it into an array. Then you create another table which is linking a column it creates to dataset id, I think this means that its grabbing each unique row for each table name aka filename. Then a row with another column called text which is an array.
Not sure what the index thing is, also half certain about what's happening
i need some sleep now, but for the time being you can check out postgres's docs about indexing and foreign keys
11.1. Introduction Suppose we have a table similar to this: CREATE TABLE test1 ( id integer, content varchar ); and the …
each interaction gets its own identifier, and the interaction_user table associates many users with many interactions (in other words a user can be in many interactions, and an interaction can have many users)
Hey guys. I saw that dynamoDb returns paginated responses. Isn't that gonna slow down the data retrieval rate if there are a lot of rows
it's only giving 700 rows at a time. So could be 100 calls to the api
50/50 between here and #software-architecture but for those that use raw SQL, how do you guys keep it organised?
Having queries everywhere is not DRY whatsoever on the other hand stored procs or views place business logic in the DB.
I'm a bit "tired" of ORMs and buying into a unit of work and a repository pattern on top of MVC but a compromise would be a query builder a la SQlalchemy core.
Honestly I just use ORM where it's possible, using a query builder would essentially provide you with the same result as writing raw sql
And ORM really just maps your results to objects, something you'd probably do yourself anyway 
Yall know hwere can i find a massive list of websites?
Are you familiar with dbt? I primarily do analytical stuff, so almost all straight sql
Yes but it's a different use case
guys, i need help settling up my database:
i don't know why my username and password is not working
i'm trying to settup atlas
oh hey congrats on the helper role 
Yeah this makes the most sense for my use case. You told me last time and I've decided I'll do that
For analytics type stuff I don't mind managing raw SQL because code there doesn't really really have to be DRY because the same concept is used over and over. If it's reused I just version control it 🤷
It would still make sense to use core/orm becaues that way your query results are typed and if something changes in your model (e.g. column name) it would change in that query too
For me speed and memory usage are a big big requirement like I mentioned. I use https://github.com/sfu-db/connector-x so I don't go OOM.
For the web stuff I do this isn't so much of a requirement but honestly, using connector-x was an incredible speedup.
"If there are many rows in test1 and only a few rows (perhaps zero or one) that would be returned by such a query, this is clearly an inefficient method." Should I not use an index as I can expect multiple returns, possibly hundreds?
I think you already mentioned, and it really depends on how you used your orm
Nothing would be able to load a large dataset into memory for example
I do manage loading tens of millions of rows, which I all need at the same time so I can't do it in batch (or trust me I would). It's worth looking at the project because you never know, I don't think the current ORMs could handle this.
having a lot of rows to select does not necessarily mean that an index is inappropriate, its the ratio of rows you're looking for compared to the total rows being searched through - if you had a pool of 5 billion rows and you specifically want a million rows associated with a CSV file, not having an index would mean scanning all 5 billion rows to pick out the million rows you want
so relatively speaking, you still only want a few rows at a time
I see okay, so I should 100% be using indexes. What I don't get is how they work, and what they effect.
In what ways will it change?
lmao
why is this being posted here
wrong channel
ah okok
thought this was regular chat
there are several algorithms/data structures involved in indexing, but binary trees are a pretty common one so you can think about that first - if you were to convert this list of numbers, [1, 1, 2, 3, 5, 8, 13, 21], into a sorted binary tree, you end up with something like this: 3 / \ 1 8 / \ / \ 1 2 5 13 \ 21 how do you find 5 in the binary tree? you start from the root and work your way down: 3 is less than 5, so you move right, and 8 is more than 5, so you move left, and you find that 5 exists
this only required 3 comparisons, whereas if you were to find 5 in the list, assuming you didn't know the list was sorted, you would need to start from the beginning and check every number until you found 5
in the worst case that would take 8 comparisons, which would be 2.6x slower than the btree
when you create an index on dataset_row.dataset_id, every row gets a corresponding entry in your index containing dataset_id and a pointer to where the row is stored on disk, so out of 5 billion rows, you'd only need to make at most 32 comparisons to find a row matching a specific dataset_id
That makes sense, that's pretty cool thanks for writing this lol. But yeah I get the idea now, I'll use it because that is a huge difference, especially when dealing with billions of records.
though after the index match, postgres doesn't make any guarantees about how it orders rows with the same dataset_id, unless you specify a column with the ORDER BY clause - if you do care about the order of your data, you might want to add a position column and add it to your index so you can efficiently sort by (dataset_id, position)
Order won't really matter, It'll all be one table so as long as it matches the order isn't important
when you want to write down an interaction between two users, you insert three rows; first row goes into interaction, then one row for each user participating in that interaction sql INSERT INTO interaction (id, cmd, count) VALUES (1234, 'rock-paper-scissors', 1); INSERT INTO interaction_user (interaction_id, user_id) VALUES (1234, 9876), (1234, 5432);
There’s a tradeoff with indices: lookups become faster but writes/updates become slower. In most common cases, the tradeoff is worth it…. You don’t notice any slowdown? But at high throughput rates, the cost becomes an issue
Wdym by "high throughput rates"? Curious on this, wondering if it's worth it.
Usually the cost of an index is so small it’s not noticeable. Just be aware that the cost is not zero: so adding indexes everywhere is not a good idea. It’s somewhat hard to quantify.
Interesting okay, I'll keep that in mind thank you.
Doing what you said, having an array with the data, I'm not sure how I'd go about it. I plan on being able to let my co workers search through this data using filters. For example, first name, last name, middlename, age, email, occupation, phone number, experience, etc
Well that's what I was wondering, that seems to be kind of what timescale helps with
We're always filtering on both, never just one
The data is already chunked/partitioned on creation time, and then we do a primary key look up within each partition. seems like it can't get better than that in theory
There is an implicit return None too. Maybe you should look at an ORM like django or sqlalchemy.
But your pk order is "wrong" I'd say
Can you change it to time, group_id?
See if there's a difference
In the end your order by could be slow, did you try profiling that query?
hmm, with the current schema you can do that, for example given the dataset us-counties-2023.csv: ```sql
-- columns = {date,county,state,fips,cases,deaths}
-- To sort rows by highest cases:
SELECT dr."row" FROM dataset_row dr
JOIN dataset d ON d.id = dr.dataset_id
WHERE d.filename = 'us-counties-2023.csv'
ORDER BY dr."row"[array_position(d.columns, 'cases')]::integer DESC``` you can also consider adding a GIN index on row to improve comparisons on individual elements within it but only if you find your query's performance is too slow, as it'll be a very expensive index for your database to maintain when you have a lot of rows
is anybody willing to code review my some 250 line long sqlite abstraction?? 🙃
Where do I put attributes that two entities share?
To start, I had a many-to-many relationship between CARD and CARDPARTS. A card can be related to zero, one, or many card parts. A card part can be related to one or many cards. To solve this relationship I tried putting an intersection entity between them: a card to card part mapping. Where my question comes from is that cards and card parts have some unique attributes and some that they share. For example, booster is a unique attribute of card; component is a unique attribute of card parts; and they both have the attribute name. Do I put the shared attributes in the intersection entity? Do I keep them in the CARD and CARDPARTS entities? Do I do something else?
If you need more information, please don't hesitate to ask.
Thank you.
Timescaledb has some kind of storage / query planner optimization and this is how they recommend querying the "most recent" item that matches a criterion
I can definitely try some benchmarks with the PK order swapped, ill also have to look carefully at the explain output after the data is more populated and there are realistic table stats
I think someone mentioned a pg profiling tool above? Or maybe that was somewhere else. I can look into it too
the "name" on a card part seems like it's only coincidentally and superficially the same as the "name" on a card
The first thing you should look at in the explain output is whether the indexes are actually used. That’s 90% of what I look for in explains. Then the question is why not: was the index not usable for the query or simply not helpful (no benefit). Sometimes a missing compound index or something similar is the root problem. (Apologize if this is obv/basic, but most of the time I see people who are just missing the ‘correct’ compound index)
And if your main query is looking up the last record for each group, perhaps that could be denormalized and maintained separately (ie: on the group table). Or, perhaps a partitioning strategy
good advice, thank you
the partitioning i believe is what timescale does automatically with its "hypertable"
maintaining this as two separate tables would be kind of a nightmare, it would have to be kept up to date in real time
And could you use a last() aggregate, rather than a sort and limit?
An aggregate last() might be optimized diff than what you’re doing (filter-sort-limit)
I am getting the error caused by the foreign key being NOT NULL but I made the Foreign key NULLABLE. in flask-sqlalchemy. How do I fix this? Thanks for the help. Please ping on reply.
class User(UserMixin, db.Model):
id = db.Column(db.Integer, primary_key=True)
email = db.Column(db.String(120), unique=True)
rel_payments = db.relationship('Payments', backref='profileinfo', lazy=True)
class Payments(db.Model):
'''
One to many relationship
This is the Many relationship.
'''
id = db.Column(db.Integer, primary_key=True)
item_name = db.Column(db.String(80))
price_of_donation = db.Column(db.Integer)
# How do I turn email into the foreign key? todo.
email = db.Column(db.String(120))
fk_user_id = db.Column(db.Integer, db.ForeignKey('user.id'), nullable=True)
Here is the code in routes.py
https://paste.pythondiscord.com/MIDA
Here is the error.
https://paste.pythondiscord.com/37TQ
yeah the timescale docs actually specifically advise not to use last() because it doesn't use the hypertable chunking or indexing effectively, that's only meant for groupby aggreation. apparently order + limit is their recommended idiom here
Oh, interesting.
https://docs.timescale.com/api/latest/hyperfunctions/last/
The last and first commands do not use indexes, they perform a sequential scan through the group. They are primarily used for ordered selection within a GROUP BY aggregate, and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value, which uses indexes.
can you check the actual table schema and confirm that it's nullable in the database?
@harsh pulsar You were correct let me try again and see if the error still appears , if this is the case sorry for such a dumb question
Maybe reduce the search window? Instead of “before the end date”, add a start date. And, this could be computed / materialized infrequently. Like: once a day, compute the last event for each group.
can't hurt to consider using a migration framework here. alembic integrates with sqlalchemy for example
interesting. i wonder if that actually makes a difference. i have a feeling that internally they're able to do an efficient lookup on the timestamp alone. but i can mess around with it and benchmark some things. as for precomputing, no, this needs to happen on-line
(to be clear, the current performance is more than acceptable)
My suggestion was to precompute something that just narrows the space; even if it’s out of date, it might cut the query from ‘forever until today’ to a few days.
So, it’s online/real-time, with the crutch of a narrower range
The real test is: how well will this system respond to queries as the table gets larger: will the query remain responsive.
Im genuinely curious, haven’t worked with this in pg.
my limited intuition about how these datetime indexes work is that they don't really need that kind of coaxing, but i can definitely experiment with it. can also maybe reach out to timescale since we're enterprise customers
basically we just need the "most recent previous" record for a given entity, but that could be arbitrarily far in the past or half a second ago
Yah, if it’s an ~O(1) operation than yah, o worries
yeah no idea, would love to know
Me too! (Let me know if you do find out)
that's what i'm interested in too. my week coming up is mostly dedicated to QAing this app anyway, so i can hopefully work in a load test
won't be that realistic but at least blasting the system with a few hundred concurrent requests from my laptop should at least give us a rough sense of stability under load
the db is definitely the biggest perf bottleneck rn. the python app is maybe 1/3 of the time, and most of that is spent inside fastapi
ive been wanting to try pypy as well for this
@harsh pulsar I tried deleting the migration folder and app.db. Then recreating them by step 1 flask db init. Step 2 flask db migrate -m "Initial migration." . Step 3 flask db upgrade. Then I ran the code in the terminal. Here is the documentation on flask migrate. https://flask-migrate.readthedocs.io/en/latest/
Yah. My two cents is: consider data generation of longer histories… and maybe measure query performance vs db size.
good thinking on these. i was doing the former but not for the specific intention of load testing
I think since there's no lower bound it may be checking all of your partitions, if you often need to retrieve latest values I'd store them in a separate table
Sqlalchemy 1.4 and postgress, is there a way to load a row from different tables in a single query? With joins and such? Theh have foreingkeys within them to sort of mapping tables, no relationships, yet.
I mean like
Obj1, obj2, obj3 = db.query(table1).join(table2) etc
is there a way to edit same database from diffrent projects?
Query is legacy api
But yes
stmt = select(Table1, Table2).join(Table1.table2)
for table1, table2 in session.execute(stmt):
...
Also I'd rather just use relationships 
still lacking understanding what u describe
try to write example in terms of dictionaries/classes/pure python code here that provides 2 examples of data
Ye it's moving towards that for sure. It's a legacy system. Thanks for the info.
You can use 2.0 style querying with 1.4 version of sqlalchemy
And relationships should be the same too
Yeah but trying to keep the codebase consistent before actually updating stuff. Adding relations should fix what I need to happen for now should be fine.
Guys I am making a program which will connect with remote mysql server and fetch data. I am utilizing mysql-connector-python and sqlalchemy. Is there a way to add proxy from my side to connect with database
interesting, ill look into this. i would hope that it's wise enough to avoid that though. also i don't think the separate table is worth the maintenance burden, it also would place much stricter requirements on transaction isolation, no? possibly requiring full serializability. might be significantly slower under load unless there's some efficient design idiom for that
Really depedns on how you use it, if you often write to specific group - then probably not 🤔
I would still try adding lower bound, changing query itself (e.g. removing some of it's parts), adding indexes, changing pk order
hmmm good point. i think there is still somewhat of a concurrency concern within each group but maybe not so severe
Yeah, I understand
I had a similar situation, but with significantly lower database load - we were storing historing records in the same table, which becomes hard to query latest values from
my bigger concern would be that the cost of maintaining that extra table would add more latency than we would save by optimizing the query... but i love all the suggestions, this is giving me a lot of things i can try this week
e.g. status of some kind of object
Yeah, I'd just fiddle with the query, indexes and primary key 
Hey guys, I trying to create a csv file with pandas that import multiple csv and concatenate into one dataframe
I already tried every way I know, I dont know too much 🥲
Share what you've got? If it's complicated, maybe open a help thread #❓|how-to-get-help
@coral wasp @paper flower reporting back from the world of query optimization: my lookup is reduced to a single index scan, but it does appear to be scanning every hypertable. had to work on other things today so i didn't get a chance to experiment and see if pre-filtering reduced the number of hypertables to be scanned. i will have a much more realistic amount of data loaded into the database after i get through my load testing tomorrow, so i will revisit this then
What did you change though? 🤔
Hmm, kinda what i was expecting, I don't know hypertables but I do know snake oil 😉 (also known as tanstaafl)
What if I choose to insert the data from the csv file into the database, then retrieve the information?
you'll need to insert one row into dataset with your filename/column names, and then all of the rows from that CSV into dataset_row, where dataset_row.dataset_id matches whatever id was given for your dataset
I'm confused on how I should go about the column. I'm making one then how should I input the data inside? I'd need to know what the data is so wouldn't I need to store the format or something.
im not sure what else you need to know about the "data", assuming that refers to individual rows
once you've parsed the csv into a bunch of lists (see csv.reader()), you can insert each of them directly into the dataset.columns/dataset_row."row" column and asyncpg will convert it into the appropriate array type
I don't get how the data is being formatted. Am I taking the string and turning it into a list, if I am am I keeping it one string? Or am I taking each value and putting it in as a different item. If I do that how do I what the values belong to once I retrieve it from pg
#databases message
one column is storing the entire row as a TEXT[] array, so what you should be inserting is a list of values
if you did want to insert each value individually, you would need another column to identify the row number that value is associated with alongside the dataset_id (and also a column number to help define the order of those values)
Issue with this is I wouldn't know what values were what. I'd have to write a script that checks the csv file name then sets it up from there
you do, dataset_row.dataset_id lets you check what dataset its from
every row in dataset_row can be associated with a dataset because of that foreign key
So in dataset_row I'm storing the format name, age, occupation, blah blah2. I'm so confused rn,
the column names? no, you can store that once in the dataset.columns TEXT[] column
maybe we should walk through the data you're going to store and understand how to describe it in two tables
Okay I get it now
Yeah sure
Should I use pandas?
thats a really vague question so its hard to answer, but if its more specific someone else who uses pandas could help you out
To start, in a relational database you normally have a fixed number of tables, and those tables should be able to describe your data in a way that makes it hard to store duplicate or invalid information. For your data, you have a bunch of CSV files, where each file has a filename, a list of column names, and a list of rows. Every row should be associated with a filename and column names, otherwise your row would be meaningless, so you might think of putting them in one table: sql CREATE TABLE dataset_row ( filename TEXT, columns TEXT[], "row" TEXT[] ); However, when you want to insert a bunch of rows from one file, you have to duplicate the filename and columns for every row: sql INSERT INTO dataset_row VALUES ('my.csv', '{col1,col2,col3}', '{123,456,789}'); INSERT INTO dataset_row VALUES ('my.csv', '{col1,col2,col3}', '{321,654,987}'); ... That increases wasted storage space, and makes it possible to say, have two rows in the same file with entirely mismatched column names. Ideally you would define the filename and columns one time, and then have each row refer to that same piece of information, such as: sql INSERT INTO dataset (id, filename, columns) VALUES (1, 'my.csv', '{col1,col2,col3}'); INSERT INTO dataset_row (dataset_id, "row") VALUES (1, '{123,456,789}'); INSERT INTO dataset_row (dataset_id, "row") VALUES (1, '{321,654,987}'); Each row in dataset now represents a CSV file, and each row in dataset_row corresponds to one of your CSV files
So dataset will be a row with a unique filename and column. Then dataset_row will be all the rows
As a sidenote, not all relational databases have array types, and an array column can often be represented with another table, for example: sql CREATE TABLE dataset_column ( dataset_id INTEGER REFERENCES dataset (id), name TEXT ); This conforms to 1NF and is normally what you want to design, but arrays tend to be useful when you have a bunch of values that you will often work with as a single item
i probably should have explained that earlier because i initially thought you wouldnt need to access/sort by individual elements in each row, which in hindsight made it kind of pointless
however postgres provides various operators and functions for working with arrays, so its not a deadend to be using it here
What is the dataset_id INTERGER REFERENCES dataset (id) doing?
REFERENCES dataset (id) is a constraint, or more specifically a foreign key constraint, which says that dataset_id must refer to an existing row in the dataset table matching the id column
its how you associate tables with other tables, and is often needed to safely join two tables together: sql -- Select the rows of any file which has a 'date' column SELECT d.filename, dr."row" FROM dataset AS d JOIN dataset_column AS dc ON d.id = dc.dataset_id JOIN dataset_row AS dr ON d.id = dr.dataset_id WHERE dc.name = 'date';
I've never seen something like this before. Can you explain what's happening?
JOIN table lets you combine the columns of another table, and in the example ON d.id = dr.dataset_id specifies that rows should be joined when they have the same dataset ID
For example, if you have dataset (1, 'birth-rates.csv'), a dataset column (1, 'year'), and another dataset column (1, 'rate'), joining dataset with the dataset columns by ID will give you (1, 'birth-rates.csv', 'year') and (1, 'birth-rates.csv', 'rate')
see also this interactive tutorial on joins https://sqlbolt.com/lesson/select_queries_with_joins
Alright cool thanks. I'm gonna head to bed now when I'm free I'm gonna come back I'll probably respond as I have short term memory loss lol. But yeah I'm gonna get started tomorrow, thanks.
Hey where can I learn sqlite3?
On your computer. Just dive in. There are endless tutorials online. Which one is best depends on what you're trying to build and how you like to learn.
so how I edit a number in json file?
What’s the easiest way to offload a local database to a cloud database?
Use str() to change the number to a string.
Or use the json package to dump a dict into a json file.
depends on which database and which provider - most providers will have quite a few different options to import data though
the most straightforward and applicable to most might be just dumping and restoring (or exporting to something like csv/json and importing if not the exact same kind of database), but some may have more efficient options
both not working
I didn’t think about the database aspect of it. Just went on the assumption the data was already in a json.
how I can check if a json file is empty or no?
oh so you did follow up on this?

ooo. a timescaledb optimisation question - can't believe i missed that.
select create_hypertable('records', 'created_time');- this doesn't have space partition, worth a try after you have exhausted other options imo- what's the index on the chunks? is it a multi-column btree index? anyway - the ordering of
group_id, timemakes sense to me. - is there anything that uses JIT in the
explain analyzeplan? i recall i got bitten once by that, JIT'ing for absolutely no reason and wasting compute over and over.
You're specifying the names of the variables after they're being called right? idk if I'm tripping or it's just a sql thing I don't get
im not sure what you mean by variables, those are completely different things from that foreign key syntax
<new_column> <column_datatype> REFERENCES <referenced_table> (<referenced_column>)
nothing yet, but until i have the table fully populated there isn't much else to do. had to deal with some unrelated stuff the last couple days, going to actually do it tomorrow
be careful here, the syntax superficially looks like a python function call but it's not at all the same thing. likewise with column and table names, they kind of look like python variables but they are not really the same thing
Hey guys
Does have a good knowledge on pandas?
I'm struggling trying to merge CSV files.
Variables as in the names you're assigning to the actual value, for example, "from dataset as d". You call d before dataset was referenced as d. You can see that where you say "d.filename".
oh those are table aliases, its just to make the query easier to read
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-TABLE-ALIASES
when I do that it telling that unwritable
How can i use independent db with a discord bot I want to use sql?
SQLite?
Or postgres. Bc with postgres u host the server, make an role and password for it, and then create a database that is the owner of the new account you just created
ic
I think in json you can just put the numbers. I dont think you need the “” or ‘’
Just wanted to understand industry standards, what's the benchmark for getting 100K recrods from table and having it in-memory. I have to apply couple of functions on top of this tabular data. Right now it takes 15 Seconds for me to get the data from snowflake, fetch it as Pandas DataFrame.
Is it normal in industry to do this kinda of thing?
Any one who have previously done, can you elobrate your tech stack.
I wanted to do this in a request response cycle. I'll send very minimal data to the frontend after computation. (sub second performance)
Benchmark? Like, how long should it take? I'd guess that most of that 15 seconds is snowflake executing/processing the query, not anything on the client side. And, I doubt the snowflake database connector is the bottleneck here. What I'm assuming is: it takes nearly 15s for the query to get the 100k rows to complete. I'd first check that, such as doing the same operation on another system or snowflake interface or whatever.
overall I would say it's slower than i'd expect
But I wouldn't expect it to be less than a few seconds
I have checked the query performance , query is executing in 100ms. It's simple select query in snowflake.
Client side snowflake connector is what taking its time to give it back as pandas dataframe.
Just to give more context, the data has 15 cols, 5 integer, 5 string and 5 dates. that's it
I'd start with normal troubleshooting: check versions of connector, pyarrow and pandas. Test a smaller query (ie: 1 row), etc.
- 15 sec is obviously unexpectedly slow.
Got it. Python connector is in latest version, pandas is in 2.1.x
And pyarrow?
Pyarrow is in 10.0.1
yeah! but the snowflake python connector restricted it to v10. 😦
https://github.com/snowflakedb/snowflake-connector-python/blob/351b2dca61dc2a685a2ef2dd419b3e94c0c6e595/pyproject.toml#L9
pyproject.toml line 9
"pyarrow>=10.0.1,<10.1.0",```
I'm still not able to figure out this 10 to 15 seconds for 100K. when I checked the memory size of the pandas dataframe or list of tuple. It is just around 100MB
From your experience, what's the approx time it takes for this. Just converting the query results to native python type
<1sec on the client side.
Man! I don't know what am I even doing wrong! 15 Seconds it takes for me 😦
Maybe compare fetchall performance vs fetch_pandas?
Yes I did that. fetch_pandas_all takes 15 Seconds. fetchall() takes 10 to 15 seconds.
Reading a 100k row csv takes me 400ms: ```py
import pandas as pd
import numpy as np
Dataframe creation
df = pd.DataFrame({
'int_col1': np.random.randint(1, 1000, 100000),
'int_col2': np.random.randint(1, 1000, 100000),
'int_col3': np.random.randint(1, 1000, 100000),
'int_col4': np.random.randint(1, 1000, 100000),
'int_col5': np.random.randint(1, 1000, 100000),
'str_col1': np.random.choice(['A', 'B', 'C', 'D', 'E'], 100000),
'str_col2': np.random.choice(['F', 'G', 'H', 'I', 'J'], 100000),
'str_col3': np.random.choice(['K', 'L', 'M', 'N', 'O'], 100000),
'str_col4': np.random.choice(['P', 'Q', 'R', 'S', 'T'], 100000),
'str_col5': np.random.choice(['U', 'V', 'W', 'X', 'Y'], 100000),
'date_col1': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col2': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col3': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col4': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col5': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
})
df.to_csv("mytest.csv")
%timeit pd.read_csv("mytest.csv")
df2 = pd.read_csv("mytest.csv")
print(len(df2))
display(df2)
Um but is this right measure. I mean I'm getting data from snowflake table.
You asked about "converting the query results to native python type". Converting from CSV should be even slower than converting a database result.
You really shouldn't crosspost. And please just run a profiler.
ugh maybe because pandas sucks?
read #web-development for the thrilling conclusion
(it's not pandas)
does snowflake not have a dedicated command for exporting data to a flat file which you can pull much quicker?
in my expereience using these commands is much faster than whatever fetch_all is doing.
(also yes - there is no need to crosspost)
typically you just pull straight from the object store
how would I handle multiple updates to the db for something like likes in sqlite? The race condition would make it kind of tricky to aggregate the requests
speaking of sqlite locks, im getting some really poor performance when running concurrent selects with this benchmark using sqlite 3.43.0
https://github.com/thegamecracks/asqlite-bench ```py
py -m asqlite_bench queries\2-selects.toml --connections 10
Finished in 7.080s
py -m asqlite_bench queries\2-selects.toml --connections 1
Finished in 0.513s``` for context,asqliteruns each connection in its own thread and implicitly setsconn.isolation_level = NoneandPRAGMA journal_mode=WAL
looking at the profiling results, most of the time is spent in Cursor.fetchall() so i believe it is dealing with locks, but i dont know why running ten concurrent selects would make it perform worse
what happens if you try open the DB in read only mode?
you can also try setting nolock to 1
Or immutable to 1
tried ?immutable=1, ?nolock=1, and ?mode=ro (nolock required removing the WAL pragma from asqlite), and results were all around 7s, still waiting on fetchall()
oh i forgot to manually set WAL too, that made ?nolock=1 raise OperationalError: unable to open database file while the other two had no issue
so i guess that means my assumption was wrong, and something in the pysqlite wrapper is blocking...
cursor.execute('''
CREATE TABLE IF NOT EXISTS enlistments (
user_id INTEGER PRIMARY KEY,
discord_name TEXT,
application_name TEXT,
division TEXT,
rank TEXT,
gamertag TEXT
)
''')
I have already got my table created, how would I go about adding an additional row to the table once it has been created or is this not possible?
simple select query? just in case, if youre doing select *, youll want to avoid that. this is a columnar db so the way you write sql here vs. rbdms is similar but dif.
this also leads to increased compute and/or cost/snowflake bills btw
how do i know? we have dbas over here that have lived in sql server world for so long but now need to do things in snowflake
the result?

The cost of using a datalake for data which could just work on a regular machine
Whats a good way to scrub user input to prevent sql injection for a desktop app?
usually the library will take care of that for you as long as you specify the parameters separately from the query itself
(in which case the query itself should be hardcoded, and the user should only provide the parameters)
!sql-fstring
Don't use f-strings (f"") or other forms of "string interpolation" (%, +, .format) to inject data into a SQL query. It is an endless source of bugs and syntax errors. Additionally, in user-facing applications, it presents a major security risk via SQL injection.
Your database library should support "query parameters". A query parameter is a placeholder that you put in the SQL query. When the query is executed, you provide data to the database library, and the library inserts the data into the query for you, safely.
For example, the sqlite3 package supports using ? as a placeholder:
query = "SELECT * FROM stocks WHERE symbol = ?;"
params = ("RHAT",)
db.execute(query, params)
Note: Different database libraries support different placeholder styles, e.g. %s and $1. Consult your library's documentation for details.
See Also
• Python sqlite3 docs - How to use placeholders to bind values in SQL queries
• PEP-249 - A specification of how database libraries in Python should work
Yeah im using cur.execute(sql, (userdata,)) where sql is something like '''SELECT * FROM table WHERE column=?'''
quite much no risk of injection in that case
Alright, thank you. All i gotta do is check the data against the data tables then.
I never heard of asqlite but used aiosqlite for a bit, did you try it? 🤔
i looked at the implementations some time ago and to my understanding they both follow the same principles, but ill try to adapt the tests for it too
fwiw ive also done a simpler ThreadPoolExecutor version #databases message
So it's likely a sqlite issue? 
implemented a simple aiosqlite pool (source) and results were basically the same ```py
py -m asqlite_bench queries\2-selects.toml -c 1
Finished in 0.527s
py -m asqlite_bench queries\2-selects.toml -c 1 -m aiosqlite
Finished in 0.447s
py -m asqlite_bench queries\2-selects.toml -c 10
Finished in 6.786s
py -m asqlite_bench queries\2-selects.toml -c 10 -m aiosqlite
Finished in 6.968s```
I'm getting ready to start coding as I have some free time, looking now what would be the speed difference if I did something like having all the columns?
Rather than storing the row in a single column and trying to access it
im not sure of the performance difference between 100+ tables with custom columns vs. a few tables with the same schema, that you'll probably have to test yourself with some real data
So it's probably sqlite thing 
Not 100+ tables, but one table with every column I'd need.
VS
The two tables linked together and putting all the text into one column
as in a union of every column from every dataset you might store?
Indeed
hmm i would guess that might lead to inefficient queries if every row has to store a bunch of NULL columns
Maybe it has something to do with what query you're actually running?
Ah, nvm 😅
oh yeah i did test SELECT 1 instead of selecting 1000 single-column rows from a table, and more connections marginally improved the performance rather than degrading it
Do you still think the method you explained before is the best way now knowing I want to be able to select specific data out of rows?
here were the results for that: py -c 1: Finished in 0.160s -c 10: Finished in 0.117s i assumed this CTE alone wouldnt have done any IO either, but it performed similarly to creating the table beforehand: ```py
query = """
WITH RECURSIVE generate_series(value) AS (
SELECT 1
UNION ALL
SELECT value+1 FROM generate_series
WHERE value+1<=1000
)
SELECT value FROM generate_series
"""
repeat = 1000
-c 1: Finished in 0.690s
-c 10: Finished in 6.925s
the array columns could be further normalized, but i think it'll be fairly easy to use - try creating the tables yourself, inserting some data, and writing up queries to do what you need
Can you link me to documentation for arrays in columns
I think the issue is that it just becomes cpu bound? 
https://www.postgresql.org/docs/current/arrays.html
if you find the need to do really complicated queries, that can be an indication that the schema isnt suitable for your data model
What's considered a really complicated query?
now that you mention it my cpu usage gets fairly high, even with in-memory databases... seems weird for merely selecting 10K numbers
Runtime seems to scale fairly linearly for me with the amount of connections
Makes me think it's probably cpu bound
Adding a lock in there makes queries run faster though 🙂
async def run_query(iterations: int, lock: AbstractAsyncContextManager[None]):
stmt = select(Model).where(Model.c > 1000, Model.c < 2000).limit(100)
async with async_session_factory() as session:
for _ in range(iterations):
async with lock:
await session.execute(stmt)
async def main() -> None:
iterations = 100
start = time.perf_counter()
lock = asyncio.Semaphore(1)
async with asyncio.TaskGroup() as tg:
for _ in range(100):
tg.create_task(run_query(iterations=iterations, lock=lock))
print(time.perf_counter() - start)
About 16s without lock and 7s with lock
i dont think there's a definitive answer, but i'd consider a query too complicated if i need to write some verbose query like: sql -- Select all data from xyz 2021-2023 datasets with a population greater than 500 SELECT population FROM csv_xyz_2021 WHERE population > 500 UNION ALL SELECT population FROM csv_xyz_2022 WHERE population > 500 UNION ALL SELECT population FROM csv_xyz_2023 WHERE population > 500; or use esoteric features for something that should be simple
Possibly could be like this
wtf that worked for me too