#databases

1 messages · Page 20 of 1

foggy tartan
#

Dodge Charger
Type: Muscle
Seats: 5 Doors: 4

#

any idea

fading patrol
foggy tartan
fading patrol
foggy tartan
#

Mhm

silk raft
#

Working with roles in postgres,

I dont fully understand the implication of one role being a member of another role, say I have something like this setup would this mean that public role inherits permissions from its parent role ?

#

I initially created a database through aditya then I attempted to create/insert tables into that database and always got a permission error. It ended up getting fixed when I declared projects was a member of aditya, I wanted to reduce the amount of permissions given to projects role since I would most likely be sharing it with a bunch of people

pearl lodge
#

okay soo i made tests here and for some reason i don't know why it's not returning me the document, it's just sending me an empty list:

    async def quick_search_default_character(self, *, user_id: int, name: str | None = None, prompt_prefix: str | None = None) -> None:
        documents = list()
        init = True
        if name and prompt_prefix:
            cursor = self.db.characters.find({'user_id': user_id, 'name': name,'prompt_prefix': prompt_prefix}, no_cursor_timeout = True)
            while (await cursor.fetch_next):
                data = cursor.next_object()
                documents.append(data)
        elif prompt_prefix:
            cursor = self.db.characters.find({'user_id': user_id, 'prompt_prefix': prompt_prefix}, no_cursor_timeout = True)
            while (await cursor.fetch_next):
                data = cursor.next_object()
                documents.append(data)
        elif name:
            cursor = self.db.characters.find({'user_id': user_id, 'name': name}, no_cursor_timeout = True)
            while (await cursor.fetch_next):
                data = cursor.next_object()
                documents.append(data)
        else:
            cursor = self.db.characters.find({'user_id': user_id}, no_cursor_timeout = True)
            while (await cursor.fetch_next):
                data = cursor.next_object()
                documents.append(data)
        
        print(documents)

        return documents if len(documents) > 0 else None
#

like, this code is suppose to make a quick search

#

!paste

delicate fieldBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.

pearl lodge
pearl lodge
#

okay i realize it was my stupid misstake

tepid osprey
#

Hi, I'm trying to figure out a way to get a Text widget to auto update in Tkinter to show all the values in a SQLite3 database, and the code runs without errors, but the contents of the database aren't inserted into the Text widget. Here is the code concerning just the auto-update and the database:

import tkinter as tk
import sqlite3 as sql
root = tk.Tk()

conn = sql.connect('database_store.db')
c = conn.cursor()
c.execute("CREATE TABLE dataStore (data text)")
conn.commit()
conn.close()

data_show = tk.Text(root, wrap='word') #other params for height/width that i didn't add
data_show.grid(row=1, column=1, padx=(100, 0), pady=20, sticky='ew')

def update_textbox():
  conn = sql.connect('database_store.db')
  c = conn.cursor()
  updated_db = c.fetchall()
  for data in updated_db:
    data_show.insert('end', f'{data}\n')

  conn.commit()
  conn.close()
  data_show.after(1000, update_textbox)

update_textbox()

And here is the full code if that wasn't enough to find any issues: https://paste.pythondiscord.com/6GTA

Thank you in advance for helping me out!

coral wasp
#

But, now that I look at it, you have no query. You’re creating a cursor then calling c.fetchall() on an empty cursor

#

You should add some print statements so you can see / debug what’s happening. For example, add a print statement in the for loop to print the data.

tepid osprey
tepid osprey
coral wasp
tepid osprey
coral wasp
#

You’re bringing in nothing

#

You opened a cursor to the db, then did fetchall, but without a query, the cursor returns nothing.

tepid osprey
tepid osprey
grim vault
silver robin
#

Hello, has anyone had access setting up a connection to an MS access database? I have seen a few methods but struggling to get anything to work proper in the past.

#

I should mention I am not on windows

wise goblet
#

does it mean you are at MacOS, or at Linux, or at Android, or Chrome OS or at something else 😅

silver robin
#

Acorn

#

You remember Archimedes machines?

wise goblet
silver robin
#

I am on MacOS, but will potentially run on linux, but most importantly it means I dont have access to native MSaccess drivers.

#

I know UCanAccess JDBC works with DBeaver on Mac, but had issues getting this to be recognised in Python, I tried using JayDeBeApi but I dont know if it hit my skill ceiling or if there was an actual issue with thoes drivers.

wise goblet
#

What if u connect your MSaccess to sqlite3 as external source

#

and easily work in Macos/Linux through python with sqlite3?

#

which is natively supported in std libs

#

TLDR: selecting some more open source database engine as common denominator between your two environment

#

Postgresql, MySQL, Mariadb, Sqlite3

silver robin
#

ok thats interesting, thank you for the suggestion I will investigate

tepid osprey
pale sinew
#

Hello everyone, i have a problem with "RETURNING" stantment in python, I use sqlite3 .3.43 and databases[aiosqlite] databases==0.7.0

Traceback (most recent call last):
  File "/home/yosaki/.local/lib/python3.10/site-packages/discord/app_commands/commands.py", line 827, in _do_call
    return await self._callback(self.binding, interaction, **params)  # type: ignore
  File "/home/yosaki/main/cogs/suggest.py", line 206, in suggest_
    suggest_id = await db.add_suggest(i.user.display_name, consiglio)
  File "/home/yosaki/main/utility/db.py", line 66, in add_suggest
    return await database.execute(query=query, values=SuggestObject(member, 0, 0, suggest=suggest).to_dict())
  File "/home/yosaki/.local/lib/python3.10/site-packages/databases/core.py", line 169, in execute
    return await connection.execute(query, values)
  File "/home/yosaki/.local/lib/python3.10/site-packages/databases/core.py", line 288, in execute
    return await self._connection.execute(built_query)
  File "/home/yosaki/.local/lib/python3.10/site-packages/databases/backends/sqlite.py", line 133, in execute
    await cursor.execute(query_str, args)
  File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/cursor.py", line 48, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/cursor.py", line 40, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/core.py", line 133, in _execute
    return await future
  File "/home/yosaki/.local/lib/python3.10/site-packages/aiosqlite/core.py", line 106, in run
    result = function()
sqlite3.OperationalError: near "RETURNING": syntax error

any idea? this error only appear in my remote VPS Ubuntu 20.04.6 LTS x86_64

#
query = "INSERT INTO suggest(member, suggest, positive_vote, negative_vote, status) VALUES (:member, :suggest, :positive_vote, :negative_vote, :status) RETURNING *"
return await database.execute(query=query, values=SuggestObject(member, 0, 0, suggest=suggest).to_dict())
paper flower
pale sinew
paper flower
#

Not sure what's wrong then

pearl lodge
#

it's creating but not displaying

#

why?

#

it was working normally in Alpha version

#

but then in beta version it's not working like intended

waxen finch
pale sinew
waxen finch
waxen finch
#

iunno, what was the sqlite version shown by python?

waxen finch
#

not sqlite3.version, i mean sqlite3.sqlite_version

pale sinew
#

hm

#

Now the problem is, I uninstalled this version to install 3.43

waxen finch
#

you could try recompiling python? might be a bad guess though

pale sinew
#

but although, I have no idea where it gets the old version, maybe the sqlite library is compiled based on the one installed and made embed, now I try the recompilation

waxen finch
#

some time ago with my raspberrypi, after make installing my sqlite into /usr/local/lib, i had to fidget with /etc/ld.so.conf and sudo ldconfig as i believed the system sqlite was being prioritized over my local install

pale sinew
#

I don't know what to do really

waxen finch
#

how did you upgrade sqlite? where were the .so libraries placed?

pale sinew
#

Collecting pysqlite
Downloading pysqlite-2.8.3.tar.gz (80 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80.6/80.6 kB 10.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [1 lines of output]
pysqlite is not supported on Python 3. When using Python 3, use the sqlite3 module from the standard library.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details. @waxen finch

pale sinew
waxen finch
#

but how did you install sqlite? are you talking about that pysqlite module? if so, thats the wrapper which links to the sqlite program and is not sqlite itself, plus its already built into python 3, hence the error

pale sinew
#

sqlite3 system and python sqlite

#

only that system sqlite is up to date, python's is not

waxen finch
#

i checked the ubuntu package index, 20.04 only comes with sqlite 3.31.1 (see libsqlite3-0)

#

personally i would compile sqlite 3.43.0 from source using the autoconf distribution

#

unpack it, enter the source directory, ./configure then sudo make install

pale sinew
#

already done nothing changes, I will update the whole VPS at this point.

waxen finch
#

if you want to then sure, jammy 22.04 has sqlite 3.37.2 and the minimum version for RETURNING syntax is 3.35

noble oar
#

what db do you recommend for python?

carmine tiger
#

anyone expert on peewee please message me im so stuck and noone seems to be able to help

wise goblet
#

Sqlite3 if it is just pet project level

pale sinew
ember badge
#

Hello! I have a question. I have an algorithm that gathers data on financial performance, keeps it and yields it to an AI. However, im facing problems feeding the ai bc the amount of data is too large. To solve this, I am trying with langchain models, but i dont know which one and how to use it. Does anyone know about it ? If not, do you know any alternatives?
langchain agents*

rich violet
#

Hi everyone, i am trying for a scholar project to do a multi-tenancy application, i did a quite "good" architecture for my backend following the MVC structure (Model, Vue, Controller) and also DAOS to access the data to avoid having this in the model part.

I am now trying to migrate my tables from a single tenant structure to have a 1 PostgreSQL schema per "user" so he can have all his tables for himself.

My project is in FastApi, SqlModel and PostgreSQL. I need some help because i can't figure out a way to create schemas directly and i would love some help for this one, would be very thankful, thanks in advance!

wise goblet
wise goblet
#

Django ORM could be abused for library migrating purposes. Just using it's migrating module

#

I would recommend probably Django ORM if u will not find thing with better integration. It has quite comfy migrating paths

wise goblet
#

I googled it for you 😂

hollow oar
turbid quarry
#

Wich datebase here use?

hollow oar
#

correct.

if you want to optimise, imo you should always refer to the query planner's plan.

hollow oar
#

potentially using recursive CTE if you want to short circuit it.

this is just a toy problem so i wouldn't have an issue with writing something like that, but if it's for work then i would really think twice and see if it's absolutely necessary.

also short circuiting it wouldn't necessarily mean it's faster in practice.

harsh pulsar
#

fwiw i think a "sufficiently smart" query optimizer should be able to recognize this 🙂

#

i don't know if any real optimizers actually do that

#

i wouldn't be surprised if some olap-oriented databases implemented optimizations like this

hexed estuary
coral wasp
#

(did someone summon me?)

#

Oh, it's a leetcode, can't use duckdb 😦

opal holly
#

Having clause for speed

#

The DBA always gets mad at me when I do stuff like this without a having

opal holly
#

I'm not proposing a code diff I'm answering the question of " how would I stop computing weights after hitting sum>1000" from ur first post.

#

Join on a.turn>=b.turn

#

Group by a.turn

#

Having sum(b.weight)<1000

#

Ur DBA may get mad that it's a quarter of the way to a cross join but he will be happy about the having clause and probaly overlook it

#

I don't know if it's better I'm just answering the question about how to quit counting after sum>1000

#

Write your MySQL query statement below

SELECT person_name FROM Queue
WHERE turn = (
SELECT MAX(turn) as turn FROM
(
SELECT q1.turn
FROM Queue q1 JOIN QUEUE q2
ON q1.turn >= q2.turn
GROUP BY q1.turn
HAVING SUM(q2.weight) <= 1000
) as table1
);

#

But there's the full working query if u wanna play around with it

#

And as for the prose " hello my name is Doug,
I don't use recursion,
Cuz it makes me need Tylenol;
Tylenol is bad for the liver
Having clause make ur DBA very very happy,
Greetings traveler "

coral wasp
opal holly
#

Me neither it's a quarter of a cross join

#

Cursed join

valid sentinel
#

Is there not just a general chat or am I just overlooking that channel

coral wasp
#

I think the idea was whether a database could shortcut a cumulative sum operation once it reached a certain threshold. I don't know of any that would. The recursive CTE would shortcut the evaluation, but I doubt a window function would/could be shortcutted.

verbal bridge
#

i use notes as my database

coral wasp
#

If the cost of the cumulative calculation is prohibitively expensive, perhaps employ a search strategy to narrow the search space? @brazen lava

harsh pulsar
#

e.g. you see this in pandas code a lot, where the algorithmically-efficient code is not any faster than, or significantly slower than, the inefficient but vectorized version

coral wasp
harsh pulsar
#

i guess it depends a lot on what the query optimizer can do

harsh pulsar
coral wasp
#

Oh, I'd perhaps accumulate just from the first 1million rows, and if I haven't reached 10k, use the max cumulative and get the next million, and so on.

#

Nah, just separate queries

#

I don't know if I could do it in a recursive CTE: accumulating the next 1 million rows. I guess depends on whether the engine allows a dynamic limit/offset clause.

#

You'd check the last row of each search

#
import duckdb
target = 100000
last = 0
iteration = 0
batchsize = 10000
with duckdb.connect() as con:
    results = []
    con.execute("create table abc as select * from range(10000000) t(i)")
    while last < target: 
        df = con.execute(f"select i, sum(i) over (order by i) as cumsum from abc limit {batchsize * (iteration+1)} offset {batchsize * iteration}").df()
        results.append(df)
        last = df.iloc[-1, -1]
        iteration+=1

print(df.tail())
coral wasp
#

Yah, I don't think this approach can be used in a recursive cte... at least not in duckdb (limits/offset aren't allowed in recursive ctes). I could probably hack something together that used the max value from the last iteration and got the next n rows, but I can't imagine that performing very well

fading owl
#

Good day guys it's my first time here!

My question is can someone suggest to me a video tutorial or a step by step guide on how to host twisted python to internet to work in real life not in local host

Thanks in advance

lunar cargo
#

Hey guys, Im struggling with my database conception

player_close = sqlalchemy.Table('player-close', SqlAlchemyBase.metadata,
                                Column('player_id', Integer, ForeignKey('player.id')),
                                Column('close_id', Integer, ForeignKey('close.id'))
                                )


class Player(SqlAlchemyBase, SerializerMixin):
    __tablename__ = 'player'
    id = Column(Integer, primary_key=True, autoincrement=True)
    coins = Column(Integer, nullable=True)


class Close(SqlAlchemyBase, SerializerMixin):
    __tablename__ = 'close'
    id = Column(Integer, primary_key=True, autoincrement=True)
    winner = Column(Boolean)
    team1 = relationship(Player, secondary=player_close, backref='closes')
    team2 = relationship(Player, secondary=player_close, backref='closes')
    timestamp = Column(DateTime)
#

I want to connect two column to the same backref

#

but getting property of that name exists

paper flower
#

But maybe I'm wrong : 🤔

#

You can make a third relationship with custom join

lunar cargo
#

?

#

Im little bit suck at sql

#

could you describe more

paper flower
#

Do you want that relationship to be read only btw?

lunar cargo
#

read and write

#

I was thinking about creating third table Team with just 5 columns of player and connect to Closers table

#

and dont flex on it

#

like this

#

but it look mess to be honest

paper flower
#

1 sec

lunar cargo
paper flower
lunar cargo
#

never thought about it...

lunar cargo
#

thanks mate

frosty mirage
#

To paraphrase Jamie Zawinski: “Programmer has database problem. Thinks: ‘I know, I’ll use an ORM.’ Now programmer has two problems.”

paper flower
shut tiger
#

Programmer has a problem, he thinks "I'll build SQL stringly", now he has three problems.

frosty mirage
keen minnow
# shut tiger That was a total non-sequitur

it may help to also state what you think is the right approach to prevent common issues with people using SQL.

For instance relying on parametrized queries, ORMs and tooling (lint, etc.)

shut tiger
frosty mirage
#

Show me an ORM that handles this:

    def escape_sql_wild(s, escch) :
        "escapes SQL pattern wildcards in s with escch. The same escch" \
        "  needs to be passed to the ESCAPE clause for the LIKE" \
        " operator."
        if not isinstance(s, str) :
            raise TypeError("expecting s to be a string")
        #end if
        if not isinstance(escch, str) or len(escch) != 1 :
            raise TypeError("expecting escch to be a single-character string")
        #end if
        result = []
        for ch in s :
            if ch == escch or ch == "%" or ch == "_" :
                result.append(escch)
            #end if
            result.append(ch)
        #end for
        return "".join(result)
    #end escape_sql_wild
shut tiger
#

(And jesus, you really comment #end at the end of all the blocks?)

steady creek
#

Question, does API is in the cloud not in your 'database'? Am I understanding it right? And if anyone knows your API Key, anyone can change it?

I don't know if this is related.

paper flower
# frosty mirage Show me an ORM that handles this: ``` def escape_sql_wild(s, escch) : ...
from sqlalchemy import String, select
from sqlalchemy.dialects import postgresql
from sqlalchemy.orm import DeclarativeBase, mapped_column, Mapped


class Base(DeclarativeBase):
    pass


class Book(Base):
    __tablename__ = "book"

    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str] = mapped_column(String(255), unique=True)


stmt = select(Book).where(Book.title.ilike("a;[]\\'/\""))
print(stmt.compile(dialect=postgresql.dialect(), compile_kwargs={"literal_binds": True}))

And you can pass escape parameter to ilike

#
SELECT book.id, book.title 
FROM book 
WHERE book.title ILIKE 'a;[]\\''/"'
shut tiger
#

He's asking if an ORM can do escaping properly? I hope not.

quartz lava
#

guys maybe this is too much to ask for but can anyone write a program for me to make a database for online flight reservation using mysql and tkinter(I need a UI for it) if possible maybe by tomorrow before 4:30 AM GMT?

paper flower
quartz lava
paper flower
#

People would be more than happy to help you with your code, but they won't code for you

frosty mirage
frosty mirage
frosty mirage
# paper flower ...

OK, here’s the problem: I want to do a partial match on what the user enters. So I will use their input in, say “like '%input%'”. But I want the user input to always match literally, even if it contains “%” and "_” characters. That’s what my escaping function achieves. How do you do it with an ORM?

paper flower
#

It doesn't sound like a db problem at all

#

Just a part of your business logic

frosty mirage
frosty mirage
coral wasp
#

? You'd escape and still use an ORM for a partial match.

paper flower
coral wasp
#

Don't see why this is the ORMs problem.

frosty mirage
coral wasp
coral wasp
#

What database? PG?

frosty mirage
coral wasp
#

msft sql server?

frosty mirage
paper flower
frosty mirage
paper flower
#

Who hurt you?

frosty mirage
paper flower
coral wasp
#

I'm not even sure, it's just: replace userinput's wildcard variables (% and _), append a '%' then pass that as a parameter to the like function.

paper flower
#
def escape(text: str, escape_char: str) -> str:
    return (
        text
        .replace(escape_char, escape_char * 2)
        .replace('%', f"{escape_char}%")
        .replace("_", f"{escape_char}_")
    )


user_input = "a/b%c"
stmt = select(Book).where(Book.title.ilike("%{}%".format(escape(user_input, "/")), escape="/"))
print(stmt.compile(dialect=postgresql.dialect(), compile_kwargs={"literal_binds": True}))
SELECT book.id, book.title 
FROM book 
WHERE book.title ILIKE '%%a//b/%%c%%' ESCAPE '/'
coral wasp
#

hah, yes, that

#

It's still parameterized, even though there's some f-strings involved.

paper flower
#

But "orm bad"

coral wasp
#

I hate ORMs... they almost always made the complex stuff harder.

#

But, they made the easy/medium stuff super easy

paper flower
coral wasp
#

but I'm a db person, so that didn't slow me down... I just use SQL for what SQL is good at.

paper flower
#

What orms did you try? pithink

coral wasp
#

Oh, this is my past Java life... so mostly hibernate

paper flower
#

You should take a look at sqlalchemy

#

Pretty much everything you can write in sql you can write in it

#

e.g. subqueries, ctes, aggregates, window functions, etc

coral wasp
#

Well, I do mostly analytic stuff, so there's no reason for an ORM

#

And, I'd rather write in SQL than in chained pythonic APIs

#

A single declarative statement is far more readable and debuggable, imo.

#

but I know people will fight me over that

paper flower
paper flower
#
stmt = (
    select(Model.name, func.count(Model.id)
    .where(Model.name.ilike("a%"))
    .group_by(Model.name)
)

I think that's pretty similar to how SQL is written

coral wasp
#

There's dozens of APIs like that, which is part of the problem. It's yet another thing to learn, when all I need to know is SQL.

paper flower
coral wasp
#

That's where dbt comes in

paper flower
#

What's dbt?

paper flower
coral wasp
paper flower
#

Ah, that's not really would fit a usecase of a generic app that needs a DB imo

#

E.g. some kind of web application

coral wasp
#

no, not at all

glacial current
# coral wasp Oh, this is my past Java life... so mostly hibernate

Ah, yeh i had worked with someone that used hibernate that crushed the MSSQL server. this was because of threading and although i think that they just made a mistake, it was compounded by java philosophy that java is all you need to know. the actual result set should have been 100K rows. So i think you are more aware of a naked query result set than an ORM result set.

#

btw that was an interesting discussion you guys had and i agree with Billy that its sometimes simpler to bypass the ORM layer.

#

but ORM is nice too at times for the simple and medium stuff Billy said it well

harsh pulsar
#

also yeah their attitude to jinja templating is a little... cavalier

#

you'd think they'd at least offer a quote jinja macro, but nope. '{{ foo }}'

coral wasp
#

It’s data engineering, everything is cavalier:) a lot of duct tape and glue

harsh pulsar
#

it doesn't need to be. it's that way because all the tooling is designed that way

#

there's no reason dbt couldn't at least try to make inserting strings into sql slightly safe

#

it works because 99% of the time users are well-behaved

mild crow
harsh pulsar
#

it's all the honor system

harsh pulsar
mild crow
#

thank you? its ok im not tripping but i do understand better then most, when you do have time, only because you are the only person that understand it.

harsh pulsar
mild crow
#

sure

opal holly
shut tiger
#

And when people like ORMs they've never used Hibernate :P

tame parcel
#

so i want to take data which is saved in database and show it on frontend

#

but this not working

#
import { useState, useEffect } from 'react';
import { Container, Table } from 'react-bootstrap';

export default function Admintable() {
  const [ordersData, setOrdersData] = useState([]);

  useEffect(() => {
    const fetchData = async () => {
      const response = await fetch('/api/admintable');
      const data = await response.json();
      setOrdersData(data);
    };
    fetchData();
  }, []);

  return (
    <Container className=" container-fluid d-flex justify-content-center align-items-center">
      <div className='m-5'>
        <Table responsive>
          <thead>
            <tr>
              <th>Email</th>
              <th>Price</th>
              <th>Date</th>
            </tr>
          </thead>
          <tbody>
            {ordersData.map(order => (
              <tr key={order.email}>
                <td>{order.email}</td>
                <td>{order.totalprice}tk</td>
                <td>{order.date}</td>
              </tr>
            ))}
          </tbody>
        </Table>
      </div>
    </Container>
  );
}
shut tiger
tame parcel
#

router.post('/admintable', async (req, res) => {
try {
// Attempt to save a new Admintable object to the database
let { email, totalprice, date } = req.body;
const newAdmin = new Admintable({ email, totalprice, date });
await newAdmin.save();
res.json({ message: "Data saved successfully" });
} catch (error) {
// If an error occurs, send a 500 (Internal Server Error) response to the client
res.status(500).send("Server Error");
}
});

tame parcel
paper otter
#

hi ppl

hollow oar
floral basin
#

(Windows)Can someone help? I'm trying install mysqlclient for my Django project but error pop up:

Failed to build mysqlclient
ERROR: Could not build wheels for mysqlclient, which is required to install pyproject.toml-based projects
harsh pulsar
#

at least as far as i know?

hollow oar
fading river
#

Question: Is there a way to use a well supported package to use schema definitions with mongodb with asyncio? Currently I don't think there is. The way I got to that conclusion was that mongoengine seems to be the standard way to do schemas for mongo, but mongoengine doesn't work with the motor package which provides mongodb compatibility with asyncio. there is a package called motorengine to get the same functionality. I looked up this package and it only supports Tornado which was the original use case of Motor before it was ported to asyncio. There is another package called AIOMotorEngine which has only 27 stars on GitHub and proudly states that it was ported from motorengine and that "All tests are rewritten too and seem to pass." this is pretty far from well supported in my view, and since this is for work, I want to use packages that I know have been extensively used and tested.

#

This seems a bit weird to me though because schemas, mongodb, and asyncio are all extremely popular and it seems like this combination would be better supported, so maybe there is an obvious approach that I'm missing.

brazen charm
#

A large reason why nothing exists is because it's not really common for businesses and people who have the time to write those libraries to use mongo when they need a schema

torn sphinx
#

hey i have got a question, should i store id's as bigint type or rather string and cast it to int in python?

model Guild {
  id             BigInt  @id @unique
  welcomeMessage String?
  welcomeChannel BigInt?
  leaveMessage   String?
  leaveChannel   BigInt?
}
#

its postgres btw

waxen finch
waxen finch
#

fwiw 2^63-1 converted to a discord timestamp would be <t:3619093655>, or <t:3619093655:R>, so i probably wouldnt worry about it overflowing yet

torn sphinx
#

is bigint 2^63?

waxen finch
torn sphinx
#

yeah

harsh pulsar
harsh pulsar
#

it's an "ODM", like sqlalchemy for mongodb, using motor for mongodb interaction and pydantic to define types and validate inputs/outputs

#

i've used it before, it's not bad at all

fading river
broken charm
#

Hi does anyone knows if there is a database with company wise production data for agriculture sector (crops livestock dairy)? Preferably europe

fading patrol
pale canyon
#

Ive created a login and registration system for a project but im unable to get a working database using SQLite, can anybody help me with this please?

shut tiger
torn sphinx
#
async def run():
    conn = await psy.connect(user='postgres', password='root',
                                 database='postgres', host='localhost')
    #values = await conn.fetch(
    #    'SELECT * FROM mytable WHERE id = $1',
    #    10,
    #)
    await conn.execute("CREATE TABLE IF NOT EXISTS test1(id SERIAL PRIMARY KEY, number INT)")
    await conn.execute("CREATE TABLE IF NOT EXISTS test2(id SERIAL PRIMARY KEY, number INT)")
    await conn.execute("CREATE TABLE IF NOT EXISTS test3(id SERIAL PRIMARY KEY, number INT)")
    await add(conn)

async def add(co):
    await asyncio.create_task(a1(co))
    await asyncio.create_task(a2(co))
    await asyncio.create_task(a3(co))

async def a1(con):
    while True: # test1
        n = random.randint(100000, 199999)
        await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")

async def a2(con):
    while True: # test2
        n = random.randint(200000, 299999)
        await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
async def a3(con):
    while True: # test3
        n = random.randint(300000, 399999)
        await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")```

How come this isn't running all three of them at once, it's only doing the first one. (AKA a1)

## async pg
brazen charm
#

it should run them concurrently, but your conn object might be holding a mutex or the database itself is limiting the concurrency if you are not seeing the values inserting

shut tiger
#

I wonder if anything is awaiting

paper flower
#

@torn sphinx Also you're not running your coroutines concurrently

grim vault
#

And shouldn't a2() use test2 as table (same for 3)?

paper flower
#
tasks = [a1(co), a2(co), a3(co)]
await asyncio.gather(*tasks)

Or in python 3.11:

async with asyncio.TaskGroup() as tg:
    for fn in (a1, a2, a3):
        tg.create_task(fn(co))
brazen charm
#

oh yeah i forgot about the immediate await lol

torn sphinx
torn sphinx
# paper flower <@456226577798135808> Also you're not running your coroutines concurrently
  + Exception Group Traceback (most recent call last):
   File "c:\Users\m\Desktop\Test\V1\New\main.py", line 38, in <module>
  |     loop.run_until_complete(run())
  |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
  |     return future.result()
  |            ^^^^^^^^^^^^^^^
  |   File "c:\Users\m\Desktop\Test\V1\New\main.py", line 16, in run
  |     await add(conn)
  |   File "c:\Users\m\Desktop\Test\V1\New\main.py", line 19, in add
  |     async with asyncio.TaskGroup() as tg:
  |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\taskgroups.py", line 147, in __aexit__
  |     raise me from None
  | ExceptionGroup: unhandled errors in a TaskGroup (2 sub-exceptions)
#
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "c:\Users\m\Desktop\Test\V1\New\main.py", line 31, in a2
    |     await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
    |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\site-packages\asyncpg\connection.py", line 317, in execute
    |     return await self._protocol.query(query, timeout)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "asyncpg\protocol\protocol.pyx", line 323, in query
    |   File "asyncpg\protocol\protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
    | asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
    +---------------- 2 ----------------
    | Traceback (most recent call last):
    |   File "c:\Users\m\Desktop\Test\V1\New\main.py", line 35, in a3
    |     await con.execute(f"INSERT INTO test1 (number) VALUES ({n})")
    |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\site-packages\asyncpg\connection.py", line 317, in execute
    |     return await self._protocol.query(query, timeout)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "asyncpg\protocol\protocol.pyx", line 323, in query
    |   File "asyncpg\protocol\protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
    | asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
    +------------------------------------```
torn sphinx
grim vault
harsh pulsar
harsh pulsar
torn sphinx
#

Well mainly because I did use pyscopg

harsh pulsar
#

that's guaranteed to confuse people

torn sphinx
#

Fair enough

#

I'll change it in the message

torn sphinx
waxen finch
# torn sphinx Ah I see, fixed it. It still doesn't do both though

afaik one connection isnt capable of doing more than one query at a time, so you'd normally open multiple connections to your database to run each of your queries in separate transactions, e.g. ```py
async with asyncpg.create_pool(...) as pool, asyncio.TaskGroup() as tg:
tasks = [tg.create_task(run_query(pool)) for _ in range(3)]

async def run_query(pool):
async with pool.acquire() as conn:
await conn.execute("INSERT INTO my_table VALUES ('foo')")```

#

(by default create_pool() will immediately open 10 connections, but you can adjust that with min/max_size= if you want less)

torn sphinx
#

googleapis has a way to pull youtube data

torn sphinx
waxen finch
# torn sphinx Is there a way to set unlimited? I plan on allowing alot of executions that coul...

i dont think literally setting an unlimited number of connections would be a good idea because at some point your client/server won't have the resources to handle it, not to mention your database has a limit on the number of connections too (100 by default)
what specifically about your "executions" is taking up so much time? is it some kind of processing on the database, or on your client? if its the client, i would suggest batching your queries into a few transactions so you can release the connection during the heaviest parts, e.g. ```py
async with pool.acquire() as conn:
all_the_data = await conn.fetch("SELECT ...")

Work with data now...

async with pool.acquire() as conn:
await conn.execute("INSERT INTO ...")```

fallen vault
#

I have a user level column in my database. How do i select all users where the user level is greater tahn or equal to 3?

#

Please ping me if you respond.

torn sphinx
#

When I was doing it before with I think a couple hundred million records it took ~30 seconds

#

But that was using pyscopg so it could be faster with asyncpg

#

I fully coded the tool before but the only issue was it wasn't asynchronous so if one user called it, the next would have to wait the time left for that transaction + the time for the transaction they made

torn sphinx
waxen finch
waxen finch
keen minnow
torn sphinx
torn sphinx
#

I'm not sure if it is slow due to the resources available or if it's just the way psycopg was. I'm going to have a similar testing situation as before so I can actually grasp what's taking majority of the time.

#

Based off of their testing I should be seeing somewhere near a 5x speed difference, this is a big difference. That means if it took 30 seconds before it would take ~6 seconds now.

keen minnow
#

If you wonder if it's slow due to the resources available, then you ought to monitor the resources available

#

be it resources on the db, the thread or connection pool from the app, etc.

torn sphinx
keen minnow
torn sphinx
keen minnow
torn sphinx
keen minnow
torn sphinx
keen minnow
torn sphinx
#

Ah okay that makes sense, appreciate it.

#

I'll look into it

keen minnow
#

At the end of the day, it's all about going through hypothesis and validating them

#

You should never have to guess

#

also, it's a very valuable and difficult skill to get

#

so it's great for your career too

elder sand
#

Hey, can someone help me out with this question : I’m running 3 websites on a linux instance. For each website I have a sql database server running using docker container. So 2 postgresql docker containers & 1 mysql docker container. What I want to do is to migrate these db’s to a new postgres docker container in real-time, i.e copy the current data and keep copying the new data until its same. There are tools like pg_dump for posgres, but those are manual methods. Is there an automated approach to solve this problem, for eg : bash/ python scripts.

hexed estuary
#

oh, I see, asyncpg is 25% cython. cool.

brazen charm
#

yeah most of the heavy lifting is done in cython

#

not that those numbers matter much most of the time, since you need a massive DB to start hitting those numbers

fading patrol
narrow shore
#

Let's say you are in a position where you can't use Spark etc to process a huge chunk of data from a RDBMS (Postgres / MySQL) to calculate some value, but you need all the data from multiple tables to calculate it. What would your approach be? (stored procedures is a no-no)

west sentinel
#

what does this error mean from pymongo?
SSL handshake failed: ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992),SSL handshake failed: ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992),SSL handshake failed: ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992), Timeout: 30s, Topology Description: <TopologyDescription id: 64f388b2e820f7223e1abc5d, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-00.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>, <ServerDescription ('ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-01.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>, <ServerDescription ('ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('SSL handshake failed: ac-qb0gg1h-shard-00-02.2lrs6ai.mongodb.net:27017: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:992)')>]>

brazen charm
#

it means it was unable to connect to the address given due to it failing to complete a handshake within the timeout

west sentinel
#

do you know what could be causing the problem

brazen charm
#

are you running some really old version of pymongo?

#

or is atlas having any service issues rn?

west sentinel
#

my pymongo version is 4.4.1

brazen charm
#

can you run python -c "import requests; print(requests.get('https://www.howsmyssl.com/a/check', verify=False).json()['tls_version'])" in your terminal

#

(assumes you have requests installed)

#

And/Or run openssl version

west sentinel
#

tls version is 1.3

brazen charm
#

hmm

#

That should be new enough, only v1.0 or 1.1 should really be disabled

west sentinel
#

openssl version gives me LibreSSL 2.8.3

brazen charm
#

I assume you're using atlas btw?

west sentinel
#

yes

brazen charm
#

What if you run python -c "import ssl; print(ssl.OPENSSL_VERSION)"

west sentinel
#

OpenSSL 1.1.1s 1 Nov 2022

brazen charm
#

right

#

so it looks like the openssl version python is using is outdated

#

well, outdated enough to not be able to connect to atlas

#

What python version are you using?

west sentinel
#

i think 3.11

#

3.11.2

brazen charm
#

did you compile it from source?

west sentinel
#

i used pip

brazen charm
#

hmm?

west sentinel
#

oh shit

brazen charm
#

For the interpreter itself?

west sentinel
#

nah im being stupid

#

no i didnt

brazen charm
#

can you run python -c "import ssl; print(ssl.HAS_TLSv1_2, ssl.HAS_TLSv1_3)"

#

Just realised thats the openssl version not the tls version

west sentinel
#

true true

brazen charm
#

okay, well thats weird

#

you have the right TLS versions

#

pithink So why is mongo using 1.1

#

what is your connection URI? (Remove the auth stuff if that's in there aka the mongodb://foo:bar@

west sentinel
#

"mongodb+srv://x:x@cluster0.2lrs6ai.mongodb.net/?retryWrites=true&w=majority"

brazen charm
#

can you add &tls=true to the end of the uri?

west sentinel
#

it worked

#

?? how

brazen charm
#

pymongo having some seriously poor error handling and no TLS upgrade support apparently

#

So it's not really a TLS version error, the error is just super miss-leading

#

pymongo just doesn't bother attempting to setting up TLS unless you explicitly tell it to

west sentinel
#

oh ok

#

thank you very much

pure mortar
harsh pulsar
pure mortar
#

my condolences

#

blobpray 🕯️

#

mongo: complex queries over ton of data -> falls over

deep jetty
#

Yes. You can use AND or OR to combine multiple conditions.
You only need one WHERE at the beginning.

brazen charm
potent spire
#

How I can store data in mongo db?

torn sphinx
#

Hey guys. Need a little suggestion. So I need create a nosql db (dynamoDb). The purpose is to store ads data in there. And the queries will be like filtering on a certain keyword and getting data based in a date range.

What would be the appropriate nosql schema for this task. The number of rows would be high likely over a billion.

coral wasp
#

Can you explain more about what ads detail is?

torn sphinx
#

When you search a keyword you will get multiple ads

coral wasp
#

But what about the ads? What kinda of information? Can’t design a schema without know the objects and attributes

torn sphinx
#

Hmm title, description, datetime, keyword

coral wasp
#

4 attributes? Why nosql?

#

And I don’t really understand the question then, if just four attributes, the schema is just one table four attributes?

torn sphinx
#

Yes I meant as in partition and sort key and all that kind of stuff

#

I have only worked with RDS so far

torn sphinx
coral wasp
#

If you’re only searching for keyword and date range, I’d think you’d partition in keyword with a secondary on date to support a range query

torn sphinx
#

So will that be a sort_column?

coral wasp
#

Yah, or if you are searching across entire, you could use a global secondary index… one sec find a ref

torn sphinx
#

Ah, global secondary is like if you want ads in past 30 days irrespective keyword bin they belong to yeah?

coral wasp
#

Yah

torn sphinx
#

So what will I put as a partition key in global index

#

in this case

#

Since we essentially just need a sort key and no partition

coral wasp
coral wasp
torn sphinx
coral wasp
#

Thinking more generally: I’m not an ad-guy, so I don’t know if this is optimal or even makes semantic sense (don’t searches have multiple keywords). Im just commenting on the schema you described

torn sphinx
#

Gpt says make date the partition in gsi

coral wasp
#

You said you were searching keyword and date, not merely date.

torn sphinx
#

Yeah but this is secondary index.

#

In the case where we only want to filter on date

coral wasp
#

Sure, you can put an index on any additional thing you want to filter on

torn sphinx
#

Awesome. And do you see any use case of a local secondary index in this case?

#

Just relating to keywords and date

coral wasp
#

There are limits and costs. Locals are good when you’re doing an and search: keyword=Foo and date…

torn sphinx
#

Are secondary indexes slower than normal ones?

coral wasp
#

Don’t think so, but they add overhead because they need to be updated during inserts

torn sphinx
#

Cost wise?

coral wasp
#

No idea, been a while since I look at that

#

Aws charges you for everything so certainly there’s a cost for everything and every io

#

Overall tho, I’d really make sure you understand the queries and whether dynamodb makes sense here. If there’s queries are aggregate stuff like: average ad hits per day over the last year, I dunno if it’s the right solution

brazen charm
#

Secondary indexes don't have a read performance cost but it makes writes potentially slower/more expensive

#

As the cluster has to update the indexes across each node seperately

#

you dont want Dynamo for this though btw

#

It'll cost you an absolute fortune

torn sphinx
#

Uniqueness: The combination of Partition Key and Sort Key (if present) must be unique for each item in the table.

Is this true?

brazen charm
torn sphinx
#

Silly gpt

brazen charm
#

partition keys and sort keys are not like primary keys

torn sphinx
coral wasp
#

Me, I’d probably dump the ad data into parquet files and do roll up analysis into a separate summary db

brazen charm
#

You have 3 fairly reasonable options for this depending on what latency you want:

  • Clickhouse for the lowest latency possible, it'll be the most expensive option though and require manual deployment of clickhouse cloud.
  • Redshift a bit cheaper than Dynamo but still expensive
  • Athena, which is basically S3 datalake querying, your queries are in the seconds at best though, it is by far the cheapest option though
torn sphinx
#

Ah dynamoDb even expensive that redshift

brazen charm
#

Both are pretty expensive for a lot of data

#

but redshift is more tailored for analytic queries

#

Depending on the queries you could also potentially use https://quickwit.io (Although disclaimer I used to work there and still work on the project) which will probably one of the cheapest options and still give you sub second latency ish

Sub-second search & analytics engine on cloud storage

torn sphinx
#

We are stuck with aws

coral wasp
#

Is this db part of customer facing app? Like interactive? Or for internal analytics?

torn sphinx
#

Hmm, yes it will query ads both to show user straight away. But also to do something in the backend

brazen charm
torn sphinx
#

But queries are simple like select * from table where keyword='djsj' and dat in past 30 days

brazen charm
#

Or cassandra/scylla would work for the lowest latency possible

#

since these seem like single point click queries

stiff radish
torn sphinx
#

yes

#

Talking to gpt, it said maintaining indexes at that scale is difficult in an rds

brazen charm
#

I kind of agree with GPT that a non-relational DB is probably well suited for this

#

What sort of queries per second are you expecting?

torn sphinx
#

Not much I reckon, not many users

#

Internal uni usage by staff to post ads

#

And also to see competitor ads

brazen charm
#

and what is your budget for this roughly? On a monthly basis

torn sphinx
#

That's a talk we haven't had. But ig not super low

#

If dynamo costs you per time then the queries shouldn't take long and that can't be a fortune like you said

brazen charm
#

It charges by just about everything, i.e. per op, per time, per GB, etc...

#

but if your read rate is very low (I.e. less than 1k QPS) and your data is like < 500GB then you are probably fine i guess

#

i think it'll be a couple hundred dollars a month ish

coral wasp
torn sphinx
brazen charm
#

I am curious about this though this Since low QPS, and the queries seem to be very direct I.e. "Get n matches for keyword 'XY' in date range Z"

torn sphinx
#

We need titles and descriptions

#

Of indivudal ads

brazen charm
#

How many results do you expect per query? Just one?

coral wasp
brazen charm
#

I think they said upto around 1B rows

coral wasp
#

Yah but that’s events: ad hits

torn sphinx
stiff radish
#

Your biggest issue is how the DB handles indexing and search over such a big dataset.

torn sphinx
#

Actually 15k and each search might have returned 3 ads on average. So that will be about 45k rows

coral wasp
#

Like, could you just store two tables: ad and ad_hit? Ad_hit could be a rollup

brazen charm
#

Hang on a sec, they just want to get the ad title and description though, this isn't really an aggregation or searching thing

#

it's Keyword, daterange -> rows

#

Right?

torn sphinx
#

yep

brazen charm
#

yeah Dynamo should be fine

stiff radish
#

So you index over time but then you still need something that searches over all the records in that timeframe somewhat efficiently

#

I'm lazy so my reflex would be to explore possible postgres plug-ins.

torn sphinx
brazen charm
#

Your partition key is the keyword
Your sort key would be the datetime

If you have multiple keywords per ad, you're better off duplicating the rows

stiff radish
#

How would the index over the text look like?

torn sphinx
#

Yep they are duplicated already

stiff radish
#

Do you have duplicates?

#

Ah ok.

torn sphinx
#

one row per retrieved ad even though they came in same request

coral wasp
brazen charm
#

Dynamo should be fine then, as long as you can target the specific partition key without having dynamo filter, it's fine

torn sphinx
#

Awesome

brazen charm
#

The aim of the game with Dynamo/Cassandra/Scylla is you always want to filter to a specific partition (because the drivers can work out exactly what node the data lives on) so you're not scanning loads of data

#

So if you're doing that, should be fine

brazen charm
torn sphinx
#

And should I be worried about having a gsi on date. That query wouldn't be as common, that is to query with a keyword or a list of keywords. So will that gsi be a lot of cost for lot of rows?

brazen charm
#

Do you need a GSI on date?

#

like you want a list of keywords that appeared within a range?

torn sphinx
#

Well ads* irrespective of the keywords they belong to

#

Though rare query

brazen charm
#

You will probably be fine, it's a rare query and you don't have a huge amount of data or query rate.

#

If I remember right GSIs are built upon the materialized view system, so you do effectively double your data size though

stiff radish
#

I'd still look into how text is indexed with that GSI thing.

brazen charm
#

It's a hash

#

from a very high level view, a GSI effectively creates a new table with the field(s) you want to build the index on as the partition key

stiff radish
#

Which is fine for text cause you only need equality anyway.

torn sphinx
brazen charm
#

yes

#

internally the DB hashes each key and then works out what partition the key lives in based on the hash

torn sphinx
coral wasp
brazen charm
torn sphinx
#

It would be a range on the gsi

may<dat<june

stiff radish
brazen charm
#

If you're doing things like which ads with xyz in the title had the most hits in your postgres or athena queries you're going to be there a while

stiff radish
#

Well, @coral wasp asked the right questions already "is this an operational or analytical system"

#

If it's really just searching for keywords my broke solution might've been to use an RDBMS and do it somewhat properly:

Before inserting split all the text into distinct words.
Have an ad table: ad(PK, date, title)
Have a bridge table: ad_keyword_bridge(FK_ad, FK_key)
Have a keyword table: keyword(Keyword). The keyword can just be the PK. You hash this.

#

Your query plan would look quite good; If you put a good ol' B-tree or similar index on date and a hash on the keyword you'll be fine. You're also storing way way less data this way because your keyword table contains a unique set of keywords.

Maybe I misunderstood a requirement but it aint that deep, no?

coral wasp
stiff radish
#

It's not just about full text search imo but also just storage size.

coral wasp
stiff radish
#

Depending on how many ads there are per 30 days (if that's truly their timeframe) this is a trivial solution. It depends on the type of query they're doing. What I'm suggesting might scale gracefully and is conceptually very simple and DB-wise idiomatic, I'd benchmark it 🤷

brazen charm
#

I think you two might be slightly overengineering this 😅

coral wasp
#

Now we need to decide on a caching layer

stiff radish
#

I'm old school so I think No-SQL is overengineering

torn sphinx
stiff radish
#

3 tables in a RDBMS is not overengineering

brazen charm
#

That is probably not a great way of going about things

coral wasp
stiff radish
#

I'm exaggerating a bit but this is a very simple schema imo. I can't count the amount of times I've gotten Mongo from web devs that needed to become relational downstream.

brazen charm
#

That's just because people fan boy over using Mongo because it's easy to do something quick and dirty VS actually thinking about the schema.

stiff radish
#

I like nosql when it has a clear advantage but I'm pretty sure their schema is "fixed" and if they're not doing this on their entire DB it'll scale while likely having a lower storage cost so idk what the advantage of dynamo is.

#

This might be a me-problem, I'm curious to know why using 3 tables is worse than dynamo. I haven't used it because I'm squarely in an Azure country

coral wasp
#

Dynamodb isn’t nosql in the sense that mongo is

#

(Lacking the right words right now to express this, I’m going to get some coffee)

brazen charm
#

It is serverless point -> click access.

Dynamo is the same as Cassandra and Scylla if you've used either of those.

But the base concept is the partitioning of data and querying of data is done such that a query only ever needs to hit on database node/shard. Sometimes even down to what CPU core the data lives on.

So rather than your relational DB which might have several indexes and table to JOIN on which get expensive to scale, Dynamo and that go Keyword 'x' lives on partition 'Y' which is one node 'Z' which is a single round trip and much faster.

They operate on the "Many cheap queries per second is better than fewer expensive queries" I.e. issuing 40 queries for 40 keywords is cheaper than issuing one query, because the client can run all 40 queries concurrently and only talk to the DB node which actually has that keyword, rather than randomly picking one node, and then that node having to work out what keyword lives on what node and then sending the query to each node that has the data, which has a much higher number of round trips which increases latency.

stiff radish
#

On Azure you have MPP versions of SQL server where you determine your hash and storage strategy

brazen charm
#

Now for this amount of data, it doesn't matter that much because relational DBs can still realistically handle the data fairly easily, but in this case you're on serverless AWS instances where accessing dynamo is miles faster and easier than RDS for this workload

stiff radish
#

Like round robin over all nodes or "store everything of one node together" etc.

#

It's also serverless

brazen charm
#

I have some 😅 less than positive opinions of SQL server xD

stiff radish
#

On-prem or cloud?

brazen charm
#

Both

torn sphinx
#

This is how it's actually behaving as well when I am pushing a df. It says contains duplicate rows

brazen charm
#

You might need to do PK: keyword, SK: (dt, ad_id)

torn sphinx
#

oh okay, gpt suggests just a random id

torn sphinx
brazen charm
#

It doesn't really have any use other than stopping the conflict

delicate fieldBOT
#

The rules and guidelines that apply to this community can be found on our rules page. We expect all members of the community to have read and understood these.

#

8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.

gentle zinc
#

is there any incentive with using the cli client over the workbench? i mean the workbench is so much more intuitive

gentle zinc
#

or am i oblivious to it's greatness i mean there ought to be a reason for it's popularity

#

cause there's plenty of alternatives out there

fading patrol
gentle zinc
#

oh and quick question if we dont mention the size what value will it take by default

#

it's 1 right

paper flower
paper flower
#

Specific tool shouldn't matter

gentle zinc
#

that's why i've resorted to using the workbench for now

paper flower
#

Yeah, that makes sense

#

If you're learning SQL - use a tool that allows you to write it more comfortably pithink

gentle zinc
paper flower
#

Don't ask to ask

gentle zinc
#

i've heard this numerous times lmao

#

oh and is there a way to "lock" queries in the workbench for future references?

#

for example i would like to lock the output to another pane

paper flower
#

I don't really use mysql and whatever tool you're using too

#

But you can just write them to files 🤔

gentle zinc
gentle zinc
#

@fading patrol i need help :/

coral wasp
drifting thunder
#

anyone here work with sql alchemy?

coral wasp
drifting thunder
paper flower
waxen finch
#

@halcyon dew i saw your message about aiosqlite connection pooling yesterday #discord-bots message, and i was curious about it so ive written up a few benchmarks with asqlite to see how multiple connections perform
https://github.com/thegamecracks/asqlite-bench

performance apparently degrades very quickly: ```py

py -m asqlite_bench -c 1 queries\2-selects.toml
Finished in 0.512s
py -m asqlite_bench -c 2 queries\2-selects.toml
Finished in 0.969s
py -m asqlite_bench -c 10 queries\2-selects.toml
Finished in 7.106s
py -m asqlite_bench -c 100 queries\2-selects.toml
Finished in 7.643s``` the numbers suggest that asqlite performs better with a limited number of connections, though it doesnt make sense because SQLite is supposed to allow concurrent readers...

halcyon dew
waxen finch
#

similar results came out of manually running the select statements in different threads, although it was overall faster
https://paste.pythondiscord.com/YTLQ
1 conn: 0.212s
2 conns: 0.589s
10 conns: 6.600s
100 conns: 6.892s

waxen finch
#

what in particular are you stuck on? i think your suggested table is fine, although the ordering of user1 and user2 might be ambiguous

#

what will you use for the row's primary key? if its (cmd, user1, user2), then (cmd, user2, user1) would be considered a separate interaction

shut tiger
#

Sort the pks of user1 and user2.

waxen finch
#

^ although as a derivative question, how would the query be written if it was a many-interactions to many-users relationship? for example: sql CREATE TABLE interaction ( id INTEGER PRIMARY KEY, cmd TEXT NOT NULL, count INTEGER NOT NULL DEFAULT 0 ); CREATE TABLE interaction_user ( interaction_id INTEGER REFERENCES interaction (id), user_id INTEGER, PRIMARY KEY (interaction_id, user_id) ); im thinking of a self-referential join but im not sure if there's a better way: sql SELECT i.cmd, i.count FROM interaction_user iu1 JOIN interaction_user iu2 ON iu1.interaction_id = iu2.interaction_id JOIN interaction i ON i.id = iu1.interaction_id WHERE iu1.user_id = ? AND iu2.user_id = ?;

hybrid marsh
#

Need some help with cassandra, if anyone's around. Got a list<text> type, and when I call for cassandra to return it, it only returns the first ten items in the list, regardless of what I set the LIMIT to

brazen charm
#

Also in general it's recommended to avoid the list datatype if you can, really you want a set unless you really need duplicates or sorting

hybrid marsh
#

Need duplicates

hybrid marsh
brazen charm
#

take a screenshot of the rows returned

hybrid marsh
#

Only 1 row

brazen charm
hybrid marsh
#

but thats expected because there is only 1 row

brazen charm
#

What are you using to query the data?

hybrid marsh
#

my problem is the fact that the row thats returned, the list has 12 items, but the query only returns 10

hybrid marsh
brazen charm
#

Can i see the code?

brazen charm
hybrid marsh
#

Proper weird. Now I only have 10 items. The table must not have updated in my client (TablePlus). Appreciate the help anyway

torn sphinx
#

Had some help with this a couple of days ago but I'm still confused so I'm gonna post again.

import asyncpg as psy
import asyncio

tables = ('test1', 'test2', 'test3')

async def start():
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    async with asyncio.TaskGroup() as tg:
        for fn in tables:
            tg.create_task(table_check(fn))
    async with asyncio.TaskGroup() as tg:
        for fn in tables:
            tg.create_task(search(fn))


async def table_check(i):
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    await conn.execute(f"CREATE TABLE IF NOT EXISTS {i} (value INT)")

async def search(i):
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    print(i)
    a = await conn.execute(f"SELECT COUNT(id) FROM {i}")
    print(a)

loop = asyncio.get_event_loop()
loop.run_until_complete(start())```
#
test1
test3
SELECT 1
SELECT 1
test2
SELECT 1```
#

^ Output

#

It should be printing how many records there are

torn sphinx
#

@torn sphinx didn't work g

#
+ Exception Group Traceback (most recent call last):
  |   File "c:\Users\m\Desktop\Test\V1\New\search.py", line 27, in <module>
  |     loop.run_until_complete(start())
  |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 
653, in run_until_complete
  |     return future.result()
  |            ^^^^^^^^^^^^^^^
  |   File "c:\Users\m\Desktop\Test\V1\New\search.py", line 11, in start
  |     async with asyncio.TaskGroup() as tg:
  |   File "C:\Users\m\AppData\Local\Programs\Python\Python311\Lib\asyncio\taskgroups.py", line 147, in __aexit__
  |     raise me from None
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "c:\Users\m\Desktop\Test\V1\New\search.py", line 24, in search
    |     print(a[0]['count'])
    |           ~~~~^^^^^^^^^
    | TypeError: string indices must be integers, not 'str'```
harsh pulsar
#

admittedly that's because i had/have poor intuition for mongodb system design and i basically translated it to a relational model. because realistically it was relational data

#

i spent a long time thinking about how much i could precompute to take advantage of the "document" thing, but the computations all involved lookups across entities so the pre-computation upon insert would end up getting expensive, and i thought it was a lot less bad to have slightly slower lookups than to run the risk of accumulating backpressure due to a large number of updates/inserts, because the thing was being fed by a webhook triggered by updates in the cms

#

in hindsight i was probably too worried about that because the group of users who were actively editing things in the cms was small. with the benefit of hindsight i probably could have done without the giant pile of lookups in the main query

#

fortunately the lookups were all lookups on unique keys so we just added some indexes and the performance was good enough that we didnt need to pursue further optimization

pure mortar
#

whew wow a happy ending with mongo

#

@brazen charm would maybe like to see this

pure mortar
harsh pulsar
#

indeed

pure mortar
#

funny enough, one of the products in the portfolio atm only has a small number of users but theyre kinda optimizing too much for them as well but what do i know

harsh pulsar
#

yeah it takes discipline to not spend too long optimizing something that already has 50ms round trip response time and serves 10 concurrent users max per day

pure mortar
#

its taking up two dev teams but not like im in charge

harsh pulsar
#

damn ok

harsh pulsar
#

i thought i was bad taking an extra week to polish my data science code for posterity

pure mortar
#

i would take 300ms

#

heck yknow how ML stuff is sometimes, sometimes we're looking at SECONDS in latency

harsh pulsar
pure mortar
#

for now

harsh pulsar
#

oh yeah for sure

#

and in my case the 50 ms is just within the AWS VPC

pure mortar
#

always aws vpc smh

harsh pulsar
#

but that's all we need, the current app im finishing up is for an internal api only

#

i like it that way

#

keep that frontend away from me

pure mortar
harsh pulsar
#

lol

#

i dont understand it

#

this is getting offtopic now but the frontend tooling churn is really insane

pure mortar
#

oh yeah i forgot we were in db. oops

harsh pulsar
#

anyway currently something like 50% of our latency is just database lookups. id actually be curious about this

#

our main query is something like

select *
from records
where group_id = :1 and created_time <= :2
order by created_time desc
limit 1

and the schema for the table is something like this in timescale:

create table records (
  group_id uuid,
  created_time timestamp,
  -- other attributes ...
  primary key (group_id, created_time)
);
select create_hypertable('records', 'created_time');

and im wondering when i know if i need to add an index on group_id, like what tea leaves should i be reading in the explain output and/or what kinds of metrics should i be collecting about postgres and timescale

#

or does postgres actually create an index for lookups because it's part of the primary key?

#

i can never remember how that works in different databases

#

my intuition was that within each chunk of the hypertable, table scans should be reasonably fast, so i didn't feel like we desperately needed anything special here. i don't have a query plan handy but it seemed pretty "normal"

pure mortar
#

hmm yeah if you ever need to optimize, those would be good places to look/investigate first. lmk if you end up doing so

harsh pulsar
pure mortar
#

||that may be for the best. go solve other problems, salt||

paper flower
#

Could you try slapping an index on created_time?

torn sphinx
#
import asyncpg as psy
import asyncio

tables = ('test1', 'test2', 'test3')

async def start():
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    async with asyncio.TaskGroup() as tg:
        for fn in tables:
            tg.create_task(table_check(fn))
    async with asyncio.TaskGroup() as tg:
        for fn in tables:
            tg.create_task(search(fn))


async def table_check(i):
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    await conn.execute(f"CREATE TABLE IF NOT EXISTS {i} (value INT)")

async def search(i):
    conn = await psy.connect(user='postgres', password='root', database='postgres', host='localhost')
    print(i)
    a = await conn.execute(f"SELECT COUNT(*) FROM {i}")
    print(a)

loop = asyncio.get_event_loop()
loop.run_until_complete(start())```
#
test1
test3
SELECT 1
SELECT 1
test2
SELECT 1```
#

^ This is the output, I want it to be outputting the count, which I tried count(id) (mentioned by someone previously)

waxen finch
# torn sphinx ^ This is the output, I want it to be outputting the count, which I tried `count...

see the API reference:
https://magicstack.github.io/asyncpg/current/api/index.html#asyncpg.connection.Connection.execute
conn.execute() returns the status of the query (which is apparently an sql string?), but what you want is the row/values of that query, so you should use fetch() or fetchrow() or fetchval() instead: ```py
rows = await conn.fetch("SELECT ...")
row = rows[0]
count = row[0]

OR

row = await conn.fetchrow("SELECT ...")
count = row[0]

OR

count = await conn.fetchval("SELECT ...")```

torn sphinx
#

Like after it's created

waxen finch
#

i mostly toy around with postgresql through a pgadmin container though, so i wouldnt be aware of specific details

torn sphinx
waxen finch
#

i think having that many tables would be cumbersome, but i don't know enough about the data you're modelling to give an appropriate answer (im no professional database designer either, so i probably wouldnt know anyway)

#

actually why would you have thousands of tables if the total number of columns is ~50?

#

or did you mean ~50 columns for each table

#

if you dont need to enforce any constraints on the data or access/modify individual columns (i.e. selecting the entire row), perhaps you'd be interested in denormalizing your schema with json[b] columns, arrays, etc.

torn sphinx
# waxen finch or did you mean ~50 columns for each table

If I do one table, I'd have a lot of columns do to the different types of data being stored. The issue is I'll need to do searches through all of the tables if I have multiple. meaning to find results (anything with related data needs to be pulled) I'd have to loop through each table.

torn sphinx
waxen finch
torn sphinx
waxen finch
# torn sphinx I won't really have those, the data is all pretty much csv files I'm dumping. Mo...

oh, so you're debating between giving each csv file their own table vs making it available in one? if you again dont want to constrain the data or modify it, i imagine a table with an indexed filename and an array column would be okay for querying sql CREATE TABLE datasets (filename TEXT, "row" TEXT[]); CREATE INDEX ON datasets (filename); though the lack of a schema makes it hard to work with the COPY command, so perhaps that's not a good idea

torn sphinx
#

I used pandas to convert csv files into tables which would then dump the data, I typed in the columns that needed to be created and it would create the table based off of the csv file's name

waxen finch
torn sphinx
#

And by an array column do you mean I'd have one column which would be the entire row rather than having a column for each value?

waxen finch
waxen finch
torn sphinx
waxen finch
#

less duplication, the filename is stored once and a number is used in-place of it instead of storing 5 billion filenames

#

same goes for the columns column

#

alternatively you can look at it and say one csv file has many rows, a.k.a. a one-to-many relationship

torn sphinx
# waxen finch less duplication, the filename is stored once and a number is used in-place of i...

Basically relearning this so I'll lyk what I'm thinking is happening. You're creating a table which has a SERIAL PRIMARY KEY column under the name id. You then create a column called filename (text), which is unique. Meaning that only one of whatever text is inside of it can exist, not sure how it manages that though. Then you create another column which is test, I think the [] turns it into an array. Then you create another table which is linking a column it creates to dataset id, I think this means that its grabbing each unique row for each table name aka filename. Then a row with another column called text which is an array.

#

Not sure what the index thing is, also half certain about what's happening

waxen finch
# torn sphinx Not sure what the index thing is, also half certain about what's happening

i need some sleep now, but for the time being you can check out postgres's docs about indexing and foreign keys

waxen finch
#

each interaction gets its own identifier, and the interaction_user table associates many users with many interactions (in other words a user can be in many interactions, and an interaction can have many users)

torn sphinx
#

Hey guys. I saw that dynamoDb returns paginated responses. Isn't that gonna slow down the data retrieval rate if there are a lot of rows

#

it's only giving 700 rows at a time. So could be 100 calls to the api

stiff radish
#

50/50 between here and #software-architecture but for those that use raw SQL, how do you guys keep it organised?

Having queries everywhere is not DRY whatsoever on the other hand stored procs or views place business logic in the DB.

I'm a bit "tired" of ORMs and buying into a unit of work and a repository pattern on top of MVC but a compromise would be a query builder a la SQlalchemy core.

paper flower
#

And ORM really just maps your results to objects, something you'd probably do yourself anyway pithink

hoary onyx
#

Yall know hwere can i find a massive list of websites?

coral wasp
stiff radish
pearl lodge
#

guys, i need help settling up my database:

#

i don't know why my username and password is not working

#

i'm trying to settup atlas

pure mortar
stiff radish
#

For analytics type stuff I don't mind managing raw SQL because code there doesn't really really have to be DRY because the same concept is used over and over. If it's reused I just version control it 🤷

paper flower
stiff radish
#

For the web stuff I do this isn't so much of a requirement but honestly, using connector-x was an incredible speedup.

paper flower
#

Nothing would be able to load a large dataset into memory for example

stiff radish
#

I do manage loading tens of millions of rows, which I all need at the same time so I can't do it in batch (or trust me I would). It's worth looking at the project because you never know, I don't think the current ORMs could handle this.

waxen finch
#

so relatively speaking, you still only want a few rows at a time

torn sphinx
#

I see okay, so I should 100% be using indexes. What I don't get is how they work, and what they effect.

#

In what ways will it change?

long leaf
#

lmao

torn sphinx
#

why is this being posted here

long leaf
#

wrong channel

torn sphinx
#

ah okok

long leaf
#

thought this was regular chat

waxen finch
# torn sphinx I see okay, so I should 100% be using indexes. What I don't get is how they work...

there are several algorithms/data structures involved in indexing, but binary trees are a pretty common one so you can think about that first - if you were to convert this list of numbers, [1, 1, 2, 3, 5, 8, 13, 21], into a sorted binary tree, you end up with something like this: 3 / \ 1 8 / \ / \ 1 2 5 13 \ 21 how do you find 5 in the binary tree? you start from the root and work your way down: 3 is less than 5, so you move right, and 8 is more than 5, so you move left, and you find that 5 exists
this only required 3 comparisons, whereas if you were to find 5 in the list, assuming you didn't know the list was sorted, you would need to start from the beginning and check every number until you found 5
in the worst case that would take 8 comparisons, which would be 2.6x slower than the btree

when you create an index on dataset_row.dataset_id, every row gets a corresponding entry in your index containing dataset_id and a pointer to where the row is stored on disk, so out of 5 billion rows, you'd only need to make at most 32 comparisons to find a row matching a specific dataset_id

torn sphinx
waxen finch
#

though after the index match, postgres doesn't make any guarantees about how it orders rows with the same dataset_id, unless you specify a column with the ORDER BY clause - if you do care about the order of your data, you might want to add a position column and add it to your index so you can efficiently sort by (dataset_id, position)

torn sphinx
waxen finch
#

when you want to write down an interaction between two users, you insert three rows; first row goes into interaction, then one row for each user participating in that interaction sql INSERT INTO interaction (id, cmd, count) VALUES (1234, 'rock-paper-scissors', 1); INSERT INTO interaction_user (interaction_id, user_id) VALUES (1234, 9876), (1234, 5432);

coral wasp
torn sphinx
coral wasp
torn sphinx
#

Interesting okay, I'll keep that in mind thank you.

torn sphinx
harsh pulsar
#

We're always filtering on both, never just one

#

The data is already chunked/partitioned on creation time, and then we do a primary key look up within each partition. seems like it can't get better than that in theory

shut tiger
#

There is an implicit return None too. Maybe you should look at an ORM like django or sqlalchemy.

paper flower
#

Can you change it to time, group_id?

#

See if there's a difference

#

In the end your order by could be slow, did you try profiling that query?

waxen finch
# torn sphinx Doing what you said, having an array with the data, I'm not sure how I'd go abou...

hmm, with the current schema you can do that, for example given the dataset us-counties-2023.csv: ```sql
-- columns = {date,county,state,fips,cases,deaths}

-- To sort rows by highest cases:
SELECT dr."row" FROM dataset_row dr
JOIN dataset d ON d.id = dr.dataset_id
WHERE d.filename = 'us-counties-2023.csv'
ORDER BY dr."row"[array_position(d.columns, 'cases')]::integer DESC``` you can also consider adding a GIN index on row to improve comparisons on individual elements within it but only if you find your query's performance is too slow, as it'll be a very expensive index for your database to maintain when you have a lot of rows

harsh schooner
#

is anybody willing to code review my some 250 line long sqlite abstraction?? 🙃

trail rune
#

Where do I put attributes that two entities share?
To start, I had a many-to-many relationship between CARD and CARDPARTS. A card can be related to zero, one, or many card parts. A card part can be related to one or many cards. To solve this relationship I tried putting an intersection entity between them: a card to card part mapping. Where my question comes from is that cards and card parts have some unique attributes and some that they share. For example, booster is a unique attribute of card; component is a unique attribute of card parts; and they both have the attribute name. Do I put the shared attributes in the intersection entity? Do I keep them in the CARD and CARDPARTS entities? Do I do something else?
If you need more information, please don't hesitate to ask.
Thank you.

harsh pulsar
#

I can definitely try some benchmarks with the PK order swapped, ill also have to look carefully at the explain output after the data is more populated and there are realistic table stats

#

I think someone mentioned a pg profiling tool above? Or maybe that was somewhere else. I can look into it too

harsh pulsar
coral wasp
# harsh pulsar our main query is something like ```sql select * from records where group_id = :...

The first thing you should look at in the explain output is whether the indexes are actually used. That’s 90% of what I look for in explains. Then the question is why not: was the index not usable for the query or simply not helpful (no benefit). Sometimes a missing compound index or something similar is the root problem. (Apologize if this is obv/basic, but most of the time I see people who are just missing the ‘correct’ compound index)

#

And if your main query is looking up the last record for each group, perhaps that could be denormalized and maintained separately (ie: on the group table). Or, perhaps a partitioning strategy

harsh pulsar
#

maintaining this as two separate tables would be kind of a nightmare, it would have to be kept up to date in real time

coral wasp
#

And could you use a last() aggregate, rather than a sort and limit?

coral wasp
obsidian basin
#

I am getting the error caused by the foreign key being NOT NULL but I made the Foreign key NULLABLE. in flask-sqlalchemy. How do I fix this? Thanks for the help. Please ping on reply.

class User(UserMixin, db.Model):
    id = db.Column(db.Integer, primary_key=True)
    email = db.Column(db.String(120), unique=True) 
    rel_payments = db.relationship('Payments', backref='profileinfo', lazy=True) 

class Payments(db.Model):
    '''
    One to many relationship
    This is the Many relationship. 
    '''

    id = db.Column(db.Integer, primary_key=True)
    item_name = db.Column(db.String(80))
    price_of_donation = db.Column(db.Integer)
    # How do I turn email into the foreign key? todo.
    email = db.Column(db.String(120))
    fk_user_id = db.Column(db.Integer, db.ForeignKey('user.id'), nullable=True)

Here is the code in routes.py
https://paste.pythondiscord.com/MIDA

Here is the error.
https://paste.pythondiscord.com/37TQ

harsh pulsar
coral wasp
#

Oh, interesting.

harsh pulsar
#

https://docs.timescale.com/api/latest/hyperfunctions/last/

The last and first commands do not use indexes, they perform a sequential scan through the group. They are primarily used for ordered selection within a GROUP BY aggregate, and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value, which uses indexes.

harsh pulsar
obsidian basin
#

@harsh pulsar You were correct let me try again and see if the error still appears , if this is the case sorry for such a dumb question

coral wasp
harsh pulsar
harsh pulsar
#

(to be clear, the current performance is more than acceptable)

coral wasp
#

My suggestion was to precompute something that just narrows the space; even if it’s out of date, it might cut the query from ‘forever until today’ to a few days.

#

So, it’s online/real-time, with the crutch of a narrower range

#

The real test is: how well will this system respond to queries as the table gets larger: will the query remain responsive.

#

Im genuinely curious, haven’t worked with this in pg.

harsh pulsar
#

basically we just need the "most recent previous" record for a given entity, but that could be arbitrarily far in the past or half a second ago

coral wasp
#

Yah, if it’s an ~O(1) operation than yah, o worries

harsh pulsar
#

yeah no idea, would love to know

coral wasp
#

Me too! (Let me know if you do find out)

harsh pulsar
#

won't be that realistic but at least blasting the system with a few hundred concurrent requests from my laptop should at least give us a rough sense of stability under load

#

the db is definitely the biggest perf bottleneck rn. the python app is maybe 1/3 of the time, and most of that is spent inside fastapi

#

ive been wanting to try pypy as well for this

obsidian basin
#

@harsh pulsar I tried deleting the migration folder and app.db. Then recreating them by step 1 flask db init. Step 2 flask db migrate -m "Initial migration." . Step 3 flask db upgrade. Then I ran the code in the terminal. Here is the documentation on flask migrate. https://flask-migrate.readthedocs.io/en/latest/

coral wasp
harsh pulsar
paper flower
craggy gust
#

Sqlalchemy 1.4 and postgress, is there a way to load a row from different tables in a single query? With joins and such? Theh have foreingkeys within them to sort of mapping tables, no relationships, yet.

#

I mean like
Obj1, obj2, obj3 = db.query(table1).join(table2) etc

potent spire
#

is there a way to edit same database from diffrent projects?

paper flower
#

But yes

#
stmt = select(Table1, Table2).join(Table1.table2)
for table1, table2 in session.execute(stmt):
    ...
#

Also I'd rather just use relationships pithink

wise goblet
craggy gust
paper flower
#

And relationships should be the same too

craggy gust
#

Yeah but trying to keep the codebase consistent before actually updating stuff. Adding relations should fix what I need to happen for now should be fine.

nocturne depot
#

Guys I am making a program which will connect with remote mysql server and fetch data. I am utilizing mysql-connector-python and sqlalchemy. Is there a way to add proxy from my side to connect with database

harsh pulsar
paper flower
#

I would still try adding lower bound, changing query itself (e.g. removing some of it's parts), adding indexes, changing pk order

harsh pulsar
paper flower
#

I had a similar situation, but with significantly lower database load - we were storing historing records in the same table, which becomes hard to query latest values from

harsh pulsar
#

my bigger concern would be that the cost of maintaining that extra table would add more latency than we would save by optimizing the query... but i love all the suggestions, this is giving me a lot of things i can try this week

paper flower
#

e.g. status of some kind of object

paper flower
slender hollow
#

Hey guys, I trying to create a csv file with pandas that import multiple csv and concatenate into one dataframe

I already tried every way I know, I dont know too much 🥲

coral wasp
harsh pulsar
#

@coral wasp @paper flower reporting back from the world of query optimization: my lookup is reduced to a single index scan, but it does appear to be scanning every hypertable. had to work on other things today so i didn't get a chance to experiment and see if pre-filtering reduced the number of hypertables to be scanned. i will have a much more realistic amount of data loaded into the database after i get through my load testing tomorrow, so i will revisit this then

paper flower
coral wasp
#

Hmm, kinda what i was expecting, I don't know hypertables but I do know snake oil 😉 (also known as tanstaafl)

torn sphinx
waxen finch
torn sphinx
waxen finch
#

im not sure what else you need to know about the "data", assuming that refers to individual rows

#

once you've parsed the csv into a bunch of lists (see csv.reader()), you can insert each of them directly into the dataset.columns/dataset_row."row" column and asyncpg will convert it into the appropriate array type

torn sphinx
waxen finch
#

#databases message
one column is storing the entire row as a TEXT[] array, so what you should be inserting is a list of values

#

if you did want to insert each value individually, you would need another column to identify the row number that value is associated with alongside the dataset_id (and also a column number to help define the order of those values)

torn sphinx
waxen finch
#

you do, dataset_row.dataset_id lets you check what dataset its from

#

every row in dataset_row can be associated with a dataset because of that foreign key

torn sphinx
#

So in dataset_row I'm storing the format name, age, occupation, blah blah2. I'm so confused rn,

waxen finch
#

the column names? no, you can store that once in the dataset.columns TEXT[] column

#

maybe we should walk through the data you're going to store and understand how to describe it in two tables

torn sphinx
#

Should I use pandas?

waxen finch
#

thats a really vague question so its hard to answer, but if its more specific someone else who uses pandas could help you out

waxen finch
# torn sphinx Yeah sure

To start, in a relational database you normally have a fixed number of tables, and those tables should be able to describe your data in a way that makes it hard to store duplicate or invalid information. For your data, you have a bunch of CSV files, where each file has a filename, a list of column names, and a list of rows. Every row should be associated with a filename and column names, otherwise your row would be meaningless, so you might think of putting them in one table: sql CREATE TABLE dataset_row ( filename TEXT, columns TEXT[], "row" TEXT[] ); However, when you want to insert a bunch of rows from one file, you have to duplicate the filename and columns for every row: sql INSERT INTO dataset_row VALUES ('my.csv', '{col1,col2,col3}', '{123,456,789}'); INSERT INTO dataset_row VALUES ('my.csv', '{col1,col2,col3}', '{321,654,987}'); ... That increases wasted storage space, and makes it possible to say, have two rows in the same file with entirely mismatched column names. Ideally you would define the filename and columns one time, and then have each row refer to that same piece of information, such as: sql INSERT INTO dataset (id, filename, columns) VALUES (1, 'my.csv', '{col1,col2,col3}'); INSERT INTO dataset_row (dataset_id, "row") VALUES (1, '{123,456,789}'); INSERT INTO dataset_row (dataset_id, "row") VALUES (1, '{321,654,987}'); Each row in dataset now represents a CSV file, and each row in dataset_row corresponds to one of your CSV files

torn sphinx
waxen finch
#

As a sidenote, not all relational databases have array types, and an array column can often be represented with another table, for example: sql CREATE TABLE dataset_column ( dataset_id INTEGER REFERENCES dataset (id), name TEXT ); This conforms to 1NF and is normally what you want to design, but arrays tend to be useful when you have a bunch of values that you will often work with as a single item

#

i probably should have explained that earlier because i initially thought you wouldnt need to access/sort by individual elements in each row, which in hindsight made it kind of pointless

#

however postgres provides various operators and functions for working with arrays, so its not a deadend to be using it here

torn sphinx
waxen finch
#

REFERENCES dataset (id) is a constraint, or more specifically a foreign key constraint, which says that dataset_id must refer to an existing row in the dataset table matching the id column

#

its how you associate tables with other tables, and is often needed to safely join two tables together: sql -- Select the rows of any file which has a 'date' column SELECT d.filename, dr."row" FROM dataset AS d JOIN dataset_column AS dc ON d.id = dc.dataset_id JOIN dataset_row AS dr ON d.id = dr.dataset_id WHERE dc.name = 'date';

torn sphinx
waxen finch
#

JOIN table lets you combine the columns of another table, and in the example ON d.id = dr.dataset_id specifies that rows should be joined when they have the same dataset ID
For example, if you have dataset (1, 'birth-rates.csv'), a dataset column (1, 'year'), and another dataset column (1, 'rate'), joining dataset with the dataset columns by ID will give you (1, 'birth-rates.csv', 'year') and (1, 'birth-rates.csv', 'rate')

torn sphinx
sonic plaza
#

Hey where can I learn sqlite3?

fading patrol
potent spire
#

so how I edit a number in json file?

fallen vault
#

What’s the easiest way to offload a local database to a cloud database?

fallen vault
#

Or use the json package to dump a dict into a json file.

storm mauve
#

the most straightforward and applicable to most might be just dumping and restoring (or exporting to something like csv/json and importing if not the exact same kind of database), but some may have more efficient options

potent spire
fallen vault
potent spire
#

how I can check if a json file is empty or no?

pure mortar
hollow oar
#

ooo. a timescaledb optimisation question - can't believe i missed that.

  • select create_hypertable('records', 'created_time'); - this doesn't have space partition, worth a try after you have exhausted other options imo
  • what's the index on the chunks? is it a multi-column btree index? anyway - the ordering of group_id, time makes sense to me.
  • is there anything that uses JIT in the explain analyze plan? i recall i got bitten once by that, JIT'ing for absolutely no reason and wasting compute over and over.
torn sphinx
waxen finch
#

<new_column> <column_datatype> REFERENCES <referenced_table> (<referenced_column>)

harsh pulsar
harsh pulsar
slender hollow
#

Hey guys

#

Does have a good knowledge on pandas?

#

I'm struggling trying to merge CSV files.

torn sphinx
fluid quail
#

Make

potent spire
slate basin
#

How can i use independent db with a discord bot I want to use sql?

halcyon dew
#

SQLite?

#

Or postgres. Bc with postgres u host the server, make an role and password for it, and then create a database that is the owner of the new account you just created

slate basin
#

ic

fluid quail
potent spire
#

I solved it

#

but I have a question

marsh otter
#

Just wanted to understand industry standards, what's the benchmark for getting 100K recrods from table and having it in-memory. I have to apply couple of functions on top of this tabular data. Right now it takes 15 Seconds for me to get the data from snowflake, fetch it as Pandas DataFrame.
Is it normal in industry to do this kinda of thing?
Any one who have previously done, can you elobrate your tech stack.
I wanted to do this in a request response cycle. I'll send very minimal data to the frontend after computation. (sub second performance)

coral wasp
# marsh otter Just wanted to understand industry standards, what's the benchmark for getting 1...

Benchmark? Like, how long should it take? I'd guess that most of that 15 seconds is snowflake executing/processing the query, not anything on the client side. And, I doubt the snowflake database connector is the bottleneck here. What I'm assuming is: it takes nearly 15s for the query to get the 100k rows to complete. I'd first check that, such as doing the same operation on another system or snowflake interface or whatever.

brazen charm
#

overall I would say it's slower than i'd expect

#

But I wouldn't expect it to be less than a few seconds

marsh otter
#

Just to give more context, the data has 15 cols, 5 integer, 5 string and 5 dates. that's it

coral wasp
#
  • 15 sec is obviously unexpectedly slow.
marsh otter
#

Got it. Python connector is in latest version, pandas is in 2.1.x

coral wasp
#

And pyarrow?

marsh otter
#

Pyarrow is in 10.0.1

coral wasp
#

Oh, that's crazy old

#

Sure not 12.0.1?

#

They just released 13

marsh otter
delicate fieldBOT
#

pyproject.toml line 9

"pyarrow>=10.0.1,<10.1.0",```
marsh otter
#

I'm still not able to figure out this 10 to 15 seconds for 100K. when I checked the memory size of the pandas dataframe or list of tuple. It is just around 100MB

marsh otter
coral wasp
#

<1sec on the client side.

marsh otter
#

Man! I don't know what am I even doing wrong! 15 Seconds it takes for me 😦

coral wasp
#

Maybe compare fetchall performance vs fetch_pandas?

marsh otter
coral wasp
# marsh otter Yes I did that. fetch_pandas_all takes 15 Seconds. fetchall() takes 10 to 15 sec...

Reading a 100k row csv takes me 400ms: ```py
import pandas as pd
import numpy as np

Dataframe creation

df = pd.DataFrame({
'int_col1': np.random.randint(1, 1000, 100000),
'int_col2': np.random.randint(1, 1000, 100000),
'int_col3': np.random.randint(1, 1000, 100000),
'int_col4': np.random.randint(1, 1000, 100000),
'int_col5': np.random.randint(1, 1000, 100000),
'str_col1': np.random.choice(['A', 'B', 'C', 'D', 'E'], 100000),
'str_col2': np.random.choice(['F', 'G', 'H', 'I', 'J'], 100000),
'str_col3': np.random.choice(['K', 'L', 'M', 'N', 'O'], 100000),
'str_col4': np.random.choice(['P', 'Q', 'R', 'S', 'T'], 100000),
'str_col5': np.random.choice(['U', 'V', 'W', 'X', 'Y'], 100000),
'date_col1': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col2': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col3': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col4': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
'date_col5': pd.date_range(start='1/1/2020', periods=100000, freq='H'),
})

df.to_csv("mytest.csv")

%timeit pd.read_csv("mytest.csv")

df2 = pd.read_csv("mytest.csv")
print(len(df2))
display(df2)

marsh otter
coral wasp
#

You asked about "converting the query results to native python type". Converting from CSV should be even slower than converting a database result.

shut tiger
narrow shore
shut tiger
#

(it's not pandas)

hollow oar
#

does snowflake not have a dedicated command for exporting data to a flat file which you can pull much quicker?
in my expereience using these commands is much faster than whatever fetch_all is doing.

(also yes - there is no need to crosspost)

brazen charm
#

typically you just pull straight from the object store

torn sphinx
#

how would I handle multiple updates to the db for something like likes in sqlite? The race condition would make it kind of tricky to aggregate the requests

brazen charm
#

a lock

#

although sqlite already holds a lock on writers so it should be fine

waxen finch
#

speaking of sqlite locks, im getting some really poor performance when running concurrent selects with this benchmark using sqlite 3.43.0
https://github.com/thegamecracks/asqlite-bench ```py

py -m asqlite_bench queries\2-selects.toml --connections 10
Finished in 7.080s
py -m asqlite_bench queries\2-selects.toml --connections 1
Finished in 0.513s``` for context, asqlite runs each connection in its own thread and implicitly sets conn.isolation_level = None and PRAGMA journal_mode=WAL

looking at the profiling results, most of the time is spent in Cursor.fetchall() so i believe it is dealing with locks, but i dont know why running ten concurrent selects would make it perform worse

brazen charm
#

what happens if you try open the DB in read only mode?

#

you can also try setting nolock to 1

#

Or immutable to 1

waxen finch
#

oh i forgot to manually set WAL too, that made ?nolock=1 raise OperationalError: unable to open database file while the other two had no issue

#

so i guess that means my assumption was wrong, and something in the pysqlite wrapper is blocking...

trail current
#
cursor.execute('''
    CREATE TABLE IF NOT EXISTS enlistments (
        user_id INTEGER PRIMARY KEY,
        discord_name TEXT,
        application_name TEXT,
        division TEXT,
        rank TEXT,
        gamertag TEXT
    )
''')

I have already got my table created, how would I go about adding an additional row to the table once it has been created or is this not possible?

pure mortar
#

this also leads to increased compute and/or cost/snowflake bills btw

#

how do i know? we have dbas over here that have lived in sql server world for so long but now need to do things in snowflake

#

the result?

brazen charm
#

The cost of using a datalake for data which could just work on a regular machine

fallen vault
#

Whats a good way to scrub user input to prevent sql injection for a desktop app?

storm mauve
#

usually the library will take care of that for you as long as you specify the parameters separately from the query itself
(in which case the query itself should be hardcoded, and the user should only provide the parameters)

#

!sql-fstring

delicate fieldBOT
#
SQL & f-strings

Don't use f-strings (f"") or other forms of "string interpolation" (%, +, .format) to inject data into a SQL query. It is an endless source of bugs and syntax errors. Additionally, in user-facing applications, it presents a major security risk via SQL injection.

Your database library should support "query parameters". A query parameter is a placeholder that you put in the SQL query. When the query is executed, you provide data to the database library, and the library inserts the data into the query for you, safely.

For example, the sqlite3 package supports using ? as a placeholder:

query = "SELECT * FROM stocks WHERE symbol = ?;"
params = ("RHAT",)
db.execute(query, params)

Note: Different database libraries support different placeholder styles, e.g. %s and $1. Consult your library's documentation for details.

See Also
Python sqlite3 docs - How to use placeholders to bind values in SQL queries
PEP-249 - A specification of how database libraries in Python should work

fallen vault
#

Yeah im using cur.execute(sql, (userdata,)) where sql is something like '''SELECT * FROM table WHERE column=?'''

storm mauve
#

quite much no risk of injection in that case

fallen vault
#

Alright, thank you. All i gotta do is check the data against the data tables then.

paper flower
waxen finch
paper flower
#

So it's likely a sqlite issue? pithink

waxen finch
torn sphinx
#

Rather than storing the row in a single column and trying to access it

waxen finch
torn sphinx
#

VS

#

The two tables linked together and putting all the text into one column

waxen finch
#

as in a union of every column from every dataset you might store?

torn sphinx
#

Indeed

waxen finch
#

hmm i would guess that might lead to inefficient queries if every row has to store a bunch of NULL columns

paper flower
#

Ah, nvm 😅

waxen finch
#

oh yeah i did test SELECT 1 instead of selecting 1000 single-column rows from a table, and more connections marginally improved the performance rather than degrading it

paper flower
#

select 1 wouldn't do disk IO though 🤔

#

in theory

torn sphinx
waxen finch
# paper flower `select 1` wouldn't do disk IO though 🤔

here were the results for that: py -c 1: Finished in 0.160s -c 10: Finished in 0.117s i assumed this CTE alone wouldnt have done any IO either, but it performed similarly to creating the table beforehand: ```py

query = """

WITH RECURSIVE generate_series(value) AS (

SELECT 1

UNION ALL

SELECT value+1 FROM generate_series

WHERE value+1<=1000

)

SELECT value FROM generate_series

"""

repeat = 1000

-c 1: Finished in 0.690s
-c 10: Finished in 6.925s

waxen finch
torn sphinx
paper flower
waxen finch
torn sphinx
waxen finch
paper flower
#

Runtime seems to scale fairly linearly for me with the amount of connections

#

Makes me think it's probably cpu bound

#

Adding a lock in there makes queries run faster though 🙂

#
async def run_query(iterations: int, lock: AbstractAsyncContextManager[None]):
    stmt = select(Model).where(Model.c > 1000, Model.c < 2000).limit(100)
    async with async_session_factory() as session:
        for _ in range(iterations):
            async with lock:
                await session.execute(stmt)


async def main() -> None:
    iterations = 100
    start = time.perf_counter()
    lock = asyncio.Semaphore(1)
    async with asyncio.TaskGroup() as tg:
        for _ in range(100):
            tg.create_task(run_query(iterations=iterations, lock=lock))
    print(time.perf_counter() - start)

About 16s without lock and 7s with lock

waxen finch
# torn sphinx What's considered a really complicated query?

i dont think there's a definitive answer, but i'd consider a query too complicated if i need to write some verbose query like: sql -- Select all data from xyz 2021-2023 datasets with a population greater than 500 SELECT population FROM csv_xyz_2021 WHERE population > 500 UNION ALL SELECT population FROM csv_xyz_2022 WHERE population > 500 UNION ALL SELECT population FROM csv_xyz_2023 WHERE population > 500; or use esoteric features for something that should be simple