#databases
1 messages · Page 38 of 1
If you have a web dev project, the Django ORM provides a completely different approach that is very popular.
Ill keep this in mind when I build the dashboard for my bot thanks
Best orm in all the lands
how do i make it so that when i change the name of a class on 1 file it changes the name on all other files?
What editor are you using? (I don't think this is #databases related)
Is this #databases related? You might try asking in #data-science-and-ml
so i have an sqlalchemy orm like this
in project/settings.py:
class BaseDatabase(AsyncAttrs, DeclarativeBase):
pass
in posts/models.py:
from sqlalchemy import String, Text
from sqlalchemy.orm import Mapped, mapped_column
from starlette_project.settings import BaseDatabase
class Post(BaseDatabase):
__tablename__ = "post"
id: Mapped[int] = mapped_column(primary_key=True)
title: Mapped[str] = mapped_column(String(50))
body: Mapped[str] = mapped_column(Text)
but BaseDatabase.metadata.tables is empty
am i missing something?
You have to execute module containing your Post model
Usually via import 🤔
ah
makes sense
gotten lazy using django orm :p
I haven't used it, but Piccolo is a standalone ORM that is Django ORM-like and has support for starlette. https://github.com/piccolo-orm/piccolo/
don't remember seeing that around
I'll check it out, thanks!
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
anyone here know how to use pandas
You should just ask your actual question. #data-science-and-ml might be a better fit, or you can use a help channel #❓|how-to-get-help
trying to learn sqlqlchemy with alembic
i wan to open a session, how do i get the engine from alembic?
or do i have to make an engine for my uses?
meaning, alembic is creating an engine to do migration, right?
do i get that for my own uses? or make a different one?
u usually make a different one
alembic is mostly just for migration standalone scripts
fair enough
thanks
I'd love some feedback on loading a SQLAlchemy relationship. I'm basically doing this:
purchase = await purchase_repository.get_by_id(purchase_id, get_store=True)
return dict_from_schema(purchase, schemas.Purchase)
where Purchase has
class Purchase(Base):
// other fields
store_id: Mapped[int] = mapped_column(ForeignKey("stores.id"))
store: Mapped["Store"] = relationship(back_populates="purchases")
Because it has the linked store, but the store isn't accessed, it's not loaded and I get
Error extracting attribute: MissingGreenlet: greenlet_spawn has not been called; can't call await_only() here. Was IO attempted in an unexpected place? (Background on this error at: https://sqlalche.me/e/20/xd2s)
So I changed my query to do this:
statement = select(Purchase).filter(Purchase.id == id).limit(1)
if get_store:
statement.options(joinedload(Purchase.store))
I thought by making it a joinedload, I'd solve the problem? What have I missed?
So it's worth noting, the model is using AsyncAttrs, and if I add await purchase.awaitable_attrs.store before the return, it works since the data is loaded. Which makes it seem like the eager load doesn't work?
Of course, a SUPER dumb mistake
In the if block, I never stored the updated statement back into statement
I got an integrity error, which was expected:
sqlalchemy.exc.IntegrityError: (sqlalchemy.dialects.postgresql.asyncpg.IntegrityError) <class 'asyncpg.exceptions.ForeignKeyViolationError'>: insert or update on table "purchases" violates foreign key constraint "purchases_item_id_fkey"
DETAIL: Key (item_id)=(5) is not present in table "items".
Does anyone know how I can pull that "DETAIL" string? I see that IntegrityError has a detail value, but every time I print it, it comes back as empty.
Best I can find so far is `str(e.orig), which I then need to parse via regex
Hi are there any MS SQL resources in this channel?
How can I query multiple tables parallely
Databases support concurrent reads. Are you running into a particular error?
When connecting to databases with Python, is there a recommended way of connection within a function. An example below is how I currently think it should be done, but wondering if there's a way to take advantage of context manager's or some other way of doing it that's better.
def connect_to_snowflake():
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT)
return conn
conn = connect_to_snowflake()
cur = conn.cursor()
cur.execute("SELECT * FROM TABLE")
cur.close()
conn.close()```
A context manager is a good idea, here's how you could create one using the contextlib.
from contextlib import contextmanager
@contextmanager
def snowflake_cursor():
conn = connect_to_snowflake()
cur = conn.cursor()
try:
yield cur
finally:
cur.close()
conn.close()
with snowflake_cursor() as cur:
cur.execute("SELECT * FROM TABLE")
Thank you. I think that's what I need. Do you mind explaining what exactly the yield is doing? I'm not familiar with the contextmanager decorator or yield keyword
I'll try my best! 🙂 A function that uses yield instead of return, returns a generator. A generator can be iterated over. yield specifically returns an instance of the iteration. When with snowflake_cursor() as cur is called it loops over the first iteration of the generator. When the block is closed it also reaches the end of the iterator, because there is only one value. Then the finally block is executed.
Ah okay. I think I understand. So with a return it would exit the function, whereas with yield is sort of pausing the function, so once the with statement is finished, the function will resume and close the connection within the finally block?
I think that's fair to say!
I'm starting tests for a FastAPI project. With Postgres, should I create a separate db for the tests, or a schema?
im making a bot to pull blockchain data and predict rugpulls so im wondering if i should just pay for an api to store my database or go through the trouble of using flask
I think a separate database is easier, a little more separation and if you ever decide to use multiple schemas in your app it's less confusing. There's no downside to creating a new database vs a schema that I can think of.
Thanks. I'm still trying to understand where a schema comes in handy.
hi
With SQLAlchemy, anyone know if I get two db objects and modify them both, is there a way to only update one of them? If I do a flush/commit, both are sent.
If you don't need to modify the second one you can make a copy that's detached from the db which you can modify https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.make_transient_to_detached
Thanks
Damn
if youre working with fastapi you might want to consider working with sqlmodel instead. It mixes sqlalchemy and pydantic very smoothly
I looked at it. It's nice, but still a little feature light. Or I haven't figured out how to implent some stuff not explicitly mentioned in the docs.
Honestly, I'm considered dropping using ORMs entirely.
Afaik sqlmodel supports all features sqlalchemy supports as its an extension
Can be a little tricky building complex model structure with lots of relationships ill give you that
Yah, but the syntax is a bit different, obviously to work with both. So it's more a question of how to do it over can it do it.
There are some things that are not compatible/just dont work - look at the issues on github, also it generally promotes a bad practice of mixing your API schemas with db models
Unsure why that would be a bad practice
Im not particularly fan of sqlmodel, but the whole point of it is to merge API schemas and db models
what percent of android phones globally run on android 10 and lower vs android 10 and higher
not just the us and eu
saw doc here and forgot which server it was lol
I'm relatively new to python, is there a reason why you are redefining the variables?
ex: user=USER
not redefining, they are just passing the values
in this case USER will typically be a string constant, this value is passed to the function
ok
That is considered bad practice 🤷♂️ you're mixing API and Data layers
Yes, why?
I am reading up on some tools to implement some data quality checks and decided to give dbt a try. Is there some way to implement dbt testing in an existing database/projet?
So far it's all about creating a new project
Dbt isn't specifically around testing, it's around structuring your data pipeline in a particular way, which includes testing
yeah that's what i thought
maybe great expectations or soda would be more appropriate for the task
hey guys, so im currently working on a discord bot with some challenges.
bot overview:
tracks message activity across different "lounges" in a large discord server
currently handling around 700k members
tracks user message counts in real-time
needs to handle frequent writes (every second-ish)
current setup:
using json files for data storage
periodically saving message counts
im concerned about file i/o performance and potential data corruption
my main questions:
is json still a viable option for this scale?
would sqlite be a better alternative?
are there any performance concerns i should keep in mind with 700k members?
i've already added some optimizations like atomic writes and error handling, but i want to make sure i'm using the best possible approach
SQLite was designed for your use case. Would it be a lot of work to switch?
yeah it would, i’ve already fully implemented the json files
i also have to migrate the data
for some of the files
and completely rework my database file
a JSON file and SQLite do share the property that neither can safely perform parallel writes to the file/database, however, while you had to implement your own locks to ensure you didn't corrupt your file, SQLite has its own locking mechanisms built-in, and additionally can distinguish between read and write transactions to allow concurrent access, including reading while writing if write-ahead logging is used, aka WAL mode
im concerned about file i/o performance and potential data corruption
that is indeed an issue, there are benefits over performing full reads/writes to a JSON file in that SQLite does not need to read the entire database to retrieve information, nor does it need to re-write the entire database to change a value, though there are some caveats to that last point:
- by default, SQLite uses a "rollback journal" to maintain durability, which essentially means that every write transaction requires SQLite to copy parts of the database as a backup before modifying it, in case the program/system crashes during the write (see also Atomic Commit In SQLite)
- if you switch to WAL mode, writes are appended to a separate file instead of being written to the database directly, which is very fast for writing, but worsens read performance over time until a "checkpoint" is performed to move data from the WAL file to the database (see Performance Considerations)
- SQLite automatically checkpoints when the last connection closes, or during COMMITs if the WAL is large enough, at the cost of a writer taking a bit longer to commit
all of this info above is assuming you're going to use SQLite with multiple connections - others may suggest to use one connection for your entire application which, from my understanding, is fast because SQLite won't have to contend with transactions from other connections, but now you're back to JSON because you're now responsible for locking the connection every time you need to use it for reading/writing, so i'm not a fan of it (FWIW this was how i used SQLite for my first discord bot, but my newer ones have moved towards simply starting connections on demand and letting SQLite handle locking)
and last thing, for an async application, you probably don't want to use the standard library's sqlite3 module since it will block whenever it needs to wait on a pending transaction; it's possible to carefully workaround that, but it is easier to use an async wrapper which handles SQLite in a separate thread, like aiosqlite or asqlite
(the latter was written by the same author as discord.py and is my personal preference because i like its default configuration; it sets WAL for you, enables foreign key checks, and disables implicit transactions so you're always managing it)
SQLite does support concurrent writes, but at one write/second that's not really a big deal in this case.
There are JSON parsing functions in SQLite, so migrating might not be too painful.
writing requires a lock on the database (specifically EXCLUSIVE in rollback journal meaning readers must all stop beforehand, and... some other kind in WAL that doesn't block readers), so if you have multiple connnections attempting to write, they will queue up and complete writes one by one
if you say, made an API request while inside an write transaction, another writer can't butt in front of the line to commit their own changes, so that's what im referring to
(technically you can start a transaction but not actually acquire a lock until executing your first query, but im ignoring that)
Sure, but that's not fair to compare the disk level constraints of SQLite to the filesystem level constraints of JSON.
This comparison makes sense in the context of a client-server DBMS for example.
i do agree that one write/second is basically nothing compared to what SQLite can handle, or even just reading/writing to a flat file, so performance shouldn't be that important of a consideration
since OP was interested in performance though, i think comparing their writing capabilities is fair, as they're both just single files(-ish) on local disks
Fair enough. I would just note that SQLite is much faster than writing directly to a flat file, and if OP wants concurrency then there are relatively simple multi-thread solutions available for SQLite.
hey, thanks for this! ive switched to sqlite and im utilising WAL mode as well
thank you too @tepid basalt
Nice
guys any resources to learn SQL
Does making a column UNIQUE NOT NULL automatically make it a primary key (MySQL)? I didn't add a primary key explicitly
hey guys, could use some implementation advice for my discord bot. i'm currently storing user interaction stats (hugs, kisses, etc) in a json file that's around 2mb with a structure like {user_id: {"hug": 5, "kiss": 2}}. my concern is i/o performance, every time someone uses a social command, the bot needs to read the entire json, update a counter, then write the whole thing back to disk
i've already implemented sqlite for my lounge activity tracking (tracks every message in the lounge channels) and that's working great with proper indexing. wondering if it's worth migrating these interaction stats to sqlite too, even though the file size is relatively small? the server i'm implementing this to has 700k members, but the interaction commands are mainly reserved for donors
what's bugging me is the read/write cycle for json feels inefficient, especially when sqlite would just update the specific row. but is 2mb small enough that the migration effort outweighs the benefits? i know sqlite would be more "correct" for this use case, but im trying to be pragmatic here. has anyone done similar migrations or have thoughts on the json vs sqlite tradeoff at this scale? would appreciate any insights, thanks!
It does not make it a primary key automatically, it shares similar properties with the primary key but it is preferable to use a primary key in case you want to join with other tables as the db considers it a unique identifier for the row
You're right. This query would no return any results. The HAVING clause is equivalent to this WHERE clause:
WHERE salary > salary
Of course there are no rows where the salary is greater than itself.
Ty for the detailed answer
Is anyone here really good at using sql, pandas and python? I’m really struggling with doing etl stuff. I would post here but the files are too large
Hellow guys and gals. My 3 years old flask-sqalch-marshmallow backend is growing unmaintainable. Specific point of pain is marshmallow schemas. I have a lot of relations and tables, and as a result, by the time staff gets serialized for GET requests, a top level schema has few field.Nested members serialized, each with their own field.Nested, and the tree goes up to 5 nested deep, with some schemas in that member tree having their own inheritance tree. The mix of nesting and inheritance makes it hard to tune things for api views. I end up having to drill request params with context.args through the tree, or write out custom schemas for business requirement, just to tune output a little bit. This starts to look ugly, as some deeply nested children in their schema.py definition file have 6 different versions for different goals, even though some are very similar, and most are used only once.
How do I scale and future-proof that shit: Rewrite the thing with mixins (How do i then pass customization params down the tree)? Dump complete model from every model at every layer, and filter out garbage with only= or final, pre-response processing?
Related question: do you know any open source mature projects using marshmallow that I can learn from?
Kind regards ☕
json is used for?
it allows you to store structured information (like settings, user profiles, or application data) as plain text files
a lot of lightweight applications use json
Probably most frequently used as a standard format to pass data around. Think json API
o
Hello! Anybody need help in SQL
IMO that's more or less "classical" problem when you have deeply nested relationships (which is really fine on it's own) and need to serialize them. As you pointed out - mixins could be a way to remediate that somewhat, I'd also recommend trying GraphQL 🤔
What are you running into?
hey i'm using SQLite (learning it) and i'm wondering what are the limit of :memory: storage ? can i really store a lo ton it ?
Depends how much memory (RAM) you have. But just to be sure, you understand all the data is gone once you close the connection right?
absolutely, it is the objective
i mean let's suppose i could accept to allocate up to 4G of ram to my app, supposing the program would run on 1G how much would it left me in storage ?
let's say if i wanna load multiple discussion from a server...
I haven't pushed it that far myself. Apparently it will spill over to disk, so >3GB I suppose. https://stackoverflow.com/a/11061407/11080806
I wonder if DuckDB might be better for your use case?
i do not know, i'm gonna stick with SQLite for now, and i'll see
but what i was wondering is like how muc data that represent
not really i'm making a chat app
basicly
and i would like to store the data exchanged on the server and have them temporaly on the pc when loaded, i wonder how much it could hold up, as i could delete data when new are comming and the defined cache is full but i don't want this to be too recurent or limitating.
i want to be hable to store hundred of msg and they're metadata easily
I see, with gigabytes you can store millions
yeah, most data is pretty small
Does anyone have any recommendations for databases for write heavy workloads (lots of small writes to relatively few values) -- I want to maximize the throughput on a fairly low-end machine.
use WAL mode on SQlite
What is lots? <50,000/s? Then SQLite as barney said 🙂
Interesting, I'll test WAL.
As many as a 10-30$/month VPS can handle.
At that cost you can run something that can handle concurrent writes, like MySQL. But I'm not sure if it would be faster. SQLite is just so much less work to set up and maintain that I would start there.
Yeah, try it 🤔 Personally I mostly use postgres, I think you'd have to batch your writes if you really want to maximize throughput
batching writes is not possible here (outside of writing my own write-ahead log)
Could you explain why not? Also is data important? e.g. If a some parts of your writes would be missing would that be ok?
I'm working on a longer-term agent competition thing of sorts, and each agent only picks an action when it receives feedback on the previous one.
and each action has to write to the DB
it's not going to be too complex of an environment, so I am willing to sacrifice some comfort for cost savings
so agents pick up jobs, but what exactly do they write to the db?
an agent can say things like "pick up ball", "move left", "switch this valve" etc. And they receive the environment state/the change in the environment this caused.
normally, this would be entirely in-memory, maybe dumped to a flatfile periodically, but here I'd like it to be properly in a DB.
If you're not receiving literally thousands of rps it would probably be fine
Since it doesn't seem that would be a limiting factor
In practice, the limiting factor will probably be networking, a gigabit connection can't bring in enough data to saturate an SSD, but it is probably somewhat important to check that I'm not going to hit like two second pauses on 99.9% latencies
Again, I don't know what exactly you're doing so it's hard to judge what exactly may be the bottleneck in your architecture 😅
Generally dbs are relatively slow to write to, but I think sqlite may be better, also if you could use a stream-like structure (kafka/redis streams) for a job queue-like structure - redis streams can probably handle ~10k rps or more
There isn't a lot to it, receive request - read O(1) data from db - run logic (I can get this more or less arbitrarily fast, it's fairly basic number crunching) - write O(1) data to db.
and agents will be running roughly 24/7, so traffic should be pretty flat.
I don't think reads are O(1), but it depends
O(1) data as in a constant amount data
I thoughs you're talking about time
yea, but it's important that read size doesn't scale with rps/number of agents
tho it kind of isn't IG, it's still probably just going to be scylladb
You're right, SQLite makes some really interesting tradeoffs to favour writes. No type enforcement for example. A big one is that it doesn't allocate fixed column storage, much less writing overhead.
I wonder if a simple key-value db would be right for your use case. This is out of my wheelhouse though. A quick search reveals some interesting embedded options.
yeah, I was looking at LMDB and rocksdb as well
sqlite gets 0.4s worst case latency on an SSD, and processes 100M commands in ~73s (using the C API directly). So the only angle for improvement is the tail latency, which embedded key-value stores should do well (tho they do also need to do compaction).
hm, lmdb really sucks here, or I am using it wrong.
oh, these numbers are fake, I was doing all the commands in one transaction
1.2s worst case latency, 10M commands in 14m. Let's see if LMDB can be meaningfuly better.
no, 11.5m, but I am also probably just measuring the random write performance of my SSD to some extent
Go's badgerDB did it in 2m
I'll verify the results tomorrow, but it looks like I'll be writing some Go
Interesting description of db
very Funny Vid (:
Badger badger Badger , MUSHROOM MUSHROOM : D
hmmmm, cute. By default Badger can be embedded into my app binary file, no extra processes are needed
That's very powerful and interesting for me
may be a thing i needed to cache some data / unload some of my data from RAM
Hey i'm making my first database using SQLite for an chat app with account, i wanted to know if anyone had any tips about what is important to store from the user, i have the usual credential and id but after that i was wondering about other data i could need in the future, and same for my channel Table.
I would add created_at and updated_at fields. Along with created_by and updated_by. Maybe a last_login_at field. Beyond that just add new fields as you need them.
Also i plan on storing the message each user send to each other. Now there are two posibilities :
First i make one tab per user and store all his message... on his table, or i focus on a channel based table and i make a table per channel. I tend to prefer the channel based but the user based storing message certainly need to query from other user when reading a channel but would make less tables overall...
yea ok ty
do you also have any good tutorial for SQLite, bc the one from python feels lacking of a lot of information for someone that have never done any SQL and other website that i have found so far do not take the time to explain every step carefully as if i just wanted to do their exemple and not making my own.
Free code camp is good: https://www.freecodecamp.org/news/sql-and-databases-full-course/
nice, thanks
Neither of these options are ideal. One table per channel means multiple tables to get a user's messages and same problem if you have one table per user. Creating multiple tables to store the same concept also means a lot of code duplication. If you decide your messages require a new attribute, you don't want to have to update multiple tables. Instead create one table for all messages.
?
It's not a question of performance but ease of development.
ok
hey i have another question, what are the uses of primary key for SQLite ? apparently rowid is way more efficient than making an int primary key, so i'll use it for the unique identification of the accounts, but then should i use use primary key like username and password as primary key ?
A primary key is the unique identifier of a record, used in other tables to define a relationship between both tables.
what does that mean
That would be a foreign key
If you have an author table and a book table. Your book table might have an author_id field, this field has the value of the primary key of that book's author.
Thank, I'll clarify.
I recommend going through the course above. It will explain this more clearly than I am.
A primary key is a unique value, this can be a single column or multiple columns, that identifies a record in a table.
And no, I suggest not using username or password as the primary key. Instead stick with the sqlite default.
apparently rowid is way more efficient than making an int primary key
If your table has anINTEGER PRIMARY KEYcolumn, it will be an alias to rowid
https://sqlite.org/rowidtable.html
I always create an id field instead of having to refer to rowid directly. Also for more consistency between other relational dbs.
yeah, it can be potentially puzzling if you're looking at a table structure and it's not clear how a column is uniquely identified
also, when you do SELECT * FROM table;, rowid is excluded (generally SELECT * is not a great idea for production code, but it's kinda useful for inspection/debugging)
Primary keys are kinda confusing as a concept. At least in PostgreSQL, you don't have to reference a primary key in a foreign key constraint, it just needs a unique constraint.
From what I've gathered:
- a primary key cannot contain nullable columns
- a table cannot have more than one primary key
https://www.postgresql.org/docs/current/ddl-constraints.html
A table can have at most one primary key. (There can be any number of unique and not-null constraints, which are functionally almost the same thing, but only one can be identified as the primary key.) Relational database theory dictates that every table must have a primary key. This rule is not enforced by PostgreSQL, but it is usually best to follow it.
Primary keys are useful both for documentation purposes and for client applications. For example, a GUI application that allows modifying row values probably needs to know the primary key of a table to be able to identify rows uniquely. There are also various ways in which the database system makes use of a primary key if one has been declared; for example, the primary key defines the default target column(s) for foreign keys referencing its table.
I can't think of a good reason to reference something other than a primary key.
This rule is not enforced by PostgreSQL, but it is usually best to follow it.
+2
I work with a db where a lot of tables are missing PKs and it is a hassle
what I'm saying is, I'm not quite sure what the difference between just a unique non-null key and a primary key is
maybe it's just about semantics
Practically nothing, it's a theory thing as the docs you quoted mention.
Many tables might have unique non-null fields eligible to be a primary key and used for foreign key constraints. But it is confusing if relationships are defined using different fields.
Primary key ideally should never change, because otherwise you'd have to change allt the related foreign keys otherwise. It's much easier to make an integer/uuid primary key and make your user's username unique
yea it is what i planned first , but i was questionning the meaning of primary key. And also sinice rowid is so efficient i was wondering about using it as my uuid.
Technically no, indexing is different for both as the primary key is used to determine the store the rows sequentially in disk, if it's a column that requires frequent updates, use a unique key, if searches are more likely on that column, use a primary key
You can't reference rowid from a foreign key
The parent key is the column or set of columns in the parent table that the foreign key constraint refers to. This is normally, but not always, the primary key of the parent table. The parent key must be a named column or columns in the parent table, not the rowid.
This is true for SQLite, but not necessarily for other databases. Postgres for example has no fixed order.
That sounds like something that depends a lot on the specific DBMS
Ahh i see, I assumed since most implementations use physical order
PK does not determine disk location, unless it's a clustered index
I recently started reading the PostgreSQL 14 Internals book, it seems like storage is quite complicated in postgres
Oh cool, any interesting tidbits?
PostgreSQL assigns an ID for each transaction, but it only uses 32 bits (in storage) for that, so it has a mechanism to prevent wraparound
wow, it's free? 👀
Since it's linked on postgerspro.com I'll leave the link here too: https://edu.postgrespro.com/postgresql_internals-14_en.pdf
I've been reading this, it's quite useful too
oh yea ok
quick question about SQLite again, i want to have a row so for each user i can write each channel they belong to, but how could i for exemple when a channel is created and the user name is in this channel automaticly link it to the user page in question and also have multiple channel linked to one user. I would like to know if it's possible in SQLite or i do that in python even tho i would prefer it linked
To be clear, a channel has many users and a user can be in many channels. This is called a many-to-many relationship. It requires a third table that has a foreign key to both the user table and the channel table.
oh ok
what would you guys recommend for data quality checking tools?
i guess it is the same for my received friend request and sended friend request
Yep, you got it 😄
ok nice, just quick question how could i implement multiple value into one case only, like having multiple channel into the channel row of one user ?
is there something already exsiting into SQLite
Integration of tens of a model with a back end of ask and with integration of react native mobile app is quite challenging
There's no way to do this easily, the best way is definitely separate tables. But you could do something like a json field, and store an array in it.
yea but it is not scalable easily so
or i should dedicate the database for message and json to store account information
or i guess i could organise in an other manner my database
The simplest thing to do is stick to relational database best practices. Once you start hitting scaling or performance issues, that's when you can look at designing for those. Trying to do that before you run into those issues, will certainly be more work.
all is possible. part of basic SQL learning to get hang off. Sqlite3 is completely versatile to represent all this data in all the ways
The only reason to hesitate picking or not picking SQL/Sqlite3 can be a question if u can afford also preferably getting ORM in (in python world it is SQLAlchemy with Alembic, or Django ORM, or some other ones)
default SQL is meh, ORMs rule
if no ORMS, for simple apps i would prefer storing my data structs in nosql instead
ORMs are good. But learn SQL first.
yeah... u have to learn raw SQL in order to make sense of ORMs.
ORMs are very leaky abstraction, that just add benefits of more integrating your current langauge to SQL
oh ok, can i just have the emaning of ORM ? and yea i'm into learning sql rn just i'm going through problem and try to solve them efficiently
what does it mean, "emaning of ORM" ?
meaning of ORM is it an abreviation ?
Object Relational Mappers, like like those https://www.sqlalchemy.org/
https://github.com/vinta/awesome-python?tab=readme-ov-file#orm
Full list of popular ORMs in python world
ty
they did not mention https://sqlmodel.tiangolo.com/ , could have been mentioned 😏
Interesting project to pick for type friendliness and out of the box data structs proper present for ORM
ok thank you for all this
Guys does my ERD (entity relationship diagram) look ok? I'm very new to db design
For context I'm trying to create a basic role-based access control
Missing to see some kind of connections leading from Rules to Users/Groups
Role can be what u desire to be. Consider an option best for you and pick 😏
i would imagine though basing RBAC on example of postgres for example
in this situation we could be wishing asigning Roles to users
Many to Many relationship i can imagine. Many users, can be in many roles i guess.
Well yeah I'm planning to use Postgres
Made some changes
@wise goblet I just want to know if it looks right lol
Like in terms of valid ERD design
Because this is literally my first time creating one
did u even write this one your own?
Your last all iterations are meaningless for some reason
And I gotta show it to my peers
How are they meaningless?
And yes I did write it on my own
i don't see Foreign Keys defining relationships in your last iterations
rows like FK1 group_id NOT NULL that u have in other tables
What should it rather be?
The Group-User table for example has FK1 group_id NOT NULL for its relationship with Group
i can recommend having basic SQL education with manga guide
or head first series for a bit smth more serious, yet very brain friendly too
i am not comfortable spoon feeding primitive answers. Not good thing to do for both parties
My man, I'm not sure if I have to read 55 pages worth just to know what you mean lol. I don't want you to spoonfeed me just tell me what you mean by "Your last all iterations are meaningless for some reason"
With SQLAlchemy, if I do something like
connection.execute(text("SELECT * FROM items"))
can I get the column names associated with each result set? Best I can see, the answer is no?
55 pages sounds like punishment bro
🥲
Maybe he had good intentions bro, but maybe he type to fast so came out little rude 🤔
My advice is it depends how complex of a permission management system you want.
It is easier to first know the requirements and then do the ERD.
why is 55 pages punishment
i guess it depends on how much content there is on each page but
when it comes to an average amount of material per page for a textbook, 55 pages doesn't seem too bad
Also the diagram is not a proper ERD. The tables do not show the relationship type, it is just association arrow from UML that you have used.
@deft apex You don't need 55 pages to learn the normalisation rules. And, mostly up to 4NF is fine. That book is teaching more than this person needs to know right now.
Take baby 👶 steps, it is better
🙂
oh nevermind, my bad i spoke without context
yeah normalization rules are pretty chill
i don't know them but i'm aware that they're a thing and i'm aware of their comprehensiveness
Is it necessary to show all the relationship types?
Yes, otherwise we don't know what they are.
Hot take: SQLite is more than enough. I will die on this hill.
Maybe, depends on the requirements you have.
I mean, It won’t host a billion dollar company but it’s enough for 80 percent of people.
It only slows down when you have a couple hundred thousand records to query through
Not really hundred thousand records is actually peanuts for modern hardware
Sqlite is usually limited when you need to write. Especially with write heavy applications.
SQLite is indeed excellent, tho being dynamically typed is a bit concerning.
It also has few other weird quirks as to how it implements SQL
But IIRC you can switch it to strict mode.
Also, SQLite doesn't fare all that well when you have multiple distinct machines handling requests.
I'd go as far as to say 95% of websites.
I tried using mongodb once and I don’t know if I liked it or not
Though it was my first experience with databases or anything close to that so maybe I was just awful at using it
Maybe MongoDB is an absolute legend with strength I am not blessed to handle
If you are comparing mongodb to sqlite, then they are for 2 completely different use cases.
One is relational other is not
But postgreySQL is more powerful then any database
postgres covers most database usecases, but it would be inadequate e.g. for discord messages
honestly it's probably fine for a decently long while
Is table bloat still an issue for PG(high'ish frequency updates)? I recall hearing about some mitigations some years ago, but haven't been keeping up.
google bots says yes
Hello guys, how are y'all doing? I have a question about SupaBase in Gemini API 2.0 Flash. I want to integrate the AI with SupaBase using MCP, but I'm not sure which method is efficient and optimized in terms of performance. Because I want to use it on our website that will be available for a company. Is it a good idea? Or if there is a way that is easy to do, please let me know.
Thank you a lot, and I wish you all a good life!
probably better to @ a postgres mantainer
Casual interest. The bot's conclusion looks accurate enough. HOT was the feature that I was recalling, and it would appear to, still, provide a fairly strong bloat counter for common UPDATE cases such as counter increments/decrements.
hello guys i just started working with flask and currently im learning sql with python.So can u all suggest some courses or books
to learn sql i think basic w3 school is fine. but for more indepth check this
https://roadmap.sh/sql
Hello, I have a 200 gb csv file, I need to open and take a small part of it, chatgpt suggest using database, what do you advice
I would give DuckDB a shot
if you don't want to put it in a real database you could try to use duckdb or polars lazy api
imo you really shouldn't have a 200gb csv file though, either put it in a database or at least split it into smaller & more compact files
(e.g. a bunch of parquet files with 1 gigabyte each)
Make sure you set the temp_directory
its a data I will be usiing for academic purposes, but they provide it like that
I know R and familiar with python but I am not coder
so with your guidance I will learn the way to make it work
I will be reading some DUCKDB now, you mentioned apis I only know them from gaming 😄 . splitting is good also but the problem is it creates more documents and takes too much space
I will buy a 2tb external waiting some good deal
I am starting ot learn it, but had hard time in python so switched to R, but the funny thing is my csv file used ; semicolon instead of commas and even though I learned how to do it which works in my test.csv now I get errors for the real data which has ; semicolons 😄
You can specify the delimiter: https://duckdb.org/docs/stable/data/csv/overview.html
yeah I tried it but it didnt work dunno 😦
Can you share what you tried please?
thx for helping, ok I tried many things and encountered many problems, atm I am at this part where I use python but due to column names have spaces I get error
Are you able to paste the code?
the error is here https://github.com/duckdb/duckdb/issues/8214
someone had the same issue, my solution is if I can delete column names it will work with default names
allright
That issue was fixed, are you using the latest version?
yes I installed all today
let me show u
I will ping u when I do
duckdb.sql("SELECT 'ORDER NO' FROM 'C:/Users/icewind/Desktop/aaa/bigdata.csv'")
it returns
@tepid basalt but when I look it normally order no has data in it this just replicates
when I try one word column it works
Have you tried to SELECT *?
yes, I want to filter
And did you try specifying the delimiter?
Single quotes are used for string literals use double quotes "
SELECT "ORDER NO" FROM ...
nope it doesnt work I tried it
I wil ltry something else 1 sec
ok this worked thx
duckdb.sql('SELECT "ORDER NO" FROM "C:/Users/icewind/Desktop/aaa/bigdata.csv"')
changed the outside ones to ' and inside ones " as u suggested
👍
SQL-Standard: 'string' and "identifier"
I am new to python normally I use R but duckdb is very confusing in it
I read all documentation in website but one thing confuses me that, it says "To use DuckDB, you must first create a connection to a database. " but I use direct commands to access the csv file in pc without any connection
in R it forces you, thats where I failed no space left since creates a huge temp
Hello World ! I'm trying to scrape a website using an 2 years old tuto and I encounter this problem. Can someone help me please ?
Looks like your first error is an import error. Did you import pandas?
hmm seems I can only use duckdb.sql("SELECT once , tried selecting on selected didnt work
seems website guide is outdates downloaed a 1800 page pdf and I understood now
How do I right test from my project
DuckDB also has a discord
This is because DuckDB creates a default connection to an in memory database
yes I am in it but its not much active 🙂
Ask in the R channel.
But I'm happy to answer any Sql related q's, I don't use R
(It is moderately active on weekdays, but many of the devs are in Europe timezones)
UPDATE: I switched to python from R since its very confusing and very weird. Python was really easier. After tons of reading and trying I become a bit familiar, for today I needed to filter 2 columns and I achieved it by this code:
x1 = duckdb.sql("SELECT * FROM read_csv('bigdata.csv', header= False) WHERE column00='aaa' ")
x2 = x1.filter(" column04='bbb' ")
x2.write_csv(" out.csv ")
edit: forgot to mention in python I have no problem with semicolon
Nice. That's fine (using relations), and that could also be written as copy (select * from 'big data.csv' where ...) to 'out.csv'
how can I modify the existing csv file, I want to add 1 row for example, or should I somehow copy the existing file to python, make modification and save as new file
Modifying is just rewriting, yes. You can do it in a single statement like the one I gave above: copy (select .... from 'myfile.csv') to 'myfile.csv'
how can I delete the first row any tips
Use 'offset 1'
hey guys is this job position for a backend developer or for data engineer ?
Job Duties and Responsibilities:
Writing new services to collect and process data from external vendors and public APIs
Writing internal APIs to assist with the processing of card applications and their ongoing underwriting.
Integrating new research from the data team into our production processes.
Writing ad-hoc scripts in order to bulk-process historical data.
May be required to perform other tasks and duties reasonably related to job responsibilities.
Experience/Knowledge, Skills & Abilities:
Bachelor’s degree in a relevant field required, Graduate degree in relevant field a plus
Expertise working with one or more relational databases systems (e.g. SQL Server/Postgres).
Strong knowledge of Python
Data processing experience and analysis using Pandas/Numpy
Experience working with consumption of external REST/SOAP APIs
Experience in writing new APIs (preferably using Python/Flask)
Sounds more on the data engineer side to me, but there is some overlap between the two
yeah that's what deepseek told me also but it is like data engineer and back end because of creating api's
nice tips
yeah i agree
Many companies don't have a separate title for data engineer. While it's a data centric role, backend engineer means: everything that's not frontend
so according to you this is a backend position or just data engineer
Yah, in other words, backend includes data engineering (in many places)
ohh great thanks for clearing that out
i was a little confuse because i have just a little experience with node js and i want to get this job which title was junior python dev
i also have some experience with python on automating simple stuff
Hello, yes I did…
Can you double check? This import error has to be investigated
Okay Imma do it right now
I'm sure
This is the beginning of the error message
And when I do it with the other link it pass
the link that works is the one of the tuto, but cause I'm interested in MLB I done try with another link.
maybe it's because it is a shtml link
Well, I get: ImportError: Missing optional dependency 'html5lib'. Use pip or conda to install html5lib.
So I installed html5lib, after that I got ImportError: Missing optional dependency 'beautifulsoup4'. Use pip or conda to install beautifulsoup4.
So I installed beautifulsoup4 and after that I get ValueError: No tables found
So I just can’t it is that ?
I don't know. You have to install the html5lib package before you can continue.
Okay so I have installed with the terminal html5lib and bs4 was already on my computer. I still get the same error on jupyter and I try to rewrite the code on sublimetext but now it is pandas that can't be find...
And when I try to import both of them directly on my jupyter file they can't be found
Sorry, I can't be of help any more. I never worked with jupyter. Maybe try a help channel, it's not realy database related.
Okay I was looking for the webscrape channel but I didn’t found it. Thank you have a nice day
Hello, what's the problem?
I'm using "SQL Server Management Studio 20", I'm unable to open the containing folder for my database diagram. It gives a warning that says "Unable to open the folder. It might have been deleted or existing permissions might be insufficient."
I'm trying to send the database diagram file to another person
Are you able to view the folder with file explorer?
In the application, I can but not the windows file explorer
I'm not familiar with windows permissions, sorry. Sounds like SSMS is installed with more permissions than your account has
it should have some permissions set automatically so check in chatgpt how to remove it
Thanks guys
did you fix it
where you guys put scripts that insert hard coded data into dbs?
In my case all data of one specific SQL tabel is hard coded (for now), but I want to use the same API to all persistent data
the matter here is project structure
migrations/
<today>.py # here?
scripts/
db/
data/
that_one_table.py # here?
src/
db/
scripts/
that_one_table.py # here?
I'm using sqlalchemy so wherever it'll be, I'll have to import src.db.* stuff
is there a consensus about this?
I personally put them in migrations
hey i'm making a chat app and i want to make a table that stores for each channel the users that are in, but i wonder if it is a good idea to add a row for each user that is added to a channel. i don't think it is real efficient but idk how to do better. If someone as any idea for it. i'm taking it
Does someone know how to prevent in a TRIGGER function checking columns for a table that does not exist?
I have a trigger for 2 tables and I use TG_TABLE_NAME check to check if it correct name, however the actual check for that table was also executed and because column does not exist on other table it fail to execute query:sql IF (TG_TABLE_NAME = 'api_transfer' AND (SELECT ts.name = 'transferred_to_account' FROM api_transfer_statuses ts WHERE ts.id = NEW.status_id)) THEN Here is NEW - record of table that called in trigger function and status_id exist only on api_transfer table, but not the other, however it still being called and trigger crash exit.
👍
Hey everyone , what do you think about firebase as database?
Hello. Im trying to learn good practices using postgresql(psycopg2) and python. Like folder structure, how to separate different sql statements etc. You guys recommend any good article/video/github project?
Hello, can you share a little more context about your project? Are they DDL statements?
Im trying to get into scraping and i want to create multiple scraping projects. Some with message queue, some without; some with ORM, some without. I want to write scraped data into database and later do some analysis with it. Currently i need for DML statements.
Just write your raw SQL statements in the Python file. Keep it simple.
I'm writing a test to test an event that checks and create the schema on the first ever connection due to async structuring
And I wrote a NOTE comment stating "Pre-runtime tests of schema validation are a migration related matter"
That's because, since it is SQL, the safest runtime check is just to see if the tables already exist and create them if they don't
Is this code comment correct? Is ensuring schemas upon first and posterior runs a matter of migration, thus, needing to tested pre product launch?
And by production launch I also mean maintenance pauses for SaaS
Anyone have experience using the alembic api to use it in a script? I'm trying to get alembic to run on my test database in a pytest fixture:
@pytest.fixture(scope="session", autouse=True)
def setup_database():
database_url = f"postgresql+psycopg://{configs.DATABASE_USER}:{configs.DATABASE_PASSWORD}@{configs.DATABASE_HOST}"
engine: Engine = create_engine(database_url)
alembic_cfg = AlembicConfig()
alembic_cfg.set_main_option("script_location", "/app/src/alembic")
alembic_cfg.set_main_option(
"sqlalchemy.url", f"{database_url}/{configs.DATABASE_DATABASE}_test"
)
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
connection.execute(
text(f"DROP DATABASE IF EXISTS {configs.DATABASE_DATABASE}_test")
)
connection.execute(text(f"CREATE DATABASE {configs.DATABASE_DATABASE}_test"))
alembic_cfg.attributes["connection"] = connection
alembic_command.upgrade(alembic_cfg, "head")
yield
connection.execute(text(f"DROP DATABASE {configs.DATABASE_DATABASE}_test"))
But when I run tests, I get the error that the relation I'm testing against doesn't exist, which seems to indicate alembic never ran? Also, I don't love that this will modify my logging settings...
I also can't find if there's some way to give alembic a different database at the same host to run against (in the cli).
Is anyone good at using flask ?
Given this is a Python discord, you're better off asking the question than asking to ask.
/rant I wish a good DB ORM existed in Python. sqla should've forked for 2.0 and dropped their tech debt, gone all-in on mapping/typing
I create a different engine based on envvars for tests, personally
and then export that engine from my db package
the engine is then imported by env.py and everything just works
Do you have a favourite ORM in another language?
I have issues with all of them in one way or another 😂
I actually really like sqlx from the Rust ecosystem, but it's not actually an orm
I like sqla well enough, but it's needlessly verbose and relationship/mapped_column are poorly documented
because it's insane and does runtime eval() metaprogramming to find types
fair enough haha
the metaprogramming is probably one of the biggest issues I have with sqla because it doesn't play nicely with inheritance which it advises you to use as well
so it feels at odds
does anyone know how to use alembic to "reset" your DB?
just take the base revision and delete everything else?
Sorry, I don't follow?
I have a function for each engine I support, then have an export from my engine.py depending on DB_ENV that was set before execution
so for tests, I run sqlite in memory, localhost I run sqlite-on-disk or postgres, and in prod I run cloud postgres
So you're also doing it via the alembic API? Can you see what I'm doing wrong?
I am, yes
You haven't run into any logging issues? It seems that's a complaint on every post I've seen about using the API. But regardless, I don't know why I can't even get it working.
very important, in your env.py fileConfig make sure you pass disable_existing_loggers=False. I have no idea why it is True by default
Ok, good to know then. And yah, very strange that that's true by default.
But I can't even run an upgrade, with or without the logging issue, that was just an aside.
are you getting an error?
Not from alembic, just the test that's run that it can't find a the relationship, which means alembic isn't actually running
I'm not even sure how to turn this into a more testable scenario
so to run alembic programmatically, you pass it an alembic.config.Config pointing at your alembic.ini and then use alembic.command.upgrade(cfg, "tag")
it's rather simple
I'd hope it's that simple, heh, but you can see from my code above, I have that
alembic_cfg = AlembicConfig()
alembic_cfg.attributes["connection"] = connection
alembic_cfg.set_main_option("script_location", "/app/src/alembic")
alembic_cfg.set_main_option(
"sqlalchemy.url", f"{database_url}/{configs.DATABASE_DATABASE}_test"
)
I got the main option modifcation from a stack overflow answer
I just ran this:
from sqlalchemy import Engine, create_engine, text
from alembic import command as alembic_command
from alembic.config import Config as AlembicConfig
from app.configs import configs
def main():
database_url = f"postgresql+psycopg://{configs.DATABASE_USER}:{configs.DATABASE_PASSWORD}@{configs.DATABASE_HOST}"
engine: Engine = create_engine(database_url)
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
alembic_cfg = AlembicConfig()
alembic_cfg.attributes["connection"] = connection
alembic_cfg.set_main_option("script_location", "/app/src/alembic")
alembic_cfg.set_main_option(
"sqlalchemy.url", f"{database_url}/{configs.DATABASE_DATABASE}_test"
)
alembic_command.upgrade(alembic_cfg, "head")
main()
It took a few seconds to run, but no output and no change to my db
script location should be the migrations dir, right?
See, that's what I thought, but when I pointed it to the migrations dir (versions), it complained that it couldn't find env.py
I suggest saving your alembic.ini to disk and generating it from alembic itself so you can modify it as needed rather than just generating it in code
Can't find Python file /app/src/alembic/versions/env.py
ah yeah, so your alembic dir is your migrations dir
well, the parent folder
And I do have an ini, but for my app
You're suggesting a separate ini for each env?
I guess as it wont' be updated too often, that's not too big a deal
Oh, but I do some modification work in my env.py, as per other answers I got on this discord
Frick, why is something so basic so difficult
my envs don't really differ? I don't use the ini for my connection string, I set that up in the environment
Same, at the top of my alembic.env.py, I did
DATABASE_HOST = os.getenv("DATABASE_HOST")
DATABASE_USER = os.getenv("DATABASE_USER")
DATABASE_PASSWORD = os.getenv("DATABASE_PASSWORD")
DATABASE_DATABASE = os.getenv("DATABASE_DATABASE")
# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config
config.set_main_option(
"sqlalchemy.url",
f"postgresql+psycopg://{DATABASE_USER}:{DATABASE_PASSWORD}@{DATABASE_HOST}/{DATABASE_DATABASE}",
)
Unless you mean something different
I do mean something different
I just set the url in my engine directly since that's what you're importing into env.py
I tried to set it in different alembic.ini headers, but that feature is very clunky and I gave up on it since it doesn't merge "defaults"
and the programmatic use doesn't even let you reference different headers
example and then I use a match os.getenv("DB_ENV"): block at the end of the module to determine which function is called to export as engine
Oh, but then you can't do it via the cli, I'm guessing
Which I guess you could just replace the cli with script
right, you do it via the CLI by using the correct envvar
so you set DB_ENV to select the engine
oh, I see
I do this because my cloud db requires me to set up a complex event listener for password rotations on the engine
Makes sense
so I set up the listeners inside the engine functions
you can see that in set_sqlite_pragma
I don't suppose this is something you could throw onto a gist for me to look at? I'm definitely not following the full flow. If not, I understand, I'll see if I can piece it together
@brave bluff
def test_engine():
e = create_engine(some_sqlite_str)
return e
def local_engine():
e = create_engine(local_pg_str)
return e
def cloud_engine():
e = create_engine(cloud_pg_str)
@event.listens_for(e, "do_connect")
def receive_do_connect(dialect, conn_rec, carsg, cparams):
cparams["password"] = get_rotated_password()
match os.getenv("DB_ENV"):
case "local":
engine = local_engine()
case "prod":
engine = cloud_engine()
case _:
engine = test_engine()
Session = sessionmaker(engine)
And this is in your app, not the alembic/env.py, right?
yeah, it's used as your connectable
yeah, I had bad luck using the alembic tools in a configurable way
I spent like a week to figure it out and realizing that what I wanted to do wasn't possible or required a lot of code after reading through too many github issues
This really makes me consider swapping to atlas
But for now, I just need to get my test db structured
same reason that the JVM only really has Hibernate despite being also terrible
Yah, I definitely don't mean to make light of the difficulty
I think it's the way it is because it has to deal with the tech debt inside sqlalchemy as well
But I'm also shocked at the entire lack of answers/lack of knowledge, given it's the only player
Yah, SQLA should have 100% done a break at v2 and tossed the debt
and both frameworks have a lot of customers/clients consuming their code, so breaking changes are hard
I'm on the same boat after figuring out how to do mixins and dataclass integrations this past two weeks
Oh man, dataclass integrations are MESSY
So much stuff I'd like to be able to do, but then throws errors
or being able to use alternatives like msgspec but you can't because it doesn't use the same API as dataclasses.dataclass
SQLA and Alembic have been making me think I'm a much worse engineer than I probably am, so it's nice to hear someone else with similar opinions
so you're kind of just stuck using MappedAsDataclass or whatever
(I've had various people try to convince me the SQLA docs are fine, if you just read them carefully)
the docs kind of are fine? but the code docstrings definitely are not
I shouldn't need several browser tabs of documentation open to use the framework
I donno, as someone who's using it for the first time, I think fine is giving it a lot of credit 😛
Yes, 100%
And for example, this all came up because I'm trying to use postgres's citext
it actually is better than using pre-2.0 sqla
Yes, that's for sure
I wanted to use the AWS-provided RDS Postgres client with sqlalchemy, but that was a headache and a half because they themselves are importing sqlalchemy for connection pools with no instruction on how to integrate with sqlalchemy ORM 🤣
As someone who's just starting to use sqla... I couldn't figure out how to set a column as citext, because it and other db specific columns CITEXT (all caps) where as the base colums are Text (snake case)
AWS just assumes you're using raw connections everywhere which is absurd
It didn't occur to me that CITEXT was a class, because it was all caps
And nothing in the docs indicated it was just a swap
yeah, you're talking about type_annotation_map stuff now, btw
Yah
Sorry, I went off the rails a bit
heh
The docs aren't terrible, but I do think they're bad
I found out that the feature I wanted, which was using NewType UUIDs for type safety, was only implemented like a week before I was trying to use them recently
Haha, I had similar experience!
but I have to iterate EVERY ID NEWTYPE into that annotationmap which is absurd
So many design choices I can't imagine making...
at least my base class is beautiful
DbId is a union of all the newtypes, so it automatically implements the correct id primary key for each table
I'm quite proud of it because you can't misuse it
Nice
I do have to learn how to use generics better
Anyway, for now, I'm gonna try to get this testing working... given what you said about building an engine per env, while I'm doing my ini in memory, i'm basically doing the same thing in code, so the fact that it's not working is frustrating
yeah, dynamic connection_string in the ini wasn't working out for me which lead me down this path and it's easier to reason through
Does your script output something when alembic actually runs?
The fact that there are no outputs also bothers me
in my alembic ini I set the sqlalchemy.engine logger to have DEBUG logging so I can see outputs, and I set echo=True on my local engines
Thanks. I've gotten as far as alembic spitting out
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
in my script, which means now it's just not finding the actual migrations...
Which I guess is forward movement, but still not actually a result...
yeah, your alembic.ini expects a script_location pointed at the root folder for env.py and it assumes your migrations are named versions/
if it's not versions you need to configure version_locations
It's versions, and script location is pointed at the root
so I would verify you are specifying the command correctly
are you using upgrade(cfg, "head")?
This is my ini: https://gist.github.com/rohitsodhia/af3e4b20bdd6ad14cea47914996a7036
This is the script:
from sqlalchemy import Engine, create_engine, text
from alembic import command as alembic_command
from alembic.config import Config as AlembicConfig
from app.configs import configs
def main():
database_url = f"postgresql+psycopg://{configs.DATABASE_USER}:{configs.DATABASE_PASSWORD}@{configs.DATABASE_HOST}"
engine: Engine = create_engine(database_url)
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
alembic_cfg = AlembicConfig("/app/src/alembic.ini")
alembic_cfg.attributes["connection"] = connection
alembic_command.upgrade(alembic_cfg, "head")
main()
is attributes["connection"] correct?
I think your script is including code that should exist in your env.py instead
I plan on moving it there, I'm just going by what's in the alembic docs
with engine.begin() as connection:
alembic_cfg.attributes['connection'] = connection
command.upgrade(alembic_cfg, "head")
I'm trying to get it working with the minimum changes to start
Because I already have a feeling if it can't find the versions in my script, it won't find the versions in my tests
Hell, I'll be shocked if it works at all in my tests
btw, engines usually auto-begin
so you'll run into conflicts by doing it explicitly without changing the setting
Ok, I can sort through that
I put it in my app startup to run migrations
what are you building for?
This was meant to be a small crud app to learn fastapi and sqla off of, and improve my testing skills
This has turned into a nightmare that makes me question if I'm fit to be anything of an engineer after 10+ years if I cant even get a simple crud api going
ah yeah, so put your migrations into your lifespan startup
and then you can invoke them in your test by using the testclient as a ctx mgr
I don't mind that as an ultimate solution, though I'd still like to figure out what's wrong with the current test script... if it can't find the migrations here, why would it find the migrations at startup?
my suspicion is your env.py script
Which besides the portion I showed you earlier, is the same as the default. I get the connection is something I have to address, but that doesn't explain the lack of finding versions
It's making a db connection to the right database, I confirmed that
I think we're at a point where you'd need to share your migrations scripts maybe
or bump your alembic logger to DEBUG
Let me bump the logger to start, and I only have one migration right now; I'll throw it on gist. As mentioned before, it works via the cli on my app db, just can't get the api to work
Hey guys, Im tryna make a dynamic website using flask but idk what kind of database I should use. I've asked my teacher abt it and he said it's easier to start off with mongodb but from what I've seen MySQL or SQLite are much more popular (i guess?) so which one do I actually want to use? I'm a total newbie in database and I just wanna know which one is much more beginner friendly (it doesn't have to be limited to only the options I've mentioned above). Thx guys, much appreciated 🙏
SQLite will be the easyiest one and it already comes with python, no extra setup or server required.
alr sounds great! im definitely going to try it out thx!
Important notes: do you want a relational db, a document db, a managed db, a local db, these are a matrix of decisions and products that you must decide are best for your app. SQLite is built into the Python standard library, but it is a local (i.e., filesystem) relational db
it just doesn't matter much for me cuz i just wanna start with the easier ones and work my way up and the type of database isnt really something that i care abt, thx for the note!
in the worst-case (easiest) scenario, just use a Python dict you pass around as app state and treat that as your "db"
I also don't think theres an "easier" database. There are databases that are easier to set up, and come with downsides. There are more complicated to set up databases that can make your life easier down the line. For a quick simple project, and not knowing what else to do, sqlite is great, but choosing between a relational database like mysql vs a non-relational like mongo requires knowing what else does and why you'd want to pick each one. I'd say none are more "beginner friendly". If you're using SQL based databases, you need to learn SQL. If you're doing a NoSQL, then you need to learn the syntax for it (which are largely javascript based, but not always). Either way, there's something for you to learn.
I'm having trouble with SLQAlchemy 2. I have this model:
class Item(Base):
__tablename__ = "items"
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(), unique=True)
For which I wrote this repository method:
async def update(self, id: int, name: str | None = None, notes: str | None = None):
item = await self.get_by_id(id)
if not item:
raise NotFound(Item)
if name is not None:
item.name = name
if notes is not None:
item.notes = notes
db_check = await self.db_session.scalar(
select(Item).where(func.lower(Item.name) == item.name.lower()).limit(1)
)
if db_check and db_check.id != item.id:
raise AlreadyExists(db_check)
await self.db_session.commit()
return item
And have this test:
async def test_update_duplicate_item(self, db_session: AsyncSession):
item_repository = ItemRepository(db_session)
items_inserted = await db_session.scalars(
insert(Item).returning(Item), [{"name": "test1"}, {"name": "test2"}]
)
item1, item2 = items_inserted.all()
with pytest.raises(AlreadyExists) as e:
await item_repository.update(id=item2.id, name="test1")
item1 = await item_repository.update(
id=item1.id, name="test1", notes="changing notes"
)
assert item1.notes == "changing notes"
If I run each assertion block (the with and the line below it, and the bottom few lines) on it's own, it works. But when I run them together, I get an error on the second update that says there's an integrity error for trying to update id = 2, which means it's still trying to update the first attempt. I fixed it with session.expunge, but I don't understand why that's happening? In the update method, item is being overwritten each time.
I don't see you adding the model to your session before the commit?
Yah, I found it really strange that whether I add it to the session or not, it updates either way. And if I do add it to the session, sometimes it causes errors, which is baffling?
A REPL is an interactive shell where you can execute individual lines of code one at a time, like so:
>>> x = 5
>>> x + 2
7
>>> for i in range(3):
... print(i)
...
0
1
2
>>>
To enter the REPL, run python (py on Windows) in the command line without any arguments. The >>> or ... at the start of some lines are prompts to enter code, and indicate that you are in the Python REPL. Any other lines show the output of the code.
Trying to execute commands for the command-line (such as pip install xyz) in the REPL will throw an error. To run these commands, exit the REPL first by running exit() and then run the original command.
I have another test:
async def test_update(self, db_session: AsyncSession):
item_repository = ItemRepository(db_session)
item = Item(name="test", notes="test")
db_session.add(item)
await db_session.commit()
item = await item_repository.update(id=item.id, name="test2", notes="test more")
assert item.name == "test2", item.notes == "test more"
item_from_db = await db_session.scalar(select(Item).limit(1))
assert item_from_db
assert item.id == item_from_db.id
assert item.name == item_from_db.name
That shows, even without adding it to the session, the db changes
oh also, why are you using some get_by_id thing? You can just use .get(id) on the session
Uh, best i've seen, get doesn't exist in 2.0
the db_check seems redundant. you can just try the insert and catch the integrityerror for non-unique names
Yah, that's fair, I can remove that and just catch the integrity error
Which I'll do
But I'm hoping someone can help me understand what's going wrong with the update in my situation
If I change the method to
async def update(self, id: int, name: str | None = None, notes: str | None = None):
item = await self.get_by_id(id)
if not item:
raise NotFound(Item)
if name is not None:
item.name = name
if notes is not None:
item.notes = notes
try:
await self.db_session.commit()
except SQLAIntegrityError as e:
await self.db_session.rollback()
raise AlreadyExists(e)
return item
the error moves to a greenlet spawn issue in the second update?
https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-bulk-update-by-primary-key
I'd suggest using the update() construct? I can't find guidance in the docs about updates otherwise 😵💫
Yah, the docs are BAD about orm updates
how is it this bad, it's so common
Yah, and that's an obscure example, heh
So I guess that tracked change persists in the session even if the instance itself doesn't?
Which is WEIRD behavior IMO
And they talk about rolling back there, but rolling the session back for the update didn't work for me
I had to expunge the item from the session
So... the session.get document says it's legacy, but the v2 tutorial says it's not
so question, are you creating a new session per request?
One session per test
oh no
No?
you need one session per request
they aren't intended to be durable and long-living
See, this is the frustrating thing... some folks say "one session all the way!" while others tell me "commit commit commit"... I don't know which way to go!
each session is basically its own transaction
Yah, I read the transaction docs
the prior is just wrong, though
one engine per app, but session per request
Heh, it's a pretty common attitude though, and as someone just getting deeper into SQLA, there's no reason for me not to believe it until I get to an issue like this.
Oh, as I think about it, I think for a app request, treating it as one transaction makes sense
But for tests
Each attempt should be it's own transaction
Oh?
I've disabled autobegin and gotten errors still, too 🤷
Ok, well, I think with your transaction advice and knowing a bit more about what's going on, I can fix this
Thanks
SQLA docs confusing? ::shocked::
And I think I'm going to have to look into nested transactions for a test like this
Because I don't want to commit any transaction, so my db is clean between tests
But in some tests, I'm gonna set up data
spin up a fresh db for every test /s
I did that before... boy was that slow...
Use save points and roll back your transactions.
nah, that's a fragile approach
it shouldn't be. your tests should be in-memory
How so? Works well for me with SQLA and this is also how Django's defaults work for example.
It's also what SQLA themselves use: https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#using-savepoint
if supported by the underlying engine
Well, per class, not per test
I'm not aware of an RDBMS that SQLA supports that does not have support for savepoints....
there are bugs to be aware of.
That's unfortunate, I wonder how Django works around that
I have a database with these four entities:
- Property
- Department
- Municipality
- RuralSettlement
Each property must be associated with a specific Department, Municipality, and RuralSettlement.
The relationships are:
Department one-to-many with Municipality
Municipality one-to-many with RuralSettlement
What foreign keys should I have in Property?
Just one foreign key field for RuralSettlement. Because the other tables are already connected by their relationships.
Could you consider have foreign keys for each one? in some specific use case?
Like in what case?
That would be confusing because if you want to know a Property's Municipality how do you know if it's property.rural_settlement.municipality or property.municipality.?
Uhmm for example, if it's possible that a Property doesn't belong to a RuralSettlement, I should have foreign keys for both?
hmmm
You could make Department, Municipality, and RuralSettlement into one hierarchal entity, so you have only one foreign key field on Property.
But that might be more complicated than having nullable foreign key fields on Property...
That's all I have off of the top of my head
Uhmm, this is a good idea. But I think I need each entity to be independent because, for example, there is a feature where the user creates a property, and they select the department. Then, they need to see the available municipalities for that department.
Add a type field so you know what's a Municipality
Once again, this might be more complicated than it's worth
okok i will think about it
I've got a SQLA issue I can't figure out. I have this test:
async def test_update_duplicate_item(self, db_session: AsyncSession):
item_repository = ItemRepository(db_session)
items_inserted = await db_session.scalars(
insert(Item).returning(Item), [{"name": "test1"}, {"name": "test2"}]
)
item1, item2 = items_inserted.all()
with pytest.raises(AlreadyExists) as e:
await item_repository.update(id=item2.id, name="test1")
item1 = await item_repository.update(
id=item1.id, name="test1", notes="changing notes"
)
assert item1.notes == "changing notes"
Along with the update method:
async def update(self, id: int, name: str | None = None, notes: str | None = None):
item = await self.get_by_id(id)
if not item:
raise NotFound(Item)
if name is not None:
item.name = name
if notes is not None:
item.notes = notes
try:
await self.db_session.commit()
except SQLAIntegrityError as e:
self.db_session.expire(item)
await self.db_session.rollback()
raise AlreadyExists(e)
return item
When I run the test, it fails on the second update. I realized from feedback here it's because SQLA still has the first item in memory and tries to commit them both (but it somehow doesn't get caught in the try block?). So I thought adding the expunge would work, but then I get an error that I need to rollback. When when I add the rollback here, I get a greenlet error. I can't figure out what's going on.
BTW @hallow brook ,i did do a nested transaction, which ultimately fixed it, but now I'm hoping to learned what's going on in this situation to be better able to use SQLA.
ive been running into a lot of database performance issues with my discord bot, and I’m looking for a better alternative to SQLite, something that’s still easy to set up but handles high-concurrency writes better.
what im looking for:
easier to set up than PostgreSQL (I don’t want the hassle of full DB server management)
more scalable than SQLite without major config headaches
handles high-concurrency writes well
minimal setup & lightweight
any suggestions?
im also very unfamiliar with postgresql
Mysql seems a good alternative for database.
async def route(request):
async with create_session() as session:
# db logic
...
async def route(request, session):
# route & db logic
...
hmm with docker?
You just need to call setup_database() to configure the database, then you can forget about it.
i don’t want to use something that needs to run on a server like postgresql
Do you mean you need to store the data locally?
It doesn't necessarily run as a server, you just need to call create_session() to access the database.
yeah, like with sqlite the setup is very minimal and it’s really easy to use it when you’re hosting a bot on a vps
That's the best current setup I can have for a DB.
It is safe due to its good ORM handling, and supports concurrency (async), and also combines the Pydantic SDK.
is it able to handle several asynchronous operations concurrently
Obviously, it is an asynchronous session
You just need to familiarize yourself with SQLAlchemy, for example this one for update
item = await session.get(objects.Item, itemId, with_for_update=True)
item.price += 15
await session.commit()
What's the issue with performance?
i keep getting db locks, high contention under concurrent writes and my performance is degrading with increased interactions
i already enabled WAL mode
Yeah, I'd try something like postgres there
im not very familiar with postgres, and the setup seems to be really difficult
ill be running the code on a vps as well
As Viktor mentioned sqlalchemy as an orm may be a good choice too
You could run it in docker 
i was considering runninig it in docker
you said earlier your traffic was approximately one write per second, right? that's way within the realm of sqlite's capabilities, so i suspect your transactions are being kept open for too long
ive decided to use mysql
was it easier to setup over postgresql? i thought both had docker images
also not familiar with how they handle contention between transactions, i'd like to read a bit about that...
okay now im in a dilemma again
honestly, i cant decide what to use
its killing me
im about to pull out my hair
Mind sharing how you are using sqlite to write ? also you can try playing around the connection pooling options with sqlalchemy
https://www.postgresql.org/docs/7.1/mvcc.html
https://dev.mysql.com/doc/refman/8.4/en/innodb-multi-versioning.html
hmm if i understand correctly, postgresql and mysql (innodb) use multi-version concurrency control to allow concurrent reading/writing, but transactions can still lock individual rows or tables which prevents some transactions from reading them, depending on isolation level
i'm using a custom connection pool built on top of aiosqlite rather than sqlachemy.
this is my current approach to database writes:
connection pooling: I've implemented a DatabasePool class that manages between 3-8 connections with configurable min/max settings
write locking: i use a dedicated asyncio.Lock to ensure only one write operation runs at a time, while still allowing concurrent reads
transaction management: my implementation supports proper transaction nesting and has safeguards to prevent leaving transactions open.
retry logic: I have exponential backoff with jitter when encountering "database is locked" errors.
WAL mode: im using sqlite's WAL journaling mode with optimized settings
the current issue im facing isn't the connection pool itself but rather sqlite's fundamental limitation with concurrent writes. no matter how good the pool is, sqlite still locks the database during writes which is extremely frustrating
i did a benchmark of sqlite's write performance a while ago: #databases message
in short, the results were that one connection committed 1000 writes in 0.2s, while ten connections committed 1000 total writes in 6.6s, so around 150 writes/second
oh whoops, i didn't actually report the write performance in this message, i'll need to rerun the benchmark...
edit: results were 1000 writes in 0.6s with 1 conn, and with 10 conns it increased to 1.5s, so it was actually quite fast for writing
Yeah, the note on tables is key, a client/server db is still going to run into locks if the writes are all to one table.
do you know how long your write transactions are held open? do they have to wait for an API request to complete, or for the user to enter something, etc.
You can always ask an AI expert in databases, they know many technologies and will be able to recommend you based on your needs.
To be clear you should be able to read from the database at the same time as a write, if not something is wrong. SQLite can only support one write operation at a time, but it's so fast that shouldn't be a problem. Let it handle queuing the writes, so you don't have to worry about it.
my write transactions are generally short lived and don't wait on external operations. most of my transactions execute a single database operation or a small batch, then commit immediately. the issue isn't transaction duration, but rather multiple concurrent transactions attempting to write to the same tables, but i do have proper pooling and transaction management with automatic commits/rollbacks, but the contention happens when multiple requests hit simultaneously
what do you guys think about sharding my database?
one for user data, one for chat data, etc
honestly, if i were to migrate to postgres or mysql, id have to make several adjustments
and can you confirm that the quantity of writes is still ~1/second? or has it increased?
SQLite should handle all of this for you. Have you tried without rolling your own?
Well isn't that because you want one write operation at a time only ? It's probably better to offload this to the db or orm rather, also you can check out the isolation levels and find the best one for your need, but strictly keeping it at one write at a time will for sure hit performance
Have you tried FOR UPDATE?
SELECT ... FROM ... FOR UPDATE statements
SELECT * FROM orders WHERE id = 123 FOR UPDATE
This locks certain rows to prevent access to old values that have not yet been updated in the database but are being executed.
it hasn't increased
im primarily using regular SELECT statements followed by separate UPDATE operations
based on my understanding, sqlite will effectively lock the entire database during write transactions anyway due to its file-based locking mechanism. so while FOR UPDATE would make my intent clearer, it won't fundamentally solve the concurrency issues im currently experiencing, correct me if i'm wrong though
then i would suspect that the duration of one or more write transactions is the problem, not necessarily the quantity of concurrent writes
SQLite is used for local databases, your machine should have a good disk write speed, the more writes per second you have,
How many values do you update per transaction?
my thinking is that say you had 100 writes to finish immediately, but each transaction only lasts for say, 20ms, then that would take 2 seconds to complete, since it goes through them sequentially (as you've understood correctly)
in contrast, if you had 10 write transactions to complete, but one of them needed to stay open for 5 seconds, then even though there are fewer writes being done, the total time needed to finish those writes is 5.18s, just long enough for sqlite3's default 5 second timeout to raise "database is locked" exceptions on some of those transactions
can you reproduce the exact transactions that causes those locked errors?
In that case it would not be good to handle updates as a future, so when you want to access them you should make sure that the future has ended, although you should almost always assume that the transaction will be successful.
there are a few operations where i might be holding transactions open longer than necessary.
i have a matchmaking system within my code, this is what it does when processing matches:
- removes users from queue table
- creates a new chat record
- updates user records
- calls discord api to create thread
- updates some more records with the thread id
that discord api call happens within the transaction in some places, and that's potentially problematic. even with my retry logic, a 200-300ms api delay during high activity could be causing the lock contentions
Does the session remain active throughout the entire process?
if you can reproduce the error and figure out a way to determine what transactions are active at that time, perhaps printing when your transactions open, and the queries you're about to execute, that should help you narrow down which function is causing your other transactions to timeout
honestly, i dont think using sqlite is a viable solution anymore, this bot is being implemented in a server with 700k members
so my options are;
- shard my database
- completely migrate to either postgres or mysql
try using a decorator that displays the response time
import time
import collections.abc
import functools
def timer[**P, T](
func: collections.abc.Callable[P, T],
) -> collections.abc.Callable[P, T]:
@functools.wraps(func)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
s = time.perf_counter_ns()
r = func(*args, **kwargs)
print(func.__qualname__, time.perf_counter_ns() - s)
return r
return wrapper
# example
@timer
def sum(a: int, b: int):
return a + b
sum(1, 2)
I think Postgres is better for your case. MySQL is generally better for reading, but not so much for writing. That's why it's more commonly used for static data.
Depends how far you want to scale, sharding imo leaves you with single points of failure, migrating to postgres or mysql allows you to specify multi write nodes so its more scalable
I'd personally prefer postgres, but I think they're equally difficult to set up 😅
By the way, sharding some of your database tables seems like a nice solution, depending on what you're doing
yeah and i’m not familiar with postgres at all
i tried installing it, its all so complicated
this is so frustrating
What exactly was complicated? I'm really curious 
so i managed to connect to the port, but i couldn’t view the tables for some reason?? i tried watching a few tutorials as well
View the tables?
I don't use psql directly that much if you're talking about this, but there are commands to see tables or tables in current schema. If you have a db client/browser like pycharm's built-in one I'd recommend that
i mean listing all the tables in the database to see what data structures are available
I mean, if you didn't create anything there wouldn't be any? 🤔
i did
i was tryna see if i could configure it
i just gave up after a while because i was on the verge of pulling out my hair
I tried to see how far I could push WAL mode before things started breaking:
| Connections | Transactions | Journal | Time |
| ----------- | ------------ | ------- | ----- |
| 5 | 1,000 | DELETE | 4.64s |
| 5 | 1,000 | WAL | 0.19s |
| 10 | 1,000 | DELETE | LOCK |
| 5 | 10,000 | DELETE | LOCK |
| 10 | 1,000 | WAL | 0.45s |
| 50 | 1,000 | WAL | 4.63s |
| 10 | 10,000 | WAL | LOCK |
I'd still like to try a few more things like increasing the timeout.
Each transaction is an insert to the same table btw.
Hi anyone alive here?
Im currently looking for more efficient ways to update a different table based on changes from a settings table
How does one implement this in a way that won't crash a db?
For example, a different table (T1) references a central settings(T2) table, and when that setting changes, ALL rows in T1 must update with that new setting
To make it more complicated, T1 has rows which use the setting for calculations, so those values will have to be recalculated as well
For context, i'm currently creating a payroll system for a small company
I currently implement the solution to what i mentioned by having a 'bulk regenerate calculations' button (for example for the employee timesheets). Only problem is, with just around 100 of those, supabase already cuts the connection/times out
what would be a better way to approach this?
oh my god
this seems very logical
What are you using for pooling?
There are a couple database design rules that could help you out here:
- Don't store computed fields. Instead perform the calculation on read.
- Don't store the same data twice. Instead create a reference using a foreign key.
So in your case the rows in T1 would have a field with the foreign key to a row in T2. Now when you make a change you only have to update the value in T2.
oh thanks for that first tip, didn't think about that
for 2 though, i do use foreign keys
how'd you suggest i go about the first one tho in a way that doesn't recompute on every read?
oh what do you mean by perform the calculation btw? recompute then store at the DB level? or just pull the data then compute at either the server or frontend?
Let's say you want to know the total value of an order. That would be annoying to store in the database because each time the order, quantity or price changes, it needs to be updated. Instead when you're reading the data, add a field in the select statement to calculate the total value.
If that's necessary, you'll have to break one or both of those rules. Take a look into generated fields.
i see, that makes a lot of sense, thanks!
some of the fields are too complicated to be generated/computed at the DB level (mostly because the values themselves depend on values from other tables), but i got a lot of ideas from what you said, thank you!
Hello people, I'm looking for an sql database client that autorefreshes when changes occur, I'm using dbeaver but when I make changes in the tables or so it's not autorefreshed (i need to click in refresh to see the difference), can you recommend me one that have this feature enabled?
DataGrip, HeidieSQL
client? I'm not sure a database is what you want
i've managed to use postgres
now i just need to figure out how to host it on my vps
Docker is probably easiest if that's an option.
could someone tell me how to add a csv dataset into vs code. Im new to python and im using pandas.
VS Code is a glorified text editor, it is not a database in itself nor has any built-in tools for managing data like that besides "open files in a text editor"
working with dataframe libraries like pandas or polars, you'll just need to write code to read that file - there is no concept of datasets
Read the User Guide of the library you are using, it covers basic operations like reading from and writing to files, as well as all sorts of basic operations
What do you mean 'into VS Code' exactly? In general after importing pandas, you can do pandas.read_csv('somefile.csv'), optional args include 'header' and 'names' if you want to process or set names for the columns.
usually you'll want to avoid having to read the entire file to test each change you make though
you can use something like Jupyter Notebooks, IPython, marimo or even just running code in the terminal to interactively run your python code one line (or block/"cell") at a time, keeping the entire data loaded in memory in between
TL;DR watch https://www.youtube.com/watch?v=suAkMeWJ1yE then read and follow along https://pandas.pydata.org/docs/user_guide/index.html
In this video, you'll learn how to create your very first Jupyter Notebook in VS Code, including setting up your environment, running and debugging code, and visualizing data. Getting started with Jupyter Notebooks has never been easier!
🔎 Chapters:
00:00 Install
00:50 Setting up your environment
02:29 Running text, code and visuals in one n...
thank you I appreciate it
so Im doing this for a school project and one of my classmates who knows a lot more about programming then I do took the csv Im using for my project and put it into vs code. I think it was some sort of text editor possibly but im not too sure.
but Im using jupyter notebook for this
Gotcha. Jupyter is a great place to mess around and learn. Good luck!
thanks for the help I got everything working now
Im also pretty new to python, and just started with pandas a few days ago. This might sound trivial but should i try to integrate jupyter nbs into vsc or can i just use the anaconda browser localhost thingy
Nothing wrong with using the Jupyter Notebook Server (or JupyterLab). VS Code does come builtin with support for notebooks, you can give it a try and see what you prefer.
alright thank!
im not using docker
i already figured it out
i have not seen a single deadlock yet
That's awesome, let's go!
:incoming_envelope: :ok_hand: applied timeout to @tulip kite until <t:1743525234:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).
The <@&831776746206265384> have been alerted for review.
Ok, finally a SQLA "how would you do this" rather than a "how is this done"!
I have a SQL object with a hybrid property:
class Purchase(Base):
_price: Mapped[int] = mapped_column("price", Integer())
@hybrid_property
def price(self):
return self._price / 100
@price.inplace.setter
def set_price(self, value):
self._price = round(value, 2) * 100
But when you do a bulk insert (https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-queryguide-bulk-insert), you pass in a dict of values, which means the hybrid part never triggers. So if I try to insert {"price": 1.50}, it just fails. And I'd have to set _price to 150 for it to work, which means data manipulation anyway.
Right now, I'm doing this:
purchases_data: list[dict] = []
for purchase in purchase_input.purchases:
purchases_data.append(
Purchase(price=price).__dict__
)
Any thoughts on another mechanisms?
Possibly a dumb statement but just making sure you've considered session.add_all()? That isn't as efficient but it will run your setters
Oh I guess you could also do an event listener?
@sqlalchemy.event.listens_for(Purchase, "before_insert")
def process_price(mapper, connection, target):
if hasattr(target, "_price") and target._price is None:
target._price = round(target.price * 100, 2)
Instead of the hybrid property? I guess that is another way to handle it. I'd need to add a second listener for the other way around.
hybrid properties aren't called during initialization
you want a @validates
Sorry, not sure how I'd apply that? And I know it's not applied at initialization, but it's like any other getter/setter property, using the value in the obj to calculate the result. But since that's not how bulk works, my code doesn't work. What would I be validating?
oh I see what you're saying. you don't need to use the bulk insert api specifically
just add to your session and commit at the end
Yah, I considered that. I just know that'll run multiple inserts. Not a HUGE deal, specially at the small scale of my app, but at a larger scale, that'd be a poor solution.
I do see there's an update_expression for bulk updates, but I don't see anything for bulk inserts...
What are the git problems you have run into and felt like - someone should have taught me / or told me about this
I need to conduct session for Git - I know basic commands only..
need to know advance level
I guess I would have liked a better walkthrough of “rerere” when trying to fix rebase conflicts. Not sure this is the channel though it’s sassy to think of git as a db
Idk where to ask 😭
Thanks bro
hi i am trying to do a little data bases exercice where ive made a list of movies and next to it i have "acting", "soundtrack" ... and "good", "mediocre", "bad" as a first row and i will have to check one of the 3 for each criteria
i want to code a python script to note the movies using pandas, how do you suggest i proceed?
I would have a "rating" column that took values like "good", "mediocre", "bad", rather than having them be separate.
this is exactly an exercice im trying to learn how to convert check signs into a rating
Dm me if you can dump of a db
Dm me now
i'm currently using ubuntu 24.04.2 on my vps and i'm wondering how to add my setup.sql into my postgres database using the command line
or should i integrate the schema directly into my code
which framework/db library u use
i'm using postgres with asyncpg
https://pypi.org/project/asyncpg/ asyncpg directly, without anything else?
yup
well, a thing to check, but alembic is probably usable standalone as migrational lib.
I know for sure it should be able to handle Raw SQL migrations
integrating to your own code is always good idea => as it simplifies local dev setups to do development locally
and running unit tests
but technically nothing forbids having file in Git repo without migrational lib, that is already enough to bootstrap server
still better doing all actions through alembic for versioning purposes though
=> At minimum ensure your SQL Schema is always stored in git repo.
Usage of tools like alembic will ensure u aren't doing manual things outside of git repo code
====
It depends on size your development. if it is micro pet project with small amount of code, it can survive with whatever shit u make. Just going to be less pleasant shit
If u have many devs and too much code, better using migrational versioning lib and reviewing changes in pull requests
Hey I had a doubt we use databases to store data and perform olap operations
But I stacked my data in an array
And created a python script to query it
And used dask to parallelize it
What's the exactly the diff between both ?
What will be the advantage to use database here
how are you storing your data?
if you only need to perform olap operations, there aren't that many benefits - to the point you see tools like https://duckdb.org/ gaining popularity lately
if you need to perform updates, run queries that benefit from indexes, manage a lot of data in a single place etc. then you might rather use a proper database, but for some simple use cases storing the data in parquet files* and using libraries like duckdb, polars or dask+pandas could be enough
-# *or even just plain CSV
I am storing in hdf5
I have raster data/gridded data
So like for each time stamp it's an image
there might be some PostgreSQL Plugin or even some NoSQL databases better suited for that, but depending on which kind of processing you're doing it might be fine to just not use an actual database
Will it scale to if data is in tbs?
Currently I got only 1month data and it's 100gb
For 4 variables
If it add rest of the remaining 4 it will be around 200gb
So for whole year it will be 200 x 12 = around 2.4tb
Yes I saw postgresql plug-in it's name is postgis
hi
yes postgis can handle raster data quite well
I think this is most likely a general issue with my lack of python knowledge but I am unsure why I am getting this error when trying to attach to a sqlite db.
dbconnect = db_handlers.createDB()
cur = dbconnect.cursor()
db_handlers.addExpenses(cur, rawimport, "2024")```
throws an error AttributeError: 'str' object has no attribute 'cursor'
This I have two handler files basically. handlers.py and db_handlers.py. This function is in handlers.py and is being called by an upper level script for choice handling.
The actual function createDB works perfectly fine in that file along with cursor but not outside it
*edit* well apparently the issue is createDB fails when its being called outside of db_handlers.py
*edit2* ok yea it is me being new at python. the db path is currently hard coded and I was using relative path, which basically fails when I execute from a source file because the top level is in a different folder. time to convert to absolute path after all. Thanks for the rubber ducky moment 
dbconnect = db_handlers.createDB() that suggests createDB() is returning a string, which is probably not what you want.
yea, it was due to my sloppy error handling that was very basic. Which basically createDB() returns error text if it fails. Which worked fine for direct call but not as a library so that was the actual indicator I had it wrong and it came down to trying to do like mydb="data/mydb.db" vs "../data/mydb", etc. So I just need to go ahead and do absolute path and that will solve that problem in the future too.
Basically ive dabbled in python for a while but I wanted to try and like really learn it so in addition to taking a class im making my own program from the ground up and trying to not just look up a code snippet to copy paste but actually figure it out. Fun but also challenging
If you start adding 'type hints' a bit at a time to your functions, that will catch a lot of these errors for you automatically.
e.g. it would have complained that your createDB function returns two different types conditionally
hmm good point, Ive done that in a few cases before for troubleshooting but worth doing here. I also just need better error handling for the initial db setup in the first place because if that fails for some reason nothing else really matters 🙂
j'ai une petite question sur les bases de donnés sql y aurai t il des connaisseur qui pourrai m'aider (probleme de com avec une aplication vbnet)
I have a quick question about SQL databases — is there anyone knowledgeable who could help me? (I'm having a communication issue with a VB.NET application.)
Since this is a Python discord probably not the right place for VB.NET help you might get lucky
What error are you getting, and what database server is it? I don't know anything about modern VB.NET but maybe it'll be clear from the specifics.
huzzah, db import totally works finally. dealing with preventing duplicates is going to be a whole thing though heh...
If you can manage it, it's nice to prevent duplicates via uniqueness constraints on the columns.
Yea that would be nice but most things could be the same. its an expense/budget tracking so like dates and charges can easily be duplicates and maybe even the expenses. Was thinking maybe I could like hash the whole entry and use that as a match case or something
In a perfect world where you have super precise timestamps, you can use that as a proxy for identity.. but life isn't always that easy.
lol its a miracle the dates are correct, but sadly all I have are YYYY-MM-DD so not unique enough
hey i'm using sqlite3, is this the right way of writing this : ```py
cur.execute('SELECT account_name, password from account WHERE account_name =? AND WHERE password =?', (username,password))
so I found this before and can't find it again. How does sqlalchemy interpret string annotations / expressions? I know it tries to literally eval() them against a namespace, but somewhere they explain this in their docs and I can't find it
ah, think I found it.
https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#late-evaluation-of-relationship-arguments
I like the way it's explained in the litestar docs where they do something similar
https://docs.litestar.dev/2/usage/routing/handlers.html#signature-namespace
I feel like im overthinking but do I really have to call connect = sqlite3.connect("db") in every instance that I do something with the database? I currently have that wrapped in a function for error handling but it still feels like I'm doing it wrong
depends on what you are doing, some projects can reuse the same connection or request it via dependency injection
I'm trying to create a project for my resume that would show my ability to use SQL and do some ETL. I'm taking some nested json and putting it into a database. I wanted to get it into a tabular format to put into a staging table and then normalize it. But is this an odd thing to do? Should I not be trying to put it into relational database if its nested SQL (not in a records/row format, more of a keys with dictionaries as values that can potentially have their own dictionaries). Ive heard NOSQL may fit it better. But I'm down to make a relational database out of it as long as its not seen as "bad".
{'name': 'Not in Jojo but should be a stand name', 'lastModifiedDate': '2024-11-20', 'items': [{'track': {'trackName': 'Can You Stand The Rain', 'artistName': 'New Edition', 'albumName': 'Heart Break', 'trackUri': 'spotify:track:1vbe9sh7U9vdMceAhitWr2'}, 'episode': None, 'audiobook': None, 'localTrack': None, 'addedDate': '2024-11-20'}, {'track': {'trackName': 'This Place Hotel (a.k.a. Heartbreak Hotel)', 'artistName': 'The Jacksons', 'albumName': 'Triumph', 'trackUri': 'spotify:track:2eADspwGh5Ad8tDoJ2bHsR'}, 'episode': None, 'audiobook': None, 'localTrack': None, 'addedDate': '2024-11-20'}], 'description': None, 'numberOfFollowers': 0}
This is an example of just one piece of data from the json I have. Its basically a bunch of these.
It’s not “bad” it just requires analysis to turn into a good SQL schema
For example do you want an “albums” table or just a label on “tracks” etc
This is my personal spotify listening data. So I was thinking of Playlists, Songs, and Streaming Activity (e.g. listened to song for 1000ms on Jan 1)
Yeah, makes sense. A playlist has many songs, a song has many streaming events/listens.
Okay that's relieving. I just don't want it to look like I'm trying to hammer in a nail by smacking it with the pointy end of a screwdriver.
(By “many” I mean 0 or more)
Now, if each of these songs had different attributes, that’s when NoSQL might make more sense
Oh nah, they very consistently have the same stuff. Its just the formatting being a little annoying to read in.
Thank you for your time and help 🙂
No problem, good luck!
Not sure where to ask this question, I think here is the correct topic:
I have 3 folk stories that I want to create an inverted index for (attach word to files that the word is present in). To do that, I created a dictionary that has words as keys and indices of those stories as values. Now I want to save it as .json, so that my programme would not run it all the time.
The only way I know how to do that is by writing the dict into pl.DataFrame and then .write_json on the DF. Is there any more "meta" way to do it that you would recommend?
This is the current iteration (cast_to_df actually creates the dict)
You could do json.dumps(word_index.items()) to get back the JSON string and then write that to disk.
Thanks, after a bit of debugging this seems to actually do more of what I wanted than the pl.df method
That makes sense and I did also realize I had stuff too spread out and too specialized when I didn't need it to be so consolidating to like a single function for writing and then called by others to handle different types of writes helped a lot with keeping the db interaction cleaner. Now just need to figured out what I'm doing wrong that results in no data for a query lol.
Been over a decade since I last did any SQL and that was with PHP so basically I remember nothing
make sure you're using transactions and committing at the end of each
yea I am comitting and closing, even DL the sqlite DB browser and I see my table and test data in there. just so far each attempt at query just gives me back [ ]
yo
:incoming_envelope: :ok_hand: applied timeout to @sudden torrent until <t:1744296396:f> (10 minutes) (reason: burst spam - sent 8 messages).
The <@&831776746206265384> have been alerted for review.
good bot
And I do in theory know im doing this right as from what I see in the documentation if I was querying a nonexistent table my result.fetchall() would return None instead of an empty tuple
welllll I tore it all down and started again. and directly in py I got it working now so it must be the queries themselves. what is odd is even directly in the interpreter previously it wouldnt work either
update well I figured it out though not sure on the correct adaptation going forward but it seems that when using cursor.executescript() it must be a script, a single command doesnt appear to work. I was only using that as the docs said that cursor.execute() was being deprecated.
update2 ok that is not entirely correct though I am not sure why. I have an INSERT query that operates just fine with executescript but the SELECT does not.
What? cursor.execute() is not deprecated. You just get a warning if you do not use a dictionary with named parameters. And as for .fetchall() will return an empty list [] if no rows are found and .fetchone() will return None if no row is found.
ok then I did misread what that reffered to. however executing the same exact query string. executescript() returns [ ], while execute() returns the expected result
[('Jan 5', ' Applebees', 123.12, ' ', ' '), ('Feb 4', ' Best Buy', 999.95, ' ', ' '), ('Apr 19', 'King Soopers', 225.33, ' ', ' '), ('Aug 10', 'INTERNETCHARLOTTESVILVA', 89.0, ' ', ' '), ('Aug 10', 'BEER & WINE', 17.29, ' ', ' '), ('Aug 10', 'GOOGLE *YouTubePremiumg.co/helppay#CA', 14.94, ' ', ' '), ('Aug 12', 'STEAMGAMES.COM ellevueWA', 30.74, ' ', ' '), ('Aug 13', 'THE HOME DEPOT', 84.93, ' ', ' ')]
>>> print(readCursor.executescript("SELECT * FROM expenses2024").fetchall())
[]
I'm wondering that you can fetchall() on an executescript at all. A script normally consists of more than one sql statement. For which one should a result list be done?
Well that is an assumption on my part which is why I assumed using executescript for a single command didnt work properly
And yea using executescript instead of execute was my mistaken reading of what the deprecation refered to
I would say just like for executemany() any result list is discarded for executescript()
ok so basically I was right in thinking I used the wrong function just not the reason why. there is no return to fetchxxx() if I am executing something wiht multiple commands since it doesnt know what I want to get back
edit and I didnt notice until now because executescript for insert I didnt care about a return value outside of it working so there was no fetch() there to tell me it was wrong
Hi, I've faced very strange issue:
def create_metering(
metering: metering.MeteringCreate,
db: Session,
current_user: user.User,
account_id: int,
background_tasks: BackgroundTasks
):
# Validate metering record
valid, rec = is_metering_record_valid(account_id=account_id, request=metering, db=db)
if not valid:
raise HTTPException(
status_code=status.HTTP_409_CONFLICT,
detail=f"Please submit record with Customer: {metering.customer},"
+ f" Product: {metering.product_code}, "
+ f"Dimension: {metering.dimension_quantity} in next hour, "
+ f"current hour has a record submitted at UTC {str(rec.created)}",
)
details = metering.details.__dict__ if metering.details else None
mtrng = Metering(
account_id=account_id,
customer=metering.customer,
customer_aws_account_id=metering.customer_aws_account_id,
dimension_quantity=metering.dimension_quantity,
org_id=metering.org_id,
metering_type=metering.metering_type,
product_code=metering.product_code,
updated_by=current_user.email,
details=details,
)
if metering.agreement_id:
agreement = db.query(Agreement).filter(Agreement.identifier == metering.agreement_id).first()
if not agreement:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Agreement not found",
)
mtrng.agreement = agreement
db.add(mtrng)
db.flush()
print("------------------------------------------")
print(mtrng.__dict__)
# if mtrng.details is None:
# account = db.query(Account).filter(Account.id == account_id).first()
# if account:
# mtrng.no_of_times_submitted += 1
# background_tasks.add_task(get_metering_processor(account=account, db=db).process_metering(mtrng))
This code is really strange
def create_meterings(
account_id: int,
request: list[metering.MeteringCreate],
db: Session,
currentUser: user.User,
background_tasks: BackgroundTasks,
):
account = db.query(Account).filter(Account.id == account_id).first()
log.info(
REPO_LOG_FMT,
account.org.name,
account.name,
currentUser.email,
ActionEnum.ADD.value,
request,
)
if account.cloud == Cloud.AWS:
if not (account.status == "UPDATING" or account.status == "OK"):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Invalid Role Arn",
)
errors = []
passed_data = []
for req in request:
try:
create_metering(req, db, currentUser, account_id, background_tasks)
passed_data.append(req.dimension_quantity)
except HTTPException as e:
errors.append(str(e.detail))
db.commit()
data = db.query(Metering).filter(Metering.account_id == account_id, Metering.product_code == request[0].product_code).first()
print("+++++++++++++++++++++++++++++++++++++++")
print(data.__dict__)
if errors:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail={"errors": errors, "passed_data": passed_data},
)
return {"Message": "Meterings Created Successfully"}
When I use this data for metering creation i can eventually see customer_aws_account_id in db
[
{
"org_id": "2",
"agreement_id": "agmt-8dxk1wcle1p3bznb3krotuqkk",
"account_id": 1,
"dimension_quantity": "[{\"Dimension\":\"dim1GbpsAPI\",\"Quantity\":222}]",
"metering_type": "individual",
"product_code": "2co1o9ighmy1kkuivy6rnbtga",
"customer_aws_account_id": "607485323993"
}
]
But when I use below data for metering creation I noticed that customer_aws_account_id isn't saved db even other fields are saved in db
[
{
"account_id": "1",
"org_id": "2",
"metering_type": "individual",
"agreement_id": "agmt-8dxk1wcle1p3bznb3krotuqkk",
"customer_aws_account_id": "607485323993",
"product_code": "2co1o9ighmy1kkuivy6rnbtga",
"dimension_quantity": "[{\"Dimension\":\"dim1GbpsAPI\",\"Quantity\":11}]",
"details": {
"is_schedule": true,
"is_recurring": false,
"schedule_at": "2025-04-16T15:00:00-03:00"
}
}
]
I'm using FastAPI with SQLAlchemy
You aren’t capturing the return value of “create_metering”; what is that function doing exactly?
@topaz widget I've already solved problem. Thanks for your help.
Hi all, can someone please recommend a good python library which allows me to test my postgresql queries made in my python module? I had a look at few and they looked like they required a postgresql server to be running in the background. I don't want to keep a postgresql server running to test, rather I would like the library to spin off the server during testing and direct any psql qeuries made by my module to this test server
Appreciate any help 🙂
Hmm, that's a lot easier to do with something like SQLite than it is with PG, just because the PG server infrastructure is fairly large.. Would you be OK with a solution that used Docker?
The github page for the python version has sqlalchemy/postgres as its example in the README
(Huh, this even supports NATS. Neat.)
Thanks for the quick response
I'll have a look
testcontainers is a overly futuristic choice yes
i prefer to keep stuff old school though, just having docker-compose up with ports binded to my host
and then launching the tests then
docker-compose defines stuff in yaml, which postgres, quickly on up to raise, and on docker-compose down --volumes complete destruction (if not used volume mapping to host directly)
testcontainers in comparison allow a more flexible control through, u could be cleaning postgresql between your modules for examples, having it reraised more than once for testing session
nobody forbids running drop db between tests with docker-composed postgres too though
i am hesitant to use testcontainers just for the sake of not putting more libs to my apps -_- than less libs is used then better
testcontainers would be nicer if you have multiple dependencies to worry about, vs. having to deal with every combination you might want in docker compose. If you just have a single big daemon to care about, I agree docker compose might be less trouble.
if I have multiple verions of postgres to take care about, i will just work locally with a single postgres version
And will use Github Actions Matrix mode to run tests so that it would run for every postgres on its own in CI for every my commit
Yeah that's pragmatic
hey again ppl. So for context, I've got json data, and im working with python and sqlite. The goal is to practice some ETL. Im trying to understand proper procedure for loading data into a database. My question is, do people typically load an entire CSV into a database all at once, or are things inserted one record at a time?
I've transformed the raw json data I'm working with into a proper table. But I'm wondering if I should loop through it and INSERT one at a time, or is it better to just load it entirely via Sqlite3's import. I've heard of batch vs streaming.
is it better to use docker or to install my own server for any dbms
I see, but then I can just run the Postgres server locally then 😅, or does docker have some advantages?(sorry, asking because idk docker).
The python Postgres test libraries seem to take care of creating and tearing down databases anyways, so it seems locally running Postgres server should be preferred than docker right?
No, docker is still preferable. Because easy to run specific postgres version, easy changing, easy complete removal and running again, easy runs out of the box ready for usage
Docker very strongly simplifies all that
Baremetal postgres installation is a big pain in comparison
Ahh I see
Okay will read up on docker more
I have a question, maybe I’ll figure it out while reading, but docker would be heavy to run? I’ve 16gb ram laptop. If it is, then it seems testing setup is heavy, but it also seems inevitable. Have to figure out how to run docker on GitHub actions for my testing also
I have 16gb desktop, Docker Engine, the true real version of docker is lightweight for me and it is main point of docker.
It reuses my OS kernel for running apps, and just creates illusion of filesystem and of different os for them
So the weight of running postgres in docker is equal to regular postgres for 99.9%
But only if u a using Docker Engine, real docker runs without VMs
Docker Desktop is entirely another story. Windows/macOS abomination made to support those oses with GUI in proprietary way. It comes with VM and heavy
Possible should be to use even in wsl2 docker engine normally
I just work from Linux and normal for me running dozens containers
At some point went crazy and had running even more than half hundred at least
Ohh that’s amazing