#databases
1 messages ยท Page 34 of 1
im sorry i dont know these things
please don't ping random people.
Can u help
no
Why
What do you think about an ORM dedicated only to creating tables, validating and displaying those tables via GUI? I want to create such a tool for fun, because I think ORM is mainly helpful for table management and validation, but imo creating queries in raw SQL is easy enough that I don't need this functionality.
Two main feedback from me would be:
- To manage my schemas I use tools like liquibase or flyway
- I do like to use ORMs for everything else
I'm trying to learn AWS features in prep for an interview, but I don't know how to create a database without costing me. It says it's in free tier, but both options for configuring Dynamo and RDS are saying they'll cost me. Any help?
i could add to answer of @keen minnow that i saw quite elegantly made and feature rich schema migrating tool Atlas https://atlasgo.io/
That could be used as inpiration for making own tool. Luquibase/Flyway looked way more messy in comparison
but imo creating queries in raw SQL is easy enough that I don't need this functionality.
i think it is still important feature for static typing safety. At least lightweight abstraction helping with SQL query building as SQLAlchemy provides or Golang-bun provides is very nice to have
i don't use any ORMs only if i develop tool... which as basis i build around smth which has docs providing me with Raw SQL only
in this case i am lazy to translate to ORM, and can go raw, since usually this tool is small enough anyway and i would not benefit from using ORM
otherwise if it is any even small tool that has a lot of CRUD stuff... then ORM is obligatory for me to simplify things and having type safety
TLDR: The essential problem is with Raw SQL that it is unsafe.
but at the same time it could be nice having tool like Atlas that defines tables, helps with migrations
Hello,
I'm going to start a project with FastApi, and it would also be my first Python project, so I would like some advice about the stack.
I was going to use SQL Alchemy because it was what I had seen a lot, but then I read about Prisma Python Client, and I've already used Prisma but in Node and it's great, how is it with Python?
Atlas kind of sucks in having almost everything behind paywall
would be nice having similar tool as FOSS
i would make an educated guess it will suck completely unless you are node.js dev
at least that's how my Prisma experience went i tried to check its usage for golang client
better use Python native solutions like SQLAlchemy or Django ORM
they will provide with far better experience, because both natively intergrate with Python
and provide even full static typing capabilities (in case of SQLAlchemy ORM)
Prima will double suck, because its models aren't even described in Python, but in its own custom language/files
Thank you very much, this is the kind of information I was looking for ๐ซฐ๐ป
basically if u use SQLAlchemy (with mypy integration optionally for best experience)
you will have Python native acccessed data structs, so it will for you like u use Typescript stuff in full capabilities
Mypy integration will especially make extra things top notch
and as far as i see Python Prisma Client has not really good level of... type safety or providing intellisence support in general
so it will be... full of runtime surprises thing
that is at least what i see from short Prima offered doc examples, and is suspect there is nothing more to it ๐ค
Although Prisma does have Pyright integration
may be it is not that bad
post = await db.post.create(
{
'title': 'Hello from prisma!',
'desc': 'Prisma is a database toolkit and makes databases easy.',
'published': True,
}
)
Still
Prisma example has DICTIONARY as input
that's nightmare
i can't seem them being capable to catch it with Pyright
I'd rather go the safe route then since this api will move money hehe
never heard python devs also using Prisma in general too
so your python devs will be way more surprised too if u go with it
SQLAlchemy has nice module called Alembic to manage migrations and even can auto generate them based on introspection of a database for you
Should be quite good
thank you for taking the time to reply dude
hello does anyone know to link sqlserver with django python
padhe likhe gawar ๐
not python but ik how to link with xammp
!pip mssql-django was suggested to you. Did you try it?
ok so i'm in a debate with this guy at work for large mysql tables. ๐ฆ
to either partition large table or create explicit new archive table
I try to keep history separate from 'current' data, rather than using a single table for both.
One big reason is recovery: I don't want to have to restore massive tables to get my app back online.
So you would rather write unions or seperate queries.
No, why would I need to union? If I have some operational data that I also need a history for, I keep it separate. So I only need to query one side
I'm eating the cost of two writes for the benefit of query time
We have an app that would have to query the current and archive .
We're just talking abstractly right now, could you explain more?
we have a large table. and we want to either put records < 2020 in a different table or create a partition on the table by the date field with <2020
We have a use case that would have to scan all partitions or union all tables.
a query by only id and not a date
Nothing wrong with just throwing a partition on the year of a history table. Makes it easy to drop a year, or recover specific years.
I've done both tho, but problem with a year specific table is needing another one in a few years,
And so on
Lately, I've just been putting history in hive partitioned parquet files, but that's a different story
Yes. It's alot easier to throw a partion than to create new tables every year.
Models don't have to be updated on all the applications.
MySQL also isn't so great at parallelizing queries, so I don't love it for large historical analytical queries
This company has been using mysql since the 90's
A lot of my MySQL work was working around its scale limitations, a lot of denormalization and 'manual' parallelization
Oh, I ran a SaaS backend on MySQL, sharded the heck out of it. We got it done, just needed a lot of love
Thanks for help.
yess sir, I'm thinking about simple table-maker tool with Atlas support. At this stage of development it's dataclass with SQL-style syntax, example:
class Color(BaseTable):
name: VARCHAR(max=50)
rgb: VARCHAR(max=6)
is this stupid
i e slow
class DuckDBLogger:
def __init__(self):
self.con = duckdb.connect()
self.con.sql("CREATE TABLE metrics (name VARCHAR PRIMARY KEY)")
self.con.sql("CREATE TABLE steps (step INTEGER PRIMARY KEY)")
self.con.sql("CREATE TABLE logs (metric VARCHAR REFERENCES metrics(name), step INTEGER REFERENCES steps(step), value FLOAT)")
def log(self, step, metric, value):
self.con.sql(f"INSERT OR IGNORE INTO metrics VALUES ('{metric}')")
self.con.sql(f"INSERT OR IGNORE INTO steps VALUES ({step})")
self.con.sql(f"INSERT INTO logs VALUES ('{metric}', {step}, {value})")
i mean yes it is slow but is it slow because it is badly written or because databases won't help with that
its actually 5688.87785007 times slower than a nested python dictionary
Yes, don't use duckdb to write logs.
It's the wrong kind of database. Use almost any other database.
Preferably, just stream to a .csv file.
Also, never write SQL queries with f-strings like that:
!sql
Don't use f-strings (f"") or other forms of "string interpolation" (%, +, .format) to inject data into a SQL query. It is an endless source of bugs and syntax errors. Additionally, in user-facing applications, it presents a major security risk via SQL injection.
Your database library should support "query parameters". A query parameter is a placeholder that you put in the SQL query. When the query is executed, you provide data to the database library, and the library inserts the data into the query for you, safely.
For example, the sqlite3 package supports using ? as a placeholder:
query = "SELECT * FROM stocks WHERE symbol = ?;"
params = ("RHAT",)
db.execute(query, params)
Note: Different database libraries support different placeholder styles, e.g. %s and $1. Consult your library's documentation for details.
See Also
- Python sqlite3 docs - How to use placeholders to bind values in SQL queries
- PEP-249 - A specification of how database libraries in Python should work
Would it be bad practice to have my database open to all IP addresses when setting up a azure Postgres database?
yes
table.sql is like this:
CREATE TABLE IF NOT EXISTS BLACKLIST(
user_id BIGINT NOT NULL PRIMARY KEY,
reason TEXT,
);
vs something like this:
https://paste.pythondiscord.com/I2QA
I am trying to convert my old schema.sql to the original table.sql
I could probaly find the old things
but it would be way easier to conver it from the schema.sql to table.sql
and I'll send the schema.sql if someone replies with ping only
(so I know you responded)
sqlite is still 16.4110429335 times slower than python nested dictionary for logging
I would need to know all metrics beforehand which is a little bit annoying
next up I try redis
nooooo they dont have a redis tutorial on w3s how am I going to be able to understand it
"Redis is not officially supported on Windows"
we will see about that
it was actually true๐
You can't compare sqlite with a dictionary and redis they all offer different functionality
I am comparing their performance for my logging
How would dictionary be useful for logging though? ๐ค
this is what im using
{
"metric1":{
1: 100,
4: 200,
7: 300,
},
"metric2":{
2: 15,
3: 25,
},
}
Well, would they be persistent anywhere?
How would you fetch them?
What operations would you perform on these logs? E.g. finding averages, means, etc
Writing logs/metrics isn't that useful unless you can do something with them later
I just do it with python, convert it to numpy and do stuff return np.array(list(self.logs[key].values()), copy = False)
fetching is not an issue, its instant, only writing
If your application crashes or just exists you lose all of your data though
oh yeah I save logs into numpy savez compressed
Also memory could be a concern depending on how long your application runs for and how many metrics you collect
true as well
but I am trying to find an alternative and everything I tried is much slower
That's why using something like sql is just easier than to deal with numpy, etc
its actually very easy with numpy, because it savez_compressed basically saves a dictionary, ease of use is not an issue, its mainly writing performance
I mean, you'd have to know how that was packaged, etc, and it would only be accessible with python
sql is much easier to use
Also regarding write performance - you didn't share how you actually benchmarked it
also I'm not completely sure if you want metrics or logs
thats sort of true about sql, how its packaged depends on database and databases can have different column names
this is my benchmarking code
dlogger = DictLogger()
from glio.python_tools import perf_counter_context
with perf_counter_context():
for i in range(1_000_000):
dlogger.log(i, "loss", 1.)
dlogger.log(i, "accuracy", 1.)
dlogger.log(i, "dice", 1.)
dlogger.log(i, "iou", 1.)
if i % 2 == 0: dlogger.log(i, "loss2", 1.)
if i % 2 == 0: dlogger.log(i, "acc2", 1.)
if i % 4 == 0: dlogger.log(i, "dice4", 1.)
if i % 4 == 0: dlogger.log(i, "iou4", 1.)
if i % 10 == 0: dlogger.log(i, "loss10", 1.)
if i % 10 == 0: dlogger.log(i, "dice10", 1.)
well it logs metrics
dictionary is 2.6316637999843806 perf_counter seconds
sqlite is 44.37015440012328 perf_counter seconds
and duckdb I had to reduce number of iterations to 1000 and it still took 11.377755700144917 perf_counter seconds
Imo there's a difference between metrics and logs, logs usually are just strings, when metrics could be something more structured ๐ค Honestly I don't have a good explanation for that
For logs you can just use logging
it logs metrics, I just put "1." as the value everywhere for performance testing
but 1 is replaced by the value of the metric
dlogger.log(i, "accuracy", 1.) means at step i metric accuracy was 1
Seeing your duckdb code - I don't think you need two of those tables at all? ๐ค
Also you're not batching your operations in any way
Why insert records one after the other in different sql operations when you can insert like 10 thousands of them at once
is it possible to have 1 table if some metrics are not logged at some steps? Like some metrics are logged only once every 128 stepzs
In your code you just insert the same values into metrics and steps tables, so they're kind of redundant, no?
probably
Can you share your dict code too?
class DictLogger:
def __init__(self):
self.logs = {}
def log(self, step, metric, value):
if metric not in self.logs: self.logs[metric] = {step: value}
else: self.logs[metric][step] = value
Miscellaneous Data Types
Transact-SQL supports several data types that do not belong to any of the data type groups described previously:
- Binary data types
- BIT
- Large object data types
- CURSOR (discussed in Chapter 8)
- UNIQUEIDENTIFIER
- SQL_VARIANT
- TABLE (discussed in Chapters 5 and 8)
- XML (discussed in the previous edition of this book)
- Spatial (e.g., GEOGRAPHY and GEOMETRY) data types (discussed in Chapter 30)
- HIERARCHYID
- TIMESTAMP
- User-defined data types (discussed in Chapter 5)
Why is TIMESTAMP included in miscellaneous? I thought it would be a temporal data type
is this correct in sqlite "INSERT INTO logs (step, ?) VALUES (?, ?)"? It says OperationalError: near "?": syntax error, even though I passed it a length 3 list
before I had "?" not being replaced because of quotes
For 100k records with sqlite it takes me 15 seconds, and with dict logger 0.05, did youconsider how much time does it take for numpy to do the whole processing too? I don't think it's that slow though
Also you can always send logs in the background thread or make use of asyncio
I didn't benchmark numpy processing because it seems instant
I didn't write raw sql in a while
(I was using an orm for this example, so this may be the reason for it being slow ๐ )
it was because one of the substituted strings was empty
still weird, why would this not work self.cur.execute("ALTER TABLE logs ADD COLUMN ? FLOAT", [metric, ]), it says OperationalError: near "?": syntax error, even though metric = "a", why would it not subsititute
this works fine self.cur.execute(f"ALTER TABLE logs ADD COLUMN {metric} FLOAT")
man substituting is so weird in sqlite
Just tested, and simply inserting all records into a sqlite db shouldn't take that long
import time
from collections import defaultdict
from sqlalchemy import create_engine, insert
from sqlalchemy.orm import Mapped, mapped_column, \
sessionmaker, DeclarativeBase
class Base(DeclarativeBase):
pass
class Metric(Base):
__tablename__ = "metric"
id: Mapped[int] = mapped_column(primary_key=True)
iteration: Mapped[int]
name: Mapped[str]
value: Mapped[float]
engine = create_engine("sqlite:///db.sqlite3")
session_factory = sessionmaker(bind=engine)
class DictLogger:
def __init__(self):
self.logs = defaultdict(dict)
def log(self, step, metric, value):
self.logs[metric][step] = value
def main() -> None:
Base.metadata.create_all(bind=engine)
t = time.perf_counter()
with engine.begin() as connection:
logger = DictLogger()
for i in range(100_000):
logger.log(i, "loss", 1.)
logger.log(i, "accuracy", 1.)
logger.log(i, "dice", 1.)
logger.log(i, "iou", 1.)
if i % 2 == 0: logger.log(i, "loss2", 1.)
if i % 2 == 0: logger.log(i, "acc2", 1.)
if i % 4 == 0: logger.log(i, "dice4", 1.)
if i % 4 == 0: logger.log(i, "iou4", 1.)
if i % 10 == 0: logger.log(i, "loss10", 1.)
if i % 10 == 0: logger.log(i, "dice10", 1.)
values = []
for metric, obj in logger.logs.items():
for step, value in obj.items():
values.append({"name": metric, "iteration": step, "value": value})
connection.execute(insert(Metric), values)
print(time.perf_counter() - t)
if __name__ == "__main__":
main()
You can swith to just sqlite here, but I'm just used to sqlalchemy
It still takes around 10 times longer, which is to be expected, but you can parallelize your operations and writes
I just tried 1 table batched yeah it takes 10 times longer
I need to try asyncio now
asyncio would be beneficial if you just do that in the background
You can't parameterize colunn or table names. Just values.
Oh, do I need to somehow avoid f strings there though?
You can't avoid them for dynamic columns, but you'd need to sanitize the inputs to avoid an injection attack.
And make sure to put your variables in brackets for your placeholders e.g. (var1, ) because for some reason it wont recognise your variable as one variable but as several different variables
Someone here, much smarter than me, will explain it better
I am currently using MongoDB for saving data and PyMongo in Python.
Here is what my data look like:
{
"_id": 123,
"weapons": [
{ "name": "Sword", "durability": 100 },
{ "name": "Shield", "durability": 50 }
]
}
{
"_id": 456,
"weapons": [
{ "name": "Sword", "durability": 100 },
{ "name": "Shield", "durability": 0 }
]
}
{
"_id": 789,
"weapons": [
{ "name": "Sword", "durability": 50 },
{ "name": "Hammer", "durability": 50 }
]
}
how can I filter out a list of "_id" that does not have a "name": "Hammer" in its "weapons" ( i.e. [ 123, 456 ] )?
(Var1,) is a tuple containing one element. (Var1) is just var1
The trailing comma makes a difference
Are each of these _id in a list?
yes
or i think it can be a cursor using find() because i would like to add a hammer to each of those who dont have it
then loop over each dictionary ids and loop over each weapons name to find the Hammer. If its not Hammer, add it to a new list
oh you wanna add a hammer to ids that don't have it
had this idea but would it be slow if there are many entries? im not sure about it ๐
yep lol
yeah then I guess do that double loop I mentioned and then append hammer to it after
Yeah it would. Easier if its in SQL.
oof alright thanks lemme have a try
!e
data_list = [
{"_id": 123, "weapons": [{"name": "Sword", "durability": 100}, {"name": "Shield", "durability": 50}]},
{"_id": 456, "weapons": [{"name": "Sword", "durability": 100}, {"name": "Shield", "durability": 0}]},
{"_id": 789, "weapons": [{"name": "Sword", "durability": 50}, {"name": "Hammer", "durability": 50}]}
]
def check_for_weapon(weapons_list, weapon_name):
for weapon in weapons_list:
if weapon["name"].upper() == weapon_name.upper():
return True
return False
weapon_all_should_have = "Hammer"
for row in data_list:
if not check_for_weapon(row["weapons"], weapon_all_should_have):
print(f"{row['_id']} needs a {weapon_all_should_have}")
:white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | 123 needs a Hammer
002 | 456 needs a Hammer
๐ฎ thanks
so it still needs a double loop
@whole mica
db.collection.find({
"weapons.name": {
"$nin": [
"Hammer"
]
}
})
Mongo playground: a simple sandbox to test and share MongoDB queries online
:O thanks a lot!!! by the way is there a way to search for the value instead of the key, like searching for those with a durability of 50 (i.e. [123, 789])? ๐ค
I think you can add an and condition, too there
No idea how to do it ๐
I don't use mongodb at all, mostly just sql
To only get ids you can add a projection:
db.collection.find({
"weapons.name": {
"$nin": [
"Hammer"
]
}
},
{
"_id": 1
},
)
You want a specific weapon with a durability of 50 or just any weapon?
e.g. sword with durability 50
any weapon
Just
db.collection.find({
"weapons.durability": 50
},
)
?
It's kind of the same
Seems like in firs query you can do the same instead of using nin operator
Ah, no, nvm ๐
wait :O it works!!!
i thought like weapons.name and weapons.durability wont work if weapons is a list of objects, but came out i was wrong
If you want a specific weapon youl could use $elemMatch:
db.collection.find({
"weapons": {
"$elemMatch": {
"name": "Sword",
"durability": 50
}
}
},
)
And you should be able to use comparison operators too instead of just 50 if you need that
thanks a lot!! ๐
Hi! Can someone tell me if it makes sense in context of django/psql to create a generic model class with fields like created_at, deleted_at and perhaps some helper classmethods like create/delete for other actual models to inherit from?
A mixin class would probably do, but you don't need a create/delete method here ๐
Django's managers already have create/delete methods, and created_at is supported by django itself
For deleted objects you can just mark them as deleted instead of calling delete
Hm, thats wierd. I created a model, made a migration that added the model itself, than made a data migration to populate the table of the said model and I don't have a created_at column ๐
the only column that was actually autogenerated was id PKey
Ah, I see
The column you have to add yourself, but there's a parameter for DateTimeField ๐ค
I am starting to work on a timescaledb project(db built on top of postgres for time series). I am working on my wsl2. I just pulled the official image. I want to know if I am not using python, how do I work on package management and dependency issues? My PC is trash so I do not want to trash it further by keep installing on my local, and I do not want to use virtual machine.
Which "package management and dependency issues" exactly?
It depends heavily on which manager you are using
If a venv isn't good enough, then sounds like Docker is what you need
Someone told me to use vscode devcontainers, I want to develop, scale and deploy the project, I dont understand why I should run as entire os on docker just to code.
Why should you run a container? Precisely because you said you're having trouble managing dependencies and that you don't want to install things locally. What other solution could there be?
I'm assuming you've already looked at venv. If that's good enough, then that's what you should use. Maybe layer Poetry on top of that if you really want. But to me Devcontainers is just simple and easy.
BTW, this has absolutely nothing to do with this channel, please move it over to #tools-and-devops if you have more questions
Thanks
how can I use python to create an edit a txt file, for example I can put save data. however the open() wont let me do it is there a way to do it without root access
open can write to a text file. What are you unable to do?
welll
Looks like permissions for the location you are saving the file. I haven't run into that issue.
hmmm its in a foulder that i created
Maybe it is the permission level of the python interpreter
Is the file being saved in the same location as the .py file?
yea
Hello I am using flask and flask sqlachemy. To create my db I am using flask migrate I am wondering if there are better and more advance ways like postgresSQL. I am going to research postgresSQL more but will it work with flask and create the db with flask sqlachemy ? Also is it easy to setup in flask.
postgresql is supported by sqlalchemy
but your migrations created for Sqlite3 highly likely will not work for it ๐
so you would have to "regenerate" migrations
If I am understanding correctly that okay I haven't launched a site so it is nothing important
Thanks
Do you think postresql is the best tool to use
that depends on a project and somewhat even programming language.
for web related pet projects which have user database, Sqlite3 is kind of more lazy option to go for as default
if you utilize a good programming language and project is simple, depending on its data... less conventional dbs like redis or mongodb can fit too, or even just storing in memory of a program (+using some of its native storing in file) and that's enough too. some projects run on stuff like etcd and etc, it is very app specific.
if you develop backend api for production for work (once again we have user database we wish to persist), then postgres is good default.
TLDR: it depends on project requirements.
Postgresql is nice in allowing to bruteforce its usability for everything
but... for simple enough pet projects it is too much, if u are lazy to maintain it, sqlite3 is better then
Thanks it is just a pet project which I hope grows but might not , I am just an amateur
Alembic migrations generated for sqlite generally should work with postgres
But postgres ones if you postgres specific data types won't
What should I do to make this query go fast?
SELECT * FROM words
WHERE
(book_number, chapter_number, verse_number) IN (SELECT book_number, chapter_number, verse_number FROM words WHERE word LIKE 'jeesus%')
AND
(book_number, chapter_number, verse_number) IN (SELECT book_number, chapter_number, verse_number FROM words WHERE word LIKE 'elรครค%');
Db is sqlite, and it takes over 200ms to respond from backend. I mean to run it on every key-stroke, though, but at least once second (or two ๐ ). ChatGPT suggested some bizarre join-gadget which was meant to eliminate subqueries, but it just worsened it. :O.
use full text search instead of just matching regex LIKE on a text column
i don't use regex. But I would need to transform the database
How many rows do you have in that table?
bible words
How many rows does that amounts to in your db?
so you expect to perform around a million string comparisons for each keystroke?
idk if there is some simpler way to create a text index, but I am pretty sure you will have to transform the database
This could be faster: deleted doesn't work like I thought.
Mmm... i really like the idea of of having flexible search i get with this. One can write words, and the wording need not be exact
also, side note: storing each word as a separate row sounds like a pretty bad idea in first place but idk the best/right way to do it besides "use FTS"
the thing I linked earlier
ah
yes
i mean word counts if you start it correctly. It has to be 3 or more chars (or it won't generate sql for that). Words can be out of order too
๐ฎ So i prefer not to have full text search, because this is more flexible
maybe i just don't send every keystroke, but instead set a timer and after a time make query
maybe im just imagining things and i should change the db ;D
at least consider de-duplicating if you haven't yet I guess
(just making sure)
i haven't :< I just edited something and plรถrรคytin it from pandas to db ;D
i believe you can do that with FTS too using the trigram tokenizer and OR boolean operators, although i'd need a benchmark to say if it's a faster approach
also, could this query be simplified to SELECT * FROM words WHERE word LIKE 'jeesus%' OR word LIKE 'elรครค%', or am i misunderstanding it?
edit: oh oops, i misread the AND operator in your query
i forgot --release ๐ฎ It's still slow, but 3x faster still than before. Thanks for fts links ๐ I just kind of assumed it wasn't suitable and thus made the db one word per line. Why is it always I only learn after the mistakes. ๐
Honestly I don't undrestand this query ๐ค You're trying to find a word that contains search terms?
Hm, you could check how long it takes to just do a select with specific book_number, chapter_number and verse_number?
Maybe filtering is just slow
It searches bible verses that contain both search terms
Hello All, I have a challenge in implementing a scenario. I have 2 databases. first is updating with data for every 10 min interval and the second database is updating 3 times a day. I have to compare data and similarities between these 2 databases. which tools fit for this scenario and how to do it
:warning: Your 3.12 eval job has completed with return code 0.
[No output]
If these don't absolutely need to be two separate databases, they probably shouldn't be. Why are they separate?
If you really need to sync two unrelated databases with two different schemas there are discussions online about the possibilities like: https://softwareengineering.stackexchange.com/questions/288370/best-way-to-synchronize-data-between-two-different-databases
Hey everyone! I got a quick question about the json module.
def populateLocations(apps, schema_editor):
file = open('./data/geo_list_1.json')
locations_data = json.load(file)
for location_data in locations_data:
location = location_data
Location.populate(**location_data)
file.close()
Do I need to .close() the file after the loop, as shown above, or can I .close() the file right after it loads?
Also are there any constraints keeping the file "open" for a while?
You can close it after the load but maybe use the with statement:
def populateLocations(apps, schema_editor):
with open('./data/geo_list_1.json') as file:
locations_data = json.load(file)
for location_data in locations_data:
Location.populate(**location_data)
writes full stack app in lua wishing I could use python
Hey guys I understood the basic concept of CLUSTERED and NONCLUSTERED in MS SQL but I am unsure when's the right time to use either of the two. And I'm wondering if there are any disadvantages between them?
Clustered indices?
yep
In Transact-SQL, the names of all data types and system functions, such as CHARACTER and
INTEGER, are not reserved keywords. Therefore, they can be used to denote objects. (Do not
use data types and system functions as object names! Such use makes Transact-SQL statements
difficult to read and understand.)
They say it isn't a reserved keyword but why is it highlighted in my case? (I'm using SQL Server)
mongoDb is the best ngl
Q&A discussion discussing the merits of No SQL and relational databases.
old but gold
it's like how you can use list in python as a variable name, but you really shouldn't
Hey guys, I'm trying to use MS SQL but there are so many variations? I don't know which one to choose.
Hello there
I hope you are all doing well.
I am new and I need help, can someone please help me? I have a problem with relationships in my sqlalchemy database.
Who is managing your database? Ask them how you're supposed to connect, not us. But my guess would be TCP/IP
It's warning you that whatever is at line 31 of your file is depricated. It should still work for now but you should put that error into a web search to find the relevant documentation, and then look at your code
i'm managing the database
So unless you set up one of the other options, TCP/IP
Thank you very much, the problem is already solved.
I could use a pointer, if someone knows, on how to properly handle excluding zero dates (0000-00-00) and NULL dates in MariaDB using SQLAlchemy.
def _build_search_query(self, table, fields, term, like):
term_string = str(term)
search_conditions = []
query = smc_database.sql.select(table)
for field_name in fields:
term = term_string
field_attribute = getattr(table, field_name, None)
if field_attribute:
field_type = self._get_field_type(table, field_name).lower()
if "int" in field_type or "float" in field_type or "decimal" in field_type:
try:
term = float(term) if "." in term else int(term)
search_conditions.append(field_attribute == term)
except ValueError:
continue
elif "date" in field_type or "time" in field_type:
term = self._parse_date(term)
if term:
search_conditions.append(
smc_database.sql.and_(
field_attribute == term,
field_attribute != None,
field_attribute != "0000-00-00"
)
)
else:
if like:
search_conditions.append(field_attribute.like(f"%{term}%"))
else:
search_conditions.append(field_attribute == term)
if search_conditions:
query = query.where(smc_database.sql.or_(*search_conditions))#.limit(500)
print(f"\n{query}\n")
return query
The part in question is elif date. If I understand sql correctly, I think the generated sql query is correct. (Excerpt)
WHERE `SMC_Billing`.`Customer_No` = ? OR `SMC_Billing`.`Invoice_Date` = ? AND `SMC_Billing`.`Invoice_Date` IS NOT NULL AND `SMC_Billing`.`Invoice_Date` != ? OR `SMC_Billing`.`Item_No` = ?
2024-08-19 12:11:41,935 INFO sqlalchemy.engine.Engine [generated in 0.00169s] ('2016-09-23', datetime.date(2016, 9, 23), '0000-00-00', '2016-09-23')
Why excatly this function is so complex? ๐ Do you want to exclude these dates only on specific query or on all of them?
I might break it up a bit once it's all working. This is a general search program where the user can type a string, number, date, or date object, and it has the ability to search every column for something like it. It's complex because it's attempting to match data type as best as possible.
Funny enough, I think it was working the whole time, but I didn't realize it. Pointers in general would still be great though.
I may have some work to do on the complexity ๐
Are you using this function for multiple models or just one? ๐ค
It's for any model.
The idea is that I can set it up so if I need to give a department access to edit fields of our core database, I can just make a little program that imports the module that includes this. Then I call with the necessary fields, which fields have which permissions, and then display a search followed by an output display.
For very simple views and edits, I should add. Anything more complicated deserves more program, but can still use this module for whatever would be common. It's part of a set of core modules I'm writing for the company.
The or and and there makes me wonder if you got the order of operations right
So, do you only need to exclude fields that are of type date/dateime and have value of 0000?
Maybe like this:
from datetime import date
from sqlalchemy import inspect, select
from app.db.models.manga import Manga
mapper = inspect(Manga)
stmt = select(Manga)
for column in mapper.columns:
try:
python_type = column.type.python_type
except NotImplementedError:
continue
if issubclass(python_type, date):
stmt = stmt.where(column != "0000-00-00")
print(stmt)
change the model for your own here
But honestly I'd work on fixing your db to exclude these zero dates ๐
Honestly I'd make a separate set of searchable columns for each model which you'd apply your search term to
class User(UserMixin, db.Model):
# ...
following: so.WriteOnlyMapped['User'] = so.relationship(
secondary=followers, primaryjoin=(followers.c.follower_id == id),
secondaryjoin=(followers.c.followed_id == id),
back_populates='followers')
I don't get the logic of the secondary join, it makes sense that we're joining the user table with the assoiciation table where the follower_id == id. But I'm not sure what the secondaryjoin does
It's dynamic. I can't account for all of the models and columns this program will encounter.
If it works, that's the beauty of it.
And the database is full of garbage data from poor imports when the tables were created, so I need to account for that, too. Including the zero dates.
First, how about using is distinct from operator? https://docs.sqlalchemy.org/en/20/core/sqlelement.html#sqlalchemy.sql.expression.ColumnOperators.is_distinct_from (I'm not a sqla user, I'm just thinking about how I'd handle in sql), second, why are the 0000's an issue? Do you want to treat them as unknown or what?
In case you're using a secondary table secondaryjoin establishes how you join from your secondary table to a child model, and primaryjoin how you join from this model to a secondary table
In most cases you don't need to use primary or secondary join attributes as sqlalchemy can figure out how a typical relationship should work
I'm testing some python automation for some reports I am tasked with. I have some excel sheets that connect to our MS-SQL databases, but I am not too sure about how to connect to them with my python environment. It's health data, so the security can be weird. Does anybody know how I can generate a sqlalchemy connection strings? I have the connection string that works in excel but I don't exactly understand how to transfer that to something python can understand
What would I use IS DISTINCT FROM to accomplish? Would I need to know the fields ahead of time?
I was getting some weird results from "0000-00-00" when I was searching for things, so I decided to exclude them entirely. I don't think anyone would be searching for it specifically.
I wish I could just clear them out, but there's so many garbaged up tables with tens of thousands of rows that I can't possibly go through and fix them.
Is distinct from would simplify your null checks.
๐ค I might have to take a look at it. I thought it just checked to make sure you didn't return results with repeated information. I might have misunderstood the reading.
The way I'd do this in sql is: id write a cte to filter out irrelevant records, then query against that, like: with q1 as (select * from mytable where date is not null and date != 0000....) select * from q1 where conditions...
Could someone link me a database project theyve done that can still be understood by a beginner (as in the code is understandable to someone like me, even if parts of the code is above my level)?
Also itd help if the code itself has a lot of comments so I can understand what each part functions as? Thank you
I finally found why I needed the protections against the zero date and null entries. If I am searching for something that isn't a date, when it gets to the date field, I get a "match" with those, even though they don't match.
So I added a few checks to block the search and block against those returns as a sort of multi level protection against false positives.
So for example, If I search for "WSB-BX-LGCOVER#2" with the protections, I get one result. Without the protections, I get a segmentation fault (I think). There's so many results that the program just kind of... gives up. It exits with no error to catch. I had to create and pull a minidump to find that error.
Maybe this is an issue with nulls, but I don't think zero dates quite matter here 
You can just add a limit to see the small subset of what you get back from db
Oh yes. I did that to confirm the issue.
To make sure the results I would get back were actually bad if it weren't for the error. They were.
On a related note, is it better to query a date field with a date object? Or the string of said date object? I'm leaning towards object since it would be a closer match in my mind.
If you're talking about using it in a condition in where it shouldn't quite matter
But date would be preferred
I thought so. Thanks ๐
^ Btw you could check if python type is a date and add a condition based on that
I use this to parse a given date from one of two formats.
def _parse_date(self, date):
try:
if "/" in date:
return datetime.datetime.strptime(date, "%m/%d/%Y").date()
elif "-" in date:
return datetime.datetime.strptime(date, "%Y-%m-%d").date()
else:
return None
except ValueError:
return date
And you can use stmt = stmt.where(...) instead of collection all the where clauses in a list
Horrific ๐ฌ
Can't you handle that with orm?
lol why?
are you getting these values from db?
The date to search? Or the date returned?
The date you're parsing
Or is that what you're getting from your client?
The date being parsed is from the user.
So yeah, client.
And the source of part of my headache with item "numbers" like given before ๐
It's not that bad then ๐ I would be scared If I had dates in my db as strings and in different formats
Nono. That at least is one of the correct things about this database. Dates are in date type.
I'd love to. I don't have that control. The joke is that some of the data in this database is older than I am.
Imported from other sources of course.
I'm curious though, why ISO?
Please ping on reply.
I have a few questions about sqlachemy + flask sqlalchemy could someone please answer?
In my database/models I have the RouteToken table and the a column called token.
Can someone think of a better name for RouteToken and the column token.
For example RouteToken and the token column is used when when I have a route like
@email_password_reset.route('/verify_email_token/<username_db>/<token_db>', methods = ['GET', 'POST'])
The reason I want a different name is because RouteToken and token are so similar. Or do you think my reasoning is incorrect.
Another question I have is in RouteToken I want to add a column called ย ย attempts_token_tried: so.Mapped[int] = so.mapped_column(sa.Integer, default=0). It would reset to zero if someone tried to create more then 5 tokens. Should I create a 1 to many relationship or just have a 1 to 1 and delete the token each time and keep track with attempts_token_tried?
Also how is the 2nd example in the 1 to many link different then the 1st example in 1 to 1 link?
https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#one-to-many
https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#one-to-one
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
@coral wasp figured we should move to a dedicated channel. I havenโt played with the idea much yet since it was just meantioned yesterday, but the software would probably have to save locally and preform a scheduled online backup rather than saving in both locations every time a record is created.
This can get complicated quickly, just warning; having a central database with local onsite replicas raises a lot of issues. The other approach is many onsite databases that 'push' to a central database. A third approach is a dual db, where transactions are applied across two db's.
https://github.com/YourLocalPurveyorOfExplsoveGoods/MyDatabaseProjects if anyone doesnt mind, could someone here give as detailed as feedback as you can on my database projects please? Just as a preface, they're all quite basic projects but I want to start advancing my skills. Thank you
Hii ,
I am dishant. I work with startups & businesses to find suitable freelancers they require. If you are a web developer or editor you can contact me with your portfolio and pricing
is there any way to fix the thrift error when trying to make a connection to a hive database ?
Not here, please delete this and read the #rules .
I need to store JSON in an SQLITE table column. JSON1 extension or Blob? Pros and Cons?
what do you mean by blob, like storing JSON strings as blobs? or using the jsonb* functions provided by the JSON1 extension?
A Blob datatype column. Supposedly you can store JSON in a BLOB (Don't know if it has to be a string or not). And yes for the latter question, using the JSON1 extension
https://sqlite.org/json1.html#the_json_blob_input_bug
apparently it was originally considered a bug for JSON1 functions to operate on BLOB values and was initially fixed in 3.45.0 before they restored backwards compatibility in 3.45.1
i think if you want to use BLOBs for the slight performance benefit of sqlite's jsonb format, that's fine
but otherwise it's probably more convenient to use TEXT for json strings
(to clarify, converting {"Hello": "world!"} into a BLOB has no benefits over storing it as TEXT, but in 3.45.0+ you can convert it into sqlite's JSONB format using the jsonb() function if you want slightly more compact storage and/or other JSON1 functions like jsonb_insert() to run more efficiently)
Is there a trick to matching a python float to a MariaDB float? I know there's a slight difference (python 64 bit, MariaDB 32 bit?) and I can't change it to decimal.
I confirmed an issue exists when I even ran the query in MariaDB through Heidi, same result. I have a row with a float value, and when I put that float value in a WHERE clause, it returns no results.
!e You could try to convert the double to a float, like:
import struct
from decimal import Decimal
double = 0.3
float32 = struct.unpack("f", struct.pack("f", double))[0]
print(f"{Decimal(double) = }")
print(f"{Decimal(float32) = }")
:white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | Decimal(double) = Decimal('0.299999999999999988897769753748434595763683319091796875')
002 | Decimal(float32) = Decimal('0.300000011920928955078125')
python floats are double precision so 64 bit
Ah. I'm thinking backwards.
the driver should cast it correctly tho
I was having trouble with that, but I tried this out and it's hitting the matches, at least. It's not the most precise...
def _build_numeric_query(self, attribute, value, conditions):
try:
if isinstance(value, int):
numeric_value = int(value)
elif isinstance(value, float) or "." in value:
numeric_value = float(value)
conditions.append(attribute.between(numeric_value - 0.005, numeric_value + 0.005))
return None
else:
numeric_value = int(value)
except (ValueError, TypeError) as error:
numeric_value = None
if numeric_value or numeric_value == 0:
conditions.append(attribute == numeric_value)
(Right before the return None)
This is using SqlAlchemy, btw.
I'm using SQLAlchemy and i think i may have written my app pretty wrong, are you suppost to do:
engine = create_engine(get_db_url())
def nameFromID(ID:int):
with Session(engine) as session:
stmt = select(Customer).where(Customer.id == ID)
result = session.scalar(stmt)
return result.name
def userFromID(ID:int):
with Session(engine) as session:
stmt = select(Customer).where(Customer.id == id)
result = session.scalar(stmt)
if not result:
raise Exception('invalid id!')
return result
or
engine = create_engine(get_db_url())
session = Session(engine)
def nameFromID(ID:int):
stmt = select(Customer).where(Customer.id == ID)
result = session.scalar(stmt)
return result.name
def userFromID(ID:int):
stmt = select(Customer).where(Customer.id == ID)
result = session.scalar(stmt)
if not result:
raise Exception('invalid id!')
return result
hi, db masters. In sqlalchemy I got a model SKU that has decimal/number col price and category (str) . However different clients have their discount config in ClientCo.disc_config (json) field.
def client_price(self, clientco_disc_config: dict[str, dict]) -> Decimal:
'''returns client price with computed discount for client catalog'''
cat_config_dict: dict = clientco_disc_config.get(self.category, {})
cat_disc_str = cat_config_dict.get(CatConfigKey.DISCOUNT.value, '0')
disc_fraction = Decimal(cat_disc_str) / 100
return round(self.price * (1 - disc_fraction), 2)
This method works well. but I would like to sort paginated catalog of SKUs based on this computed-disc_config based client_price. As per my learnings in recent day, hybrid_property does not take any extra arguments. What are my options?
appears, solved it. Not quite nice, but works.
from sqlalchemy.sql import case
from sqlalchemy import Select, asc, desc
def apply_client_price_sort(stmt: Select, clientco_disc_config, sort_asc: bool = True) -> Select:
cases = []
for category, config in clientco_disc_config.items():
cat_disc_str = config.get(CatConfigKey.DISCOUNT.value, '0')
disc_fraction = Decimal(cat_disc_str) / 100
cases.append(
(SKU.category == category, SKU.price * (1 - disc_fraction))
)
client_price_expression = case(*cases, else_=SKU.price)
sort_order = asc(client_price_expression) if sort_asc else desc(client_price_expression)
return stmt.order_by(sort_order)
...
skus_stmt = apply_client_price_sort(skus_stmt, clientco_disc_config, sort_asc=False)
Heyo people! Earlier I have only worked with xampp and mysql for a earlier project. But now my current project is getting close to needing a database to store some data. What do you say about using xampp and host it with AWS server? All ideas are appreciated
The database needs to hold a number (Id) and under every Id there will be three columns of data (temperature, time and sensors)
So I believe mysql would work fine for it but I have no experience having a database cloud hosted. Earlier it was just local
how can i fix this
any book recommendations for learning SQL (particularly books that have projects to immediately apply your knowledge)? Prefer reading books as it gives more structure to my learning. Been using sqlite and python but feel ive hit a wall as, I'm very much a beginner
For beginner resources, these are pretty good: https://owencampbell.me.uk/sql_python_tutorial/pages/intro.html https://www.hackerrank.com/domains/sql https://sqlbolt.com https://selectstarsql.com/ https://www.datawars.io/
Ill have a look. Thank you
The owencampbell one looks like itll be the perfect resource for me as Ive essentially been learning sql through python
hi
I have about select data using pandas
this line not filter
aprovados = escola_selecionada[escola_selecionada['situacao_nome'] ==
'APROVADO']
I am using this csv
my code https://pastebin.com/CVb9t07q
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
How do I connect to a hive database ?
Thanks for the help. Sorry, ended up getting busy at work and forgetting about discord.
Can I ask a question about vector databases here?
yes, also preferably ask your question instead of asking to ask
if it's off-topic, you'll just get redirected elsewhere
if it's on topic, someone may answer it without the extra delay/compromise of responding to the original message and having you reply with the actual question
Is this a good chat to get help with creating a data logger?
I mean, itโs more just a functionality in Python rather than purely this but just wanna ask just in case
But basically
I have created a data logger for a module system for a raspberry pi to track various modules for a car. This is just context
My issue is, Iโm creating a framework for it, so that it would act as a library, you create a logger object, and you have the options to add sources, information you want to log, and those will be consistently updated with a update_source() function
There are 2 types of updates, updating a specific source, and updating all sources of the logger, both with the same problem
I want it to be as easy as possible, so by simply specifying the name of the source, or just running the update_all_sources() function, it will update and add the values into the list storing each update
This is going to be difficult to explain, but how I was thinking was to instead of getting the value, to instead store where the data is being changed, so you would simply just have to read the stored variable and then obtain the value
Because if not you would have to plug these variables into the parameters of update_all_sources and that ruins convenience to an extent. Instead do it once when you create the source and then run said code. Just wanted to know if it was possible
Sorry for the long post, itโs hard to describe my thought process, hopefully it helps
I would suggest to start by the problem you are trying to solve
A datalogger is a start, but then how much retention do you need? Who would need access to that? How often would they need access to that? Would they go straight to the RPI or would it go to a DB or a cloud or something else?
hi
so i have a many to many relationship
using postgres
say a table called records
and a table called meds
when the users create a record
they can make any number of med for it
the med can be new or one if the meds already in the database
so i need a query the first checks if each med exists or not
if it exists, make the relationship in the middle table
otherwise create it then make the relationship
if anyone knows how this can be done in an efficient way
Guys i forgot abt the cursor what does it do
The cursor is just the thing you do operations on, see https://docs.python.org/3/library/sqlite3.html#sqlite3-tutorial for how to use it. One connection can have multiple cursors.
anyone know any good books on implementing databases? namely looking into learning how distributed keyvalue stores work.
could even just be theory, just a decent book on how databases work
Maybe check the syllabus of a graduate database class at a large CS Uni? Carnegie Mellon has a large database group, or Stanford
the book "database internals"
Hi, I'm making a login system as a part of a app project I'm working on and I want the app to be accessed publicly. I plan to code it in python using the sqlite3 library and a sqlite3 database to store all of the users's information (passwords, usernames, etc...). Anyway, I was wondering if I have to code it differently as a result of wanting the app to be public ? Do I have to import libraries other than sqlite3 and consider things like protocols or am I ok to just code it normally ?
hi. in my python project i used pure python mysql connector. why do ppl use sqlalchemy or other stuff to work with database??
they abstract SQL away into Python classes and methods, are sometimes more portable than SQL with how each database has a different dialect and can help make database migrations and versioning easier
there are many people that prefer to use raw SQL queries instead of using an ORM though
(the reason varies per person but in general extra tool, extra complexity, extra dependency, more things that can go wrong and that you have to manage, sometimes their queries are less efficient than hand-written queries etc)
SQLAlchemy stuff + Adding things like Alembic + Factory boy can take care for you of
- Creating SQL migrations
- Applying not applied to db
- Easier to unit test code logic (thanks to Factory boy)
- Having DECLARATIVE table/columns definitions, where u define them once and they are used to autogenerate migrations
- Having type safety, you could is using recognizable attributes of those ORM class instances
- Increased uniformness/flexibility to implement more complex in terms of Python syntax itself code logic (raw SQL is pretty much a mess)
- You can use raw SQL if necessary too, but only if u exhausted all ORM lib capabilities
Good to choose if u develop fully fledged backend api/service (or just big in code size app. Lets say over 20-40k code lines u are drastically needing that)
And also IMP very good for any CRUD api
On another hand using raw SQL has advantages of:
- Having increased flexibility to SQL syntax (At the cost of almost no integration with a programming language itself, and no typing safety)
- Less dependencies involved (u depend only on a single minimal lib!)
Good to choose if u develop very minimalistic tool that will be maintained only once in several months. Haven't encountered usage cases when else it would be good to use raw only, since u can use it if u need for ORM too anyway
i also prefer that but maybe because i havent tried python orm yet and also feel less interested in trying but will learn anyways. but i see what u mean
can u look at my code and tell if sqlalchemy would have made it better or now
Yeah, the regular CRUD (Create, Update, Delete simple operations) app, ORMs are best fit here.
You would benefit the most if u utilize Mypy integration of SQLAlchemy
whyyyyyy??? but i love seeing my sql i literally put sql
why put extra stuff in it
even tho i havent learned them yet
described above before. ๐ #databases message
anyway your app is simple enough to live without ORM
student level project after all
i would be more excited to see unit tests first in it than ORM
okay will look into it
tnx for the advice
:incoming_envelope: :ok_hand: applied timeout to @gusty scroll until <t:1724626385:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).
The <@&831776746206265384> have been alerted for review.
Anyone know if My SQL can be used in Jupiter noteboooks on Mac
did you run MySQL locally?
Wdym
Yes
bluds, in my face attendance system, I am using sheety API to store the data in google sheets, it is working well but it can't save images in google sheets, any alternative ?
HI, does MariaDB Connector support async operations ? I try to create SQLALchemy async engine with valid URL started with mariadb+mariadbconnector:// using create_async_engine(...) , then SQLAlchemy raises following exception :
File "/path/to/my-project/.venv/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/engine.py", line 1031, in __init__
raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: The asyncio extension requires an async driver to be used. The loaded 'mariadbconnector' is not async
Should I switch to other connector packages such as asyncmy (looks like the only option) ?
[update]
I switched to asyncmy , async things work well
not blud creating anchors ๐
bruh ๐
if i have a problem that I was able to resolve, but not able to comprehend the log messages, should I move on?
figured out...
Is there a reason you're using Google sheets and not a normal database like SQLite or Postgres?
Using Google Sheets as a database is a wired idea, and you also want to store images? You can use the Google Drive API. But using Google sheets/drive is the most wired and unreasonable way to stor data.
Use a database like MongoDB, MySQL, or Postgres.
And store files in a server, store a file as a file.
:D
DDoS?
Did I understand you well?
You mean Distributed Denial-of-Service (DDoS) Attack?
You want malicious code..
And the server ToS doesn't agree with you.
Rule 5
Do not provide or request help on projects that may violate terms of service, or that may be deemed inappropriate, malicious, or illegal.
Do I need to delete messages?
I think I need to delete
Okay
Why would a person write code if he isn't going to do such things?
To make a billion dollar company and be rich?
Coding a calculator is seriously boring
To take $100K a year?
bruh
Did someone tell you that's coding?
Be realistic my friend.
Realistic?
what is this
are you asking why someone would write code if it isn't going to be malicious?
yeah, can you help me explain that to him?
I don't think it's right to continue this discussion from here.
Google Top Results:
The average salary for Software Developer is USD 104,440 per year in the India, MS
Ok, #cybersecurity
Are you here?
if you know my github page, Lives in the terminal
Do you think I care about money? :D
No, you care about stealing banks.
plz don't delete this comment
I can't make it unless I come across a stupid manager in a bank's system.
:D
My goal is not to enter the bank
that uses JSON & plain text to store data.
That's why my friend went to court today.
bruh
Really
?
He was released on the condition that he pay back the money he stole to its owner.
2.058$
I'm not looking for money
where are you from @ashen wraith
ok, if you want to be a programmer, I recommend Flask & HTML/CSS, then Django, then beautifulsoup to autosearch for jobs
From earth
me too :D

no
:(
it's illegal
you will just teach
i'm wondering deleted messages are viewable by maintainers
thank me for what?
They are, but a @rugged locust report is always appreciated if we need to look out for something
for talking to me
@fathom beacon bruh
see you buddy
๐คจ
Hi brad, this buddy wants code for DDoS, which is illegal..
I think I'm about to get banned
Yup, and they have now been informed of the rules (thanks for that by the way) and have discontinued their search here.
What do you mean by " discontinued their search"?
I promise I won't open a topic like this again.
That is what I mean
i don't understand, do you mean you told them to not open a topic like this again?
yeah
ok
They have understood that we will not allow it again on the server. Leniency is usually a good policy if someone is not aware of the server rules (even though everyone clicks the button saying they have read and understood the server rules when they join)
just like every other app, they need to read the rules/ToS/privacy policy!
How do I try to level up my sql skills (in terms of learning and projects). Like Ive got a good grasp in the basics but I want to go beyond that. Using sqlite atm before I go onto other sql variants
I think I was able to push my SQL abilities the most is when I was making company reports. See if you can do the same.
Ive been working on a ufc database but its getting kinda boring now. Going to move onto a new project thatll properly test me
I swear this shit is so difficult because Im self learning everything. Im not going to spend money on courses unless i really want to
What have you done?
Here's a few resources:
https://www.hackerrank.com/domains/sql
https://sqlbolt.com/
https://selectstarsql.com/
https://www.datawars.io/
https://github.com/YourLocalPurveyorOfExplsoveGoods/MyDatabaseProjects this is the link to the database projects Ive done. Hit bit of a wall with it. I feel like I need to challenge myself a lot more and pick projects that will properly test my skills and force me to learn more
There's several sides to the database world. One side is the transactional world of using databases for application data. The other is using it for analytical purposes, such as stock analysis. (there's also the no-sql world too).
Ive been focusing more on the analytical side as thats what I'm more interested in. just dont know what to do to further my skills
You'd want to balance it with some transactional stuff, if you want to be "expert" you need to know both sides.
But, I'd suggest working through some of those sites I linked above. It looks like you're just scratching the surface, and haven't yet worked with CTEs, aggregates, subqueries, and window functions.
Ill do that then for the next month or so
when i was in game industry i was constantly told that when i write code it should be future ready, that if something unforeseen happens, it should be able to handle with minimal intervention. But looks like same practice doesn't apply in data engineering. I was told, that instead of writing dynamic code, i should only apply changes when user request since table is a fixed structure.
in general is this a standard in data engineering or should i see this as case by case?
@next shadow Hey man, do you mind if I find teammates here?
We don't allow recruitment or advertising
thank you buddy
not a real database but yesterday i created a small package for storing your data models. (In cas you use pydantic)
https://github.com/42dotmk/slowstore
maybe it will work for you.
If you've never touched SQL at all before, I recommend SQLite first. Postgres is also a fine choice but SQLite is just a bit easier to dive into quickly and start building stuff with.
Is it normal to have SQL query like this? Or I should have do more logic in python instead
WITH search AS (SELECT id.id as item_id,
td.scalar_min as scalar_min,
td.scalar_max as scalar_max,
id.type as type
FROM item_held
INNER JOIN main.item_log il on il.item_id = item_held.item_id
INNER JOIN main.item_def id on il.source_item_id = id.id
INNER JOIN main.tool_def td on id.id = td.item_id
WHERE id.type in ('tool_dirt', 'tool_fish')
and holder_user_id = 0),
search2 AS (SELECT scalar_min, scalar_max, type
FROM search
WHERE (
(scalar_max = (SELECT max(scalar_max) FROM search WHERE type = 'tool_dirt') and type = 'tool_dirt')
or
(scalar_max = (SELECT max(scalar_max) FROM search WHERE type = 'tool_fish') and type = 'tool_fish')
)),
search3 AS (SELECT scalar_min, scalar_max, type
FROM(
SELECT scalar_min, scalar_max, type
FROM search2
WHERE (scalar_min = (SELECT max(scalar_min) FROM search2 WHERE type = 'tool_dirt') and type = 'tool_dirt')
or (scalar_min = (SELECT max(scalar_min) FROM search2 WHERE type = 'tool_fish') and type = 'tool_fish')
)),
dirt_out AS (
SELECT scalar_min, scalar_max, type
FROM search3
WHERE type = 'tool_dirt'
ORDER BY scalar_max DESC, scalar_min DESC
LIMIT 1),
fish_out AS (
SELECT scalar_min, scalar_max, type
FROM search3
WHERE type = 'tool_fish'
ORDER BY scalar_max DESC, scalar_min DESC
LIMIT 1
)
SELECT * FROM dirt_out
UNION ALL
SELECT * FROM fish_out;
I agree with dowcet. Try with sqlite first and you can easily reset and redo a single db file whenever. Once you're comfortable, you can then start looking at setting up Postgresql server. I like messing around with one in a docker container https://github.com/devrimgunduz/pagila
Thats normal
Ok
Yah, normal, but some opportunities to simplify. Such as consolidating the two max's, and combining dirt out and fish out by using a window function to take the scalar max of both types
Hmm, I have no clue what they are, probably would learn later
ye, it's bcz I haven't learnt them yet
It would be recommend to start learning the basic since they are quite helpful
If you really want to keep using Google Sheets and need help troubleshooting then share the details in a help channel #โ๏ฝhow-to-get-help . But switching to sqlite3 shouldn't be too difficult and should be easier / more reliable
hi
can anyone help me choose which one to download
Sqlite is in the Python stdlib , you don't need to install
What BillyBobby said
a SQLite database is just one simple file, that's the whole beauty of it ๐
How does that work though?
Like it was jus 5mb to download.
I was surprised
cause postgresql gave me a lot to download
What was 5mb to download? The Python library should be all you need. It will create a sqlite file when you use it, and that file is your whole database
Just like writing to JSON or a text file basically
does that mean i can uninstall what i installed
Probably, but you didn't say what you installed
Might be useful but probably not required. (Not so sure to be honest, because Windows)
yeah, python's windows installer bundles its own sqlite library so it works out of the box, but if desired, you can replace the sqlite3.dll inside your installation with a newer version from that SQLite downloads page
those sqlite executables are just convenient CLI tools you can use on your database files (they have some neat formatting options too)
https://sqlite.org/cli.html
Yes
Hey guys, I've found this cool open-source project for visualizing databases. I want to test it, but I've never used PostgreSQL before, and I'm using it as a Docker container. Can anyone tell me how to make it work?
Here's the link: https://app.chartdb.io/diagrams/diagramexample01
I tried adding the script mentioned in my PostgreSQL terminal, but it didn't return any JSON at all.
It looks like whatever the results are from that query, you need to paste them into the box.
This is the result
fk_info
---------
(1 row)
it should return in json
instead of this
Here's how I run
postgres=# \c postgres
You are now connected to database "postgres" as user "postgres".
postgres=# WITH fk_info AS (
postgres(# SELECT json_agg(row_to_json(t)) AS fk_info
postgres(# FROM (
postgres(# SELECT
postgres(# schema_name,
postgres(# table_name,
postgres(# fk_column,
postgres(# foreign_key_name,
postgres(# reference_table,
postgres(# reference_column
postgres(# FROM (
postgres(# SELECT
postgres(# tc.constraint_schema AS schema_name,
postgres(# tc.table_name,
postgres(# kcu.column_name AS fk_column,
postgres(# ccu.table_name AS reference_table,
postgres(# ccu.column_name AS reference_column,
postgres(# tc.constraint_name AS foreign_key_name
postgres(# FROM
postgres(# information_schema.table_constraints AS tc
postgres(# JOIN information_schema.key_column_usage AS kcu
postgres(# ON tc.constraint_name = kcu.constraint_name
postgres(# AND tc.table_schema = kcu.table_schema
postgres(# JOIN information_schema.constraint_column_usage AS ccu
postgres(# ON ccu.constraint_name = tc.constraint_name
postgres(# AND ccu.table_schema = tc.table_schema
postgres(# WHERE
postgres(# tc.constraint_type = 'FOREIGN KEY'
postgres(# ) AS sub
postgres(# ) AS t
postgres(# )
postgres-# SELECT * FROM fk_info;
fk_info
---------
(1 row)
Again I think the JSON will come after you paste results in the box...
But the problem I see here is with line breaks. Your query is getting ready line by line instead of as one line. I think you need to fix it in an editor before pasting in your terminal
is there a way to connect and execute HSQLDB commands through python script ?
I've never used it but: https://hsqldb.org/doc/2.0/verbatim/sample/sample.py might help
Looks like they connect through an odbc connection, so it's really about python -> pyodbc -> hsqldb
Tried that as well same output
I would try an IDE like DBeaver instead of terminal to see if the query executes properly.
give it a shot and let me know
if it works
Will work
look like just a filter
it will only show rows where that column value is "UDF111"
np
Hey I am using back_populate because the modern docs doesn't include backref? Also if I have a 1 to many relationship in the many if I have a column in the many table how do I get a column in the one table in flask sqlalchemy?
https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#one-to-many
The link below is using backref + back_populate. But the example below are from the old docs. Is the link below accurate with the new docs? I can't find anything with the modern docs involving my previous question. Any help would be appreciated.
https://stackoverflow.com/questions/39869793/when-do-i-need-to-use-sqlalchemy-back-populates
Please ping on reply.
Can somebody help me with this please ?
I'm making a create account screen using sqlite3 and hashlib in pycharm. Whenever I create an account it says "Database locked"
"database is locked" means you have some uncommitted transaction from another connection to your database file
that can happen from leaving an unclosed connection in your script, but also from external programs like database viewers, or just having your script running twice - can't say for sure what's causing it in your case without seeing the rest of the code
get_guild_modules, (get|list|find|search)_guilds_with_module
thanks
Finally, i can continue coding ๐
Yeah youโre right turns out it wasnโt working because I forgot to add conn.commit() and conn.close(). Thanks for the help
Maybe create more generic methods with related filter objects? So if you have to fetch a module/guild by different property you won't have to make a new method?
@dataclasses.dataclass
class SomeFilter:
module_name: str | None = None
async def guilds(self, filter: SomeFilter) -> list[Guild]: ...
:incoming_envelope: :ok_hand: applied timeout to @delicate lodge until <t:1725227150:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).
The <@&831776746206265384> have been alerted for review.
does anyone have a database class which I could get some inspiration from
Database class? Inspiration for what purpose?
I was thinking of some ideas to create a database class to manage all queries
and wanted to get some inspiration from others
asyncpg?
Anything works
I'm starting a new back-end project with fairly loose persistence requirements. I have experience with SQLAlchemy but it feels like a sledgehammer to crack a nut, plus it seems like there are so many different query syntaxes these days it's hard to remember how everything works.
So, does anyone have any alternative suggestions? I don't necessarily need SQL, but I do need to be able to have trivial local hosting on Windows and Linux, even better if it's self-contained like sqlite. Looking for something easy to use from Python but powerful if needed.
Sqlite3 is a good choice for local engine of data (that is just a file)
otherwise... i recommend actually to suck it up and use it through SQLAlchemy (or Django ORM in Django Ninja)
Python ORMs are too powerful, they make it all a class, with easy to use abstractions and ecosystem around for easy unit testing
And they are able to autogenerate migrations and have inbuilt migrating versioning accordinly
Like...Other options would be very hard to refactor later if u will pick anything else
ORMs are abstracting away which SQL engine u use and later if u will need to make the project more powerful than at its start u will be able easily switching Sqlite3 to postgres or other fully fledged engine
So u can use SQLAlchemy with SQlite3 as mentioned. and to make it even more powerful u could ensure u use its mypy integration ๐
https://docs.sqlalchemy.org/en/20/orm/extensions/mypy.html the the code will be awesome.
Because despite u using leaky abstraction like orm, it will have all the validations for typing and correct usage
from sqlalchemy import Column, Integer, String, select
from sqlalchemy.orm import Mapped
from sqlalchemy.orm.decl_api import DeclarativeMeta
class Base(metaclass=DeclarativeMeta):
__abstract__ = True
class User(Base):
__tablename__ = "user"
id: Mapped[Optional[int]] = Mapped._special_method(
Column(Integer, primary_key=True)
)
name: Mapped[Optional[str]] = Mapped._special_method(Column(String))
def __init__(self, id: Optional[int] = ..., name: Optional[str] = ...) -> None: ...
some_user = User(id=5, name="user")
print(f"Username: {some_user.name}")
select_stmt = select(User).where(User.id.in_([3, 4, 5])).where(User.name.contains("s"))
Ensuring u use only existing attributes to a model, including during a syntax to SQL query commands
https://factoryboy.readthedocs.io/en/stable/orms.html#factory.alchemy.SQLAlchemyModelFactory well synergizes with factory boy for easier unit testng ๐
There are some other options... but they are meh in average
like if you are going to maintain your project only once in year, then good choice can be just using std lib sqlite3 and nothing else
Currently there's only one query syntax (select(...)), Query has been deprecated
Why are you using _special_method instead of mapped_column?
But if I try to find any examples anywhere, there is always a mix of syntaxes. And the docs are far, far too long to easily work out what is current and what is not
Like I said, the docs are far, far too long.
You don't need to read everything
I've been using SQLAlchemy for something like 15 years, on and off. I'm not a newbie at this. I'm just looking for something a bit more convenient.
quickstart section is pretty short
i copy pasted doc example, i blame on them any inconsistencies
Which example? ๐ค
the one u asked question about and i have code posted about
Can you link it please?
copy pasted from https://docs.sqlalchemy.org/en/20/orm/extensions/mypy.html
Deprecated since version 2.0: The SQLAlchemy Mypy Plugin is DEPRECATED
ORM Declarative Models
SQLAlchemy 1.4 introduced the first SQLAlchemy-native ORM typing support using a combination of sqlalchemy2-stubs and the Mypy Plugin. In SQLAlchemy 2.0, the Mypy plugin remains available, and has been updated to work with SQLAlchemy 2.0โs typing system. However, it should now be considered deprecated, as applications now have a straightforward path to adopting the new typing support that does not use plugins or stubs.
๐ค https://docs.sqlalchemy.org/en/21/changelog/whatsnew_20.html#whatsnew-20-orm-declarative-typing
okay mapped_column and Mapped[int] is latest
This is exactly why Iโm considering other options. Almost every example of SqlAlchemy out there uses old syntax, whether itโs for queries, declaring ORM fields, or something else. It gets tiring trying to mentally translate it all the time. So Iโm just interested in alternatives.
check Django ORM then ๐
Depends on the queries you want to run really and how far you're ready to stray away from sql
Not sure I want to try and integrate Django ORM into a non-Django project ๐ Tempted to consider MongoDB again but that doesnโt seem popular any more
MongoDB I think can work nice if u will use Pydantic to set data structure
And if u don't care about data to be lost between app version upgrades
Since MongoDB is not sql, data is migratable to new format only manually by self written scripts
Sql (and especially popular python orms) take care about it for you
Also if u use mongo, u usually need to be not caring about complex syntax quering with data joins.
Supposedly you are okay your db having simple operations only
Tldr: mongodb is even more poor alternative than Sqlite3, but can play nice enough if u have very simple needs and it is one time development application without future growth in terms of code and data complexity
Tbh at this point u can just use shelve
https://docs.python.org/3/library/shelve.html
Like mongo, except inbuilt std lib and bd as simple file ๐
I hadn't thought about using Pydantic with Mongo before, which is an interesting idea. At some point I will probably need the data to be readable from multiple processes in a fairly reliable way so I don't think shelve would be good enough, but thanks for the reminder that it exists
Sprinkle with mypy/Pyright(pick one) to validate Pydantic correct usage across code.
U will very potentially sweet nice app
As long as u wrote unit tests involving their running against local (preferably in docker compose) instance of mongo
not really. you still need to validate them manually, and more complex migrations can't be done automatically
you are probably speaking about SQLAlchemy having manual part for validation
it is not present in Django ORM ๐
trully automatically
kind of an odd question but i want to make a webapp with python that would show some charts on an analysis of data that i have in a csv. this data has tags but it is currently just comma separated in a single cell for each item which feels so wrong. I am not sure if I am gong to move the data to a db or use pandas, but i am struggling to find the best way to store tags that will make it easy to read and sort by (e.x. click on a tag in the webapp and it will show all items that have that tag, and each item can have multiple tags). anyone have experience with this?
ik in something like bigquery i could use something like an array in a column for each line item but i also dont want to get that involved if it is a personal passion project that is to just track books i am reading and the tags i am adding to them
i have also considered the option of having a denormalized set where there is an entry for each book and each tag it has but that also feels kind of icky to me and could cause issues with latency (i sometimes have more than 20 tags on a book and there will be at least 1500 books with the list growing each day)
kind of an odd question but i want to make a webapp with python that would show some charts on an analysis of data that i have in a csv. this data has tags but it is currently just comma separated in a single cell for each item which feels so wrong. I am not sure if I am gong to move the data to a db or use pandas, but i am struggling to find the best way to store tags that will make it easy to read and sort by (e.x. click on a tag in the webapp and it will show all items that have that tag, and each item can have multiple tags). anyone have experience with this?
Relational database like Postgres has several solutions for that.
- Just using Array and store as it is tags https://www.postgresql.org/docs/current/arrays.html
U already again ability to query
Postgres has all the syntax to treat them correctly and return u query if only specific tag is present
https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#array-types
from sqlalchemy.dialects import postgresql
mytable = Table("mytable", metadata,
Column("data", postgresql.ARRAY(Integer, dimensions=2))
)
mytable.c.data.contains([1, 2])
SQLAlchemy with its Postgresql dialect can use this functionality for example
- More traditional SQL solution...
using Normalization. Creating extra table with unique tags for example
And then creating third table for many to many relationship
where each record links your row from main db with specific tag.
Repeat records for linking multiple tags with specific row from main table
Django ORM probably has it implemented as Many To Many relationship shortcut or smth like that
- Also can be just used some pervert solution like using "LIKE" syntax for searching values in any pseudo array for any SQL engine (including Sqlite3)
ultimately solution depends on size of your database.
- if i used golang, and it fitted in memory, i would just created data structs and kept in memory.. in python same would have worked less good but can do too. Depends on size
- Then next good pick Sqlite3
- Then postgres ๐
if it is personal project as it is, i would have crossed postgres away. too heavy for pet projects
๐ค if u expect the number growing fast, may be good to pick postgres.
Otherwise sqlite3 is a good start then
To be honest it's fine for personal projects if you already have a db running on your pc/local network
-1 would mean the last character. 6 would mean the 6th position. -1:6 means last character and up to the 6th character, but it doesn't work that way
I think it's because name[-1] (the last element) is to the right of name[6] and the default step for slicing is 1, so you move to the right
You can specify your own "step" for slicing, for example:
!e
name='Jennifer'
print(name[-1:6:-1])
:white_check_mark: Your 3.12 eval job has completed with return code 0.
r
If you just wanted the last 2 characters, just do -2:
!e
name='Jennifer'
print(name[-2:])
:white_check_mark: Your 3.12 eval job has completed with return code 0.
er
Hey guys I have 20tb of images to store
I need to store it to train a model later (yolov8)
Where should I store it? I need it long term too
If I use amazon s3/glacier, will I be able to train models with the stored data directly or will I have to download it again
Download it again
glacier is very expensive to retrieve data from
Whatโs the best solution? Should I get a physical hard drive
It's basically a math equation... how much data, how long you're going to keep it, where's the model running, etc.
Are you training locally or in AWS?
sagemaker
I heard that sagemaker can read data directly from s3 but Iโm not sure
play around with this calculator to estimate costs - https://calculator.aws/#/createCalculator/S3
AWS Pricing Calculator lets you explore AWS services, and create an estimate for the cost of your use cases on AWS.
More of a conceptual and generic question.
Best or good practice: how should I handle my database connection code? Where should I place it, how should I structure it, etc.
What "database connection code"? You're writing your own instead of using a library?
depends on the framework, not that generic
for FastAPI, you should use Dependencies to create connections that draw from a global pool
for Flask, iirc you typically just save the pool as part of the app configs
using https://github.com/fastapi/full-stack-fastapi-template/blob/master/backend/app as an example
- define the models
- setup the database and run migrations, in their case they use alembic for it
- define the connection as part of your configuration
- create a method (or a context manager, or in this case a dependency) that returns a session and closes it automatically
- use that dependency & the models within your routes
Im using a prebuild wrapper, asyncpg. I just want to know how to structure where the code that interacts within PG is.
To be honest, structuring good DB code within a non trivial FastAPI app is a pain in the ass. Like Flask, itโs too keen on using globals.
The example above is okay but it relies on a module level โengineโ variable. But at least it gets the settings from the environment
what is the problem with having the module level โengineโ variable?
at least for FastAPI, you can and should use their dependency injection overwrites if you need to mock it for testing
if you need to use a different engine depending on the request, that feels pretty cursed but could be done by having another dependency instead of leaving it global I guess
I donโt want logic being executed at import time. The engine global has side-effects, so you have to be careful when importing that file.
Their dependency override system for testing is nasty because it uses the original object as the key. Which means you get the side effects from importing it even though you wonโt use it
Hey all, I have an eight month long class for designing a project with 2-3 people that fills some need in CS that hasn't been filled. I really am trying to orient my career path towards databases (with a little bit of DS), but for the life of me can't think of a project that would take that long to make with that many people. Of course it wouldn't be full time work on the project, but I would need to get something presentable and just really don't know what I could even present related to databases. Any ideas?
collect usage analytics in whichever project you do, and look for ways to store it efficiently and analyse them
databases are part of nearly any CS project ever, but not a project on their own unless you're creating your own (which I wouldn't really recommend)
Do you have any idea how I can add exceptions for the database calls?
I'm using AsyncIOMotorClient
do i need a conn pool with async conn?
wont like the async-ness of the conn act as a paralelism mechanism by itself?
just like with conn pool
from my understanding, a pool still gives you a convenient way of running queries without needing to open a new connection each time (connection reuse), along with queuing up workers to prevent excessively high concurrent connections
well on the point of conn reuse
i would just pass single conn via app state
along with queuing up workers to prevent excessively high concurrent connections
and with that as i understood async should do even better than with sync (beacuse thats where async strikes at, as i understood)
however
i think i should just like test
with pool vs no pool
beacuse about high concurrency
i am going to store binary data in db
so perhaps separating would help
async tasks are definitely more efficient to create compared to threading, but servers cant tolerate infinite connections regardless of which concurrency model you use
but wont the db be the bottleneck at that point?
or more like process conn limit
not to mention there's the overhead of handshaking every new connection
someone else or a blog post can explain it better than me, but in short, the benefits of using them don't become redundant simply because your program is asynchronous, and having fewer connections can outperform a database with too many connections
https://stackoverflow.blog/2020/10/14/improve-database-performance-with-connection-pooling/
i read up a bit of docs actually
and
it seems like i misunderstood how it works internally a bit 
yeah i will need a conn pool
because i cant use same conn from different threads at the same time without a lock..?
i use rust btw, just to clear up some possible misunderstandings
you could definitely try, but mixing queries into one connection without locking is always bound to cause some obscure race conditions: sql BEGIN; INSERT INTO foo VALUES (7, 8, 9); BEGIN; SELECT a, b, c FROM foo WHERE a < 5; UPDATE foo SET b = b * 10 WHERE a = 2; ROLLBACK; DELETE FROM foo WHERE c = 10; COMMIT;
ic
and pg's pipeline mode
seems more like just batch request
rather than some automatic magic
seems about right, its described as just client-side so i'd assume the database was already able to queue up queries (from one connection) before the feature was added
the model in relational databases is multiple concurrent connections, not multiple concurrent transactions on one connection
hey guys why are dbt docs such a mess
MongoDB is best
This may be a silly question but it's Monday morning and i'm only two sips into my coffee (That's my story and I'm sticking to it):
I have two tables [PartsInventory] & [PartHistory] that pull just under 60k items every morning from our software's DB via API calls. I also have three more tables [CurrentDayInventory] CDI, [PreviousDayInventory] PDI, and [ArchivedInventory] AI. THe current process is to dump PDI's data into AI. Then dump CDI into PDI. Then dump the up to date data from a View I created into CDI.
I recently began working on this piece again and decided to compare the View to CDI and which ever Part required updating to update it instead of deleting all data in CDI and then inserting 60k items.
My question, is whether or not it is more efficient to simply delete the data in PDI, then insert all of CDI into it. Then delete CDI data and insert all of my View into it. OR iterate over each item and update where necessary? The latter method appears to have increased the run time on the script.
Thanks!
I should also add that I am using SQLAlchemy within my Python script to do all of this.
How are you updating the existing records?
I wonder also if you could get away with renaming the tables (at least for CDI -> PDI). Should be pretty fast : https://dba.stackexchange.com/a/53850
I have CDI and PDI set up so that I can compare yesterday's items with today's items which I use for a couple of reports for the managers.
Changing their names wouldn't help i nthis scenario. As for the update, I will grab a code sample and post it
Also beware indexes for slowing down the inserts into the AI table (https://dba.stackexchange.com/a/55575). What are your current benchmarks and what are what speed are you aiming for?
wsg ppl
Here is the function that updates iterates through my View and updates CDI where necessary. Previously, I simply deleted the data in CDI and inserted everything from the View into CDI. I figured updating would be quicker but it seems to have lengthed the process.
That might be easier to read
Hmm curious why you decided to go with SQLAlchemy vs pure SQL against the DB.
In terms of the update, you are making multiple queries per row, which is going to be quite slow. Also I'm not sure these operations are being batched, but that will have a big impact if not either.
I would expect a query like this to take on the order of 10s of seconds, a couple mins max. How long is it taking at the moment?
It shouldn't be hard to just merge exiting entities ๐ค
In terms of efficiency it will likely (definitely?) be slower though
A bit, yes, but it's just 60k entities, how slower do you think it'll be?
And also why?
Let's say you have a code like this to add new entities:
session: Session
for chunk in chunked(scraped_data, 1000): # Probably would be a good idea to chunk your data there
for schema in chunk:
model = SomeModel(id=shcema.id, name=schema.name, ...)
session.add(model)
session.flush()
You can just modify it a bit to update existing records:
session: Session
for chunk in chunked(scraped_data, 1000):
stmt = select(SomeModel).where(SomeModel.id.in_([schema.id for schema in chunk]))
existing_models = {model.id: model for model in session.scalars(stmt)}
for schema in chunk:
model = existing_models.get(schema.id) or SomeModel(id=schema.id)
model.name = schema.name
session.add(model)
session.flush()
That's if you're using orm of course
With core you could just use on conflict do update?
Not much, a couple seconds. Not 100% sure what OP is attempting to optimise for.
Honestly if you're after raw speed you probably shouldn't use python should just benchmark all the options ๐ค
Maybe yeah, just call delete/truncate on a table and insert all the records again, see if that would be better
But generally sqlalchemy should try to optimize updates, if model didn't change nothing should be emitted into db
Hmm, I will either get praise or a tonne of backlash due to this code
class DictWrapper:
def __init__(self, dictionary):
self._dict = dictionary
def __getattr__(self, key):
if key in self._dict:
value = self._dict[key]
if isinstance(value, dict):
return DictWrapper(value)
return value
else:
return None
def __setattr__(self, key, value):
if key == "_dict":
super().__setattr__(key, value)
else:
if isinstance(value, dict):
self._dict[key] = DictWrapper(value)
else:
self._dict[key] = value
def __repr__(self):
return repr(self._dict)
The only reason I made it is so that I can easily read and manupulate the dictionary without getting random index errors. The code looks prettier as well.
||This code was motivated by the javascript objects||
I don't think trying stuff out is bad, and this is a nice approach. Re. the problem that you had though (random index errors), I'm not sure how you using dicts, but dict.get(key) can be used instead as it just returns None by default (default val is configurable).
Yh thanks for that info, but typing response.StatusCode is way cleaner than response.["StatusCode"] or response.get("StatusCode") where response is a dictionary
Hello, your messages have been removed for violating rule 6
How is the picture such high resolution ๐?
guys why my table content is not deleting?
Template.py is ready
result = [(None, 1130268405333753857, 'Hikaru', 'Hikaru:', None)]
deleted
the process is occuring but the deletation si not happening
async def delete_default_character(
self, *, user_id: int, name: str | None = None, prompt_prefix: str | None = None
) -> None | str | list:
cursor = await self.db.execute_fetchall(
"SELECT * FROM default_characters WHERE user_id = ? and (prompt = ? or char_name = ?)",
(user_id, prompt_prefix, name),
)
result = list(i for i in cursor)
print(f"{result = }")
if len(result) > 0:
if len(result) == 1:
if name:
await self.db.execute(
"DELETE FROM default_characters WHERE user_id = ? and char_name = ?",
(user_id, name),
)
await self.db.commit()
print("deleted")
return "SUCESS"
elif prompt_prefix:
await self.db.execute(
"DELETE FROM default_characters WHERE user_id = ? and char_name = ?",
(user_id, name),
)
print("deleted")
await self.db.commit()
return "SUCESS"
else:
result_list = list()
for i in result:
data = {
"name": i[2],
"prompt_prefix": i[3],
"image_url": i[4],
}
result_list.append(data)
return result_list
else:
return "ERROR"
why it's not deleting from my database?
I used CodeSnap extension in VSC ๐ช๐
in the elif prompt_prefix: block you attempt to delete an entry using name, even though the condition implies that name might not be provided.
what's the purpose of doing this?
https://github.com/dbt-labs/dbt-adapters/blob/c707b534184a1a631e9b27437b4737626e19a5bd/dbt/include/global_project/macros/materializations/snapshots/helpers.sql#L62
won't that always be null 100% of the time? why not just write null? is it some kind of workaround to avoid an explicit cast?
dbt/include/global_project/macros/materializations/snapshots/helpers.sql line 62
nullif({{ strategy.updated_at }}, {{ strategy.updated_at }}) as dbt_valid_to,```
What is the fastest way to check "in" on integer on dbs?
Get the whole list from JSON and placing it in a set?
For example, a list of channels ids
Can you give an example of what you mean?
channels ids in a json file
[
16382728238381982384,
16382728238381982386,
16382728238381982385,
16382728238381982383,
16382728238381982387,
16382728238381982388,
16382728238381983389,
]
on message event for example
white_channels = getallwhitechannelsjson() #returns the whole json in set
if message.channel.id in white_channels:
...
And what do you want to query from a db? (And what type of db?)
Json is considered a database...
Well, thats not the point, i have this issue
Why does 'fastest' matter? Is there a huge number of objects or something? Is there a complex json file? If it's just your example, that's just find an element in a list.
Exceeding my example
if you're just using a json file, there's no faster way
So is json the best way to do something like that?
maybe
json could be a good choice for other reasons. it's not well suited for this task
Since this is the db channel, I'll give you my
solution: ```py
with open('myfile.json', 'w') as f:
f.write("""
[
"16382728238381982384",
"16382728238381982386",
"16382728238381982385",
"16382728238381982383",
"16382728238381982387",
"16382728238381982388",
"16382728238381983389"
]
""")
import duckdb
df = duckdb.sql("select * from 'myfile.json' where json = '16382728238381982384' ").df()
print(df)
well that's impressive
Imma try it out. Seems like something pretty cool ๐
It's actually cooler when you deal with more complex json...https://duckdb.org/2023/03/03/json.html
that is indeed cool
Ty
๐
json is not suitable as a database when partial reading and writing is required. the json file format does not support partial rewriting.
and duckdb is cool but you might want to consider just using regular python for this... json.load interprets json arrays as python lists. so once your data is loaded, you just have a normal list, and you can use normal python code on it
I set up a db but have no idea how to work it. Still using json because we are less than 5 users so far
It's worth learning and won't take long. SQLite is the easiest step up from a JSON file and good enough if you don't need simultaneous transactions. The main difference from JSON is just that you need a schema and tables
Can I set automatic attachment downloading of specific email sender that I receive every day and set it to be also automatically saved on my pc when I receive it?
yes you can
Hi ! Tnx...Is there a software for that or I have to write code ?
Can I do it through python?
Yes
which database offers the most storage for free with the most read/write cycles?
Question is vague... are you asking about fully managed cloud providers or what?
You can self-host Postgres and move a whole lot of data through it, depending on your hardware
database on the cloud, i cant self host a database because it needs to be available 24/7
You're not going to get a reliable cloud-managed database for free. If your app is on a VPS you can manage the DB yourself on the same server
This is a bit old, not sure if any of these options are still available https://www.reddit.com/r/Heroku/comments/x2qwst/free_alternatives_to_heroku_with_free_sql_database/
if you can afford the transition to KV store instead of sticking to relational databases, serverless kv store like cloudflare workers kv and vercel hobby tier is also decent
idk what a kv store is but i currently use mongodb
i was thinking of upgrading to serverless mongodb if i have to
i use mongodb right now but ill look into this
Hi, I have a question about SQLAlchemy's cascade deletion, specifically when using a secondary join.
If I declare cascade="all, delete", does it delete rows from the association (middle) table only, or does it also delete rows from the related table on the other side of the relationship? Or does it delete from both?
If I declare onDelete="CASCADE" on the middle table, do I also need to declare cascade="all, delete" on the left and right tables?
i feel dumb looking at all of this
I have a question. So I appear to be having an issue with my connection string pyodbc. I mean the code runs perfectly on my computer, but when this other person tries running it, she gets the following error:
Login failed for user 'janedoe'
Here's the code:
`
server = 'servername'
database = 'dbname'
username = 'readonly'
password = 'xxxxxxxxxx'
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password, trusted_connection='yes', autocommit=True)
`
Normal... I've been programming professionally for a few years and I still don't understand half the stuff people discuss on this server ๐
Do they have network connectivity to servername?
I'll just check with her.
Now to wait for a response.
So no she does not have a connectivity to the server. In SSMS, it says cannot connect to ServerName. Under additional information, it says Login failed for user '...\JDoe'. My guess is that she has no access to the server.
I guess I'll contact someone to try and get her access.
hello all, is it bad practice to name certain keys in the database as "name.bio" instead of name_bioi? The original source where I'm getting the data from has keys names with both underscore and dots as well.
Should I make the schema according to that or rename the incoming data to match the database with all keys in underscore?
It's probably better to be more consistent, I personally prefer _, also depending on db I think you'd have to wrap a name with . in quotes, but I may be wrong here ๐ค
On python side if you use orm it would certainly be easier to use an underscore
Also regarding the field name.bio itself, can't it be just bio?
Uh don't use periods in names. Ever.
Periods are for scoping, and you'll be forced (as Doctor said) to always double quote the names
Yeah, e.g. schema.table, table.field
Would be much easier to write table.name_fio than table."name.fio"
Not sure if this is the right channel but Iโm curious for dealing with local data files as part of Jupyter hosted data analysis Iโve found my three most useful external modules are pandas, pyarrow and duckdb. Am I missing anything thatโs better than these?
Depends entirely on what you are trying to do. Throw polars into the mix if you are looking to work with large Data frames.
Ibis might be interesting to you: it's a common interface over several data frame libraries, including Pandas, Polars, and Duckdb
Other options include Xarray for when you want >2-dimensional Pandas data frames, or Dask for when you want to do Spark-style partitioned parallel batch processing without the fuss of Spark (arguably this is less useful for local computing than it used to be, now that we have higher-performance multithreaded stuff like Duckdb that can handle out-of-core processing)
Pandera is good if you are building applications around data frames
Hi, not sure if this is the right channel to ask this in but I was having some trouble so I figured I'd try, https://github.com/vn-ki/film-raffle-bot I'm attempting to recreate a discord bot that uses python, I think I've got the main elements in the config file down but I'm having trouble with setting up a database for it since I have no experience with that. Any suggestions or recommendations for what to use?
How are you planning to host the bot? Sqlite is probably the easiest option, but you will need write access to the filesystem where you host the bot
(The sqlite database is just a file)
i was planning on just running it from my computer since it doesnt need to be online 24/7 if thats possible
iโm not really sure how hosting solutions work
That's fine. What exactly you'll need to do to get it exposed to the Internet may depend on your router or ISP but basically the options are to set up port forwarding or open some kind of tunnel
As for the database, I agree SQLite is the easiest place to start, perfect for something that's not going to support a huge amount of real-time traffic
i created the file in the same folder as the rest of the files, edited the config with the necessary info but the main.py isnt working no matter what and im not sure what the issue is
I have a question how can I make subfolders for my gui? My apologies
You should show the relevant code and other details in a help channel #โ๏ฝhow-to-get-help
oh i got that fixed but now itโs just the commands not working, iโll post in a bit
why am I get floating point errors with my float columns using asyncpg? i'm not doing anything exotic with my data except from loading it into my class
What was in the initial data? How are you loading it?
wdym by that? i'm just using a select query
I don't understand what kind of errors are you getting since these seem like normal values to me
Also is that war thunder?
yep
also i should've clarified, i'm getting this
Are you getting this from your api or from warthunder's api?
at it's core a python script get's it from a datamine and puts it into the database
I mean where are these values coming from?
can you be more specific?
That's what I'm asking you to simply understand what's the cause of that problem ๐
If that's a screenshot from warthunder's api for example then you can't really fix that
other thatn just round it to nearest N digits after coma yourself
it's from my own api, which i'm trying to fix
and i guess i could round it, but i would prefer a more robust solution
Depending on how your client works with these values you may want to use decimal.Decimal here and a numeric type in your db ๐ค
Float values can't be accurately represented with IEEE754 floats if that's what you're going for
But generally that should be fine if your client can handle formatting correctly
i tried using DecimalField and it crashed from numeric overflow ๐ซ
i'll just round it for now until i work out something better
what font are you using
looks cool
Open help thread and share code plz: #โ๏ฝhow-to-get-help
done, thanks ^^
hey guys i got a job interview for a data analyst position in two days
what topics should i go over
I googled it for you: https://www.datacamp.com/blog/how-to-prepare-for-a-data-analyst-interview
In addition to lists like these I would especially be researching the specific company and role
Prepare for your data analyst interview and practice some data analyst interview questions. Get help landing your dream job!
Iโve also been using AWSwrangler + boto for non remote stuff. (Mostly for integrating Athena queries into my notebooks or to dynamically generate glue records for S3 hosted data..)
Ibis has been on my radar to checkout, as was Polars but i actually ran into an issue with Polars and would have had to compile it myself to handle datasets with more than 4 billion (or so) records, which ironically pandas has no issue with if you have the ram. (64 bit index).
My largest dataset had 300+ billion rows. I melted my laptop trying to analyze it, so I ended up spinning up my notebook on a 1TB ec2 instance with Vscode remote. Believe it or not pandas (with Pyarrow) was able to handle it fine. (But took 12 hours.) in hindsight they t would have made sense to use a distributed processing engine but I was trying to contain my costs. (Was about $70 for the ec2 instance)
Polars is still on my list for its Iceberg support.. but Iโm more and more attracted to duckdbโs native format for those cases I need an extendable file based data store that supports schema evolution.
Ibia remains on my list.. Iโll learn more about out it at the very least. I kind of like learning to work directly with the different libraries so may wait to actually use ibis.
DuckDB discord is active and helpful, so ask for help there if you get stuck. It can handle larger than memory datasets, but as with everything, the details matter. My preferred model involves a lot of hive partitioned parquet files, which duckdb is really good with
Yeah Iโm in there. Duckdb was a recent discovery. I love it.. refactored a bunch of parquet code to use it.
Itโs not as fast with parquet as its native format but I canโt give up the space efficiency of parquet. I ended up writing a simple query caching function which more than made up for the performance difference.
(Itโs gross but I instantiate a global variable dictionary which uses a tuple of the query parameters as the cache key.)
It speeds up the first run and since I live in Jupyter I get the benefits on subsequent runs as long as I do t restart the kernel
Next time I have to analyze a large dataset Iโm going to try duckdb
Could also just create a table in an in memory duckdb table
I use a dbt pipeline that mostly starts from parquet, too
(Which also creates various tables for the intermediate steps)
And, if you want to survive restarts, then make it an on disk db rather than :memory:
All this data is data to help me. (I do Finops and Iโm largely left to fend for myself.. I had to teach myself python and enough data analysis/engineering to get it done.
And I love it!
Nice, I'm on data engineering side of fintech
Ah neat! My background is devops/cloudops/linux.
One important skill: understanding SQL window functions. Windows are the answer to many complex queries.
(Iโve had to do ops support for de teams in the past so I wasnโt scared to get my hands dirty)
Ok good to know. Is this particularly to duckdb queries or in general?
As of now I largely only use sql with Athena and duckdb. (In both cases I dynamically generate the queries in code)
I also have to use psql sql for generating cloud watch metrics in lambda. (Much more lightweight as itโs for tracking database metrics like length of longest running query)
In general if you're doing analytics work with databases.
Iโll try it and see if there is a performance difference.
Interesting experience. I definitely found Polars' support for "out of core" / "streaming" to be incomplete, eg you can't do a streaming partitioned write to a pyarrow dataset even though pyarrow supports it
Duckdb does better
I think Dask can do some out of core stuff but meh, Duckdb just works most of the time
I think they're all somewhat 'incomplete': it's an area of active work and optimization for all of them... I haven't compared them side by side, I generally can just throw RAM at a problem and partition the dataset.
Absolutely, I don't mean that to be negative -- it's just been my experience
They all have areas where they are good and less good
Yah, i guess what I really wanted to say is that this stuff is the state of the art, and as such, will have bumps along the path of using
Looking at window functions. Can they kinda replace the groupby functions in pandas? (Among other things.. really just looking g to see why I would need them)
Oh nvm I see.. seems like it could be useful.. but maybe not for any problems Iโve had to solve YET.
Groupby is just group by in sql.
Windows are good for things like: give me the top three products by country, or: show me rolling 7 day sales average
Or, let's say you want to compare sales this quarter to sales 4 quarters ago
(These are all things you can also do in pandas, just explaining why windows are useful)
I'm in a bit of a predicament and i need help please because i need to deploy
so first of all every time i need to execute sql on my webapp i open a new connection but this is a very expensive operation to do like opening new tcp connection to postgres db in the cloud so what people do is that they create a connection pool and pull connections from it so that there is no opening and closing overhead but the problem is that the appache webserver makes my code in process and each process has 5 threads so my server code gets ran num_ processes * 5 times which results in initializing a number of connections higher than the available (5)
You can use a connection pool with only one connection.
This just means that each process will reuse the same connection, rather than making new connections each time
Honestly I was thinking that maybe don't use apache or use pgbouncer so it does the pooling instead 
Is it bad to run a database server and a website on the same machine?
I only have 1 machine and I am trying to run both so any recommendations which database would be fine running on the same machine as my website
it's fine
what happen if i index on column with alphabetical data/ char type in SQL ?
like name for example
What do you mean by "what happen"?
Index would get created 
I mean i can see if its on int
But idk if possible for char type
btree indexes (which are default in most databases, I believe), so anything that can be compared can be indexed
chars can be indexed (if we're talking about single characters), strings can be too
@weak yoke , hopefully it is okay if I ping you
I created a simpler example
Like stated in the post
I am having trouble inserting and then querying data in flask sqlalchemy 2.0. Does anyone know how?
Also can someone please show an an example?
Can someone use sqlalchemy 2.0 querying when answering?
Also if I need to add other columns at another time how would I add the columns in a many to many relationship? I assume this would be equivalent to updating columns even though the columns are optional.
Here is the error.
https://paste.pythondiscord.com/TPWQ
app.py
https://paste.pythondiscord.com/TPTQ
home.html
https://paste.pythondiscord.com/B6FA
create_book.html
https://paste.pythondiscord.com/WBUQ
Thanks
Hi, I'm looking for someone to give me a hand with some facebook scraper, I have to collect phone numbers of a particular location.
Nobody will help you get around Facebook ToS here against server rules. Also that's not a database question
Hey everyone!
Can anyone tell me how can I store 3d models in sql database and how I can retrieve to render it on website.
Why do you think you want to store 3D models in your database? It's probably better to store the URLs of the files in your DB instead, but if they are small blobs that should be trivial
The normal way to retrieve any data from a DB for a website is an API which you can build with whatever framework you like
For the rendering part, pick a library https://medium.com/lai4d/how-to-embed-an-interactive-3d-model-on-a-web-page-9bd631046927
this, ive tried storing the blobs and its way worse, just store the URLs
How can I update all values in an object in MongoDB?
For example, the data look like ```
{
"_id": ...,
"myObject": {
"a": 1, "b": 2, "c": 3
}
}
and I would like to change all 1, 2 and 3 to 0s
You can use the $set operator
query = {"_id": your_document_id}
update = {
"$set": {
"myObject.a": 0,
"myObject.b": 0,
"myObject.c": 0
}
}
collection.update_one(query, update)
Hello, Ive been trying to make this work for the past 2 days, I honestly dont know if this is the correct place to come, But my bot keeps "closing" the databases connection
Ive learned that this is because of my init_database_schema function that loads the schema.sql and gets it contents and executes the commands idk, But, Nothing is wrong there, I load my schema.sql and execute, Weirdly it shows that error, But fun fact, When i do return <asyncpg.Pool>._closed, It for some reason returns, False. Even if the error occured, Can someone help me?
Perphaps connection is getting stale ๐ค
Honestly I didn't work with asyncpg directly but sqlalchemy's pool includes a pre_ping functionality which checks if the connection is alive before sending any requests
https://github.com/MagicStack/asyncpg/issues/309 Seems to be related?
Alright... here's the thingy...
I'm building a database in PostgreSQL of raw trade data from the Kraken cryptocurrency exchange. There are 724 .csv files. Each one represents a unique trading pair. Each files contains anywhere from 10,000 rows up to around 50 million rows of data. The data in each row always consists of unix time stamp, price, volume like so:
1628609725,2.50300,50.36358000
1628609725,3.07400,2.00200000
1628609725,3.07400,2.00200000
1628609725,3.07400,2.00200000
1628609733,2.50300,7.87232000
1628609739,2.50300,355.43768000
1628609750,2.50300,3.93616000
1628609757,2.50300,2.00400000```
As you can see, entries 3-5 are identical, which is fairly common in raw trade data due to trading algorithms executing trades in rapid succession. Herein lies the problem...
This following link is to the kraken exchange support page
https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales-
There you will find two links within, on for the bulk of their entire historical trade data in its most granular form and the second like is to quarterly data dumps. So the problem is, I am currently aggregating the bulk of the data into my database upto and including Q2 2024. Sometime in early Oct they will release the Q3 datadump.
My plan is to be consistently running a datacenter on my raspberry pi to capture live trade data and add it to my database via websocket, however, I'm considering the fact that this data stream can and will be interrupted at some point in the future. For example, if I started the data stream now, there would be gap of missing data from the end of the bulk data up to the end of Q2 all the way until I start the websocket for live data.
So If I added the Q3 data to the database, there is no way to differentiate between identical entries, and duplicate entries. I can't seem to figure out how to 'fill in the gaps' without duplicating trade data.
So, the best I can come up with is giving a unique trade id's to the bulk data that identifies as a data dump, and giving unique trade id's to the live data that identifies as live trade data so if there are identical entries that already exist from one side or the other, then processing will skip over adding it as a duplicate?
Does that make sense?
Make rows unique on all three columns and then run an insert that just skips on conflict.
Also lots of other ways to approach this.
but they said that there are genuine duplicates
Maybe compute a rolling checksum (ie rabin) over last N entries to find the overlap points?
See above
reading. I'll get back to you, thank you.
Since you already have a timestamp (1628609757 i guess for example?) can't you just skip all the data before that?
!pip dataset is interesting. Avoids the whole sql injection and the methods looks easy to use.
"Avoids the whole sql injection"?
SQL Injection is not a problem if you are using any library correctly, and I bet that if you use it wrong enough you'll still end up with code vulnerable to sql injections
For extremely simple projects it might be fine, but for anything with a non-trivial schema I would recommend using SQLAlchemy (which that library is built on top off) instead of using it
Hey guys. I'm new to this channel and our team recently built a SQLAlchemy extension for VS Code. It's basically an AI chat (RAG) system built out around the SQLAlchemy docs that allows you to ask questions about the project and get relevant answers without leaving VS Code (it's also completely free). Wanted to share here in case anyone found it useful! https://marketplace.visualstudio.com/items?itemName=buildwithlayer.sqlalchemy-integration-expert-jYSzG&
hello everyone, apologies if the answer is obvious, but i need some assistance identifying my server name.
i am trying to connect to my local sql server inside my .py script. however, i get the following error:
Error connecting to database: ('HYT00', '[HYT00] [Microsoft][ODBC SQL Server Driver]Login timeout expired (0) (SQLDriverConnect)')
when running select @@SERVERNAME , it shows me the default DESKTOP-###### server name.
how do i setup my connection string?
prior to this i was using a docker instance, however, the image for that was lost so i am no longer using it.
the problem is that multiple identical entries can occur within the same timestamp for example:
1628609725,3.07400,2.00200000
1628609725,3.07400,2.00200000```
Well, you can just delete the duplicates? ๐ค
they arent duplicates. they are legitimate trades
recurring trades from algo bots
I'm just saying that you can for example:
- Get latest timestamp from your db
- Delete everything with that timestamp
- Start receiving data from the timestamp you're deleted
so you won't get any duplicates
I don't think there's really a need to overcomplicate, how many trades per second are you expecting to get on average?
Correct me if I'm wrong, but I think something like
# Pseudocode
last_record = select(Something).order_by(Something.timestamp.desc()).limit(1)
delete(Something).where(Something.timestamp == last_record.timestamp)
for record in data_stream(from_=last_record.timestamp):
add_to_db(record)
should work?
If there's not a lot of records that could be in the same second/timestamp then impact of that delete statement is pretty minimal, so there shouldn't be any obvious problems?
- Timestamp is placed into the raw .csv files by the exchange not my db.
- Can't delete duplicates because they aren't duplicates, they are identical trades that occur in rapid succession on the excahnge within the same timestamp. It's down to the second so multiple trades can occur within the same second, in fact they commonly do.
- The data dumps occur about 1-2 weeks after each quarter ends. So I would be using a data dump to fill in the gaps between the last data dump and whatever gets missed by the live websocket. but if the entire entry from the live data is identical to the data dump, there is no way to differentiate between a duplicate and an identical trade.
there are hundreds of millions of trades
- There would be duplicates if you accidentally go over the same timestamp twice
32gb of .csv files i'm processing
that's what I was trying to avoid
do you mind a voice conversation?
sorry, I'm not really in the mood
no worries.
The examples I provided are real entries from the raw .csv file. multiple identical trades are legitimately occurring within the same second which subsequently produces what look like duplicate entries but are actual legitimate trades.
so if live data is recorded....
That's not typically how you do this
hang on
You add a 4th column for batch identity
Which records which input task inserted that row
so if live data is recorded... and then I add the data dump from the whole quarter to fill in any gaps, the data dump and the live trade data that exists incrementally, there is no way to tell the dumplicates from the identical trades
