#databases
1 messages · Page 32 of 1
Sure, that's a database decision what order to provide. If you want an order, add it as an order by to the outer query.
Otherwise, most databases just say the result is non-deterministic (it could come back in any order)
It's implementation defined?
It depends on how the database decided to execute the query, which may depend on the data. So, you always put an order by if you depend on the order.
Alright, that makes sense.
Thanks for helping me understand this.
ok. @coral wasp. did like you said.
first i tried below query
SELECT category, product, total_spend,
ROW_NUMBER()
OVER(
PARTITION BY category
ORDER BY total_spend DESC
)
FROM
(
SELECT category, product, SUM(spend) total_spend
FROM product_spend
GROUP BY category, product
ORDER BY category, total_spend DESC
)AS sub_query
got the correct result. From this result, i proceeded to next step of filtering only those rows whose row_number <=2, then the error comes.
To put it in simple way, i have added only last line of the below query.....
SELECT category, product, total_spend,
ROW_NUMBER()
OVER(
PARTITION BY category
ORDER BY total_spend DESC
)
FROM
(
SELECT category, product, SUM(spend) total_spend
FROM product_spend
GROUP BY category, product
ORDER BY category, total_spend DESC
)AS sub_query
WHERE row_number <= 2
Error:
column "row_number" does not exist. FYI: in PostgreSQL all identifiers (including column names) that are not double-quoted are folded to lowercase. So if you're using any uppercased letters in the names and want to specify them in SQL-query, please double-quote them first. (LINE: 14)
Ok, so row_number is not a column. You can't filter on something that isn't a column.
Look at the results from the first query. What columns do you see?
I see row number there!
pls just execute the query here once https://datalemur.com/questions/sql-highest-grossing
Share a screenshot of the result
Also, your code will look better if you format it first: https://sql-formatter-org.github.io/sql-formatter/
Yes: but row_number isn't a column in sub_query.
The where clause filters the sub_query, before the "select" statement runs. If you want to filter by something you created in the outer select, then wrap it again as a subquery and filter it.
i am adding where clause at the outer query, not inside the sub query
The where clause applies to columns in the sub_query.
Your query says: "select <stuff> from subquery where subquery.row_number <= 2"
hmm..so i am thinking like this.
SELECT a,b,c,d,e
FROM(
SELECT a,b,c
FROM so_and_so_table
)
Now, at the output i have 5 columns on which i can execute WHERE clause. Pls correct me, where is wrong in my thinking.
No, you ahve 3 columns on which you can execute the where clause
The where clause is before the select.
I know it's confusing.
?? row_number column is produced by super query, not sub query
Yes. that's why your code doesn't work.
yeah..now i get it. sorry for bothering you.
So, this is what happening.
First WHERE clause executes & whatever the result comes out of that WHERE clause, SELECT clause executes --> Becasue of this, When WHERE clause is getting executed, the row_number column is not yet there.
Yes, that's right.
The reason is:
If the where clause was after the select, it might change the results. What if you said:
select category, row_number() over (order by total_spend) from sub_query where total_spend > 100 and row_number < 3
This would be an impossible query... because you'd be filtering on both an input to the row_number (total_spend) AND the result of the row_number. (ok, not impossible, just really weird)
ok. so, what could be the best way to filter out only those rows whose row_number <=2 from the below query:
SELECT category, product, total_spend,
ROW_NUMBER()
OVER(
PARTITION BY category
ORDER BY total_spend DESC
)
FROM
(
SELECT category, product, SUM(spend) total_spend
FROM product_spend
GROUP BY category, product
ORDER BY category, total_spend DESC
)AS sub_query
select * from (yourquery) where row_number <= 2
so, totally 3 SELECT statements?
Yes
The way I would've written this, using a CTE, is:
with subquery as (...),
sq_rownumber as (select .... from subquery)
select * from sq_rownumber
where row_number <= 2
Much more readable, imo.
yes....i will write with CTE.
but, here also 3 SELECT stataments will be there, right?
Yes. With a cte or without, it's the same query.
without CTE, it looks like below 🙂
SELECT category, product, total_spend
FROM
(
SELECT *,
ROW_NUMBER()
OVER(
PARTITION BY category
ORDER BY total_spend DESC
)
FROM
(
SELECT category, product, SUM(spend) total_spend
FROM product_spend
WHERE EXTRACT(YEAR FROM transaction_date) = '2022'
GROUP BY category, product
ORDER BY category, total_spend DESC
)AS sub_query_1
)AS sub_query_2
WHERE row_number <= 2
i just read about CTE. it's the neat way of writing sub query. right?
In this example, it's just a nicer way of writing sub queries, yes. But, in other examples where you need to use the subquery multiple times, it's very important.
Even when i have to use single sub query multiple times, we can give alias name to sub query through AS keyword & use, right?
I mean, without using CTE also, we can write sub query only once & use multilpe times...
Show me an example of using a subquery multiple times. [It won't work without a CTE. ]
oh..ok.
I prepared notes for CTE as below. pls correct, if you have few minutes, could you pls correct, if there is any wrong.
What is CTE?
CTE stands for common table expressions.
It is advanced version of sub query which differs from sub query in 2 ways.
- It is much more readable compared to sub query.
- When a subquery has to be used multiple times, you can't use with the help of alias name defined. So, going with CTE is the best option here.
Syntax
WITH my_cte AS (
SELECT a,b,c
FROM T1
)
SELECT a,c
FROM my_cte
WHERE ....
Yes, that looks correct
thx
I think, a stand alone CTE will give error.
There must be a SELECT statement which selects from that CTE
Correct, yes.
You use "with" once:
with q1 as (...),
q2 as (...),
q3 as (...)
select * from ...
``` (corrected to remove trailing comma)
followed same syntax:
Sqlalchemy was updating instead of deleting the children rows because I didn't specify a delete cascade with my user to dm relationship.
I ended up changing my relationship configurations to this: py sent_dms: Mapped[List["Dms"]] = relationship(foreign_keys='Dms.sender', back_populates='sender_user',cascade="all, delete") received_dms: Mapped[List["Dms"]] = relationship(foreign_keys='Dms.receiver',back_populates='receiver_user',cascade="all, delete")
Sqlalchemy has some weird behavior.
SELECT candidate_id
FROM candidates
WHERE (
(candidate_id, 'Python') IN candidates
AND (candidate_id, 'Tableau') IN candidates
AND (candidate_id, 'PostgreSQL') IN candidates
)
;
What's wrong with the syntax here?
syntax error at or near ""candidates"" (LINE: 4)
I'm trying to do the 2nd datalemur problem.
And how do I do what I'm trying to do? What's the right syntax to do this?
I think putting just IN candidates won't work because you haven't selected anything from the candidates table therefore no value (tuple) is being returned.
Okay that sort of makes sense, but now it's showing me some insane error that my alias doesn't exist.
SELECT candidate_id AS this_candidate_id
FROM candidates
WHERE (
(this_candidate_id, 'Python') IN (SELECT (candidate_id, skill) FROM candidates)
AND (this_candidate_id, 'Tableau') IN (SELECT (candidate_id, skill) FROM candidates)
AND (this_candidate_id, 'PostgreSQL') IN (SELECT (candidate_id, skill) FROM candidates)
)
;
column "this_candidate_id" does not exist. FYI: in PostgreSQL all identifiers (including column names) that are not double-quoted are folded to lowercase. So if you're using any uppercased letters in the names and want to specify them in SQL-query, please double-quote them first. (LINE: 4)
Of course it doesn't exist, it's an alias.
I don't get it.
I assume you have to double quote the label for candidate_id
Hold up a sec
Your query is a bit insane
Your query says: "Give me all candidates where any candidate knows Python and any candidate knows Tableau and any candidate knows Postgresql?
It is redundant.
Oh soryr, the tuples... nm
Who writes like this... don't write queries like this
select * from candidates where skill in (x,y,z)
Well, that your current will never work.
Well, what the heck. What is your data model?
Show an example of the data.
I'm doing this one.
https://datalemur.com/questions/matching-skills
Ah, it's a weird name for the table. I'd call taht "candidate_skills", not "candidates".
I don't get what you mean
The naive thing to do is something like: ```sql
with python_candidates as (....),
tableau_candidates as (...),
sql_candidates as (...)
select ... from ...
SELECT candidate_id
FROM candidates
WHERE skill IN ('Python','Tableau','PostgresSQL')
GROUP BY candidate_id
HAVING COUNT(DISTINCT(skill)) = 3```
The fancy version is what Blaziken just posted... you could count the number
fwiw, I prefer people understand the first version first, because it's more extensible to different types of problems. The second version (counting) fits only certain cases that are exactly designed for this kind of solution.
I just want to check if the three tuples are in the table.
Yeah starting simple is the way to go.
A simpler version of this original idea is:
select *
from (select distinct candidate_id from candidates) distinct_cands
where distinct_cands.candidate_id in (select * from candidates where skill='Python') ...
I don't like that... tho
I would've written, if need be:
select *
from (select distinct candidate_id from candidates) distinct_cands
left outer join (select * from candidates where skill='Python') python_cands
on python_cands.candidate_id = ...
Or maybe,
select candidate_id, candidate_id in (select * from candidates where skill='Python') as knows_python, ...
from (select distinct candidate_id from candidates) distinct_cands
(still, professionally, I would've done what Blaziken did)
The problem is that it's redundant. Imagine if you wanted to check if a candidate had 20 distinct skills.
SELECT DISTINCT candidate_id AS this_candidate_id
FROM candidates AS c1
WHERE (
EXISTS (SELECT candidate_id FROM candidates AS c2 WHERE c1.candidate_id = c2.candidate_id AND c2.skill = 'Python')
AND EXISTS (SELECT candidate_id FROM candidates AS c2 WHERE c1.candidate_id = c2.candidate_id AND c2.skill = 'Tableau')
AND EXISTS (SELECT candidate_id FROM candidates AS c2 WHERE c1.candidate_id = c2.candidate_id AND c2.skill = 'PostgreSQL')
)
;
Won't it optimize that away for you?
Think of how messy this query is
The number of times you're visiting candidates
My duckdb version: ```py
import duckdb
result_df = duckdb.sql("""
create or replace table candidates as
select * from (values
(123,'Python'),
(123,'Tableau'),
(123,'PostgreSQL'),
(234,'R'),
(234,'PowerBI'),
(234,'SQL Server'),
(345,'Python'),
(345,'Tableau')
) t(candidate_id, skill)
;
select candidate_id, list_has_all(list(skill), ['Python', 'Tableau']) from candidates group by candidate_id
""").df()
print(result_df)
SELECT DISTINCT candidate_id AS this_candidate_id
FROM candidates AS c1
WHERE (
(SELECT COUNT(DISTINCT(skill))
FROM candidates c2
WHERE (
c1.candidate_id = c2.candidate_id
AND c2.skill IN ('Python','Tableau','PostgreSQL')
) = 3
)
;
At minimum, you're querying candidates once too many times. You have one query to get all Python candidates, then Tableau, then Postgresql... why bother with the from clause? You just want the inner join of all three subqueries.
I thought the query optimizer is supposed to deal with stuff like that.
Deal with what? Bad queries?
yes
It tries to map your query to the equivalent simplest version, but your query isn't the same as the inner join version.
I guess it'd probably map it to an equivalent of:
select * from candidates a
join candidates b
on a.id=b.id
and b.skill='Python'
join candidates c
on ...
But, things get hard to fathom when you start looking at the cardinalities of each step and how many rows are returned/etc.
I'm going to bet that the count solution will beat any of these solutions easily.
Altho, Blaziken's solution has one bug which I'll leave an an exercise to the reader. Sorry, no, they corrected for it (the distinct)
Ok, I thought it could optimize it to not be redundant.
The solution is close to what he showed here.
SELECT candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(skill) = 3
ORDER BY candidate_id;
That is this.
That said, the "distinct skill" isn't needed because the question (I see now) says: "There are no duplicates in the candidates table."
I want to complete datalemur. But I've done a few of the easy level questions now, and they're about the difficulty of the questions I dealt with on my final exam. Maybe slightly less, but only slightly.
I feel like it's one of those situations where the class only actually gave us a surface level of knowledge/skill.
Yup, it's worth doing, but just accept that you don't know many of the "tricks" and "patterns", so you should accept "it works" over "it's identical to their solution".
It's better to do something on your own even though it isn't going to be totally correct or it takes a ton of time.
My plan is to try to do the problems, THEN look at their solution.
And hopefully end up absorbing their way of doing things.
Personally, I find other peoples solutions demotivating. THe important thing for a beginner/intermediate is that you can solve the problem. Worry about the optimal solution some other time 🙂
https://datalemur.com/questions/sql-average-post-hiatus-2
Do any of you know what this one is? Is it to find the average amount of time between posts through 2021 for each user?
Dynamic
See lag() window function
How many rows are you loading?
https://datalemur.com/questions/sql-highest-grossing
I have solved this problem using ROW_NUMBER()
But, the official solution contains RANK()
I analysed for some time & for this problem -- Either you solve with ROW_NUMBER() / RANK() / DENSE_RANK, doesn't matter. The reason is; we are partitioning based on SUM(spend) & then taking top 2 SUM(spend) -- Right?
Do you know the difference between rank and row_number?
If not, read this: https://learn.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver16 "ROW_NUMBER and RANK are similar. ROW_NUMBER numbers all rows sequentially (for example 1, 2, 3, 4, 5). RANK provides the same numeric value for ties (for example 1, 2, 2, 4, 5)."
Let's consider that you are selecting salaries column from department table.
When all the salaries are unique, there will be no difference between ROW_NUMBER(), RANK() and DENSE_RANK().
Difference comes when there are same salaries
ROW_NUMBER
No matter what the salary is, irrespective of that, ROW_NUMBER() will produce sequence of numbers one after the other.
RANK()
It's like ranking students based on the marks they got. When two students get same marks, they will be allocated same rank, isn't it. In the same way, RANK() works.
But the point to note down is, when it allocates same rank to different persons, it will skip the next rank, which can be better explained with above example.
DENSE_RANK()
It's like ranking students based on the marks they got. When two students get same marks, they will be allocated same rank, isn't it. In the same way, RANK() works.
But the point to note down is, when it allocates same rank to different persons, it won't skip the next rank, which can be better explained with above example.
The problem specification does not say what to do when there are ties, therefore any of these options are arguably right.
I have generalized when to use when. If you have any points to add to this, pls...
When to use what?
For example, if you were reporting winners, and always needed to report a first, second, and third place, regardless of ties for each place, you would use DENSE_RANK.
Otherwise, you would use RANK, which might mean that there is no second or third place. If you were certain that there could never be duplicates, then you could just use ROW_NUMBER.
I rarely use Rank. row_number is usually what I select first, and then dense_rank if I must consider ties.
But, that's all a matter of preference .
When solving question https://datalemur.com/questions/top-fans-rank
Answer i came up with
SELECT a.artist_name, COUNT(a.artist_name) no_of_time_below_10
FROM artists a
JOIN songs s
ON a.artist_id= s.artist_id
JOIN global_song_rank r
on s.song_id = r.song_id
WHERE r.rank <= 10
GROUP BY(a.artist_name)
ORDER BY no_of_time_below_10, artist_name
LIMIT 5
The official solution in the solution tab was solved with the help of DENSE_RANK(). Where could have been my query deviated from the official solution? pls..
Just to repeat my question in pydis: is your result correct? If not, in what way is it not correct?
my query:
SELECT a.artist_name, COUNT(a.artist_name) artist_rank
FROM artists a
JOIN songs s
ON a.artist_id= s.artist_id
JOIN global_song_rank r
on s.song_id = r.song_id
WHERE r.rank <= 10
GROUP BY(a.artist_name)
ORDER BY artist_rank, artist_name
LIMIT 5
Corresponding output:
official answer in solution tab
WITH top_10_cte AS (
SELECT
artists.artist_name,
DENSE_RANK() OVER (
ORDER BY COUNT(songs.song_id) DESC) AS artist_rank
FROM artists
INNER JOIN songs
ON artists.artist_id = songs.artist_id
INNER JOIN global_song_rank AS ranking
ON songs.song_id = ranking.song_id
WHERE ranking.rank <= 10
GROUP BY artists.artist_name
)
SELECT artist_name, artist_rank
FROM top_10_cte
WHERE artist_rank <= 5;
corresponding output:
And what's wrong with your answer?
😦 that's my question billybobby. I am trying to understand where is my answer deviated from the official. could not find out, so came here for help.
What's your result vs their result? Not the query, the output of the query
first of all, the number of rows itself different....
Yes, because of ties. The problem statement says: "two or more artists have the same number of song appearances, they should be assigned the same ranking, and the rank numbers should be continuous (i.e. 1, 2, 2, 3, 4, 5)"
I agree the specification is ambiguous here: top 5 ranks vs top 5 artists.
i also covered this aspect in my query.
Aren't you ordering ascending? Ariana has the least number, not most.
And limit 5 will never work, because it'll truncate ties.
But your first issue is the sort order
i see your point. because of this, i am getting less no of rows in my output. but, within the output i got, yet the data is mismatching..
That makes perfect sense because you're sorting in wrong order.
And your artist_rank isn't a rank, it's a count of top 10 hits
wrong sorting?
What does your artist rank represent?
i see this logic. But, my thought process was like this:
Let's consider a class room with 2 students as below
name marks
peter 10
heidi 20
If i want to rank the students, i have 2 options
- RANK()
- sort based on RANK which is nothing but, the student with highest marks is 1st rank & the second highest marks will become 2nd rank.
both are giving same kind of result. isn't it?
What does your artisst_rank represent? It's not a rank.
This query gave me clear picture
There you go, that's how you debug!
no_of_times an artist appeared in the top 10 is different from the rank.
i was thinking no_of_times as artist appeared it self as their rank which is the difference between my answer & official answer.
Thank you very much to be there & help me while i debug...
you are gem of SQL...
So, for an data science interview perspective, apart from the below topic, could you pls list down some more important things, based on your experience?
- Aggregate fn's
- Window fn's -- Especially ROW_NUMBER(), RANK() & DENSE_RANK()
- Cluster index & non Cluster index concept
- CTE
What I'd say is start with understanding every part of the SELECt statement: https://www.postgresql.org/docs/current/sql-select.html Indices are advanced topics: it's very database specific, though, so you'll really be learning about Postgres.
Second, practice debugging. What you just did above (modifying query to see debug info) is the most important part. I don't write perfect sql: I'm just good at debugging
Third, there's other advanced topics, like: what's the difference between different databases. How does Dynamo differ from Snowflake/Duckdb/Clickhouse?
Fourth: learning how to express the same query in Pandas vs SQL : all of the problems above could be written in Pandas. You should learn how (if you want to be in a data role)
Thank you BillyBobby. I sent you an friend request. pls accept it(If you are ok)
Your persistence here is going to pay off!
Thanks, I didn't know about that.
https://www.tutorialspoint.com/explain-division-operation-in-relational-algebra-dbms
Does postgresql have this?
I finished a hard difficulty one on sql-practice.
WITH docs_years (doctor_id, `year`) AS (
SELECT DISTINCT doctor_id,
YEAR(admission_date)
FROM admissions
JOIN (
SELECT DISTINCT doctor_id
FROM doctors
)
), adm_map AS (
SELECT *
FROM docs_years
LEFT JOIN admissions AS a
ON docs_years.doctor_id = a.attending_doctor_id AND docs_years.`year` = YEAR(a.admission_date)
JOIN doctors AS d
ON docs_years.doctor_id = d.doctor_id
)
SELECT doctor_id,
first_name || ' ' || last_name AS doctor_full_name,
specialty,
`year`,
CASE WHEN patient_id IS NULL THEN 0 ELSE COUNT(*) END AS total_admissions
FROM adm_map
GROUP BY doctor_id,
`year`
;
But I overcomplicated it because it looks like there aren't any doctors with no admissions.
Let's take apart the first subquery: ```sql
SELECT DISTINCT doctor_id,
YEAR(admission_date)
FROM admissions
JOIN (
SELECT DISTINCT doctor_id
FROM doctors
)
What you're looking for is (something like):
select doctor_id, YEAR(admission_date) year, count(*) num_admissions
from doctors
left outer join admissions
on doctors.doctor_id = admissions.doctor_id
group by doctors.doctor_id, year
Above was for you
Thanks, but would that still work if there were doctors who had no admissions in certain years?
That's why it's a left outer join
Oh right, I see now.
would num_admissions be null?
Or would it be 1?
Oh yeah, I was thinking that if year was null, then COUNT(*) would end up being 1
With related to https://datalemur.com/questions/signup-confirmation-rate
Below query produces result of 0.33 which is correct answer.
WITH basic_data AS(
SELECT e.email_id d, t.email_id n
FROM emails e
LEFT JOIN texts t
ON e.email_id = t.email_id
AND t.signup_action = 'Confirmed')
SELECT ROUND(COUNT(n)::DECIMAL / COUNT(d),2)
FROM basic_data
If i remove the DECIMAL from last but one line, then it results in 0. what could be the reason, pls..
have a read here - https://www.postgresql.org/docs/14/functions-math.html#:~:text=Division (for integral types%2C division truncates the result towards zero)
that's a very common mistake when dealing with division.
Now I got the two hard difficulties from the northwind.db problems.
Except on the last one, I can't figure out why it's SUM(unit_price * quantity * discount) instead of just SUM(quantity * discount)
How could you debug this? How would you write a query that gives more insight into what's happening?
at least one product from each product category.
I don't see any reference to product_category in your attempt.
Are you expected to be really knowledgeable in statistics for most jobs that use SQL?
by summing the product count you are discarding the category
A better way is to use distinct. This should work in postgres. I'm unsure with SQL Server though I assume there is some sort of equivalent syntax.
SELECT cc.customer_id,
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id
HAVING COUNT(distinct product_category) >=3
ORDER BY customer_id
Saying that you may get the right answer if you use COUNT and not SUM here. to make it SELECT customer_id, COUNT(product_count) That way if one category has 2 products its not over represented in the result as the question doesn't ask for it.
Why would the expected output not include month 6? It shows right there that user 445 was active that month and the previous month. I don't get it.
I think it's just saying that the output for June 2022 is that, not that it's the entire output
No that is my output
I see now.
Write a query to obtain number of monthly active users (MAUs) in July 2022
I missed the "July". I thought it wanted the whole year.
That seems easier.
The solution i came up with is:
WITH basic_data_with_count AS(
SELECT cc.customer_id, p.product_category, COUNT(product_name) product_count
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id, p.product_category)
SELECT customer_id, SUM(product_count)
FROM basic_data_with_count
GROUP BY customer_id
HAVING SUM(product_count) >=3
ORDER BY customer_id
not getting where i went wrong...
Think about how to rewrite this query to get more information about what it's returning. For instance: which products and product categories did each customer see? If you do that, you'll see your mistake.
The count(product_name) in your CTE will also count cases when 1 product_category that has multiple product names - consider using COUNT(DISTINCT )
What you need to learn next is: Stop trying to debug a query by analyzing the code. Start debugging by adding information or modifying the query to understand what is happening.
A lot of new developers stare at their code trying to find the bug, rather than getting more information.
what is this
ok. i deleted everything & staring a fresh...
i will write querry step by step & paste here.....when you see somehting is going wrong. just show me red flag
SELECT cc.customer_id, p.product_category, p.product_name
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
now i have basic data in my hand.
what i need to do is: find whether customer is buying at least one product per each product_category. I am thinking on this now.
came up with this:
SELECT cc.customer_id, p.product_category,
CASE
WHEN COUNT(p.product_name) >= 1 THEN 1
ELSE 0
END AS product_buyed
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id, p.product_category
pls leave above...i came up with new approach
SELECT cc.customer_id, p.product_category
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id, p.product_category
HAVING COUNT(cc.customer_id) >= 3
ORDER BY cc.customer_id
It's really hard to debug when you compute the count in the having clause. Move the count to the select, and reference it in the having: select count(...) as mycount from ... having mycount>3. Then, use something like group_concat(...) to inspect the values.
This is a weird formulation of the question: show me all customers that have three or more contracts?
finally....
WITH basic_data AS(
SELECT cc.customer_id, p.product_category
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id, p.product_category)
SELECT customer_id
FROM basic_data
GROUP BY customer_id
HAVING COUNT(product_category) >= 3
finally....
but, this one i did not get. could you pls show me small snippet, when you have few minutes
below is still more improved answer
WITH basic_data AS(
SELECT cc.customer_id, p.product_category
FROM customer_contracts cc
JOIN products p
ON cc.product_id = p.product_id
GROUP BY cc.customer_id, p.product_category)
SELECT customer_id
FROM basic_data
GROUP BY customer_id
HAVING COUNT(product_category) >= (SELECT COUNT(DISTINCT product_category)
FROM products)
may be the measurement_id is potraied as integer in question, but it could be char behind the scenes
@wooden topaz why do you ask SQL questions in #python-discussion?
I recommend taking the time to figure the problem out on your own. You're not always going to have someone help you.
i just wish they would ever respond to me when I ask them not to post these questions in the wrong channel. I've never gotten a reply.
Yeah I just think he should learn to solve things on his own because that how you learn. Also, he has a lot of messages from previous conversations that hold valuable information that could solve his problems.
you guys know any good SQL interview prep sources? i'm looking for something that at least sorts its problems by category, like tagging them as joins for example
this is the only one i've found but i just wanna see if i can find some more variety in the questions
Scroll up a bit, you'll see a few like datalemur
How much statistics do you need to know for most jobs that use SQL?
I tried to do one of the easy statistics questions and couldn't do it.
I didn't understand the terminology or what it was asking.
What question?
'How much' is probably not quantifiable, nor is 'for most jobs'
Link?
Are you in Uni?
I basically only know mean, median, and mode
Yes but not for math
What major?
CS
Doesn't your Cs program require a stats course?
I've never seen one that doesn't.
No, I've never taken stats
I took calculus that I've mostly forgot since it's been awhile.
That's weird. Well, I think it's expected that SWEs should have taken engineering stats. That's my experience, at least. And this question is like chapter 1 of engineering stats
Even business majors cover this in their lighter business stats course (non calc based)
I dunno. It's a 1 semester class, usually. It's not easy, but everyone takes it, ime
Lmk if you know of any resources to learn it quickly. I don't know any statistics at all.
I have summer off. Then I need to go back and I won't have any time.
Off the top of my head, no, but it's a normal course that is probably very similar at every Uni. Find a syllabus/cirriculum and maybe ask in #data-science-and-ml
(More math ppl there than here, I think)
Alright, I'll ask there. Thanks for telling me about this.
so good
Got it.
WITH months AS (
SELECT DISTINCT user_id,
EXTRACT(MONTH FROM event_date) AS "month"
FROM user_actions
WHERE (
(
EXTRACT(YEAR FROM event_date) = 2022
AND (EXTRACT(MONTH FROM event_date) = 6 OR EXTRACT(MONTH FROM event_date) = 7)
)
)
)
SELECT current_month.month,
count(*) AS monthly_active_users
FROM months AS current_month
WHERE EXISTS (
SELECT months.user_id,
months.month
FROM months
WHERE months.month = current_month.month - 1 AND months.user_id = current_month.user_id
)
GROUP BY current_month.month
;
Hello everyone,
I have a quick question. I’m building an app and I’m trying to connect it to an online database, but I actually don’t know how to do it. I keep getting an error saying I don’t have privileges, but gave my user all the privileges. Also, how do I setup an online database for my app. Can I use regular website hosting plan and just create a database?
It sounds like u use we can't possibly know what
Databases can be very different
And it also sounds like u use mysterious provider of some managed databases. Which once again we can't know which one u use
I'm using Python and the provider is Hostinger
Hostinger for web hosting
So what I did, was I created a tic tac toe game, but I need to connect it to an API, which means I need a database
So I used a web hosting platform because it has a database, but to me that doesn't seem like the right solution
Looks very complicated
I will offer using the good provider Hetzner https://www.hetzner.com/
That just provides you with Linux OS machine at which u can run anything
Your partner for system-wide hosting, from cloud to dedicated servers. We deploy the latest tech at the best price in minutes.
still dunno which db u are using, but Sqlite3 is a solid start, that can be run at the same server
For especially perveted people, there is an option to just use shelve 😄 https://docs.python.org/3/library/shelve.html
Sqlite3 is good default for pet projects. But sometimes makes sense dumb solution like shelve too i guess
https://docs.python.org/3/library/sqlite3.html here is tutorial to sqlite3
no library installing is necessary in python for its usage
but using it with ORM is ergh... more simple if necessary
i could recommend using it with https://sqlitebrowser.org/ which is very nice GUI to traverse it
i would say, using migrating system in python is kind of obligatory to have simple usage of SQL
SQLAlchemy + Alembic is strongly encouraged to use in python to have very simple life with SQL
(or Django ORM for Djangoists)
Yeah, I was using SQL. What other database is there? I thought almost all used SQL
I don't know oif there is difference between Sql and sqlite3, but the setup I used was very confusing 😁
SQL is a structured query language, used to interact with relational databases
relational databases store data in a format similar to spreadsheets, with well defined rows and columns, and nearly all of them do use SQL, but there are also a lot of other databases which store data in less structured formats, such as documents (MongoDB) or even graphs (Neo4J)
Oh, well, yes mine is a relational database for sure
SQLite, PostgreSQL and others are databases, which you can interact with using the SQL language
SQL is only used for queries though, you still need of a programming language like Python to interact with other things, and you need of a program that actually sends the query to the database
Yeah, so I'm doing that now using Python, but the problem lies in the query
It just doesn't work 😕
I keep getting an error saying my user doesn't have privileges, so I don't know if it's the database or the provider at this point
I would recommend using SQLite to make sure it works, before you try to debug connecting to an online database
OK 👍
right now it is probably an issue with the provider
SQL is good default, working for 95%+ usage cases.
Do use its engine version Sqlite3, it is just a file in filesystem
Nice for pet projects
there is:
- The database itself
- Whatever is hosting it and managing external connections to the database (their network -> their database) *****
- Whatever you are using to connect to it (your network -> their network) *
- Your program itself (your program -> your network)
Right now it is probably on that * level aka 'fix your credentials'
In general case noSQL databases should be avoided used unless u became professional enough to know when they are justified (very very rarely)
SELECT profit.year,
profit.product_id,
profit.curr_year_spend,
CASE
WHEN (
profit.product_id = LAG(product_id, 1) OVER (ORDER BY profit.year, profit.product_id)
THEN LAG(product_id, 1) OVER (ORDER BY profit.year, profit.product_id)
)
ELSE
NULL
END AS prev_year_spend,
Why is this invalid syntax?
syntax error at or near "THEN"
no wait
yeah, still this.
nvm, apparently it was the parentheses
Thanks again, I got one with it:
WITH profit (product_id, "year", curr_year_spend, prev_year_spend) AS (
SELECT product_id,
EXTRACT(YEAR from transaction_date),
SUM(spend),
CASE
WHEN
product_id = LAG(product_id, 1, NULL)
OVER (ORDER BY product_id, EXTRACT(YEAR from transaction_date))
THEN LAG(SUM(spend), 1, NULL)
OVER (ORDER BY product_id, EXTRACT(YEAR from transaction_date))
ELSE
NULL
END AS prev_year_spend
FROM user_transactions
GROUP BY product_id,
EXTRACT(YEAR from transaction_date)
ORDER BY product_id,
EXTRACT(YEAR from transaction_date)
)
SELECT profit.year,
profit.product_id,
profit.curr_year_spend,
profit.prev_year_spend,
CASE
WHEN
profit.prev_year_spend IS NOT NULL
THEN ROUND(((profit.curr_year_spend - profit.prev_year_spend) / profit.prev_year_spend) * 100, 2)
ELSE
NULL
END AS yoy_rate
FROM profit
ORDER BY profit.product_id,
profit.year
;
A few things: DRY: don't repeat yourself. You have extract(year..) four times.
I don't understand that case in the CTE. Looks like a hack that could be replaced with PARTITION BY product_id
well, I'll eave that there.
The query optimizer wouldn't fix THAT?
That's not the only reason we don't repeat ourselves.
Would you give it an alias and then use the alias?
I remember trying to do that at a certain point, and it said the column doesn't exist, so I've just been repeating expressions instead.
Push it into a subquery and give it an alias there.
do i have to install mongodb to access it or there is another way
what do you mean?
you have to install the driver to connect to the database, but for the database itself, you can either run it locally or in some cloud.
in Mongodb's case, they have a free cloud tier under their MongoDB Atlas offering
so i dont have to ..
thanks!
https://datalemur.com/questions/sql-bloomberg-stock-min-max-1
In official solution present inside the Solutions tab, data is GROUPED BY month. The questions doesn't ask to provide data month wise. still why to GROUP BY date? pls...
oh wait, never mind
Currently, you're analyzing the highest and lowest open prices for each FAANG stock by month over the years.
input data is for every month, highest stock price & lowest stock price
from that input data, i need to pick up below per stock
which month has highest stock price & what is that month
which month has lowest stock price & what is that month
I'd say just try doing it without grouping by and see what you get
I am not signing up to see their Solution though
You're right, this worked.
WITH profit (product_id, "year", curr_year_spend, prev_year_spend) AS (
SELECT product_id,
EXTRACT(YEAR from transaction_date),
SUM(spend),
LAG(SUM(spend), 1, NULL) OVER (PARTITION BY product_id ORDER BY EXTRACT(YEAR from transaction_date)) AS prev_year_spend
I am trying to solve the problem step by step without looking at solution.
Progressed up to here..
SELECT ticker,
MAX(open) highest_open,
MIN(open) lowest_open
FROM stock_prices
GROUP BY ticker
ORDER BY ticker
This is actually correct output. Now, all i want it what is the corresponding month where this highest_open & lowerst_open are happening.
But, i try to include blow in the SELECT statement, it says it needs to be included in the GROUP BY & on the other hand, i should not GROUP BY month. How to deal with this dichotomy?
TO_CHAR(date, 'Mon-YYYY') mth
You don't need to do the aliases in the with part:
WITH profit AS (
SELECT product_id,
EXTRACT(YEAR from transaction_date) as "year", --- use a different name for this, don't use a keyword/function name/etc
SUM(spend) as curr_year_spend,
LAG(SUM(spend), 1, NULL) OVER (PARTITION BY product_id ORDER BY EXTRACT(YEAR from transaction_date)) AS prev_year_spend
as far as I understand:
The database cannot guess what you want to happen when rows inside of the same ticker group have different dates
not sure if you can do something like WHERE/HAVING min/max
" all i want it what is the corresponding month where this highest_open & lowerst_open are happening": You want to know the month of the highest open, and separately, the month of the lowest open? Two different months?
this red colour data is needed
Yah, normal pattern: one query to get the minimums, then a second query to find rows where the value = that minimum.
Another approach is to use row_number to rank the values, then filter on rows where rank = 1
actually, in the solution which is written in solution tab, there is GROUP BY month.
But, the qn dont ask us about month wise data, right? it is asking across whole time......
Then turn the data into a monthly summary first.
no. my question is about why GROUP BY month needs to be there?
I don't understand. If you want to group by the month, then you need to group by the month?
pls forget about the solution. it may make to look through hind sight.
When we read the qn, it no where asks that the results shoud be grouped by month.. So, then why GROUP BY month is used in the solution?
Didn't the question ask for which month has the highest or lowest?
suppose, below is the sample data...
The questions doesn't ask as below
with in Jan months group, what is the highest?
with in feb months group, what is the highest?
with in march months group, what is the highest?
It asks among whole data, which month has the highest & what is that month?
i could be wrong also. in that case, pls tell we where i mis undertood the question....
Yes. One way to calculate the highest and lowest is to summarize the data by month.
????? That completely gives wrong results right?
when you GROUP BY month, the answer will be:
jan 20
feb 40
march 50
but, when you dont GROUP BY, answer is
march 50
If you group by month, you'd then use a second query to find the month with the largest value.
not getting what to do from here, but any way, pls leave it. i am wasting your time also..
SELECT ticker,
MAX(open) highest_open,
MIN(open) lowest_open
FROM stock_prices
GROUP BY ticker
ORDER BY ticker
Don't think of it as a "single query". Think of it as a sequence of questions:
Start at the end: What information do you need to know which months are the highest?
The idea is to work backwards from the result, and construct subqueries that make it easier
does x != y evaluate to FALSE if y is NULL and x isn't?
WHERE user_id = lag2 AND (user_id != lag3 OR lag3 IS NULL)
There was a WHERE clause that didn't work until I wrote it like that.
Not disagreeing. This approach works. But what works me as well is breaking it up. Get the first small thing because I can verify that works. Then build complexity onto it. Just 2 cents.
fair, I'm not actually sure which way I do it. end to beginning or beginning to end. I guess depends if I already know the pattern.
This is super important to understand, and so important that I won't tell you the answer. Test it yourself.
(after you knwo answer, I'm happy to explain)
It seems like it evaluates to false.
Even better, what if both are null?
I know that null isn't equal to null. But I didn't check if it's inequal to null.
You know what I mean?
You should check. Very important.
SELECT *
FROM candidates
WHERE (
NULL = NULL
OR NULL != NULL
OR NOT (NULL = NULL)
OR NOT (NULL != NULL)
)
LIMIT 1
;
This gets no data.
SELECT *
FROM candidates
WHERE NOT (
NULL = NULL
OR NULL != NULL
OR NOT (NULL = NULL)
OR NOT (NULL != NULL)
)
LIMIT 1
;
This also gets no data.
SELECT *
FROM candidates
WHERE NOT (
NULL = NULL
OR NULL != NULL
OR NOT (NULL = NULL)
OR NOT (NULL != NULL)
) IS NULL
LIMIT 1
;
This also gets no data.
SELECT *
FROM candidates
WHERE (
NULL = NULL
OR NULL != NULL
OR NOT (NULL = NULL)
OR NOT (NULL != NULL)
) IS NULL
LIMIT 1
;
This DOES get data.
That seems really weird.
You should build a truth table
Is there an existing truth table?
Or a set of rules for this?
It seems like any comparison involving NULL evaluates to NULL.
You should build one... but:
The idea is this - null represents absence of knowledge.
You can't reason about two unknowns.
Does "I don't know" equal "I don't know?
Does "I don't know" not equal "I don't know"?
Then is this accurate?
Test it
It seems like it does, but I'm not really sure.
And google isn't telling me yes or no.
Idk if there are exceptions to it evaluating to NULL.
Why aren't you sure? What test isn't clear?
I mean if just not sure if the tests I did are comprehensive enough.
I just did some comparison tests, like less than, greater than, equal to, inequal to
You know what I mean?
Yup, I understand... so, now google: "sql three valued logic". That's what you're experiencing.
Normally NULL handling differs from DB to DB as well
Even for something like THAT?
The stuff you're looking at I think is well standardized.
I hope so
there are differences in some of the edge cases, like concatenation with nulls, for instance
So basically think of NULL as "it could be anything" instead of "nothing"?
That's what it seems like.
Yes exactly
Alright then, that clears that up. Thank you.
Also many dbs have null safe comparisons. Some db's use 'is distinct from', MySQL has operators like <=> https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to
sqlserver 🗿
!paste
what do i even do
db = mysql.connector.connect(
host="localhost",
user="root",
passwd="root"
)```
im just tryna setup mysql 😭
Did it work?
the error is the the paste
i dont how this even works tbh im just tryna figure it out as i go
passwd != password
ty
getting sam error
Try this: https://stackoverflow.com/a/50027851
i dont understand that page, does that code go into command prompt?
sorry i know im asking to be spoonfed but im confused
have you tried logging into your mysql without python?
OK then run the commands or alter your user to use the mysql_native_password. Not sure how that looks like in workbench
But its something to edit your user in mysql
wdym alter user
alter also means "change"
ik but how
🤷♂️
Idk what that looks like in workbench
But
The command stated in that answer shows it can be done via CLI
If you can login to your mysql server via command prompt/cli, then you can alter your user from there (by following the commands shared)
Can someone help me complete the following process? I am trying to download Postgres to do this but I dont know which version I need and what packages to use?
Setting up a Django project locally.
Setting up a MySQL or Postgres database locally, ideally using Docker.
Make sure things can run locally, all connected.
Understand what properties are being edited to control the connection between the Django app and database.
Use ENV variables to define those properties.
Figure out how to Dockerize the Django app
Deploy a cloud database. (AWS RDS? but that costs money? other suggestions)
Set up users and grants in the cloud database.
Deploy the Django app in a cloud provider with the right ENV variables so it points to your cloud database
Use the official Docker image https://hub.docker.com/_/postgres
The PostgreSQL object-relational database system provides reliability and data integrity.
I am reading this, does it say anything anywhere about what Postgres version to download, also is it true I do not need to techncially download anything for Docker?
By default Docker will pull the latest version.
If you haven't installed Docker, you may need to download that
i have download docker I keep getting this error
Is that Amazon Linux? Have you tried sudo yum install docker? https://www.cyberciti.biz/faq/how-to-install-docker-on-amazon-linux-2/
WITH cte1 AS (
SELECT SUM(num_users) OVER (ORDER BY searches) user_count,
searches
FROM search_frequency
WHERE num_users > 0
ORDER BY searches
), cte2 AS (
SELECT user_count AS uc_high,
searches AS s_high,
LAG(user_count) OVER (ORDER BY user_count) AS uc_low,
LAG(searches) OVER (ORDER BY user_count) AS s_low
FROM cte1
ORDER BY user_count
), cte3 AS (
SELECT MAX(uc_high) AS uc_total
FROM cte2
), cte4 AS (
SELECT *
FROM cte3
CROSS JOIN cte2
WHERE uc_low <= (CAST(uc_total AS DECIMAL) / 2) AND uc_high > (CAST(uc_total AS DECIMAL) / 2)
), cte5 AS (
SELECT
CASE
WHEN MOD(uc_total, 2) = 1 THEN s_low
ELSE
CASE
WHEN uc_low = (CAST(uc_total AS DECIMAL) / 2) THEN (CAST(s_low AS DECIMAL) + s_high) / 2
ELSE
s_low
END
END AS median_not_rounded
FROM cte4
)
SELECT ROUND(median_not_rounded, 1) AS median
FROM cte5
;
https://datalemur.com/questions/median-search-freq
Their solution did this in ten lines. It's kind of frustrating.
WITH searches_expanded AS (
SELECT searches
FROM search_frequency
GROUP BY
searches,
GENERATE_SERIES(1, num_users))
SELECT
ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (
ORDER BY searches)::DECIMAL, 1) AS median
FROM searches_expanded;
The searches_expanded part makes no sense at all to me. grouping by searches, GENERATE_SERIES(1, num_users)) should be exactly the same as grouping by searches, num_users.
Because it just returns a series of numbers from 1 to num_users.
Also, when I put GENERATE_SERIES(1, num_users) in the SELECT clause, it evaluates to integers, and I can't figure out where they're coming from. I don't get what it's doing.
The documentation doesn't explain it either.
it's an abuse of group by.
think of it this way
WITH tmp AS (
SELECT searches, GENERATE_SERIES(1, num_users) AS searches_user_id
FROM search_frequency
)
SELECT
searches
FROM
tmp
GROUP BY
searches, searches_user_id
you can disregard the group by abuse that the "model answer" uses tbh.
i.e. just use
WITH searches_expanded AS (
SELECT searches, GENERATE_SERIES(1, num_users)
FROM search_frequency
)
SELECT
ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (
ORDER BY searches)::DECIMAL, 1) AS median
FROM searches_expanded;
I don't get what GENERATE_SERIES(1, num_users) is doing
of which
This section describes functions that possibly return more than one row.
possibly return more than one row. is the key point
Ok, I think I see now.
Doesn't that blow up the number of rows?
It seems like it would be kind of inefficient.
yeah i would assume so.
Yeah wtf, it would end up generating the trillions of rows like you're supposed to be avoiding.
I don't get what their idea is.
Anyways, are there any ways that this could be written more concisely?
I kept having a hard time with stuff like not being able to nest window functions, or not being able to look at the maximum in a column in the WHERE clause. Stuff like that made it so I couldn't figure out how to reduce the number of CTEs
i think this is just a quiz on your knowledge on window function, if you have written SUM(...) OVER(...) then i am already happy as an interviewer tbh.
the rest is just an execrise of how careful you are
i would have used a bunch of scalar query instead of cross joins
WITH cumsum AS (
SELECT
SUM(num_users) OVER(ORDER BY searches) AS running_sum, searches
FROM
search_frequency
), total_users AS (
SELECT SUM(num_users) AS val FROM search_frequency
), mid_point AS (
SELECT val::decimal / 2.0 AS val FROM total_users
), aux AS (
SELECT
(SELECT searches FROM cumsum WHERE running_sum <= (SELECT val FROM mid_point) ORDER BY running_sum DESC LIMIT 1) AS bot
, (SELECT searches FROM cumsum WHERE running_sum > (SELECT val FROM mid_point) ORDER BY running_sum ASC LIMIT 1) AS top
, MOD((SELECT val FROM total_users), 2) = 0 AS should_avg
)
SELECT
CASE WHEN should_avg THEN (bot + top)::decimal / 2.0
ELSE bot
END
FROM aux
(not sure if the above is correct as well tbh.)
Yeah it was
SELECT ROUND(
CASE WHEN should_avg THEN (bot + top)::decimal / 2.0
ELSE bot
END,
1
)
FROM aux
WITH cte as(
SELECT *,
SUM(num_users) OVER(ORDER BY searches) AS cumsum_users,
SUM(num_users) OVER() AS tot_users
FROM search_frequency
)
SELECT ROUND(AVG(searches), 1) AS median
FROM cte
WHERE (
tot_users <= cumsum_users * 2
AND tot_users >= (cumsum_users - num_users) * 2
)
;
Here's one someone posted in the discussions. The first comment posted there, the one with this answer, points out how inefficient the recommended answer is.
that's neat
SUM(something) OVER ()
That's one of the things I was looking for.
I could have done that instead of the cte that just joins with the MAX(uc_high) cte.
It only calculates SUM(something) once, right?
I don't have a very good idea of how smart query optimizers are.
What was the original question?
I came up with this.
Why not use percentile_cont?
Because then you need to generate two trillion rows.
Why would it generate two trillion rows?
Because it says there are two trillion users, so using GENERATE_SERIES for each num_users would create that number of rows.
SELECT CAST(ROUND(CAST(SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END), FLOAT) / COUNT(*), 2), TEXT) || '%' AS percent_of_male_patients
FROM patients
;
near ",": syntax error
Why is there a syntax error in this?
SELECT CAST(ROUND((CAST(SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*)) * 100, 2) AS TEXT) || '%' AS percent_of_male_patients
FROM patients
;
This fixed it.
SELECT round(100 * avg(gender = 'M'), 2) || '%' AS percent_of_male_patients
FROM patients
;
Apparently this works too.
Is giving a boolean expression to an aggregate function only possible on certain SQL implementations? It lets me do it on sql-practice but not on datalemur.
Use a sql formatter like https://sql-formatter-org.github.io/sql-formatter/
Yes, I think that works in MySQL but not Postgres for instance. See dbfiddle.uk to test on different dbs. ** but it's normally not allowed by SQL standard: use an if statement or case to convert to binary, or use count(*) filter .... instead
i did it!
now how do i use this to make a database? should i make it in docker or use docker to make it somewhere else?
I am trying to figure out how to deploy a django REST api with a MySQLdatabase on an AWS EC2 server
In order to do this should I still use Postgres/Docker to access MySQL or should I use AWS RDS or something?
Thanks. I hate CASE for ternaries, it takes up so much space. IF(expression, 1, 0) is okay, but it doesn't work on datalemur.
I completed all the available hard difficulty SQL questions on datalemur, sql-practice, and leetcode. But I still find that I need to look at the documentation a lot.
So do I
hello if someone could help me that would be great. its stupid but i need help merging a csv file into my pycharm ide file
oh wow this indian lady just helped me on youtube, 1 hour of trying solved in 30 seconds, #indian people are great
does this look accurate? now i can edit and manipulate the data is that correct?
@woeful coral #discord-bots message
yes, there are a few different sqlite extensions that provide encryption support
personally i found SQLite3MultipleCiphers the most convenient mainly because it provides prebuilt windows DLLs (see their releases page), but also because its completely free compared to SQLCipher and sqlite's official encryption extension, SEE
You seem very confused and might need to take a few steps back,. follow a basic tutorial or two, etc.
For starters, Postgres and MySQL are both DB engines so you want to pick one or the other and not use both.
You can host either one via RDS instead of a Docker container which is more convenient and reliable for production with automatic backups and such , but does cost money.
But would they let you look it up in an interview?
I guess it depends on what 'it' is. I don't really know: I see very few candidates who can solve basic sql problems, but I'm also good at teaching it, so I don't really know.
Anyone experienced with setting up local mongodb?
I’m trying to decide if I should do my apps rate limiting/usage tracking through local mongodb or redis and shared mongodb which gets periodically synced.
Obviously I want speed, the reason I’m doing this in the first place, however if the speed isn’t THAT different from local mongodb and redis id gladly take the little slowdown of mongodb to make it easy and no syncing needed. But if the difference is substantial or not even that far from shared mongodb time then might as well go through the work of redis and syncing
I tried installing local mongodb got ton of errors that seemed common but all the common fixes did nothing for me…
Because of this, I wasn’t able to run any benchmarks to see what my best choice would be.
So I’m asking for either help setting it up locally on Ubuntu or with giving me your experiences using local mongodb to get an idea of the speed. Thanks
<@&831776746206265384> shitposting
!cban 1245012707523170399 You seem to have joined just to troll.
:incoming_envelope: :ok_hand: applied ban to @bleak plover permanently.
Spinning up mongodb with the official Docker container should be trivial, and highly performant as well
Howdy there, having some issues with my sqlalchemy set up. I am getting a reference error. I have tried forward referencing, and using the typing's List type rather than the build in list. I am still getting blocked and am very confused. Any advice would be much appreciated. I have included the code below as well.
sqlalchemy.exc.ArgumentError: Could not locate SQLAlchemy Core type for Python type typing.List[services.host.ip.ip_record_sqlalchemy.IpRecordSqlAlchemy] inside the 'ip_history' attribute Mapped annotation
ip_record_sqlalchemy.py
from datetime import datetime
from typing import TYPE_CHECKING
from uuid import uuid4
from services.automatic_fiesta_base import AutomaticFiestaBase
from services.host.ip.ip_common import IpDTO
from sqlalchemy.orm import Mapped, mapped_column, relationship
if TYPE_CHECKING:
from services.host.host_record_sqlalchemy import HostRecordSqlAlchemy
class IpRecordSqlAlchemy(AutomaticFiestaBase):
__tablename__ = "ips"
id: Mapped[int] = mapped_column(primary_key=True, default=uuid4().int)
address: Mapped[str] = mapped_column()
last_seen: Mapped[datetime] = mapped_column()
host: Mapped["HostRecordSqlAlchemy"] = relationship(back_populates="ip_history")
def ip_record_to_dto(ip_record: IpRecordSqlAlchemy) -> IpDTO:
return IpDTO(
address=ip_record.address,
last_seen=ip_record.last_seen,
)
host_record_sqlalchemy.py in thread.
oops, i don't see an option for threads, so here it goes:
host_record_sqlalchemy.py
from datetime import datetime
from typing import TYPE_CHECKING, List
from uuid import uuid4
from pydantic import PositiveInt
from services.automatic_fiesta_base import AutomaticFiestaBase
from services.host.host_common import (
HostDTO,
HostRecordNotFoundException,
)
from services.host.host_record_base import HostRecordDatastoreABC
from services.host.ip.ip_common import IpDTO
from services.host.ip.ip_record_sqlalchemy import IpRecordSqlAlchemy, ip_record_to_dto
from services.scan.directory_scan.directory_scan_record_sqlalchemy import (
directory_scan_record_to_dto,
)
from services.sqlalchemy.sqlalchemy_database import SqlAlchemyDatabase
from sqlalchemy import select
from sqlalchemy.orm import Mapped, mapped_column, relationship
if TYPE_CHECKING:
from services.scan.directory_scan.directory_scan_record_sqlalchemy import (
DirectoryScanRecordSqlAlchemy,
)
class HostRecordSqlAlchemy(AutomaticFiestaBase):
__tablename__ = "hosts"
id: Mapped[int] = mapped_column(primary_key=True, default=uuid4().int)
hostname: Mapped[str] = mapped_column()
os_type: Mapped[str] = mapped_column()
os_version: Mapped[str] = mapped_column()
os_full_type: Mapped[str] = mapped_column()
ip_history: Mapped[List["IpRecordSqlAlchemy"]] = mapped_column(back_populates="host")
directory_scans: Mapped[List["DirectoryScanRecordSqlAlchemy"]] = relationship(
back_populates="host"
)
check_ins: Mapped[List[datetime]] = mapped_column()
I realized that I did not do a forward ref in host_record_sqlalchemy.py. I have updated the type checking to this:
if TYPE_CHECKING:
from services.scan.directory_scan.directory_scan_record_sqlalchemy import (
DirectoryScanRecordSqlAlchemy,
)
from services.host.ip.ip_record_sqlalchemy import IpRecordSqlAlchemy
However, getting similar error
sqlalchemy.exc.ArgumentError: Could not resolve all types within mapped annotation: "sqlalchemy.orm.base.Mapped[typing.List[ForwardRef('IpRecordSqlAlchemy')]]". Ensure all types are written correctly and are imported within the module in use.
I am silly,
ip_history: Mapped[List["IpRecordSqlAlchemy"]] = mapped_column(back_populates="host")
should be
ip_history: Mapped[List["IpRecordSqlAlchemy"]] = relationship(back_populates="host")
need to be assigned to a relationship not a mapped column. my bad...
I am running this command correctly by using two lines?
bash-4.4# is part of the prompt. Don't copy it when following guides.
new plan, i am using aws instead of powershell, but this happened. How do i write commands once i run mysql in aws?
you can't. aws databases are hosted, so you can't access the console.
run mysql on a ec2 instance using the host of the database
ok I am trying to get it running on powershell first but somehow the password is not example_pass ideas?
Is your database exposed to the internet?
no
yes
how do i access my docker through AWS EC2?
install the mysql-client
sudo yum install mysql
you might also be able to use dnf instead of yum depending on the os version
It's dnf on 2023
do I need to run anything else before that?
Looks like you need to manually add the mysql repo
Follow instructions here. https://dev.mysql.com/doc/refman/8.0/en/linux-installation-yum-repo.html
I suspect the complete lack of context for your question may be causing confusion.... Where is docker running? On your local machine?
i have it downloaded so i think so
Wait, I see the earlier screenshot that shows MySQL running on a different ec2?
they time out or otherwise have errors so i have to restart them (unless i am doing something wrong?)
there wouldnt be any latency overhead from docker? - how does speed compare against similar actions via redis (obviously redis is faster, just trying to see by how much)
"Install the downloaded release package with the following command, replacing platform-and-version-specific-package-name with the name of the downloaded RPM package:" What does this mean, after I download the package I have to manually edit it through powershell or something?
will it only understand these commands after I install MySQL?
amazon linux uses dnf, not apt
Hi, i'm trying to get mysql working with my flask application but i keep getting AttributeError: 'NoneType' object has no attribute 'cursor' when i try to get any data from my table. I've double checked the config details.
Negligible in theory but best to do your own testing. If you're comparing a local container vs a remote cloud instance, that added network latency is what's going to matter
Is there a way to generate and write migration files for postgres in pure python anyone can recomend?
Like, I'd like to make a migrationfile that creates a couple of tables. Any packages that would allow me to generate such a file and ensure that the migration is applied to the db?
Alembic? I'd rather use Django though
Hello,
I am currently solving https://datalemur.com/questions/sql-bloomberg-stock-min-max-1
As a first step, i came up with below query
SELECT ticker,
MAX(open) highest_open,
MIN(open) lowest_open
FROM stock_prices
GROUP BY ticker
ORDER BY ticker
corresponding output is:
Now, need one more column TO_CHAR(date, 'Mon-YYYY') mth in the above query. could any one pls tell me how to do this.
Why? Also is it possible to use Django migration functionality without applying the framework overall in the broader scope?
I had that in a CTE and then joined it with stock_prices twice. Once for the highest dates and again for the lowest dates.
When you're using MIN and/or MAX, is there a way to list values associated with the minimum and maximum rows aside from the minimum and maximum values?
It would make this easier and probably faster.
Compared to this way.
I always had frustrating problems with Alembic I couldn't understand... Django's migrations just work.
You would need to declare and maintain your models.py but otherwise I think you could ignore the rest.
Some databases have argmax/arguing functions. Search that.
Here's a Postgres implementation: https://github.com/bashtanov/argm
And I'd be a bad DuckDB shill if I didn't: https://duckdb.org/docs/sql/aggregates.html#arg_maxarg-val
If you're still stuck: if you know the highs for every stock, then you can join back to the original data where the high = value
So you couldn't use it on datalemur? That sucks. They should add it to the vanilla version of postgresql, it seems like it would be useful.
yo currently using postgresql with fastapi & sqlalchemy
was wondering how should I implement RLS or RBAC
Ok, thanks!
finally: after lot of tries & tries since 3 days. I mean, there is a solution in the site, but looking at that & proceeding in the same way would gives the answer, but i thought of trying on my own, step by step by understanding...
SELECT
ticker,
(SELECT TO_CHAR(date, 'Mon-YYYY')
FROM stock_prices as t2
where t2.open = MAX(t1.open)) highest_mth,
MAX(t1.open) highest_open,
(SELECT TO_CHAR(date, 'Mon-YYYY')
FROM stock_prices as t2
where t2.open = MIN(t1.open)) lowest_mth,
MIN(t1.open) lowest_open
FROM stock_prices t1
GROUP BY ticker
ORDER BY ticker
By the way, i did it using corelated sub query, on the other hand you suggested to use JOINS right. I did not get that.
could you pls modify the above query using JOINS & show me once(Nothing urgent, whenever you have few minutes). Thx
Try something like: FROM stock_prices t1 JOIN ( SELECT ticker, MAX(open) max_open, MAX(date) date FROM stock_prices GROUP BY ticker ) highest ON t1.ticker = highest.ticker AND t1.open = highest.max_open
Your version is almost right, but the correlated subqueries are wrong (you need to also join on ticker, and you need to handle case where multiple dates match the high)
not right? The solution got accepted...
dm = Dms(sender=username_1,receiver=username_2)
dm_message = Dm_Messages(text="hi",username=username_1,date=datetime.now())
dm.dm_messages.append(dm_message)```
UPDATE dm_messages SET dm=$1::INTEGER WHERE dm_messages.id = $2::INTEGER
[generated in 0.00020s] (None, 1)
When I try to create a related sqlalchemy model via relationship, it runs this query under the hood.
This is leading to a not null constraint error. Is there anyway to avoid this behavior?
What would your query do if two tickets had the same high?
already thought about this & enhanced the query....
SELECT
ticker,
MAX(t1.open) highest_open,
(SELECT TO_CHAR(date, 'Mon-YYYY')
FROM stock_prices as t2
where t2.open = MAX(t1.open)
LIMIT 1) highest_mth,
MIN(t1.open) lowest_open,
(SELECT TO_CHAR(date, 'Mon-YYYY')
FROM stock_prices as t2
where t2.open = MIN(t1.open)
LIMIT 1) lowest_mth
FROM stock_prices t1
GROUP BY ticker
ORDER BY ticker
Thats doesn't fix the problem correctly: what if two tickers had the same max in different days?
any ways, i tried new approach:
WITH highest_values AS(
SELECT ticker,
MAX(open) highest_open,
TO_CHAR(date, 'Mon-YYYY') highest_mth,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open) AS row_num
FROM stock_prices
GROUP BY ticker, open, date
ORDER BY ticker),
lowest_values AS(
SELECT ticker,
MIN(open) lowest_open,
TO_CHAR(date, 'Mon-YYYY') lowest_mth,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open) AS row_num
FROM stock_prices
GROUP BY ticker, open, date
ORDER BY ticker)
SELECT
highest_values.ticker,
highest_values.highest_mth,
highest_values.highest_open,
lowest_values.lowest_mth,
lowest_values.lowest_open
FROM highest_values
JOIN lowest_values
ON highest_values.row_num = lowest_values.row_num
AND highest_values.row_num = 1
AND lowest_values.row_num = 1
look at the last two lines alone.....
i am only joining the rows where the row number is 1, right?
but at the output, every row is coming.
yes
I don't believe they are: I'm talking about highest values and lowest values.
irrespective of highest & lowest, as per join condition, only those rows whose row number = 1 should be matched, right?
Your where clause does filter on row-num = 1
correct. thats because i only want the 1st row.
by the way, it's ON clause
So why are you asking about the row number? You're only getting row numbers 1, but your subqueries are wrong.
sub queries are worng? For example, if you see the sub query which gets the highest values
whats wrong with it?
for each and every ticker, this sub query is presenting all the rows from highest to lowest with corresponding row_num's
Also, secondly, look at your on clause: you're not comparing tickers
If this is your intent, why do you even had a group by?
There are many tickers, but the values need to be calculated per ticker group
That's what the window partition does.
Row_number() over (partition by ticker order by open) ranks each row by open, for each ticker.
right.
So what's the group by doing?
how about this: this MAX(OPEN) --> this aggregate needs to be executed per GROUP of ticker.
Why? Row number is already telling you the largest value
The way you wrote it, that's the max value for each (ticker, open, date), which is the same as the open (since only one value per date). Max does nothing.
yes. but unless you keep ticker & date inside the GROUP BY, it wont work.
Your aggregate is doing nothing in this query.
You're computing the max(open) over sets of exactly 1 value, since there's only one open for each (ticker, open, date)
right & i also corrected the mistake at JOIN ON clause & below is the perfect working query
WITH highest_values AS(
SELECT ticker,
open highest_open,
TO_CHAR(date, 'Mon-YYYY') highest_mth,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open DESC) AS row_num
FROM stock_prices),
lowest_values AS(
SELECT ticker,
open lowest_open,
TO_CHAR(date, 'Mon-YYYY') lowest_mth,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open) AS row_num
FROM stock_prices)
SELECT
highest_values.ticker,
highest_values.highest_mth,
highest_values.highest_open,
lowest_values.lowest_mth,
lowest_values.lowest_open
FROM highest_values
JOIN lowest_values
ON highest_values.ticker = lowest_values.ticker
AND highest_values.row_num = 1
AND lowest_values.row_num = 1
You know you don't need two subqueries now
You could compute both ranks in one query
And I'd optimize by pulling the to_char into the outer select. You only need it for the result
you mean joining same table with itself?
Yes (same query)
Or just refactoring to have one window query, then a highest and lowest query that uses that, and then a final query to join them
WITH data AS(
SELECT ticker,
open highest_open,
open lowest_open,
TO_CHAR(date, 'Mon-YYYY') highest_mth,
TO_CHAR(date, 'Mon-YYYY') lowest_mth,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open DESC) AS highest_row_num,
ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY open) AS lowest_row_num
FROM stock_prices)
SELECT
d1.ticker,
d1.highest_mth,
d1.highest_open,
d2.lowest_mth,
d2.lowest_open
FROM data d1
JOIN data d2
ON d1.ticker = d2.ticker
AND d1.highest_row_num = 1
AND d2.lowest_row_num = 1
i know below things can be written only once, but atleast from readabiliy perspective, it looks easy to understand like this. thats why left like this
open highest_open,
open lowest_open,
TO_CHAR(date, 'Mon-YYYY') highest_mth,
TO_CHAR(date, 'Mon-YYYY') lowest_mth,
Now, there's a trick you can use to eliminate the join, using an aggregate
You could get the first() filtered where row num is 1
And I would move the t o_char to the outer query.
Hello! I want to ask in MongoDb how to update or push the array by matching a key and value pair?
So when I used $push operator it always inserting new array in test. But I want this to behave like, it should match with the user_id if same user_id already exist in the test array just update that else it will push an new array to test. How can I do it in a single operation.
data = {
"test": [
{
"user_id": 1,
"key": "value"
},
{
"user_id": 2,
"key": "value"
}
]
}
I believe upsert is what you're asking for: https://www.mongodb.com/docs/drivers/node/current/fundamentals/crud/write-operations/upsert/
When i am reading about AWS step functions, it is always referred as serverless fn orchestrator. So, an step function can not orchestrate services like EC2, because it is server based service?
There is one more AWS service named AWS Glue, which is called as server less Data pipeline provided by AWS. Through AWS we can define whole work flow such as data loading, data cleaning & data loading. In fact we can even schedule jobs.
So, isn't it both AWS Glue & Step fn's are doing same job?
using .json file as my discord bot database? Is this good or suitable if i plan to make my bot high scale?!
it is a very bad idea if you want to scale at all
I would recommend starting with SQLite + https://github.com/Rapptz/asqlite but you can look into MongoDB if you want to stick with an unstructured data format
You can orchestrate multiple Glue crawlers in parallel with Step Functions. https://aws.amazon.com/blogs/compute/orchestrating-aws-glue-crawlers-using-aws-step-functions/
This blog post is written by Justin Callison, General Manager, AWS Workflow. Organizations generate terabytes of data every day in a variety of semistructured formats. AWS Glue and Amazon Athena can give you a simpler and more cost-effective way to analyze this data with no infrastructure to manage. AWS Glue crawlers identify the schema of […]
Hi, Does anyone know how to tackle this error ?
js exceptional error 503
This is a db channel, not js. You can ask in off topic.
how deep does the database rabbit hole go? Wha am I in for?
All the way to Australia
Get me a kangaroo and let’s go to the outback then.
the regular non web developer usually just learns Raw SQL, able to use Sqlite3 if needed, and calls it a day.
Usually learn designing relational tables and writing complex queries to query anything
for backend developers (data engineers) journey only begins with raw SQL in depth, preferably for specific also engine details (like postgresql, or optionally mysql/mariadb)
Also of course learn designing and querying in all ways
They learn how to debug it and performance optimize it (with operating millions of row records), with solving its different issues from at query level to maintenance level a bit
small enough app can be using relational db for almost everything.
As they need to go further, it is not uncommon learning Elastic Search or its equvalients. (to feel gaps in functionality)
working with blob storage databases for designing dynamic assets stateless (for horizontal scaling) backend apps
and working with other dbs depending on needs (than more data, more scale of app, then more specific usage cases can be needed)
redis/memcache for caching stuff
redis/rabbitmq or alterantives for message queue stuff (or even Kafka is common for event streaming db)
due to managed databases simplifying life a lot, it is also common choosing cloud provider to specialize in (AWS, GCP or Azure)
Maintanance of databases is usually in hands of System Administrators.
but they can be having also Database Administrators, specialized guys responsible for it with expertise in depth
If they have DevOps engineers, they can be usually responsible for infrastructure as a code approach to keep running managed databases at cloud provider, or having running dbs in kubernetes like https://cloudnative-pg.io/
fifty shades of databases, with its self hosted options, and cloud provider specific managed alternatives
So at certain level every dev learns it
But for backend (+data engineers + database experts) devs databases make the major part of technology fluff they need to learn
also databases are one of main reasons (but not all of them) why frontend devs learning small bit of backend and calling themselves full stack devs are heavily frowned upon sometimes. They have to learn dealing with databases properly, otherwise app is greatly screwed up in terms of performance and storage consumptions. At beginner level there are plenty of rookie mistakes that can make 100 times difference.
I have this sqlite3 database for a bot for a VR game that adds a leveling system, but when I try to use this function I get None returned even if the user id is in the database, does anyone know why this would happen?
def get_level(user_id):
c.execute('''SELECT level FROM users WHERE id = ?''', (user_id,))
result = c.fetchone()
if result:
return result[0]
else:
return None
Try viewing db with https://sqlitebrowser.org/ for easier seeing what is inside
Add print/logging functioning verifying what you insert as user_id
Try to query same queries manually in the sqlitebrowser ui
also, try using fetchall instead of fetchone ^_^ just for debug to see a difference
and give comparison with removed WHERE condition and fetchall used too
thank you mate
is sqlite better on python or vscode?
vscode
The question is nonsensical. What do you mean?
Does anyone know how to transfer 1 postgres db thats in a docker to another host? But without losing the data in it..?
Just needing docs or something on it?
so which is better?
Which what?
Did you mean pycharm vs VSCode?
no, like two years ago I downloaded some sqlite extension for VS code
One is a programming language and one is a IDE
which is better in terms of interactively exploring sqlite databases? writing a python script can be cumbersome for that depending on what you need, as for your vscode extension i cant say how good the experience is
personally i recommend the SQLiteStudio desktop app for interacting with databases, you can easily see the schema, get an overview of a table's data, and of course run queries on it
https://sqlitestudio.pl/
if however you need to programmatically access your database or automate some work, only python can do that
Ty.
We already got it, but thank you.
I'd use that for convenience sometimes if I'm just checking something. For more indepth database work I'd use DBeaver. What's "better" is always dependent on context and preference
I don’t remember. It was 2 years ago. It was SQLite that someone made and you just made it a environment variable and you could use it in VSCODE. I only brought it up because I learned SQLite that way first
An easier approach than SQL dump, can be... Just moving docker volumes to new host
For large DBS that can be faster at least
"We already got it, but thank you."
Deleting your question might be a more appropriate response than wasting multiple people's time with it and then being snarky about it. 😅
"We already got it, but thank you."
room_number could be a little dicey but maybe acceptable. Would you ever want to store anything else about a room, like how many students it accommodates? If yes, make it a table
thats fair. i was more curious about the enrollments relation referencing 3 tables
I see... Maybe this is better: https://dba.stackexchange.com/questions/278459/school-database-schema
:incoming_envelope: :ok_hand: applied timeout to @true zinc until <t:1719351915:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).
The <@&831776746206265384> have been alerted for review.
I tried to import large CSV files to MySQL Workbench database but I got this error. I wonder if there's any chance I coudl fix?
You mean, change the data type?
Should I change the data type in MySQL or with Python?
The file is 2.7 GB, I think it may take plenty of time for me to process the file in Python.
always do the processing out of Python if you can because usually the third party tool is usually 100x faster
What tools should I use and free?
Could you send more context?
I should've asked that earlier.
MySQL (what you're using) is pretty good since it runs a C compiler.
Sure, lemme upload the file to Google Drive and I'll send you the link
Lmk what you think
Dang. I can see why mysql raised that error. That's a lot of data.
I was able to read this file in Python
Should I load this to a dataframe then load to MySQL database?
Can MySQL take in that much data?
Is it on a cloud server?
I use Community Edition
user = await db.execute(select(Users).where(Users.username == usernames[index]).options(selectinload(Users.sent_dms)
.selectinload(Dms.dm_messages)
.joinedload(Dm_Messages.user))
.execution_options(populate_existing=True))```
Does anyone know why the relationships aren't loading? The relationships load when I run this query:
user = await db.execute(select(Users).where(Users.username == usernames[index]).options(selectinload(Users.sent_dms)
.selectinload(Dms.dm_messages)
)
.execution_options(populate_existing=True))```
but not when I run the top query
Update: I imported the data successfully. I just created a table in the database, then write SQL to load the CSV files, much faster than using Data Import Wizard and Python
anyone here famiiliar with structure charts?
Can anyone provide me dataset for generators(of thermal power plant).
how do I run this sql lite project through an AWS EC2 server, should i sudo apt install sqllite?
SQLite is in the Python stdlib.
I am confused, is thought_sync_django-main.zip accessible to the EC2 server, how do I fix this?
Can you open a help thread? #❓|how-to-get-help
sure
ready when you are : ]
I have a question. Is anyone here familiar with the WISKI software? I'm trying to take data stored on there and then importing that data into Python.
One library I found when looking was called kiwis_pie. Sadly there's not much information about it other than the documentation.
Never mind. I'm likely going to be able to query the data in Python via SQL Server thankfully.
can i ask for php here?
We don't have a php channel. You can ask in one of the off topic channels.
!topic
There are three off-topic channels:
The channel names change every night at midnight UTC and are often fun meta references to jokes or conversations that happened on the server.
See our off-topic etiquette page for more guidance on how the channels should be used.
Can anyone help me with executing postgres query using psycopg ?
It's been a while since I messed with databases, so I was wondering how to structure a database for a simple project of mine?
I'm making a bot that tracks scores of users on a per server basis.
The idea being that your score is specific to only your server.
So I think it only has userID, serverID, and score, but I'm not sure how to design it to seperate the scores.
Score table
Value, foreign key to User, foreign key to Server
Shrimple as that 🦐
for row in cur.execute("select distinct(player_id) from results order by match_no"):
list.append(row[0])
Why does the none appear at last?
I only have 8 rows
or if I remove the 'distinct' it became like this.. more none
Query results show you have more than 8 rows ^-^
Deselect nones with Where condition if u wish not seeing those rows
U may have filter in place to ignore some rows
Which database package should I use with python?
The databases I do with sqlite3 malfunction all the time
Malfunction? How?
I'm actually reading up on sqlite3 now being how I should probably use it.
Where would you guys store your database?
The database cannot be read after working because it's corrupt
I'd need to say, the database gets created by the program
You are definitely doing something wrong if that is the case
It is almost certainly not an sqlite issue
You mean, like, in what directory?
Yeah, pretty much.
I'll send the relevant code in
try:
# Verbindung zur SQLite-Datenbank herstellen
conn = sqlite3.connect('battle_data.db')
cursor = conn.cursor()
# Tabelle erstellen (falls nicht bereits vorhanden)
cursor.execute('''
CREATE TABLE IF NOT EXISTS Battle (
BattleID VARCHAR(255) PRIMARY KEY,
BattleTime VARCHAR(255),
BattleMode VARCHAR(255),
BattleMap VARCHAR(255),
BattleResult VARCHAR(255),
PlayerName1 VARCHAR(255),
PlayerName2 VARCHAR(255),
PlayerName3 VARCHAR(255),
PlayerName4 VARCHAR(255),
PlayerName5 VARCHAR(255),
PlayerName6 VARCHAR(255),
Player1 VARCHAR(255),
Player2 VARCHAR(255),
Player3 VARCHAR(255),
Player4 VARCHAR(255),
Player5 VARCHAR(255),
Player6 VARCHAR(255),
Brawler1 VARCHAR(255),
Brawler2 VARCHAR(255),
Brawler3 VARCHAR(255),
Brawler4 VARCHAR(255),
Brawler5 VARCHAR(255),
Brawler6 VARCHAR(255)
)
''')
# Daten in die Tabelle einfügen
cursor.execute('''
INSERT INTO Battle (
BattleID, BattleTime, BattleMode, BattleMap, BattleResult,
PlayerName1, PlayerName2, PlayerName3, PlayerName4, PlayerName5, PlayerName6,
Player1, Player2, Player3, Player4, Player5, Player6,
Brawler1, Brawler2, Brawler3, Brawler4, Brawler5, Brawler6
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', battle_data)
# Änderungen speichern
conn.commit()
except sqlite3.DatabaseError as e:
# Fehlerprotokollierung und detaillierte Fehlermeldung
print(f"Failed to insert battle data: {e}")
raise sqlite3.DatabaseError(f"Failed to insert battle data: {e}")
finally:
# Verbindung schließen
if conn:
conn.close()
Depends on what's in it - e.g. if it's something like settings, store it in a platform-specific app folder (e.g. on Windows it'd be in %APPDATA%, and in general you can use appdirs). If in doubt, can always just use the folder your program is in - that'd be a portable installation of sorts.
Sounds good. Thanks for the advice.
has anybody used sqlalchemy alembic? is it a common pattern to create migrations as models are defined?
I only have a little bit, but I think yes, that's probably a good idea.
hi guys! so my company is copy pasting all the data from google manually..... to get real time data like working hours every week. Is it possible to scrape that from the website? and not actually write it in the table
Ask over in #python-discussion , but you may run into a Terms of Use issue (see #rules / rule 5) because Google doesn't like scraping
anyone used kdb+
Only briefly, it's somewhat on the other end of what I tend to focus on
What's up tho?
looking to store daily trade data at minute resolution. also I am broke lol. Also storing daily reddit info from api, so far have about 250 parquet files for it, so getting unwieldy and some database is probably in order. doing all processing of reddit data with polars because pandas was too slow
kdb has like a one year free trial but honestly whos trying to deal with that and renewing and stuff (and I think I have to learn their weird q language).. probably gonna go with more ML approach rather than try to compete with HFT which is what kdb tries to cater to
What's the problem with the parquet files? Hive partitioned? Have you tried something like DuckDB to query the hive partitioned parquets?
1min candles are usually manageable without something like KDB
I'm a DuckDB shill but you can get pretty far querying a directory of partitioned parquet files. Might need to compute some rollups first/etc, but good partitioning leads to efficient queries
Not hive partitioned, just learning about that. One problem for example, if I want to access data from Feb 4th 2022 I'd have to iterate through all of them until I find the one with the right data
so I should just read in everything and rewrite it to more organized folders?
If you can partition the data in a manner aligned with your queries, that's worth an experiment.
hey anyone here watched CMU Intro to Database Systems course because i tried to watch it and lot of things i didn't understand i already have some database knowledge with sql but this is the first course i found it hard to understand so is this normal or what
Not familiar with it, but what did you find hard?
stuff like timescale db has automatic partitioning, researching that rn
the course overall i saw the first 3 lectures and there are somethings i couldn't understand and relate like database storage
My preference lately for parquet is the portability and ergonomics: it's easy to move, access via other engines, and you can get it to work reasonably well: sure, other systems might be faster, but depends on what you're trying to do.
If you can ask something specific, I'm sure someone will try to help
Timescale is pretty cool, though. Applicable to a lot of use cases.
for i in semi_finalist(): cur.execute('update player set seed = 1 where player_id = {i}')
how do I check the player_id is = to the i
the semi_finalist() is a list
Don't do what you're trying to do.
!sql
Don't use f-strings (f"") or other forms of "string interpolation" (%, +, .format) to inject data into a SQL query. It is an endless source of bugs and syntax errors. Additionally, in user-facing applications, it presents a major security risk via SQL injection.
Your database library should support "query parameters". A query parameter is a placeholder that you put in the SQL query. When the query is executed, you provide data to the database library, and the library inserts the data into the query for you, safely.
For example, the sqlite3 package supports using ? as a placeholder:
query = "SELECT * FROM stocks WHERE symbol = ?;"
params = ("RHAT",)
db.execute(query, params)
Note: Different database libraries support different placeholder styles, e.g. %s and $1. Consult your library's documentation for details.
See Also
- Python sqlite3 docs - How to use placeholders to bind values in SQL queries
- PEP-249 - A specification of how database libraries in Python should work
You are missing the 'f' to make it an fstring, but you never want to do that. Do what the above text says to do instead.
okay thanks
And, for efficiency's sake, you should run a single update instead of a loop. You can do this in a single query with a list of ids.
Can't do = a list
You want 'in (?)' instead
Ie: Give me all rows where the id is in the id_list
[1, 2, 3, 4]
yeah apparently kdb actually kinda sucks so not gonna put in extra effort to use them
timescale also seems nice for adding data in real time from a stream while duckdb not so much for that
HFT vs non is a big divide. I live on the non side, so my tech stack is more throughput vs latency focused
and timescale seems flexible enough that I can use it for financial metrics, economic indicators, as well as the typical time series
DuckDB (OLAP in general) is for the consumption side
As the only one involved in my project I am consumption and production lol
Or perhaps said differently: you'd still need a production pipeline to retrieve (or receive) and store the data.
thanks I used this and it worked
One option is timescale, another is parquet, etc
In general, when working with data, try to avoid loops. Theres usually a better way.
Thanks a lot
Need help with generating proper SQL statements or at least save SQL statements into a SQL file so that I can execute them into the database. https://discord.com/channels/267624335836053506/1256124450139668530
My databases keep getting "malformed" but it successed in inserting data
I use sqlite3
When I download the database and want a website to read it for me, it says it can't
Does anyone knows about how to fix it
hey! I'm trying to avoid hard-coding relationship strings in sqlalchemy to make everything more type safe
for example, if I have such models:
class User(Base):
id = Column(Integer, primary_key=True)
is_active = Column(Boolean, default=True)
items = relationship(lambda: Item, back_populates="owner")
class Item(Base):
id = Column(Integer, primary_key=True)
owner_id = Column(Integer, ForeignKey(User.id))
owner = relationship(lambda: User, back_populates="items")
how do I avoid hard-coding values like "owner" or "items" of back_populates?
Best database for beginners doing dpy?
postgres
not sure if youve ever heard of it but Im leaning towards questdb. seems to be maybe half as popular as timescale though
Do you really have this kind of workload though? 1 minute candles isn't super high frequency.
What is dpy?
await db.execute(select(Users).options(selectinload(Users.sent_dms).selectinload(Dms.dm_messages).selectinload(Dm_Messages.user)))``` Whenever I try to load in `Dm_Messages.user`, nothing loads leading to a greenlet error, but whenever I don't load it, the relationships load like normal. Does anyone know what could be causing this issue?
I did have a bottleneck of reading data in for a web app I made using daily price data
ended up using an lru cache and just dealth with the first run being kinda slow
So if questdb makes reads faster at the cost of just not having a DELETE operation available...
though now I am seeing someone on reddit say: "I tried to migrate from numpy .npy files to a timeseries database. I saw a degradation of 10x in read performance. I scrapped the idea and stuck with flat files."
for the web app I actually tried using clickhouse thinking itd be faster but it wasnt (I was using pandas' read_parquet() instead of polars or duckdb )
Databases are a frustrating specialty.... I aim for good enough and predictably scalable, over perfect/etc
Discord python
Never heard of that are you sure bro ?
Sqlite3 for the win for simple beginner projects
Is that a website ?
No, it is database
I need resources I code on mobile
O I thought sqlite3 fir the win for simple beginner projects was the name of the website
Coding on mobile is not intended 😕
I dont have a pc 😭
Start with a plan to get one then -_-.
You got a plan?
I am not you too know your conditions.
Choices can vary from asking parents/relatives to buy one
To getting summer job to get the money
Oh
I thought you were talking about sqlite3
Don't worry I'm gonna buy one next month
I'm 20 so I can get one by myself 🙂
match = 'delete from matches' result = 'delete from results' cur.execute(match) cur.execute(result)
why isn't the record in the table deleted
Did you commit?
I wanna make a voting system using python, and html/css/js what's the best db to use? around 1000 voters, I thought I'm gonna use SQLite, but how do I handle multiple simultaneous votes? Cause I looked it up it and SQLite can only have 1 write simultaneously.
I'm pretty new to this db and networking stuff, and concurrency (if it's needed)
Help? Please 🥺
Guidance for what I should learn
sqlite can handle concurrency reasonably well if you ensure your connections/transactions are short-lived, and sqlite also has a WAL mode to remove contention between readers and writers
see also: https://sqlite.org/whentouse.html
Situations Where SQLite Works Well
- Websites
SQLite works great as the database engine for most low to medium traffic websites (which is to say, most websites). The amount of web traffic that SQLite can handle depends on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic.- Server-side database
With this pattern, the overall system is still client/server: clients send requests to the server and get back replies over the network. But instead of sending generic SQL and getting back raw table content, the client requests and server responses are high-level and application-specific. The server translates requests into multiple SQL queries, gathers the results, does post-processing, filtering, and analysis, then constructs a high-level reply containing only the essential information.
you should learn how to design good database schemas, like what tables you should create, what primary/foreign keys you should use, normal forms you should follow, using indexes to optimize particular queries, etc. and then get used to making connections and executing queries to retrieve/insert into your database
(see also the SQLBolt interactive tutorial if you want practice with writing queries, and the SQLite docs to learn their SQL syntax like INSERT, SELECT, UPDATE, etc.)
Oh shoot yes thanks
I see some really old references to people using SQLite and other SQL databases as the message broker for Celery. The Celery documentation doesn't mention this possibility.
Does anyone happen to know if it is still possible?
only as a result backend though, not as a message broker?
taking a look at the update history: https://github.com/celery/celery/blob/main/docs/history/whatsnew-4.0.rst#features-removed-for-lack-of-funding
Good find, thanks!
should i use sqlalchemy with asyncpg? or just sqlachemy alone , i want good performance
is using it is overkill?
Since we don't know how serious your app (and what kind of quering usage cases it has), we can't recommend if u should use it or not.
By default i don't recommend, because u bring extra code complexity to code with having async stuff
I also don't recommend such extra over compleixities to people who did not cover their database interacting code with unit tests and added their execution into CI
its not very big
Guys i need help
why does python deleted my " HELP " from the " python-help " ??
Anyway i will post my problem here 😁😁😁
Threads are closed after 1 hour of inactivity.
BRUH ☠️
Do you have a question?
I have a big problem
...
i'm in need of help to finish my term paper i know this is for simple help, but i really need the help of a good soul so i can finish it, so here goes;
My project consists of a control of entry and exit of students using the QR-CODE of the student card when entering you must show the card to the laptop* and take the photo of the card (The QR-CODE shows the student's enrollment that will be previously registered with the student's data that I have already done).
At this point, the system must record the time in the MySQL database, and the same thing applies to the student's departure.
I'm doing my project in Python and MySQL
The problem I have is this enrollment data that I can't transport and save the time and the link to the database.
I need to present it on July 2nd now 😢
.
You want to write a record to a database table when the student enters, and write a record when the student exits.
What have you done already for the database? Do you have a database? Have you created a database table?
Where are you stuck?
The alter table and the relationtip with the main python code
Please share the relevant code and ask a specific question, I still don't know what you need help with.
Ok
The more specific the better
Is it bad to have thousands of tables
My short answer: probably. Why do you have so many?
One per stock / etf (minute level data)
Why would you put each ticker in a separate table?
I dont have any yet, just trying to see how I should do this
Something tells me it'd be more efficient
Just use one table and have a ticker column.
What you're trying to do is partitioning, and partitioning can be done within a table.
The idea is sound, just the wrong tool
Like, I don't put all my price data into a single parquet file, there's a separate praquet file for each. So naturally I didnt think that putting everything into a single table would be correct
ok
My experience with databases has been limited to school busy-work until now and definitely never worked with time series data in a db before
Many tables would require dynamically selecting tables in the sql
Query, something sql doesn't allow (can't parametrize table names)
Most database optimizations come down to: indexing, partitioning, and orientation (columnar vs row)
(And roll ups and sharding and denormalization and caching)
I'm going with questdb so far, seems pretty good. Its column oriented and as a result faster than timescale, but pretty sure the catch is you can't delete rows easily
It described itself as OLAP, while describing timescaledb being more OLTP
'Unknown MySQL server host'
Thanks bro i didn't drink coffee this afternoon
i will add more stuff to let complete
Your server host should probably be localhost, not pou
Yeah LMAO
u can use context managers , to do an action , right after a connexion is closed
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import asyncpg
from config import DatabaseConfig
from sqlalchemy import MetaData
DATABASE_URL = f"postgresql+asyncpg://{DatabaseConfig.username}:{DatabaseConfig.password}@{DatabaseConfig.hostname}:{DatabaseConfig.db_port}/{DatabaseConfig.db_name}"
print(DATABASE_URL)
DEFAULT_DATABASE_URL = f"postgresql://{DatabaseConfig.username}:{DatabaseConfig.password}@localhost:5432/postgres"
print(DEFAULT_DATABASE_URL)
# Base for declarative class configuration
Base = declarative_base(metadata=MetaData())
async def create_database_if_not_exists():
try:
# Connect to the default database
default_engine = await asyncpg.connect(DEFAULT_DATABASE_URL)
# Check if the target database exists
result = await default_engine.fetchrow(f"SELECT 1 FROM pg_database WHERE datname = '{DatabaseConfig.db_name}'")
print(result)
if not result:
# Create the target database if it doesn't exist
await default_engine.execute(f'CREATE DATABASE "{DatabaseConfig.db_name}"')
print(f"Database '{DatabaseConfig.db_name}' created successfully")
await default_engine.commit()
await default_engine.close()
except Exception as e:
print(e)
async def create_tables():
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.drop_all)
await conn.run_sync(Base.metadata.create_all)
print("Tables created successfully")
async def initialize_database():
try:
await create_database_if_not_exists()
await create_tables()
except Exception as e:
print(f"Error during database initialization: {e}")
# Create the async engine for the target database
engine = create_async_engine(
DATABASE_URL,
echo=True, # Set to True for debugging SQL queries (optional)
# Enable pool_pre_ping to check for stale connections (recommended)
pool_pre_ping=True,
pool_size=20, # Maximum number of connections in the pool
max_overflow=0 # Maximum number of connections that can be created beyond pool_size
)
# Create a configured "Session" class
AsyncSessionLocal = sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False
)
why tables are not getting created : (
literally i cant figure out where i am doing wrong everything works but then it doesnt create any table at all
Add more print statements and narrow it down.
except Exception as e: print(e)
don't ever do this. at the very least do traceback.format_exc. otherwise you're discarding the valuable traceback.
unsure if better here or elsewhere but someone knows if it's possible to read a JS JSON in python? like it isn't plain JSON but a JS file meant to emulate JSON (along with variable names, etc…). right now made it able to read file from internet, now only to read content from it 🙂
like i want this file to be read (and putted to variables) by python: https://hacknorris.codeberg.page/alchemy/db.js
manually strip the variable stuff and load the rest as 3 different json files (or as python code for that matter)
it have 3 "tables"
aka - 3 variables
and i want them to be loaded into separate variables in python (number of "tables" in JS should be same as number of variables in python)
so, why not just change let initial = ["fire", "water", "air", "earth"]; to initial = ["fire", "water", "air", "earth"]; and similarly for the other two, and then just import that as a python file?
or must it be automatic?
?
JS is broken then
(i want one file for 2 apps - one in JS and second - local client in python)
I dunno if I'd say narrow is easier
Maybe easier to stick data in
But often performs terribly, and dealing with data types is problematic
It also compromises various compression and optimizations, and managing data integrity can be harder
--> Created an .tf file where i wrote the code to create an lambda fn
--> After executing the series of commands init, plan & apply, the lambda got created on AWS console.
--> Now, if i want to use that lambda, i need run that lambda from AWS console, pass the input & see the output.
--> But, i want to execute this lambda from my console & see the output also in my console if self without touching AWS portal. Can terraform do this?
Hello everyone just a quick question:
I have a get request to one of my models lets say Departments and im passing a column value from a dataset which is typically a short string like 'La county' but sometimes it is a string with multiple counties as one string like 'La county, multnomah county, idaho county' and my get method call fails. what would be the best way cut up that string into seperate strings and run the get method on each?
would making it into a list with split(',') work and then run a loop on that list with each string making its own get call work? Thanks
I was going to say, seems like they missed "performance" as a measure to compare with
EDIT: Problem solved! See below.
Hey everyone, I am having problems creating a session context manager. I have some generic database methods in a class like bulk add. Sometimes I do not want to commit changes unless all method calls execute properly, therefore I was thinking of creating a session context manager. The code executes without errors but the changes are not written to the database. I tried creating it:
async def transaction():
try:
async with async_session() as session:
yield session
if session.dirty or session.new or session.deleted:
await session.commit()
except SQLAlchemyError:
await session.rollback()
raise
async def bulk_update_objects(
self,
objects: List[Any],
update_fields: List[str],
model: Type[Any],
session: AsyncSession = None
):
primary_key = inspect(model).primary_key[0].name
updates = [{primary_key: getattr(obj, primary_key), **{field: getattr(obj, field) for field in update_fields}} for obj in objects]
# Perform the bulk update
async with session or async_session() as sess:
try:
await sess.execute(
update(model),
updates
)
if session is None:
await sess.commit()
except SQLAlchemyError:
if session is None:
await sess.rollback()
raise
async def example():
async with self._db_operator.transaction() as session:
await self._db_operator.bulk_update_objects(
objects=examples,
update_fields=["example_flag"],
model=Example,
session=session
)```
Even though the session object has the same ID in all places the changes are never actually written to the PostgreSQL database when using the yielded session. The bulk_update_objects method works fine when the session argument is not provided. Anyone have any idea why that might be?
I have also unsuccessfully used the begin method since I thought it might commit the changes after the session is closed automatically, thereby forcing a commit after the yield. But that didn't work either. For example:
@asynccontextmanager
async def transaction():
session = async_session()
try:
async with session.begin():
yield session
except SQLAlchemyError:
await session.rollback()
raise
Nevermind, it was stupid to create a new context using "async with" like so:
async with session or async_session() as sess:
Instead I can make it work by having the method constructed like this instead:
context = async_session() if session is None else nullcontext(session)
async with context as sess:
try:
await sess.execute(
update(model),
updates
)
if session is None:
await sess.commit()
except SQLAlchemyError:
if session is None:
await sess.rollback()
raise
Problem solved.
my code has an error and idk what it means
Ask in #python-discussion , not database related... but, you're downloading a .json.gz, but treating it as a .json
This is a random question but can you make a model that for examples tries to draw trendlines on financial data and on each output it gets a rating from a human and the correct example to learn from
Of course. But probably the more normal way would be to get all the human input together first and then train on that
If you have more questions #data-science-and-ml is probably the better place
I am in the process of making a central company package that we can import into our projects for common dev stuff. It's only for internal use and is being stored on a local hosted gitlab, and the users are locked into a menu through a putty session and as soon as their process completes, the session gets closed. So they can't view the actual coding. One thing I realized is that we have a single username and password into our database for general use, and this is something I have to work around (out of my control). The thing is, all of our current projects hardcode that user password combination.
Being that I can't adjust the database, what is the best way to handle database access? Is there a way to get away from hardcoding that? Currently I use it in a JSON.
Packages are being used with Poetry, btw.
Looks like GitLab is still working on their Secret Manager solution. 😢
Are you using on-prem servers or what? No containers? Would environment variables work?
This is more of a #tools-and-devops question BTW, unless your DB engine offers some sort of non-password authentication
But to be clear you absolutely should not be hard coding DB credentials in your repo
Git lab is locally hosted, is that what you mean by on-prem servers? We're not using any containers, unless you count the Putty remote sessions each user uses. I've raised the question about a system environment variable for the "Business Box" as we call it, as each user uses a remote session into it. I've not heard a good answer back yet other than, "rather not if it can be avoided" so I'm unsure on that. ("Small" company problems, I know...)
Should I move the question?
It was a toss up for this question and my other one on which would be the proper channels. I chose poorly 😂
And we use SQLAlchemy through a connector for MariaDB, if that's the engine.
This is a good summary of the options as I understand them. If you're stuck with the infra you're on, I do think environment variable is probably the way to go
https://medium.com/twodigits/keep-passwords-out-of-source-code-why-and-how-e84f9004815a
Oh actually, here's a doc from GitLab https://docs.gitlab.com/ee/ci/secrets/
Ok, thanks 😄 I will give them a read.
Storing a password as an environmental variable would still be open to anyone who could get access to those, right?
Yes
But this is still more secure than having them in the program, and much better than having them in the repo.
I'm currently researching the hashicorp option, but it seems like a bit much for just one user/pass
A database can use both data storage layout row and columb based together?
In a system
Some databases allow different storage engines on a per table basis, so some tables might use a row oriented engine, and others might use a columnar engine.
:incoming_envelope: :ok_hand: applied timeout to @torn sphinx until <t:1720106129:f> (10 minutes) (reason: emoji spam - sent 24 emojis).
The <@&831776746206265384> have been alerted for review.
When a site is live or phrased better connected to the internet do people use flask migrate to update a database?
Hii Billy long time no see, hru
Can it is possible we save ur data in 2 database one is row another column and a query selector who select which query is executed by which database column or row?
Yes, people do that
Is this an alright design:
CREATE TABLE IF NOT EXISTS estimates (
timestamp_accessed TIMESTAMPTZ NOT NULL,
ticker_id INTEGER NOT NULL, -- AAPL
line_item TEXT NOT NULL, -- net_sales
quarter VARCHAR(6) NOT NULL, -- 2022Q1
released INTEGER, -- 27908
forecast INTEGER, --28233
spread NUMERIC, -- -1.15%
announcement_date DATE, -- 2024/09/06
PRIMARY KEY (timestamp_accessed, ticker_id, line_item, quarter),
CONSTRAINT fk_ticker_id FOREIGN KEY (ticker_id)
REFERENCES tickers (id)
ON UPDATE CASCADE,
CONSTRAINT check_quarter_format CHECK (quarter ~ '^\d{4}Q[1-4]$')
);
That doesn't sound practical. Two tables with the same data but different storage engines?
:ok_hand: applied timeout to @tardy bolt until <t:1720190228:f> (10 minutes) (reason: mentions spam - sent 6 mentions).
The <@&831776746206265384> have been alerted for review.
@fading patrol Are there other options with flask sqlalchemy, thanks for the previous response
Other options? To do what?
@fading patrol to add and delete columns in a table without losing the data in the db
You cannot delete columns without losing data.
Alembic is a standard tool for managing migrations with SQL Alchemy if that's what you're asking for
It's not a bad idea to backup your DB before performing any migrations
Hi everyone,does anyone know how to get the unused table in MySQL other than using count(For MySQL)
Wdym unused?
is big data even there yet?
Where?
is spark and Hadoop even good? Is the technology even there yet in terms of quality?
ok
don't they require like a nonsense amount of RAM?
Depends what you're doing. But, ram is also pretty cheap.
I would say Hadoop is mostly on its way out tbh
I do not see it much really outside of legacy systems or hidden away by spark's handling
by "on its way out" I mean more along the lines of it isn't being used directly on new project anymore
Hello guys ! I need tips about Sqlite3 functions management. I'm having some exercises with Python and now I'm facing the database connection with Sqlite3.
My question is this " How can i organize well all my function for connect, disconnect, insert, select, update, etc... ? " I tried to organize as best I could but I've always a lot of issue with connection and cursor object thru all the functions i created.
It's better to write always clean code customized every time in this case or someone of you can help me with some example how the manage all these database function ?
Thank you very much !!!
If you can share some code it might clarify what you're having a problem with.
It sounds like you're starting to experience one reason that using an ORM can help
I'll write again some code for let you understand what I mean. Asap it's ready I'll sent to you !
Thank you very much
Hi guys. I need some suggestions.
My project currently has two tables in the database. staff_id = Column(Integer, primary_key=True, autoincrement=True) but the autoincrement is relative to individual table. But I would love it to be global for the database. How can I achieve this?
Example: Intern employed on July 7th gets a Staff ID of 1092 but a manager employed the following day could be getting a Staff ID of 723.
Your question is: 'how to share a sequence across tables' (a sequence being the counter used to implement the auto increment)
What database are you using?
What problem are you trying to solve?
What benefit do you see in having a global id for the database?
One option is another base table for 'person'.
Just so the next employee gets the next Staff_ID in the organization, rather than in the table.
And what benefit does it convey?
...
Can't think of any. Probably because that how I imagined it during the idea conception.
Your problem sounds a lot like a https://xyproblem.info/ to me. So I am trying to understand the root problem here and the trade offs you are willing to make.
There might be reasons to make the number sequential within an organization, but there are also reasons to make the numbers not predictable as well (ex: security measure so no one can't predict the hiring rate). Both having very different trade offs
A concrete example of trade off is that having a global id means you will run out of ID faster since it's now shared across all the tables 🙂
And note that's not something generally advisable to do
Valid points.
However, the idea behind having unique IDs is such that I can query database with the ID, and get a maximum of 1 value. If I get lucky enough, the two tables might get as many entries that there might be an overlap at some point.
Your DB design is the problem.
If it is logical for staff and interns to have employee IDs in a single sequence, then it's probably logical for them to be on the same table
If there is really a good reason to put them in two different tables, then the employee number sequence can be a third table with fk relations to the other two
I got a similar suggestion (using a single table) earlier. I would just use that.