#data-science-and-ml
1 messages · Page 102 of 1
let's go
I don't trust this equity split stuff
is it a HR round or the first step to get the job in that company ?
oh
gotta take it with stride
what they asked you ?
I thik there's a sixth
no it's 5, and the fifth I have to travel to their office, which I don't mind since it's in a cool place
first was leetcodes, second was hr
you seem very unsure
third is solving a physics problem, fourth is more leetcode and fifth is traveling to their place
i speak Spanish but im not gonna answer
That's just a sum as i goes from 1 to n (inclusive).
the first value i takes is 1
math notation is something one reads and interprets, just like any other language
you would read this as "the sum of elements x_i, where i goes from 1 to n"
so, x1 + x2 + ... xn
Hey, i've made a 3D plot with matplotlib and i would like to know if there is some way to enable antialiasing for it?
I've tried passing antialiased=True as a parameter for plot but it doesn't seem to make any difference
Here's the code of the graph:
def plot_graph3D(func, zlim, name):
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = np.linspace(0.0, 2.0, GPAPH_RESOLUTION)
y = np.linspace(0.0, 1.0, GPAPH_RESOLUTION)
X, Y = np.meshgrid(x, y)
Z = func(X, Y)
ax.view_init(15, GRAPH_ANGLE)
ax.plot_surface(X, Y, Z, cmap='viridis', antialiased=True)
ax.set_zlim(0.0, zlim)
ax.set_xlabel('Pixel Luminance')
ax.set_ylabel('Average Luminance')
ax.set_zlabel('Output Luminance')
if name is not None:
plt.savefig(DIR + 'graph3D-' + name + '.pdf', bbox_inches='tight')
plt.show()
Also could someone help me understand why the image is cropped? Even when i save it, the label of the Z axis is cropped. I thought bbox_inches='tight' would fix that but it doesn't
hello! i am a 16 year old living in the uk. i want to have some work experience in research before i enter university. i am highly interested in reinforcement learning. i have successfully replicated a fairly well-known research paper and currently refining the code (i have a functional MVP but currently writing code to hopefully speed up the learning). however i consider my ml background to be "shaky" as i still lack some math understanding in some fundamental aspects of ml. i currently use pytorch, which abstracts some math away. would this be ok with my volunteering work experience, like i can just work behind a library?
i would want to have some work experience with some tech companies (i can volunteer). does anyone know any good companies/programmes that i can join to get some real-world experience? i know that at least for some us-based highschoolers there are some programmes for them and internships are usually for undergraduates. can someone please recommend some international/uk based programmes/opportunities for me? tysm :D
You might also want to ask in #career-advice
I need a bit of help on something related to Plotly Dash. I am making an app, but it appears that I'm running into an issue where not all the data is being displayed when the data is being entered into a new dataframe from the text box which runs code to generate df_2. Here's the background. When I first run the program, I use Tkinter to enter an initial date for when the app first goes live. The data looks good when I do that. The problem is when I enter the same exact date in the text box and click the button, which should execute the Callback, which should then call the previous functions to generate a new dataframe, now a lot of the data is missing. Theoretically, nothing should change because it's the same exact query being run. Most of the data from df is the same, but when data is supposed to be saved into df_2 at certain times, it just does not do it, when it did it prior to the app going live with the initial date and continues like it never even saw said row in question even though it did. What could be causing such an issue when running all those functions from the Callback?
Here's the syntax I'm using to add rows to df_2 if it helps where A, B, C, D are variables defined in the function:
df_2.loc[len(df_2.index)] = [A, B, C, D]
Many times that line of code would appear like it's being ignored when the functions using this line are being called from the Callback.
It does not get ignored when I first run the program and enter the date. This problem I noticed only happens when the date is entered in the text box of the Dash App by the user.
In other words. It happens when everything is executing inside of the Callback instead of outside of it, where I get the expected results.

Hi I need help sending an email to a website. I want to scrape their data and use it to train my AI for a research paper/whatever that paper you send to confrences is called.
What should the email contain? They already told me I can do it via a call but I want to have an actual email.
Hello, I need help with python. I have a code but i am unable to use beautiful soup and scrape from website the column level data
Hey in SVR , we call the points which are outside the epsilon insensitive tube as support vectors right?
and for training model , we use the .fit() method right? Should it always have a 2D array as its input?
Yes\
- Yes all points outside of the epsilon tubes are support vectors. In classification all misclassified points are also support vectors.
- I think most models do some dot product internally so it can't fit when you give it nx1 feature so they require you to reshape it to nx1
Noice btw does reshape convert a 1D array to 2D arraY?
Try it out in the console 🙂
I did
And then call .shape on it
It gives it in the form of a 2D array
try a bunch of things, try muiltiplying two vectors of shape nx1 in numpy as well
You gotta run all the code to get the intuitions
if you see this
I want to make that 1D array to be vertical?
You don't need to write print in notebooks btw
Is there any way>
i would note that numpy ndarrays of dimension 1 do not behave like proper vectors
you can multiply them from the left or right with no issues
That's true
Even so, that's something you need to run to find out
Compared to someone telling you
hmmm
Isnt this the same>?
'tuple' object is not callable . Its giving me this error
# Calculation
number_of_neurons = 86_000_000_000 # 86 billion neurons
average_synapses_per_neuron = 5_000 # average synapses
parameters_per_synapse = 40 # 10 parameters for each 4 modes (neuro transmiter types, receptor types, synaptic strength, other factors (modulatory receptors, ion channels, post synaptic properties, etc)
total_parameters = number_of_neurons * average_synapses_per_neuron * parameters_per_synapse
total_parameters
a fermi estimate for the number of parameters in the human brain
17200000000000000
17.2e15 if I counted the zeros right
But ngl, I might've over engineered my pipeline
Now that I'm able to fit the entire dataset in ram
The code will probly be useful eventually, so I'm not too bummed out
Maybe now it fits, but nothing is stopping me from just getting more data
Does anyone have any idea on what's causing this bug?
You might have more luck if you show some code
I would if this wasn't for work though.
I mean it's hard to debug code through a natural language description of it
setting up reproducible examples is an important skill. if you can't isolate the problem enough to explain it to someone else, you might not understand the problem that well yourself. often the process of figuring out how to reproduce a problem also elucidates the root cause, and from there a solution.
that's been my experience at least, and i know i'm not alone in that.
That moment when I realize most of my job is debugging stuff
truly. 70% debugging, 20% data cleaning, 5% talking to people, 5% actual data analysis and machine learning
so good to hear that, was starting to think I was doing something wrong ahah
There's just like, infinite configuration. In the GitHub actions stuff alone there were like 10 unexpected things that prevented them from reaching the 24h run time mark
Not even in my code, just libs not working as expected, or settings that I didn't know about, like a timeout at 6h, that one was unfortunate cuz I could've seen it coming if I read the docs
it's the current plague of doing data science in industry. data scientists are under-supported by engineers and devops people. so you have this perverse trend of trying to hire data scientists with unicorn-level credentials to do 3 different jobs at once, instead of hiring 2 extra people to collaborate with the data scientist and get a lot more value out of the whole team. save 2/3 on payroll but only get 1/4 productivity, it's a bad deal in the end for everybody (including and especially the data scientist who doesn't get to actually do their own job and their resume / skills atrophy over time).
I'm a data analyst and it's not much better
There are some data engineers in the company but they have their own work to focus on, mostly I have to do my own full stack end to end tasks, from system administration , to etl scripts, to cloud platform work, to SQL, python and Tableau
this was a topic of dicussion a day or two ago
I think companies doing this aren't necessarily wrong
DS folk I meet just think they can get away with knowing 1 thing
On the other hand companies are trying to get away with having small teams and fewer people.
And that's really a thing for the vast majority of jobs, especially if you're not a specialist in a large company
you get to a point when that is not very productive
Yeah, if they're not pulling a lot of revenue what else are you supposed to do?
don't hire people that you can't make use of
I'm a jack of all trades and it's very hard to become a master of any
Unless your argument is: don't get a data scientist untill you're at a big scale
not at big scale, but don't get a data scientist if you don't have at least 1 engineer that can help support getting their stuff into prod
Or they hire someone that can do a bit of both? 🤔
it never works out that way
I think it's a myth that having more than responsibility means you're doing twice the work
i can do both. i've done both, professionally. unequivocally it's worse when you expect someone to do both.
you're doing half of each
that's mythical man-month thinking
I do both and I don't do twice the job tbh
right
case in point, no?
it's not about doing twice the job. it's about doing less than half of each job
if i split my time 50/50 i end up doing 30% of each job
that leaves 40% of the job not done, or backlogged
yet i'm still spending the same 100% of hours
It's all data, I never understood the distinction
there's the time spent context switching, and missed synergies
then maybe you're not doing what i'm doing
I don't see the context switching, data is data 🤷
also just the fact that it's ridiculous to expect a data scientist to also be a software engineer
If I work on notebooks I have a variety of projects I can work on
If I switch from notebooks to sql to tableau to unix sysadmin
that's context switching 😉
i'm not talking about "data" though
is writing an HTTP API and setting up a CI/CD pipeline "data" work in any non-trivial sense?
we literally pay kids $100k out of college to do that and only that, full time
here i am doing that and trying to also do data science and keeping an eye on the ETL pipelines
My interest, is in making things that work. If that requires an HTTP API, CI/CD, an ETL, ... so be it personally
What I see of a lot of DS is no interest in making things that work
same. that's why i do those things and know how to do them. i still think it's stupid to expect to hire someone who can do that
you and i are unicorns
The interest is in doing stuff in notebooks
it's very very well known across basically all industries that 1 person doing 3 jobs in 1/3 time ratios is less effective than 3 people specializing
ML model in a notebook, plot in a notebook 👎
I don't want to make any excuses for that
i use notebooks 🤷♂️ not sure what that has to do with it
i don't expect them to run in production
I also use notebooks, that's not what I mean
but even if i did, i don't see why it matters. tests are tests, pipelines are pipelines, etc.
I mean, no interest in going to prod
is it no interest in going to prod? or is it lack of interest in doing what should be someone else's job? another specialist's job, so you can focus on your own specialty?
Because some of my colleagues believe their responsibilities start at getting a clean dataset and end at producing a PoC
Didn't Adam Smith get his face on an english bank note for describing division of labour during the industrial revolution? 😉
When your responsibility should be: getting something that works. If there's no one to bail you out, then you gotta do it yourself imho
i mean, sure? but if you extend this line of reasoning, you should also criticize software devs for not also being devops and DBA
sure I do
yes, at smaller scales, it pays off to be a generalist and to hire generalists
have you ever worked with a good DBA?
sure, maybe that's entitled on their part. but at the same time, it's ridiculous to expect this level of multi-specialization as table stakes for all DS
Nothing like a good DBA.
sc2.inverse_transform(regressor.predict(sc1.transform([[6.5]])).reshape(-1,1))
the above statement is used to predict the result of an svr regression algo
hu
Is it necessary to use a reshape()
I think there's just like an overspecialisation of DS folk
I don't think any CS niche overspecialises this much
-
DS isn't CS
-
i totally disagree, i think there's an unrealistic expectation of DS people also being software engineers and unless we pay them 2x what software engineers make, it's just employers trying to be cheap
i spent 6 years in school studying math, statistics, and machine learning. you now want me to also become a professional software engineer?
I've been mostly working at startups, and I do really like doing a bit of everything but, I've decided that I'm not a substitute for a team, there's a point at which it's just not fair. I personally don't see the fun in just doing one thing, but I also see the line in the sand as an extremely important thing for my own well being
that's two careers and two specializations. i expect to be paid double accordingly.
The expectation is just that people can deliver results
Every role has this problem, it's not unique to DS
but that's a very startup-centric small-scale mindset
Not necessarily imho
there is absolutely a niche for people who can "deliver results"
It's definitely compounded by the fact that the slice of the process DS people do is super narrow
Narrower than other roles
Like pure pure DS roles. You need a large supporting cast for that
I'm just pleading for knowing more than 1 thing is all, just knowing DE is already a step up
Ivory tower DSers
At my job I grew into a lot of tasks because all the rest just says no
Ever consider going into consulting?
That was the first contract I signed but I got cold feet and tore it up
I might in the future
i'm kinda on salt rock's camp here
on a time constraint, time spent learning math is time spent not learning software eng
i would say that's a job for 2 people at least
the code and software optimizations you learn on one side are completely unrelated to the ones on the other
The truth is that for most roles there's diminishing returns on that math vs. software
in general, a lot of "DS" positions really just need software eng
If you go that deep then you should really only aim for the ones where the diminshing returns aren't doing you in
people don't even know what DS and ML are in the first place
But a lot of the data work requires a different kind of software engineering. Optimal SQL or pandas is very different to the skills you learn with C/C++/Java type software engineering
But there's a reason why we teamup right, a team goes farther and for a team to work you gotta have lanes
this is what SHOULD happen, but salt rock lamp was complaining about the opposite being the case when oyu look for job openings
But it's definitely a two way street that's what I meant in the discussion tbh
Yeah I'm aware of what happens, especially in the smaller companies
Because from observation DS are unique in the fact that they say "not my job" and don't grow towards the mismatch in the hire
Actually a lot of DS are moving into Data Engineering
getting AWS/GCP certifications, learning dbt
Unless the point all of you are trying to make is that companies should hire less data scientists
I don't know, I had to learn to say no. Not because I'm not willing to take additional tasks but because I'll very quickly become overworked
almost kinda, yeah
a lot of them don't really need it imo
well, they should think whether they need a data engineer first
before going for the data scientist
I'm very productive in general and that creates this illusion even to myself that I can just keep on doing more stuff, but it's not the case
just basic stats would take them a long way, which doesn't require heavy ds
To come full circle
What I'm trying to say is, move some of those math / stats hours to software
Or work at google / do a PhD / ...
that's where I've been trying to focus on as a data analyst, moving into hypothesis testing, linear regression etc.
but business people don't always get statistics
don't like uncertainty
We actually had a breath of fresh air 2 hires ago
this sounds about right
The person that we hired wasn't married to ML (and was previously a software engineer)
They ended up building an awesome ML product, one of the best we have on offer
Because they're willing to do what it takes
that kinda piggy bags on what i said though
that software eng is truly what is usually needed
I think we're in agreement
you could probably even do with a single ds person that doesn'T even code, but regularly participates in the meetings where stuff is arranged with the others
I just don't agree with the "I can do two things so I need to be paid 2x" argument
yeah i guess that's unrealistic expectations, but from both sides
Software isn't a monolith, software engineers themselves need to do 2+ things all the time (frontend, backend, devops, data, ...) and none of them makes this argument tbh
the employer not knowing what to ask for, and DS people being reluctant
I was like this before as well and what I did was ask a million questions in interviews
There's still things I more or less "refuse" to do because I don't enjoy them and I'm not good at them either, I'm just transparent about it
If anyone decides to not here me on the basis of that both of us win
lol
But wouldn't it be unfair to hire someone as a data scientist and then have them do 90% frontend
Where's the line that seperates the roles
That's true but that's a super L for both sides
I can imagine the DS will be terrible at frontend
This doesn't happen tho
My next project is on HCI / explainable AI. The very first thing I'll do is make a frontend we'll use for the experiments.
My focus is on making cool stuff and if there's no one else to do it, then I step up. Obviously it'll take me longer than a specialist, but at the end we do have something tangible which is what matters
zestar, maker of cool stuff
Yeah, maybe I should put that on my linkedin
And take away data scientist or whatever I have, I have been thinking of it 😛
I guess the fear is to be stuck doing things that don't further what the person feels should be their career, and this is a pretty strong thing because a lot of people derive purpose from their work
I think there’s a bit of hubris involved in presupposing a career path.
i guess DS people hit this wall often because its a buzz word that was turned into a career in unis for whatever reason
Lots of people think ML -should- be their career path, and it (imo) won’t be for most of them.
I think reality is: careers are shaped primarily by opportunity, some luck, and a bit of preparation
My organized thoughts will be written down about this. I have many sketches (actual drawings/figures) of what I think the problem is
Will take me a couple of months to write it out, but then I'll let all of you know
zestar approaching us in 3 months with a large mirror
"look"
I’m wondering which other hype cycles have been like this.
bitconnect
The gist of what it's going to be if you look at the N % most valuable work in an org it likely needs to be very large to sustain someone with a very lopsided skillset (I'll use radar charts for these).
The dot com boom was just general SWEing, but I guess it mainstreamed web dev
yep, web dev hype in the late 90s
Maybe chip design is comparable to what you just said.
like Verilog, layout etc.?
in what sense?
There’s an interesting stat around the number of chip designers / phds to produce successively modern chips: it’s becoming ever more expensive
One sec( there’s a talk…
idk if that's the best comparison though
Those chip companies just focus on chips though, they'll have a hardware team, a layout team, an embedded software team etc.
cuz there's also the current struggle that too few people study electronics compared to what the market would like (in the chip design end)
You need less data scientists or a more balanced radar chart such that they can go out and do more valuable work when a pure DS project isn't in the top N %
and the overall trend of people studying STEM decreasingly
[EuroPython 2023 — Forum Hall on 2023-07-20]
https://ep2023.europython.eu/session/the-future-of-microprocessors
The Future of Microprocessors - a talk about the history of microprocessors, how we got here and what might happen next. There will be two laws, one equation, some graphs and a particle beam weapon out of Star Trek.
This work is lic...
what's the TL;DW
I'll have a look!
ok she's a bit of a legend
Yah, and this was a great talk, very accessible
but I think in her line of work it's as I mentioned above, lots of verilog people, lots of embedded software people (including assembly)
and some people who are more into layout etc.
I dunno how relevant but idea had to do with the declining ROI of chip research
Similarly, I think there’s some limit to the returns in DS in a single organization. Maybe a bit of a stretch.
It depends, you can do lots of R&D to develop the next processor or ASIC
but it's much harder to push CMOS technology further
or give up on CMOS and come up with a replacement
not sure how this compares to data science
past a certain point what you need is a team of physicists to research new fancy stuff and separately, engineers to try implementing it
CMOS is hitting limits in terms of electrons tunelling through thin layers of insulator
quantum mechanics and all that
so you need a paradigm shift
and that was parallelisation, multicore, GPUs etc.
Anyhow, I apologize for the controversial opnions! 😄
Esp to Salt Rock if he's still reading
It's a difficult topic
data science had the opposite with GPUs, suddenly a world of possibility opened
I'm not sure I agree with this. People get hired for certain roles which presuppose a given set of tasks. Someone hired as a data scientist refusing to do backend or anything else not in the job description doesn't strike me as something out of the ordinary.
Just when I thought I was out they pull me back in :/
We agree that people exaggerated with stuff like microservices because Google did it yeah? "Google does it, they're big so if we do it, we'll be big!"
Just to clarify tho, I don't think it's healthy to do just one thing, but there's no arrogance in pursuing self determination in a career
It looks like I found the root of my problem, which appears to happen as a result of defining another Pyodbc connection inside a function while I defined one earlier in the program.
Who says it's not the same for data science 😭
At least in the way where it's frequently touted
people invented Hadoop to mimic Google's file system, Big Table etc.
and that was the data science fad of the 2010s
ok, one of them
hello, I need a help with python
I do think it's wrong to push someone out of their role. If the person wasn't hired to do X, there should be a discussion before assigning those tasks.
for scraping a website, I am unable to get the subcolum data
ask away
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table using its attributes (modify as needed)
table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})
# Open a CSV file in write mode
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
# Process all rows
for row in table.find_all('tr'):
# Extract data from each row
row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]
# Write the row data to the CSV file
writer.writerow(row_data)
print("Data has been successfully written to output.csv")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}") ```
Here, use this ```
I do not get the subcolumn data
: P
Can you reformat this and use ``` as pedantic_propagation says
I also don't think this question is best suited for this room
Could you make a help thread?
done
please help me with code to get sub column data in appropriate format
I highly recommend being a generalist (if you have strong programming and math skills, you can tackle most things). Especially now that software is massively downsizing, they can't keep hyper specialized people anymore.
(See who remains after the layoffs, it's not the specialists...)
Software is returning to where it was before, lots of generalists with many hats.
So what's going wrong ?
I highly recommend you check the stuff you're receiving, like, just cuz you got a 200, and your headers say it should be html, it don't mean nothing, sometimes people do wtv >.>
I mean the webpage has data in sub column and the csv output in different column under different column header which make the data screwed up
import requests
from bs4 import BeautifulSoup
import csv
# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table using its attributes (modify as needed)
table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})
# Open a CSV file in write mode
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
# Process all rows
for row in table.find_all('tr'):
# Extract data from each row
row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]
# Write the row data to the CSV file
writer.writerow(row_data)
print("Data has been successfully written to output.csv")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}") ```
It's the same code, I need the syntax highlight
But I still didn't understand
Like maybe show the page and the CSV
Could you open a help thread? This channel can get noisy and this sounds like this isn’t going to take a few back and forth a. #❓|how-to-get-help
It’s an html table, they’re trying to write to csv.
But unclear on whether the subtotals are in the same html table (I didn’t look at the raw html)
Yes I understood, I'm not understanding what went wrong
Seeing the CSV will make it clear for me what is being described
I think basic problem js this isn’t a simple html table. Colspan headers, multiple splits, etc
so i think there should be a way to distinguish the sbucolumns and the column names must processed to create an equal number of names as the number of colums/subcolumns
how to fix this in my code so that I get appropriate data ?
But it looks fine
no, the ICME Plasma/Field Start, End Y/M/D (UT)" can be divided into "ICME Plasma/Field Start Y/M/D (UT)" and "ICME Plasma/Field End Y/M/D (UT)"
currently the data goes in different coloumn header , which is Comp. Start, End (Hrs wrt. Plasma/
I see three columns
yes but the header is not mapped correctly and I need to fix my code
Four rows before the row with the trxt
missing the header
check the below row and it misses the sub column
so data is messed up
Below row it's still three columns
And choosing cells at random, they seem to match
Maybe I'm not seeing something, but the only thing missing is the first row, which is the header
The Comp Start values are actually teh second column of the ICME plasma
Because of the col span
thead>
<tr align="center"><td><b>Disturbance Y/M/D (UT) <A HREF="#(a)">(a)</a></b> </td><td colspan="2">
<b>ICME Plasma/Field Start, End Y/M/D (UT) <A HREF="#(b)">(b)</a> </b> </td><td colspan="2">
<b>Comp. Start, End (Hrs wrt. Plasma/ Field) <A HREF="#(c)">(c)</a></b> </td><td colspan="2">
Oooh I see it
Just a terrible table design.
But can CSV represent this table ?
Yah, the col span needs to be split into two headers
Meaning, the colspan=2 headers need to be unpacked into two headers
Wait is the / meant to match the two columns
Like "name of first column/ name of second"
No it's the start, end
It's a date range or something
I think even copying and pasting would work
Yah
any suggestion for code ?
You have to rewrite to handle the colspans in the headers.
You'll have to modify this step: row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]
to ?
You'll have to figure that out, I'm just telling you the problem and where to start.
Okay
from bs4 import BeautifulSoup
import csv
# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table using its attributes (modify as needed)
table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})
# Open a CSV file in write mode
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
# Process all rows
for idx, row in enumerate(table.find_all('tr')):
# Extract data from each row
if idx == 0:
# Handle headers with colspans for both main columns and subcolumns
header_row = []
for cell in row.find_all(['td', 'th']):
colspan = int(cell.get('colspan', 1))
header_text = cell.text.strip()
if colspan > 1:
# If colspan is greater than 1, add the text multiple times
header_row.extend([header_text] * colspan)
else:
# Otherwise, just add the text once
header_row.append(header_text)
writer.writerow(header_row)
else:
# Extract data from each cell in the row
row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]
writer.writerow(row_data)
print("Data has been successfully written to output.csv")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
This works for 1st 4 row and not all. Any suggestion ?
This is all GPT code, right?
So, the first step is to figure out what is not doing what you want it to do.
You ask: "Any suggestion ?". What do you need help with in this version?
Or, said differently, you say it works for 1st 4 rows. What happens on 5th and 6th?
now with updated code the the first four row works but when it find out 5th which is again a column header the col span is not working and it messing up a data
What about 6th row?
not working some problem as earlier
What's wrong with 6th row?
Let me fix it
The trick to debugging these problems is to add a few print statements, so you can see what's happening.
Do you know what a colspan is?
I need help with some basics machine learning. I am trying to solve the Titanic prediction problem from Kaggle but after imputation, my train data gets more row somehow and then it doesn't match with the y_train
X = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = train_data['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y)
# Encoding
oh_enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
oh_X_train = pd.DataFrame(oh_enc.fit_transform(X_train[['Sex']]))
oh_X_val = pd.DataFrame(oh_enc.transform(X_val[['Sex']]))
X_train_encoded = pd.concat([X_train.drop('Sex', axis=1), oh_X_train], axis=1)
X_val_encoded = pd.concat([X_val.drop('Sex', axis=1), oh_X_val], axis=1)
X_train_encoded.columns = X_train_encoded.columns.astype(str)
X_val_encoded.columns = X_val_encoded.columns.astype(str)
# Imputation
imputer = SimpleImputer()
imputed_train_data = pd.DataFrame(imputer.fit_transform(X_train_encoded))
imputed_test_data = pd.DataFrame(imputer.transform(X_val_encoded))
imputed_train_data.index = X_train_encoded.index
imputed_test_data.index = X_val_encoded.index
imputed_train_data.columns = X_train_encoded.columns
imputed_test_data.columns = X_val_encoded.columns
I put a py X_train_encoded.describe() after the encoding and it says the dataframe has 668 rows at that point, which is what it should have
But when I do this after the imputation, for some reason it shows the df with a varying number of rows around 830, though this number varies a little bit every time I restart the kernel and at the end of the program, I get this error "ValueError: Found input variables with inconsistent numbers of samples: [838, 668]" when trying to fit a model
Do you have any idea about what it could be?
Is it wrong to do the imputation after encoding?
This is the kind of question that bwginners think to ask. And the answer is that pros don't think about situations like this
That's interesting
But then I can't think of an actual problem with the imputation I did
even if take out the parts where I set the index and the columns, it keeps adding rows
Encoding is about making sure that the same information is represented the same way, and that information is represented in a way that is intelligible by the model
Both of those concerns are equal
Imputation is about filling in missing information with whatever would be least interesting to the model
So it shouldn't add rows at all
that's very odd
No...
Could you try to run this program if send you the dataset?
No.
I'm actually on vacation
I even promised my mom that I wouldn't answer questions on discord during it.
I don't live with my mom bte
lol
Btw
I just answer questions on discord when I'm at her house so I don't have to talk to her
So she assumes I do it all the time.
I get it
Yeah
but then I don't know what to do
I could try to do it using get_dummies
probably it wouldn't give an error
but I want to see what the problem really is
That's basically like one hot encoding, I think
Yeah
But this error let me curious
It's very odd and I don't want to have it happening again
"getting an error" is just one of two overarching ways that your program can do something other than what you intended for it to do.
It can also do something other than what you intended, without raising an error
In which case you're fucked. Unless you know what you're doing.
Yeah
@left tartan do you have any idea what could this be?
Please don't ping people who haven't engaged with your specific question to ask them for help. No one is on call to provide help.
I just want to verify the my intuition of why activation functions are necessary. For this example lets consider a network that classifies numbers 0-9. A network WITHOUT an activation function will be able to do well on numbers that are similar to the sizes of the numbers in the training set but it will struggle if numbers appear to be darker or lighter because it is linear and cannot take both size and lightness/darkness into account. In a neural network completeing the same task but WITH activation functions will be able to take into account both orientation and lightness/darkness because the weights will learn all possible relationships between the pixel values in the data set and the sigmoid(or other acctivation function) will then take number that are slighty lighter or darker and transform/smooth them so that the network could ouput the same probability as if it were the nomral darkness/lightness. Does this intuition sound about right or is it incorrect in some way ?
sorry, didn't mean to send the previous message. i'm using pytorch, and i'm not really sure on the format of the data on which my pretrained model was... trained. there's a lot of stuff in the code about making vocabs that are, to my knowledge, not actually torch vocab objects (it seems all custom?). i'm having trouble parsing it but it doesn't seem to be changing anything outside of something about ID correspondence and making a .pkl file. i want to fine tune it for a binary classification task. should i be worried about this data formatting? as far as i know the data is similar enough, at least in its raw form. would it be okay for me to just put it into a dataset and start training?
the creators of the pretrained model talk about implementing classification as a downstream task, if that changes the process at all
Activation functions in general are necessary because otherwise you can only model linear functions. Any number of weight matrices A1, A2, A3, ... that are used in a model by passing an input through every layer: ...A3⋅(A2⋅(A1⋅X)) can also be modeled using a single weight matrix A⋅X @crimson summit
Because in the end each output is just a linear combination of the inputs
In the output an activation could be necessary because, f.e., you want the outputs to sum up to 1, because it needs to be a probability distribution. So that is why you would use softmax f.e.
Hey guys, need some help regarding missing data. I know its all project specific but what would you do here? I am still new to this so I'm still learning. I'm using for name it would be easier to just make them all missing. But for the rest? I'm trying to do credit score classification from kaggle, https://www.kaggle.com/datasets/parisrohan/credit-score-classification
I'm updating my resume, and this is the stuff I've been doing, and it reminded me of yesterdays conversation
looking at this, it feels like coding the model and the direct contact with data is a small part of what needs to be done to get these things going, either that or I'm doing something wrong, but I don't really see how I'm gonna do this without all the infra, and when the infra is done all I'm doing is sort of waiting around (I use that time to do other stuff like update resume and etcs)
like, once the infra is done, I trigger some runs and wait for it to do its thing
then I take the results, review what I did right or wrong, go back to the data and repeat the process
once that is done I'll be doing deployments
what I'm saying is that rn to me it feels like the MLOps stuff is 90% of the work that needs to happen to train a model
that sounds about right in practical jobs and is in line with what we were discussing yesterday with zestar amd the others
we use ml extensively where i work too, but we do none of this other than things like in your first bullet point :p
how do you do it then tho
like do you just get a super expensive instance and train stuff there
we do math on paper, run our bad code on the university compute cluster (e.g. lsf or slurm), and publish the results in papers
ok yeah if you have access to a cluster I can see that
is it like a super computer
yeah
I think we have a couple in my country, there was a cluster in my faculty but it was small stuff
this one is not huge either, but the nodes add up to a couple dozen A100s or so
and the nature pf the work is more about reformulating problems and showing why some approaches should be better. actually running code is more to show evidence
or at least that's how i see it :p maybe my boss hates me haha
not enough to train lamma but still quite alot
like, if you read their paper they were very concerned with performance and doing all these crazy optimizations
which is my next thing to do
I'm essentially emulating their challenge but on a smaller scale
my constraint is a 16Gb GPU on the cloud, which is not a lot to train even small transformers
yeah that's rough
I'm writing it to be scalable, eventually if I decide to I can just add more gpu at will
I think it's a good play to start in an artificially constrained setup because that encourages me not to be wasteful of the available resources
yeah. we do a fair amount of that too, since faster and less mem is always a selling point
ASDSC
hey I trained multiple regression models on ethereum historicaldataset and all of them are giving good results
is it time to get rich?
Now think about risk management
Whats the importance of p-value? in multiple linear regression
Test it on future data to make sure it will work well in the future 👌🏽
importance in what context?
what's the least performance-impacting way to do bounds checking on the size of a cuda array so that I can implement some kind of mitigation on my program's concurrent memory usage
What do you mean by bound checking the size ?
Like prevent out of bounds access of a memory address ?
prevent it from trying to allocate too much CUDA ram using something like thread blocking or smth
my model fits in memory but I'm using async on parts of my program
so it can be loaded twice I think
in which case being in memory twice doesn't fit
so I need to wait for the first model to be deallocated
in principle though let's say you have many GPUs I only want it to block if the next action would overallocate to CUDA
If you know how much memory each model occupies on the GPU (accounting for the stuff that happens when calculations are being made), you can use redis to store how many models have been loaded to GPU
Redis can act as a lock for multi processing and multi threading stuff since it's single threaded and does one thing at a time
good idea thx
speaking of which, I wish there were something like Spark for the GPU
that's where I really need the memory mgmt
I think pytorch has builtin functionality for handling multiple GPUs
Something something nn.DataParallel, idk
i don't mean just that part, I mean the memory management part
like it's not smart enough to say "no more right now"
the way Spark is
Yeah that would be useful for sure
But many times it's hard to predict because it's not just the memory of the model, it's the allocations that happen in between when the graph is being executed
Got an interview in 10min or so
Just wanna get it over with, these things give me anxiety ._.
Nice. What job is it for?
ML Engineer
It's gonna be a physics problem or something of the sort
Ty
Ngl, failing this interview would be a bit of hit on my pride 🥲
But I'm rusty on physics have t touched it in 2 years
How important is Physics in ML?
In this case it's important because it relates to the companies core product
Which is?
Using simulation data to train models that make the simulation unnecessary
Physics simulation tend to be costly and some can't even be put in a GPU without a ton of simplifying assumptions
My MSc thesis was physics simulation, but of a different kind to the ones they seem to be doing, which relates more to structural engineering
My stuff was simulation of elementary particles like photons and electrons
The first time they started doing them was actually at Los Alamos during the Manhattan project
They had to simulate neutrons and such
Aight gotta go

I have a question. Does anyone experienced in Plotly Dash know how to update 'active_cell' with the new dataframe so that the contents of the cell for data_table will be read with the new data rather than reading the old data from the data_table cell.
trimmed mean-- first sort the numbers and then removing 1st and last number and then finding mean of that...did i understand it right?
To give a better idea of the problem, imagine you have a table with 1 column and the rows have the values A, B, C, D. You then press a button that performs an operation that changes the values from A, B, C, D to 1, 2, 3, 4. Now you go to click the cell with the value 2, but the contents have it read as B, even though you can clearly see the 2 on that table. What would be the best way to solve this?
not necessarily only the first and last, but yes
You have to change the underlying data of the table,. So some kind of call back, that on cell change updates the underlying data
ok..also i didn't get what is meant by 'p' in the formula?
do i take the largest no. or smallest
in place of 'p'
what i said is exactly what the p means
.latex sigma notation means the following;
[
\sum_{i=n}^N x_i = x_n + x_{n+1} + x_{n+2} + \cdots + x_{N-1} + x_{N}
]
so if you add p to n, and subtract p from N, it means "ignore the p smallest and p largest entries, then take the mean"
p can be 1, but it can also be any other value
And that would be done by having it modify the layout?
ohk
got it
Eh, I do believe it went almost to perfection
I wasn't able to answer a question about the boundary conditions of the navier stokes equation on an air plane wing, but he said it was cool. The first questions I aced them all
And there was a lot of time left at the end
We just kinda talked about the role
Tho I'm a bit embarrassed I didn't know that one
I also got wrong a question regarding genetic algorithms, I think I correctly explained everything but I said something wrong by relating them to reinforcement learning
Anyone here running a study group? Looking for one where people actually want and enjoy learning and working together on (mini) projects. Preferably based in the European timezone.
I understand if you don’t use one the whole thing becomes linear but if you do use an activation function is the visual of the flow of numbers I described correct ?
THX! I went with an interesting approach that saves a CSV somewhere and the program will read that newly created CSV when I select the cells.
Nice. Imagine if Sheldon was your interviewer though.
I really dislike that show >.<
Idk about mini-projects but if you're interested in NLP, join Cohere discord server. They have severally niche-specific study group in NLP.
MLCollective also has something like that, not just in NLP domain alone.
Just Google them, the link to their Discord community can always be found on their website.
I was just about to say, I really like the idea of a casual small study group with non-work colleagues. Thanks for these
Do you have a specific reason?
It's just a bunch of stereotypes. High levels of Intelligence also frequently come with high levels of empathy and emotional understanding. Sheldon is a myth as far as I know, and you can see that if you read up on real world geniuses
Physics and science students in general are also kinda just normal college students, they don't look or sound like the people on that show
it's a noxious combination of bad stereotypes, pop-culture pandering, and generally not being funny or interesting
ai is cool
Guys anyone in research ?
I wrote my first paper today
So just looking for a review before submitting it
Did I wrote something wrong ?😶🌫️
@wooden sail In a fully connected neural network that detects numbers 0-9 if two 7's are inputted into the network and both 7's are the exact same size and position expet one 7 is slightly darker and one is slightly lighter when the pre activation values at neuron 1(for example) the sigmoid function will essentially make the activation values the same something like 0.993 and 0.992 which will allow for both 7's to be treated the same thought the rest of the network and be classified correctly does this intuition sound right ?
hmm there's several issues with that reasoning
the most important being that you wouldn't really know what value the network will output nor why
and there's no reason why the output values would be close to each other
0.50001 for one of them and 0.999999 for the other is still a correct classification
at the end of the network their would be probabilities for each number so if one is 0.50001 in the first layer it could cause network to output a higher probability for another number whereas if the network has already learning the relationship for the pixel values for a 7 in that position the sigmoid will essentially just transform the lighter 7 to take the same path as the normal 7 throughout the network
idk what you mean by "path through the network", any input will have exactly the same operations done on it
classifiers do often have an output corresponding to a probability being assigned to each class. all you need is for one class to have a probability higher than the others for that to count as the class the network predicts
it doesn't matter if it's the largest by 0.9 or by 1e-15
also what you understand by "similarity" is not at all what the network will learn to treat as "similarity". what networks do usually has no real world interpretation that makes intuitive sense to people
it's just not the case that you can interpret what a neural network is doing in general. you'd be better served thinking of it as a function that maps an input to a categorical distribution, without wondering about the "how" for now
pre activation value for lighter 7 =4 pre activation value for darker 7 =5. Without the sigmoid the "lighter seven" would come up with a much different probability at the end of the network compared to the darker 7 but with the sigmoid if the pixel values are slightly different it will make them both 0.991 and 0.992 so then when all operations get done throughout rest of network they come out with the same probability
nope
there's no reason why the probability of the two 7s will be anywhere near each other
in fact the scenario you're describing is often enough to make a network guess the number incorrectly
if they are same shape and size and all that is different is the shade this doesn't happen ?
it can very well happen
so then isnt that what sigmoid does in this case
nope
all the sigmoid does is enforce that the outputs are between 0 and 1, and the softmax (the multidimensional form of the sigmoid) makes it so that the outputs are between 0 and 1 and add up to 1
okay let me try an rethink my logic to try and fit this and ill make another scenario hopefully better
Can anyone help me with a problem im having with chromadb in python? -
im using unstructured to chunk and embed files to my local instance of chromadb. i then query the chromadb and send the k chunks to an LLM for natural language processing, and get a result. This flow works well, but im at the point where i need to store metadata and filter by the metadata when querying.
I am inserting vendor invoices in pdf format into chroma, and then i need to query them later. This is obviously difficult with multiple invoices as chroma does not 'know' which one or ones i want to query. Therefore, i want to extract some data from the invoice into metadata (invoice #, payee, vendor name) so when i query i can use this to filter results. (example: give me the total of all invoices from VENDOR to PAYEE, or what is the total of invoice INVOICE_NUMBER)
does anyone here have any experience with this? am i barking up the wrong tree and there is an easier way to do this? at a certain point the docs for both chroma and unstructured kind of just drop off and stop being useful
i know its a long shot 😆
So a fully connected network that is trained to detect numbers (numbers are black and background of image is white) 0-9. It is trained on numbers that are slightly different shapes sizes and brightness. Through backprop the function (aka Neural Net) learns to generalize across all numbers 0-9. During testing we have two 7's of same same size and shape but one 7 has a slightly different brightness that has not been seen in training. When both of these 7's are inputed into the network they get classified correctly. The reason they were both classified correctly is not just because of the sigmoid but a combo of the weights and sigmoid because through backprop the weights learned relationships between pixel values of all different types of 7's so one that appeared in training will obviously do well and the one which has the brightness that did not show up in training the sigmoid assists this unseen brightness to be seen the same as one seen in training by mapping positive values as the same 4 and 5 get mapped to 0.991 and 0.993. So in conclusion its a combination of the sigmoid and all the the learned weights from training on many different examples that allow for this generalization to occur. @wooden sail does this seem to follow a better train of thought ?
you're still getting the sigmoid part wrong
there's no reason two different instances of the same class would be classified with the same probability, neither through the affine transformations nor through the sigmoid
probability wont be the same but more similar than if the sigmoid was not their right ?
affine
watching a tutorial and i don't think it was explained, why is X always capital?
Or better, is it always capitalized?
need some more context, but capital bold letters usually represent matrices or tensors, while capital letters without boldface denote random variables
you'd have to show an example though, because notation only makes sense in context and varies by book/course/video. symbols in math don't have fixed meanings
so w out sigmoid its just linear transformation but with sigmoid it helps give similar probability to 7's of diff brightness. But I cant compare and say sigmoid helps give better probability then without sigmoid because without sigmoid is no probability at all just linear transformation
Hi! Are you able to help me with python?
Sorry forgot about context
no, there's no reason why the probabilities would be similar
here it just means a transformation was applied to x and it's no longer the original
but if a pre activation value is 4 and a pre activation value of the same number but darker shade is 5 then when transformed by the sigmoid the values will be similar so then when further operations are done there result will be kind of similar as well
Alright ty
sorry if being redundant just trying to get this through my head
yeah but why would they be 4 and 5 to begin with
what you hope is that, in most cases, you predict the class correctly. all that needs to happen for that is that the correct category gets the largest probability. nothing is said about the value of that probability
an output of [1,0,0] is just as valid as [0.35, 0.33, 0.32]
and you can't even guarantee that it'll always work. you'll have many cases where you get the wrong output too
Can I get help please?
not from me atm, sorry
:/
lets say the darker 7 pixel values =0.7 0.3 0.4 0.5 and the lighter 7 pixel values =0.6 0.3 0.2 0.4 when these values are multiplied by the same configuration of weights that lead to neuron 1 in layer one and then passed through the sigmoid they will have similar values
not necessarily, and especially not if you have several layers
what you consider distance or similarity is not the same as what the network considers distance or similarity
I am getting this error, the person who is doing the tutorial fixed it by running:
train, valid, test = np.split(df.sample(frac=1), [int(0.6*len(df)), int(0.8*len(df))])
but i cannot fix it by running that, what am i doing wrong? 🙂
oh wait, think i found the issue
yep found it, inside the scale_dataset i was passing the wrong args
ohhh so network could have 0.01 difference between activation values in neuron 1 be a massive distance so as the sigmoid transforms both pre activation values if one is 0.01 smaller than the other it can result in a massive difference in estimated probability but with patterns learned from wights it will still have highest probability of its class and get classified correctly
so similar activation value dosnt mean similar probabilty but a probability that will still be the biggest in relation to all other classes so calssification can be correct
@wooden sail
yeah
o shitttttttt thanks so much man ! Thanks for working with noobs like me !
Anyone know how to try Gemini 1.0 if you're in Europe ? I have the subscription thing but the model says it's not capable of generating images, which I thought was one of its capabilities
It's also clearly gpt 3.5 level
The android app is not available too
sometimes models might just hallucinate that they are or are not able to do something, and asking again in a new chat (and phrasing it differently, e.g. be more explicit like include "generate an image" in the start of the phrase) could work correctly, but there is also a huge chance it is just not available in Europe at all
See what I mean by gpt3.5 level
Ah no wait, I got mislead by the you're right part
The text does kinda make sense still
I mean it's totally fine if it's not available, but it seems like they are saying it is when in fact it's not
one way or the other, that is closer to tech support than data science ; may as well move to offtopic
Uhm not sure if it is off topic, since I'm trying to gather how google is doing their AI model rollout
I gave one of our interns the job to play around with these for a couple of months
So we know the capabilities of stuff like goose.ai, quantized models etc.
Open ai has also released a new model
Looks insanely good
The dogs in the snow are my favorite
It's like Gemini is insufficient in coding
Yeah I don't know what's up with it
You can also use dash.store components
I really wanna work at open ai they're doing so much cool stuff 😭
It's like those impossible drawings, the whole doesn't make sense but the details do
is gpt 3.5 the one that comes for free with Bing and Microsoft is embedding into everyone of their products?
Just saw that they launched Gemini 1.5 today.
https://x.com/sundarpichai/status/1758145921131630989?s=46&t=sRKd79BJEKLsp89AJToMJw
I feel they probably added mixture of experts (M.O.E) and called it 1.5
a year ago they were saying this, not sure where they are now Next-generation OpenAI model. We’re excited to announce the new Bing is running on a new, next-generation OpenAI large language model that is more powerful than ChatGPT and customized specifically for search. It takes key learnings and advancements from ChatGPT and GPT-3.5 – and it is even faster, more accurate and more capable. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/
To empower people to unlock the joy of discovery, feel the wonder of creation and better harness the world’s knowledge, today we’re improving how the world benefits from the web by reinventing the tools billions of people use every day, the search engine and the browser. Today, we’re launching an all new, AI-powered Bing search...
As far as I know bing is using gpt 4
ok makes sense, the blog entry I linked to is old
I had some fun with bing gpt recently. It refused to write a poem about an old cunning fox
because apparently there are iranian poems that describe England as an old cunning fox
I had to ask in a roundabout way to convince it I am not a revolutionary guard 🙂
I just asked it, it did it
In the shadowed woods, under moon's soft gaze,
Lived an old fox, traversing life's complex maze.
With fur as red as the dying day's sun,
He moved in silence, his cunning second to none.
Through the thicket, under canopy's embrace,
He danced with shadows, a silent, fleeting grace.
Eyes gleaming bright in the dark of the night,
He was a specter, a ghost, just out of sight.
(...)
I truncated it cuz it goes for a long time
The thing with GPT is that it doesn't do the same thing in a repeatable way. It second guessed me in an entirely wrong way here
Uhm bing might have a different system prompt, or be a different variation of gpt 4
it's always an older dodgier version of gpt 🙂
AI Twitter is buzzing today. What a terrific Thursday.
Sora
Gemini 1.5
V-JEPA
All in a day. And the day's not even over yet
Somebody knows any good way to perfectly time showing captions with elevenlabs generated voice using python
If yes please ping
Anyone knows this tutorial? https://www.youtube.com/watch?v=i_LwzRVP7bg
And would be good to learn some basics?
Learn Machine Learning in a way that is accessible to absolute beginners. You will learn the basics of Machine Learning and how to use TensorFlow to implement many different concepts.
✏️ Kylie Ying developed this course. Check out her channel: https://www.youtube.com/c/YCubed
⭐️ Code and Resources ⭐️
🔗 Supervised learning (classification/MAGIC...
hi guys, I just want to ask a question, I have read an interesting journal about transformer model and finds out that transformer has its own inverted version. Does anyone understand about? I need help to understand of how it works
I need help regarding data analysis project making
She talks a bit fast and the topic is physics, delta and gamma rays in energy, very interesting but half of the time no clue what the values need to be other than the same as hers 😄
hi guyz i am working on a college project, can anyone suggest / give me a learning link to train custom object classification using Tensorflow?
It's the dataset in question, but thanks for the insight. By chance you know other good tutorial with an intro to ML?
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
Check TensorFlow website there's a tutorial section therein. I think you'll find it there.
I was going for that after some intro video, i understand better when watching someone explain, i suppose it's related to being voiced idk
But i can try to give it a look again and see if i can understand text based
https://youtube.com/playlist?list=PL8P_Z6C4GcuVQZCYf_ZnMoIWLLKGx9Mi2&si=dQXDNFfk41Zvva6A
You can definitely find more videos that'll speak to you on a personal level on YouTube if this one isn't doing the job well 😀
This content is based on Machine Learning University (MLU) Accelerated Tabular Data class. Slides, notebooks and datasets are available on GitHub: https://gi...
okay thanks
Using pyspark pandas, is there a way to do operations on dates lazily? (in my case, adding pd.offsets.MonthEnd)
anyone with elastic py experience, and elastic in general, what is the best way to handle null fields gracefully?
pipeline? mappings/schema?
also, how can I make polars convert a column into a list of its own former values?
Will give it a look, thanks 😉
I've decided on two main things, the ML course on Coursera by Andrew Ng (the specialization is paid but the individual courses can be done for free), this course is theory-heavy with math etc. To balance out the theory and frankly not get bored, I also chose the handson book from O'Reilly in which you work on projects in every chapter. Anything that I need to read into in terms of math and stats, I'll use books. Those single tutorials on Youtube are nice, but after done several of them I came to the conclusion that I learned what to type, but not why and how it actually works, which feels like a risk of learning bad habits from the start.
Heard of AWS deepracer student league?
I'm also currently transitioning to the ML field and it's already blowing my mind lol
- you get a chance to earn an Udacity nanodegree :)
Currently i am not a student, apparently can't join if i am not one :\
Ah, okays
Y can't the GPU do unsigned ints
Why do I need to use int64 for indexing
I thought I had all this resolved
D :
Can someone guide me with this exercise?
there's a known effect where students claim to learn the best from passive lectures, but actually retain the info the least. i personally really enjoy lectures too, but it requires the learner to actively participate by pausing to take notes, following up with practice problems, etc.
i'm not familiar with this particular YT channel and it might be really good. but i also strongly encourage spending some time with hands-on practice projects and ideally also practice problems from a good textbook
i assume you're not supposed to use a dedicated csv reader, right?
without thinking too hard about code, how would you do it? if you had to just explain it in words.
Uh didn't knew that, do you recommend any books for begginers? or kaggle is enough?
i'll share this here, it might be interesting to some of you. when doing contrained optimization, one interior point method is to make the problem unconstrained by adding "barrier functions", among which log barriers are common. they explode to infinity when you come close to a specific value, so it's a good way of enforcing inequality. going past the point in a single step, however, gets you either a complex numer or a nan, depending on how the log is implemented. in the case of pytorch you get a nan, but the gradient seems to be hardcoded as 1/x even for values of x outside the domain of log(x) over the reals https://github.com/pytorch/pytorch/issues/76516
I might've been using models that are too large
but yet to see what's gonna happen on that final slope
don't matter where it is now, if it converges to a flatter slope blue will catch up
Hey guys newbie here, last year I started to get involved in data analysis using python, something pretty basic like using pandas in a df in a notebook. Instead of getting into more data analysis I got more interested in python itself. To be honest I was even embarrased of how I was "writing" code before, not even using functions or error handling. Well the thing is a read a couple of books, python crashcourse, python by John Zelle that was a bit more formal. And now my code is a bit more modular and following pep8, last project I made was scraping about 20k data entries from a website and as well creating a recommending system for anime based on similarity in a database.
Now I was diving into a book more related to web scraping and I got frustrated because it started to use classes and objects. Some recommendations when dealing with this more "advanced" concepts, probably pretty basic but I'm 28 migrating from a health carreer.
if you aren't embarrassed by your old code, then you aren't growing as a programmer.
anyway, the data science side of Python applies OOP differently than "normal" python. we don't often create our own classes. and when we do, it's really just extending the interface of a library like pytorch.
always ask your actual question right out of the gate. don't wait for a commitment or spread your explanation out over multiple messages.
oh
thx :)
how do i put this
post the code as text (no screenshots) and explain how it's different from what you want.
im using a python backend for my react native project and basically it uses the phone's acceler-
if there's an error message, post the whole error message as text.
oh alr
react? you might be looking for #web-development
no lol
its more python than react :D
okay. well give as much information about your data science question as is needed in one message. don't spread it out over a bunch of messages.
tysm brb
I’m even embarrassed by my new code!
I'm never embarrassed because I have no shame.
App.js:
import React, { useEffect, useState, useRef } from 'react';
import { StyleSheet, Text, View } from 'react-native';
import { Accelerometer } from 'expo-sensors';
import axios from 'axios';
export default function App() {
const [data, setData] = useState({});
const subscription = useRef(null);
const _subscribe = () => {
Accelerometer.setUpdateInterval(1000);
subscription.current = Accelerometer.addListener(accelerometerData => {
setData(accelerometerData);
// Log the accelerometer data
console.log("Accelerometer data: ", accelerometerData);
// Send accelerometer data to the server for prediction
sendDataToServer(accelerometerData);
});
};
const _unsubscribe = () => {
subscription.current && subscription.current.remove();
subscription.current = null;
};
const sendDataToServer = (data) => {
console.log("Sending data to server: ", data);
axios.post('http://10.0.0.21:5000/predict', data)
.then(response => {
console.log("Response received from server: ", response.data);
const predictedMovement = response.data.prediction;
// Log the predicted action based on your model's predictions
logPredictedAction(predictedMovement);
// TODO: You can add logic to perform actions based on the predicted movement here
// For example, send an emergency notification or update the UI
})
.catch((error) => {
console.error('Error:', error);
console.log('Error details:', error.response);
});
};
Thanks I'm going to watch it
!code
?
three backticks, not one
but it still isn't clear how this is a python, data science question
okay, well like I've said a few times, you need to ask your whole question in one message.
character limit
im so sorry
continuation of above code:
const logPredictedAction = (predictedMovement) => {
// Customize this logic based on your model's predictions
console.log("Predicted action: ", predictedMovement);
};
useEffect(() => {
_subscribe();
return () => _unsubscribe();
}, []);
let { x, y, z } = data;
return (
<View style={styles.container}>
<Text>Accelerometer:</Text>
<Text>x: {round(x)} y: {round(y)} z: {round(z)}</Text>
</View>
);
}
function round(n) {
if (!n) {
return 0;
}
return Math.floor(n * 100) / 100;
}
const styles = StyleSheet.create({
container: {
flex: 1,
justifyContent: 'center',
paddingHorizontal: 10,
},
});```
now python:
at least get to the python/data science part. so far, it isn't clear why anyone should want to read this code.
from flask import Flask, request, jsonify
import joblib
import numpy as np
import pandas as pd
from scipy.signal import butter, lfilter
app = Flask(__name__)
# Load the trained model
model = joblib.load('model.pkl')
# Add your preprocessing and feature extraction functions here
def butter_lowpass(cutoff, fs, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = butter(order, normal_cutoff, btype='low', analog=False)
return b, a
def butter_lowpass_filter(data, cutoff, fs, order=5):
b, a = butter_lowpass(cutoff, fs, order=order)
y = lfilter(b, a, data)
return y
def window_data(data, window_size):
windows = []
for i in range(0, len(data) - window_size + 1, window_size // 2):
windows.append(data[i:i+window_size])
return windows
def extract_features(windows):
features = []
for window in windows:
feature = [np.mean(window), np.std(window)] # Replace with your actual feature extraction process
features.append(feature)
return features
@app.route('/predict', methods=['POST'])
def predict():
# Get the accelerometer data from the request
data = request.get_json()
# Apply the low-pass filter to the data
filtered_data = butter_lowpass_filter(data, cutoff=0.3, fs=50, order=6)
# Divide the data into windows
windows = window_data(filtered_data, window_size=128)
# Extract features from each window
features = extract_features(windows)
# Use the model to make a prediction for each window
predictions = [model.predict([feature]) for feature in features]
# Return the most common prediction
prediction = max(set(predictions), key=predictions.count)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(host='0.0.0.0', debug=True)```
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import joblib # Import joblib for model saving
# Load the training data
X_train = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/train/X_train.txt', delim_whitespace=True, header=None)
y_train = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/train/y_train.txt', delim_whitespace=True, header=None)
# Create a new random forest classifier
rf = RandomForestClassifier()
# Train the model on the training data
rf.fit(X_train, y_train.values.ravel())
# Save the trained model to a file
joblib.dump(rf, 'model.pkl') # Add this line to save the model
# Load the test data
X_test = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/test/X_test.txt', delim_whitespace=True, header=None)
y_test = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/test/y_test.txt', delim_whitespace=True, header=None)
# Make predictions on the test data
y_pred = rf.predict(X_test)
# Print a classification report
print(classification_report(y_test, y_pred))
remember to put a py after the three backticks.
oh
oh i forgot mbmb
uh so what im trying to do is use a dataset called:
"Human Activity Recognition Using Smartphones" from the UCI machine learning repository
and essentially use the phone's accelerometer and collect the values and use those to print in console the predicted movement/action
and here was the description of the dataset:
**The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.**
thank you once again 🥹
what exactly are you asking for help with?
well when i run my code instead of getting the desired stuff i get:
LOG Error details: undefined
LOG Accelerometer data: {"x": 0.1241912841796875, "y": -0.073974609375, "z": -0.999786376953125}
LOG Sending data to server: {"x": 0.1241912841796875, "y": -0.073974609375, "z": -0.999786376953125}
only the values outputted to console
@trail summit so the model is supposed to tell you when the user transitions between activity classes, right?
yes pretty much, or yes?
no you're not
but also, where did you get the idea to use a decision tree to do this?
youtube and friends ;-;
can you show what the first few lines of X_train.txt and y_train.txt look like?
No screenshots.
oh
oops
brb sry
how do i format this?
2.8858451e-001 -2.0294171e-002 -1.3290514e-001 -9.9527860e-001 -9.8311061e-001 -9.1352645e-001 -9.9511208e-001 -9.8318457e-001 -9.2352702e-001 -9.3472378e-001 -5.6737807e-001 -7.4441253e-001 8.5294738e-001 6.8584458e-001 8.1426278e-001 -9.6552279e-001 -9.9994465e-001 -9.9986303e-001 -9.9461218e-001 -9.9423081e-001 -9.8761392e-001 -9.4321999e-001 -4.0774707e-001 -6.7933751e-001 -6.0212187e-001 9.2929351e-001 -8.5301114e-001 3.5990976e-001 -5.8526382e-002 2.5689154e-001 -2.2484763e-001 2.6410572e-001 -9.5245630e-002 2.7885143e-001 -4.6508457e-001 4.9193596e-001 -1.9088356e-001 3.7631389e-001 4.3512919e-001 6.6079033e-001 9.6339614e-001 -1.4083968e-001 1.1537494e-001 -9.8524969e-001 -9.8170843e-001 -8.7762497e-001 -9.8500137e-001 -9.8441622e-001 -8.9467735e-001 8.9205451e-001 -1.6126549e-001 1.2465977e-001 9.7743631e-001 -1.2321341e-001 5.6482734e-002 -3.7542596e-001 8.9946864e-001 -9.7090521e-001 -9.7551037e-001 -9.8432539e-001 -9.8884915e-001 -9.1774264e-001 -1.0000000e+000 -1.0000000e+000 1.1380614e-001 -5.9042500e-001 5.9114630e-001 -5.9177346e-001 5.9246928e-001 -7.4544878e-001 7.2086167e-001 -7.1237239e-001 7.1130003e-001 -9.9511159e-001 9.9567491e-001 -9.9566759e-001 9.9165268e-001 5.7022164e-001 4.3902735e-001 9.8691312e-001 7.7996345e-002 5.0008031e-003 -6.7830808e-002 -9.9351906e-001 -9.8835999e-001 -9.9357497e-001 -9.9448763e-001 -9.8620664e-001 -9.9281835e-001 -9.8518010e-001 -9.9199423e-001 -9.9311887e-001 9.8983471e-001 9.9195686e-001 9.9051920e-001 -9.9352201e-001 -9.9993487e-001 -9.9982045e-001 -9.9987846e-001 -9.9436404e-001 -9.8602487e-001 -9.8923361e-001 -8.1994925e-001 -7.9304645e-001 -8.8885295e-001 1.0000000e+000 -2.2074703e-001 6.3683075e-001 3.8764356e-001 2.4140146e-001 -5.2252848e-002
```
line 1
line 2
line 3
```
also is this one line?
no
I need to know what the structure of the data is
it was like this
what are the rows and columns. like what do they represent
ah
Each row represents a single observation or sample. In the context of this dataset, a sample is a 2.56-second window of time where multiple measurements were taken from the smartphone’s accelerometer and gyroscope.
Each column represents a different feature that has been calculated from the raw accelerometer and gyroscope data. These features are various statistical measures (like mean, standard deviation, etc.) and frequency domain variables that were calculated for each window of data.
is what the university of genova(the dataset makers) said
okay. but you don't know what measure each column is?
also what about the y data? what does that look like?
one huge column
like this
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
oh oh i remembered in the folder i downloaded there was something called features_info.txt
and a bunch of other stuff
and each number represents a state?
so does the model only need to identify which line is which state, in isolation? or does it need to be able to tell you when someone is switching between states?
"The model is trained to predict the activity (or state) based on the features of each observation. In its basic form, the model treats each observation in isolation and doesn’t consider the sequence of activities. So, it doesn’t inherently know when someone is switching between states."
basically
Okay, so it doesn't identify state transitions.
which is what I was asking about here, just so you know
sorry
i realized ;-;
what happens when you run model.py by itself, without the react part?
it created model.pkl
can you show the printed output of print(classification_report(y_test, y_pred)) as text?
k brb
um
@serene scaffold should i format it?
precision recall f1-score support
1 0.90 0.97 0.93 496
2 0.89 0.91 0.90 471
3 0.96 0.85 0.90 420
4 0.91 0.89 0.90 491
5 0.90 0.92 0.91 532
6 1.00 1.00 1.00 537
accuracy 0.93 2947
macro avg 0.93 0.92 0.92 2947
weighted avg 0.93 0.93 0.93 2947
k its like this
I have good news and bad news for you
o no
the good news is that this is great model performance
the bad news is that the python code works correctly, which means that the problem is only with the javascript code, so you'll have to ask somewhere else.
noooooooooooooooo
i never liked js ;-;
but
thank you so so much!
I really appreciate it
are you a member of the js server?
no
lms if I can find the invite
thx :D
yay tysmm
@trail summit just remember what I said about how to ask questions effectively. it will increase your chances of getting help quickly in the future.
I will :D
thank you again!
also, decision trees can't identify state transitions, so that's why I was confused
I thought maybe someone was bullshitting you
Sorry about that lol
it's okay
XD
I wouldn't even know at this stage
as you continue learning, keep this question in mind: "what are state transitions, and why can't decision trees detect them?"
that's actually two questions
eventually you'll figure it out.
Exception has occurred: AttributeError
'ArrowExtensionArray' object has no attribute 'to_pydatetime'
does someone have a fix for this
I just updated my OS and now my code doesn't run anymore
It's on writing to DB with pandas
did you accidentally switch python environments?
I had to switch python environments
something changed where my old venv didn't work after the upgrade
so I had to make a new one
okay, so you might not have the same pyarrow version that you had before
upgrading was probably a bad idea but I wanted to try to upgrade my GPU driver
do python -m pip freeze in both and compare
I'm in (venv) (base) (venv) hell
but I think they were on the same version (15)
pandas version was off by a minor version
still doesn't work, so something is strange
Dev containers ftw
# ensure conversion to pandas uses the pyarrow extension array option
# so that we can make use of the sql/db export *without* copying data
res: int | None = self.to_pandas(
use_pyarrow_extension_array=True,
).to_sql(
name=unpacked_table_name,
schema=db_schema,
con=engine_sa,
if_exists=if_table_exists,
index=False,
)
return -1 if res is None else res
else:
msg = f"engine {engine!r} is not supported"
raise ValueError(msg)
do you understand what this comment means
it's in the polars source code
I just set it to false screw it
works now
i don't care if the write is expensive
I just need the read to be fast
I don't know anything about pyarrow
What about it?
why would it copy data otherwise
That’s basically a key point of pyarrow; that pyarrow tables can be referenced by multiple engines without copying any data… they can even be passed between runtimes.
But pandas also supports numpy data types, which is completely different than pyarrow data types and is the default
whAT
so I came to the conclusion
that I can't do react native ;-;
it's better to stick with python for both frontend and backend
uh if you remember my goal from earlier today
can you please give me advice on how to go about it?
Don't ask if someone will do something based on information that you haven't yet provided. Give all the information, and invite anyone to help.
?
!
‽
The interrobang (), also known as the interabang ‽ (often represented by any of ?!, !?, ?!?,?!!, !?? or !?!), is an unconventional punctuation mark intended to combine the functions of the question mark (also known as the interrogative point) and the exclamation mark (also known in the jargon of printers and programmers as a "bang"). The glyph i...
OO
wow
u can combine them.
crazy 😭
how doooo I dooooooo thissssss
its like im drowning in confusion xd
put yourself in the shoes of the person who's helping you. what do they need to know to start helping you?
everything
let me restate: what would that person need to know about what you're doing to start helping you?
because I don't have access to your computer and idk what you've been doing since the last time we talked.
crying
I feel that
;-;-;-;
well, in the end all I want to do is take a dataset like MotionSense dataset and train a model with it and somehow implement it in a way that the device using the app(phone) uses its accelerometer and gyroscope and etc. along with that model and like it displays in console the action happening
like
idk:
WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING
stel?
k im not going to ping
bro when u see this can you please ping me
thx
I'm busy irl. I'll get to this if I can. but someone else might be able to help as well.
np it can wait
thanks
gtg anyways cya :)
you need to know what all the sensory inputs are that your model is using. and then figure out how to request those readings on Android and iOS.
but that's beyond the scope of this channel.
good morning , for my project on Large language model , i need to know any good demo ?
what are you trying to do
is there anything "ready made" that allows efficient mapping of a set of column values into key-values? I would like to benchmark doing that outside of polars. right now I built a expr chain class that allows me to "cheat" around the problem by coercing sets of columns into structs, and then selecting these alone and converting the result to a dict
# pandas DataFrame
df[['a', 'b']].set_index('a').squeeze().to_dict()
lemme check
lemme rephrase the question
imagine we already have a dict of column names -> column values, I have also written my own expr builder so I can (if i want to load the polars parsing heavily...) coerce the desired columns into a struct for a new column. suppose I want FOOBAR to be a composite of X,Y,Z columns. right now i do this via the expr chain i build. how can I "displace" that specific step into something done after iter_rows?
can you provide me a sample input for the df?
from ast import Dict
import polars as pl
from pprint import pprint as pp
from typing import Dict
def expr_concat_columns_unique(new_column_name, columns):
return pl.concat_list(
[
pl.when(
pl.col(column).is_not_null()
).then(
pl.col(column)
).otherwise(
pl.lit(None)
).alias(column)
for column in columns
]
).list.drop_nulls().list.unique().alias(new_column_name)
def expr_structured_column_with_mappings(name: str, mapping: Dict) -> pl.Expr:
return pl.concat_list(
pl.struct(
{
new_key: pl.col(original_column) for new_key, original_column in mapping.items()
}
).struct.rename_fields(
list(mapping.values())
)).list.drop_nulls().list.unique().alias(name)
struct_mapping = {
"foobar": {
'foo' : "from_foo",
'bar' : "from_bar"
},
}
df = pl.DataFrame({
'foo': ['xyz', 'blah' ],
'bar': ['zyx', 'bleh' ],
})
exprs = []
exprs += expr_structured_column_with_mappings("foobars", struct_mapping['foobar'])
new_df = df.with_columns(exprs)
print(new_df)
this is how im doing it now
but it puts a significant strain on polars' engine, which handles it fine, but it does have quite some ram backpressure
@agile owl dude it's too early to start trolling
relax
;P
ram backpressure = the scan_csv op no longer seems larger-than-ram friendly
so in other words, consumption shoots up. at least in virt addr space if you are pedantic.
didnt measure actual effective occupied memory...
(good luck using something like valgrind while processing a 10mil row csv file)
I wanted to try to help you but I don't understand what your problem is sorry. what do you mean by done after iter_rows? btw polars is meant to be used with the Lazy API most of the time that's where it gets its optimization benefits from but it looks like you're just using a normal eager dataframe
so, im trying to validate and coerce as much data as possible into the actual ingestion schema for elastic
i made an expr "compiler" that takes my configuration (tl;dr "create these key-value mappings from the CSV rows, apply some transforms, inject some static values") and I apply it to the lazyframe returned from scan_csv
taking a CSV row with a set of columns i need a final dict/document for elastic as { 'somekey': { fields: ...mapped values from CSV row/dict }, 'anotherkey' .... )
as an experiment i built all that using exprs, but polars is then forced to allocate new data for every row in the dataset
Overfiting is non existent
I've noticed that single head single layer transformer works better
Due to performance mainly
I might have to look into my self attention implementation
And use a non learnable positional encoding to reduce the number of gradients per mini batch index
The size of the embeddings seems to matter quite a lot, more than the number of heads or number of blocks
This at small scales with little compute, I'm sure the story is different if you can do gradient accumulation across several GPUS
Of 80Gb mem each
Changing regions is time consuming, and the current one only provides this 16gb machine
I can do federated training to get to a 32gb GPU, but at that scale might as well just tank the extra iteration on my training loop instead of having to collect results through a network
anyone has recommendations for handling parquet io (writing) from multiple Process(es)?
im using polars already so perhaps I can just output to parquet as is
there is a pandas .to_parquet() method, is there a polars equivalent? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html
is there a handy replacement for ffill() in polars
forward_fill
"DataFrame" has no attribute "forward_fill"
hmm
test_df = (
pl.read_csv("data/master.csv", try_parse_dates=True)
.with_columns(pl.col("date").cast(pl.Date).alias("date"))
.drop("")
.sort("date", "ticker")
.forward_fill() # <---- typechecker/intellisense doesn't pick up method
)
it's an expression
Well, you do it on an expression
So you forward fill on a date for instance
so I can't just do it across all columns
I don't know by heart but I think selectors are expressions so you could try cs.all().forward_fill()
gotcha
test_df = (
pl.read_csv("data/master.csv", try_parse_dates=True)
.with_columns(pl.col("date").cast(pl.Date).alias("date"))
.drop("")
.sort("date", "ticker")
)
test_df = test_df.select(cs.all().forward_fill())
so like this?
seems to be working
┌─────────────────────┬────────┬────────┬─────────┬───┬─────────────────┬─────────────────┬──────────┬───────────┐
│ date ┆ ticker ┆ open ┆ high ┆ … ┆ IRLTLT01JPM156N ┆ IRLTLT01GBM156N ┆ WTISPLC ┆ DEXCAUS │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════════╪════════╪════════╪═════════╪═══╪═════════════════╪═════════════════╪══════════╪═══════════╡
│ 2021-03-01 00:00:00 ┆ ABBV ┆ 108.53 ┆ 109.21 ┆ … ┆ 1.684268 ┆ 1.029709 ┆ 1.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ ACB ┆ 10.84 ┆ 11.41 ┆ … ┆ 1.684268 ┆ 1.029709 ┆ 1.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ ALKS ┆ 19.2 ┆ 19.605 ┆ … ┆ 1.684268 ┆ 1.029709 ┆ 1.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ AMGN ┆ 225.88 ┆ 227.929 ┆ … ┆ 1.684268 ┆ 1.029709 ┆ 1.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ AMPH ┆ 17.79 ┆ 18.03 ┆ … ┆ 1.684268 ┆ 1.029709 ┆ 1.226966 ┆ -1.790026 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
└─────────────────────┴────────┴────────┴─────────┴───┴─────────────────┴─────────────────┴──────────┴───────────┘
shape: (10_320, 165)
Here's something super annoying: it uses pandas.to_sql and the conversion to pandas converts my date column to datetime microseconds so I can't get the right date I want without manually editing the db column
I might as well just write the insert myself I guess
Have you tried using the ADBC engine instead of SQLAlchemy?
@versed pilot there seems to be a parquet sink
Hello can someone tell me which course should i pursue to get into ai/ml or any data job in india or outside or which should i pursue
because getting in industry with this field i think is tough
I started with Andrew Ng's course and even after only the first regression videos, I feel smarter already 🤣
Well i know it will need learning and some projects experiment but can you help me like for job assistance what should i approch
any suggestions please?
ok I haven't used polars, I understand that it is different from Pandas both in syntax and underlying technology
yes
ill let you know how that works out
now trying to solve an issue
i cant seem to be able to filter null columns
Greetings guys
Hope all are well. Anyone here who is familiar with iTensor library in Julia?
Kindly let me know.
No
Looks to me like you got it...
Right side is the smoothened thing
Oh
I know I'm narrowing down
So like, if I let these go for like 3 days they'll converge, almost certainly without over fitting, been keeping an eye on eval loss, eval acc, eval f1 etc
Those graphs are different runs, leftmost graphs are larger batches
They run for an epoch each
So, smaller batches converge faster but their loss graph is extremely chaotic
I don't care for their loss graph if at the end I still get a model I can put into production
My question is then, do I care for the chaotic graph ?
The loss graph itself looks fine as far as I can tell, that's how a loss graph of a transformer training on text looks like
do you need to understand data analytics to do data science
who's a data scientist that can guide\ me
I need some help with a classification algorithm for sentiment of sentences. If anyone is able to help, please check my post in #1035199133436354600 for more information about it
Good day everyone
I have a problem in term of learning. Actually I spend so much time on learning process, but I don't get result and in some case I get even panic attack.
Another problem is that I can not make a routine for myself. Right now it comes to my mind to ask you how do you study?
At wich hour are you starting and what time do you finish? Can you please give me some advice how to learn? Or how to manage my time?
I'm not very experienced in this field of study, however when it comes to learning, I try to break down what I need to learn. By breaking something down, it becomes more feasible to learn each part.
For example: I am currently learning to improve my ability to explore data by not only attempting to grasp the essence of EDA, but also the techniques used within EDA. Data cleaning can be used in conjunction with the EDA process. And because duplicated must be cleansed depending on the context, I must first learn how to identify these duplicates. So I ask myself, "what graphs can I use to indentify these duplicates." I then research on that and once I find a graph that seems efficient to me, I then learn how to actually create that graph using a tool. I personally use Python for this, so I would then read matplotlib documentation on that said graph.
In terms of motivation to learning, I make time out of the day; About 1-2 hours of studying and practice before going to bed. I gain this motivation by reading articles on data science, watching YouTube videos on data science, or even talking about data science. These methods really get me motivated to learn.
I also recommend referring to a resource fitting to you as long as it is trusted. This Discord server is also great when it comes to help and resources
My favorite EDA resource is: https://www.itl.nist.gov/div898/handbook/
It’s long but very thorough and no single part too difficult
Thanks, I will take a look at it
Thank you.
What are your study hour?
I mean when do you start and finish?
I usually begin around 8:00 pm all the way to somewhere around 10:00 pm. I usually do not go to sleep that late though. You should choose a time comfortable to you and fits into your schedule
I think ‘time’ is a depressing measure for studying. Instead, consider making a list of topics and making a point to cross off one topic a day. Then you can feel measurably accomplished
Switching social media off may help? Big distraction for me.
need to show some demo , text generation based examples should be fine
hello, anybody knows any good datasets libraries?