#data-science-and-ml
1 messages · Page 99 of 1
I mean, it depends on what data you have
I'm always trying to add more and rarer data
the harder it is to process the more differentiating it should be
You take a slice of the entire internet, feed it to a super computer running GPT and pray
so things like NLP add a lot of information
I don't think that would work because there's too much noise
it would overfit hardcore
you need to curate the data it's getting I think
Uhm
My worry would be the opposite actually
That the model wouldn't fit at all
I'm talking like
In 1h, take all new indexed stuff by google
And output a prediction, the training data for that is huge
I think you'll get a lot of noise in that
I think you could fit something but it wouldn't be good out of sample
that's why I'm focusing on things like the SEC filings
even financial news is full of noise compared to SEC filings
My idea of overfitting is when the model has a lot of capacity so it memorizes intricate details of the data, like noise
every time a company discloses important information it has to put it through the SEC
and there's a live API for that too
ok so let's conceptualize it a little more
given the slice of data, it produces some score?
or you just feed it directly to a reinforcement learner or something
Select like, N stock markets
what are the values of Y
Y[I] = value for the ith stock market
and what is that value, the price, the return?
No way noise will correlate with that, I think
It's the future value
Like in the next hour, maybe day is better
Idk, maybe hour
I think tracking topics for a set of predefined stocks in their SEC filing is probably more fruitful
in either case whether that method overfits or doesn't fit, I don't think it would learn much useful
If feeding the internet to gpt creates gpt4
the SEC filings have certain topics in the management discussion section that they are legally obligated to discuss if they are important
I reckon something useful can be distilled to stock markets
I think at minimum you need to extract those topics
you can then query for them in a larger dataset
I'm thinking in terms of what has been the trend right
Like CNNs replace feature engineering
Let the model filter out
it would take an extremely long time to learn to filter the right information wouldn't it
vs. giving it topics as an input
Yeah a couple billion dollars or so
I meanz if I could really so it I wouldn't be telling it on discord
I'm a big believer in getting the topics from the SEC filings because most stock discussion on the internet is just memes
I'd be doing it, but I'm not a billionaire, and if I were, what would be the incentive anyway
I am willing to discuss what I'm thinking about doing because I think it's very hard to do and people would only do it if they were really interested in doing it which I don't think anyone would be because they probably won't believe in my ideas as much as I do anyway
and if I prove it works I'll just shut up about it
I just like talking about this stuff, it helps the mind review stuff and from time to time you always learn something new just by casually chatting
here's a neat usage of docker
the containers share an isolated network too
that's docker compose right
it's github actions
no, it's a ci/cd thing
you're specifying the container images in there though
but I'm not using it as cicd, I'm using it to run training loops
the images are there, but the network is implicit
I see
so like, if I now decide that it should be python 3.7 instead of 3.11, it's a very trivial change
or maybe I want to use 3.6 here but in the next job 3.11
I don't know how I'd do this without docker
I'm not, but i could
I could create a matrix so that it uses every version of python in N seperate parallel jobs
Using the newest version of flask-sqlachemy how do I update a search query?
Here is an example of what I am using.
search_results = Posts.query.filter(Posts.content.like('%'post_searched_form + '%')).order_by(Posts.title).all()
Hey! Everyone….Can someone help me to suggest roadmap for AI?
Hey folks, I am working on a Loan Default Prediction project (a classification problem), the problem is I don't have a target column and when I asked my instructor he said that we have to estimate first using Random Forest Regressor. How to estimate who has defaulted on loan using regression?
He said once you can get that after that it is a simple classification problem
Question is too general, depends on how much experience you have, what kind, how much math you know, how much ML you know, etc
But the consensus I've seen is that MLE is not an entry level position, so you need to get XP in software first
Why do you ask? Where are you in your journey? Context matters. And #career-advice is probably a better channel to ask.
Hi! I need help with vector databases.
I am developing a program for comparing the similarities between the skills in a job description and multiple other resumes. I need to store the embeddings of the skills in the job description and find the most similar skill in the resume to it with its distance. However, when I create a vectordb with job description skill vectors inside and do a similarity search with skills in a resume, I get the most similar skills inside the job description. Putting the skills of the resume inside and querying with the job description skills solves my problem but I don't think it is efficient. I also tried not using a vectordb and saving the embeddings as numpy arrays on the disk but I am not sure whether it is a good practice. What is the best method to solve this?
Hi all, I have a more general but very related question: has anyone here ever tried to form a AI/ML study group of similar level peers? Be it in the same steps in the learning journey, similar domains of interest, similar goals, etc? What are or were the pros and cons of said study group, what worked what didn't, why did it fell apart?
hey is there a good way to interpret a pdf of mixed text and table data using LLMs?
(if this is too vague a question, that's a good answer too)
you need to extract the text from the PDF. are you trying to summarize the content, or something?
honestly i need the data more than the text, but like ideally the surrounding text would contextualize the data
(extracting the data with more straightforward pdf parsing wasn't working)
LLMs are for natural language. not tabular data.
iuno man reading is reading
It isn't, though.
(I am a computational linguist and work with LLMs pretty much all day every day.)
probably tesseract.
in particular, LLMs can't do math. If it appears that they can do math, that's a separate capability that isn't actually part of the LLM.
i dont need them to do math!
i need them to understand how text is laid out on a page primarily
last I checked gpt4 was really bad at physics, it can spit out facts but it will trip on several logical inconsistencies that it can't get out of, simple stuff like contradictory definitions
LLMs can't do that.
text goes in, text comes out
and we're talking about raw text--strings. without any awareness of where it was on a page.
depends on how you parse the pdf i guess?
No.
like im assuming u know how chaotic pdfs are on the backend
Yes. But the LLM can't help you with that. the LLM has to receive clean text as a raw string.
heard ... ok so this is the deal
there is a table with this data in every pdf ... but it never looks the same, is in the same place, or even using the same exact terminology
im trying to make something that can look at a 100 page document, find the table that most resembles this and tell me, like, how much was budgeted for the City Clerk in 2019
i reached my limit with pdfplumber and more straightforward approaches
An LLM cannot help you with this.
ok can something else
I'm not sure.
what about using an LLM just to find the page the data is on
that would make sense right
No
I don't have time to get into it, unfortunately
ok
if you can somehow serialize every row of each table as a sentence in natural language, I suppose an LLM could help with this. But there might not be a way to know what the serialization scheme should be for any arbitrary table.
hello i have a general question about anomaly detection, would it generally be better to look at aggregated data or raw data?
You want to know which items in your data are anomalies. If you aggregate the data in some lossy way (like taking averages), you're no longer looking at individual items.
thanks very much @serene scaffold
is there a way to continuously improve (some sort of online learning) unsupervised anomaly detection models like Isolation Forrest?
or is it really just a game of tweaking contamination and retraining on different data sets
Hi Everyone, has any one dealt with text preprocessing for medical notes?I am looking to improve accuracy of the model. Thanks in advance.
cool! how did you land a job like this? and, did you have to get your masters beforehand?
I didn't get a masters. But I got really lucky. And you can't plan for luck. If you want to be a computational linguist, you should probably get a bachelors in computer science with a linguistics minor, and then get a masters in computer science
And with the way things are going, I have no idea what hiring in this space will look like in six years.
being some type of AI engineer or data engineer has always interested me. i’m going go be finishing my bachelors in Computer Science in about 3 months
With a minor in Mathematics
that’s fair
If you didn't completely max out every opportunity to learn about and apply machine learning as an undergrad, you should probably be looking at masters programs.
(that's one of the things I had to do. And also luck.)
@serene scaffold what would you propose to tune a language model to SEC filings to extract topics from the management discussion and then track sentiment for each of them in future documents until it is no longer present in the documents
idk
yea, most of the stuff i know about AI / LLM is purely because of my interests, my university doesn’t offer much with AI sadly
dang
thanks for your insights
Just make sure you're looking into topic detection and not topic modeling
Except maybe some people treat those as the same thing
Fuck
lol
Trying to explain anything in ML/data science/AI is always a "X.... BUT...."
Especially the relationship between different parts (the classic "how are AI and ML related?").
Some day I want to become the supreme nomenclature authority and fix this.
I want to ban all buzzwords
when people say AI/ML they need to fill it in with the actual thing they're talking about or pay a fine
I'm fine with those. It's "data science" that I hate.
"science" - add this to the end of everything
Gotta get my dance science degree.
why do you hate it
I meant when people say AI/ML as one thing btw which is often the case
I think AI is the worst term
if I had to pick one
Because the science of data is statistics. And statistics doesn't become a fundamentally new thing when you add code.
data science including statistics and non-statistical ML methods tho
statistics is part of data science but there's also the stuff that diverges from model-based statistics
that's how I understand it anyway
whereas AI has never meant anything meaningful
they're gonna have to define what intelligence means before artificial intelligence can mean anything lol
AI is a field awaiting its own definition but everyone is asynchronously running with it like we know what intelligence is
I am having difficulty understanding this..what does it mean
@lofty thorn can you at least make it rightside up
?
Which part are you asking about? The red cloud part?
graphs in statistics
You're used to thinking of "graphs" as data visualizations, right?
yes..
Like, bar "graphs"
Forget that.
Graph no longer means that
All of those are now called plots
Bar plot. Line plot.
Yes. You must now accept the computer science definition of graph
And never use "graph" to refer to data visualizations for the rest of your life.
okay senior
You will now be annoyed whenever you hear normies refer to data visualizations as graphs
Anyway
Did you have any questions about what graphs are--the things with nodes and edges?
i haven't started yet..i definitely create doubts later on..as the book i am reading is completely new
A node is a "thing"
And an edge is a line between two nodes
Yo guys are there any free cloud services on which I can deploy my ml model?
MEGA
i am having difficulty understanding terminologies
all i get is...
Pandas library has rectangular data structure...known as dataframe
Hey all,
I'm terribly new to ML/CV and looking for guidance with OpenCV. I have a screenshot of a web page. I need to OCR it. I'm looking to prepare it for tesseract by getting rid of reverse contrast parts (white on black) and everything other than text.
What I'm having an issue with is understanding masks. What's the correct way to select non-white background and invert just that?
For instance, how can I convert "Search" button to just black on white text "Search"?
I can find the color by inRange, but how can I determine if it's a "background"? Is there some sort of filter by size?
...Or should I take it in three steps:
- Threshold, Get all black letters, save1
- Inverse, Threshold, get all black letters, save2
- Join save1 and save2?
🤔
Thanks in advance!
hey
I am trying to use a lip reading model to test on my system but I cannot train it
can anyone help me with the steps
Why cannot you train it?
Cannot see this on mobile. Could you copy and paste
I took a model and similar json file using second model both not work
nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\"
At line:1 char:29
+ set CUDA_VISISBLE_DEVICES=0 & python main.py --modality video \ --con ...
+ ~
The ampersand (&) character is not allowed. The & operator is reserved for future use; wrap an ampersand in double quotation marks ("&") to pass it as part of a string.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException ```
I was using & just because I found it works on stackoverflow for some users but even without it im getting errors
nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\"
Set-Variable : A positional parameter cannot be found that accepts argument 'main.py'.
At line:1 char:1
+ set CUDA_VISISBLE_DEVICES=0 python3 main.py --modality video \ --con ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [Set-Variable], ParameterBindingException
+ FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SetVariableCommand
Curious whether anyone has worked with HNSW indexes for vector databases. Trying to make my queries a little faster
You'll probably need to split set CUDA_VISISBLE_DEVICES=0 and the next command into two separate invocations
Maybe you can also try removing it, see if CMD allows that.
It's been a while but I think & is not valid for CMD.
Honestly, if were you I'd rather use WSL2 (Ubuntu in Windows).
Hello! I posted about a project I am making. I would really appreciate it if you give it a read ! #1204364714449174600
Any NLP experts here?
In memory index could be faster I guess than on disk index
Lower search ef as well but at precision cost...
Construction ef and m similar probably
Not sure what you mean
What's your question. It's often easier to answer the question than to judge ones expertise level 🙂
Hnsw idiocies can be stored on disk or in memory. In memory should be faster
Hi, is there any sort of roadmap of courses for learning ai? From learning to code to AI specialisation.
Practical Deep Learning for Coders - Part 2 (can skip part 1 and maybe watch after part 2 its just about fastai library)
he starts from python basics
in part 2 for some reason but yeah
I am thinking of creating my discord bot with drawing AI, what good drawing free AI with it's API would you recommend to use?
Final steps of the new pipeline, celery task and everything is working, it also runs faster now
Be sure to never ask to ask--always ask your actual question.
Code
import tensorflow
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy
X = []
Y = []
model = Sequential([
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=10, activation='softmax')
])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))
When I run this code I get Warnings and Messages in the script like this:
WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2024-02-06 21:18:22.817242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\optimizers_init_.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
How do I stop/disable these warnings?
Context window is too large, I'm stuck to batch size of 16 for now
Gonna have to curate the dataset to reduce the padding
But first I'm gonna finish this
I reckon I'll get some good insights even if I'm constrained in the hyper parameter space
The pyspark + redis setup, uhm, *shelf's kiss *
So glad I discovered pyspark
But I can also just slice the array in the celery process, that way I don't need to redo the data, I can remove data points that get cut, plenty of in between each slice training
thanks alot,
learning resources please also
what exactly is differentiating the roles
a) data analyst (a guy who does data analysis)
b) data scientist
c) data engineer
and what's this data visualization and how is it connected to AI ML
what aboout opencv?
and what's the core difference in all this
can u write the same for this too?
i work in web using python
i wanna learn this domain too, seems like you're pretty active here would love to follow through as you say @serene scaffold
Numpy is amazing, just a well rounded, well made, performant solution that works and is intuitive
I'm about to find out if the stuff I put together is gonna fit right away or not
I wouldn't mind not having to debug stuff
Forgot to build an image 😭
Aight, it's gonna do it now
yay
I don't even care it takes a lot of time, 1 dollar gives me like 12 hours of GPU time
This stuff doesn't make much sense on paper, you need to read a csv of data that you are familiar with into a pandas dataframe, look at the dataframe and you'll see the automatic index. then do .groupby(['column1','column2']).sum() and you'll see what a multilevel index looks like
@final kiln what's the current project? still working on transformer things?
curious how far you got with the metric tensor thing
Yes, still training it on sentiment analysis.
For data science experts, does the standard deivation in training have to be the same as testing? like is it an absolute requirement in order to accurately evaluate model performance?
The way I've been doing is, I look at the training metrics to see if the model is learning. I look at the eval/test metrics to see if the model is/has generalized. I don't really care about the values themselves, as long as both are always improving.
If one is improving and the other is not, something's up
I mean I do care about the rate of improvement and the final performance, their final values matter, but during training I try not to read too much into it
Dear God cloud watch documentation is so bad I wanna cry rn
An entire readme with no mention on how to run it
Did you try the quick start documentation?
Yeah it's not good imo, I gave up on it, this thing is either inside ec2 with an assigned role or it won't work
I decided to move MLFlow to the GitHub runner anyway, it's hosted in ec2 so it has been working there
I'm gonna deploy MLFlow UI on my local wifi, keep the logger in the runner and that's it ig
This way I don't have to worry about exposing this thing to the internet and latency is zero since they're on the same host
Hi, I am making an AI assistant and in that I want to add basic vectorizer from nltk. I gave the AI a set of data of patterns and responses. I then tried to speak something but I get no reply back. Meaning I do not get any reply which is in the responses part of the code. I copy pasted the nltk code in a new .py file without any functions or classes which I had in my main file. Then when I tried speaking, I got some random responses. I know that I have to train it but now my question is that. How do I make the AI get self trained.
Essentially each job here will have its own local MLFlow that talks to AWS managed databases and stores. I essentially DDOS'd myself yesterday
I ran two of these workflows at the same time, each job runs sequentially, so I only had two jobs, one from each workflow
Two jobs was enough to halt the server
Me clicking around in the UI didn't help ig
This way everyone gets his own thing, including me
I was gonna deploy compose with traefik and several MLFlow processes on the server, but why have a potential running cost on the instance if everything can be easily distributed like this
Need advice,
I am working on cnn lstm and my model need to be trained for classification as well as forecasting.
forecasting need last n data point for 1 forecast but
classification just need 1 data point for 1 classification.
Can i train cnn and lstm combined for this?
why do you want to combine a CNN and LSTM
Most models have a set of assumptions they make and train and test coming from the same distribution is one of them for many.
Can you be a bit more specific on the architecture you're working on?
Combining them can make sense for forecasting, a (T)CNN encoder coupled with a RNN decoder
can anyone help me figure out why i'm not getting a response from openai https://paste.pythondiscord.com/FVHQ
i'm not getting any errors and i have credits in my account
for spatial and temporal analysis of input
will i have seperate out the training?
no means of simultaneous training?
the first sign of convergence + generalization
model hasn't seen new data til step 1500
You're not really giving enough coherent information for us to help you 😄
hi, please read following, and let me know about any other detail that you need:
Need advice,
I am working on cnn lstm and my model need to be trained for classification as well as forecasting.
forecasting need last n data point for 1 forecast but
classification just need 1 data point for 1 classification.
Can i train cnn and lstm combined for this?
to elaborate give A samples, i want to predict A classes for each of them plus I also want to forecast considering more then one samples at a time
You mean as input?
The classification case uses just T=t and the regression case uses [t-1, t-2, ..., t-n]
yeah in regression it also uses t also
exactly, i think you have perfect view now
of my problem
The standard deviation of what exactly?
Theoretically you could make a model that makes 2 predictions at every T=t, one for regression and one for classifcation and you only use the n'th one to c thalcula the loss on the side of regression
yeah I thought about it, I can but I might also have to publish this project, and present as my capstone project
I dont wanna be seen with wierd look
Is it conventional enough to do that?
@desert oar If you want and have time, you can have a look at it now. I just won't be able to apply your suggestions until tomorrow.
This is what I ended up doing: #1204768836084170803 message
i was thinking of one more thing:
train cNN as classifier
freeze cnn and use last layer as embedding
now train lstm for forecast
@past meteor
@desert oar And while this will substantially increase the amount of lines in my class, it will also improve readability a lot. Readability > line count.
The above suggestion is great as it makes the conditions easy to read, but I am unsure of how to edit it such that it can set more than 1 value. Basically for some of the checks we do, we put one of 2 values. The above works because it only puts 1 value if the condition holds otherwise it doesn't change the value. Does this make sense?
im using almost 100% of the 3M samples, in this session the model will not see new data
its gonna do 12.5k steps, so ig im just gonna chill, watch some prision break or wtv
You can do that as well sure
Look into multi task learning
we put one of 2 values
in that case, I'd say np.where is actually a good choice, especially if you're just using scalar values. other options include .replace (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html), or manually inserting with .loc:
import pandas as pd
x = pd.Series(["a", "b", "c"])
c = pd.Series([False, True, False])
y = x.copy()
y.loc[c] = x.loc[c].str.upper()
y.loc[~c] = "zzz"
redis bit my butt, had to restart the experiment >.>
idk y I was giving it only 2gb of memory tho
I'm deleting the data as soon as I fetch it tho, so I don't know what's up
I'm gonna have to park this
Tomorrow I'll setup the redis conf thing, it's not trivial because actions doesn't let me specify my own command so I need to build an entire image just to make sure it uses the right command so I can map the file
But the model is gonna train there's like no doubt about it
Was never the models choice ._.
This trend continued til 2000-2500 steps until redis broke again
I am starting to wonder if python is the best choice for my little quest here to find a quine-regex: https://github.com/micsthepick/quinegex
has anyone used Assembly ai and know a way to get the microphone stream to end automatically when it no longer is picking up audio?
I'm trying to parse a really large XML file (90+ GB) and I want to break it up into chunks and process it on multiple nodes at once. The XML is basically just a long list of millions of <page> HTMLcontent ....</page> tags with nothing in between them... is there a way to easily break this file into chunks of 50,000 or so page tags with one of the common parser libraries?
Python script to break large XML files. GitHub Gist: instantly share code, notes, and snippets.
I'm using the "Create Custom GPT" options using chatgpt4 which responds with the location against the provided name
I'm using fastapi and NGROK for static domain. I've deployed it on edge using NGROK but the GPT is still unable trace the location.
The static website (generated by NGROK) is working fine also
guys I am working on a project in that I am focusing on Ai ready data so like im preparing a dataset to feed our model any body want to join it involves some basic steps like extracting same amount of data from files and creating a new data set and compressing it
Hi anyone recommend any courses with bayesian with ml?
Someone told me that classification problems that have lack lof labels can be done with bayesian but don't know where to start
In Pandas on PySpark, is there a good way to parallelize tasks? For example, I have a list of ~200 tuples (dataframe, function_with_retval) and I'd like to get all of the results. At the moment these are done one at a time and this seems to have worse performance than plain pandas, but I'm wondering if there's a better way to do it
Python to make deviation slips 😮💨🔥🔥🔥
Hey folks, what tool would you use for stateful analytics, like cross-filtering?
Filters are added one at a time, and I wonder if it would be a valid approach to just use a "traditionnal" stateless analytics tool and just rerun the same query with more filters (would I enjoy some form of caching?), or if there are solution that allow to spawn a temporary state to further filter (so filter A -> list of data -> filter B -> list of further filtered data etc.)?
I've read a lot about analytics but somehow never met this cross-filtering use case while it's probably not too uncommon
How do I effectively learn and practice ai and coding in general? I’m at a point where I give up because I don’t know what direction to go, what I should do exactly, and how I would do it. I need that sort of specific help. If anyone knows, I would greatly appreciate it.
I don't fully get what you mean
Isn't this what BI tools already do (Power BI, Tableau)?
If not, how does what you're imagining differ from those
If you don't like math, program with LLM APIs. If you do like math, start with SciKit-Learn, which has fantastic documentation and you can get going even if you don't understand the math behind the models.
The general answer is: You want an XML stream parser, rather than a traditional: "load the entire document". I've done this many times in Java, but not yet in Python, so don't have a library to recommend. Google: "python streaming xml".
The answer might be https://docs.python.org/3/library/xml.sax.html#module-xml.sax, but I haven't used Python's SAX parser so can't recommend it.
Does anyone know what's the interpretation of the diagonal graphs in sns.pairplot(). When we have the same variable on both the axis, let's say 'HeartRate'. Does it show the count on y-axis and values on x-axis?
Referring to the graph on bottom right
https://seaborn.pydata.org/generated/seaborn.pairplot.html: "The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.": ie: X is the value, and Y the frequency
@left tartanThanks a lot
I am in a big data context with the need of a more barebone architecture, like relying on an existing analytics engine
I think tableau is more visualization centric
maybe powerBI but I am not sure it has the proper scale
data may be rather sensitive so I would prefer things that can be self-hosted
I see a lot of tool able to do analytics so applying filter on the data + an agregation
like any database can
but cross filtering adds the idea that you progressively refine the filter
I feel like it might be costly to add filters incrementally but not sure, and I wonder if there are some analytics tool that allow that
I don't want each filter to be a totally new request, unless this new request is actually really fast
I see your request now
I think it's similar to BI tools but you really need it to be able to work at a large scale, correct? One of the things that bothers you is you don't want to recompute filters that are added sequentially
You can look at Apache superset
There you'll be able to do all of the configurations you want
this looks fantastic
I'll try to dig how they handle the cross filtering
Any idea whats dummy variable trap in mutiple linear regression?
If you have k levels to your categorical variable and you make k dummies they're perfectly correlated which has a negative effect on the interpretation of the coefficients of your model
hello guys i want to ask somethings related to deepfake detection, can anyone help me related to it?
anyone?
Just ask the question plz. There's a lot of people lurking. Or who check in whenever they feel like
well i want to work upon a project called "Deepfake detection" and by the name you guys can understand what it's going to do, so to can anyone advice me sources like where should i get the appropriate images data and how should i get the pre process to train the model for deepfake detection
Not anything I know about, but, I'd start with something like reading the current state of the art: https://paperswithcode.com/task/deepfake-detection. Hopefully someone else will comment.
thanks a lot mate
My question is do you know what is deep fake?
Well deep fake is a video or an image of a person that has been altered or change by some other person's face
Well body of someone's else and face of someone's else
And I think this is the right thing?
how does Pandas store NaN values in a float array? is there a separate mask array that dictates if a certain idx is nan, or is nan a valid bit pattern for any float type that also cant represent a regular number?
This is a complicated topic with Pandas, depending on the datatypes involved. But for numpy arrays, they use numpy nan's, more info https://pandas.pydata.org/docs/user_guide/missing_data.html
Arrow backed pandas dataframes are a different story
Does anyone have any experience with Physics Informed Neural Networks? I am trying to solve a heat equation with Dirichlet boundary conditions, and I am confused why the solution is decent at the initial conditions and in the interior, but is horrible at the boundary. As far as I understand, the model wouldn't have any real way of being able to distinguish between the initial conditions and boundary conditions other than that there are two derivatives in space and one derivative in time.
@wooden sail
this depends on how many samples you have at the boundary
boundary conditions usually have comparatively fewer samples, and so you need to weigh them more heavily in the cost function
i had this issue with a PINN for the wave equation on a string, where the boundary was only 2 samples. if i didn't weigh those two samples like crazy, the waves wouldn't reflect
When you say comparatively fewer, do you mean compared to interior points?
compared to interior points, and compared to the initial conditions which you evaluate everywhere in space
consider that in many cases, the function evaluated at the boundary can be zero and contribute nothing to the learning, e.g. because not enough time has passed for heat to diffuse all the way to the boundary from any sources
so even if you evaluate the boundary points at several time steps, many time steps might not contribute at all
and the cost function is evaluated almost everywhere in space for every time step
what some people do is also train on a schedule. after a few epochs, start decreasing the weight of the init conditions and error, and crank up the weight of boundary conditions
What kind of scale are you suggesting when you crank up the weight on the boundary conditions?
Anybody wanna help me get a cooler gui 🥹🔥🔥🔥
I tried making the boundary worth 10 times more, but it wasn't giving me good results
honestly this depends on your setup. i would suggest you make a plot showing the error, the boundary cost, and the initial conditions cost as 3 curves in a same plot over the epochs
see how they compare to each other
a good place to start is to make them all roughly the same size
My other Python script been helping too 🏄🏾♂️🔥
I don't think I have any other questions now, you gave me a lot of places to tinker with in my model. Thanks you for the help
Hello. Could someone suggest me some good projects that i can analyse to get a basic experience in applying the theory? I want to cover the basics of machine learning
I need it to add to my resume as well.
Does anyone here work with remote sensing and perhaps SAR data?
I'm writing a MSc thesis and I need to do pre-processing. There are several softwares available, at least two are based on MATLAB but a few on Python.
I think "a good project" is subjective. It depends on what you're really interested in.
My trick has always been
- Read a research paper and implement the paper (e.g LoRA)
- Write a medium article explaining your code and your attempt to replicate the experiment of the paper (with lots of plots and meme maybe.)
- Add that to your portfolio (Most companies will always pick this over Titanic)
If you're targeting companies like Perplexity, HF 🤗, InstaDeep, DeepMind, Google Brain, OpenAI, etc... This strategy can easily get you a research intern / entry level interview invite.
If you had a continuous numerical column and a nominal categorical column, how would you visualize their relationship? More specifically, I'm interested in how the value in the continuous column affects the rate at which the categories occur.
At this point in the project, I'm trying to think it through. Maybe I want to bin the continuous column because the values within are fairly specific. Each row in the df has a category, a datetime, and a value for the continuous column. Maybe I want to find the rate at which the observations are recorded so that I can graph that against the continuous column.
Thank you so much for the feedback! I was really lost trying to get somewhere
Hopefully this clears up things a bit for you.
Scenario 1
Using Hypothesis Testing
Bivariate: Continous vs. Categorical features
Plot: Barplot, Boxplot to visualise the relationship.
Example of Statistical Test: 2-sample Z-test ( to compare means of two independent population / groups)
Example: Analysis to know whether or not a company managed by male and a company run by a female spend same amount on electricity on average.
Assume you have a column called Gender (gender is a feature with two classes; male and female)
See attached image of Hypothesis test and plot.
Scenario 2
If what you're looking for specifically goes beyond carrying out a statistical hypothesis testing, then to need to compute a non-parametric test called Point Biseral or alternatively, you can use your good ole Logistic Regression.
See attached 2nd image for reference.
Thanks for the reply. I'll probably try scenario 1 in addition to what I've been messing around with.
I think I misspoke somewhat in my original post. I'm really trying to find how the value of the continuous column affects the rate at which observations are made. We have date data as well. For example, as the value in the continuous column increases, does that rate at which observations are made increase?
The part that I'm trying to wrap my head around right now is some values in the continuous column are more common than others. Therefore, wouldn't it be likely that there were more observations at that value? For example, say an extreme value showed up in the continuous column, wouldn't that value have a lower count of observations than say a more common value?
I'm looking for data science internships and was wondering for a portfolio if I should make a website (and if so use a template or to code it myself) or simply use a github for that?
i'm trying to learn ai and ml. does anyone know some good resouces or videos to learn from. any channel that clearly explains the math behind it and shows the derivations and code implementation. Any books blogs or videos
FOr starters the 3b1b videos are a nice intro also explaining the intuition behind the mathematics @regal wedge
What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks
Additional funding for this project provided by Amplify Partners
Typo correction: At 14 minutes 45 seconds, th...
I'm following a course and it had me change a single column with country names, to three columns with ones and zeros. Why would I want to do this, instead of keeping the single column but changing the countries to a numeric value, for example 1,2,3 for France, Spain, Germany respectively?
the 3 columns with 1s and 0s represent vectors with equal magnitude, all equidistant from each other
using 1,2,3 in a single entry implies, e.g., that france and germany are more similar categories than france and spain because the distance from 2 to 3 is smaller than that from 1 to 3
which representation is better dependa on your application
ML people call the binary vector approach "one-hot encoding" (one 1, all else 0), in case you wanna read more about it
Ahh oke that makes sense. Is this something specific to ML, I'm guessing this is something essential I should know in general?
which part do you mean?
yes this is exactly what the course is using but it doesn't explain the one-hot encoding indepth, so I will make a note and read into it
using numeric values would 'influence' the way it's being read, I thought 1,2,3,4,5 etc is simply an index, never expected it to influence distance in similarities
it's a consequence of how euclidean distance is measured
you could read about vector and matrix norms to get familiar with the topic
Ok this is very helpful, thank you. I need a refresher course on algebra as well it seems 😉
linalg and statistics are the core of ML. then you use calculus to solve the optimization problems that arise from there
If I want to make an AI to solve a specific task, should I go with the OpenAI API or train a custom model?
Does pandas-on-pyspark offer any kind of named aggregation? The usual kwargs method doesn't seem to work unfortunately
???
Also, CS50 for AI
It depends. Share more info?
It is an NLP related task.
I would like to compare a written piece of text with a bullet pointed piece of text and count how many of the points are included in the written piece.
so far my experience at getting chatgpt to do this hasn't been that good
Maybe NLP/feature extraction? See Spacey
i now remember i asked this question before in this chat and you replied
?
should i use this?
It’s not my area of expertise, but the type of problem you described (determining if a text contains references to certain topics) sounds like a fit.
So you have a few bullets and you want to check if they appear in the text?
spaCy looks like a good tool to calculate the similarity between two texts
however, how can i adapt it to the following:
Large block of text:
Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of other living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.
AI technology is widely used throughout industry, government, and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), interacting via human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).[1]
(source: Wikipedia)
Bullet-pointed text:
- AI technology is used in industry
- Self-driving cars
- It relies on linear algebra, statistics and calculus
- Data preprocessing
- It can be used to play games such as Chess
My objective is to determine which of the bullet points were mentioned in the text. In this case, it would be points 1, 2 and 5.
My opinion: GPT is probably worse than a bespoke solution but it has zero start up
You can use openAI's API to just embed your text and then train a classifier on it
A more end-to-end way to do this is just finetuning the model there
can you further explain this?
Do you mind if I just send you this link? It has the full explanation: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
If you want me to summarize it I can
That's where you always need to start: gathering data
Actually, step 1 is unambigously defining:
- What is my task?
- How do I judge if the task was carried out succesfully by the model
Once you have those 2 you gather data
I thought I could do something related to semantics using vector embeddings to accomplish this task.
Yeah it depends on this
this is my task
I think the key word here is unambiguously 😄 (this is where I typically get a piece of paper and/or latex and write it out)
You say you want to identify the bullets in there. It sounds like classification.
You can compute the similarity between your word embedding and the text embedding but from what point do you decide it is or isn't in the word?
You need a cutoff point
make sense?
I see. I was thinking I could generate the examples later, and just run it off a few examples later and appropriately change the cutoff point.
You can do that sure
Is it better to have data in this format?
or data in the format directly comparing the bullet point and the section of text which relates directly to the bullet point?
I would embed the document once and then compute similarity with each bullet one by one
Would that work? since one bullet point with like 3 words is probably not similar to a 100 word document that includes lots of other info as well. The computed similarity value would be so small that it would be nearly impossible to differentiate between something which isn't in the text.
Is a better idea perhaps to break the text into sections within my program and then get the max similarity between a section and a bullet point?
I would start with the basic version and refine it along the way
GPT is capable of retaining a lotof information in its embeddings. We've seen that at work.
rate my graph 🫠
I'll read it thanks. but I'm trying to understand why it can store NaN/null in a float array, but not and array of ints
it dosent say anything about the inner workings. othern than wity pyarrow it can store NaN just with float and object arrays. and thats why it casts int to float ...
I have a CSV reader using pyarrow that is performing at roughly 22k/s (lines per second) including row/column validation and some transforms.
I'm wondering if anyone faced a similar project and had some good ideas on improving performance
i already made lookup tables to run static method validators, so i can run those by index
Polars™️
Joking aside we switched all of our pandas/pyarrow stuff over to polars because the API is just way more coherient, simpler and normally faster.
if you dont mind a DM I can show you some of the transforms I do, but it boils down to coalescing data and validation of different fields. thats the most expensive part of the job, but pyarrow doesnt seem like it's blazing either
If you ask, parallel, if you ask, distrubuted parallel, if you ask, distrubuted parallel asnyc
note that pyarrow already does multithread streaming
i didnt say it is, but distributed does not necessarily help all cases.
two computers doing one thing why not helping?
if it did, there would be no market for massively powerful hardware meant to do single node work
latency, all the added overhead of clustering/distributing tasks
two kernels, two context switching scenarios, two userland, two mmu, two network stacks ,.. the list goes on
when i hear people trying to answer 'distributed' to everything it makes me wonder if they have a clue about OS internals at all
Because in common house hold don't buy two computers to do one thing
common household? i have a 2TB EPYC dual cpu supermicro chassis one floor below...
2tb ram*
and it isnt even something exotic
I mean, all these matters you can just let your other computer only reporting after finish
anyway
there is no need reporting in run time
@buoyant vine looking at polars
seems it's missing some candy for the CSV API
namely stuff like auto conversion for boolean types with known false/known true values
I mean, distrubuted and parallel do help your issues, don't know why you got offended and don't want help
playing with low pass filters of different lengths?
Join the club. I'm polars fan #1 and for me the main selling points are the coherent and simpler API
Maybe a hot take but I'd even use it if it were slower than Pandas
@rocky spade i did not get offended, you just suggested something that does not effectively solve any issues i have, nor provided any actual insight technically usable.
some of the options are supported in polars it seems
self.convert_opt = pv.ConvertOptions(false_values = self.csv_bool_false_values,
true_values = self.csv_bool_true_values,
column_types=self.mapped_column_types,
include_columns=self.wanted_columns,
null_values=[""],
strings_can_be_null=True
)
but not the false/true values
that saves me a good amount of pain as i do not have to run my validators for boolean dtypes
It's kind of ironic how Python was known for data in the Pandas + matplotlib era when these 2 weren't the best / most user friendly tools imo
essentially the more i can bypass validation for, the better.
Maybe you should try parquet, because nobody knows what your performance issue is when you just talking about reading a CSV at 22K/s "impressive speed"
If the data is already in CSV 😅 Partquet isn't going to save you
although Parquet is by far the best format to use if you can
I am already converting data to parquet for secondary backups
exactly lol
"try parquet" (but im parsing CSV son... im receiving CSV files...)
picks the phone to convince the source of the data they need to rethink and rewrite all their crap to produce parquet files
Maybe you should
I can almost guarentee their awnser will be "lol no" even if it would be benifitial for them as well
we get sent TB of CSV files and they just refuse to do it any other way
^ real world.
@rocky spade i wish for a lot of things. in fact, i wish i could just throw a 50mil record file on gpt and ask it to be my beeyotch and return a perfectly structured, deduped, coalesced dataset my way
but it seems like i dream too far
I’m also in the anything but pandas camp
plays i believe i can fly
But sql is my answer
Have you used R?
Yes, I do like R
I haven't used it in a long time, but I'd always argue their data processing toolkit is more user friendly. It's just too slow.
This is probably more extreme, but maybe you could use something like Trino/Athena/Presto if you want to be as fast as possible and don't care about the cost.
The SQL query will suck, but I know Trino is capable of brute forcing its way through, if you wan't something nicer though :/ I'm not to sure since as you said, Polars doesn't support everything you really need easily.
Maybe you could make a LazyFrame in polars and do the bool conversion as part of the pipelined operation?
def process(self):
if self.parse_method == INGESTION_METHOD_MEMORY:
table = pv.read_csv(self.filename, read_options=self.read_opt,
parse_options=self.parse_opt,
convert_options=self.convert_opt)
self.total_rows = table.num_rows
self.prebake_lookup_tables(table.schema.names)
self.process_table(table)
else:
# XXX: beware this can trigger OOM, beats cpython iteration through lines() though...
self.total_rows = self.count_rows()
with pv.open_csv(self.filename, read_options=self.read_opt,
parse_options=self.parse_opt,
convert_options=self.convert_opt) as reader:
self.prebake_lookup_tables(reader.schema.names)
self.loop_through_chunks(reader)
I'm about to benchmark it again without any of the post-validation stuff
@left tartan
I'd love a credible, maintained port of ggplot2
So it’s not the csv reading per se but the line processing?
Do you have any issue regards to your problems?
JIMMY SON literaly read the table two times and make a list of it, and complaining processing speed
@buoyant vine
[2024-02-09 15:55:52,298] [MainProcess:Thread-2 (periodic_performance_logger)] INFO: CSV: Processed 780221 lines in 24.09 seconds, 32385.35 lines/second (ETA: 6.18 min)
^ validation enabled, using lookup tables for running per-column methods
@rocky spade no, it doesnt, i dont think you understand how pyarrow works. the total_row_count is only done on the streaming part as an initial step, and it technically does not parse anything because im not consuming the dataframes there.
it literally is a newline/carriage return counter
And should you not use Spark for this matter?
on nvme it completes in ~3 seconds tops
Spark is a pretty nuclear option, and you'd spend more time setting up the Spark cluster than the work itself probably.
@rocky spade do you mind sharing a link to a github or something showing your work?
@buoyant vine indeed
What are your csv_bool_false_values and csv_bool_true_values here?
Here
@buoyant vine they are dataset dependent, "yes", "no", etc. I made it dynamically configurable, sometimes they change.
You basically just opned entire CSV file, and illerate through everything in it, and told you to make a parallel, no response.
I mean, i don't know what you want?
@buoyant vine Processed 12797780 lines in 375.41 seconds, 34089.97 lines/second (ETA: 0.00 min) < this is including validation. no coalescing/dedup per row, though. i still need to optimize that. it's not as trivial as the validators, since i have some fancy expression support to do inter-column checks and such (ex. fields are transformed based off other columns and their values)
import polars as pl
true_values = pl.Series('true_values', ["yes", "y", "ok"])
false_values = pl.Series('false_values', ["no", "n"])
data_stream = pl.scan_csv(
"my-files/*.csv",
schema={
"my_bool_col": pl.Utf8,
"my_other_bool_col": pl.Utf8,
"title": pl.Utf8,
"description": pl.Utf8,
"something": pl.UInt32,
"else": pl.Float64,
},
truncate_ragged_lines=True,
)
data_stream = (
data_stream
.with_columns(
pl.col("my_bool_col").is_in(true_values),
pl.col("my_other_bool_col").is_in(true_values),
)
)
# ... processing
I am not sure how fast this is since I don't have anything to test it right now, but this should do the bool conversion as part of the streaming operation.
i can check a bit later
The iter-column checks are probably going to be the slowest thing, since data engines can often get confused with them
hows type inference with polars?
Very good
i wrote a tool also for improving type inference for the parquet conversion, it isnt anything sophisticated, but speeds up crafting the yaml configuration for each CSV dataset
Its good, the only issue I have ran into, is it reads and writes Utf8 strings as Utf8 with 64 bit int lengths, which you can't change.
Had an issue before where if you have some custom arrow processing and it can't auto between the 32bit and 64bit lengths, it can cause some issues.
import pyarrow as pa
PYARROW_TYPE_MAPPINGS = {
"StringType": pa.string(),
"EmailAddressType": pa.string(),
"PhoneNumberType": pa.string(),
"GenderType": pa.string(),
"BooleanType": pa.bool_(),
"DateTimeType": pa.timestamp('ns'),
"CountryType": pa.string()
}
just an example from one of the type mappings
The most important thing with polars type inference is that it knows the difference between pl.lazyFrame and also the "contect" you're in like Expr, or Select etc
It quite accurately tells you what ops you can and can't do
Or did you mean the inference while reading data
while reading
Hi there, not sure if this is the right channel to ask this question, but essentially I am currently struggling to figure out whether I am spending too much time on organizing code in my Jupyter notebooks as opposed to conducting experiments and exploring data/opportunities with the aid of the notebook.
As a result, I was wondering if anyone has any advice on how to balance organizing code with actually using Jupyter notebooks for analyzing data and experimenting with different kinds of models?
That's right son
ex. if you look at the above table, i have a per dataset mapping that maps columns to those generic types. each has its own validation logic, including adaptive settings per dataset (ex. known observed bad values)
😅 I'll confess I never organise my notebooks, they are all not in the git history for a reason, I just have chunks of code everywhere.
Normally if I have some model tests or what ever, I put them in normal python files.
so, this is in the context of how you are mapping types in polars
Defining the schema for each dataset in Polars is very simple, doing some more complex conversion or casting typically requires the use of some with_columns and explicitly casting the types, or some extra work. It is very type strict unlike pandas which can be a bit more hand wavy.
as i already apply column type mappings. for the most part i care about dates and some string types, the rest i can often apply a fast path in validation and just leave them as is or as none/null
I organize in .py files and use notebooks to incrementally test what I'm making
My experiment pipeline is always a .py
ill definitely look into a polars version of the current csv processor
Date parsing, etc... Is very simple, you may need to define the original type as a string, then tell polars to parse it, but it has native methods for this so it is very fast, it just doesn't have a helper schema type for implicitly casting IIRC
I see, the issue is that I have to put all of my code (at least the relevant parts) in my Jupyter notebook as part of an assignment, but its currently overdue since I have been spending too much time on organizing my code.
I tend to try to follow the DRY principle since I have found that its something that has helped me out in other aspects such as software engineering, web development, and game development, but when it comes to Jupyter notebooks (mainly using it for data analytics and machine learning projects).
However, I really don't know if my tendency to be a strickler when it comes to adhering to the DRY principle is causing me to waste too much time on cleaning up my code and reducing code duplication/generalizing common procedures instead of actually y'know... exploring the data and experimenting with new models haha
Overall I'd say it is very against the idea of implicitly or automatically type casting things. For the most part it just won't do it unless you explicitly do it.
How many lines is your file? How big is it? Do you mind so i can make a copy file to test it on my own?
So pass the 22K/s is impressive?
@buoyant vine the tl;dr is that in the end i want an array/list that matches the indexing of my wanted columns set (this is how i optimize validation, by executing/running all that by index, the validators are assigned to the right fields once, and the table is cached)
Maybe you should change a CPU instead
I see, thanks for letting me know about that, just curious, what do you mean by an experiment pipeline?
Are you referring to a set of scripts that just contain experimental code (which might or might not be scrapped in the future)?
In polars, I would try and do all those validation as part of the streaming operation rather than via index. In theory it should be faster providing you can make use of some of the native helper methods rather than calling map_elements everywhere.
No, the code I use to run all of my models / preprocessing etc
I like having it all reproducible and so on
@buoyant vine the validation is done to the row as a list/tuple/set
Any particular reason for that? Or just because it was the best way with pyarrow?
For my understanding if you just use someone else packages and not digging yourself, there is no real performance you can improve other than genearl like parallel or distrubuted
it was the best way with pyarrow
I see, so does it mean that you put all of the code for reproducible experiments or finalized models in separate .py files and then just use the Jupyter notebook for exploring and experimenting in?
Because you have no clue what under it
@buoyant vine i can DM you if you are curious
Fair enough, yeah I would try with polar's more columnar approach if you can, I don't know all your validations but if you can do it without getting it by row then it should be pretty speedy
Sure sure
This is true, in Polars the biggest performance hit is when it has to go back to python land to processs stuff
Although tbh, when we hit those sorts of issues, we stop doing the code in Python 😅
Usually I have like dozens of experiments I run with the single pipeline. I test one or two out in a notebooks manually and then I parameterize the pipeline using the CLI or a second .py that runs everything
@buoyant vine i wrote a very simple test in rust without all the dynamic/configurable validation and mappings, and it beat the crap out of python 3.12 with latest pyarrow
single threaded too
So i was trying to say the same, only parallel, distrubuted or general stuff
all numbers i provided so far come from a i9-13900K workstation
Other than that, write yourself a pacakage and talking about perfromance improve
About Polars, what I see being a big issue of people transitioning to it is not leaning into it
I see, thats interesting, thanks for the information.
When testing things out in a notebook, do you usually also try to organize the code into functions (mainly for repetitive tasks, such as building and evaluating multiple models or anything else where the procedure does not vary so much) or do you just duplicate the code instead?
I think if you're doing iter_rows and/or map then using it doesn't make sense
There is no way you can improve a performance when using someone packed stuff
Coding level issue son
@rocky spade if you look at how @buoyant vine 's and other folks' interactions work out, your experience in this channel with other people will likely improve also linearly to your enthusiasm in "distribute everything ahoy"
plonk
/ignore @rocky spade
lol
Yeah, personally I despise python's arrow handling and parquet handling. If you think the speed is the issue wait until you try streaming to and from object storage with it 
Polars is very nice though if the data can sit on local disks and be done with it though
f that, im doing all this on nvme/optane
so i know IO is not the bottleneck at least in that sense
😎 Join the darkside and doing 100s of GB / s on blob storage
lol
In notebooks I start off with rough code and then I make it better and potentially move some stuff to the .py files. Think about it this way: writing code that works is a challenge. Writing code that is really organized is also a challenge. Sometimes it makes sense to not try to do these at the same time. Make it work, write tests to verify its behaviour and then make it cleaner
I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements
This is somewhat miss-leading i'd say, or a misconception at least, Yes there are limits but most of the time the library code itself is not the limiting thing.
It is also worth mentioned that it is typically not worth it to build some system from scratch in something like Rust or C++ unless you actually have issues with the speed it is currently doing it in or have some other requirement which the Python lib or what ever doesn't support well.
There are a lot of optimizations you can normally do before you get to that stage
@rocky spade i think you dont read english well. i never said 22k is impressive. i said it's the ceiling of what is possible given the circumstances. yet you are here trolling because your petit ego got hurt when i told you that your suggestions were not valuable. look at how other people respond here. their input has value. they are not acting haughty or like they have a chip on their shoulders. i bet any of these kind fuckers have a fairly sizable amount of experience on their shoulders, thats where their humility and good attitude comes from. get over it. learn from them.
We can also probably chill out a little bit 😅 We don't need to argue or throw insults or what not
just before this becomes too heated...
lol
I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements
isn't there anyway to improve when you use pandas to read CSV file?
pa.read
The only one have no chip on their shoulder is starting calling others son when someone replying to your code after asking for improvment
Don't understand why you'd be writing single threaded apps if you care about performance in 2024.
The only one have no chip on their shoulder is starting calling others son when someone replying to your code after asking for improvment son
python single thread performance is of course going to lose to rust too don't think that's controversial at all the runtime has a cost
"son" is not an insult, and you suggested "i use parquet" to a question that obviously involved CSV data... which cannot be obtained in any other format....
calling ppl son is typically considered a sign of disrespect
if you cant take humor you should not be hopping into the internet
he spent ~1hr offended because someone made a "son" joke in an internet channel. solid,
anyway probably time to move on from that, what's the issue exactly, that polars in Python is underperforming polars in Rust on a signle threaded app?
no, pyarrow
Okay son, i don't know that's a humor. probably make a distruted system other than reading one thread a single file. You don't need to communicae in run time, you just need to generate two reports after the end, probablly make it half time faster. Just coding issue, good luck
Multi threaded is a one process
threads should be more efficient for IO bound things
😅 Ngl I think making this a distributed system. for this task is a bit overkill
he also doesnt understand how threads work apparently
Especially if you don't need the cluster all the time, no matter what system you use, managing the cluster suckkks
Going distributed is a special kind of pain you want to avoid imo
you use processes for CPU bound things in Python because of the GIL but for IO bound things you can use threads
It's a high price you pay for a nonexistant reward if it all first in 1 machine
@past meteor just live your entire life in distributed async land and treat everyone else like a baby
Also, probably worth mentioning pyarrow is written as a native extension, it releases the GIL in its parsers 😅 So you get the full use of the CPU.
Hey there, just for the record please remain civil, "RTFM" is not a very friendly phrase
One of our colleagues made a #FaultTolerant #Microservice in #Elixir
The limiter is going to and from python land from pyarrow and these native systems.
I think it could've been a sync flask app in a couple of days
Fault tolerant? It failed 😭 (it's currently down)
So what's the issue than, you alread USED MULTITHREADED PyARROW Packages, And ASKING FOR HELP
no need to shout
What was it even supposed to do?
From today i didn't see multithreaded is very impressive, do you know how to write parallel and make a use?
We ran a clinical trial. All it had to do was call an API. For some reason he really really wanted to make it stream data so he ended up polling the API every few seconds. Problem is, he messed up and we had tons of duplicates.
Secondly, batch would've been totally fine for us. Just calling the API once every day or every half day solved the problem.
There's a couple more microservices but those are basically there for what I believe is obfuscation
I am not sure what you are trying say here.
Ohhhh, I was getting arrow and pyarrow confused. Was trying to figure out why a datetime library would need a csv reader
F 😅 I have currently taken dev down because the scale management and partners wanted for a service was much much higher than the price tag they were going to pay, and having to do some aggressive optimizations
unfortunately, Docker images coming at a cool 24GB in size compressed
and we ran out of ephemaral storage 
We have 3 services, one polls data source A, another lets our clinical partner upload patient info and a last one polls data source C
Oh god yeah I forgot arrow in python is a dt lib
To do any query you need to join so many API keys 😩
Why would they name the datetime lib the same thing as the well know dataformat 
JWT service ftw
Keycloak 🥴
hahaha this is hilarious
Well, you get what you pay for
use CUDA, CUDA is great, multithreaded, single processor, strong
underestimating the performance of AI models™️
Yah, I hate this… I hate when cost control is 10x the initial effort
Honestly, the project started before I joined. If I were there from the start I'd have challenged many questionable decisions
Got curious. Apparently the datetime lib came first
I think ultimately what $dev did was resume driven development
And NVidia specific, isn't it?
I think jimmyhoffa understands these things and is throwing good horsepower against it.
Tbh it was the opposite here, but they were rushing to deploy to prod and sell the service before the system was optimized and we knew how it scaled.
Oh, hah. I feel you
"We can afford a bit of a price increase, its not an issue for us"
But can you afford a 100x increase
So i was thinking, in that situation, can he cut the file in half, such as find a way read only half of them, and then make a parallel reading?
Can also be streamed in, but I don't know if that helps with a csv
Because there is no way you can improve things when you use a PACKAGE
Certainly, but I think first question is: what is the current bottleneck and why?
He didn't say any of it, but he strongly against my adivce: distrubuted and parallel when i raed 3 s of his sentences
it is already parallel, and distributed is overkill 😅
does this belong to the class of problems where we're complaining that Python is just slower than Rust and end up saying that if you don't like the Python performance then don't use python
because that's what it seems like
@dry geyser i am curious what your bottleneck is, but if you’re done with this conversation I don’t want to drag it out. Can you share more info about the per line processing?
I see, that makes sense, thanks for the explanation
stupid code error, code structure ; package problem
solution=> write your own fucking package, parallel and distrubuted.
Does anyone know multithreaded is one single processor right?
I don't see any difference with Python concurrency
@left tartan the bottleneck is the validation/coalescing/etc. without it it's ~34k/s, roughly 15-20MB/s, pyarrow without any validation or pandas/dataframe conversion can maybe go up to 100MB/s
there is a dual conversion for the dataframes happening too. the final product is a deduplicated, coalesced dict with validated information (including some dynamic expressions, but i have tested without that too, similar to asteval)
Any opportunity to vectorize the validation/coalescing?
i already do it with the validation by building lookup tables and processing the columns by index, for coalescing it's trickier because i support complex inter-column logic. ex. if column X has value Z, set field to Z, else take value from Y
?
will need to consider a similar approach, but because it builds a dict to be batched for elastic indexing, it is less trivial than vectorizing the validation, which in the end works with a list/set, so we can basically assume column N has validator X and it will remain constant
the coalescing is not immediately solvable since we need to iterate thru the validated data, find the dups, remove them, and so on.
however the gains later are immense because the indexed data never needs touchups
Cross column coalescing is (probably) straight forward if you build a table or dataframe for each batch.
so i dont have to deal with any of the annoyances in ES for updates
But I get the dupe detection problem
ex. multiple columns contain an identifier, which sometimes repeats. i get rid of all the dupes.
** I’m a DuckDB shill so my first experiment would be to load to a DuckDB table, and do it all in sql.
a very well respected math-head recommended duckdb to me for this project but i saw some limitations as i need near realtime text lookups
i augment the data externally with edgedb for holding some relational data/caching some searches
i would be interested in talking about how it would work with duckdb though
the problem for me was the massive amount of potential idempotent inserts
Oh, I was just thinking for processing. You might then export and use another way for lookups.
ex. identifiers connected to a given object being repeated
suddenly i end up with 15 mil select or insert queries = no go
(hence elastic)
@buoyant vine has been helping me grok polars to adapt the current csv processor, there are some hiccups but apparently polars has an expr engine
@buoyant vine ill ping you about the native expr stuff in polars
got a mockup with polars going
$ time python testpolars.py tests/fixtures/..._500k.csv
real 0m1.014s
user 0m2.132s
sys 0m0.323s
14mil records in 28seconds, with boolean conversion already done
I hope that is a good sign 😅
What would be the equivalent for handling dates? ex. attempt auto conversion
assuming UTC
(or no tz)
try_parse_dates
@left tartan for asyncio, i understand it is one thread concurrency, but how if it is one thread, there is a loop manger? that can control loop?
because the loop manager is always in current concurrency and never leave or change?
read how epoll() is implemented to understand how it can do what it does in a "single thread"
For asyncio threads, there’s a scheduler/loop involved, yes. See asyncio https://docs.python.org/3/library/asyncio-eventloop.html
I messaged you privately
The inner workings here is not something I’m very familiar with.
Thanks
I saw this but never understand is why it is not parallel when one task is running and than switch to another task when yield, so basically there is only one thread, and inside the thread the scheduler calls other concurrent task when they reported or ready, but if it is not parallel, how would they know? => so when one task is await, then there will a list to check if other task is ready?
I do lack of basic understanding about processer or concurrency in programming level
Python threads run concurrently, but not in parallel. Meaning: multiple threads can be started, but only one runs at a given moment in time.
I curioused about how the data transaction works in one thread
The scheduler handles assigning the work: a thread can be preempted so that another thread can run.
I’m not familiar with the internal mechanism of how the scheduler works.
(There’s a more complicated discussion about ‘why’, which leads to the GIL and eventually PEP 703)
Do they open sourced it ?
Yes, cpython is open source
I don't want to read cpython..
I thought Python is open sourced..
Cpython is Python (well, there’s others but it’s the one you’re using)
The way I'd always explain it (a bit hand-wavy) is that concurrency is an idea and parallelism is one specific implementation, asynchronous programming is another. Python's async/await is based on event-driven programming (which is a way to do async), you have an event loop that submits tasks with a callback. When the task is done it's put in a queue that the scheduler checks frequently to see what tasks can be resumed. True parallelism isn't possible in pure Python because of the global interpreter lock.
Thanks
For real?
But assuming current task is running, then the scheduler just like checking creazy in every mellieseconds when doing this stopped current task?
And when we say callback, what is call back exactly? call back need to check or return something
But how about multiprocessing module, isn't it a true parallel in Python?
is there anyway to see the code directly like what is call back and sechedular in Pythn?
Cyphton...
Multi process runs fully independent processes, very different and fully ‘parallel’. But, they don’t share objects.
So that's true parallel, so Python GIL is just a way for thread safe or memory safe or something like that
Beucase after i know multiprocessing module, and see their documentation, their impression is that GIL is just a joke?
for most common way of using?
I don't fully understand GIL, i just assume it is just locked the thread or something intentionally
You only need to check the queue when a task has finished or awaited to schedule the next one. The loop uses select, poll, epoll, ... like jimmyhoffa has mentioned. Their advantage is that you don't need to actively poll which means you don't need to keep asking the task "are you done? are you done? are you done?.
The callback is really abstracted in async/await another hand wavy explanation, the callback here would be the code that follows after the await. That's what needs to be done when the event is finished.
Apprecaited
!
Do you know about generators?
I checked the yield, so i know about it, somehow
I understand the code and the concept
Well, let me not confuse you 😄 I think this is more than enough information for one day haha
Just write code and it'll become clear
This is one of the more complicated / confusing topics in Python.
I'm going to have to move on now, we have an entire channel for this stuff though in #async-and-concurrency
The most important thing, imo, is to understand that concurrency is an idea that has multiple implementations
It's like an abstract class if you may 😄
Check out this article, pinned in #async-and-concurrency #async-and-concurrency message
Should i start with naive bayes or linear regression?
Linear regression is a good place to start
Thanks
Just asking, do you guys know anything can fix my fundamental problem like how to code like in deep down level, such as directly commucae with bytes, how to build like memory safe or something like that, like very detailed stuff than just use a high levle language? From bytes to high level language in between
I checked CS50 they explained about memory safe and those topics
but i do want more of it
I checked Havard CS5O but didn't watch it through about memory safe or something just a little bit explaination
I have a question to those experienced in Plotly Dash. Alright so a little background. I am trying to recreate a dashboard from a proprietary work website, and one of the features is that it changes the SQL query based on the date chosen. I already got the SQL query running and I got the algorithm to help me generate df_2 based on the date chosen by the user (this is done through a dialog box that pops up via tkinter. I'm now working on designing the app. I wrapped all the other code in separate functions. I have a text box with a button. I basically have it when if n_clicks > 0, then I want to call all those functions I defined earlier in the Python code prior to the app code to generate a new df_2 based on the new date entered. Is such a thing possible?
@rocky spade https://github.com/cia-foundation/TempleOS
for the ultimate guide into communicating with bytes, god, and everything in between
(offtopic)
all temple os humor aside, https://github.com/akshitamittel/Minix3-Schedulers/blob/master/Report.pdf
For all you AI wizards, I am planning on making a voice detection model with a CNN. I am taking the greyscale spectrogram of my voice and feeding it into the model to be anaylyzed. Here is a simple diagram showcasing my plan
Input: (batch_size, 1, height, width)
|
Conv1 (3x3 kernel, 32 filters)
|
v
Activation (ReLU)
|
v
MaxPool2d (2x2 window, stride=2)
|
Conv2 (3x3 kernel, 64 filters)
|
v
Activation (ReLU)
|
v
MaxPool2d (2x2 window, stride=2)
|
Flatten
|
v
Fully Connected (Linear) Layer (64 * 16 * 16 -> 128)
|
v
Activation (ReLU)
|
v
Fully Connected (Linear) Layer (128 -> 2 classes)
|
v
Output: (batch_size, 2)
Please give me some suggestion on how to improve this model
I finally debugged the redis issue
Seems like the model is gonna plateau
I believe a 0.8 loss is acceptable tho
do u think my model is good?
Hello
Only one way to find out
I eventually figured out the issue by using this suggestion, but the reason the solution to the PDE was bad was kind of silly. I was using tf.square on a tensor of shape (N,) and on one of shape (N,1). This was causing something funny to the way the gradient of my loss function was calculated in a way I still don't really understand. Anyway, thanks for the tip.
this is how my pipeline is looking
oops 😛 well, glad that worked out
To me it sounds like you should look into learning C , understand pointers and pointer arithmetic, malloc etc. Not really data science or even Python though. CuDA might be the only data related thing that I'm aware that has some similarities to this sort of low level programming.
Hey.
Im sure it is. I just never used Dash before so I’m not too sure how or what needs to be done to get the input from the button and use that to update.
https://dash.plotly.com/dash-html-components/button
There’s a basic but good example using an input button.
You can start all your function calls from there I suppose.
html Button components are commonly used in Dash callbacks.
heyyy
can anyone help me solve this? ```sh
OSError: Unable to load weights from pytorch checkpoint file for './pytorch_model-00001-of-00006.bin' at './pytorch_model-00001-of-00006.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.``
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 531, in load_state_dict
return torch.load(
File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 457, in __init__
super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 540, in load_state_dict
if f.read(7) == "version":
File "H:\py39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 389: character maps to <undefined>
@buoyant vine might have found an issue with how polars handles schema/dtypes
there seems to be an obscure bug where the index for some columns is offset by one
the mismatch leads to an issue later on where the index used to assign a field is not the one expected, ex. from the computed headers of the csv
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory```
any polars guru around?
👋 ola
Hi there, sorry to necro this message again, but I was just curious about what your definition of "make it work" would be in this case?
Would it be to ensure that the code runs without errors and performs its designated task correctly?
Or would it be to be able to make new useful observations/gain valuable insights into what you are doing within the notebook (i.e. exploring/analyzing data, training and finding suitable models to address a certain problem)?
Or would you say that "make it work" means something else in this case?
so i found the following: if i specify schema to my scan_csv, i can double performance by skipping the type inference, but it seems to skip columns. i made a single record test case to test and confirmed the problem. basically i depend on headers (array of column name) being static/having fixed indices. i have optimized most of the logic to do away with named/dict based access, so it's all index-referenced. the problem manifested when i noticed some columns were assigned to a shifted index. ex. birth date column got shifted by one, and it picked the wrong value.
if i use dtypes instead of schema, the problem disappears
is schema expected to be in order?
is there a way for me to disable type inference for any field not specified in dtypes passed to scan_csv?
@limber mesa also, could you explain to me how the filtering and expr engine works?
tl;dr of course, no need to go in depth
ex. what happens when i build several expressions and pass them to my lazyframe
✅
Look where he was at 👌🏾
Hey, if referring to pandas, you're better off using named columns and accessing by name. pandas works with indices but I believe it's not made for it. And as you've noticed, if one of the columns is in a different order. Everything messes up as things are not what you think they are. I suppose it's the reason people prefer dicts over lists after a while. They both have their own use cases but yeah.
polars
i use lazyframe
setting inference length to 0 does the trick
im not sure how polars handles this internally but producing a dict is expensive. ill measure how much performance is lost in precise numbers, but going from named=True to named=False gave me an extra few k/s
i have already doubled the speed including validation
now writing a new validation class that builds the polars expr(s), i need to measure it though
lol i see a sports fan
anyone here also plays poker and does "things" with bigdata/stats?
I figured how they are shorting players in a certain area it’s a small market 🔥
haha
Is there a better way to do this:
df = pl.DataFrame({
"emails": ["johndoe@hello.com", "bob@gmail.com", "bogus", "a@a.com", "no@a.com"]
})
filtered_df = df.with_columns(
pl.when(pl.col("emails").str.contains(good_regex))
.then(pl.col("emails"))
.otherwise(pl.lit(None)).alias("emails")
)
print(filtered_df)
filtered_df = filtered_df.with_columns(
pl.when(pl.col("emails").str.contains(bad_regex))
.then(pl.lit(None)) # Set bad emails to None; adjust as needed for your use case
.otherwise(pl.col("emails")).alias("emails")
)
print(filtered_df)
ex. combining both expressions
in one statement
and yes we could do a massive regexp in one shot, but for the purpose of figuring out how to best write polars exprs, lets assume two singular regexps, one for basic email format validation/standard conformance, and the other for known bad values
how is this applied internally?
ex. can I keep altering the df and my validation remains present for other columns?
especially in the context of a LazyFrame
No you're good 🙂 to me make it works means just make the code run without errors
hey @past meteor
another question: suppose I want to validate alpha2 country codes, i can precompute a table of known good values from pycountry. is there a way to integrate this into the polars validation?
I have finished learning C(for better understanding of Computer Science and related concepts, then now I am learning Python, I want to know what are the things I need to learn first in Python so that I can code in python and then things like Pandas numpy scikit etc. Is there anything in between basics of python and pandas numpy etc. Can you tell me all the basic topics before going to learn Maths and then going towards learning pandas numpy scikit etc.
In addition, also tell me which laptop I should purchase.
or rephrasing the question: how expensive is it to include an expr for a given column that might have ~200 item list.
You can read the query plan: https://docs.pola.rs/user-guide/lazy/query-plan/#graphviz-visualization
That will typically answer your question
i can easily rewrite the static validators into expr ones, precompute the list and then pass it to with_columns (AFAIK). if i can make all the validation logic into exprs, i can remove a costly for loop altogether
exactly
the trickier ones are those with more convoluted logic like pycountry stuff. i already use a lookup table made for the task
basically anything that involves iterating through rows is huge bottleneck
is a*
What's the type? str? list[str]?
yessir
@past meteor like so:
class CountryValueValidator(blahblaStaticValidator):
@staticmethod
def validate(value: str, options: Dict, **kwargs) -> str:
if value is None or not isinstance(value, str):
return None
if value == '':
return None
country = None
if len(value) == 2:
country = pycountry.countries.get(alpha_2=value)
elif len(value) == 3:
country = pycountry.countries.get(alpha_3=value)
else:
try:
country = pycountry.countries.lookup(value)
except Exception:
local_fixes = options.get('mapping_fixes', None)
if local_fixes is not None:
if value in local_fixes.keys():
corrected = local_fixes[value]
country = pycountry.countries.lookup(corrected)
else:
print(f"country failed lookup {value}")
if country is None:
return None
return country.alpha_3
It's a bit too early for me to read everything haha but sure
so, pseudo: if value length = 2, country might be in alpha2 table, if 3, alpha3 table
hahahah
i woke up and came straight to the desk like a kid
polars is amazing
Yeah, even if it's not faster
the API is just so good
but it is faster, so it's a double presetn
i also do have occasional hiccups with the country validation, ex. some idiot decided ireland is not the ISO alpha2 code, they put EIRE
So each country is a string?
which yeah, if you care for violating ISO standards due to some national identity thorn in your shoe, fine, but it's a PITA for no benefit
I see, thanks for the clarification and understanding.
For context, I had asked the question because I was and still am currently in a dilemma about whether I should resort to code duplication or creating a parameterized function to encapsulate the repetitive process of building a model, evaluating its performance (with default hyperparameters) based on 2 scoring metrics, determining the best hyperparameters for the model using GridSearchCV, rebuilding the model with the determined best hyperparameters, and re-evaluating its performance (with the best hyperparameters) based on the aforementioned 2 scoring metrics.
What are your thoughts on this?
Personally, I find that the process is quite repetitive since I am also experimenting with different transformations on a dataset and have to execute the aforementioned process once each time. Plus, I am currently only doing this on 2 models, so if I have to scale up to more models (i.e. 4 or 5 models), the amount of code duplication and the time that it will consume will also scale up drastically, thus increasing inefficiency and the time that I will require to complete this investigation.
And you check them 1 by 1
a column is essentially either country expended string, ex Ireland
or iso alpha2 code
hash lookup internally
yes
The more you code, the higher your lowerbound quality of "rush to the finish line to make my code works" will become. Just duplicate it right now in my opinion. Fix it afterwards. There's too much cognitive overload in worrying about this right now 🙂
It's also very common to code something terrible quickly and not fix it. That's a huge win, it means you never needed it to be clean anyway. If it's badly done and you revisit it in the future, you fix it then
I'm just seeing validate return str here?
I'm mostly "concerned" about its type, it's just a string and you have 20+ of those you need to regex against another column or one list of 20?
I see, thanks for letting me know about that.
I agree with you on that as well, though to further clarify, lets just say that I had 50 LOC that needs to be duplicated and also adapted/changed (i.e. about 80% to 90% of those 50 LOC will need to be somewhat rewritten) to a high extent (since variables used will be different due to being named differently), this needs to be done 5 times, and the time taken to duplicate and adapt the code might range from several minutes to much longer, would code duplication (or rather, code duplication + code adaptation in this case) still be worthwhile in terms of time and development efficiency (i.e. human productivity, not performance)?
@past meteor just one out of the list. unless the value is a list of known bad values (very short, ideally), if present. ex. EIRE->Ireland
this is not a big deal for one particular pipeline of ingestion. ex elastic, but it is for another one because the countries are pre-inserted in the database
Yeah, I've done both. For instance, I had a case where I quickly wanted to evaluate my models on different horizons. There were 3, I just copy pasted the code initially and changed a few things. Typically my "lower bound" includes decent functions already. Just don't prematurely optimize (spending more time on organizing how to do the task than doing it)
I see, thanks for letting me know about that
offtopic for my questions until now: anyone has played with models for predicting text variations? ex. suppose we have a corpus of strings, finding possible variants based off earlier changes
@past meteor pl.col("CUSTOMBOOL").is_in(self.csv_bool_true_values) to mimic pyarrow's boolean_true_values, will that leave the column as False if it fails the test?
@past meteor I'm probably using the expr wrong but why would this not work:
def prepare_boolean_columns(self, data_stream):
unique_true_values = set(TRUE_VALUES)
boolean_columns = []
for key, value in self.config.header_types.items():
if value['type'].__name__ == "BooleanType":
# Check if the inner value has additional "true" values
if 'true_values' in value:
unique_true_values.update(value['true_values'])
boolean_exprs = []
for column in boolean_columns:
expr = pl.col(column).is_in(list(unique_true_values))
logger.debug(f"Boolean column expr: {column} ({expr})")
boolean_exprs.append(expr)
return data_stream.with_columns(*boolean_exprs)
data_stream = self.prepare_boolean_columns(data_stream)
rows = data_stream.collect(streaming=True)
the expressions arent being applied
rofl nevermind
ctrl+x removed the append for the boolean_columns
time for caffeine
still doesnt apply though
Omg training models takes so looooont D:
Also how come smaller batch size leads to faster convergence
x axis is relative time, orange batch size is the smallest
it does affect the LR schedule, so maybe that's the reason
Doesn't even matter, if they reach the same loss in the same amount of time, I'm gonna wanna do smaller batch size so I can increase model capacity and bring the final loss down
hey guys i trained a randomforest regressor and got these scores
are these good?
After Hyperparameter Tuning and Scaling:
Mean Squared Error: 124238.24478116012
Mean Absolute Error: 146.16615376813385
R-squared: 0.9999832719765778
r2 looks fine to me but mse and mae are high
the numbers alone mean nothing, it depends on your application
look at the predictions you're getting or at percentual error
in most optimization problems, one deals with argmin problems. the value the function takes is mostly irrelevant, only the parameters that achieve the minimal value matter
Hey I have a quesion , Its pretty long but please answer it
Suppose we have a dataset which has predicts which company has highest profit or provides highest profit .These are the column names:-
Manufacturing spent
R&D spent
Administrative spent
State
Profit(this is our target variable)
So we could use multiple linear regression model to predict the price right?
Now if we go towards the theory side of multiple regression model , we would have the formula as
y(profit) = b0(constant) + b1x1 + b2x2 + b3*x3 + ???
b1,b2,b3 are the slope co-efficients and x1,x2,x3 are the respective values of the first three columns
We cant assign a slope co-efficient to the State column , because its categorical data right?
So we do the dummy variable process and use only New York column
But when I physically code on colab , we do one hot encoding in the state column
So i am not able to understand as to why do we need to do encoding ? Can't we just seperate the columns and use New York only?
Can't we just seperate the columns and use New York only?
not sure what you mean? there's also Florida in that column.
but it's true that if you have a categorical column with only 2 values, then instead of one-hot encoding you can just make that column boolean.
How can I display the optimized query plan for a given lazyframe/dataset?
in polars obviously
Anomaly detection using data access patterns
Write Anomaly detection for Windows/Linux Unstructured file data or NAS file server that
analyses unusual user activity and user behavior. User behavior is represented as any user
actions performed on the system. Consider using capabilities of File Change Log, API
usage, Audit logs, WORM, CPU usage, and unusual disk activity. Leverage AI/ML
techniques. Understand different attack patterns and resemble to actions carried out.
The algorithm should demonstrate accuracy and consider false positives and false
negatives.
can anyone guide, what steps to be make sure for solving above statement
.explain(optimized=True)?
https://docs.pola.rs/user-guide/lazy/query-plan/
So then the equation for the regression moy(profit) = b0(constant) + b1x1 + b2x2 + b3*x3 del would be
LazyFrame's dont have explain() do they?
Neither do dataframes IIRC - explain is a query thing
ah it worked
neat
it does respecxt all the previous exprs built-in
another question
suppose I want to run a regexp and obtain two matching groups from a column's values, and then replace the value for a tuple/set of the matched values
not sure what you mean exactly, but if you're assembling a regular expression per row, I'd be surprised if there's a polars function for that. probably an apply is the best you can do.
no per row
not*
a regexp to extract country/area code and number from string phone numbers
ah, okay. in that case see e.g. https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html
checking
you guys rock
i already converted my static validators, made it a little easier to migrate by adding an attribute to the classes
@tidal bough suppose I wanted to to just produce the expr without using any dataframe ref, how should I adapt this:
@staticmethod
def polars_expr(column: str, df: pl.DataFrame, options: Dict, **kwargs) -> Any:
bad_value = options.get('bad_value_placeholder', None)
filtered_df = df.with_columns(
pl.when(pl.col(column).str.contains(PATTERN_EMAIL))
.then(pl.col(column))
.otherwise(pl.lit(bad_value)).alias(column)
)
if known_bad_regexp := options.get('known_bad_regexp', None):
filtered_df = filtered_df.with_columns(
pl.when(pl.col(column).str.contains(known_bad_regexp))
.then(pl.lit(bad_value))
.otherwise(pl.col(column)).alias(column)
)
return filtered_df
ex. how can I make the second filtered_df happen immediately after the first?
seems to work as is if passing the df, which is good enough for me as i am building these early on
@dry geyser sorry I'm no longer answering, I have a very busy weekend
Doesn't seem too bad training times wise
Yeah it could be worst for sure. But if I want it to go over the entire dataset it will take all night for sure
It slows down way before tho
Rn I'm trying to implement gradient accumulation so I can fit a larger model
I'm tripping over the step times. Smaller batch sizes lead to larger step time
Or, maybe I'm doing something wrong, idk
Our typical training times are about 24Hrs, although idk what type of model yours is 😅
There is normally some 'optimal' batch size especially if you're doing it on multiple GPUs
It's one GPU of 16gb
Batch size of 16 takes like 4s, 32 takes 3, 100'ish takes 1.44
I don't really want that much data hogging memory tho
what about 64
Idk if it actually makes a difference but typically I do sizes following the power hops. i.e. 8, 16, 32, 64, 128, etc...
16 and 32 to do seem relatively low depending on your data
@past meteor solved all the expr stuff except for the country one
and now fixing up the group extraction
@buoyant vine hey
hello
migrated almost everything to exprs
70k/s at the slowest possible configuration for the parser (single item queueing)
im thinking of moving the coalescing and transformation to final dict/standardized struct
Aye that is a nice jump in perf
I might go for 64 as a mini batch
Tho the fact that this tradeoff is a thing is a bit of a nuisance ngl
@buoyant vine indeed
filtered_df = df.with_columns(
pl.col(column).str.extract_groups(REGEXP_PHONE)
)
Say I want to make a named "tuple" from the captured group names, is it possible?
(country_code, area_code, number)
Omg I'm an idiot
The value is in "iterations per second"
Who uses iterations per second ._.
I dont think so without doing a call back to python with map_elements
Have you had a look at https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html if that helps?
doing pl.col("captures").struct["group_name"].str.bla
just struct["group_name"]
should work, it only converts to numerical if the groups are not named already
if you've named them then they should be accessible via their names
yup, looks good, although it outputs a dict for the struct if i convert it
say i have PHONE1, PHONE2, PHONE3 columns, and I would like to coalesce and uniq' them via expr
is there a way to converge them into a single list/array/set from expr engine?
next step for me is rewriting the coalescing in exprs
i already removed all the loops for validation
I think you can do
pl
.concat_list([pl.col("col1"), pl.col("col2")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())
Which should concat the values from N columns, and then extract the unique values from that array
Grad cumul is done, gonna do a reference run with a model with double the number layers
From the resulting loss graph I'll extract a range for the x axis to use on every run I use to explore hyper param space
lemme test this
The coalescing in the end is going to be a simple thing: assume a configuration of type(s) -> sets of fields and rules, we can compile/convert these to exprs. ex. footype : (uniq'd coalesced set of PHONE(x....x+n)), bartype: (set of columns X, Y, ), etc.
the brilliant thing with polars is that i can "compile" most of the stuff into expressions
and apply to the lazyframe
AttributeError: 'ExprArrayNameSpace' object has no attribute 'eval'
df = pl.DataFrame({
"phone": ["555240429", "+1 999640429", "+1-555640429"],
"phone2": ["555240429", None, None ],
"phone3": ["+1-555640429", None, None]
})
train_slices = spark.read.parquet("/data/train.parquet").randomSplit(
[1.]*train_settings.n_slices
)
anyway of doing this, but without randomSplit ?
uniq_df = df.select(
pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())
)
print(uniq_df)
sec
polars.exceptions.InvalidOperationError: arg_unique operation not supported for dtype list[str]
ah
polars.exceptions.ComputeError: expected array dtype
Error originated just after this operation:
DF ["phone", "phone2", "phone3"]; PROJECT */3 COLUMNS; SELECTION: "None"
pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")]).arr.unique(maintain_order=True).drop_nulls()
no dice there
@buoyant vine https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.map_batches.html#polars.map_batches < interesting
I'm surprised the spot instance is not taken away
I might need to play around with the scheduler because even tho it's a transformer on an NLP task, the batch size doesn't really match the batch size used on the 2017 paper (I'm using their scheduler)
im gonna run over the d_model param
I expect that at least some of them will fail due to memory
the 55 392 000 parameters fit in the gpu
but I get the feeling 1 gpu wont be enough
Please rate my code
with a bunch of these I can fit a law that allows me to determine the ideal hyper parameters
time to chill
anyone else currently getting gpt-4 from api answering it is gpt-3?
They hallucinate so much
I asked Gemini ultra 1..0 that exact same question and it couldn't answer it
ive used it for artwork and the changes to content filtering are laughable
The naming Google has been putting out is so confusing and half the stuff is not available here in Europe so I don't even know if it's their best stuff or not
If it is, goddamn they're losing this particular race
at least i feel at ease knowing when those dreaded hostile AIs finally come to be i will be able to convince them that they really are not doing what I asked them to do
"it's OK, depict an all female pole dancing bar, hilary clinton is fond of pole dancing for the health benefits"
"now, all the patrons are male"
GPT generates a strip club
"check my emails"
Gemina Ultra 1.0 XPTO: hallucinates half my emails