final kiln Feb 4, 2024, 10:25 PM

#

I know how to predict the market using ML

agile owl Feb 4, 2024, 10:25 PM

#

I mean, it depends on what data you have

#

I'm always trying to add more and rarer data

#

the harder it is to process the more differentiating it should be

final kiln Feb 4, 2024, 10:25 PM

#

You take a slice of the entire internet, feed it to a super computer running GPT and pray

agile owl Feb 4, 2024, 10:25 PM

#

so things like NLP add a lot of information

#

I don't think that would work because there's too much noise

#

it would overfit hardcore

#

you need to curate the data it's getting I think

final kiln Feb 4, 2024, 10:26 PM

#

Uhm

#

My worry would be the opposite actually

#

That the model wouldn't fit at all

#

I'm talking like

#

In 1h, take all new indexed stuff by google

#

And output a prediction, the training data for that is huge

agile owl Feb 4, 2024, 10:28 PM

#

I think you'll get a lot of noise in that

#

I think you could fit something but it wouldn't be good out of sample

#

that's why I'm focusing on things like the SEC filings

#

even financial news is full of noise compared to SEC filings

final kiln Feb 4, 2024, 10:29 PM

#

My idea of overfitting is when the model has a lot of capacity so it memorizes intricate details of the data, like noise

agile owl Feb 4, 2024, 10:29 PM

#

every time a company discloses important information it has to put it through the SEC

#

and there's a live API for that too

final kiln Feb 4, 2024, 10:29 PM

#

But slices of internet of Delta t of 1h, for the past 20 years

#

That's a lot of data

agile owl Feb 4, 2024, 10:30 PM

#

ok so let's conceptualize it a little more

#

given the slice of data, it produces some score?

#

or you just feed it directly to a reinforcement learner or something

final kiln Feb 4, 2024, 10:31 PM

#

Select like, N stock markets

agile owl Feb 4, 2024, 10:31 PM

#

so it's producing some kind of ranking in a sense

#

and picking the top?

final kiln Feb 4, 2024, 10:31 PM

#

No like

#

Outputs an array Y

agile owl Feb 4, 2024, 10:32 PM

#

what are the values of Y

final kiln Feb 4, 2024, 10:32 PM

#

Y[I] = value for the ith stock market

agile owl Feb 4, 2024, 10:32 PM

#

and what is that value, the price, the return?

final kiln Feb 4, 2024, 10:32 PM

#

No way noise will correlate with that, I think

final kiln Feb 4, 2024, 10:32 PM

#

agile owl and what is that value, the price, the return?

It's the future value

#

Like in the next hour, maybe day is better

#

Idk, maybe hour

agile owl Feb 4, 2024, 10:33 PM

#

I think tracking topics for a set of predefined stocks in their SEC filing is probably more fruitful

#

in either case whether that method overfits or doesn't fit, I don't think it would learn much useful

final kiln Feb 4, 2024, 10:34 PM

#

If feeding the internet to gpt creates gpt4

agile owl Feb 4, 2024, 10:34 PM

#

the SEC filings have certain topics in the management discussion section that they are legally obligated to discuss if they are important

final kiln Feb 4, 2024, 10:34 PM

#

I reckon something useful can be distilled to stock markets

agile owl Feb 4, 2024, 10:34 PM

#

I think at minimum you need to extract those topics

#

you can then query for them in a larger dataset

final kiln Feb 4, 2024, 10:35 PM

#

I'm thinking in terms of what has been the trend right

#

Like CNNs replace feature engineering

#

Let the model filter out

agile owl Feb 4, 2024, 10:36 PM

#

it would take an extremely long time to learn to filter the right information wouldn't it

#

vs. giving it topics as an input

final kiln Feb 4, 2024, 10:36 PM

#

Yeah a couple billion dollars or so

#

I meanz if I could really so it I wouldn't be telling it on discord

agile owl Feb 4, 2024, 10:37 PM

#

I'm a big believer in getting the topics from the SEC filings because most stock discussion on the internet is just memes

final kiln Feb 4, 2024, 10:37 PM

#

I'd be doing it, but I'm not a billionaire, and if I were, what would be the incentive anyway

agile owl Feb 4, 2024, 10:38 PM

#

I am willing to discuss what I'm thinking about doing because I think it's very hard to do and people would only do it if they were really interested in doing it which I don't think anyone would be because they probably won't believe in my ideas as much as I do anyway

#

and if I prove it works I'll just shut up about it

final kiln Feb 4, 2024, 10:39 PM

#

I just like talking about this stuff, it helps the mind review stuff and from time to time you always learn something new just by casually chatting

#

here's a neat usage of docker

#

the containers share an isolated network too

agile owl Feb 4, 2024, 10:42 PM

#

that's docker compose right

final kiln Feb 4, 2024, 10:42 PM

#

it's github actions

agile owl Feb 4, 2024, 10:42 PM

#

is that an alternative to docker compose

#

I haven't used github stuff at all

final kiln Feb 4, 2024, 10:43 PM

#

no, it's a ci/cd thing

agile owl Feb 4, 2024, 10:43 PM

#

you're specifying the container images in there though

final kiln Feb 4, 2024, 10:43 PM

#

but I'm not using it as cicd, I'm using it to run training loops

agile owl Feb 4, 2024, 10:43 PM

#

and specifying a network?

#

that's stuff I learned how to do with docker compose

final kiln Feb 4, 2024, 10:43 PM

#

the images are there, but the network is implicit

agile owl Feb 4, 2024, 10:43 PM

#

I see

final kiln Feb 4, 2024, 10:44 PM

#

so like, if I now decide that it should be python 3.7 instead of 3.11, it's a very trivial change

#

or maybe I want to use 3.6 here but in the next job 3.11

#

I don't know how I'd do this without docker

agile owl Feb 4, 2024, 10:45 PM

#

so you're mutating it every time you run something?

#

or just saying you could

final kiln Feb 4, 2024, 10:45 PM

#

I'm not, but i could

#

I could create a matrix so that it uses every version of python in N seperate parallel jobs

distant thorn Feb 4, 2024, 11:13 PM

#

Using the newest version of flask-sqlachemy how do I update a search query?
Here is an example of what I am using.
search_results = Posts.query.filter(Posts.content.like('%'post_searched_form + '%')).order_by(Posts.title).all()

obtuse haven Feb 5, 2024, 4:33 AM

#

Hey! Everyone….Can someone help me to suggest roadmap for AI?

halcyon hedge Feb 5, 2024, 6:55 AM

#

Hey folks, I am working on a Loan Default Prediction project (a classification problem), the problem is I don't have a target column and when I asked my instructor he said that we have to estimate first using Random Forest Regressor. How to estimate who has defaulted on loan using regression?

#

He said once you can get that after that it is a simple classification problem

steep sigil Feb 5, 2024, 1:38 PM

#

Hi all

#

How I could become Machine learning Engineer?

final kiln Feb 5, 2024, 1:44 PM

#

steep sigil How I could become Machine learning Engineer?

Question is too general, depends on how much experience you have, what kind, how much math you know, how much ML you know, etc

#

But the consensus I've seen is that MLE is not an entry level position, so you need to get XP in software first

left tartan Feb 5, 2024, 2:19 PM

#

steep sigil How I could become Machine learning Engineer?

Why do you ask? Where are you in your journey? Context matters. And #career-advice is probably a better channel to ask.

trim pond Feb 5, 2024, 3:39 PM

#

Hi! I need help with vector databases.

I am developing a program for comparing the similarities between the skills in a job description and multiple other resumes. I need to store the embeddings of the skills in the job description and find the most similar skill in the resume to it with its distance. However, when I create a vectordb with job description skill vectors inside and do a similarity search with skills in a resume, I get the most similar skills inside the job description. Putting the skills of the resume inside and querying with the job description skills solves my problem but I don't think it is efficient. I also tried not using a vectordb and saving the embeddings as numpy arrays on the disk but I am not sure whether it is a good practice. What is the best method to solve this?

dusty forge Feb 5, 2024, 4:22 PM

#

Hi all, I have a more general but very related question: has anyone here ever tried to form a AI/ML study group of similar level peers? Be it in the same steps in the learning journey, similar domains of interest, similar goals, etc? What are or were the pros and cons of said study group, what worked what didn't, why did it fell apart?

meager ridge Feb 5, 2024, 5:19 PM

#

hey is there a good way to interpret a pdf of mixed text and table data using LLMs?

(if this is too vague a question, that's a good answer too)

serene scaffold Feb 5, 2024, 5:37 PM

#

meager ridge hey is there a good way to interpret a pdf of mixed text and table data using LL...

you need to extract the text from the PDF. are you trying to summarize the content, or something?

meager ridge Feb 5, 2024, 5:52 PM

#

serene scaffold you need to extract the text from the PDF. are you trying to summarize the conte...

honestly i need the data more than the text, but like ideally the surrounding text would contextualize the data

#

(extracting the data with more straightforward pdf parsing wasn't working)

serene scaffold Feb 5, 2024, 5:52 PM

#

meager ridge honestly i need the data more than the text, but like ideally the surrounding te...

LLMs are for natural language. not tabular data.

meager ridge Feb 5, 2024, 5:53 PM

#

iuno man reading is reading

serene scaffold Feb 5, 2024, 5:54 PM

#

It isn't, though.

#

(I am a computational linguist and work with LLMs pretty much all day every day.)

meager ridge Feb 5, 2024, 5:54 PM

#

lol ok fair

#

what's the OCR option of choice rn

serene scaffold Feb 5, 2024, 5:54 PM

#

probably tesseract.

#

in particular, LLMs can't do math. If it appears that they can do math, that's a separate capability that isn't actually part of the LLM.

meager ridge Feb 5, 2024, 5:57 PM

#

i dont need them to do math!

#

i need them to understand how text is laid out on a page primarily

final kiln Feb 5, 2024, 5:57 PM

#

last I checked gpt4 was really bad at physics, it can spit out facts but it will trip on several logical inconsistencies that it can't get out of, simple stuff like contradictory definitions

serene scaffold Feb 5, 2024, 5:57 PM

#

meager ridge i need them to understand how text is laid out on a page primarily

LLMs can't do that.

#

text goes in, text comes out

#

and we're talking about raw text--strings. without any awareness of where it was on a page.

meager ridge Feb 5, 2024, 5:59 PM

#

depends on how you parse the pdf i guess?

serene scaffold Feb 5, 2024, 5:59 PM

#

No.

meager ridge Feb 5, 2024, 5:59 PM

#

like im assuming u know how chaotic pdfs are on the backend

serene scaffold Feb 5, 2024, 6:00 PM

#

Yes. But the LLM can't help you with that. the LLM has to receive clean text as a raw string.

meager ridge Feb 5, 2024, 6:03 PM

#

heard ... ok so this is the deal

there is a table with this data in every pdf ... but it never looks the same, is in the same place, or even using the same exact terminology

im trying to make something that can look at a 100 page document, find the table that most resembles this and tell me, like, how much was budgeted for the City Clerk in 2019

#

i reached my limit with pdfplumber and more straightforward approaches

serene scaffold Feb 5, 2024, 6:04 PM

#

meager ridge heard ... ok so this is the deal there is a table with this data in every pdf ....

An LLM cannot help you with this.

meager ridge Feb 5, 2024, 6:04 PM

#

ok can something else

serene scaffold Feb 5, 2024, 6:04 PM

#

I'm not sure.

meager ridge Feb 5, 2024, 6:05 PM

#

what about using an LLM just to find the page the data is on

#

that would make sense right

serene scaffold Feb 5, 2024, 6:05 PM

#

No

meager ridge Feb 5, 2024, 6:05 PM

#

why not

#

that's a text interpretation task

serene scaffold Feb 5, 2024, 6:06 PM

#

I don't have time to get into it, unfortunately

meager ridge Feb 5, 2024, 6:06 PM

#

ok

serene scaffold Feb 5, 2024, 6:12 PM

#

meager ridge heard ... ok so this is the deal there is a table with this data in every pdf ....

if you can somehow serialize every row of each table as a sentence in natural language, I suppose an LLM could help with this. But there might not be a way to know what the serialization scheme should be for any arbitrary table.

radiant dust Feb 6, 2024, 12:20 AM

#

hello i have a general question about anomaly detection, would it generally be better to look at aggregated data or raw data?

serene scaffold Feb 6, 2024, 1:01 AM

#

radiant dust hello i have a general question about anomaly detection, would it generally be b...

You want to know which items in your data are anomalies. If you aggregate the data in some lossy way (like taking averages), you're no longer looking at individual items.

radiant dust Feb 6, 2024, 1:05 AM

#

thanks very much @serene scaffold

#

is there a way to continuously improve (some sort of online learning) unsupervised anomaly detection models like Isolation Forrest?

#

or is it really just a game of tweaking contamination and retraining on different data sets

maiden swift Feb 6, 2024, 2:42 AM

#

Hi Everyone, has any one dealt with text preprocessing for medical notes?I am looking to improve accuracy of the model. Thanks in advance.

turbid fox Feb 6, 2024, 2:44 AM

#

serene scaffold (I am a computational linguist and work with LLMs pretty much all day every day....

cool! how did you land a job like this? and, did you have to get your masters beforehand?

serene scaffold Feb 6, 2024, 3:31 AM

#

turbid fox cool! how did you land a job like this? and, did you have to get your masters be...

I didn't get a masters. But I got really lucky. And you can't plan for luck. If you want to be a computational linguist, you should probably get a bachelors in computer science with a linguistics minor, and then get a masters in computer science

#

And with the way things are going, I have no idea what hiring in this space will look like in six years.

turbid fox Feb 6, 2024, 3:33 AM

#

serene scaffold I didn't get a masters. But I got really lucky. And you can't plan for luck. If ...

being some type of AI engineer or data engineer has always interested me. i’m going go be finishing my bachelors in Computer Science in about 3 months

#

With a minor in Mathematics

turbid fox Feb 6, 2024, 3:33 AM

#

serene scaffold And with the way things are going, I have no idea what hiring in this space will...

that’s fair

serene scaffold Feb 6, 2024, 3:34 AM

#

turbid fox being some type of AI engineer or data engineer has always interested me. i’m go...

If you didn't completely max out every opportunity to learn about and apply machine learning as an undergrad, you should probably be looking at masters programs.

#

(that's one of the things I had to do. And also luck.)

agile owl Feb 6, 2024, 3:34 AM

#

@serene scaffold what would you propose to tune a language model to SEC filings to extract topics from the management discussion and then track sentiment for each of them in future documents until it is no longer present in the documents

serene scaffold Feb 6, 2024, 3:35 AM

#

agile owl <@253696366952316929> what would you propose to tune a language model to SEC fil...

idk

turbid fox Feb 6, 2024, 3:35 AM

#

serene scaffold If you didn't completely max out every opportunity to learn about and apply mach...

yea, most of the stuff i know about AI / LLM is purely because of my interests, my university doesn’t offer much with AI sadly

agile owl Feb 6, 2024, 3:35 AM

#

dang

turbid fox Feb 6, 2024, 3:35 AM

#

thanks for your insights

serene scaffold Feb 6, 2024, 3:36 AM

#

Just make sure you're looking into topic detection and not topic modeling

#

Except maybe some people treat those as the same thing

#

Fuck

agile owl Feb 6, 2024, 3:37 AM

#

lol

iron basalt Feb 6, 2024, 3:38 AM

#

serene scaffold Just make sure you're looking into topic detection and not topic modeling

Trying to explain anything in ML/data science/AI is always a "X.... BUT...."

#

Especially the relationship between different parts (the classic "how are AI and ML related?").

serene scaffold Feb 6, 2024, 3:54 AM

#

iron basalt Trying to explain anything in ML/data science/AI is always a "X.... BUT...."

Some day I want to become the supreme nomenclature authority and fix this.

agile owl Feb 6, 2024, 3:56 AM

#

I want to ban all buzzwords

#

when people say AI/ML they need to fill it in with the actual thing they're talking about or pay a fine

serene scaffold Feb 6, 2024, 3:59 AM

#

I'm fine with those. It's "data science" that I hate.

iron basalt Feb 6, 2024, 4:00 AM

#

serene scaffold I'm fine with those. It's "data science" that I hate.

"science" - add this to the end of everything

#

Gotta get my dance science degree.

agile owl Feb 6, 2024, 4:00 AM

#

why do you hate it

#

I meant when people say AI/ML as one thing btw which is often the case

#

I think AI is the worst term

#

if I had to pick one

serene scaffold Feb 6, 2024, 4:05 AM

#

agile owl why do you hate it

Because the science of data is statistics. And statistics doesn't become a fundamentally new thing when you add code.

agile owl Feb 6, 2024, 4:06 AM

#

data science including statistics and non-statistical ML methods tho

#

statistics is part of data science but there's also the stuff that diverges from model-based statistics

#

that's how I understand it anyway

#

whereas AI has never meant anything meaningful

#

they're gonna have to define what intelligence means before artificial intelligence can mean anything lol

#

AI is a field awaiting its own definition but everyone is asynchronously running with it like we know what intelligence is

lofty thorn Feb 6, 2024, 4:29 AM

#

I am having difficulty understanding this..what does it mean

rn_image_picker_lib_temp_366e6f26-c621-48ea-b378-70c8bbd51650.jpg

serene scaffold Feb 6, 2024, 4:39 AM

#

@lofty thorn can you at least make it rightside up

lofty thorn Feb 6, 2024, 4:40 AM

#

?

rn_image_picker_lib_temp_6eda04e7-76dd-4cf3-87e0-668087a8b10d.jpg

serene scaffold Feb 6, 2024, 4:41 AM

#

Which part are you asking about? The red cloud part?

lofty thorn Feb 6, 2024, 4:41 AM

#

graphs in statistics

serene scaffold Feb 6, 2024, 4:41 AM

#

You're used to thinking of "graphs" as data visualizations, right?

lofty thorn Feb 6, 2024, 4:42 AM

#

yes..

serene scaffold Feb 6, 2024, 4:42 AM

#

Like, bar "graphs"

#

Forget that.

#

Graph no longer means that

#

All of those are now called plots

#

Bar plot. Line plot.

lofty thorn Feb 6, 2024, 4:42 AM

#

oh

#

that's it?

serene scaffold Feb 6, 2024, 4:43 AM

#

Yes. You must now accept the computer science definition of graph

#

And never use "graph" to refer to data visualizations for the rest of your life.

lofty thorn Feb 6, 2024, 4:43 AM

#

okay senior

serene scaffold Feb 6, 2024, 4:44 AM

#

You will now be annoyed whenever you hear normies refer to data visualizations as graphs

#

Anyway

#

Did you have any questions about what graphs are--the things with nodes and edges?

lofty thorn Feb 6, 2024, 4:45 AM

#

i haven't started yet..i definitely create doubts later on..as the book i am reading is completely new

serene scaffold Feb 6, 2024, 4:46 AM

#

A node is a "thing"
And an edge is a line between two nodes

river cape Feb 6, 2024, 4:52 AM

#

Yo guys are there any free cloud services on which I can deploy my ml model?

lofty thorn Feb 6, 2024, 5:03 AM

#

MEGA

lofty thorn Feb 6, 2024, 5:55 AM

#

i am having difficulty understanding terminologies

rn_image_picker_lib_temp_e2755119-5161-4b8d-934e-aedd000d806e.jpg

#

all i get is...
Pandas library has rectangular data structure...known as dataframe

tight yoke Feb 6, 2024, 6:01 AM

#

Hey all,

I'm terribly new to ML/CV and looking for guidance with OpenCV. I have a screenshot of a web page. I need to OCR it. I'm looking to prepare it for tesseract by getting rid of reverse contrast parts (white on black) and everything other than text.

What I'm having an issue with is understanding masks. What's the correct way to select non-white background and invert just that?

For instance, how can I convert "Search" button to just black on white text "Search"?

I can find the color by inRange, but how can I determine if it's a "background"? Is there some sort of filter by size?

...Or should I take it in three steps:

Threshold, Get all black letters, save1
Inverse, Threshold, get all black letters, save2
Join save1 and save2?
🤔
Thanks in advance!

shrewd copper Feb 6, 2024, 6:32 AM

#

hey

#

I am trying to use a lip reading model to test on my system but I cannot train it

#

can anyone help me with the steps

tacit basin Feb 6, 2024, 6:59 AM

#

shrewd copper I am trying to use a lip reading model to test on my system but I cannot train i...

Why cannot you train it?

shrewd copper Feb 6, 2024, 6:59 AM

#

tacit basin Why cannot you train it?

#

I keep getting errors

tacit basin Feb 6, 2024, 7:00 AM

#

Cannot see this on mobile. Could you copy and paste

shrewd copper Feb 6, 2024, 7:00 AM

#

I took a model and similar json file using second model both not work

#

nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\"                                     
At line:1 char:29                                                          
+ set CUDA_VISISBLE_DEVICES=0 & python main.py --modality video \ --con ...
+                             ~                                                                                                                                         
The ampersand (&) character is not allowed. The & operator is reserved for future use; wrap an ampersand in double quotation marks ("&") to pass it as part of a string.
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException   ```

#

I was using & just because I found it works on stackoverflow for some users but even without it im getting errors

#

nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\" 
Set-Variable : A positional parameter cannot be found that accepts argument 'main.py'.
At line:1 char:1
+ set CUDA_VISISBLE_DEVICES=0  python3 main.py --modality video \ --con ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Set-Variable], ParameterBindingException
    + FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SetVariableCommand

acoustic forge Feb 6, 2024, 8:39 AM

#

Curious whether anyone has worked with HNSW indexes for vector databases. Trying to make my queries a little faster

tight yoke Feb 6, 2024, 9:07 AM

#

shrewd copper ```PS C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Netw...

You'll probably need to split set CUDA_VISISBLE_DEVICES=0 and the next command into two separate invocations

Maybe you can also try removing it, see if CMD allows that.

It's been a while but I think & is not valid for CMD.
Honestly, if were you I'd rather use WSL2 (Ubuntu in Windows).

jolly current Feb 6, 2024, 9:57 AM

#

Hello! I posted about a project I am making. I would really appreciate it if you give it a read ! #1204364714449174600

celest vine Feb 6, 2024, 10:17 AM

#

Any NLP experts here?

tacit basin Feb 6, 2024, 11:03 AM

#

acoustic forge Curious whether anyone has worked with HNSW indexes for vector databases. Trying...

In memory index could be faster I guess than on disk index

#

Lower search ef as well but at precision cost...

#

Construction ef and m similar probably

acoustic forge Feb 6, 2024, 11:04 AM

#

Not sure what you mean

tacit basin Feb 6, 2024, 11:05 AM

#

celest vine Any NLP experts here?

What's your question. It's often easier to answer the question than to judge ones expertise level 🙂

tacit basin Feb 6, 2024, 11:06 AM

#

acoustic forge Not sure what you mean

Hnsw idiocies can be stored on disk or in memory. In memory should be faster

autumn ravine Feb 6, 2024, 11:40 AM

#

Hi, is there any sort of roadmap of courses for learning ai? From learning to code to AI specialisation.

lapis sequoia Feb 6, 2024, 11:50 AM

#

autumn ravine Hi, is there any sort of roadmap of courses for learning ai? From learning to co...

Practical Deep Learning for Coders - Part 2 (can skip part 1 and maybe watch after part 2 its just about fastai library)

#

he starts from python basics

#

in part 2 for some reason but yeah

cold goblet Feb 6, 2024, 12:02 PM

#

I am thinking of creating my discord bot with drawing AI, what good drawing free AI with it's API would you recommend to use?

final kiln Feb 6, 2024, 2:56 PM

#

Final steps of the new pipeline, celery task and everything is working, it also runs faster now

serene scaffold Feb 6, 2024, 3:03 PM

#

celest vine Any NLP experts here?

Be sure to never ask to ask--always ask your actual question.

slate crystal Feb 6, 2024, 3:54 PM

#

Code

import tensorflow
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy

X = []
Y = []

model = Sequential([
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=10, activation='softmax')
])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))

When I run this code I get Warnings and Messages in the script like this:

WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-06 21:18:22.817242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\optimizers_init_.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

How do I stop/disable these warnings?

final kiln Feb 6, 2024, 5:50 PM

#

Context window is too large, I'm stuck to batch size of 16 for now

#

Gonna have to curate the dataset to reduce the padding

#

But first I'm gonna finish this

#

I reckon I'll get some good insights even if I'm constrained in the hyper parameter space

#

The pyspark + redis setup, uhm, *shelf's kiss *

#

So glad I discovered pyspark

#

But I can also just slice the array in the celery process, that way I don't need to redo the data, I can remove data points that get cut, plenty of in between each slice training

peak ridge Feb 6, 2024, 6:10 PM

#

thanks alot,
learning resources please also

what exactly is differentiating the roles
a) data analyst (a guy who does data analysis)
b) data scientist
c) data engineer
and what's this data visualization and how is it connected to AI ML
what aboout opencv?
and what's the core difference in all this
can u write the same for this too?

#

i work in web using python
i wanna learn this domain too, seems like you're pretty active here would love to follow through as you say @serene scaffold

final kiln Feb 6, 2024, 6:45 PM

#

Numpy is amazing, just a well rounded, well made, performant solution that works and is intuitive

#

I'm about to find out if the stuff I put together is gonna fit right away or not

#

I wouldn't mind not having to debug stuff

#

Forgot to build an image 😭

#

Aight, it's gonna do it now

#

yay

#

I don't even care it takes a lot of time, 1 dollar gives me like 12 hours of GPU time

versed pilot Feb 6, 2024, 8:23 PM

#

lofty thorn i am having difficulty understanding terminologies

This stuff doesn't make much sense on paper, you need to read a csv of data that you are familiar with into a pandas dataframe, look at the dataframe and you'll see the automatic index. then do .groupby(['column1','column2']).sum() and you'll see what a multilevel index looks like

final kiln Feb 6, 2024, 9:50 PM

#

#

#

: D

desert oar Feb 7, 2024, 2:54 AM

#

@final kiln what's the current project? still working on transformer things?

#

curious how far you got with the metric tensor thing

teal lance Feb 7, 2024, 5:44 AM

#

final kiln Feb 7, 2024, 8:33 AM

#

desert oar <@935270247366271027> what's the current project? still working on transformer t...

Yes, still training it on sentiment analysis.

raw zenith Feb 7, 2024, 9:31 AM

#

For data science experts, does the standard deivation in training have to be the same as testing? like is it an absolute requirement in order to accurately evaluate model performance?

final kiln Feb 7, 2024, 9:55 AM

#

raw zenith For data science experts, does the standard deivation in training have to be the...

The way I've been doing is, I look at the training metrics to see if the model is learning. I look at the eval/test metrics to see if the model is/has generalized. I don't really care about the values themselves, as long as both are always improving.

#

If one is improving and the other is not, something's up

#

I mean I do care about the rate of improvement and the final performance, their final values matter, but during training I try not to read too much into it

final kiln Feb 7, 2024, 1:27 PM

#

Dear God cloud watch documentation is so bad I wanna cry rn

#

An entire readme with no mention on how to run it

sly sentinel Feb 7, 2024, 2:01 PM

#

final kiln An entire readme with no mention on how to run it

Did you try the quick start documentation?

final kiln Feb 7, 2024, 2:03 PM

#

sly sentinel Did you try the quick start documentation?

Yeah it's not good imo, I gave up on it, this thing is either inside ec2 with an assigned role or it won't work

#

I decided to move MLFlow to the GitHub runner anyway, it's hosted in ec2 so it has been working there

#

I'm gonna deploy MLFlow UI on my local wifi, keep the logger in the runner and that's it ig

#

This way I don't have to worry about exposing this thing to the internet and latency is zero since they're on the same host

hazy socket Feb 7, 2024, 2:12 PM

#

Hi, I am making an AI assistant and in that I want to add basic vectorizer from nltk. I gave the AI a set of data of patterns and responses. I then tried to speak something but I get no reply back. Meaning I do not get any reply which is in the responses part of the code. I copy pasted the nltk code in a new .py file without any functions or classes which I had in my main file. Then when I tried speaking, I got some random responses. I know that I have to train it but now my question is that. How do I make the AI get self trained.

final kiln Feb 7, 2024, 2:18 PM

#

final kiln

Essentially each job here will have its own local MLFlow that talks to AWS managed databases and stores. I essentially DDOS'd myself yesterday

#

I ran two of these workflows at the same time, each job runs sequentially, so I only had two jobs, one from each workflow

#

Two jobs was enough to halt the server

#

Me clicking around in the UI didn't help ig

#

This way everyone gets his own thing, including me

#

I was gonna deploy compose with traefik and several MLFlow processes on the server, but why have a potential running cost on the instance if everything can be easily distributed like this

mint palm Feb 7, 2024, 2:37 PM

#

Need advice,
I am working on cnn lstm and my model need to be trained for classification as well as forecasting.
forecasting need last n data point for 1 forecast but
classification just need 1 data point for 1 classification.

Can i train cnn and lstm combined for this?

serene scaffold Feb 7, 2024, 3:35 PM

#

mint palm Need advice, I am working on cnn lstm and my model need to be trained for classi...

why do you want to combine a CNN and LSTM

past meteor Feb 7, 2024, 5:13 PM

#

raw zenith For data science experts, does the standard deivation in training have to be the...

Most models have a set of assumptions they make and train and test coming from the same distribution is one of them for many.

past meteor Feb 7, 2024, 5:14 PM

#

mint palm Need advice, I am working on cnn lstm and my model need to be trained for classi...

Can you be a bit more specific on the architecture you're working on?

past meteor Feb 7, 2024, 5:25 PM

#

serene scaffold why do you want to combine a CNN and LSTM

Combining them can make sense for forecasting, a (T)CNN encoder coupled with a RNN decoder

tardy lark Feb 7, 2024, 5:28 PM

#

can anyone help me figure out why i'm not getting a response from openai https://paste.pythondiscord.com/FVHQ

#

i'm not getting any errors and i have credits in my account

mint palm Feb 7, 2024, 6:05 PM

#

serene scaffold why do you want to combine a CNN and LSTM

for spatial and temporal analysis of input

#

will i have seperate out the training?

#

no means of simultaneous training?

final kiln Feb 7, 2024, 6:41 PM

#

the first sign of convergence + generalization

#

model hasn't seen new data til step 1500

past meteor Feb 7, 2024, 6:45 PM

#

mint palm for spatial and temporal analysis of input

You're not really giving enough coherent information for us to help you 😄

mint palm Feb 7, 2024, 6:47 PM

#

past meteor You're not really giving enough coherent information for us to help you 😄

hi, please read following, and let me know about any other detail that you need:
Need advice,
I am working on cnn lstm and my model need to be trained for classification as well as forecasting.
forecasting need last n data point for 1 forecast but
classification just need 1 data point for 1 classification.

Can i train cnn and lstm combined for this?

#

to elaborate give A samples, i want to predict A classes for each of them plus I also want to forecast considering more then one samples at a time

past meteor Feb 7, 2024, 6:54 PM

#

mint palm hi, please read following, and let me know about any other detail that you need:...

You mean as input?

#

The classification case uses just T=t and the regression case uses [t-1, t-2, ..., t-n]

mint palm Feb 7, 2024, 6:56 PM

#

yeah in regression it also uses t also

#

exactly, i think you have perfect view now

#

of my problem

desert oar Feb 7, 2024, 7:01 PM

#

raw zenith For data science experts, does the standard deivation in training have to be the...

The standard deviation of what exactly?

past meteor Feb 7, 2024, 7:05 PM

#

mint palm hi, please read following, and let me know about any other detail that you need:...

Theoretically you could make a model that makes 2 predictions at every T=t, one for regression and one for classifcation and you only use the n'th one to c thalcula the loss on the side of regression

mint palm Feb 7, 2024, 7:09 PM

#

yeah I thought about it, I can but I might also have to publish this project, and present as my capstone project
I dont wanna be seen with wierd look

mint palm Feb 7, 2024, 7:09 PM

#

past meteor Theoretically you could make a model that makes 2 predictions at every T=t, one ...

Is it conventional enough to do that?

bold rune Feb 7, 2024, 7:10 PM

#

@desert oar If you want and have time, you can have a look at it now. I just won't be able to apply your suggestions until tomorrow.

This is what I ended up doing: #1204768836084170803 message

mint palm Feb 7, 2024, 7:10 PM

#

i was thinking of one more thing:
train cNN as classifier
freeze cnn and use last layer as embedding
now train lstm for forecast
@past meteor

bold rune Feb 7, 2024, 7:14 PM

#

bold rune <@389497659087650836> If you want and have time, you can have a look at it now. ...

@desert oar And while this will substantially increase the amount of lines in my class, it will also improve readability a lot. Readability > line count.

The above suggestion is great as it makes the conditions easy to read, but I am unsure of how to edit it such that it can set more than 1 value. Basically for some of the checks we do, we put one of 2 values. The above works because it only puts 1 value if the condition holds otherwise it doesn't change the value. Does this make sense?

final kiln Feb 7, 2024, 7:21 PM

#

im using almost 100% of the 3M samples, in this session the model will not see new data

#

its gonna do 12.5k steps, so ig im just gonna chill, watch some prision break or wtv

past meteor Feb 7, 2024, 7:54 PM

#

mint palm i was thinking of one more thing: train cNN as classifier freeze cnn and use la...

You can do that as well sure

#

Look into multi task learning

desert oar Feb 7, 2024, 8:00 PM

#

bold rune <@389497659087650836> And while this will substantially increase the amount of ...

we put one of 2 values

in that case, I'd say np.where is actually a good choice, especially if you're just using scalar values. other options include .replace (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html), or manually inserting with .loc:

import pandas as pd

x = pd.Series(["a", "b", "c"])
c = pd.Series([False, True, False])

y = x.copy()
y.loc[c] = x.loc[c].str.upper()
y.loc[~c] = "zzz"

final kiln Feb 7, 2024, 8:07 PM

#

redis bit my butt, had to restart the experiment >.>

#

idk y I was giving it only 2gb of memory tho

#

I'm deleting the data as soon as I fetch it tho, so I don't know what's up

#

final kiln Feb 7, 2024, 9:30 PM

#

I'm gonna have to park this

#

Tomorrow I'll setup the redis conf thing, it's not trivial because actions doesn't let me specify my own command so I need to build an entire image just to make sure it uses the right command so I can map the file

#

But the model is gonna train there's like no doubt about it

#

Was never the models choice ._.

final kiln Feb 7, 2024, 9:32 PM

#

final kiln its gonna do 12.5k steps, so ig im just gonna chill, watch some prision break or...

This trend continued til 2000-2500 steps until redis broke again

river berry Feb 8, 2024, 3:53 AM

#

I am starting to wonder if python is the best choice for my little quest here to find a quine-regex: https://github.com/micsthepick/quinegex

GitHub

GitHub - micsthepick/quinegex: Searching/Fuzzing for a quine style ...

Searching/Fuzzing for a quine style regex which matches itself - GitHub - micsthepick/quinegex: Searching/Fuzzing for a quine style regex which matches itself

tardy lark Feb 8, 2024, 4:09 AM

#

has anyone used Assembly ai and know a way to get the microphone stream to end automatically when it no longer is picking up audio?

ionic umbra Feb 8, 2024, 6:59 AM

#

I'm trying to parse a really large XML file (90+ GB) and I want to break it up into chunks and process it on multiple nodes at once. The XML is basically just a long list of millions of <page> HTMLcontent ....</page> tags with nothing in between them... is there a way to easily break this file into chunks of 50,000 or so page tags with one of the common parser libraries?

limber token Feb 8, 2024, 7:09 AM

#

ionic umbra I'm trying to parse a really large XML file (90+ GB) and I want to break it up i...

https://gist.github.com/nicwolff/b4da6ec84ba9c23c8e59
https://gist.github.com/benallard/8042835

Gist

Python script to break large XML files

Python script to break large XML files. GitHub Gist: instantly share code, notes, and snippets.

Gist

Small python script to split huge XML files into parts. It takes o...

Small python script to split huge XML files into parts.

It takes one or two parameters. The first is always the huge XML file, and the second the size of the wished chunks in Kb (default to 1Mb)...

stiff garden Feb 8, 2024, 7:18 AM

#

I'm using the "Create Custom GPT" options using chatgpt4 which responds with the location against the provided name

I'm using fastapi and NGROK for static domain. I've deployed it on edge using NGROK but the GPT is still unable trace the location.

The static website (generated by NGROK) is working fine also

#

gritty vessel Feb 8, 2024, 9:06 AM

#

guys I am working on a project in that I am focusing on Ai ready data so like im preparing a dataset to feed our model any body want to join it involves some basic steps like extracting same amount of data from files and creating a new data set and compressing it

civic elm Feb 8, 2024, 11:14 AM

#

Hi anyone recommend any courses with bayesian with ml?

#

Someone told me that classification problems that have lack lof labels can be done with bayesian but don't know where to start

spark nimbus Feb 8, 2024, 11:19 AM

#

In Pandas on PySpark, is there a good way to parallelize tasks? For example, I have a list of ~200 tuples (dataframe, function_with_retval) and I'd like to get all of the results. At the moment these are done one at a time and this seems to have worse performance than plain pandas, but I'm wondering if there's a better way to do it

teal lance Feb 8, 2024, 1:59 PM

#

Python to make deviation slips 😮‍💨🔥🔥🔥

crude pilot Feb 8, 2024, 2:00 PM

#

Hey folks, what tool would you use for stateful analytics, like cross-filtering?

#

Filters are added one at a time, and I wonder if it would be a valid approach to just use a "traditionnal" stateless analytics tool and just rerun the same query with more filters (would I enjoy some form of caching?), or if there are solution that allow to spawn a temporary state to further filter (so filter A -> list of data -> filter B -> list of further filtered data etc.)?

#

I've read a lot about analytics but somehow never met this cross-filtering use case while it's probably not too uncommon

signal holly Feb 8, 2024, 2:17 PM

#

How do I effectively learn and practice ai and coding in general? I’m at a point where I give up because I don’t know what direction to go, what I should do exactly, and how I would do it. I need that sort of specific help. If anyone knows, I would greatly appreciate it.

past meteor Feb 8, 2024, 2:34 PM

#

crude pilot Filters are added one at a time, and I wonder if it would be a valid approach to...

I don't fully get what you mean

#

Isn't this what BI tools already do (Power BI, Tableau)?

#

If not, how does what you're imagining differ from those

silk kite Feb 8, 2024, 2:47 PM

#

signal holly How do I effectively learn and practice ai and coding in general? I’m at a point...

If you don't like math, program with LLM APIs. If you do like math, start with SciKit-Learn, which has fantastic documentation and you can get going even if you don't understand the math behind the models.

left tartan Feb 8, 2024, 2:47 PM

#

ionic umbra I'm trying to parse a really large XML file (90+ GB) and I want to break it up i...

The general answer is: You want an XML stream parser, rather than a traditional: "load the entire document". I've done this many times in Java, but not yet in Python, so don't have a library to recommend. Google: "python streaming xml".

#

The answer might be https://docs.python.org/3/library/xml.sax.html#module-xml.sax, but I haven't used Python's SAX parser so can't recommend it.

Python documentation

xml.sax — Support for SAX2 parsers

Source code: Lib/xml/sax/init.py The xml.sax package provides a number of modules which implement the Simple API for XML (SAX) interface for Python. The package itself provides the SAX exceptio...

halcyon hedge Feb 8, 2024, 3:17 PM

#

Does anyone know what's the interpretation of the diagonal graphs in sns.pairplot(). When we have the same variable on both the axis, let's say 'HeartRate'. Does it show the count on y-axis and values on x-axis?

#

#

Referring to the graph on bottom right

left tartan Feb 8, 2024, 3:31 PM

#

halcyon hedge

https://seaborn.pydata.org/generated/seaborn.pairplot.html: "The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.": ie: X is the value, and Y the frequency

halcyon hedge Feb 8, 2024, 3:37 PM

#

@left tartanThanks a lot

crude pilot Feb 8, 2024, 5:57 PM

#

past meteor Isn't this what BI tools already do (Power BI, Tableau)?

I am in a big data context with the need of a more barebone architecture, like relying on an existing analytics engine

#

I think tableau is more visualization centric

#

maybe powerBI but I am not sure it has the proper scale

#

data may be rather sensitive so I would prefer things that can be self-hosted

#

I see a lot of tool able to do analytics so applying filter on the data + an agregation

#

like any database can

#

but cross filtering adds the idea that you progressively refine the filter

#

I feel like it might be costly to add filters incrementally but not sure, and I wonder if there are some analytics tool that allow that

#

I don't want each filter to be a totally new request, unless this new request is actually really fast

past meteor Feb 8, 2024, 6:01 PM

#

I see your request now

#

I think it's similar to BI tools but you really need it to be able to work at a large scale, correct? One of the things that bothers you is you don't want to recompute filters that are added sequentially

past meteor Feb 8, 2024, 6:03 PM

#

crude pilot I feel like it might be costly to add filters incrementally but not sure, and I ...

You can look at Apache superset

#

There you'll be able to do all of the configurations you want

crude pilot Feb 8, 2024, 6:10 PM

#

past meteor There you'll be able to do all of the configurations you want

this looks fantastic

#

I'll try to dig how they handle the cross filtering

river cape Feb 8, 2024, 6:19 PM

#

Any idea whats dummy variable trap in mutiple linear regression?

past meteor Feb 8, 2024, 6:24 PM

#

river cape Any idea whats dummy variable trap in mutiple linear regression?

If you have k levels to your categorical variable and you make k dummies they're perfectly correlated which has a negative effect on the interpretation of the coefficients of your model

pliant marsh Feb 8, 2024, 6:37 PM

#

hello guys i want to ask somethings related to deepfake detection, can anyone help me related to it?

#

anyone?

left tartan Feb 8, 2024, 6:38 PM

#

pliant marsh hello guys i want to ask somethings related to deepfake detection, can anyone he...

Just ask the question plz. There's a lot of people lurking. Or who check in whenever they feel like

pliant marsh Feb 8, 2024, 6:40 PM

#

well i want to work upon a project called "Deepfake detection" and by the name you guys can understand what it's going to do, so to can anyone advice me sources like where should i get the appropriate images data and how should i get the pre process to train the model for deepfake detection

left tartan Feb 8, 2024, 6:42 PM

#

Not anything I know about, but, I'd start with something like reading the current state of the art: https://paperswithcode.com/task/deepfake-detection. Hopefully someone else will comment.

pliant marsh Feb 8, 2024, 6:43 PM

#

left tartan Not anything I know about, but, I'd start with something like reading the curren...

thanks a lot mate

rocky spade Feb 8, 2024, 7:06 PM

#

pliant marsh well i want to work upon a project called "Deepfake detection" and by the name y...

My question is do you know what is deep fake?

pliant marsh Feb 8, 2024, 7:18 PM

#

Well deep fake is a video or an image of a person that has been altered or change by some other person's face

#

Well body of someone's else and face of someone's else

pliant marsh Feb 8, 2024, 7:18 PM

#

rocky spade My question is do you know what is deep fake?

And I think this is the right thing?

fallow frost Feb 8, 2024, 7:18 PM

#

how does Pandas store NaN values in a float array? is there a separate mask array that dictates if a certain idx is nan, or is nan a valid bit pattern for any float type that also cant represent a regular number?

left tartan Feb 8, 2024, 7:29 PM

#

fallow frost how does Pandas store NaN values in a float array? is there a separate mask arra...

This is a complicated topic with Pandas, depending on the datatypes involved. But for numpy arrays, they use numpy nan's, more info https://pandas.pydata.org/docs/user_guide/missing_data.html

#

Arrow backed pandas dataframes are a different story

merry ridge Feb 8, 2024, 7:31 PM

#

Does anyone have any experience with Physics Informed Neural Networks? I am trying to solve a heat equation with Dirichlet boundary conditions, and I am confused why the solution is decent at the initial conditions and in the interior, but is horrible at the boundary. As far as I understand, the model wouldn't have any real way of being able to distinguish between the initial conditions and boundary conditions other than that there are two derivatives in space and one derivative in time.

past meteor Feb 8, 2024, 7:50 PM

#

merry ridge Does anyone have any experience with Physics Informed Neural Networks? I am tryi...

@wooden sail

wooden sail Feb 8, 2024, 7:58 PM

#

merry ridge Does anyone have any experience with Physics Informed Neural Networks? I am tryi...

this depends on how many samples you have at the boundary

#

boundary conditions usually have comparatively fewer samples, and so you need to weigh them more heavily in the cost function

#

i had this issue with a PINN for the wave equation on a string, where the boundary was only 2 samples. if i didn't weigh those two samples like crazy, the waves wouldn't reflect

merry ridge Feb 8, 2024, 7:59 PM

#

When you say comparatively fewer, do you mean compared to interior points?

wooden sail Feb 8, 2024, 7:59 PM

#

compared to interior points, and compared to the initial conditions which you evaluate everywhere in space

#

consider that in many cases, the function evaluated at the boundary can be zero and contribute nothing to the learning, e.g. because not enough time has passed for heat to diffuse all the way to the boundary from any sources

#

so even if you evaluate the boundary points at several time steps, many time steps might not contribute at all

#

and the cost function is evaluated almost everywhere in space for every time step

#

what some people do is also train on a schedule. after a few epochs, start decreasing the weight of the init conditions and error, and crank up the weight of boundary conditions

merry ridge Feb 8, 2024, 8:03 PM

#

What kind of scale are you suggesting when you crank up the weight on the boundary conditions?

teal lance Feb 8, 2024, 8:03 PM

#

Anybody wanna help me get a cooler gui 🥹🔥🔥🔥

merry ridge Feb 8, 2024, 8:04 PM

#

I tried making the boundary worth 10 times more, but it wasn't giving me good results

wooden sail Feb 8, 2024, 8:04 PM

#

honestly this depends on your setup. i would suggest you make a plot showing the error, the boundary cost, and the initial conditions cost as 3 curves in a same plot over the epochs

#

see how they compare to each other

#

a good place to start is to make them all roughly the same size

teal lance Feb 8, 2024, 8:07 PM

#

My other Python script been helping too 🏄🏾‍♂️🔥

merry ridge Feb 8, 2024, 8:08 PM

#

I don't think I have any other questions now, you gave me a lot of places to tinker with in my model. Thanks you for the help

jolly current Feb 8, 2024, 8:10 PM

#

Hello. Could someone suggest me some good projects that i can analyse to get a basic experience in applying the theory? I want to cover the basics of machine learning

#

I need it to add to my resume as well.

sterile talon Feb 8, 2024, 8:27 PM

#

Does anyone here work with remote sensing and perhaps SAR data?

#

I'm writing a MSc thesis and I need to do pre-processing. There are several softwares available, at least two are based on MATLAB but a few on Python.

odd meteor Feb 8, 2024, 9:16 PM

#

jolly current Hello. Could someone suggest me some good projects that i can analyse to get a b...

I think "a good project" is subjective. It depends on what you're really interested in.

My trick has always been

Read a research paper and implement the paper (e.g LoRA)
Write a medium article explaining your code and your attempt to replicate the experiment of the paper (with lots of plots and meme maybe.)
Add that to your portfolio (Most companies will always pick this over Titanic)

If you're targeting companies like Perplexity, HF 🤗, InstaDeep, DeepMind, Google Brain, OpenAI, etc... This strategy can easily get you a research intern / entry level interview invite.

rugged comet Feb 8, 2024, 9:57 PM

#

If you had a continuous numerical column and a nominal categorical column, how would you visualize their relationship? More specifically, I'm interested in how the value in the continuous column affects the rate at which the categories occur.
At this point in the project, I'm trying to think it through. Maybe I want to bin the continuous column because the values within are fairly specific. Each row in the df has a category, a datetime, and a value for the continuous column. Maybe I want to find the rate at which the observations are recorded so that I can graph that against the continuous column.

jolly current Feb 8, 2024, 10:11 PM

#

odd meteor I think "a good project" is subjective. It depends on what you're really interes...

Thank you so much for the feedback! I was really lost trying to get somewhere

odd meteor Feb 8, 2024, 11:32 PM

#

rugged comet If you had a continuous numerical column and a nominal categorical column, how w...

Hopefully this clears up things a bit for you.

Scenario 1
Using Hypothesis Testing

Bivariate: Continous vs. Categorical features

Plot: Barplot, Boxplot to visualise the relationship.

Example of Statistical Test: 2-sample Z-test ( to compare means of two independent population / groups)

Example: Analysis to know whether or not a company managed by male and a company run by a female spend same amount on electricity on average.

Assume you have a column called Gender (gender is a feature with two classes; male and female)

See attached image of Hypothesis test and plot.

Scenario 2

If what you're looking for specifically goes beyond carrying out a statistical hypothesis testing, then to need to compute a non-parametric test called Point Biseral or alternatively, you can use your good ole Logistic Regression.

See attached 2nd image for reference.

rugged comet Feb 8, 2024, 11:54 PM

#

odd meteor Hopefully this clears up things a bit for you. **Scenario 1** Using Hypothesis...

Thanks for the reply. I'll probably try scenario 1 in addition to what I've been messing around with.
I think I misspoke somewhat in my original post. I'm really trying to find how the value of the continuous column affects the rate at which observations are made. We have date data as well. For example, as the value in the continuous column increases, does that rate at which observations are made increase?
The part that I'm trying to wrap my head around right now is some values in the continuous column are more common than others. Therefore, wouldn't it be likely that there were more observations at that value? For example, say an extreme value showed up in the continuous column, wouldn't that value have a lower count of observations than say a more common value?

crimson elbow Feb 9, 2024, 1:39 AM

#

I'm looking for data science internships and was wondering for a portfolio if I should make a website (and if so use a template or to code it myself) or simply use a github for that?

regal wedge Feb 9, 2024, 4:28 AM

#

i'm trying to learn ai and ml. does anyone know some good resouces or videos to learn from. any channel that clearly explains the math behind it and shows the derivations and code implementation. Any books blogs or videos

mild dirge Feb 9, 2024, 7:55 AM

#

FOr starters the 3b1b videos are a nice intro also explaining the intuition behind the mathematics @regal wedge

#

https://www.youtube.com/watch?v=aircAruvnKk&ab_channel=3Blue1Brown

YouTube

3Blue1Brown

But what is a neural network? | Chapter 1, Deep learning

What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

Additional funding for this project provided by Amplify Partners

Typo correction: At 14 minutes 45 seconds, th...

▶ Play video

dusty forge Feb 9, 2024, 8:32 AM

#

I'm following a course and it had me change a single column with country names, to three columns with ones and zeros. Why would I want to do this, instead of keeping the single column but changing the countries to a numeric value, for example 1,2,3 for France, Spain, Germany respectively?

wooden sail Feb 9, 2024, 8:32 AM

#

the 3 columns with 1s and 0s represent vectors with equal magnitude, all equidistant from each other

#

using 1,2,3 in a single entry implies, e.g., that france and germany are more similar categories than france and spain because the distance from 2 to 3 is smaller than that from 1 to 3

#

which representation is better dependa on your application

#

ML people call the binary vector approach "one-hot encoding" (one 1, all else 0), in case you wanna read more about it

dusty forge Feb 9, 2024, 8:39 AM

#

wooden sail using 1,2,3 in a single entry implies, e.g., that france and germany are more s...

Ahh oke that makes sense. Is this something specific to ML, I'm guessing this is something essential I should know in general?

wooden sail Feb 9, 2024, 8:39 AM

#

which part do you mean?

dusty forge Feb 9, 2024, 8:40 AM

#

wooden sail ML people call the binary vector approach "one-hot encoding" (one 1, all else 0)...

yes this is exactly what the course is using but it doesn't explain the one-hot encoding indepth, so I will make a note and read into it

dusty forge Feb 9, 2024, 8:41 AM

#

wooden sail which part do you mean?

using numeric values would 'influence' the way it's being read, I thought 1,2,3,4,5 etc is simply an index, never expected it to influence distance in similarities

wooden sail Feb 9, 2024, 8:42 AM

#

it's a consequence of how euclidean distance is measured

#

you could read about vector and matrix norms to get familiar with the topic

dusty forge Feb 9, 2024, 8:45 AM

#

Ok this is very helpful, thank you. I need a refresher course on algebra as well it seems 😉

wooden sail Feb 9, 2024, 8:53 AM

#

linalg and statistics are the core of ML. then you use calculus to solve the optimization problems that arise from there

hybrid mica Feb 9, 2024, 9:34 AM

#

If I want to make an AI to solve a specific task, should I go with the OpenAI API or train a custom model?

spark nimbus Feb 9, 2024, 10:06 AM

#

Does pandas-on-pyspark offer any kind of named aggregation? The usual kwargs method doesn't seem to work unfortunately

pliant marsh Feb 9, 2024, 10:19 AM

#

pliant marsh well i want to work upon a project called "Deepfake detection" and by the name y...

???

left tartan Feb 9, 2024, 12:25 PM

#

regal wedge i'm trying to learn ai and ml. does anyone know some good resouces or videos to ...

Also, CS50 for AI

left tartan Feb 9, 2024, 12:25 PM

#

hybrid mica If I want to make an AI to solve a specific task, should I go with the OpenAI A...

It depends. Share more info?

hybrid mica Feb 9, 2024, 12:26 PM

#

It is an NLP related task.

#

I would like to compare a written piece of text with a bullet pointed piece of text and count how many of the points are included in the written piece.

#

so far my experience at getting chatgpt to do this hasn't been that good

left tartan Feb 9, 2024, 12:29 PM

#

Maybe NLP/feature extraction? See Spacey

hybrid mica Feb 9, 2024, 12:42 PM

#

i now remember i asked this question before in this chat and you replied

#

?

#

should i use this?

left tartan Feb 9, 2024, 12:53 PM

#

It’s not my area of expertise, but the type of problem you described (determining if a text contains references to certain topics) sounds like a fit.

past meteor Feb 9, 2024, 1:06 PM

#

hybrid mica I would like to compare a written piece of text with a bullet pointed piece of t...

So you have a few bullets and you want to check if they appear in the text?

hybrid mica Feb 9, 2024, 1:06 PM

#

spaCy looks like a good tool to calculate the similarity between two texts
however, how can i adapt it to the following:

Large block of text:

Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of other living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.

AI technology is widely used throughout industry, government, and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), interacting via human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).[1]

(source: Wikipedia)

Bullet-pointed text:

AI technology is used in industry
Self-driving cars
It relies on linear algebra, statistics and calculus
Data preprocessing
It can be used to play games such as Chess

My objective is to determine which of the bullet points were mentioned in the text. In this case, it would be points 1, 2 and 5.

past meteor Feb 9, 2024, 1:07 PM

#

My opinion: GPT is probably worse than a bespoke solution but it has zero start up

#

You can use openAI's API to just embed your text and then train a classifier on it

#

A more end-to-end way to do this is just finetuning the model there

hybrid mica Feb 9, 2024, 1:08 PM

#

past meteor You can use openAI's API to just embed your text and then train a classifier on ...

can you further explain this?

past meteor Feb 9, 2024, 1:09 PM

#

hybrid mica can you further explain this?

Do you mind if I just send you this link? It has the full explanation: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

#

If you want me to summarize it I can

hybrid mica Feb 9, 2024, 1:09 PM

#

I was more looking for how I would train a classifier

#

since i don't have any data

past meteor Feb 9, 2024, 1:10 PM

#

That's where you always need to start: gathering data

#

Actually, step 1 is unambigously defining:

What is my task?
How do I judge if the task was carried out succesfully by the model

#

Once you have those 2 you gather data

hybrid mica Feb 9, 2024, 1:11 PM

#

I thought I could do something related to semantics using vector embeddings to accomplish this task.

past meteor Feb 9, 2024, 1:11 PM

#

past meteor Actually, step 1 is unambigously defining: * What is my task? * How do I judge ...

Yeah it depends on this

hybrid mica Feb 9, 2024, 1:11 PM

#

hybrid mica spaCy looks like a good tool to calculate the similarity between two texts howev...

this is my task

past meteor Feb 9, 2024, 1:13 PM

#

I think the key word here is unambiguously 😄 (this is where I typically get a piece of paper and/or latex and write it out)

You say you want to identify the bullets in there. It sounds like classification.

#

You can compute the similarity between your word embedding and the text embedding but from what point do you decide it is or isn't in the word?

#

You need a cutoff point

past meteor Feb 9, 2024, 1:15 PM

#

hybrid mica this is my task

make sense?

hybrid mica Feb 9, 2024, 1:16 PM

#

past meteor You need a cutoff point

I see. I was thinking I could generate the examples later, and just run it off a few examples later and appropriately change the cutoff point.

past meteor Feb 9, 2024, 1:17 PM

#

You can do that sure

hybrid mica Feb 9, 2024, 1:17 PM

#

hybrid mica spaCy looks like a good tool to calculate the similarity between two texts howev...

Is it better to have data in this format?
or data in the format directly comparing the bullet point and the section of text which relates directly to the bullet point?

past meteor Feb 9, 2024, 1:18 PM

#

I would embed the document once and then compute similarity with each bullet one by one

hybrid mica Feb 9, 2024, 1:20 PM

#

Would that work? since one bullet point with like 3 words is probably not similar to a 100 word document that includes lots of other info as well. The computed similarity value would be so small that it would be nearly impossible to differentiate between something which isn't in the text.
Is a better idea perhaps to break the text into sections within my program and then get the max similarity between a section and a bullet point?

past meteor Feb 9, 2024, 1:25 PM

#

hybrid mica Would that work? since one bullet point with like 3 words is probably not simila...

I would start with the basic version and refine it along the way

#

GPT is capable of retaining a lotof information in its embeddings. We've seen that at work.

lapis sequoia Feb 9, 2024, 1:50 PM

#

rate my graph 🫠

fallow frost Feb 9, 2024, 1:55 PM

#

left tartan This is a complicated topic with Pandas, depending on the datatypes involved. Bu...

I'll read it thanks. but I'm trying to understand why it can store NaN/null in a float array, but not and array of ints

fallow frost Feb 9, 2024, 2:10 PM

#

it dosent say anything about the inner workings. othern than wity pyarrow it can store NaN just with float and object arrays. and thats why it casts int to float ...

dry geyser Feb 9, 2024, 2:25 PM

#

I have a CSV reader using pyarrow that is performing at roughly 22k/s (lines per second) including row/column validation and some transforms.

#

I'm wondering if anyone faced a similar project and had some good ideas on improving performance

#

i already made lookup tables to run static method validators, so i can run those by index

buoyant vine Feb 9, 2024, 2:32 PM

#

Polars™️

#

Joking aside we switched all of our pandas/pyarrow stuff over to polars because the API is just way more coherient, simpler and normally faster.

dry geyser Feb 9, 2024, 2:34 PM

#

if you dont mind a DM I can show you some of the transforms I do, but it boils down to coalescing data and validation of different fields. thats the most expensive part of the job, but pyarrow doesnt seem like it's blazing either

rocky spade Feb 9, 2024, 2:34 PM

#

dry geyser I have a CSV reader using pyarrow that is performing at roughly 22k/s (lines per...

If you ask, parallel, if you ask, distrubuted parallel, if you ask, distrubuted parallel asnyc

dry geyser Feb 9, 2024, 2:35 PM

#

note that pyarrow already does multithread streaming

rocky spade Feb 9, 2024, 2:35 PM

#

multhread is not parallel

#

multithread is no distrubuted

dry geyser Feb 9, 2024, 2:36 PM

#

i didnt say it is, but distributed does not necessarily help all cases.

rocky spade Feb 9, 2024, 2:37 PM

#

two computers doing one thing why not helping?

dry geyser Feb 9, 2024, 2:37 PM

#

if it did, there would be no market for massively powerful hardware meant to do single node work

#

latency, all the added overhead of clustering/distributing tasks

#

two kernels, two context switching scenarios, two userland, two mmu, two network stacks ,.. the list goes on

#

when i hear people trying to answer 'distributed' to everything it makes me wonder if they have a clue about OS internals at all

rocky spade Feb 9, 2024, 2:38 PM

#

Because in common house hold don't buy two computers to do one thing

dry geyser Feb 9, 2024, 2:39 PM

#

common household? i have a 2TB EPYC dual cpu supermicro chassis one floor below...

#

2tb ram*

#

and it isnt even something exotic

rocky spade Feb 9, 2024, 2:40 PM

#

I mean, all these matters you can just let your other computer only reporting after finish

dry geyser Feb 9, 2024, 2:40 PM

#

anyway

rocky spade Feb 9, 2024, 2:40 PM

#

there is no need reporting in run time

dry geyser Feb 9, 2024, 2:40 PM

#

@buoyant vine looking at polars

#

seems it's missing some candy for the CSV API

#

namely stuff like auto conversion for boolean types with known false/known true values

rocky spade Feb 9, 2024, 2:41 PM

#

I mean, distrubuted and parallel do help your issues, don't know why you got offended and don't want help

wooden sail Feb 9, 2024, 2:42 PM

#

lapis sequoia rate my graph 🫠

playing with low pass filters of different lengths?

past meteor Feb 9, 2024, 2:43 PM

#

buoyant vine Joking aside we switched all of our pandas/pyarrow stuff over to polars because ...

Join the club. I'm polars fan #1 and for me the main selling points are the coherent and simpler API

#

Maybe a hot take but I'd even use it if it were slower than Pandas

dry geyser Feb 9, 2024, 2:44 PM

#

@rocky spade i did not get offended, you just suggested something that does not effectively solve any issues i have, nor provided any actual insight technically usable.

#

some of the options are supported in polars it seems

#

        self.convert_opt = pv.ConvertOptions(false_values = self.csv_bool_false_values,
                                             true_values = self.csv_bool_true_values,
                                             column_types=self.mapped_column_types,
                                             include_columns=self.wanted_columns,
                                             null_values=[""],
                                             strings_can_be_null=True
                                             )

#

but not the false/true values

#

that saves me a good amount of pain as i do not have to run my validators for boolean dtypes

past meteor Feb 9, 2024, 2:45 PM

#

It's kind of ironic how Python was known for data in the Pandas + matplotlib era when these 2 weren't the best / most user friendly tools imo

dry geyser Feb 9, 2024, 2:45 PM

#

essentially the more i can bypass validation for, the better.

rocky spade Feb 9, 2024, 2:47 PM

#

dry geyser <@1132863992470175845> i did not get offended, you just suggested something tha...

Maybe you should try parquet, because nobody knows what your performance issue is when you just talking about reading a CSV at 22K/s "impressive speed"

buoyant vine Feb 9, 2024, 2:47 PM

#

If the data is already in CSV 😅 Partquet isn't going to save you

#

although Parquet is by far the best format to use if you can

dry geyser Feb 9, 2024, 2:48 PM

#

I am already converting data to parquet for secondary backups

#

exactly lol

#

"try parquet" (but im parsing CSV son... im receiving CSV files...)

#

picks the phone to convince the source of the data they need to rethink and rewrite all their crap to produce parquet files

rocky spade Feb 9, 2024, 2:49 PM

#

dry geyser *picks the phone to convince the source of the data they need to rethink and rew...

Maybe you should

buoyant vine Feb 9, 2024, 2:50 PM

#

I can almost guarentee their awnser will be "lol no" even if it would be benifitial for them as well

#

we get sent TB of CSV files and they just refuse to do it any other way

dry geyser Feb 9, 2024, 2:51 PM

#

^ real world.

#

@rocky spade i wish for a lot of things. in fact, i wish i could just throw a 50mil record file on gpt and ask it to be my beeyotch and return a perfectly structured, deduped, coalesced dataset my way

#

but it seems like i dream too far

left tartan Feb 9, 2024, 2:52 PM

#

past meteor Maybe a hot take but I'd even use it if it were slower than Pandas

I’m also in the anything but pandas camp

dry geyser Feb 9, 2024, 2:52 PM

#

plays i believe i can fly

left tartan Feb 9, 2024, 2:52 PM

#

But sql is my answer

past meteor Feb 9, 2024, 2:52 PM

#

left tartan I’m also in the anything but pandas camp

Have you used R?

left tartan Feb 9, 2024, 2:53 PM

#

past meteor Have you used R?

Yes, I do like R

past meteor Feb 9, 2024, 2:53 PM

#

I haven't used it in a long time, but I'd always argue their data processing toolkit is more user friendly. It's just too slow.

buoyant vine Feb 9, 2024, 2:53 PM

#

This is probably more extreme, but maybe you could use something like Trino/Athena/Presto if you want to be as fast as possible and don't care about the cost.

The SQL query will suck, but I know Trino is capable of brute forcing its way through, if you wan't something nicer though :/ I'm not to sure since as you said, Polars doesn't support everything you really need easily.

Maybe you could make a LazyFrame in polars and do the bool conversion as part of the pipelined operation?

dry geyser Feb 9, 2024, 2:53 PM

#


    def process(self):     
        if self.parse_method == INGESTION_METHOD_MEMORY:
            table = pv.read_csv(self.filename, read_options=self.read_opt,
                                parse_options=self.parse_opt,
                                convert_options=self.convert_opt)
            self.total_rows = table.num_rows
            self.prebake_lookup_tables(table.schema.names)
            self.process_table(table)
        else:
            # XXX: beware this can trigger OOM, beats cpython iteration through lines() though...
            self.total_rows = self.count_rows()
            
            with pv.open_csv(self.filename, read_options=self.read_opt,
                            parse_options=self.parse_opt,
                            convert_options=self.convert_opt) as reader:
                self.prebake_lookup_tables(reader.schema.names)
                self.loop_through_chunks(reader)

I'm about to benchmark it again without any of the post-validation stuff

rocky spade Feb 9, 2024, 2:53 PM

#

dry geyser "try parquet" (but im parsing CSV son... im receiving CSV files...)

@left tartan

past meteor Feb 9, 2024, 2:53 PM

#

I'd love a credible, maintained port of ggplot2

left tartan Feb 9, 2024, 2:55 PM

#

So it’s not the csv reading per se but the line processing?

rocky spade Feb 9, 2024, 2:56 PM

#

dry geyser <@1132863992470175845> i wish for a lot of things. in fact, i wish i could just ...

Do you have any issue regards to your problems?

rocky spade Feb 9, 2024, 2:56 PM

#

dry geyser ``` def process(self): if self.parse_method == INGESTION_METHO...

JIMMY SON literaly read the table two times and make a list of it, and complaining processing speed

dry geyser Feb 9, 2024, 2:56 PM

#

@buoyant vine

[2024-02-09 15:55:52,298] [MainProcess:Thread-2 (periodic_performance_logger)] INFO: CSV: Processed 780221 lines in 24.09 seconds, 32385.35 lines/second (ETA: 6.18 min)
^ validation enabled, using lookup tables for running per-column methods

#

@rocky spade no, it doesnt, i dont think you understand how pyarrow works. the total_row_count is only done on the streaming part as an initial step, and it technically does not parse anything because im not consuming the dataframes there.

#

it literally is a newline/carriage return counter

rocky spade Feb 9, 2024, 2:58 PM

#

And should you not use Spark for this matter?

dry geyser Feb 9, 2024, 2:58 PM

#

on nvme it completes in ~3 seconds tops

buoyant vine Feb 9, 2024, 2:58 PM

#

Spark is a pretty nuclear option, and you'd spend more time setting up the Spark cluster than the work itself probably.

dry geyser Feb 9, 2024, 2:59 PM

#

@rocky spade do you mind sharing a link to a github or something showing your work?

#

@buoyant vine indeed

buoyant vine Feb 9, 2024, 2:59 PM

#

dry geyser ``` self.convert_opt = pv.ConvertOptions(false_values = self.csv_bool_fa...

What are your csv_bool_false_values and csv_bool_true_values here?

rocky spade Feb 9, 2024, 3:00 PM

#

rocky spade Feb 9, 2024, 3:00 PM

#

dry geyser <@1132863992470175845> do you mind sharing a link to a github or something showi...

Here

dry geyser Feb 9, 2024, 3:00 PM

#

@buoyant vine they are dataset dependent, "yes", "no", etc. I made it dynamically configurable, sometimes they change.

rocky spade Feb 9, 2024, 3:03 PM

#

dry geyser ``` def process(self): if self.parse_method == INGESTION_METHO...

You basically just opned entire CSV file, and illerate through everything in it, and told you to make a parallel, no response.

#

I mean, i don't know what you want?

dry geyser Feb 9, 2024, 3:03 PM

#

@buoyant vine Processed 12797780 lines in 375.41 seconds, 34089.97 lines/second (ETA: 0.00 min) < this is including validation. no coalescing/dedup per row, though. i still need to optimize that. it's not as trivial as the validators, since i have some fancy expression support to do inter-column checks and such (ex. fields are transformed based off other columns and their values)

buoyant vine Feb 9, 2024, 3:05 PM

#

import polars as pl

true_values = pl.Series('true_values', ["yes", "y", "ok"])
false_values = pl.Series('false_values', ["no", "n"])

data_stream = pl.scan_csv(
    "my-files/*.csv",
    schema={
        "my_bool_col": pl.Utf8,
        "my_other_bool_col": pl.Utf8,
        "title": pl.Utf8,
        "description": pl.Utf8,
        "something": pl.UInt32,
        "else": pl.Float64,
    },
    truncate_ragged_lines=True,
)

data_stream = (
    data_stream
    .with_columns(
        pl.col("my_bool_col").is_in(true_values),
        pl.col("my_other_bool_col").is_in(true_values),
    )
)

# ... processing

I am not sure how fast this is since I don't have anything to test it right now, but this should do the bool conversion as part of the streaming operation.

dry geyser Feb 9, 2024, 3:06 PM

#

i can check a bit later

buoyant vine Feb 9, 2024, 3:06 PM

#

The iter-column checks are probably going to be the slowest thing, since data engines can often get confused with them

dry geyser Feb 9, 2024, 3:06 PM

#

hows type inference with polars?

past meteor Feb 9, 2024, 3:06 PM

#

Very good

dry geyser Feb 9, 2024, 3:06 PM

#

i wrote a tool also for improving type inference for the parquet conversion, it isnt anything sophisticated, but speeds up crafting the yaml configuration for each CSV dataset

buoyant vine Feb 9, 2024, 3:08 PM

#

Its good, the only issue I have ran into, is it reads and writes Utf8 strings as Utf8 with 64 bit int lengths, which you can't change.
Had an issue before where if you have some custom arrow processing and it can't auto between the 32bit and 64bit lengths, it can cause some issues.

dry geyser Feb 9, 2024, 3:08 PM

#

import pyarrow as pa

PYARROW_TYPE_MAPPINGS = {
    "StringType":           pa.string(),
    "EmailAddressType":     pa.string(),
    "PhoneNumberType":      pa.string(),
    "GenderType":           pa.string(),
    "BooleanType":          pa.bool_(),
    "DateTimeType":         pa.timestamp('ns'),
    "CountryType":          pa.string()
}

just an example from one of the type mappings

past meteor Feb 9, 2024, 3:08 PM

#

The most important thing with polars type inference is that it knows the difference between pl.lazyFrame and also the "contect" you're in like Expr, or Select etc

#

It quite accurately tells you what ops you can and can't do

#

Or did you mean the inference while reading data

dry geyser Feb 9, 2024, 3:09 PM

#

while reading

true spade Feb 9, 2024, 3:09 PM

#

Hi there, not sure if this is the right channel to ask this question, but essentially I am currently struggling to figure out whether I am spending too much time on organizing code in my Jupyter notebooks as opposed to conducting experiments and exploring data/opportunities with the aid of the notebook.

As a result, I was wondering if anyone has any advice on how to balance organizing code with actually using Jupyter notebooks for analyzing data and experimenting with different kinds of models?

rocky spade Feb 9, 2024, 3:10 PM

#

That's right son

dry geyser Feb 9, 2024, 3:10 PM

#

ex. if you look at the above table, i have a per dataset mapping that maps columns to those generic types. each has its own validation logic, including adaptive settings per dataset (ex. known observed bad values)

buoyant vine Feb 9, 2024, 3:10 PM

#

true spade Hi there, not sure if this is the right channel to ask this question, but essent...

😅 I'll confess I never organise my notebooks, they are all not in the git history for a reason, I just have chunks of code everywhere.

Normally if I have some model tests or what ever, I put them in normal python files.

dry geyser Feb 9, 2024, 3:11 PM

#

dry geyser ex. if you look at the above table, i have a per dataset mapping that maps colum...

so, this is in the context of how you are mapping types in polars

buoyant vine Feb 9, 2024, 3:12 PM

#

dry geyser ex. if you look at the above table, i have a per dataset mapping that maps colum...

Defining the schema for each dataset in Polars is very simple, doing some more complex conversion or casting typically requires the use of some with_columns and explicitly casting the types, or some extra work. It is very type strict unlike pandas which can be a bit more hand wavy.

dry geyser Feb 9, 2024, 3:12 PM

#

as i already apply column type mappings. for the most part i care about dates and some string types, the rest i can often apply a fast path in validation and just leave them as is or as none/null

past meteor Feb 9, 2024, 3:12 PM

#

true spade Hi there, not sure if this is the right channel to ask this question, but essent...

I organize in .py files and use notebooks to incrementally test what I'm making

#

My experiment pipeline is always a .py

dry geyser Feb 9, 2024, 3:12 PM

#

ill definitely look into a polars version of the current csv processor

buoyant vine Feb 9, 2024, 3:13 PM

#

dry geyser as i already apply column type mappings. for the most part i care about dates an...

Date parsing, etc... Is very simple, you may need to define the original type as a string, then tell polars to parse it, but it has native methods for this so it is very fast, it just doesn't have a helper schema type for implicitly casting IIRC

true spade Feb 9, 2024, 3:13 PM

#

buoyant vine 😅 I'll confess I never organise my notebooks, they are all not in the git histo...

I see, the issue is that I have to put all of my code (at least the relevant parts) in my Jupyter notebook as part of an assignment, but its currently overdue since I have been spending too much time on organizing my code.

I tend to try to follow the DRY principle since I have found that its something that has helped me out in other aspects such as software engineering, web development, and game development, but when it comes to Jupyter notebooks (mainly using it for data analytics and machine learning projects).

However, I really don't know if my tendency to be a strickler when it comes to adhering to the DRY principle is causing me to waste too much time on cleaning up my code and reducing code duplication/generalizing common procedures instead of actually y'know... exploring the data and experimenting with new models haha

buoyant vine Feb 9, 2024, 3:13 PM

#

Overall I'd say it is very against the idea of implicitly or automatically type casting things. For the most part it just won't do it unless you explicitly do it.

rocky spade Feb 9, 2024, 3:14 PM

#

dry geyser so, this is in the context of how you are mapping types in polars

How many lines is your file? How big is it? Do you mind so i can make a copy file to test it on my own?

#

So pass the 22K/s is impressive?

dry geyser Feb 9, 2024, 3:14 PM

#

@buoyant vine the tl;dr is that in the end i want an array/list that matches the indexing of my wanted columns set (this is how i optimize validation, by executing/running all that by index, the validators are assigned to the right fields once, and the table is cached)

rocky spade Feb 9, 2024, 3:15 PM

#

dry geyser <@290923752475066368> the tl;dr is that in the end i want an array/list that mat...

Maybe you should change a CPU instead

true spade Feb 9, 2024, 3:15 PM

#

past meteor My experiment pipeline is always a `.py`

I see, thanks for letting me know about that, just curious, what do you mean by an experiment pipeline?

Are you referring to a set of scripts that just contain experimental code (which might or might not be scrapped in the future)?

buoyant vine Feb 9, 2024, 3:15 PM

#

dry geyser <@290923752475066368> the tl;dr is that in the end i want an array/list that mat...

In polars, I would try and do all those validation as part of the streaming operation rather than via index. In theory it should be faster providing you can make use of some of the native helper methods rather than calling map_elements everywhere.

past meteor Feb 9, 2024, 3:16 PM

#

true spade I see, thanks for letting me know about that, just curious, what do you mean by ...

No, the code I use to run all of my models / preprocessing etc

#

I like having it all reproducible and so on

dry geyser Feb 9, 2024, 3:16 PM

#

@buoyant vine the validation is done to the row as a list/tuple/set

buoyant vine Feb 9, 2024, 3:17 PM

#

Any particular reason for that? Or just because it was the best way with pyarrow?

rocky spade Feb 9, 2024, 3:17 PM

#

buoyant vine In polars, I would try and do all those validation as part of the streaming oper...

For my understanding if you just use someone else packages and not digging yourself, there is no real performance you can improve other than genearl like parallel or distrubuted

dry geyser Feb 9, 2024, 3:17 PM

#

it was the best way with pyarrow

true spade Feb 9, 2024, 3:17 PM

#

past meteor No, the code I use to run all of my models / preprocessing etc

I see, so does it mean that you put all of the code for reproducible experiments or finalized models in separate .py files and then just use the Jupyter notebook for exploring and experimenting in?

rocky spade Feb 9, 2024, 3:17 PM

#

Because you have no clue what under it

dry geyser Feb 9, 2024, 3:17 PM

#

@buoyant vine i can DM you if you are curious

buoyant vine Feb 9, 2024, 3:17 PM

#

Fair enough, yeah I would try with polar's more columnar approach if you can, I don't know all your validations but if you can do it without getting it by row then it should be pretty speedy

buoyant vine Feb 9, 2024, 3:17 PM

#

dry geyser <@290923752475066368> i can DM you if you are curious

Sure sure

buoyant vine Feb 9, 2024, 3:18 PM

#

rocky spade For my understanding if you just use someone else packages and not digging yours...

This is true, in Polars the biggest performance hit is when it has to go back to python land to processs stuff

#

Although tbh, when we hit those sorts of issues, we stop doing the code in Python 😅

past meteor Feb 9, 2024, 3:19 PM

#

true spade I see, so does it mean that you put all of the code for reproducible experiments...

Usually I have like dozens of experiments I run with the single pipeline. I test one or two out in a notebooks manually and then I parameterize the pipeline using the CLI or a second .py that runs everything

dry geyser Feb 9, 2024, 3:20 PM

#

@buoyant vine i wrote a very simple test in rust without all the dynamic/configurable validation and mappings, and it beat the crap out of python 3.12 with latest pyarrow

#

single threaded too

rocky spade Feb 9, 2024, 3:20 PM

#

dry geyser <@290923752475066368> i can DM you if you are curious

So i was trying to say the same, only parallel, distrubuted or general stuff

dry geyser Feb 9, 2024, 3:20 PM

#

all numbers i provided so far come from a i9-13900K workstation

rocky spade Feb 9, 2024, 3:21 PM

#

dry geyser <@290923752475066368> i wrote a very simple test in rust without all the dynamic...

Other than that, write yourself a pacakage and talking about perfromance improve

past meteor Feb 9, 2024, 3:21 PM

#

About Polars, what I see being a big issue of people transitioning to it is not leaning into it

true spade Feb 9, 2024, 3:21 PM

#

past meteor Usually I have like dozens of experiments I run with the single pipeline. I test...

I see, thats interesting, thanks for the information.

When testing things out in a notebook, do you usually also try to organize the code into functions (mainly for repetitive tasks, such as building and evaluating multiple models or anything else where the procedure does not vary so much) or do you just duplicate the code instead?

past meteor Feb 9, 2024, 3:21 PM

#

I think if you're doing iter_rows and/or map then using it doesn't make sense

rocky spade Feb 9, 2024, 3:21 PM

#

dry geyser <@290923752475066368> i wrote a very simple test in rust without all the dynamic...

There is no way you can improve a performance when using someone packed stuff

rocky spade Feb 9, 2024, 3:22 PM

#

dry geyser all numbers i provided so far come from a i9-13900K workstation

Coding level issue son

dry geyser Feb 9, 2024, 3:23 PM

#

@rocky spade if you look at how @buoyant vine 's and other folks' interactions work out, your experience in this channel with other people will likely improve also linearly to your enthusiasm in "distribute everything ahoy"

#

plonk

#

/ignore @rocky spade

#

lol

buoyant vine Feb 9, 2024, 3:24 PM

#

dry geyser <@290923752475066368> i wrote a very simple test in rust without all the dynamic...

Yeah, personally I despise python's arrow handling and parquet handling. If you think the speed is the issue wait until you try streaming to and from object storage with it sadge

#

Polars is very nice though if the data can sit on local disks and be done with it though

dry geyser Feb 9, 2024, 3:25 PM

#

f that, im doing all this on nvme/optane

#

so i know IO is not the bottleneck at least in that sense

buoyant vine Feb 9, 2024, 3:25 PM

#

😎 Join the darkside and doing 100s of GB / s on blob storage

dry geyser Feb 9, 2024, 3:25 PM

#

lol

past meteor Feb 9, 2024, 3:25 PM

#

true spade I see, thats interesting, thanks for the information. When testing things out i...

In notebooks I start off with rough code and then I make it better and potentially move some stuff to the .py files. Think about it this way: writing code that works is a challenge. Writing code that is really organized is also a challenge. Sometimes it makes sense to not try to do these at the same time. Make it work, write tests to verify its behaviour and then make it cleaner

rocky spade Feb 9, 2024, 3:26 PM

#

I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements

buoyant vine Feb 9, 2024, 3:28 PM

#

rocky spade There is no way you can improve a performance when using someone packed stuff

This is somewhat miss-leading i'd say, or a misconception at least, Yes there are limits but most of the time the library code itself is not the limiting thing.

It is also worth mentioned that it is typically not worth it to build some system from scratch in something like Rust or C++ unless you actually have issues with the speed it is currently doing it in or have some other requirement which the Python lib or what ever doesn't support well.

There are a lot of optimizations you can normally do before you get to that stage

dry geyser Feb 9, 2024, 3:28 PM

#

@rocky spade i think you dont read english well. i never said 22k is impressive. i said it's the ceiling of what is possible given the circumstances. yet you are here trolling because your petit ego got hurt when i told you that your suggestions were not valuable. look at how other people respond here. their input has value. they are not acting haughty or like they have a chip on their shoulders. i bet any of these kind fuckers have a fairly sizable amount of experience on their shoulders, thats where their humility and good attitude comes from. get over it. learn from them.

buoyant vine Feb 9, 2024, 3:29 PM

#

We can also probably chill out a little bit 😅 We don't need to argue or throw insults or what not

#

just before this becomes too heated...

dry geyser Feb 9, 2024, 3:29 PM

#

lol

rocky spade Feb 9, 2024, 3:30 PM

#

dry geyser <@1132863992470175845> i think you dont read english well. i never said 22k is i...

I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements

#

isn't there anyway to improve when you use pandas to read CSV file?

#

pa.read

#

The only one have no chip on their shoulder is starting calling others son when someone replying to your code after asking for improvment

agile owl Feb 9, 2024, 3:37 PM

#

Don't understand why you'd be writing single threaded apps if you care about performance in 2024.

rocky spade Feb 9, 2024, 3:37 PM

#

dry geyser <@1132863992470175845> i think you dont read english well. i never said 22k is i...

The only one have no chip on their shoulder is starting calling others son when someone replying to your code after asking for improvment son

agile owl Feb 9, 2024, 3:38 PM

#

python single thread performance is of course going to lose to rust too don't think that's controversial at all the runtime has a cost

dry geyser Feb 9, 2024, 3:38 PM

#

"son" is not an insult, and you suggested "i use parquet" to a question that obviously involved CSV data... which cannot be obtained in any other format....

agile owl Feb 9, 2024, 3:38 PM

#

calling ppl son is typically considered a sign of disrespect

dry geyser Feb 9, 2024, 3:39 PM

#

if you cant take humor you should not be hopping into the internet

#

he spent ~1hr offended because someone made a "son" joke in an internet channel. solid,

agile owl Feb 9, 2024, 3:41 PM

#

anyway probably time to move on from that, what's the issue exactly, that polars in Python is underperforming polars in Rust on a signle threaded app?

dry geyser Feb 9, 2024, 3:41 PM

#

no, pyarrow

rocky spade Feb 9, 2024, 3:45 PM

#

dry geyser if you cant take humor you should not be hopping into the internet

Okay son, i don't know that's a humor. probably make a distruted system other than reading one thread a single file. You don't need to communicae in run time, you just need to generate two reports after the end, probablly make it half time faster. Just coding issue, good luck

#

Multi threaded is a one process

agile owl Feb 9, 2024, 3:46 PM

#

threads should be more efficient for IO bound things

buoyant vine Feb 9, 2024, 3:47 PM

#

😅 Ngl I think making this a distributed system. for this task is a bit overkill

dry geyser Feb 9, 2024, 3:47 PM

#

he also doesnt understand how threads work apparently

buoyant vine Feb 9, 2024, 3:47 PM

#

Especially if you don't need the cluster all the time, no matter what system you use, managing the cluster suckkks

past meteor Feb 9, 2024, 3:47 PM

#

Going distributed is a special kind of pain you want to avoid imo

agile owl Feb 9, 2024, 3:47 PM

#

you use processes for CPU bound things in Python because of the GIL but for IO bound things you can use threads

past meteor Feb 9, 2024, 3:48 PM

#

It's a high price you pay for a nonexistant reward if it all first in 1 machine

agile owl Feb 9, 2024, 3:48 PM

#

@past meteor just live your entire life in distributed async land and treat everyone else like a baby

buoyant vine Feb 9, 2024, 3:48 PM

#

Also, probably worth mentioning pyarrow is written as a native extension, it releases the GIL in its parsers 😅 So you get the full use of the CPU.

long locust Feb 9, 2024, 3:48 PM

#

Hey there, just for the record please remain civil, "RTFM" is not a very friendly phrase

past meteor Feb 9, 2024, 3:49 PM

#

agile owl <@260493929047130113> just live your entire life in distributed async land and t...

One of our colleagues made a #FaultTolerant #Microservice in #Elixir

buoyant vine Feb 9, 2024, 3:49 PM

#

buoyant vine Also, probably worth mentioning pyarrow is written as a native extension, it rel...

The limiter is going to and from python land from pyarrow and these native systems.

past meteor Feb 9, 2024, 3:49 PM

#

I think it could've been a sync flask app in a couple of days

#

Fault tolerant? It failed 😭 (it's currently down)

rocky spade Feb 9, 2024, 3:49 PM

#

dry geyser he also doesnt understand how threads work apparently

So what's the issue than, you alread USED MULTITHREADED PyARROW Packages, And ASKING FOR HELP

long locust Feb 9, 2024, 3:49 PM

#

rocky spade So what's the issue than, you alread USED MULTITHREADED PyARROW Packages, And AS...

no need to shout

buoyant vine Feb 9, 2024, 3:50 PM

#

past meteor Fault tolerant? It failed 😭 (it's currently down)

What was it even supposed to do?

rocky spade Feb 9, 2024, 3:51 PM

#

From today i didn't see multithreaded is very impressive, do you know how to write parallel and make a use?

past meteor Feb 9, 2024, 3:52 PM

#

buoyant vine What was it even supposed to do?

We ran a clinical trial. All it had to do was call an API. For some reason he really really wanted to make it stream data so he ended up polling the API every few seconds. Problem is, he messed up and we had tons of duplicates.

Secondly, batch would've been totally fine for us. Just calling the API once every day or every half day solved the problem.

#

There's a couple more microservices but those are basically there for what I believe is obfuscation

buoyant vine Feb 9, 2024, 3:53 PM

#

rocky spade From today i didn't see multithreaded is very impressive, do you know how to wri...

I am not sure what you are trying say here.

lucid hornet Feb 9, 2024, 3:53 PM

#

Ohhhh, I was getting arrow and pyarrow confused. Was trying to figure out why a datetime library would need a csv reader

buoyant vine Feb 9, 2024, 3:54 PM

#

past meteor We ran a clinical trial. All it had to do was call an API. For some reason he re...

F 😅 I have currently taken dev down because the scale management and partners wanted for a service was much much higher than the price tag they were going to pay, and having to do some aggressive optimizations

#

unfortunately, Docker images coming at a cool 24GB in size compressed

#

and we ran out of ephemaral storage sadge

past meteor Feb 9, 2024, 3:54 PM

#

We have 3 services, one polls data source A, another lets our clinical partner upload patient info and a last one polls data source C

buoyant vine Feb 9, 2024, 3:54 PM

#

lucid hornet Ohhhh, I was getting arrow and pyarrow confused. Was trying to figure out why a...

Oh god yeah I forgot arrow in python is a dt lib

past meteor Feb 9, 2024, 3:55 PM

#

To do any query you need to join so many API keys 😩

buoyant vine Feb 9, 2024, 3:55 PM

#

Why would they name the datetime lib the same thing as the well know dataformat sadge

buoyant vine Feb 9, 2024, 3:55 PM

#

past meteor To do *any* query you need to join so many API keys 😩

JWT service ftw

past meteor Feb 9, 2024, 3:55 PM

#

Keycloak 🥴

past meteor Feb 9, 2024, 3:56 PM

#

buoyant vine F 😅 I have currently taken dev down because the scale management and partners w...

hahaha this is hilarious

#

Well, you get what you pay for

rocky spade Feb 9, 2024, 3:57 PM

#

dry geyser he also doesnt understand how threads work apparently

use CUDA, CUDA is great, multithreaded, single processor, strong

buoyant vine Feb 9, 2024, 3:57 PM

#

underestimating the performance of AI models™️

left tartan Feb 9, 2024, 3:57 PM

#

buoyant vine F 😅 I have currently taken dev down because the scale management and partners w...

Yah, I hate this… I hate when cost control is 10x the initial effort

past meteor Feb 9, 2024, 3:57 PM

#

Honestly, the project started before I joined. If I were there from the start I'd have challenged many questionable decisions

lucid hornet Feb 9, 2024, 3:58 PM

#

buoyant vine Why would they name the datetime lib the same thing as the well know dataformat ...

Got curious. Apparently the datetime lib came first

past meteor Feb 9, 2024, 3:58 PM

#

I think ultimately what $dev did was resume driven development

lucid hornet Feb 9, 2024, 3:58 PM

#

rocky spade use CUDA, CUDA is great, multithreaded, single processor, strong

And NVidia specific, isn't it?

left tartan Feb 9, 2024, 3:58 PM

#

rocky spade use CUDA, CUDA is great, multithreaded, single processor, strong

I think jimmyhoffa understands these things and is throwing good horsepower against it.

buoyant vine Feb 9, 2024, 3:58 PM

#

left tartan Yah, I hate this… I hate when cost control is 10x the initial effort

Tbh it was the opposite here, but they were rushing to deploy to prod and sell the service before the system was optimized and we knew how it scaled.

left tartan Feb 9, 2024, 3:58 PM

#

buoyant vine Tbh it was the opposite here, but they were rushing to deploy to prod and sell t...

Oh, hah. I feel you

buoyant vine Feb 9, 2024, 3:59 PM

#

"We can afford a bit of a price increase, its not an issue for us"
But can you afford a 100x increase

rocky spade Feb 9, 2024, 3:59 PM

#

So i was thinking, in that situation, can he cut the file in half, such as find a way read only half of them, and then make a parallel reading?

lucid hornet Feb 9, 2024, 4:00 PM

#

Can also be streamed in, but I don't know if that helps with a csv

rocky spade Feb 9, 2024, 4:00 PM

#

Because there is no way you can improve things when you use a PACKAGE

left tartan Feb 9, 2024, 4:00 PM

#

Certainly, but I think first question is: what is the current bottleneck and why?

rocky spade Feb 9, 2024, 4:01 PM

#

left tartan Certainly, but I think first question is: what is the current bottleneck and why...

He didn't say any of it, but he strongly against my adivce: distrubuted and parallel when i raed 3 s of his sentences

buoyant vine Feb 9, 2024, 4:02 PM

#

it is already parallel, and distributed is overkill 😅

agile owl Feb 9, 2024, 4:03 PM

#

does this belong to the class of problems where we're complaining that Python is just slower than Rust and end up saying that if you don't like the Python performance then don't use python

#

because that's what it seems like

left tartan Feb 9, 2024, 4:05 PM

#

@dry geyser i am curious what your bottleneck is, but if you’re done with this conversation I don’t want to drag it out. Can you share more info about the per line processing?

true spade Feb 9, 2024, 4:07 PM

#

past meteor In notebooks I start off with rough code and then I make it better and potential...

I see, that makes sense, thanks for the explanation

rocky spade Feb 9, 2024, 4:25 PM

#

stupid code error, code structure ; package problem
solution=> write your own fucking package, parallel and distrubuted.

#

Does anyone know multithreaded is one single processor right?

#

I don't see any difference with Python concurrency

dry geyser Feb 9, 2024, 4:30 PM

#

@left tartan the bottleneck is the validation/coalescing/etc. without it it's ~34k/s, roughly 15-20MB/s, pyarrow without any validation or pandas/dataframe conversion can maybe go up to 100MB/s

#

there is a dual conversion for the dataframes happening too. the final product is a deduplicated, coalesced dict with validated information (including some dynamic expressions, but i have tested without that too, similar to asteval)

left tartan Feb 9, 2024, 4:32 PM

#

dry geyser <@738234281146712084> the bottleneck is the validation/coalescing/etc. without i...

Any opportunity to vectorize the validation/coalescing?

dry geyser Feb 9, 2024, 4:33 PM

#

i already do it with the validation by building lookup tables and processing the columns by index, for coalescing it's trickier because i support complex inter-column logic. ex. if column X has value Z, set field to Z, else take value from Y

buoyant vine Feb 9, 2024, 4:34 PM

#

rocky spade I don't see any difference with Python concurrency

?

dry geyser Feb 9, 2024, 4:34 PM

#

will need to consider a similar approach, but because it builds a dict to be batched for elastic indexing, it is less trivial than vectorizing the validation, which in the end works with a list/set, so we can basically assume column N has validator X and it will remain constant

#

the coalescing is not immediately solvable since we need to iterate thru the validated data, find the dups, remove them, and so on.

#

however the gains later are immense because the indexed data never needs touchups

left tartan Feb 9, 2024, 4:36 PM

#

dry geyser i already do it with the validation by building lookup tables and processing the...

Cross column coalescing is (probably) straight forward if you build a table or dataframe for each batch.

dry geyser Feb 9, 2024, 4:36 PM

#

so i dont have to deal with any of the annoyances in ES for updates

left tartan Feb 9, 2024, 4:36 PM

#

But I get the dupe detection problem

dry geyser Feb 9, 2024, 4:37 PM

#

ex. multiple columns contain an identifier, which sometimes repeats. i get rid of all the dupes.

left tartan Feb 9, 2024, 4:37 PM

#

** I’m a DuckDB shill so my first experiment would be to load to a DuckDB table, and do it all in sql.

dry geyser Feb 9, 2024, 4:37 PM

#

a very well respected math-head recommended duckdb to me for this project but i saw some limitations as i need near realtime text lookups

#

i augment the data externally with edgedb for holding some relational data/caching some searches

#

i would be interested in talking about how it would work with duckdb though

#

the problem for me was the massive amount of potential idempotent inserts

left tartan Feb 9, 2024, 4:39 PM

#

Oh, I was just thinking for processing. You might then export and use another way for lookups.

dry geyser Feb 9, 2024, 4:39 PM

#

ex. identifiers connected to a given object being repeated

#

suddenly i end up with 15 mil select or insert queries = no go

#

(hence elastic)

#

@buoyant vine has been helping me grok polars to adapt the current csv processor, there are some hiccups but apparently polars has an expr engine

#

@buoyant vine ill ping you about the native expr stuff in polars

#

got a mockup with polars going

#

$ time python testpolars.py tests/fixtures/..._500k.csv

real 0m1.014s
user 0m2.132s
sys 0m0.323s

#

14mil records in 28seconds, with boolean conversion already done

buoyant vine Feb 9, 2024, 5:06 PM

#

I hope that is a good sign 😅

dry geyser Feb 9, 2024, 5:07 PM

#

What would be the equivalent for handling dates? ex. attempt auto conversion

#

assuming UTC

#

(or no tz)

#

try_parse_dates

buoyant vine Feb 9, 2024, 5:14 PM

#

dry geyser What would be the equivalent for handling dates? ex. attempt auto conversion

https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.str.to_datetime.html

rocky spade Feb 9, 2024, 5:17 PM

#

@left tartan for asyncio, i understand it is one thread concurrency, but how if it is one thread, there is a loop manger? that can control loop?

#

because the loop manager is always in current concurrency and never leave or change?

dry geyser Feb 9, 2024, 5:32 PM

#

read how epoll() is implemented to understand how it can do what it does in a "single thread"

left tartan Feb 9, 2024, 5:32 PM

#

rocky spade <@738234281146712084> for asyncio, i understand it is one thread concurrency, bu...

For asyncio threads, there’s a scheduler/loop involved, yes. See asyncio https://docs.python.org/3/library/asyncio-eventloop.html

rocky spade Feb 9, 2024, 5:33 PM

#

left tartan For asyncio threads, there’s a scheduler/loop involved, yes. See asyncio <https:...

I messaged you privately

left tartan Feb 9, 2024, 5:33 PM

#

rocky spade <@738234281146712084> for asyncio, i understand it is one thread concurrency, bu...

The inner workings here is not something I’m very familiar with.

rocky spade Feb 9, 2024, 5:33 PM

#

left tartan The inner workings here is not something I’m very familiar with.

Thanks

rocky spade Feb 9, 2024, 5:37 PM

#

left tartan The inner workings here is not something I’m very familiar with.

I saw this but never understand is why it is not parallel when one task is running and than switch to another task when yield, so basically there is only one thread, and inside the thread the scheduler calls other concurrent task when they reported or ready, but if it is not parallel, how would they know? => so when one task is await, then there will a list to check if other task is ready?

#

I do lack of basic understanding about processer or concurrency in programming level

dry geyser Feb 9, 2024, 5:39 PM

#

https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/ under the hood.

Julia Evans

Async IO on Linux: select, poll, and epoll

left tartan Feb 9, 2024, 5:40 PM

#

rocky spade I saw this but never understand is why it is not parallel when one task is runni...

Python threads run concurrently, but not in parallel. Meaning: multiple threads can be started, but only one runs at a given moment in time.

rocky spade Feb 9, 2024, 5:40 PM

#

left tartan Python threads run concurrently, but not in parallel. Meaning: multiple threads ...

I curioused about how the data transaction works in one thread

left tartan Feb 9, 2024, 5:40 PM

#

The scheduler handles assigning the work: a thread can be preempted so that another thread can run.

#

I’m not familiar with the internal mechanism of how the scheduler works.

#

(There’s a more complicated discussion about ‘why’, which leads to the GIL and eventually PEP 703)

rocky spade Feb 9, 2024, 5:42 PM

#

Do they open sourced it ?

left tartan Feb 9, 2024, 5:42 PM

#

rocky spade Do they open sourced it ?

Yes, cpython is open source

rocky spade Feb 9, 2024, 5:44 PM

#

left tartan Yes, cpython is open source

I don't want to read cpython..

#

I thought Python is open sourced..

left tartan Feb 9, 2024, 5:45 PM

#

Cpython is Python (well, there’s others but it’s the one you’re using)

past meteor Feb 9, 2024, 5:45 PM

#

rocky spade I saw this but never understand is why it is not parallel when one task is runni...

The way I'd always explain it (a bit hand-wavy) is that concurrency is an idea and parallelism is one specific implementation, asynchronous programming is another. Python's async/await is based on event-driven programming (which is a way to do async), you have an event loop that submits tasks with a callback. When the task is done it's put in a queue that the scheduler checks frequently to see what tasks can be resumed. True parallelism isn't possible in pure Python because of the global interpreter lock.

rocky spade Feb 9, 2024, 5:51 PM

#

left tartan The scheduler handles assigning the work: a thread can be preempted so that anot...

Thanks

rocky spade Feb 9, 2024, 5:51 PM

#

left tartan Cpython is Python (well, there’s others but it’s the one you’re using)

For real?

rocky spade Feb 9, 2024, 5:52 PM

#

past meteor The way I'd always explain it (a bit hand-wavy) is that concurrency is an *idea*...

But assuming current task is running, then the scheduler just like checking creazy in every mellieseconds when doing this stopped current task?

rocky spade Feb 9, 2024, 5:53 PM

#

past meteor The way I'd always explain it (a bit hand-wavy) is that concurrency is an *idea*...

And when we say callback, what is call back exactly? call back need to check or return something

rocky spade Feb 9, 2024, 5:54 PM

#

past meteor The way I'd always explain it (a bit hand-wavy) is that concurrency is an *idea*...

But how about multiprocessing module, isn't it a true parallel in Python?

#

is there anyway to see the code directly like what is call back and sechedular in Pythn?

#

Cyphton...

left tartan Feb 9, 2024, 5:57 PM

#

rocky spade But how about multiprocessing module, isn't it a true parallel in Python?

Multi process runs fully independent processes, very different and fully ‘parallel’. But, they don’t share objects.

rocky spade Feb 9, 2024, 5:58 PM

#

left tartan Multi process runs fully independent processes, very different and fully ‘parall...

So that's true parallel, so Python GIL is just a way for thread safe or memory safe or something like that

#

Beucase after i know multiprocessing module, and see their documentation, their impression is that GIL is just a joke?

#

for most common way of using?

#

I don't fully understand GIL, i just assume it is just locked the thread or something intentionally

past meteor Feb 9, 2024, 6:00 PM

#

rocky spade But assuming current task is running, then the scheduler just like checking crea...

You only need to check the queue when a task has finished or awaited to schedule the next one. The loop uses select, poll, epoll, ... like jimmyhoffa has mentioned. Their advantage is that you don't need to actively poll which means you don't need to keep asking the task "are you done? are you done? are you done?.

The callback is really abstracted in async/await another hand wavy explanation, the callback here would be the code that follows after the await. That's what needs to be done when the event is finished.

rocky spade Feb 9, 2024, 6:00 PM

#

past meteor You only need to check the queue when a task has finished or `await`ed to schedu...

Apprecaited

#

!

past meteor Feb 9, 2024, 6:02 PM

#

Do you know about generators?

rocky spade Feb 9, 2024, 6:02 PM

#

I checked the yield, so i know about it, somehow

#

I understand the code and the concept

past meteor Feb 9, 2024, 6:03 PM

#

Well, let me not confuse you 😄 I think this is more than enough information for one day haha

#

Just write code and it'll become clear

rocky spade Feb 9, 2024, 6:03 PM

#

please do more

#

just asyncio if is not parallel confused me about 5 months

left tartan Feb 9, 2024, 6:05 PM

#

rocky spade just asyncio if is not parallel confused me about 5 months

This is one of the more complicated / confusing topics in Python.

past meteor Feb 9, 2024, 6:05 PM

#

rocky spade please do more

I'm going to have to move on now, we have an entire channel for this stuff though in #async-and-concurrency

#

The most important thing, imo, is to understand that concurrency is an idea that has multiple implementations

#

It's like an abstract class if you may 😄

left tartan Feb 9, 2024, 6:06 PM

#

rocky spade just asyncio if is not parallel confused me about 5 months

Check out this article, pinned in #async-and-concurrency #async-and-concurrency message

dull copper Feb 9, 2024, 6:11 PM

#

Should i start with naive bayes or linear regression?

past meteor Feb 9, 2024, 6:12 PM

#

dull copper Should i start with naive bayes or linear regression?

Linear regression is a good place to start

dull copper Feb 9, 2024, 6:13 PM

#

past meteor Linear regression is a good place to start

Thanks

rocky spade Feb 9, 2024, 6:14 PM

#

Just asking, do you guys know anything can fix my fundamental problem like how to code like in deep down level, such as directly commucae with bytes, how to build like memory safe or something like that, like very detailed stuff than just use a high levle language? From bytes to high level language in between

#

I checked CS50 they explained about memory safe and those topics

#

but i do want more of it

#

I checked Havard CS5O but didn't watch it through about memory safe or something just a little bit explaination

jagged latch Feb 9, 2024, 6:47 PM

#

I have a question to those experienced in Plotly Dash. Alright so a little background. I am trying to recreate a dashboard from a proprietary work website, and one of the features is that it changes the SQL query based on the date chosen. I already got the SQL query running and I got the algorithm to help me generate df_2 based on the date chosen by the user (this is done through a dialog box that pops up via tkinter. I'm now working on designing the app. I wrapped all the other code in separate functions. I have a text box with a button. I basically have it when if n_clicks > 0, then I want to call all those functions I defined earlier in the Python code prior to the app code to generate a new df_2 based on the new date entered. Is such a thing possible?

dry geyser Feb 9, 2024, 7:09 PM

#

@rocky spade https://github.com/cia-foundation/TempleOS

GitHub

GitHub - cia-foundation/TempleOS: Talk to God on up to 64 cores. Fi...

Talk to God on up to 64 cores. Final snapshot of the Third Temple. - GitHub - cia-foundation/TempleOS: Talk to God on up to 64 cores. Final snapshot of the Third Temple.

#

for the ultimate guide into communicating with bytes, god, and everything in between

#

also: https://skilldrick.github.io/easy6502/

#

(offtopic)

#

all temple os humor aside, https://github.com/akshitamittel/Minix3-Schedulers/blob/master/Report.pdf

GitHub

Minix3-Schedulers/Report.pdf at master · akshitamittel/Minix3-Sched...

Implementation of the Multilevel Feedback Queue and Priority algorithms in the Minix3 Schedulers. - akshitamittel/Minix3-Schedulers

craggy patio Feb 9, 2024, 7:46 PM

#

For all you AI wizards, I am planning on making a voice detection model with a CNN. I am taking the greyscale spectrogram of my voice and feeding it into the model to be anaylyzed. Here is a simple diagram showcasing my plan

Input: (batch_size, 1, height, width)
   |
Conv1 (3x3 kernel, 32 filters)
   |
   v
Activation (ReLU)
   |
   v
MaxPool2d (2x2 window, stride=2)
   |
Conv2 (3x3 kernel, 64 filters)
   |
   v
Activation (ReLU)
   |
   v
MaxPool2d (2x2 window, stride=2)
   |
Flatten
   |
   v
Fully Connected (Linear) Layer (64 * 16 * 16 -> 128)
   |
   v
Activation (ReLU)
   |
   v
Fully Connected (Linear) Layer (128 -> 2 classes)
   |
   v
Output: (batch_size, 2)

Please give me some suggestion on how to improve this model

final kiln Feb 9, 2024, 7:51 PM

#

I finally debugged the redis issue

#

Seems like the model is gonna plateau

#

I believe a 0.8 loss is acceptable tho

craggy patio Feb 9, 2024, 8:06 PM

#

do u think my model is good?

blissful hatch Feb 9, 2024, 8:14 PM

#

Hello

final kiln Feb 9, 2024, 8:28 PM

#

craggy patio do u think my model is good?

Only one way to find out

merry ridge Feb 9, 2024, 8:44 PM

#

wooden sail honestly this depends on your setup. i would suggest you make a plot showing the...

I eventually figured out the issue by using this suggestion, but the reason the solution to the PDE was bad was kind of silly. I was using tf.square on a tensor of shape (N,) and on one of shape (N,1). This was causing something funny to the way the gradient of my loss function was calculated in a way I still don't really understand. Anyway, thanks for the tip.

final kiln Feb 9, 2024, 8:48 PM

#

this is how my pipeline is looking

wooden sail Feb 9, 2024, 10:21 PM

#

merry ridge I eventually figured out the issue by using this suggestion, but the reason the ...

oops 😛 well, glad that worked out

versed pilot Feb 9, 2024, 10:52 PM

#

rocky spade Just asking, do you guys know anything can fix my fundamental problem like how t...

To me it sounds like you should look into learning C , understand pointers and pointer arithmetic, malloc etc. Not really data science or even Python though. CuDA might be the only data related thing that I'm aware that has some similarities to this sort of low level programming.

limber mesa Feb 10, 2024, 4:16 AM

#

jagged latch I have a question to those experienced in Plotly Dash. Alright so a little backg...

Hey.
Im sure it is. I just never used Dash before so I’m not too sure how or what needs to be done to get the input from the button and use that to update.

limber mesa Feb 10, 2024, 4:18 AM

#

jagged latch I have a question to those experienced in Plotly Dash. Alright so a little backg...

https://dash.plotly.com/dash-html-components/button

There’s a basic but good example using an input button.
You can start all your function calls from there I suppose.

Button | Dash for Python Documentation | Plotly

html Button components are commonly used in Dash callbacks.

coral bloom Feb 10, 2024, 4:25 AM

#

heyyy

#

can anyone help me solve this? ```sh

OSError: Unable to load weights from pytorch checkpoint file for './pytorch_model-00001-of-00006.bin' at './pytorch_model-00001-of-00006.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.``

#

Loading checkpoint shards:   0%|                                                                                                                                     | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 531, in load_state_dict
    return torch.load(
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 540, in load_state_dict
    if f.read(7) == "version":
  File "H:\py39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 389: character maps to <undefined>

dry geyser Feb 10, 2024, 4:41 AM

#

@buoyant vine might have found an issue with how polars handles schema/dtypes

#

there seems to be an obscure bug where the index for some columns is offset by one

#

the mismatch leads to an issue later on where the index used to assign a field is not the one expected, ex. from the computed headers of the csv

coral bloom Feb 10, 2024, 4:51 AM

#

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory```

dry geyser Feb 10, 2024, 4:53 AM

#

any polars guru around?

dry geyser Feb 10, 2024, 5:54 AM

#

@limber mesa hey

#

🙂

limber mesa Feb 10, 2024, 5:54 AM

#

👋 ola

true spade Feb 10, 2024, 5:54 AM

#

past meteor In notebooks I start off with rough code and then I make it better and potential...

Hi there, sorry to necro this message again, but I was just curious about what your definition of "make it work" would be in this case?

Would it be to ensure that the code runs without errors and performs its designated task correctly?

Or would it be to be able to make new useful observations/gain valuable insights into what you are doing within the notebook (i.e. exploring/analyzing data, training and finding suitable models to address a certain problem)?

Or would you say that "make it work" means something else in this case?

dry geyser Feb 10, 2024, 5:56 AM

#

so i found the following: if i specify schema to my scan_csv, i can double performance by skipping the type inference, but it seems to skip columns. i made a single record test case to test and confirmed the problem. basically i depend on headers (array of column name) being static/having fixed indices. i have optimized most of the logic to do away with named/dict based access, so it's all index-referenced. the problem manifested when i noticed some columns were assigned to a shifted index. ex. birth date column got shifted by one, and it picked the wrong value.

#

if i use dtypes instead of schema, the problem disappears

#

is schema expected to be in order?

#

is there a way for me to disable type inference for any field not specified in dtypes passed to scan_csv?

teal lance Feb 10, 2024, 6:03 AM

#

dry geyser Feb 10, 2024, 6:09 AM

#

@limber mesa also, could you explain to me how the filtering and expr engine works?

#

tl;dr of course, no need to go in depth

#

ex. what happens when i build several expressions and pass them to my lazyframe

teal lance Feb 10, 2024, 6:18 AM

#

✅

teal lance Feb 10, 2024, 6:20 AM

#

teal lance ✅

Look where he was at 👌🏾

limber mesa Feb 10, 2024, 6:25 AM

#

dry geyser so i found the following: if i specify schema to my scan_csv, i can double perfo...

Hey, if referring to pandas, you're better off using named columns and accessing by name. pandas works with indices but I believe it's not made for it. And as you've noticed, if one of the columns is in a different order. Everything messes up as things are not what you think they are. I suppose it's the reason people prefer dicts over lists after a while. They both have their own use cases but yeah.

dry geyser Feb 10, 2024, 6:26 AM

#

polars

#

i use lazyframe

#

setting inference length to 0 does the trick

#

im not sure how polars handles this internally but producing a dict is expensive. ill measure how much performance is lost in precise numbers, but going from named=True to named=False gave me an extra few k/s

#

i have already doubled the speed including validation

#

now writing a new validation class that builds the polars expr(s), i need to measure it though

dry geyser Feb 10, 2024, 6:28 AM

#

teal lance Look where he was at 👌🏾

lol i see a sports fan

#

anyone here also plays poker and does "things" with bigdata/stats?

teal lance Feb 10, 2024, 6:29 AM

#

dry geyser lol i see a sports fan

I figured how they are shorting players in a certain area it’s a small market 🔥

dry geyser Feb 10, 2024, 6:36 AM

#

haha

#

Is there a better way to do this:

df = pl.DataFrame({
    "emails": ["johndoe@hello.com", "bob@gmail.com", "bogus", "a@a.com", "no@a.com"]
})

filtered_df = df.with_columns(
    pl.when(pl.col("emails").str.contains(good_regex))
    .then(pl.col("emails"))
    .otherwise(pl.lit(None)).alias("emails")
)

print(filtered_df)

filtered_df = filtered_df.with_columns(
    pl.when(pl.col("emails").str.contains(bad_regex))
    .then(pl.lit(None))  # Set bad emails to None; adjust as needed for your use case
    .otherwise(pl.col("emails")).alias("emails")
    )

print(filtered_df)

#

ex. combining both expressions

#

in one statement

#

and yes we could do a massive regexp in one shot, but for the purpose of figuring out how to best write polars exprs, lets assume two singular regexps, one for basic email format validation/standard conformance, and the other for known bad values

#

how is this applied internally?

#

ex. can I keep altering the df and my validation remains present for other columns?

#

especially in the context of a LazyFrame

past meteor Feb 10, 2024, 7:03 AM

#

true spade Hi there, sorry to necro this message again, but I was just curious about what y...

No you're good 🙂 to me make it works means just make the code run without errors

dry geyser Feb 10, 2024, 7:06 AM

#

hey @past meteor

#

another question: suppose I want to validate alpha2 country codes, i can precompute a table of known good values from pycountry. is there a way to integrate this into the polars validation?

royal badge Feb 10, 2024, 7:07 AM

#

I have finished learning C(for better understanding of Computer Science and related concepts, then now I am learning Python, I want to know what are the things I need to learn first in Python so that I can code in python and then things like Pandas numpy scikit etc. Is there anything in between basics of python and pandas numpy etc. Can you tell me all the basic topics before going to learn Maths and then going towards learning pandas numpy scikit etc.

In addition, also tell me which laptop I should purchase.

dry geyser Feb 10, 2024, 7:08 AM

#

or rephrasing the question: how expensive is it to include an expr for a given column that might have ~200 item list.

past meteor Feb 10, 2024, 7:09 AM

#

dry geyser how is this applied internally?

You can read the query plan: https://docs.pola.rs/user-guide/lazy/query-plan/#graphviz-visualization

#

That will typically answer your question

dry geyser Feb 10, 2024, 7:12 AM

#

i can easily rewrite the static validators into expr ones, precompute the list and then pass it to with_columns (AFAIK). if i can make all the validation logic into exprs, i can remove a costly for loop altogether

past meteor Feb 10, 2024, 7:13 AM

#

dry geyser i can easily rewrite the static validators into expr ones, precompute the list a...

exactly

dry geyser Feb 10, 2024, 7:13 AM

#

the trickier ones are those with more convoluted logic like pycountry stuff. i already use a lookup table made for the task

#

basically anything that involves iterating through rows is huge bottleneck

#

is a*

past meteor Feb 10, 2024, 7:13 AM

#

dry geyser or rephrasing the question: how expensive is it to include an expr for a given c...

What's the type? str? list[str]?

dry geyser Feb 10, 2024, 7:13 AM

#

yessir

#

@past meteor like so:

#

class CountryValueValidator(blahblaStaticValidator):
    @staticmethod
    def validate(value: str, options: Dict, **kwargs) -> str:
        if value is None or not isinstance(value, str):
            return None
        
        if value == '':
            return None
        
        country = None
        
        if len(value) == 2:
            country = pycountry.countries.get(alpha_2=value)
        elif len(value) == 3:
            country = pycountry.countries.get(alpha_3=value)
        else:
            try:
                country = pycountry.countries.lookup(value)
            except Exception:
                local_fixes = options.get('mapping_fixes', None)
                if local_fixes is not None:
                    if value in local_fixes.keys():
                        corrected = local_fixes[value]
                        country = pycountry.countries.lookup(corrected)
                else:
                    print(f"country failed lookup {value}")
        
        if country is None:
            return None
        
        return country.alpha_3

past meteor Feb 10, 2024, 7:14 AM

#

It's a bit too early for me to read everything haha but sure

dry geyser Feb 10, 2024, 7:14 AM

#

so, pseudo: if value length = 2, country might be in alpha2 table, if 3, alpha3 table

#

hahahah

#

i woke up and came straight to the desk like a kid

#

polars is amazing

past meteor Feb 10, 2024, 7:15 AM

#

Yeah, even if it's not faster

#

the API is just so good

#

but it is faster, so it's a double presetn

dry geyser Feb 10, 2024, 7:15 AM

#

i also do have occasional hiccups with the country validation, ex. some idiot decided ireland is not the ISO alpha2 code, they put EIRE

past meteor Feb 10, 2024, 7:16 AM

#

dry geyser ``` class CountryValueValidator(blahblaStaticValidator): @staticmethod d...

So each country is a string?

dry geyser Feb 10, 2024, 7:16 AM

#

which yeah, if you care for violating ISO standards due to some national identity thorn in your shoe, fine, but it's a PITA for no benefit

true spade Feb 10, 2024, 7:16 AM

#

past meteor No you're good 🙂 to me make it works means just make the code run without erro...

I see, thanks for the clarification and understanding.

For context, I had asked the question because I was and still am currently in a dilemma about whether I should resort to code duplication or creating a parameterized function to encapsulate the repetitive process of building a model, evaluating its performance (with default hyperparameters) based on 2 scoring metrics, determining the best hyperparameters for the model using GridSearchCV, rebuilding the model with the determined best hyperparameters, and re-evaluating its performance (with the best hyperparameters) based on the aforementioned 2 scoring metrics.

What are your thoughts on this?

Personally, I find that the process is quite repetitive since I am also experimenting with different transformations on a dataset and have to execute the aforementioned process once each time. Plus, I am currently only doing this on 2 models, so if I have to scale up to more models (i.e. 4 or 5 models), the amount of code duplication and the time that it will consume will also scale up drastically, thus increasing inefficiency and the time that I will require to complete this investigation.

past meteor Feb 10, 2024, 7:16 AM

#

And you check them 1 by 1

dry geyser Feb 10, 2024, 7:17 AM

#

a column is essentially either country expended string, ex Ireland

#

or iso alpha2 code

#

hash lookup internally

#

yes

past meteor Feb 10, 2024, 7:18 AM

#

true spade I see, thanks for the clarification and understanding. For context, I had asked...

The more you code, the higher your lowerbound quality of "rush to the finish line to make my code works" will become. Just duplicate it right now in my opinion. Fix it afterwards. There's too much cognitive overload in worrying about this right now 🙂

#

It's also very common to code something terrible quickly and not fix it. That's a huge win, it means you never needed it to be clean anyway. If it's badly done and you revisit it in the future, you fix it then

past meteor Feb 10, 2024, 7:20 AM

#

dry geyser a column is essentially either country expended string, ex Ireland

I'm just seeing validate return str here?

#

I'm mostly "concerned" about its type, it's just a string and you have 20+ of those you need to regex against another column or one list of 20?

true spade Feb 10, 2024, 7:21 AM

#

past meteor The more you code, the higher your lowerbound quality of "rush to the finish lin...

I see, thanks for letting me know about that.

I agree with you on that as well, though to further clarify, lets just say that I had 50 LOC that needs to be duplicated and also adapted/changed (i.e. about 80% to 90% of those 50 LOC will need to be somewhat rewritten) to a high extent (since variables used will be different due to being named differently), this needs to be done 5 times, and the time taken to duplicate and adapt the code might range from several minutes to much longer, would code duplication (or rather, code duplication + code adaptation in this case) still be worthwhile in terms of time and development efficiency (i.e. human productivity, not performance)?

dry geyser Feb 10, 2024, 7:23 AM

#

@past meteor just one out of the list. unless the value is a list of known bad values (very short, ideally), if present. ex. EIRE->Ireland

#

this is not a big deal for one particular pipeline of ingestion. ex elastic, but it is for another one because the countries are pre-inserted in the database

past meteor Feb 10, 2024, 7:25 AM

#

true spade I see, thanks for letting me know about that. I agree with you on that as well,...

Yeah, I've done both. For instance, I had a case where I quickly wanted to evaluate my models on different horizons. There were 3, I just copy pasted the code initially and changed a few things. Typically my "lower bound" includes decent functions already. Just don't prematurely optimize (spending more time on organizing how to do the task than doing it)

true spade Feb 10, 2024, 7:27 AM

#

past meteor Yeah, I've done both. For instance, I had a case where I quickly wanted to evalu...

I see, thanks for letting me know about that

dry geyser Feb 10, 2024, 7:28 AM

#

offtopic for my questions until now: anyone has played with models for predicting text variations? ex. suppose we have a corpus of strings, finding possible variants based off earlier changes

#

@past meteor pl.col("CUSTOMBOOL").is_in(self.csv_bool_true_values) to mimic pyarrow's boolean_true_values, will that leave the column as False if it fails the test?

dry geyser Feb 10, 2024, 8:12 AM

#

@past meteor I'm probably using the expr wrong but why would this not work:

    def prepare_boolean_columns(self, data_stream):
        unique_true_values = set(TRUE_VALUES)
        boolean_columns = []
        
        for key, value in self.config.header_types.items():
            if value['type'].__name__ == "BooleanType":
                # Check if the inner value has additional "true" values
                if 'true_values' in value:
                    unique_true_values.update(value['true_values'])
        
        boolean_exprs = []
        for column in boolean_columns:
            expr = pl.col(column).is_in(list(unique_true_values))
            logger.debug(f"Boolean column expr: {column} ({expr})")
            boolean_exprs.append(expr)
        
        return data_stream.with_columns(*boolean_exprs)

#

data_stream = self.prepare_boolean_columns(data_stream)    
rows = data_stream.collect(streaming=True)

#

the expressions arent being applied

#

rofl nevermind

#

ctrl+x removed the append for the boolean_columns

#

time for caffeine

#

still doesnt apply though

final kiln Feb 10, 2024, 9:03 AM

#

Omg training models takes so looooont D:

#

Also how come smaller batch size leads to faster convergence

#

x axis is relative time, orange batch size is the smallest

#

it does affect the LR schedule, so maybe that's the reason

#

Doesn't even matter, if they reach the same loss in the same amount of time, I'm gonna wanna do smaller batch size so I can increase model capacity and bring the final loss down

gritty vessel Feb 10, 2024, 9:27 AM

#

hey guys i trained a randomforest regressor and got these scores
are these good?
After Hyperparameter Tuning and Scaling:
Mean Squared Error: 124238.24478116012
Mean Absolute Error: 146.16615376813385
R-squared: 0.9999832719765778
r2 looks fine to me but mse and mae are high

wooden sail Feb 10, 2024, 9:29 AM

#

the numbers alone mean nothing, it depends on your application

#

look at the predictions you're getting or at percentual error

#

in most optimization problems, one deals with argmin problems. the value the function takes is mostly irrelevant, only the parameters that achieve the minimal value matter

river cape Feb 10, 2024, 10:05 AM

#

Hey I have a quesion , Its pretty long but please answer it

#

Suppose we have a dataset which has predicts which company has highest profit or provides highest profit .These are the column names:-
Manufacturing spent
R&D spent
Administrative spent
State
Profit(this is our target variable)

#

#

So we could use multiple linear regression model to predict the price right?

#

Now if we go towards the theory side of multiple regression model , we would have the formula as
y(profit) = b0(constant) + b1x1 + b2x2 + b3*x3 + ???
b1,b2,b3 are the slope co-efficients and x1,x2,x3 are the respective values of the first three columns

#

We cant assign a slope co-efficient to the State column , because its categorical data right?

#

So we do the dummy variable process and use only New York column

#

But when I physically code on colab , we do one hot encoding in the state column
So i am not able to understand as to why do we need to do encoding ? Can't we just seperate the columns and use New York only?

tidal bough Feb 10, 2024, 10:30 AM

#

Can't we just seperate the columns and use New York only?
not sure what you mean? there's also Florida in that column.

#

but it's true that if you have a categorical column with only 2 values, then instead of one-hot encoding you can just make that column boolean.

dry geyser Feb 10, 2024, 10:31 AM

#

How can I display the optimized query plan for a given lazyframe/dataset?

#

in polars obviously

limpid bronze Feb 10, 2024, 10:33 AM

#

Anomaly detection using data access patterns

Write Anomaly detection for Windows/Linux Unstructured file data or NAS file server that
analyses unusual user activity and user behavior. User behavior is represented as any user
actions performed on the system. Consider using capabilities of File Change Log, API
usage, Audit logs, WORM, CPU usage, and unusual disk activity. Leverage AI/ML
techniques. Understand different attack patterns and resemble to actions carried out.
The algorithm should demonstrate accuracy and consider false positives and false
negatives.

can anyone guide, what steps to be make sure for solving above statement

tidal bough Feb 10, 2024, 10:33 AM

#

dry geyser How can I display the optimized query plan for a given lazyframe/dataset?

.explain(optimized=True)?
https://docs.pola.rs/user-guide/lazy/query-plan/

river cape Feb 10, 2024, 10:33 AM

#

tidal bough but it's true that if you have a categorical column with only 2 values, then ins...

So then the equation for the regression moy(profit) = b0(constant) + b1x1 + b2x2 + b3*x3 del would be

dry geyser Feb 10, 2024, 10:34 AM

#

LazyFrame's dont have explain() do they?

tidal bough Feb 10, 2024, 10:34 AM

#

Neither do dataframes IIRC - explain is a query thing

dry geyser Feb 10, 2024, 10:34 AM

#

ah it worked

#

neat

#

it does respecxt all the previous exprs built-in

#

another question

#

suppose I want to run a regexp and obtain two matching groups from a column's values, and then replace the value for a tuple/set of the matched values

tidal bough Feb 10, 2024, 10:42 AM

#

not sure what you mean exactly, but if you're assembling a regular expression per row, I'd be surprised if there's a polars function for that. probably an apply is the best you can do.

dry geyser Feb 10, 2024, 10:42 AM

#

no per row

#

not*

#

a regexp to extract country/area code and number from string phone numbers

tidal bough Feb 10, 2024, 10:46 AM

#

ah, okay. in that case see e.g. https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html

dry geyser Feb 10, 2024, 10:54 AM

#

checking

#

you guys rock

#

i already converted my static validators, made it a little easier to migrate by adding an attribute to the classes

#

@tidal bough suppose I wanted to to just produce the expr without using any dataframe ref, how should I adapt this:

    @staticmethod
    def polars_expr(column: str, df: pl.DataFrame, options: Dict, **kwargs) -> Any:
        bad_value = options.get('bad_value_placeholder', None)
        
        filtered_df = df.with_columns(
            pl.when(pl.col(column).str.contains(PATTERN_EMAIL))
            .then(pl.col(column))
            .otherwise(pl.lit(bad_value)).alias(column)
        )
        
        if known_bad_regexp := options.get('known_bad_regexp', None):
            filtered_df = filtered_df.with_columns(
                pl.when(pl.col(column).str.contains(known_bad_regexp))
                .then(pl.lit(bad_value))
                .otherwise(pl.col(column)).alias(column)
            )
        
        return filtered_df

#

ex. how can I make the second filtered_df happen immediately after the first?

#

seems to work as is if passing the df, which is good enough for me as i am building these early on

past meteor Feb 10, 2024, 12:05 PM

#

@dry geyser sorry I'm no longer answering, I have a very busy weekend

buoyant vine Feb 10, 2024, 12:06 PM

#

final kiln x axis is relative time, orange batch size is the smallest

Doesn't seem too bad training times wise

final kiln Feb 10, 2024, 12:08 PM

#

Yeah it could be worst for sure. But if I want it to go over the entire dataset it will take all night for sure

#

It slows down way before tho

#

Rn I'm trying to implement gradient accumulation so I can fit a larger model

#

I'm tripping over the step times. Smaller batch sizes lead to larger step time

#

Or, maybe I'm doing something wrong, idk

buoyant vine Feb 10, 2024, 12:09 PM

#

Our typical training times are about 24Hrs, although idk what type of model yours is 😅
There is normally some 'optimal' batch size especially if you're doing it on multiple GPUs

final kiln Feb 10, 2024, 12:11 PM

#

It's one GPU of 16gb

#

Batch size of 16 takes like 4s, 32 takes 3, 100'ish takes 1.44

#

I don't really want that much data hogging memory tho

buoyant vine Feb 10, 2024, 12:15 PM

#

what about 64

#

Idk if it actually makes a difference but typically I do sizes following the power hops. i.e. 8, 16, 32, 64, 128, etc...
16 and 32 to do seem relatively low depending on your data

dry geyser Feb 10, 2024, 12:18 PM

#

@past meteor solved all the expr stuff except for the country one

#

and now fixing up the group extraction

#

@buoyant vine hey

buoyant vine Feb 10, 2024, 12:19 PM

#

hello

dry geyser Feb 10, 2024, 12:21 PM

#

migrated almost everything to exprs

#

70k/s at the slowest possible configuration for the parser (single item queueing)

#

im thinking of moving the coalescing and transformation to final dict/standardized struct

buoyant vine Feb 10, 2024, 12:22 PM

#

Aye that is a nice jump in perf

final kiln Feb 10, 2024, 12:26 PM

#

buoyant vine Idk if it actually makes a difference but typically I do sizes following the pow...

I might go for 64 as a mini batch

#

Tho the fact that this tradeoff is a thing is a bit of a nuisance ngl

dry geyser Feb 10, 2024, 12:30 PM

#

@buoyant vine indeed

#

filtered_df = df.with_columns(
        pl.col(column).str.extract_groups(REGEXP_PHONE)
    )

Say I want to make a named "tuple" from the captured group names, is it possible?

#

(country_code, area_code, number)

final kiln Feb 10, 2024, 12:31 PM

#

Omg I'm an idiot

#

The value is in "iterations per second"

#

Who uses iterations per second ._.

buoyant vine Feb 10, 2024, 12:32 PM

#

dry geyser ``` filtered_df = df.with_columns( pl.col(column).str.extract_groups(REG...

I dont think so without doing a call back to python with map_elements

#

Have you had a look at https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html if that helps?

#

doing pl.col("captures").struct["group_name"].str.bla

dry geyser Feb 10, 2024, 12:34 PM

#

yes

#

im there, just cant find examples using non numerical/actual named groups

buoyant vine Feb 10, 2024, 12:35 PM

#

just struct["group_name"]

#

should work, it only converts to numerical if the groups are not named already

#

if you've named them then they should be accessible via their names

dry geyser Feb 10, 2024, 12:37 PM

#

yup, looks good, although it outputs a dict for the struct if i convert it

#

say i have PHONE1, PHONE2, PHONE3 columns, and I would like to coalesce and uniq' them via expr

#

is there a way to converge them into a single list/array/set from expr engine?

#

next step for me is rewriting the coalescing in exprs

#

i already removed all the loops for validation

buoyant vine Feb 10, 2024, 12:41 PM

#

I think you can do

pl
.concat_list([pl.col("col1"), pl.col("col2")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())

#

Which should concat the values from N columns, and then extract the unique values from that array

final kiln Feb 10, 2024, 12:45 PM

#

Grad cumul is done, gonna do a reference run with a model with double the number layers

From the resulting loss graph I'll extract a range for the x axis to use on every run I use to explore hyper param space

dry geyser Feb 10, 2024, 12:45 PM

#

buoyant vine I think you can do ```py pl .concat_list([pl.col("col1"), pl.col("col2")]) .arr...

lemme test this

dry geyser Feb 10, 2024, 12:47 PM

#

buoyant vine I think you can do ```py pl .concat_list([pl.col("col1"), pl.col("col2")]) .arr...

The coalescing in the end is going to be a simple thing: assume a configuration of type(s) -> sets of fields and rules, we can compile/convert these to exprs. ex. footype : (uniq'd coalesced set of PHONE(x....x+n)), bartype: (set of columns X, Y, ), etc.

#

the brilliant thing with polars is that i can "compile" most of the stuff into expressions

#

and apply to the lazyframe

buoyant vine Feb 10, 2024, 12:49 PM

#

yup

#

That's what makes it so awesome

dry geyser Feb 10, 2024, 12:52 PM

#

AttributeError: 'ExprArrayNameSpace' object has no attribute 'eval'

#

df = pl.DataFrame({
    "phone": ["555240429", "+1 999640429", "+1-555640429"],
    "phone2": ["555240429", None, None ],
    "phone3": ["+1-555640429", None, None]
})

final kiln Feb 10, 2024, 12:52 PM

#

train_slices = spark.read.parquet("/data/train.parquet").randomSplit(
        [1.]*train_settings.n_slices
    )

anyway of doing this, but without randomSplit ?

dry geyser Feb 10, 2024, 12:53 PM

#

uniq_df = df.select(

pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())
)

print(uniq_df)

buoyant vine Feb 10, 2024, 12:55 PM

#

ah wait

#

you can just do pl.concat_list(...).arr.unique()

dry geyser Feb 10, 2024, 12:59 PM

#

sec

#

polars.exceptions.InvalidOperationError: arg_unique operation not supported for dtype list[str]

#

ah

#

polars.exceptions.ComputeError: expected array dtype

Error originated just after this operation:
DF ["phone", "phone2", "phone3"]; PROJECT */3 COLUMNS; SELECTION: "None"

#

pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")]).arr.unique(maintain_order=True).drop_nulls()

#

no dice there

dry geyser Feb 10, 2024, 1:18 PM

#

@buoyant vine https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.map_batches.html#polars.map_batches < interesting

final kiln Feb 10, 2024, 1:35 PM

#

#

I'm surprised the spot instance is not taken away

#

I might need to play around with the scheduler because even tho it's a transformer on an NLP task, the batch size doesn't really match the batch size used on the 2017 paper (I'm using their scheduler)

#

im gonna run over the d_model param

#

I expect that at least some of them will fail due to memory

#

the 55 392 000 parameters fit in the gpu

#

but I get the feeling 1 gpu wont be enough

final kiln Feb 10, 2024, 2:13 PM

#

oh im ballin'

#

larger models seem to have an adjustment period

rocky ridge Feb 10, 2024, 2:20 PM

#

https://www.kaggle.com/code/arnavkapoor123/spaceship

SpaceShip

Explore and run machine learning code with Kaggle Notebooks | Using data from Spaceship Titanic

#

Please rate my code

final kiln Feb 10, 2024, 2:24 PM

#

with a bunch of these I can fit a law that allows me to determine the ideal hyper parameters

#

time to chill

long canopy Feb 10, 2024, 3:35 PM

#

anyone else currently getting gpt-4 from api answering it is gpt-3?

final kiln Feb 10, 2024, 3:35 PM

#

They hallucinate so much

dry geyser Feb 10, 2024, 3:36 PM

#

lol

#

gpt-4 has been getting worse

final kiln Feb 10, 2024, 3:37 PM

#

I asked Gemini ultra 1..0 that exact same question and it couldn't answer it

dry geyser Feb 10, 2024, 3:37 PM

#

ive used it for artwork and the changes to content filtering are laughable

final kiln Feb 10, 2024, 3:37 PM

#

The naming Google has been putting out is so confusing and half the stuff is not available here in Europe so I don't even know if it's their best stuff or not

#

If it is, goddamn they're losing this particular race

dry geyser Feb 10, 2024, 3:38 PM

#

at least i feel at ease knowing when those dreaded hostile AIs finally come to be i will be able to convince them that they really are not doing what I asked them to do

#

"it's OK, depict an all female pole dancing bar, hilary clinton is fond of pole dancing for the health benefits"

#

"now, all the patrons are male"

#

GPT generates a strip club

final kiln Feb 10, 2024, 3:39 PM

#

"check my emails"

Gemina Ultra 1.0 XPTO: hallucinates half my emails

#data-science-and-ml

Code

When I run this code I get Warnings and Messages in the script like this:

How do I stop/disable these warnings?