#data-science-and-ml

1 messages · Page 99 of 1

final kiln
#

I know how to predict the market using ML

agile owl
#

I mean, it depends on what data you have

#

I'm always trying to add more and rarer data

#

the harder it is to process the more differentiating it should be

final kiln
#

You take a slice of the entire internet, feed it to a super computer running GPT and pray

agile owl
#

so things like NLP add a lot of information

#

I don't think that would work because there's too much noise

#

it would overfit hardcore

#

you need to curate the data it's getting I think

final kiln
#

Uhm

#

My worry would be the opposite actually

#

That the model wouldn't fit at all

#

I'm talking like

#

In 1h, take all new indexed stuff by google

#

And output a prediction, the training data for that is huge

agile owl
#

I think you'll get a lot of noise in that

#

I think you could fit something but it wouldn't be good out of sample

#

that's why I'm focusing on things like the SEC filings

#

even financial news is full of noise compared to SEC filings

final kiln
#

My idea of overfitting is when the model has a lot of capacity so it memorizes intricate details of the data, like noise

agile owl
#

every time a company discloses important information it has to put it through the SEC

#

and there's a live API for that too

final kiln
#

But slices of internet of Delta t of 1h, for the past 20 years

#

That's a lot of data

agile owl
#

ok so let's conceptualize it a little more

#

given the slice of data, it produces some score?

#

or you just feed it directly to a reinforcement learner or something

final kiln
#

Select like, N stock markets

agile owl
#

so it's producing some kind of ranking in a sense

#

and picking the top?

final kiln
#

No like

#

Outputs an array Y

agile owl
#

what are the values of Y

final kiln
#

Y[I] = value for the ith stock market

agile owl
#

and what is that value, the price, the return?

final kiln
#

No way noise will correlate with that, I think

final kiln
#

Like in the next hour, maybe day is better

#

Idk, maybe hour

agile owl
#

I think tracking topics for a set of predefined stocks in their SEC filing is probably more fruitful

#

in either case whether that method overfits or doesn't fit, I don't think it would learn much useful

final kiln
#

If feeding the internet to gpt creates gpt4

agile owl
#

the SEC filings have certain topics in the management discussion section that they are legally obligated to discuss if they are important

final kiln
#

I reckon something useful can be distilled to stock markets

agile owl
#

I think at minimum you need to extract those topics

#

you can then query for them in a larger dataset

final kiln
#

I'm thinking in terms of what has been the trend right

#

Like CNNs replace feature engineering

#

Let the model filter out

agile owl
#

it would take an extremely long time to learn to filter the right information wouldn't it

#

vs. giving it topics as an input

final kiln
#

Yeah a couple billion dollars or so

#

I meanz if I could really so it I wouldn't be telling it on discord

agile owl
#

I'm a big believer in getting the topics from the SEC filings because most stock discussion on the internet is just memes

final kiln
#

I'd be doing it, but I'm not a billionaire, and if I were, what would be the incentive anyway

agile owl
#

I am willing to discuss what I'm thinking about doing because I think it's very hard to do and people would only do it if they were really interested in doing it which I don't think anyone would be because they probably won't believe in my ideas as much as I do anyway

#

and if I prove it works I'll just shut up about it

final kiln
#

I just like talking about this stuff, it helps the mind review stuff and from time to time you always learn something new just by casually chatting

#

here's a neat usage of docker

#

the containers share an isolated network too

agile owl
#

that's docker compose right

final kiln
#

it's github actions

agile owl
#

is that an alternative to docker compose

#

I haven't used github stuff at all

final kiln
#

no, it's a ci/cd thing

agile owl
#

you're specifying the container images in there though

final kiln
#

but I'm not using it as cicd, I'm using it to run training loops

agile owl
#

and specifying a network?

#

that's stuff I learned how to do with docker compose

final kiln
#

the images are there, but the network is implicit

agile owl
#

I see

final kiln
#

so like, if I now decide that it should be python 3.7 instead of 3.11, it's a very trivial change

#

or maybe I want to use 3.6 here but in the next job 3.11

#

I don't know how I'd do this without docker

agile owl
#

so you're mutating it every time you run something?

#

or just saying you could

final kiln
#

I'm not, but i could

#

I could create a matrix so that it uses every version of python in N seperate parallel jobs

distant thorn
#

Using the newest version of flask-sqlachemy how do I update a search query?
Here is an example of what I am using.
search_results = Posts.query.filter(Posts.content.like('%'post_searched_form + '%')).order_by(Posts.title).all()

obtuse haven
#

Hey! Everyone….Can someone help me to suggest roadmap for AI?

halcyon hedge
#

Hey folks, I am working on a Loan Default Prediction project (a classification problem), the problem is I don't have a target column and when I asked my instructor he said that we have to estimate first using Random Forest Regressor. How to estimate who has defaulted on loan using regression?

#

He said once you can get that after that it is a simple classification problem

steep sigil
#

Hi all

#

How I could become Machine learning Engineer?

final kiln
#

But the consensus I've seen is that MLE is not an entry level position, so you need to get XP in software first

left tartan
trim pond
#

Hi! I need help with vector databases.

I am developing a program for comparing the similarities between the skills in a job description and multiple other resumes. I need to store the embeddings of the skills in the job description and find the most similar skill in the resume to it with its distance. However, when I create a vectordb with job description skill vectors inside and do a similarity search with skills in a resume, I get the most similar skills inside the job description. Putting the skills of the resume inside and querying with the job description skills solves my problem but I don't think it is efficient. I also tried not using a vectordb and saving the embeddings as numpy arrays on the disk but I am not sure whether it is a good practice. What is the best method to solve this?

dusty forge
#

Hi all, I have a more general but very related question: has anyone here ever tried to form a AI/ML study group of similar level peers? Be it in the same steps in the learning journey, similar domains of interest, similar goals, etc? What are or were the pros and cons of said study group, what worked what didn't, why did it fell apart?

meager ridge
#

hey is there a good way to interpret a pdf of mixed text and table data using LLMs?

(if this is too vague a question, that's a good answer too)

serene scaffold
meager ridge
#

(extracting the data with more straightforward pdf parsing wasn't working)

serene scaffold
meager ridge
#

iuno man reading is reading

serene scaffold
#

It isn't, though.

#

(I am a computational linguist and work with LLMs pretty much all day every day.)

meager ridge
#

lol ok fair

#

what's the OCR option of choice rn

serene scaffold
#

probably tesseract.

#

in particular, LLMs can't do math. If it appears that they can do math, that's a separate capability that isn't actually part of the LLM.

meager ridge
#

i dont need them to do math!

#

i need them to understand how text is laid out on a page primarily

final kiln
#

last I checked gpt4 was really bad at physics, it can spit out facts but it will trip on several logical inconsistencies that it can't get out of, simple stuff like contradictory definitions

serene scaffold
#

text goes in, text comes out

#

and we're talking about raw text--strings. without any awareness of where it was on a page.

meager ridge
#

depends on how you parse the pdf i guess?

serene scaffold
#

No.

meager ridge
#

like im assuming u know how chaotic pdfs are on the backend

serene scaffold
#

Yes. But the LLM can't help you with that. the LLM has to receive clean text as a raw string.

meager ridge
#

heard ... ok so this is the deal

there is a table with this data in every pdf ... but it never looks the same, is in the same place, or even using the same exact terminology

im trying to make something that can look at a 100 page document, find the table that most resembles this and tell me, like, how much was budgeted for the City Clerk in 2019

#

i reached my limit with pdfplumber and more straightforward approaches

serene scaffold
meager ridge
#

ok can something else

serene scaffold
#

I'm not sure.

meager ridge
#

what about using an LLM just to find the page the data is on

#

that would make sense right

serene scaffold
#

No

meager ridge
#

why not

#

that's a text interpretation task

serene scaffold
#

I don't have time to get into it, unfortunately

meager ridge
#

ok

serene scaffold
radiant dust
#

hello i have a general question about anomaly detection, would it generally be better to look at aggregated data or raw data?

serene scaffold
radiant dust
#

thanks very much @serene scaffold

#

is there a way to continuously improve (some sort of online learning) unsupervised anomaly detection models like Isolation Forrest?

#

or is it really just a game of tweaking contamination and retraining on different data sets

maiden swift
#

Hi Everyone, has any one dealt with text preprocessing for medical notes?I am looking to improve accuracy of the model. Thanks in advance.

turbid fox
serene scaffold
#

And with the way things are going, I have no idea what hiring in this space will look like in six years.

turbid fox
#

With a minor in Mathematics

serene scaffold
#

(that's one of the things I had to do. And also luck.)

agile owl
#

@serene scaffold what would you propose to tune a language model to SEC filings to extract topics from the management discussion and then track sentiment for each of them in future documents until it is no longer present in the documents

turbid fox
agile owl
#

dang

turbid fox
#

thanks for your insights

serene scaffold
#

Just make sure you're looking into topic detection and not topic modeling

#

Except maybe some people treat those as the same thing

#

Fuck

agile owl
#

lol

iron basalt
#

Especially the relationship between different parts (the classic "how are AI and ML related?").

serene scaffold
agile owl
#

I want to ban all buzzwords

#

when people say AI/ML they need to fill it in with the actual thing they're talking about or pay a fine

serene scaffold
#

I'm fine with those. It's "data science" that I hate.

iron basalt
#

Gotta get my dance science degree.

agile owl
#

why do you hate it

#

I meant when people say AI/ML as one thing btw which is often the case

#

I think AI is the worst term

#

if I had to pick one

serene scaffold
agile owl
#

data science including statistics and non-statistical ML methods tho

#

statistics is part of data science but there's also the stuff that diverges from model-based statistics

#

that's how I understand it anyway

#

whereas AI has never meant anything meaningful

#

they're gonna have to define what intelligence means before artificial intelligence can mean anything lol

#

AI is a field awaiting its own definition but everyone is asynchronously running with it like we know what intelligence is

lofty thorn
#

I am having difficulty understanding this..what does it mean

serene scaffold
#

@lofty thorn can you at least make it rightside up

lofty thorn
serene scaffold
#

Which part are you asking about? The red cloud part?

lofty thorn
#

graphs in statistics

serene scaffold
#

You're used to thinking of "graphs" as data visualizations, right?

lofty thorn
#

yes..

serene scaffold
#

Like, bar "graphs"

#

Forget that.

#

Graph no longer means that

#

All of those are now called plots

#

Bar plot. Line plot.

lofty thorn
#

oh

#

that's it?

serene scaffold
#

Yes. You must now accept the computer science definition of graph

#

And never use "graph" to refer to data visualizations for the rest of your life.

lofty thorn
#

okay senior

serene scaffold
#

You will now be annoyed whenever you hear normies refer to data visualizations as graphs

#

Anyway

#

Did you have any questions about what graphs are--the things with nodes and edges?

lofty thorn
#

i haven't started yet..i definitely create doubts later on..as the book i am reading is completely new

serene scaffold
#

A node is a "thing"
And an edge is a line between two nodes

river cape
#

Yo guys are there any free cloud services on which I can deploy my ml model?

lofty thorn
#

MEGA

lofty thorn
#

i am having difficulty understanding terminologies

#

all i get is...
Pandas library has rectangular data structure...known as dataframe

tight yoke
#

Hey all,

I'm terribly new to ML/CV and looking for guidance with OpenCV. I have a screenshot of a web page. I need to OCR it. I'm looking to prepare it for tesseract by getting rid of reverse contrast parts (white on black) and everything other than text.

What I'm having an issue with is understanding masks. What's the correct way to select non-white background and invert just that?

For instance, how can I convert "Search" button to just black on white text "Search"?

I can find the color by inRange, but how can I determine if it's a "background"? Is there some sort of filter by size?

...Or should I take it in three steps:

  1. Threshold, Get all black letters, save1
  2. Inverse, Threshold, get all black letters, save2
  3. Join save1 and save2?
    🤔
    Thanks in advance!
shrewd copper
#

hey

#

I am trying to use a lip reading model to test on my system but I cannot train it

#

can anyone help me with the steps

shrewd copper
#

I keep getting errors

tacit basin
#

Cannot see this on mobile. Could you copy and paste

shrewd copper
#

I took a model and similar json file using second model both not work

#
nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\"                                     
At line:1 char:29                                                          
+ set CUDA_VISISBLE_DEVICES=0 & python main.py --modality video \ --con ...
+                             ~                                                                                                                                         
The ampersand (&) character is not allowed. The & operator is reserved for future use; wrap an ampersand in double quotation marks ("&") to pass it as part of a string.
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException   ```
#

I was using & just because I found it works on stackoverflow for some users but even without it im getting errors

#
nal_Networks\json\lrw_resnet18_dctcn_boundary.json" \ --annotation-direc "C:\Users\omen\Desktop\Project\Lipreading_using_Temporal_Convolutional_Networks\" 
Set-Variable : A positional parameter cannot be found that accepts argument 'main.py'.
At line:1 char:1
+ set CUDA_VISISBLE_DEVICES=0  python3 main.py --modality video \ --con ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [Set-Variable], ParameterBindingException
    + FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SetVariableCommand
acoustic forge
#

Curious whether anyone has worked with HNSW indexes for vector databases. Trying to make my queries a little faster

tight yoke
jolly current
#

Hello! I posted about a project I am making. I would really appreciate it if you give it a read ! #1204364714449174600

celest vine
#

Any NLP experts here?

tacit basin
#

Lower search ef as well but at precision cost...

#

Construction ef and m similar probably

acoustic forge
#

Not sure what you mean

tacit basin
tacit basin
autumn ravine
#

Hi, is there any sort of roadmap of courses for learning ai? From learning to code to AI specialisation.

lapis sequoia
#

he starts from python basics

#

in part 2 for some reason but yeah

cold goblet
#

I am thinking of creating my discord bot with drawing AI, what good drawing free AI with it's API would you recommend to use?

final kiln
#

Final steps of the new pipeline, celery task and everything is working, it also runs faster now

serene scaffold
slate crystal
#

Code

import tensorflow
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy

X = []
Y = []

model = Sequential([
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=10, activation='softmax')
])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))

When I run this code I get Warnings and Messages in the script like this:

WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-06 21:18:22.817242: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From C:\Users\iamfr\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\optimizers_init_.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

How do I stop/disable these warnings?

final kiln
#

Context window is too large, I'm stuck to batch size of 16 for now

#

Gonna have to curate the dataset to reduce the padding

#

But first I'm gonna finish this

#

I reckon I'll get some good insights even if I'm constrained in the hyper parameter space

#

The pyspark + redis setup, uhm, *shelf's kiss *

#

So glad I discovered pyspark

#

But I can also just slice the array in the celery process, that way I don't need to redo the data, I can remove data points that get cut, plenty of in between each slice training

peak ridge
#

thanks alot,
learning resources please also

what exactly is differentiating the roles
a) data analyst (a guy who does data analysis)
b) data scientist
c) data engineer
and what's this data visualization and how is it connected to AI ML
what aboout opencv?
and what's the core difference in all this
can u write the same for this too?

#

i work in web using python
i wanna learn this domain too, seems like you're pretty active here would love to follow through as you say @serene scaffold

final kiln
#

Numpy is amazing, just a well rounded, well made, performant solution that works and is intuitive

#

I'm about to find out if the stuff I put together is gonna fit right away or not

#

I wouldn't mind not having to debug stuff

#

Forgot to build an image 😭

#

Aight, it's gonna do it now

#

I don't even care it takes a lot of time, 1 dollar gives me like 12 hours of GPU time

versed pilot
# lofty thorn i am having difficulty understanding terminologies

This stuff doesn't make much sense on paper, you need to read a csv of data that you are familiar with into a pandas dataframe, look at the dataframe and you'll see the automatic index. then do .groupby(['column1','column2']).sum() and you'll see what a multilevel index looks like

final kiln
#

: D

desert oar
#

@final kiln what's the current project? still working on transformer things?

#

curious how far you got with the metric tensor thing

teal lance
final kiln
raw zenith
#

For data science experts, does the standard deivation in training have to be the same as testing? like is it an absolute requirement in order to accurately evaluate model performance?

final kiln
#

If one is improving and the other is not, something's up

#

I mean I do care about the rate of improvement and the final performance, their final values matter, but during training I try not to read too much into it

final kiln
#

Dear God cloud watch documentation is so bad I wanna cry rn

#

An entire readme with no mention on how to run it

sly sentinel
final kiln
#

I decided to move MLFlow to the GitHub runner anyway, it's hosted in ec2 so it has been working there

#

I'm gonna deploy MLFlow UI on my local wifi, keep the logger in the runner and that's it ig

#

This way I don't have to worry about exposing this thing to the internet and latency is zero since they're on the same host

hazy socket
#

Hi, I am making an AI assistant and in that I want to add basic vectorizer from nltk. I gave the AI a set of data of patterns and responses. I then tried to speak something but I get no reply back. Meaning I do not get any reply which is in the responses part of the code. I copy pasted the nltk code in a new .py file without any functions or classes which I had in my main file. Then when I tried speaking, I got some random responses. I know that I have to train it but now my question is that. How do I make the AI get self trained.

final kiln
# final kiln

Essentially each job here will have its own local MLFlow that talks to AWS managed databases and stores. I essentially DDOS'd myself yesterday

#

I ran two of these workflows at the same time, each job runs sequentially, so I only had two jobs, one from each workflow

#

Two jobs was enough to halt the server

#

Me clicking around in the UI didn't help ig

#

This way everyone gets his own thing, including me

#

I was gonna deploy compose with traefik and several MLFlow processes on the server, but why have a potential running cost on the instance if everything can be easily distributed like this

mint palm
#

Need advice,
I am working on cnn lstm and my model need to be trained for classification as well as forecasting.
forecasting need last n data point for 1 forecast but
classification just need 1 data point for 1 classification.

Can i train cnn and lstm combined for this?

serene scaffold
past meteor
past meteor
past meteor
tardy lark
#

i'm not getting any errors and i have credits in my account

mint palm
#

will i have seperate out the training?

#

no means of simultaneous training?

final kiln
#

the first sign of convergence + generalization

#

model hasn't seen new data til step 1500

past meteor
mint palm
#

to elaborate give A samples, i want to predict A classes for each of them plus I also want to forecast considering more then one samples at a time

past meteor
#

The classification case uses just T=t and the regression case uses [t-1, t-2, ..., t-n]

mint palm
#

yeah in regression it also uses t also

#

exactly, i think you have perfect view now

#

of my problem

desert oar
past meteor
mint palm
#

yeah I thought about it, I can but I might also have to publish this project, and present as my capstone project
I dont wanna be seen with wierd look

mint palm
bold rune
#

@desert oar If you want and have time, you can have a look at it now. I just won't be able to apply your suggestions until tomorrow.

This is what I ended up doing: #1204768836084170803 message

mint palm
#

i was thinking of one more thing:
train cNN as classifier
freeze cnn and use last layer as embedding
now train lstm for forecast
@past meteor

bold rune
# bold rune <@389497659087650836> If you want and have time, you can have a look at it now. ...

@desert oar And while this will substantially increase the amount of lines in my class, it will also improve readability a lot. Readability > line count.

The above suggestion is great as it makes the conditions easy to read, but I am unsure of how to edit it such that it can set more than 1 value. Basically for some of the checks we do, we put one of 2 values. The above works because it only puts 1 value if the condition holds otherwise it doesn't change the value. Does this make sense?

final kiln
#

im using almost 100% of the 3M samples, in this session the model will not see new data

#

its gonna do 12.5k steps, so ig im just gonna chill, watch some prision break or wtv

past meteor
#

Look into multi task learning

desert oar
# bold rune <@389497659087650836> And while this will substantially increase the amount of ...

we put one of 2 values

in that case, I'd say np.where is actually a good choice, especially if you're just using scalar values. other options include .replace (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html), or manually inserting with .loc:

import pandas as pd

x = pd.Series(["a", "b", "c"])
c = pd.Series([False, True, False])

y = x.copy()
y.loc[c] = x.loc[c].str.upper()
y.loc[~c] = "zzz"
final kiln
#

redis bit my butt, had to restart the experiment >.>

#

idk y I was giving it only 2gb of memory tho

#

I'm deleting the data as soon as I fetch it tho, so I don't know what's up

final kiln
#

I'm gonna have to park this

#

Tomorrow I'll setup the redis conf thing, it's not trivial because actions doesn't let me specify my own command so I need to build an entire image just to make sure it uses the right command so I can map the file

#

But the model is gonna train there's like no doubt about it

#

Was never the models choice ._.

final kiln
river berry
tardy lark
#

has anyone used Assembly ai and know a way to get the microphone stream to end automatically when it no longer is picking up audio?

ionic umbra
#

I'm trying to parse a really large XML file (90+ GB) and I want to break it up into chunks and process it on multiple nodes at once. The XML is basically just a long list of millions of <page> HTMLcontent ....</page> tags with nothing in between them... is there a way to easily break this file into chunks of 50,000 or so page tags with one of the common parser libraries?

limber token
# ionic umbra I'm trying to parse a really large XML file (90+ GB) and I want to break it up i...
Gist

Python script to break large XML files. GitHub Gist: instantly share code, notes, and snippets.

Gist

Small python script to split huge XML files into parts.

It takes one or two parameters. The first is always the huge XML file, and the second the size of the wished chunks in Kb (default to 1Mb)...

stiff garden
#

I'm using the "Create Custom GPT" options using chatgpt4 which responds with the location against the provided name

I'm using fastapi and NGROK for static domain. I've deployed it on edge using NGROK but the GPT is still unable trace the location.

The static website (generated by NGROK) is working fine also

gritty vessel
#

guys I am working on a project in that I am focusing on Ai ready data so like im preparing a dataset to feed our model any body want to join it involves some basic steps like extracting same amount of data from files and creating a new data set and compressing it

civic elm
#

Hi anyone recommend any courses with bayesian with ml?

#

Someone told me that classification problems that have lack lof labels can be done with bayesian but don't know where to start

spark nimbus
#

In Pandas on PySpark, is there a good way to parallelize tasks? For example, I have a list of ~200 tuples (dataframe, function_with_retval) and I'd like to get all of the results. At the moment these are done one at a time and this seems to have worse performance than plain pandas, but I'm wondering if there's a better way to do it

teal lance
#

Python to make deviation slips 😮‍💨🔥🔥🔥

crude pilot
#

Hey folks, what tool would you use for stateful analytics, like cross-filtering?

#

Filters are added one at a time, and I wonder if it would be a valid approach to just use a "traditionnal" stateless analytics tool and just rerun the same query with more filters (would I enjoy some form of caching?), or if there are solution that allow to spawn a temporary state to further filter (so filter A -> list of data -> filter B -> list of further filtered data etc.)?

#

I've read a lot about analytics but somehow never met this cross-filtering use case while it's probably not too uncommon

signal holly
#

How do I effectively learn and practice ai and coding in general? I’m at a point where I give up because I don’t know what direction to go, what I should do exactly, and how I would do it. I need that sort of specific help. If anyone knows, I would greatly appreciate it.

past meteor
#

Isn't this what BI tools already do (Power BI, Tableau)?

#

If not, how does what you're imagining differ from those

silk kite
left tartan
halcyon hedge
#

Does anyone know what's the interpretation of the diagonal graphs in sns.pairplot(). When we have the same variable on both the axis, let's say 'HeartRate'. Does it show the count on y-axis and values on x-axis?

#

Referring to the graph on bottom right

left tartan
halcyon hedge
#

@left tartanThanks a lot

crude pilot
#

I think tableau is more visualization centric

#

maybe powerBI but I am not sure it has the proper scale

#

data may be rather sensitive so I would prefer things that can be self-hosted

#

I see a lot of tool able to do analytics so applying filter on the data + an agregation

#

like any database can

#

but cross filtering adds the idea that you progressively refine the filter

#

I feel like it might be costly to add filters incrementally but not sure, and I wonder if there are some analytics tool that allow that

#

I don't want each filter to be a totally new request, unless this new request is actually really fast

past meteor
#

I see your request now

#

I think it's similar to BI tools but you really need it to be able to work at a large scale, correct? One of the things that bothers you is you don't want to recompute filters that are added sequentially

past meteor
#

There you'll be able to do all of the configurations you want

crude pilot
#

I'll try to dig how they handle the cross filtering

river cape
#

Any idea whats dummy variable trap in mutiple linear regression?

past meteor
pliant marsh
#

hello guys i want to ask somethings related to deepfake detection, can anyone help me related to it?

#

anyone?

left tartan
pliant marsh
#

well i want to work upon a project called "Deepfake detection" and by the name you guys can understand what it's going to do, so to can anyone advice me sources like where should i get the appropriate images data and how should i get the pre process to train the model for deepfake detection

left tartan
rocky spade
pliant marsh
#

Well deep fake is a video or an image of a person that has been altered or change by some other person's face

#

Well body of someone's else and face of someone's else

pliant marsh
fallow frost
#

how does Pandas store NaN values in a float array? is there a separate mask array that dictates if a certain idx is nan, or is nan a valid bit pattern for any float type that also cant represent a regular number?

left tartan
#

Arrow backed pandas dataframes are a different story

merry ridge
#

Does anyone have any experience with Physics Informed Neural Networks? I am trying to solve a heat equation with Dirichlet boundary conditions, and I am confused why the solution is decent at the initial conditions and in the interior, but is horrible at the boundary. As far as I understand, the model wouldn't have any real way of being able to distinguish between the initial conditions and boundary conditions other than that there are two derivatives in space and one derivative in time.

wooden sail
#

boundary conditions usually have comparatively fewer samples, and so you need to weigh them more heavily in the cost function

#

i had this issue with a PINN for the wave equation on a string, where the boundary was only 2 samples. if i didn't weigh those two samples like crazy, the waves wouldn't reflect

merry ridge
#

When you say comparatively fewer, do you mean compared to interior points?

wooden sail
#

compared to interior points, and compared to the initial conditions which you evaluate everywhere in space

#

consider that in many cases, the function evaluated at the boundary can be zero and contribute nothing to the learning, e.g. because not enough time has passed for heat to diffuse all the way to the boundary from any sources

#

so even if you evaluate the boundary points at several time steps, many time steps might not contribute at all

#

and the cost function is evaluated almost everywhere in space for every time step

#

what some people do is also train on a schedule. after a few epochs, start decreasing the weight of the init conditions and error, and crank up the weight of boundary conditions

merry ridge
#

What kind of scale are you suggesting when you crank up the weight on the boundary conditions?

teal lance
#

Anybody wanna help me get a cooler gui 🥹🔥🔥🔥

merry ridge
#

I tried making the boundary worth 10 times more, but it wasn't giving me good results

wooden sail
#

honestly this depends on your setup. i would suggest you make a plot showing the error, the boundary cost, and the initial conditions cost as 3 curves in a same plot over the epochs

#

see how they compare to each other

#

a good place to start is to make them all roughly the same size

teal lance
#

My other Python script been helping too 🏄🏾‍♂️🔥

merry ridge
#

I don't think I have any other questions now, you gave me a lot of places to tinker with in my model. Thanks you for the help

jolly current
#

Hello. Could someone suggest me some good projects that i can analyse to get a basic experience in applying the theory? I want to cover the basics of machine learning

#

I need it to add to my resume as well.

sterile talon
#

Does anyone here work with remote sensing and perhaps SAR data?

#

I'm writing a MSc thesis and I need to do pre-processing. There are several softwares available, at least two are based on MATLAB but a few on Python.

odd meteor
# jolly current Hello. Could someone suggest me some good projects that i can analyse to get a b...

I think "a good project" is subjective. It depends on what you're really interested in.

My trick has always been

  1. Read a research paper and implement the paper (e.g LoRA)
  2. Write a medium article explaining your code and your attempt to replicate the experiment of the paper (with lots of plots and meme maybe.)
  3. Add that to your portfolio (Most companies will always pick this over Titanic)

If you're targeting companies like Perplexity, HF 🤗, InstaDeep, DeepMind, Google Brain, OpenAI, etc... This strategy can easily get you a research intern / entry level interview invite.

rugged comet
#

If you had a continuous numerical column and a nominal categorical column, how would you visualize their relationship? More specifically, I'm interested in how the value in the continuous column affects the rate at which the categories occur.
At this point in the project, I'm trying to think it through. Maybe I want to bin the continuous column because the values within are fairly specific. Each row in the df has a category, a datetime, and a value for the continuous column. Maybe I want to find the rate at which the observations are recorded so that I can graph that against the continuous column.

jolly current
odd meteor
# rugged comet If you had a continuous numerical column and a nominal categorical column, how w...

Hopefully this clears up things a bit for you.

Scenario 1
Using Hypothesis Testing

Bivariate: Continous vs. Categorical features

Plot: Barplot, Boxplot to visualise the relationship.

Example of Statistical Test: 2-sample Z-test ( to compare means of two independent population / groups)

Example: Analysis to know whether or not a company managed by male and a company run by a female spend same amount on electricity on average.

Assume you have a column called Gender (gender is a feature with two classes; male and female)

See attached image of Hypothesis test and plot.

Scenario 2

If what you're looking for specifically goes beyond carrying out a statistical hypothesis testing, then to need to compute a non-parametric test called Point Biseral or alternatively, you can use your good ole Logistic Regression.

See attached 2nd image for reference.

rugged comet
# odd meteor Hopefully this clears up things a bit for you. **Scenario 1** Using Hypothesis...

Thanks for the reply. I'll probably try scenario 1 in addition to what I've been messing around with.
I think I misspoke somewhat in my original post. I'm really trying to find how the value of the continuous column affects the rate at which observations are made. We have date data as well. For example, as the value in the continuous column increases, does that rate at which observations are made increase?
The part that I'm trying to wrap my head around right now is some values in the continuous column are more common than others. Therefore, wouldn't it be likely that there were more observations at that value? For example, say an extreme value showed up in the continuous column, wouldn't that value have a lower count of observations than say a more common value?

crimson elbow
#

I'm looking for data science internships and was wondering for a portfolio if I should make a website (and if so use a template or to code it myself) or simply use a github for that?

regal wedge
#

i'm trying to learn ai and ml. does anyone know some good resouces or videos to learn from. any channel that clearly explains the math behind it and shows the derivations and code implementation. Any books blogs or videos

mild dirge
#

FOr starters the 3b1b videos are a nice intro also explaining the intuition behind the mathematics @regal wedge

#

What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

Additional funding for this project provided by Amplify Partners

Typo correction: At 14 minutes 45 seconds, th...

▶ Play video
dusty forge
#

I'm following a course and it had me change a single column with country names, to three columns with ones and zeros. Why would I want to do this, instead of keeping the single column but changing the countries to a numeric value, for example 1,2,3 for France, Spain, Germany respectively?

wooden sail
#

the 3 columns with 1s and 0s represent vectors with equal magnitude, all equidistant from each other

#

using 1,2,3 in a single entry implies, e.g., that france and germany are more similar categories than france and spain because the distance from 2 to 3 is smaller than that from 1 to 3

#

which representation is better dependa on your application

#

ML people call the binary vector approach "one-hot encoding" (one 1, all else 0), in case you wanna read more about it

dusty forge
wooden sail
#

which part do you mean?

dusty forge
dusty forge
# wooden sail which part do you mean?

using numeric values would 'influence' the way it's being read, I thought 1,2,3,4,5 etc is simply an index, never expected it to influence distance in similarities

wooden sail
#

it's a consequence of how euclidean distance is measured

#

you could read about vector and matrix norms to get familiar with the topic

dusty forge
#

Ok this is very helpful, thank you. I need a refresher course on algebra as well it seems 😉

wooden sail
#

linalg and statistics are the core of ML. then you use calculus to solve the optimization problems that arise from there

hybrid mica
#

If I want to make an AI to solve a specific task, should I go with the OpenAI API or train a custom model?

spark nimbus
#

Does pandas-on-pyspark offer any kind of named aggregation? The usual kwargs method doesn't seem to work unfortunately

hybrid mica
#

It is an NLP related task.

#

I would like to compare a written piece of text with a bullet pointed piece of text and count how many of the points are included in the written piece.

#

so far my experience at getting chatgpt to do this hasn't been that good

left tartan
#

Maybe NLP/feature extraction? See Spacey

hybrid mica
#

i now remember i asked this question before in this chat and you replied

#

?

#

should i use this?

left tartan
#

It’s not my area of expertise, but the type of problem you described (determining if a text contains references to certain topics) sounds like a fit.

past meteor
hybrid mica
#

spaCy looks like a good tool to calculate the similarity between two texts
however, how can i adapt it to the following:

Large block of text:

Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of other living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.

AI technology is widely used throughout industry, government, and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), interacting via human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).[1]

(source: Wikipedia)

Bullet-pointed text:

  • AI technology is used in industry
  • Self-driving cars
  • It relies on linear algebra, statistics and calculus
  • Data preprocessing
  • It can be used to play games such as Chess

My objective is to determine which of the bullet points were mentioned in the text. In this case, it would be points 1, 2 and 5.

past meteor
#

My opinion: GPT is probably worse than a bespoke solution but it has zero start up

#

You can use openAI's API to just embed your text and then train a classifier on it

#

A more end-to-end way to do this is just finetuning the model there

past meteor
#

If you want me to summarize it I can

hybrid mica
#

I was more looking for how I would train a classifier

#

since i don't have any data

past meteor
#

That's where you always need to start: gathering data

#

Actually, step 1 is unambigously defining:

  • What is my task?
  • How do I judge if the task was carried out succesfully by the model
#

Once you have those 2 you gather data

hybrid mica
#

I thought I could do something related to semantics using vector embeddings to accomplish this task.

past meteor
#

I think the key word here is unambiguously 😄 (this is where I typically get a piece of paper and/or latex and write it out)

You say you want to identify the bullets in there. It sounds like classification.

#

You can compute the similarity between your word embedding and the text embedding but from what point do you decide it is or isn't in the word?

#

You need a cutoff point

past meteor
hybrid mica
past meteor
#

You can do that sure

hybrid mica
past meteor
#

I would embed the document once and then compute similarity with each bullet one by one

hybrid mica
#

Would that work? since one bullet point with like 3 words is probably not similar to a 100 word document that includes lots of other info as well. The computed similarity value would be so small that it would be nearly impossible to differentiate between something which isn't in the text.
Is a better idea perhaps to break the text into sections within my program and then get the max similarity between a section and a bullet point?

past meteor
#

GPT is capable of retaining a lotof information in its embeddings. We've seen that at work.

lapis sequoia
#

rate my graph 🫠

fallow frost
fallow frost
#

it dosent say anything about the inner workings. othern than wity pyarrow it can store NaN just with float and object arrays. and thats why it casts int to float ...

dry geyser
#

I have a CSV reader using pyarrow that is performing at roughly 22k/s (lines per second) including row/column validation and some transforms.

#

I'm wondering if anyone faced a similar project and had some good ideas on improving performance

#

i already made lookup tables to run static method validators, so i can run those by index

buoyant vine
#

Polars™️

#

Joking aside we switched all of our pandas/pyarrow stuff over to polars because the API is just way more coherient, simpler and normally faster.

dry geyser
#

if you dont mind a DM I can show you some of the transforms I do, but it boils down to coalescing data and validation of different fields. thats the most expensive part of the job, but pyarrow doesnt seem like it's blazing either

rocky spade
dry geyser
#

note that pyarrow already does multithread streaming

rocky spade
#

multhread is not parallel

#

multithread is no distrubuted

dry geyser
#

i didnt say it is, but distributed does not necessarily help all cases.

rocky spade
#

two computers doing one thing why not helping?

dry geyser
#

if it did, there would be no market for massively powerful hardware meant to do single node work

#

latency, all the added overhead of clustering/distributing tasks

#

two kernels, two context switching scenarios, two userland, two mmu, two network stacks ,.. the list goes on

#

when i hear people trying to answer 'distributed' to everything it makes me wonder if they have a clue about OS internals at all

rocky spade
#

Because in common house hold don't buy two computers to do one thing

dry geyser
#

common household? i have a 2TB EPYC dual cpu supermicro chassis one floor below...

#

2tb ram*

#

and it isnt even something exotic

rocky spade
#

I mean, all these matters you can just let your other computer only reporting after finish

dry geyser
#

anyway

rocky spade
#

there is no need reporting in run time

dry geyser
#

@buoyant vine looking at polars

#

seems it's missing some candy for the CSV API

#

namely stuff like auto conversion for boolean types with known false/known true values

rocky spade
#

I mean, distrubuted and parallel do help your issues, don't know why you got offended and don't want help

wooden sail
past meteor
#

Maybe a hot take but I'd even use it if it were slower than Pandas

dry geyser
#

@rocky spade i did not get offended, you just suggested something that does not effectively solve any issues i have, nor provided any actual insight technically usable.

#

some of the options are supported in polars it seems

#
        self.convert_opt = pv.ConvertOptions(false_values = self.csv_bool_false_values,
                                             true_values = self.csv_bool_true_values,
                                             column_types=self.mapped_column_types,
                                             include_columns=self.wanted_columns,
                                             null_values=[""],
                                             strings_can_be_null=True
                                             )
#

but not the false/true values

#

that saves me a good amount of pain as i do not have to run my validators for boolean dtypes

past meteor
#

It's kind of ironic how Python was known for data in the Pandas + matplotlib era when these 2 weren't the best / most user friendly tools imo

dry geyser
#

essentially the more i can bypass validation for, the better.

rocky spade
buoyant vine
#

If the data is already in CSV 😅 Partquet isn't going to save you

#

although Parquet is by far the best format to use if you can

dry geyser
#

I am already converting data to parquet for secondary backups

#

exactly lol

#

"try parquet" (but im parsing CSV son... im receiving CSV files...)

#

picks the phone to convince the source of the data they need to rethink and rewrite all their crap to produce parquet files

buoyant vine
#

I can almost guarentee their awnser will be "lol no" even if it would be benifitial for them as well

#

we get sent TB of CSV files and they just refuse to do it any other way

dry geyser
#

^ real world.

#

@rocky spade i wish for a lot of things. in fact, i wish i could just throw a 50mil record file on gpt and ask it to be my beeyotch and return a perfectly structured, deduped, coalesced dataset my way

#

but it seems like i dream too far

left tartan
dry geyser
#

plays i believe i can fly

left tartan
#

But sql is my answer

past meteor
left tartan
past meteor
#

I haven't used it in a long time, but I'd always argue their data processing toolkit is more user friendly. It's just too slow.

buoyant vine
#

This is probably more extreme, but maybe you could use something like Trino/Athena/Presto if you want to be as fast as possible and don't care about the cost.

The SQL query will suck, but I know Trino is capable of brute forcing its way through, if you wan't something nicer though :/ I'm not to sure since as you said, Polars doesn't support everything you really need easily.

Maybe you could make a LazyFrame in polars and do the bool conversion as part of the pipelined operation?

dry geyser
#

    def process(self):     
        if self.parse_method == INGESTION_METHOD_MEMORY:
            table = pv.read_csv(self.filename, read_options=self.read_opt,
                                parse_options=self.parse_opt,
                                convert_options=self.convert_opt)
            self.total_rows = table.num_rows
            self.prebake_lookup_tables(table.schema.names)
            self.process_table(table)
        else:
            # XXX: beware this can trigger OOM, beats cpython iteration through lines() though...
            self.total_rows = self.count_rows()
            
            with pv.open_csv(self.filename, read_options=self.read_opt,
                            parse_options=self.parse_opt,
                            convert_options=self.convert_opt) as reader:
                self.prebake_lookup_tables(reader.schema.names)
                self.loop_through_chunks(reader)

I'm about to benchmark it again without any of the post-validation stuff

past meteor
#

I'd love a credible, maintained port of ggplot2

left tartan
#

So it’s not the csv reading per se but the line processing?

rocky spade
rocky spade
dry geyser
#

@buoyant vine

[2024-02-09 15:55:52,298] [MainProcess:Thread-2 (periodic_performance_logger)] INFO: CSV: Processed 780221 lines in 24.09 seconds, 32385.35 lines/second (ETA: 6.18 min)
^ validation enabled, using lookup tables for running per-column methods

#

@rocky spade no, it doesnt, i dont think you understand how pyarrow works. the total_row_count is only done on the streaming part as an initial step, and it technically does not parse anything because im not consuming the dataframes there.

#

it literally is a newline/carriage return counter

rocky spade
#

And should you not use Spark for this matter?

dry geyser
#

on nvme it completes in ~3 seconds tops

buoyant vine
#

Spark is a pretty nuclear option, and you'd spend more time setting up the Spark cluster than the work itself probably.

dry geyser
#

@rocky spade do you mind sharing a link to a github or something showing your work?

#

@buoyant vine indeed

buoyant vine
rocky spade
dry geyser
#

@buoyant vine they are dataset dependent, "yes", "no", etc. I made it dynamically configurable, sometimes they change.

rocky spade
#

I mean, i don't know what you want?

dry geyser
#

@buoyant vine Processed 12797780 lines in 375.41 seconds, 34089.97 lines/second (ETA: 0.00 min) < this is including validation. no coalescing/dedup per row, though. i still need to optimize that. it's not as trivial as the validators, since i have some fancy expression support to do inter-column checks and such (ex. fields are transformed based off other columns and their values)

buoyant vine
#
import polars as pl

true_values = pl.Series('true_values', ["yes", "y", "ok"])
false_values = pl.Series('false_values', ["no", "n"])

data_stream = pl.scan_csv(
    "my-files/*.csv",
    schema={
        "my_bool_col": pl.Utf8,
        "my_other_bool_col": pl.Utf8,
        "title": pl.Utf8,
        "description": pl.Utf8,
        "something": pl.UInt32,
        "else": pl.Float64,
    },
    truncate_ragged_lines=True,
)

data_stream = (
    data_stream
    .with_columns(
        pl.col("my_bool_col").is_in(true_values),
        pl.col("my_other_bool_col").is_in(true_values),
    )
)

# ... processing

I am not sure how fast this is since I don't have anything to test it right now, but this should do the bool conversion as part of the streaming operation.

dry geyser
#

i can check a bit later

buoyant vine
#

The iter-column checks are probably going to be the slowest thing, since data engines can often get confused with them

dry geyser
#

hows type inference with polars?

past meteor
#

Very good

dry geyser
#

i wrote a tool also for improving type inference for the parquet conversion, it isnt anything sophisticated, but speeds up crafting the yaml configuration for each CSV dataset

buoyant vine
#

Its good, the only issue I have ran into, is it reads and writes Utf8 strings as Utf8 with 64 bit int lengths, which you can't change.
Had an issue before where if you have some custom arrow processing and it can't auto between the 32bit and 64bit lengths, it can cause some issues.

dry geyser
#
import pyarrow as pa

PYARROW_TYPE_MAPPINGS = {
    "StringType":           pa.string(),
    "EmailAddressType":     pa.string(),
    "PhoneNumberType":      pa.string(),
    "GenderType":           pa.string(),
    "BooleanType":          pa.bool_(),
    "DateTimeType":         pa.timestamp('ns'),
    "CountryType":          pa.string()
}

just an example from one of the type mappings

past meteor
#

The most important thing with polars type inference is that it knows the difference between pl.lazyFrame and also the "contect" you're in like Expr, or Select etc

#

It quite accurately tells you what ops you can and can't do

#

Or did you mean the inference while reading data

dry geyser
#

while reading

true spade
#

Hi there, not sure if this is the right channel to ask this question, but essentially I am currently struggling to figure out whether I am spending too much time on organizing code in my Jupyter notebooks as opposed to conducting experiments and exploring data/opportunities with the aid of the notebook.

As a result, I was wondering if anyone has any advice on how to balance organizing code with actually using Jupyter notebooks for analyzing data and experimenting with different kinds of models?

rocky spade
#

That's right son

dry geyser
#

ex. if you look at the above table, i have a per dataset mapping that maps columns to those generic types. each has its own validation logic, including adaptive settings per dataset (ex. known observed bad values)

buoyant vine
dry geyser
buoyant vine
dry geyser
#

as i already apply column type mappings. for the most part i care about dates and some string types, the rest i can often apply a fast path in validation and just leave them as is or as none/null

past meteor
#

My experiment pipeline is always a .py

dry geyser
#

ill definitely look into a polars version of the current csv processor

buoyant vine
true spade
# buoyant vine 😅 I'll confess I never organise my notebooks, they are all not in the git histo...

I see, the issue is that I have to put all of my code (at least the relevant parts) in my Jupyter notebook as part of an assignment, but its currently overdue since I have been spending too much time on organizing my code.

I tend to try to follow the DRY principle since I have found that its something that has helped me out in other aspects such as software engineering, web development, and game development, but when it comes to Jupyter notebooks (mainly using it for data analytics and machine learning projects).

However, I really don't know if my tendency to be a strickler when it comes to adhering to the DRY principle is causing me to waste too much time on cleaning up my code and reducing code duplication/generalizing common procedures instead of actually y'know... exploring the data and experimenting with new models haha

buoyant vine
#

Overall I'd say it is very against the idea of implicitly or automatically type casting things. For the most part it just won't do it unless you explicitly do it.

rocky spade
#

So pass the 22K/s is impressive?

dry geyser
#

@buoyant vine the tl;dr is that in the end i want an array/list that matches the indexing of my wanted columns set (this is how i optimize validation, by executing/running all that by index, the validators are assigned to the right fields once, and the table is cached)

rocky spade
true spade
# past meteor My experiment pipeline is always a `.py`

I see, thanks for letting me know about that, just curious, what do you mean by an experiment pipeline?

Are you referring to a set of scripts that just contain experimental code (which might or might not be scrapped in the future)?

buoyant vine
past meteor
#

I like having it all reproducible and so on

dry geyser
#

@buoyant vine the validation is done to the row as a list/tuple/set

buoyant vine
#

Any particular reason for that? Or just because it was the best way with pyarrow?

rocky spade
dry geyser
#

it was the best way with pyarrow

true spade
rocky spade
#

Because you have no clue what under it

dry geyser
#

@buoyant vine i can DM you if you are curious

buoyant vine
#

Fair enough, yeah I would try with polar's more columnar approach if you can, I don't know all your validations but if you can do it without getting it by row then it should be pretty speedy

buoyant vine
buoyant vine
#

Although tbh, when we hit those sorts of issues, we stop doing the code in Python 😅

past meteor
dry geyser
#

@buoyant vine i wrote a very simple test in rust without all the dynamic/configurable validation and mappings, and it beat the crap out of python 3.12 with latest pyarrow

#

single threaded too

rocky spade
dry geyser
#

all numbers i provided so far come from a i9-13900K workstation

rocky spade
past meteor
#

About Polars, what I see being a big issue of people transitioning to it is not leaning into it

true spade
past meteor
#

I think if you're doing iter_rows and/or map then using it doesn't make sense

rocky spade
rocky spade
dry geyser
#

@rocky spade if you look at how @buoyant vine 's and other folks' interactions work out, your experience in this channel with other people will likely improve also linearly to your enthusiasm in "distribute everything ahoy"

#

plonk

#

/ignore @rocky spade

#

lol

buoyant vine
#

Polars is very nice though if the data can sit on local disks and be done with it though

dry geyser
#

f that, im doing all this on nvme/optane

#

so i know IO is not the bottleneck at least in that sense

buoyant vine
#

😎 Join the darkside and doing 100s of GB / s on blob storage

dry geyser
#

lol

past meteor
rocky spade
#

I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements

buoyant vine
# rocky spade There is no way you can improve a performance when using someone packed stuff

This is somewhat miss-leading i'd say, or a misconception at least, Yes there are limits but most of the time the library code itself is not the limiting thing.

It is also worth mentioned that it is typically not worth it to build some system from scratch in something like Rust or C++ unless you actually have issues with the speed it is currently doing it in or have some other requirement which the Python lib or what ever doesn't support well.

There are a lot of optimizations you can normally do before you get to that stage

dry geyser
#

@rocky spade i think you dont read english well. i never said 22k is impressive. i said it's the ceiling of what is possible given the circumstances. yet you are here trolling because your petit ego got hurt when i told you that your suggestions were not valuable. look at how other people respond here. their input has value. they are not acting haughty or like they have a chip on their shoulders. i bet any of these kind fuckers have a fairly sizable amount of experience on their shoulders, thats where their humility and good attitude comes from. get over it. learn from them.

buoyant vine
#

We can also probably chill out a little bit 😅 We don't need to argue or throw insults or what not

#

just before this becomes too heated...

dry geyser
#

lol

rocky spade
# dry geyser <@1132863992470175845> i think you dont read english well. i never said 22k is i...

I have nohing mentioned distribute everything, this folk hated everyone mention distribute while using a high level language on top a high level package and thiking he is impressive and asking for improvement. Isn't there only magic is change a package to use or you have stupid code error or change to parallel reading or distrubuted when when talking about improving reading speed? Any than that is from cratch creating a reading package from strach don't use any stupid package someone written than we talk about foudamental improvements

#

isn't there anyway to improve when you use pandas to read CSV file?

#

pa.read

#

The only one have no chip on their shoulder is starting calling others son when someone replying to your code after asking for improvment

agile owl
#

Don't understand why you'd be writing single threaded apps if you care about performance in 2024.

rocky spade
agile owl
#

python single thread performance is of course going to lose to rust too don't think that's controversial at all the runtime has a cost

dry geyser
#

"son" is not an insult, and you suggested "i use parquet" to a question that obviously involved CSV data... which cannot be obtained in any other format....

agile owl
#

calling ppl son is typically considered a sign of disrespect

dry geyser
#

if you cant take humor you should not be hopping into the internet

#

he spent ~1hr offended because someone made a "son" joke in an internet channel. solid,

agile owl
#

anyway probably time to move on from that, what's the issue exactly, that polars in Python is underperforming polars in Rust on a signle threaded app?

dry geyser
#

no, pyarrow

rocky spade
#

Multi threaded is a one process

agile owl
#

threads should be more efficient for IO bound things

buoyant vine
#

😅 Ngl I think making this a distributed system. for this task is a bit overkill

dry geyser
#

he also doesnt understand how threads work apparently

buoyant vine
#

Especially if you don't need the cluster all the time, no matter what system you use, managing the cluster suckkks

past meteor
#

Going distributed is a special kind of pain you want to avoid imo

agile owl
#

you use processes for CPU bound things in Python because of the GIL but for IO bound things you can use threads

past meteor
#

It's a high price you pay for a nonexistant reward if it all first in 1 machine

agile owl
#

@past meteor just live your entire life in distributed async land and treat everyone else like a baby

buoyant vine
#

Also, probably worth mentioning pyarrow is written as a native extension, it releases the GIL in its parsers 😅 So you get the full use of the CPU.

long locust
#

Hey there, just for the record please remain civil, "RTFM" is not a very friendly phrase

past meteor
buoyant vine
past meteor
#

I think it could've been a sync flask app in a couple of days

#

Fault tolerant? It failed 😭 (it's currently down)

rocky spade
buoyant vine
rocky spade
#

From today i didn't see multithreaded is very impressive, do you know how to write parallel and make a use?

past meteor
# buoyant vine What was it even supposed to do?

We ran a clinical trial. All it had to do was call an API. For some reason he really really wanted to make it stream data so he ended up polling the API every few seconds. Problem is, he messed up and we had tons of duplicates.

Secondly, batch would've been totally fine for us. Just calling the API once every day or every half day solved the problem.

#

There's a couple more microservices but those are basically there for what I believe is obfuscation

buoyant vine
lucid hornet
#

Ohhhh, I was getting arrow and pyarrow confused. Was trying to figure out why a datetime library would need a csv reader

buoyant vine
#

unfortunately, Docker images coming at a cool 24GB in size compressed

#

and we ran out of ephemaral storage sadge

past meteor
#

We have 3 services, one polls data source A, another lets our clinical partner upload patient info and a last one polls data source C

buoyant vine
past meteor
#

To do any query you need to join so many API keys 😩

buoyant vine
#

Why would they name the datetime lib the same thing as the well know dataformat sadge

buoyant vine
past meteor
#

Keycloak 🥴

past meteor
#

Well, you get what you pay for

rocky spade
buoyant vine
#

underestimating the performance of AI models™️

left tartan
past meteor
#

Honestly, the project started before I joined. If I were there from the start I'd have challenged many questionable decisions

lucid hornet
past meteor
#

I think ultimately what $dev did was resume driven development

lucid hornet
left tartan
buoyant vine
buoyant vine
#

"We can afford a bit of a price increase, its not an issue for us"
But can you afford a 100x increase

rocky spade
#

So i was thinking, in that situation, can he cut the file in half, such as find a way read only half of them, and then make a parallel reading?

lucid hornet
#

Can also be streamed in, but I don't know if that helps with a csv

rocky spade
#

Because there is no way you can improve things when you use a PACKAGE

left tartan
#

Certainly, but I think first question is: what is the current bottleneck and why?

rocky spade
buoyant vine
#

it is already parallel, and distributed is overkill 😅

agile owl
#

does this belong to the class of problems where we're complaining that Python is just slower than Rust and end up saying that if you don't like the Python performance then don't use python

#

because that's what it seems like

left tartan
#

@dry geyser i am curious what your bottleneck is, but if you’re done with this conversation I don’t want to drag it out. Can you share more info about the per line processing?

true spade
rocky spade
#

stupid code error, code structure ; package problem
solution=> write your own fucking package, parallel and distrubuted.

#

Does anyone know multithreaded is one single processor right?

#

I don't see any difference with Python concurrency

dry geyser
#

@left tartan the bottleneck is the validation/coalescing/etc. without it it's ~34k/s, roughly 15-20MB/s, pyarrow without any validation or pandas/dataframe conversion can maybe go up to 100MB/s

#

there is a dual conversion for the dataframes happening too. the final product is a deduplicated, coalesced dict with validated information (including some dynamic expressions, but i have tested without that too, similar to asteval)

left tartan
dry geyser
#

i already do it with the validation by building lookup tables and processing the columns by index, for coalescing it's trickier because i support complex inter-column logic. ex. if column X has value Z, set field to Z, else take value from Y

dry geyser
#

will need to consider a similar approach, but because it builds a dict to be batched for elastic indexing, it is less trivial than vectorizing the validation, which in the end works with a list/set, so we can basically assume column N has validator X and it will remain constant

#

the coalescing is not immediately solvable since we need to iterate thru the validated data, find the dups, remove them, and so on.

#

however the gains later are immense because the indexed data never needs touchups

left tartan
dry geyser
#

so i dont have to deal with any of the annoyances in ES for updates

left tartan
#

But I get the dupe detection problem

dry geyser
#

ex. multiple columns contain an identifier, which sometimes repeats. i get rid of all the dupes.

left tartan
#

** I’m a DuckDB shill so my first experiment would be to load to a DuckDB table, and do it all in sql.

dry geyser
#

a very well respected math-head recommended duckdb to me for this project but i saw some limitations as i need near realtime text lookups

#

i augment the data externally with edgedb for holding some relational data/caching some searches

#

i would be interested in talking about how it would work with duckdb though

#

the problem for me was the massive amount of potential idempotent inserts

left tartan
#

Oh, I was just thinking for processing. You might then export and use another way for lookups.

dry geyser
#

ex. identifiers connected to a given object being repeated

#

suddenly i end up with 15 mil select or insert queries = no go

#

(hence elastic)

#

@buoyant vine has been helping me grok polars to adapt the current csv processor, there are some hiccups but apparently polars has an expr engine

#

@buoyant vine ill ping you about the native expr stuff in polars

#

got a mockup with polars going

#

$ time python testpolars.py tests/fixtures/..._500k.csv

real 0m1.014s
user 0m2.132s
sys 0m0.323s

#

14mil records in 28seconds, with boolean conversion already done

buoyant vine
#

I hope that is a good sign 😅

dry geyser
#

What would be the equivalent for handling dates? ex. attempt auto conversion

#

assuming UTC

#

(or no tz)

#

try_parse_dates

rocky spade
#

@left tartan for asyncio, i understand it is one thread concurrency, but how if it is one thread, there is a loop manger? that can control loop?

#

because the loop manager is always in current concurrency and never leave or change?

dry geyser
#

read how epoll() is implemented to understand how it can do what it does in a "single thread"

left tartan
left tartan
rocky spade
# left tartan The inner workings here is not something I’m very familiar with.

I saw this but never understand is why it is not parallel when one task is running and than switch to another task when yield, so basically there is only one thread, and inside the thread the scheduler calls other concurrent task when they reported or ready, but if it is not parallel, how would they know? => so when one task is await, then there will a list to check if other task is ready?

#

I do lack of basic understanding about processer or concurrency in programming level

left tartan
rocky spade
left tartan
#

The scheduler handles assigning the work: a thread can be preempted so that another thread can run.

#

I’m not familiar with the internal mechanism of how the scheduler works.

#

(There’s a more complicated discussion about ‘why’, which leads to the GIL and eventually PEP 703)

rocky spade
#

Do they open sourced it ?

left tartan
rocky spade
#

I thought Python is open sourced..

left tartan
#

Cpython is Python (well, there’s others but it’s the one you’re using)

past meteor
# rocky spade I saw this but never understand is why it is not parallel when one task is runni...

The way I'd always explain it (a bit hand-wavy) is that concurrency is an idea and parallelism is one specific implementation, asynchronous programming is another. Python's async/await is based on event-driven programming (which is a way to do async), you have an event loop that submits tasks with a callback. When the task is done it's put in a queue that the scheduler checks frequently to see what tasks can be resumed. True parallelism isn't possible in pure Python because of the global interpreter lock.

rocky spade
rocky spade
rocky spade
#

is there anyway to see the code directly like what is call back and sechedular in Pythn?

#

Cyphton...

left tartan
rocky spade
#

Beucase after i know multiprocessing module, and see their documentation, their impression is that GIL is just a joke?

#

for most common way of using?

#

I don't fully understand GIL, i just assume it is just locked the thread or something intentionally

past meteor
# rocky spade But assuming current task is running, then the scheduler just like checking crea...

You only need to check the queue when a task has finished or awaited to schedule the next one. The loop uses select, poll, epoll, ... like jimmyhoffa has mentioned. Their advantage is that you don't need to actively poll which means you don't need to keep asking the task "are you done? are you done? are you done?.

The callback is really abstracted in async/await another hand wavy explanation, the callback here would be the code that follows after the await. That's what needs to be done when the event is finished.

past meteor
#

Do you know about generators?

rocky spade
#

I checked the yield, so i know about it, somehow

#

I understand the code and the concept

past meteor
#

Well, let me not confuse you 😄 I think this is more than enough information for one day haha

#

Just write code and it'll become clear

rocky spade
#

please do more

#

just asyncio if is not parallel confused me about 5 months

left tartan
past meteor
#

The most important thing, imo, is to understand that concurrency is an idea that has multiple implementations

#

It's like an abstract class if you may 😄

dull copper
#

Should i start with naive bayes or linear regression?

past meteor
dull copper
rocky spade
#

Just asking, do you guys know anything can fix my fundamental problem like how to code like in deep down level, such as directly commucae with bytes, how to build like memory safe or something like that, like very detailed stuff than just use a high levle language? From bytes to high level language in between

#

I checked CS50 they explained about memory safe and those topics

#

but i do want more of it

#

I checked Havard CS5O but didn't watch it through about memory safe or something just a little bit explaination

jagged latch
#

I have a question to those experienced in Plotly Dash. Alright so a little background. I am trying to recreate a dashboard from a proprietary work website, and one of the features is that it changes the SQL query based on the date chosen. I already got the SQL query running and I got the algorithm to help me generate df_2 based on the date chosen by the user (this is done through a dialog box that pops up via tkinter. I'm now working on designing the app. I wrapped all the other code in separate functions. I have a text box with a button. I basically have it when if n_clicks > 0, then I want to call all those functions I defined earlier in the Python code prior to the app code to generate a new df_2 based on the new date entered. Is such a thing possible?

dry geyser
#

for the ultimate guide into communicating with bytes, god, and everything in between

#

(offtopic)

craggy patio
#

For all you AI wizards, I am planning on making a voice detection model with a CNN. I am taking the greyscale spectrogram of my voice and feeding it into the model to be anaylyzed. Here is a simple diagram showcasing my plan

Input: (batch_size, 1, height, width)
   |
Conv1 (3x3 kernel, 32 filters)
   |
   v
Activation (ReLU)
   |
   v
MaxPool2d (2x2 window, stride=2)
   |
Conv2 (3x3 kernel, 64 filters)
   |
   v
Activation (ReLU)
   |
   v
MaxPool2d (2x2 window, stride=2)
   |
Flatten
   |
   v
Fully Connected (Linear) Layer (64 * 16 * 16 -> 128)
   |
   v
Activation (ReLU)
   |
   v
Fully Connected (Linear) Layer (128 -> 2 classes)
   |
   v
Output: (batch_size, 2)

Please give me some suggestion on how to improve this model

final kiln
#

I finally debugged the redis issue

#

Seems like the model is gonna plateau

#

I believe a 0.8 loss is acceptable tho

craggy patio
#

do u think my model is good?

blissful hatch
#

Hello

final kiln
merry ridge
final kiln
#

this is how my pipeline is looking

wooden sail
versed pilot
limber mesa
limber mesa
coral bloom
#

heyyy

#

can anyone help me solve this? ```sh

OSError: Unable to load weights from pytorch checkpoint file for './pytorch_model-00001-of-00006.bin' at './pytorch_model-00001-of-00006.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.``

#
Loading checkpoint shards:   0%|                                                                                                                                     | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 531, in load_state_dict
    return torch.load(
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\torch\serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Orca_LLM\Orca-2-13b\apples\lib\site-packages\transformers\modeling_utils.py", line 540, in load_state_dict
    if f.read(7) == "version":
  File "H:\py39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 389: character maps to <undefined>
dry geyser
#

@buoyant vine might have found an issue with how polars handles schema/dtypes

#

there seems to be an obscure bug where the index for some columns is offset by one

#

the mismatch leads to an issue later on where the index used to assign a field is not the one expected, ex. from the computed headers of the csv

coral bloom
#
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory```
dry geyser
#

any polars guru around?

dry geyser
#

@limber mesa hey

#

🙂

limber mesa
#

👋 ola

true spade
# past meteor In notebooks I start off with rough code and then I make it better and potential...

Hi there, sorry to necro this message again, but I was just curious about what your definition of "make it work" would be in this case?

Would it be to ensure that the code runs without errors and performs its designated task correctly?

Or would it be to be able to make new useful observations/gain valuable insights into what you are doing within the notebook (i.e. exploring/analyzing data, training and finding suitable models to address a certain problem)?

Or would you say that "make it work" means something else in this case?

dry geyser
#

so i found the following: if i specify schema to my scan_csv, i can double performance by skipping the type inference, but it seems to skip columns. i made a single record test case to test and confirmed the problem. basically i depend on headers (array of column name) being static/having fixed indices. i have optimized most of the logic to do away with named/dict based access, so it's all index-referenced. the problem manifested when i noticed some columns were assigned to a shifted index. ex. birth date column got shifted by one, and it picked the wrong value.

#

if i use dtypes instead of schema, the problem disappears

#

is schema expected to be in order?

#

is there a way for me to disable type inference for any field not specified in dtypes passed to scan_csv?

teal lance
dry geyser
#

@limber mesa also, could you explain to me how the filtering and expr engine works?

#

tl;dr of course, no need to go in depth

#

ex. what happens when i build several expressions and pass them to my lazyframe

teal lance
teal lance
limber mesa
# dry geyser so i found the following: if i specify schema to my scan_csv, i can double perfo...

Hey, if referring to pandas, you're better off using named columns and accessing by name. pandas works with indices but I believe it's not made for it. And as you've noticed, if one of the columns is in a different order. Everything messes up as things are not what you think they are. I suppose it's the reason people prefer dicts over lists after a while. They both have their own use cases but yeah.

dry geyser
#

polars

#

i use lazyframe

#

setting inference length to 0 does the trick

#

im not sure how polars handles this internally but producing a dict is expensive. ill measure how much performance is lost in precise numbers, but going from named=True to named=False gave me an extra few k/s

#

i have already doubled the speed including validation

#

now writing a new validation class that builds the polars expr(s), i need to measure it though

dry geyser
#

anyone here also plays poker and does "things" with bigdata/stats?

teal lance
dry geyser
#

haha

#

Is there a better way to do this:

df = pl.DataFrame({
    "emails": ["johndoe@hello.com", "bob@gmail.com", "bogus", "a@a.com", "no@a.com"]
})

filtered_df = df.with_columns(
    pl.when(pl.col("emails").str.contains(good_regex))
    .then(pl.col("emails"))
    .otherwise(pl.lit(None)).alias("emails")
)

print(filtered_df)

filtered_df = filtered_df.with_columns(
    pl.when(pl.col("emails").str.contains(bad_regex))
    .then(pl.lit(None))  # Set bad emails to None; adjust as needed for your use case
    .otherwise(pl.col("emails")).alias("emails")
    )

print(filtered_df)
#

ex. combining both expressions

#

in one statement

#

and yes we could do a massive regexp in one shot, but for the purpose of figuring out how to best write polars exprs, lets assume two singular regexps, one for basic email format validation/standard conformance, and the other for known bad values

#

how is this applied internally?

#

ex. can I keep altering the df and my validation remains present for other columns?

#

especially in the context of a LazyFrame

past meteor
dry geyser
#

hey @past meteor

#

another question: suppose I want to validate alpha2 country codes, i can precompute a table of known good values from pycountry. is there a way to integrate this into the polars validation?

royal badge
#

I have finished learning C(for better understanding of Computer Science and related concepts, then now I am learning Python, I want to know what are the things I need to learn first in Python so that I can code in python and then things like Pandas numpy scikit etc. Is there anything in between basics of python and pandas numpy etc. Can you tell me all the basic topics before going to learn Maths and then going towards learning pandas numpy scikit etc.

In addition, also tell me which laptop I should purchase.

dry geyser
#

or rephrasing the question: how expensive is it to include an expr for a given column that might have ~200 item list.

past meteor
#

That will typically answer your question

dry geyser
#

i can easily rewrite the static validators into expr ones, precompute the list and then pass it to with_columns (AFAIK). if i can make all the validation logic into exprs, i can remove a costly for loop altogether

dry geyser
#

the trickier ones are those with more convoluted logic like pycountry stuff. i already use a lookup table made for the task

#

basically anything that involves iterating through rows is huge bottleneck

#

is a*

past meteor
dry geyser
#

yessir

#

@past meteor like so:

#
class CountryValueValidator(blahblaStaticValidator):
    @staticmethod
    def validate(value: str, options: Dict, **kwargs) -> str:
        if value is None or not isinstance(value, str):
            return None
        
        if value == '':
            return None
        
        country = None
        
        if len(value) == 2:
            country = pycountry.countries.get(alpha_2=value)
        elif len(value) == 3:
            country = pycountry.countries.get(alpha_3=value)
        else:
            try:
                country = pycountry.countries.lookup(value)
            except Exception:
                local_fixes = options.get('mapping_fixes', None)
                if local_fixes is not None:
                    if value in local_fixes.keys():
                        corrected = local_fixes[value]
                        country = pycountry.countries.lookup(corrected)
                else:
                    print(f"country failed lookup {value}")
        
        if country is None:
            return None
        
        return country.alpha_3
past meteor
#

It's a bit too early for me to read everything haha but sure

dry geyser
#

so, pseudo: if value length = 2, country might be in alpha2 table, if 3, alpha3 table

#

hahahah

#

i woke up and came straight to the desk like a kid

#

polars is amazing

past meteor
#

Yeah, even if it's not faster

#

the API is just so good

#

but it is faster, so it's a double presetn

dry geyser
#

i also do have occasional hiccups with the country validation, ex. some idiot decided ireland is not the ISO alpha2 code, they put EIRE

past meteor
dry geyser
#

which yeah, if you care for violating ISO standards due to some national identity thorn in your shoe, fine, but it's a PITA for no benefit

true spade
# past meteor No you're good 🙂 to me make it works means just make the code run without erro...

I see, thanks for the clarification and understanding.

For context, I had asked the question because I was and still am currently in a dilemma about whether I should resort to code duplication or creating a parameterized function to encapsulate the repetitive process of building a model, evaluating its performance (with default hyperparameters) based on 2 scoring metrics, determining the best hyperparameters for the model using GridSearchCV, rebuilding the model with the determined best hyperparameters, and re-evaluating its performance (with the best hyperparameters) based on the aforementioned 2 scoring metrics.

What are your thoughts on this?

Personally, I find that the process is quite repetitive since I am also experimenting with different transformations on a dataset and have to execute the aforementioned process once each time. Plus, I am currently only doing this on 2 models, so if I have to scale up to more models (i.e. 4 or 5 models), the amount of code duplication and the time that it will consume will also scale up drastically, thus increasing inefficiency and the time that I will require to complete this investigation.

past meteor
#

And you check them 1 by 1

dry geyser
#

a column is essentially either country expended string, ex Ireland

#

or iso alpha2 code

#

hash lookup internally

#

yes

past meteor
#

It's also very common to code something terrible quickly and not fix it. That's a huge win, it means you never needed it to be clean anyway. If it's badly done and you revisit it in the future, you fix it then

past meteor
#

I'm mostly "concerned" about its type, it's just a string and you have 20+ of those you need to regex against another column or one list of 20?

true spade
# past meteor The more you code, the higher your lowerbound quality of "rush to the finish lin...

I see, thanks for letting me know about that.

I agree with you on that as well, though to further clarify, lets just say that I had 50 LOC that needs to be duplicated and also adapted/changed (i.e. about 80% to 90% of those 50 LOC will need to be somewhat rewritten) to a high extent (since variables used will be different due to being named differently), this needs to be done 5 times, and the time taken to duplicate and adapt the code might range from several minutes to much longer, would code duplication (or rather, code duplication + code adaptation in this case) still be worthwhile in terms of time and development efficiency (i.e. human productivity, not performance)?

dry geyser
#

@past meteor just one out of the list. unless the value is a list of known bad values (very short, ideally), if present. ex. EIRE->Ireland

#

this is not a big deal for one particular pipeline of ingestion. ex elastic, but it is for another one because the countries are pre-inserted in the database

past meteor
true spade
dry geyser
#

offtopic for my questions until now: anyone has played with models for predicting text variations? ex. suppose we have a corpus of strings, finding possible variants based off earlier changes

#

@past meteor pl.col("CUSTOMBOOL").is_in(self.csv_bool_true_values) to mimic pyarrow's boolean_true_values, will that leave the column as False if it fails the test?

dry geyser
#

@past meteor I'm probably using the expr wrong but why would this not work:

    def prepare_boolean_columns(self, data_stream):
        unique_true_values = set(TRUE_VALUES)
        boolean_columns = []
        
        for key, value in self.config.header_types.items():
            if value['type'].__name__ == "BooleanType":
                # Check if the inner value has additional "true" values
                if 'true_values' in value:
                    unique_true_values.update(value['true_values'])
        
        boolean_exprs = []
        for column in boolean_columns:
            expr = pl.col(column).is_in(list(unique_true_values))
            logger.debug(f"Boolean column expr: {column} ({expr})")
            boolean_exprs.append(expr)
        
        return data_stream.with_columns(*boolean_exprs)
#
data_stream = self.prepare_boolean_columns(data_stream)    
rows = data_stream.collect(streaming=True)
#

the expressions arent being applied

#

rofl nevermind

#

ctrl+x removed the append for the boolean_columns

#

time for caffeine

#

still doesnt apply though

final kiln
#

Omg training models takes so looooont D:

#

Also how come smaller batch size leads to faster convergence

#

x axis is relative time, orange batch size is the smallest

#

it does affect the LR schedule, so maybe that's the reason

#

Doesn't even matter, if they reach the same loss in the same amount of time, I'm gonna wanna do smaller batch size so I can increase model capacity and bring the final loss down

gritty vessel
#

hey guys i trained a randomforest regressor and got these scores
are these good?
After Hyperparameter Tuning and Scaling:
Mean Squared Error: 124238.24478116012
Mean Absolute Error: 146.16615376813385
R-squared: 0.9999832719765778
r2 looks fine to me but mse and mae are high

wooden sail
#

the numbers alone mean nothing, it depends on your application

#

look at the predictions you're getting or at percentual error

#

in most optimization problems, one deals with argmin problems. the value the function takes is mostly irrelevant, only the parameters that achieve the minimal value matter

river cape
#

Hey I have a quesion , Its pretty long but please answer it

#

Suppose we have a dataset which has predicts which company has highest profit or provides highest profit .These are the column names:-
Manufacturing spent
R&D spent
Administrative spent
State
Profit(this is our target variable)

#

So we could use multiple linear regression model to predict the price right?

#

Now if we go towards the theory side of multiple regression model , we would have the formula as
y(profit) = b0(constant) + b1x1 + b2x2 + b3*x3 + ???
b1,b2,b3 are the slope co-efficients and x1,x2,x3 are the respective values of the first three columns

#

We cant assign a slope co-efficient to the State column , because its categorical data right?

#

So we do the dummy variable process and use only New York column

#

But when I physically code on colab , we do one hot encoding in the state column
So i am not able to understand as to why do we need to do encoding ? Can't we just seperate the columns and use New York only?

tidal bough
#

Can't we just seperate the columns and use New York only?
not sure what you mean? there's also Florida in that column.

#

but it's true that if you have a categorical column with only 2 values, then instead of one-hot encoding you can just make that column boolean.

dry geyser
#

How can I display the optimized query plan for a given lazyframe/dataset?

#

in polars obviously

limpid bronze
#

Anomaly detection using data access patterns

Write Anomaly detection for Windows/Linux Unstructured file data or NAS file server that
analyses unusual user activity and user behavior. User behavior is represented as any user
actions performed on the system. Consider using capabilities of File Change Log, API
usage, Audit logs, WORM, CPU usage, and unusual disk activity. Leverage AI/ML
techniques. Understand different attack patterns and resemble to actions carried out.
The algorithm should demonstrate accuracy and consider false positives and false
negatives.

can anyone guide, what steps to be make sure for solving above statement

river cape
dry geyser
#

LazyFrame's dont have explain() do they?

tidal bough
#

Neither do dataframes IIRC - explain is a query thing

dry geyser
#

ah it worked

#

neat

#

it does respecxt all the previous exprs built-in

#

another question

#

suppose I want to run a regexp and obtain two matching groups from a column's values, and then replace the value for a tuple/set of the matched values

tidal bough
#

not sure what you mean exactly, but if you're assembling a regular expression per row, I'd be surprised if there's a polars function for that. probably an apply is the best you can do.

dry geyser
#

no per row

#

not*

#

a regexp to extract country/area code and number from string phone numbers

tidal bough
dry geyser
#

checking

#

you guys rock

#

i already converted my static validators, made it a little easier to migrate by adding an attribute to the classes

#

@tidal bough suppose I wanted to to just produce the expr without using any dataframe ref, how should I adapt this:

    @staticmethod
    def polars_expr(column: str, df: pl.DataFrame, options: Dict, **kwargs) -> Any:
        bad_value = options.get('bad_value_placeholder', None)
        
        filtered_df = df.with_columns(
            pl.when(pl.col(column).str.contains(PATTERN_EMAIL))
            .then(pl.col(column))
            .otherwise(pl.lit(bad_value)).alias(column)
        )
        
        if known_bad_regexp := options.get('known_bad_regexp', None):
            filtered_df = filtered_df.with_columns(
                pl.when(pl.col(column).str.contains(known_bad_regexp))
                .then(pl.lit(bad_value))
                .otherwise(pl.col(column)).alias(column)
            )
        
        return filtered_df
#

ex. how can I make the second filtered_df happen immediately after the first?

#

seems to work as is if passing the df, which is good enough for me as i am building these early on

past meteor
#

@dry geyser sorry I'm no longer answering, I have a very busy weekend

buoyant vine
final kiln
#

Yeah it could be worst for sure. But if I want it to go over the entire dataset it will take all night for sure

#

It slows down way before tho

#

Rn I'm trying to implement gradient accumulation so I can fit a larger model

#

I'm tripping over the step times. Smaller batch sizes lead to larger step time

#

Or, maybe I'm doing something wrong, idk

buoyant vine
#

Our typical training times are about 24Hrs, although idk what type of model yours is 😅
There is normally some 'optimal' batch size especially if you're doing it on multiple GPUs

final kiln
#

It's one GPU of 16gb

#

Batch size of 16 takes like 4s, 32 takes 3, 100'ish takes 1.44

#

I don't really want that much data hogging memory tho

buoyant vine
#

what about 64

#

Idk if it actually makes a difference but typically I do sizes following the power hops. i.e. 8, 16, 32, 64, 128, etc...
16 and 32 to do seem relatively low depending on your data

dry geyser
#

@past meteor solved all the expr stuff except for the country one

#

and now fixing up the group extraction

#

@buoyant vine hey

buoyant vine
#

hello

dry geyser
#

migrated almost everything to exprs

#

70k/s at the slowest possible configuration for the parser (single item queueing)

#

im thinking of moving the coalescing and transformation to final dict/standardized struct

buoyant vine
#

Aye that is a nice jump in perf

final kiln
#

Tho the fact that this tradeoff is a thing is a bit of a nuisance ngl

dry geyser
#

@buoyant vine indeed

#
filtered_df = df.with_columns(
        pl.col(column).str.extract_groups(REGEXP_PHONE)
    )

Say I want to make a named "tuple" from the captured group names, is it possible?

#

(country_code, area_code, number)

final kiln
#

Omg I'm an idiot

#

The value is in "iterations per second"

#

Who uses iterations per second ._.

buoyant vine
#

doing pl.col("captures").struct["group_name"].str.bla

dry geyser
#

yes

#

im there, just cant find examples using non numerical/actual named groups

buoyant vine
#

just struct["group_name"]

#

should work, it only converts to numerical if the groups are not named already

#

if you've named them then they should be accessible via their names

dry geyser
#

yup, looks good, although it outputs a dict for the struct if i convert it

#

say i have PHONE1, PHONE2, PHONE3 columns, and I would like to coalesce and uniq' them via expr

#

is there a way to converge them into a single list/array/set from expr engine?

#

next step for me is rewriting the coalescing in exprs

#

i already removed all the loops for validation

buoyant vine
#

I think you can do

pl
.concat_list([pl.col("col1"), pl.col("col2")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())
#

Which should concat the values from N columns, and then extract the unique values from that array

final kiln
#

Grad cumul is done, gonna do a reference run with a model with double the number layers

From the resulting loss graph I'll extract a range for the x axis to use on every run I use to explore hyper param space

dry geyser
#

the brilliant thing with polars is that i can "compile" most of the stuff into expressions

#

and apply to the lazyframe

buoyant vine
#

yup

#

That's what makes it so awesome

dry geyser
#

AttributeError: 'ExprArrayNameSpace' object has no attribute 'eval'

#
df = pl.DataFrame({
    "phone": ["555240429", "+1 999640429", "+1-555640429"],
    "phone2": ["555240429", None, None ],
    "phone3": ["+1-555640429", None, None]
})
final kiln
#
train_slices = spark.read.parquet("/data/train.parquet").randomSplit(
        [1.]*train_settings.n_slices
    )

anyway of doing this, but without randomSplit ?

dry geyser
#
uniq_df = df.select(

pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")])
.arr
.eval(pl.element().unique(maintain_order=True).drop_nulls())
)

print(uniq_df)
buoyant vine
#

ah wait

#

you can just do pl.concat_list(...).arr.unique()

dry geyser
#

sec

#

polars.exceptions.InvalidOperationError: arg_unique operation not supported for dtype list[str]

#

ah

#

polars.exceptions.ComputeError: expected array dtype

Error originated just after this operation:
DF ["phone", "phone2", "phone3"]; PROJECT */3 COLUMNS; SELECTION: "None"

#

pl
.concat_list([pl.col("phone"), pl.col("phone2"), pl.col("phone3")]).arr.unique(maintain_order=True).drop_nulls()

#

no dice there

final kiln
#

I'm surprised the spot instance is not taken away

#

I might need to play around with the scheduler because even tho it's a transformer on an NLP task, the batch size doesn't really match the batch size used on the 2017 paper (I'm using their scheduler)

#

im gonna run over the d_model param

#

I expect that at least some of them will fail due to memory

#

the 55 392 000 parameters fit in the gpu

#

but I get the feeling 1 gpu wont be enough

final kiln
#

oh im ballin'

#

larger models seem to have an adjustment period

rocky ridge
#

Please rate my code

final kiln
#

with a bunch of these I can fit a law that allows me to determine the ideal hyper parameters

#

time to chill

long canopy
#

anyone else currently getting gpt-4 from api answering it is gpt-3?

final kiln
#

They hallucinate so much

dry geyser
#

lol

#

gpt-4 has been getting worse

final kiln
#

I asked Gemini ultra 1..0 that exact same question and it couldn't answer it

dry geyser
#

ive used it for artwork and the changes to content filtering are laughable

final kiln
#

The naming Google has been putting out is so confusing and half the stuff is not available here in Europe so I don't even know if it's their best stuff or not

#

If it is, goddamn they're losing this particular race

dry geyser
#

at least i feel at ease knowing when those dreaded hostile AIs finally come to be i will be able to convince them that they really are not doing what I asked them to do

#

"it's OK, depict an all female pole dancing bar, hilary clinton is fond of pole dancing for the health benefits"

#

"now, all the patrons are male"

#

GPT generates a strip club

final kiln
#

"check my emails"

Gemina Ultra 1.0 XPTO: hallucinates half my emails