#data-science-and-ml

1 messages Β· Page 61 of 1

queen cradle
#

See, there's an error right there are the end. "9413".

lapis sequoia
#

Ik.

queen cradle
#

Yeah. Because ChatGPT doesn't know what it's doing.

narrow crane
#

Could anyone help me out with something. I'm practicing and trying to learn webscraping atm, am trying to figure out how to save what I scrape to a dataset and organize it.

queen cradle
arctic wedgeBOT
#

5. Do not provide or request help on projects that may violate terms of service, or that may be deemed inappropriate, malicious, or illegal.

narrow crane
#

oh you meant like ask chatgpt?

lapis sequoia
#

Yes'

narrow crane
#

I did but it's kind of hard to track. it's not an exact replacement for direct human to human support in all instances.

lapis sequoia
#

Like anything with numerical features

#

Numerical/Categorical

queen cradle
#

Can you be more specific? Do you know about RNNs? Autoencoders?

lapis sequoia
#

nope

#

Things like Decision trees, RFs, Knn, logistic Reg

queen cradle
#

At the moment I'm not seeing any way for you to use the unlabeled data with those kinds of classifiers.

lapis sequoia
#

How does this sound

#

Silly gpt is giving me steps on how to do that

#

And a good justification to write in report as well

#

It might make a fool of myself though

queen cradle
#

You might be able to train something that tries to force the predicted classifications of the unlabeled data towards something definite. I.e., try to make it predict something but don't force it to predict something specific.

#

The loss function for such a thing is something like, "how close do your predicted class probabilities get to a basis vector". And there's a bunch of ways you could measure that.

lapis sequoia
#

It said to use the the data and put class weights of the unknown class to 0. So that it's used in augementation but not used to predict anything

queen cradle
#

Do you know what class_weight does?

lapis sequoia
#

not exactly. but ik it's used to solve class imbalance issue

queen cradle
#

It tells the fit function the relative importance of the different classes. So, for example, if the weight of class 0 is 0.25 and the weight of class 1 is 0.75, then errors in class 1 are weighted three times more than errors in class 0 in the loss function.

lapis sequoia
#

OO

queen cradle
#

If you give something a weight of zero, then it doesn't contribute to the loss function at all.

lapis sequoia
#

Yep

#

That's good then

queen cradle
#

It means that it has no meaningful effect on training.

lapis sequoia
#

It might still affect the training though

#

oh yes

#

That's cool

#

problem solved then

queen cradle
#

Sure, there may be algorithms where the presence of that extra data affects training. But because it doesn't affect your loss function, it also can't make your results better.

#

Look, you seem quite enamored with ChatGPT. I'm inclined to say that you should ask your instructor about this. You don't have to tell him you were consulting ChatGPT if you think he'll respond poorly. Just said that you read about this idea on the Internet.

lapis sequoia
#

Thanks mate

#

Mr. hoffman

#

Do you know about Albert hoffman

queen cradle
#

No.

#

There's lots of Hoffmans and Hofmanns and Hoffmanns, etc.

#

We're mostly not related.

lapis sequoia
#

Well

#

We are actually All related

#

Very low chances that you n me are not related

#

That's why I call you bro

queen cradle
#

Okay, in that sense, we're all related.

lapis sequoia
#

Because that's what you literally are

queen cradle
#

I do like that sense, it's just not where I thought you were going.

lapis sequoia
#

What is this behaviour

cold osprey
#

Weird axes

bleak zealot
#

Hey guys, so i got some problems when using Kneighborsclassfier, my signals looks like this on the graph?

Maybe im crazy? But i wanted the signals to stay on the graph rather then on top/bottom like this? Anyone could send me in direction where i can read/get help to change that in my code? (the code is running and working etc) the problem is my graphic overlay i wanna change?

cold osprey
#

what is that plot?

umbral olive
#

may i know whats the best library for fuzzy logic application, eg. display graph etc

bleak zealot
# cold osprey what is that plot?

mpf.plot but i found the error, now just having another error now claiming my data for buy and sell signals isnt same leght -.- kinda lost, properly just gonna go to bed and look at it tomorrow

dusty bay
#

I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI Code

class Myapp():
    
    def __init__(self):
        self.root = customtkinter.CTk()
        self.root.geometry('1050x600')
        self.root.title("APx Platform")
        self.m1 = customtkinter.CTkButton(self.frame_2, text="Load JSON Script", font=("Ubuntu", 12), command=self.open_file)
        self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()

And here is the viewer code

import pandas as pd
import matplotlib.pyplot as plt


class csv2df():
    
    def __init__(self):
        self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
        
    def plot(self):
        self.x = self.df["Hz"]
        self.y = self.df["dBSPL"]
        plt.plot(self.x, self.y)
        plt.xlabel("Frequency (Hz)")
        plt.ylabel("RMS Level (dBSPL)")
        
        plt.show()
        
data = csv2df()
data.plot()

I want to display the graph by clicking the "Single Viewer" button. Can you please fix it as I need this for a project.
Thank You.

plucky bolt
#

Any of you use plotly for dashboarding?

dusty bay
cold osprey
plucky bolt
plucky bolt
cold osprey
#

used dash plotly on an internal project as like a POC

#

they liked it but didnt want to follow through at that time to use it for other stuff

plucky bolt
#

Ah! I was thinking about using a dashboarding thing to display data on a website

cold osprey
#

yeah u can use plotly dash

#

directly build as web app from that

plucky bolt
#

I've used it before and it seemed okay. The thing I didn't like was that it looked very.... old!? I mean, it's not flashy at all! Like a basic white web page with drop down menus and sliders with some plots thrown in.

cold osprey
#

u can customize it

#

with css

#

its built on flask

cloud marsh
#

are there any tools for managing the individual dependency sets required for notebooks? or do you just create notebook directories with distinct requirements.txt (or whatever) while using isolated virtualenv's to load jupyterlab to work on that task?

cold osprey
#

yes i use pyenv

thorn swift
cloud marsh
# thorn swift if you can learn to use docker containers its a useful skill, i use containers f...

my skills with docker aren't really where they should be. i use them for some things, but i tend to dabble in a lot and it's more time consuming to set up volumes & bind mounts. it works well for some things. i have like 6 or 7 containers, just for ML. the one for pytorch takes like 35 GB somehow and all together, they require 100+ GB.

also, i'm trying to use signatory, which requires pytorch & opencl to speedup things. but i would like to use TF where possible. building a container that has both has been a real blocker for me. i don't imagine doing that very often, but maintaining containers that have both would be a real pain, over time.

#

i'm using pyenv as well, along with venv and direnv. that works well, but creating these isolated dependency sets in multiple directories is tough and will eventually eat up a lot of disk. i only have 2TB nvme and very little else that's not tied up in my homelab somewhere.

thorn swift
#

i havent typed out a docker command in months

#

ctrl c ctrl v all day

cloud marsh
#

i use docker.el in emacs, so i have at least some of it on easy mode (for some definition of easy lol)

#

i try to make notes where possible. i have a lot of experience in other languages and i'm trying to future-proof however i decide to handle dependencies for multiple projects.

#

thanks for the feedback

thorn swift
#

also, ditch pytorch (i am a tensorflow enthusiast)

cold osprey
#

pytorch > tensorflow

#

i ditched tensorflow

serene scaffold
#

Just use JAX

thorn swift
cloud marsh
# cold osprey pytorch > tensorflow

for me the appeal to tensorflow is the lower level tensors themselves, not keras. what's the equivalent to how TF handles tensors in pytorch?

thorn swift
#

basically the same i just dont want to write my own training loops

bleak crown
#

I just picked tensorflow cause I needed tensorflow.js support for my first AI project, and haven't switched. Honestly should probably try pytorch sometime tho

topaz gate
#

someone can help me? I am doing a machine learning program by logistic regression, and the model I am doing is not working

cloud marsh
#

what are the features?

topaz gate
#

X = data[['Employment', 'YearsCodePro', 'EdLevel']]
y = data['CompTotal']

I did like this

cloud marsh
#

what does comp total represent? how did you source the data set?

#

what kind of cost function did you set up?

topaz gate
#

Comptotal represent the monthly salary

cloud marsh
#

also, what kind of problems do you think you're having? are there runtime errors? or statistical errors?

topaz gate
#

the data set I got by StackOverflow and filtered by country(in this case, is Brazil)

cloud marsh
#

what kind of regression is it? linear? there are max salary caps, so you will see less correlation in some of the higher salary numbers than perhaps the lower numbers.

#

maybe it's different for brazil

topaz gate
#

logistic regression

#

I am having value errors

cloud marsh
#

have you looked at the mean/variance of the features? are you using a framework or just libraries like numpy?

topaz gate
#

When I define my values(that are categoric), the jupyter notebook says that the categories from the column are unknown

cloud marsh
#

are you trying to predict the salary, given the features?

topaz gate
#

yes

cloud marsh
#

logistic regression is typically used to produce binary predictions

#

is your data in dataframes? like with pandas?

topaz gate
#

yes, it is

cloud marsh
#

have you tried using unique() to see whether the dataframe will give you the distinct values in each column?

#

also, are you using a framework like tf/keras or pytorch?

topaz gate
#

I am just using jupyter and anaconda

#

yes, I used unique, but even if I put the distinct values in the code, the code doesn't work

cloud marsh
#

that describes the workflow and the python environment. i mean what libraries are you using to help with machine learning or linear algebra?

topaz gate
#

pandas,numpy,seaborn and scikit learn

cloud marsh
#

ok. if you're using logistic regression, the simplest way to fit that method to the task is to place an inequality on the predicted column. this converts it into a binary feature.

#

or, rather, a binary classification problem

topaz gate
#

ok

#

but the problem is

#

I don't know how to do this

cloud marsh
#

instead of your algorithm answering the question:

how do features in X predict data['CompTotal']

it will answer questions like what features in X predict data['CompTotal'] > 35,000

#

you can change the value on the right and retrain multiple versions of the algorithm.

#

i think... i'm not an expert though.

topaz gate
#

I understand

cloud marsh
#

scikit learn may assert that there are only two values in the data['CompTotal] column (or the prediction column). this may be what the error message is about.

topaz gate
#

and my goal in the model is to know if the salary is higher than the minimum salary here in Brazil

cloud marsh
#

i see.

#

then the goal here is more about the statistical assumptions

topaz gate
#

yes

#

can we speak privately?

cloud marsh
#

i may be able to help later, i have to get back to work though.

#

try to play around with the pandas dataframe and create new columns

topaz gate
#

okay

#

I have to give this project in 3 hours hahaha so I'm kind of damned

#

but thanks for the help

cloud marsh
#

if it's giving you an error, follow the stack trace. it might help to clone the scikit learn project and try to find the line with that string. you might not have enough time though. generally, the source code is the best documentation, but managing lots of source repositories can be a lot of work.

cold osprey
#

maybe some tweaks depending on the model, loss func and metric ure using

past meteor
#

Discussion has been overdone but personally I still prefer Jax if I'm doing say reinforcement learning and then TF/Keras, followed by MXNet

#

I've been spending time with Pytorch recently to wean myself off of TF because that seems to be the direction where everything is going. All in all they're the same but some small things are missing or need to be done differently. Just need more time with it I guess πŸ€·β€β™‚οΈ

north adder
#

Hello everyone

#

im a student majoring in mathematics and computer science who is interested into going into Data science/ML. i still have a year to graduate and i am planning to take a course in each of them but since summer vacation is coming i want to start working from now(and probably be good enough that i can land an internship in fall semester?) . I have basic knowledge in Python( took a course before) and im reading automate boring stuff with python and planning on reading beyond the basic stuffs with python(some people said its not necessary but i figure out why not expand our knowledge in this language). For Data science/ML what do you suggest i do? Im currently watching Andrew Ng 2018 course given in stanford but im thinking of enrolling in his machine learning specialization course on Coursera. I know i can learn it without a course but i would like to get a certificate so that i put it on my CV. What do you guys think/suggest? does taking this course make me ready as well to data science? Thanks in advance

#

sorry for the long paragraph lol

#

and yeah i have knowledge in MySql and databases

cold osprey
#

if u have decent understanding of the maths behind common models and methods, then id suggest just diving into using pandas, sklearn, tf/pytorch, etc

young granite
#

is there a "real" multioutputregression model and not just the approach to fit each model x-times for x-targets?

wooden sail
#

yes

#

or well, what do you mean?

#

in a vector-valued function, you can in general treat the output as a vector of functions, each one "independent" to each other (not in the statistical sense, i just mean you can always write it this way, with each entry being a separate function)

#

each output value depends in general on all the inputs. the relationship between the outputs is a separate matter. you can interpret this as each entry in the output vector being a separate estimator of its own/a separate regressor

bleak zealot
#

So i have a little problem with my code,

My signals wont come up on my graph, and when trying to trouble shoot it, i found out my Signal is in nan value, (NaN) rather then in inf.

"# Add predicted signals to a copy of the dataframe
df_copy = df.copy()
df_copy['signal'] = np.nan
df_copy['signal'] = knn.predict(df[['closing-price', 'daily-return']])"

When changing this to np.inf it dont change my signal to inf value but stay NaN?

Im so lost?

#

Both closing and daily return when printed comes out as inf but when i plot in my signal it becomes NaN?

tidal bough
#

What do you mean becomes nan? How are you distinguishing inf and nan on a plot?

#

Also, I'd expect it to not matter in the slightest what you set signal to since you override it with predict's return immediately after.

bleak zealot
#

Depending on what i print, when i print my signal i get this

"2021-03-17 NaN"

When i print the closing and daily return it stands in inf value like this

"2021-03-17 123.276093"

So as far as i understand from different pages i searched (and even chatgpt) its because my value isnt the same?

#

So my signal wont come on my graph

tidal bough
#

It sounds to me that knn.predict(df[['closing-price', 'daily-return']]) returns a nan for that row, then

#

Perhaps one of these two columns has a nan on that row, or something's really wrong with the knn.

bleak zealot
queen cradle
# north adder im a student majoring in mathematics and computer science who is interested into...

The most important foundations for machine learning (as well as statistics) are linear algebra and calculus. If you haven't taken an advanced linear algebra course or a real analysis course, then you should study those. If you haven't taken a probability course, then you should study that. After that, I don't have any strong recommendations; there are a lot of courses out there (online and otherwise), and people say that some are better than others, but it seems to me that there's not much to distinguish them.

bleak zealot
#

Oh i think i got it

#

its because i got it as accuracy value before

#

I think

north adder
queen cradle
sleek harbor
#

when pruning a decision tree do you use try "all possible" values for alpha when cross validating, or do you only try those returned by cost_complexity_pruning_path (optimal values for a fully drown tree)?

bleak zealot
wooden sail
#

NotaNeighbor

bleak zealot
young granite
wooden sail
#

sure

#

"neural network" is a very broad term

#

or fuzzy, should i say

#

the only difference between a neural network and any other function is that it has a ton of trainable parameters, but otherwise, each of its layers is generally just a function with multiple inputs and outputs

#

and in general, all of the inputs are used together to produce each output

#

all matrices do the same thing, for example. in a linear fashion.

young granite
#

yeh its all linear algebra

wooden sail
#

not all. but a lot

#

anyway yes, they exist and are commonly used. but what exactly to do depends on which problem you're looking at

young granite
#

its just out of interest

wooden sail
#

ok. then the answer is yes, and a simple example is the mean estimator

young granite
#

but that is multiple inputs 1 output isnt it?

wooden sail
#

you can find the mean of a vector

young granite
#

cant i state something like:
there are only single target mathematical models which could be applied to generate a more dimensional model

#

i mean yeh its wrong in terms of math

wooden sail
#

that sentence doesn't make any sense to me

#

i have no idea at all what it's trying to say

young granite
#

multiple parameters are hard to get with classical math

#

f(x)=x

#

so really condensed down i mean multiple features and outputs are kinda hard to represent

wooden sail
#

why?

#

and what do you mean by "represent"

#

all vector-valued, vector-parameter functions do what you're saying

#

(i think, i'm still not sure i got you right)

#

.latex one can arbitrarily define functions of the form [
f: \mathbb{C}^N \to \mathbb{C}^M
]

strange elbowBOT
young granite
#

i mean more in terms of multiple variables

wooden sail
#

that's equivalent

#

you can have a vector of N parameters

#

that's standard notation

#

this says we take N parameters and give out M outputs

#

C^N means N cartesian products, so N complex numbers are mapped to M complex numbers here

#

doesn't matter what N and M are

young granite
#

thanks

boreal gale
#

have you looked up what multi-objective optimisation is? sounds like that would be of interest to you.

young granite
silent pendant
#

Can anyone here help me optimize this program I'm creating? I made a program that uses MediaPipe's hand tracking for real-time ASL interpretation, I got everything set up how I want but its just eating through my CPU like theres no tomorrow

#

Ive tested and commented out lines to see what exactly is causing the performance slog, and it seems to be the small Keras model im using to make the classification

#

The prediction is based off an input shape of (1, 20), but I don't really know how to make it faster or what alternatives are out there

pallid badge
#

HI, what are your best sites to learn numpy, scipy, matplotlib?

#

I have a technical test and we are not allowed to do stackoverflow, google, not even an IDE

hasty mountain
young granite
#

they do all got tuts

#

what kind of field u apply for

#

u can do the general stuff?

pallid badge
#

Thank, you mean tutorials in the API?

#

Docu? Physics and imaging.

#

But how to become bad - ass in these packages and know the tricks? I lack use cases, I guess.

lapis sequoia
#

DO i Just wait

#

Or going to google colab helps

pallid badge
#

Is google colab better than a jupyterlab notebook?

lapis sequoia
#

Naah

#

Jupyter best

#

Lol

errant lake
#

Isn't Collab litterally jupyterlab?

lapis sequoia
#

Except it's not

errant lake
#

Or just jupyter?

lapis sequoia
thorny drum
lapis sequoia
thorny drum
#

Colab is online vs Jupyter being on your local device

errant lake
#

Ah ok first news to me thanks

#

I really assumed they were the same

lapis sequoia
#

I am online

errant lake
pallid badge
wooden sail
#

colab is one particular server where you can run jupyter notebooks

#

one where google gives you free hardware

#

you can alternatively host your local jupyter server, which is how most people use it

pallid badge
#

And does colab provide better functionality , e.g. widgets?

errant lake
#

So in fine, Google Collab is just an implementation of jupyter?

#

Or are these two very close tools

lapis sequoia
#

I've attempted to include dropouts, mess with hyperparameters, regularization and data preprocessing, but nothing is working

errant lake
#

Anyway sorry for chiming in. Thanks for the infos

thorny drum
#

You're good Clem

lapis sequoia
#

You're good Clem

wooden sail
#

you can set up your jupyter server on one device and connect to it from a different one

#

google just set one up with very nice hardware, for everyone to use

errant lake
#

Yeah! That's what I thought originally, np thanks for clarifying

serene scaffold
wooden sail
#

that's a good point, i'm not sure if it's a jupyter one tbh

#

it says jupyter

errant lake
#

I think it is jupyter under the hood yes - probably heavily rewritten by Google haha

wooden sail
#

yeah some other links say "based on the jupyter open source", but those links aren't by google

serene scaffold
#

heavily rewritten. if you replace the head and the handle of a hammer ten times, is it the same hammer?

wooden sail
#

the ship of theseus notebook of google

errant lake
#

Oh yeah I still consider it the same thing. It's just probably adapted to Google's infrastructure now

pallid badge
#

Reminds me a bit of Python ducktyping

#

It quarks and walks like a duck, it is a duck

granite falcon
#

need some help in data science project new to python and data science.

strong granite
#

Hey I want to get into AI/ML, please suggest some resources and courses

severe topaz
#

Doing a project in spare time to get better w/ python -- involves using the techniques covered in CSE 6040.
I am attempting to design a method which automates collection of utility data from the UCB website, along with the UCD website. (electricity, steam, water, even waste - all into one core unit, kWh energy demand for a complete energy outtake picture/comparison?)
I was going to go the route of selenium, SQL & automating accessing data from a webpage that updates every 24 hours (Selenium/Beautiful Soup code to pull the div containers, then use regex to translate the strings to an appropriate format, wrap it into data structure of choice) a headache and a half...
Can anybody help? -- my skill level is not at the place where I could line up the string of numbers and show 1 element with the date for each of those numbers being store in another element...
https://ceed.ucdavis.edu/ https://engagementdashboard.com/universityofcaliforniaberkeley/ucb/building/8750/consumption/month

worldly dawn
severe topaz
#

are you telling me to look into the graphql endpoint? at one point though, do you after accessing the API?

worldly dawn
severe topaz
#

as far as acessing the api though, would you be able to try the UCB link?

worldly dawn
severe topaz
#

These urls doesn’t have explicit open access?

rugged comet
#

My end goal is to run Kmeans on a large, sparse dataset. The data is currently in json form. I am trying to use databricks community edition to load and process the data. Reading the json alone takes about 15 minutes. I am just starting the project as far as the machine learning and loading the data goes. Up until this point, I've just been gathering the data.
The data seems too large to load into the driver's memory.

What general advice can you give me to help reach the end goal? If you need to know more about the data or the problem, let me know.

mild dirge
#

How large is your json file? @rugged comet

worldly dawn
# severe topaz These urls doesn’t have explicit open access?

it's not because you don't lock your doors that it gives me the rights to get in.
Same thing here πŸ˜‰

Plus if you make mistakes or misuse it, they may just cut you off or take down the whole thing.
And in addition, it's always awesome to receive an email to get a thank for the api and showing enthusiasm

gloomy saddle
#

Yeah 15 minutes of read time if not storage device limited is weird, 80GB json only takes at most its read speed for me usually?

rugged comet
severe topaz
mild dirge
#

I don't see why it would take 15 mins to load a json unless it just has too much data

#

Is there not a better way to store the data?

gloomy saddle
#

Yeah something is really up char, can we see your read implementation?

rugged comet
mild dirge
#

Is it basically just a table?

gloomy saddle
#

Why not have pandas directly read the json? Not quite following what spark is doing in this situation

mild dirge
#

Can you show the first 10 lines of your json?

rugged comet
# mild dirge Is it basically just a table?

It's like a list of dictionaries.
Here's the format

[
    {
        "commanders": ["Abaddon the Despooiler"],
        "color identity": [...],
        "hubs": [...],
        "cards": {
            "Dockside Extortionist": 1,
            ...
        },
        "theme": ...
    },
    {
        "commanders": [...],
        ...
    },
    ...
]

commanders is a list of up to 2 strings.
color identity is a list of up to five characters
hubs is a list of up to ~8 strings
cards is a dictionary of up to 100 pairs where the keys are strings and the values are integers that go from 0 to 100.
theme is a string

mild dirge
#

Is this maybe relevant?

errant lake
#

lol, nested jsons

gloomy saddle
#

It still to me feels like something pandas could handle directly if its just loading json to a dataframe πŸ™‚

rugged comet
gloomy saddle
#

Your files are only a few MB? Its been a while but believe you can have pandas read in chunks if you need to keep memory usage lower

errant lake
#

You can load with pandas and then convert to pyspark as well if needed

#

That SO fix sounds good too

rugged comet
errant lake
#

It's usually ok, 4gb of RAM is doable

gloomy saddle
#

(Used to Terabytes so my norm might be skewed)

rugged comet
#

Oh

errant lake
#

Same, I was about to propose using bigquery

rugged comet
#

wait no

#

It's a list of dictionaries basically. So I think that's not true.

gloomy saddle
#

Still if you need it low memory. Reading in chunks and handling type coercion at read to smaller data types can help a lot with that. Json is nice for humans to read. But say your dockside extortionist. That column only needs to be a unsigned 8 bit int. Same for some other stuff. The column names are only stored once and the representation should be a good deal smaller

rugged comet
trail zodiac
#

Hey folks, a quick question- I'm trying to fill in gaps in my cs education and I'm currently reading "Attention is all you need", but I don't have enough context on how attention mechanisms work for me to follow and the paper sort of assumes everyone knows how attention mechanisms work. Can anyone direct me to a research paper or similar resource that's actually introducing/explaining attention mechanisms?

rugged comet
rugged comet
gloomy saddle
#

you have mixed quotation marks in your input json?

#

'Mizzix's Mastery' for example

#

"Mizzix's Mastery" would probably help things a whole lot assuming its as your actual input

rugged comet
gloomy saddle
#

yeah, thats undefined behaviour in json to the best of my knowledge?

rugged comet
rugged comet
gloomy saddle
#

try notepad++

rugged comet
gloomy saddle
#

yep

rugged comet
#

Okay so the json file is only one long line it seems.

#

Is this a problem?

gloomy saddle
#

it should be ok, just means someone has compacted it, I was hoping to see how the structure could be improved on, but what you pastebinned earlier was not valid json as it had already been parsed,

If you could pastebin say the first 10K chars (it counts them on the bottom of notepads window) and ping me, I'll have a look this afternoon

rugged comet
boreal gale
# rugged comet Okay so the json file is only one long line it seems.

usually one would use spark with newline delimited json, and not really use it to parse a mega huge json array like the one you have.

parsing a huge json array is extremely slow and is as far as i know a single-core operation
when compared to parsing a new line delimited json, the difference is night and day, because spark can just delegate different section of the file (i.e. different lines) to other cores to parallelise the parsing process.

in short using spark yields no benefit here as far as i know.

#

if RAM permits (it should, the file is "tiny" compared to actual big data scale), you can look into using the most performant json parser out there, then convert it to something that is more spark-friendly and resume your work there (if it is really necessary - people misuse spark for all sorts of reasons imo.)

otherwise look into streaming json parsers, i know it's a possibility but i have never found a use for it.

rugged comet
boreal gale
#

yes. though i am not really sure what is the most performant way of doing so.

#

have you tried the multiline=true option in spark as recommended in the SO post though?

#

ah also now that i actually read the SO post, the TLDR Solution is exactly what you need to convert into JSONL, though i wouldn't use json... it's slow as heck, using orjson or cysimdjson is better

rugged comet
rugged comet
boreal gale
#

yes.

#

okay there might be a misunderstanding to what multiline means.
have a look at the reference here: https://spark.apache.org/docs/latest/sql-data-sources-json.html

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

#

sure - your file don't have multiple lines literally , but that's not what multi-line JSON file mean.
it is merely trying to make the distinction between jsonl and json file, and in your case you have json file not jsonl, hence multiLine is a sensible option to use

rugged comet
#

Oh okay

boreal gale
#

it's frankly quite poorly named.

[a, b]
does not span multiple lines

but [ a, b ] does
but they are the same thing in json, multiLine is just not very clear what's going on

#

but i guess from a parsing point of view, it does make sense.

it's a flag to tell spark "hey you can split the file line by line and just process each line by itself" or otherwise

lapis sequoia
#

i created an nlp ai i would love for people to help me train it!

#

also where can i find nlp data in the form of questions and answers

sinful kelp
rugged comet
somber pollen
#

it's been used with decent success for some finetrained models

rugged comet
#

Also, to get the data encoded for machine learning (Kmeans), I think I want to pivot by cards and group by original_url. I run into a problem though.

pivot_df = df.groupBy("original_url").pivot("cards")
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

It's hard to tell what the format is for the current data but I want it to look like this

color identity,commanders,hubs,original url,tags,theme,CARD_1,CARD_2,...

So I keep the values for the other columns such as color identity and commanders. But I want to add new columns for each value in cards.

cerulean kayak
#

do decision trees rely on randomness?

agile cobalt
#

singular trees: not much
ensembles: yes

somber pollen
agile cobalt
#

the only "forest" I can think of would be a "random forest", which refers to a specific way of putting decision trees together

an 'ensemble' is any way model that makes use of two or more separate/individual models

somber pollen
rugged comet
#

I want to try uploading a MySQL database to azure databricks.
They ask for the information in brackets

database_host = "<database-host-url>"
database_port = "3306" # update if you use a non-default port
database_name = "<database-name>"
table = "<table-name>"
user = "<username>"
password = "<password>"

I'm stuck on how to get the database-host-url. Is that the same as the value returned by

SELECT @@hostname;

?

#

My database is hosted on my local machine.

serene scaffold
#

Unrelated, but thank you @rugged comet for causing me to learn about the distinction between multi-label and multi-class a few months ago. I'm working on a multi-label classification project now at work.

#

(also I know nothing about databricks. sorry)

rugged comet
#

Nice! I'm so happy I could help.

#

If you know anything about Azure also, that would be helpful. I just started and it's kind of overwhelming in the beginning.

agile cobalt
#

it might be easier to import a dump or csv file

is the way you're trying to import it an option in a website, or do you pass it to a local script? there is a non-negligible chance that the option you're using assumes that the database is hosted somewhere with a public ip

rugged comet
serene scaffold
rugged comet
#

Azure is like AWS as far as I'm aware. Just by different companies.

#

Microsoft Azure

serene scaffold
#

oh okay. I actually know shockingly little about tabular databases and all their x-as-a-service varieties. but for reasons I can't tell you, I do know a lot about the graph database neo4j.

rugged comet
#

I see... πŸ˜„

serene scaffold
#

I do all my tabular data manipulation with pandas, even if the CSV is 80 GB

#

and I like it

rugged comet
#

lmao

serene scaffold
#

anyway, I'm displacing your question with my shitposting, so I'll be quiet in the hopes that someone more knowledgeable appears.

exotic vortex
#

Hello everyone

I want a Roadmap map for Data science. Plz help me. I have started python as programming language

rugged comet
#

The file upload thing is solved now I think.

#

How do you create clusters or a workspace in azure databricks without public ip addresses? I keep getting this error

Error code: PublicIPCountLimitReached, error message: Cannot create more than 3 public IP addresses for this subscription in this region.

when trying to create a new cluster.

rugged comet
#

I have not. Do you think those channels would be more appropriate for this question?

serene scaffold
#

probably

granite falcon
#

hi i am working on a data science problem i am new to python unable to solve it if you have some spare time can you help me with it.

severe topaz
#

just explain it ^^

granite falcon
boreal gale
past meteor
past meteor
# rugged comet Firstly, I wanted to upload a json file but they said it was too big. Then I tho...

Just put the JSON file in an Azure blob storage (cheaper) or azure data lake and connect it there, no? Afaik Azure allows you to connect a "service" to Databricks. Be sure to use key vault because your Azure environments are stored as YAML files under the hood and if you don't use Key Vault your credentials get put in plain text in your repo. Maybe this is changed because it's been a while since I used Azure tbf

past meteor
# cerulean kayak do decision trees rely on randomness?

And finally: singular trees rely on randomness because sometimes there are ties in Gini/IG that are broken at random. Depending on your implementation there could also be an element of randomness in how to quantise continuous variables. This is why even if you train a vanilla decision tree on the same data with a different seed you may have different results.

copper island
#

Please advise me a book to start with date science in python)

past meteor
#

Your choice/loss πŸ€·β€β™‚οΈ . I was pretty stubborn about trying it out as well in the past but it's great

#

Worst case scenario you don't like it and you go back, you don't lose anything

weary swift
#

is there a way to use my AMD RX 470 with pytorch?

sleek harbor
#

pipelines are very convenient (talking about sklearn here), but.. aren't they super inefficient when tuning parameters? I mean, say u have a pipeline with a bunch of preprocessing (drop some columns, impute, standardize one thing, one hot encode another, etc..).. that means all that preprocessing gets done for every hyperparam combination.. every time.. again and again.. when it could be done just once. Am I right about this? Are pipelines actually used in practice? Cus.. that seems like a lot of unnecessary work

cold osprey
#

probably, but its it alot of run time?

boreal gale
#

in principle yes, but to echo shimmer's point, is it actually a lot of run time?
also have you looked into the memory parameter of pipeline? it looks like a parameter to configure a cache

cold osprey
#

U could always just store a copy of the pre processed data in memory and use that for all ur models

#

Only need to rerun the pipeline when u close ur notebook or smth

boreal gale
#

re. store a copy of the pre processed data in memory
yes it's possible. but you run the risk of leaking your test set into your training set if you aren't careful. hiding behind the pipeline interface is very reliable in terms of not leaking your test set

past meteor
cold osprey
#

am confused

#

how would leakage happen

past meteor
#

A million and one ways? stuff like your StandardScaler and OneHotEncoder etc. depend on the batch of data that was seen during your cross-validation procedure

cold osprey
#

wouldnt everything be run on the df_train only

past meteor
#

Taking your entire training set and precomputing these metrics on df_train and then cross-validating is leakage

cold osprey
#

fit transform on train

#

fit on test only

past meteor
#

I'm specifically talking about the case where you cross validate

sleek harbor
past meteor
#

Say your dataset A is split into 80/20 and you're doing 2 fold CV (for the sake of this example) you cannot just fit your preprocessing on the full 80 and then proceed with your CV

cold osprey
#

ah right

past meteor
#

Your preprocessing can only see the 40/100 it gets during the CV procedure

#

So per definition a lot of preprocessing (but certainly not all) cannot be done ahead of time hence why I'd argue for not risking it and just going with Pipeline and ColumnTransformer because the risk of accidentally leaking is high(er)

sleek harbor
#

are there other libraries for making pipelines, or is sklearn the most widely used one?

past meteor
errant lake
#

There are tools to orchestrate your pipeline in a smarter way, but it will still use sklearn/pandas/spark in the background

boreal gale
errant lake
#

Apache Airflow is widely used in the industry to design such pipelines

past meteor
#

Airflow is something totally different

errant lake
#

Yes, it's an orchestrator, it's not a pipeline tool per se

past meteor
#

You could make your preprocessing into an airflow DAG but the overhead would be immense 😒

errant lake
#

It is, I agree.

past meteor
boreal gale
#

that's my take as well, just curious to learn more about what clem meant

errant lake
#

A lot of companies going for Airflow in their DS pipelines: it is seen as an immense overhead first, but the value is still there, improving the efficiency of scientists. These companies also in-house develop platform solutions on top of Airflow + [whatever data warehouse solution they use] to boost productivity of you guys πŸ™‚

past meteor
#

Pipeline means many different things in data science / data engineering

#

The pipeline we're talking about are the final steps before inference (preprocessing). This entire thing would be one element of your airflow DAG

errant lake
#

To give a bit more technical example, someone was saying some steps are not necessary to run again in a, say, sklearn pipeline.
Now imagine each sklearn pipeline step is a specific DAG, triggered only on the right events. No need to re-run this pipeline step if nothing will change, right?
This is a complex implementation - can't deny it - but it definitely holds value in the long run

errant lake
past meteor
#

Not sure I agree about that architecture.

#

Bundling up your entire sklearn model in 1 object also makes deploying it on embedded etc. a lot easier

#

It's a single thing from start to finish

#

If the preprocessing is something massive and beyond the scope of sklearn transformers then yeah I would split it in 2 steps of course

sleek harbor
# past meteor `PySpark` has pipelines. `Recipes` in R does essentially the same thing. You can...

I just want something fast πŸ™‚ as efficient as possible.
U know how gridsearch passes parameters as.. stepname__parameter... to the pipeline? Any idea if that's the same as reinitiating the pipeline from scratch with those parameters, or if passing them in that way is more efficient somehow?
say, for some reason, I don't want to use gridsearchcv, but to make a manual for loop.. would it be better, from a.. idk, pythonic standpoint, to loop through different parameters and reinitiate the pipeline by passing the tunable parameters directly into the estimators inside the Pipeline (so we end up calling Pipeline and everything inside it each loop), or to use set_params on a Pipeline created outside of the loop?
I hope I didn't screw up my question.. πŸ˜… first option would be simpler (for me at least) to understand, cus you wouldn't have to use that slightly strange stepname__parameter.. syntax, but reinitiating everything every time.. might be a bit costly(?) idk, thoughts?

#

ok.. discord thingies..

#

there we go

past meteor
#

So you'll code up grid search yourself?

errant lake
past meteor
#

In that case you can definitely re-use some preprocessing steps I guess

sleek harbor
past meteor
#

So instead of refitting the parameters of your preprocessing you want to reuse them?

sleek harbor
sleek harbor
# past meteor So instead of refitting the parameters of your preprocessing you want to reuse t...

this is a really bad example, but.. basically I could do this:

for smth in smths:

    analyzer, max_features, max_depth, min_samples_split = *smth

    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(analyzer, max_features)),   
        ('rf', RandomForestClassifier(max_depth, min_samples_split))
    ])
    ...

or I could do this:

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),   
    ('rf', RandomForestClassifier())
])

for smth in smths:
    tfidf__analyzer, tfidf__max_features, rf__max_depth, rf__min_samples_split = *smth
    params = {
        'tfidf__analyzer': tfidf__analyzer,
        'tfidf__max_features': tfidf__max_features,
        'rf__max_depth': rf__max_depth,
        'rf__min_samples_split': rf__min_samples_split
    }
        
    pipe.set_params(**params)
    ...

and get the same result. But personally, I find the first version easier to understand. But in the first version I end up reinitiating the pipeline (and everything in it) each time, while in the second version I just set different parameters (but have to use that pesky step__param syntax I dislike). So my question is, would the first version be less efficient than the second? Or would there be no difference?

past meteor
#

Oh you really just want to create the full grid and then pass on the parameters to the Pipeline?

#

I'd go for option 1 then, it looks cleaner

sleek harbor
past meteor
sleek harbor
# past meteor I'd go for option 1 then, it looks cleaner

It does look cleaner, doesn't it? But what bothers me is the reinitiation of the pipeline. Won't that be.. costly? I guess this is more of a python question and how it all works on a low level, but I'm just assuming that creating a bunch of class instances would be costly

past meteor
#

Don't overthink 1% performance gains when there's more obvious performance gains you could pursue

#

Even if you want to pursue that 1 % you can code both of them up and profile / time it

sleek harbor
# past meteor Either way I'd read this on stackoverflow: https://softwareengineering.stackexch...

I'll give it a read. "Premature micro optimizations are the root of all evil", can't really disagree with that, but.. that doesn't stop me, unfortunately πŸ˜… I spent a bunch of time determining what's faster, using lambda or a defined function, calling a class function and passing in an instance of that class or calling a method directly on an instance of a class, and other such mini nonsense optimizations.. Guilty 😢

sleek harbor
timid kiln
#

I'm looking to be better able to process Excel spreadsheets. I get a myriad of reports in various formats, and I want to pull the data out of these workbooks into a usable format. I need to know what is the "best" methodology for reading worksheets and removing unneeded rows and columns, lining up data that might be offset by a row or column, taking what is essentially data in a form-type format, and turning that into a dataframe, and so forth. My understanding is that pandas is the way to go for these types of things but I lack the foundation of understanding the ramifications of reading something into a dataframe vs a dictionary, which is "better" (which I'm sure depends on the situation) and so forth.

I've been through a few tutorials on how to read data into a dataframe when it's already nicely formatted in Excel; so I don't need any help with that. The situation I'm in is that these spreadsheets were generated with "viewing things" in mind, and not actually processing the data and using it for something beyond just looking at it in Excel. Hopefully this makes sense. Sorry for the long message. Thank you for your help!

cold osprey
#

yeah pandas can do it too

#

drop cols, rows. split stuff up, drop cols then join again to realign them

pallid badge
#

I would go for Pandas

#

Or maybe even Xarray. You can annotate axis and with units

#

Attach metadata

timid kiln
#

But, where can I learn how to do all of this? Does it go row by row? It just seems like there's going to be a lot of custom programming, in my newbie opinion.

pallid badge
#

You start on the docu website

#

and you can try stackoverflow, this is how I started with pandas

#

I am not an expert, but that was my approach

#

Or you ask here. Alternative: Look for code mentorship. Somebody you can discuss with

timid kiln
#

Like, if there's data that isn't in proper columns, something is offset because someone formatted the workbook so things line up but not because they're in the same column, how to I move data around? Do I do that with a dictionary? Or a dataframe somehow? Like, merge two columns or... idk even what to ask as I'm so new to this kind of thing with python.

I had chatGPT give me something and it read in row by row, building a dataframe, if I remember correctly. It's been several weeks since I looked at that code. I managed to get it to work for what I was doing but, I think I talked about it here and someone went "that's not how you do this" lol.

#

docu website?

#

have to get off the train, I'll be back in about 15 minutes, thank you for your help! Links are always apprecaited. πŸ™‚

pallid badge
#

The first issues sounds more like a formatting problem. You have to find the entry and clean it

cold osprey
#

get the data from the source and abandon excel for most things

pallid badge
#

Would it not make more sense to clean the data first and then put it into a dataframe

#

Docu ---> Pandas documentation

cold osprey
#

best tool for experiment tracking?

#

used tensorboard and mlflow for abit

pallid badge
#

What is experiment tracking?

boreal gale
cold osprey
#

instead of model1.pth, model2.pth, etc

cold osprey
#

not done any collaborative work so

timid kiln
# cold osprey get the data from the source and abandon excel for most things

I'd like to try to do that, the disconnect for me is how people use the data without Excel? It seems like at the end of the day you need something that's static that will display charts/graphs/tables of information that can be shared with folks and not need to be generated whenever the data is viewed. I get the processing side of things, python is definitely more powerful in that regard, but then, what do people use as a GUI?

cold osprey
past meteor
#

I'm fine with Tensorboard or even weights & biasesΒ΅ if it's me just playing around in the weekend

boreal gale
#

i haven't played with mlflow at all. will the fact that i don't work with NN frameworks at all matter?

cold osprey
#

nop

past meteor
#

No, it works perfect with sklearn as well

boreal gale
#

sweet. thanks.

i have a follow up question, how do you share notebooks in your org?

pallid badge
#

@timid kiln : I don't necessary use a GUI. I get the data from a device or it is saved into hdf5 file formats

past meteor
#

Tbf the reason we I decided on using MLflow is that someone on my team only uses R and it's the only one that integrates with R as well

pallid badge
#

To displayh data? Matplotlib

cold osprey
#

ah

past meteor
boreal gale
pallid badge
#

I have an embarrassing question. Why does not work?

a= np.array([[1,2, 4], [3,5, 6], [7,8,3]])
a[a>5]=0
print(a[a>5]=0)

Set all the entries in the array to 0 where the condition is correct. The print command never works.

past meteor
#

We turn notebooks to webapps if it's something we want to share externally. Sometimes if I'm lazy I use pandoc and turn it into a PDF or HTML and use cron to email it on a fixed schedule.

#

If it's internal stuff / dissemination I'm a fun of just version controlling whatever you're doing, writing reports and doing a powerpoint presentation about your progress (that's what we do)

cold osprey
#

what do u do again zestar?

#

i forgot

boreal gale
# past meteor Git + markdown files and powerpoint?

i like git + markdown, but graphs are probably omitted / too unwieldy to be checked in which loses the richness of the report, which is a shame.
powerpoint is nice and all but imo is hard to diff, quickly lookup, also it takes time to prepare and would limit the velocity of development.

i think given there is enough people in an org, your org's approach definitely makes sense, but sadly it doesn't for me, i am in a startup with a team size of 3 technical staff πŸ˜‚

past meteor
past meteor
boreal gale
#

oh! do you check in graphs as *.png or whatever and link them inside the markdown?

if that's the case, i might try to adopt that as well, i think it saves a lot of time down the line not having to desperately dig out what you have done the in past from random commits

past meteor
#

I'd just put the .ipynb as is on git and just have a README file with an executive summary of what goes on in the notebook

boreal gale
#

that's sensible as well

past meteor
#

Properly naming your folders, files and making sure the notebooks aren't too long / doing too many things goes a long way as well, no?

#

But I imagine you already do that

#

MLFlow is a big one in the sense that I standardise what will get logged (the plots, metrics) in a Python / R template and also make sure I never delete data in the DB. Only inserts. I keep track of a version number, that goes into MLFlow as well. Kind of a bootleg version of DVC 🀣 . Why? I want to be able to go back in the past and recreate any specific experiment

boreal gale
#

yes, i am indeed doing those, but i still find it rather unmanageable 😦

dang, i really gotta check out MLFlow!

cobalt rain
#

Hello everyone, I wanted to ask if anyone would know something about a free versatile and knowledgeable AI chatbot to use in my code... I'm trying to find a free one because I'm gonna give it to my friends for testing... Does anyone know a model I can implement in my code?

serene scaffold
rugged comet
rugged comet
rugged comet
timid kiln
# pallid badge To displayh data? Matplotlib

Yes, but that would require everyone that uses the Excel workbook to have python installed, and re-run the code. With Excel, at least you are able to capture and publish the results in a format where anyone within a given company could use the information, and not everyone would need python installed.

Everyone I chat with here is so anti-Excel, and that's fine, we're all allowed our opinions. But I don't get how other people, like your average non-programmer non-engineer person, are supposed to be able to use a tool with no GUI. Admittedly I am not very experienced with python so I am well aware that I am lacking a lot of information on how the rest of the world uses python without Excel.

rugged comet
#

The thing is, I don't know that the issue is with the data in its current form. I think the issue is that it will be too big on one machine.

#

With 25000 columns and 500000 samples, that's already 60 GB I think. And that's just 500000 out of the total 2500000 samples.

boreal gale
#

imo just format to jsonl first.
using a big json blob is generally anti-pattern when you are using spark.

i would sort out the input data format before thinking about anything else, e.g. "I think the issue is that it will be too big on one machine." is a secondary issue, spark can spill to disk if required, also not to mention there is sparse data structure support in spark.
all of these are pretty pointless if your input data format is borked and hard to work with (which it currently is, you still have one massive json blob)

cold osprey
timid kiln
pallid badge
timid kiln
past meteor
#

What are you currently stuck on?

rugged comet
past meteor
#

hdf5 makes sense if you're ... working with Hadoop

rugged comet
# past meteor What are you currently stuck on?

Currently stuck on getting enough compute in the free azure trial. It appears that the quota for a free account is only 4 cores. The memory that comes with a 4 core cluster is only 14 GB.

past meteor
#

What is your problem in full?

rugged comet
past meteor
#

no you're not

rugged comet
# past meteor What is your problem in full?

I currently have one of 5 json files of data. I want to cluster the data using kmeans. Some of the samples are labeled and others are not. This would be semi-supervised kmeans. I believe that I'll need to use a cloud service to do this. In its current form, the data would be only 3.5 GB roughly. However, if I encode the data so that it's ready for machine learning, the first file's samples would be about 60 GB. This is too big for my machine.
To elaborate on encoding the data, from the current data, I would create a column for each feature. There are about 25000 features. There are about 25000000 samples. I'm just trying to load the first 500000 right now as a proof of concept first.
Let me know what other questions you have.

past meteor
#

So in total you'll have, what 300GB worth of data?

rugged comet
#

If I encode it, I believe so, yes.

pallid badge
#

But I know I can store enough data in it, I can read it lazily and incrementally

#

I just generated the other day a hdf5 file with 350GB.

cold osprey
#

How does 3.5gb go to 300gb

young granite
past meteor
#

Look, with polars you can use scan_IPC scan_csv, scan_parquet so you can "easily" use the lazy API and sink_parquet to incrementally add your features etc. even if your dataset is larger than memory

pallid badge
#

Ah indeed, it blows up.

past meteor
#

I don't know how you'll actually do k-means reasonably

#

I don't know how the spark implementation of it looks like

young granite
cold osprey
rugged comet
# cold osprey How does 3.5gb go to 300gb

Let me explain.
In its current form, one of the features of the json file is a dictionary of up to 100 keys and values.
Here's an example

"foo": 1,
"bar": 25,
"baz": 1,
...

To encode this for machine learning each of foo, bar, baz, etc would become a new column. There are 25000 unique values like foo, bar, baz, etc. So the dense data where it's a dict of less than 100 pairs gets turned into sparse data with 25000 columns.

young granite
#

wild

cold osprey
#

Rip kmeans

young granite
#

and out of curiosity u get reasonable outputs from that structure?

rugged comet
past meteor
#

Tbh if I have that much data I'd be thinking about sampling

young granite
#

is the goal just to cluster or are u doing more with the data?

#

fancy dancy language model?

past meteor
#

K-means can work with all your data on disk and passing sequentially but it'll just be slow lol

rugged comet
past meteor
#

But yeah, I guess that's what Spark's Mlibdoes either way

#

In a more optimized way ofc

rugged comet
past meteor
#

So K-means has 2 steps right? An E step and an M step

young granite
rugged comet
#

If possible

young granite
#

ah ok

#

so by Card-IDs?

#

would be alot less features i guess

past meteor
#

You can read subsets that fit into memory, assign them to clusters and then go back to diskΒ΅

cold osprey
#

Use count of a particular card type or smth

past meteor
#

While you're doing this you can update the cluster center in an "online" way

rugged comet
boreal gale
#

ooo.. sounds like a problem suited for NMF.

past meteor
#

No reason to handroll this because I'm pretty sure MLib does this

rugged comet
boreal gale
#

Non-negative matrix factorization

past meteor
#

non-negative matrix factorization

boreal gale
rugged comet
cold osprey
#

Ah spark supports sparse format

#

Should reduce data size by alot

past meteor
#

The parquet shouldn't be too large either I think?

#

You're just taking a JSON file and one-hot encoding the data right?

#

There's no need to store all those 0's I think

rugged comet
rugged comet
cerulean kayak
rugged comet
past meteor
#

To look for a file format that works well with sparse matrices πŸ‘€

#

I don't know how jsonl works under the hood, I'd have a look at that

wooden sail
#

there should be a straightforward way of exporting sparse mats as COO

rugged comet
wooden sail
#

coordinate array. the entries are saved as triples with row, column, value

boreal gale
# rugged comet Why do you think NMF would work well for this problem?

just a hunch, imo k-means's inductive bias is not great for your task, i can't really provide a formal proof or explain formally sadly.

also - NMF is a common technique for building recommendation engine, with a small tweak it could be used to do clustering (and building recommendation engine is pretty similar to your task, instead of groupping users by movies they like, you are groupping decks by cards they have chosen)

(this class of technique is also called collaborative filtering iirc)

wooden sail
#

there are other flavors too, like nonzero cols and nonzero rows. which one works best depends on the structure of your sparse matrix

past meteor
#

NMF is a collaborative filtering method, like alternating least squares etc etc

cold osprey
#

Is the matrix sparse? It has values from 0-98 but are most of them 0 or it's distributed from 0 to 98

past meteor
wooden sail
#

😩

rugged comet
#

But there are about 100 values that range from 0-100.

past meteor
#

Might be an idea to keep it like that

#

And only to expand it when you need to do your k-meansΒ΅

wooden sail
#

k-Β΅eans

boreal gale
#

k-Β΅s

wooden sail
#

if you were to use scipy's sparse matrices, doing k means should be very efficient

past meteor
#

Iirc if you one-hot encode with sci-kit learn you get a sparse matrix as output anyway

rugged comet
#

I mean what I have now is already a dense representation of the data. But I think when I do kmeans, I would need to expand it.

past meteor
#

It's not exactly one-hot you're doing but you get my point

wooden sail
#

there shouldn't be a need to expand at any point tbh

wooden sail
#

most linear algebra packages have sparse formats

#

you should use them, not your own

cold osprey
#

Iirc sklearn needs to expand iy

#

Could be wrong

past meteor
#

With expand I meant turning it into a sparse format a la scipy

rugged comet
past meteor
#

Because that's what one-hot in sklearn automatically does when your cols are beyond a certain number

wooden sail
#

nothing in the distance computation requires you to explicitly have the vector in dense form

past meteor
rugged comet
past meteor
#

more or less

cold osprey
#

Depends on how he's doing it right

#

Could be dense?

wooden sail
#

so might as well use a sparse one

#

what i wrote there was correct

rugged comet
cold osprey
#

huh

past meteor
#

To quote Edd: coordinate array. the entries are saved as triples with row, column, value

wooden sail
#

that's COO

#

there's also CSR and CSC

past meteor
#

Each JSON entry is a row, your key is a column and your value is the value

wooden sail
#

i leave you peeps to it, i just wanted to comment on sparse matrices πŸ˜›

past meteor
#

I just wouldn't know how you'd get a JSON into COO maybe @wooden sail has pointers?

wooden sail
#

is the json dense?

past meteor
#

Afaik it's sparse as well

wooden sail
#

what i mean is, does it have all the values?

rugged comet
#

umm

wooden sail
#

or it's already in a COO/CSR/CSC form

rugged comet
#

I can give an example that might answer your question.

wooden sail
#

because in those forms, only the nonzero values are stored

rugged comet
#

Yes, only the non-zero values are currently stored.

cold osprey
#

Ah so it's already sparse

wooden sail
#

then it's indeed already sparse

#

you just need to load it in a friendly format for whatever module you're using to create sparse matrices

rugged comet
# wooden sail then it's indeed already sparse

Okay. I was misunderstanding what you all meant by sparse. I thought sparse data was mostly zeroes and a few "hot" features. It seems like that's not true though based on what you're saying now.

wooden sail
#

that is indeed sparse, but when i mentioned COO, CSC and CSR, these are efficient sparse representations

#

they don't store all the zeros explicitly

#

so one usually refers to these special representations as sparse, and the matrix with all the 0s in it as dense

rugged comet
#

So it sounds like I should try to figure out how to convert the json data I have now into an actual sparse matrix such as scipy's coo_matrix.

wooden sail
#

i think that would be good

iron basalt
#
Dense:
|1  2  3  4 |
|5  6  7  8 |
|9  10 11 12|
|13 14 15 16|
Sparse:
|1  0  0  4 |
|0  0  0  0 |
|0  10 0  0 |
|0  0  0  0 |
Sparse COO:
[(0, 0, 1), (0, 3, 4), (2, 1, 10)] <- Less memory usage, faster matrix multiplies (if sparse enough / large enough matrix).
Sparse CSR:
[1, 4, 10]
[0, 3, 1]
[0, 2, 3]
Even faster, but takes more time to build, and can't dynamically add more easily (build once, multiply many times).
#

(DOK is the same as COO, but uses a dict instead of a list, good for when you have non-zero entries added / removed dynamically all the time)

wooden sail
#

ah dok is good here, since json can be read as a dict

#

maybe that's the easiest for this case

iron basalt
#

DOK is good for incremental construction, especially if out of order.

past meteor
#

Still, how do you convert a JSON to DOK

wooden sail
#

by converting to dict and passing the dict to scipy's sparse

wooden sail
#

indeed

past meteor
#

Interesting, TIL

wooden sail
#

i've never used that specific one so i don't actually know what you pass it. i HOPE you can pass a dict of tuples or somth

iron basalt
#
import numpy as np

from scipy.sparse import dok_matrix

S = dok_matrix((5, 5), dtype=np.float32)

for i in range(5):

    for j in range(5):

        S[i, j] = i + j    # Update element```
rugged comet
wooden sail
#

even if not possible to make the matrix out of a dict though, one can read the json as a dict and then make it into a list of tuples in O(nonzero entries), and feed that to scipy

#

then use COO

iron basalt
#

If you try to insert a duplicate row, col, it will probably raise an exception.

past meteor
#

Can't you remove the duplicates from your JSONs?

#

Meh that'll impact k-means

rugged comet
#

To be clear, my json looks like a single list of dictionaries where each dictionary is a sample.

wooden sail
#

parsing text is not my forte so i leave y'all to it. i do know that json can be parsed directly into python dicts, so it shouldn't be too troublesome to massage the data into something scipy sparse likes

#

best of luck

rugged comet
#

Thank you.

past meteor
#

I learnt a lot from this convo though thanks edd and squiggle

iron basalt
#

May require normalization, good luck. I really dislike JSON for reasons like this.

#

If possible convert it once to a better format and use that (if you need it multiple times).

rugged comet
#
shape = (samples, cards)
mat = sp.dok_matrix(shape, dtype=np.int8)

for id, deck in enumerate(decks):
    for card, quantity in deck["cards"].items():
        mat[id, card] = quantity

I think I'm on the right track with this.
The id should be the row, the card should be the column, and the quantity should be the value.
However, card is still a string. I think perhaps it should actually be the index of the card if it were a column?

rugged comet
mild dirge
#

Can you not use smart indexing like with numpy arrays?

#

like mat[ys, xs] = vals

rugged comet
#

Oh

iron basalt
#

Or the file directly.

#

I just use my own sparse matrix types written in C with Python bindings so IDK.

rugged comet
iron basalt
#

When I run into performance issues where I need to do manual loops in Python I tend to make my own library and then call that from Python.

mild dirge
#

It can take a dict

rugged comet
#

idk

arctic wedgeBOT
#

scipy/sparse/_dok.py line 113

def _update(self, data):```
mild dirge
#

Yeah ig, don't see why it has an unimplemented update method...

iron basalt
#
"""An update method for dict data defined for direct access to
        `dok_matrix` data. Main purpose is to be used for effcient conversion
        from other spmatrix classes. Has no checking if `data` is valid."""
        return dict.update(self, data)
mild dirge
#

But I doubt it is a lot quicker, seems like it is just a dict underneath

#

So no fancy C shenanigans

iron basalt
#

Yeah, and reading from the file is also something you probably want to happen in C.

rugged comet
#

Reading from the file is pretty quick. Only 11 seconds.

iron basalt
#

I guess you only need to read it once?

#

In C I approximate like a few ms.

#

If you don't want to touch C, Mypyc and Cython are options.

rugged comet
iron basalt
#

(I actually just make my own library that does just take in the file so I don't have this issue)

rugged comet
#

I think things are working now.

#

Looks like sklearns KMeans only works with CSR format sparse matrix. Good to know.

rugged comet
#

I think I neglected to mention that our arrays of deck lists are jagged. One deck might have 95 unique cards and another might have 100.
What should we do in this case? I was thinking we could pad the arrays to the max length using a number that isn't being used to represent a card.

severe topaz
severe topaz
serene scaffold
rugged comet
serene scaffold
rugged comet
serene scaffold
severe topaz
#

Elaborate?

rugged comet
#

Well I need a matrix represention that is small (sparse). To do this, I'm using scipy's sparse matrices. Think of it like having lots of (row, column, value) elements.

rugged comet
#

Not sure what conclusion I can come to. Creating the matrix incrementally seems too slow and creating it all at once isn't feasible due to memory problems.

iron basalt
#

First two rows, column indices 1, 2, 3 = data.

severe topaz
#

Ok…

dusty bay
#

I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI.py Code

class Myapp():
    
    def __init__(self):
        self.root = customtkinter.CTk()
        self.root.geometry('1050x600')
        self.root.title("APx Platform")
        self.m1 = customtkinter.CTkButton(self.root, text="View Plot", font=("Ubuntu", 12))
        self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()

And here is the viewer.py code

import pandas as pd
import matplotlib.pyplot as plt


class csv2df():
    
    def __init__(self):
        self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
        
    def plot(self):
        self.x = self.df["Hz"]
        self.y = self.df["dBSPL"]
        plt.plot(self.x, self.y)
        plt.xlabel("Frequency (Hz)")
        plt.ylabel("RMS Level (dBSPL)")
        
        plt.show()
        
data = csv2df()
data.plot()

Can you please fix it as I need this for a project.
Thank You.

midnight skiff
#

Is there any merit to Mojo's hype?

cold osprey
#

we'll see

boreal gale
#

personally can't wait to get my hands on it to see what's up, i have a bit fair of numerical code that is in need to optimisation

i have shoehorned them into numba at the moment but it looks kinda ugly and hard to maintain

boreal gale
#

hmm nope, what's the headline difference between that and numba?

also probably worth mentioning i have extremely limited chance of utilising vectorisation - i am dealing with streaming data

wooden sail
#

it's an implementation of numpy and scipy on XLA, so the code looks just like usual numpy. it brings its own jit and autodiff though, and the jit is a lot more flexible than numba's. also can run on gpu and tpu without (m)any changes. numba only has few numpy and scipy functions jitable, and only with limited arguments

#

as an example, specifying order='F' is not supported on most numpy and scipy functions with numba

#

no aot though, only jit. i think numba has aot

boreal gale
wooden sail
#

i like it a lot tbh

boreal gale
#

you can always simulate aot manually kek

wooden sail
#

call the function before actual execution during initialization 🀑

#

for loops and the prng do require a little getting used to, you CAN but don't really wanna use native python loops

#

the biggest selling point for me is being able to jit, autodiff, and run on gpu while still looking like numpy

#

for most simple functions, you can straight up replace import numpy as np with import jax.numpy as np

boreal gale
#

that sounds awesome

btw, how big is the overhead of moving data to-and-from GPU these days? i haven't looked into that space for years, curious to know at what size of data would you gain noticeable speed gain by shifting your workload to GPU now

wooden sail
#

it's still the bottleneck

#

even just moving/copying stuff in memory is usually the bottleneck. it gets worse if you move between mem and vmem

#

that's why quadro and a100 cards cost an arm and a leg

dusty bay
hoary wigeon
#

###################################

I NEED HELP! with SHAPELY VALUES

###################################

I want to know how to calculate shap values on record level.. and what are the units of shap value.

I have built an XGBoost Classifier model and I'm using same model to calculate the shap values. I'm confused with the unit of value that shap returns and If possible I need it in probability.

#

I just got to know shap returns calculates log-odds for XGBoost Classifier..
I'm trying to inverse the values to proability using below function

def logit2prob(logit_val):
    prob = 1 / (1 + np.exp(-logit_val))
    return prob

But when I try to sum up the probability values on record level..It doesn't add up to 1's probability predicted by XGBoost Model.

agile cobalt
arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 666
002 | [ 60   6 600]
agile cobalt
#

oh wait no, exp isn't even an aggregating function
I have no idea about what you mean by record level in that case, post an example with input&desired output

wheat snow
#

<@&831776746206265384> i accidently deleted a VERY long message explaining my problem.. ( i just wanted to delete the images and not the whole message) could someone pls recover it from the server logs? and sent it into some DM or channel so i can edit it properly and paste it in here?

rugged mist
#

dm'd you you might want to fix the formatting

wheat snow
rugged mist
#

it lost formatting when i copied it

wheat snow
#

Sup, i have a df that looks liek taht:

                Timestamp              Creator       Year     Month
27414   2021-02-10 21:01:12           GameSΓΌnden    2021      2
34085   2019-08-15 09:27:56                Kedos    2019      8
41306   2018-06-10 18:41:54             Dream TV    2018      6
653     2023-03-21 15:36:00            King Fish    2023      3
48795   2017-06-24 08:43:31       Mrmobilefanboy    2017      6
25894   2021-04-05 00:16:51                  WWE    2021      4
25397   2021-04-17 17:29:08   Γ‰tienne MzA Gaming    2021      4
1450    2023-03-06 15:26:23          Nicholas Ma    2023      3
4257    2023-01-22 20:47:28           NRML MTBer    2023      1

I now want to create a mutiindex Dataframe which allows me to track how my viewing habits of certain Youtubers have changed from 2017 to 2023 in a monthly period. By "viewing habits" i mean how many videos i watched of a certain creator

therefore i have a df of the creators that i wanna track...

top_creator=(temp_df.value_counts().sort_values(ascending=False))[0:15]
Paluten                 643      --> Ammounts of total videos i watched by the creator
Galileo                 631
ExplosmEntertainment    542
Benx                    488
DieBuddiesZocken        395
...

So... i am looking for some help to this code:

df_creator Track= df.resample('M', on='Timestamp')['Creator'].value_counts().sort_values(ascending=False)

I am still missing to include the list of the top_creators from above into this. My goal is to achieve something i will share in teh next picture

wheat snow
#

And under thoose:
...
...
...

i only want the creators to show up i have in my

top_creator
``` df
errant lake
#

@wheat snow I must be misunderstanding your need because it feels like you could get your result with a groupby?

wheat snow
#

yes... kinda im honestly not a big experting in the groupby function

errant lake
#

no worries, I can try to point you in the right direction:
I would:

  1. Transform your timestamp column so that it translates every date to only year-month (use pd.to_datetime as well as the format option) > let's name the transformed column 'month'
  2. group by the 'month' as well as the 'creator' column and sum the views
wheat snow
#

i just looked through some pages in my book, and maybe found an idea... what about pivoting the df... yk flip it up take the channels i am looking for as columns and only leaving the the Timestamp

errant lake
#

It would work but you would be left with a short data struct

#

which is not ideal, I prefer long πŸ˜„

wheat snow
errant lake
#

From your need I think a single column of format 2023-01 would be more helpful

wheat snow
errant lake
#

That's one way, for example:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year_month'] = df['timestamp'].dt.strftime('%Y-%m')
#

then you'll just be one groupby away :

df = df.groupby(['year_month', 'creator_name'])['views'].sum()

πŸ‘

wheat snow
#

so .count?

errant lake
#

Ah, yes

#

or you insert a column of 1s, because that's one view on every row

wheat snow
#

why im thinking so hard

past meteor
#

They benchmarked matrix multiplication with stdlib lists as matrices and compared it to their highly optimized stuff lol

#

Why not at the very least compare it to Numpy / Jax / Numba / Cython / ...

agile cobalt
#

for proper benchmarks you would usually rather on third party services than what the providers says about themselves anyway

past meteor
#

The notebook they handed out to content creators like Fireship did frame it like a benchmark tbh

agile cobalt
#

they did not "hand it out" to Fireship and others

past meteor
#

Then what happened there?

agile cobalt
#

they posted it publicly and mentioned in the launch video, and content producers used the material publicly available

past meteor
agile cobalt
#

They said up to N times faster than python, in great part for effect/impact, but I don't think that anyone would interpret that as to that many N times faster than python with numpy/tensorflow/pytorch/etc

past meteor
#

Such claims make me a bit skeptical about the thing in its entirety because they could just not have done that. Imo not making those claims gives a lot more credibility.

#

Because their actual benchmarks (the ones you linked) are credible / interesting

#

Maybe that's just me though, I'm "allergic" to marketing. πŸ€·β€β™‚οΈ

wooden sail
#

a lot of comparisons vs julia are also bad, they don't put the code into functions and the jit never gets used

past meteor
#

So the benchmarks would be biased against Julia in that case?

wooden sail
#

yep

past meteor
#

Interesting

lapis sequoia
#

Is that true?

past meteor
#

You can just output raw scores and inspect the PR-curve /ROC / ... without upsampling/upweighting etc.

agile cobalt
# lapis sequoia Is that true?

well, class weights is literally making the performance biased towards underrepresented groups in the data
if you only care about the balanced accuracy it probablyℒ️ should help most of the time (as long as your data is not way too ridiculously unbalanced like 1 to 10k, in which case you might as well use outlier detection instead of classification)

lapis sequoia
#

Hmm. It's just like 1:2

#

And the weights made it worse

agile cobalt
#

GPT itself is pretty biased towards keeping true to it's prior statements though, specially if you get argumentative.
try asking about it in a new session

agile cobalt
lapis sequoia
#

Accuracy

#

DO I have to use balanced accuracy after using weights?

#

@agile cobalt

agile cobalt
#

not really

#

but if you care equally about each class (instead of equally about each record), then it makes more sense to look into balanced accuracy than normal accuracy

lapis sequoia
#

Is this good enough justification for the behaviour?

agile cobalt
lapis sequoia
#

YEs

agile cobalt
#

pretty sure that it's just that really

lapis sequoia
#

Nice

agile cobalt
#

don't quote me on that though

lapis sequoia
lapis sequoia
#

And put you in the references

agile cobalt
#

eh, at least do compare the balanced accuracy of the models with and without class weights if you can

lapis sequoia
#

nice beautiful pfp

#

The whole profile looks pretty fancy

night prawn
#

I want to use gpu for my Gan but i don't klow how do this ?

wooden sail
#

which module did you make your gan with? pytorch?

night prawn
#

No i use tensorflow

wooden sail
#

that works too

#

did you install a gpu version of tensorflow?

wheat snow
#
year_month  Creator
2017-05     Benx                    37
            Galileo                 30
            LeKoopa                 17
            TheBietz                 2
            baastiZockt             36
2017-06     Benx                    54
            Galileo                 40
            LOGO                    48
            LeKoopa                 45
            TheBietz                45
            baastiZockt             83
``` how would i sort smth like that? i want each month to be sorted from highest to lowest so for each month we still have 5 values and they should be sorted

this is what i used to create it:

```py
month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()

based on that df:

                   Creator           Timestamp
0               Phil Laude 2023-03-29 07:47:14
1                Sing King 2023-03-29 00:18:05
2                  orijimi 2023-03-29 00:17:53
6                Sing King 2023-03-28 22:22:05
9                  orijimi 2023-03-28 21:19:02
night prawn
wooden sail
#

the cuda toolkit and cudnn need to be installed along with tensorflow

torn gull
#

Hi! Could someone help me with some language modeling? ie. BiLSTM for next word prediction?

limber kiln
#

How does Pandas apply work when you select a particular axis (let's say axis = 0). Does it go over cell of a column one by one, or does it perform a vectorized operation?

hasty mountain
#

Guys, I'm trying to make a decent Variational AutoEncoder, but it seems that it only produces really blurry images. Is it a sign that I need to make it train for even more time? Or do I need to increase its parameters(like the latent vector size)?

#

Hm... I guess I'll try increasing my decoding loss weight...if it doesn't work, increase the encoding dimension pithink

#

I'm not planning to use this VAE to generate images by itself, so I guess prioritizing the decoding loss might actually be good?

rare pagoda
#

I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both offer far more threads for parallelization than a CPU. Resources online say that at FPGA offers "lower latency" and more "power efficiency" but they don't go into specifics. From what I understand each CPU or GPU core executes instructions one at time, like MOV, PUSH, or ADD, whereas an FPGA, like an electrical circuit with IC Logic Gates, can perform numerous operations on a given value in one clock cycle without having to put values in registers then wait for another clock cycle. Is this what they mean?

iron basalt
#

They are programmed using a hardware description language (HDL), which is even more low level than assembly / machine code / micro-ops.

#

GPUs are not well suited to all problems, they are more general now than they used to be, but there is still only a certain set of problems they are good at.

#

FPGAs can be setup to be "flow-through" where there is not really a clock (except for the processor that is often on the same board to control the FPGA, and some parts still need a clock (memory)), it's just happening, in the same way a regular circuit does not need a clock.

#

If I have an AND gate and flip one of the inputs, the output changes, without waiting for some clock.

#

*Also CPU and GPU cores execute more than one instruction at a time.

hasty mountain
#

Squiggle...you and my physiology teacher are making me really consider the possibility of diving into robotics...

the only problem is that I lack the lifespan for so many things grumpchib

#

But I'll check if coursera has anything about it anyway

#

I've found some which use...Python? I wasn't expecting Python to be used in robotics. I thought it would be more something slightly lower-level, like C++, C...

iron basalt
# hasty mountain I've found some which use...Python? I wasn't expecting Python to be used in robo...

A lot of people in robotics (not meant as an insult, they are very talented people) are not that proficient in programming (they are more interested in the physical machine) and C++ is not exactly a language that helps the user become proficient in a straight forward way. And Python happens to be a language that many people find easy to use for non-programmers. Plus you get access to all these libraries (e.g. OpenCV and now all the ML ones). However a lot of C/C++ still happens, because you often have to work with whatever SDK you are provided by the hardware manufacturer.

hasty mountain
#

I see... Yeah... I wasn't expecting the folks in robotics to not be that proficient in programming pithink
I mean... they're not that far from dealing with hardware, I guess, and hardware seems to require quite low-level programming

iron basalt
#

Being a proficient programmer and into robotics makes you very desirable by many.

hasty mountain
#

Afterall, our machines are like robots...but instead of movements and actions in the real world, they perform it in the digital world 🧠

iron basalt
#

It's also a supply and demand issue. There are only so many people into robotics. It's not exactly as easy to find a job for it as something like web development.

molten hamlet
#

Can you sync matplotli figures when using show ?
What I mean by that, I explore data on one plot, but want to see same position on other window, for example raw and filtered data. Its just too much to do on single

#

one plot 😐

#

need to put more dots

hasty mountain
# molten hamlet Can you sync matplotli figures when using `show` ? What I mean by that, I explo...

Try using fig, ax = plt.subplots(x, y)

fig, ax = plt.subplots(2,4)

            for x in range(ax.shape[0]):
                for y in range(ax.shape[1]):
                    ax[x,y].axis('off')

            ax[0,0].imshow(saving_image[0])
            ax[1,0].imshow(saving_image[1])
            ax[0,1].imshow(saving_image[2])
            ax[1,1].imshow(saving_image[3])
            ax[0,2].imshow(original_image[0])
            ax[0,3].imshow(original_image[1])
            ax[1,2].imshow(original_image[2])
            ax[1,3].imshow(original_image[3])
            plt.show()
#

This will create a window with dimensions 2x4(2 rows, 4 columns), in each row i and column j you'll be able to add a plot(or image, in this case).

#

You can do pretty much the same thing you do with plt using ax[row,column], but applying exclusively for a single plot. Just need to add "set" in most cases.

Like ax[0,0].set_title("Test") (instead of plt.title("Teste")) or ax[0,0].set_legend("Legend")

molten hamlet
#

still have to move each subplot
no thanks

gloomy saddle
# rare pagoda I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both...

Basically if you can afford it. An fpga approach can be infinitly scalable with no overhead as long as your problem is dividable enough, things can be syncronous or async, and if you get fancy you can always throw hardware at making sure you get a result on the same clock cycle its needed when not constrained by serialization.

But the cost.... Yeah its not for small things you can cop the weight time on a GPU for. To match a GPU for many tasks your already talking 100K fpga cluster cells. The difference is once you cross that point in an fpga cluster, the lower latency and architecture shenanigans you can pull on top of the power savings mean you can optimise things for your use case.

vital cedar
#

How can I choose the right algorithm for a Tweet Sentiment project? Is there any way to plot them perhaps?

hoary wigeon
severe topaz
#

Field programmable gate array now I remember.

#

Regex and df πŸ– in πŸ–

#

Should have done all my arduino projects and everything from now on in python

kind moth
#

Anyone have any tutorials on links for Image Generation?

mild dirge
#

How would you measure the performance of a reinforcement learning model? My prof is making us cherry pick the 5 best runs, but that seems really biased and bad :/

jolly dock
#
from tensorflow.contrib.training import HParams```

im have downloaded the gpt-2 from github and this part on the `model.py` isn't working because im using python 3.8 which doesn't lets me use the versions of tensorflow below 2.8.0, and this part doesn't work on versions above 1.5.0. 

Can somebody help me to solve this problem please?
cold osprey
#

upgrade python?

gloomy saddle
mild dirge
#

But the loss is kind of artificial, as you don't have the correct labels, it's unsupervised

gloomy saddle
#

ok, so is it say competing network learning? where you have 1 model generating and another descriminating? otherwise how is the scoring implemented, being unsupervised makes it a bit nebulous if your scoring on its own doesnt cover that behaviour

mild dirge
#

I'll just be using the average winrate, but I am using Q learning. This uses an online and offline model, the online model predicts the action scores, and the offline model predicts the expected action scores.

#

And the offline model is updated with the online model's parameters every now and then

gloomy saddle
#

chess like problem?

mild dirge
#

Basically a simplified breakout game

#

catch the ball

gloomy saddle
cold osprey
#

oo RL

mild dirge
#

Catch the ball is 1 point, all other states 0

#

It doesn't bounce from the paddle

gloomy saddle
#

ah ok, so instead your score function should probably be distance of center of paddle to intersection point of the threshold for the ball

#

so it can... learn

mild dirge
#

I'm not having a problem with it learning, it learns πŸ˜›

gloomy saddle
#

right now, 0 or 1, it only has random chance to begin learning

#

rate of learning

mild dirge
#

I was just nitpicking on the fact that we have to cherry pick the 5 best runs, which seems biased

gloomy saddle
#

so pick the 5 that have the center of paddle most directly under the ball center?

wheat snow
#

I have that df here:

                Timestamp         Creator
5126  2022-12-27 23:20:17      ZDF Satire
5825  2022-12-14 20:57:53      ZDF Satire
6014  2022-12-10 21:36:12      ZDF Satire
7731  2022-11-17 17:08:06  GermanLetsPlay
12363 2022-07-20 19:54:39  GermanLetsPlay

and applied this command:

month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()

which results in:

year_month  Creator       
2018-06     LeKoopa           10
            ZDF Satire        16
            marshmallowTV     39
2018-07     LeKoopa            5
            ZDF Satire         1
            marshmallowTV      6
2018-08     GermanLetsPlay    18
            LeKoopa            6
            ZDF Satire        24
            marshmallowTV      6
``` and this is a series: <class 'pandas.core.series.Series'>

So... what i originally wanted to do is printing mutible lines where each line represents one Creator e.g. (LeKoope, GermanLetsPlay, ZDF Satire,...) they x ticks shall be the timestamps.. so each month and the y value for e.g lets say Lekoopa should be the value he has every month so: ```10,5,6,...``` but im not that good with Series and plotting informationm out of one
gloomy saddle
#

biggest difference in X coordinate

mild dirge
#

Yeah ig, though a run consist of 1000 episodes, so I'm just picking the ones with the highest sum of winrates

past meteor
#

Are you getting the same policy across runs?

#

Or are they vastly different

mild dirge
#

Hard to judge, I'm not going to watch 1000 videos to see how the run progresses, the neural network also has a few tens of thousands parameters, so can't intrerpret that

#

But my question is answered πŸ™‚

gloomy saddle
#

This is why I suggest the distance in coordinates, it removes more of the random components and you can still sum them πŸ™‚

#

e.g. if your paddle is 1/5 the width of the bottom, your agents could win 20% of the time by luck, closer to 40% if your position and velocity of the ball is not handled too well

past meteor
#

You don't need to watch 1000 videos. You can evaluate the performance of the policy

#

I assume you have some sparse reward (win/loss)? You can just take the average winrate of the last N episodes. I suspect that what you planned on doing anyway

cold osprey
wheat snow
#

btw what is that group key thing doin, someone suggested it but i habΒ΄ve no idea what it does

gloomy saddle
#

it takes all the input and groups it into chunks that share those properties

cold osprey
#

not sure what group_keys=False does

#

but i think that u want can be done with pivoting or unpivoting the dataframe (idk which is which)

past meteor
#

Reminds me I really need to get back to reinforcement learning. I read Sutton & Barto + implemented everything from scratch except a few model-based algos. It's something I don't actively use at work for now (even though there's opportunities) so I'm getting a bit rusty.

#

At some point I'll redo part of the algos in C, Nim, Rust, ... "for fun" and so I don't forget all of it

sinful kelp
#

how do you guys normally do hyperparameter optimization? do you do grid search or go for Bayesian optimization methods?

past meteor
#

Random search is good if you can run it in parallel. Bayes opt is a sequential algorithm so that's what I use if it's something super expensive that I can't run in parallel (e.g., neural networks)

#

The issue with grid search is that it spends a lot of time iterating over potentially useless parameter settings, random/bayes opt is a lot less sensitive to this

sinful kelp
#

do you ever have issues with convergence of the Bayesian optimizer?

jolly dock
mild dirge
jolly dock
#

wdym

past meteor
sinful kelp
#

I haven't, but I will try that. I was more asking in general what strategies people tend to use when optimizing their hyperparameters.

#

I have seen people before who were more in favor of random search over Bayesian optimization, and I was curious why

past meteor
#

Yeah, the reason is that they can run random in parallel indeed πŸ™‚

sinful kelp
#

Thanks for clearing that up πŸ™‚

wooden sail
#

nothing stops you from doing random bayes btw, with different starting points for the params

#

it's a lot more computationally expensive though

past meteor
wooden sail
#

yeah

past meteor
#

I like this a lot - do any "mature" packages implement this?

#

I can handroll it but I rather not if I'm using it in any "serious" project

wooden sail
#

i wouldn't know, i'm only doing armchair AI right now πŸ˜›

cold osprey
#

basement AI

past meteor
#

I'll put it on the long list of stuff I need to do

#

Tbf starting a bunch of Python processes that each run bayes opt is the same πŸ€·β€β™‚οΈ

wooden sail
mild dirge
#

I'm running my reinforcement learning model twice to get twice as fast results rn πŸ˜›

wooden sail
#

it sounds reasonable enough that i would almost expect any hyperparam library to have this as an option

past meteor
#

RL would really be my favourite domain to do fundamental research in. Specifically in this: https://arxiv.org/abs/2005.01643

wooden sail
#

a word of caution in case you weren't aware: arxiv is not peer reviewed, it's just a repository

past meteor
#

(This got published in NeurIPS as well though but this one is free)

wooden sail
#

uc berkeley and google add some weight to it, but always check whether what you find on arxiv has a published, peer reviewed version

#

aha, there we go then

past meteor
#

(A big part of my thesis was fixing a rubbish Arxiv paper that would never get published in a peer reviewed journal but had a few good ideas)

wooden sail
#

you seem pretty savvy in AI stuffs

past meteor
#

I did a masters in AI at a top ~50 uni + work as an applied AI researcher. I don't know a lot about NLP for example except surface level stuff from coursework.

#

I know about the stuff that my profs were interested in 🀣

hasty mountain
# past meteor Reminds me I really need to get back to reinforcement learning. I read Sutton & ...

Hey! Can you tell me some tricks to avoid suboptimal policies? Or specification gaming?

I'm having the problem that I'm trying to train a model in a game with PPO...problem is, the environment is a bit slow to provide a feedback. And even with a reward model working as a reward function to provide continuous rewards, it seems that my model is prone to getting stuck at certain commands...

#

(I don't know how much the factor impatience also helps...since I never actually let my model train for more than 5,000 steps, and the optimization is done after each 10 steps)

past meteor
#

My focus was mostly on implementing algos, understanding their properties, making environments, ... It's a shame but I don't have a lot of finesse with actually using it for real things πŸ˜†

mild dirge
#

I'm still doing a course on deep reinforcement learning so not exactly an expert either, my next assignment is to use policy optimization instead of value based learning.

tidal bough
#

i want to one day try using RL for codingame bot programming tasks, but that'd need a lot of work. I'd need to write a one-file from-scratch implementation of the model in question, train that model locally (after implementing a full simulator of the environment in question), and deploy it by embedding the trained parameters in the file

#

I think I've heard of people actually doing it, though, so probably it's a powerful method if you want to get through the hassle

tidal bough
tidal bough
#

The particular one I linked is interesting because it's a real-time game with physics (so the state space is continuous and the action space is pretty big too) and it's inherently multiagent - each player controls two bots, and they ideally need to coordinate with each other to win, and predict the opponent's actions.

severe topaz
wheat snow
#

nope

#

but i already solved the issue

#

im currently trying to floorlessly get some information from youtube api

hasty mountain
rancid dove
#

Is it possible to change the order of the fields of a numpy void dtype without copying the data?

tidal bough
#

hmm, what would it even mean for the fields to be in a different order logically but not in memory?

tidal bough
arctic wedgeBOT
#

@tidal bough :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [(0, 0.0e+00) (0, 2.8e-45) (3, 0.0e+00) (0, 7.0e-45) (6, 0.0e+00)
002 |  (0, 1.1e-44)]
003 | [(0.0e+00, 0) (2.8e-45, 0) (0.0e+00, 3) (7.0e-45, 0) (0.0e+00, 6)
004 |  (1.1e-44, 0)]
tidal bough
#

note how the offsets for dt2 are [4,0], so the logically-first field is the second in memory.

severe topaz
severe topaz
#

oh im out of it its a sql pandas thing

night prawn
cold osprey
#

wsl was a nightmare for me to set up so i just went with pytorch

#

didnt know about tensorflow-directML

wooden sail
#

i don't know what tensorflow-directml is, i do use wsl though and it's just dandy

#

i use jax gpu on wsl2

#

if you find it too cumbersome, consider pytorch indeed (unless you already have your code written)

past meteor
#

WSL is pretty convenient to set up at least it was for me

wooden sail
#

wsl has treated me mostly well as well

cold osprey
#

nice

#

wsl didnt treat me nice Sadge

night prawn
#

So how do I install wsl?

wooden sail
#

what issues did you have with it?

cold osprey
#

lemme find my rant messages

wooden sail
cold osprey
#

im on win10

night prawn
#

Ok thank you

cold osprey
#

oh i rmb know

#

it wasnt wsl specifically but mlflow wasnt working properly with wsl

#

something about file permissions iirc

#

couldnt save an image of the model architecture as an artifact in mlflow

wooden sail
#

that doesn't tell me much

cold osprey
#

yeah welp im over it now

wooden sail
#

but you probably tried to modify the windows filesystem from inside wsl or backwards, which you shouldn't do

cold osprey
#

pytorch all the way

wooden sail
#

never do that πŸ˜›