#data-science-and-ml | Python | Page 61

queen cradle May 7, 2023, 2:48 PM

#

See, there's an error right there are the end. "9413".

lapis sequoia May 7, 2023, 2:48 PM

#

Ik.

queen cradle May 7, 2023, 2:48 PM

#

Yeah. Because ChatGPT doesn't know what it's doing.

narrow crane May 7, 2023, 2:48 PM

#

Could anyone help me out with something. I'm practicing and trying to learn webscraping atm, am trying to figure out how to save what I scrape to a dataset and organize it.

queen cradle May 7, 2023, 2:49 PM

#

narrow crane Could anyone help me out with something. I'm practicing and trying to learn webs...

!rule 5 Webscraping usually violates ToS.

arctic wedgeBOT May 7, 2023, 2:49 PM

#

Rules

5. Do not provide or request help on projects that may violate terms of service, or that may be deemed inappropriate, malicious, or illegal.

lapis sequoia May 7, 2023, 2:49 PM

#

narrow crane Could anyone help me out with something. I'm practicing and trying to learn webs...

Use chatgpt

#

yert

narrow crane May 7, 2023, 2:49 PM

#

lapis sequoia Use chatgpt

Well I can't do that because i'm trying to learn a new skill

#

oh you meant like ask chatgpt?

lapis sequoia May 7, 2023, 2:49 PM

#

Yes'

narrow crane May 7, 2023, 2:50 PM

#

I did but it's kind of hard to track. it's not an exact replacement for direct human to human support in all instances.

lapis sequoia May 7, 2023, 2:50 PM

#

Like anything with numerical features

#

Numerical/Categorical

queen cradle May 7, 2023, 2:51 PM

#

Can you be more specific? Do you know about RNNs? Autoencoders?

lapis sequoia May 7, 2023, 2:54 PM

#

nope

#

Things like Decision trees, RFs, Knn, logistic Reg

queen cradle May 7, 2023, 2:58 PM

#

At the moment I'm not seeing any way for you to use the unlabeled data with those kinds of classifiers.

lapis sequoia May 7, 2023, 2:59 PM

#

How does this sound

#

Silly gpt is giving me steps on how to do that

#

And a good justification to write in report as well

#

It might make a fool of myself though

queen cradle May 7, 2023, 3:00 PM

#

You might be able to train something that tries to force the predicted classifications of the unlabeled data towards something definite. I.e., try to make it predict something but don't force it to predict something specific.

#

The loss function for such a thing is something like, "how close do your predicted class probabilities get to a basis vector". And there's a bunch of ways you could measure that.

lapis sequoia May 7, 2023, 3:02 PM

#

It said to use the the data and put class weights of the unknown class to 0. So that it's used in augementation but not used to predict anything

#

lapis sequoia May 7, 2023, 3:03 PM

#

queen cradle The loss function for such a thing is something like, "how close do your predict...

hmmmmm

queen cradle May 7, 2023, 3:05 PM

#

Do you know what class_weight does?

lapis sequoia May 7, 2023, 3:05 PM

#

not exactly. but ik it's used to solve class imbalance issue

#

queen cradle May 7, 2023, 3:07 PM

#

It tells the fit function the relative importance of the different classes. So, for example, if the weight of class 0 is 0.25 and the weight of class 1 is 0.75, then errors in class 1 are weighted three times more than errors in class 0 in the loss function.

lapis sequoia May 7, 2023, 3:08 PM

#

OO

queen cradle May 7, 2023, 3:08 PM

#

If you give something a weight of zero, then it doesn't contribute to the loss function at all.

lapis sequoia May 7, 2023, 3:08 PM

#

Yep

#

That's good then

queen cradle May 7, 2023, 3:09 PM

#

It means that it has no meaningful effect on training.

lapis sequoia May 7, 2023, 3:09 PM

#

It might still affect the training though

#

oh yes

#

That's cool

#

problem solved then

queen cradle May 7, 2023, 3:09 PM

#

Sure, there may be algorithms where the presence of that extra data affects training. But because it doesn't affect your loss function, it also can't make your results better.

#

Look, you seem quite enamored with ChatGPT. I'm inclined to say that you should ask your instructor about this. You don't have to tell him you were consulting ChatGPT if you think he'll respond poorly. Just said that you read about this idea on the Internet.

lapis sequoia May 7, 2023, 3:11 PM

#

Thanks mate

#

Mr. hoffman

#

Do you know about Albert hoffman

queen cradle May 7, 2023, 3:12 PM

#

No.

#

There's lots of Hoffmans and Hofmanns and Hoffmanns, etc.

#

We're mostly not related.

lapis sequoia May 7, 2023, 3:13 PM

#

Well

#

We are actually All related

#

Very low chances that you n me are not related

#

That's why I call you bro

queen cradle May 7, 2023, 3:14 PM

#

Okay, in that sense, we're all related.

lapis sequoia May 7, 2023, 3:14 PM

#

Because that's what you literally are

queen cradle May 7, 2023, 3:14 PM

#

I do like that sense, it's just not where I thought you were going.

lapis sequoia May 7, 2023, 7:04 PM

#

What is this behaviour

cold osprey May 7, 2023, 7:23 PM

#

Weird axes

bleak zealot May 7, 2023, 11:58 PM

#

Hey guys, so i got some problems when using Kneighborsclassfier, my signals looks like this on the graph?

Maybe im crazy? But i wanted the signals to stay on the graph rather then on top/bottom like this? Anyone could send me in direction where i can read/get help to change that in my code? (the code is running and working etc) the problem is my graphic overlay i wanna change?

cold osprey May 8, 2023, 12:08 AM

#

what is that plot?

umbral olive May 8, 2023, 1:00 AM

#

may i know whats the best library for fuzzy logic application, eg. display graph etc

bleak zealot May 8, 2023, 1:06 AM

#

cold osprey what is that plot?

mpf.plot but i found the error, now just having another error now claiming my data for buy and sell signals isnt same leght -.- kinda lost, properly just gonna go to bed and look at it tomorrow

dusty bay May 8, 2023, 2:03 AM

#

I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI Code

class Myapp():
    
    def __init__(self):
        self.root = customtkinter.CTk()
        self.root.geometry('1050x600')
        self.root.title("APx Platform")
        self.m1 = customtkinter.CTkButton(self.frame_2, text="Load JSON Script", font=("Ubuntu", 12), command=self.open_file)
        self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()

And here is the viewer code

import pandas as pd
import matplotlib.pyplot as plt


class csv2df():
    
    def __init__(self):
        self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
        
    def plot(self):
        self.x = self.df["Hz"]
        self.y = self.df["dBSPL"]
        plt.plot(self.x, self.y)
        plt.xlabel("Frequency (Hz)")
        plt.ylabel("RMS Level (dBSPL)")
        
        plt.show()
        
data = csv2df()
data.plot()

I want to display the graph by clicking the "Single Viewer" button. Can you please fix it as I need this for a project.
Thank You.

plucky bolt May 8, 2023, 2:22 AM

#

Any of you use plotly for dashboarding?

dusty bay May 8, 2023, 2:41 AM

#

plucky bolt Any of you use plotly for dashboarding?

sorry, what u mean?

cold osprey May 8, 2023, 2:41 AM

#

plucky bolt Any of you use plotly for dashboarding?

used to use it

plucky bolt May 8, 2023, 2:47 AM

#

dusty bay sorry, what u mean?

https://plotly.com/dash/

Dash Overview

Dash is a framework for building data apps in Python. Dash Enterprise simplifies the development and deployment process in a secure, scalable environment.

plucky bolt May 8, 2023, 2:48 AM

#

cold osprey used to use it

Switched to something else better?

cold osprey May 8, 2023, 2:48 AM

#

plucky bolt Switched to something else better?

work still preferred power bi, due to our clients mainly being microsoft integrated

#

used dash plotly on an internal project as like a POC

#

they liked it but didnt want to follow through at that time to use it for other stuff

plucky bolt May 8, 2023, 2:51 AM

#

Ah! I was thinking about using a dashboarding thing to display data on a website

cold osprey May 8, 2023, 3:01 AM

#

yeah u can use plotly dash

#

directly build as web app from that

plucky bolt May 8, 2023, 3:32 AM

#

I've used it before and it seemed okay. The thing I didn't like was that it looked very.... old!? I mean, it's not flashy at all! Like a basic white web page with drop down menus and sliders with some plots thrown in.

cold osprey May 8, 2023, 3:43 AM

#

u can customize it

#

with css

#

its built on flask

cloud marsh May 8, 2023, 4:44 AM

#

are there any tools for managing the individual dependency sets required for notebooks? or do you just create notebook directories with distinct requirements.txt (or whatever) while using isolated virtualenv's to load jupyterlab to work on that task?

cold osprey May 8, 2023, 4:47 AM

#

yes i use pyenv

thorn swift May 8, 2023, 5:00 AM

#

cloud marsh are there any tools for managing the individual dependency sets required for not...

if you can learn to use docker containers its a useful skill, i use containers for everything

cloud marsh May 8, 2023, 5:04 AM

#

thorn swift if you can learn to use docker containers its a useful skill, i use containers f...

my skills with docker aren't really where they should be. i use them for some things, but i tend to dabble in a lot and it's more time consuming to set up volumes & bind mounts. it works well for some things. i have like 6 or 7 containers, just for ML. the one for pytorch takes like 35 GB somehow and all together, they require 100+ GB.

also, i'm trying to use signatory, which requires pytorch & opencl to speedup things. but i would like to use TF where possible. building a container that has both has been a real blocker for me. i don't imagine doing that very often, but maintaining containers that have both would be a real pain, over time.

#

i'm using pyenv as well, along with venv and direnv. that works well, but creating these isolated dependency sets in multiple directories is tough and will eventually eat up a lot of disk. i only have 2TB nvme and very little else that's not tied up in my homelab somewhere.

thorn swift May 8, 2023, 5:11 AM

#

cloud marsh my skills with docker aren't really where they should be. i use them for some th...

i keep a text file for docker commands and use shell scripts for resetting up environments, not sure how many containers youd need up at once, since i document setup for more complicated project envs i usually feel comfortable deleting containers

#

i havent typed out a docker command in months

#

ctrl c ctrl v all day

cloud marsh May 8, 2023, 5:12 AM

#

i use docker.el in emacs, so i have at least some of it on easy mode (for some definition of easy lol)

#

i try to make notes where possible. i have a lot of experience in other languages and i'm trying to future-proof however i decide to handle dependencies for multiple projects.

#

thanks for the feedback

thorn swift May 8, 2023, 5:15 AM

#

also, ditch pytorch (i am a tensorflow enthusiast)

cold osprey May 8, 2023, 5:18 AM

#

pytorch > tensorflow

#

i ditched tensorflow

serene scaffold May 8, 2023, 5:19 AM

#

Just use JAX

thorn swift May 8, 2023, 5:24 AM

#

cloud marsh May 8, 2023, 5:24 AM

#

cold osprey pytorch > tensorflow

for me the appeal to tensorflow is the lower level tensors themselves, not keras. what's the equivalent to how TF handles tensors in pytorch?

thorn swift May 8, 2023, 5:25 AM

#

basically the same i just dont want to write my own training loops

bleak crown May 8, 2023, 5:30 AM

#

I just picked tensorflow cause I needed tensorflow.js support for my first AI project, and haven't switched. Honestly should probably try pytorch sometime tho

topaz gate May 8, 2023, 5:35 AM

#

someone can help me? I am doing a machine learning program by logistic regression, and the model I am doing is not working

cloud marsh May 8, 2023, 5:38 AM

#

what are the features?

topaz gate May 8, 2023, 5:40 AM

#

X = data[['Employment', 'YearsCodePro', 'EdLevel']]
y = data['CompTotal']

I did like this

cloud marsh May 8, 2023, 5:41 AM

#

what does comp total represent? how did you source the data set?

#

what kind of cost function did you set up?

topaz gate May 8, 2023, 5:42 AM

#

Comptotal represent the monthly salary

cloud marsh May 8, 2023, 5:42 AM

#

also, what kind of problems do you think you're having? are there runtime errors? or statistical errors?

topaz gate May 8, 2023, 5:43 AM

#

the data set I got by StackOverflow and filtered by country(in this case, is Brazil)

cloud marsh May 8, 2023, 5:43 AM

#

what kind of regression is it? linear? there are max salary caps, so you will see less correlation in some of the higher salary numbers than perhaps the lower numbers.

#

maybe it's different for brazil

topaz gate May 8, 2023, 5:44 AM

#

logistic regression

#

I am having value errors

cloud marsh May 8, 2023, 5:45 AM

#

have you looked at the mean/variance of the features? are you using a framework or just libraries like numpy?

topaz gate May 8, 2023, 5:46 AM

#

When I define my values(that are categoric), the jupyter notebook says that the categories from the column are unknown

cloud marsh May 8, 2023, 5:48 AM

#

are you trying to predict the salary, given the features?

topaz gate May 8, 2023, 5:48 AM

#

yes

cloud marsh May 8, 2023, 5:48 AM

#

logistic regression is typically used to produce binary predictions

#

is your data in dataframes? like with pandas?

topaz gate May 8, 2023, 5:50 AM

#

yes, it is

cloud marsh May 8, 2023, 5:51 AM

#

have you tried using unique() to see whether the dataframe will give you the distinct values in each column?

#

also, are you using a framework like tf/keras or pytorch?

topaz gate May 8, 2023, 5:52 AM

#

I am just using jupyter and anaconda

#

yes, I used unique, but even if I put the distinct values in the code, the code doesn't work

cloud marsh May 8, 2023, 5:54 AM

#

that describes the workflow and the python environment. i mean what libraries are you using to help with machine learning or linear algebra?

topaz gate May 8, 2023, 5:55 AM

#

pandas,numpy,seaborn and scikit learn

cloud marsh May 8, 2023, 5:55 AM

#

ok. if you're using logistic regression, the simplest way to fit that method to the task is to place an inequality on the predicted column. this converts it into a binary feature.

#

or, rather, a binary classification problem

topaz gate May 8, 2023, 5:56 AM

#

ok

#

but the problem is

#

I don't know how to do this

cloud marsh May 8, 2023, 5:58 AM

#

instead of your algorithm answering the question:

how do features in X predict data['CompTotal']

it will answer questions like what features in X predict data['CompTotal'] > 35,000

#

you can change the value on the right and retrain multiple versions of the algorithm.

#

i think... i'm not an expert though.

topaz gate May 8, 2023, 5:59 AM

#

I understand

cloud marsh May 8, 2023, 6:00 AM

#

scikit learn may assert that there are only two values in the data['CompTotal] column (or the prediction column). this may be what the error message is about.

topaz gate May 8, 2023, 6:00 AM

#

and my goal in the model is to know if the salary is higher than the minimum salary here in Brazil

cloud marsh May 8, 2023, 6:00 AM

#

i see.

#

then the goal here is more about the statistical assumptions

topaz gate May 8, 2023, 6:01 AM

#

yes

#

can we speak privately?

cloud marsh May 8, 2023, 6:04 AM

#

i may be able to help later, i have to get back to work though.

#

try to play around with the pandas dataframe and create new columns

topaz gate May 8, 2023, 6:05 AM

#

okay

#

I have to give this project in 3 hours hahaha so I'm kind of damned

#

but thanks for the help

cloud marsh May 8, 2023, 6:09 AM

#

if it's giving you an error, follow the stack trace. it might help to clone the scikit learn project and try to find the line with that string. you might not have enough time though. generally, the source code is the best documentation, but managing lots of source repositories can be a lot of work.

cold osprey May 8, 2023, 6:20 AM

#

thorn swift basically the same i just dont want to write my own training loops

i mean, u just copy paste it, or import from a helper library

#

maybe some tweaks depending on the model, loss func and metric ure using

past meteor May 8, 2023, 6:40 AM

#

Discussion has been overdone but personally I still prefer Jax if I'm doing say reinforcement learning and then TF/Keras, followed by MXNet

#

I've been spending time with Pytorch recently to wean myself off of TF because that seems to be the direction where everything is going. All in all they're the same but some small things are missing or need to be done differently. Just need more time with it I guess 🤷‍♂️

north adder May 8, 2023, 8:06 AM

#

Hello everyone

#

im a student majoring in mathematics and computer science who is interested into going into Data science/ML. i still have a year to graduate and i am planning to take a course in each of them but since summer vacation is coming i want to start working from now(and probably be good enough that i can land an internship in fall semester?) . I have basic knowledge in Python( took a course before) and im reading automate boring stuff with python and planning on reading beyond the basic stuffs with python(some people said its not necessary but i figure out why not expand our knowledge in this language). For Data science/ML what do you suggest i do? Im currently watching Andrew Ng 2018 course given in stanford but im thinking of enrolling in his machine learning specialization course on Coursera. I know i can learn it without a course but i would like to get a certificate so that i put it on my CV. What do you guys think/suggest? does taking this course make me ready as well to data science? Thanks in advance

#

sorry for the long paragraph lol

#

and yeah i have knowledge in MySql and databases

cold osprey May 8, 2023, 8:35 AM

#

if u have decent understanding of the maths behind common models and methods, then id suggest just diving into using pandas, sklearn, tf/pytorch, etc

young granite May 8, 2023, 12:49 PM

#

is there a "real" multioutputregression model and not just the approach to fit each model x-times for x-targets?

wooden sail May 8, 2023, 12:52 PM

#

yes

#

or well, what do you mean?

#

in a vector-valued function, you can in general treat the output as a vector of functions, each one "independent" to each other (not in the statistical sense, i just mean you can always write it this way, with each entry being a separate function)

#

each output value depends in general on all the inputs. the relationship between the outputs is a separate matter. you can interpret this as each entry in the output vector being a separate estimator of its own/a separate regressor

bleak zealot May 8, 2023, 1:04 PM

#

So i have a little problem with my code,

My signals wont come up on my graph, and when trying to trouble shoot it, i found out my Signal is in nan value, (NaN) rather then in inf.

"# Add predicted signals to a copy of the dataframe
df_copy = df.copy()
df_copy['signal'] = np.nan
df_copy['signal'] = knn.predict(df[['closing-price', 'daily-return']])"

When changing this to np.inf it dont change my signal to inf value but stay NaN?

Im so lost?

#

Both closing and daily return when printed comes out as inf but when i plot in my signal it becomes NaN?

tidal bough May 8, 2023, 1:11 PM

#

What do you mean becomes nan? How are you distinguishing inf and nan on a plot?

#

Also, I'd expect it to not matter in the slightest what you set signal to since you override it with predict's return immediately after.

bleak zealot May 8, 2023, 1:13 PM

#

Depending on what i print, when i print my signal i get this

"2021-03-17 NaN"

When i print the closing and daily return it stands in inf value like this

"2021-03-17 123.276093"

So as far as i understand from different pages i searched (and even chatgpt) its because my value isnt the same?

#

So my signal wont come on my graph

tidal bough May 8, 2023, 1:14 PM

#

It sounds to me that knn.predict(df[['closing-price', 'daily-return']]) returns a nan for that row, then

#

Perhaps one of these two columns has a nan on that row, or something's really wrong with the knn.

bleak zealot May 8, 2023, 1:16 PM

#

tidal bough It sounds to me that `knn.predict(df[['closing-price', 'daily-return']])` return...

okay and why would it return a nan from that when all others are correct (i checked?)

queen cradle May 8, 2023, 1:18 PM

#

north adder im a student majoring in mathematics and computer science who is interested into...

The most important foundations for machine learning (as well as statistics) are linear algebra and calculus. If you haven't taken an advanced linear algebra course or a real analysis course, then you should study those. If you haven't taken a probability course, then you should study that. After that, I don't have any strong recommendations; there are a lot of courses out there (online and otherwise), and people say that some are better than others, but it seems to me that there's not much to distinguish them.

bleak zealot May 8, 2023, 1:18 PM

#

tidal bough It sounds to me that `knn.predict(df[['closing-price', 'daily-return']])` return...

I mean until that line, everything is a inf value (or numeric value) but after that line it becomes a NaN? how?

#

Oh i think i got it

#

its because i got it as accuracy value before

#

I think

north adder May 8, 2023, 1:21 PM

#

queen cradle The most important foundations for machine learning (as well as statistics) are ...

i dont thik i will find any problem in the mathematical part and i have high grades in calculus linear algebra and probability

queen cradle May 8, 2023, 1:22 PM

#

north adder i dont thik i will find any problem in the mathematical part and i have high gra...

In that case, like I said, I don't have any strong recommendations. Find a course that you like and it should be fine.

sleek harbor May 8, 2023, 1:22 PM

#

when pruning a decision tree do you use try "all possible" values for alpha when cross validating, or do you only try those returned by cost_complexity_pruning_path (optimal values for a fully drown tree)?

bleak zealot May 8, 2023, 1:22 PM

#

tidal bough It sounds to me that `knn.predict(df[['closing-price', 'daily-return']])` return...

Thanks for putting me in the right track 🙂 I think i found the error now 🙂 futher up i think i put the knn.predict as a accuracy value which would then fuck up that line.

wooden sail May 8, 2023, 1:34 PM

#

NotaNeighbor

bleak zealot May 8, 2023, 1:52 PM

#

tidal bough It sounds to me that `knn.predict(df[['closing-price', 'daily-return']])` return...

Thanks i found the problem and its working now. 🙂

young granite May 8, 2023, 2:23 PM

#

wooden sail in a vector-valued function, you can in general treat the output as a vector of ...

if i input 3 targets and use multioutputregressor it creates basc. 3 fits, so for each target 1 fit and i want to know if there are approaches (other than NN) that can do all 3 based on 1 set of fits

wooden sail May 8, 2023, 2:24 PM

#

sure

#

"neural network" is a very broad term

#

or fuzzy, should i say

#

the only difference between a neural network and any other function is that it has a ton of trainable parameters, but otherwise, each of its layers is generally just a function with multiple inputs and outputs

#

and in general, all of the inputs are used together to produce each output

#

all matrices do the same thing, for example. in a linear fashion.

young granite May 8, 2023, 2:26 PM

#

yeh its all linear algebra

wooden sail May 8, 2023, 2:26 PM

#

not all. but a lot

#

anyway yes, they exist and are commonly used. but what exactly to do depends on which problem you're looking at

young granite May 8, 2023, 2:28 PM

#

its just out of interest

wooden sail May 8, 2023, 2:28 PM

#

ok. then the answer is yes, and a simple example is the mean estimator

young granite May 8, 2023, 2:30 PM

#

but that is multiple inputs 1 output isnt it?

wooden sail May 8, 2023, 2:30 PM

#

you can find the mean of a vector

young granite May 8, 2023, 2:31 PM

#

cant i state something like:
there are only single target mathematical models which could be applied to generate a more dimensional model

#

i mean yeh its wrong in terms of math

wooden sail May 8, 2023, 2:31 PM

#

that sentence doesn't make any sense to me

#

i have no idea at all what it's trying to say

young granite May 8, 2023, 2:32 PM

#

multiple parameters are hard to get with classical math

#

f(x)=x

#

so really condensed down i mean multiple features and outputs are kinda hard to represent

wooden sail May 8, 2023, 2:34 PM

#

why?

#

and what do you mean by "represent"

#

all vector-valued, vector-parameter functions do what you're saying

#

(i think, i'm still not sure i got you right)

#

.latex one can arbitrarily define functions of the form [
f: \mathbb{C}^N \to \mathbb{C}^M
]

strange elbowBOT May 8, 2023, 2:36 PM

#

$latex.png$

young granite May 8, 2023, 2:39 PM

#

i mean more in terms of multiple variables

wooden sail May 8, 2023, 2:46 PM

#

that's equivalent

#

you can have a vector of N parameters

#

that's standard notation

#

this says we take N parameters and give out M outputs

#

C^N means N cartesian products, so N complex numbers are mapped to M complex numbers here

#

doesn't matter what N and M are

young granite May 8, 2023, 2:51 PM

#

thanks

boreal gale May 8, 2023, 2:52 PM

#

have you looked up what multi-objective optimisation is? sounds like that would be of interest to you.

young granite May 8, 2023, 2:53 PM

#

boreal gale have you looked up what multi-objective optimisation is? sounds like that would ...

i did not but certainly will thanks

silent pendant May 8, 2023, 5:03 PM

#

Can anyone here help me optimize this program I'm creating? I made a program that uses MediaPipe's hand tracking for real-time ASL interpretation, I got everything set up how I want but its just eating through my CPU like theres no tomorrow

#

Ive tested and commented out lines to see what exactly is causing the performance slog, and it seems to be the small Keras model im using to make the classification

#

The prediction is based off an input shape of (1, 20), but I don't really know how to make it faster or what alternatives are out there

pallid badge May 8, 2023, 5:13 PM

#

HI, what are your best sites to learn numpy, scipy, matplotlib?

#

I have a technical test and we are not allowed to do stackoverflow, google, not even an IDE

hasty mountain May 8, 2023, 5:31 PM

#

pallid badge HI, what are your best sites to learn numpy, scipy, matplotlib?

Matplotlib has some tutorials. Idk about the other 2.

young granite May 8, 2023, 5:31 PM

#

they do all got tuts

#

what kind of field u apply for

#

u can do the general stuff?

pallid badge May 8, 2023, 5:38 PM

#

Thank, you mean tutorials in the API?

#

Docu? Physics and imaging.

#

But how to become bad - ass in these packages and know the tricks? I lack use cases, I guess.

lapis sequoia May 8, 2023, 5:48 PM

#

DO i Just wait

#

Or going to google colab helps

pallid badge May 8, 2023, 5:48 PM

#

Is google colab better than a jupyterlab notebook?

lapis sequoia May 8, 2023, 5:48 PM

#

Naah

#

Jupyter best

#

Lol

errant lake May 8, 2023, 5:49 PM

#

Isn't Collab litterally jupyterlab?

lapis sequoia May 8, 2023, 5:50 PM

#

Except it's not

errant lake May 8, 2023, 5:50 PM

#

Or just jupyter?

lapis sequoia May 8, 2023, 5:50 PM

#

yert

thorny drum May 8, 2023, 5:50 PM

#

errant lake Isn't Collab litterally jupyterlab?

They're not the same

lapis sequoia May 8, 2023, 5:50 PM

#

errant lake Or just jupyter?

You r Saturn

thorny drum May 8, 2023, 5:50 PM

#

Colab is online vs Jupyter being on your local device

errant lake May 8, 2023, 5:50 PM

#

Ah ok first news to me thanks

#

I really assumed they were the same

lapis sequoia May 8, 2023, 5:50 PM

#

I am online

#

pithink

errant lake May 8, 2023, 5:51 PM

#

thorny drum Colab is online vs Jupyter being on your local device

Well hmm jupyter starts a server whatever device you use it on

pallid badge May 8, 2023, 5:52 PM

#

thorny drum Colab is online vs Jupyter being on your local device

There is also a jupyterhub version

thorny drum May 8, 2023, 5:52 PM

#

errant lake Well hmm jupyter starts a server whatever device you use it on

true

wooden sail May 8, 2023, 5:52 PM

#

colab is one particular server where you can run jupyter notebooks

#

one where google gives you free hardware

#

you can alternatively host your local jupyter server, which is how most people use it

pallid badge May 8, 2023, 5:53 PM

#

And does colab provide better functionality , e.g. widgets?

errant lake May 8, 2023, 5:53 PM

#

So in fine, Google Collab is just an implementation of jupyter?

#

Or are these two very close tools

lapis sequoia May 8, 2023, 5:56 PM

#

I've attempted to include dropouts, mess with hyperparameters, regularization and data preprocessing, but nothing is working

errant lake May 8, 2023, 5:56 PM

#

Anyway sorry for chiming in. Thanks for the infos

thorny drum May 8, 2023, 5:56 PM

#

You're good Clem

lapis sequoia May 8, 2023, 6:02 PM

#

You're good Clem

wooden sail May 8, 2023, 6:14 PM

#

errant lake So in fine, Google Collab is just an implementation of jupyter?

not an implementation, just a particular host for jupyter notebooks

#

you can set up your jupyter server on one device and connect to it from a different one

#

google just set one up with very nice hardware, for everyone to use

errant lake May 8, 2023, 6:15 PM

#

Yeah! That's what I thought originally, np thanks for clarifying

serene scaffold May 8, 2023, 6:22 PM

#

wooden sail not an implementation, just a particular host for jupyter notebooks

do we know if colab uses jupyter under the hood? my understanding is that "notebooks" are a general thing, and jupyter notebooks are a flavor of them.

wooden sail May 8, 2023, 6:23 PM

#

that's a good point, i'm not sure if it's a jupyter one tbh

#

https://research.google.com/colaboratory/local-runtimes.html

#

it says jupyter

errant lake May 8, 2023, 6:24 PM

#

I think it is jupyter under the hood yes - probably heavily rewritten by Google haha

wooden sail May 8, 2023, 6:25 PM

#

yeah some other links say "based on the jupyter open source", but those links aren't by google

serene scaffold May 8, 2023, 6:25 PM

#

heavily rewritten. if you replace the head and the handle of a hammer ten times, is it the same hammer?

wooden sail May 8, 2023, 6:25 PM

#

the ~~ship of theseus~~ notebook of google

errant lake May 8, 2023, 6:25 PM

#

Oh yeah I still consider it the same thing. It's just probably adapted to Google's infrastructure now

pallid badge May 8, 2023, 6:26 PM

#

Reminds me a bit of Python ducktyping

#

It quarks and walks like a duck, it is a duck

granite falcon May 8, 2023, 7:42 PM

#

need some help in data science project new to python and data science.

strong granite May 8, 2023, 8:00 PM

#

Hey I want to get into AI/ML, please suggest some resources and courses

severe topaz May 8, 2023, 8:17 PM

#

Doing a project in spare time to get better w/ python -- involves using the techniques covered in CSE 6040.
I am attempting to design a method which automates collection of utility data from the UCB website, along with the UCD website. (electricity, steam, water, even waste - all into one core unit, kWh energy demand for a complete energy outtake picture/comparison?)
I was going to go the route of selenium, SQL & automating accessing data from a webpage that updates every 24 hours (Selenium/Beautiful Soup code to pull the div containers, then use regex to translate the strings to an appropriate format, wrap it into data structure of choice) a headache and a half...
Can anybody help? -- my skill level is not at the place where I could line up the string of numbers and show 1 element with the date for each of those numbers being store in another element...
https://ceed.ucdavis.edu/ https://engagementdashboard.com/universityofcaliforniaberkeley/ucb/building/8750/consumption/month

CEED - Campus Energy Education Dashboard

Historical and real-time building energy data for University of California Davis. The first step to saving energy is seeing how much you use.

worldly dawn May 8, 2023, 8:21 PM

#

severe topaz Doing a project in spare time to get better w/ python -- involves using the tech...

they have a graphql endpoint. Not sure about the license though

severe topaz May 8, 2023, 8:22 PM

#

are you telling me to look into the graphql endpoint? at one point though, do you after accessing the API?

worldly dawn May 8, 2023, 8:24 PM

#

severe topaz are you telling me to look into the graphql endpoint? at one point though, do yo...

you may want to reach out to them directly. They will be better able to guide you

severe topaz May 8, 2023, 8:28 PM

#

as far as acessing the api though, would you be able to try the UCB link?

worldly dawn May 8, 2023, 8:36 PM

#

severe topaz as far as acessing the api though, would you be able to try the UCB link?

I am not going to try any random API that doesn't have an explicit open access

severe topaz May 8, 2023, 8:39 PM

#

These urls doesn’t have explicit open access?

rugged comet May 8, 2023, 8:40 PM

#

My end goal is to run Kmeans on a large, sparse dataset. The data is currently in json form. I am trying to use databricks community edition to load and process the data. Reading the json alone takes about 15 minutes. I am just starting the project as far as the machine learning and loading the data goes. Up until this point, I've just been gathering the data.
The data seems too large to load into the driver's memory.

What general advice can you give me to help reach the end goal? If you need to know more about the data or the problem, let me know.

mild dirge May 8, 2023, 8:43 PM

#

How large is your json file? @rugged comet

worldly dawn May 8, 2023, 8:44 PM

#

severe topaz These urls doesn’t have explicit open access?

it's not because you don't lock your doors that it gives me the rights to get in.
Same thing here 😉

Plus if you make mistakes or misuse it, they may just cut you off or take down the whole thing.
And in addition, it's always awesome to receive an email to get a thank for the api and showing enthusiasm

gloomy saddle May 8, 2023, 8:44 PM

#

Yeah 15 minutes of read time if not storage device limited is weird, 80GB json only takes at most its read speed for me usually?

rugged comet May 8, 2023, 8:45 PM

#

mild dirge How large is your json file? <@188467763558350849>

The file I'm trying to use is 675,863 KB. This is only the first 500,000 samples though. I'll have 5 files total, each about this size except for the last one which is smaller.

severe topaz May 8, 2023, 8:45 PM

#

worldly dawn it's not because you don't lock your doors that it gives me the rights to get in...

Whaaaa thank for the API? I’m going to have to YouTube this. I’ve never tried accessing the API. Or I might as a LinkedIn connect for help.

mild dirge May 8, 2023, 8:45 PM

#

I don't see why it would take 15 mins to load a json unless it just has too much data

#

Is there not a better way to store the data?

gloomy saddle May 8, 2023, 8:46 PM

#

Yeah something is really up char, can we see your read implementation?

rugged comet May 8, 2023, 8:46 PM

#

gloomy saddle Yeah something is really up char, can we see your read implementation?

df = spark.read.json("/FileStore/tables/edhrec_deck_data_500000.json")

mild dirge May 8, 2023, 8:46 PM

#

Is it basically just a table?

gloomy saddle May 8, 2023, 8:48 PM

#

Why not have pandas directly read the json? Not quite following what spark is doing in this situation

mild dirge May 8, 2023, 8:49 PM

#

Can you show the first 10 lines of your json?

rugged comet May 8, 2023, 8:49 PM

#

mild dirge Is it basically just a table?

It's like a list of dictionaries.
Here's the format

[
    {
        "commanders": ["Abaddon the Despooiler"],
        "color identity": [...],
        "hubs": [...],
        "cards": {
            "Dockside Extortionist": 1,
            ...
        },
        "theme": ...
    },
    {
        "commanders": [...],
        ...
    },
    ...
]

commanders is a list of up to 2 strings.
color identity is a list of up to five characters
hubs is a list of up to ~8 strings
cards is a dictionary of up to 100 pairs where the keys are strings and the values are integers that go from 0 to 100.
theme is a string

mild dirge May 8, 2023, 8:50 PM

#

https://stackoverflow.com/questions/75135994/spark-read-json-taking-extremely-long-to-load-data

Stack Overflow

spark.read.json() taking extremely long to load data

What I've Tried
I have JSON data which comes from an API. I saved all the data into a single directory. Now I am trying to load this data into a spark dataframe, so I can do ETL on it. The API retu...

#

Is this maybe relevant?

errant lake May 8, 2023, 8:50 PM

#

lol, nested jsons

rugged comet May 8, 2023, 8:51 PM

#

mild dirge https://stackoverflow.com/questions/75135994/spark-read-json-taking-extremely-lo...

I'll read this.

gloomy saddle May 8, 2023, 8:52 PM

#

It still to me feels like something pandas could handle directly if its just loading json to a dataframe 🙂

rugged comet May 8, 2023, 8:53 PM

#

gloomy saddle Why not have pandas directly read the json? Not quite following what spark is do...

I wanted to use spark because it has extra memory basically. Might be a silly reason.

gloomy saddle May 8, 2023, 8:54 PM

#

Your files are only a few MB? Its been a while but believe you can have pandas read in chunks if you need to keep memory usage lower

errant lake May 8, 2023, 8:54 PM

#

You can load with pandas and then convert to pyspark as well if needed

#

That SO fix sounds good too

rugged comet May 8, 2023, 8:54 PM

#

gloomy saddle Your files are only a few MB? Its been a while but believe you can have pandas r...

A few MB is a bit of an understatement. The total of all the files will be about 3.5 GB.

errant lake May 8, 2023, 8:54 PM

#

It's usually ok, 4gb of RAM is doable

gloomy saddle May 8, 2023, 8:54 PM

#

(Used to Terabytes so my norm might be skewed)

rugged comet May 8, 2023, 8:55 PM

#

Oh

errant lake May 8, 2023, 8:55 PM

#

Same, I was about to propose using bigquery

rugged comet May 8, 2023, 8:56 PM

#

mild dirge https://stackoverflow.com/questions/75135994/spark-read-json-taking-extremely-lo...

i think all my files are formatted such that each line is a valid JSON object.

#

wait no

#

It's a list of dictionaries basically. So I think that's not true.

gloomy saddle May 8, 2023, 8:58 PM

#

Still if you need it low memory. Reading in chunks and handling type coercion at read to smaller data types can help a lot with that. Json is nice for humans to read. But say your dockside extortionist. That column only needs to be a unsigned 8 bit int. Same for some other stuff. The column names are only stored once and the representation should be a good deal smaller

rugged comet May 8, 2023, 8:59 PM

#

mild dirge Can you show the first 10 lines of your json?

https://paste.pythondiscord.com/rudaxajemi
Here are the first 10 objects in the list.

trail zodiac May 8, 2023, 9:01 PM

#

Hey folks, a quick question- I'm trying to fill in gaps in my cs education and I'm currently reading "Attention is all you need", but I don't have enough context on how attention mechanisms work for me to follow and the paper sort of assumes everyone knows how attention mechanisms work. Can anyone direct me to a research paper or similar resource that's actually introducing/explaining attention mechanisms?

rugged comet May 8, 2023, 9:01 PM

#

gloomy saddle Still if you need it low memory. Reading in chunks and handling type coercion at...

So there are about 25000 unique values like Dockside Extortionist. So that would be about 25000 columns. The shape of the resulting dataframe would be like 2,500,000 samples X 25,000 columns. Is this too large?

rugged comet May 8, 2023, 9:04 PM

#

gloomy saddle Still if you need it low memory. Reading in chunks and handling type coercion at...

Maybe spark is taking so long to read because it's trying to convert all the unique values in cards to columns like you said. This would make the dataframe much larger, not smaller, I think.

gloomy saddle May 8, 2023, 9:05 PM

#

you have mixed quotation marks in your input json?

#

'Mizzix's Mastery' for example

#

"Mizzix's Mastery" would probably help things a whole lot assuming its as your actual input

rugged comet May 8, 2023, 9:06 PM

#

gloomy saddle you have mixed quotation marks in your input json?

Yeah so some of the card names can have " (double quote) in the name. And some can have ' (single quote).

gloomy saddle May 8, 2023, 9:07 PM

#

yeah, thats undefined behaviour in json to the best of my knowledge?

rugged comet May 8, 2023, 9:07 PM

#

gloomy saddle "Mizzix's Mastery" would probably help things a whole lot assuming its as your a...

The python discord paste I provided is what Python prints out for the first 10 objects in the list.

import json

with open("edhrec_deck_data_500000.json", "r") as f:
    decks = json.load(f)

for deck in decks[:10]:
    print(deck)

This is the code that generated the paste that I posted.

rugged comet May 8, 2023, 9:08 PM

#

gloomy saddle yeah, thats undefined behaviour in json to the best of my knowledge?

The actual json file probably looks different. I can't open it though.

gloomy saddle May 8, 2023, 9:08 PM

#

try notepad++

rugged comet May 8, 2023, 9:09 PM

#

gloomy saddle try notepad++

https://notepad-plus-plus.org/
This?

gloomy saddle May 8, 2023, 9:09 PM

#

yep

rugged comet May 8, 2023, 9:11 PM

#

Okay so the json file is only one long line it seems.

#

Is this a problem?

gloomy saddle May 8, 2023, 9:16 PM

#

it should be ok, just means someone has compacted it, I was hoping to see how the structure could be improved on, but what you pastebinned earlier was not valid json as it had already been parsed,

If you could pastebin say the first 10K chars (it counts them on the bottom of notepads window) and ping me, I'll have a look this afternoon

rugged comet May 8, 2023, 9:22 PM

#

gloomy saddle it should be ok, just means someone has compacted it, I was hoping to see how th...

I don't think I can select the first 10k characters using normal means (highlighting with the mouse). It's far too slow. Any other way to do this?

#

boreal gale May 8, 2023, 9:32 PM

#

rugged comet Okay so the json file is only one long line it seems.

usually one would use spark with newline delimited json, and not really use it to parse a mega huge json array like the one you have.

parsing a huge json array is extremely slow and is as far as i know a single-core operation
when compared to parsing a new line delimited json, the difference is night and day, because spark can just delegate different section of the file (i.e. different lines) to other cores to parallelise the parsing process.

in short using spark yields no benefit here as far as i know.

#

if RAM permits (it should, the file is "tiny" compared to actual big data scale), you can look into using the most performant json parser out there, then convert it to something that is more spark-friendly and resume your work there (if it is really necessary - people misuse spark for all sorts of reasons imo.)

otherwise look into streaming json parsers, i know it's a possibility but i have never found a use for it.

rugged comet May 8, 2023, 9:35 PM

#

boreal gale usually one would use spark with newline delimited json, and not really use it t...

Is it possible to convert what I have into newline-delimited json?

boreal gale May 8, 2023, 9:38 PM

#

yes. though i am not really sure what is the most performant way of doing so.

#

have you tried the multiline=true option in spark as recommended in the SO post though?

#

ah also now that i actually read the SO post, the TLDR Solution is exactly what you need to convert into JSONL, though i wouldn't use json... it's slow as heck, using orjson or cysimdjson is better

rugged comet May 8, 2023, 9:41 PM

#

boreal gale have you tried the `multiline=true` option in spark as recommended in the SO pos...

It wasn't clear to me where to put that parameter. Also, multiline=true doesn't make sense to use to me because I only have one line.

rugged comet May 8, 2023, 9:42 PM

#

boreal gale ah also now that i actually read the SO post, the TLDR Solution is exactly what...

Are those library alternatives to json?

boreal gale May 8, 2023, 9:42 PM

#

yes.

#

okay there might be a misunderstanding to what multiline means.
have a look at the reference here: https://spark.apache.org/docs/latest/sql-data-sources-json.html

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

#

sure - your file don't have multiple lines literally , but that's not what multi-line JSON file mean.
it is merely trying to make the distinction between jsonl and json file, and in your case you have json file not jsonl, hence multiLine is a sensible option to use

rugged comet May 8, 2023, 9:47 PM

#

Oh okay

boreal gale May 8, 2023, 9:47 PM

#

it's frankly quite poorly named.

[a, b]
does not span multiple lines

but [ a, b ] does
but they are the same thing in json, multiLine is just not very clear what's going on

#

but i guess from a parsing point of view, it does make sense.

it's a flag to tell spark "hey you can split the file line by line and just process each line by itself" or otherwise

lapis sequoia May 8, 2023, 9:58 PM

#

i created an nlp ai i would love for people to help me train it!

#

also where can i find nlp data in the form of questions and answers

sinful kelp May 8, 2023, 10:02 PM

#

Kaggle https://www.kaggle.com and PapersWithCode https://paperswithcode.com are probably good places to start

Kaggle: Your Machine Learning and Data Science Community

Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.

Papers with Code - The latest in Machine Learning

Papers With Code highlights trending Machine Learning research and the code to implement it.

rugged comet May 8, 2023, 10:24 PM

#

boreal gale it's frankly quite poorly named. ```[a, b]``` does not span multiple lines bu...

df = spark.read.option("multiline", "true").json("/FileStore/tables/edhrec_deck_data_500000.json")

This line has been running for over 15 minutes. It doesn't seem any better than before.

somber pollen May 8, 2023, 10:53 PM

#

lapis sequoia also where can i find nlp data in the form of questions and answers

you can use an existing model to generate these

#

it's been used with decent success for some finetrained models

rugged comet May 8, 2023, 10:56 PM

#

Also, to get the data encoded for machine learning (Kmeans), I think I want to pivot by cards and group by original_url. I run into a problem though.

pivot_df = df.groupBy("original_url").pivot("cards")

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

It's hard to tell what the format is for the current data but I want it to look like this

color identity,commanders,hubs,original url,tags,theme,CARD_1,CARD_2,...

So I keep the values for the other columns such as color identity and commanders. But I want to add new columns for each value in cards.

cerulean kayak May 8, 2023, 11:56 PM

#

do decision trees rely on randomness?

agile cobalt May 9, 2023, 12:03 AM

#

singular trees: not much
ensembles: yes

somber pollen May 9, 2023, 12:58 AM

#

agile cobalt singular trees: not much ensembles: yes

i've never been sure about the right notation for these kinds of things. Is one tree a tree and then an ensemble a forest? I feel like I've seen that wording to describe it

agile cobalt May 9, 2023, 2:36 AM

#

the only "forest" I can think of would be a "random forest", which refers to a specific way of putting decision trees together

an 'ensemble' is any way model that makes use of two or more separate/individual models

somber pollen May 9, 2023, 3:03 AM

#

agile cobalt the only "forest" I can think of would be a "random forest", which refers to a s...

ah ok, that makes sense. I think I was confusing a forest (an ensemble of trees) with ensembles (an ensemble of any model).

rugged comet May 9, 2023, 3:10 AM

#

I want to try uploading a MySQL database to azure databricks.
They ask for the information in brackets

database_host = "<database-host-url>"
database_port = "3306" # update if you use a non-default port
database_name = "<database-name>"
table = "<table-name>"
user = "<username>"
password = "<password>"

I'm stuck on how to get the database-host-url. Is that the same as the value returned by

SELECT @@hostname;

?

#

My database is hosted on my local machine.

serene scaffold May 9, 2023, 3:13 AM

#

Unrelated, but thank you @rugged comet for causing me to learn about the distinction between multi-label and multi-class a few months ago. I'm working on a multi-label classification project now at work.

#

(also I know nothing about databricks. sorry)

rugged comet May 9, 2023, 3:14 AM

#

Nice! I'm so happy I could help.

#

If you know anything about Azure also, that would be helpful. I just started and it's kind of overwhelming in the beginning.

agile cobalt May 9, 2023, 3:29 AM

#

it might be easier to import a dump or csv file

is the way you're trying to import it an option in a website, or do you pass it to a local script? there is a non-negligible chance that the option you're using assumes that the database is hosted somewhere with a public ip

rugged comet May 9, 2023, 3:33 AM

#

agile cobalt it might be easier to import a dump or csv file is the way you're trying to imp...

Firstly, I wanted to upload a json file but they said it was too big. Then I thought I could put it in a MySQL db and upload that. Now, it looks like I may be able to use the DBFS. What do you think about the DBFS for this?

serene scaffold May 9, 2023, 3:48 AM

#

rugged comet If you know anything about Azure also, that would be helpful. I just started and...

is that an AWS thing? I only know enough AWS to be either dangerous as fuck or totally impotent.

rugged comet May 9, 2023, 3:49 AM

#

Azure is like AWS as far as I'm aware. Just by different companies.

#

Microsoft Azure

serene scaffold May 9, 2023, 3:50 AM

#

oh okay. I actually know shockingly little about tabular databases and all their x-as-a-service varieties. but for reasons I can't tell you, I do know a lot about the graph database neo4j.

rugged comet May 9, 2023, 3:51 AM

#

I see... 😄

serene scaffold May 9, 2023, 3:51 AM

#

I do all my tabular data manipulation with pandas, even if the CSV is 80 GB

#

and I like it

rugged comet May 9, 2023, 3:52 AM

#

lmao

serene scaffold May 9, 2023, 3:52 AM

#

anyway, I'm displacing your question with my shitposting, so I'll be quiet in the hopes that someone more knowledgeable appears.

exotic vortex May 9, 2023, 4:02 AM

#

Hello everyone

I want a Roadmap map for Data science. Plz help me. I have started python as programming language

rugged comet May 9, 2023, 4:26 AM

#

The file upload thing is solved now I think.

#

How do you create clusters or a workspace in azure databricks without public ip addresses? I keep getting this error

Error code: PublicIPCountLimitReached, error message: Cannot create more than 3 public IP addresses for this subscription in this region.

when trying to create a new cluster.

serene scaffold May 9, 2023, 4:39 AM

#

rugged comet How do you create clusters or a workspace in azure databricks without public ip ...

have you tried asking in #tools-and-devops btw? or #databases?

rugged comet May 9, 2023, 4:40 AM

#

I have not. Do you think those channels would be more appropriate for this question?

serene scaffold May 9, 2023, 4:41 AM

#

probably

granite falcon May 9, 2023, 4:55 AM

#

hi i am working on a data science problem i am new to python unable to solve it if you have some spare time can you help me with it.

severe topaz May 9, 2023, 4:57 AM

#

just explain it ^^

granite falcon May 9, 2023, 4:58 AM

#

severe topaz just explain it ^^

is that for me?

boreal gale May 9, 2023, 6:53 AM

#

rugged comet ```py df = spark.read.option("multiline", "true").json("/FileStore/tables/edhrec...

Somewhat expected, have you tried reformatting your data to jsonl already?
How sparse is your data?

past meteor May 9, 2023, 6:57 AM

#

rugged comet I want to try uploading a MySQL database to azure databricks. They ask for the ...

Yes

past meteor May 9, 2023, 6:58 AM

#

serene scaffold I do all my tabular data manipulation with pandas, even if the CSV is 80 GB

Cursed 😦 you should look at Polars the API is a lot cleaner than Pandas and it's so so so much more efficient

past meteor May 9, 2023, 6:59 AM

#

rugged comet Firstly, I wanted to upload a json file but they said it was too big. Then I tho...

Just put the JSON file in an Azure blob storage (cheaper) or azure data lake and connect it there, no? Afaik Azure allows you to connect a "service" to Databricks. Be sure to use key vault because your Azure environments are stored as YAML files under the hood and if you don't use Key Vault your credentials get put in plain text in your repo. Maybe this is changed because it's been a while since I used Azure tbf

past meteor May 9, 2023, 7:04 AM

#

cerulean kayak do decision trees rely on randomness?

And finally: singular trees rely on randomness because sometimes there are ties in Gini/IG that are broken at random. Depending on your implementation there could also be an element of randomness in how to quantise continuous variables. This is why even if you train a vanilla decision tree on the same data with a different seed you may have different results.

copper island May 9, 2023, 9:13 AM

#

Please advise me a book to start with date science in python)

serene scaffold May 9, 2023, 9:24 AM

#

past meteor Cursed 😦 you should look at `Polars` the API is a lot cleaner than Pandas and i...

No.

past meteor May 9, 2023, 9:30 AM

#

Your choice/loss 🤷‍♂️ . I was pretty stubborn about trying it out as well in the past but it's great

#

Worst case scenario you don't like it and you go back, you don't lose anything

weary swift May 9, 2023, 10:58 AM

#

is there a way to use my AMD RX 470 with pytorch?

sleek harbor May 9, 2023, 11:22 AM

#

pipelines are very convenient (talking about sklearn here), but.. aren't they super inefficient when tuning parameters? I mean, say u have a pipeline with a bunch of preprocessing (drop some columns, impute, standardize one thing, one hot encode another, etc..).. that means all that preprocessing gets done for every hyperparam combination.. every time.. again and again.. when it could be done just once. Am I right about this? Are pipelines actually used in practice? Cus.. that seems like a lot of unnecessary work

cold osprey May 9, 2023, 11:28 AM

#

probably, but its it alot of run time?

boreal gale May 9, 2023, 11:32 AM

#

in principle yes, but to echo shimmer's point, is it actually a lot of run time?
also have you looked into the memory parameter of pipeline? it looks like a parameter to configure a cache

cold osprey May 9, 2023, 11:32 AM

#

U could always just store a copy of the pre processed data in memory and use that for all ur models

#

Only need to rerun the pipeline when u close ur notebook or smth

boreal gale May 9, 2023, 11:34 AM

#

re. store a copy of the pre processed data in memory
yes it's possible. but you run the risk of leaking your test set into your training set if you aren't careful. hiding behind the pipeline interface is very reliable in terms of not leaking your test set

past meteor May 9, 2023, 11:38 AM

#

sleek harbor pipelines are very convenient (talking about sklearn here), but.. aren't they *s...

Yes you should also use them while training / tuning hyperparemeters. As @boreal gale correctly says the risk of leakage is too big

cold osprey May 9, 2023, 11:39 AM

#

am confused

#

how would leakage happen

past meteor May 9, 2023, 11:40 AM

#

A million and one ways? stuff like your StandardScaler and OneHotEncoder etc. depend on the batch of data that was seen during your cross-validation procedure

cold osprey May 9, 2023, 11:40 AM

#

wouldnt everything be run on the df_train only

past meteor May 9, 2023, 11:41 AM

#

Taking your entire training set and precomputing these metrics on df_train and then cross-validating is leakage

cold osprey May 9, 2023, 11:41 AM

#

fit transform on train

#

fit on test only

past meteor May 9, 2023, 11:41 AM

#

I'm specifically talking about the case where you cross validate

sleek harbor May 9, 2023, 11:42 AM

#

boreal gale in principle yes, but to echo shimmer's point, is it actually a lot of run time?...

I gotta look into the memory parameter.., but yeah. By default, if you have a bunch of iterations looking for hyperparameters.. the runtime can pile up, I'd assume

past meteor May 9, 2023, 11:42 AM

#

Say your dataset A is split into 80/20 and you're doing 2 fold CV (for the sake of this example) you cannot just fit your preprocessing on the full 80 and then proceed with your CV

cold osprey May 9, 2023, 11:43 AM

#

ah right

past meteor May 9, 2023, 11:43 AM

#

Your preprocessing can only see the 40/100 it gets during the CV procedure

#

So per definition a lot of preprocessing (but certainly not all) cannot be done ahead of time hence why I'd argue for not risking it and just going with Pipeline and ColumnTransformer because the risk of accidentally leaking is high(er)

sleek harbor May 9, 2023, 11:45 AM

#

are there other libraries for making pipelines, or is sklearn the most widely used one?

past meteor May 9, 2023, 11:47 AM

#

sleek harbor are there other libraries for making pipelines, or is sklearn the most widely us...

PySpark has pipelines. Recipes in R does essentially the same thing. You can easily make your own with Functools in the standard library. pipelines are just function composition. The thing you need to understand is that it's a universal problem 🙂 you have to compute them on the fly, it's not a sci-kit learn problem

errant lake May 9, 2023, 11:47 AM

#

There are tools to orchestrate your pipeline in a smarter way, but it will still use sklearn/pandas/spark in the background

boreal gale May 9, 2023, 11:48 AM

#

errant lake There are tools to orchestrate your pipeline in a smarter way, but it will still...

curious to know more, what are these tools?

errant lake May 9, 2023, 11:48 AM

#

Apache Airflow is widely used in the industry to design such pipelines

past meteor May 9, 2023, 11:48 AM

#

Airflow is something totally different

errant lake May 9, 2023, 11:48 AM

#

Yes, it's an orchestrator, it's not a pipeline tool per se

past meteor May 9, 2023, 11:49 AM

#

You could make your preprocessing into an airflow DAG but the overhead would be immense 😢

errant lake May 9, 2023, 11:50 AM

#

It is, I agree.

past meteor May 9, 2023, 11:50 AM

#

boreal gale curious to know more, what are these tools?

Personally I always stick with featureUnion Pipeline and ColumnTransformer. If I need a custom transformer I subclass sklearn stuff

boreal gale May 9, 2023, 11:51 AM

#

that's my take as well, just curious to learn more about what clem meant

errant lake May 9, 2023, 11:52 AM

#

A lot of companies going for Airflow in their DS pipelines: it is seen as an immense overhead first, but the value is still there, improving the efficiency of scientists. These companies also in-house develop platform solutions on top of Airflow + [whatever data warehouse solution they use] to boost productivity of you guys 🙂

past meteor May 9, 2023, 11:52 AM

#

Pipeline means many different things in data science / data engineering

#

The pipeline we're talking about are the final steps before inference (preprocessing). This entire thing would be one element of your airflow DAG

errant lake May 9, 2023, 11:54 AM

#

To give a bit more technical example, someone was saying some steps are not necessary to run again in a, say, sklearn pipeline.
Now imagine each sklearn pipeline step is a specific DAG, triggered only on the right events. No need to re-run this pipeline step if nothing will change, right?
This is a complex implementation - can't deny it - but it definitely holds value in the long run

errant lake May 9, 2023, 11:55 AM

#

past meteor Pipeline means many different things in data science / data engineering

I agree, also doing my best not to mix them up 😄

past meteor May 9, 2023, 11:56 AM

#

Not sure I agree about that architecture.

#

Bundling up your entire sklearn model in 1 object also makes deploying it on embedded etc. a lot easier

#

It's a single thing from start to finish

#

If the preprocessing is something massive and beyond the scope of sklearn transformers then yeah I would split it in 2 steps of course

sleek harbor May 9, 2023, 11:57 AM

#

past meteor `PySpark` has pipelines. `Recipes` in R does essentially the same thing. You can...

I just want something fast 🙂 as efficient as possible.
U know how gridsearch passes parameters as.. stepname__parameter... to the pipeline? Any idea if that's the same as reinitiating the pipeline from scratch with those parameters, or if passing them in that way is more efficient somehow?
say, for some reason, I don't want to use gridsearchcv, but to make a manual for loop.. would it be better, from a.. idk, pythonic standpoint, to loop through different parameters and reinitiate the pipeline by passing the tunable parameters directly into the estimators inside the Pipeline (so we end up calling Pipeline and everything inside it each loop), or to use set_params on a Pipeline created outside of the loop?
I hope I didn't screw up my question.. 😅 first option would be simpler (for me at least) to understand, cus you wouldn't have to use that slightly strange stepname__parameter.. syntax, but reinitiating everything every time.. might be a bit costly(?) idk, thoughts?

#

ok.. discord thingies..

#

there we go

past meteor May 9, 2023, 11:59 AM

#

So you'll code up grid search yourself?

errant lake May 9, 2023, 11:59 AM

#

past meteor Bundling up your entire sklearn model in 1 object also makes deploying it on emb...

Seems a lot easier if you want to deploy on embedded, non-distributed systems, of course in this case Airflow owuld be an overkill - or any orchestrator

past meteor May 9, 2023, 11:59 AM

#

In that case you can definitely re-use some preprocessing steps I guess

sleek harbor May 9, 2023, 12:00 PM

#

past meteor So you'll code up grid search yourself?

no.. 😓 that's actually for when I define on objective function in optuna.. but the point of the question is the same

past meteor May 9, 2023, 12:02 PM

#

So instead of refitting the parameters of your preprocessing you want to reuse them?

sleek harbor May 9, 2023, 12:03 PM

#

past meteor So instead of refitting the parameters of your preprocessing you want to reuse t...

lemme rephrase.. give my slow fat fingers some time to type

sleek harbor May 9, 2023, 12:16 PM

#

past meteor So instead of refitting the parameters of your preprocessing you want to reuse t...

this is a really bad example, but.. basically I could do this:

for smth in smths:

    analyzer, max_features, max_depth, min_samples_split = *smth

    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(analyzer, max_features)),   
        ('rf', RandomForestClassifier(max_depth, min_samples_split))
    ])
    ...

or I could do this:

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),   
    ('rf', RandomForestClassifier())
])

for smth in smths:
    tfidf__analyzer, tfidf__max_features, rf__max_depth, rf__min_samples_split = *smth
    params = {
        'tfidf__analyzer': tfidf__analyzer,
        'tfidf__max_features': tfidf__max_features,
        'rf__max_depth': rf__max_depth,
        'rf__min_samples_split': rf__min_samples_split
    }
        
    pipe.set_params(**params)
    ...

and get the same result. But personally, I find the first version easier to understand. But in the first version I end up reinitiating the pipeline (and everything in it) each time, while in the second version I just set different parameters (but have to use that pesky step__param syntax I dislike). So my question is, would the first version be less efficient than the second? Or would there be no difference?

past meteor May 9, 2023, 12:21 PM

#

Oh you really just want to create the full grid and then pass on the parameters to the Pipeline?

#

I'd go for option 1 then, it looks cleaner

sleek harbor May 9, 2023, 12:23 PM

#

past meteor Oh you really just want to create the full grid and then pass on the parameters ...

noo, that's just how optuna is used, it's supposed to be "pothonic", but yeah, for the sake of discussion, lets say that's what I'm trying to do

past meteor May 9, 2023, 12:24 PM

#

Either way I'd read this on stackoverflow: https://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil

Software Engineering Stack Exchange

Is premature optimization really the root of all evil?

A colleague of mine today committed a class called ThreadLocalFormat, which basically moved instances of Java Format classes into a thread local, since they are not thread safe and "relatively expe...

sleek harbor May 9, 2023, 12:25 PM

#

past meteor I'd go for option 1 then, it looks cleaner

It does look cleaner, doesn't it? But what bothers me is the reinitiation of the pipeline. Won't that be.. costly? I guess this is more of a python question and how it all works on a low level, but I'm just assuming that creating a bunch of class instances would be costly

past meteor May 9, 2023, 12:26 PM

#

Don't overthink 1% performance gains when there's more obvious performance gains you could pursue

#

Even if you want to pursue that 1 % you can code both of them up and profile / time it

sleek harbor May 9, 2023, 12:30 PM

#

past meteor Either way I'd read this on stackoverflow: https://softwareengineering.stackexch...

I'll give it a read. "Premature micro optimizations are the root of all evil", can't really disagree with that, but.. that doesn't stop me, unfortunately 😅 I spent a bunch of time determining what's faster, using lambda or a defined function, calling a class function and passing in an instance of that class or calling a method directly on an instance of a class, and other such mini nonsense optimizations.. Guilty 😶

sleek harbor May 9, 2023, 12:30 PM

#

past meteor Even if you want to pursue that 1 % you can code both of them up and profile / t...

yeah.. should probably try that.. someday..

timid kiln May 9, 2023, 2:18 PM

#

I'm looking to be better able to process Excel spreadsheets. I get a myriad of reports in various formats, and I want to pull the data out of these workbooks into a usable format. I need to know what is the "best" methodology for reading worksheets and removing unneeded rows and columns, lining up data that might be offset by a row or column, taking what is essentially data in a form-type format, and turning that into a dataframe, and so forth. My understanding is that pandas is the way to go for these types of things but I lack the foundation of understanding the ramifications of reading something into a dataframe vs a dictionary, which is "better" (which I'm sure depends on the situation) and so forth.

I've been through a few tutorials on how to read data into a dataframe when it's already nicely formatted in Excel; so I don't need any help with that. The situation I'm in is that these spreadsheets were generated with "viewing things" in mind, and not actually processing the data and using it for something beyond just looking at it in Excel. Hopefully this makes sense. Sorry for the long message. Thank you for your help!

cold osprey May 9, 2023, 2:24 PM

#

yeah pandas can do it too

#

drop cols, rows. split stuff up, drop cols then join again to realign them

pallid badge May 9, 2023, 2:26 PM

#

I would go for Pandas

#

Or maybe even Xarray. You can annotate axis and with units

#

Attach metadata

timid kiln May 9, 2023, 2:28 PM

#

But, where can I learn how to do all of this? Does it go row by row? It just seems like there's going to be a lot of custom programming, in my newbie opinion.

pallid badge May 9, 2023, 2:28 PM

#

You start on the docu website

#

and you can try stackoverflow, this is how I started with pandas

#

I am not an expert, but that was my approach

#

Or you ask here. Alternative: Look for code mentorship. Somebody you can discuss with

timid kiln May 9, 2023, 2:30 PM

#

Like, if there's data that isn't in proper columns, something is offset because someone formatted the workbook so things line up but not because they're in the same column, how to I move data around? Do I do that with a dictionary? Or a dataframe somehow? Like, merge two columns or... idk even what to ask as I'm so new to this kind of thing with python.

I had chatGPT give me something and it read in row by row, building a dataframe, if I remember correctly. It's been several weeks since I looked at that code. I managed to get it to work for what I was doing but, I think I talked about it here and someone went "that's not how you do this" lol.

#

docu website?

#

have to get off the train, I'll be back in about 15 minutes, thank you for your help! Links are always apprecaited. 🙂

pallid badge May 9, 2023, 2:32 PM

#

The first issues sounds more like a formatting problem. You have to find the entry and clean it

cold osprey May 9, 2023, 2:32 PM

#

get the data from the source and abandon excel for most things

pallid badge May 9, 2023, 2:32 PM

#

Would it not make more sense to clean the data first and then put it into a dataframe

#

Docu ---> Pandas documentation

cold osprey May 9, 2023, 2:46 PM

#

best tool for experiment tracking?

#

used tensorboard and mlflow for abit

pallid badge May 9, 2023, 2:47 PM

#

What is experiment tracking?

boreal gale May 9, 2023, 2:47 PM

#

cold osprey best tool for experiment tracking?

ditto, also super curious what people think of experiment tracking.
i haven't had a good solution

cold osprey May 9, 2023, 2:47 PM

#

pallid badge What is experiment tracking?

tracking machine learning models

#

instead of model1.pth, model2.pth, etc

cold osprey May 9, 2023, 2:48 PM

#

boreal gale ditto, also super curious what people think of experiment tracking. i haven't ha...

mlflow was decent for me on my own machine

#

not done any collaborative work so

timid kiln May 9, 2023, 2:49 PM

#

cold osprey get the data from the source and abandon excel for most things

I'd like to try to do that, the disconnect for me is how people use the data without Excel? It seems like at the end of the day you need something that's static that will display charts/graphs/tables of information that can be shared with folks and not need to be generated whenever the data is viewed. I get the processing side of things, python is definitely more powerful in that regard, but then, what do people use as a GUI?

cold osprey May 9, 2023, 2:49 PM

#

timid kiln I'd like to try to do that, the disconnect for me is how people use the data *wi...

dashboarding tool like Power BI

past meteor May 9, 2023, 2:49 PM

#

cold osprey best tool for experiment tracking?

MLflow is what we use at work

#

I'm fine with Tensorboard or even weights & biasesµ if it's me just playing around in the weekend

boreal gale May 9, 2023, 2:51 PM

#

i haven't played with mlflow at all. will the fact that i don't work with NN frameworks at all matter?

cold osprey May 9, 2023, 2:51 PM

#

nop

past meteor May 9, 2023, 2:51 PM

#

No, it works perfect with sklearn as well

boreal gale May 9, 2023, 2:51 PM

#

sweet. thanks.

i have a follow up question, how do you share notebooks in your org?

pallid badge May 9, 2023, 2:51 PM

#

@timid kiln : I don't necessary use a GUI. I get the data from a device or it is saved into hdf5 file formats

past meteor May 9, 2023, 2:51 PM

#

Tbf the reason we I decided on using MLflow is that someone on my team only uses R and it's the only one that integrates with R as well

pallid badge May 9, 2023, 2:52 PM

#

To displayh data? Matplotlib

cold osprey May 9, 2023, 2:52 PM

#

ah

past meteor May 9, 2023, 2:52 PM

#

boreal gale sweet. thanks. i have a follow up question, how do you share notebooks in your ...

Share notebooks internally or externally?

boreal gale May 9, 2023, 2:54 PM

#

past meteor Share notebooks internally or externally?

primarily interested in internally, for

knowledge dissemination
keeping track of what you have tried/explored and for what reasons did you abandon that line of research

pallid badge May 9, 2023, 2:54 PM

#

I have an embarrassing question. Why does not work?

a= np.array([[1,2, 4], [3,5, 6], [7,8,3]])
a[a>5]=0
print(a[a>5]=0)

Set all the entries in the array to 0 where the condition is correct. The print command never works.

past meteor May 9, 2023, 2:54 PM

#

boreal gale primarily interested in internally, for 1. knowledge dissemination 2. keeping ...

Git + markdown files and powerpoint?

#

We turn notebooks to webapps if it's something we want to share externally. Sometimes if I'm lazy I use pandoc and turn it into a PDF or HTML and use cron to email it on a fixed schedule.

#

If it's internal stuff / dissemination I'm a fun of just version controlling whatever you're doing, writing reports and doing a powerpoint presentation about your progress (that's what we do)

cold osprey May 9, 2023, 2:59 PM

#

what do u do again zestar?

#

i forgot

boreal gale May 9, 2023, 3:00 PM

#

past meteor Git + markdown files and powerpoint?

i like git + markdown, but graphs are probably omitted / too unwieldy to be checked in which loses the richness of the report, which is a shame.
powerpoint is nice and all but imo is hard to diff, quickly lookup, also it takes time to prepare and would limit the velocity of development.

i think given there is enough people in an org, your org's approach definitely makes sense, but sadly it doesn't for me, i am in a startup with a team size of 3 technical staff 😂

past meteor May 9, 2023, 3:00 PM

#

cold osprey what do u do again zestar?

(Applied) AI research.

past meteor May 9, 2023, 3:02 PM

#

boreal gale i like git + markdown, but graphs are probably omitted / too unwieldy to be chec...

Graphs are included on github at least

boreal gale May 9, 2023, 3:02 PM

#

oh! do you check in graphs as *.png or whatever and link them inside the markdown?

if that's the case, i might try to adopt that as well, i think it saves a lot of time down the line not having to desperately dig out what you have done the in past from random commits

past meteor May 9, 2023, 3:03 PM

#

I'd just put the .ipynb as is on git and just have a README file with an executive summary of what goes on in the notebook

boreal gale May 9, 2023, 3:04 PM

#

that's sensible as well

past meteor May 9, 2023, 3:05 PM

#

Properly naming your folders, files and making sure the notebooks aren't too long / doing too many things goes a long way as well, no?

#

But I imagine you already do that

#

MLFlow is a big one in the sense that I standardise what will get logged (the plots, metrics) in a Python / R template and also make sure I never delete data in the DB. Only inserts. I keep track of a version number, that goes into MLFlow as well. Kind of a bootleg version of DVC 🤣 . Why? I want to be able to go back in the past and recreate any specific experiment

boreal gale May 9, 2023, 3:13 PM

#

yes, i am indeed doing those, but i still find it rather unmanageable 😦

dang, i really gotta check out MLFlow!

cobalt rain May 9, 2023, 3:28 PM

#

Hello everyone, I wanted to ask if anyone would know something about a free versatile and knowledgeable AI chatbot to use in my code... I'm trying to find a free one because I'm gonna give it to my friends for testing... Does anyone know a model I can implement in my code?

serene scaffold May 9, 2023, 3:42 PM

#

cobalt rain Hello everyone, I wanted to ask if anyone would know something about a free vers...

be careful not to overuse the word "implement". "implementing a model" means to write all the code for the model on your own. if you're loading an existing model, you're just using it.

what does the chat bot need to be able to do as compared to ChatGPT?

rugged comet May 9, 2023, 4:54 PM

#

boreal gale Somewhat expected, have you tried reformatting your data to jsonl already? How s...

If I encode it, it is very sparse (25000 columns and only up to 100 of them have values). Currently, it's dense though.

rugged comet May 9, 2023, 4:55 PM

#

past meteor Just put the JSON file in an Azure blob storage (cheaper) or azure data lake and...

The DBFS is free as far as I'm aware. Can't really beat that price.

rugged comet May 9, 2023, 4:57 PM

#

boreal gale Somewhat expected, have you tried reformatting your data to jsonl already? How s...

I have not tried reformatting to jsonl.

timid kiln May 9, 2023, 4:57 PM

#

pallid badge To displayh data? Matplotlib

Yes, but that would require everyone that uses the Excel workbook to have python installed, and re-run the code. With Excel, at least you are able to capture and publish the results in a format where anyone within a given company could use the information, and not everyone would need python installed.

Everyone I chat with here is so anti-Excel, and that's fine, we're all allowed our opinions. But I don't get how other people, like your average non-programmer non-engineer person, are supposed to be able to use a tool with no GUI. Admittedly I am not very experienced with python so I am well aware that I am lacking a lot of information on how the rest of the world uses python without Excel.

rugged comet May 9, 2023, 4:58 PM

#

The thing is, I don't know that the issue is with the data in its current form. I think the issue is that it will be too big on one machine.

#

With 25000 columns and 500000 samples, that's already 60 GB I think. And that's just 500000 out of the total 2500000 samples.

boreal gale May 9, 2023, 5:08 PM

#

imo just format to jsonl first.
using a big json blob is generally anti-pattern when you are using spark.

i would sort out the input data format before thinking about anything else, e.g. "I think the issue is that it will be too big on one machine." is a secondary issue, spark can spill to disk if required, also not to mention there is sparse data structure support in spark.
all of these are pretty pointless if your input data format is borked and hard to work with (which it currently is, you still have one massive json blob)

cold osprey May 9, 2023, 5:13 PM

#

timid kiln Yes, but that would require everyone that uses the Excel workbook to have python...

Power bi, or any other dashboarding tool

timid kiln May 9, 2023, 5:17 PM

#

cold osprey Power bi, or any other dashboarding tool

Understood. My industry has been slow to adopt such things. Many people use Excel as a word processor. :/

pallid badge May 9, 2023, 6:07 PM

#

timid kiln Yes, but that would require everyone that uses the Excel workbook to have python...

You start to learn command line tools or jupyter notebooks for workflows.

pallid badge May 9, 2023, 6:08 PM

#

rugged comet With 25000 columns and 500000 samples, that's already 60 GB I think. And that's ...

Why not hdf5 file format?

timid kiln May 9, 2023, 6:16 PM

#

pallid badge You start to learn command line tools or jupyter notebooks for workflows.

Asking commercial folks to use command line tools... lololol They just want the chart, they can't be bothered to do anything beyond that. Maybe click a button. In Excel. 😄

past meteor May 9, 2023, 6:23 PM

#

rugged comet The DBFS is free as far as I'm aware. Can't really beat that price.

Agreed, but azure data lake costs virtually nothing

#

What are you currently stuck on?

rugged comet May 9, 2023, 6:28 PM

#

pallid badge Why not hdf5 file format?

I haven't heard of that format.

past meteor May 9, 2023, 6:28 PM

#

hdf5 makes sense if you're ... working with Hadoop

rugged comet May 9, 2023, 6:30 PM

#

past meteor What are you currently stuck on?

Currently stuck on getting enough compute in the free azure trial. It appears that the quota for a free account is only 4 cores. The memory that comes with a 4 core cluster is only 14 GB.

past meteor May 9, 2023, 6:30 PM

#

What is your problem in full?

rugged comet May 9, 2023, 6:31 PM

#

past meteor hdf5 makes sense if you're ... working with Hadoop

I don't think I'm working in with Hadoop.

past meteor May 9, 2023, 6:31 PM

#

no you're not

rugged comet May 9, 2023, 6:35 PM

#

past meteor What is your problem in full?

I currently have one of 5 json files of data. I want to cluster the data using kmeans. Some of the samples are labeled and others are not. This would be semi-supervised kmeans. I believe that I'll need to use a cloud service to do this. In its current form, the data would be only 3.5 GB roughly. However, if I encode the data so that it's ready for machine learning, the first file's samples would be about 60 GB. This is too big for my machine.
To elaborate on encoding the data, from the current data, I would create a column for each feature. There are about 25000 features. There are about 25000000 samples. I'm just trying to load the first 500000 right now as a proof of concept first.
Let me know what other questions you have.

past meteor May 9, 2023, 6:38 PM

#

So in total you'll have, what 300GB worth of data?

rugged comet May 9, 2023, 6:39 PM

#

If I encode it, I believe so, yes.

pallid badge May 9, 2023, 6:40 PM

#

past meteor hdf5 makes sense if you're ... working with Hadoop

I use hdf5 without Hadoop in my life.

#

But I know I can store enough data in it, I can read it lazily and incrementally

#

I just generated the other day a hdf5 file with 350GB.

cold osprey May 9, 2023, 6:42 PM

#

How does 3.5gb go to 300gb

young granite May 9, 2023, 6:42 PM

#

cold osprey How does 3.5gb go to 300gb

alot of preprocessing lel

past meteor May 9, 2023, 6:43 PM

#

Look, with polars you can use scan_IPC scan_csv, scan_parquet so you can "easily" use the lazy API and sink_parquet to incrementally add your features etc. even if your dataset is larger than memory

pallid badge May 9, 2023, 6:43 PM

#

cold osprey How does 3.5gb go to 300gb

Is this for me?

#

Ah indeed, it blows up.

past meteor May 9, 2023, 6:43 PM

#

I don't know how you'll actually do k-means reasonably

#

I don't know how the spark implementation of it looks like

young granite May 9, 2023, 6:44 PM

#

past meteor I don't know how you'll actually do k-means reasonably

only useful with good data i guess

cold osprey May 9, 2023, 6:44 PM

#

pallid badge Is this for me?

Read wrongly, it was 3.5gb to 60gb

rugged comet May 9, 2023, 6:45 PM

#

cold osprey How does 3.5gb go to 300gb

Let me explain.
In its current form, one of the features of the json file is a dictionary of up to 100 keys and values.
Here's an example

"foo": 1,
"bar": 25,
"baz": 1,
...

To encode this for machine learning each of foo, bar, baz, etc would become a new column. There are 25000 unique values like foo, bar, baz, etc. So the dense data where it's a dict of less than 100 pairs gets turned into sparse data with 25000 columns.

young granite May 9, 2023, 6:45 PM

#

wild

cold osprey May 9, 2023, 6:46 PM

#

Rip kmeans

young granite May 9, 2023, 6:46 PM

#

and out of curiosity u get reasonable outputs from that structure?

rugged comet May 9, 2023, 6:46 PM

#

past meteor I don't know how you'll actually do k-means reasonably

Well we are going to try and see what happens. That's the point.

past meteor May 9, 2023, 6:46 PM

#

Tbh if I have that much data I'd be thinking about sampling

rugged comet May 9, 2023, 6:46 PM

#

young granite and out of curiosity u get reasonable outputs from that structure?

What do you mean?

young granite May 9, 2023, 6:46 PM

#

is the goal just to cluster or are u doing more with the data?

#

fancy dancy language model?

past meteor May 9, 2023, 6:47 PM

#

K-means can work with all your data on disk and passing sequentially but it'll just be slow lol

rugged comet May 9, 2023, 6:47 PM

#

young granite is the goal just to cluster or are u doing more with the data?

The first goal is to cluster, yes. We would like to do other stuff later. But we don't really know that that is yet. Just having fun.

rugged comet May 9, 2023, 6:48 PM

#

past meteor K-means can work with all your data on disk and passing sequentially but it'll j...

Can you elaborate?

past meteor May 9, 2023, 6:48 PM

#

But yeah, I guess that's what Spark's Mlibdoes either way

#

In a more optimized way ofc

rugged comet May 9, 2023, 6:48 PM

#

young granite fancy dancy language model?

No it isn't with text data really. It's with deck lists of cards from a card game.

past meteor May 9, 2023, 6:49 PM

#

So K-means has 2 steps right? An E step and an M step

young granite May 9, 2023, 6:49 PM

#

rugged comet No it isn't with text data really. It's with deck lists of cards from a card gam...

but if u got card data u dont need clustering?

rugged comet May 9, 2023, 6:50 PM

#

young granite but if u got card data u dont need clustering?

We want to cluster decks that are similar to each other together.

#

If possible

young granite May 9, 2023, 6:50 PM

#

ah ok

#

so by Card-IDs?

#

would be alot less features i guess

past meteor May 9, 2023, 6:50 PM

#

You can read subsets that fit into memory, assign them to clusters and then go back to diskµ

cold osprey May 9, 2023, 6:50 PM

#

Use count of a particular card type or smth

past meteor May 9, 2023, 6:50 PM

#

While you're doing this you can update the cluster center in an "online" way

rugged comet May 9, 2023, 6:50 PM

#

young granite so by Card-IDs?

There are 25000 different cards.

boreal gale May 9, 2023, 6:51 PM

#

ooo.. sounds like a problem suited for NMF.

past meteor May 9, 2023, 6:51 PM

#

No reason to handroll this because I'm pretty sure MLib does this

rugged comet May 9, 2023, 6:51 PM

#

boreal gale ooo.. sounds like a problem suited for NMF.

Natural moisturizing factor (NMF) is essential for appropriate stratum corneum hydration, barrier homeostasis, desquamation, and plasticity.

boreal gale May 9, 2023, 6:51 PM

#

Non-negative matrix factorization

past meteor May 9, 2023, 6:51 PM

#

non-negative matrix factorization

boreal gale May 9, 2023, 6:52 PM

#

boreal gale imo just format to jsonl first. using a big json blob is generally anti-pattern ...

just in case you missed my comment here

rugged comet May 9, 2023, 6:53 PM

#

boreal gale Non-negative matrix factorization

Why do you think NMF would work well for this problem?

cold osprey May 9, 2023, 6:53 PM

#

Ah spark supports sparse format

#

Should reduce data size by alot

past meteor May 9, 2023, 6:54 PM

#

The parquet shouldn't be too large either I think?

#

You're just taking a JSON file and one-hot encoding the data right?

#

There's no need to store all those 0's I think

rugged comet May 9, 2023, 6:56 PM

#

past meteor You're just taking a JSON file and one-hot encoding the data right?

It's very very similar to one-hot encoding. However, instead of values of 0 and 1, it's values of 0-98.

rugged comet May 9, 2023, 6:56 PM

#

boreal gale just in case you missed my comment here

Thanks for the advice. I can try to convert my json to jsonl and see what happens.

cerulean kayak May 9, 2023, 6:56 PM

#

copper island Please advise me a book to start with date science in python)

the science of dating is not somthing we can help you with.

rugged comet May 9, 2023, 6:56 PM

#

past meteor There's no need to store all those 0's I think

What do you propose instead?

past meteor May 9, 2023, 6:57 PM

#

To look for a file format that works well with sparse matrices 👀

#

I don't know how jsonl works under the hood, I'd have a look at that

wooden sail May 9, 2023, 6:58 PM

#

there should be a straightforward way of exporting sparse mats as COO

rugged comet May 9, 2023, 6:59 PM

#

wooden sail there should be a straightforward way of exporting sparse mats as COO

What is COO in this context?

wooden sail May 9, 2023, 6:59 PM

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

#

coordinate array. the entries are saved as triples with row, column, value

boreal gale May 9, 2023, 7:00 PM

#

rugged comet Why do you think NMF would work well for this problem?

just a hunch, imo k-means's inductive bias is not great for your task, i can't really provide a formal proof or explain formally sadly.

also - NMF is a common technique for building recommendation engine, with a small tweak it could be used to do clustering (and building recommendation engine is pretty similar to your task, instead of groupping users by movies they like, you are groupping decks by cards they have chosen)

(this class of technique is also called collaborative filtering iirc)

wooden sail May 9, 2023, 7:00 PM

#

there are other flavors too, like nonzero cols and nonzero rows. which one works best depends on the structure of your sparse matrix

past meteor May 9, 2023, 7:01 PM

#

NMF is a collaborative filtering method, like alternating least squares etc etc

cold osprey May 9, 2023, 7:01 PM

#

Is the matrix sparse? It has values from 0-98 but are most of them 0 or it's distributed from 0 to 98

rugged comet May 9, 2023, 7:01 PM

#

cold osprey Is the matrix sparse? It has values from 0-98 but are most of them 0 or it's dis...

The vast majority are 0.

past meteor May 9, 2023, 7:01 PM

#

wooden sail coordinate array. the entries are saved as triples with row, column, value

The irony is that their format right now stores something like this already

wooden sail May 9, 2023, 7:02 PM

#

😩

rugged comet May 9, 2023, 7:02 PM

#

But there are about 100 values that range from 0-100.

past meteor May 9, 2023, 7:02 PM

#

Might be an idea to keep it like that

#

And only to expand it when you need to do your k-meansµ

wooden sail May 9, 2023, 7:02 PM

#

k-µeans

boreal gale May 9, 2023, 7:03 PM

#

k-µs

wooden sail May 9, 2023, 7:03 PM

#

if you were to use scipy's sparse matrices, doing k means should be very efficient

past meteor May 9, 2023, 7:03 PM

#

Iirc if you one-hot encode with sci-kit learn you get a sparse matrix as output anyway

rugged comet May 9, 2023, 7:03 PM

#

I mean what I have now is already a dense representation of the data. But I think when I do kmeans, I would need to expand it.

past meteor May 9, 2023, 7:03 PM

#

It's not exactly one-hot you're doing but you get my point

wooden sail May 9, 2023, 7:03 PM

#

there shouldn't be a need to expand at any point tbh

rugged comet May 9, 2023, 7:04 PM

#

wooden sail there shouldn't be a need to expand at any point tbh

o rly?

wooden sail May 9, 2023, 7:04 PM

#

most linear algebra packages have sparse formats

#

you should use them, not your own

cold osprey May 9, 2023, 7:04 PM

#

Iirc sklearn needs to expand iy

#

Could be wrong

past meteor May 9, 2023, 7:04 PM

#

With expand I meant turning it into a sparse format a la scipy

rugged comet May 9, 2023, 7:04 PM

#

past meteor With expand I meant turning it into a sparse format a la scipy

That's what I meant by expand as well.

past meteor May 9, 2023, 7:04 PM

#

Because that's what one-hot in sklearn automatically does when your cols are beyond a certain number

wooden sail May 9, 2023, 7:05 PM

#

nothing in the distance computation requires you to explicitly have the vector in dense form

past meteor May 9, 2023, 7:05 PM

#

Actually, it's the default: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

scikit-learn

sklearn.preprocessing.OneHotEncoder

Examples using sklearn.preprocessing.OneHotEncoder: Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.0 Release Highlights for sc...

rugged comet May 9, 2023, 7:06 PM

#

wooden sail nothing in the distance computation requires you to explicitly have the vector i...

What I have now is essentially the vector in dense form. I was under the impression that I needed it in sparse form for kmeans. Unless you misspoke.

past meteor May 9, 2023, 7:07 PM

#

rugged comet Let me explain. In its current form, one of the features of the json file is a ...

This is a sparse format already

#

more or less

cold osprey May 9, 2023, 7:07 PM

#

Depends on how he's doing it right

#

Could be dense?

wooden sail May 9, 2023, 7:07 PM

#

rugged comet What I have now is essentially the vector in dense form. I was under the impress...

you don't need it in either form, the computation can be done regardless

#

so might as well use a sparse one

#

what i wrote there was correct

rugged comet May 9, 2023, 7:08 PM

#

past meteor This is a sparse format already

No I think the sparse form would be a column for each foo, bar, baz, etc and there's 25000 unique of those.

cold osprey May 9, 2023, 7:08 PM

#

huh

past meteor May 9, 2023, 7:08 PM

#

To quote Edd: coordinate array. the entries are saved as triples with row, column, value

wooden sail May 9, 2023, 7:08 PM

#

that's COO

#

there's also CSR and CSC

past meteor May 9, 2023, 7:09 PM

#

Each JSON entry is a row, your key is a column and your value is the value

rugged comet May 9, 2023, 7:09 PM

#

past meteor Each JSON entry is a row, your key is a column and your value is the value

This makes sense.

wooden sail May 9, 2023, 7:09 PM

#

i leave you peeps to it, i just wanted to comment on sparse matrices 😛

past meteor May 9, 2023, 7:09 PM

#

I just wouldn't know how you'd get a JSON into COO maybe @wooden sail has pointers?

wooden sail May 9, 2023, 7:10 PM

#

is the json dense?

past meteor May 9, 2023, 7:10 PM

#

Afaik it's sparse as well

wooden sail May 9, 2023, 7:10 PM

#

what i mean is, does it have all the values?

rugged comet May 9, 2023, 7:10 PM

#

umm

wooden sail May 9, 2023, 7:10 PM

#

or it's already in a COO/CSR/CSC form

rugged comet May 9, 2023, 7:11 PM

#

I can give an example that might answer your question.

wooden sail May 9, 2023, 7:11 PM

#

because in those forms, only the nonzero values are stored

rugged comet May 9, 2023, 7:11 PM

#

Yes, only the non-zero values are currently stored.

cold osprey May 9, 2023, 7:11 PM

#

Ah so it's already sparse

wooden sail May 9, 2023, 7:11 PM

#

then it's indeed already sparse

#

you just need to load it in a friendly format for whatever module you're using to create sparse matrices

rugged comet May 9, 2023, 7:12 PM

#

wooden sail then it's indeed already sparse

Okay. I was misunderstanding what you all meant by sparse. I thought sparse data was mostly zeroes and a few "hot" features. It seems like that's not true though based on what you're saying now.

wooden sail May 9, 2023, 7:13 PM

#

that is indeed sparse, but when i mentioned COO, CSC and CSR, these are efficient sparse representations

#

they don't store all the zeros explicitly

#

so one usually refers to these special representations as sparse, and the matrix with all the 0s in it as dense

rugged comet May 9, 2023, 7:14 PM

#

wooden sail so one usually refers to these special representations as sparse, and the matrix...

That makes sense and confirms that I was misunderstanding the terminology.

#

So it sounds like I should try to figure out how to convert the json data I have now into an actual sparse matrix such as scipy's coo_matrix.

wooden sail May 9, 2023, 7:19 PM

#

i think that would be good

iron basalt May 9, 2023, 7:21 PM

#

Dense:
|1  2  3  4 |
|5  6  7  8 |
|9  10 11 12|
|13 14 15 16|
Sparse:
|1  0  0  4 |
|0  0  0  0 |
|0  10 0  0 |
|0  0  0  0 |
Sparse COO:
[(0, 0, 1), (0, 3, 4), (2, 1, 10)] <- Less memory usage, faster matrix multiplies (if sparse enough / large enough matrix).
Sparse CSR:
[1, 4, 10]
[0, 3, 1]
[0, 2, 3]
Even faster, but takes more time to build, and can't dynamically add more easily (build once, multiply many times).

#

(DOK is the same as COO, but uses a dict instead of a list, good for when you have non-zero entries added / removed dynamically all the time)

wooden sail May 9, 2023, 7:26 PM

#

ah dok is good here, since json can be read as a dict

#

maybe that's the easiest for this case

iron basalt May 9, 2023, 7:26 PM

#

DOK is good for incremental construction, especially if out of order.

past meteor May 9, 2023, 7:26 PM

#

Still, how do you convert a JSON to DOK

wooden sail May 9, 2023, 7:27 PM

#

by converting to dict and passing the dict to scipy's sparse

rugged comet May 9, 2023, 7:27 PM

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html

wooden sail May 9, 2023, 7:27 PM

#

indeed

past meteor May 9, 2023, 7:27 PM

#

Interesting, TIL

wooden sail May 9, 2023, 7:28 PM

#

i've never used that specific one so i don't actually know what you pass it. i HOPE you can pass a dict of tuples or somth

iron basalt May 9, 2023, 7:28 PM

#

rugged comet https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.htm...

Note: Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.

#

import numpy as np

from scipy.sparse import dok_matrix

S = dok_matrix((5, 5), dtype=np.float32)

for i in range(5):

    for j in range(5):

        S[i, j] = i + j    # Update element```

rugged comet May 9, 2023, 7:29 PM

#

iron basalt Note: `Allows for efficient O(1) access of individual elements. Duplicates are n...

I don't think I can guarantee that there are no duplicate decks.

wooden sail May 9, 2023, 7:29 PM

#

even if not possible to make the matrix out of a dict though, one can read the json as a dict and then make it into a list of tuples in O(nonzero entries), and feed that to scipy

#

then use COO

iron basalt May 9, 2023, 7:29 PM

#

If you try to insert a duplicate row, col, it will probably raise an exception.

past meteor May 9, 2023, 7:30 PM

#

Can't you remove the duplicates from your JSONs?

#

Meh that'll impact k-means

rugged comet May 9, 2023, 7:32 PM

#

To be clear, my json looks like a single list of dictionaries where each dictionary is a sample.

wooden sail May 9, 2023, 7:33 PM

#

parsing text is not my forte so i leave y'all to it. i do know that json can be parsed directly into python dicts, so it shouldn't be too troublesome to massage the data into something scipy sparse likes

#

best of luck

rugged comet May 9, 2023, 7:33 PM

#

Thank you.

past meteor May 9, 2023, 7:34 PM

#

I learnt a lot from this convo though thanks edd and squiggle

iron basalt May 9, 2023, 7:34 PM

#

May require normalization, good luck. I really dislike JSON for reasons like this.

#

If possible convert it once to a better format and use that (if you need it multiple times).

rugged comet May 9, 2023, 8:02 PM

#

shape = (samples, cards)
mat = sp.dok_matrix(shape, dtype=np.int8)

for id, deck in enumerate(decks):
    for card, quantity in deck["cards"].items():
        mat[id, card] = quantity

I think I'm on the right track with this.
The id should be the row, the card should be the column, and the quantity should be the value.
However, card is still a string. I think perhaps it should actually be the index of the card if it were a column?

#

I was inspired by this
https://stackoverflow.com/questions/37862139/convert-dictionary-to-sparse-matrix

Stack Overflow

convert dictionary to sparse matrix

I have a dictionary with keys as user_ids and values as list of movie_ids liked by that user with #unique_users = 573000 and # unique_movies =16000.
{1: [51, 379, 552, 2333, 2335, 4089, 4484],
...

rugged comet May 9, 2023, 10:09 PM

#

iron basalt ```py import numpy as np from scipy.sparse import dok_matrix S = dok_matrix((5...

Incrementally making the dok matrix seems very slow. Is there any faster way than manually looping through the samples?

cards_list = list(unique_cards)
mapper = {card: index for index, card in enumerate(cards_list)}
for id, deck in enumerate(decks):
    for card, quantity in deck["cards"].items():
        mat[id, mapper[card]] = quantity

mild dirge May 9, 2023, 10:10 PM

#

Can you not use smart indexing like with numpy arrays?

#

like mat[ys, xs] = vals

rugged comet May 9, 2023, 10:11 PM

#

mild dirge Can you not use smart indexing like with numpy arrays?

Can you elaborate, please? I haven't heard about this.

#

Oh

iron basalt May 9, 2023, 10:13 PM

#

rugged comet Incrementally making the dok matrix seems very slow. Is there any faster way tha...

Ideally you could pass the dict to the dok_matrix directly and it would internally (hopefully in C or something) do a fast loop for you.

#

Or the file directly.

#

I just use my own sparse matrix types written in C with Python bindings so IDK.

rugged comet May 9, 2023, 10:16 PM

#

iron basalt Ideally you could pass the dict to the dok_matrix directly and it would internal...

For this approach, what would the dictionary look like? My guess is that the keys would be maybe the deck id and the values would be the cards.

iron basalt May 9, 2023, 10:17 PM

#

rugged comet For this approach, what would the dictionary look like? My guess is that the key...

I don't think dok_matrix from scipy can take a dict.

#

When I run into performance issues where I need to do manual loops in Python I tend to make my own library and then call that from Python.

rugged comet May 9, 2023, 10:20 PM

#

iron basalt I don't think dok_matrix from scipy can take a dict.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html#scipy.sparse.dok_matrix
Maybe I could use a coo matrix then?

mild dirge May 9, 2023, 10:22 PM

#

It can take a dict

mild dirge May 9, 2023, 10:22 PM

#

rugged comet https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.htm...

Look at the docs

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.update.html#scipy.sparse.dok_matrix.update

rugged comet May 9, 2023, 10:23 PM

#

mild dirge https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.upd...

    def update(self, val):
        # Prevent direct usage of update
        raise NotImplementedError("Direct modification to dok_matrix element "
                                  "is not allowed.")

#

idk

iron basalt May 9, 2023, 10:24 PM

#

_update

#

https://github.com/scipy/scipy/blob/v1.10.1/scipy/sparse/_dok.py#L113

arctic wedgeBOT May 9, 2023, 10:25 PM

#

scipy/sparse/_dok.py line 113

def _update(self, data):```

mild dirge May 9, 2023, 10:25 PM

#

Yeah ig, don't see why it has an unimplemented update method...

iron basalt May 9, 2023, 10:25 PM

#

"""An update method for dict data defined for direct access to
        `dok_matrix` data. Main purpose is to be used for effcient conversion
        from other spmatrix classes. Has no checking if `data` is valid."""
        return dict.update(self, data)

mild dirge May 9, 2023, 10:26 PM

#

But I doubt it is a lot quicker, seems like it is just a dict underneath

#

So no fancy C shenanigans

iron basalt May 9, 2023, 10:27 PM

#

Yeah, and reading from the file is also something you probably want to happen in C.

rugged comet May 9, 2023, 10:27 PM

#

Reading from the file is pretty quick. Only 11 seconds.

iron basalt May 9, 2023, 10:28 PM

#

I guess you only need to read it once?

#

In C I approximate like a few ms.

#

If you don't want to touch C, Mypyc and Cython are options.

rugged comet May 9, 2023, 10:50 PM

#

iron basalt _update

I thought we weren't supposed to use those private methods.

iron basalt May 9, 2023, 10:52 PM

#

rugged comet I thought we weren't supposed to use those private methods.

Yeah, but I do anyhow when needed because i'm a bad programmer.

#

(I actually just make my own library that does just take in the file so I don't have this issue)

rugged comet May 9, 2023, 10:56 PM

#

I think things are working now.

#

Looks like sklearns KMeans only works with CSR format sparse matrix. Good to know.

rugged comet May 10, 2023, 2:49 AM

#

I think I neglected to mention that our arrays of deck lists are jagged. One deck might have 95 unique cards and another might have 100.
What should we do in this case? I was thinking we could pad the arrays to the max length using a number that isn't being used to represent a card.

severe topaz May 10, 2023, 3:04 AM

#

iron basalt In C I approximate like a few ms.

🤭I’ve never heard of these

severe topaz May 10, 2023, 3:06 AM

#

rugged comet https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.htm...

What is scipy it seems to be a package I’m not familiar with. What are using it for? I’ve used tk inter ….

serene scaffold May 10, 2023, 3:13 AM

#

severe topaz What is scipy it seems to be a package I’m not familiar with. What are using it ...

tkinter doesn't really have anything to do with data science. scipy is more functions for numpy, basically

rugged comet May 10, 2023, 3:17 AM

#

severe topaz What is scipy it seems to be a package I’m not familiar with. What are using it ...

scipy has some data structures for matrices. That's all I know so far.

serene scaffold May 10, 2023, 3:17 AM

#

rugged comet scipy has some data structures for matrices. That's all I know so far.

pretty sure that part is numpy

rugged comet May 10, 2023, 3:18 AM

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
This is what I'm exploring now.

serene scaffold May 10, 2023, 3:19 AM

#

hmm okay. I've mostly used scipy for the stats stuff (https://docs.scipy.org/doc/scipy/reference/stats.html)

severe topaz May 10, 2023, 3:20 AM

#

Elaborate?

rugged comet May 10, 2023, 3:21 AM

#

Well I need a matrix represention that is small (sparse). To do this, I'm using scipy's sparse matrices. Think of it like having lots of (row, column, value) elements.

rugged comet May 10, 2023, 3:51 AM

#

mild dirge like `mat[ys, xs] = vals`

I'm trying to understand this
https://scipy-lectures.org/advanced/scipy_sparse/lil_matrix.html
It looks like they do support the fancy indexing. Can you help me understand how it works, please?

#

Not sure what conclusion I can come to. Creating the matrix incrementally seems too slow and creating it all at once isn't feasible due to memory problems.

iron basalt May 10, 2023, 4:07 AM

#

rugged comet I'm trying to understand this https://scipy-lectures.org/advanced/scipy_sparse/l...

mtx[:2, [1, 2, 3]] = data

#

First two rows, column indices 1, 2, 3 = data.

severe topaz May 10, 2023, 4:28 AM

#

Ok…

dusty bay May 10, 2023, 5:03 AM

#

I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI.py Code

class Myapp():
    
    def __init__(self):
        self.root = customtkinter.CTk()
        self.root.geometry('1050x600')
        self.root.title("APx Platform")
        self.m1 = customtkinter.CTkButton(self.root, text="View Plot", font=("Ubuntu", 12))
        self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()

And here is the viewer.py code

import pandas as pd
import matplotlib.pyplot as plt


class csv2df():
    
    def __init__(self):
        self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
        
    def plot(self):
        self.x = self.df["Hz"]
        self.y = self.df["dBSPL"]
        plt.plot(self.x, self.y)
        plt.xlabel("Frequency (Hz)")
        plt.ylabel("RMS Level (dBSPL)")
        
        plt.show()
        
data = csv2df()
data.plot()

Can you please fix it as I need this for a project.
Thank You.

midnight skiff May 10, 2023, 8:33 AM

#

Is there any merit to Mojo's hype?

cold osprey May 10, 2023, 8:58 AM

#

we'll see

boreal gale May 10, 2023, 9:05 AM

#

personally can't wait to get my hands on it to see what's up, i have a bit fair of numerical code that is in need to optimisation

i have shoehorned them into numba at the moment but it looks kinda ugly and hard to maintain

wooden sail May 10, 2023, 9:10 AM

#

boreal gale personally can't wait to get my hands on it to see what's up, i have a bit fair ...

have you tried jax yet :x

boreal gale May 10, 2023, 9:14 AM

#

hmm nope, what's the headline difference between that and numba?

also probably worth mentioning i have extremely limited chance of utilising vectorisation - i am dealing with streaming data

wooden sail May 10, 2023, 9:16 AM

#

it's an implementation of numpy and scipy on XLA, so the code looks just like usual numpy. it brings its own jit and autodiff though, and the jit is a lot more flexible than numba's. also can run on gpu and tpu without (m)any changes. numba only has few numpy and scipy functions jitable, and only with limited arguments

#

as an example, specifying order='F' is not supported on most numpy and scipy functions with numba

#

no aot though, only jit. i think numba has aot

boreal gale May 10, 2023, 9:18 AM

#

oooo.. that sounds very promising.

https://jax.readthedocs.io/en/latest/jax-101/07-state.html <- this is also exactly what i need

more thing for me to play with!

wooden sail May 10, 2023, 9:18 AM

#

i like it a lot tbh

boreal gale May 10, 2023, 9:18 AM

#

you can always simulate aot manually kek

wooden sail May 10, 2023, 9:18 AM

#

call the function before actual execution during initialization 🤡

#

for loops and the prng do require a little getting used to, you CAN but don't really wanna use native python loops

#

the biggest selling point for me is being able to jit, autodiff, and run on gpu while still looking like numpy

#

for most simple functions, you can straight up replace import numpy as np with import jax.numpy as np

boreal gale May 10, 2023, 9:23 AM

#

that sounds awesome

btw, how big is the overhead of moving data to-and-from GPU these days? i haven't looked into that space for years, curious to know at what size of data would you gain noticeable speed gain by shifting your workload to GPU now

wooden sail May 10, 2023, 9:25 AM

#

it's still the bottleneck

#

even just moving/copying stuff in memory is usually the bottleneck. it gets worse if you move between mem and vmem

#

that's why quadro and a100 cards cost an arm and a leg

dusty bay May 10, 2023, 9:28 AM

#

dusty bay I've created a button to display a graphic from viewer.py file. I want the graph...

bro, can u help me with this issue

hoary wigeon May 10, 2023, 10:53 AM

#

###################################

I NEED HELP! with SHAPELY VALUES

###################################

I want to know how to calculate shap values on record level.. and what are the units of shap value.

I have built an XGBoost Classifier model and I'm using same model to calculate the shap values. I'm confused with the unit of value that shap returns and If possible I need it in probability.

#

I just got to know shap returns calculates log-odds for XGBoost Classifier..
I'm trying to inverse the values to proability using below function

def logit2prob(logit_val):
    prob = 1 / (1 + np.exp(-logit_val))
    return prob

But when I try to sum up the probability values on record level..It doesn't add up to 1's probability predicted by XGBoost Model.

agile cobalt May 10, 2023, 12:03 PM

#

hoary wigeon ################################### ### **I NEED HELP! with SHAPELY VALUES** ...

!e you meant this? ```py
data = [
[10, 20, 30],
[1, 2, 3],
[100, 200, 300],
]
import numpy as np
arr = np.array(data)
print(np.sum(arr))
print(np.sum(arr, axis=-1))

arctic wedgeBOT May 10, 2023, 12:03 PM

#

@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 666
002 | [ 60   6 600]

agile cobalt May 10, 2023, 12:04 PM

#

oh wait no, exp isn't even an aggregating function
I have no idea about what you mean by record level in that case, post an example with input&desired output

wheat snow May 10, 2023, 1:07 PM

#

<@&831776746206265384> i accidently deleted a VERY long message explaining my problem.. ( i just wanted to delete the images and not the whole message) could someone pls recover it from the server logs? and sent it into some DM or channel so i can edit it properly and paste it in here?

rugged mist May 10, 2023, 1:10 PM

#

dm'd you you might want to fix the formatting

wheat snow May 10, 2023, 1:11 PM

#

rugged mist dm'd you you might want to fix the formatting

u mean the problem or the way i typed teh message?

rugged mist May 10, 2023, 1:12 PM

#

it lost formatting when i copied it

wheat snow May 10, 2023, 1:12 PM

#

Sup, i have a df that looks liek taht:

                Timestamp              Creator       Year     Month
27414   2021-02-10 21:01:12           GameSünden    2021      2
34085   2019-08-15 09:27:56                Kedos    2019      8
41306   2018-06-10 18:41:54             Dream TV    2018      6
653     2023-03-21 15:36:00            King Fish    2023      3
48795   2017-06-24 08:43:31       Mrmobilefanboy    2017      6
25894   2021-04-05 00:16:51                  WWE    2021      4
25397   2021-04-17 17:29:08   Étienne MzA Gaming    2021      4
1450    2023-03-06 15:26:23          Nicholas Ma    2023      3
4257    2023-01-22 20:47:28           NRML MTBer    2023      1

I now want to create a mutiindex Dataframe which allows me to track how my viewing habits of certain Youtubers have changed from 2017 to 2023 in a monthly period. By "viewing habits" i mean how many videos i watched of a certain creator

therefore i have a df of the creators that i wanna track...

top_creator=(temp_df.value_counts().sort_values(ascending=False))[0:15]
Paluten                 643      --> Ammounts of total videos i watched by the creator
Galileo                 631
ExplosmEntertainment    542
Benx                    488
DieBuddiesZocken        395
...

So... i am looking for some help to this code:

df_creator Track= df.resample('M', on='Timestamp')['Creator'].value_counts().sort_values(ascending=False)

I am still missing to include the list of the top_creators from above into this. My goal is to achieve something i will share in teh next picture

#

wheat snow May 10, 2023, 1:13 PM

#

rugged mist it lost formatting when i copied it

oh u mean backticks... right they get lost all the time when formatting

#

And under thoose:
...
...
...

i only want the creators to show up i have in my

top_creator
``` df

errant lake May 10, 2023, 1:45 PM

#

@wheat snow I must be misunderstanding your need because it feels like you could get your result with a groupby?

wheat snow May 10, 2023, 1:46 PM

#

yes... kinda im honestly not a big experting in the groupby function

errant lake May 10, 2023, 1:48 PM

#

no worries, I can try to point you in the right direction:
I would:

Transform your timestamp column so that it translates every date to only year-month (use pd.to_datetime as well as the format option) > let's name the transformed column 'month'
group by the 'month' as well as the 'creator' column and sum the views

wheat snow May 10, 2023, 1:48 PM

#

i just looked through some pages in my book, and maybe found an idea... what about pivoting the df... yk flip it up take the channels i am looking for as columns and only leaving the the Timestamp

errant lake May 10, 2023, 1:49 PM

#

It would work but you would be left with a short data struct

#

which is not ideal, I prefer long 😄

wheat snow May 10, 2023, 1:50 PM

#

errant lake no worries, I can try to point you in the right direction: I would: 1. Transform...

so:

2022-04-29 15:25:16    --> 2022 | 4
``` so 2 columns year and month? or in one column?

errant lake May 10, 2023, 1:51 PM

#

From your need I think a single column of format 2023-01 would be more helpful

wheat snow May 10, 2023, 1:52 PM

#

errant lake From your need I think a single column of format `2023-01` would be more helpful

alright, splicing would be easier then i thgink

errant lake May 10, 2023, 1:52 PM

#

That's one way, for example:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year_month'] = df['timestamp'].dt.strftime('%Y-%m')

#

then you'll just be one groupby away :

df = df.groupby(['year_month', 'creator_name'])['views'].sum()

👍

wheat snow May 10, 2023, 1:55 PM

#

errant lake then you'll just be one groupby away : ```py df = df.groupby(['year_month', 'cre...

thing is i dont have an views column... the "views" can be seen as the ammount of rows that exist in the Month we are currently looking at

#

so .count?

errant lake May 10, 2023, 1:55 PM

#

Ah, yes

#

or you insert a column of 1s, because that's one view on every row

wheat snow May 10, 2023, 1:56 PM

#

errant lake or you insert a column of 1s, because that's one view on every row

bruh true

#

why im thinking so hard

past meteor May 10, 2023, 5:24 PM

#

cold osprey we'll see

The issue with Mojo' s benchmarks is that they are crap. Just like Julia's benchmarks

#

They benchmarked matrix multiplication with stdlib lists as matrices and compared it to their highly optimized stuff lol

#

Why not at the very least compare it to Numpy / Jax / Numba / Cython / ...

agile cobalt May 10, 2023, 5:29 PM

#

past meteor They benchmarked matrix multiplication with stdlib lists as matrices and compare...

I am pretty sure that the matrix multiplication demo was not meant to be a benchmark?
their actual "benchmarks" are in https://performance.modular.com/ though tbh still unsatisfactory

Modular Performance Dashboard

#

for proper benchmarks you would usually rather on third party services than what the providers says about themselves anyway

past meteor May 10, 2023, 5:30 PM

#

The notebook they handed out to content creators like Fireship did frame it like a benchmark tbh

agile cobalt May 10, 2023, 5:30 PM

#

they did not "hand it out" to Fireship and others

past meteor May 10, 2023, 5:31 PM

#

Then what happened there?

agile cobalt May 10, 2023, 5:32 PM

#

they posted it publicly and mentioned in the launch video, and content producers used the material publicly available

past meteor May 10, 2023, 5:33 PM

#

agile cobalt I am pretty sure that the matrix multiplication demo was not meant to be a bench...

The actual benchmarks you linked look a lot more convincing than the notebook I saw on Fireship so I'll walk back a big part of my claim

agile cobalt May 10, 2023, 5:34 PM

#

They said up to N times faster than python, in great part for effect/impact, but I don't think that anyone would interpret that as to that many N times faster than python with numpy/tensorflow/pytorch/etc

past meteor May 10, 2023, 5:36 PM

#

Such claims make me a bit skeptical about the thing in its entirety because they could just not have done that. Imo not making those claims gives a lot more credibility.

#

Because their actual benchmarks (the ones you linked) are credible / interesting

#

Maybe that's just me though, I'm "allergic" to marketing. 🤷‍♂️

wooden sail May 10, 2023, 5:53 PM

#

a lot of comparisons vs julia are also bad, they don't put the code into functions and the jit never gets used

past meteor May 10, 2023, 5:53 PM

#

So the benchmarks would be biased against Julia in that case?

wooden sail May 10, 2023, 5:54 PM

#

yep

past meteor May 10, 2023, 5:54 PM

#

Interesting

lapis sequoia May 10, 2023, 5:56 PM

#

Is that true?

past meteor May 10, 2023, 5:57 PM

#

You can just output raw scores and inspect the PR-curve /ROC / ... without upsampling/upweighting etc.

agile cobalt May 10, 2023, 6:05 PM

#

lapis sequoia Is that true?

well, class weights is literally making the performance biased towards underrepresented groups in the data
if you only care about the balanced accuracy it probably™️ should help most of the time (as long as your data is not way too ridiculously unbalanced like 1 to 10k, in which case you might as well use outlier detection instead of classification)

lapis sequoia May 10, 2023, 6:05 PM

#

Hmm. It's just like 1:2

#

And the weights made it worse

agile cobalt May 10, 2023, 6:05 PM

#

GPT itself is pretty biased towards keeping true to it's prior statements though, specially if you get argumentative.
try asking about it in a new session

agile cobalt May 10, 2023, 6:06 PM

#

lapis sequoia And the weights made it worse

are you looking at accuracy or at balanced accuracy?

lapis sequoia May 10, 2023, 6:09 PM

#

Accuracy

#

DO I have to use balanced accuracy after using weights?

#

@agile cobalt

agile cobalt May 10, 2023, 6:09 PM

#

not really

#

but if you care equally about each class (instead of equally about each record), then it makes more sense to look into balanced accuracy than normal accuracy

lapis sequoia May 10, 2023, 6:10 PM

#

Is this good enough justification for the behaviour?

agile cobalt May 10, 2023, 6:11 PM

#

lapis sequoia May 10, 2023, 6:11 PM

#

YEs

agile cobalt May 10, 2023, 6:11 PM

#

pretty sure that it's just that really

lapis sequoia May 10, 2023, 6:11 PM

#

Nice

agile cobalt May 10, 2023, 6:11 PM

#

~~don't quote me on that though~~

lapis sequoia May 10, 2023, 6:11 PM

#

lapis sequoia Is that true?

That's why I had this follow up question

lapis sequoia May 10, 2023, 6:12 PM

#

agile cobalt ~~don't quote me on that though~~

I will add your discord tag in my assignmenet

#

And put you in the references

agile cobalt May 10, 2023, 6:12 PM

#

eh, at least do compare the balanced accuracy of the models with and without class weights if you can

lapis sequoia May 10, 2023, 6:21 PM

#

nice beautiful pfp

#

The whole profile looks pretty fancy

night prawn May 10, 2023, 7:54 PM

#

I want to use gpu for my Gan but i don't klow how do this ?

wooden sail May 10, 2023, 7:55 PM

#

which module did you make your gan with? pytorch?

night prawn May 10, 2023, 7:55 PM

#

No i use tensorflow

wooden sail May 10, 2023, 7:55 PM

#

that works too

#

did you install a gpu version of tensorflow?

wheat snow May 10, 2023, 7:58 PM

#

year_month  Creator
2017-05     Benx                    37
            Galileo                 30
            LeKoopa                 17
            TheBietz                 2
            baastiZockt             36
2017-06     Benx                    54
            Galileo                 40
            LOGO                    48
            LeKoopa                 45
            TheBietz                45
            baastiZockt             83
``` how would i sort smth like that? i want each month to be sorted from highest to lowest so for each month we still have 5 values and they should be sorted

this is what i used to create it:

```py
month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()

based on that df:

                   Creator           Timestamp
0               Phil Laude 2023-03-29 07:47:14
1                Sing King 2023-03-29 00:18:05
2                  orijimi 2023-03-29 00:17:53
6                Sing King 2023-03-28 22:22:05
9                  orijimi 2023-03-28 21:19:02

night prawn May 10, 2023, 8:00 PM

#

wooden sail did you install a gpu version of tensorflow?

I used this command : pip install tensorflow

wooden sail May 10, 2023, 8:01 PM

#

https://www.tensorflow.org/install/pip#windows-native here's how to install the gpu flavor on windows, provided you have an nvidia gpu and have already installed its drivers

#

the cuda toolkit and cudnn need to be installed along with tensorflow

torn gull May 10, 2023, 9:07 PM

#

Hi! Could someone help me with some language modeling? ie. BiLSTM for next word prediction?

torn gull May 10, 2023, 10:19 PM

#

https://discord.com/channels/267624335836053506/1105979451604480041

limber kiln May 10, 2023, 10:30 PM

#

How does Pandas apply work when you select a particular axis (let's say axis = 0). Does it go over cell of a column one by one, or does it perform a vectorized operation?

hasty mountain May 10, 2023, 10:39 PM

#

Guys, I'm trying to make a decent Variational AutoEncoder, but it seems that it only produces really blurry images. Is it a sign that I need to make it train for even more time? Or do I need to increase its parameters(like the latent vector size)?

#

Hm... I guess I'll try increasing my decoding loss weight...if it doesn't work, increase the encoding dimension pithink

#

I'm not planning to use this VAE to generate images by itself, so I guess prioritizing the decoding loss might actually be good?

rare pagoda May 11, 2023, 12:40 AM

#

I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both offer far more threads for parallelization than a CPU. Resources online say that at FPGA offers "lower latency" and more "power efficiency" but they don't go into specifics. From what I understand each CPU or GPU core executes instructions one at time, like MOV, PUSH, or ADD, whereas an FPGA, like an electrical circuit with IC Logic Gates, can perform numerous operations on a given value in one clock cycle without having to put values in registers then wait for another clock cycle. Is this what they mean?

iron basalt May 11, 2023, 12:58 AM

#

rare pagoda I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both...

FGPAs are programmable hardware. They can be configured to act as if you had made a specific hardware device for your specific problem.

#

They are programmed using a hardware description language (HDL), which is even more low level than assembly / machine code / micro-ops.

#

GPUs are not well suited to all problems, they are more general now than they used to be, but there is still only a certain set of problems they are good at.

#

FPGAs can be setup to be "flow-through" where there is not really a clock (except for the processor that is often on the same board to control the FPGA, and some parts still need a clock (memory)), it's just happening, in the same way a regular circuit does not need a clock.

#

If I have an AND gate and flip one of the inputs, the output changes, without waiting for some clock.

#

*Also CPU and GPU cores execute more than one instruction at a time.

hasty mountain May 11, 2023, 1:40 AM

#

Squiggle...you and my physiology teacher are making me really consider the possibility of diving into robotics...

the only problem is that I lack the lifespan for so many things grumpchib

#

But I'll check if coursera has anything about it anyway

#

I've found some which use...Python? I wasn't expecting Python to be used in robotics. I thought it would be more something slightly lower-level, like C++, C...

iron basalt May 11, 2023, 1:50 AM

#

hasty mountain I've found some which use...Python? I wasn't expecting Python to be used in robo...

A lot of people in robotics (not meant as an insult, they are very talented people) are not that proficient in programming (they are more interested in the physical machine) and C++ is not exactly a language that helps the user become proficient in a straight forward way. And Python happens to be a language that many people find easy to use for non-programmers. Plus you get access to all these libraries (e.g. OpenCV and now all the ML ones). However a lot of C/C++ still happens, because you often have to work with whatever SDK you are provided by the hardware manufacturer.

hasty mountain May 11, 2023, 1:52 AM

#

I see... Yeah... I wasn't expecting the folks in robotics to not be that proficient in programming pithink
I mean... they're not that far from dealing with hardware, I guess, and hardware seems to require quite low-level programming

iron basalt May 11, 2023, 1:52 AM

#

Being a proficient programmer and into robotics makes you very desirable by many.

hasty mountain May 11, 2023, 1:53 AM

#

Afterall, our machines are like robots...but instead of movements and actions in the real world, they perform it in the digital world 🧠

iron basalt May 11, 2023, 1:53 AM

#

It's also a supply and demand issue. There are only so many people into robotics. It's not exactly as easy to find a job for it as something like web development.

molten hamlet May 11, 2023, 2:04 AM

#

Can you sync matplotli figures when using show ?
What I mean by that, I explore data on one plot, but want to see same position on other window, for example raw and filtered data. Its just too much to do on single

#

one plot 😐

#

need to put more dots

hasty mountain May 11, 2023, 2:11 AM

#

molten hamlet Can you sync matplotli figures when using `show` ? What I mean by that, I explo...

Try using fig, ax = plt.subplots(x, y)

fig, ax = plt.subplots(2,4)

            for x in range(ax.shape[0]):
                for y in range(ax.shape[1]):
                    ax[x,y].axis('off')

            ax[0,0].imshow(saving_image[0])
            ax[1,0].imshow(saving_image[1])
            ax[0,1].imshow(saving_image[2])
            ax[1,1].imshow(saving_image[3])
            ax[0,2].imshow(original_image[0])
            ax[0,3].imshow(original_image[1])
            ax[1,2].imshow(original_image[2])
            ax[1,3].imshow(original_image[3])
            plt.show()

#

This will create a window with dimensions 2x4(2 rows, 4 columns), in each row i and column j you'll be able to add a plot(or image, in this case).

#

You can do pretty much the same thing you do with plt using ax[row,column], but applying exclusively for a single plot. Just need to add "set" in most cases.

Like ax[0,0].set_title("Test") (instead of plt.title("Teste")) or ax[0,0].set_legend("Legend")

molten hamlet May 11, 2023, 2:16 AM

#

still have to move each subplot
no thanks

gloomy saddle May 11, 2023, 2:26 AM

#

rare pagoda I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both...

Basically if you can afford it. An fpga approach can be infinitly scalable with no overhead as long as your problem is dividable enough, things can be syncronous or async, and if you get fancy you can always throw hardware at making sure you get a result on the same clock cycle its needed when not constrained by serialization.

But the cost.... Yeah its not for small things you can cop the weight time on a GPU for. To match a GPU for many tasks your already talking 100K fpga cluster cells. The difference is once you cross that point in an fpga cluster, the lower latency and architecture shenanigans you can pull on top of the power savings mean you can optimise things for your use case.

vital cedar May 11, 2023, 3:55 AM

#

How can I choose the right algorithm for a Tweet Sentiment project? Is there any way to plot them perhaps?

hoary wigeon May 11, 2023, 3:57 AM

#

agile cobalt !e you meant this? ```py data = [ [10, 20, 30], [1, 2, 3], [100, 200...

NOPE, SHAP EXPLAINABLE AI..

severe topaz May 11, 2023, 7:42 AM

#

errant lake That's one way, for example: ```py df['timestamp'] = pd.to_datetime(df['timestam...

Looks familiar - df…. FPGA … C++ is easier than python. But I guess everybody uses python.

#

Field programmable gate array now I remember.

#

Regex and df 🖐 in 🖐

#

Should have done all my arduino projects and everything from now on in python

kind moth May 11, 2023, 9:05 AM

#

Anyone have any tutorials on links for Image Generation?

mild dirge May 11, 2023, 10:37 AM

#

How would you measure the performance of a reinforcement learning model? My prof is making us cherry pick the 5 best runs, but that seems really biased and bad :/

jolly dock May 11, 2023, 10:38 AM

#

from tensorflow.contrib.training import HParams```

im have downloaded the gpt-2 from github and this part on the `model.py` isn't working because im using python 3.8 which doesn't lets me use the versions of tensorflow below 2.8.0, and this part doesn't work on versions above 1.5.0. 

Can somebody help me to solve this problem please?

cold osprey May 11, 2023, 10:50 AM

#

upgrade python?

gloomy saddle May 11, 2023, 11:04 AM

#

mild dirge How would you measure the performance of a reinforcement learning model? My prof...

I think usually plotting the loss function against the scores of each run per generation?

mild dirge May 11, 2023, 11:04 AM

#

But the loss is kind of artificial, as you don't have the correct labels, it's unsupervised

gloomy saddle May 11, 2023, 11:07 AM

#

ok, so is it say competing network learning? where you have 1 model generating and another descriminating? otherwise how is the scoring implemented, being unsupervised makes it a bit nebulous if your scoring on its own doesnt cover that behaviour

mild dirge May 11, 2023, 11:09 AM

#

I'll just be using the average winrate, but I am using Q learning. This uses an online and offline model, the online model predicts the action scores, and the offline model predicts the expected action scores.

#

And the offline model is updated with the online model's parameters every now and then

gloomy saddle May 11, 2023, 11:10 AM

#

chess like problem?

mild dirge May 11, 2023, 11:11 AM

#

Basically a simplified breakout game

#

catch the ball

gloomy saddle May 11, 2023, 11:15 AM

#

mild dirge catch the ball

bouce the ball, so whats your current scoring method? that could greatly change the answer?

cold osprey May 11, 2023, 11:15 AM

#

oo RL

mild dirge May 11, 2023, 11:15 AM

#

Catch the ball is 1 point, all other states 0

#

It doesn't bounce from the paddle

gloomy saddle May 11, 2023, 11:16 AM

#

ah ok, so instead your score function should probably be distance of center of paddle to intersection point of the threshold for the ball

#

so it can... learn

mild dirge May 11, 2023, 11:16 AM

#

I'm not having a problem with it learning, it learns 😛

gloomy saddle May 11, 2023, 11:16 AM

#

right now, 0 or 1, it only has random chance to begin learning

#

rate of learning

mild dirge May 11, 2023, 11:17 AM

#

I was just nitpicking on the fact that we have to cherry pick the 5 best runs, which seems biased

gloomy saddle May 11, 2023, 11:17 AM

#

so pick the 5 that have the center of paddle most directly under the ball center?

wheat snow May 11, 2023, 11:18 AM

#

I have that df here:

                Timestamp         Creator
5126  2022-12-27 23:20:17      ZDF Satire
5825  2022-12-14 20:57:53      ZDF Satire
6014  2022-12-10 21:36:12      ZDF Satire
7731  2022-11-17 17:08:06  GermanLetsPlay
12363 2022-07-20 19:54:39  GermanLetsPlay

and applied this command:

month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()

which results in:

year_month  Creator       
2018-06     LeKoopa           10
            ZDF Satire        16
            marshmallowTV     39
2018-07     LeKoopa            5
            ZDF Satire         1
            marshmallowTV      6
2018-08     GermanLetsPlay    18
            LeKoopa            6
            ZDF Satire        24
            marshmallowTV      6
``` and this is a series: <class 'pandas.core.series.Series'>

So... what i originally wanted to do is printing mutible lines where each line represents one Creator e.g. (LeKoope, GermanLetsPlay, ZDF Satire,...) they x ticks shall be the timestamps.. so each month and the y value for e.g lets say Lekoopa should be the value he has every month so: ```10,5,6,...``` but im not that good with Series and plotting informationm out of one

gloomy saddle May 11, 2023, 11:18 AM

#

biggest difference in X coordinate

mild dirge May 11, 2023, 11:19 AM

#

Yeah ig, though a run consist of 1000 episodes, so I'm just picking the ones with the highest sum of winrates

past meteor May 11, 2023, 11:19 AM

#

Are you getting the same policy across runs?

#

Or are they vastly different

mild dirge May 11, 2023, 11:21 AM

#

Hard to judge, I'm not going to watch 1000 videos to see how the run progresses, the neural network also has a few tens of thousands parameters, so can't intrerpret that

#

But my question is answered 🙂

gloomy saddle May 11, 2023, 11:21 AM

#

This is why I suggest the distance in coordinates, it removes more of the random components and you can still sum them 🙂

#

e.g. if your paddle is 1/5 the width of the bottom, your agents could win 20% of the time by luck, closer to 40% if your position and velocity of the ball is not handled too well

past meteor May 11, 2023, 11:22 AM

#

You don't need to watch 1000 videos. You can evaluate the performance of the policy

#

I assume you have some sparse reward (win/loss)? You can just take the average winrate of the last N episodes. I suspect that what you planned on doing anyway

cold osprey May 11, 2023, 11:25 AM

#

wheat snow I have that df here: ``` Timestamp Creator 5126 2022-12...

ure looking for pivot unpivot

wheat snow May 11, 2023, 11:26 AM

#

cold osprey ure looking for pivot unpivot

so getting it back into a df?

#

btw what is that group key thing doin, someone suggested it but i hab´ve no idea what it does

gloomy saddle May 11, 2023, 11:28 AM

#

it takes all the input and groups it into chunks that share those properties

cold osprey May 11, 2023, 11:28 AM

#

not sure what group_keys=False does

#

but i think that u want can be done with pivoting or unpivoting the dataframe (idk which is which)

past meteor May 11, 2023, 11:33 AM

#

Reminds me I really need to get back to reinforcement learning. I read Sutton & Barto + implemented everything from scratch except a few model-based algos. It's something I don't actively use at work for now (even though there's opportunities) so I'm getting a bit rusty.

#

At some point I'll redo part of the algos in C, Nim, Rust, ... "for fun" and so I don't forget all of it

sinful kelp May 11, 2023, 11:54 AM

#

how do you guys normally do hyperparameter optimization? do you do grid search or go for Bayesian optimization methods?

past meteor May 11, 2023, 11:58 AM

#

sinful kelp how do you guys normally do hyperparameter optimization? do you do grid search o...

Don't grid search imo

#

Random search is good if you can run it in parallel. Bayes opt is a sequential algorithm so that's what I use if it's something super expensive that I can't run in parallel (e.g., neural networks)

#

The issue with grid search is that it spends a lot of time iterating over potentially useless parameter settings, random/bayes opt is a lot less sensitive to this

sinful kelp May 11, 2023, 12:05 PM

#

do you ever have issues with convergence of the Bayesian optimizer?

jolly dock May 11, 2023, 12:08 PM

#

cold osprey upgrade python?

it works on old versions and i can't downgrade my python version because my other projects wont work

mild dirge May 11, 2023, 12:09 PM

#

jolly dock it works on old versions and i can't downgrade my python version because my othe...

That last point can be solved by using virtual environments

jolly dock May 11, 2023, 12:09 PM

#

wdym

past meteor May 11, 2023, 12:10 PM

#

sinful kelp do you ever have issues with convergence of the Bayesian optimizer?

I haven't personally. Maybe a stupid question but have you tried increasing the maximum amount of trials?

sinful kelp May 11, 2023, 12:12 PM

#

I haven't, but I will try that. I was more asking in general what strategies people tend to use when optimizing their hyperparameters.

#

I have seen people before who were more in favor of random search over Bayesian optimization, and I was curious why

past meteor May 11, 2023, 12:14 PM

#

Yeah, the reason is that they can run random in parallel indeed 🙂

sinful kelp May 11, 2023, 12:16 PM

#

Thanks for clearing that up 🙂

wooden sail May 11, 2023, 12:33 PM

#

nothing stops you from doing random bayes btw, with different starting points for the params

#

it's a lot more computationally expensive though

past meteor May 11, 2023, 12:39 PM

#

wooden sail nothing stops you from doing random bayes btw, with different starting points fo...

So starting bayes opt from N random points?

wooden sail May 11, 2023, 12:40 PM

#

yeah

past meteor May 11, 2023, 12:40 PM

#

I like this a lot - do any "mature" packages implement this?

#

I can handroll it but I rather not if I'm using it in any "serious" project

wooden sail May 11, 2023, 12:41 PM

#

i wouldn't know, i'm only doing armchair AI right now 😛

cold osprey May 11, 2023, 12:41 PM

#

basement AI

past meteor May 11, 2023, 12:41 PM

#

I'll put it on the long list of stuff I need to do

#

Tbf starting a bunch of Python processes that each run bayes opt is the same 🤷‍♂️

wooden sail May 11, 2023, 12:43 PM

#

https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html i think this does it

mild dirge May 11, 2023, 12:43 PM

#

I'm running my reinforcement learning model twice to get twice as fast results rn 😛

wooden sail May 11, 2023, 12:44 PM

#

it sounds reasonable enough that i would almost expect any hyperparam library to have this as an option

past meteor May 11, 2023, 12:45 PM

#

RL would really be my favourite domain to do fundamental research in. Specifically in this: https://arxiv.org/abs/2005.01643

arXiv.org

Offline Reinforcement Learning: Tutorial, Review, and Perspectives ...

In this tutorial article, we aim to provide the reader with the conceptual
tools needed to get started on research on offline reinforcement learning
algorithms: reinforcement learning algorithms that utilize previously collected
data, without additional online data collection. Offline reinforcement learning
algorithms hold tremendous promise for...

wooden sail May 11, 2023, 12:46 PM

#

a word of caution in case you weren't aware: arxiv is not peer reviewed, it's just a repository

past meteor May 11, 2023, 12:47 PM

#

(This got published in NeurIPS as well though but this one is free)

wooden sail May 11, 2023, 12:47 PM

#

uc berkeley and google add some weight to it, but always check whether what you find on arxiv has a published, peer reviewed version

#

aha, there we go then

past meteor May 11, 2023, 12:48 PM

#

(A big part of my thesis was fixing a rubbish Arxiv paper that would never get published in a peer reviewed journal but had a few good ideas)

wooden sail May 11, 2023, 12:52 PM

#

you seem pretty savvy in AI stuffs

past meteor May 11, 2023, 12:55 PM

#

I did a masters in AI at a top ~50 uni + work as an applied AI researcher. I don't know a lot about NLP for example except surface level stuff from coursework.

#

I know about the stuff that my profs were interested in 🤣

wooden sail May 11, 2023, 12:57 PM

#

past meteor I did a masters in AI at a top ~50 uni + work as an applied AI researcher. I don...

very nice 😌

hasty mountain May 11, 2023, 1:03 PM

#

past meteor Reminds me I really need to get back to reinforcement learning. I read Sutton & ...

Hey! Can you tell me some tricks to avoid suboptimal policies? Or specification gaming?

I'm having the problem that I'm trying to train a model in a game with PPO...problem is, the environment is a bit slow to provide a feedback. And even with a reward model working as a reward function to provide continuous rewards, it seems that my model is prone to getting stuck at certain commands...

#

(I don't know how much the factor impatience also helps...since I never actually let my model train for more than 5,000 steps, and the optimization is done after each 10 steps)

past meteor May 11, 2023, 1:13 PM

#

hasty mountain Hey! Can you tell me some tricks to avoid suboptimal policies? Or specification ...

Honestly I wouldn't know, maybe @mild dirge can give you more concrete pointers

#

My focus was mostly on implementing algos, understanding their properties, making environments, ... It's a shame but I don't have a lot of finesse with actually using it for real things 😆

mild dirge May 11, 2023, 1:14 PM

#

I'm still doing a course on deep reinforcement learning so not exactly an expert either, my next assignment is to use policy optimization instead of value based learning.

tidal bough May 11, 2023, 1:16 PM

#

i want to one day try using RL for codingame bot programming tasks, but that'd need a lot of work. I'd need to write a one-file from-scratch implementation of the model in question, train that model locally (after implementing a full simulator of the environment in question), and deploy it by embedding the trained parameters in the file

#

I think I've heard of people actually doing it, though, so probably it's a powerful method if you want to get through the hassle

past meteor May 11, 2023, 1:31 PM

#

tidal bough i want to one day try using RL for codingame bot programming tasks, but that'd n...

for what?

tidal bough May 11, 2023, 1:34 PM

#

past meteor for what?

Codingame is a mostly generic programming contest site, basically, but the really fun stuff they have are the "bot programming" and "optimization" tasks, where you compete with other players in making, roughly speaking, the best bot for a certain game. E.g. https://www.codingame.com/multiplayer/bot-programming/mad-pod-racing is a very good example.

tidal bough May 11, 2023, 1:49 PM

#

The particular one I linked is interesting because it's a real-time game with physics (so the state space is continuous and the action space is pretty big too) and it's inherently multiagent - each player controls two bots, and they ideally need to coordinate with each other to win, and predict the opponent's actions.

severe topaz May 11, 2023, 3:39 PM

#

wheat snow I have that df here: ``` Timestamp Creator 5126 2022-12...

is this regex?

wheat snow May 11, 2023, 3:40 PM

#

nope

#

but i already solved the issue

#

im currently trying to floorlessly get some information from youtube api

hasty mountain May 11, 2023, 4:16 PM

#

past meteor My focus was mostly on implementing algos, understanding their properties, makin...

Okay grumpchib

hasty mountain May 11, 2023, 4:16 PM

#

mild dirge I'm still doing a course on deep reinforcement learning so not exactly an expert...

Spoiler: ||When your assignment is to implement PPO, you'll have to do both||

#

shipit

rancid dove May 11, 2023, 4:39 PM

#

Is it possible to change the order of the fields of a numpy void dtype without copying the data?

tidal bough May 11, 2023, 4:42 PM

#

hmm, what would it even mean for the fields to be in a different order logically but not in memory?

tidal bough May 11, 2023, 4:46 PM

#

rancid dove Is it possible to change the order of the fields of a numpy void dtype without c...

!e looks to me like it works just fine, though:

import numpy as np

dt = np.dtype({"names": ["col1", "col2"], "formats": ["i4", "f4"], "offsets": [0, 4], "itemsize": 12})
dt2 = np.dtype({"names": ["col1", "col2"][::-1], "formats": ["i4", "f4"][::-1], "offsets": [0, 4][::-1], "itemsize": 12})

arr = np.arange(9).view(dtype=dt)
print(arr)
print(arr.view(dtype=dt2))

arctic wedgeBOT May 11, 2023, 4:46 PM

#

@tidal bough :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [(0, 0.0e+00) (0, 2.8e-45) (3, 0.0e+00) (0, 7.0e-45) (6, 0.0e+00)
002 |  (0, 1.1e-44)]
003 | [(0.0e+00, 0) (2.8e-45, 0) (0.0e+00, 3) (7.0e-45, 0) (0.0e+00, 6)
004 |  (1.1e-44, 0)]

tidal bough May 11, 2023, 4:46 PM

#

note how the offsets for dt2 are [4,0], so the logically-first field is the second in memory.

severe topaz May 11, 2023, 5:05 PM

#

wheat snow nope

Where have I seen df I swear when I was picking up regex.

rancid dove May 11, 2023, 5:06 PM

#

tidal bough note how the offsets for dt2 are `[4,0]`, so the logically-first field is the se...

This is great news, ty

wheat snow May 11, 2023, 5:16 PM

#

severe topaz Where have I seen df I swear when I was picking up regex.

Brih idk

severe topaz May 11, 2023, 5:17 PM

#

oh im out of it its a sql pandas thing

night prawn May 11, 2023, 6:07 PM

#

wooden sail https://www.tensorflow.org/install/pip#windows-native here's how to install the ...

I installed cuda 12.1 and cudnn 8.9 but I saw that tensorflow is no longer compatible with gpus for native windows on the latest versions. I therefore hesitate between installing between wsl2 and tensorflow-directML and I also wonder how to do it.

cold osprey May 11, 2023, 6:08 PM

#

wsl was a nightmare for me to set up so i just went with pytorch

#

didnt know about tensorflow-directML

wooden sail May 11, 2023, 6:08 PM

#

i don't know what tensorflow-directml is, i do use wsl though and it's just dandy

#

i use jax gpu on wsl2

#

if you find it too cumbersome, consider pytorch indeed (unless you already have your code written)

past meteor May 11, 2023, 6:09 PM

#

WSL is pretty convenient to set up at least it was for me

wooden sail May 11, 2023, 6:09 PM

#

wsl has treated me mostly well as well

cold osprey May 11, 2023, 6:10 PM

#

nice

#

wsl didnt treat me nice Sadge

night prawn May 11, 2023, 6:10 PM

#

So how do I install wsl?

wooden sail May 11, 2023, 6:10 PM

#

what issues did you have with it?

cold osprey May 11, 2023, 6:10 PM

#

wooden sail what issues did you have with it?

ive blocked it out of my memory now

#

lemme find my rant messages

wooden sail May 11, 2023, 6:11 PM

#

night prawn So how do I install wsl?

https://learn.microsoft.com/en-us/windows/wsl/install you can follow the steps here. on windows 11 it's super straightforward. on windows 10 it can require some extra steps

Install WSL

Install Windows Subsystem for Linux with the command, wsl --install. Use a Bash terminal on your Windows machine run by your preferred Linux distribution - Ubuntu, Debian, SUSE, Kali, Fedora, Pengwin, Alpine, and more are available.

cold osprey May 11, 2023, 6:11 PM

#

im on win10

night prawn May 11, 2023, 6:11 PM

#

Ok thank you

cold osprey May 11, 2023, 6:12 PM

#

oh i rmb know

#

it wasnt wsl specifically but mlflow wasnt working properly with wsl

#

something about file permissions iirc

#

couldnt save an image of the model architecture as an artifact in mlflow

wooden sail May 11, 2023, 6:14 PM

#

that doesn't tell me much

cold osprey May 11, 2023, 6:15 PM

#

yeah welp im over it now

wooden sail May 11, 2023, 6:15 PM

#

but you probably tried to modify the windows filesystem from inside wsl or backwards, which you shouldn't do

cold osprey May 11, 2023, 6:15 PM

#

pytorch all the way

cold osprey May 11, 2023, 6:15 PM

#

wooden sail but you probably tried to modify the windows filesystem from inside wsl or backw...

i think so too

wooden sail May 11, 2023, 6:15 PM

#

never do that 😛