#data-science-and-ml
1 messages Β· Page 61 of 1
Ik.
Yeah. Because ChatGPT doesn't know what it's doing.
Could anyone help me out with something. I'm practicing and trying to learn webscraping atm, am trying to figure out how to save what I scrape to a dataset and organize it.
!rule 5 Webscraping usually violates ToS.
5. Do not provide or request help on projects that may violate terms of service, or that may be deemed inappropriate, malicious, or illegal.
Use chatgpt

Well I can't do that because i'm trying to learn a new skill
oh you meant like ask chatgpt?
Yes'
I did but it's kind of hard to track. it's not an exact replacement for direct human to human support in all instances.
Can you be more specific? Do you know about RNNs? Autoencoders?
At the moment I'm not seeing any way for you to use the unlabeled data with those kinds of classifiers.
How does this sound
Silly gpt is giving me steps on how to do that
And a good justification to write in report as well
It might make a fool of myself though
You might be able to train something that tries to force the predicted classifications of the unlabeled data towards something definite. I.e., try to make it predict something but don't force it to predict something specific.
The loss function for such a thing is something like, "how close do your predicted class probabilities get to a basis vector". And there's a bunch of ways you could measure that.
It said to use the the data and put class weights of the unknown class to 0. So that it's used in augementation but not used to predict anything
hmmmmm
Do you know what class_weight does?
It tells the fit function the relative importance of the different classes. So, for example, if the weight of class 0 is 0.25 and the weight of class 1 is 0.75, then errors in class 1 are weighted three times more than errors in class 0 in the loss function.
OO
If you give something a weight of zero, then it doesn't contribute to the loss function at all.
It means that it has no meaningful effect on training.
It might still affect the training though
oh yes
That's cool
problem solved then
Sure, there may be algorithms where the presence of that extra data affects training. But because it doesn't affect your loss function, it also can't make your results better.
Look, you seem quite enamored with ChatGPT. I'm inclined to say that you should ask your instructor about this. You don't have to tell him you were consulting ChatGPT if you think he'll respond poorly. Just said that you read about this idea on the Internet.
No.
There's lots of Hoffmans and Hofmanns and Hoffmanns, etc.
We're mostly not related.
Well
We are actually All related
Very low chances that you n me are not related
That's why I call you bro
Okay, in that sense, we're all related.
Because that's what you literally are
I do like that sense, it's just not where I thought you were going.
What is this behaviour
Weird axes
Hey guys, so i got some problems when using Kneighborsclassfier, my signals looks like this on the graph?
Maybe im crazy? But i wanted the signals to stay on the graph rather then on top/bottom like this? Anyone could send me in direction where i can read/get help to change that in my code? (the code is running and working etc) the problem is my graphic overlay i wanna change?
what is that plot?
may i know whats the best library for fuzzy logic application, eg. display graph etc
mpf.plot but i found the error, now just having another error now claiming my data for buy and sell signals isnt same leght -.- kinda lost, properly just gonna go to bed and look at it tomorrow
I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI Code
class Myapp():
def __init__(self):
self.root = customtkinter.CTk()
self.root.geometry('1050x600')
self.root.title("APx Platform")
self.m1 = customtkinter.CTkButton(self.frame_2, text="Load JSON Script", font=("Ubuntu", 12), command=self.open_file)
self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()
And here is the viewer code
import pandas as pd
import matplotlib.pyplot as plt
class csv2df():
def __init__(self):
self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
def plot(self):
self.x = self.df["Hz"]
self.y = self.df["dBSPL"]
plt.plot(self.x, self.y)
plt.xlabel("Frequency (Hz)")
plt.ylabel("RMS Level (dBSPL)")
plt.show()
data = csv2df()
data.plot()
I want to display the graph by clicking the "Single Viewer" button. Can you please fix it as I need this for a project.
Thank You.
Any of you use plotly for dashboarding?
sorry, what u mean?
used to use it
Switched to something else better?
work still preferred power bi, due to our clients mainly being microsoft integrated
used dash plotly on an internal project as like a POC
they liked it but didnt want to follow through at that time to use it for other stuff
Ah! I was thinking about using a dashboarding thing to display data on a website
I've used it before and it seemed okay. The thing I didn't like was that it looked very.... old!? I mean, it's not flashy at all! Like a basic white web page with drop down menus and sliders with some plots thrown in.
are there any tools for managing the individual dependency sets required for notebooks? or do you just create notebook directories with distinct requirements.txt (or whatever) while using isolated virtualenv's to load jupyterlab to work on that task?
yes i use pyenv
if you can learn to use docker containers its a useful skill, i use containers for everything
my skills with docker aren't really where they should be. i use them for some things, but i tend to dabble in a lot and it's more time consuming to set up volumes & bind mounts. it works well for some things. i have like 6 or 7 containers, just for ML. the one for pytorch takes like 35 GB somehow and all together, they require 100+ GB.
also, i'm trying to use signatory, which requires pytorch & opencl to speedup things. but i would like to use TF where possible. building a container that has both has been a real blocker for me. i don't imagine doing that very often, but maintaining containers that have both would be a real pain, over time.
i'm using pyenv as well, along with venv and direnv. that works well, but creating these isolated dependency sets in multiple directories is tough and will eventually eat up a lot of disk. i only have 2TB nvme and very little else that's not tied up in my homelab somewhere.
i keep a text file for docker commands and use shell scripts for resetting up environments, not sure how many containers youd need up at once, since i document setup for more complicated project envs i usually feel comfortable deleting containers
i havent typed out a docker command in months
ctrl c ctrl v all day
i use docker.el in emacs, so i have at least some of it on easy mode (for some definition of easy lol)
i try to make notes where possible. i have a lot of experience in other languages and i'm trying to future-proof however i decide to handle dependencies for multiple projects.
thanks for the feedback
also, ditch pytorch (i am a tensorflow enthusiast)
Just use JAX
for me the appeal to tensorflow is the lower level tensors themselves, not keras. what's the equivalent to how TF handles tensors in pytorch?
basically the same i just dont want to write my own training loops
I just picked tensorflow cause I needed tensorflow.js support for my first AI project, and haven't switched. Honestly should probably try pytorch sometime tho
someone can help me? I am doing a machine learning program by logistic regression, and the model I am doing is not working
what are the features?
X = data[['Employment', 'YearsCodePro', 'EdLevel']]
y = data['CompTotal']
I did like this
what does comp total represent? how did you source the data set?
what kind of cost function did you set up?
Comptotal represent the monthly salary
also, what kind of problems do you think you're having? are there runtime errors? or statistical errors?
the data set I got by StackOverflow and filtered by country(in this case, is Brazil)
what kind of regression is it? linear? there are max salary caps, so you will see less correlation in some of the higher salary numbers than perhaps the lower numbers.
maybe it's different for brazil
have you looked at the mean/variance of the features? are you using a framework or just libraries like numpy?
When I define my values(that are categoric), the jupyter notebook says that the categories from the column are unknown
are you trying to predict the salary, given the features?
yes
logistic regression is typically used to produce binary predictions
is your data in dataframes? like with pandas?
yes, it is
have you tried using unique() to see whether the dataframe will give you the distinct values in each column?
also, are you using a framework like tf/keras or pytorch?
I am just using jupyter and anaconda
yes, I used unique, but even if I put the distinct values in the code, the code doesn't work
that describes the workflow and the python environment. i mean what libraries are you using to help with machine learning or linear algebra?
pandas,numpy,seaborn and scikit learn
ok. if you're using logistic regression, the simplest way to fit that method to the task is to place an inequality on the predicted column. this converts it into a binary feature.
or, rather, a binary classification problem
instead of your algorithm answering the question:
how do features in X predict data['CompTotal']
it will answer questions like what features in X predict data['CompTotal'] > 35,000
you can change the value on the right and retrain multiple versions of the algorithm.
i think... i'm not an expert though.
I understand
scikit learn may assert that there are only two values in the data['CompTotal] column (or the prediction column). this may be what the error message is about.
and my goal in the model is to know if the salary is higher than the minimum salary here in Brazil
i may be able to help later, i have to get back to work though.
try to play around with the pandas dataframe and create new columns
okay
I have to give this project in 3 hours hahaha so I'm kind of damned
but thanks for the help
if it's giving you an error, follow the stack trace. it might help to clone the scikit learn project and try to find the line with that string. you might not have enough time though. generally, the source code is the best documentation, but managing lots of source repositories can be a lot of work.
i mean, u just copy paste it, or import from a helper library
maybe some tweaks depending on the model, loss func and metric ure using
Discussion has been overdone but personally I still prefer Jax if I'm doing say reinforcement learning and then TF/Keras, followed by MXNet
I've been spending time with Pytorch recently to wean myself off of TF because that seems to be the direction where everything is going. All in all they're the same but some small things are missing or need to be done differently. Just need more time with it I guess π€·ββοΈ
Hello everyone
im a student majoring in mathematics and computer science who is interested into going into Data science/ML. i still have a year to graduate and i am planning to take a course in each of them but since summer vacation is coming i want to start working from now(and probably be good enough that i can land an internship in fall semester?) . I have basic knowledge in Python( took a course before) and im reading automate boring stuff with python and planning on reading beyond the basic stuffs with python(some people said its not necessary but i figure out why not expand our knowledge in this language). For Data science/ML what do you suggest i do? Im currently watching Andrew Ng 2018 course given in stanford but im thinking of enrolling in his machine learning specialization course on Coursera. I know i can learn it without a course but i would like to get a certificate so that i put it on my CV. What do you guys think/suggest? does taking this course make me ready as well to data science? Thanks in advance
sorry for the long paragraph lol
and yeah i have knowledge in MySql and databases
if u have decent understanding of the maths behind common models and methods, then id suggest just diving into using pandas, sklearn, tf/pytorch, etc
is there a "real" multioutputregression model and not just the approach to fit each model x-times for x-targets?
yes
or well, what do you mean?
in a vector-valued function, you can in general treat the output as a vector of functions, each one "independent" to each other (not in the statistical sense, i just mean you can always write it this way, with each entry being a separate function)
each output value depends in general on all the inputs. the relationship between the outputs is a separate matter. you can interpret this as each entry in the output vector being a separate estimator of its own/a separate regressor
So i have a little problem with my code,
My signals wont come up on my graph, and when trying to trouble shoot it, i found out my Signal is in nan value, (NaN) rather then in inf.
"# Add predicted signals to a copy of the dataframe
df_copy = df.copy()
df_copy['signal'] = np.nan
df_copy['signal'] = knn.predict(df[['closing-price', 'daily-return']])"
When changing this to np.inf it dont change my signal to inf value but stay NaN?
Im so lost?
Both closing and daily return when printed comes out as inf but when i plot in my signal it becomes NaN?
What do you mean becomes nan? How are you distinguishing inf and nan on a plot?
Also, I'd expect it to not matter in the slightest what you set signal to since you override it with predict's return immediately after.
Depending on what i print, when i print my signal i get this
"2021-03-17 NaN"
When i print the closing and daily return it stands in inf value like this
"2021-03-17 123.276093"
So as far as i understand from different pages i searched (and even chatgpt) its because my value isnt the same?
So my signal wont come on my graph
It sounds to me that knn.predict(df[['closing-price', 'daily-return']]) returns a nan for that row, then
Perhaps one of these two columns has a nan on that row, or something's really wrong with the knn.
okay and why would it return a nan from that when all others are correct (i checked?)
The most important foundations for machine learning (as well as statistics) are linear algebra and calculus. If you haven't taken an advanced linear algebra course or a real analysis course, then you should study those. If you haven't taken a probability course, then you should study that. After that, I don't have any strong recommendations; there are a lot of courses out there (online and otherwise), and people say that some are better than others, but it seems to me that there's not much to distinguish them.
I mean until that line, everything is a inf value (or numeric value) but after that line it becomes a NaN? how?
Oh i think i got it
its because i got it as accuracy value before
I think
i dont thik i will find any problem in the mathematical part and i have high grades in calculus linear algebra and probability
In that case, like I said, I don't have any strong recommendations. Find a course that you like and it should be fine.
when pruning a decision tree do you use try "all possible" values for alpha when cross validating, or do you only try those returned by cost_complexity_pruning_path (optimal values for a fully drown tree)?
Thanks for putting me in the right track π I think i found the error now π futher up i think i put the knn.predict as a accuracy value which would then fuck up that line.
NotaNeighbor
Thanks i found the problem and its working now. π
if i input 3 targets and use multioutputregressor it creates basc. 3 fits, so for each target 1 fit and i want to know if there are approaches (other than NN) that can do all 3 based on 1 set of fits
sure
"neural network" is a very broad term
or fuzzy, should i say
the only difference between a neural network and any other function is that it has a ton of trainable parameters, but otherwise, each of its layers is generally just a function with multiple inputs and outputs
and in general, all of the inputs are used together to produce each output
all matrices do the same thing, for example. in a linear fashion.
yeh its all linear algebra
not all. but a lot
anyway yes, they exist and are commonly used. but what exactly to do depends on which problem you're looking at
its just out of interest
ok. then the answer is yes, and a simple example is the mean estimator
but that is multiple inputs 1 output isnt it?
you can find the mean of a vector
cant i state something like:
there are only single target mathematical models which could be applied to generate a more dimensional model
i mean yeh its wrong in terms of math
that sentence doesn't make any sense to me
i have no idea at all what it's trying to say
multiple parameters are hard to get with classical math
f(x)=x
so really condensed down i mean multiple features and outputs are kinda hard to represent
why?
and what do you mean by "represent"
all vector-valued, vector-parameter functions do what you're saying
(i think, i'm still not sure i got you right)
.latex one can arbitrarily define functions of the form [
f: \mathbb{C}^N \to \mathbb{C}^M
]
i mean more in terms of multiple variables
that's equivalent
you can have a vector of N parameters
that's standard notation
this says we take N parameters and give out M outputs
C^N means N cartesian products, so N complex numbers are mapped to M complex numbers here
doesn't matter what N and M are
thanks
have you looked up what multi-objective optimisation is? sounds like that would be of interest to you.
i did not but certainly will thanks
Can anyone here help me optimize this program I'm creating? I made a program that uses MediaPipe's hand tracking for real-time ASL interpretation, I got everything set up how I want but its just eating through my CPU like theres no tomorrow
Ive tested and commented out lines to see what exactly is causing the performance slog, and it seems to be the small Keras model im using to make the classification
The prediction is based off an input shape of (1, 20), but I don't really know how to make it faster or what alternatives are out there
HI, what are your best sites to learn numpy, scipy, matplotlib?
I have a technical test and we are not allowed to do stackoverflow, google, not even an IDE
Matplotlib has some tutorials. Idk about the other 2.
Thank, you mean tutorials in the API?
Docu? Physics and imaging.
But how to become bad - ass in these packages and know the tricks? I lack use cases, I guess.
Is google colab better than a jupyterlab notebook?
Isn't Collab litterally jupyterlab?
Except it's not
Or just jupyter?

They're not the same
You r Saturn
Colab is online vs Jupyter being on your local device
Well hmm jupyter starts a server whatever device you use it on
There is also a jupyterhub version
colab is one particular server where you can run jupyter notebooks
one where google gives you free hardware
you can alternatively host your local jupyter server, which is how most people use it
And does colab provide better functionality , e.g. widgets?
So in fine, Google Collab is just an implementation of jupyter?
Or are these two very close tools
I've attempted to include dropouts, mess with hyperparameters, regularization and data preprocessing, but nothing is working
Anyway sorry for chiming in. Thanks for the infos
You're good Clem
You're good Clem
not an implementation, just a particular host for jupyter notebooks
you can set up your jupyter server on one device and connect to it from a different one
google just set one up with very nice hardware, for everyone to use
Yeah! That's what I thought originally, np thanks for clarifying
do we know if colab uses jupyter under the hood? my understanding is that "notebooks" are a general thing, and jupyter notebooks are a flavor of them.
I think it is jupyter under the hood yes - probably heavily rewritten by Google haha
yeah some other links say "based on the jupyter open source", but those links aren't by google
heavily rewritten. if you replace the head and the handle of a hammer ten times, is it the same hammer?
the ship of theseus notebook of google
Oh yeah I still consider it the same thing. It's just probably adapted to Google's infrastructure now
Reminds me a bit of Python ducktyping
It quarks and walks like a duck, it is a duck
need some help in data science project new to python and data science.
Hey I want to get into AI/ML, please suggest some resources and courses
Doing a project in spare time to get better w/ python -- involves using the techniques covered in CSE 6040.
I am attempting to design a method which automates collection of utility data from the UCB website, along with the UCD website. (electricity, steam, water, even waste - all into one core unit, kWh energy demand for a complete energy outtake picture/comparison?)
I was going to go the route of selenium, SQL & automating accessing data from a webpage that updates every 24 hours (Selenium/Beautiful Soup code to pull the div containers, then use regex to translate the strings to an appropriate format, wrap it into data structure of choice) a headache and a half...
Can anybody help? -- my skill level is not at the place where I could line up the string of numbers and show 1 element with the date for each of those numbers being store in another element...
https://ceed.ucdavis.edu/ https://engagementdashboard.com/universityofcaliforniaberkeley/ucb/building/8750/consumption/month
Historical and real-time building energy data for University of California Davis. The first step to saving energy is seeing how much you use.
they have a graphql endpoint. Not sure about the license though
are you telling me to look into the graphql endpoint? at one point though, do you after accessing the API?
you may want to reach out to them directly. They will be better able to guide you
as far as acessing the api though, would you be able to try the UCB link?
I am not going to try any random API that doesn't have an explicit open access
These urls doesnβt have explicit open access?
My end goal is to run Kmeans on a large, sparse dataset. The data is currently in json form. I am trying to use databricks community edition to load and process the data. Reading the json alone takes about 15 minutes. I am just starting the project as far as the machine learning and loading the data goes. Up until this point, I've just been gathering the data.
The data seems too large to load into the driver's memory.
What general advice can you give me to help reach the end goal? If you need to know more about the data or the problem, let me know.
How large is your json file? @rugged comet
it's not because you don't lock your doors that it gives me the rights to get in.
Same thing here π
Plus if you make mistakes or misuse it, they may just cut you off or take down the whole thing.
And in addition, it's always awesome to receive an email to get a thank for the api and showing enthusiasm
Yeah 15 minutes of read time if not storage device limited is weird, 80GB json only takes at most its read speed for me usually?
The file I'm trying to use is 675,863 KB. This is only the first 500,000 samples though. I'll have 5 files total, each about this size except for the last one which is smaller.
Whaaaa thank for the API? Iβm going to have to YouTube this. Iβve never tried accessing the API. Or I might as a LinkedIn connect for help.
I don't see why it would take 15 mins to load a json unless it just has too much data
Is there not a better way to store the data?
Yeah something is really up char, can we see your read implementation?
df = spark.read.json("/FileStore/tables/edhrec_deck_data_500000.json")
Is it basically just a table?
Why not have pandas directly read the json? Not quite following what spark is doing in this situation
Can you show the first 10 lines of your json?
It's like a list of dictionaries.
Here's the format
[
{
"commanders": ["Abaddon the Despooiler"],
"color identity": [...],
"hubs": [...],
"cards": {
"Dockside Extortionist": 1,
...
},
"theme": ...
},
{
"commanders": [...],
...
},
...
]
commanders is a list of up to 2 strings.
color identity is a list of up to five characters
hubs is a list of up to ~8 strings
cards is a dictionary of up to 100 pairs where the keys are strings and the values are integers that go from 0 to 100.
theme is a string
Is this maybe relevant?
lol, nested jsons
I'll read this.
It still to me feels like something pandas could handle directly if its just loading json to a dataframe π
I wanted to use spark because it has extra memory basically. Might be a silly reason.
Your files are only a few MB? Its been a while but believe you can have pandas read in chunks if you need to keep memory usage lower
You can load with pandas and then convert to pyspark as well if needed
That SO fix sounds good too
A few MB is a bit of an understatement. The total of all the files will be about 3.5 GB.
It's usually ok, 4gb of RAM is doable
(Used to Terabytes so my norm might be skewed)
Oh
Same, I was about to propose using bigquery
i think all my files are formatted such that each line is a valid JSON object.
wait no
It's a list of dictionaries basically. So I think that's not true.
Still if you need it low memory. Reading in chunks and handling type coercion at read to smaller data types can help a lot with that. Json is nice for humans to read. But say your dockside extortionist. That column only needs to be a unsigned 8 bit int. Same for some other stuff. The column names are only stored once and the representation should be a good deal smaller
https://paste.pythondiscord.com/rudaxajemi
Here are the first 10 objects in the list.
Hey folks, a quick question- I'm trying to fill in gaps in my cs education and I'm currently reading "Attention is all you need", but I don't have enough context on how attention mechanisms work for me to follow and the paper sort of assumes everyone knows how attention mechanisms work. Can anyone direct me to a research paper or similar resource that's actually introducing/explaining attention mechanisms?
So there are about 25000 unique values like Dockside Extortionist. So that would be about 25000 columns. The shape of the resulting dataframe would be like 2,500,000 samples X 25,000 columns. Is this too large?
Maybe spark is taking so long to read because it's trying to convert all the unique values in cards to columns like you said. This would make the dataframe much larger, not smaller, I think.
you have mixed quotation marks in your input json?
'Mizzix's Mastery' for example
"Mizzix's Mastery" would probably help things a whole lot assuming its as your actual input
Yeah so some of the card names can have " (double quote) in the name. And some can have ' (single quote).
yeah, thats undefined behaviour in json to the best of my knowledge?
The python discord paste I provided is what Python prints out for the first 10 objects in the list.
import json
with open("edhrec_deck_data_500000.json", "r") as f:
decks = json.load(f)
for deck in decks[:10]:
print(deck)
This is the code that generated the paste that I posted.
The actual json file probably looks different. I can't open it though.
try notepad++
yep
it should be ok, just means someone has compacted it, I was hoping to see how the structure could be improved on, but what you pastebinned earlier was not valid json as it had already been parsed,
If you could pastebin say the first 10K chars (it counts them on the bottom of notepads window) and ping me, I'll have a look this afternoon
I don't think I can select the first 10k characters using normal means (highlighting with the mouse). It's far too slow. Any other way to do this?
usually one would use spark with newline delimited json, and not really use it to parse a mega huge json array like the one you have.
parsing a huge json array is extremely slow and is as far as i know a single-core operation
when compared to parsing a new line delimited json, the difference is night and day, because spark can just delegate different section of the file (i.e. different lines) to other cores to parallelise the parsing process.
in short using spark yields no benefit here as far as i know.
if RAM permits (it should, the file is "tiny" compared to actual big data scale), you can look into using the most performant json parser out there, then convert it to something that is more spark-friendly and resume your work there (if it is really necessary - people misuse spark for all sorts of reasons imo.)
otherwise look into streaming json parsers, i know it's a possibility but i have never found a use for it.
Is it possible to convert what I have into newline-delimited json?
yes. though i am not really sure what is the most performant way of doing so.
have you tried the multiline=true option in spark as recommended in the SO post though?
ah also now that i actually read the SO post, the TLDR Solution is exactly what you need to convert into JSONL, though i wouldn't use json... it's slow as heck, using orjson or cysimdjson is better
It wasn't clear to me where to put that parameter. Also, multiline=true doesn't make sense to use to me because I only have one line.
Are those library alternatives to json?
yes.
okay there might be a misunderstanding to what multiline means.
have a look at the reference here: https://spark.apache.org/docs/latest/sql-data-sources-json.html
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine option to true.
sure - your file don't have multiple lines literally , but that's not what multi-line JSON file mean.
it is merely trying to make the distinction between jsonl and json file, and in your case you have json file not jsonl, hence multiLine is a sensible option to use
Oh okay
it's frankly quite poorly named.
[a, b]
does not span multiple lines
but [ a, b ] does
but they are the same thing in json, multiLine is just not very clear what's going on
but i guess from a parsing point of view, it does make sense.
it's a flag to tell spark "hey you can split the file line by line and just process each line by itself" or otherwise
i created an nlp ai i would love for people to help me train it!
also where can i find nlp data in the form of questions and answers
Kaggle https://www.kaggle.com and PapersWithCode https://paperswithcode.com are probably good places to start
df = spark.read.option("multiline", "true").json("/FileStore/tables/edhrec_deck_data_500000.json")
This line has been running for over 15 minutes. It doesn't seem any better than before.
you can use an existing model to generate these
it's been used with decent success for some finetrained models
Also, to get the data encoded for machine learning (Kmeans), I think I want to pivot by cards and group by original_url. I run into a problem though.
pivot_df = df.groupBy("original_url").pivot("cards")
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
It's hard to tell what the format is for the current data but I want it to look like this
color identity,commanders,hubs,original url,tags,theme,CARD_1,CARD_2,...
So I keep the values for the other columns such as color identity and commanders. But I want to add new columns for each value in cards.
do decision trees rely on randomness?
singular trees: not much
ensembles: yes
i've never been sure about the right notation for these kinds of things. Is one tree a tree and then an ensemble a forest? I feel like I've seen that wording to describe it
the only "forest" I can think of would be a "random forest", which refers to a specific way of putting decision trees together
an 'ensemble' is any way model that makes use of two or more separate/individual models
ah ok, that makes sense. I think I was confusing a forest (an ensemble of trees) with ensembles (an ensemble of any model).
I want to try uploading a MySQL database to azure databricks.
They ask for the information in brackets
database_host = "<database-host-url>"
database_port = "3306" # update if you use a non-default port
database_name = "<database-name>"
table = "<table-name>"
user = "<username>"
password = "<password>"
I'm stuck on how to get the database-host-url. Is that the same as the value returned by
SELECT @@hostname;
?
My database is hosted on my local machine.
Unrelated, but thank you @rugged comet for causing me to learn about the distinction between multi-label and multi-class a few months ago. I'm working on a multi-label classification project now at work.
(also I know nothing about databricks. sorry)
Nice! I'm so happy I could help.
If you know anything about Azure also, that would be helpful. I just started and it's kind of overwhelming in the beginning.
it might be easier to import a dump or csv file
is the way you're trying to import it an option in a website, or do you pass it to a local script? there is a non-negligible chance that the option you're using assumes that the database is hosted somewhere with a public ip
Firstly, I wanted to upload a json file but they said it was too big. Then I thought I could put it in a MySQL db and upload that. Now, it looks like I may be able to use the DBFS. What do you think about the DBFS for this?
is that an AWS thing? I only know enough AWS to be either dangerous as fuck or totally impotent.
Azure is like AWS as far as I'm aware. Just by different companies.
Microsoft Azure
oh okay. I actually know shockingly little about tabular databases and all their x-as-a-service varieties. but for reasons I can't tell you, I do know a lot about the graph database neo4j.
I see... π
I do all my tabular data manipulation with pandas, even if the CSV is 80 GB
and I like it
lmao
anyway, I'm displacing your question with my shitposting, so I'll be quiet in the hopes that someone more knowledgeable appears.
Hello everyone
I want a Roadmap map for Data science. Plz help me. I have started python as programming language
The file upload thing is solved now I think.
How do you create clusters or a workspace in azure databricks without public ip addresses? I keep getting this error
Error code: PublicIPCountLimitReached, error message: Cannot create more than 3 public IP addresses for this subscription in this region.
when trying to create a new cluster.
have you tried asking in #tools-and-devops btw? or #databases?
I have not. Do you think those channels would be more appropriate for this question?
probably
hi i am working on a data science problem i am new to python unable to solve it if you have some spare time can you help me with it.
just explain it ^^
is that for me?
Somewhat expected, have you tried reformatting your data to jsonl already?
How sparse is your data?
Yes
Cursed π¦ you should look at Polars the API is a lot cleaner than Pandas and it's so so so much more efficient
Just put the JSON file in an Azure blob storage (cheaper) or azure data lake and connect it there, no? Afaik Azure allows you to connect a "service" to Databricks. Be sure to use key vault because your Azure environments are stored as YAML files under the hood and if you don't use Key Vault your credentials get put in plain text in your repo. Maybe this is changed because it's been a while since I used Azure tbf
And finally: singular trees rely on randomness because sometimes there are ties in Gini/IG that are broken at random. Depending on your implementation there could also be an element of randomness in how to quantise continuous variables. This is why even if you train a vanilla decision tree on the same data with a different seed you may have different results.
Please advise me a book to start with date science in python)
No.
Your choice/loss π€·ββοΈ . I was pretty stubborn about trying it out as well in the past but it's great
Worst case scenario you don't like it and you go back, you don't lose anything
is there a way to use my AMD RX 470 with pytorch?
pipelines are very convenient (talking about sklearn here), but.. aren't they super inefficient when tuning parameters? I mean, say u have a pipeline with a bunch of preprocessing (drop some columns, impute, standardize one thing, one hot encode another, etc..).. that means all that preprocessing gets done for every hyperparam combination.. every time.. again and again.. when it could be done just once. Am I right about this? Are pipelines actually used in practice? Cus.. that seems like a lot of unnecessary work
probably, but its it alot of run time?
in principle yes, but to echo shimmer's point, is it actually a lot of run time?
also have you looked into the memory parameter of pipeline? it looks like a parameter to configure a cache
U could always just store a copy of the pre processed data in memory and use that for all ur models
Only need to rerun the pipeline when u close ur notebook or smth
re. store a copy of the pre processed data in memory
yes it's possible. but you run the risk of leaking your test set into your training set if you aren't careful. hiding behind the pipeline interface is very reliable in terms of not leaking your test set
Yes you should also use them while training / tuning hyperparemeters. As @boreal gale correctly says the risk of leakage is too big
A million and one ways? stuff like your StandardScaler and OneHotEncoder etc. depend on the batch of data that was seen during your cross-validation procedure
wouldnt everything be run on the df_train only
Taking your entire training set and precomputing these metrics on df_train and then cross-validating is leakage
I'm specifically talking about the case where you cross validate
I gotta look into the memory parameter.., but yeah. By default, if you have a bunch of iterations looking for hyperparameters.. the runtime can pile up, I'd assume
Say your dataset A is split into 80/20 and you're doing 2 fold CV (for the sake of this example) you cannot just fit your preprocessing on the full 80 and then proceed with your CV
ah right
Your preprocessing can only see the 40/100 it gets during the CV procedure
So per definition a lot of preprocessing (but certainly not all) cannot be done ahead of time hence why I'd argue for not risking it and just going with Pipeline and ColumnTransformer because the risk of accidentally leaking is high(er)
are there other libraries for making pipelines, or is sklearn the most widely used one?
PySpark has pipelines. Recipes in R does essentially the same thing. You can easily make your own with Functools in the standard library. pipelines are just function composition. The thing you need to understand is that it's a universal problem π you have to compute them on the fly, it's not a sci-kit learn problem
There are tools to orchestrate your pipeline in a smarter way, but it will still use sklearn/pandas/spark in the background
curious to know more, what are these tools?
Apache Airflow is widely used in the industry to design such pipelines
Airflow is something totally different
Yes, it's an orchestrator, it's not a pipeline tool per se
You could make your preprocessing into an airflow DAG but the overhead would be immense π’
It is, I agree.
Personally I always stick with featureUnion Pipeline and ColumnTransformer. If I need a custom transformer I subclass sklearn stuff
that's my take as well, just curious to learn more about what clem meant
A lot of companies going for Airflow in their DS pipelines: it is seen as an immense overhead first, but the value is still there, improving the efficiency of scientists. These companies also in-house develop platform solutions on top of Airflow + [whatever data warehouse solution they use] to boost productivity of you guys π
Pipeline means many different things in data science / data engineering
The pipeline we're talking about are the final steps before inference (preprocessing). This entire thing would be one element of your airflow DAG
To give a bit more technical example, someone was saying some steps are not necessary to run again in a, say, sklearn pipeline.
Now imagine each sklearn pipeline step is a specific DAG, triggered only on the right events. No need to re-run this pipeline step if nothing will change, right?
This is a complex implementation - can't deny it - but it definitely holds value in the long run
I agree, also doing my best not to mix them up π
Not sure I agree about that architecture.
Bundling up your entire sklearn model in 1 object also makes deploying it on embedded etc. a lot easier
It's a single thing from start to finish
If the preprocessing is something massive and beyond the scope of sklearn transformers then yeah I would split it in 2 steps of course
I just want something fast π as efficient as possible.
U know how gridsearch passes parameters as.. stepname__parameter... to the pipeline? Any idea if that's the same as reinitiating the pipeline from scratch with those parameters, or if passing them in that way is more efficient somehow?
say, for some reason, I don't want to use gridsearchcv, but to make a manual for loop.. would it be better, from a.. idk, pythonic standpoint, to loop through different parameters and reinitiate the pipeline by passing the tunable parameters directly into the estimators inside the Pipeline (so we end up calling Pipeline and everything inside it each loop), or to use set_params on a Pipeline created outside of the loop?
I hope I didn't screw up my question.. π
first option would be simpler (for me at least) to understand, cus you wouldn't have to use that slightly strange stepname__parameter.. syntax, but reinitiating everything every time.. might be a bit costly(?) idk, thoughts?
ok.. discord thingies..
there we go
So you'll code up grid search yourself?
Seems a lot easier if you want to deploy on embedded, non-distributed systems, of course in this case Airflow owuld be an overkill - or any orchestrator
In that case you can definitely re-use some preprocessing steps I guess
no.. π that's actually for when I define on objective function in optuna.. but the point of the question is the same
So instead of refitting the parameters of your preprocessing you want to reuse them?
lemme rephrase.. give my slow fat fingers some time to type
this is a really bad example, but.. basically I could do this:
for smth in smths:
analyzer, max_features, max_depth, min_samples_split = *smth
pipe = Pipeline([
('tfidf', TfidfVectorizer(analyzer, max_features)),
('rf', RandomForestClassifier(max_depth, min_samples_split))
])
...
or I could do this:
pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('rf', RandomForestClassifier())
])
for smth in smths:
tfidf__analyzer, tfidf__max_features, rf__max_depth, rf__min_samples_split = *smth
params = {
'tfidf__analyzer': tfidf__analyzer,
'tfidf__max_features': tfidf__max_features,
'rf__max_depth': rf__max_depth,
'rf__min_samples_split': rf__min_samples_split
}
pipe.set_params(**params)
...
and get the same result. But personally, I find the first version easier to understand. But in the first version I end up reinitiating the pipeline (and everything in it) each time, while in the second version I just set different parameters (but have to use that pesky step__param syntax I dislike). So my question is, would the first version be less efficient than the second? Or would there be no difference?
Oh you really just want to create the full grid and then pass on the parameters to the Pipeline?
I'd go for option 1 then, it looks cleaner
noo, that's just how optuna is used, it's supposed to be "pothonic", but yeah, for the sake of discussion, lets say that's what I'm trying to do
Either way I'd read this on stackoverflow: https://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil
It does look cleaner, doesn't it? But what bothers me is the reinitiation of the pipeline. Won't that be.. costly? I guess this is more of a python question and how it all works on a low level, but I'm just assuming that creating a bunch of class instances would be costly
Don't overthink 1% performance gains when there's more obvious performance gains you could pursue
Even if you want to pursue that 1 % you can code both of them up and profile / time it
I'll give it a read. "Premature micro optimizations are the root of all evil", can't really disagree with that, but.. that doesn't stop me, unfortunately π I spent a bunch of time determining what's faster, using lambda or a defined function, calling a class function and passing in an instance of that class or calling a method directly on an instance of a class, and other such mini nonsense optimizations.. Guilty πΆ
yeah.. should probably try that.. someday..
I'm looking to be better able to process Excel spreadsheets. I get a myriad of reports in various formats, and I want to pull the data out of these workbooks into a usable format. I need to know what is the "best" methodology for reading worksheets and removing unneeded rows and columns, lining up data that might be offset by a row or column, taking what is essentially data in a form-type format, and turning that into a dataframe, and so forth. My understanding is that pandas is the way to go for these types of things but I lack the foundation of understanding the ramifications of reading something into a dataframe vs a dictionary, which is "better" (which I'm sure depends on the situation) and so forth.
I've been through a few tutorials on how to read data into a dataframe when it's already nicely formatted in Excel; so I don't need any help with that. The situation I'm in is that these spreadsheets were generated with "viewing things" in mind, and not actually processing the data and using it for something beyond just looking at it in Excel. Hopefully this makes sense. Sorry for the long message. Thank you for your help!
yeah pandas can do it too
drop cols, rows. split stuff up, drop cols then join again to realign them
I would go for Pandas
Or maybe even Xarray. You can annotate axis and with units
Attach metadata
But, where can I learn how to do all of this? Does it go row by row? It just seems like there's going to be a lot of custom programming, in my newbie opinion.
You start on the docu website
and you can try stackoverflow, this is how I started with pandas
I am not an expert, but that was my approach
Or you ask here. Alternative: Look for code mentorship. Somebody you can discuss with
Like, if there's data that isn't in proper columns, something is offset because someone formatted the workbook so things line up but not because they're in the same column, how to I move data around? Do I do that with a dictionary? Or a dataframe somehow? Like, merge two columns or... idk even what to ask as I'm so new to this kind of thing with python.
I had chatGPT give me something and it read in row by row, building a dataframe, if I remember correctly. It's been several weeks since I looked at that code. I managed to get it to work for what I was doing but, I think I talked about it here and someone went "that's not how you do this" lol.
docu website?
have to get off the train, I'll be back in about 15 minutes, thank you for your help! Links are always apprecaited. π
The first issues sounds more like a formatting problem. You have to find the entry and clean it
get the data from the source and abandon excel for most things
Would it not make more sense to clean the data first and then put it into a dataframe
Docu ---> Pandas documentation
What is experiment tracking?
ditto, also super curious what people think of experiment tracking.
i haven't had a good solution
tracking machine learning models
instead of model1.pth, model2.pth, etc
mlflow was decent for me on my own machine
not done any collaborative work so
I'd like to try to do that, the disconnect for me is how people use the data without Excel? It seems like at the end of the day you need something that's static that will display charts/graphs/tables of information that can be shared with folks and not need to be generated whenever the data is viewed. I get the processing side of things, python is definitely more powerful in that regard, but then, what do people use as a GUI?
dashboarding tool like Power BI
MLflow is what we use at work
I'm fine with Tensorboard or even weights & biasesΒ΅ if it's me just playing around in the weekend
i haven't played with mlflow at all. will the fact that i don't work with NN frameworks at all matter?
nop
No, it works perfect with sklearn as well
sweet. thanks.
i have a follow up question, how do you share notebooks in your org?
@timid kiln : I don't necessary use a GUI. I get the data from a device or it is saved into hdf5 file formats
Tbf the reason we I decided on using MLflow is that someone on my team only uses R and it's the only one that integrates with R as well
To displayh data? Matplotlib
ah
Share notebooks internally or externally?
primarily interested in internally, for
- knowledge dissemination
- keeping track of what you have tried/explored and for what reasons did you abandon that line of research
I have an embarrassing question. Why does not work?
a= np.array([[1,2, 4], [3,5, 6], [7,8,3]])
a[a>5]=0
print(a[a>5]=0)
Set all the entries in the array to 0 where the condition is correct. The print command never works.
Git + markdown files and powerpoint?
We turn notebooks to webapps if it's something we want to share externally. Sometimes if I'm lazy I use pandoc and turn it into a PDF or HTML and use cron to email it on a fixed schedule.
If it's internal stuff / dissemination I'm a fun of just version controlling whatever you're doing, writing reports and doing a powerpoint presentation about your progress (that's what we do)
i like git + markdown, but graphs are probably omitted / too unwieldy to be checked in which loses the richness of the report, which is a shame.
powerpoint is nice and all but imo is hard to diff, quickly lookup, also it takes time to prepare and would limit the velocity of development.
i think given there is enough people in an org, your org's approach definitely makes sense, but sadly it doesn't for me, i am in a startup with a team size of 3 technical staff π
(Applied) AI research.
Graphs are included on github at least
oh! do you check in graphs as *.png or whatever and link them inside the markdown?
if that's the case, i might try to adopt that as well, i think it saves a lot of time down the line not having to desperately dig out what you have done the in past from random commits
I'd just put the .ipynb as is on git and just have a README file with an executive summary of what goes on in the notebook
that's sensible as well
Properly naming your folders, files and making sure the notebooks aren't too long / doing too many things goes a long way as well, no?
But I imagine you already do that
MLFlow is a big one in the sense that I standardise what will get logged (the plots, metrics) in a Python / R template and also make sure I never delete data in the DB. Only inserts. I keep track of a version number, that goes into MLFlow as well. Kind of a bootleg version of DVC π€£ . Why? I want to be able to go back in the past and recreate any specific experiment
yes, i am indeed doing those, but i still find it rather unmanageable π¦
dang, i really gotta check out MLFlow!
Hello everyone, I wanted to ask if anyone would know something about a free versatile and knowledgeable AI chatbot to use in my code... I'm trying to find a free one because I'm gonna give it to my friends for testing... Does anyone know a model I can implement in my code?
be careful not to overuse the word "implement". "implementing a model" means to write all the code for the model on your own. if you're loading an existing model, you're just using it.
what does the chat bot need to be able to do as compared to ChatGPT?
If I encode it, it is very sparse (25000 columns and only up to 100 of them have values). Currently, it's dense though.
The DBFS is free as far as I'm aware. Can't really beat that price.
I have not tried reformatting to jsonl.
Yes, but that would require everyone that uses the Excel workbook to have python installed, and re-run the code. With Excel, at least you are able to capture and publish the results in a format where anyone within a given company could use the information, and not everyone would need python installed.
Everyone I chat with here is so anti-Excel, and that's fine, we're all allowed our opinions. But I don't get how other people, like your average non-programmer non-engineer person, are supposed to be able to use a tool with no GUI. Admittedly I am not very experienced with python so I am well aware that I am lacking a lot of information on how the rest of the world uses python without Excel.
The thing is, I don't know that the issue is with the data in its current form. I think the issue is that it will be too big on one machine.
With 25000 columns and 500000 samples, that's already 60 GB I think. And that's just 500000 out of the total 2500000 samples.
imo just format to jsonl first.
using a big json blob is generally anti-pattern when you are using spark.
i would sort out the input data format before thinking about anything else, e.g. "I think the issue is that it will be too big on one machine." is a secondary issue, spark can spill to disk if required, also not to mention there is sparse data structure support in spark.
all of these are pretty pointless if your input data format is borked and hard to work with (which it currently is, you still have one massive json blob)
Power bi, or any other dashboarding tool
Understood. My industry has been slow to adopt such things. Many people use Excel as a word processor. :/
You start to learn command line tools or jupyter notebooks for workflows.
Why not hdf5 file format?
Asking commercial folks to use command line tools... lololol They just want the chart, they can't be bothered to do anything beyond that. Maybe click a button. In Excel. π
Agreed, but azure data lake costs virtually nothing
What are you currently stuck on?
I haven't heard of that format.
hdf5 makes sense if you're ... working with Hadoop
Currently stuck on getting enough compute in the free azure trial. It appears that the quota for a free account is only 4 cores. The memory that comes with a 4 core cluster is only 14 GB.
What is your problem in full?
I don't think I'm working in with Hadoop.
no you're not
I currently have one of 5 json files of data. I want to cluster the data using kmeans. Some of the samples are labeled and others are not. This would be semi-supervised kmeans. I believe that I'll need to use a cloud service to do this. In its current form, the data would be only 3.5 GB roughly. However, if I encode the data so that it's ready for machine learning, the first file's samples would be about 60 GB. This is too big for my machine.
To elaborate on encoding the data, from the current data, I would create a column for each feature. There are about 25000 features. There are about 25000000 samples. I'm just trying to load the first 500000 right now as a proof of concept first.
Let me know what other questions you have.
So in total you'll have, what 300GB worth of data?
If I encode it, I believe so, yes.
I use hdf5 without Hadoop in my life.
But I know I can store enough data in it, I can read it lazily and incrementally
I just generated the other day a hdf5 file with 350GB.
How does 3.5gb go to 300gb
alot of preprocessing lel
Look, with polars you can use scan_IPC scan_csv, scan_parquet so you can "easily" use the lazy API and sink_parquet to incrementally add your features etc. even if your dataset is larger than memory
I don't know how you'll actually do k-means reasonably
I don't know how the spark implementation of it looks like
only useful with good data i guess
Read wrongly, it was 3.5gb to 60gb
Let me explain.
In its current form, one of the features of the json file is a dictionary of up to 100 keys and values.
Here's an example
"foo": 1,
"bar": 25,
"baz": 1,
...
To encode this for machine learning each of foo, bar, baz, etc would become a new column. There are 25000 unique values like foo, bar, baz, etc. So the dense data where it's a dict of less than 100 pairs gets turned into sparse data with 25000 columns.
wild
Rip kmeans
and out of curiosity u get reasonable outputs from that structure?
Well we are going to try and see what happens. That's the point.
Tbh if I have that much data I'd be thinking about sampling
What do you mean?
is the goal just to cluster or are u doing more with the data?
fancy dancy language model?
K-means can work with all your data on disk and passing sequentially but it'll just be slow lol
The first goal is to cluster, yes. We would like to do other stuff later. But we don't really know that that is yet. Just having fun.
Can you elaborate?
But yeah, I guess that's what Spark's Mlibdoes either way
In a more optimized way ofc
No it isn't with text data really. It's with deck lists of cards from a card game.
So K-means has 2 steps right? An E step and an M step
but if u got card data u dont need clustering?
We want to cluster decks that are similar to each other together.
If possible
You can read subsets that fit into memory, assign them to clusters and then go back to diskΒ΅
Use count of a particular card type or smth
While you're doing this you can update the cluster center in an "online" way
There are 25000 different cards.
ooo.. sounds like a problem suited for NMF.
No reason to handroll this because I'm pretty sure MLib does this
Natural moisturizing factor (NMF) is essential for appropriate stratum corneum hydration, barrier homeostasis, desquamation, and plasticity.
Non-negative matrix factorization
non-negative matrix factorization
just in case you missed my comment here
Why do you think NMF would work well for this problem?
The parquet shouldn't be too large either I think?
You're just taking a JSON file and one-hot encoding the data right?
There's no need to store all those 0's I think
It's very very similar to one-hot encoding. However, instead of values of 0 and 1, it's values of 0-98.
Thanks for the advice. I can try to convert my json to jsonl and see what happens.
the science of dating is not somthing we can help you with.
What do you propose instead?
To look for a file format that works well with sparse matrices π
I don't know how jsonl works under the hood, I'd have a look at that
there should be a straightforward way of exporting sparse mats as COO
What is COO in this context?
just a hunch, imo k-means's inductive bias is not great for your task, i can't really provide a formal proof or explain formally sadly.
also - NMF is a common technique for building recommendation engine, with a small tweak it could be used to do clustering (and building recommendation engine is pretty similar to your task, instead of groupping users by movies they like, you are groupping decks by cards they have chosen)
(this class of technique is also called collaborative filtering iirc)
there are other flavors too, like nonzero cols and nonzero rows. which one works best depends on the structure of your sparse matrix
NMF is a collaborative filtering method, like alternating least squares etc etc
Is the matrix sparse? It has values from 0-98 but are most of them 0 or it's distributed from 0 to 98
The vast majority are 0.
The irony is that their format right now stores something like this already
π©
But there are about 100 values that range from 0-100.
Might be an idea to keep it like that
And only to expand it when you need to do your k-meansΒ΅
k-Β΅eans
k-Β΅s
if you were to use scipy's sparse matrices, doing k means should be very efficient
Iirc if you one-hot encode with sci-kit learn you get a sparse matrix as output anyway
I mean what I have now is already a dense representation of the data. But I think when I do kmeans, I would need to expand it.
It's not exactly one-hot you're doing but you get my point
there shouldn't be a need to expand at any point tbh
o rly?
With expand I meant turning it into a sparse format a la scipy
That's what I meant by expand as well.
Because that's what one-hot in sklearn automatically does when your cols are beyond a certain number
nothing in the distance computation requires you to explicitly have the vector in dense form
Actually, it's the default: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Examples using sklearn.preprocessing.OneHotEncoder: Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.0 Release Highlights for sc...
What I have now is essentially the vector in dense form. I was under the impression that I needed it in sparse form for kmeans. Unless you misspoke.
This is a sparse format already
more or less
you don't need it in either form, the computation can be done regardless
so might as well use a sparse one
what i wrote there was correct
No I think the sparse form would be a column for each foo, bar, baz, etc and there's 25000 unique of those.
huh
To quote Edd: coordinate array. the entries are saved as triples with row, column, value
Each JSON entry is a row, your key is a column and your value is the value
This makes sense.
i leave you peeps to it, i just wanted to comment on sparse matrices π
I just wouldn't know how you'd get a JSON into COO maybe @wooden sail has pointers?
is the json dense?
Afaik it's sparse as well
what i mean is, does it have all the values?
umm
or it's already in a COO/CSR/CSC form
I can give an example that might answer your question.
because in those forms, only the nonzero values are stored
Yes, only the non-zero values are currently stored.
Ah so it's already sparse
then it's indeed already sparse
you just need to load it in a friendly format for whatever module you're using to create sparse matrices
Okay. I was misunderstanding what you all meant by sparse. I thought sparse data was mostly zeroes and a few "hot" features. It seems like that's not true though based on what you're saying now.
that is indeed sparse, but when i mentioned COO, CSC and CSR, these are efficient sparse representations
they don't store all the zeros explicitly
so one usually refers to these special representations as sparse, and the matrix with all the 0s in it as dense
That makes sense and confirms that I was misunderstanding the terminology.
So it sounds like I should try to figure out how to convert the json data I have now into an actual sparse matrix such as scipy's coo_matrix.
i think that would be good
Dense:
|1 2 3 4 |
|5 6 7 8 |
|9 10 11 12|
|13 14 15 16|
Sparse:
|1 0 0 4 |
|0 0 0 0 |
|0 10 0 0 |
|0 0 0 0 |
Sparse COO:
[(0, 0, 1), (0, 3, 4), (2, 1, 10)] <- Less memory usage, faster matrix multiplies (if sparse enough / large enough matrix).
Sparse CSR:
[1, 4, 10]
[0, 3, 1]
[0, 2, 3]
Even faster, but takes more time to build, and can't dynamically add more easily (build once, multiply many times).
(DOK is the same as COO, but uses a dict instead of a list, good for when you have non-zero entries added / removed dynamically all the time)
ah dok is good here, since json can be read as a dict
maybe that's the easiest for this case
DOK is good for incremental construction, especially if out of order.
Still, how do you convert a JSON to DOK
by converting to dict and passing the dict to scipy's sparse
indeed
Interesting, TIL
i've never used that specific one so i don't actually know what you pass it. i HOPE you can pass a dict of tuples or somth
Note: Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element```
I don't think I can guarantee that there are no duplicate decks.
even if not possible to make the matrix out of a dict though, one can read the json as a dict and then make it into a list of tuples in O(nonzero entries), and feed that to scipy
then use COO
If you try to insert a duplicate row, col, it will probably raise an exception.
To be clear, my json looks like a single list of dictionaries where each dictionary is a sample.
parsing text is not my forte so i leave y'all to it. i do know that json can be parsed directly into python dicts, so it shouldn't be too troublesome to massage the data into something scipy sparse likes
best of luck
Thank you.
I learnt a lot from this convo though thanks edd and squiggle
May require normalization, good luck. I really dislike JSON for reasons like this.
If possible convert it once to a better format and use that (if you need it multiple times).
shape = (samples, cards)
mat = sp.dok_matrix(shape, dtype=np.int8)
for id, deck in enumerate(decks):
for card, quantity in deck["cards"].items():
mat[id, card] = quantity
I think I'm on the right track with this.
The id should be the row, the card should be the column, and the quantity should be the value.
However, card is still a string. I think perhaps it should actually be the index of the card if it were a column?
I was inspired by this
https://stackoverflow.com/questions/37862139/convert-dictionary-to-sparse-matrix
Incrementally making the dok matrix seems very slow. Is there any faster way than manually looping through the samples?
cards_list = list(unique_cards)
mapper = {card: index for index, card in enumerate(cards_list)}
for id, deck in enumerate(decks):
for card, quantity in deck["cards"].items():
mat[id, mapper[card]] = quantity
Can you elaborate, please? I haven't heard about this.
Oh
Ideally you could pass the dict to the dok_matrix directly and it would internally (hopefully in C or something) do a fast loop for you.
Or the file directly.
I just use my own sparse matrix types written in C with Python bindings so IDK.
For this approach, what would the dictionary look like? My guess is that the keys would be maybe the deck id and the values would be the cards.
I don't think dok_matrix from scipy can take a dict.
When I run into performance issues where I need to do manual loops in Python I tend to make my own library and then call that from Python.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html#scipy.sparse.dok_matrix
Maybe I could use a coo matrix then?
It can take a dict
Look at the docs
def update(self, val):
# Prevent direct usage of update
raise NotImplementedError("Direct modification to dok_matrix element "
"is not allowed.")
idk
scipy/sparse/_dok.py line 113
def _update(self, data):```
Yeah ig, don't see why it has an unimplemented update method...
"""An update method for dict data defined for direct access to
`dok_matrix` data. Main purpose is to be used for effcient conversion
from other spmatrix classes. Has no checking if `data` is valid."""
return dict.update(self, data)
But I doubt it is a lot quicker, seems like it is just a dict underneath
So no fancy C shenanigans
Yeah, and reading from the file is also something you probably want to happen in C.
Reading from the file is pretty quick. Only 11 seconds.
I guess you only need to read it once?
In C I approximate like a few ms.
If you don't want to touch C, Mypyc and Cython are options.
I thought we weren't supposed to use those private methods.
Yeah, but I do anyhow when needed because i'm a bad programmer.
(I actually just make my own library that does just take in the file so I don't have this issue)
I think things are working now.
Looks like sklearns KMeans only works with CSR format sparse matrix. Good to know.
I think I neglected to mention that our arrays of deck lists are jagged. One deck might have 95 unique cards and another might have 100.
What should we do in this case? I was thinking we could pad the arrays to the max length using a number that isn't being used to represent a card.
π€Iβve never heard of these
What is scipy it seems to be a package Iβm not familiar with. What are using it for? Iβve used tk inter β¦.
tkinter doesn't really have anything to do with data science. scipy is more functions for numpy, basically
scipy has some data structures for matrices. That's all I know so far.
pretty sure that part is numpy
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
This is what I'm exploring now.
hmm okay. I've mostly used scipy for the stats stuff (https://docs.scipy.org/doc/scipy/reference/stats.html)
Elaborate?
Well I need a matrix represention that is small (sparse). To do this, I'm using scipy's sparse matrices. Think of it like having lots of (row, column, value) elements.
I'm trying to understand this
https://scipy-lectures.org/advanced/scipy_sparse/lil_matrix.html
It looks like they do support the fancy indexing. Can you help me understand how it works, please?
Not sure what conclusion I can come to. Creating the matrix incrementally seems too slow and creating it all at once isn't feasible due to memory problems.
mtx[:2, [1, 2, 3]] = data
First two rows, column indices 1, 2, 3 = data.
Okβ¦
I've created a button to display a graphic from viewer.py file. I want the graphic from the viewer.py file to be displayed by clicking a button from the gui that I have created. Here's the code for a gui and graphics. Both are in separate files.
GUI.py Code
class Myapp():
def __init__(self):
self.root = customtkinter.CTk()
self.root.geometry('1050x600')
self.root.title("APx Platform")
self.m1 = customtkinter.CTkButton(self.root, text="View Plot", font=("Ubuntu", 12))
self.m1.grid(row=1, column=1, padx=(65, 65), pady=(5, 10))
app = Myapp()
app.root.mainloop()
And here is the viewer.py code
import pandas as pd
import matplotlib.pyplot as plt
class csv2df():
def __init__(self):
self.df = pd.read_csv("RMS level.csv", skiprows=[0,1,2])
def plot(self):
self.x = self.df["Hz"]
self.y = self.df["dBSPL"]
plt.plot(self.x, self.y)
plt.xlabel("Frequency (Hz)")
plt.ylabel("RMS Level (dBSPL)")
plt.show()
data = csv2df()
data.plot()
Can you please fix it as I need this for a project.
Thank You.
Is there any merit to Mojo's hype?
we'll see
personally can't wait to get my hands on it to see what's up, i have a bit fair of numerical code that is in need to optimisation
i have shoehorned them into numba at the moment but it looks kinda ugly and hard to maintain
have you tried jax yet :x
hmm nope, what's the headline difference between that and numba?
also probably worth mentioning i have extremely limited chance of utilising vectorisation - i am dealing with streaming data
it's an implementation of numpy and scipy on XLA, so the code looks just like usual numpy. it brings its own jit and autodiff though, and the jit is a lot more flexible than numba's. also can run on gpu and tpu without (m)any changes. numba only has few numpy and scipy functions jitable, and only with limited arguments
as an example, specifying order='F' is not supported on most numpy and scipy functions with numba
no aot though, only jit. i think numba has aot
oooo.. that sounds very promising.
https://jax.readthedocs.io/en/latest/jax-101/07-state.html <- this is also exactly what i need
more thing for me to play with!
i like it a lot tbh
you can always simulate aot manually kek
call the function before actual execution during initialization π€‘
for loops and the prng do require a little getting used to, you CAN but don't really wanna use native python loops
the biggest selling point for me is being able to jit, autodiff, and run on gpu while still looking like numpy
for most simple functions, you can straight up replace import numpy as np with import jax.numpy as np
that sounds awesome
btw, how big is the overhead of moving data to-and-from GPU these days? i haven't looked into that space for years, curious to know at what size of data would you gain noticeable speed gain by shifting your workload to GPU now
it's still the bottleneck
even just moving/copying stuff in memory is usually the bottleneck. it gets worse if you move between mem and vmem
that's why quadro and a100 cards cost an arm and a leg
bro, can u help me with this issue
###################################
I NEED HELP! with SHAPELY VALUES
###################################
I want to know how to calculate shap values on record level.. and what are the units of shap value.
I have built an XGBoost Classifier model and I'm using same model to calculate the shap values. I'm confused with the unit of value that shap returns and If possible I need it in probability.
I just got to know shap returns calculates log-odds for XGBoost Classifier..
I'm trying to inverse the values to proability using below function
def logit2prob(logit_val):
prob = 1 / (1 + np.exp(-logit_val))
return prob
But when I try to sum up the probability values on record level..It doesn't add up to 1's probability predicted by XGBoost Model.
!e you meant this? ```py
data = [
[10, 20, 30],
[1, 2, 3],
[100, 200, 300],
]
import numpy as np
arr = np.array(data)
print(np.sum(arr))
print(np.sum(arr, axis=-1))
@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 666
002 | [ 60 6 600]
oh wait no, exp isn't even an aggregating function
I have no idea about what you mean by record level in that case, post an example with input&desired output
<@&831776746206265384> i accidently deleted a VERY long message explaining my problem.. ( i just wanted to delete the images and not the whole message) could someone pls recover it from the server logs? and sent it into some DM or channel so i can edit it properly and paste it in here?
dm'd you you might want to fix the formatting
u mean the problem or the way i typed teh message?
it lost formatting when i copied it
Sup, i have a df that looks liek taht:
Timestamp Creator Year Month
27414 2021-02-10 21:01:12 GameSΓΌnden 2021 2
34085 2019-08-15 09:27:56 Kedos 2019 8
41306 2018-06-10 18:41:54 Dream TV 2018 6
653 2023-03-21 15:36:00 King Fish 2023 3
48795 2017-06-24 08:43:31 Mrmobilefanboy 2017 6
25894 2021-04-05 00:16:51 WWE 2021 4
25397 2021-04-17 17:29:08 Γtienne MzA Gaming 2021 4
1450 2023-03-06 15:26:23 Nicholas Ma 2023 3
4257 2023-01-22 20:47:28 NRML MTBer 2023 1
I now want to create a mutiindex Dataframe which allows me to track how my viewing habits of certain Youtubers have changed from 2017 to 2023 in a monthly period. By "viewing habits" i mean how many videos i watched of a certain creator
therefore i have a df of the creators that i wanna track...
top_creator=(temp_df.value_counts().sort_values(ascending=False))[0:15]
Paluten 643 --> Ammounts of total videos i watched by the creator
Galileo 631
ExplosmEntertainment 542
Benx 488
DieBuddiesZocken 395
...
So... i am looking for some help to this code:
df_creator Track= df.resample('M', on='Timestamp')['Creator'].value_counts().sort_values(ascending=False)
I am still missing to include the list of the top_creators from above into this. My goal is to achieve something i will share in teh next picture
oh u mean backticks... right they get lost all the time when formatting
And under thoose:
...
...
...
i only want the creators to show up i have in my
top_creator
``` df
@wheat snow I must be misunderstanding your need because it feels like you could get your result with a groupby?
yes... kinda im honestly not a big experting in the groupby function
no worries, I can try to point you in the right direction:
I would:
- Transform your timestamp column so that it translates every date to only year-month (use
pd.to_datetimeas well as the format option) > let's name the transformed column 'month' - group by the 'month' as well as the 'creator' column and sum the views
i just looked through some pages in my book, and maybe found an idea... what about pivoting the df... yk flip it up take the channels i am looking for as columns and only leaving the the Timestamp
It would work but you would be left with a short data struct
which is not ideal, I prefer long π
so:
2022-04-29 15:25:16 --> 2022 | 4
``` so 2 columns year and month? or in one column?
From your need I think a single column of format 2023-01 would be more helpful
alright, splicing would be easier then i thgink
That's one way, for example:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year_month'] = df['timestamp'].dt.strftime('%Y-%m')
then you'll just be one groupby away :
df = df.groupby(['year_month', 'creator_name'])['views'].sum()
π
thing is i dont have an views column... the "views" can be seen as the ammount of rows that exist in the Month we are currently looking at
so .count?
bruh true
why im thinking so hard
The issue with Mojo' s benchmarks is that they are crap. Just like Julia's benchmarks
They benchmarked matrix multiplication with stdlib lists as matrices and compared it to their highly optimized stuff lol
Why not at the very least compare it to Numpy / Jax / Numba / Cython / ...
I am pretty sure that the matrix multiplication demo was not meant to be a benchmark?
their actual "benchmarks" are in https://performance.modular.com/ though tbh still unsatisfactory
for proper benchmarks you would usually rather on third party services than what the providers says about themselves anyway
The notebook they handed out to content creators like Fireship did frame it like a benchmark tbh
they did not "hand it out" to Fireship and others
Then what happened there?
they posted it publicly and mentioned in the launch video, and content producers used the material publicly available
The actual benchmarks you linked look a lot more convincing than the notebook I saw on Fireship so I'll walk back a big part of my claim
They said up to N times faster than python, in great part for effect/impact, but I don't think that anyone would interpret that as to that many N times faster than python with numpy/tensorflow/pytorch/etc
Such claims make me a bit skeptical about the thing in its entirety because they could just not have done that. Imo not making those claims gives a lot more credibility.
Because their actual benchmarks (the ones you linked) are credible / interesting
Maybe that's just me though, I'm "allergic" to marketing. π€·ββοΈ
a lot of comparisons vs julia are also bad, they don't put the code into functions and the jit never gets used
So the benchmarks would be biased against Julia in that case?
yep
Interesting
Is that true?
You can just output raw scores and inspect the PR-curve /ROC / ... without upsampling/upweighting etc.
well, class weights is literally making the performance biased towards underrepresented groups in the data
if you only care about the balanced accuracy it probablyβ’οΈ should help most of the time (as long as your data is not way too ridiculously unbalanced like 1 to 10k, in which case you might as well use outlier detection instead of classification)
GPT itself is pretty biased towards keeping true to it's prior statements though, specially if you get argumentative.
try asking about it in a new session
are you looking at accuracy or at balanced accuracy?
not really
but if you care equally about each class (instead of equally about each record), then it makes more sense to look into balanced accuracy than normal accuracy
Is this good enough justification for the behaviour?
YEs
pretty sure that it's just that really
Nice
don't quote me on that though
That's why I had this follow up question
I will add your discord tag in my assignmenet
And put you in the references
eh, at least do compare the balanced accuracy of the models with and without class weights if you can
I want to use gpu for my Gan but i don't klow how do this ?
which module did you make your gan with? pytorch?
No i use tensorflow
year_month Creator
2017-05 Benx 37
Galileo 30
LeKoopa 17
TheBietz 2
baastiZockt 36
2017-06 Benx 54
Galileo 40
LOGO 48
LeKoopa 45
TheBietz 45
baastiZockt 83
``` how would i sort smth like that? i want each month to be sorted from highest to lowest so for each month we still have 5 values and they should be sorted
this is what i used to create it:
```py
month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()
based on that df:
Creator Timestamp
0 Phil Laude 2023-03-29 07:47:14
1 Sing King 2023-03-29 00:18:05
2 orijimi 2023-03-29 00:17:53
6 Sing King 2023-03-28 22:22:05
9 orijimi 2023-03-28 21:19:02
I used this command : pip install tensorflow
https://www.tensorflow.org/install/pip#windows-native here's how to install the gpu flavor on windows, provided you have an nvidia gpu and have already installed its drivers
the cuda toolkit and cudnn need to be installed along with tensorflow
Hi! Could someone help me with some language modeling? ie. BiLSTM for next word prediction?
How does Pandas apply work when you select a particular axis (let's say axis = 0). Does it go over cell of a column one by one, or does it perform a vectorized operation?
Guys, I'm trying to make a decent Variational AutoEncoder, but it seems that it only produces really blurry images. Is it a sign that I need to make it train for even more time? Or do I need to increase its parameters(like the latent vector size)?
Hm... I guess I'll try increasing my decoding loss weight...if it doesn't work, increase the encoding dimension 
I'm not planning to use this VAE to generate images by itself, so I guess prioritizing the decoding loss might actually be good?
I'm somewhat confused about the advantages of an FPGA vs a GPU. I know that both offer far more threads for parallelization than a CPU. Resources online say that at FPGA offers "lower latency" and more "power efficiency" but they don't go into specifics. From what I understand each CPU or GPU core executes instructions one at time, like MOV, PUSH, or ADD, whereas an FPGA, like an electrical circuit with IC Logic Gates, can perform numerous operations on a given value in one clock cycle without having to put values in registers then wait for another clock cycle. Is this what they mean?
FGPAs are programmable hardware. They can be configured to act as if you had made a specific hardware device for your specific problem.
They are programmed using a hardware description language (HDL), which is even more low level than assembly / machine code / micro-ops.
GPUs are not well suited to all problems, they are more general now than they used to be, but there is still only a certain set of problems they are good at.
FPGAs can be setup to be "flow-through" where there is not really a clock (except for the processor that is often on the same board to control the FPGA, and some parts still need a clock (memory)), it's just happening, in the same way a regular circuit does not need a clock.
If I have an AND gate and flip one of the inputs, the output changes, without waiting for some clock.
*Also CPU and GPU cores execute more than one instruction at a time.
Squiggle...you and my physiology teacher are making me really consider the possibility of diving into robotics...
the only problem is that I lack the lifespan for so many things 
But I'll check if coursera has anything about it anyway
I've found some which use...Python? I wasn't expecting Python to be used in robotics. I thought it would be more something slightly lower-level, like C++, C...
A lot of people in robotics (not meant as an insult, they are very talented people) are not that proficient in programming (they are more interested in the physical machine) and C++ is not exactly a language that helps the user become proficient in a straight forward way. And Python happens to be a language that many people find easy to use for non-programmers. Plus you get access to all these libraries (e.g. OpenCV and now all the ML ones). However a lot of C/C++ still happens, because you often have to work with whatever SDK you are provided by the hardware manufacturer.
I see... Yeah... I wasn't expecting the folks in robotics to not be that proficient in programming 
I mean... they're not that far from dealing with hardware, I guess, and hardware seems to require quite low-level programming
Being a proficient programmer and into robotics makes you very desirable by many.
Afterall, our machines are like robots...but instead of movements and actions in the real world, they perform it in the digital world π§
It's also a supply and demand issue. There are only so many people into robotics. It's not exactly as easy to find a job for it as something like web development.
Can you sync matplotli figures when using show ?
What I mean by that, I explore data on one plot, but want to see same position on other window, for example raw and filtered data. Its just too much to do on single
one plot π
need to put more dots
Try using fig, ax = plt.subplots(x, y)
fig, ax = plt.subplots(2,4)
for x in range(ax.shape[0]):
for y in range(ax.shape[1]):
ax[x,y].axis('off')
ax[0,0].imshow(saving_image[0])
ax[1,0].imshow(saving_image[1])
ax[0,1].imshow(saving_image[2])
ax[1,1].imshow(saving_image[3])
ax[0,2].imshow(original_image[0])
ax[0,3].imshow(original_image[1])
ax[1,2].imshow(original_image[2])
ax[1,3].imshow(original_image[3])
plt.show()
This will create a window with dimensions 2x4(2 rows, 4 columns), in each row i and column j you'll be able to add a plot(or image, in this case).
You can do pretty much the same thing you do with plt using ax[row,column], but applying exclusively for a single plot. Just need to add "set" in most cases.
Like ax[0,0].set_title("Test") (instead of plt.title("Teste")) or ax[0,0].set_legend("Legend")
still have to move each subplot
no thanks
Basically if you can afford it. An fpga approach can be infinitly scalable with no overhead as long as your problem is dividable enough, things can be syncronous or async, and if you get fancy you can always throw hardware at making sure you get a result on the same clock cycle its needed when not constrained by serialization.
But the cost.... Yeah its not for small things you can cop the weight time on a GPU for. To match a GPU for many tasks your already talking 100K fpga cluster cells. The difference is once you cross that point in an fpga cluster, the lower latency and architecture shenanigans you can pull on top of the power savings mean you can optimise things for your use case.
How can I choose the right algorithm for a Tweet Sentiment project? Is there any way to plot them perhaps?
NOPE, SHAP EXPLAINABLE AI..
Looks familiar - dfβ¦. FPGA β¦ C++ is easier than python. But I guess everybody uses python.
Field programmable gate array now I remember.
Regex and df π in π
Should have done all my arduino projects and everything from now on in python
Anyone have any tutorials on links for Image Generation?
How would you measure the performance of a reinforcement learning model? My prof is making us cherry pick the 5 best runs, but that seems really biased and bad :/
from tensorflow.contrib.training import HParams```
im have downloaded the gpt-2 from github and this part on the `model.py` isn't working because im using python 3.8 which doesn't lets me use the versions of tensorflow below 2.8.0, and this part doesn't work on versions above 1.5.0.
Can somebody help me to solve this problem please?
upgrade python?
I think usually plotting the loss function against the scores of each run per generation?
But the loss is kind of artificial, as you don't have the correct labels, it's unsupervised
ok, so is it say competing network learning? where you have 1 model generating and another descriminating? otherwise how is the scoring implemented, being unsupervised makes it a bit nebulous if your scoring on its own doesnt cover that behaviour
I'll just be using the average winrate, but I am using Q learning. This uses an online and offline model, the online model predicts the action scores, and the offline model predicts the expected action scores.
And the offline model is updated with the online model's parameters every now and then
chess like problem?
bouce the ball, so whats your current scoring method? that could greatly change the answer?
oo RL
ah ok, so instead your score function should probably be distance of center of paddle to intersection point of the threshold for the ball
so it can... learn
I'm not having a problem with it learning, it learns π
I was just nitpicking on the fact that we have to cherry pick the 5 best runs, which seems biased
so pick the 5 that have the center of paddle most directly under the ball center?
I have that df here:
Timestamp Creator
5126 2022-12-27 23:20:17 ZDF Satire
5825 2022-12-14 20:57:53 ZDF Satire
6014 2022-12-10 21:36:12 ZDF Satire
7731 2022-11-17 17:08:06 GermanLetsPlay
12363 2022-07-20 19:54:39 GermanLetsPlay
and applied this command:
month = df_channels.groupby(['year_month', 'Creator'], group_keys=False)['Views'].sum()
which results in:
year_month Creator
2018-06 LeKoopa 10
ZDF Satire 16
marshmallowTV 39
2018-07 LeKoopa 5
ZDF Satire 1
marshmallowTV 6
2018-08 GermanLetsPlay 18
LeKoopa 6
ZDF Satire 24
marshmallowTV 6
``` and this is a series: <class 'pandas.core.series.Series'>
So... what i originally wanted to do is printing mutible lines where each line represents one Creator e.g. (LeKoope, GermanLetsPlay, ZDF Satire,...) they x ticks shall be the timestamps.. so each month and the y value for e.g lets say Lekoopa should be the value he has every month so: ```10,5,6,...``` but im not that good with Series and plotting informationm out of one
biggest difference in X coordinate
Yeah ig, though a run consist of 1000 episodes, so I'm just picking the ones with the highest sum of winrates
Hard to judge, I'm not going to watch 1000 videos to see how the run progresses, the neural network also has a few tens of thousands parameters, so can't intrerpret that
But my question is answered π
This is why I suggest the distance in coordinates, it removes more of the random components and you can still sum them π
e.g. if your paddle is 1/5 the width of the bottom, your agents could win 20% of the time by luck, closer to 40% if your position and velocity of the ball is not handled too well
You don't need to watch 1000 videos. You can evaluate the performance of the policy
I assume you have some sparse reward (win/loss)? You can just take the average winrate of the last N episodes. I suspect that what you planned on doing anyway
ure looking for pivot unpivot
so getting it back into a df?
btw what is that group key thing doin, someone suggested it but i habΒ΄ve no idea what it does
it takes all the input and groups it into chunks that share those properties
not sure what group_keys=False does
but i think that u want can be done with pivoting or unpivoting the dataframe (idk which is which)
Reminds me I really need to get back to reinforcement learning. I read Sutton & Barto + implemented everything from scratch except a few model-based algos. It's something I don't actively use at work for now (even though there's opportunities) so I'm getting a bit rusty.
At some point I'll redo part of the algos in C, Nim, Rust, ... "for fun" and so I don't forget all of it
how do you guys normally do hyperparameter optimization? do you do grid search or go for Bayesian optimization methods?
Don't grid search imo
Random search is good if you can run it in parallel. Bayes opt is a sequential algorithm so that's what I use if it's something super expensive that I can't run in parallel (e.g., neural networks)
The issue with grid search is that it spends a lot of time iterating over potentially useless parameter settings, random/bayes opt is a lot less sensitive to this
do you ever have issues with convergence of the Bayesian optimizer?
it works on old versions and i can't downgrade my python version because my other projects wont work
That last point can be solved by using virtual environments
wdym
I haven't personally. Maybe a stupid question but have you tried increasing the maximum amount of trials?
I haven't, but I will try that. I was more asking in general what strategies people tend to use when optimizing their hyperparameters.
I have seen people before who were more in favor of random search over Bayesian optimization, and I was curious why
Yeah, the reason is that they can run random in parallel indeed π
Thanks for clearing that up π
nothing stops you from doing random bayes btw, with different starting points for the params
it's a lot more computationally expensive though
So starting bayes opt from N random points?
yeah
I like this a lot - do any "mature" packages implement this?
I can handroll it but I rather not if I'm using it in any "serious" project
i wouldn't know, i'm only doing armchair AI right now π
basement AI
I'll put it on the long list of stuff I need to do
Tbf starting a bunch of Python processes that each run bayes opt is the same π€·ββοΈ
https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html i think this does it
I'm running my reinforcement learning model twice to get twice as fast results rn π
it sounds reasonable enough that i would almost expect any hyperparam library to have this as an option
RL would really be my favourite domain to do fundamental research in. Specifically in this: https://arxiv.org/abs/2005.01643
In this tutorial article, we aim to provide the reader with the conceptual
tools needed to get started on research on offline reinforcement learning
algorithms: reinforcement learning algorithms that utilize previously collected
data, without additional online data collection. Offline reinforcement learning
algorithms hold tremendous promise for...
a word of caution in case you weren't aware: arxiv is not peer reviewed, it's just a repository
(This got published in NeurIPS as well though but this one is free)
uc berkeley and google add some weight to it, but always check whether what you find on arxiv has a published, peer reviewed version
aha, there we go then
(A big part of my thesis was fixing a rubbish Arxiv paper that would never get published in a peer reviewed journal but had a few good ideas)
you seem pretty savvy in AI stuffs
I did a masters in AI at a top ~50 uni + work as an applied AI researcher. I don't know a lot about NLP for example except surface level stuff from coursework.
I know about the stuff that my profs were interested in π€£
very nice π
Hey! Can you tell me some tricks to avoid suboptimal policies? Or specification gaming?
I'm having the problem that I'm trying to train a model in a game with PPO...problem is, the environment is a bit slow to provide a feedback. And even with a reward model working as a reward function to provide continuous rewards, it seems that my model is prone to getting stuck at certain commands...
(I don't know how much the factor impatience also helps...since I never actually let my model train for more than 5,000 steps, and the optimization is done after each 10 steps)
Honestly I wouldn't know, maybe @mild dirge can give you more concrete pointers
My focus was mostly on implementing algos, understanding their properties, making environments, ... It's a shame but I don't have a lot of finesse with actually using it for real things π
I'm still doing a course on deep reinforcement learning so not exactly an expert either, my next assignment is to use policy optimization instead of value based learning.
i want to one day try using RL for codingame bot programming tasks, but that'd need a lot of work. I'd need to write a one-file from-scratch implementation of the model in question, train that model locally (after implementing a full simulator of the environment in question), and deploy it by embedding the trained parameters in the file
I think I've heard of people actually doing it, though, so probably it's a powerful method if you want to get through the hassle
for what?
Codingame is a mostly generic programming contest site, basically, but the really fun stuff they have are the "bot programming" and "optimization" tasks, where you compete with other players in making, roughly speaking, the best bot for a certain game. E.g. https://www.codingame.com/multiplayer/bot-programming/mad-pod-racing is a very good example.
The particular one I linked is interesting because it's a real-time game with physics (so the state space is continuous and the action space is pretty big too) and it's inherently multiagent - each player controls two bots, and they ideally need to coordinate with each other to win, and predict the opponent's actions.
is this regex?
nope
but i already solved the issue
im currently trying to floorlessly get some information from youtube api
Okay 
Spoiler: ||When your assignment is to implement PPO, you'll have to do both||

Is it possible to change the order of the fields of a numpy void dtype without copying the data?
hmm, what would it even mean for the fields to be in a different order logically but not in memory?
!e looks to me like it works just fine, though:
import numpy as np
dt = np.dtype({"names": ["col1", "col2"], "formats": ["i4", "f4"], "offsets": [0, 4], "itemsize": 12})
dt2 = np.dtype({"names": ["col1", "col2"][::-1], "formats": ["i4", "f4"][::-1], "offsets": [0, 4][::-1], "itemsize": 12})
arr = np.arange(9).view(dtype=dt)
print(arr)
print(arr.view(dtype=dt2))
@tidal bough :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [(0, 0.0e+00) (0, 2.8e-45) (3, 0.0e+00) (0, 7.0e-45) (6, 0.0e+00)
002 | (0, 1.1e-44)]
003 | [(0.0e+00, 0) (2.8e-45, 0) (0.0e+00, 3) (7.0e-45, 0) (0.0e+00, 6)
004 | (1.1e-44, 0)]
note how the offsets for dt2 are [4,0], so the logically-first field is the second in memory.
Where have I seen df I swear when I was picking up regex.
This is great news, ty
Brih idk
oh im out of it its a sql pandas thing
I installed cuda 12.1 and cudnn 8.9 but I saw that tensorflow is no longer compatible with gpus for native windows on the latest versions. I therefore hesitate between installing between wsl2 and tensorflow-directML and I also wonder how to do it.
wsl was a nightmare for me to set up so i just went with pytorch
didnt know about tensorflow-directML
i don't know what tensorflow-directml is, i do use wsl though and it's just dandy
i use jax gpu on wsl2
if you find it too cumbersome, consider pytorch indeed (unless you already have your code written)
WSL is pretty convenient to set up at least it was for me
wsl has treated me mostly well as well
So how do I install wsl?
what issues did you have with it?
ive blocked it out of my memory now
lemme find my rant messages
https://learn.microsoft.com/en-us/windows/wsl/install you can follow the steps here. on windows 11 it's super straightforward. on windows 10 it can require some extra steps
im on win10
Ok thank you
oh i rmb know
it wasnt wsl specifically but mlflow wasnt working properly with wsl
something about file permissions iirc
couldnt save an image of the model architecture as an artifact in mlflow
that doesn't tell me much
yeah welp im over it now
but you probably tried to modify the windows filesystem from inside wsl or backwards, which you shouldn't do
pytorch all the way
i think so too
never do that π
