#data-science-and-ml
1 messages · Page 332 of 1
Ah yeah, thanks
Rather than dropping 40 and keeping 20, is there a way to specify which 20 to keep? @undone flare
So basically remove everything except what's passed
you could just create a new dataframe with those 20 columns
try new_df = df[["col1", "col2"]].copy()
Ended up doing a different approach and instead only reading the columns from the csv that I want (using usecols kwarg)
Seemed bad to read the entire database when I only need a fraction of it
that works too
skewness before tranformation
ph : 0.04891026669821542
Hardness : -0.08517383101708786
Solids : 0.595449442721807
Chloramines : 0.01296659647911324
Sulfate : -0.04652296251790013
Conductivity : 0.26666972862929905
Organic_carbon : -0.020002726567027108
Trihalomethanes : -0.051383722200829214
Turbidity : -0.03302682552748457
```after log transformation
ph : -2.2032213464172155
Hardness : -0.8250213149082217
Solids : -1.230858118768609
Chloramines : -1.069749910885117
Sulfate : -0.692747912780153
Conductivity : -0.20033687775898243
Organic_carbon : -0.9940495304526159
Trihalomethanes : -1.2119564041594677
Turbidity : -0.702269975309455
the goal is to make something look like a normal distribution right?
Anyway to install older versions of tensorflow like 2.3.1?
Without having to compile from source.
pip install tensorflow==2.3.1
That's also the way you're supposed to format requirements.txt files, with the exact version listed
yea, you can ask here
i have successfully deployed my application heroku
for tweet sentiment analysis
It collects specified number of tweets and analyze it , and generate reports
when i tried using 900 tweets it exceeded the memory over 640Mib
when i tried with 600 tweets i faced server timeout, delay in response
how can i avoid it ?
can we extend service timeout ?
if not ram
so that app can work with 600 tweets
I don't know, haven't used heroku much
Count Vectorizer or Tfidf?
tfidf
did you set stop_words_ to None before pickling? (if you used it)
i dint set stopword in tfidf
actually twitter api is slow in sending response
then selecting part from json and creating a dict
converting dict to df
and vectorizing it, applying model
is taking time
can you not load json to pandas df?
contains too much unwanted data
hmm
Hey @hoary wigeon!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
yikes
not to my knowledge, maybe someone has done this same thing
that's gonna take some time ye
How would you put the legend outside to the right?
plt.legend(loc='center right')
has no effect.
I've run my tsv file by Pandas on Jupyter
but i dont know why it appeared like this
anyone know to fix this?
filenotfound error?
but it is true file in my computer
put the absolute path
yepp, i have done it
What sort of analysis would you use to figure out how close two groups of numbers are to one another?
thanks bro
eigenvalues?
Basically there is a "real" number and two algorithms that "guess" what that number is
and I have a ton of data
Trying to determine which algo is better
Oh, pandas has a .corr() method, I'll just use that, haha
Anyone know how I can move my matplotlib legend outside to the right of my graph?
loc="right"?
Only moves right and inside the graph.
You can't move it outside
Or at least not with the default options
There might be a different method
anyone know a good metrics for multiclass clf ?
Yes
I'm still rather new to programming in general and this is just my second program. I'm using pandas to parse a csv file and pick specific cells with iloc
I haven't read the whole documentation yet, but rather posts on blogs and stacked overflow
my question is, is defining the type of data contained in each column with dtype={} really necessary?
@serene scaffold
not always
can you show what you've written so far?
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
here's a sample
from os import path
import numpy as np
import random
import pandas as pd
from datetime import datetime as dt
from datetime import timedelta as td
def rollDice(numDice):
results = []
for x in range(numDice):
results.append(random.choice(range(1,7)))
return sum(results)
print(rollDice(3))
def HabMod_Table(path, rowN, colN):
a = pd.read_csv(path)
return a.iloc[rowN,colN]
table_path = R"H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\12 - Habitability Modifiers Table.csv"
path = table_path
ovrl_wrldType = HabMod_Table(path,1,1)
print(type(ovrl_wrldType))```
and here's the table:
"No atmosphere or Trace atmosphere" ,0
"Non-breathable atmosphere, Very Thin or above, Suffocating, Toxic, and Corrosive",-2
"Non-breathable atmosphere, Very Thin or above, Suffocating and Toxic only" ,-1
"Non-breathable atmosphere, Very Thin or above, Suffocating only" ,0
"Breathable atmosphere (Very Thin)" ,1
"Breathable atmosphere (Thin)" ,2
"Breathable atmosphere (Standard or Dense)" ,3
"Breathable atmosphere (Very Dense or Superdense)" ,1
"Breathable atmosphere is not Marginal" ,1
"No liquid-water oceans, or Hydrographic Coverage 0%" ,0
"Liquid-water oceans, Hydrographic Coverage 1% to 59%" ,1
"Liquid-water oceans, Hydrographic Coverage 60% to 90%" ,2
"Liquid-water oceans, Hydrographic Coverage 91% to 99%" ,1
"Liquid-water oceans, Hydrographic Coverage 100%" ,0
"Breathable atmosphere, climate type is Frozen or Very Cold" ,0
"Breathable atmosphere, climate type is Cold" ,1
"Breathable atmosphere, climate type is Chilly, Cool, Normal, Warm, or Tropical" ,2
"Breathable atmosphere, climate type is Hot" ,1
"Breathable atmosphere, climate type is Very Hot or Infernal" ,0```
doesn't seem like pandas is necessary for this.
agreed
but it was the first method I found when googling "how to pick a cell in a csv with python" so here we are
well, looks like you figured it out 
yeah
Also, your code is not following pep8
what's pep8?
the style guide. you should never have variableNamesLikeThis
# not pep8
def rollDice(numDice):
results = []
for x in range(numDice):
results.append(random.choice(range(1,7)))
return sum(results)
# pep8
def roll_dice(num_dice):
results = []
for x in range(num_dice):
results.append(random.choice(range(1,7)))
return sum(results)
# not pep8
def HabMod_Table(path, rowN, colN):
a = pd.read_csv(path)
return a.iloc[rowN,colN]
# pep8
def hab_mod_table(path, row_n, col_n):
a = pd.read_csv(path)
return a.iloc[row_n,col_n]
I see
generally, when should I use dtype={}?
because pandas is too awesome to run tables any other way from now on
@serene scaffold
If you want numeric types to be stored a certain way.
can you ellaborate?
Like float or int32 or whatever
I see
also, it's reading numbers as str for some reason lol
so I guess I should just always use dtype
how annoying
You don't always need to use it though.
I rarely do. I'd have to look at the source file and the code to understand why your numbers are being inferred as strings.
I'm able to get proper ints and floats from really simple examples.
A bit more 'adversarial' example might be csvs that store floats with surrounding ""
9E10000000 ,0.009 ,Trace
0.01 ,0.5 ,Very Thin
0.51 ,0.8 ,Thin
0.81 ,1.2 ,Standard
1.21 ,1.5 ,Dense
1.51 ,10 ,Very Dense
11 ,9E10000000 ,Superdense```
all values in this table are being read as strings
Those don't seem really...proper 👀
I think it'd be hard for a computer to know they are floats
What does it mean by ambiguous, i got ValueError saying the truth value of a Series is ambiguous
It's not that 1.51 is bad, it's that 1.51 is bad
I don't think space-padding in csvs is common, you generally get them really dense
hdr,hdr2
0,1.523523
234,4.5234
23,666.3453
You might be asking for if pandas.Series (an instance, not the class) which is taken to be ambiguous
So the elements of pandas.Series can be truthy or falsey
ah okay. it's something I added on Notepad++ so the columns are a bit clearer
And you should try to ask for the truthiness of the elements of the Series instead
So that means they are neither nor true not false?
lol I have no idea, I think they just raise an error actually
code
import pandas as pd
from datetime import datetime as dt
from datetime import timedelta as td
from pandas.io.parsers import read_csv
def atmo_pressure_table(path, rowN, colN):
a = pd.read_csv(path)
return a.iloc[rowN,colN]
table_path = R"H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\3 - Atmospheric Pressure Categories Table.csv"
ovrl_wrldType = atmo_pressure_table(table_path,5,0)
print(type(ovrl_wrldType))```
the table
Ohh so i'll have to target the elements of the series right?
That's probably what you want, right?
You generally want an all or some or on the entire Series
Yeah i want to get out the values that are False
Or some and of some kind, of the elements of the series
@chilly geyser padding removed
9E10000000,0.009,Trace
0.01,0.5,Very Thin
0.51,0.8,Thin
0.81,1.2,Standard
1.21,1.5,Dense
1.51,10,Very Dense
11,9E10000000,Superdense```
As in, the values are exactly False?
still being read as string
Hmm that's a good question
I didn't know it doesn't understand float notation xEy
I think that might be a cause
print(data1[data1['Price']>0 & data1['Type']=='Free'])
This should target the elements idk why it's giving me that error
the result
PS H:\01 Libraries\Documents\Tosh0kan Studios\Coding> & C:/Users/Tosh0kan/AppData/Local/Programs/Python/Python39/python.exe "h:/01 Libraries/Documents/Tosh0kan Studios/Coding/tester.py"
1.51
<class 'str'>```
nvm it worked i missed out brackets
Wait I think I got it
Not sure why yours is being str but when I changed the 9E10000000 to be within float limits like 3E100 it got read as a float properly
Lmao yes
over witch, it just becomes a string?
It's stored in a limited memory space
There's an IEEE standard for this, but the double limit (or just float64 is +-E308)
The exact string inf will work as well
I'm not too sure about InF, Inf, inF, etc within the csv, can test that
oh, they all seem to work, so I think it's auto .lower()-ing them or something
Yeah there are a lot of 'obvious things' done in pd I think
It's almost too convenient
because like
these tables are for a tabletop RPG
and I'm making a program to automate rolling them
is there a difference between inf and those numbers you were using
ah, so this should work
I put 9E10000000 in those two places because in the table is "0.009 or less"
I think there's dedicated functions for inf, like DataFrame.isinf()
so I just needed a really ridiculous number that would never come up
Oh, that will certainly fail that, no issue with infinity order checking
this is the original table
Functionally (to me?) positive infinity is just an entity that is greater than any number, and should(?) error if compared against another positive infinity
Yeah >10 seems like you can go inf on it
less than 0.01, you can use 0? or just -inf
I don't think it can read negative inf
I put -inf on the table
but it's reading print(ovrl_wrldType < 0) as false
when it should be true
so I'll go with just 0
Use tab ('\t') as delimiter if you want that
yes, pressure can't be negative.
hello, i would like to please know if ML certificates gives more of an eye to employers vs someone who didnt get one but they. have projects to showcase they know ML?
I am currently pursing a degree in CS though, but school doesnt teach ML, but im learning it on my own.
Hey guys, question about ML
How do you call a machine learning task that uses it's output for another machine learning task?
is there any difference between:
import pandas as pd
def csv_parser(path):
a = pd.read_csv(path)
return a
teeburu = csv_parser(R"H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\1 - Overall Type Table.csv")
print(type(teeburu.iloc[0,1]))
print(type(teeburu.iloc[0,0]))```
and:
```py
import pandas as pd
def csv_parser(path):
a = pd.read_csv(path)
return a
teeburu = csv_parser(R"H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\1 - Overall Type Table.csv")
teeburu_df = pd.DataFrame(data=teeburu)
print(type(teeburu_df.iloc[0,1]))
print(type(teeburu_df.iloc[0,0]))```
I think the second one is redundant
You don't need to specify it as a dataframe if you have already opened the CSV with pandas.
Pandas will upload it as a dataframe
so, with the first, the csv is already being stored internally?
import pandas as pd
teeburu=pd.read_csv("H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\1 - Overall Type Table.csv",header=None,sep=';')
print(type(teeburu_df.iloc[0,1]))
print(type(teeburu_df.iloc[0,0]))
I normally read CSV documents like this
you can choose if you want headers or not and the sep is dependent on what your CSV uses for separation of rows
It could be comma, or this ;
oops, like this:
import pandas as pd
teeburu=pd.read_csv("H:\01 Libraries\Documents\Tosh0kan Studios\Coding\GURPS Space\Tables\1 - Overall Type Table.csv",header=None,sep=';')
print(type(teeburu.iloc[0,1]))
print(type(teeburu.iloc[0,0]))
if the file does not load, check for relative or absolute path.
Yeah, it's in memory as an object.
so, if I set header=0, then the first roll will be column's names?
How can I do a very simple reinforcement learning, I already know a little about ML
No, the arguments apply to the entire dataset. If set to false, the entire header row will be left out
you'll end up with a matrix of just features but no description/headers
but that's set to 0, as in row 0.
yep
here
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file,
straight from the documentation
feature1|feature2|feature3|
---------------------------
0934828|349823849|348238943|
---------------------------
4327823|323848484|378327474|
---------------------------
if set to False it is loaded like this:
---------------------------
0934828|349823849|348238943|
---------------------------
4327823|323848484|378327474|
---------------------------
If set to zero, then you are telling pandas row 0 is the header row
I'm not asking about false. I was asking about setting header to 0
I don't think I ever asked anything about false
Oh yeah. Set to zero points to the zero row as header row
ok
how so?
😦
sorry repost :
hello, i would like to please know if ML certificates gives more of an eye to employers vs someone who didnt get one but they. have projects to showcase they know ML?
I am currently pursing a degree in CS though, but school doesnt teach ML, but im learning it on my own.
can i send a script using ML, do you want to see?
Maybe with the script, you can get a sense of it, it's very basic, I send it and teach you step by step
Today I was using pandas to import csv to sqlite3 and using the separator ';' and notice that all 2000 records imported except for 3 records that found that one of the columns had a semi Colin and create another column. trying to find a way for the cvs data to be imported and ignoring the discovered extra semi colin. any thoughts our idea?
@strange portal oh sorry, it just my question is different, i was wondering about if ML certificate are really worth it when it comes down to internships
because i already have projects to showcase that i know ML, but there are other interns who have coursea certifate on ML
Sorry, I can't help you now, I live in Brazil
but yeah thats cool to see, you can share git repo here
can i share the zipped folder
send dm, ok?
nothing major, but it might help
start with the basic math, and then move up with Q-learning to DQNs
your csv_parser is just a wrapper around pd.read_csv and doesn't actually afford you anything.
using keras?
You don't need any libraries to do basic RL.
(And I recommend you don't and instead start out with simple tabular implementations)
ok, thanks
The book I linked is written by the people that came up with the stuff in the first place (RL, in its modern form).
It's not a very hard read, although it does require some math.
I am using keras and it is training models on my cpu. Is there a way to make it use my gpu?
spank it
Don't overthink it. pd.read_csv reads the data and returns it as a DataFrame
alright. thank you!
btw, i noticed something weird
I've been messing with themes in vscode
and for some reason, vscode thinks iloc is a variable, rather than part of a function, in regards to coloring
it works fine
it just the color
what was that library that was some kind of wrapper around numba, jax, torch, and a few other python-optimizer tools?
transonic
!pypi transonic
im using pytorch ....i have a folder of images that i want to pass to model one by one and get outputs ? custom dataloader just seems too much to do all i want is to pass files one by one to a model
actually, the textMate scope that changes that color is source.python. whatever that means
trying to get a notebook that was written with tensorflow 1 working locally on cuda 11 like: https://cdn.discordapp.com/attachments/777174797934264320/874379048787247224/redditsave.com_from_ttmrs_twitter_russell_after_he_loses_alyxs-g2yjxzkpu0g71.mp4
Hey @lapis sequoia!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Hey @lapis sequoia!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Hey @lapis sequoia!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hello!, does anyone have ideas about data analytics business in pandemic era that has social impacts?
*sorry for broken englsih, TYSM
Hi, I have a question : whether Recommender System is part of Machine Learning?
yes
analyzing public health data could be good. someone recently here was working on hospital bed occupancy during covid
So this is a classification problem, I fixed the skewed data, scaled the data but still can't get better results what else can I do?
also weirdly enough scaled and non scaled data give me the same results
I think you can do a lot to get better performance, you can take some time to do feature engineer(this might be tedious if you don’t have domain knowledge). You can gather more data, you can tune hyperameters of your models etc I also think having domain understanding of the problem would help alot.
I can't gather more data so that's not happening, I will try tuning the hyper parameters of the models. Also I have some domain knowledge
Thanks for the suggestions
That’s great having domain knowledge… you can spend time on feature engineering, it would help your models a lot. Cheers 🥂
Has anyone here applied ML to heavy-tail data? I have a few questions
scaling won't affect linear model results, it's only for interpretability. but it also helps with numerical stability and it does have an effect in nonlinear models.
in a linear model, if you scale x up by 10, the parameter just scales down by 10 to compensate
how many classes do you have?
what are the f1 scores, precision, recall, log or brier score, etc.?
9 features
classes, not features
I don't know what you mean by classes, do you mean the categories the model outputs?
yes
those are usually called "classes"
the term "categories" is usually applied to features, as in "categorical features"
2 (water is potable or not)
yea I think the features are making it harder
why do you think that?
tried many models with different hyper params but still can't get the mean score above a certain point
are you expecting higher accuracy? sometimes data is noisy or the features just aren't that tightly related to the outcome
features just aren't that tightly related to the outcome
I believe that's the case
are these ppm values?
not that it matters, maybe there isn't any interesting feature engineering to be done here
different units ppm, μg/L and mg/L
yea not that I can think of any
how is potability determined?
70% out-of-sample accuracy based on those 8 of features actually sounds kind of good, but i'm not a water treatment expert either
mtft is the standard I think
also I think hard water doesn't really make drinking water unsafe
Hi
I have two numpy arrays, data which has (3,n) dim points which has (3, m) dim
Until now, there was a for loop over n which produced the following output:
for i in range(n):
diff = data[:,i,newaxis] - points # shape is 3,m
meaning, subtract each column n times.
I want to that in a vectorized fashion, that the result would be in shape n,3,m
How can I achieve that?
Minecraft ai bot possible?
Hmm, that 3 is annoying, otherwise it'd be np.sub.outer.
Oh, you can probably use einsum.
EDIT: ah, sadly no, doesn't support subtraction
you want res[i,j,k] = data[i,j] - points[i,k], I believe, where i is in range(3), j in range(n), k in range(m).
I think you might be able to broadcast data and points both to the output shape and subtract them like that.
hello! so i'm trying to make a speech recognition ai (kind of) using the google speech recognition module and it keeps showing this error: Traceback (most recent call last): File "C:\Users\rorop\Desktop\ai.py", line 10, in <module> text=r.recognize_google(audio_data) File "C:\Users\rorop\Desktop\speech_recognition\__init__.py", line 822, in recognize_google assert isinstance(audio_data, AudioData), "``audio_data`` must be audio data" AssertionError: ``audio_data`` must be audio data
here's my code: ` import speech_recognition as sr
from speech_recognition import AudioFile
r=sr.Recognizer()
audio=AudioFile('vs.wav')
audio_data=r.record
type(audio_data)
text=r.recognize_google(audio_data)
print(text)`
Can anybody help me please?
Thank you
I've been trying to fix this for days
I'm also quite new to python
I hope someone helps soon! 🙂
And how would I do that?
What is different of splitting with train_test_split and Cross fold validation?
Hi. I want to learn data science. Where to start? I know Python up to average.
Of course, a basic question is what does a data scientist do?
PS:
I hope I have not violated the rules of society regarding my questions :)
Does anyone have idea about onnx operators ?
Hey, there's a dataset I'd like to use, but It's 900+ mb, I don't want it on my local disk. Is there a way to use the dataset for CNN training without having to download it?
do you have the csv link?
https://www.kaggle.com/tawsifurrahman/covid19-radiography-database?select=COVID-19_Radiography_Dataset
this is the dataset.
uh oh
you can just create a kaggle notebook and add the data if you don't want to download the data set
Hello 👋 in Pytorch, can I somehow measure a loss on a subset of a tensor? say I have
pred=[1,2,3,0]
labels=[1,1,3,nan]
Is it possible to tell the loss to only consider the first 3 values, without modifying the tensors? Possibly by passing a mask for those values to ignore ( mask=[1,1,1,0] in the example). I can't filter out the samples to ignore before sending them through the NN because I'd have to modify a large part of the model to do so.
I didn't think of this situation when I prepared the model 
Hi, I have a question, how to evaluate RecommenderSystem?
What kind of recommender system?
Hi there, I am starting to prepare data science seriously from the beginning. And am gonna complete whole data science within 6 to 12 months. If there is anyone who would like me join me, it would be great as we can study together. 🙂
what are you going to use as learning resource?
I have one hot encoded data and I want to revert this to just digits as in an 1D array with digits [0, 0, 0, 2, 5, 9...]
how can I achieve this?
do I check if the value is one and return it? I think there is probably more efficient way
oh, never used a kaggle notebook before, thanks 👍 .
did you use the sklearn.preprocessing.OneHotEncoder class?
nope the data was in the shape of (2062, 10)
so I made it a dataframe and it looks like one hot encoded
I am doing this right now
In xgboost, how is it determining which feature will be the root of the tree? I have a few continuous variables and many categorical which I one hot encoded
a decision tree uses the same splitting algorithm at every node, including the first/root node
the basic algorithm is on page 3 of https://arxiv.org/abs/1603.02754
there are other split-finding algorithms implemented in xgboost, but the others are all just approximations to the exact algorithm
like Gini impurity or gain? or is that unrelated to determine which to split at
think of a node in a decision tree as containing data points, not as containing a split on a feature. the edges between nodes are the splits.
yep, that's it. the algorithm just finds the split point with the greatest gain for each feature, then splits on the best feature
here's the first tree of my model
I keep getting
ValueError: Expected 2D array, got 1D array instead:
array=[6 9 3 9 0 5 8 2 5 9 4 9 7 1 3 3 0 5 0 7 0 8 3 6 9 2 7 3 5 9 8 5 4 6 4 6 3
1 9 2 7 7 3 1 1 2 0 7 8 9 1 9 6 2 1 0 6 8 2 8 8 7 2 7 5 9 2 3 6 4 1 1 5 7
4 9 9 4 3 8 8 9 2 0 9 0 0 4 1 5 5 4 7 4 7 4 2 2 8 7 2 0 9 0 2 1 7 8 8 7 2
8 3 3 2 2 6 1 5 5 5 0 1 5 8 2 6 5 1 0 3 1 9 9 8 3 8 9 2 2 2 6 2 6 6 1 6 2
5 4 9 2 1 2 6 2 6 6 1 1 7 5 9 8 6 2 4 7 6 9 8 7 2 9 1 6 7 6 0 6 1 7 4 8 4
3 2 2 4 2 8 6 8 3 2 0 8 8 8 5 4 7 0 8 2 4 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```can anyone tell why? I tried many things but still end up with this
ahhh so it's gain, ok
shapes
and it will compare gains of various features, and if it's a continuous features, it will find which split in the continuous feature has the highest gain
and use the largest gain as splitting point
i think that visualization is just a little weird. unless i'm really badly misunderstanding something, NTON<585 is supposed to be the splitting criterion after the first node. it is not "the" first node as such.
show the code that caused this
the first node would contain all the data right, I'd call that the 0th node, so it decided to split using NTON at the first
and then it's the same for further nodes, where it will compare gain from each feature using the new subset of data points?
clf1 = LogisticRegression(random_state=42).fit(X_train, y_train.values.ravel())
y_pred1 = clf1.predict(X_test)
clf1.score(y_test, y_pred1)
if I make y a list then also it gives the same thing
on and on until it reaches max depth, reaches minimum data points in the node, or gets pruned based on what regularization I use
what is y_train, a Series?
.values is deprecated
def get_digit(row):
for c in df.columns:
if row[c]==1:
return c
y = df.apply(get_digit, axis=1)
y = np.array(y)
that's... weird
what does this df look like
oh, the one-hot encoded data you posted in the screenshot above?
I have two npy files
yea that's the y
X had 3 dimensions so I reduced it to two by X.reshape(2062, 64*64)
what's the task?
Sign Language digits classification
image?
nope npy files
did
y doesn't need to be 2D does it?
(2062,) is the dimension
LogisticRegression y should only be 1d actually
yea
x has to be 2d
I don't know why it's giving me this error
i don't think they even support multiclass-onehot or multilabel
because its an image
I tried doing it with digits dataset of sklearn and it worked but doesn't wanna work with this data
BTW what is there in the npy files? why dont they give images in jpegs?
X should probably have shape like (n_images, image_height * image_width), no?
yes
image is 64 x 64
This is the set https://www.kaggle.com/ardamavi/sign-language-digits-dataset
Image size: 64x64
Color space: Grayscale
it should be 3D?
it looks like they're flattening it to 2d
(img_height, img_widht, channel)
it was 3D I flattened to 2D
reshape
X = X.reshape(2062, 64 * 64)
X.shape
This is the image
this is when X has 3 dimensions but you can't provide array with 3 dims to LogisticRegression
Hi ! What prerequisites do I need to get started in ml ?
Good knowledge of pandas will really help
other than that it's just curiosity and decent reading comprehension lol
what's the ouput of X.shape\
because that's not what you are passing
.
clf1 = LogisticRegression(random_state=42).fit(X_train, y_train.values.ravel())
y_pred1 = clf1.predict(X_test)
clf1.score(y_test, y_pred1)
you are obviously doing some more processing, since you don't passX?
@undone flare
code: https://paste.pythondiscord.com/kigelewune.py
output: https://paste.pythondiscord.com/okafexanum.yaml
curious what's considered "good" out-of-sample accuracy on this problem, i assume it's in the high 90s
https://www.kaggle.com/esercicek/sign-language-with-cnn low 90s with a basic CNN, seems reasonable
What was I doing wrong tho? The dimensions were wrong?
not sure. take a look at how i did it maybe and compare
Hi, I'm looking at coding to implement an AR model for time series and in this image, can anyone tell me what does the -100 signify in the code
train_data = df['Consumption'][:len(df)-100]
that's a sloppy way of saying "everything but the last 100 elements". :n means "take elements until n", and they're using len(df)-100 as the "n". but this is bad style, they should have written
train_data = df['Consumption'].iloc[:-100]
test_data = df['Consumption'].iloc[-100:]
which is equivalent
indexing or slicing with a negative number means "count from the end"
Can you tell what are the shapes of x and y? I am on mobile right now
!eval @raw temple```python
import numpy as np
import pandas as pd
x_py = list(range(10))
x_np = np.array(x_py)
x_pd = pd.Series(x_np, index=list('abcdefghij'))
print(x_py[:-3])
print()
print(x_np[:-3])
print()
print(x_pd.iloc[:-3])
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | [0, 1, 2, 3, 4, 5, 6]
002 |
003 | [0 1 2 3 4 5 6]
004 |
005 | a 0
006 | b 1
007 | c 2
008 | d 3
009 | e 4
010 | f 5
011 | g 6
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/vorubegiya.txt?noredirect
In [92]: x.shape
Out[92]: (2062, 4096)
In [93]: y.shape
Out[93]: (2062,)
Hmm I have the same shapes
In [95]: x.iloc[:10,:5]
Out[95]:
pixel 0,0 0,1 0,2 0,3 0,4
image_num
0 0.466667 0.474510 0.478431 0.482353 0.486275
1 0.596078 0.607843 0.619608 0.631373 0.643137
2 0.588235 0.603922 0.619608 0.631373 0.643137
3 0.556863 0.568627 0.584314 0.600000 0.611765
4 0.580392 0.576471 0.592157 0.607843 0.615686
5 0.517647 0.529412 0.552941 0.615686 0.635294
6 0.427451 0.439216 0.454902 0.474510 0.490196
7 0.564706 0.576471 0.588235 0.603922 0.615686
8 0.498039 0.509804 0.521569 0.537255 0.549020
9 0.501961 0.517647 0.533333 0.545098 0.564706
@desert oar thanks for your help. so if I only have 55 points of data, then would I just change the value to 50?
or maybe -15? as I want all points included until the last 15?
is that what it means?
:-15 seems reasonable, yes
great! thanks so much for your help
Hi everyone, pretty random question on the neural network. I saw a sample code on the towardDataScience they have model.add(Dense(64, activation=tf.nn.relu, kernel_initializer='uniform', input_dim = input_dim)) # fully-connected layer with 64 hidden units Where 64 is the number of layers.
Why do they choose 64 layers? is it hurt if I choose more layers than that?
I think you mean the number of nodes?
64 is definitely the number of nodes
There are general rules on how to determine the number of nodes, and layers. Here's an explanation:
https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
Ah, seems like sensible rules of thumb.
I would still consider experimenting on a grid or random-search basis though
thanks yall
how do you guys define class imbalance?
I have 10 classes of observations
max class has 1100 observations
min class has 750 observations
should I consider oversampling or keep it as it is?
I want to plot 2 classes with barely any variance but a big interclass difference
Any way to do this? Boxplots fail miserably
what are you trying to show? maybe you just want a table
Loading times
i wouldn't worry about this difference
left with precompilation, right without
A table could work but Id like to have a nice looking figure
and a logarithmic axis is definitely not nice, sadly
make 3 plots:
- average loading time as a bar chart with some kind of error bar showing that the errors are small relative to the difference between groups
2-3) kernel density plot or histogram for each group, separately, or faceted together so you can compare the distributions without worrying about scale
I feel like I should just summarize the results textually..
and yeah, it's almost never bad to include a table with some combination of mean, std dev, median, min, max, 25%, 75%, 10%, 90%
(in general i wouldn't recommend using tuples for "array-like" things)
!eval ```python
import pandas as pd
data = pd.DataFrame({
'without_precomp': [90, 91, 92],
'with_precomp': [10, 11, 12],
})
def p25(x):
return x.quantile(0.25)
def p75(x):
return x.quantile(0.75)
table = data.agg([
'mean',
'std',
'min',
p25,
'median',
p75,
'max',
]).transpose()
print(table)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | mean std min p25 median p75 max
002 | without_precomp 91.0 1.0 90.0 90.5 91.0 91.5 92.0
003 | with_precomp 11.0 1.0 10.0 10.5 11.0 11.5 12.0
Hey guys, I stumbled upon this problem, I use DeepQ algorithm (reinforced learning) and my output as you see (actions) are same for every input, which is very bad. Can some help me identify what is the cause? or name the problem im facing? Is this due to learning rate? Loss function? to small model? I used huber, and mae loss functions, but nothing works in long run...
does anyone know a doc or guide that explains the installation process of keras and tensorflow in anaconda on linux? I'm facing a few problems
or, if you're willing, provide all the lines of code here?
@desert oar sorry for the ping, but regarding my loading-time testing:
With the std calculated, can I use e.g., chi² to say that "std is not relevant so the difference between mean_with and mean_without is equal to the the compilation time"?
i'm not sure you need any kind of test for that, if the difference is that big
Ok, just want to make sure my prof doesnt bicker at me
But I think thats a valid argument
have anyone do RNN, CNN or ConLSTM? I need some help of how to set it up. I got my optimal gridsearch and epoch but idk how to use those deep learning model
can someone explain this 2 rows
input number 31, 32
hahaha, thanks but i meant like.. what does data = [trace] and line 32 mean?
Anyone here have a lot of experience with Spark?
you should just ask your question.
Well I have to pull a lot of data in from an api, and I have to use a ton of separate requests due to the API limitations. I want to write all the data to a parquet file, and I think the way to do that is each time I do an api call I append the returned data to a spark dataframe that gets written into a parquet file at the end. I'm new to spark so I'm pretty sure I should be using partitions to write the file since the dataframe will be too large to fit in memory. I guess I'm just wondering if I have the idea right, and also wondering if there is a fast way to make concurrent api calls with Spark since I know it's basically designed for big data ingestion
Everything I can find is all about doing one api call or working with one JSON file that already exists
So I'm having trouble visualizing how the pieces go together in my situation
I have to pay for the API access so I'm trying to get this all sorted out without having access to the data
concurrent API calls is a separate thing entirely.
in general
for that kind of stuff
you want to do async IO
as for the partitioning...
Spark will do that for you
(basically)
do you have a cluster or what?
No it's all local, but I have probably about 20-30 gigs of data at least
and I will probably move it to a cloud service eventually
So am I right about just adding everything to a dataframe?
Should I even be using Spark? I do want to use the parquet format that's why I'm looking at it
Like I just don't understand how it goes about it. In my head I picture the data getting added to the dataframe as it comes in after each api call, then once the dataframe reaches a certain size it writes it to a parquet file... and then what? it basically starts a new dataframe and once that reaches the same size it appends it to the same parquet file?
It's stock data, so I have to loop over a list of tickers and do multiple api calls for each ticker, and there are thousands of tickers.
either that or some other distributed thing
like dask
what format is the data in
when it comes in?
json
I would say
store it in memory first
when you hit a certain limit
write that to disk
then
you'll have multiple dataframes
once you're done with the API
concatenate them
ALTERNATIVELY
you can use spark-streaming
I haven't
but I'm fairly sure
it would work here?
Would Cuda version 11.4 work with the newest version of Tensorflow?
for what OS?
windows
the website says 11.2
but i cant get it
because the version is too old for my gpu
so im asking if anyone has tried using 11.4
did you get an error message?
can you give the link for this page?
yes, you did get an error message? if so, show.
if you're asking for help that's in any way related to an error message, always show the error message.
will do in the future, sorry about that
no problem
Please do text next time.
i cant copy paste it
Anyway, try installing and running tensorflow and see what happens.
but it hasnt installed yet
did you try to install it?
i cant
what happened when you tried?
that's not how you install tensorflow
i have tensorflow installed. im talking about cuda
i cant install cuda because of the error message
try doing something with tensorflow so we see what error message you get from tensorflow.
like python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
can you show the warnings?
2021-08-11 19:50:14.363291: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-08-11 19:50:14.364295: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found
2021-08-11 19:50:14.365360: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found
2021-08-11 19:50:14.366408: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2021-08-11 19:50:14.367435: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2021-08-11 19:50:14.368440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found
2021-08-11 19:50:14.369441: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found
2021-08-11 19:50:14.370436: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2021-08-11 19:50:14.370627: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-08-11 19:50:14.473931: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
grep : The term 'grep' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:14
- pip freeze | grep tensorflow
-
~~~~- CategoryInfo : ObjectNotFound: (grep:String) [], CommandNotFoundException
- FullyQualifiedErrorId : CommandNotFoundException
like in the terminal?
yes but I guess windows doesn't have grep
do you know the three-number version number of tensorflow that you have?
yeah, ive never heard of grep before
should be like x.y.z
hold on
if you do pip freeze I guess you can just look for tensorflow
2.6.0
@quiet vault
bruh
!mute 314448333739524096 investigating
:incoming_envelope: :ok_hand: applied mute to @ebon walrus until <t:1628733351:f> (59 minutes and 59 seconds).
sure, how do i make the version go down?
lol
looks to me like 2.5.0 is compatible with cuda 11.2
which I realize doesn't solve your problem
yeah
i just want to get cuda in the first place
should i just try to use 2.6.0 tensorflow with 11.4 cuda
see what happens
I thought we already tried that and it didn't work
it's unlikely that anyone will have tried what you did with the exact same OS, tensorflow version, cuda version, and remember that they did it with those exact versions.
Truly a pioneer 

Well
My power went out
Right before installation finished
Hope that didn’t fuck anything up
worst case, you just delete that virtual environment. it's fine.
I mean I suppose there are worse cases but suffice to say you won't have any data loss.
How is this for my first time making a AI stock predictor?
or better yet, how did you train it lol
Its data that it hasn't seen before
K what's your bitcoin wallet I'd like to buy your code
I made my own dataset with 60 parameters
Sorry but i wont sell it
Lol I'm kidding anyway
But yeah if that's legit you should continue testing and use it if it works into the future
thats sick
Yeah it was a hassel and it took me like 5 days to make
5 days only??
Yeah
so is it based on training or does it look at parameters and make predictions as time goes on
It takes some closing prices from the present closing price and the closing 2 days before and then predicts the next days price
ooo
I just need to make it scalable for some bots i will make
yeah, it looks promising already so thats awesome
Thanks.
So what's apple gonna be tomorrow?
I havent tested it on present data yet because i am sceptical of my code
did u use a cnn?
Yeah good to have a skeptical attitude haha
Ann
Thanks
When you say 60 parameters do you mean stuff like the weather and other stock data and things?
one more question from me
Ok
How did u make this graph? Did you use walk forward validation?
two questions i guess
??
And you're sure you haven't fed it like the next days google stock price to predict the same day's apple price? (I.e. giving it future data)
dog in the fog be getting interrogated rn
lol
I mean if someone makes an accurate stock predictor that's like a billion dollar tool so...
I worked at a finance company and they basically laughed at the idea that was even possible
I just need to edit something in my code one sec and i will come back with a new graph. Just to make sure
Just by the way, I was working on something similar and I thought I had the perfect model with these results:
but
i realized that tensorflow was reusing backend graphs which meant it was cheating kind of
I have no idea what i am looking at. I just watched some tutorials and then made this from scratch
Funny my model does so well on data it has seen before...
lol
yeah
well for me this was not surprising
i was using a basic neural network
just like 2 dense layers
well
@quiet vault did you figure out your thing?

i came here originally to ask a question but i got distracted
the power came back
and then i did everything and i think it work
You had the power all along bb
i didnt but ok
No I mean
You have the power.
Like as a person
oh
lol
anyway
2021-08-11 21:49:13.705324: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-11 21:49:14.151069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3993 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1660, pci bus id: 0000:01:00.0, compute capability: 7.5
2021-08-11 21:49:14.333000: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-11 21:49:15.356801: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8202
this comes up when i import tensorflow
is everything ok
Oh I'm going to sleep but hopefully someone knows
If you google that error "None of the MLIR Optimization Passes are enabled" it says it's fine
So...
That's exactly the part I wanted to single out but I can't select text on mobile
Thanks 
This means that i am the first person to make tensorflow 2.6.0 work with cuda 11.4
This is gonna take awhile to train
Are you using a local GPU or colab or AWS or something?
Colab
Statistics question here. I was thinking about architecting a code running/code contest system where students submit solutions to various problems, and I wanted to calculate how many cores I'd need to allocate for a particular contest.
Let's say that I have S students solving problems concurrently. This is a perfect system, so each student's workflow is like this: they are working on the code for I seconds, then they submit the code for testing and wait for the testing system to run all the test. After the student gets the results, they start the next iteration and so on.
Each problem has M tests in it that it runs (a test consists of running a program with a certain input and checking that the output matches the predefined one). Each test takes T_avg on average and T_worst in the worst case to run (there's a hard upper limit, but many submissions will run very quickly). I have C cores at my disposal, in other words I can run C total tests in parallel.
If I'm willing to accept that students will wait for X seconds for the results of the test, how many cores (C) do I need?
Why not simulate it to get a pretty solid estimation
I mean it's definitely a solvable stats problem, but it's pretty trivial to run a simulation
@desert oar The code which I have: https://hastebin.com/ajidowofir.py
Output: https://hastebin.com/sedosoposa.yaml
The commented lines in code gives the reshape error same as yesterday
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
However it works on your model
like
base_model = LogisticRegression(max_iter=100)
legacy_random_state = np.random.RandomState()
tune_model = HalvingRandomSearchCV(
base_model,
param_distributions={
"C": scipy.stats.expon(scale=100),
"class_weight": ["balanced", None],
},
cv=3,
n_jobs=3,
random_state=legacy_random_state,
verbose=1,
)
tune_model.fit(X_train, y_train)
pred_train = tune_model.predict(X_train)
score_train = tune_model.score(X_train, y_train)
pred_test = tune_model.predict(X_test)
score_test = tune_model.score(X_test, y_test)
print("Train:", score_train)
print("Test:", score_test)
so I don't even know what I am doing wrong
import random
import scipy.stats as stats
from collections import Counter
S = 100 #students
I = 200 #time per problem
sI = 30 #std.dev of I
maxI = 3600 #maximum time
minI = 60 #minimum time
M = 10 #number of problems
T = 3 #typical time
Tf = 30 #timeout time
pTf = 3 #percent chance the student's code times out
a, b = minI, maxI
mu, sigma = I, sI
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
max_cores = []
for i in range(100):
all_students = []
for student in range(S):
in_use = []
for problem in range(M):
problem_time = T
if random.randint(0,100)<pTf:
problem_time = Tf
solve_time = int(dist.rvs())
if in_use==[]:
in_use+=(list(range(solve_time,solve_time+problem_time)))
else:
in_use+=(list(range(in_use[-1]+solve_time,in_use[-1]+solve_time+problem_time)))
all_students+=in_use
max_cores.append(Counter(all_students).most_common(1)[0][1])
print(max_cores)
PAINFULLY slow and not optimized, but it works!
hm... I guess that works if I can make a big lookup table, thanks
downloading scipy
That gives the max cores to never have a collision, if you have a queue or something it would change the complexity
Also I think there is a way to have more python instances than cores, just that everyone's code would slow down a bit
100 cases tho rite? ._.
Well it's better to be sharing a core with someone in an infinite while loop than to be queuing behind them
but it will have the same throughput, or maybe a lower throughput, right?
Yes
But like if 10 people use up all 10 available cores, for example, and are all running while loops that last forever, I'll never get the chance to run something as simple as print("hello")
But if I have an available thread on a shared cpu, then my print("hello") will take one nanosecond longer to run, but I won't have to wait forever to start it
the issue is, I'll have several machines, not one machine
Yeah, I'm not really sure how you'd route traffic effectively
The numbers are something like:
maximum test time = 5s
test cases per problem = 20
students = 1000 * N
1000 * N ? Like you have multiple thousands of students?
yeah, that's the theoretical idea
I'm not making anything practical yet, just wondering how many servers that would need
So now the question becomes, how many cores do you need to run my poorly optimized code for that many students
Lol
Haha
Honestly you might be able to generalize the simulation results to an approximation
Not sure if my logic is correct, but in a perfect world```
N = number of students
i = iteration time
t = time waiting in queue
c = number of cores
r = time to run a test
M = tests per problem
The system can process `capacity = c/r` tests per second, while students can submit `throughput = N * M / (i + t)` problems per second
`c/r = N * M / (i + t)`
`c = N * M * r / (i + t)`
So if
```py
N = 1000
i = 300 # 5 minutes
t = 10 # 10 s wait on average
r = 5 # 5 s per test
M = 20
then I need 323 cores (ouch)
i want to ask about keras loss function. There is a categorical cross-entropy and sparse categorical cross-entropy. What is the different on both of them, and what suitable function i need to use if the data are slightly imbalance (2:1)?
Was 323 what you calculated theoretically?
yeah, from these numbers on my ""model""
The simulation predicts a very similar number, so your theory is correct!

-> [310, 293, 280, 293, 318, 309, 294, 323, 297, 301]
Weird it actually has 323 as one of the values
Does this answer it:
https://stackoverflow.com/questions/58565394/what-is-the-difference-between-sparse-categorical-crossentropy-and-categorical-c
why is it giving such an error? can someone help me?
you're trying to convert the string 'labels' into an integer, which doesn't work
what is the line where that error shows up?
This vector (a) is dense: [0.1, 0.5, 0.3, 1.0, 0.8, 0.6, 0.1, 0.7]
This vector (b) is sparse: [0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.0, 1.0]
The vector b can be compressed. It can instead be stored it like this:
nnz = [0.4, 1.0]
indices = [4, 7]
Where "nnz" is the non-zero values, and indices is the indices of those non-zero values.
Consider adding b to a (changing a in-place). When using the non-compressed form of b, there are n iterations, where n is the length of the two vectors (8 iterations).
Now consider adding b to a, but this time using the compressed form of b. Adding the zero values from b to a is pointless as it leaves those values unchanged.
So there are m iterations where m is the number of non-zero values in b (2 iterations). This provides a very large speedup with large enough sizes and if b is sparse enough (e.g. 1% non-zero).
In addition, since only the non-zero values of b are stored and b has mostly zero values, there is a large reduction is memory usage for b.
Now consider having categories of color: red, blue, green.
Each category can be stored as one-hot encoding:
red = [1, 0, 0]
blue = [0, 1, 0]
green = [0, 0, 1]
Each of these can be stored in a compressed form. While for only 3 categories and batch size 1 this does not provide much benefit, for larger sizes there are gains to be had.
my class are coded 0 and 1 but the proportion is 2:1 for 0 and 1 class. I using the sparse with 2 output at the dense. When i try to use sparse, the accuracy are higher compare with dense one. Of course, in dense i only use 1 output and at the sparse are 2
You are doing binary classification?
Its binary but i hot encode my class as 0 and 1 (in integer)
If it's binary classification there is no need for one-hot encoding.
It's just one output, true or false
I.m sorry to clarify the last statement, I mean from the source data, the csv dataset, the class is set on 0 and 1 in integer (although class 0 and 1 are come from the different csv file)
So you have multiple classes and you have one-hot encoded each?
yup
what is this "proportion" then?
2:1
of what
2 from class 0 and 1 from class 1
so if theres 14 class 0 so there would be 12 for class 1
I thought there was more than 2 classes.
so the case is there is normal condition data which labeled as 1, and sick one as 0. There are 2 rows different between class 0 and 1 which class 0 bigger than 1 at proportion
What do you mean by class? Category? Sub-category? Or something else.
its category
So the proportion is the amount of each class in the dataset (labelled)?
there is only one dataset (combine of class 0 and 1 in separated CSV), which has 40 class 0 or sick and 36 class 1 or healthy
So it's binary classification.
Two categories, sick and not sick.
True/False
Use binary cross-entropy loss.
thanks @iron basalt now i know how to do. But there is small issue... about the net i used and the class to feed the CNN. I read since i have two class, i need to define my net
model = Sequential([
InputLayer(input_shape=(f_leng,1)),
Conv1D(filters= 128,kernel_size=3,activation='relu'),
Conv1D(filters=128,kernel_size=5,activation='relu'),
MaxPool1D(pool_size=10),
Conv1D(filters= 256,kernel_size=3,activation='relu'),
Conv1D(filters=256,kernel_size=5,activation='relu'),
MaxPool1D(pool_size=2),
Dropout(rate=0.3),
Flatten(),
Dense(2,activation='softmax')
])
model.compile(loss = tf.keras.losses.BinaryCrossentropy(),optimizer='Adam', metrics=["accuracy"])
like this since when i use Dense(1,activation='softmax')giving bad result (always fail to classifiy the 2nd class. But when i feed the class with
trainY = trainY.reshape(trainY.shape[0], 1, 1) since trainY is 1D array
always give error ValueError: logits and labels must have the same shape ((None, 2) vs (None, 1)). This error not happen when i used sparse categorical
line 24, error in loop
sparse categorical uses index labels with one hot outputs, categorical uses onehot labels with onehot outputs
i'm assuming from the error that you have index labels so you need to use sparse categorical
can you give the full traceback?
now i'm using binary
if your labels are 0 or 1, then that would be considered index labels
and my class is 0 and 1... which arranged in 1D array
one hot would be [1, 0] or [0, 1]
ow so thats the problem
is it okay if using sparse categorical
for index labels things?
yes
you can't use categorical with index labels anyways
You can either have 2 outputs with softmax, or 1 output with sigmoid, both should work. I prefer the sigmoid route because it's less computation being done.
For the 2 output softmax version, sparse can be used but you won't really gain anything from it.
As for keras API specific stuff, idk, I have not used it in a long time.
yeah i was discussing it from the point of view that you don't wanna change the model, if you want to change the model to use sigmoid then you can change the loss function as well
To explain the difference between the two, with softmax you would get something like [0.3, 0.7] as output (probability of each class), and with sigmoid you would just get output of like [0.7]
You know the probability of the other class is just 1.0 - 0.7
in most cases people usually go the sigmoid route for binary stuff and softmax for more than that
so for binary one, its pretty costly in performance with softmax since the sigmoid one are enough
Softmax is for when you have 3 or more because then it can give you something like [0.1, 0.2, 0.7]
^
Sigmoid does not work for that
technically it can work but it will give weird results so its usually just avoided
Yeah, but when I say work, I mean also work well. A lot of things "work" in ML
yeah
Technically you could have 3 sigmoid outputs and get away with it on simple stuff.
because softmax will always have the results add up to 1, while sigmoid they will not
I have a sentiment analysis dataset with 2 columns comment and label(POS, NEG, NEU). Then I encode label to lables(2, 0, 1) and I use CNN model to classify them.
so softmax will be like the probabilities of each class, which will add up to 1
it's define function train model
by traceback i mean the error output
should i convert again the index labels or not? i directly feed the class without further encoding to the model
The classes are 0 and 1 so they are directly the targets in the case of using 1 sigmoid output.
They act as the probabilities
and now it works will waiting for the result...
probability of sick is 1 or 0
the proba for sick is 0 and health is 1
so far its work with sigmoid and sparse one
actually in this project i combine signal processing since the source data is an ECG record. Why i store it as CSV because its the proper way to represent the data
Hello everyone,
I have a question regarding regex.
I d like to check if a variable follows a certain pattern before mutating it from str to date.
The pattern is this one:
2022-05-20 13:21:29
I can t succeed in writting the necessary regex to check it.
How would you proceed?
Thanks.
I tried this:
\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}
oups nevermind the problem was not coming from my regex it works 👍
why do you need to validate dates manually?
Because I have some unappropriate string which are not following the correct format
I realised it when attempting to mutate str to date
but Wanted to extract a list of all the non appropriate formats
Do you know a better way to proceed?
why not just use strptime in a loop
with try-except
I don't know this, I m gonna check it 🙂 thanks
This is great, how if I use assert instead of a try-except? Would I see the incorrect format appearing?
Not sure if I should ping, but pinging anyway.
You can try
https://en.wikipedia.org/wiki/M/M/c_queue
If you assume times are Markovian,
else
https://en.wikipedia.org/wiki/M/G/k_queue
(which is less likely you get analytical solutions)
There's also G/M/k queues, which should relate to M/G/k queues (but don't ask me how)
G/G/k sounds very hopeless, you might want to approximate (IIRC - heavy traffic approximation might make the problem analytically easier), or even just use the raw M/M/k queues for approximation.
In queueing theory, a discipline within the mathematical theory of probability, the M/M/c queue (or Erlang–C model) is a multi-server queueing model. In Kendall's notation it describes a system where arrivals form a single queue and are governed by a Poisson process, there are c servers, and job service times are exponentially distributed. It is...
In queueing theory, a discipline within the mathematical theory of probability, an M/G/k queue is a queue model where arrivals are Markovian (modulated by a Poisson process), service times have a General distribution and there are k servers. The model name is written in Kendall's notation, and is an extension of the M/M/c queue, where service ti...
The problem is kind of different given that you have finite students, but infinite students with a distribution of visiting times might still be a good idea
hi um, idk if this is the right place to ask but, how much python should i learn to be able to start learning AI/ML stuff?
a basic understanding should be fine, you can learn more advanced stuff as you go on
I would say basic fundamentals and functions
can someone explain me what is class_weight parameter in sklearn clf algorithms
it has options like either None or balanced
I found this helpful: https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work
basically class_weight="balanced" means that it will try to replicate the smaller class until it has same samples as the larger class
so its used when the target labels are imbalanced ?
i tried it to some imbalanced dataset , after changing the parameter to 'balanced' my f1 scored lowered
if you dataset is imbalanced try using technique like SMOTE
yeh going to try that now
what is the average param set for your f1 score?
binary
oh binary classification okay
if its more than 2 clf then i should change that to weighted right ?
well if you have multiclass labels then only this parm is required
do you want to calculate metrics for each label?
yes
then set it to weighted
also want the average?
bcuz that is what weighted will do
ooo okay then
yep thanks i'll look into it
uh, no
you can use class_weights for n classes, be it 2 or 10
its useful for baselines if you don't want to do any augmentation
but its impact on accuracy is kinda inconsisitent
try removing the param and training + with param to see if there is any difference in accuracy scores (setting seed ofc) @somber prism if you don't see any major difference, use some augmentation
I was talking about f1_score average parameter
ayy, my bad
where do i learn data science for free
Hello! I have some data for x and y axes and I also have their errors. The thing is that I can't find a way to fit a curve that takes into consideration both x and y errors. The function curve_fit only considers y error. I also tried using odr but it doesn't give me the correct curve. I have searched a lot but I can't find anything that works, so I would appreciate some help. Thanks in advance
How do you know it's not giving you the 'correct curve'?
Hello, I have a dataset of X-ray images that fall under one of the classes : covid or non-covid. The assignment requires me to perform EDA on these images. Can someone help me with this. Except plotting the mean and S.D of the pixels of each image in a scatter plot, I don't know what kind of EDA can I run on images.
Well you can easily tell that it doesn't correctly follow the points. But I also checked with OriginLab which should give me the correct curve and they were different.
Yes, please ping me always!
That's... a bit over my head to be honest, but I'll try reading it again tomorrow
thanks
This is the appropriate channel for TF questions I assume?
Basically you can get statistics for waiting time or time in queue if you use such models, and I think you can compare those against X I think
Yes, but whether someone answers... that's a little difficult to say. It depends on your question
And like, if anyone is well-experienced and willing to answer
Anyone using the new VSCode notebooks?
So I know on xgboost, you can use a few different metrics to determine feature importance, but is there a way to determine the importance of a set of features? For example, could you pick a certain feature, and then look at all the features the algo decided to split on immediately after your picked feature, and compare the gain? Like say every time it split on "Color" and then "Size", the gain from both of them is larger on average than "Color" and then "Weight". And then you'd interpret that as those two features interacting with each other somehow
what about the average importance of a set of features? all of the importance scores used in xgboost are "additive" and behave linearly, so you can just add up the scores https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score
well, if you add the averages it won't be quite the same
but you can sum the total_gains for each of the features, and sum the number of splits (called weight in xgboost) for each of the features, and divide to get an average
I'm not sure if that would have the effect I'm looking for, which is determining the relationship between two (or more) features. I'm less concerned about the absolute value, and more about for this specific feature, which feature in combination is the most important
isn't total_gain/weight the same as just gain?
gain is the average gain per split (according to docs)
does anyone have a tutorial on image classification with machine learning (no CNN)
If you just search MNIST ANN or basic MNIST NN you'll find a bunch
thanks
Are you looking for some non-linear way for measuring importance?
Not sure if it's a good idea. Suppose you have a definition, then now you have a powerset to solve for (which you can't solve)
there are plenty that use traditional sklearn algos
I am trying to do one and not getting that good results
so I was trying to explore more
what did you try?
didn't wanna do CNN just yet
I believe SVM with a bit of preprocessing does good
yea that gave the highest result of all of them I tried ~85%
I think it can be better
like in the 90+
nobody uses MNIST with CNNs lol
its too overkill - MLPs are better
I am doing sign lang digits classification, is CNN overkill?
no
I recommend you learn the basics of NNs first rather than jumping straight to CNNs
I am
well, you are using SVM in one and CNNs in another 🤷
yes, calling it MLP's is maybe more accurate
"Artificial Neural Network" isnt very specific
alright my bad
cool - for images, CNNs are the only archs (bar some esoteric ones)
ViT is not something you would use unless you want to win or smthing
it would be overkill anyways. CNNs are always the best for images
Im getting a error while making a box plot for my series
ValueError: The number of FixedLocator locations (2), usually from a call to set_ticks, does not match the number of ticklabels (1).
Can someone help pls
here are the results from SVC
I don't know why I did mae
Please normalise the data before making a confusion matrix
Also I would suggest use log_loss available in sklearn as the loss function
It already is?
0-9 are digits
No the confusion matrix isnt normalised. Like instead of a colorbar from 0 to 40 it would show from 0 to 1
Benefits would be like lets say in test cases ther were only 30 8s and 40 9s so the 9 will be brighter than 8 even if the accuracy in both classes were the same
I'd recommend you give more detail, because I don't think anyone can help you without additional detail
you have normalised the pixel values which you should have, i am saying you could normalise the number of data from each class before making a confusion matrix so that it would not be confusing
oh how to do that?
you mean the one i wrote for you? 😉
i think PCA + SVM is the "classic" MNIST solution
or something like preprocessing to binarize the data
i think you could do something similar with sign language
use some kind of image processing to extract an "outline"
then PCA + SVM or keep using RBF
Well we are avoiding CNN so does that mean we are avoding NNs too?
true a boring feedforward NN could be in play
Because I have tried using autoencoders and then SVM; it works well too
"let's pretend it's 2008 again"
while creating the confusion matrix cm
you can use:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
did you try ada-boosting the models?
you should note that it's specifically SVM with an RBF kernel
nope
a linear SVM wouldn't be much better than ridge regression, if at all
yea
I don't know what that is
use "SVM + RBF" in the graph so you don't forget (and other people don't wonder about it)
and yeah gradient boosting could be a good option
thanks
does boosting work with non-weak learners?
i'm not surprised the random forest didn't do well... feature splitting doesn't make sense on pixels, they're too "specific"
you need to extract bigger features like with PCA
then you can try something like ensembling the SVM-RBF with the PCA+RF setup
this is why CNNs are so cool, they learn useful features from "fine-grained" data like this
and why deep learning works so well on this kind of data, where the data is very "high resolution"
I was going to learn about CNNs, think this is the right time?
ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=3, found ndim=2. Full shape received: (None, 12)
X.shape = (118, 12, 1)
Y.shape = (118,)
The input shape of the model is input_shape(12, 1)
Does anyone know why I am getting this error
model = Sequential()
model.add(Conv1D(128, 7, activation='relu', input_shape=(n_steps, n_features)))
model.add(Conv1D(128, 7, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=3, padding='same'))
model.add(Conv1D(256, 5, padding='same'))
model.add(Conv1D(256, 5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=3, padding='same'))
model.add(Conv1D(512, 3, padding='same'))
model.add(Conv1D(512, 3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=3, padding='same'))
model.add(Conv1D(512, 1, padding='same'))
model.add(Conv1D(512, 1, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=3, padding='same'))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Here is the code for the model
can someone help
looks decent ig, but what's stochastic gradient descent lol
You need to define the input shape with one of these methods, I think
# With explicit InputLayer.
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(4,)),
tf.keras.layers.Dense(8)])
model.compile(tf.optimizers.RMSprop(0.001), loss='mse')
model.fit(np.zeros((10, 4)),
np.ones((10, 8)))
# Without InputLayer and let the first layer to have the input_shape.
# Keras will add a input for the model behind the scene.
model = tf.keras.Sequential([
tf.keras.layers.Dense(8, input_shape=(4,))])
model.compile(tf.optimizers.RMSprop(0.001), loss='mse')
model.fit(np.zeros((10, 4)),
np.ones((10, 8)))
That's just an example not the exact code
most probably you put the wrong shapes
maybe
question i have data that looks like this. When i convert it to date time its in the format Year-01-01 but i want it to be Year-12-31, how would i go about doing that
Note that MNIST is a trivial task and does not at all represent a real computer vision task.
Just about anything will work on it. Also some digits are miss-labeled so keep that in mind. Don't expect 100% accuracy ever.
Because it is so simple, it does make a for a good bug check.
Fashion MNIST and other datasets that use the MNIST name are more a of a real task and you will notice the large drop in accuracy.
my solution to above for anyone who cares
Hey guys, sorry for keep asking this type of question. so this is what I have ```c['cat'] = np.nan
for i in range(len(c)):
if (abs(c['ay'].iloc[i]) >= 50) and (abs(c['az'].iloc[i]) >= 70)and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 20 ) and (abs(c['az'].iloc[i+2]) < 20) and (abs(c['ay'].iloc[i+3]) < 20 )and (abs(c['az'].iloc[i+3]) < 20) and (abs(c['ay'].iloc[i+4]) < 20 ) and (abs(c['az'].iloc[i+4]) < 20) and (abs(c['ay'].iloc[i+5]) < 20 ) and (abs(c['az'].iloc[i+5]) < 20) and (abs(c['ay'].iloc[i+6]) < 20 ) and (abs(c['az'].iloc[i+6]) < 20) and(abs(c['ay'].iloc[i+7]) < 20 ) and (abs(c['az'].iloc[i+7]) < 20):
c['cat'].iloc[i] = 1
elif (abs(c['ay'].iloc[i]) >= 50) and (abs(c['ax'].iloc[i]) >= 70) and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['ax'].iloc[i+1] < abs(c['ax'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 20 ) and (abs(c['ax'].iloc[i+2]) < 20) and (abs(c['ay'].iloc[i+3]) < 20 ) and (abs(c['ax'].iloc[i+3]) < 20) and (abs(c['ay'].iloc[i+4]) < 20 ) and (abs(c['ax'].iloc[i+4]) < 20) and (abs(c['ay'].iloc[i+5]) < 20 ) and (abs(c['ax'].iloc[i+5]) < 20) and (abs(c['ay'].iloc[i+6]) < 20 ) and (abs(c['ax'].iloc[i+6]) < 20) and(abs(c['ay'].iloc[i+7]) < 20 ) and (abs(c['ax'].iloc[i+7]) < 20):
c['cat'].iloc[i] = 1
else: c['cat'].iloc[i] = 0```
How can i set any other c['cat'].iloc[i+1] and so on to i+7 =1
so my model is fitting really really well to the training data, but it's also fitting well to the test data (RMSE on train of 1, on test is 6) is that ok? Or do I want reduce how well it fits the training data?
Hiya, anyone knows what is the simplest way to kind of model like a "gesture is not among known gestures" label in hand gesture recognition? threshold is one way but it's not really that robust, I thought of estimating uncertainty using bayesian neural networks but was wondering if there happened to be a simpler fairly robust method?
would using sigmoid intseade of softmax for the last layer help? if all the probailities are less than 0.5, it means either there was no gesture, or the gesture is not known enough?
add another category
a bit difficult for a huge dataset, I'm already using a pre trained model
I would have to modify the dataset, and retrain the model
Hey guys, I want to read a video using matplotlib.image, can someone give me an idea on how to do that?
I've tried using image.io, which can use a reader and then iterate through the reader to get frames and an array with the pixels. However, I gave up using this library because it doesn't return a proper array that I can use in my algoritms.
Here's the code I've used so far:
data = imageio.get_reader(r'video_sample.mp4', 'ffmpeg')
for frame, rgb in enumerate(data):
X = rgb
y = frame
I'm out of ideas on how to iterate through a video using matplotlib.image
it shouldn’t
because each successive learner increases variance and decreases bias
if you start with strong learners you’re probably going to get mad overfitting
should have planned it out before then - that sort of thing can only be interpreted by the confidence values
that's not how enumerate works
it simply converts the data your are interating over to a tuple while providing its count as the second elem
atleast, that's what I understand ¯_(ツ)_/¯
Enumerate is working fine and it returns an array, but it returns an imageio array, not a "proper" array
why did this not work for you?
ah okay
wait hold up
it should return a subclass
which you can use as if it were a numpy array
what do you mean by this
as in RGB = the index of the iterator?
does that provide any useful information?
rgb is the array representing each frame
I thought it was the counter 🤔
When I use print(type(X)) it returns <class 'imageio.core.util.Array'. If I try using X in my neural network, it returns the following error:
ValueError: Failed to find data adapter that can handle input: <class 'imageio.core.util.Array'>
convert it then
can you print the shapes?
np.asarray
no, index comes first
ahh, lightbulb
this is one of those dynamic typing things I guess
🙏
I've tried that, but I'll try again, just to make sure

