#data-science-and-ml
1 messages ยท Page 205 of 1
Ok
when not using cross val, I fit/predict and read the accuracy from the prediction
I split before tho
Ok good. You should usually stick to one evaluation procedure
cross val seems to be solid
Yeah it's a reasonable way to go
So you have two different models that have different accuracies?
I only have 1 model for now, but I was thinking about combining 2 yes
Yes, that's a technique called ensembling
Ok, gtk
It's very common and very popular in these type of prediction competitions
There are some great blog posts on it, let me find one
https://mlwave.com/kaggle-ensembling-guide/ @surreal nacelle
thank you ๐
Decision Tree / CART with Label Encoder?
Dummies makes 0 sense
Balanced accuracy: 0.21309668192963746
Hamming loss: 0.2958520739630185
Accuracy: 0.7041479260369815
hey
"""Weather in Szeged 2006-2016: Is there a relationship between humidity and temperature? What about
between humidity and apparent temperature?
Can you predict the apparent temperature given the humidity?"""
import sklearn.linear_model as lin
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
dataw = pd.read_csv(filepath_or_buffer='C:/Users/admin/Desktop/Artifcial intelligence/ML/data/WEATHER/weatherHis.csv')
df_x = pd.DataFrame(dataw.Humidity)
df_y = pd.DataFrame(dataw['Temperature (C)'])
trainx,testx,trainy,testy = train_test_split(df_x,df_y,test_size=0.2,random_state=4)
regr = lin.LinearRegression()
regr.fit(trainx,trainy)
slope = regr.coef_
intercept = regr.intercept_
n = regr.predict(testx)
print(n)
print(testy)
print(mean_squared_error(y_true=testy , y_pred=n))
this is my
code
i get an MSE value of 55.248754390894184?
is that somewhat accurate
?
Ideally youd always want your MSE to reach a value asymptotically close to 0....wether 55 is a good value really depends on what you're predicting as it's simply take the averages of the squared difference of actual and predicted values you have to think yourself wether sqrt(mse) is good enough for you or not
if temperature is in celsius and humidity is 0-100, then probably not
if you did temperature, humidity vs apparent temp you'd probably get a lot better results
Your Best Entry
You advanced 2,422 places on the leaderboard!
Your submission scored 0.78468, which is an improvement of your previous score of 0.77033. Great job!```
I really don't see how to improve on that, currently ranked 4200 out of 11600, is it time to read some notebooks written by top leaderboard ?
actually jumped 1000 place by running the model again ๐
and again 2000 places lol
now ranked 1000 out of 11600
feeling pretty good about it
nice good job
Thanks ๐
What is the best free way to learn data science
check the second pin
x = temp + humidity, y = feels like temp
guys i'm trying to get into a bit of algo trading and thinking of using backtrader in python, anyone know of anything better than this or is backtrader a good starting point? I'm not just greatest coder but i can fumble my way through most things
I believe you would want to look at quantopian @sweet socket
lol good luck
Alright I'm done with the titanic ๐
Somebody worked with SMOTENC and make_classification to balance the class imbalance in the dataset?
Looking at some code for reinforcement learning on time series (below). Why is it scaling each observation 0-1 instead of the entire observation space to 0-1?
What happens if you have divergent values (e.g. values that go above what are seen in the train set and, thus, will go above 1 if scaled)? Will everything break if values go < 0 or >1 or is everything fine? Would it be better to scale to a tighter range (e.g. instead of 0-1 scaling, 0.2-0.8 scaling)? Want to see if anyone knows / has resources before I start experimenting.
# Get the data points for the last 5 days and scale to between 0-1
frame = np.array([
self.df.loc[self.current_step: self.current_step +
5, 'Open'].values / MAX_SHARE_PRICE,
self.df.loc[self.current_step: self.current_step +
5, 'High'].values / MAX_SHARE_PRICE,
self.df.loc[self.current_step: self.current_step +
5, 'Low'].values / MAX_SHARE_PRICE,
self.df.loc[self.current_step: self.current_step +
5, 'Close'].values / MAX_SHARE_PRICE,
self.df.loc[self.current_step: self.current_step +
5, 'Volume'].values / MAX_NUM_SHARES,
]) # Append additional data and scale each value to between 0-1
obs = np.append(frame, [[
self.balance / MAX_ACCOUNT_BALANCE,
self.max_net_worth / MAX_ACCOUNT_BALANCE,
self.shares_held / MAX_NUM_SHARES,
self.cost_basis / MAX_SHARE_PRICE,
self.total_shares_sold / MAX_NUM_SHARES,
self.total_sales_value / (MAX_NUM_SHARES * MAX_SHARE_PRICE),
]], axis=0) return obs```
@void anvil i dont fully understand your question. it's scaling each feature separately
those MAX_* variables are defined outside the function
practically, i think normally you'd just clip the value to 0 or 1
ah yeah you're right, but it's still scaling the max observed to 1
whereas it could go to 1.5 or w/e
in the unobserved test set (or in real time application)
this example came from stock trading, so amazon is trading at ~2k now. If we were to let it go, amazon could go to 3k or 50k because it's unbounded
so then you'd end up feeding a value > 1
I have 7 categorical features and i want to use smotenc
from collections import Counter
from numpy.random import RandomState
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
Does somebody has a idea what does the error mean?
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6
expected neighbors to be less than or equal to 1
I googled the problem, seems that my data set is to small
@quartz monolith that seems wrong. you shouldnt have only 1 sample
what is X_train.shape?
and y_train.shape?
unless that means 1 sample in a particular category
y_train = 4287 shape
X_train = (4287, 8)
Here is someone with similiar problem
https://stackoverflow.com/a/48820222/11811575
but I dont understand it...
how many classes do you have
Label classes are 31
And feature 7
sm = SMOTENC(random_state=42, categorical_features=[X])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)```
got now `ValueError: cannot copy sequence with size 5717 to array axis with dimension 8`
how many of each class do you have @quartz monolith
i dont think categorical_features=[X] is right. is X a matrix?
i think you would need to use the column numbers instead
im not 100% sure
what library is this?
X is wrong i need to use array
sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
x_res, y_res = sm.fit_resample(X, y)```
0,7 should be right
My X_train.shape = (4287, 8)
By class you mean the number of my features?
@desert oar it does but I would have to buy a very expensive license to get the full scripting functionality which is why I wanted to create a gui bot in python for it
I actually found some corner detection and edge detection stuff in opencv that I'm going to look into
thanks!
Got another question on the take_action portion. Again, using the stock example code is coming from:
# Buy amount % of balance in shares
total_possible = self.balance / current_price
shares_bought = total_possible * amount
prev_cost = self.cost_basis * self.shares_held
additional_cost = shares_bought * current_price self.balance -= additional_cost
self.cost_basis = (prev_cost + additional_cost) /
(self.shares_held + shares_bought)
self.shares_held += shares_bought```
So right here it's calculating the max amount it can buy in the "total_possible" line.
If we wanted to arbitrarily limit it to a max amount, say 10000 shares, we could change it to:
``` total_possible = np.min(self.balance / current_price, 10000```
Hopefully, over the course of time, the algorithm will "learn" that it can't place a buy bigger than 10000 at a time (and we wouldn't expect a large output than that when the algo is finished training.
But what if we want to limit it to some amount dependent on the next time step that the algorithm SHOULD NOT have access to at time t because the assumption of an unlimited purchase of a stock or w/e doesn't make a lot of sense . For this example, we'll limit it to an arbitrary 10% of the volume of the next time period, vol_t+1.
```total_possible = np.min(self.balance / current_price, 0.1*next_period_vol)```
By placing this restriction in the take_action space, is it actually being fed a bit of information from the future or is this restriction put in the right place? Should the RL try to place a large buy / sell than possible and have the action be restricted elsewhere to not pass "cheating" information back?
porting between pytorch and TF code is LOL
Hey does anyone have a recommended textbook for Machine Learning/Data Science?
suggest starting with this one
No, it's not
@exotic cedar Please don't share pirated works or discuss piracy on our server.
!rule 5
5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.
lol aight
Question about pandas, does it count as a multi index if I use set_index([column1, column2]) or is that just a caveman mimicry of a multiindex?
okay so the red points are my y_pred and blue points are y_true
this simple linear regr model come out well?
Is there good library for displaying graphs in Jupyter that you can modify and live update?
I mean like actual graph with nodes and edges.
matplotlib
fwiz that is not a graph with nodes and edges
he is talking about a graph in the cs definition
my goto for that would always be graphviz but idk about that in jupyter
@silk forge @dense rose
Is there a function within Keras/TF that lets you add weights to the training data? Some of my data is higher quality and I'd like to give it more weight when fitting my NN. Currently I'm using ImageDataGenerator and flow_from_directory, if that helps any.
Ideally I'd want to assign different specific directories with higher/lower weights
off the top of my head I don't know but you could always add repeats of the data during training @haughty wind
basically just copy the data you value more so that the model sees it multiple times per epoch
@silk forge have a look at ridge regression to understand the graph
https://www.youtube.com/watch?v=Q81RR3yKn30&t=785s
Ridge Regression is a neat little way to ensure you don't overfit your training data - essentially, you are desensitizing your model to the training data. It...
still cant figure out how to use smotenc on my data set ๐ค @desert oar
Has anyone had success installing tensorflow_datasets on tf 2.0?
Always get the same error when importing
AttributeError: module 'tensorflow._api.v2.autograph.experimental' has no attribute 'do_not_convert'
can you compare versions
where that attribute is originally from and whether it was taken out
yo where raggy
Hm? @mossy dragon
Doesn't really come up but good to have some confidence with manipulating differentials
@lofty girder yes it creates a multi index
Hey, do you think that this is good enough ?
I'm trying to do some 'data augmentation' by rotating each element of the dataset by -5 degree. The rotated image is a little noisy tho. Should I take the time to denoise it ? (and should I apply a stronger rotation to the images ?)
anyone here using pandas? I'm trying to run this code
but it is not adding the extra columns (path, dist, init, control, meas_interm):
I just have the ones I already had
looks like a bunch of pickles
in a dataframe.. hands down the weirdest thing I've seen yet
ahah, yeah, still have to update the name
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
!pip install googletrans
import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('SMSSpamCollection',
sep='\t',
header=None,
names=['label', 'sms_message'])
# Output printing out first 5 rows
df.head()
from googletrans import Translator
translator = Translator()
df2 = pd.DataFrame()
df2 = pd.DataFrame(columns=['label', 'sms_message'])
for i,j in zip(df['sms_message'],df['label']):
text = translator.translate(i, dest='hi')
df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)
can anyone help me
the above code throws this error
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-26-059f1e8ecbf9> in <module>()
4 df2 = pd.DataFrame(columns=['label', 'sms_message'])
5 for i,j in zip(df['sms_message'],df['label']):
----> 6 text = translator.translate(i, dest='hi')
7 df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)
6 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The output of df2 is only till 530 value but the original file has more than 6000 values
what are you trying to do
if it's a tab separated file, why don't you open it with builtins in pandas..
oh.. I just saw read_table..
not sure what you're trying to do here man..
what is df2
@lapis sequoia it looks like they're building up df2 by translating the contents of df
@quartz stream something is wrong with whatever is in i
you can try printing the data, or using %debug in ipython to investigate
@desert oar
I tried printing i
its a standard dataset
it shows all the value
@desert oar Yes you guessed it correct I am trying to translate df into df2
Why dont you try the link I have given the data file also
Any ideas why model.predict takes an abnormal amount of time to finish ? The model.fit takes a minute or so, and predict on 3% of the dataset takes 10
@surreal nacelle depends on the model. seems weird though
KNN on mnist
oh
@desert oar Any idea on my question
@surreal nacelle sklearn? KNeighborsClassifier?
Yep
@quartz stream "normal dataset" doesn't really help. i'm not in a position to start downloading data files and debugging right now
the translator is expecting something different from what you gave it.. thats the best i can offer
im not familiar w/ that library
Its 200kb
dataset
and the translator is working fine
for half the dataset
it is just not completing
so the code is fine
I translate and print every value
it does
I guess there is something wrong with adding values in database
i mean pandas*
it is not
well look at the error
it's clearly related to translate and not pandas
there is a bad element in your data somewhere
@surreal nacelle what distance are you using?
I'm using the default values for now, which is n_neighbors=5
so algorithm='auto'?
Yes
and metric='minkowski'?
Yes
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
hm. im not that well versed in the details of kd-trees, but the whole point is that tree lookups are supposed to be fast
are the results correct?
hm. could just be the way it is
@desert oar Some guy explained me that knn doesn't train, the .fit simply gives it data so that it can use it to compare during the .predict phase. So it does make sense that it takes much longer to predict than to 'train' (no training)
KNNs are a nonparametric method, so it doesn't learn anything, so there's nothing to fit
@silent swan sklearn doesnt build a tree when you call fit()?
actually you're right it probably does do it then
I need some help
I have a column that contains a list.. I want to split it and add them to new columns in the dataframe.. but, not all rows have equal number of items in the list
how should I approach this
Hey, how could I see the word instead of seeing the dictionary indice ?
vect = CountVectorizer(analyzer='word')
bag_of_words = vect.fit_transform(emails)
test_output = vect.transform(['email', 'test', 'hello'])
print(test_output)```
(0, 15725) 1
(1, 51148) 1
(2, 55302) 1```
@lapis sequoia is it already a list, or is it a string? and what's the max number of columns?
@void anvil i'd just use None personally
@surreal nacelle that's odd, my .transform returns a sparse matrix
but you would use vect.vocabulary_ to get a mapping from words to indices
oh weird. i didnt know sparse matrices acquired a fancy print method
yes that is a sparse matrix
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun'])
print(type(transf))
print(transf.shape)
print(len(v.vocabulary_))
id rather leave null values as null, and fill in later
just my style tho
I got the matrice that way, but as you can see it contains the value of the word indice in the dictionary instead of 0 and 1.
the vocabulary is telling you that "greetings" is in column 21152
you can't really get the words back
that doesn't make sense
that's the whole point of vectorizing
words go in, numbers come out
I understand that, but shouldn't the matrix contain the number of occurrence of each words instead of their indices in the vocabulary ?
it doesnt contain their indices
it contains 0s and 1s
well it contains more than that
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun to email with email', 'turtles dont use email but turtles like lettuce'])
print(transf.toarray())
print(v.vocabulary_)
oh, so it does exactly what I want ๐
ohhh
the np.argmax returns the index
in the array
no
not its content
err
yeah
but
email_counts = transf.toarray()[:, v.vocabulary['email']]
you want to know the most frequent word in each document?
For example yes
the vocabulary should have unique values as well as indices, so you can "invert" it
vocab_inverse = {val: key for key, val in v.vocabulary_.items()}
most_freq_word_per_document = [vocab_inverse[i] for i in transf.argmax(axis=1)]
thank you, once again ๐
Hey everyone, I'm new to ML, and got a task to predict sales for dataset1 using dataset2. Dataset2 has data for 4 weeks in advance, how would I go about doing this? Currently built a linear regression model for dataset2 with .94 rsquared
@crimson trellis seems like a good start. how many data points do you have?
2.7mil from dataset2
oh thats a lot
running it on bigquery
you can use a train/test split or cross validation, and measure the accuracy of your model
I graphed the linear regression predictions against the actual data and it was pretty spot on
yeah thats probably good enough
but you can use the holdout set to be sure
implementing CV in bigquery would be annoying
but you can reserve eg 1/4 of your data and not train on it
then do the prediction on it and measure accuracy
got it. I think I saw something like that
I'm stuck on figuring out how to use the model for a forecast though...
that's a bigquery specific question i'm afraid
and i wouldnt know the answer. maybe someone else does
gotcha. All the examples I've seen don't really have a date forecast, and instead do something like 'airline delays, taxi fares, etc'
if time is involved things are a bit more complicated
can you describe datasets1 and 2 in more detail
how they related to each other etc
it's just date/product/sales/units/store1 - dataset1
same thing for dataset2, but different store
I was thinking of using a coefficient store1/store2 and apply it to the model to predict store1 using store2 performance
so you're predicting, e.g. walmart sales using store2 sales?
yep
because store2 has data coming in daily, and walmart has data every month instead
how are you running that regression then
so the model is trained using target only
and not entirely sure if this is the right way to do it, but I want to use it to predict walmart data based on (walmart sales last month/target sales last month)
so you basically took the monthly average of target sales, then predicted this month's walmart sales using last months' walmart-target ratio
hm. the math of linear regression won't like that, you're going to have non-independently distributed data
as for testing it, you can only forecast 1 month ahead at best
unless you start forecasting target as well
I do have about 3 years of data in that 2.7mil records
unfortunately that doesnt help
ah
it means your model can learn more ,but it doesnt fix the fundamental issues
how would you approach this? I'd like to get it right without just throwing something up quickly
and that's how my coefficient thing feels
just a quick solution
model being valid or not, the way you would test it is by "sliding" over the data. say you have 24 months of data and you reserve the last 6 months for testing. then you train on the first 18 months and evaluate on the 19th month. then you train on the first 19th months and evaluate on the 20th. and so on until you run out of months. and then you can do mean square error of all the evaluation points
Ok, I get that. Then using the model, I would use it to predict future target sales
ah ok, I lost track
but again you can only predict 1 month in advance
because you need last month's target sales
and also this model is likely to have other issues
due to the fact that you're violating the iid assumption
I mean, it is what it is at the moment
I can use things like # of walmart stores, and # of different products sold at each location
actually wait. it might be okay w/ least squares actually
yeah, you know what? this should be fine
just make sure youre using the testing strategy i described
otherwise you will be "cheating" and using future data to predict past data
which inflates your accuracy
good luck
for a little bit olonger year
im cleaning lyrics scraped from genius
I wanna remove text like "verse 1" "intro" "chorus" and brackets/punctuations
fortunately I found this python script that seems to does the job well
but
well here's the original data
my codes: https://hastebin.com/azuwexadop.py
data output after applying the cleaning function:
sorry, i cant help with that
scraping is against their TOS
and its against the rules for us to help with TOS violations
whose TOS?
!rule 5
5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.
genius
oh really?
https://genius.com/static/terms
Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods. Any use of the Service or the Genius Content other than as specifically authorized herein is strictly prohibited. As between you and Genius, the technology and software underlying the Service or distributed in connection therewith is the exclusive property of Genius, our affiliates and our partners (the "Software"). You agree not to copy, modify, create a derivative work of, reverse engineer, reverse assemble or otherwise attempt to discover any source code, sell, assign, sublicense, or otherwise transfer any right in the Software. Any rights not expressly granted herein are reserved by Genius.
sorry
aight thank you
I'm trying to figure out if having a dataframe with 60k columns (1 for each words in the dictionary) is ok ? @desert oar
It seems a little much
Why do you want that
I mean, I get why you might want that for convenience?
Can put sparse data into a data frame
It's the flattened sparse matrix to feed the algorithm
What model are you trying to use in this particular case
no ideas yet
Usually you need to convert from data frame to matrix anyway
gonna try a bunch
man tensorflow is black magic built on black magic
yes
i dont know why they dont just give me a damn api to construct a graph manually
instead of all this as_default stuff
at that point you might as well just manually palay with numpy stuff
except not at all? the whole point of TF is that you're constructing a differentiable graph
and that gets sent back down to the C++ framework for processing
its just the python API is extremely confusing and the documentation is unclear
(to me)
my experience so far with TF is
there're tons of ways to do the same thing
so it's actually easy to write code
but it's hell to read other's because you have no idea what their workflow is
the documentation problem is compounded by how often the API / "best practice" changes
like even running the official model code gives you a ton of deprecation warnings
thats just the nature of huge code bases really
pytorch is great though, everybody get on the pytorch train
I have never actually done useful things with pytorch but what Ive seen looks good
im hoping things stabilize in TF after 2.0
it's very pythonic. about the only "surprising"/obscured thing is that gradients are stored in state, and sometimes the dataloaders hide crazy stuff from you
otherwise the code does about exactly what you think it does when you read it
nah its just gonna be like the internal APIs of linux which are never guaranteed to be stable and allowed to be subject of change every commit @desert oar
i hope not
every time i load a model from a checkpoint i feel like im doing something wrong
embrace keras
import pandas as pd
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']
plt.show()
if you want to enter code, @sterile remnant , it will be easier to read if you use code block formatting
!codeblock
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
โข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โข You can also use py as the language instead of python
โข The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
thanks for your help bro @desert oar it really help
'''import pandas as pd
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']
plt.show()
'''
no problem. did you have a question you wanted to ask?
use ` not '
on an american keyboard it's on the same key as ~, not sure what keyboard you have
ok gotcha bro
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']
plt.show()
thanks
๐
also i assume you mean to write
data.plot(x="di**k_size", y = "satisfaction", color = "fapability")
right now you have doubled ' and [ instead of (
great variable name btw
bro i wanna do some scikit stuff but i am finding a really hard time to do so help me out please
yeak i was focusing more on my variables instead of syntax
lol
hard to say what help you need... do you have a specific objective in mind?
yep like i gotta some project on super vised learnig i have seen its tut but all went above my senses and i want to learn it do u hv any suggestions hw can i wrap my head around that?
like some link or something else anyone
@sterile remnant what is your level of programming and math knowledge?
i'd start by maybe working through a beginner book or online course
@desert oar bro i am also noob to this one and i have been handed over with that project i was looking out for some stuff to get some knowledge about it .
its hard to help without more context
yep so it is but i have lookes into it
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print(knn.score(X_test, y_test))
like this code gives u prediction about what dataset the given data belongs to
and like this bit and pieces i been handling my shit
thouh thanks @desert oar for ur consideration
we want to see the graph
i wanna know where this data came from tbh
i really regret not applying for a job at pornhub when i had the chance
๐
i see you mean data hub ๐ค
i told my mother about it and she was not happy with the idea. it was in montreal too
would have loved to have an excuse to move to montreal
๐ data wrangle
i wondered about that
ive heard mixed things about "porn tech"
maybe they still have data science jobs
yeah technology magazines makes me stimulated
especially MIT Technology Review
somebody ever worked with word2vec?
no ๐ฆ i dont know why but we figured out something important that some error's have more labels. The classification model cant work. I want to buld a corpus and interview the experts how the text vector predicts words and use it as a sentence classifier later to create new keywords which will be the new label of the classiciation
my word2vec corpus with skipgram
tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=5000, random_state=2300)
salt whats the best way to predict a sentence with w2v?
what do you mean the classification model cant work?
the way you'd use a word embedding like w2v is you'd generate a word vector for each word in the sentence, then average them to get a vector for the whole sentence, then feed that into a classifier as your feature vector
yes i want to build a feature vectors for the troubleshoot text classification and after that using it as a label
the errors are connected with to many same label. The Model can't predict the right come or has to many outcomes which make it really hard
have you done a cross tab of errors and labels
like with pd.crosstab(errors, labels) and then plotted it using plt.imshow
I did it with confusion matrix
i had 130k data set and 51%., I used decision tree to understand how he splits the data and why is the model misleading the predicition to optimize the knowledge base
other problems was if some certain rows where based nmar (laziness) or not relevant missing values
hm
could use some assistance in sklearn if this is the applicable channel
Yes?
so I'm trying to do kfold cross validation with cross_val_score
and it was working just fine until today
and now when i run it I get like a hundred lines of traceback
and I have no idea what any of it means
"Fatal Python error: initfsencoding: unable to load the file system codec" is the first line
how does your data looks like?
as in shape?
i found something about pyinstaller with sklearn
next to nothing on that page makes any kind of sense to me
Give me a second. Im reinstalling my ide just to see if that fixes something
now i literally just can't install packages
it's like every time i touch something relatively new that i need to install the entire program shatters
what about your interpreter?
yea this just didn't do anything
so here is my code
i just did
now i just can't send my code
just fucking kill me
kFoldScores = scoreModel(xTrain = XTrain, yTrain = yTrain)
this is the line that messes everything up^
This is the function it's calling:
def scoreModel(xTrain, yTrain):
checkingNetwork = KerasRegressor(build_fn = buildNetwork, batch_size = 10, epochs = 100)
accuracies = cross_val_score(estimator = checkingNetwork, X = xTrain, y = yTrain, cv = 5, n_jobs = -1)
#print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std()))
return accuracies```
Data I'm using
and it's like i get different errors everytime
if i take that function out and just do k fold manually everything is fine
but i don't understand what's wrong with that function
okay I think I've fixed it. it works if i just specify one cpu to use instead of all and i don't get it but holy crap that was a headache
good to know
fastTEXT ๐
@void anvil yeah ive heard ray and modin are not anywhere near ready for production
its kind of baffling that companies arent willing to pick up and put money towards these kinds of projects
Hey all, I'm kind of overthinking myself into a hole here... I have for example hourly (2.8e-4 Hz) data for a week, and I want to filter out sub-inertial frequencies (anything below 5e-5 Hz). There's a lot of stuff about IIR and FIR, windows, filter order, how to apply the filter, but this all seems like overkill? I'm just looking for something fairly simple!
@grizzled folio apply a scipy Butterworth filter with that cutoff frequency
That's as simple as it gets
@lean ledge cool, that's a handy pointer! I still need to provide the order of the filter though, I don't know how to choose that (unless I use buttord?)
As a bit of nuance, you can't remove the filtered frequency completely. Higher order removes things more. Look at the bode plot for the filter you construct to see how the gain at different frequencies
Gain is in dB which is a log scale
Aha, that's handy. So why wouldn't you just crank the order way up? Increased computation? Artifacts?
Instability/Artifacts. Computation scales linearly too. Importantly, people working on signal processing are generally also often doing it on hardware. Higher order = more components, more cost, more board space etc
You can increase the order to an extent. After that you should come up with different strategies if you need sharper rolloff
Eg you can trade the smooth region of a Butterworth filter for one with more ripples to get a sharper rolloff (eg in an elliptic or Chebyshev filters)
Cool, this is at least a starting point. Computational cost may be a factor since I'm filtering a few tens of millions of timeseries. I'll see how things look with Butterworth and if that's working for me. Thanks!
I appreciate the real practical condensation of different types into their effects, I was really struggling to find any literature that didn't immediately get super technical about it
Yeah signal processing is a rough subject to learn for someone who didn't learn it as part of formal electrical engineering education
I got some of it in applied math courses, but never got the chance to apply it
There's a lot of nuance to signal processing so it's hard to simplify stuff down and ignore some very technical aspects
I don't think maths generally goes over signal processing
Except maybe at a grad level
Well, I think it was that course! It was very much on the applied/computational side of things
This is something you'd learn in electrical engineering. Would be very surprised if other people are learning it muchb
I definitely remember talking about IIR systems, and probably designing them... So I could probably understand the nuance if I wanted to get into it, but for the moment I just need something high-level that works and I can build upon
Huh
Aha, I think it was something like "Scientific and Industrial Modelling"
Anyway, that was quite a while ago ๐
@void anvil
https://github.com/danaugrs/huskarl
https://github.com/heronsystems/adeptRL
https://github.com/tensorflow/agents
there are certainly some
maturity is eh
arguable
i think its more about the distributed approach for adept
why is the name schulman so familar to me
if you want something in oss contribute it lol
@desert oar it's a list of objects, I have to get an item from each object.. the max number of columns can be 6
@void anvil for future reference i know 0 about RL
@lapis sequoia what was that in reference to again? ping me in the AM
great, a tensorflow version from less than a year ago breaks on 3.7
oops.. sorry.. @desert oar I have this column in my dataframe, each row has a list..
the list is a bunch of objects I can pass through a function... the number of items in that list can be max of 6..
or the list can be empty.. I want to split this column into multiple columns based on this
thanks.. let me try:)
btw if you need any help on RL you can ask.. but I can't really point you to an implementation
I wonder if I could do a freelance gig where I replicate papers or port things between TF and pytorch
Actually sounds like possible freelancing
arr = np.array([10,20,30,40,50])
for i in array:
print(arr[i])```
help me this code i wanna print this array out?
but its flashisng an error
@sterile remnant for i in array... you don't have any variables called that
And secondly, each i will be an element of the array, so you'll be trying to index a 5-element array with 10, 20, etc., which aren't valid
so hw shd i do it?
@void anvil , you can search for algos + papers + codes here:
https://paperswithcode.com/search?q=DQN
@sterile remnant , if your array is 1D, e.g a vector, you can loop throw it element by element. If your array is a shape of 2D and greater, by default looping will happen on the first dimension. for a 2D array it will be on rows.
I dont know your reason behing printing the array the way you do it above, but you can do:
a = np.array([10,20,30,40,50])
for el in a:
print(el)
# this will give you every element of the array because it is a vector
b = np.random.random((2, 5))
for row in b:
print(row)
# this will give you every row of that array because it is a 2D
Hey, what are my options for removing non-words in my 'corpus'?
Is there a way to keep all the words that are somewhat similar to something from the english dictionary and remove words that aren't ?
Example : 'redhat' is not a word, but I still want to keep it, however 'asdjhasgdja' must go.
also, would removing all 1 occurence words be a bad idea ?
redhat should be considered as a company name by language models
What do you mean ?
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word
I havent treid it unfortunately, can not tell how well it works :)#
Well, I'll give it a shot and keep you updated
Is room free?
might be a bit far fetched question.
Is there a way to store the loaded matlab file data in variables?
I already know about scipy.io.loadmat(), but I mean after it's been loaded
@granite sierra , according to docs scipy.io.loadmat():
Returns
mat_dictdict
dictionary with variable names as keys, and loaded matrices as values.
you can just assign a variable by indexing that dictionary
or blah["key"]
you will probably have to transform matrices to numpy arrays if it is not done automatically
otherwise, you are safe
converting it to an array is ltierally just
a = scipy.io.loadmat('test.mat')
np.array(a)
```?
no no
a is not a dictionary with variables as keys and matrices as values
lets say you have a variable foo in that matlab file
sure
you can access it now a["foo"] which will return you the matrix
np.asanyarray(a["foo"]) i think
or np.array
depending on use case
I'll be honest
wait
this is how it's returning it
{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
dtype=[('test', 'O'), ('temp_data', 'O')])}
a = scipy.io.loadmat('test.mat')
print(a)
yes, you can now access your matrices. a["testfile"] will return you array([[(array([[ 5, 10, 15, 20]])
I have never used matlab, so, i may be mistaken ๐
ok let me test haha
well it did exactly that
how do I store the variables now?
this is what it returned
[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]
also why does it return it with double [[]]
like how do I access the first list, the 5, 10, 15, 20
you can now do c = np.asanyarray(a["foo"])
something like this
matlab is messy ๐
also I think that code is outdated, numpy has no strip function now
ok so I did this
b = np.squeeze(np.asarray(a['testfile']))
it returned this
(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))
which is obviously just a tuple with a tuple of lists inside, is there anyway to access the first nested tuple?
tuples aren't mutable, are they, so there is no way to access by 'index'
anybody?
Hi, I have a neural network saved in ONNX format (make with matlab), do you know how to run it on python? Keras don't seam to support it and I faild to install caffe2 on my windows 7...
@hollow latch theres this... https://github.com/onnx/onnx-tensorflow
Ive actually been wondering how to use onnx and tf myself. So good thing i found this
@granite sierra have you tried np.asanyarray?
It works with nested structures better
let me try
Also you can reduce the tuple until you get only vectors
You can even use numpys own reduce
hmm let me see
but what ufunc would I do to the vector?
nah I dont think that works, unless I'm doing it wrong\
@desert oar Thank you, it's seam to work !
import scipy.io as sci
import numpy as np
a = sci.loadmat('test.mat')
print(a)
b = a['testfile']
print(b)
c = np.squeeze(np.array(b))
d = np.reduce(c)
print(d)
Maybe show the outputs too ๐
oh sorr
runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')
{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
dtype=[('test', 'O'), ('temp_data', 'O')])}
[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]
Traceback (most recent call last):
File "<ipython-input-530-607ea6f6830d>", line 1, in <module>
runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')
File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/danilov_d/.spyder-py3/understandfile.py", line 21, in <module>
d = np.reduce(c)
AttributeError: module 'numpy' has no attribute 'reduce'
how does test_train_split function work in cases of multiple linear regression
not different from any other model
it simply splits your data into test and training data so you can see how your model performs on data it has never seen during training
so you can for example diagnose overfitting etc
x = data.ENGINESIZE , data.CYLINDERS , data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS
dfx = pd.DataFrame(x)
dfy = pd.DataFrame(y)
trainx , testx ,trainy, testy = train_test_split(dfx , dfy, test_size=0.2 , random_state=8)
when im using more than 1 xvalues , can i just do this?
@
@earnest prawn
I mean I am not 100 percent sure but I dont see any reason why it shouldnt. You could just try I guess ยฏ_(ใ)_/ยฏ
won't work
whats the errror?
@silk forge x is a tuple
why are you trying to take columns out of a dataframe then make a new dataframe out of it?
i assume data is a dataframe right? if so
dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']] # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS'] # series, equivalent to 1-d array
yuh
oh imma try your thing
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as lin
from sklearn.metrics import r2_score , mean_absolute_error , mean_squared_error
from sklearn.model_selection import train_test_split
data = pd.read_csv(filepath_or_buffer="C:/Users/admin/Downloads/FuelConsumptionCo2.csv")
# so the x values are gonna be ENGINESIZE , CYLINDERS AND FUELCONSUMPTION_COMB
x = data.iloc[: , 4:6]
x["FUELCONSUM"] = data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS
dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']] # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS'] # series, equivalent to 1-d array
trainx , testx ,trainy, testy = train_test_split(dfx,dfy, test_size=0.2 , random_state=8)
regr = lin.LinearRegression()
regr.fit(trainx,trainy)
slope = regr.coef_
inter = regr.intercept_
plt.scatter(trainx,trainy,color = "red")
plt.plot(trainx,trainx*slope + inter, color = "blue")
plt.show()
@silk forge why are you using the iloc there?
you can delete that line entirely unless you really need to reduce memory usage by dropping columns
how to get a model info about the parameter of the model in gensim?
model = Word2Vec.load('fastTEXT_big_sg_w30_min12_iter15.model')
print("model ready")```
something like this
`model.info()`
Hi, I'm applying to an entry level data analyst position, and I submitted a python technical challenge last week and got invited back for an interview including some code review (and SQL whiteboarding too). Would anyone be able to look through my notebook and offer advice prior to my code review?
Deleted
(posted in a help and the career challenges as well, this is my last time, apologies)
link doesn't work
apologies, wrong link
the outlined task: was given a couple of separated data files, combine them into one spreadsheet and also point out any outliers I find, within a 30-45 min period
combined spreadsheet was supposed to be formatted to compare a brand's product across 4 different regions
you can use some scatter graphs for the outliers maybe?
@round jay
maybe for outliers you can use scipy z-score
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
outliners = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] ```
something like this
here more info:
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
Thank you for the suggestions! will definitely consider and go over the medium post
hey, has anyone used statsmodel to do stepwise regression?
trying to follow this here; but it's complaining about my inputs
https://stackoverflow.com/questions/22341271/get-list-from-pandas-dataframe-column
my X is a dataframe with input variables, and my y is the list of target values
Has anybody used dataquest and reccomend it? I have 3 hours a day to spend to studying and as a beginner what's the best online course material?
Is this a good channel to ask about AI?
Just wanted to know if anybody was working/is working on AI
ok I read channel topic nevermind
btw the question is, what were you working at? just curious ๐
@crimson trellis show your code
that's awesome @surreal nacelle
Thanks ๐ First ""practical"" application of ml, really fun project tbh
how long did it take you
2 days
did you do any maths for it?
I used xgboost for the model
The part that required work was the preprocessing
so not much maths
next step is to code most of the basic algo from scratch tho
spamassassin
Good stuff
๐
That's a great project idea by the way, do you think it is something you would recommend for other beginners
Absolutely
Learned a lot
actually it is one of the assignments from the Hands on machine learning with sklearn, tensorflow, and keras book.
This book is really great
Hey y'all!
Sorry if this isn't the right category!
I'm not sure if this is totally a "python" question, but I figured I'd ask since I'm coding it in Python. I'm using numpy and matplotlib, and in this graph, I want to calclate a value for "when the graph is shooting up". Is there a word for that? Is there a function I could use in either numpy or matplotlib (or scipy) that'll help me calculate it?
It's for a school assignment, and the professor is just expecting us to look at it and write down what it looks like, but that feels super gross
The time it happens, sorry
so you want to know when the increase "starts"?
"Happens" being a confusing word since it's over a time interval, but knowing how crazy mathmeticians are there's probably a definition for it :p
Oops, see above
so theres no exact definition for it
Aww
Math let me down ;~;
A bajillion
(I think technically like a thousand but I can generate more)
You could monitor the derivative with a threshold
ok thats perfect
( If that's what you were thinking of salt rock lamp? )
OOO Bast that's a cool idea!!!
they are evently spaced points?
yep @zenith nova exactly
take the successive differences of points
They are mostly evenly spaced; some specific points will generate errors, but that's usually only one in a hundred
and when it goes above some number say "the increase has started"
ok. i would linearly interpolate the missing points
so you don't accidentally register a 2x increase
Oh man that sounds advanced
Changes np.loadtxt to np.genfromtxt
Oh I was kidding; doesn't genfromtxt do that automatically?
yeah ๐
lol
Hmm
I know it does something like interpolation, I don't know what it's technically doing
O well, I can just use interp ^_^
(Making edits now)
afaik genfromtxt only fills with a fixed value
oh ew
so you can fill all the nulls with 0.0
Hmm
I want to mess with the interp more, but I'll leave that for later.
Since we're taking "the successive differences of points", that's exactly what the derivative is, so imma actually try that :3
Thanks guys! <3
well... it's what the derivative is with infinite points, so kinda. but yeah that should work
either you can manually set a threshold
or use a change point detection algorithm
Does anyone know how to prepare and load a local dataset for use in keras?
@lapis sequoia throw Pandas. If it is in database, you can use database client for loading and then pandas for transforming
Love ya @desert oar and @zenith nova ! ^_^
๐
@desert oar wasn't able to use the walmart data to predict target sales, the differences were too big.
I'm trying to use VAR now just for target
I'm getting a little confused with all these tutorials though. They show me how to build a model, but not how to use it to predict future values
am I supposed to pass in fake X values for the future?
to predict Y
VAR has the same problem kinda
where you need to know current month in everything in order to predict next month
what you can do is "chain" predictions
like predict one period with X then use it to predict the next period with Y
@supple ferry Should i import the data and labels separately?
i've tried using pandas read_csv function previously, but it didnt really like that i had 2 different data types
@lapis sequoia what does your data look like? you can exercise a lot of control over data types with pandas
Its basically sequences of 5000 time measurements
sometimes a little bit less than 5000
and i want to add a label to the series as a whole to classify it
what was your concern about data types
@desert oar when you say " you need to know everything" what do you mean?
I have entire data from 2018 to this June, but nothing for July, is it still possible to predict?
Well when i import it with data labels added it struggles to determine the data type, and gives me an error
i read some documentation, and it says i should specify the dtype
however if i specify it being float, it reacts to the label not being convertable to float
can you give us a sample of the data
so in each row you have 5000 time measurements, and a label at the end
Yes
is that right?
ok. so you dont need pandas for this because you can trust that the commas arent going to be messed up
pandas is good for mixed data types
"spreadsheet" type of stuff
just open the line and split it on commas
Yea the commas wont be messed up
The thing I'm struggeling with is passing it into Keras the correct way
with open('example_labeled.csv') as f:
data = [(vals[-1], np.array(vals[:-1], dtype=np.float32)) for line in f for vals in line.strip().split(',')]
that gives you a list of tuples, the 1st element being the label and the 2nd element being a numpy array of the numbers
i mean, yeah
lol
for pandas, what you would do is this
i think theyre constructing this data
look at the example
i am gathering the data myself yea
I can easily remove the label from the csv file
that said
data = pd.read_csv('example_labeled.csv', header=None)
y = data.iloc[:, -1]
x = data.iloc[:, :-1]
the first 5000 columns should already be float type
and the last column should already be 'O' type which is basically "string"
wait what the heck how do you disable reading headers
yeah its header=None
if some of the series are a bit shorter than 5000 measurements, will it still work?
no
thats why you were having issues
either pad the series when you're creating the data, or read it line by line as i started describing above
Ok, I think I have my work cut out for me now.
how are you creating the data?
I'm measuring delay over a network
I'm more of a network engineer, but my masters thesis acquires me to touch on machine learning, which I've never really done before
Can I message you if I have a question in the future?
you can ping me here
@desert oar fastTEXT works really good on sentence/text classificaiton
yup, i love it
i showed the experts 7 different models with different parameters we decided one
they cant believe it lol
its really a odd feeling when they help you to create a new systeme with ai and maybe they will not be needed in future
wow
yeah it is
congrats on making it work
its hard.. you know that overall you are making the world more efficient. but its scary knowing that we as a society havent figured out how to handle people who lose their jobs due to automation
https://github.com/ageron/handson-ml2/blob/master/math_linear_algebra.ipynb
This is pretty good to learn/relearn linear algebra
Using python
and matplotlib ๐
that's from the repository associated with the hands on with machine learning (...) book, it's full of valuable ressources ๐
and you have to fiddle with it less than VW
its become my baseline go-to
instead of liblinear
for text, that is
and less hassle than bert lol
we just started using bert here
weve had a big classification project running for almost a year now
1450 classes
very rare classes in some cases, < 5
lots of mislabelings
VW fell on its face
fasttext barely beat out liblinear
its a really messy project
I have a question which isn't really python related but more generally data-science. I can't seem to google my way to a solution through key words in different combinations as I don't know precisely how to articulate my need. I have a 2d data set which is really weirdly distributed - as in, a lot of the data is clumped in one area. Is there a method by which I can 'redistribute' that data and how can I go about searching for methods to do this?
@vital plume what do you mean redistribute? what would be the desired result?
Like
for a very dumb example
lets say we have data points that produce a completely diagonal relationship
(0, 0)(0.3, 0.5)(1, 1)
I only really care that the points themselves demonstrate that 1,1 is at the far right hand corner of the coordinate space and 0,0 is at the bottom left
the middle value 0.3, .0.5 however could be at 0.5, 0.5 and it would still express its position in relationship to both those points relatively
sort of? the distance to both points changed
the angled changed
its a totally different point imo, except for the fact that 0 < 0.3 < 1 and 0 < 0.5 < 1
so it depends on your meaning
eg you can identify a bounding box or bounding circle, and evenly space points within that bound
Is there not some unsupervised processing of data that you can do to make a distribution more normal?
sure, but you havent described an actual criterion until now
you want it to be more gaussian?
heck you can take the mean and variance of the same data, and randomly generate N new points from a gaussian distribution
anybody know off-hand whether I can get a speedup by opening a compressed, chunked netCDF file in parallel? I vaguely recall benchmarking the uncompressed file and getting a huge speedup in single-thread reading...
@grizzled folio i dont know netcdf specifically, but "probably"
that would be my guess
looks like HDF5 is not thread safe, even for reads?!
oof
ive never used it. never had a need. i always used parquet for "big" tabular stuff and gzipped json for non-tabular structured
interesting, I'm working with climate/ocean data so netCDF is the way to go (though zarr is making its way in)
yeah im not familiar w/ the more complex data formats from the natural sciences
in social science everything is tabular or text
that'd make things much simpler!
whats the advantage of all these complex formats
i know for example GIS data it's just a really old format, so it's really messy
ok sometimes in social sciences we use GIS data too, but thats not so bad because there are a lot of established tools for it
netCDF isn't particularly complex, it's self-describing (so you can pick one up and have all the dimensions/attributes), and handles multiple dimensions, record dimensions (so you can write them as your model runs), things like that
I've never used GIS, but it definitely sounds messy ๐
hmmmm, might need to find a way to efficiently compute pair-wise dot products of 1million 128d vectors
ya as it turns out, nonparametric methods are... compute heavy
yea this is the sort of problem that could get close to the 100% theoretical efficiency of a GPU lol
Hello,may i ask,data science is part of ML right?
yes and no
@muted garden
btw
i dont understand feature scaling and data normalization
@muted garden - there are two areas. Data Analytics encompasses Data Science and Machine Learning. Inside Data Science there are traditional methods (logistic, linear, cluster, factor analysis etc) and Machine Learning
Here's a helpful infographic I use in work:
the most confusing is the difference between data mining and machine learning ๐
bleh
ML is a research discipline, a problem domain, and a loose collection of techniques
data science is a job title and a career path
data science subsumed a number of jobs that previously had different names, eg. "quant", "statistician", "machine learning researcher", etc.
any time you're automating something in a way that requires "learning" (anything beyond hard-coded rules) and making inferences from it, imo you're doing machine learning
basically any kind of automated prediction task
Salt hit the nail on the head, 20 years ago a Machine Learning Engineer undoubtably would be a statistician or Quant
I hate that infographic
How come Raggy?
It's just so bad. Clustering, regression and time series isn't ML, because ML is just supervised, unsupervised, and RL, which are clearly different things. Apparently the major distinguishing factor between supervised and RL is that you're maximising reward instead of minimising cost. Somehow ML is so different from traditional methods that it uses 5 more languages or something
You clearly didn't read the infographic
Clustering, regression, factor etc are all under traditional not machine learning
Regression, clustering etc are also generally ML
How is clustering ever not ML
Because you can do a 1 line in Python without any ML packages to find clusters within observations
Same with Factor Analysis
...number of lines isn't an indicator of whether something is ML
No, ML is the indicator of a machine learning from previous optimisation attempts to imrpove the accuracy of a model
Cluster does not do that unless specifically indicated
and thus k-means is more appropriate than Cluster for machine learning
than traditional cluster*
Let's take powerBI for example
You can use m query language to get a cluster of data
that didn't use ML at all
but you're right as well because
what if you have a constant stream of data that it needs to cluster
K means isn't ML?
k-means is
it's a very common model for machine learning
K-means clustering is a type of unsupervised learning
Great, so I think we are on the same page
I'm just having a hard time grasping what you're saying
I'm saying that the infographic is accurate for what it is trying to portray
But you're also right in some regards, it is hard to draw a hard line between the two
but there are more appropriate complex models that would separate traditional data science and machine learning
Is it though? It's honestly just confusing for everyone
It can't even properly distinguish between RL and supervised learning
It's usually better explained with the video
It's just one of the million bad infographics made in the field
But let's face it, if you dont know the difference between reward based learning and supervised learning then you shouldn't really be looking at it in the first instance un-aided
๐๐ "this resource for beginners is bad"
"If you're a beginner you shouldn't be looking at it anyway"
I think you misinterpreted what I said because I said it is better explained with the video
But the infographic for all intent in purposes, is the best I have ever had at trying to help people explain the differences
but the entire Data Analytics field is so vast of buzz words
that is has become subjective in what sits where
everyone must make their own mind up these days
But on another topic, Matplotlib vs plotly ๐
The infographic shouldn't try to give info when there isn't a concrete answer
Problem is Raggy, I dont think anyone has a concrete anaswer yet
Clustering, regression, and time series are problems to tackle not methods on their own
answer* it just attempts to put some structure on a wildly unstructured field
You can't classify them as traditional
And then list a bunch of ML methods to do those tasks and say they're separate
You shouldn't misinterpret the difference between RL and supervised
I dont think that was the purpose because Cluster and K means are in both
You shouldn't make claims on the languages used when there is no standard or meaningful distinction
I think you're arguing for the sake of it
Left doesn't mention K means
No but it mentions cluster analysis
which is at traditional non learning method
k means is
That infographic isn't set is stone because it isn't explicitly mentioned
means that's it ti cannot be used in either regards
Let me rephrase that as that was terrible explanation
Because the infographic doesn't explicitly say what belongs to which, doesn't mean that it is the rule of law, it's just giving some methods to help people realise what the difference is
one doesn't learn from itself, here are some methods
one learns from itself, here are some methods
not an exhaustive list
I agree with you they can be used for both, absolutely
but someone has to make the attempt at giving a few examples for each
You're incomprehensible to me
I'm not saying it's not exhaustive enough
I'm not saying something can be used for both
I'm saying the infographic is horrible because it's putting a problem statement and a solution next to each other and pretending the problem statement is a traditional method and the solution is an ML method
And that's just one of the many misleading things about it
ITS NOT LISTING EXAMPLES AT ALL
Yeah it is
Regression is not a traditional technique
It's a problem statement
Traditional technique would be the normal equation
...I am afraid you can't read
Linear regression, logistic regression, cluster analysis and factor analysis are all traditional method
Sigh
A traditional method for doing linear regression is analytically calculating weights using the normal equation
The ML way is gradient descent over weights
Both are the same tasks
- linear regression
I pray for you if you get angry over the difference between Method and problem statements. I'm going to duck out of this conversation now because as with any debate, everyone gets ingrained in their original opinion anyway.
Jesus Christ
Let's not continue this
Why does this server have such a high concentration of people who keep insisting they're right when they are clueless
It is so frustrating
All capping at people is really going to help with that
Vent your frustrations somewhere else
๐ What else do I do when someone isn't taking in what I'm saying over and over. Not like mods are willing to tell people to stop being wrong
Be an adult and move on
Whatever
You're not the arbiter of truth on the internet
When someone is wrong, they are wrong. I'm not the arbiter of truth but it's hopefully the policy of moderators to ensure the server is both polite and filled with intelligent discussion, not just polite and filled with crap.
You consider yelling at people intelligent discussion?
Perhaps come up with a way of discussing topics with people without being a child
I did try telling the same thing multiple times without yelling. All capsing is just another form of emphasis, not childish shouting.
right, to swtich the topic. Has someone used azure machine learning studio or any deep learning instances?
Hope it helps ๐
Fwiw i dont like the infographic much either, but not for the same reasons ๐
silly bois, the method is (X'X)^{-1} X'Y
easy stuff
next discussion: there is not such thing as unsupervised learning ๐
It might not be a good term but it has a specific meeting and it definitely exists
So which one who expert with AI should be expert wit data science too right?i am sorry if my question is silly but really i am wondering
sort of?
i dont think you can be an expert with data science
it's too broad
i think you often need data science to do AI
and some tools from AI are useful in data science
Yeah, if you want to hit both Fof, I guess it's statistics
Has anyone used scikit learn crf?
scikit-crfsuite?
Hey do any of you guys have experience with this : https://www.coursera.org/specializations/mathematics-machine-learning
Worth it ?
can anyone here provide an accurate, yet basic method of explaining going from two images into a 2d depth map?
i've played around with it a lot using opencv throughout the past couple days, but i think i'm missing some fundamental concepts
(not sure where to put this question, so i'll drop it here)
I suggest you take a look at this https://en.wikipedia.org/wiki/Triangulation_(computer_vision)
In computer vision triangulation refers to the process of determining a point in 3D space given its projections onto two, or more, images. In order to solve this problem it is necessary to know the parameters of the camera projection function from 3D to 2D for the cameras in...
Thanks, i'll look at it.
