#data-science-and-ml
1 messages ยท Page 262 of 1
Terrible for certain groups of people but okay for others.
Some suggestions I have for dealing with certain groups of data points is you can either do feature engineering
ensemble the models
yeah good point
etc etc.
whats that
You might need to look into data science.
@heady hatch any helpful resources for that?
Hey, I'm learning pandas and numpy... Can somebody tell me, How can I divide a dataframe in series?
Could you clarify what do you mean by divide a dataframe in a series?
Get specific information form a dataframe by labels
Like "dates" and get all the column
What are some good books to learn stats
@lapis sequoia I've heard people like this book. https://web.stanford.edu/~hastie/ElemStatLearn/
@steep olive So get certain columns?
ie
if there's a column named 'a', get 'a' as a column from the dataframe?
Yeah, but I've found what I need, thank you XD
hey guys
what's the difference between linear regression and multiple linear regression
can someone explain it simply
Most of the time what we( in ML) use is multiple linear regression only.
When you have more than one independent variable then it becomes MLP.
Y = Model(X_1) Linear Regression
Y = Model(X_1, X_2, X_3) Multiple Linear Regression
I was looking at different ways of doing array filtering in Python, and came across something I find weird. Why is the second method of filtering the fastest?
@full narwhal Actually the difference is not a lot (at max 20ish %).
Anyway the difference is most likely due to how indexing works in the background for numpy arrays.
In 1st and 3rd you need to calculate columns and rows indexes k=i*ncol+j for each cell.
But in the 2nd it is you are avoiding that computation. Therefore it is a lil faster.
I'm not sure if I'm accounting all the possible reasons but the above one is one of them.
@lapis sequoia The difference isn't a lot, but the order is consistent across multiple array sizes. Naively, though, doesn't the second method have the most amount of allocations?
There's one for the data[1] >= 0.75, then one for np.where(), then one for data[0][index_list]
and i feel like there should be a way to combine what np.where is doing in the other two methods without that extra allocation
am i missing something?
https://stackoverflow.com/questions/55123613/why-numpy-where-is-much-faster-than-alternatives
If indexing is not a problem then its C implementation which is the reason of speed.
that code is comparing python performance to numpy perf. i dont really see what it has to do with this
%%timeit
index_list = [data[1] >= 0.75]
%%timeit
index_list = np.where(data[1] >= 0.75)
try to break the code in more cells and time it. Then see if it is due numpys C optimisation or not.
@full narwhal
As you can see the np.where is slower but the output is in indices form where as in simple conditional it is in True and False. Which leads to different sizes.
Yes, but that computation has to happen either way
I would argue np.where has to do the extra step of gathering the indices
The way i see it, why doesn't the simple solution do what np.where does, except rather than gathering the indices, it gathers the associated data[0][i] (which should be an O(1) operation)
If you want to go indepth on why it is happening like that and why is there a difference then I can only suggest to look under the hood.
hm.
this is an interesting problem
@full narwhal I don't have an answer, only a guess
and my guess is that in the np.where case, it's faster because the size of the result is known
so there is only ever one allocation
I'm not sure if there's a way to track allocations, but if there is that might help?
@velvet thorn The size of the result is know because np.where had to figure out the size for its return array
and you still have to allocate another array for the result, no?
it's not overwriting the index array
and you still have to allocate another array for the result, no?
@full narwhal yes
but what I'm saying is
the difference is in the column-level indexer for the original array, right?
whether it's a boolean mask or an array of indices
and I'm saying that that part is faster with the latter because the length of the result is known in that case
but not in the former case
since you have to traverse the entire boolean mask to know the length of the result
so presumably there's a number of reallocations, which lead to the slightly higher time
Hello there! Im new at data science and i want to start but i dont know from where... Can someone recommend me a course, Tutorial or any resource? Thanks btw
non-conclusive support:
if you reduce the size of the original array by a lot, the former is actually faster
which, is my supposition were valid, would make sense, because there'd be fewer reallocations
@velvet thorn what i'm saying is np.where doesn't know what size the output array is to begin with, either
right?
@velvet thorn what i'm saying is
np.wheredoesn't know what size the output array is to begin with, either
@full narwhal yes, it doesn't
but it calculates it
that's my point
what i'm asking is why couldn't we skip the intermediate step of collecting the indices?
The way i see it, why doesn't the simple solution do what
np.wheredoes, except rather than gathering the indices, it gathers the associateddata[0][i](which should be an O(1) operation)
@full narwhal because it's not necessarily faster, perhaps
what i'm asking is why couldn't we skip the intermediate step of collecting the indices?
@full narwhal there's probably some sort of tradeoff.
so i just tested it on a 2x1000 array and you seem to be right, but at 2x10000 it seems to flip the other way around (and i wouldnt really consider 10000 to be large)
but i still feel like there's something missing here; i just don't see how the np.where method can be faster for larger arrays when it has to do an extra step
and @daring crag https://jakevdp.github.io/PythonDataScienceHandbook/ is a good start
but i still feel like there's something missing here; i just don't see how the np.where method can be faster for larger arrays when it has to do an extra step
@full narwhal like I said, presumably the different indexing format cuts down on reallocations when applied to the original array
but without looking at the source it'd be hard to tell I guess
Hello everyone. What do you think is the appropriate mixture of data science skills and domain knowledge?
Filter the subset.. does anyone understand the question??
I'll be that guy.. if someone is knowledgeable with numpy masked array, I posted a question with a simplified example in #help-pancakes ๐
my code here https://paste.pythondiscord.com/ezojitenom.py python Traceback (most recent call last): File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request rv = self.dispatch_request() File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper resp = resource(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view return self.dispatch_request(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request resp = meth(*args, **kwargs) File "E:\demo3\findDocumentType1.py", line 126, in post self.resize_im(image_data) File "E:\demo3\findDocumentType1.py", line 209, in resize_im img = preprocessing(img) File "E:\demo3\findDocumentType1.py", line 194, in preprocessing img = img//255 TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
in real jobs, how common is it to code the visualization with Python matplotlib as oppose to tools like Tableua?
Ime it depends on the team and the tools they use, if they're already all the preprocessing and analysis in python, it's generally simpler do directly plot stuff from there with mpl/pyplot
Same with r and ggplot etc
just used matplotlib for realtime visualization recently, it happened to be really quick to get going
short answer: matplotlib if engineers/scientists are looking at it, tableau if managers and up are looking at it :>
Is this room also for Machine Learning?
yes
I am starting to learn ML and I would be asking many questions about that
I have the book Python Machine Learning 3rd Edition by Sebastian Raschka
Does anyone have good references for signal processing?
model = load_model(pathlib.Path(r'E://demo3\\united_kingdom_50.h5')) this is not working
@daring crag What exactly do you find interesting?
Can somebody help me with question
@jolly plank just apply a filter or condition to select the desired category.
in real jobs, how common is it to code the visualization with Python matplotlib as oppose to tools like Tableau?
@vague bear It'll depend on your job. If your jobs requires you to give a final viz. report then maybe you don't want to use matplotlib. But when you are doing analysis as subtask in a project or you just need quick plots then your may go for matplotlib.
Conclusion: If you are serving the final result to a non technical group or its a presentation then you may want to have a nice dashboard made from tableau.
@spark nimbus
You can look into this course from Coursera.
https://www.coursera.org/learn/advanced-machine-learning-signal-processing
@lapis sequoia is that one mainly focused around machine learning or mainly signal processing, because I don't need the former
nvm, seems to mostly be machine learning and the applications of it in signal processing
yeah i though you are looking with ML. This is data science channel. tbh.
@spark nimbus You can try this. This looks pure signal Processing. https://www.coursera.org/specializations/digital-signal-processing
But it looks paid. I'm not able to find audit option in Coursera.
Yeah, I was about to say it requires a login :/
You can try free trial..
There was an audit option in coursera courses. You could watch free videos. But now it looks like they have removed that option.
The main issue in signal processing is that the basic concepts are pretty simple, but for anything slightly more complex you need to suddenly understand a couple dozen terms that you've likely never heard before, and I just keep getting lost in all this, especially since I have a hard time visualizing it in my head
To give an example:
Well I studied that in my university.
You can also try this:https://nptel.ac.in/courses/117/102/117102060/#
Increase the speed and watch. If you have know some basics then this can help.
Also you can look for OCW MIT Lectures. They are pretty old but still make sense.
NPTEL provides E-learning through online Web and Video courses various streams.
oh now that you mention it, I might still have the PDFs of the books in my uni's mega drive
I think they had a signal processing course
For some reason i'm getting a keyerror when trying to read a column from a dataframe in pandas, when I know the column name is correct
df.columns will give you columns.
print(df.columns) and see the column names
here it shows the dataframe and the column lists but it still says keyerror trade type
yeah i already did df.columns
you can see 'Trade Type' is in the dataframe and in df.columns but it still gives me a keyerror
check for whitespaces and \n
Try checking data in database
Like pasta told
U like pasta, pasta?
Or ur name is pasta?
what's weird is that when I try to call 'Trade Date' it works fine, but 'Trade Type' doesnt work
I think u have to check on data only
what do you mean
@lone osprey I keep my nicks related to food and fruits. And yes i like pasta.
Nice๐
Fortran
Yeah, this is a 1999 book alright
what do you mean
@austere swift check like pasta told
what's weird is that when I try to call 'Trade Date' it works fine, but 'Trade Type' doesnt work
@austere swift did you check for whitespaces and \n ?
how would I check that?
Show us data once
df[df.columns[6]]
see where the Trade Type is in the columns array. And select it.
most likely it will be 6th index. if it works then the there is some whitespace
df[df.columns[6]]
@lapis sequoia this works for some reason but putting the string directly doesnt
maybe it has some weird whitespace thats not a normal space
but even when i copy it from the terminal it doesn't work
You can't see space when you print it.
Yeah I've had df columns do that, when there was a space included at the end of the name that isn't obvious
yeah, I guess i'll just use df[df.columns[6]] instead just as a workaround
you could also just rename that column?
well it's being scraped from a website, so it would probably just be easier to use that as a workaround
@austere swift you can clean those spaces by using string.strip()
df.columns = [col.strip() for col in df.columns]
Hey folks, does anyone know any good and robust ways to convert a pretty extensive MATLAB script to a Python script?
i got a question
Hey folks, does anyone know any good and robust ways to convert a pretty extensive MATLAB script to a Python script?
@haughty nymph You'll have to write it down in python. Or you can see if there is some library or some repo where the required script is already written.
i got a series of 3d brainscans with their labels
how can i extract them with the correct labels
its in matlab file
what is the format of 3d brainscans ? images?
And where are the labels? In filename ?
Have the user input a list of columns for a table
Have the user input a data type for every column: int, float, string size 255
In a loop have the user input the values for each row
Ensure the string size doesn't exceed 255
Print the results when the user is finished
for the second sentence, how do you know when the user has already input every column include int , float , string size 255?
does anyone know how to implement naive baye's classification algorithm? I understood how it works but I'm new to Python language.
@lapis sequoia the labels are in an array
@lapis sequoia the labels are in an array
@lapis sequoia I'm not sure what is your problem.
Do you want to train a image classifier to classify the brainscans with correct labels ?
does anyone know how to implement naive baye's classification algorithm? I understood how it works but I'm new to Python language.
@lapis condor If you are looking for basic naive Bayes Classification algorithm implementation in python then they are very much available on internet.
You just have to create a table of probability(frequency/total) for each word by each class.
And if the variables are numbers (decimals) then it is a lil tricky.
Actually, I'm looking for something that doesn't use iris. I got specific dataset and was asked not to import
@lapis sequoia If you could help with that please
@lapis sequoia the problem is the I have never used matlab files
And I need a way to extract several scans and their corresponding to train a model
How I'm not sure how to do so
@lapis sequoia
can you tell me the extension of the file ?
in which the brainscan image is stored
Could anyone help me with python numpy structure array?
Yup
Have the user input a list of columns for a table
Have the user input a data type for every column: int, float, string size 255
In a loop have the user input the values for each row
Ensure the string size doesn't exceed 255
Print the results when the user is finished
You may use Numpy or Pandas to do this
I choose to use numpy
import numpy as np a = int(input("Size of array:")) lst = [] for i in range(a): my_array.append((input("Values:"))) my_array = np.array(my_array)
here's what i have been doing
how do i know when user has already input a data type of every colum: int , float , string 255
@lone osprey
@lapis sequoia
can you tell me the extension of the file ?
@lapis sequoia well its a .mat file
i believe its several thousand 3d images
import scipy.io
X = scipy.io.loadmat('file.mat')
Hello guys, does someone here attempted using neuronal networks for building better trading bots?
@lapis sequoia
Images are nothing but arrays. Just load them as shown above. Check the shape of X.
You X will have some shape like(n,h,w, ch).
Where n = number of images.
h = height of brainscan images
w = width of brainscan images
ch = channels. (=3 if its a clor image)
Apply CNN on them and you should be able to get a decent classifier.
Or Use transfer learning if the images are not enough.
its telling me that its a dictionary
@lapis sequoia Ok. then you will just have to extract the value where from the some Key.
Most likely its in the last tuple. ('Data', array[])
how would i do that
Yeah its the key with 'Data'.
data = scipy.io.loadmat('file.mat')
X = data['data']
And you can see its a 4D array as I mentioned above. (n,h,w, ch)
check the shape of X.shape
@lapis sequoia could this be the dimensions
i suspect its a greyscale image
Yes. its greyscale. But I'm not sure which one is the n = Number of brainscan files.
most likey n=89 and the brainscan images are of 176*176.
so i got 89 brains scans
pixels is 176 by 176
channel is one because of brain scan
๐
thank you so much
Now you should be able to create a classifier. Try using Transfer learning as you have only 89 images.
i got one more question
i basically have diagnose a condition which a yes or no value
should the labels be one hot encoded?
so i went back and looked at the Labels which are labeled "Target
and i got this result
the 89 responding to how many images i have
and 1 is binary
would that be a correct explanation
how do i know when user has already input a data type of every colum: int , float , string 255
@narrow flume u want to know if input is int or float or string?
most of the libraries and packages takes care of this. Automatically.
Also whether you should use OHE or binary (1/0) will depend on your loss function, you can use binary-cross-entropy or logloss.
thank you
can anyone help me with structure array
Have the user input a list of columns for a table
Have the user input a data type for every column: int, float, string size 255
In a loop have the user input the values for each row
Ensure the string size doesn't exceed 255
Print the results when the user is finished
You may use Numpy or Pandas to do this
@narrow flume First get the columns names as inputs from user.
Then get columns types as inputs.
After you have this let user input the values of each row.
@narrow flume
You have to something like that.
I'm not completing the code. You can do the validation for string with length max allowed (255) and check for types.
You can do the above with numpy also.
oh so we have to ask user for the data type of their input every time?
what is the size of string 255 ? @lapis sequoia
oh so we have to ask user for the data type of their input every time?
@narrow flume No we have to ask the type of column. And make user to enter that type.
what is the size of string 255 ? @lapis sequoia
@narrow flume size and len are two different function in python.
Check the question if you have to getsizeorlen.
And you have to check if user is entering it correctly or not. If not then maybe discard that input or again ask user to fill in that row.
Check you question on what to do.
If its not mentioned then you can decide on your own.
then i will make a length function to check
i would like to find the 4x4 transformation matrix that best fits one 3d set of points to another 3d set of points
can i do that with scikit learn and what functions should i start looking at to accomplish that?
i think i could use their stochastic gradient descent module, but is there a better way that i just dont know about?
least squares
you're basically trying to solve a least squares problem XA = Y
and there's a solution to that which is just the pseudoinverse
cool, so it looks like the least squares in scikit fits a line to a set of points, is there an easy way to get it to learn a matrix? sorry i'm very new to DS im basically a full stack dev who got thrown onto some ds projects hahaha
hi, if i got r2 score on train data 0.96 and test data 0.90 is it still count as overfitting?
and if so, how i to handle it? should i change the max depth and gamma? (i'm using hyper parameter tuning xg boost)
How do you access the X_train, y_train, X_test, y_test after doing doing a KFold like this: cv = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)?
I want to now fit the data like this:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train,y_train)
y_pred = lda.predict(X_test)```
and get the mean scores, mean ROC, etc.
what does tf.compat.v1.get_default_graph means? Like what is computational graph?
This I know, using the _train_test_split function, you can get the indices like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)```
But how to do the same for KFold?
@bold olive
Check the documentation here. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
You will get K different train test split. And you can train and test on each of them then aggregate your results.
So basically fit the classifier in the for loop:
for train_index, test_index in cv.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf = lda.fit(X_train,y_train)```
And then calculate mean scores and AUCs, correct?
hey . guys
can someone explain what pandas.to_datetime means?
I've been seeing it pop up in a lot of Kaggle notebooks and I don't understand what it does
I've looked at the pandas doc too
yeah all i would do is share the documentation with you to read and go over carefully : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
I think I won't figure it out until I do it in a project
or download a dataset and use datetime on it
Hey guys quick question on TFRecord format.
When is it worthwhile to use?
if you understand the basics of python (basic functions,if statements,loops...),how hard is it to grasp ml with scikit-learn?
I need help with adaline
I dont understand it
so how does it actually work?
and I dont really get weight and cost in ML
So basically fit the classifier in the
forloop:for train_index, test_index in cv.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf = lda.fit(X_train,y_train)``` And then calculate mean scores and AUCs, correct?
@bold olive Yes that is the idea. Just take the mean of all the metrics you want to have.
It worked btw, @lapis sequoia. Managed to get the confusion matrix, mean accuracy and ROC curves of all folds with mean AUC.
can someone explain what pandas.to_datetime means?
@hollow sentinel it just converts a value or number of values to the datetime type
what donโt you get about it?
Are there any examples of using SVMs for multilabel problems, with a SVM per label perhaps?
you can do multi label by combining any binary classification model, so yes, it's possible
not that it necessarily gives good results
@odd yoke Thanks - do you have any code examples or is there a pipeline in scikit learn for it?
I have a multiclass multilabel dataset and decision trees arent performing particularly well
Maybe I need more features?
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html#sklearn.multioutput.MultiOutputClassifier this looks to be what you want in terms of code, there can be many reasons why the result is not good
Thanks for the link! That looks like exactly what I need
I was thinking I had to hardcode it all myself
Is it not helpful most of the time @odd yoke ?
Do you recommend multilabel classifiers over wrapping a SVM or something into a multioutput?
Hey lads, I have a pandas question. What's the difference between groupby and sort_values?
group by groups based on a condition
They return different stuff, and probably also have different algorithmic complexities (groupby requires one pass over the array, so O(n), sorting, as always, is O(n log n))
sort sorts the values based on a condition
for example:
[1, 2, 3, 4, 5, 6]
if u group this by > 4
then u get [5, 6]
if u sort it from large to small
u get
[6, 5, 4, 3, 2, 1]
@grave thunder
And if I do groupby("Collumn").max() how is that different then from sorting?
that's a totally different operation
It will also return in order
groupby and sorting are not more than tangentially related
instead of being condescending about his question
the point of groupby is to split some data into groups and apply an operation to each group independently
you can just answer it, u know
you can just answer it, u know
@patent flame relax, I'm answering it
- creating an example
Ah I see, and sorting is used more like for printing data
And groupy by when I wanna operate on a group of data
>>> df
fruit price
0 apple 1.8
1 apple 1.3
2 pear 2.3
3 banana 3.7
4 pear 2.5
5 apple 1.5
6 banana 3.4
>>> df['price'].max()
3.7
>>> df.groupby('fruit')['price'].max()
fruit
apple 1.8
banana 3.7
pear 2.5
Name: price, dtype: float64
a quick example
if you want to answer the question "which is the most expensive fruit", then you just do df['price'].max() (the max price)
but if you want to answer the question "for each fruit, what is the highest price", then you need to do a groupby.
Ohhhh
conceptually, this splits the DataFrame into one "mini-DF" for each value of fruit, so you have one mini-DF where fruit is apple, one for banana and so on
then you get the .max() of each
then you combine them together.
does that make sense?
yup yup ^_^
okay
so sort just orders values
now, you see the DF above is not ordered in any way
Thanks lad! I've been trying to wrap my head around that for a while now
but I can impose an ordering:
>>> df.sort_values('price')
fruit price
1 apple 1.3
5 apple 1.5
0 apple 1.8
2 pear 2.3
4 pear 2.5
6 banana 3.4
3 banana 3.7
so now it's ordered by price in ascending order
you can just answer it, u know
@patent flame happy now?
๐ค
I'm not the one you should please. Be a better person for yourself not for anyone else.
Chill, you both helped out. Thanks lads
I'm not the one you should please. Be a better person for yourself not for anyone else.
@patent flame that's pretty ironic because you seem rather quick to jump down someone's throat
Chill, you both helped out. Thanks lads
@grave thunder yw
i dont think u can have nudity in profile pic
let's calm down bois we're all friends here
if you understand the basics of python (basic functions,if statements,loops...),how hard is it to grasp ml with scikit-learn?
@indigo steppe too early; don't do it.
you can follow a tutorial, and maybe something will kind of work, but way too soon you will run into problems that are above your level
work on your fundamentals (not just programming; mathematics too) for a while first.
Learn about sigmoid functions for example
ML has become a lot more accessible in recent years, but it's still a very complex subject.
@indigo steppe you can try a udemy course that I'm using: Python for Data Science and Machine Learning Bootcamp by Jose Portilla
not that all your doubts will be cleared but it's a good start
Hey, I've gone through that one too! Albeit it's very well done it's not for total beginners
you can also try this udemy course: 2020 complete python bootcamp from zero to hero in python by Jose Portilla, Kaggle mini courses, and Andrew Ng's course
I haven't finished the python for DS & ML bootcamp bc of college I'm still on linear regression
Yup, definitely can recommend Portilla. Among best 20 bucks I've spent.
You'll get there. ML is super fun and applicable almost everywhere. I have custom py ML programs for stocks
lmao dude I can't even figure out the right dataset to use for linear regression
I've been looking at Kaggle datasets
Kaggle is good imo
yeah I need tabular data otherwise it's lots of value_counts
But depends on what you wanna use ML for, I went through course mostly to automate my day trading and I have constantly updating market that comes in nicely sorted json or csv files
lmao dude I can't even figure out the right dataset to use for linear regression
@hollow sentinel what do you mean "right"?
@velvet thorn like I wouldn't know how to do linear regression on a dataset of words
natural language processing
I also need a dataset that's betwen 50 and 100 KB otherwise seaborn takes too long to make a graph of it
that doesn't sound right...
Is there a better way to get the euclidean norm of a row using pandas and numpy?
for index, row in sums.iteritems():
df.iloc[index] = df.iloc[index].divide(row)
return df```
preferrably like a one liner
well, not just get the norm, but also normalize the row but dividing by the norm.
huh.
so do you want the norm or not (stored separately)
or do you just want to normalise
I just want to normalize
use np.linalg.norm
I don't really care about the norm
does that work on a row by row basis?
in pandas, I have a column with strings like "foo_2020_10_11", how can I extract that date as a datetime?
I thought I tried that
df / np.linalg.norm(df, axis=1, keepdims=True)
@oblique socket
in pandas, I have a column with strings like "foo_2020_10_11", how can I extract that date as a datetime?
@lapis sequoia use a regex
or rather, pd.to_datetime with a regex
thanks
Thank you!
yw!
I knew there was a simpler way
I figured, it just seemed like spaghetti
I have this function to normalize a dataset
# minmax feature scaling
if method == 'minmax':
# scaled value = (value - min) / (max - min)
# should also return min and max values for future use
# if new values are added to the dataset
# normalize on scale [a, b] (default is [0, 1]
normalized_df = a + (df - df.min())*(b - a)/(df.max() - df.min())
if method == 'mean_normalization':
normalized_df = (df - df.mean()) / (df.max() - df.min())
# z-score normalization (standardization)
elif method=='standardize':
# make each feature have zero mean and unit variance
# should also return mean and std for each attribute
# for future use in case new values are added to dataset
# This method is widely used for normalization in many machine
# learning algorithms (e.g., support vector machines,
# logistic regression, and artificial neural networks).
normalized_df=(df-df.mean())/df.std()
elif method=='unit':
# x' = x / ||x||
# sums = df.apply(lambda x: np.sqrt(np.sum(x**2)),axis='columns')
# for index, row in sums.iteritems():
# df.iloc[index] = df.iloc[index].divide(row)
normalized_df = df / np.linalg.norm(df, axis=1, keepdims=True)
return normalized_df```
It's complete, for now
HELP
I'M BEING DROWNED IN COMMENTS
okay purely from a software engineering perspective
this is kind of dodgy IMO
I would write one function for each method of feature scaling
yeah, I guess I could do that
I probably should
right now I'm the only one using it
Alternatively, make those docstrings.
I'd say it's more important, since comment would take actually going to your code and reading it.
good point
Seems like from the source that you can't do mult-class and multi-label together?
Are there any workarounds or other wrappers for this
Seems like from the source that you can't do mult-class and multi-label together?
@shell berry what is that from
I need a novelty voice TTS engine with python..
but the only good engine I see is pyttsx3
and microsoft bob is most definitly not a novelty voice...
Im specifically trying to approximate glados
from portal
I found this: https://github.com/EtiennePerot/gladosvoicegen
but it looks terrible... and is 6 years old
and requires melodyne which wont work on my linux server
SuriyawongToday at 4:45 PM
next I found this
https://github.com/kairess/tacotron
but that takes a 130 GB dataset
so... yea thats out
OK... alternatively... because I cant find something good...
what if I used pyttsx3's microsoft lucy or whatver
what limitations do you have
and then distorted it
I'm assuming it has to be free?
yes
and needs to be fast 5 seconds MAX delay from discord bot command to saying it in VC
quad core 4gb ram VPS. No gpu though
hm.
just an idea but
why don't you use those generic TTS services that have been around since forever
you know, the robotic ones
and then just have a transformer to make it more GLaDOS-like
and fuck with it to make it distorted
ye
yep that was fall back idea
I think that'd be more efficient
that looks like what this does
its slow because it uses a GUI tool and automates it anyway
but idk how to do that distortion command line
and their hacky solution of VM with windows and melodyne is not an option lol
Generate GLaDOS-like voice samples from text input
that one that actually works does the same thing
OK.. I thought GladOS would be easy... perhaps theirs some other novelty tts engine I could use
morgan freeman, or snoop dogg, or something else... though that seems way more complex
OK.. I thought GladOS would be easy... perhaps theirs some other novelty tts engine I could use
@lapis sequoia I'd say GLaDOS is the easiest because you can just do what we said above
it already sounds kinda like TTS
on the other hand, a real human's voice is more complex.
this is an unsolved problem btw
realistic TTS is worth a lot of $$
yea thats what I was thinking
hum... well good news...
pyttsx3 is so awful it sounds close to glados already
NVM thats not pyttsx3 thats espeak
not their fault its bad I guess
ok other voices actually do decient this could work
so... how would I add robotic distortion to a audio file?
U can
My friend did change its voice
I don't know what code to change
Check in google or docs
ok gtts actually works deciently
though Im not thrilled with the delay and external server need
I keep getting this error when training, but if I set my test set to like 0.001% it goes away
Whenever I try a sizable test test I get the error again. I tried np.unique to make sure I had two classes and I do. Any ideas? appreciated
what is the average statistic that value more present than past in time series?
Exponential Moving Average?
I fixed that; I ran a SVC which takes like 10 mins to train and gives me ~77% accuracy, but a linear SVC takes 2-3 seconds and gives me 97% around no matter what my test split is
Is it really that performant or a false reading?
accuracy of train, test or validation set?
Also are you performing proper splitting?
Try use grid search Cross validation and enter the desired hyper-parameter values for both and compare the results.
Just test @lapis sequoia , Im doing this: python x_train, x_test, y_train, y_test = train_test_split(x_counts, output_labels, test_size=0.33, random_state=100)
if it's giving 97% accuracy on the test set
that's probably okay then
linearsvc can converge a lot faster
Something seems off because that's really really high
is this sklearn?
Just tried that
linearsvc can converge a lot faster
@dusty depot it can converge lot faster but the results shouldn't be so different.
got like 0.002% higher lol
can you paste the code.
It can't be my data splitting because I''m splitting it the same way for randomforest and etc
Not sure if I can paste the entire code, this is for school
Oh oops
Ok lmao
message me if you like.
I just need to see the splitting part and training
also the results.
This is really really embarrassing - I was testing on the train set. I must have changed my code and forgot to change it back
lol.
rip
Trying a normal SVC now, should get vastly different results since I'm actually doing stuff properly now
I've spent 90% of my time on this assignment cleaning the data
Are real world projects mostly like that lol
Ok a normal SVC gives me 56% now ๐ฆ
Welcome to Data Science.
Basic cleaning is nothing.
In some projects I have spent 70% to 80% time in cleaning and cleaning only.
And I'm talking about a multiple month long project.
Extraction, Cleaning and Transformation will be the biggest problem in almost every project.
Does the cross validation graph look good
sorry, im new to this and im getting a negative r-squared value compared to a baseline model
but it looks like the graph of the model i created plateaus
I don't know what is your CV score but from Loss plot I can say that you are doing something wrong or at least you are not doing something correctly .
Also Baselines are used as a reference point.
Your train and test losss should be lower and R-squared value should be near 1. 0 being the worst and 1 being the best.
I feel that I am missing something because the Keras callback isn't working. Can someone point that out?
def get_pred_loss_dataset(test_dataset: tf.data.Dataset, model: tf.keras.Model) -> Tuple[tf.data.Dataset, tf.data.Dataset]:
"""
Returns a dataset that yields the prediction and loss for each batch in the test dataset.
Parameters
----------
test_dataset : Dataset
The test dataset to evaluate on. Yields `(x_true, y_true, ...)` (batched).
model : Model
The model that predicts on the test dataset.
Returns
-------
pred_dataset : Dataset
The resultant dataset that is suitable to be zipped with `test_dataset`.
Yields the batch prediction.
loss_dataset : Dataset
The resultant dataset that is suitable to be zipped with `test_dataset`.
Yields the batch loss.
"""
pred_dataset = test_dataset.map(lambda x_true, *_: model(x_true))
print(test_dataset)
print("Obtaining loss values...")
losses = []
def on_batch_end(batch, logs):
print(f"batch: {batch}, loss: {logs['loss']}")
losses.append(logs['loss'])
log_batch_loss = tf.keras.callbacks.LambdaCallback(on_batch_end=on_batch_end)
results = model.evaluate(test_dataset, callbacks=[log_batch_loss])
print(results, losses)
loss_dataset = tf.data.Dataset.from_tensor_slices(tf.stack(losses))
return pred_dataset, loss_dataset
Console output (yes I know the model sucks, that's why I am looking at where it went wrong):
<BatchDataset shapes: ((1, 512, 512, 3), (1,), (1,)), types: (tf.float16, tf.int32, tf.float16)>
Obtaining loss values...
3965/3965 [==============================] - 578s 146ms/step - loss: 4.4005 - top_1_accuracy: 0.0504 - top_3_accuracy: 0.0918 - top_5_accuracy: 0.2683
[4.40053129196167, 0.05044136196374893, 0.09180327504873276, 0.2683480381965637] []
As you can see the print statement in the callback isn't being called
I feel dumb for still not seeing the mistake after staring at the code for several minutes
ok I still have no idea lol
oh
I'm such an idiot
on_batch_end
A backwards compatibility alias for
on_train_batch_end.
from the docs
so LambdaCallback is useless when not training
Nowhere in the docs for LambdaCallback was this mentioned
It's only mentioned in the base class Callback
hi
Quick pandas question. Say I have DataFrame
col1 col2A a1 a2
B b1 b2
How do I check row B, column 1 if it has value b1 and if it does, drop that whole row? I tried withdf.drop(df.loc[df["col1"] == "b1"])but it doesn't work
Quick pandas question. Say I have DataFrame
How do I check row B, column 1 if it has value b1 and if it does, drop that whole row? I tried withdf.drop(df.loc[df["col1"] == "b1"])but it doesn't work
@grave thunder If you know its rows B then you can directly drop it. usingdf.drop(index = 'B').
Or
index_to_drop = df[df['col1'] == "b1"].index
df.drop(index = index_to_drop )
also you have to make inplace = True if you want to reflect the changes.
can anyone help me out with this?
@keen sinew Well there is no code. But it means you are missing some imports or there is a version clash which is not directly obvious. There can be other reasons too.
Ma I get the algo info here?
hello, I am still new to algorithm.
I am planning to build a recommendation system, maybe just a basic one
Can any one give me some helpful recommendation on how should I start and what method should I use, things like that.
I plan to use the feedback and ratings for others users as the data for the recommendation
Quick pandas question. Say I have DataFrame
How do I check row B, column 1 if it has value b1 and if it does, drop that whole row? I tried withdf.drop(df.loc[df["col1"] == "b1"])but it doesn't work
@grave thunderdf = df[df['col_1'] != b1]
@velvet thorn You save me once again
np
whats a good deep learning home workstation?
Anyone here use Spyder IDE? How good is it in proceding eye-pleasing visual results?
producing*
if you have python background and wants to practice google earth engine, which one is comfertable, using python lib in conda or js on gee platform?
please suggest
Hi guys I wonder a thing. Which libraries are most use for NLP? PyTorch or TF-Keras?
@lapis sequoia Idk but I find the SpyderIDE kinda ugly. If you're doing data science I would recommend Jupyter Notebook
Anyone here use Spyder IDE? How good is it in proceding eye-pleasing visual results?
@lapis sequoia Personally I'm a big fan of the RStudio IDE which used to be solely for R but has recently gained support for Python in the preview version 1.4. I'm not sure it's as feature complete as for R but it's getting there.
I have tried using Spyder as well but I just couldn't get used to it. The problem I have with Jupyter notebooks is that I can't readily see what variables I defined and what they look like and there is no real data browser.
just use Jupyter on Visual Studio Code
it shows you the variables and has a data explorer
RStudio is great (minus the fact that it's r) but the desktop version that comes with anaconda feels like it's lacking for some reason
I had been using spyder for a while and it was pretty good
brand_of_car = car_data.groupby('brand')['model'].count().reset_index().sort_values('model',ascending = False).head(10)
brand_of_car = brand_of_car.rename(columns = {'model':'count'})
fig = px.bar(brand_of_car, x='brand', y='count', color='count')
fig.show()
guys what is groupby
I got it from this kaggle notebook https://www.kaggle.com/tanersekmen/us-car-data-analysis-eda-visualization
@hollow sentinel Just google it bro
i did
it's grouping data
but like what does that mean
how does grouping the data help
@coral trellis It depends on what you want to do. I recommend TF if you want to implement some DL paper published by Google and have a SOTA model for your task. Else just find a tutorial that covers all the theory in ML and learn that first before diving into text generation, etc.
@regal belfry Depends on what kind of tasks you want to do ๐ for most people, a GTX 1050ti would work well enough (since you would be using colab for heavy tasks anyway)
@somber bane For what data do you want the recommendation system? What is your tentative metric for that data type?
what's the difference between using df["column"] v df.column
@grave frost I was ask user to give a 1-10 scale of rating on shows. And then base on the average rating system along with the type of genre, maybe also on user's age, recommend them shows
@hollow sentinel groupby is to group similar attributes in a column that is why its usefull
and df["columns"] depends if u have a column name column
@somber bane Hmm.. seems workable. How much accuracy should it have? Like is it for personal use or you want to use it in a real world scenario?
df.columns shows your columns in the dataframe
I am building this for my freshman computer science project
but I plan to publish it for public use
so maybe as accurate as possible, but does not require
If the project is all about a recommendation system, I recommend you use some industry-level algo instead of implementing your own. However, if your teacher expects a custom system, then that's a different story..
You can write a basic one, given that you know how to use numpy and linear algebra.
@grave frost so do you have any recommend industry level ones
I just learned to use numpy and pandas, so where should I start. I mean I need some help in setting up a basic picture and working frame behind the algo
If you are sure that the project's goal is not to make your own custom algo, then industry ones are obv good enough. How does your data look like?
for right now I have not start to collect data from user's yet
If it was me, I would be developing some method based on simple ML techniques
what is ML standand for?
oh,
I recently look up at this website
I think I can understand most of it, but I do believe my teacher hopes me to build one by myself, not just do some copy and paste
To clarify when you say your teacher hopes you build one yourself, do they mean implement a library or algo vs using some library?
algo
Like is it okay to
import recommendation_engine
recommendation_engine.fit()
recommendation_engine.predict()
guys, i want to split x axis labels into years, and labelless ticks between them, how can i do that?
I think he will be okay with it, since I am only a freshman
Okay well, @grave frost might have more libraries. But couple that comes to my mind is Surprise and ALS from spark.
so do I go ahead and study on how should I use those library, and implement them?
But recommendation problems is a bit tricky to begin with since it requires you to have an understanding of what the algorithm is doing.
But like previously stated, it only requires a bit of understanding.
Implementing them from scratch might be a bit rough.
But I think it would be useful to look at source code to see how they're implemented.
If you want something easy to digest and start with, you can grab your data and then recommend items most popular by ratings.
so can you recommend one library that is friendly to beginner
I think Surprise is pretty friendly.
I think it's the whole problem that requires a bit of understanding.
Once you understand the problem, the library is just tools to help solve it.
I will workhard on the understanding part, could you help me find some sources that you think is helpful for me to begin learning with Surprise. Thanks
I think the link you used talked about it.
Here's one from google.
Thank you very much! @heady hatch
But if you built a basic one like just recommending stuff based on popularity, you won't really need much of anything other than maybe data science.
ie
-> group by some category
-> grab top 10 items in those categories
no, I ask from my professor for something that is challenging. Because I experienced on how to program before. So his purpose is not to keep me boring in the class
Ahh okay then, check out those resources and have fun!
Looks like there could be some kind of relationship there! Maybe under log transformation?
Or maybe no relationships at all.
dk what that is
Me neither.
but def not linear regression
I'm gonna do it anyways for the learning experience
Do it!
btw daspecito, I noticed you're unfamiliar with Python. It's great that you're eager to do data science but I think it's important to be familiar with Python first.
I took my college CS course when I was a cs major
Once you get familiar with Python more, do a bit of data science, check out other's notebooks, and then back and forth.
Okay, that's good.
I'm just rusty lol
But it doesn't seem to help your unfamiliarity with Python.
I don't know OOP
Just should be more familiar with Python syntax.
that is something I should learn
Because when you jump into data science, you don't want to be dealing with both Python and Data Science. Since both topics are quite wide.
You just want to focus on getting the information you want rather than dealing with syntax troubles.
oh like that one time I got confused and kept writing state
Ye.
yeah that was dumb
Well I was newer to pandas then
I make a lot of mistakes starting out
I'll get better
๐ช I know you will.
so do you most commonly use a jointplot to see if there's a relationship?
between two variables?
jointplot in seaborn I mean
yeah my pairplot isn't very promising either
not very surprised
df.drop(['Unnamed: 0','vin'],axis=1,inplace=True)
what does axis = 1 control
Second axis.
but what does second axis mean
if you have row and col, it's col.
oh
It's referring to dim.
hahhhahahha linear alg also don't know that
Because matrix can have more than 2 dim.
I may need to take a linear algebra course
It would help.
Or if you're going to be working mainly with 2 dimensional data, you could look into data science courses on MOOC.
3b1b has a pretty good series on la
that's where I learnt it from
- the basics of how nn's work
- calc if you need it
really if I need anything math related I go to him first
super nice guy too
i'd suggest taking notes, la gets pretty heavy pretty quick
but the comments are full of people complaining that he taught the subject better in a video than their profs in a semester
I wonder why they don't use comment to talk about LA stuff, I think it would be a better use of their time.
it's yt?
hahaha
carsData.drop([labels = "vin","lot"],axis=1)
I'm trying to drop columns "vin" and "lot"
and it says I have a syntax error unsurprisingly
if u makign linear regression on data u need to find a closer correlation between certain columns
try using corr()
is a great way to find what is the best one
thank you @analog hatch
btw each factor of a data is important if u puting it in as the X_train data
dont exclude anything other then columns with strings
so don't take out the lot or the vin number
some kaggle notebooks did that so I was wondering if I should do it too
I mean if u are doing a tutorial might as well but data is important even if their is barely correlation ofcourse correlation would always be the main factor for the outcome
makes sense
yeah I'm just looking how they decide to visualize the data to make it look good
I mean u are doing it good with seaborn and matplotlib
yeah
pairplot, distplot, lmplot does the job pretty well
yeah pairplot works similar as .cor()
i think i've seen .corr() before
hahaahhahahahh I am nowhere near an expert
I just started DS & ML like 2 weeks ago
yeah I've been doing a udemy course
python for data science and machine learning bootcamp?
love it jose Portilla is amazing
yeahh he is really good at explaining the basic of it
yeah the way he gives you answers with documented code is great
so you can follow along
I always have his stuff open when I do my own work bc it's a good guide
True ones u finish doing data analysis the machine learning parts gets better
the deep learning is my favor one
is like 5 hours of course
but its great
I will try it
only after I completely master machine learning and make bank
you can be more creative
I just find it cool
yeah me too
actually with the basic u can make one easily with enough data
yeah but like an insanely accurate one
you don't even need to create an algorithm for that tbh
it's a classification problem
that takes time ofcourse cleaning and difying to make it almost perfect in the course you would learn how u are supposed to control your data so it does not overfit or underfit your results
yep
oo yeah thats logistic regression
haven't learned that yet
figured I'd do some linear regression on my own and then hop back into the course
Displaying the tf-idf vector of a pdf file using Pyqt
it's so cool that this method actually knows when a word is overused and lowers it's weight accordingly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
ValueError: With n_samples=1, test_size=0.3 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
uhhhhhh
X = [["Price", 'lot', 'year' ]]
#What you're using to predict the mileage
y = ["Mileage"]
#Trying to predict the mileage based on the price
that's what X and y are equal to
I got no clue
omg guys I think I figured out my first machine learning error
pyqtgraph is datascience right!
what the fresh hell is going on here
i assumed it was some 'first argument is self' business
but
wot
never heard of pyktgraph @errant parcel but I'm new to DS/ML
noice
what was your error in the end
and it turns out the issue with that is just that they accidentally made something optional that shouldn't have been
I didn't have the dataframe I was getting the column out . of
Hi guys, does anyone know Matplotlib and is willing to help me out XD?
just ask your question
^
Sorry, so, I want to produce a histogram that show how the different traffic levels impact the light changes. I got all the data. I kinda know what to do but I am struggling with making a multicolour curve and histogram
Sorry, so, I want to produce a histogram that show how the different traffic levels impact the light changes. I got all the data. I kinda know what to do but I am struggling with making a multicolour curve and histogram
@quick epoch what do you mean multicolour?
got an example?
so it's possible to pass no arguments and it needs arguments
@errant parcel my guess is that it wraps a C library so it can't do argument checking on the Python side...?
but honestly the message looks p self-explanatory
well I'm not passing None
i'm passing nothing
my guess is that all arguments are optional, but when it calls addHandle it fails to actually provide valid default values
my guess is that all arguments are optional, but when it calls addHandle it fails to actually provide valid default values
@errant parcel don't have enough experience to say, but one way of implementing overloads is to haveNoneas default arguments
๐คทโโ๏ธ
yep but i think the combinations that it allows are wrong
yeah, that's possible
just throwing in my utterly uninformed two cents
it's a classification problem
@hollow sentinel it being a classification problem doesn't mean a new algorithm/architecture wouldn't be appropriate/necessary
true @velvet thorn
But I just want a single line
@quick epoch that is a single line
or do you mean a single plot?
anyway, I believe that's from an official MPL example, right?
so you should just be able to follow it
so there's no graphs you create after a logistic regression
after you create the model you just do the classification report and that's how you can judge how the model did
right?
also has anyone's tab shift tab to see jupyter doc stop working?
mine doesn't work at times and I don't get why
after you create the model you just do the classification report and that's how you can judge how the model did
@hollow sentinel that's a start
but there are many other things you can do
look into lift
calibration
ROC-AUC score
PR curve
oh my udemy course didn't mention those haha probably bc it's introductory
idk why my tab shift isn't working
when you have X why do we use a list inside of a list
when you have X why do we use a list inside of a list
@hollow sentinel becauseXmust be 2D
bc train_test_split requires it?
no
because otherwise
how would you tell the difference between N samples with 1 feature and 1 sample with N features?
yes
oh ok i see
yeah there'd be no other way
thanks @velvet thorn that was a question that was bothering me haha
yw
Yeah I tried but I did if statements to change the colours whenever a certain value appears. I will show you what I mean in a sec
Could anyone please be so kind to explain what I am doing wrong with my plot for my models?
you have 5 bars but are trying to set 7 tick labels? @lapis sequoia
you have 5 bars but are trying to set 7 tick labels? @lapis sequoia
@paper niche How do I change this?
your names list has 7 items in it
reduce it to 5 items. I see you have Logistic Regression and Decision Tree repeated twice. I suppose that isn't intentional?
Oops
your
nameslist has 7 items in it
@paper niche I've been staring myself blind on this, thanks!
yep, no sweat
yep, no sweat
@paper niche Is there a way to get the percentage shown in each bar?
For example like this
Something like this: https://stackoverflow.com/a/28931750
hey,
i tried vgg2 face for face verification. and i was wondering, can we detect that the face is, in fact, a real human face, not some cut-out face print on cardboard ?
is there any research on this or any library that i can use?
hmm, i think i got the keyword, liveness detection!
I'm trying to convert my data into a format for ml. It gives an example where it uses audio from util, but I can't find anything about that online. Is there a drop-in replacement for that library?
never mind, it was a custom library
so I was following this kaggle notebook https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values
and I did this
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
(total_missing/total_cells) * 100
on my dataset and I got that my data set missing 95% of it's data
did I do something wrong? is it possible to have a dataset that's missing 95% of its data?
I'm wondering why parameter tuning with gridsearchCV is giving me worse accuracy than the default?
I hope it's not a dumb question; where can one find interesting csv files to practice visualization/analysis as a beginner?
@livid tundra not at all. kaggle is great for that. There are notebooks that will teach you data cleaning, data visualization, and machine learning.
Thank you very much!
@livid tundra no problem
why is there no accuracy shown in the classification report
How do you fit multiple features into a model? Would it be like [[f1, f2], [f1,f2]] etc. where [f1, f2] is one training example with two features?
each example would be a vector, yes
usually the input is a matrix of shape (n_samples,n_features)
Thank you
so each row vector is a single datapoint
What if feature 1 is a bag of words and feature 2 is a bag of POS tags? [[1,0,1,0,1],[0,0,0,0,0,1]] or something. Is this a good feature vector? It seems intuitively hard to "graph" these as one point
why not? That's a lot of binary features.
More generally, a feature is anything your model accepts ๐
Cool thanks lol
What do you think would be better, a bag of bigrams of (word, pos_tag) or bag of words followed by a bag of POS tags?
Im sure it depends on the scenario but any thoughts?
the former seems to make more sense, though depending on the model it might not matter
like, if the model knows there's a correspondence between the two bags, it's the same as if they were already in bigrams
Thanks @tidal bough, Ill try both just to experiment ๐ I now have a list of list of tuples, where each inner list is a sentence and each tuple is (word, tag). However, I can't use countvectorizer or tfid now. Is there another way to make it an input vector, or should I convert the tuples to strings?
Any reason why mean accuracy from cross_val_score and manually average will be different?
What do you mean by manually average?
@hollow sentinel what do you mean by no accuracy?
@heady hatch outputting separate accuracies for each fold and then taking the mean of it.
like accuracy is blank in the pic I sent
@hollow sentinel if you take a look at accuracy in the third column, it shows 0.84.
@bold olive So from my understanding are you taking the accuracy of predicted train and predicted val and taking the mean?
Yes, exactly. Accuracy from each fold and then averaged in the end.
can someone explain what StandardScaler is and why you need to do it on your dataset before you run k nearest neighbors
also how is k nearest neighbors regression different from linear regression
@bold olive I believe cross_val_score uses Kfold validation.
Meaning that it trains it on training set and then score it on validation set.
@hollow sentinel You want to scale the features before running KNN because KNN takes distance into account. Scales of different features will throw these distance calculation off.
KNN vs linear regression, I recommend reading on the two algorithms.
In short, KNN uses k neighbors to calculate the score. While linear regression uses a linear model, ie y = mx + b.
i think it's time to break out the Intro to Statistical Learning
Good luck.
It's so boring to read
It might help to apply the concepts to real life.
@heady hatch , we can change the validation technique in the function.
What do you mean by in the function?
Yes, I'm using stratified shuffle split and calling it.
Even if I use KFold the problem is that the accuracies are different!
cross_val_score is reliable right?
it just a function that does cross validation. hahaha
Yeah ik I meant the way it calculates the accuracy metric
Yup.
Something probably wrong in my manual approach then!
Hmm one question to ask is why are you looking the accuracy of training set?
It does not.
I was wondering because you said you took the average of training and validation.
No! The average over all folds
So you trained the model on the training set, scored it on the validation set and then took the average of all the validation score?
Yes
Ahh.
Hmm the other factor I would probably consider is maybe the splitting through each fold.
Might not be splitting the same way.
I think something you can try is write your own splitting function
seed it
do cross_val_score using your own splitting function
seed it again
split it the same way each fold.
Does anyone here versed well in scikit learn offering paid tutoring services?
@shell berry you can try finding a udemy course that does that. Are you a beginner to machine learning? If you are I would use Python for Data Science and Machine Learning Bootcamp by Jose Portilla
Is there a better way to implement cross_validation_split using pandas and numpy?
folds = []
fold_length = df.index.size // num_folds
shuffled = df.sample(frac=1)
for i in range(num_folds):
folds.append(shuffled.iloc[i*fold_length:(i+1)*fold_length])
return folds```
Have you tried the KFolds module in sklearn? Iโm pretty sure you can do that automatically without a function in the cv parameter in cross_validate.
I saw that, I wanted to try it without sklearn first
Oh ok.
I'll try that
you can just use len(df)
also, if you do it that way, your folds will all be the same size
which could omit rows if the number of rows you have is not perfectly divisible by the number of folds
other than that it looks more or less okay
space out your operators
yw
also, if you do it that way, your folds will all be the same size
@velvet thorn What do mean? Are they not the same size?
which could omit rows if the number of rows you have is not perfectly divisible by the number of folds
@velvet thorn see this
yeah
just realized logistic regression default solver in scikit-learn uses l-bfgs solver instead of gradient descent
@hollow sentinel Thanks for the advice. I'm actually looking for some guidance on a particular project and I have specific questions
Hey guys, I'm not seeing the bransches, do you guys know the issue?
dtree = dtree.fit(X_train,y_train)```
```plot_tree(dtree,
filled = True,
rounded = True,
class_names = ['released', 'deceased'],
feature_names = X.columns) ```
not sure which topical chat / help is the best place for this sorry, but I'm a HTTP request to https://www.asos.com/api/product/catalogue/v3/stockprice?productIds=20510882&store=COM but on Python, the response I get is different to the one I get from my browser.
On my browser:
On Python with requests library:
any ideas what's causing this? This code ran fine a few weeks ago, but I tried it again today and it didn't work. Not sure if asos changed their backend
the response is different in python
what do you mean?
the response body from Python is:
{"id":14014948,"name":"Nike Air Jordan 1 Mid trainers in colourblock","description":"<a href="/women/shoes/trainers/cat/?cid=6456"><strong>Trainers</strong></a> by <a href="women/a-to-z-of-brands/jordan/cat/?cid=29517"><strong>Jordan</strong></a><ul> <li><span style="background-color: initial;">Unboxing potential: considerable</span></li><li>Mid rise</li><li>Padded cuff for a supportive fit</li><li>Lace-up fastening </li><li>Nike Swoosh logo</li><li>Perforated toe cap for breathability</li><li>Helps keep them fresher for longer</li><li>Nike Air sole with Air units</li><li>Units contain pressurised air that compress on impact</li><li>For lightweight, durable cushioning</li><li>Rubber outsole</li></ul>","alternateNames":[{"locale":"en-GB","title":"Nike Air Jordan 1 Mid trainers in colourblock"},{"locale":"ru-RU","title":"ะัะพััะพะฒะบะธ ััะตะดะฝะตะน ะฒััะพัั ะฒ ััะธะปะต ะบะพะปะพั ะฑะปะพะบ Nike Air Jordan 1"},{"locale":"sv-SE","title":"Nike โ Air Jordan 1 โ Blockfรคrgade trรคningsskor med halvhรถgt skaft"}],"localisedData":null,"gender":"Women","productCode":"1611119","pdpLayout":"Footwear",
I'm expecitng it to look more like my first screenshot, with the productID and prices
data
Hi everyone, I'm trying to build flask backend that takes tweets as input, preprocesses and makes predictions. I want to use a keras model saved as h5 format. Can anyone direct me to any helpful resources on this? Thank you
can someone explain why sklearn needs dummy columns
Somebody help me with this... why does the normal DQN perform so much better than a lot of the other ones
I need help with Deep Neural Network, like OG pro
Is it because DQN was faster to train in a simple environment?
what does DQN mean?@lapis sequoia does it mean double Q Network?
Deep Q Network I see, cool
School Homework?^
@sleek rampart are you talking about what I posted
Hey @hollow sentinel , the reason for dummy variables is for dealing with categorical variables.
You could also use ordinal encoder but sometimes order doesnโt make sense, so you need dummies instead.
Dummy features also allow you to access multi class.
Yo guys. Can you help me with the problem?
For some reason I am not getting multicoloured line
Iโm not familiar with visualization, but have you checked the documentations?
you need dummy columns to run a random forest/decision trees?
You need dummy columns to deal with not numerical data.
@quick epoch not using jupyter notebook is a dangerous game haha
I know. But I got all the necessary libraries and etc xd
So I am not worried about that XD
@quick epoch i couldn't get pycharm to run on my machine properly lmao
hey anyone familiar with dash library ?
Itโs easy actually. There is a way to install all necessary packages automatically using conda
But I am not good at data visualisation but I need this for data science and ml
So if you could help me I would appreciate it
maybe that's it idk
I've been using seaborn the most bc that's what I've been doing with Udemy
you need one but you want certain parts of the line to be different colors
what graphing library are you using
I want the plot to change whenever it sees those values in the csv file
@hollow sentinel are you familiar with dash ? and datatables and callbacks.. ? if so you could mightg help me out ? thnx
@dawn vault hahhahaha I've only been doing machine learning for 2 or 3 weeks
is dash a graphing library like plotly?
sort of.. one can build dashboards quit easily.. and yea dash and plotly are mmore or less the same thing..
with dash one can build interactive sahboards using plotly and stuff
*dashboards
yeah so far I only know pandas, matplotlib, seaborn, and plotly rn
kk
iam working with pandas and dahs/plotly.. to get my project going .. but running into issues..
so what are you working on right now ?
support vector machines
omg .. dont know any of those words... lol
which one?
Python for Data Science and Machine Learning Bootcamp by Jose Portilla
I'd recommend it for people who want to start learning machine learning it's only 20 bucks
jose portilla rings a bell.. i might have taken a course.. from him.. w
took python for financial analysis and algo trading...
