#data-science-and-ml
1 messages ยท Page 331 of 1
usually when I approach with ridge regression, I try to find the optimum value. This is what I did before, hope it help ```# Variable selection and shrinkage methods we have learned in week 2.
(Both step-wise selection and penalized likelihood approaches).
Ridge Regression
Define a fine grid of tuning parameters, lambdas.
In Python, this tuning parameter is referred to as "alpha"
Set an equal spaced grid in log scale
n_alphas = 100
alphaR = np.logspace(-1, 4, n_alphas)
#aR = np.arange(0, 1000, 5)
#print(alphaR)
Ridge Regression (using alphaR)
coefs_R = []
for a in alphaR:
ridge = linear_model.Ridge(alpha=a)
ridge.fit(X_train, y_train)
coefs_R.append(ridge.coef_)```
hmm looks good
I will try that out too, thanks
How would I got about finding an optimal value of K in KNN model. right now I just have a loop and plotting the mae
this is what i did in my last class project ``` # from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import cross_val_score
# knn = KNeighborsClassifier(n_neighbors = 3)# Create KNN classifier with k=3, for instance.
# knn.fit(X_train,y_cat_train)# Fit the classifier to the data
# y_pred = knn.predict(X_test)# Test error in confusion matrix
# k_range = range(1,51) # This search over k=1,...,50. Adjust the range as you like.
# cv_scores = []
# for k in k_range:
# knn_cv = KNeighborsClassifier(n_neighbors=k)
# scores = cross_val_score(knn_cv, X_train, y_cat_train, cv=5) # This code uses 5-fold CV.
# cv_scores.append(scores.mean())
# plt.plot(k_range, cv_scores)
# plt.xlabel('K')
# plt.ylabel('CV accuracy score')
# #more flexible compare to all of it
# print(confusion_matrix(y_cat_test,y_pred))```
reason why I comment it, because somehow it didnt work or the professor dont require it
yea I am also doing something like this
Yeah thats pretty much it, search and pick the lowest error
hello
is their a learning algorithm for training a NN model, that changes the structure of the NN as well as it's parameters (weights) ?
Yes, NEAT and it's variants
There's a python implementation available as well if you don't want to do it from scratch
salary_map={'<=50K':,'>50K':1 }
X_train['salary_map']=X_train['salary'].map(salary_map)
its not woeking
can anyone help?
what is it supposed to do?
map the values
Try using .loc[row_indexer,col_indexer] = value instead
this is kind of warning m getting
salary_map = {'<=50K': , '>50K': 1}
X_train['salary_map'] = X_train['salary'].map(salary_map)
salary_map={'<=50K':0,'>50K':1 }
X_train['salary_map']=X_train['salary'].map(salary_map)
Try using syntax highlighting and following style conventions.
What is the value supposed to be for '<=50K'
its value present in column
salary_map is a dictionary
yes
you have '<=50K' as a key. what is the value?
so, make sure the value is there in your code.
yes it is there
it's not there in this example
X_train['salary_map'] = (X_train['salary'] > 50_000).astype(int)
Try that. Also, when sharing code or error messages, please copy and paste the text instead of showing a screenshot.
yea sure
i have an issue on a graph and the screenshot is much needed, i hope it's not that big deal
It's fine if it's a graphic of some kind and not text.
Yeah because you have a billion categories lol
import pandas as pd
from pandas import to_datetime
import plotly
import plotly.express as px
import plotly.io as pio
df = pd.read_csv(r'\Users\almas\Desktop\amazon_jobs.csv')
df.dtypes
df["Posting_date"] = to_datetime(df["Posting_date"])
y = df.loc[(to_datetime(df["Posting_date"]) > to_datetime("January 1,2018")) &
(df["location"] == "US, WA, Seattle ")]
print(df)
y.groupby("Title").size().plot.pie(y="Title",ylabel="LABEL")
Yeah, you group them by title, but theres loads of titles
Well the idea was from csv (https://www.kaggle.com/atahmasb/amazon-job-skills) show a pie chart of Title (job positions) in Seattle, WA dated from January 1st 2018
Well, technically you were successful
You'd have to figure a way to label them better, eg anything that contains the words "software" and "engineer" or "developer" => "software engineer"
You would have to decide on the labels and how best to generate them though
Or other key words from the description
Okay, I'll try something else maybe
Thanks!
Oh and yes, while i'm here..any idea or example for ternary plot
If your just playing around, maybe categorize by what languages they require
Seems a bit hard because it's not a specific category, but 'PREFERRED QUALIFICATIONS' where they took data from applications
But thanks
anyone?
lol
any text-to-speech recognition with ANN in python available somewhere for learning??
Not sure yet. How could I figure that out?
Anybody doing the Kaggle 30DayMl challenge?
How can i solve the importerror?
matplotlib library is added.
but i getted the error
Hi guys, how do you improve the accuracy level of your machine learning model?
@modern dragon it's not possible to answer this question in general, as there's no one-size-fits-all solution.
What does your model do? How is it performing currently?
@tame solstice make sure pycharm is running your code in the environment where you installed matplotlib. If you don't understand what I mean by this (and it's okay if you don't) then it probably isn't.
It uses age and gender to predict their top 3 category interests
what type of model is it?
Oh uh what are the different types of models(
?*
This is my first ML project ๐
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
please don't post screenshots
Oh it's really weird, is it ok if I show you the tutorial I learned it from?
Python Machine Learning Tutorial - Learn how to predict the kind of music people like.
๐ Subscribe for more Python tutorials like this: https://goo.gl/6PYaGF
๐ The CSV file used in this tutorial: https://bit.ly/3muqqta
๐ Learn Python in one hour: https://youtu.be/kqtD5dpn9C8
๐ Python (Full Course): https://www.youtube.com/watch?v=_uQrJ0TkZlc
...
You can skip to 29 mins
no, I would need to see the code
When you have the code ready to share, ping me and I'll look next time I'm online.
does anybody know any good sources to learn machine learning without have to use any of the modules (like sklearn)? ping me if you have an answer, thanks!
@covert herald why don't you want to use modules?
They're not an added layer of complexity. They're there to help. You could implement some algorithms "from scratch" for educational purposes, but I would still use numpy at the very least.
The reason I insist on numpy: if you write all the math by hand, you're going to waste a lot of time on implementation details that don't deepen your understanding of anything.
How to know that you have overfitted your model?
Hello guys,
I'm using OpenCV and Yolo3 to detect objects in a video file I have in a folder. The problem I don't know how to save the out video ( that has the detection). This is my code:
video = 'test.mp4'
vid = detect_video(video, yolo, all_classes)
uhh
this doesn't seem right
why is it doing that?
I have one hot encoded the data
oh I should drop some columns
fixed it, the problem was there were too many unique values for some column
just like matlab has simulation interfaces
is there any link between machine learning and binary search?
LinAlgWarning: Ill-conditioned matrix (rcond=1.05001e-17): result may not be accurate.
return linalg.solve(A, Xy, sym_pos=True,
```what does this mean? code:
```py
reg2 = Ridge(alpha=0.0, normalize=True).fit(X_train, y_train)
y_pred2 = reg2.predict(X_test)
rmse_ohe[1] = rmse(y_test, y_pred2)
rmse_ohe[1]
you might have strong multicollinearity
basically, there is substantial uncertainty in the regression coefficients
oh is there a way to fix this?
thank you ,i solved ,i added matplot on phcharm
hello @all i want to build a script that detects the dominant color in an image , i did some search and I have found some packages like color-thief but the results aren't good so i want to build some thing my self but i don't how and from where to start any one could help me ?
i got this table from a website with bs4 and pandas
its a list of strings with a length of 1 named dfs and that output is dfs[0]
how do i get just the common name column?
the spacing changes depending on which state's data im looking at
printing for s in dfs[0] gets me the column names
dfs["Common Name"] should work
Or use iloc:
https://www.geeksforgeeks.org/select-rows-columns-by-name-or-index-in-pandas-dataframe-using-loc-iloc/
it says the indices must be integers or slices, not strings
Maybe it's not a data frame?
no it isn't its a list of strings
but theres only 1 element
which is that big table
Ah okay, you can turn a table into a data frame and then use what I was talking about
I think it's just df = pf.DataFrame(dfs[0])
Then you can do df["Common Name"]
@serene scaffold im use numpy but i dont want to use the machine learning modules until i learn how the algorithms work
For basic ones like linear regression you should just be able to look up an algorithm and implement it yourself. For more complicated ones, you can find tutorials like "random forest from scratch using numpy" or "neural network from scratch using numpy"
I don't think there's one single resource that has it all
alright
Images are just rgb values, so you can extract them and map them into a 3d space, then run some sort of clustering algorithm to find the dominant colors.
I'm taking an Artificial Intelligence course next semester, yay
yes it is a very nice idea,
Do I need mathematical rigor to start learning TensorFlow?
well, TensorFlow is for deep learning in general, so the amount of mathematical knowledge you need to understand what you're doing depends on what you are doing. You will probably need to know linear algebra.
Also, I would avoid "learning TensorFlow" or any other library, and instead focus on approaches to AI and use whichever library suits what you're doing.
Right now I am just planning on using a library to create an LSTM, so what mathematical knowledge would that need?
are you trying to use an LSTM to do something, or implement the actual LSTM?
Tbh I only know up to geometry so I might learn this another time after I start calculus and linear algebra
in the mean time, developing general programming skills will always serve you well.
any idea which dataset would be good for ternary plot?
i found this one but im not sure if that's ok: https://www.kaggle.com/vinven7/comprehensive-database-of-minerals
Hey, I'm trying to use Fitter from fitter library on large image data. It always times out. I tried increasing the timeout quite a lot, but it never makes the cut. I converted it to an np array and everything, but no luck....
Can't seem to find anybody that knows anything about it
Any thoughts?
@serene scaffold Katie told me that you work with NLP, so I have a question regarding that. Would you need ML to do it, could you just work based off of grammatical structures of sentences (I.e subject, verbs, predicates, etc.) and classify words as a certain description
well, I dont really know how to explain it
I am trying to archive the messages of my friends and me from a discord channel, that works great, but I get this json file from discord:
Can you put that in a paste bin?
yes sorry
It's more ML/AI than it is linguistics in many ways.
Hm
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
though there are certain approaches where grammatical features are taken heavily into account.
For my example, Iโm more or less trying to generate a sentence rather than predict the next word - using basic sentence structures such as simple sentences and the likes, would that need ML?
depends on your definition of "ML", but if you're just having fun, you can make a sentence generator using ngrams and markov chains.
This happens to be the second assignment in the NLP class I helped teach.
I would try and create an LSTM but I seriously donโt understand videos for them because they all require previous knowledge in ML
ML is a tough area to jump into, yes
I can get behind how LSTMโs and RNNโs work but I donโt understand the mathematical portions of it
how do you feel about statistics?
Not that good at it tbh, Iโm in 8th grade so my mathematical knowledge is quite bad. I do know concepts like correlation coefficient and the likes, but probably not at the point where Iโm competent enough for ML
you can do the ngram/markov chain approach if you simply understand that if something happens 8 times out of 10, it has a .8 chance.
Hm alright
Would there be a certain framework that is strong with those concepts?
Or should I work on the concepts first then find a framework that suits my needs
NLTK is a library you can use to get the ngrams.
the statistics and stuff, you can just store some numbers in a nested dict data structure of some kind.
Does it abstract it too much? Cause I do want some abstraction but not too much so I can understand the concepts
Ah, so stuff like the most common word in x position (I.e the position of the first word)
not really. an ngram is a tuple of n consecutive tokens.
[(not really .), (really . an), (. an ngram), (an ngram is), ... (consecutive tokens .)]
these are 3grams or trigrams.
Are there any articles or books that are helpful for understanding what ngrams are?
maybe? the course I taught specifically didn't have one to save money for the students.
well, helped teach. anyway, a token is just a word or punctuation mark. and it's just n tokens in order
Ohh alright
Ohh wait I understand it
I was a bit confused at first but now I get it
Anyways, thanks for your help! I appreciate it
@grand lion I'll ask my now-former advisor for her slides. Ask me again on like Tuesday.
Alright, will do
elt is the same thing as tlc for a dataset
pretty interesting competition
https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/
is anyone pretty expereinced in CV research/kaggle here willing to colab?
I will give the idea and you would code all of it - prize would be split 50-50
dont wonder why no one takes the deal
Do I need Anaconda to plot on 2d maps with matplotlib?
Also do I need mpl_toolkits.basemap?
Hey
Suppose i give u a data (interval, frequency) ...and
Here if we Apply
np.histogram(data_given, bins=class interval, density = false
)
We will get two tuples
1.frequency counts
2.bin edges
But now if we do
Density = true
What will that np.histogram give statistically ?
Thats the data
Do you guys think the only good way to learn neural networks is by learning every aspect of it
and understanding all the math and how they are built?
or can you get a good understanding and make a lot of cool AI just by learning tensorflow and mastering that
I thought learning from scratch would be nice and help me understand but after hours and hours of learning gradients equations types activation functions
it just got too much to handle and I would like to make AI with a less info-needed approach which is why i thought tensorflow would be nice
but i am scared that would limit me and what i can make
guys i have one doubt , are the tree based models prone to outliers, skewed features ?
if they are not then i dont need to standardize or scale the features right ?
hello
i am working with pandas dataframe
when i run ```python
for date_1 in rem_dup_dt_column[0]:
print("date_1:", date_1)
print()
row_data = main_dataframe.loc[main_dataframe['date']==date_1]
print("row_data:")
print(row_data)
print()```this command i get only first date is getting stored
i want here that it will run for every entry in date column
ping me when u reply
hello, i had a question regarding label encoding and one hot encoding. A few examples that i found online which had Sex column in it trained the model after label encoding and no one hot encoding, shouldnt one hot encoding be done in such cases? Thanks
in cases where column only has 2 unique values, there's zero downside to label encoding. so if the dataset for Sex had only 2 values, then you essentially bypassed the pitfall of label encoding
SEX???
sex is a synonym for gender
aw
Sorry, gender *
Ohh

in general you're correct, for higher cardinality (ie more unique values) in a categorical column, label encoding isn't appropriate
So like labelling male as 0 and female as 1 doesnt really have any effect on the model huh
yes, because it's just two numeric values, with some distance between them
Makes sense, thanks!!
you could have even set it to 0.25 and 0.75, or 0.3 and 0.6 if you wanted (though i dont know why you'd want to do that)
the model will never see any value outside those two for this feature, and thus it's relations will largely stay completely independent of the actual values
Sorry if im repeating the question but even the distance or small difference should set those two apart right ? Like in hot encoding its more like true and false but label encoding is more like assigning a value to a variable?
Shouldnt that effect even models with only two values..
ultimately a model doesn't care. all it does is weight * some_feature
the weight could be learnt arbitrarily to scale any 2 values into anything
Oh
and also, for the record, a model also doesn't even understand true or false. all it understands is math and numbers
Ig i need to think about it a lil more to completely understand that ๐
๐ ๐
Thank you!
do anyone know how to change language in Jupyter to English
google says try https://stackoverflow.com/questions/52667314/jupyter-notebook-is-displayed-partially-in-french or https://github.com/jupyter/notebook/issues/4158
I'm using Jupyter for Python programming on Windows 10 and some of the text is translated in French but not all of it (which makes it kinda annoying).
Does someone know how to change the display
can anyone help me to convert a nested json directory to a dataframe?
https://pastebin.com/vrvXVsMe
heres the json, I tried everything google has to offer...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
ig you could extract keys and values from the json object using a for loop and keep adding new rows in each iteration while setting the key as the column and the value as the column value?
https://www.kite.com/python/answers/how-to-get-values-from-a-json-string-in-python -check this out maybe?
or you could convert the json to csv first? maybe thats easier
ok, I got another question, I want to print a value of a nested dict.
I got the path from here: https://piotros.github.io/json-path-picker/ , but I don't get a result
print( data['messages']) prints everything under it which makes sense,
print( data['messages'][0][0]['embeds'][0]['title'])
should print the value of title but id doesnt
is 3000 files enough for a CNN
hello, I was trying a project on twitter sentiment analysis
so I took dataset from Kaggle - cleaned it (like @, RT, links)
I want to know if is there a way to choose tweets based on a topic??
like I want tweets regarding "Donald Trump", "Artificial Intelligence", etc
can I do that without using the twitter api - just choosing from the dataset!!
pls help!!
can you send the dataset link
so you want get all the tweets regarding donald ?
yes for any topic, not just the Trump
basically I need to create a model for this!!
I only showed wordcloud, bar graphs
need to show accuracy and train the model too
so thats why i thought I should filter the data to a topic only
try df[df.tweet.str.contains('donald')]
oh yeah that works ๐
are you a beginner ? cuz i am and i need one help
yeah lol beginner here - always asking for help
its just a joke lol
can someone help me with this. i tried this dataset https://www.kaggle.com/purumalgi/music-genre-classification . i filled the nan values with their mean then i tried to fit it to logistic, svm, random forest clf, naive bayes and xgboost but i am only getting the accuracy of 0.52 in cross validation score
can someone tell me what iam doing wrong here or the dataset wasnt meant to be classified easily ?
https://www.kaggle.com/muhammedjaabir/music-genre-clf , my notebook url
no one? ๐
this link is 404
but Idk data science much,, just started 2 weeks ago!
I hope someone else will help!
hmm ok
tried DNN?
yes
idk i only know machine learning algo not deep learning
so i have to wait for it ig
could you briefly explain your dataset?
this dataset is about classifying genre of the music , there are like 10 diff genre . it has features like Artist Name ,Track Name , Popularity , danceability , energy key, loudness , mode, speechiness , acousticness , instrumentalness , liveness , valence ,tempo , duration_in min/ms , time_signature
oh alrighty, Ive also just begun ML so lemme see if i can understand this๐
you couldve simply clicked that kaggle link to see the dataset
i did, i was kinda confused with the datalist
ohh
how long has it been since you started ?
ok
all the features except class cuz that one is the output variable ( the one needs to be predicted )
all the features in this except Artist name, Track Name?
yep
i think you shouldnt use popularity as an input....
wym ?
because it is insignificant right?
like what determines a genre is the rest of the features but not the popularity?
genre is dependent on every of those feature there but wait lemme try dropping the popularity and see if i get the improvement in score
maybe you are right cuz i do get a significant drop in accu due to popularity
someone correct me if i am wrong.
ok nvm
popularity is a important feature
huh, well my bad
its ok
i cant seem to download the code, so i cant try it out myself...
processing seems to take a lot of time on kaggle
Hey everyone, my name is Paras, and recently started to learn ML from random resources on youtube and google. Can you please guide me about how and where to learn ML. Thank you in advnace ๐
i started first by learning probability and statistics from khan academy --> ml by andrew ng --> how to create model using python by python engg ( just search some model name with python engineer ) --> started working with diff datasets and still doing
If arr is a 3D numpy array, does arr[arr<90] return values in row-wise order as in arr?
In other words, is arr[arr<90] roughly equivalent to this:
(Not considering return type)
output = []
for i in arr:
for j in i:
for k in j:
if k<90: output.appent(k)
return output
!e
import numpy as np
arr = np.random.random((2, 2, 2))
print(arr)
print(arr[arr > .5])
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | [[[0.74413612 0.56879201]
002 | [0.02811816 0.03907577]]
003 |
004 | [[0.78955903 0.69845739]
005 | [0.61117874 0.84977809]]]
006 | [0.74413612 0.56879201 0.78955903 0.69845739 0.61117874 0.84977809]
@real dew you can use that to infer what the logic is.
looks to me that it's the same as if you had first reshaped the array into one dimension
Oh yeah
Thanks!!!
๐
Does anyone know a good book or course to learn about machine learning algos like xgboost, random forests, decision trees, etc
try "data science from scratch". read the most recent edition unless you're already advanced with python.
is it the one written by joel grus?
I believe so
thanks! I'll check it out
If you attend a university, see if it's in their online library
any monte carlo youtube that i can watch? Trying to get into simulation to fill up my holiday and monte carlo seems to be the buzzword so might start with there
hay guys i want to take the informational fact from a paragraph and put it as bullet points what should i start with?
have any of you tried to predict 2 variables ?
or we have to do it separately by predicting first var then the next var by choosing that as a predicting var
?
You can predict 2 or more variables using multiple linear regression
values
Can you give an example paragraph and what you want to extract?
yeah sure just a sec
Sample paragraph :
The inflated style itself is a kind of euphemism. A mass of Latin words falls upon the facts like soft snow, blurring the outline and covering up all the details. The great enemy of clear language is insincerity. When there is a gap between oneโs real and oneโs declared aims, one turns as it were instinctively to long words and exhausted idioms, like a cuttlefish spurting out ink. In our age there is no such thing as โkeeping out of politics.โ All issues are political issues, and politics itself is a mass of lies, evasions, folly, hatred, and schizophrenia. When the general atmosphere is bad, language must suffer. I should expect to find โ this is a guess which I have not sufficient knowledge to verify โ that the German, Russian and Italian languages have all deteriorated in the last ten or fifteen years, as a result of dictatorship```
Like some of the points should be
*The inflated style itself is a kind of euphemism
*The great enemy of clear language is insincerity
*All issues are political issues, and politics itself is a mass of lies, evasions, folly, hatred, and schizophrenia.
@fathom ruin let me get back to you on this. It's an interesting question.
sure
thanks btw ๐
ping me when u find any info on this ๐ i am trying to find things as well
@fathom ruin just so we're clear, you're just trying to classify which sentences do or do not have true/false statements, yes? You're not trying to determine if the statement is actually true?
yep i am not trying to determine if the statement is actually true
Okay great. That greatly simplifies the problem 
i am trying to do this for hours ๐ and here you be like its a piece of cake lol
I literally have 0 idea on which wat to get the points
I didn't say it's easy, it's just easier than detecting misinformation.
๐ yeah k nice lol
I'm trying to identify what sets your three bullet points apart from the sentences that aren't of interest. The last one contains a lot of opinions.
i mean it was just a example
maybe we should just remove the wide option points and do the remaining and figure out later?
One thing they all have in common though: the subjects of each sentence are third person nouns that aren't people, and the verb is a form of "to be". You might actually be able to solve this with rules.
hmm thats something i didnt know
i used a package to seperate verb, noun and other types in a paragraph apart
Spacy?
but it gave the verb WORDS i am not sure how to get the setence
yep
Yay!
so like where should i go next ๐ค
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
how do i get the sentence where the verb is?
Can you show the code?
sure
https://paste.pythondiscord.com/minewahoqe.properties this is not only spicy but i am also adding cutting own passage with nltk
if u want i will just remove those and keep the spacy alone
https://paste.pythondiscord.com/amenubilef.lua this is without the previous code and just spacy
but this just returns all the verbs
not as a "sentence"
spacy can divide it up into sentences for you.
how?
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('pretend this is a long paragraph with multiple sentences.')
for sentence in doc.sents:
# do stuff with sentence
๐ค let me try that
but this just splits the paragraph by . should i check for verb and split?
it doesn't just split it by punctuation, no. it would, for example, know that "I went to Dr. Johnson because I was sick." is one sentence and not two.
oh wait yeah
for that original para, it gave me this
*The inflated style itself is a kind of euphemism.
*A mass of Latin words falls upon the facts like soft snow, blurring the outline and covering up all the details.
*The great enemy of clear language is insincerity.
think its working ๐
Great!
thank you so much ๐

totally not related to this but when writing a text file is there a way to format it? now it just writes it all in a single line
for point i can do \n before every points
but for just normal paragraph
' '.join(list_of_strings)
ohk thank you again ๐
i want it for classification
if i have a df with missing values (Firm Age), how am i able to make the value = previous one + 1
i was thinking of ffill
but im not sure
did you look into .interpolate?
hi. i would lke to know how to make a tensorflow input layer for a dataset which is like this [[1,2,3,4,5,6,7], [1,2,3,4,5,6,7]].
i need to input each element of the subset into a node
@glad mulch you can use .shift(1)
How do you form a sentence from ngrams
What does this mean? train_metrics = pd.DataFrame({'MAE': mae_train, 'MSE': mse_train, 'RMSE': rmse_train}) train_metrics.reset_index(drop=True, inplace=True) train_metrics.head(10)
I'm getting an error when I want to pass the results of the SVR model
I'm trying to help a friends son with their machine learning homework and they are using a technique to estimate a PMF I don't think I've seen before.
They are constructing a probability distribution by taking emails that are categorized as spam, and creating a frequency histogram of every word that appears in each email. Then generating a probability by taking each frequency and dividing by the sum of all the other words appearances
But instead of doing just that, they are adding 1 to each frequency and dividing by the sum plus the number of distinct words to account for the addition of those 1s
It seems like some sort of finite population correction factor, the resources this student was provided is riddled with typos everywhere and I don't know why an "n+1" correction in this manner makes sense
Put in another way, if their probability distribution for their data is p = (x_1/n, x_2/n, ....., x_m/n). They adjust it to p = ((x_1 + 1)/(n+m), ....., (x_m + 1)/(n+m))
mae_train etc. are scalars, should be lists.
problem with that is that if there is a NaN after with a different security name it would also interpolate to that right
@thorn bobcat there's a few, just ask the question
this is the question
this is how arabic looks like
ุฃุฑูุฏ ููู
ุง.
first of all it'll have to be tokenized in reverse instinctively.
but how would a transformer even work with arabic?
cause the sentences don't have a clear set structure
@thorn bobcat what do you mean that they don't have structure?
Depending on what you want to do, you could just use a prettained Arabic model and save yourself some time
This is because it is different from most languages in a few ways.
It is written from right to left.
It uses its own set of characters that are unrecognizable to speakers of other languages.
Vowels are omitted when itโs written. It has a complex and rich grammatical structure, for example, pronouns are embedded in the words themselves in many cases.
It is much more fluid than most other languages as sentences donโt conform to the subject-verb order that is typical of English.
All of this makes it harder to learn and leads to a larger risk of ambiguity than would exist in most other common languages.
@thorn bobcat I don't agree with your assessment that these properties make it harder
Our largest
model, ARAGPT2-MEGA, has 1.46 billion pa-
rameters, which makes it the largest Arabic
language model available.
https://arxiv.org/pdf/2012.15520.pdf
I can't pretend to know anything about Arabic but I don't see why any of that means it's not modelable
it was actually some sites assessment.
Anyway what sort of chat bot do you want to make, something conversational, question answering etc?
something conversational
able to answer philosophical questions
and give legal advice
ARAGPT2 is a stacked transformer-decoder model
trained using the causal language modeling objec-
tive. The model is trained on 77GB of Arabic text
I really wanna learn about transformers but don't know where to start really..
Once you have the ngrams and the most common frequency, how do you form a sentence from them?
I can't seem to find anything on S.O
(Using NLTK btw)
@thorn bobcat https://arxiv.org/abs/1706.03762 is the best place to start on learning transformers
For specifically making a chat bot, you are best to use a pretrained model as opposed to making your own
What do you mean make sentences?
Create sentences from ngrams
There used to be a method in NLTK called generate but it's deprecated now
Stelercus might know since he recommended ngrams to be yesterday but overall I've been trying to search for the answer but for some reason there's zero answers on it
Where are you getting your generate from, I can see it in the nltk.text module fine, no deprecation warnings
nltk.models.ngrams or something along the lines of that
I can't find any modules with that name or similar locally or in the docs
Nevermind it is in an old version
People who insist that a certain language is especially "complex and rich" as compared to other languages usually have ulterior motives. Every language is complex and rich.
And there's nothing special about subject-verb-object word order.
Anyway, I think transformers should work just as well for Arabic.
https://stackoverflow.com/questions/68705057/cleaning-pandas-column-on-specific-data-type
I was hoping I could get help on this question I posted
could be true because i got this from a link to website marketing their nlp toolkit
prolly just fine-tune it on your dataset
so I looked it up and someone told me it'll be computationally expensive to train a model from scratch
https://github.com/aub-mind/arabert I want to work on this
wanted to do something like this but I understand now it'll cost alot to do it from scratch
So I'd like to take what they did and improve it.
how?
I'd like to train it on ancient scriptures
make him more inclined to use the new data over the old data although the old data would still exist.
I'd like to also give him a face and apply first order motion.
what?
that's oddly niche - better fine tune it
because there aren't enough ancient scriptures to constitute a sizeable amount for traning
what is fine tuning? how do I begin to grasp this. From the most basic level to complex concepts.
check out a few articles, start with the official paper
I'd like to also give him a face and apply first order motion.
though I think you want to do something else? can you elaborate on this?
I want to generate a voice for this agent.
and a face that has lips that move matching the words
well ancient scriptures and law.
Idk would seem fascinating talking with an ancient mid eastern philosopher.
assuming you know arabic, fine-tuning is your best bet
unless you have a ton of data and compute
I know arabic
what do you mean by that?
assume I got a corpus of about 100 movie subtitles and a 1000 books for starters with an average of 150 pages.
as in a few hundred GBs of data, and few hundred GPUs
that's small - you can only fine tune it
I am gonna be using the free version of google collab.
should be enough for fine-tuning
maybe
do you have a CPU with huge RAM?
its free collab? idk random
Each user is currently allocated 12 GB of RAM
As of October 13, 2018, Google Colab provides a single 12GB NVIDIA Tesla K80 GPU that can be used up to 12 hours continuously.
no, on your own PC
4gb ram
well, leave it and use Colab then
do I need to understand about transformers to fine tune it?
I want to understand them at the most basic level
I meant transformers and multi-headed attention
self- attention
that kind of stuff
to make it very basic, they represent words in way that takes context into account.
I might be conflating transformers with BERT a bit.
self.fc1 = nn.linear(input_size,50)
self.fc2 = nn.linear(50, input_size)```
anyone know the naming convention? used
why fc?
it stands for "fully connected"
fully connected, because every neuron in a layer is connected to every neuron in the preceding and following layers
gm, you might be interested to know, I recently got a data science-related position with a large US company. I have absolutely no idea how. I must have deceived them.
so I'm trying to do the MNEST classification challenge but I want to do it with https://www.kaggle.com/salmaneunus/rock-classification or https://github.com/morrisfranken/glyphreader
congratulations! ๐ฅ
can someone tell me which would be easier to do?
I'm actually planning on applying to a US university for a master's degree
doing research now ๐
fake it till you make it.
you: I'm applying to a master's program
everyone: you don't have a masters?
did you mean MNIST?
yea
shouldn't stop you.
uh
well by definition
MNIST is done with the MNIST dataset...
do you mena like
you want to perform a similar task (multiclass image classification) and are asking which dataset might be better tow ork with?
my tensorflow results are all in lists like this [result] how do i make them floats so i can see the accuracy?
what do you mean results
like the predictions?
yeah
so you have like
its just one float in a list
show code
just the result
predictions:[[47.402496]
[47.278564]
[47.387936]
[47.897003]
[48.52338 ]
[48.993202]
[49.162148]
[49.390816]
[49.802197]
[49.949066]
[50.186504]
[50.12692 ]
[50.034527]
[49.935844]
[49.875698]]
actual:[47.11750031 47.18000031 47.48749924 47.81000137 48.50500107 48.83750153
48.92250061 49.25 50.02500153 49.875 50.15499878 49.73749924
49.71749878 49.80749893 49.8125 ]
the predictions is the problem
yea with the same code used for mnist
yeah
you can also use .ravel() or .flat
or maybe it's .flat() it's been a long time
since I worked with numpy
something like that
will that also allow the accuaracy number show?
!e
import numpy as np
a = np.array([[1, 2, 3]])
b = np.array([[4], [5], [6]])
print(a - b.flat)
print(a - b.ravel())
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [[-3 -3 -3]]
002 | [[-3 -3 -3]]
that depends on how you're calculating it
that looks like regression though
so why are you talking about accuracy?
i want to see the accuracy but since its in that nested list format the accuracy number is 000000e00
i need it since i have to document the prediction accuracy
but the point is
accuracy is a thing of classification
you're doing regression, right?
i dont really know since this is the first network i made from scratch and i am predicting stock prices
do you know the difference between classification and regression
?
i know what classification is. but not regression
okay
so like
classification involves discrete outcomes
e.g.
is this person positive or negative for COVID
ok
regression involves continuous outcomes
in this case, stock prices
because like
it's not just 1 or 0, right
it can vary continuously from 0 all the way up to infinity (theoreticallly)
so what you're doing is regression
oh ok
accuracy is the % of correct predictions
but that doesn't make sense for regression, right?
say you predict 45
and the actual price is 40
yeah u are rght
you're wrong, but the how wrong matters
that's a lot better than predicting 500, right?
so we don't use accuracy for regression
there are other metrics
the most common is RMSE
root mean square error
like loss
you can Google that
yes but no
loss is a general term
that tells the model "how wrong" its prediction is
it can apply to classification too
so there are different loss functions
depending on your task
ok then i will find some in the docs
i will use mape
now it seems to work like a charm. thanks
yw! ๐
anyone want to help me with the task of classifying hieroglyphics?
I'm trying to create a function but I'm having trouble setting making it correctly
df = df[(df['TVStandWallMount'] == 0) | (df['TVStandWallMount'] == 1)]
def clean_int_col(df, col):
df = df[(df[col] == 0) | (df[col] == 1)]`
return df
what do yo uwant to do?
I have some integer columns that I am trying to clean with df = df[(df['TVStandWallMount'] == 0) | (df['TVStandWallMount'] == 1)]
to retain the binary values
I have a bunch of them and I want to do
for col in df.columns:
clean_int_col(df, col)
uh.
so you want
wait
I'm confused
so
basically
you want
to take out
the non-numeric values
yes?
I want to take everything that is not 1 or 0. Other possible values would be 11, 10, or some other numeric
take out anything that's not 1 or 0
okay
so
for any row
in which
any column
has a non 1 or 0 value
remove that?
yes
df = df[(df['TVStandWallMount'] == 0) | (df['TVStandWallMount'] == 1)]
this works
but I'd like to iterate over all my columns and apply that
but they're not na
they will be
the 'wrong' values are something like 11 or 10
oh
sorry I'm not focusing hard enough
wait then
why is there to_numeric above
so there are also cases
where they're not strings?
uh let me think about this for a moment
I took out the to_numeric part
I sent something wrong at first
I edited the message
so that checks the values no? doesnt drop the rows right?
i have a set of images that look like.
to train them using an mnist classifier do i need the position of the object in the image or just the label?
the result of that expression
is a df without those rows
can someone help me prepare my dataset?
Does anyone have anything on handwriting with tf, like generating a page of writing? I saw something like it on reddit but cant find anything short of the digits thing on the tf website.
def calculate_correlation(self,feature_one,feature_two):
feature_one_data = []
feature_two_data = []
for data in self.data_list:
feature_one_data.append(data[feature_one])
feature_two_data.append(data[feature_two])
feature_one_mean = statistics.mean(feature_one_data)
feature_two_mean = statistics.mean(feature_two_data)
feature_one_sample_std = statistics.stdev(feature_one_data)
feature_two_sample_std = statistics.stdev(feature_two_data)
mean_diff_sum = 0
for k in range(len(feature_one_data)):
mean_diff_sum += (feature_one_data[k] - feature_one_mean) * (feature_two_data[k]-feature_two_mean)
print(mean_diff_sum)
corrcoef = mean_diff_sum/(feature_one_sample_std * feature_two_sample_std)
return corrcoef
So, I am trying to calculate correlation coefficient by using this class method. self.data_list is a list of dictionaries and contains data such as age, bmi,insurance charge, smoker(boolean), sex etc. I want to calculate correlation coefficient of two features. Normally, I should get a value between -1 and 1. However, when I run this function to test it, I noticed that I get absurd results like 389,500. There must be something wrong with my calculation but I couldn't figure it out. Any ideas what I do wrong?
what would be the best way to go about removing rows where the numerical values are all 0?
i've tried a for loop to iterate over every row but i don't think that's very wise
i think another approach might be to "keep" the rows if there's a 1 present in any of them but i'm not sure if i know of a function that does this
df[(df != 0).any(axis=1)]
hmm didn't seem to work
as in no changes have been made to the dataframe
ye
that creates
a new DataFrame
you need to assign it to a variable, of course
I have
i've assigned it to df_gm and it's the exact same as the original dataframe it seems
same shape
one of my columns is text content, would that interfere with your method?
axis = 1 is columns isn't it?
yeah that would interfere since gm is just keeping all the rows that have any non-zero value in them
yes
you can do this
df[(df.select_dtypes('number') != 0).any(axis=1)]
How do I make a grid where the x number line and y number line are thicker than the other grid lines? Like this:
I'm trying to create a deep q learning environment, similar to snake which tracks the position of certain things. How do I deal with their position (delta y, delta x) being null or undefined?
Should I assign it a value it can never reach e.g. 100,100 or allocated an input which can be only 1 or 0 depending on whether these parameters should be ignored
is this in matplotlib? if so then sth like py for axis in ['top','bottom','left','right']: ax.spines[axis].set_linewidth(0.5) would change the width of those lines where axis are the 4 lines and ax is your subplot axis object
it's easier if you provide code and error messages as text.
delete the "staticmethod" decorator from recommend
mkj is in demographic filter not recommend
How can I assure that i is always a int, not float?
for i in range(0,2**25):
step = 0
print(i)
while i != 1:
if i == 0:
break
elif i % 2 == 0:
i /= 2
step += 1
print(i, end=" ")
elif i % 2 == 1:
i = 3*i + 1
step += 1
print(i, end=" ")
print(f'\Amount of steps: {step}')
please put spaces on either side of infix operators; i % 2 == 0, not i%2==0
what should i do?
do //= instead of /= so it's floor division
what should i do?
I don't know.
Spaces only for cleaner code or do they have a purpose?
yes, easier to read
they don't change how it's executed, but it's best to present others with readable code.
Division returns a float rather than an int, since division between integers is the only one (among addition, subtraction, and multiplication) that doesn't always return an integer, mathematically speaking.
Makes sense, didn't know that // is such a big change
how to solve that?
I'm gonna try fixed code once my pc finishes this code and stops being on fire
Also, how can I check the time needed to execute whole code?
I'm curious if by increasing the power of two by 1, the time needed for execution increases exponentially
Okay, it works fine now, except still being on fire
dont pass mkj into recommend
They also delineate things, it's probably easier to programatically lint/search code with spaces and code without
For simple measurement you can use time.perf_counter
You could use timeit things but I don't recommend it
I would prefer just perf_counter at specific points in your script (usually start, end), and run the script multiple times to get multiple readings than run timeit
So if I want to use this function, I need to import time and type that function at start and end of code?
Sorry but it tells you exactly whats wrong, you there isn't a between attribute in a DataFrame, you will have to sort that
Hi guys, I want to ask you if someone worked on getting high season of each product/item in an e-shop. what is the best approach you use or do you have some articles that might help. In another way, I want to know each product's season by variance of sales when it starts and when it ends and the season length.
this is my full code. what should i write in my code?
You are joking if you think im gonna rewrite your code for you
I mean can you give me a clue?
just a clue, not to rewrite my code
Your dataframe doesnt know what between means
If you look a couple lines below you might see what you are missing
data_pns3.between, vs data_pns3.mkj.between
Oh yeah i understand
Thank you!!
!e
from time import perf_counter, sleep
start = perf_counter()
# your code here
sleep(1) # to simulate things happening
end = perf_counter()
print(end - start)
@chilly geyser :white_check_mark: Your eval job has completed with return code 0.
1.0050674946978688
Thanks a lot, is there a documentation for this module on official page?
!d time
This module provides various time-related functions. For related functionality, see also the datetime and calendar modules.
Although this module is always available, not all functions are available on all platforms. Most of the functions defined in this module call platform C library functions with the same name. It may sometimes be helpful to consult the platform documentation, because the semantics of these functions varies among platforms.
An explanation of some terminology and conventions is in order.
โข The epoch is the point where the time starts, and is platform dependent. For Unix, the epoch is January 1, 1970, 00:00:00 (UTC). To find out what the epoch is on a given platform, look at time.gmtime(0).
Thanks, gonna check
If I get Time: 2.1679996279999614e-05, it means code was executed within milliseconds?
Yes-kinda?
If you're doing benchmark on small things you might want to do timeit.timeit
There's also timeit.repeat
Up to the line plt.subplots(2) there are no plots
oh my god im so dumb
Greetings. So I have two SERIES:
- tst - has fake data,
- usr - has true data.
I am trying to check.isin()on the third series which I made a list:
third_s = [i for i in df['Some_Col']
The thing is, that both tst and usr returns the same results when checking isin(). I tried:
# all syntax are correct and works on my PC
tst = # ... has fake data series
usr = # ... has data which is also in third_s
third_s = [i for i in df['Some_Col']]
# 1st approach - .empty:
if post.isin(third_s).empty:
print('Yes it is')
else:
print('No it is not empty') # why tst returns this if it is not in third_s?
# 2nd approach - .bool:
if post.isin(third_s).bool:
print('Y') # again, why tst returns that as well TST HAS FAKE DATA
else:
print('N')
Question: I need to skip in for-loop all tst values that are NOT IN third_s. Any ideas how?
what are you trying to actually do?
[i for i in df['Some_Col']] is the same thing as df['Some Col']
but to answer your specific question@willow spindle , .empty does not check if "all values are false". it checks if one of its axes is length 0. you should read the docs instead of guessing. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.empty.html
i assume you are looking for .any()
if post.isin(df['Some_Col']).any():
print('There is at least one value in "post" that is also in "Some_Col".')
else:
print('There is no value from "post" that is also in "Some_Col".')
I have built a model time series forecasting with TensorFlow dataset creator. At this link https://www.tensorflow.org/tutorials/structured_data/time_series?authuser=1#data_windowing.
When I set like this it work well.
But I set:
It doesn't work.
Can I solve a classification problem with tensorflow time series dataset?
Hi folks. I'm looking at some covid related dataset and trying to filter to a specific row and drop Unnamed columns. Could anyone help fix my code?
import pandas as pd
df = pd.read_excel('/tmp/Covid-Publication-06-04-2021.xlsx',
engine='openpyxl',
skiprows=11,
sheet_name=['Total Beds Occupied','Total Beds Occupied Covid'],
header=1,
usecols="B:NX")
eng_hosps_total_use = \
df['Total Beds Occupied'].loc[df['Total Beds Occupied']['Name'].str.match("ENGLAND", case=True).fillna(False, axis=0)].fillna("").set_index("Name")
eng_hosps_total_use.drop(columns=eng_hosps_total_use.columns.str.match("Unnamed", na=False))
Source: https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-hospital-activity/
Gives:
raise KeyError(f"{labels[mask]} not found in axis")
KeyError: '[False ... ] not found in axis'
Health and high quality care for all, now and for future generations
@lapis sequoia what's wrong with it?
drop doesn't work "in place", you need to use inplace=True or do eng_hosps_total_use = eng_hosps_total_use.drop(...)
@desert oar Same error with inplace=True
oh i missed the error
oh, hah
match returns True and False
not the matching values
oh
ha
Yes. I wondered about passing a list of column names as list. Is it possible to do something like:
eng_hosps_total_use.drop(columns=eng_hosps_total_use.columns.str.match("Unnamed", na=False).values)
unnamed_columns = eng_hosps_total_use.columns[
eng_hosps_total_use.columns.str.match("Unnamed", na=False)
]
eng_hosps_total_use.drop(columns=unnamed_columns, inplace=True)
df['Total Beds Occupied']['Name']
what's this? are you getting a row with the label Name?
or do you have multi-index columns?
no idea
@desert oar I'm selecting the column "Name" to match for the row of ENGLAND data. Single row.
Hi all, I'm trying to import the module gensim and its doesnt work
This was the solution as suggested by this thread
However, for whatever reason it doesnt work for me
I've got gensim installed
is price null/nan?
!code-block @ebon walrus can you share your code, error messages, and data as text, not as a screenshot? see below ๐
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
alruiht
@lapis sequoia sorry didn't mean to @ you
lemme format
@desert oar np. Do you have any idea btw why Pycharm ipython console keeps giving me
console_thrift.UnsupportedArrayTypeException: UnsupportedArrayTypeException(type='ExceptionOnEvaluate')
p = model.predict(new_house)
#add a new column 'price' in new_house file to show the model predicted price
new_house['Price'] = p
#export new house price file to local system
new_house.to_csv("new_house_price.csv")
import sys
import os
plt.xlabel('area of house (sq.ft)')
plt.ylabel('Price of House(Dollars.)')
plt.title("relationship plot between area and price")
plt.scatter(price.area,price.price, color = 'red', marker = '+')
plt.plot(df.area, model.predict([['area']]), color = 'blue')
plt.show()
AttributeError Traceback (most recent call last)
<ipython-input-38-cea7655e712e> in <module>
5 plt.ylabel('Price of House(Dollars.)')
6 plt.title("relationship plot between area and price")
----> 7 plt.scatter(price.area,price.price, color = 'red', marker = '+')
8 plt.plot(df.area, model.predict([['area']]), color = 'blue')
9 plt.show()
@desert oar
@marsh beacon We don't allow that type of advertisement here.
no idea. restart the console
no figured it out. Happens when have DataFrame open in SciView inspection. Need to close the dataframe.
@lapis sequoia i see, you have a multi-index column due to using 2 sheets
Yes. The source data excel file has many worksheets inside the workbook. I only need two of them.
ah sorry it's not a multi-index, it's a dict
whichever
they're not the same!
it's important to know that [ on dataframes and series is a complicated operation with a lot of possible behaviors, while on a dict it isn't
so yes it's somewhat important to know what data type you are working with
@desert oar gotchya. How do I modify your code to drop columns with NaN values?
eng_hosps_total_use.drop(columns=eng_hosps_total_use.columns[eng_hosps_total_use.columns.values.isna()], inplace=True)
= AttributeError no isna
i'd do it like this:
import pandas as pd
data = pd.read_excel(
'Covid-Publication-06-04-2021.xlsx',
engine='openpyxl',
skiprows=11,
sheet_name=['Total Beds Occupied', 'Total Beds Occupied Covid'],
header=1,
usecols="B:NX",
)
eng_hosps_total_use = data['Total Beds Occupied'].set_index("Name")
eng_hosps_total_use = eng_hosps_total_use.loc["ENGLAND"]
eng_hosps_total_use.drop(
eng_hosps_total_use.index[
eng_hosps_total_use.index.str.match("Unnamed", na=False)
],
inplace=True,
)
values is deprecated, don't use it
note that this ultimately returns a Series, not a DataFrame
you only have 1 row of data, no reason to keep it as a dataframe
for that matter, why bother "parsing" like this at all? just read the one row you need
price might not be a dataframe if that's the error you get
whats the dataframe then?
the error is float area something
i am saying that you might have accidentally assigned something to price, overwriting the dataframe
Ah, easy
eng_hosps_total_use.dropna(axis=1, how='all')
@desert oar Have you ran this code? The dates are being picked up as Float64 dtypes and inserting a 24hr time. Do you have any idea of correcting this?
i ran some code but only for checking the column names
there are other options you can use to control how dates and other data types are handled
check the docs for pandas read_excel
Did that, but no luck. The idea is to make a simple line-bar matplotlib graph. Try to graph the DataFrame.
import maplotlib.pyplot as plt
plt.plot(eng_hosps_total_use)
Dates along the X axis
So column values would be values of X
in the plot
So I need timeseries data. The Columns are dates, but are Float64s not datetime
That's first problem.
@lapis sequoia this code gives me a eng_hosps_total_use as a Series with a DateTime index and int data:
import pandas as pd
data = pd.read_excel(
'Covid-Publication-06-04-2021.xlsx',
engine='openpyxl',
skiprows=11,
sheet_name=['Total Beds Occupied', 'Total Beds Occupied Covid'],
header=1,
usecols="B:NX",
)
eng_hosps_total_use = data['Total Beds Occupied'].set_index("Name")
eng_hosps_total_use = eng_hosps_total_use.loc["ENGLAND"]
unnamed_cols = eng_hosps_total_use.index[
eng_hosps_total_use.index.str.match("Unnamed", na=False)
].tolist()
extra_cols = ['NHS England Region', 'Code']
eng_hosps_total_use.drop(unnamed_cols + extra_cols, inplace=True)
eng_hosps_total_use.index = pd.to_datetime(eng_hosps_total_use)
eng_hosps_total_use = eng_hosps_total_use.astype(int)
i'd encourage you to spend time figuring out how it works
Amazing. Data frame, series... Ballache. What's the difference
@desert oar can you give me the shape after your drop
(370,)
So this is 370 rows
a series is a single "column", like a 1-d array, and each element has a label (the "index"). a dataframe is a "table", a collection of several series, where each series itself has a label (the "columns") and each row has a label (the "index").
So I need two series. One for datetime and one for corresponding integer (beds used). Right?
Pass to matplotlib's X and Y each series
Seems like a lot of work when the dataframe is already a collection of series
hey guys, i tried importing numpy and matplotlib in idle but cmd gave me error saying pip is not recognised
You should plt your code. It isn't correct.
do you guys know how to make my graphs stack vertically and horizontally like 4 rows 3 columns . this is what i keep getting
i'm also a stranger on the internet, offering free advice and assistance during gaps in my workday
it looked right when i ran it
but i also posted it more an example of another way to do it, not a definitively correct implementation of whatever you are trying to do
Np
The plot should not be linear is all I was alluding to
I don't know if this is the right channel but does anyone know how to set a hard RAM limit in PyTorch?
My program keeps using all the RAM and it crashes the computer
(this is the right channel)
Well how do I do it? I'm running it on Google Colab and I have 12GB of RAM, and I want to set a limit to that because it keeps using all of it and crashes the runtime
i don't know, i was just trying to answer the first question ๐ maybe it's not possible?
you might need to change how data is loaded into your model
@desert oar why does your
eng_hosps_total_use = data['Total Beds Occupied'].set_index("Name")
eng_hosps_total_use = eng_hosps_total_use.loc["ENGLAND"]
return a series, but my
eng_hosps_total_use = \
df['Total Beds Occupied'].loc[df['Total Beds Occupied']['Name'].str.match("ENGLAND", case=True).fillna(False, axis=0)].fillna(np.NaN).set_index("Name")
returns a dataframe?
I'm doing exactly the same thing with loc. Your just setting the index before looking for row relating to England where as I filter to all of the data for England and finish by setting the index to Name.
Shape difference is (375,) vs (1,375)
because you're passing something array-like/list-like to .loc
really i should be using eng_hosps_total_use.at["ENGLAND"] because i know i only want 1 row
@lapis sequoia
import matplotlib.pyplot as plt
import pandas as pd
with pd.ExcelFile('Covid-Publication-06-04-2021.xlsx') as xlsx:
eng_beds_total = xlsx.parse(
'Total Beds Occupied',
skiprows=11,
nrows=1,
header=1,
usecols="E:NJ",
).squeeze()
eng_beds_total.index = pd.to_datetime(eng_beds_total.index)
eng_beds_covid = xlsx.parse(
'Total Beds Occupied Covid',
skiprows=11,
nrows=1,
header=1,
usecols="E:NW",
squeeze=True,
).squeeze()
eng_beds_covid.index = pd.to_datetime(eng_beds_covid.index)
beds = eng_beds_total.to_frame(name='total').join(
eng_beds_covid.to_frame(name='covid')
)
beds.plot()
plt.show()
Fabulous.
How would I transform my line to make a series? I thought df.transpose() might do this.
what do you mean by "line"?
I quoted you.
if you have a dataframe with exactly one row, you have two options to turn that row into a series:
.squeezeas in my code- use
.at[row label]or.iat[0]
ah
my answer remains
.match returns a boolean series
subsetting a Series by a Series returns another Series (or an Index, in this case)
my code avoids all that stuff entirely by grabbing data out of the xlsx more selectively
no need to "parse" the row labels etc. when you know exactly what row you want
When you work with Pandas and DataFrames is the typical workflow to reduce what you want to series data for operations e.g. plotting etc.
yo yo
in this particular case i actually built it back up into a dataframe. but i pulled a series out of each file because that's all we needed from each file
This function doesn't allow you to specify an engine though. XLRD not good with new xlsx file formats so no idea why yours didn't error.
Don't understand the squeeze.
gah
What is the shape of beds ?
Also, I don't intend to rely on df.plot() as I'd want to call plt directly, e.g. there is no axhline() as there is for plt.
!e ```python
import pandas as pd
df = pd.DataFrame({'a': [1,2,3]})
print(df)
print()
print(df.squeeze())
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | a
002 | 0 1
003 | 1 2
004 | 2 3
005 |
006 | 0 1
007 | 1 2
008 | 2 3
009 | Name: a, dtype: int64
you can get the Axis object after using df.plot
that's the parse method of ExcelFile, not the ExcelFile constructor
i don't know why the constructor isn't documented
pandas/io/excel/_base.py lines 1166 to 1168
def __init__(
self, path_or_buffer, engine=None, storage_options: StorageOptions = None
):```
Ah. God sakes I'm shocking.
with pd.ExcelFile(... ,engine='openpyxl') as xlsx:
works.
@desert oar Is the squeeze on eng_beds_covid responsible for cutting off all of March data?
probably not, more likely i either didn't understand your requirements or made a mistake
oh it's a mistake, i probably didn't do the join right
I think its a union and
beds = eng_beds_total.to_frame(name='total').join(
eng_beds_covid.to_frame(name='covid'),
how='outer',
)
definitely not a union
better?
Yes. It includes only the data where both have same timestamps
Sounds like a union to me. Outer join includes everything.
If I've coded a one-player pong game (with Arcade), how would I go about adding AI to it? I basically want to implement a NEAT algorithm into it, with the fitness function just being the number of hits it can get until it dies. Ideally I'd be able to run multiple samples per generation at the same time (rather than doing one-by-one). If it helps, this is my code: https://paste.pythondiscord.com/oqitemoyuk.py
@desert oar Just speculating here, but do you notice anything unusual about those two plots?
it's an outer join on the index values. a union is different, there's no union "on" anything, a union treats the tables as sets of tuples as in relational algebra
dip around christmas, total beds rising even as covid beds are falling. would want to see pre-2020 bed data for context
covid beds smooth, non-covid beds not smooth (scheduled c-sections and elective surgeries?)
Needs further context. You're right.
really i'd want to see this going back years
yep.
might also be interesting to subtract covid beds from total beds
yep
beat me to it. Don't provide the code. I'm copypasting
it was a 1 line addition ๐
@desert oar why do you have the squeeze twice for eng_beds_covid
Hi, how do I know if a pth file contains the archteichture of the model, not only its "weights" ?
A pytorch model.
What you need to do is subtract mv_covid from total to give how many beds total beds are non-covid.
itโs not exact, right?
duplicates are kept
leaky abstraction
oh that was a mistake, squeeze=True only squeezes columns, but not rows
same reason i didn't use parse_dates=True, it doesn't parse dates in column names
hey i beginner on python and i wanna know about AI. Where should i start? And how much should i know?
what's your math background?
Nothing
what's the highest math class you've taken?
it's okay if it's just algebra or something. I'm just asking
you'll need to also learn statistics and linear algebra
Is that easy?
I thought programming doesnt require math
AI most certainly does.
whether or not you find stats and linalg easy will depend, but what ultimately matters is that you maintain a positive attitude about learning. because the learning never stops.
That is it
what do you mean, that is it?
Just statistic, how about framework or anything
there are a lot of libraries. numpy, pandas, matplotlib, sklearn, pytorch, tensorflow. but you learn the parts of them that you need as you go.
Like I said, it's important to maintain a positive attitude about learning.
Alright thanks
Can you freelance as data scientist?
does training a model in colab takes a long time ?
Do you guys think the only good way to learn neural networks is by learning every aspect of it
and understanding all the math and how they are built?
or can you get a good understanding and make a lot of cool AI just by learning tensorflow and mastering that
I thought learning from scratch would be nice and help me understand but after hours and hours of learning gradients equations types activation functions
it just got too much to handle and I would like to make AI with a less info-needed approach which is why i thought tensorflow would be nice
but i am scared that would limit me and what i can make
what exactly are you scared?
that not learning the core of neural networks and just learing tensorflow would limit what i can create
how did you how you came to that conclusion
idk AI just seems like one of those things you need to know completely
which one
you don't need to know everything about how your car works to drive it
I think you need to know the very basics, matrix multiplication, back-propagation, gradient descent, as well as a theoretical understanding of the rest (like the effects of changing hyper parameters or the idea behind transformers for nlp) to get 95% out of machine learning. You can know basically nothing and still get a lot out of it. The biggest thing is going to be experience. Knowing what to use when comes from having done it before, not some complex mathematical understanding.
Yeah I see what you mean
but back prop, gadien descentm, and more are still a lot to handle
like these equations and stuff
just seem like a lot to fully understand, and I was just wondering if I would just have to know how back prop works and not get into the math for it
You could get an intuitive understand and then just move on if you wanted, when starting
In my opinion you can defer the math when learning ml for later, I think people emphasize it too much
If my independent variables are highly correlated should I use Ridge Regression?
Hey guys any school student interested in AI here?
Anyone interested in collaborating for this competition can DM me.
https://aischoolofindia.com/waicy-competition/
WHAT IS WAICY INDIA? WAICY India is an online competition for Indian schools & students which engages them to learn and use artificial intelligence (AI) technology to solve real-world problems. AI researchers around the world are harnessing the power of AI for a sustainable future. From solving the toughest environmental challenges to becoming t...
Do anybody know which path planning algorithm is used in tesla map?
I got information from online that dijkstra is used in google maps
It could help, yes. It's often a good idea anyway, depending on what you are trying to do. Did you check the VIFs for the fitted linear model?
no
I will do that actually
how to rename the large column in dataset like this? i want to change the name of column to only s-1, s-2, -s-3 not with 'JENJANGPENDIDIKAN_'
anyone can help me? I've to try reame the column name for 2 days and still don't getting result properly
!eval ```python
import re
import pandas as pd
def remove_long_prefix(colname):
return re.sub(r'^JENJANGPENDIDIKAN_', '', colname)
data = pd.DataFrame({
'JENJANGPENDIDIKAN_s_1': [11,12,13],
'JENJANGPENDIDIKAN_s_2': [21,22,23],
})
data = data.rename(columns=remove_long_prefix)
print(data)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | s_1 s_2
002 | 0 11 21
003 | 1 12 22
004 | 2 13 23
or equivalently
import re
import pandas as pd
data = pd.DataFrame({
'JENJANGPENDIDIKAN_s_1': [11,12,13],
'JENJANGPENDIDIKAN_s_2': [21,22,23],
})
data = data.rename(
columns=lambda colname: re.sub(r'^JENJANGPENDIDIKAN_', '', colname)
)
print(data)
maybe even better:
import pandas as pd
data = pd.DataFrame({
'JENJANGPENDIDIKAN_s_1': [11,12,13],
'JENJANGPENDIDIKAN_s_2': [21,22,23],
})
data.columns = data.columns.str.replace(r'^JENJANGPENDIDIKAN_', '', regex=True)
print(data)
or if you're using python 3.9+
import pandas as pd
data = pd.DataFrame({
'JENJANGPENDIDIKAN_s_1': [11,12,13],
'JENJANGPENDIDIKAN_s_2': [21,22,23],
})
data = data.rename(
columns=lambda colname: colname.removeprefix('JENJANGPENDIDIKAN_')
)
print(data)
THANK YOU SO MUCHHHHHHHHH
now you have four ways to do it
I really big thanks to you
If I have missing values in a numerical column would I be fine with replacing those values with the mean? I have been doing this for like every project I did and I think it can be better
but how if I have multiple columns like that?
it shouldn't be any different
it depends on what's missing, how often it's missing, why it's missing, and what you're doing with it ๐ there are no strict rules or unambiguous best practices
is that manually one by one for each column?
no, the first data = pd.DataFrame line is just creating a new dataframe to demonstrate the solution
I mean if i have another column like that
i encourage you to read all 4 solutions and spend time understanding what they do and why they work
oh, you should probably do it manually for each prefix
you could do it with a single regex but i don't see much value in that
are you using python 3.9?
what is 'regex'?
yes because that's an encoding number
what is 'regex'?
there.subthing withr'^prefix'
yes because that's an encoding number
what do you mean by that?
i using 3.8.5 version
ok, then you are not using 3.9
and then?
im sorry i misunderstand before
bad_patterns = [
'^JENJANGPENDIDIKAN_',
'^JABATANSTRUKTURAL_',
]
for pattern in bad_patterns:
data.columns = data.columns.str.replace(pattern, '', regex=True)
the ^ in the pattern means "only match at the beginning of the text"
Thank you so muchh
can someone give me some tips on how to handle this imbalance multiclass classification prob like this
do you know how to change column position in datset?
use df = df[colnames], where colnames is a list of column names in the order you want
generic list of things to try: weighting, oversampling (e.g. with SMOTE), use gradient boosting which can "focus" on misclassified instances
ok
No, i mean changing column position in larger dataset, not only access several columns
you can do arbitrary manipulation on the list of column names
insert, remove, append, etc
I have 4725 columns, how can to do it simple way?
what exactly are you trying to do
Now I'm making a RecommenderSystem. this picture is result of recomendation with 4725 column. How i can to take only several column in recommendation table?
which columns do you want? what's the rule for selecting a column?
most of the time you can just use a list comprehension
can you give me example of code?
yes i want to selecting several columns
i've given you a lot of code already... do you know what a list comprehension is?
you can even write a for loop and build a list with append
i know, but i not properly understand
list comprehension is just looping in 1 cell, right?
yes
if by "cell" you mean "expression"
!e ```python
items1 = [f'number:{i}' for i in range(5)]
print(items1)
items2 = []
for i in range(5):
items2.append(f'number:{i}')
print(items2)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | ['number:0', 'number:1', 'number:2', 'number:3', 'number:4']
002 | ['number:0', 'number:1', 'number:2', 'number:3', 'number:4']
same thing, 2 different ways to write it
can u give me example in dataframe? i've to try your way, but i confuse when i want to selecting on middle column
How can I remove the specified columns of my dataset? I've tried things likepy to_drop = ['reproduction_rate', 'female_smokers', 'male_smokers', 'tests_per_case', 'tests_units', 'excess_mortality'] for x in to_drop: df.drop(x)but keep getting errors (in this case saying invalid key for 'reproduction_rate' even though that's the column header)
This shows the rough structure (first line is the headers, then the rest is the data, which has a lot of missing values)
yo
what does df.columns return?
also if you want to drop the whole column do df.drop(x, axis=1)
sigh any hardware experts in TPUs?
(on mobile) drop works on rows rather than columns by default, so you have to change the axis. Also you should pass the whole list of column labels that you want to drop. Also the drop method, like nearly all pandas methods, returns a new dataframe with the transformation applied and leaves the original untouched. It's not the same as mutator methods elsewhere in python.
reproduction_rate is the 2nd one on 3rd line
yea so as Stelercus said, you will need to provide an axis
new_df = df.drop(["reproduction_rate", "female_smokers"], axis=1)
I did originally pass in the full list but got still say invalid key error
!d pandas.DataFrame.drop
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown\_levels> for more information about the now unused levels.