#data-science-and-ml
1 messages ¡ Page 349 of 1
yeah it gave an error
astype is a method, not an accessor.
if something causes an error, please always copy and paste the whole error message into the chat.
!docs pandas.Series.str.join
Series.str.join(sep)```
Join lists contained as elements in the Series/Index with passed delimiter.
If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to [`str.join()`](https://docs.python.org/3/library/stdtypes.html#str.join "(in Python v3.10)").
also: .astype(str) is dangerous because if you have a "null" value like NaN, it will produce the string "nan", not a proper None
in addition to the obvious fact that .astype(str) on a column of lists will produce bad text like "['a', 'b']" which is not what you want
that said, it looks like your text is already tokenized. what's the point of joining and then tokenizing again?
they might have wanted to tokenize -> lemmatize -> join
hm. the screenshot is really hard to read, but it looks like some rows are tokenized and some rows are not
like... some rows are strings and some are lists of strings. i.e. a real mess
yess, at the end i want to vectorize it so i want my list as,['a', 'b'] and not [a,b] so i wasnt sure at which part i should convert it to strings
ohh alrightt
what is the type of each element here? are these strings or lists? please show us some example data that is not in a tiny unreadable screenshot
i am following this
it is a list but at the end to vectorize i wanted to convert it to a list of strings
X=dataset['text'].astype(str)
y=dataset.target
so do you understand now why .astype(str) is wrong?
what is the content of dataset['text']?
i see, they have it in their example
this is some questionable code
the data analysis content seems OK, but i would not recommend this article as an example of good code
yeah?? why
it's not bad code either, but it's messy and they do some weird/redundant things
df. shape 
def cleaning_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x))
- this is inefficient because it re-creates the
translatorin every call,translator =should be outside the function (there are some considerations w/ respect to local vs global lookup, but that's a different issue) .apply(lambda x: cleaning_punctuations(x))should be.apply(cleaning_punctuations)- i don't want to criticize their use of english too much because not everyone is a proficient english speaker, and english is a complicated language. but
cleaning_punctuationsshould probably beclean_punctuation
when i used x=dataset['text'].astype.str it gave this
800000 ['love', 'healthuandpets', 'u', 'guy', 'r', 'b...
800001 ['im', 'meeting', 'one', 'besties', 'tonight',...
800002 ['darealsunisakim', 'thanks', 'twitter', 'add'...
800003 ['sick', 'really', 'cheap', 'hurt', 'much', 'e...
800004 ['lovesbrooklyn', 'effect', 'everyone']
!e ```python
x = ['love', 'healthuandpets', 'u']
print(repr(str(x)))
@desert oar :white_check_mark: Your eval job has completed with return code 0.
"['love', 'healthuandpets', 'u']"
basically it just made a string containing literal python code
that is not what you want, ever
STOPWORDS = set(stopwordlist)
def cleaning_stopwords(text):
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text))
they did the right thing creating STOPWORDS outside the function, but
- why is it
STOPWORDSand notstopwords? not all globals need to be upper case - why
str(text)? it's definitely already a string
# remove punctuation
from string import punctuation as p
def some_func(text):
return [char for char in text if char not in p]
# alternatively
return " ".join([char for char in text if char not in p])
is this not simpler?
also they defined their own stopword list
which is hmm
All languages are complicated and no language is uniquely so.
ohh okay,thanks a lot! i used the above code for removing punctuations but i did not use the lambda function in most of the places
meh, the "standard" stopword lists can be too broad @royal crest , i've done that before
str.translate might be significantly faster. i'm not sure, but i would bet $1 on it at least
time to benchmark
this is also just questionable string cleaning in general, e.g. it doesn't take @usernames into account
i guess you could argue that words in usernames could be relevant for determining the sentiment of a tweet, but i doubt it
i guess it'll all wash out in feature selection
a better method to remove punctuations would be to use regex ?
str.translate is fine. regex is good if you need to handle text that consists of more than 1 unicode code point, or if you need to do more complicated replacements
you might want to read about "unicode normalization"
and you might want to familiarize yourself generally with what "unicode" is - python strings are sequences of unicode code points
yeah i didnt get the 2nd point either, just used it as text
that's good
ohh okay, thank you! i will read about both
but i am unable to vectorize it w/o converting the list into a list of strings. how do i go about converting [a,b] to ['a','b']?
you should have strings like this: "hello this is a tweet", which you vectorize into lists: ["hello", "this", "is", "a", "tweet"]
pandas tries to make the output "pretty", but in the process it can be difficult to see what is actually in each element
i'm not sure how you ended up with the [] stuff. it's better if you show your code
hello fellas
I need help with a project in python
i am new to python so i need help
alright!
i want to develop a program which can identify difference between 'ideal image" and 'difference image' . can anybody help
i think i end up w [] since i have tokenized, stemmed and lemmatized before vectorizing
should i paste it here?
!paste yes, please. see below đ
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
imo it's bad taste to re-post your question, as if nobody here already tried to help you earlier today
2 people here spent a good amount of time trying to understand the use case and guide you towards a solution
yes i appreciate help
but my ocr is result are not quite good
i need help with my ocr result
okay, are you able to share examples? there might not be any real OCR experts here. you also might want to describe if there are any kinds of non-text differences that your system should be looking for
should i post code?
someone else suggested comparing rgb histograms. maybe you can also use some kind of image embedding to compute distances between images in some interesting feature space
here's a tip, instead of this:
english_stopwords = set(stopwords.words('english'))
not_stopwords = ['not']
final_stopwords = set([word for word in stop_words if word not in not_stopwords])
you can do this, using the - operation on sets:
english_stopwords = set(stopwords.words('english'))
not_stopwords = {'not'}
final_stopwords = english_stopwords - not_stopwords
the result which i am looking for is the program can highlight 'spelling mistakes' , change in a text in difference image as compared to ideal image, any content which is missing in difference image as compared to original image.
@wicked grove this is part of the problem:
tokenizer=RegexpTokenizer(r'\w+')
dataset['text']=dataset['text'].apply(tokenizer.tokenize)
imo you should be saving the tokenized text into a separate column
tokenizer = RegexpTokenizer(r'\w+')
dataset['tokens'] = dataset['text'].apply(tokenizer.tokenize)
Ohh okayy ill change that
but otherwise this is OK until line 176 when you use .astype
Ah yes! I will try this
all you have to do is replace dataset['text'].astype(str) with dataset['text'].str.join(" ") as per stelercus' suggestion
but i strongly encourage you to think about what the difference is
generally this is inefficient code but it's not necessarily bad. what i will say is that TfidfVectorizer lets you apply pretty much all of these data transformations in one shot
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html see the preprocessor, tokenizer, and stop_words arguments
Examples using sklearn.feature_extraction.text.TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm Biclustering documents with the Spectral Co-clustering algorithm, To...
Yeah i used that because even after i tokenize the result is this [a,b] and not a list of individual strings like ['a','b'] which ig is needed for vectorizing( im not too sure since i watched only a video on that topic)
i don't see any place where you would get a string like "[a, b]"
i think you might need to restart your notebook kernel
(assuming this is in a notebook)
Okayy,thank you so much!
finally, you don't actually have to re-join the strings at the end
you can do TfidfVectorizer(analyzer=lambda x: x) which does not attempt to split any strings
Alrighit, that is because astype is an accessor and the our object is a list so it is converting the entire list to string. But for some reason that is not showing in my dataframe?
astype is not an accessor
Special shout-out to .reset_index(), it's a great method
It is not a notebook
i showed you what happens when you "convert a list to a string". it's not what you want.
!eval here, i will show you again:
x = ['a', 'b', 'c']
print(repr(str(x)))
@desert oar :white_check_mark: Your eval job has completed with return code 0.
"['a', 'b', 'c']"
!eval you probably want this:
x = ['a', 'b', 'c']
print(' '.join(x))
@desert oar :white_check_mark: Your eval job has completed with return code 0.
a b c
Yess you showed this to me, but my question is how do you get ['a','b','c'] in the dataframe?
My dataset['text'] column has lists like this [a,b,c] ,like the words in the list arent tokenized like this 'a'
Okay yes i looked it upđ
Thank youu!! I will refer this and change the code
Also for steps 8,9,10 should i just keep the tutorial as a reference and follow some other code for the confusion matrix,etc
Hello
I am writing to CSV file
I am getting some of columns are not completely filled
Ping me when replying
the highlighted part is getting empty
Had a question, why should there be a space after binary operators?
Hi all, I'm trying to drop the predictions column in a pandas dataframe and getting the following error:
AttributeError: 'DataFrame' object has no attribute 'DEATH_EVENT'
on line:
y = X_full.DEATH_EVENT
X_full.dtypes returns the following:
sex int64
smoking int64
time int64
DEATH_EVENT int64
dtype: object
...Any ideas?
DEATH_EVENT is definitely a column in the dataframe
Try accessing it with .loc rather than dot notation
clever, thanks!
hi
anyone know what this error mean in seaborn ? ==> "RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density."
why when i convert a json to csv using pandas, then try to read that csv, i get decoding errors?
gbk cant decode etc
Hey guys, have been scratching my head for two hours trying to figure out a way to add names of teams instead of hue dot points. Please help me
is there any place where i can get beginner questions on numpy,pandas,etc to practics?
yes
where?
Numpy 100 exercise: https://github.com/rougier/numpy-100
That's the style used by pretty much all python developers
hello!
uhm. i have question.
Can you recommend a machine learning course on Coursera?
Last time, you recommended âdata science from scratchâ, so I bought it and wanted to take a machine learning course.
Do I need to master statistics and linear algebra for additional machine learning?
not on coursera but this one is good: https://course.fast.ai/
i want on coursera.
not available on coursera?
What are some methods or tools I can use to distill a small training dataset from a big dataset that when trained on have as good projected performance as the big dataset?
I seem to not have the Google Fu required to find information to read about this.
does anyone know how I can get output node names of an Xception model?
The stanford machine learning course by andrew ng on coursera ,deep learning specialization by andrew ng on coursera
Oh alright!! I looked up the str.join and wanted to ask how do you get ['a','b','c'] in the dataframe?
My dataset['text'] column has lists like this [a,b,c] ,like the words in the list arent tokenized like this 'a'
the words in the list aren't tokenized? if you have one big string that represents a document, and from that you derive a list of individual strings, what are those individual strings if not tokens?
just arbitrary slices of the original string?
Hey all, I'm trying to use pytesseract to read text from an image. This is my current image, but the output produced from it is just nonsense
Hey all, I'm trying to use pytesseract to read an image to extract some numbers. This is the code I'm using to try and extract the numbers from:
output = pytesseract.image_to_string(img,config='--psm 10 --oem 3 tessedit_char_whitelist=0123456789')
And the output I receive is:
iy Qe
One sec, I'll upload the image to imgur too.
And here is the screenshot: https://imgur.com/a/7xYZOd8
Yeah i guess,i expected the output to be ['a','b'] after using regexptokenizer but it was just [a,b] w/o ' ' ( umm is that correct?)
I've got this window written in tkinter now i want to make it so the user can enter list of users and this will result in something like this:
add;Username@something.com;Rolename1
add;Username@something.com;Rolename2
add;Username2@something.com;Rolename1
add;Username2@something.com;Rolename2
How do i make that username list from tkinter Text gets appended with selected things into csv file?
Hi would someone be able to help me coding some loops?
How can I use a custom keras model with opencv?
Alternatively, how can I get output node names for my model
Is there a way to find the accuracy of your results with a tolerance
e.g. I'm predicting final grades for students, my output can be 0-20, right now I'm just checking how often my model is correct, but often when it isn't it is off by only 1, I want to see how often this is
I've stumbled across explained_variance_score and am wondering if that's what I'm looking for, seems like it
Hey everyone đ Can someone help with this? Thanks!
I started like this:
`#method for kNN method
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
"""
@param X, np ndarray, feature matrix
@param y, np array, labels vector
@param kfold, Integer number of CV folds
@param n_neighbors, Integer number of neighbors in KNN
@returns floating number
"""
def evaluate_kNN(X,y,kfold,n_neighbors):
kf = KFold(kfold)
neigh = KNeighborsClassifier(n_neighbors)`
Seems like many of TFX.utils functions have been decapitated.
Same with some tfx.components libraries. Found the new imports for some of them while I can't find the rest. đ
Hey @kind cape!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
⢠If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
⢠If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
You mentioned tolerance... Are you trying to perform collinearity diagnostic? Tolerance is a metric used to assess the amount of multicolinearity present in a model.
You might wanna provide some more clarity on that part....
On the other hand, explained_variance_score is simply Coefficient of Determination a.k.a R-squared.
This is the default metric for regression problems.
This metric measures the amount of variation in your response variable (target) which the explanatory variables (a.k.a features) in your model was able to explain. So the higher the R-squared score the better your model performance.
The best metric to capture what you are actually looking for is MSE or better still RMSE. The lower your RMSE score the better your model performance.
Hello guys I was wondering if anyone is experience with autoencoders for anomaly detection
if you are please @ me
is pytorch easier than TF + keras?
also is TF + keras more well known and has more guides?
I use r2_score
from sklearn.metrics import r2_score
in my opinion keras is much easier
how do I get the accuracy of the prediction in tensorflow?
val = model.predict(images)
???
is it possible to do a time series prediction on super irregular data with a lot of time where the data is zero?
for example, from say june 1-15, there are rejected records every 4 hrs, between them there are none rejected. June 16-30 there are zero rejected records. Is it possible to forecast another 30 days with this type of irregular data?
model.predict returns the output of the model on the input data, it's up to you to compute the accuracy of the predictions against the samples
yeah, I know how to get the accuracy while training but how do I get a prediction during a 'runtime'?
During inference? You do the same thing
I pass the image into my model, how do I get tensor flow to give me a "sureness" because I don't the answer for the "live testing"
Have anyone work with raspberry before??
gonna write a guide on audio processing, what do you guys recommend I talk about or explain in a fairly simple way?
Thanks
Hi! I'm trying to convert a numpy array into a pd dataframe
When I try to print df.head, i get this:
<bound method NDFrame.head of 0
0 0
1 1
2 0
3 0
4 0
5 0
the first column must be the index, and the second column is the array (now dataframe), but it doesn't return the typical first five values, like a df.head function call usually does. Any ideas?
you didnât call it
Yeah... looks the same as before
I used this:
df_YV = pd.DataFrame(y_valid)
am i missing something?
where 'y_valid' is the np array
you didnât call .head
i.e. youâre missing parentheses
youâre welcome
sublime + ella fitzgerald, great name
Is there a function to convert a pd dataframe (single column) of decimal values 0 < x < 1 , to 0-1 values?
i.e. if x_h < 0.5, x_h = 0, else x_h = 1
...without iterating over the whole dataframe or array? I have continuous predictions that need to be converted to 0-1
Off topic but ella is â¤
column = (column > activation).astype(int)?
...that worked!
I wonder what a person does when they're so advanced that people on Discord can't help them lol
Also apart from that you have plenty of ways. My fav is .apply others are iloc but iloc is not good enough for complex logic.
.apply, I'll write that down, thank you
!d pandas.DataFrame.apply
oh that makes sense using iloc
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)```
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrameâs index (`axis=0`) or the DataFrameâs columns (`axis=1`). By default (`result_type=None`), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result\_type argument.
Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)```
Replace values where the condition is False.
Nice, would that make a new column in the df or replace the existing?
I reckon you can replace if you wish. Just give same column name.
Why do you need to use the NumPy ufunc
You can use pd.Series.where too if I'm not mistaken. But anyways i personally would not mind np in this case since pd uses np too.
Tho they are bit different as the docs say. I'll check out the details when i get time.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).
Via: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
principle of least privilege
I would favour this solution
(maybe because itâd what Iâd do too đĽ´)
Im training a sklearn.svm.svc classifier on 12k samples
Is it normal to be really slow
Its like an hour
Do you have a lot of features...?
do you have an nvidia gpu
What library are you using
Right, dumb question
Im running it in vscode and i cant tell
If its frozen or not
Maybe i should have done in pycharm but idk how to tell if its not frozen or not
Scikit has no GPU support
Damn.
Is there a way to tell its still running
I guess im looking at task manager and python is taking 50% of cpu
Scikit doesn't have gpu supportđ¤Śââď¸
If it helps, colab may help.
Oh i never used it before can you explain
The cpu may be better than ours and it has 13gb ram (i guess)
Basically cloud for you to run notebook. Gives nice gpu and cpu and ram. No load on your hardware. And you can use drive as the storage.
You can do whatever you want. Just think of drive as your storage.
(Assuming you have enough space of course )
Sorry for weird questions im just really anxious haha
Waiting for a model to train without knowing whats happening
It's alright.
You can give it a try once you're done with doing on your pc. I like it and use in my daily life. Like eating and all.
Yep.
Ill just upload my folder then
Of code and data
Should be able to execute immediately
Alright. You'll need to mount drive if you want to tho. I'm not on laptop otherwise I'd share a small demo notebook.
Yeah that is the limit. Beware.
i just deployed some jax models. i have no idea what i deployed.
i really need to learn jax and quick,
i have played around on a bigginer level on keras and pytorch,
any suggestions welcome
AttributeError: 'SVC' object has no attribute 'transform', theres nothing about this issue
anyone know?
are you talkign about https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Examples using sklearn.svm.SVC: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn 0.24, Release Highlights for scikit-learn 0.22 Release Highlights for scikit-learn 0.22,...
?
would it help to tell you that SVC has no transform() method?
as the documentation suggests
wha thte
thank you for pointing it out i fixed the issue

Getting a strange error again (thought I fixed it last night)
KeyError: 'DEATH_EVENT'
line in question is:
y = X_full.loc[:,'DEATH_EVENT']
where 'DEATH_EVENT' is a column in a pd dataframe
anyone has a good seris of notebooks for getting started at jax ???
hi
i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)
ValueError: Classification metrics can't handle a mix of binary and continuous targets
even though the dataframe (one column of binary predictions) is of type int64
...?
Hey could someone help me with plotting a histogram
https://gyazo.com/59474ba5a4d4d6a8d95e60ebd9cf79ef
The x-axis is fine, but the y-axis is supposed to have the values in the array (up to 50,000 or so)
but the y-axis here is just going up to 10

when i change bins = it fixes but the axis are wrong
The y axis is going up to 10 because for the bins you specified, the frequency of the values that fall within that bin is 10
how do i make it so the x-axis is from 0 to 200
and the y axis is just takes from the array i set
The y axis is just going to show you how the data falls within the bins you want
It won't be the data itself
So what do I do
if I want the values of the bars to be what's in the array
i need to make the x-axis like this
its just the values in the bars are wrong and the y-axis is obviously wrong
if you look at the array they're in the thousands so
The x goes only to 200 because thats the upper bound you set for the bins
wat now
doesn't do anything
i need it to look something like this
but obviously y-axis will be larger
Change the step to 25
What this tells you is that for the bounds you specified, all your values fall between 0 and 25, and 100 and 150
The y axis will never be the values in the array, because in a histogram, you count the frequency of that value appearing
lemme explain wat i want
array= [0,0,0,0,0,0,0,0,146,4021,4323,13434,24089,36611,27023,31367,26800,11079,9285,23142,12757,7973,30389,20770,10426,17347,25305,34806,23654,14287,12107,2004,4256,106,3866,2247,2688,0,0]
so i have 39 values in here
i want each one to be in its own bin
and the x-axis to be from 0-200
how can I do that
And you're clear that by setting the x axis to be 200, you'll miss all the values that are greater than 200?
wdym why would taht be
the values in the array are for the y axis
each value in the array is a spacing of 5
so first value is 0-5
second is 5-10
etc
The x axis is the bins
The y axis is the frequency of numbers within the bin
yea
ohhhh I understand. @main fox in the array, the first entry is a count of all values between 0-5
yea
so array[0] -> X = 0-5, count = 0
It's not an array of individual values
@flat crown wouldn't it be easier to create a bar chart instead?
i need histograms for this assignment
does the prompt require you to use np.array
**np.arrange
nah that's just what i used for my other histograms
its not really assignment, its a research project
Since you're already using pandas
Convert the list into a series arr= pd.Series(array)
And then usen arr.plot(kind='hist')
Pandas will handle the bins automatically
That's not how a histogram works
is that colab
jupyter notebook
yes
I think you should make a bar chart (which will also be a histogram), because your data is already binned and counted.
X axis should be (index of the array + 1) * 5
Y axis should be array[index]
precisely
excel lol
I'm new to sklearn/pandas. But i know exactly what you're looking for and understand what the np array is communicating
ok
ohh. can't help with that unfortunately.
or you can run your other data in the same style
You said your output needs to be exactly like this?
yes
They don't have any values that fall between 0 and 25
Your data has 0's
yea
What this means is:
the frequency of values between 0-5 is 0. So 0 is not in the dataset
So, formatting aside, this is what the graph/histogram should look like, right?
yep
Cool. Good luck, gotta go to bed
guys, I have some data displayed in the terminal. like the one shown in the image. Do we have a way to put the entire displayed data into a file?
This is in visual code
how to make the xaxis look better ? as you can see nothing is visible on the x axis Please help asap
and why is my figure size not changing?
hi
i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)
I never use a database like that(yeras as columns until 2019). How can I find the line or cell of this value that i'm trying to find. Just returns me a True and False dataframe
Hi there,i am a total beginner in ml and have a question that may make not much sense.I am reading about reinforcement learning right now.I understand that there is a punishment/reward system.So if the program searches for patterns (time series data for example),do i need to define what kind of patterns i want or does the program find its own patterns?Sorry i am not a good programmer and i try to understand the theory behind supervised,unsupervised and reinforcement learning.
so annoying đŤ
you could try transposing?
!d pandas.DataFrame.T
property DataFrame.T```
since you weren't happy about years being in columns ...
though i don't know what kind of data you're working with
I just want to find the line or cel of the value. But if I continue with this dataframe idk if would be very difficult to make graphs, since maybe I will have to put 1950-2019
indeed
you might want to try and restructure your data
and before you ask "how", think about what would make your life easier should something go on the index vs column
though you can probably access all the columns with df.columns()
minus the first two
yes, but why would this help with plt.hist(x = 2:70, data=df)?
What you are replying to was not in response to your message
It was just me trying to finish what I was saying from earlier
but if you think in a columns to year, maybe would have to repeat the countries to each yar
đ¤
That's why I said this.
I never transformed a database like that, and I don't think would be possible with this database since probably would have to repeat the countries to each year
I think it's pretty simple, but you're the data scientist.
but how would you structure this database?
countries year expectancy?
i would transpose it
as suggested earlier
handy dandy
i would transpose it then manipulate it in such a way that I'm able to two things from this dataFrame
I can't figure out how would transpose be helpful in that
typically one would select year
years but like that: 1950 1951
so it's useful for us to take advantage of that fact, and put year as index
because right now, the indexes are not being used at all
also, the Code is not useful in this dataframe at all, so we can take that out and store it as dict values using entity as key
which leaves us countries as columns, years as indexes and life expectancy as the values
actually if we create a variable years with all years columns that can be used as df[df['years' == 10101] we don't need to transpose
since this is the major problema in the df
Most of data science is just wrestling with data
it does not get better 
hi I need pandas help
we know you do, that's why you are here đ
here is "df_confirmed_local"
we need to make a function that
def case_no(report date)```
and return the case no. of the corresponding date
example:
case_no("04/02/2020")```
output:py 16
someone help me plsss
so you're just counting the number of values in the "report date" column that matches your input
not row?
return the case no. of the corresponding date
ah sorry I dont get đ
so in this case, you gotta match two things: "report date" == input, "Confirmed/probably" == "confirmed"
then return the len()
so
def case_no(report_date):
report_date = df_confirmed_local["report date"] ?```
actually
df_confirmed_local -> does this mean your dataframe only contains confirmed cases?
that makes life a bit easier
yes
def case_counts(report_date):
for i in df_confirmed_local["Report date"].value_counts():
if i == report_date:
return i
case_counts("04/02/2020")``` I tried this
but It doesnt work
cuz I dont know how to link up the two columns
i don't think you need for and if loops
!e
import pandas as pd
def howmany(val):
"""return how many values in A match the input"""
# make random dataframe
df = pd.DataFrame({"A": [1, 0, 0, 1, 1], "B": [True, False, True, True, False], "C": ["Y", "Y", "Y", "Y", "Y"]})
print(df)
# return number of vals that match "A" == val
return len(df[df["A"] == val]["A"])
print(howmany(1))
here's a really basic example
@royal crest :white_check_mark: Your eval job has completed with return code 0.
001 | A B C
002 | 0 1 True Y
003 | 1 0 False Y
004 | 2 0 True Y
005 | 3 1 True Y
006 | 4 1 False Y
007 | 3
line 7 is the output for howmany(1)
returning 3 since column "A" has three 1's
and in your case, that would be a bit like
def case_counts(report_date):
return len(df_confirmed_local[df_confirmed_local["Report date"] == report_date]["Report date"])
case_counts("04/02/2020")
i'm sure there's always a better way, so feel free to chime in
k lemme try ty :>
Hey, i have a huge amout of csv files with some apparently empty columns. I want to check if they are actually empty in all files, and if that is the case i want to print their name.
To do that, i would like to know if there a way to see if df[columnname] does not contain any values?
!d pandas.DataFrame.isna
DataFrame.isna()```
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or `numpy.NaN`, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings `''` or `numpy.inf` are not considered NA values (unless you set `pandas.options.mode.use_inf_as_na = True`).
this could be useful
rd_s = df_confirmed_local['Report date']
return len(rd_s[rd_s == report_date])
why py len() tho? doesnt it return the length of the string?
alternatively, you could do this once:
report_date_counts = df_confirmed_local['Report date'].value_counts()
and then any time you need a specific report date count, it's just a dict-like lookup to report_date_counts.
length of an array in this case
well, Series
how could I do in this database that i'm replying like that?
apologies, my inner
taking over
not too sure what you mean mate
to appear the cell not a True, False and Nan dataframe
cell or line
so if query is "Banana", you'd want a dataframe that looks like:
Name Calories Type
True False False
?
so with the code:```py
let:
A is a column and B is a column
a is an item in column A
b is an item in column B
code:
len([A=a][B])
output:
b
is it like this? @royal crest
look my reply, I would like to replicate what I do in banana but with this database
close, don't forget that you are accessing those columns through a dataframe
yup so If including the df_ stuffs am I correct?
sounds good, also check out Stelercus' response
maybe with all columns but I'm not sure
i think you need to provide a complete demonstration of input and your expected output. it's still not clear from these scattered examples what you want
thanks for the help with isna()
it worked :>
took forever to run through all of the files, i have like 150000 csv files i need to process
big help
My output is in the print A True and false data frame my expected output is this:
i saw that, but it doesn't help me understand
there is no context
man... In one print I use a the code with a column and get the exacly cell that the value "user_result" is in, but in the life expectancy database I got a true and false dataframe, how can I got the cell from it?
i still truly have no idea what you mean, i'm sorry. the code database_df[datbase_df["Name"] == user_results] returns a dataframe, not a single value
do you want to get all rows where life expectancy is 18.907 in any column?
EXACLY
this is why it's important to provide small examples that clearly demonstrate the behavior
you force people to guess and interrogate you
but is clear, the second print is not a True and False database
it's not clear, as shown by multiple people trying to help and giving up
also a "true and false database" is not the same as "all rows where a certain value is present in any column"
but I use the same strategy in the two
đ¤
df[df == something]
it doesn't matter, if you don't show people what your data is, there's no way anyone can help
screenshots are not enough
ok sorry
your life expectancy data looks like this?
data = pd.DataFrame([
'A', 1, 2, 3,
'B', 4, 5, 6,
'C', 2, 5, 3,
'D', 3, 9, 4,
], columns=['entity', '2007', '2008', '2009'])
ok. and can you explain what output you do want?
do not show the banana example, it's not helpful
Hello guys.
Do you know how can I parse this string "20200915" to date with pyspark?
yes, I want the cells or lines as the banana example that contains the minimum value of the database that is 18..
18.907
not a True and false dataframe
understand?
and I'm not sure if this .min().min() was the best way to do, so if you have any suggestion to this too you can tell me đ
df.min().min() is better than trying to filter
however you need to exclude the non-numeric columns
!e ```python
import pandas as pd
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
min_all = data.loc[:, '2007':'2009'].min().min()
print(min_all)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
1
!e ```python
import pandas as pd
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
meta_columns = ['entity', 'code']
value_columns = [c for c in data.columns if c not in meta_columns]
min_all = data[value_columns].min().min()
print(min_all)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
1
it's what I do
look
yes, that's fine
and now how do i find the row?
but that is all the columns lol
it's not
you don't need to use .iloc at all if you want all columns
you can just write years_database.min().min()
does anyone have an idea how to make this perform better?
def find_non_empty(file):
try:
frame = pd.read_csv(file, sep=";")
for name in frame.columns:
if not name in listA:
listA.append(name)
if not name in listb:
res = True
i = 0
for val in frame[name].isna():
i +=1
if val is False:
res = False
if res is False:
listb.append(name)
except UnicodeDecodeError:
print("ooops")
print(file)
except:
print ("whoops")
so you want the row that contains the minimum year @median fulcrum ?
yeah
with the country etc
i have been scanning files for 80 minutes
@lapis sequoia are you just trying to look at the column names and ignore the data?
the bare except: is a really bad idea... at least do except e: print(e) so you can see what the error was
i want to know if certain columns are empty in all of my 150000 csv files
df.isnull()
i also am in the habit of putting the try around the smallest possible area of my code, in this case just the pd.read_csv should be inside try
not the whole frame, just some of its collumns
that makes sense
i will adapt that
df['column'].isnull()
and yes, looping over the entire df is a bad idea. use frame.isnull().any() or something like that
yeah
i assumed
:D
yes, ill change that. Thanks, i wasnt sure .isnull() on the column would work
frame.isnull() returns a dataframe of booleans, frame.isnull().any() returns a series with 1 element for each column, True if any are null, False otherwise
>>> data.isnull().any()
entity False
code False
2007 False
2008 False
2009 False
dtype: bool
đ
@median fulcrum
!e ```python
import pandas as pd
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
meta_columns = ['entity', 'code']
idx_min = (
data
.drop(columns=meta_columns)
.min(axis='columns')
.idxmin()
)
print(data.loc[idx_min])
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | entity A
002 | code asdf
003 | 2007 1
004 | 2008 2
005 | 2009 3
006 | Name: 0, dtype: object
that of course returns the row as a series
use data.loc[[idx_min]] to get a 1-row dataframe
in columns since there isso 1950-2019 columns can I use what?
your dataframe have 5 columns, and you are trying to find the minimum of two, mine has 70 columns, what can I put in columns=?
understand?
that's the problem :/
@desert oar
columns= where? in drop, or in DataFrame?
note also that 1:70 selects the 2nd through 69th column. numbering starts at 0 and ranges exclude the upper bound. you would have to write 0:71 to get all columns, which is equivalent to 0:, which is equivalent to :
in drop
mine database have 70 columns
drop removes columns. so put the column names you want to remove. if there are no columns that you want to remove, don't use drop.
!d pandas.DataFrame.drop
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown\_levels> for more information about the now unused levels.
do you have missing values (NaN) in the data?
no
did you remove Entity and Code like you were supposed to?
look at my example
i literally use the same column names...
ye men
the database is like that
result has columns Entity and Code which, according to your screenshots, do not contain numerical data
well then your result is messed up
i don't know what you expected
stop and think for a second
you are showing me 2 dataframes
oh
in screenshots, no less
I have to use
please use code blocks
i don't know what that means either
still......
don't are showing the minimum row
idx_min is the index of the row that contains the minimum value
idx_minis the index of the row that contains the minimum value
the index is the label of the row
in this case, your dataframe has the default index. so the label is the row number
ok, but I don't want that, I already know, I want to find where the minimum are
show me an example using this dataframe:
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
show me the output you want
print like that
that has the minimum in that
or just the exclusive year that has
don't need to be everything
just know the country and the year of it
i told you, data.loc[idx_min] is a Series of the data in the row that contains the minimum. if you want a 1-row DataFrame of the data in that row, use data.loc[[idx_min]].
the minimum value I already have
if you said this 30 minutes ago, we'd be done by now
also, you need to slow down and read what other people are writing. i used idxmin, which is not the same as min
omg
thanks
đŤ
would be possible, show just the year that the minimum value was found?
!e ```python
import pandas as pd
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
data_num = data.drop(columns=['entity', 'code'])
minval = data_num.min().min()
minval_row = data_num.min(axis='columns').idxmin()
minval_col = data_num.min().idxmin()
print(minval, minval_row, minval_col)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
1 0 2007
and the dataframe of the column with the country etc would be?
be more specific
I would like to print a dataframe with:
Country year column
Cambodia 1977
can have also the code whatever
please start using code blocks
it's not fair to force people to read screenshots and re-type long variable names
did you create years_database from lifexpectancy_database?
if so, show how you did it
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
do not post a screenshot
hello i tried this and i got this as the output
text target tokens
800000 love healthuandpets u guys r best 1 [love, healthuandpets, u, guys, r, best]
800001 im meeting one besties tonight cant wait girl... 1 [im, meeting, one, besties, tonight, cant, wai...
800002 darealsunisakim thanks twitter add sunisa got ... 1 [darealsunisakim, thanks, twitter, add, sunisa...
800003 sick really cheap hurts much eat real food plu... 1 [sick, really, cheap, hurts, much, eat, real, ...
@wicked grove yes, every element of the tokens column is a list of strings
but the strings are not within ' '
that's just how pandas displays it
ohh alright!!
you can put a list of strings into the model if you use TfidfVectorizer(analyzer=lambda x: x)
i had another question,when i used stemming on the tokens column i got this
okayy,thank you
what is that?
can i send a picture of the output? i am unable to copy paste it correctly
`
text target tokens
800000 love healthuandpets u guys r best 1 t
800001 im meeting one besties tonight cant wait girl... 1 k
800002 darealsunisakim thanks twitter add sunisa got ... 1 t
800003 sick really cheap hurts much eat real food plu... 1 p
800004 lovesbrooklyn effect everyone 1 e
:incoming_envelope: :ok_hand: applied mute to @lone nacelle until <t:1634832380:f> (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Hi got a really simple machine learning / pandas question but really struggling with this:
I just want to write a function that converts the 'source_type' column in my dataframe to a number depending on what string 'source_type' is
for example:
if item == 'AGB':
item = 0
else:
item = 1
I understand I should use df.apply() but I'm not sure how to write the function to make it work
please ping me!
how did you end up with this?a
there are a few different ways to do this.
one way:
data['source_num'] = 0
data.loc[data['source_type'] == 'AGB', 'source_num'] = 1
another way:
def convert_source_type(source_type):
return 1 if source_type == 'AGB' else 0
data['source_num'] = data['source_type'].apply(convert_source_type)
you could of course use lambda instead of def
data['source_num'] = data['source_type'] \
.apply(lambda source_type: 1 if source_type == 'AGB' else 0)
you could also use map which is kind of a neat trick:
from collections import defaultdict
source_type_mapping = defaultdict(int, {'AGB': 1})
data['source_num'] = data['source_type'].map(source_type_mapping)
homework: figure out how this works
how map works?
all of it đ
(hint: collections is part of the python standard library, you'll have to read about defaultdict there)
okay will do đ
how did you end up with this?
yes
years_database = lifexpectancy_database.iloc[:, 2:72]
with just the years
Hey guys quick question pytorch or tensorflow?
ah yes, the eternal question
in that case yes, you can use minval_row to get the row from lifexpectancy_database. both dataframes will have the same index values
lifexpectancy_database.minval_row()?
no, why would you expect that to work?
đ¤
years_database = lifexpectancy_database.iloc[:, 2:72]
minval = years_database.min().min()
minval_row = years_database.min(axis='columns').idxmin()
minval_col = years_database.min().idxmin()
print(minval, minval_row, minval_col)
i am using the variable names from my previous example
oh
that was my fault, it wasn't clear
do you understand now? you can do years_database.at[minval_row, 'Entity'] for example, to get "Cambodia"
and minval_col will be the year
but minval_row is not:
Country year column
Cambodia 1977
or what I'm missing?
read my message above
so I would have to mix minval_row and minval_col to get what I'm thinking
how do i create the code block in chat?
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
!code see below:
thanks
kind of, yes. minval_col will contain the column label (in this case, the year), and minval_row will contain the row label (in this case, the row number).
can somebody tell me, in the code below. .. is order of the original index guaranteed to be preserved.
pd.DataFrame({n: df.T[col].nlargest(5).index.tolist() for n, col in enumerate(df.T)}).T
alright, so i have 2 excel sheets that i load into dataframes, the objective is to match rows from one dataframe onto another
currently im computing columns in both dataframes that match so i can join them together
am i correct in thinking that pandas.merge(df1, df2, on=cols, how="left") will return all of df1 and also any rows from df2 that match what im merging on cols?
don't "chain" [ and .loc. use lifexpectancy_database.loc[[idx_min], [minval_col]]
it's the same
which index order? recent python versions do preserve dict insertion order
the index from the original dataframe, ... these are top shapley values from my model. ... i'm wanting to concat these (columnwise) to the prediction output. .... just want to ensure the new columns line up with the intended prediction
more or less. you can think of an inner join as taking every pair of rows and matching them. a left join is an inner join + the unmatched rows from the left side. so you could end up with more than one match for a given row in df1.
alright, the objective is to use columns in df1 to match rows on df2, im not sure what kind of join i should use for something like this tbh, i guess inner? i want the set of rows from df1 that can be identified on df2
I applied stemming
!e does this also do what you want? ```python
import pandas as pd
data = pd.DataFrame([
[1,2,3],
[4,5,6],
[7,8,9]
])
data.index = ['a', 'b', 'c']
print( data.apply(lambda y: y.nlargest(2).index) )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | 0 1 2
002 | 0 c c c
003 | 1 b b b
Ill show you my code
yes, that would help
it depends on what you want to do with the data afterwards, but i think a left join makes sense. you can de-duplicate afterwards if you end up with multiple matches
(you can get multiple matches from an inner join too! think about it)
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
#tokenizer=word_tokenize()
#tokenizer=RegexpTokenizer(r' \w+ ')
#dataset['tokens']=dataset['text'].apply(tokenizer.tokenize)
dataset['tokens']=dataset['text'].apply(word_tokenize)
print(dataset.head())
st=nltk.PorterStemmer()
def stemming_on_text(data):
text=[]
for word in data:
text=st.stem(word)
return text
dataset['tokens']=dataset['text'].apply(stemming_on_text)
print(dataset.head())
print("lemmatized")
lm=nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
text=[lm.lemmatize(word) for word in data]
return text
dataset['text']=dataset['text'].apply(lemmatizer_on_text)
print(dataset['text'].tail())
i removed regexptokenizer as it gave a weird output
and when i apply stem and lemmatizer it gives only individual characters
it helps to think of the inner join as a filter applied to the "cartesian product" aka the "cross join"
# cartesian product
result = []
for row_x in data_x:
for row_y in data_y:
result.append(row_x + row_y)
# inner join
result = []
for row_x in data_x:
for row_y in data_y:
if match(row_x, row_y):
result.append(row_x + row_y)
then left, right, and outer joins are easy: you just append the rows that were not matched, filling in missing data with NULL
in fact, in most databases SELECT * FROM a, b WHERE <condition> is identical to SELECT * FROM a INNER JOIN b ON <condition>
makes sense, i think im not computing common columns properly, i should be getting more rows with a left join
i'll revisit when i get home, time for a pint
did you notice that you were lemmatizing and stemming the 'text' column and not the 'tokens' column? hint: what happens when you iterate over a string, instead of a list?
okayy yes!!thank you so much. i also hadn't included .append...just to clarify, in this line of code dataset['tokens'].apply(stemming_on_text) we are applying the function on the entire column of the dataframe and word iterates through every row??
dataset['tokens'].apply(stemming_on_text) loops through elements of dataset['tokens'] and calls stemming_on_text on each element
it's like map() or a list comprehension, but for pandas
note that you can rewrite stemming_on_text as:
def stemming_on_text(data):
return [st.stem(word) for word in data]
And when the element is a list,stemminh_on_text is applied to each string in the list?
Oh alright,i will keep that in mind!! Thanks a lot:))
I have just begun using list comprehension so i did not use it here
no, it's applied to the entire list
Oh okayy
!e thanks, but i think we were thinking of two different result sets...
import pandas as pd
data = pd.DataFrame([
[1,2,3],
[6,5,4],
[7,9,8]
], columns = ['a','b','c'])
print("original")
print(data)
print('\ntop by column')
dataT = pd.DataFrame({n: data.T[col].nlargest(3).index.tolist() for n, col in enumerate(data.T)}).T
dataT.columns = ["top_1_column", "top_2_column", "top_3_column"]
print(dataT)
print('\nlambda:')
data.index = ['a','b','c']
print(data.apply(lambda y: y.nlargest(3).index))
btw: it does appear to keep original sort, thanks for the sounding board
@frigid elk :white_check_mark: Your eval job has completed with return code 0.
001 | original
002 | a b c
003 | 0 1 2 3
004 | 1 6 5 4
005 | 2 7 9 8
006 |
007 | top by column
008 | top_1_column top_2_column top_3_column
009 | 0 c b a
010 | 1 a b c
011 | 2 b c a
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/doqaqozaqi.txt?noredirect
@frigid elk data.apply(lambda y: y.nlargest(2).index, axis=1)?
or data.T.apply(lambda y: y.nlargest(2).index)
i'll have to give that a shot in the name of code simplification, it's working now though and i'm on a deadline. ... thanks for the suggestions
For getting quantile / percentile data points with pandas, is there a specific use case for setting the interpolation to midpoint instead of the default linear method? Such as when doing something like this: pandas.Series(some_data).quantile(0.99, interpolation="midpoint")
suggestions to better visualize the years in x?
đ¤
whats your audience and purpose ?
this is great for exploratory notes. i mean the colour isnt necessary but its purdy
My purpose is to plot the Cambodia life expectancy in 1950-2019 and make visible how 1977 was bad to the country
mission accomplished
thx
maybe tie clearer colour to the value
I'm also making this plot comparing Cambodia and other countries in 1977, if there is a way to write in the outlier(cambodia) I think the mission would be accomplished too
yeah lol
but if I write tha is cambodia
would be nice
remove ticks that have no meaning . but it really depends on your audience how much you want to give a crap
add a label for your outlier
label the x axis with mortality rate
what is the y axis ?
nvm reading your code
Instead of the "1977" could be the age like 20 30 etc but I don't know how would I do that
the database is strange
ya seaborn just jitters it nicely
the 1977 should be on the y-tick and the x axis should have rates
now it's not so visible that cambodia is very bad, but is better to visualize
I'm just wondering if has a possibility in catplot to write in a specific point
okay... rabbit hole
for a single visualization are you use you want to jump in ? are you trying to learn data visualization through python ?
but if you want fine control like that on a programmatic level, python is home for sure
matplotlib is under Seaborn , there you'll be able to pick it all apart
plt.scatter
has?
How can I plot a grouped-by monthly time series in seaborn or plotnine?
1- what? 2- Yes but not focused on it
Anyone able to help me melt a datframe with multiple headers, time series
Can you give me a dummy dataframe?
I need some advice. Has anyone created a machine learning model to detect malware or similar project?
Hey @austere ridge!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
how can I use annotated images to train an opencv object detection model?
also, is it possible to implement transfer learning with it
Hey @tight walrus!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
⢠If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
⢠If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pandas.plotting import register_matplotlib_converters
from pathlib import Path
register_matplotlib_converters()
df = pd.read_csv('airline_passengers.csv', index_col = ['Month'])
df.head()```
~\AppData\Local\Temp/ipykernel_15824/309454017.py in <module>
----> 1 df = pd.read_csv('airline_passengers.csv', index_col = ['Month'])
2 df.head()
~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:```
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):
~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):
~\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in __init__(self, src, **kwds)
49
50 # open handles
---> 51 self._open_handles(src, kwds)
52 assert self.handles is not None
53
~\Anaconda3\lib\site-packages\pandas\io\parsers\base_parser.py in _open_handles(self, src, kwds)
227 memory_map=kwds.get("memory_map", False),
228 storage_options=kwds.get("storage_options", None),
--> 229 errors=kwds.get("encoding_errors", "strict"),
230 )
231
~\Anaconda3\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
705 encoding=ioargs.encoding,
706 errors=errors,
--> 707 newline="",
708 )
709 else:
FileNotFoundError: [Errno 2] No such file or directory: 'airline_passengers.csv'```
anyone knows how to deal with this?
either use ./airline_passengers.csv or whatever is the absolute path of the file, by right cliking and going into properties
I had a question which plot is the best to use for this question
im using seaborn to plot
my guess is a barplot or histogram
but
it does not plot for shit
so if anyone has any idea how to use this
it would be appriciated â¤ď¸
box plot?
violin with boxplot whiskers overlaid
doesn't a violin already have everything a boxplot has?
not really, you lose the specific "points of interest" like the 25th and 75th percentiles
anyone can help me with tesseract and opencv,
https://stackoverflow.com/questions/69656785/python-tesseract-unable-to-detect-two-lines
In the end this kinda work but the return are not in order. Eg. it should be SCF6045P but I got 6045PSCF
text = pytesseract.image_to_string(ROI, lang='eng', config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 11 --oem 3')
y'all know any free websites to learn python?
im tryna learn python
and i learn from practicing
but all i got is a pdf
please lmk of any free websites if you know any
the pinned messages in this channel have links to some resources
insofar as data science is concerned.
for learning python in general, this isn't the right channel.
what are some good starting projects?
depends on your interests and intentions
Aight thanks
How can I download cuDNN 8.1
A object detector, Chat bot, A voice password or your idea
what OS are you on
is this book up to date
i want to start learning ai i have read nnfs, and im searching for something new
I'm working on an ARIMA model, I want to know how significant something should have impacted the model to be considered an intervention.
In 2014 South Africa implemented new travel regulations for tourists, requiring specific documents if a child is not traveling with both their parents.
To me it looks like the underlying pattern has changed in sometime in 2014. If the chance is indeed significant enough, is there any way to exactly pick which month the intervention happened in? Would is just be going back on news articles and finding out when exactly the changes were implemented/announced?
scikit-learn went to version 1.0 so there's some outdated stuff
But the notions are the same.
But what would you recommend?
So i'd suggest giving it a try and when you have examples with code, refer to sklearn's documentation to see what changed and what stayed the same
im 18 years old, and i dont have any degree pending
i would like to take a look on AI
and do some stuff
If you're 18 I'd suggest checking the basic mathematics for AI. Look up "linear regression" because most of the stuff you'll see here uses that.
Otherwise if you want to get straight to model construction, you can browse the examples on scikit-learn and launch them.
Are you familiar with Colab?
Ok so, to describe how scikit-learn works, you have models, and models must train using data.
Ok i know that
but i would like to move from backend to more Data engineering
AI/ML
So you want to create pipelines?
For data engineering you need to be familiar with big data I think
And with tools like pyspark
Hmm, i just want to go into data world
and maybe for a while i'llchoose path where ill be going
Try Kaggle
i have two dataframes
db_adjusted.columns=Index(['activity_dttm', 'subject_txt', 'action_type_nm', 'activity_id',
'a.activity_html_txt', 'Start Time Final', 'Subject', 'Type'],
dtype='object')
5137
cbs_adjusted.columns=Index(['Subject', 'Type', 'Calc Full Name'], dtype='object')
1537
Common Columns {'Type', 'Subject'}
i want to merge on the two common columns
merged = pandas.merge(
cbs_adjusted,
db_adjusted,
on=["Subject", "Type"],
how="left",
)
merged.count()
>>> [67313 rows x 4 columns]
``` how is this possible
dont be sorry for politely asking for help!
my guess: youre keys are not formatted the same
did you check that unique values for each are equal?
did you check ?
yes
i need to find rows where both the subject and the type are the same
you cleaned the data ?
yes
i stripped everything, i lowercased everything
idk what more i can do tbh to clean them up
pandas.merge(df1, df2, on=[col1, col2], how="inner") should get me the rows in the two dfs that have the same value in the two cols col1 and col2
but i get 10x more rows than what any of the two dfs have
can someone help me improve my anomaly detection model,
input : 10k toys product description, 225 alcohol product description
the problem is here to detect those 225 alcohol products as an anomaly based on their description, those toys and alcohol product's description will be in one dataset
so ill tell what all i did first
s-1 - i cleaned the product description text by removing puncuations and stop words, tokenizing and lemmatization etc ( basic cleaning process)
s-2 : i converted the cleaned text to a paragraph vector using gensim doc2vec
s-3 : then i tried to fit this vectorized dataset into isolation forest
was able to get 30 true negative and 597 false positive ( true negative is the anomalies ), i was able to get this evaluation because i labeled them before mixing alcohol products description to toys product description. my goal is here to minimize type 1 and 2 error.```
Why it's not staying in just one plot?
fig, ax = plt.subplots(ncols=2)
sns.catplot(data = america_dataset, ax=ax[0]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='America').set_ylabels('Life expectancy', weight='bold');
sns.catplot(data = asia_dataset, ax=ax[1]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='Asia').set_ylabels('Life expectancy', weight='bold');
sns.catplot doesn't allow me to do that?
ya i misread that count. pull open your data and explore it a bit
You wana add things to the figure or axis directly then plot
are you trying to do two plots side by side or ..
yes
looking at the data doesnt really help me, i've been staring at them all day
theres something wrong im doing with pandas
try adding ax=ax[0] to the beginning of the of your catplot parameters
and ax=ax[1]
specifies which subplot to use
doesn't matter
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
try using merge as a method
separate the methods to act on the axis
ax[0].set_ylables(stuff)
I remove all the sets and stay just with this and also doesn't work
I think it's because catplot, but I don't see any alternative to it
fig, axes = plt.subplots(1, 2)
fig.suptitle('hello world')
sns.barplot(ax = axes[0], x = df.x_values, y = df.y_values)
axes[0].set_title('title1')
sns.barplot(ax = axes[1], x = df.x_values, y = df.y_values)
axes[1].set_title('title2')
It's the same
the (1,2)
copy, see if it works
what's the different between your code and this:
fig, ax = plt.subplots(1,2)
sns.catplot(data = america_dataset, ax=ax[0])
sns.catplot(data = asia_dataset, ax=ax[1])
you tell me lol
im out
not work with catplot apparently
its called Seaborn...
?
yes, and with other types of plot probably will work
like countplot I know that work
I have faith you know what you're doing
I tough in catplot we plot more than one graph at the same time as the other seaborn plots
fig, ax = plt.subplots(1,2, figsize=(9,5));
sns.countplot(x = y_test, ax=ax[0]).set_title('Real results');
sns.countplot(x = predictions, ax=ax[1]).set_title('Algorithm predictions');
plt.ylabel(' ')
fig.show();
like that
The reason i said specify your variables it that it forces you to make sure they make sense. I dont know your data so only you can know if those magical dataframes are in order
does countplot use subplots.. i rember problems with facetgrids long ago...
I do two america and asia datasets, so just put the data= and it's good
I already separed the variables
apparently not right?
why?
cause its not working! lol a joke
what do you want your vis to look like ?
Can you show the two you want working separately ?
??? The graph it's working, but puting then together it's not. Do you feal the difference?
scatter would give so much job to do the plot
youre not specifying your variables are you
the year variables?
the year variables are columns
catplot is a way of looking at categorical vars at the same time.. you just want two side by side scatter plots
build the arrays you need and feed them in
scatter plot would not work since countries or years are huge in the databse
this is a scatter plot
this is cat plot
you didnt show your code and i dont really care what function you used, im telling you what it is lol
those are points, thats the x and y axis lol
can someone help me with anomaly detection đ
but is catplot
lol
use a scatter
I would have to had a years column
guys is it possible to detect the anomaly product just from their description for the random 2 catagories ?

