#data-science-and-ml

serene scaffold Oct 20, 2021, 3:32 AM

#

dataset['text'].astype.str() does not convert the data to strings.

wicked grove Oct 20, 2021, 3:32 AM

#

yeah it gave an error

serene scaffold Oct 20, 2021, 3:32 AM

#

astype is a method, not an accessor.

serene scaffold Oct 20, 2021, 3:32 AM

#

wicked grove yeah it gave an error

if something causes an error, please always copy and paste the whole error message into the chat.

#

!docs pandas.Series.str.join

arctic wedgeBOT Oct 20, 2021, 3:33 AM

#

pandas.Series.str.join


Series.str.join(sep)```
Join lists contained as elements in the Series/Index with passed delimiter.

If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to [`str.join()`](https://docs.python.org/3/library/stdtypes.html#str.join "(in Python v3.10)").

desert oar Oct 20, 2021, 3:33 AM

#

also: .astype(str) is dangerous because if you have a "null" value like NaN, it will produce the string "nan", not a proper None

#

in addition to the obvious fact that .astype(str) on a column of lists will produce bad text like "['a', 'b']" which is not what you want

#

that said, it looks like your text is already tokenized. what's the point of joining and then tokenizing again?

serene scaffold Oct 20, 2021, 3:35 AM

#

desert oar that said, it looks like your text is already tokenized. what's the point of joi...

they might have wanted to tokenize -> lemmatize -> join

desert oar Oct 20, 2021, 3:35 AM

#

hm. the screenshot is really hard to read, but it looks like some rows are tokenized and some rows are not

#

like... some rows are strings and some are lists of strings. i.e. a real mess

wicked grove Oct 20, 2021, 3:36 AM

#

desert oar that said, it looks like your text is already tokenized. what's the point of joi...

yess, at the end i want to vectorize it so i want my list as,['a', 'b'] and not [a,b] so i wasnt sure at which part i should convert it to strings

wicked grove Oct 20, 2021, 3:36 AM

#

serene scaffold `astype` is a method, not an accessor.

ohh alrightt

desert oar Oct 20, 2021, 3:37 AM

#

what is the type of each element here? are these strings or lists? please show us some example data that is not in a tiny unreadable screenshot

wicked grove Oct 20, 2021, 3:38 AM

#

i am following this

#

https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-analysis-a-nlp-use-case-for-beginners/

Analytics Vidhya

Gunjan Goyal

Twitter Sentiment Analysis | Implement Twitter Sentiment Analysis M...

In this project, we try to implement a Twitter sentiment analysis model that helps to overcome the challenges in Twitter sentiment analysis.

wicked grove Oct 20, 2021, 3:39 AM

#

desert oar what is the type of each element here? are these strings or lists? please show u...

it is a list but at the end to vectorize i wanted to convert it to a list of strings

#

X=dataset['text'].astype(str)
y=dataset.target

desert oar Oct 20, 2021, 3:39 AM

#

so do you understand now why .astype(str) is wrong?

#

what is the content of dataset['text']?

#

i see, they have it in their example

#

this is some questionable code

#

the data analysis content seems OK, but i would not recommend this article as an example of good code

wicked grove Oct 20, 2021, 3:42 AM

#

desert oar the data analysis content seems OK, but i would not recommend this article as an...

yeah?? why

desert oar Oct 20, 2021, 3:42 AM

#

it's not bad code either, but it's messy and they do some weird/redundant things

royal crest Oct 20, 2021, 3:42 AM

#

df. shape sweatDuck

desert oar Oct 20, 2021, 3:45 AM

#

def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x))

this is inefficient because it re-creates the translator in every call, translator = should be outside the function (there are some considerations w/ respect to local vs global lookup, but that's a different issue)
.apply(lambda x: cleaning_punctuations(x)) should be .apply(cleaning_punctuations)
i don't want to criticize their use of english too much because not everyone is a proficient english speaker, and english is a complicated language. but cleaning_punctuations should probably be clean_punctuation

wicked grove Oct 20, 2021, 3:46 AM

#

desert oar so do you understand now why `.astype(str)` is wrong?

when i used x=dataset['text'].astype.str it gave this

#

800000 ['love', 'healthuandpets', 'u', 'guy', 'r', 'b...
800001 ['im', 'meeting', 'one', 'besties', 'tonight',...
800002 ['darealsunisakim', 'thanks', 'twitter', 'add'...
800003 ['sick', 'really', 'cheap', 'hurt', 'much', 'e...
800004 ['lovesbrooklyn', 'effect', 'everyone']

desert oar Oct 20, 2021, 3:46 AM

#

wicked grove 800000 ['love', 'healthuandpets', 'u', 'guy', 'r', 'b... 800001 ['im', 'me...

!e ```python
x = ['love', 'healthuandpets', 'u']
print(repr(str(x)))

arctic wedgeBOT Oct 20, 2021, 3:46 AM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

"['love', 'healthuandpets', 'u']"

desert oar Oct 20, 2021, 3:46 AM

#

basically it just made a string containing literal python code

#

that is not what you want, ever

#

STOPWORDS = set(stopwordlist)
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text))

they did the right thing creating STOPWORDS outside the function, but

why is it STOPWORDS and not stopwords? not all globals need to be upper case
why str(text)? it's definitely already a string

royal crest Oct 20, 2021, 3:48 AM

#

# remove punctuation
from string import punctuation as p
def some_func(text):
    return [char for char in text if char not in p]
  # alternatively
    return " ".join([char for char in text if char not in p])

is this not simpler?

#

also they defined their own stopword list

#

which is hmm

serene scaffold Oct 20, 2021, 3:48 AM

#

desert oar ```python def cleaning_punctuations(text): translator = str.maketrans('', ''...

All languages are complicated and no language is uniquely so.

wicked grove Oct 20, 2021, 3:48 AM

#

desert oar ```python def cleaning_punctuations(text): translator = str.maketrans('', ''...

ohh okay,thanks a lot! i used the above code for removing punctuations but i did not use the lambda function in most of the places

desert oar Oct 20, 2021, 3:49 AM

#

meh, the "standard" stopword lists can be too broad @royal crest , i've done that before

desert oar Oct 20, 2021, 3:49 AM

#

royal crest ```py # remove punctuation from string import punctuation as p def some_func(tex...

str.translate might be significantly faster. i'm not sure, but i would bet $1 on it at least

royal crest Oct 20, 2021, 3:50 AM

#

time to benchmark

desert oar Oct 20, 2021, 3:50 AM

#

this is also just questionable string cleaning in general, e.g. it doesn't take @usernames into account

#

i guess you could argue that words in usernames could be relevant for determining the sentiment of a tweet, but i doubt it

#

i guess it'll all wash out in feature selection

wicked grove Oct 20, 2021, 3:52 AM

#

desert oar `str.translate` might be significantly faster. i'm not sure, but i would bet $1 ...

a better method to remove punctuations would be to use regex ?

desert oar Oct 20, 2021, 3:54 AM

#

wicked grove a better method to remove punctuations would be to use regex ?

str.translate is fine. regex is good if you need to handle text that consists of more than 1 unicode code point, or if you need to do more complicated replacements

#

you might want to read about "unicode normalization"

#

and you might want to familiarize yourself generally with what "unicode" is - python strings are sequences of unicode code points

wicked grove Oct 20, 2021, 3:56 AM

#

desert oar ```python STOPWORDS = set(stopwordlist) def cleaning_stopwords(text): return...

yeah i didnt get the 2nd point either, just used it as text

desert oar Oct 20, 2021, 3:57 AM

#

that's good

wicked grove Oct 20, 2021, 3:58 AM

#

desert oar and you might want to familiarize yourself generally with what "unicode" is - py...

ohh okay, thank you! i will read about both

wicked grove Oct 20, 2021, 4:02 AM

#

desert oar !e ```python x = ['love', 'healthuandpets', 'u'] print(repr(str(x))) ```

but i am unable to vectorize it w/o converting the list into a list of strings. how do i go about converting [a,b] to ['a','b']?

desert oar Oct 20, 2021, 4:02 AM

#

you should have strings like this: "hello this is a tweet", which you vectorize into lists: ["hello", "this", "is", "a", "tweet"]

#

pandas tries to make the output "pretty", but in the process it can be difficult to see what is actually in each element

#

i'm not sure how you ended up with the [] stuff. it's better if you show your code

fluid pebble Oct 20, 2021, 4:03 AM

#

hello fellas

#

I need help with a project in python

#

i am new to python so i need help

wicked grove Oct 20, 2021, 4:04 AM

#

desert oar i'm not sure how you ended up with the `[]` stuff. it's better if you show your ...

alright!

fluid pebble Oct 20, 2021, 4:06 AM

#

i want to develop a program which can identify difference between 'ideal image" and 'difference image' . can anybody help

wicked grove Oct 20, 2021, 4:06 AM

#

desert oar you should have strings like this: `"hello this is a tweet"`, which you vectoriz...

i think i end up w [] since i have tokenized, stemmed and lemmatized before vectorizing

wicked grove Oct 20, 2021, 4:08 AM

#

desert oar i'm not sure how you ended up with the `[]` stuff. it's better if you show your ...

should i paste it here?

desert oar Oct 20, 2021, 4:08 AM

#

wicked grove should i paste it here?

!paste yes, please. see below 👇

arctic wedgeBOT Oct 20, 2021, 4:08 AM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

desert oar Oct 20, 2021, 4:09 AM

#

fluid pebble i want to develop a program which can identify difference between 'ideal image" ...

imo it's bad taste to re-post your question, as if nobody here already tried to help you earlier today

#

2 people here spent a good amount of time trying to understand the use case and guide you towards a solution

fluid pebble Oct 20, 2021, 4:10 AM

#

yes i appreciate help

#

but my ocr is result are not quite good

#

i need help with my ocr result

desert oar Oct 20, 2021, 4:11 AM

#

okay, are you able to share examples? there might not be any real OCR experts here. you also might want to describe if there are any kinds of non-text differences that your system should be looking for

wicked grove Oct 20, 2021, 4:12 AM

#

desert oar !paste yes, please. see below 👇

https://paste.pythondiscord.com/vuxazonide.py

fluid pebble Oct 20, 2021, 4:12 AM

#

desert oar okay, are you able to share examples? there might not be any real OCR experts he...

should i post code?

desert oar Oct 20, 2021, 4:12 AM

#

someone else suggested comparing rgb histograms. maybe you can also use some kind of image embedding to compute distances between images in some interesting feature space

fluid pebble Oct 20, 2021, 4:13 AM

#

well i have also done that

#

but my boss dont want these results

desert oar Oct 20, 2021, 4:15 AM

#

wicked grove https://paste.pythondiscord.com/vuxazonide.py

here's a tip, instead of this:

english_stopwords = set(stopwords.words('english'))
not_stopwords = ['not']
final_stopwords = set([word for word in stop_words if word not in not_stopwords])

you can do this, using the - operation on sets:

english_stopwords = set(stopwords.words('english'))
not_stopwords = {'not'}
final_stopwords = english_stopwords - not_stopwords

fluid pebble Oct 20, 2021, 4:15 AM

#

the result which i am looking for is the program can highlight 'spelling mistakes' , change in a text in difference image as compared to ideal image, any content which is missing in difference image as compared to original image.

desert oar Oct 20, 2021, 4:16 AM

#

@wicked grove this is part of the problem:

tokenizer=RegexpTokenizer(r'\w+')
dataset['text']=dataset['text'].apply(tokenizer.tokenize)

#

imo you should be saving the tokenized text into a separate column

#

tokenizer = RegexpTokenizer(r'\w+')
dataset['tokens'] = dataset['text'].apply(tokenizer.tokenize)

wicked grove Oct 20, 2021, 4:16 AM

#

desert oar ```python tokenizer = RegexpTokenizer(r'\w+') dataset['tokens'] = dataset['text'...

Ohh okayy ill change that

desert oar Oct 20, 2021, 4:17 AM

#

but otherwise this is OK until line 176 when you use .astype

wicked grove Oct 20, 2021, 4:17 AM

#

desert oar here's a tip, instead of this: ```python english_stopwords = set(stopwords.words...

Ah yes! I will try this

desert oar Oct 20, 2021, 4:18 AM

#

all you have to do is replace dataset['text'].astype(str) with dataset['text'].str.join(" ") as per stelercus' suggestion

#

but i strongly encourage you to think about what the difference is

#

generally this is inefficient code but it's not necessarily bad. what i will say is that TfidfVectorizer lets you apply pretty much all of these data transformations in one shot

#

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html see the preprocessor, tokenizer, and stop_words arguments

scikit-learn

sklearn.feature_extraction.text.TfidfVectorizer

Examples using sklearn.feature_extraction.text.TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm Biclustering documents with the Spectral Co-clustering algorithm, To...

wicked grove Oct 20, 2021, 4:20 AM

#

desert oar but otherwise this is OK until line 176 when you use `.astype`

Yeah i used that because even after i tokenize the result is this [a,b] and not a list of individual strings like ['a','b'] which ig is needed for vectorizing( im not too sure since i watched only a video on that topic)

desert oar Oct 20, 2021, 4:20 AM

#

i don't see any place where you would get a string like "[a, b]"

#

i think you might need to restart your notebook kernel

#

(assuming this is in a notebook)

wicked grove Oct 20, 2021, 4:21 AM

#

desert oar https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.tex...

Okayy,thank you so much!

desert oar Oct 20, 2021, 4:22 AM

#

finally, you don't actually have to re-join the strings at the end

#

you can do TfidfVectorizer(analyzer=lambda x: x) which does not attempt to split any strings

#

see https://github.com/scikit-learn/scikit-learn/issues/17279#issuecomment-630746598

GitHub

How to input tokenized sentences into sklearn.feature_extraction.te...

I find the example in the document : from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document is the second docum...

wicked grove Oct 20, 2021, 4:23 AM

#

desert oar all you have to do is replace `dataset['text'].astype(str)` with `dataset['text'...

Alrighit, that is because astype is an accessor and the our object is a list so it is converting the entire list to string. But for some reason that is not showing in my dataframe?

desert oar Oct 20, 2021, 4:23 AM

#

wicked grove Alrighit, that is because astype is an accessor and the our object is a list so ...

astype is not an accessor

main fox Oct 20, 2021, 4:23 AM

#

Special shout-out to .reset_index(), it's a great method

wicked grove Oct 20, 2021, 4:23 AM

#

desert oar (assuming this is in a notebook)

It is not a notebook

desert oar Oct 20, 2021, 4:24 AM

#

wicked grove Alrighit, that is because astype is an accessor and the our object is a list so ...

i showed you what happens when you "convert a list to a string". it's not what you want.

#

!eval here, i will show you again:

x = ['a', 'b', 'c']
print(repr(str(x)))

arctic wedgeBOT Oct 20, 2021, 4:24 AM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

"['a', 'b', 'c']"

desert oar Oct 20, 2021, 4:24 AM

#

!eval you probably want this:

x = ['a', 'b', 'c']
print(' '.join(x))

arctic wedgeBOT Oct 20, 2021, 4:24 AM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

a b c

wicked grove Oct 20, 2021, 4:31 AM

#

desert oar !eval here, i will show you again: ```python x = ['a', 'b', 'c'] print(repr(str(...

Yess you showed this to me, but my question is how do you get ['a','b','c'] in the dataframe?
My dataset['text'] column has lists like this [a,b,c] ,like the words in the list arent tokenized like this 'a'

wicked grove Oct 20, 2021, 4:32 AM

#

desert oar `astype` is not an accessor

Okay yes i looked it up😅

wicked grove Oct 20, 2021, 4:33 AM

#

desert oar see https://github.com/scikit-learn/scikit-learn/issues/17279#issuecomment-63074...

Thank youu!! I will refer this and change the code

#

Also for steps 8,9,10 should i just keep the tutorial as a reference and follow some other code for the confusion matrix,etc

lone drum Oct 20, 2021, 5:25 AM

#

Hello
I am writing to CSV file
I am getting some of columns are not completely filled
Ping me when replying

#

the highlighted part is getting empty

wicked grove Oct 20, 2021, 5:37 AM

#

Had a question, why should there be a space after binary operators?

stoic musk Oct 20, 2021, 6:12 AM

#

Hi all, I'm trying to drop the predictions column in a pandas dataframe and getting the following error:

#

AttributeError: 'DataFrame' object has no attribute 'DEATH_EVENT'

#

on line:
y = X_full.DEATH_EVENT

#

X_full.dtypes returns the following:

#

sex int64
smoking int64
time int64
DEATH_EVENT int64
dtype: object

#

...Any ideas?

#

DEATH_EVENT is definitely a column in the dataframe

main fox Oct 20, 2021, 6:30 AM

#

stoic musk DEATH_EVENT is definitely a column in the dataframe

Try accessing it with .loc rather than dot notation

stoic musk Oct 20, 2021, 6:30 AM

#

clever, thanks!

fluid pebble Oct 20, 2021, 6:52 AM

#

hi

lapis sequoia Oct 20, 2021, 7:08 AM

#

anyone know what this error mean in seaborn ? ==> "RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density."

zinc rock Oct 20, 2021, 8:26 AM

#

why when i convert a json to csv using pandas, then try to read that csv, i get decoding errors?

#

gbk cant decode etc

vocal basin Oct 20, 2021, 10:11 AM

#

Hey guys, have been scratching my head for two hours trying to figure out a way to add names of teams instead of hue dot points. Please help me

#

royal glen Oct 20, 2021, 10:49 AM

#

is there any place where i can get beginner questions on numpy,pandas,etc to practics?

ember goblet Oct 20, 2021, 10:58 AM

#

yes

royal glen Oct 20, 2021, 10:59 AM

#

ember goblet yes

where?

ember goblet Oct 20, 2021, 10:59 AM

#

Numpy 100 exercise: https://github.com/rougier/numpy-100

GitHub

GitHub - rougier/numpy-100: 100 numpy exercises (with solutions)

100 numpy exercises (with solutions). Contribute to rougier/numpy-100 development by creating an account on GitHub.

#

Pandas: https://github.com/ajcr/100-pandas-puzzles

GitHub

GitHub - ajcr/100-pandas-puzzles: 100 data puzzles for pandas, rang...

100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete) - GitHub - ajcr/100-pandas-puzzles: 100 data puzzles for pandas, ranging from short and simple to super tri...

#

Pandas: https://github.com/guipsamora/pandas_exercises

GitHub

GitHub - guipsamora/pandas_exercises: Practice your pandas skills!

Practice your pandas skills! Contribute to guipsamora/pandas_exercises development by creating an account on GitHub.

serene scaffold Oct 20, 2021, 11:20 AM

#

wicked grove Had a question, why should there be a space after binary operators?

That's the style used by pretty much all python developers

fluid pebble Oct 20, 2021, 11:23 AM

#

hi

#

how to improve OCR results accuracy

flint grotto Oct 20, 2021, 12:02 PM

#

hello!

#

uhm. i have question.

#

Can you recommend a machine learning course on Coursera?

#

Last time, you recommended “data science from scratch”, so I bought it and wanted to take a machine learning course.

#

Do I need to master statistics and linear algebra for additional machine learning?

novel elbow Oct 20, 2021, 12:16 PM

#

flint grotto Can you recommend a machine learning course on Coursera?

not on coursera but this one is good: https://course.fast.ai/

flint grotto Oct 20, 2021, 12:18 PM

#

novel elbow not on coursera but this one is good: https://course.fast.ai/

i want on coursera.

#

not available on coursera?

vivid cairn Oct 20, 2021, 12:34 PM

#

What are some methods or tools I can use to distill a small training dataset from a big dataset that when trained on have as good projected performance as the big dataset?

#

I seem to not have the Google Fu required to find information to read about this.

robust jungle Oct 20, 2021, 1:31 PM

#

does anyone know how I can get output node names of an Xception model?

wicked grove Oct 20, 2021, 2:00 PM

#

flint grotto i want on coursera.

The stanford machine learning course by andrew ng on coursera ,deep learning specialization by andrew ng on coursera

wicked grove Oct 20, 2021, 2:02 PM

#

serene scaffold That's the style used by pretty much all python developers

Oh alright!! I looked up the str.join and wanted to ask how do you get ['a','b','c'] in the dataframe?
My dataset['text'] column has lists like this [a,b,c] ,like the words in the list arent tokenized like this 'a'

serene scaffold Oct 20, 2021, 2:07 PM

#

wicked grove Oh alright!! I looked up the str.join and wanted to ask how do you get ['a','b'...

the words in the list aren't tokenized? if you have one big string that represents a document, and from that you derive a list of individual strings, what are those individual strings if not tokens?

#

just arbitrary slices of the original string?

chrome lintel Oct 20, 2021, 2:37 PM

#

Hey all, I'm trying to use pytesseract to read text from an image. This is my current image, but the output produced from it is just nonsense

#

Hey all, I'm trying to use pytesseract to read an image to extract some numbers. This is the code I'm using to try and extract the numbers from:

output = pytesseract.image_to_string(img,config='--psm 10 --oem 3 tessedit_char_whitelist=0123456789')

#

And the output I receive is:

iy Qe

#

One sec, I'll upload the image to imgur too.

#

And here is the screenshot: https://imgur.com/a/7xYZOd8

Imgur

wicked grove Oct 20, 2021, 2:46 PM

#

serene scaffold just arbitrary slices of the original string?

Yeah i guess,i expected the output to be ['a','b'] after using regexptokenizer but it was just [a,b] w/o ' ' ( umm is that correct?)

sly wharf Oct 20, 2021, 3:21 PM

#

I've got this window written in tkinter now i want to make it so the user can enter list of users and this will result in something like this:
add;Username@something.com;Rolename1
add;Username@something.com;Rolename2
add;Username2@something.com;Rolename1
add;Username2@something.com;Rolename2
How do i make that username list from tkinter Text gets appended with selected things into csv file?

mighty spoke Oct 20, 2021, 3:51 PM

#

Hi would someone be able to help me coding some loops?

robust jungle Oct 20, 2021, 4:01 PM

#

How can I use a custom keras model with opencv?

#

Alternatively, how can I get output node names for my model

vague moon Oct 20, 2021, 5:57 PM

#

Is there a way to find the accuracy of your results with a tolerance

#

e.g. I'm predicting final grades for students, my output can be 0-20, right now I'm just checking how often my model is correct, but often when it isn't it is off by only 1, I want to see how often this is

#

I've stumbled across explained_variance_score and am wondering if that's what I'm looking for, seems like it

kind cape Oct 20, 2021, 6:49 PM

#

Hey everyone 👋 Can someone help with this? Thanks!

#

I started like this:

#

`#method for kNN method
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss

"""
@param X, np ndarray, feature matrix
@param y, np array, labels vector
@param kfold, Integer number of CV folds
@param n_neighbors, Integer number of neighbors in KNN

@returns floating number
"""
def evaluate_kNN(X,y,kfold,n_neighbors):
kf = KFold(kfold)
neigh = KNeighborsClassifier(n_neighbors)`

boreal summit Oct 20, 2021, 7:06 PM

#

Seems like many of TFX.utils functions have been decapitated.

#

Same with some tfx.components libraries. Found the new imports for some of them while I can't find the rest. 😐

arctic wedgeBOT Oct 20, 2021, 7:13 PM

#

Hey @kind cape!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

odd meteor Oct 20, 2021, 9:06 PM

#

vague moon I've stumbled across explained_variance_score and am wondering if that's what I'...

You mentioned tolerance... Are you trying to perform collinearity diagnostic? Tolerance is a metric used to assess the amount of multicolinearity present in a model.
You might wanna provide some more clarity on that part....

On the other hand, explained_variance_score is simply Coefficient of Determination a.k.a R-squared.

This is the default metric for regression problems.

This metric measures the amount of variation in your response variable (target) which the explanatory variables (a.k.a features) in your model was able to explain. So the higher the R-squared score the better your model performance.

The best metric to capture what you are actually looking for is MSE or better still RMSE. The lower your RMSE score the better your model performance.

worthy crystal Oct 20, 2021, 9:53 PM

#

Hello guys I was wondering if anyone is experience with autoencoders for anomaly detection

#

if you are please @ me

lapis sequoia Oct 20, 2021, 10:55 PM

#

is pytorch easier than TF + keras?

#

also is TF + keras more well known and has more guides?

lapis sequoia Oct 20, 2021, 10:58 PM

#

vague moon e.g. I'm predicting final grades for students, my output can be 0-20, right now ...

I use r2_score

#

from sklearn.metrics import r2_score

quiet vault Oct 20, 2021, 11:05 PM

#

lapis sequoia is pytorch easier than TF + keras?

in my opinion keras is much easier

sonic comet Oct 21, 2021, 1:47 AM

#

how do I get the accuracy of the prediction in tensorflow?

#

val = model.predict(images)

#

???

sharp beacon Oct 21, 2021, 1:54 AM

#

is it possible to do a time series prediction on super irregular data with a lot of time where the data is zero?

#

for example, from say june 1-15, there are rejected records every 4 hrs, between them there are none rejected. June 16-30 there are zero rejected records. Is it possible to forecast another 30 days with this type of irregular data?

tender hearth Oct 21, 2021, 2:18 AM

#

sonic comet ```python val = model.predict(images) ```

model.predict returns the output of the model on the input data, it's up to you to compute the accuracy of the predictions against the samples

sonic comet Oct 21, 2021, 2:20 AM

#

yeah, I know how to get the accuracy while training but how do I get a prediction during a 'runtime'?

tender hearth Oct 21, 2021, 2:20 AM

#

During inference? You do the same thing

sonic comet Oct 21, 2021, 2:23 AM

#

I pass the image into my model, how do I get tensor flow to give me a "sureness" because I don't the answer for the "live testing"

rigid zodiac Oct 21, 2021, 2:37 AM

#

Have anyone work with raspberry before??

spark nimbus Oct 21, 2021, 2:38 AM

#

gonna write a guide on audio processing, what do you guys recommend I talk about or explain in a fairly simple way?

spark nimbus Oct 21, 2021, 2:39 AM

#

rigid zodiac Have anyone work with raspberry before??

try #microcontrollers

rigid zodiac Oct 21, 2021, 2:42 AM

#

spark nimbus try <#545603026732318730>

Thanks

stoic musk Oct 21, 2021, 4:26 AM

#

Hi! I'm trying to convert a numpy array into a pd dataframe

#

When I try to print df.head, i get this:

#

<bound method NDFrame.head of 0
0 0
1 1
2 0
3 0
4 0
5 0

#

the first column must be the index, and the second column is the array (now dataframe), but it doesn't return the typical first five values, like a df.head function call usually does. Any ideas?

velvet thorn Oct 21, 2021, 4:29 AM

#

stoic musk <bound method NDFrame.head of 0 0 0 1 1 2 0 3 0 4 0 5 0

you didn’t call it

stoic musk Oct 21, 2021, 4:29 AM

#

Yeah... looks the same as before

#

I used this:
df_YV = pd.DataFrame(y_valid)

am i missing something?

#

where 'y_valid' is the np array

velvet thorn Oct 21, 2021, 4:30 AM

#

stoic musk I used this: df_YV = pd.DataFrame(y_valid) am i missing something?

you didn’t call .head

#

i.e. you’re missing parentheses

stoic musk Oct 21, 2021, 4:31 AM

#

....no i didn't

#

thank you

#

or.. yes I am missing parentheses.

#

thank yOU!

velvet thorn Oct 21, 2021, 4:31 AM

#

you’re welcome

stoic musk Oct 21, 2021, 4:32 AM

#

sublime + ella fitzgerald, great name

#

Is there a function to convert a pd dataframe (single column) of decimal values 0 < x < 1 , to 0-1 values?

#

i.e. if x_h < 0.5, x_h = 0, else x_h = 1

...without iterating over the whole dataframe or array? I have continuous predictions that need to be converted to 0-1

lapis sequoia Oct 21, 2021, 4:40 AM

#

velvet thorn you’re welcome

Off topic but ella is ❤

tender hearth Oct 21, 2021, 4:40 AM

#

stoic musk Is there a function to convert a pd dataframe (single column) of decimal values ...

column = (column > activation).astype(int)?

stoic musk Oct 21, 2021, 4:42 AM

#

...that worked!

#

I wonder what a person does when they're so advanced that people on Discord can't help them lol

lapis sequoia Oct 21, 2021, 4:43 AM

#

stoic musk ...that worked!

Also apart from that you have plenty of ways. My fav is .apply others are iloc but iloc is not good enough for complex logic.

stoic musk Oct 21, 2021, 4:43 AM

#

.apply, I'll write that down, thank you

lapis sequoia Oct 21, 2021, 4:44 AM

#

!d pandas.DataFrame.apply

stoic musk Oct 21, 2021, 4:44 AM

#

oh that makes sense using iloc

arctic wedgeBOT Oct 21, 2021, 4:44 AM

#

pandas.DataFrame.apply


DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)```
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (`axis=0`) or the DataFrame’s columns (`axis=1`). By default (`result_type=None`), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result\_type argument.

lapis sequoia Oct 21, 2021, 4:46 AM

#

Also .where would be good for this use case too.

#

!d pandas.Series.where

arctic wedgeBOT Oct 21, 2021, 4:46 AM

#

pandas.Series.where


Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)```
Replace values where the condition is False.

lapis sequoia Oct 21, 2021, 4:48 AM

#

df['newcol']= np.where(df.col1>0.5, 1, 0)

#

@stoic musk

stoic musk Oct 21, 2021, 4:49 AM

#

Nice, would that make a new column in the df or replace the existing?

lapis sequoia Oct 21, 2021, 4:49 AM

#

I reckon you can replace if you wish. Just give same column name.

stoic musk Oct 21, 2021, 4:49 AM

#

ohh, yes of course

#

thanks!

tender hearth Oct 21, 2021, 4:50 AM

#

Why do you need to use the NumPy ufunc

lapis sequoia Oct 21, 2021, 4:51 AM

#

You can use pd.Series.where too if I'm not mistaken. But anyways i personally would not mind np in this case since pd uses np too.

#

Tho they are bit different as the docs say. I'll check out the details when i get time.

#

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

Via: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html

velvet thorn Oct 21, 2021, 4:57 AM

#

principle of least privilege

velvet thorn Oct 21, 2021, 4:57 AM

#

tender hearth `column = (column > activation).astype(int)`?

I would favour this solution

#

(maybe because it’d what I’d do too 🥴)

zinc rock Oct 21, 2021, 5:06 AM

#

Im training a sklearn.svm.svc classifier on 12k samples

#

Is it normal to be really slow

#

Its like an hour

stoic musk Oct 21, 2021, 5:08 AM

#

Do you have a lot of features...?

zinc rock Oct 21, 2021, 5:09 AM

#

4 i think?

#

The dataset each entry has 4 items per row i guess

tender hearth Oct 21, 2021, 5:10 AM

#

do you have an nvidia gpu

zinc rock Oct 21, 2021, 5:11 AM

#

Yes 1050 laptop

#

I assume its running on cpu now

tender hearth Oct 21, 2021, 5:11 AM

#

What library are you using

zinc rock Oct 21, 2021, 5:11 AM

#

Sklearn

#

Svm.svc

tender hearth Oct 21, 2021, 5:12 AM

#

Right, dumb question

zinc rock Oct 21, 2021, 5:12 AM

#

Im running it in vscode and i cant tell

#

If its frozen or not

#

Maybe i should have done in pycharm but idk how to tell if its not frozen or not

tender hearth Oct 21, 2021, 5:13 AM

#

Scikit has no GPU support

zinc rock Oct 21, 2021, 5:13 AM

#

Damn.

#

Is there a way to tell its still running

#

I guess im looking at task manager and python is taking 50% of cpu

lapis sequoia Oct 21, 2021, 5:19 AM

#

Scikit doesn't have gpu support🤦‍♀️

lapis sequoia Oct 21, 2021, 5:19 AM

#

zinc rock Im training a sklearn.svm.svc classifier on 12k samples

If it helps, colab may help.

zinc rock Oct 21, 2021, 5:20 AM

#

Oh i never used it before can you explain

lapis sequoia Oct 21, 2021, 5:20 AM

#

The cpu may be better than ours and it has 13gb ram (i guess)

#

Basically cloud for you to run notebook. Gives nice gpu and cpu and ram. No load on your hardware. And you can use drive as the storage.

zinc rock Oct 21, 2021, 5:21 AM

#

The model is saved to sav files

#

It should be fine?

#

I can extract it

lapis sequoia Oct 21, 2021, 5:21 AM

#

zinc rock The model is saved to sav files

You can do whatever you want. Just think of drive as your storage.

#

(Assuming you have enough space of course )

zinc rock Oct 21, 2021, 5:22 AM

#

Sorry for weird questions im just really anxious haha

#

Waiting for a model to train without knowing whats happening

lapis sequoia Oct 21, 2021, 5:22 AM

#

It's alright.

zinc rock Oct 21, 2021, 5:22 AM

#

https://colab.research.google.com/

Google Colaboratory

#

This?

lapis sequoia Oct 21, 2021, 5:22 AM

#

You can give it a try once you're done with doing on your pc. I like it and use in my daily life. Like eating and all.

lapis sequoia Oct 21, 2021, 5:22 AM

#

zinc rock This?

Yep.

zinc rock Oct 21, 2021, 5:23 AM

#

Ill just upload my folder then

#

Of code and data

#

Should be able to execute immediately

lapis sequoia Oct 21, 2021, 5:24 AM

#

Alright. You'll need to mount drive if you want to tho. I'm not on laptop otherwise I'd share a small demo notebook.

zinc rock Oct 21, 2021, 5:25 AM

#

Itll only run for 12 hours max huh

#

Should be done before...

lapis sequoia Oct 21, 2021, 5:25 AM

#

zinc rock Itll only run for 12 hours max huh

Yeah that is the limit. Beware.

median berry Oct 21, 2021, 5:30 AM

#

i just deployed some jax models. i have no idea what i deployed.

i really need to learn jax and quick,
i have played around on a bigginer level on keras and pytorch,

median berry Oct 21, 2021, 5:30 AM

#

median berry i just deployed some jax models. i have no idea what i deployed. i really need ...

any suggestions welcome

zinc rock Oct 21, 2021, 5:47 AM

#

AttributeError: 'SVC' object has no attribute 'transform', theres nothing about this issue

#

anyone know?

royal crest Oct 21, 2021, 5:48 AM

#

zinc rock AttributeError: 'SVC' object has no attribute 'transform', theres nothing about ...

are you talkign about https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

scikit-learn

sklearn.svm.SVC

Examples using sklearn.svm.SVC: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn 0.24, Release Highlights for scikit-learn 0.22 Release Highlights for scikit-learn 0.22,...

#

?

zinc rock Oct 21, 2021, 5:48 AM

#

yes

#

i finished training a model and trying to test it now

royal crest Oct 21, 2021, 5:48 AM

#

would it help to tell you that SVC has no transform() method?

#

#

as the documentation suggests

zinc rock Oct 21, 2021, 5:50 AM

#

wha thte

zinc rock Oct 21, 2021, 6:04 AM

#

royal crest would it help to tell you that SVC has no `transform()` method?

thank you for pointing it out i fixed the issue

royal crest Oct 21, 2021, 6:04 AM

#

wavey

stoic musk Oct 21, 2021, 6:06 AM

#

Getting a strange error again (thought I fixed it last night)

KeyError: 'DEATH_EVENT'

line in question is:
y = X_full.loc[:,'DEATH_EVENT']

#

where 'DEATH_EVENT' is a column in a pd dataframe

median berry Oct 21, 2021, 6:12 AM

#

median berry i just deployed some jax models. i have no idea what i deployed. i really need ...

anyone has a good seris of notebooks for getting started at jax ???

fluid pebble Oct 21, 2021, 6:12 AM

#

hi

#

i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)

stoic musk Oct 21, 2021, 6:49 AM

#

ValueError: Classification metrics can't handle a mix of binary and continuous targets

even though the dataframe (one column of binary predictions) is of type int64

#

...?

flat crown Oct 21, 2021, 6:56 AM

#

Hey could someone help me with plotting a histogram
https://gyazo.com/59474ba5a4d4d6a8d95e60ebd9cf79ef

Gyazo

#

The x-axis is fine, but the y-axis is supposed to have the values in the array (up to 50,000 or so)

#

but the y-axis here is just going up to 10

#

hmge

#

when i change bins = it fixes but the axis are wrong

main fox Oct 21, 2021, 6:59 AM

#

flat crown but the y-axis here is just going up to 10

The y axis is going up to 10 because for the bins you specified, the frequency of the values that fall within that bin is 10

flat crown Oct 21, 2021, 7:00 AM

#

how do i make it so the x-axis is from 0 to 200

#

and the y axis is just takes from the array i set

main fox Oct 21, 2021, 7:01 AM

#

The y axis is just going to show you how the data falls within the bins you want

#

It won't be the data itself

flat crown Oct 21, 2021, 7:02 AM

#

So what do I do

#

if I want the values of the bars to be what's in the array

#

i need to make the x-axis like this

#

its just the values in the bars are wrong and the y-axis is obviously wrong

#

if you look at the array they're in the thousands so

main fox Oct 21, 2021, 7:04 AM

#

The x goes only to 200 because thats the upper bound you set for the bins

flat crown Oct 21, 2021, 7:04 AM

#

that's fine

#

that's what i want it to be

#

the y-axis is the probem

main fox Oct 21, 2021, 7:04 AM

#

Change the step for the bins

#

Try 3 instead of 5, see how it looks

flat crown Oct 21, 2021, 7:05 AM

#

#

wat now

#

doesn't do anything

#

#

i need it to look something like this

#

but obviously y-axis will be larger

main fox Oct 21, 2021, 7:07 AM

#

Change the step to 25

flat crown Oct 21, 2021, 7:08 AM

#

#

step has nothing to do with this

main fox Oct 21, 2021, 7:09 AM

#

What this tells you is that for the bounds you specified, all your values fall between 0 and 25, and 100 and 150

#

The y axis will never be the values in the array, because in a histogram, you count the frequency of that value appearing

flat crown Oct 21, 2021, 7:11 AM

#

lemme explain wat i want

#

array= [0,0,0,0,0,0,0,0,146,4021,4323,13434,24089,36611,27023,31367,26800,11079,9285,23142,12757,7973,30389,20770,10426,17347,25305,34806,23654,14287,12107,2004,4256,106,3866,2247,2688,0,0]

#

so i have 39 values in here

#

i want each one to be in its own bin

#

and the x-axis to be from 0-200

#

how can I do that

main fox Oct 21, 2021, 7:12 AM

#

And you're clear that by setting the x axis to be 200, you'll miss all the values that are greater than 200?

flat crown Oct 21, 2021, 7:13 AM

#

wdym why would taht be

#

the values in the array are for the y axis

#

each value in the array is a spacing of 5

#

so first value is 0-5

#

second is 5-10

#

etc

main fox Oct 21, 2021, 7:13 AM

#

The x axis is the bins

The y axis is the frequency of numbers within the bin

flat crown Oct 21, 2021, 7:14 AM

#

yea

stoic musk Oct 21, 2021, 7:14 AM

#

ohhhh I understand. @main fox in the array, the first entry is a count of all values between 0-5

flat crown Oct 21, 2021, 7:14 AM

#

yea

stoic musk Oct 21, 2021, 7:15 AM

#

so array[0] -> X = 0-5, count = 0

#

It's not an array of individual values

#

@flat crown wouldn't it be easier to create a bar chart instead?

flat crown Oct 21, 2021, 7:17 AM

#

i need histograms for this assignment

stoic musk Oct 21, 2021, 7:17 AM

#

does the prompt require you to use np.array

flat crown Oct 21, 2021, 7:18 AM

#

no

#

i just have to make this histogram

stoic musk Oct 21, 2021, 7:18 AM

#

**np.arrange

flat crown Oct 21, 2021, 7:18 AM

#

nah that's just what i used for my other histograms

#

its not really assignment, its a research project

main fox Oct 21, 2021, 7:20 AM

#

Since you're already using pandas
Convert the list into a series arr= pd.Series(array)

And then usen arr.plot(kind='hist')
Pandas will handle the bins automatically

flat crown Oct 21, 2021, 7:21 AM

#

#

almost there

#

just need the x-axis to be on the y

#

and the x-axis to be 0-200

main fox Oct 21, 2021, 7:22 AM

#

That's not how a histogram works

flat crown Oct 21, 2021, 7:22 AM

#

#

i need it to be like this one

#

wdym

#

tiny

lapis sequoia Oct 21, 2021, 7:24 AM

#

flat crown

is that colab

flat crown Oct 21, 2021, 7:24 AM

#

idk wat colab is

#

hmge

lapis sequoia Oct 21, 2021, 7:24 AM

#

google colab

#

for ipynb

#

nvm

flat crown Oct 21, 2021, 7:24 AM

#

jupyter notebook

lapis sequoia Oct 21, 2021, 7:24 AM

#

yes

stoic musk Oct 21, 2021, 7:24 AM

#

I think you should make a bar chart (which will also be a histogram), because your data is already binned and counted.

X axis should be (index of the array + 1) * 5

Y axis should be array[index]

lapis sequoia Oct 21, 2021, 7:24 AM

#

precisely

flat crown Oct 21, 2021, 7:24 AM

#

alright i can make a bar chart

#

how would I do that hmge

lapis sequoia Oct 21, 2021, 7:25 AM

#

i use plotly express

#

lmao

#

anyways

stoic musk Oct 21, 2021, 7:25 AM

#

excel lol

flat crown Oct 21, 2021, 7:25 AM

#

stoic musk Oct 21, 2021, 7:25 AM

#

I'm new to sklearn/pandas. But i know exactly what you're looking for and understand what the np array is communicating

flat crown Oct 21, 2021, 7:25 AM

#

this looks kinda bad

#

hmge

stoic musk Oct 21, 2021, 7:26 AM

#

Just change the axis formatting

#

how many bins do you want

flat crown Oct 21, 2021, 7:26 AM

#

i can fix the axis

#

its the actual blue lines themselves

stoic musk Oct 21, 2021, 7:26 AM

#

ok

flat crown Oct 21, 2021, 7:26 AM

#

look bad

#

hmge

stoic musk Oct 21, 2021, 7:27 AM

#

ohh. can't help with that unfortunately.

or you can run your other data in the same style

main fox Oct 21, 2021, 7:28 AM

#

flat crown

You said your output needs to be exactly like this?

flat crown Oct 21, 2021, 7:28 AM

#

yes

main fox Oct 21, 2021, 7:29 AM

#

They don't have any values that fall between 0 and 25

flat crown Oct 21, 2021, 7:29 AM

#

my broski

#

tiny

main fox Oct 21, 2021, 7:29 AM

#

Your data has 0's

stoic musk Oct 21, 2021, 7:29 AM

#

@main fox , think of the np array like values in a dict

#

array[0] = 0 right?

flat crown Oct 21, 2021, 7:30 AM

#

yea

stoic musk Oct 21, 2021, 7:31 AM

#

What this means is:

the frequency of values between 0-5 is 0. So 0 is not in the dataset

main fox Oct 21, 2021, 7:31 AM

#

So his output won't match the example

#

That's my point

stoic musk Oct 21, 2021, 7:32 AM

#

nope, not with np.arrange I don't think so

#

cool

stoic musk Oct 21, 2021, 7:36 AM

#

flat crown

So, formatting aside, this is what the graph/histogram should look like, right?

flat crown Oct 21, 2021, 7:36 AM

#

yep

stoic musk Oct 21, 2021, 7:37 AM

#

Cool. Good luck, gotta go to bed

lapis sequoia Oct 21, 2021, 7:58 AM

#

guys, I have some data displayed in the terminal. like the one shown in the image. Do we have a way to put the entire displayed data into a file?

#

#

This is in visual code

uneven thistle Oct 21, 2021, 9:09 AM

#

how to make the xaxis look better ? as you can see nothing is visible on the x axis Please help asap

#

and why is my figure size not changing?

fluid pebble Oct 21, 2021, 10:25 AM

#

hi
i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)

median fulcrum Oct 21, 2021, 10:58 AM

#

I never use a database like that(yeras as columns until 2019). How can I find the line or cell of this value that i'm trying to find. Just returns me a True and False dataframe

indigo steppe Oct 21, 2021, 11:04 AM

#

Hi there,i am a total beginner in ml and have a question that may make not much sense.I am reading about reinforcement learning right now.I understand that there is a punishment/reward system.So if the program searches for patterns (time series data for example),do i need to define what kind of patterns i want or does the program find its own patterns?Sorry i am not a good programmer and i try to understand the theory behind supervised,unsupervised and reinforcement learning.

median fulcrum Oct 21, 2021, 11:31 AM

#

median fulcrum I never use a database like that(yeras as columns until 2019). How can I find th...

so annoying 😫

royal crest Oct 21, 2021, 11:35 AM

#

median fulcrum so annoying 😫

you could try transposing?

median fulcrum Oct 21, 2021, 11:37 AM

#

royal crest you could try transposing?

to what?

#

0 and 1?

royal crest Oct 21, 2021, 11:38 AM

#

!d pandas.DataFrame.T

arctic wedgeBOT Oct 21, 2021, 11:38 AM

#

pandas.DataFrame.T


property DataFrame.T```

royal crest Oct 21, 2021, 11:38 AM

#

since you weren't happy about years being in columns ...

#

though i don't know what kind of data you're working with

median fulcrum Oct 21, 2021, 11:38 AM

#

royal crest though i don't know what kind of data you're working with

#

I just want to find the line or cel of the value. But if I continue with this dataframe idk if would be very difficult to make graphs, since maybe I will have to put 1950-2019

royal crest Oct 21, 2021, 11:40 AM

#

indeed

#

you might want to try and restructure your data

#

and before you ask "how", think about what would make your life easier should something go on the index vs column

median fulcrum Oct 21, 2021, 11:40 AM

#

if was possible to do something:
plt.hist(x = 2:70, data=df)

#

:/

royal crest Oct 21, 2021, 11:41 AM

#

though you can probably access all the columns with df.columns()

#

minus the first two

median fulcrum Oct 21, 2021, 11:42 AM

#

royal crest though you can probably access all the columns with `df.columns()`

yes, but why would this help with plt.hist(x = 2:70, data=df)?

royal crest Oct 21, 2021, 11:43 AM

#

What you are replying to was not in response to your message

#

It was just me trying to finish what I was saying from earlier

median fulcrum Oct 21, 2021, 11:43 AM

#

but if you think in a columns to year, maybe would have to repeat the countries to each yar

#

🤔

royal crest Oct 21, 2021, 11:44 AM

#

royal crest and before you ask "how", think about what would make your life easier should so...

That's why I said this.

median fulcrum Oct 21, 2021, 11:45 AM

#

I never transformed a database like that, and I don't think would be possible with this database since probably would have to repeat the countries to each year

royal crest Oct 21, 2021, 11:46 AM

#

I think it's pretty simple, but you're the data scientist.

median fulcrum Oct 21, 2021, 11:47 AM

#

but how would you structure this database?

countries year expectancy?

royal crest Oct 21, 2021, 11:48 AM

#

i would transpose it

#

as suggested earlier

#

https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose

#

handy dandy

#

i would transpose it then manipulate it in such a way that I'm able to two things from this dataFrame

median fulcrum Oct 21, 2021, 11:51 AM

#

I can't figure out how would transpose be helpful in that

royal crest Oct 21, 2021, 11:52 AM

#

typically one would select year

median fulcrum Oct 21, 2021, 11:52 AM

#

years but like that: 1950 1951

royal crest Oct 21, 2021, 11:52 AM

#

so it's useful for us to take advantage of that fact, and put year as index

#

because right now, the indexes are not being used at all

#

also, the Code is not useful in this dataframe at all, so we can take that out and store it as dict values using entity as key

#

which leaves us countries as columns, years as indexes and life expectancy as the values

median fulcrum Oct 21, 2021, 11:54 AM

#

actually if we create a variable years with all years columns that can be used as df[df['years' == 10101] we don't need to transpose

#

since this is the major problema in the df

royal crest Oct 21, 2021, 11:55 AM

#

Sure, sounds like you have a plan!

#

Good luck.

median fulcrum Oct 21, 2021, 11:55 AM

#

sad

#

I'm in rage with the guy who make this dataset

#

😩

royal crest Oct 21, 2021, 11:56 AM

#

Most of data science is just wrestling with data

median fulcrum Oct 21, 2021, 11:56 AM

#

lol

#

I'm homesick with this databases:

#

clientid loan income default

#

lemon_angrysad

royal crest Oct 21, 2021, 11:58 AM

#

it does not get better cz_Unsure_smile

median bone Oct 21, 2021, 12:00 PM

#

hi I need pandas help

royal crest Oct 21, 2021, 12:00 PM

#

we know you do, that's why you are here 😄

median bone Oct 21, 2021, 12:00 PM

#

here is "df_confirmed_local"

#

we need to make a function that

def case_no(report date)```
and return the case no. of the corresponding date

#

example:

#

case_no("04/02/2020")```

#

output:py 16

#

someone help me plsss

royal crest Oct 21, 2021, 12:03 PM

#

so you're just counting the number of values in the "report date" column that matches your input

median bone Oct 21, 2021, 12:03 PM

#

royal crest so you're just counting the number of values in the "report date" column that ma...

not row?

royal crest Oct 21, 2021, 12:04 PM

#

return the case no. of the corresponding date

median bone Oct 21, 2021, 12:04 PM

#

yup

#

but idk how to do it

royal crest Oct 21, 2021, 12:04 PM

#

so do you need any other columns?

#

ah i guess you do

#

the last one

median bone Oct 21, 2021, 12:05 PM

#

ah sorry I dont get 😅

royal crest Oct 21, 2021, 12:06 PM

#

so in this case, you gotta match two things: "report date" == input, "Confirmed/probably" == "confirmed"

#

then return the len()

median bone Oct 21, 2021, 12:07 PM

#

so

def case_no(report_date):
  report_date = df_confirmed_local["report date"] ?```

royal crest Oct 21, 2021, 12:08 PM

#

actually

#

df_confirmed_local -> does this mean your dataframe only contains confirmed cases?

#

that makes life a bit easier

median bone Oct 21, 2021, 12:09 PM

#

yes

#

def case_counts(report_date):
    for i in df_confirmed_local["Report date"].value_counts():
        if i == report_date:
            return i
    
case_counts("04/02/2020")``` I tried this

#

but It doesnt work

#

cuz I dont know how to link up the two columns

royal crest Oct 21, 2021, 12:12 PM

#

i don't think you need for and if loops

#

!e

import pandas as pd

def howmany(val):
    """return how many values in A match the input"""
    # make random dataframe
    df = pd.DataFrame({"A": [1, 0, 0, 1, 1], "B": [True, False, True, True, False], "C": ["Y", "Y", "Y", "Y", "Y"]})
    print(df)
 
    # return number of vals that match "A" == val
    return len(df[df["A"] == val]["A"])

print(howmany(1))

here's a really basic example

arctic wedgeBOT Oct 21, 2021, 12:14 PM

#

@royal crest :white_check_mark: Your eval job has completed with return code 0.

001 |    A      B  C
002 | 0  1   True  Y
003 | 1  0  False  Y
004 | 2  0   True  Y
005 | 3  1   True  Y
006 | 4  1  False  Y
007 | 3

royal crest Oct 21, 2021, 12:14 PM

#

line 7 is the output for howmany(1)

#

returning 3 since column "A" has three 1's

#

and in your case, that would be a bit like

#

def case_counts(report_date):
    return len(df_confirmed_local[df_confirmed_local["Report date"] == report_date]["Report date"])
    
case_counts("04/02/2020")

#

i'm sure there's always a better way, so feel free to chime in

median bone Oct 21, 2021, 12:17 PM

#

k lemme try ty :>

lapis sequoia Oct 21, 2021, 12:18 PM

#

Hey, i have a huge amout of csv files with some apparently empty columns. I want to check if they are actually empty in all files, and if that is the case i want to print their name.
To do that, i would like to know if there a way to see if df[columnname] does not contain any values?

royal crest Oct 21, 2021, 12:19 PM

#

!d pandas.DataFrame.isna

arctic wedgeBOT Oct 21, 2021, 12:19 PM

#

pandas.DataFrame.isna


DataFrame.isna()```
Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or `numpy.NaN`, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings `''` or `numpy.inf` are not considered NA values (unless you set `pandas.options.mode.use_inf_as_na = True`).

royal crest Oct 21, 2021, 12:19 PM

#

this could be useful

serene scaffold Oct 21, 2021, 12:25 PM

#

royal crest ```py def case_counts(report_date): return len(df_confirmed_local[df_confirm...

rd_s = df_confirmed_local['Report date']
return len(rd_s[rd_s == report_date])

median bone Oct 21, 2021, 12:25 PM

#

royal crest ```py def case_counts(report_date): return len(df_confirmed_local[df_confirm...

why py len() tho? doesnt it return the length of the string?

serene scaffold Oct 21, 2021, 12:25 PM

#

alternatively, you could do this once:

report_date_counts = df_confirmed_local['Report date'].value_counts()

#

and then any time you need a specific report date count, it's just a dict-like lookup to report_date_counts.

royal crest Oct 21, 2021, 12:27 PM

#

median bone why ```py len()``` tho? doesnt it return the length of the string?

length of an array in this case

serene scaffold Oct 21, 2021, 12:27 PM

#

well, Series

median fulcrum Oct 21, 2021, 12:27 PM

#

median fulcrum I never use a database like that(yeras as columns until 2019). How can I find th...

how could I do in this database that i'm replying like that?

royal crest Oct 21, 2021, 12:28 PM

#

serene scaffold well, Series

apologies, my inner matlabSenpai taking over

royal crest Oct 21, 2021, 12:33 PM

#

median fulcrum how could I do in this database that i'm replying like that?

not too sure what you mean mate

median fulcrum Oct 21, 2021, 12:33 PM

#

royal crest not too sure what you mean mate

to appear the cell not a True, False and Nan dataframe

#

cell or line

royal crest Oct 21, 2021, 12:35 PM

#

so if query is "Banana", you'd want a dataframe that looks like:

Name  Calories  Type
True  False    False

?

median bone Oct 21, 2021, 12:36 PM

#

royal crest length of an array in this case

so with the code:```py
let:
A is a column and B is a column
a is an item in column A
b is an item in column B

code:

len([A=a][B])

output:

b

is it like this? @royal crest

median fulcrum Oct 21, 2021, 12:37 PM

#

royal crest so if query is "Banana", you'd want a dataframe that looks like: ``` Name Calo...

look my reply, I would like to replicate what I do in banana but with this database

#

royal crest Oct 21, 2021, 12:38 PM

#

median bone so with the code:```py let: A is a column and B is a column a is an item in col...

close, don't forget that you are accessing those columns through a dataframe

median bone Oct 21, 2021, 12:38 PM

#

yup so If including the df_ stuffs am I correct?

royal crest Oct 21, 2021, 12:39 PM

#

sounds good, also check out Stelercus' response

median fulcrum Oct 21, 2021, 12:45 PM

#

median fulcrum

maybe with all columns but I'm not sure

desert oar Oct 21, 2021, 1:00 PM

#

median fulcrum

i think you need to provide a complete demonstration of input and your expected output. it's still not clear from these scattered examples what you want

lapis sequoia Oct 21, 2021, 1:00 PM

#

royal crest sounds good, also check out Stelercus' response

thanks for the help with isna()
it worked :>

#

took forever to run through all of the files, i have like 150000 csv files i need to process

#

big help

median fulcrum Oct 21, 2021, 1:01 PM

#

desert oar i think you need to provide a complete demonstration of input and your expected ...

My output is in the print A True and false data frame my expected output is this:

#

desert oar Oct 21, 2021, 1:16 PM

#

median fulcrum My output is in the print A True and false data frame my expected output is this...

i saw that, but it doesn't help me understand

#

there is no context

median fulcrum Oct 21, 2021, 1:18 PM

#

desert oar i saw that, but it doesn't help me understand

man... In one print I use a the code with a column and get the exacly cell that the value "user_result" is in, but in the life expectancy database I got a true and false dataframe, how can I got the cell from it?

desert oar Oct 21, 2021, 1:20 PM

#

i still truly have no idea what you mean, i'm sorry. the code database_df[datbase_df["Name"] == user_results] returns a dataframe, not a single value

#

do you want to get all rows where life expectancy is 18.907 in any column?

median fulcrum Oct 21, 2021, 1:21 PM

#

desert oar do you want to get all _rows_ where life expectancy is 18.907 in any column?

EXACLY

desert oar Oct 21, 2021, 1:21 PM

#

this is why it's important to provide small examples that clearly demonstrate the behavior

#

you force people to guess and interrogate you

median fulcrum Oct 21, 2021, 1:22 PM

#

but is clear, the second print is not a True and False database

desert oar Oct 21, 2021, 1:22 PM

#

it's not clear, as shown by multiple people trying to help and giving up

#

also a "true and false database" is not the same as "all rows where a certain value is present in any column"

median fulcrum Oct 21, 2021, 1:23 PM

#

desert oar also a "true and false database" is not the same as "all rows where a certain va...

but I use the same strategy in the two

#

🤔

#

df[df == something]

desert oar Oct 21, 2021, 1:23 PM

#

it doesn't matter, if you don't show people what your data is, there's no way anyone can help

median fulcrum Oct 21, 2021, 1:23 PM

#

men

#

I sent many prints

#

I am really not sure how you could messed up with that

desert oar Oct 21, 2021, 1:24 PM

#

screenshots are not enough

median fulcrum Oct 21, 2021, 1:24 PM

#

ok sorry

desert oar Oct 21, 2021, 1:24 PM

#

your life expectancy data looks like this?

data = pd.DataFrame([
  'A', 1, 2, 3,
  'B', 4, 5, 6,
  'C', 2, 5, 3,
  'D', 3, 9, 4,
], columns=['entity', '2007', '2008', '2009'])

median fulcrum Oct 21, 2021, 1:24 PM

#

desert oar Oct 21, 2021, 1:25 PM

#

ok. and can you explain what output you do want?

#

do not show the banana example, it's not helpful

upper granite Oct 21, 2021, 1:25 PM

#

Hello guys.
Do you know how can I parse this string "20200915" to date with pyspark?

median fulcrum Oct 21, 2021, 1:26 PM

#

yes, I want the cells or lines as the banana example that contains the minimum value of the database that is 18..

#

18.907

#

not a True and false dataframe

#

understand?

#

and I'm not sure if this .min().min() was the best way to do, so if you have any suggestion to this too you can tell me 🙂

desert oar Oct 21, 2021, 2:54 PM

#

upper granite Hello guys. Do you know how can I parse this string "20200915" to date with pysp...

https://stackoverflow.com/a/41273036/2954547

Stack Overflow

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.

I tried:

df.select(to_date(df.STRING_COLUMN).alias('new...

desert oar Oct 21, 2021, 2:55 PM

#

median fulcrum yes, I want the cells or lines as the banana example that contains the minimum v...

df.min().min() is better than trying to filter

#

however you need to exclude the non-numeric columns

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

min_all = data.loc[:, '2007':'2009'].min().min()
print(min_all)

arctic wedgeBOT Oct 21, 2021, 2:58 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

desert oar Oct 21, 2021, 2:59 PM

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

meta_columns = ['entity', 'code']
value_columns = [c for c in data.columns if c not in meta_columns]

min_all = data[value_columns].min().min()
print(min_all)

arctic wedgeBOT Oct 21, 2021, 2:59 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

median fulcrum Oct 21, 2021, 3:00 PM

#

desert oar !e ```python import pandas as pd data = pd.DataFrame([ ['A', 'asdf', 1, 2, 3]...

it's what I do

#

#

look

desert oar Oct 21, 2021, 3:00 PM

#

yes, that's fine

median fulcrum Oct 21, 2021, 3:01 PM

#

desert oar yes, that's fine

and now how do i find the row?

desert oar Oct 21, 2021, 3:01 PM

#

why 1:70? it looks like you want all columns

#

column numbers in python start at 0

median fulcrum Oct 21, 2021, 3:01 PM

#

desert oar why `1:70`? it looks like you want _all_ columns

but that is all the columns lol

desert oar Oct 21, 2021, 3:01 PM

#

it's not

#

you don't need to use .iloc at all if you want all columns

#

you can just write years_database.min().min()

lapis sequoia Oct 21, 2021, 3:02 PM

#

does anyone have an idea how to make this perform better?

def find_non_empty(file):
    try:
        frame = pd.read_csv(file, sep=";")
        for name in frame.columns:
            if not name in listA:
                listA.append(name)
            if not name in listb:
                res = True
                i = 0
                for val in frame[name].isna():
                    i +=1
                    if val is False:
                        res = False

                        
                if res is False:
                    listb.append(name)
                        
    except UnicodeDecodeError:
        print("ooops")
        print(file)
    except:
        print ("whoops")

desert oar Oct 21, 2021, 3:02 PM

#

so you want the row that contains the minimum year @median fulcrum ?

median fulcrum Oct 21, 2021, 3:02 PM

#

desert oar so you want the _row_ that contains the minimum year <@!758034911641862304> ?

yeah

#

with the country etc

lapis sequoia Oct 21, 2021, 3:02 PM

#

i have been scanning files for 80 minutes

desert oar Oct 21, 2021, 3:02 PM

#

@lapis sequoia are you just trying to look at the column names and ignore the data?

#

the bare except: is a really bad idea... at least do except e: print(e) so you can see what the error was

lapis sequoia Oct 21, 2021, 3:03 PM

#

i want to know if certain columns are empty in all of my 150000 csv files

median fulcrum Oct 21, 2021, 3:03 PM

#

lapis sequoia i want to know if certain columns are empty in all of my 150000 csv files

df.isnull()

desert oar Oct 21, 2021, 3:04 PM

#

i also am in the habit of putting the try around the smallest possible area of my code, in this case just the pd.read_csv should be inside try

lapis sequoia Oct 21, 2021, 3:04 PM

#

not the whole frame, just some of its collumns

#

that makes sense

#

i will adapt that

median fulcrum Oct 21, 2021, 3:04 PM

#

lapis sequoia not the whole frame, just some of its collumns

df['column'].isnull()

desert oar Oct 21, 2021, 3:04 PM

#

and yes, looping over the entire df is a bad idea. use frame.isnull().any() or something like that

median fulcrum Oct 21, 2021, 3:04 PM

#

yeah

lapis sequoia Oct 21, 2021, 3:04 PM

#

i assumed

#

:D

#

yes, ill change that. Thanks, i wasnt sure .isnull() on the column would work

desert oar Oct 21, 2021, 3:05 PM

#

frame.isnull() returns a dataframe of booleans, frame.isnull().any() returns a series with 1 element for each column, True if any are null, False otherwise

#

>>> data.isnull().any()
entity    False
code      False
2007      False
2008      False
2009      False
dtype: bool

lapis sequoia Oct 21, 2021, 3:06 PM

#

that makes sense

#

i will do that :>

#

thank you for explaining

median fulcrum Oct 21, 2021, 3:07 PM

#

🙂

desert oar Oct 21, 2021, 3:07 PM

#

@median fulcrum

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

meta_columns = ['entity', 'code']

idx_min = (
data
.drop(columns=meta_columns)
.min(axis='columns')
.idxmin()
)

print(data.loc[idx_min])

arctic wedgeBOT Oct 21, 2021, 3:08 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | entity       A
002 | code      asdf
003 | 2007         1
004 | 2008         2
005 | 2009         3
006 | Name: 0, dtype: object

desert oar Oct 21, 2021, 3:08 PM

#

that of course returns the row as a series

#

use data.loc[[idx_min]] to get a 1-row dataframe

median fulcrum Oct 21, 2021, 3:08 PM

#

desert oar !e ```python import pandas as pd data = pd.DataFrame([ ['A', 'asdf', 1, 2, 3]...

in columns since there isso 1950-2019 columns can I use what?

desert oar Oct 21, 2021, 3:09 PM

#

be more specific

#

what are you asking?

median fulcrum Oct 21, 2021, 3:09 PM

#

your dataframe have 5 columns, and you are trying to find the minimum of two, mine has 70 columns, what can I put in columns=?

#

understand?

#

that's the problem :/

median fulcrum Oct 21, 2021, 3:17 PM

#

median fulcrum your dataframe have 5 columns, and you are trying to find the minimum of two, mi...

@desert oar

desert oar Oct 21, 2021, 3:17 PM

#

median fulcrum your dataframe have 5 columns, and you are trying to find the minimum of two, mi...

columns= where? in drop, or in DataFrame?

#

note also that 1:70 selects the 2nd through 69th column. numbering starts at 0 and ranges exclude the upper bound. you would have to write 0:71 to get all columns, which is equivalent to 0:, which is equivalent to :

median fulcrum Oct 21, 2021, 3:18 PM

#

desert oar `columns=` where? in `drop`, or in `DataFrame`?

in drop

#

mine database have 70 columns

desert oar Oct 21, 2021, 3:18 PM

#

median fulcrum in drop

drop removes columns. so put the column names you want to remove. if there are no columns that you want to remove, don't use drop.

#

!d pandas.DataFrame.drop

arctic wedgeBOT Oct 21, 2021, 3:19 PM

#

pandas.DataFrame.drop


DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```
Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown\_levels> for more information about the now unused levels.

median fulcrum Oct 21, 2021, 3:19 PM

#

still the same as the other one

#

:/

#

just as an index

desert oar Oct 21, 2021, 3:20 PM

#

do you have missing values (NaN) in the data?

median fulcrum Oct 21, 2021, 3:20 PM

#

desert oar do you have missing values (`NaN`) in the data?

no

desert oar Oct 21, 2021, 3:21 PM

#

did you remove Entity and Code like you were supposed to?

#

look at my example

#

i literally use the same column names...

median fulcrum Oct 21, 2021, 3:21 PM

#

desert oar did you remove `Entity` and `Code` like you were supposed to?

ye men

desert oar Oct 21, 2021, 3:21 PM

#

no, you didn't

#

i can see in your code that you didn't

median fulcrum Oct 21, 2021, 3:21 PM

#

the database is like that

desert oar Oct 21, 2021, 3:22 PM

#

result has columns Entity and Code which, according to your screenshots, do not contain numerical data

median fulcrum Oct 21, 2021, 3:22 PM

#

Why I need to remve?

#

#

#

bruh

desert oar Oct 21, 2021, 3:22 PM

#

well then your result is messed up

#

i don't know what you expected

#

stop and think for a second

#

you are showing me 2 dataframes

median fulcrum Oct 21, 2021, 3:23 PM

#

oh

desert oar Oct 21, 2021, 3:23 PM

#

in screenshots, no less

median fulcrum Oct 21, 2021, 3:23 PM

#

I have to use

desert oar Oct 21, 2021, 3:23 PM

#

please use code blocks

median fulcrum Oct 21, 2021, 3:23 PM

#

the year

#

lol

desert oar Oct 21, 2021, 3:23 PM

#

i don't know what that means either

median fulcrum Oct 21, 2021, 3:24 PM

#

still......

desert oar Oct 21, 2021, 3:24 PM

#

that looks right to me

#

what's the problem?

median fulcrum Oct 21, 2021, 3:24 PM

#

don't are showing the minimum row

desert oar Oct 21, 2021, 3:24 PM

#

idx_min is the index of the row that contains the minimum value

median fulcrum Oct 21, 2021, 3:24 PM

#

It's showing the row of the minimum row?

#

line or cell?

#

no

desert oar Oct 21, 2021, 3:25 PM

#

let's fix up the terminology here

#

row is the data in each line

median fulcrum Oct 21, 2021, 3:25 PM

#

I want to show WHERE the minimum are located

#

WHERE

desert oar Oct 21, 2021, 3:25 PM

#

idx_min is the index of the row that contains the minimum value

#

the index is the label of the row

#

in this case, your dataframe has the default index. so the label is the row number

median fulcrum Oct 21, 2021, 3:26 PM

#

desert oar > `idx_min` is the index of the row that contains the minimum value

ok, but I don't want that, I already know, I want to find where the minimum are

desert oar Oct 21, 2021, 3:26 PM

#

show me an example using this dataframe:

data = pd.DataFrame([
  ['A', 'asdf', 1, 2, 3],
  ['B', 'zxcv', 4, 5, 6],
  ['C', 'qwer', 2, 5, 3],
  ['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

#

show me the output you want

median fulcrum Oct 21, 2021, 3:27 PM

#

print like that

#

that has the minimum in that

#

or just the exclusive year that has

#

don't need to be everything

#

just know the country and the year of it

desert oar Oct 21, 2021, 3:28 PM

#

i told you, data.loc[idx_min] is a Series of the data in the row that contains the minimum. if you want a 1-row DataFrame of the data in that row, use data.loc[[idx_min]].

median fulcrum Oct 21, 2021, 3:28 PM

#

the minimum value I already have

desert oar Oct 21, 2021, 3:28 PM

#

median fulcrum just know the country and the year of it

if you said this 30 minutes ago, we'd be done by now

#

also, you need to slow down and read what other people are writing. i used idxmin, which is not the same as min

median fulcrum Oct 21, 2021, 3:29 PM

#

omg

#

thanks

#

😫

#

would be possible, show just the year that the minimum value was found?

desert oar Oct 21, 2021, 3:32 PM

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

data_num = data.drop(columns=['entity', 'code'])

minval = data_num.min().min()
minval_row = data_num.min(axis='columns').idxmin()
minval_col = data_num.min().idxmin()

print(minval, minval_row, minval_col)

arctic wedgeBOT Oct 21, 2021, 3:32 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

1 0 2007

median fulcrum Oct 21, 2021, 3:35 PM

#

desert oar !e ```python import pandas as pd data = pd.DataFrame([ ['A', 'asdf', 1, 2, 3]...

and the dataframe of the column with the country etc would be?

desert oar Oct 21, 2021, 3:36 PM

#

be more specific

median fulcrum Oct 21, 2021, 3:37 PM

#

I would like to print a dataframe with:

Country year column

Cambodia 1977

#

can have also the code whatever

desert oar Oct 21, 2021, 3:43 PM

#

please start using code blocks

#

it's not fair to force people to read screenshots and re-type long variable names

#

did you create years_database from lifexpectancy_database?

#

if so, show how you did it

#

!code

arctic wedgeBOT Oct 21, 2021, 3:44 PM

#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar Oct 21, 2021, 3:44 PM

#

do not post a screenshot

wicked grove Oct 21, 2021, 3:45 PM

#

desert oar ```python tokenizer = RegexpTokenizer(r'\w+') dataset['tokens'] = dataset['text'...

hello i tried this and i got this as the output

#

text target tokens
800000 love healthuandpets u guys r best 1 [love, healthuandpets, u, guys, r, best]
800001 im meeting one besties tonight cant wait girl... 1 [im, meeting, one, besties, tonight, cant, wai...
800002 darealsunisakim thanks twitter add sunisa got ... 1 [darealsunisakim, thanks, twitter, add, sunisa...
800003 sick really cheap hurts much eat real food plu... 1 [sick, really, cheap, hurts, much, eat, real, ...

desert oar Oct 21, 2021, 3:45 PM

#

@wicked grove yes, every element of the tokens column is a list of strings

wicked grove Oct 21, 2021, 3:46 PM

#

but the strings are not within ' '

desert oar Oct 21, 2021, 3:46 PM

#

that's just how pandas displays it

wicked grove Oct 21, 2021, 3:46 PM

#

ohh alright!!

desert oar Oct 21, 2021, 3:46 PM

#

you can put a list of strings into the model if you use TfidfVectorizer(analyzer=lambda x: x)

wicked grove Oct 21, 2021, 3:47 PM

#

desert oar that's just how pandas displays it

i had another question,when i used stemming on the tokens column i got this

wicked grove Oct 21, 2021, 3:47 PM

#

desert oar you can put a list of strings into the model if you use `TfidfVectorizer(analyze...

okayy,thank you

desert oar Oct 21, 2021, 3:48 PM

#

what is that?

wicked grove Oct 21, 2021, 3:49 PM

#

can i send a picture of the output? i am unable to copy paste it correctly

desert oar Oct 21, 2021, 3:50 PM

#

you should use a code block for posting data

#

paste output here

wicked grove Oct 21, 2021, 3:53 PM

#

`

#

                                                    text  target tokens
800000                  love healthuandpets u guys r best       1      t
800001  im meeting one besties tonight cant wait  girl...       1      k
800002  darealsunisakim thanks twitter add sunisa got ...       1      t
800003  sick really cheap hurts much eat real food plu...       1      p
800004                      lovesbrooklyn effect everyone       1      e

arctic wedgeBOT Oct 21, 2021, 3:56 PM

#

:incoming_envelope: :ok_hand: applied mute to @lone nacelle until <t:1634832380:f> (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

sick wedge Oct 21, 2021, 3:59 PM

#

Hi got a really simple machine learning / pandas question but really struggling with this:
I just want to write a function that converts the 'source_type' column in my dataframe to a number depending on what string 'source_type' is

for example:

if item == 'AGB':
  item = 0
else:
  item = 1

#

I understand I should use df.apply() but I'm not sure how to write the function to make it work

#

please ping me!

desert oar Oct 21, 2021, 4:03 PM

#

wicked grove ``` text target tokens 8000...

how did you end up with this?a

desert oar Oct 21, 2021, 4:05 PM

#

sick wedge Hi got a really simple machine learning / pandas question but really struggling ...

there are a few different ways to do this.

one way:

data['source_num'] = 0
data.loc[data['source_type'] == 'AGB', 'source_num'] = 1

another way:

def convert_source_type(source_type):
    return 1 if source_type == 'AGB' else 0

data['source_num'] = data['source_type'].apply(convert_source_type)

#

you could of course use lambda instead of def

data['source_num'] = data['source_type'] \
    .apply(lambda source_type: 1 if source_type == 'AGB' else 0)

sick wedge Oct 21, 2021, 4:06 PM

#

awesome thanks 👍

#

I'll go with 1st one, doesn't need to be elegant

desert oar Oct 21, 2021, 4:07 PM

#

you could also use map which is kind of a neat trick:

from collections import defaultdict

source_type_mapping = defaultdict(int, {'AGB': 1})
data['source_num'] = data['source_type'].map(source_type_mapping)

homework: figure out how this works

sick wedge Oct 21, 2021, 4:07 PM

#

how map works?

desert oar Oct 21, 2021, 4:07 PM

#

all of it 😉

#

(hint: collections is part of the python standard library, you'll have to read about defaultdict there)

sick wedge Oct 21, 2021, 4:08 PM

#

okay will do 😄

desert oar Oct 21, 2021, 4:08 PM

#

wicked grove ``` text target tokens 8000...

how did you end up with this?

median fulcrum Oct 21, 2021, 4:09 PM

#

desert oar did you create `years_database` from `lifexpectancy_database`?

yes

#

years_database = lifexpectancy_database.iloc[:, 2:72]

#

with just the years

formal perch Oct 21, 2021, 4:10 PM

#

Hey guys quick question pytorch or tensorflow?

serene scaffold Oct 21, 2021, 4:10 PM

#

ah yes, the eternal question

desert oar Oct 21, 2021, 4:11 PM

#

median fulcrum `years_database = lifexpectancy_database.iloc[:, 2:72]`

in that case yes, you can use minval_row to get the row from lifexpectancy_database. both dataframes will have the same index values

median fulcrum Oct 21, 2021, 4:12 PM

#

desert oar in that case yes, you can use `minval_row` to get the row from `lifexpectancy_da...

lifexpectancy_database.minval_row()?

desert oar Oct 21, 2021, 4:12 PM

#

median fulcrum lifexpectancy_database.minval_row()?

no, why would you expect that to work?

median fulcrum Oct 21, 2021, 4:13 PM

#

desert oar no, why would you expect that to work?

🤔

desert oar Oct 21, 2021, 4:13 PM

#

years_database = lifexpectancy_database.iloc[:, 2:72]

minval = years_database.min().min()
minval_row = years_database.min(axis='columns').idxmin()
minval_col = years_database.min().idxmin()

print(minval, minval_row, minval_col)

#

i am using the variable names from my previous example

median fulcrum Oct 21, 2021, 4:14 PM

#

desert oar ```python years_database = lifexpectancy_database.iloc[:, 2:72] minval = years_...

oh

desert oar Oct 21, 2021, 4:18 PM

#

that was my fault, it wasn't clear

#

do you understand now? you can do years_database.at[minval_row, 'Entity'] for example, to get "Cambodia"

#

and minval_col will be the year

median fulcrum Oct 21, 2021, 4:19 PM

#

desert oar ```python years_database = lifexpectancy_database.iloc[:, 2:72] minval = years_...

but minval_row is not:

Country year column

Cambodia 1977

#

or what I'm missing?

desert oar Oct 21, 2021, 4:20 PM

#

read my message above

median fulcrum Oct 21, 2021, 4:21 PM

#

desert oar do you understand now? you can do `years_database.at[minval_row, 'Entity']` for ...

so I would have to mix minval_row and minval_col to get what I'm thinking

frigid elk Oct 21, 2021, 4:21 PM

#

how do i create the code block in chat?

median fulcrum Oct 21, 2021, 4:21 PM

#

!code

arctic wedgeBOT Oct 21, 2021, 4:21 PM

#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar Oct 21, 2021, 4:21 PM

#

frigid elk how do i create the code block in chat?

!code see below:

frigid elk Oct 21, 2021, 4:21 PM

#

thanks

desert oar Oct 21, 2021, 4:21 PM

#

median fulcrum so I would have to mix minval_row and minval_col to get what I'm thinking

kind of, yes. minval_col will contain the column label (in this case, the year), and minval_row will contain the row label (in this case, the row number).

frigid elk Oct 21, 2021, 4:23 PM

#

can somebody tell me, in the code below. .. is order of the original index guaranteed to be preserved.

pd.DataFrame({n: df.T[col].nlargest(5).index.tolist() for n, col in enumerate(df.T)}).T

median fulcrum Oct 21, 2021, 4:24 PM

#

oh

#

I think it's good idk

proven plinth Oct 21, 2021, 4:46 PM

#

alright, so i have 2 excel sheets that i load into dataframes, the objective is to match rows from one dataframe onto another
currently im computing columns in both dataframes that match so i can join them together
am i correct in thinking that pandas.merge(df1, df2, on=cols, how="left") will return all of df1 and also any rows from df2 that match what im merging on cols?

desert oar Oct 21, 2021, 4:47 PM

#

median fulcrum oh

don't "chain" [ and .loc. use lifexpectancy_database.loc[[idx_min], [minval_col]]

median fulcrum Oct 21, 2021, 4:49 PM

#

desert oar don't "chain" `[` and `.loc`. use `lifexpectancy_database.loc[[idx_min], [minval...

it's the same

desert oar Oct 21, 2021, 4:51 PM

#

frigid elk can somebody tell me, in the code below. .. is order of the original index guara...

which index order? recent python versions do preserve dict insertion order

frigid elk Oct 21, 2021, 4:52 PM

#

desert oar which index order? recent python versions do preserve dict insertion order

the index from the original dataframe, ... these are top shapley values from my model. ... i'm wanting to concat these (columnwise) to the prediction output. .... just want to ensure the new columns line up with the intended prediction

desert oar Oct 21, 2021, 4:54 PM

#

proven plinth alright, so i have 2 excel sheets that i load into dataframes, the objective is ...

more or less. you can think of an inner join as taking every pair of rows and matching them. a left join is an inner join + the unmatched rows from the left side. so you could end up with more than one match for a given row in df1.

proven plinth Oct 21, 2021, 4:56 PM

#

alright, the objective is to use columns in df1 to match rows on df2, im not sure what kind of join i should use for something like this tbh, i guess inner? i want the set of rows from df1 that can be identified on df2

wicked grove Oct 21, 2021, 4:56 PM

#

desert oar how did you end up with this?a

I applied stemming

desert oar Oct 21, 2021, 4:56 PM

#

frigid elk can somebody tell me, in the code below. .. is order of the original index guara...

!e does this also do what you want? ```python
import pandas as pd

data = pd.DataFrame([
[1,2,3],
[4,5,6],
[7,8,9]
])
data.index = ['a', 'b', 'c']

print( data.apply(lambda y: y.nlargest(2).index) )

arctic wedgeBOT Oct 21, 2021, 4:56 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |    0  1  2
002 | 0  c  c  c
003 | 1  b  b  b

wicked grove Oct 21, 2021, 4:57 PM

#

wicked grove I applied stemming

Ill show you my code

desert oar Oct 21, 2021, 4:57 PM

#

wicked grove Ill show you my code

yes, that would help

desert oar Oct 21, 2021, 4:58 PM

#

proven plinth alright, the objective is to use columns in df1 to match rows on df2, im not sur...

it depends on what you want to do with the data afterwards, but i think a left join makes sense. you can de-duplicate afterwards if you end up with multiple matches

#

(you can get multiple matches from an inner join too! think about it)

wicked grove Oct 21, 2021, 4:59 PM

#

desert oar yes, that would help

from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
#tokenizer=word_tokenize()
#tokenizer=RegexpTokenizer(r' \w+ ')
#dataset['tokens']=dataset['text'].apply(tokenizer.tokenize)
dataset['tokens']=dataset['text'].apply(word_tokenize)
print(dataset.head())

st=nltk.PorterStemmer()
def stemming_on_text(data):
    text=[]
    for word in data:
        text=st.stem(word)
    return text
dataset['tokens']=dataset['text'].apply(stemming_on_text)
print(dataset.head())

print("lemmatized")
lm=nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
     text=[lm.lemmatize(word) for word in data]
     return text
dataset['text']=dataset['text'].apply(lemmatizer_on_text)
print(dataset['text'].tail())

proven plinth Oct 21, 2021, 4:59 PM

#

goddamn joins are doing my head in

#

i shall be back shortly

#

thanks @desert oar

wicked grove Oct 21, 2021, 4:59 PM

#

i removed regexptokenizer as it gave a weird output

#

and when i apply stem and lemmatizer it gives only individual characters

desert oar Oct 21, 2021, 5:02 PM

#

proven plinth goddamn joins are doing my head in

it helps to think of the inner join as a filter applied to the "cartesian product" aka the "cross join"

# cartesian product
result = []
for row_x in data_x:
    for row_y in data_y:
        result.append(row_x + row_y)

# inner join
result = []
for row_x in data_x:
    for row_y in data_y:
        if match(row_x, row_y):
            result.append(row_x + row_y)

then left, right, and outer joins are easy: you just append the rows that were not matched, filling in missing data with NULL

#

in fact, in most databases SELECT * FROM a, b WHERE <condition> is identical to SELECT * FROM a INNER JOIN b ON <condition>

proven plinth Oct 21, 2021, 5:05 PM

#

makes sense, i think im not computing common columns properly, i should be getting more rows with a left join
i'll revisit when i get home, time for a pint

desert oar Oct 21, 2021, 5:07 PM

#

wicked grove ```py from nltk.tokenize import RegexpTokenizer from nltk.tokenize import word_t...

did you notice that you were lemmatizing and stemming the 'text' column and not the 'tokens' column? hint: what happens when you iterate over a string, instead of a list?

wicked grove Oct 21, 2021, 5:23 PM

#

desert oar did you notice that you were lemmatizing and stemming the `'text'` column and no...

okayy yes!!thank you so much. i also hadn't included .append...just to clarify, in this line of code dataset['tokens'].apply(stemming_on_text) we are applying the function on the entire column of the dataframe and word iterates through every row??

desert oar Oct 21, 2021, 5:34 PM

#

wicked grove okayy yes!!thank you so much. i also hadn't included .append...just to clarify, ...

dataset['tokens'].apply(stemming_on_text) loops through elements of dataset['tokens'] and calls stemming_on_text on each element

#

it's like map() or a list comprehension, but for pandas

#

note that you can rewrite stemming_on_text as:

def stemming_on_text(data):
    return [st.stem(word) for word in data]

wicked grove Oct 21, 2021, 5:37 PM

#

desert oar `dataset['tokens'].apply(stemming_on_text)` loops through elements of `dataset['...

And when the element is a list,stemminh_on_text is applied to each string in the list?

wicked grove Oct 21, 2021, 5:38 PM

#

desert oar it's like `map()` or a list comprehension, but for pandas

Oh alright,i will keep that in mind!! Thanks a lot:))
I have just begun using list comprehension so i did not use it here

desert oar Oct 21, 2021, 5:44 PM

#

wicked grove And when the element is a list,stemminh_on_text is applied to each string in the...

no, it's applied to the entire list

wicked grove Oct 21, 2021, 5:45 PM

#

Oh okayy

frigid elk Oct 21, 2021, 6:01 PM

#

desert oar !e does this also do what you want? ```python import pandas as pd data = pd.Dat...

!e thanks, but i think we were thinking of two different result sets...

import pandas as pd
data = pd.DataFrame([
  [1,2,3],
  [6,5,4],
  [7,9,8]
], columns = ['a','b','c'])

print("original")
print(data)
print('\ntop by column')
dataT = pd.DataFrame({n: data.T[col].nlargest(3).index.tolist() for n, col in enumerate(data.T)}).T
dataT.columns = ["top_1_column", "top_2_column", "top_3_column"]
print(dataT)
print('\nlambda:')
data.index = ['a','b','c']
print(data.apply(lambda y: y.nlargest(3).index))

btw: it does appear to keep original sort, thanks for the sounding board

arctic wedgeBOT Oct 21, 2021, 6:02 PM

#

@frigid elk :white_check_mark: Your eval job has completed with return code 0.

001 | original
002 |    a  b  c
003 | 0  1  2  3
004 | 1  6  5  4
005 | 2  7  9  8
006 | 
007 | top by column
008 |   top_1_column top_2_column top_3_column
009 | 0            c            b            a
010 | 1            a            b            c
011 | 2            b            c            a
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/doqaqozaqi.txt?noredirect

desert oar Oct 21, 2021, 6:02 PM

#

@frigid elk data.apply(lambda y: y.nlargest(2).index, axis=1)?

#

or data.T.apply(lambda y: y.nlargest(2).index)

frigid elk Oct 21, 2021, 6:05 PM

#

desert oar <@!508135091445432341> `data.apply(lambda y: y.nlargest(2).index, axis=1)`?

i'll have to give that a shot in the name of code simplification, it's working now though and i'm on a deadline. ... thanks for the suggestions

wise pelican Oct 21, 2021, 6:18 PM

#

For getting quantile / percentile data points with pandas, is there a specific use case for setting the interpolation to midpoint instead of the default linear method? Such as when doing something like this: pandas.Series(some_data).quantile(0.99, interpolation="midpoint")

median fulcrum Oct 21, 2021, 6:24 PM

#

suggestions to better visualize the years in x?

median fulcrum Oct 21, 2021, 6:41 PM

#

🤔

median fulcrum Oct 21, 2021, 7:04 PM

#

#

or something like that...

shut trail Oct 21, 2021, 7:16 PM

#

whats your audience and purpose ?

#

this is great for exploratory notes. i mean the colour isnt necessary but its purdy

median fulcrum Oct 21, 2021, 7:22 PM

#

shut trail whats your audience and purpose ?

My purpose is to plot the Cambodia life expectancy in 1950-2019 and make visible how 1977 was bad to the country

shut trail Oct 21, 2021, 7:22 PM

#

mission accomplished

median fulcrum Oct 21, 2021, 7:22 PM

#

thx

shut trail Oct 21, 2021, 7:23 PM

#

maybe tie clearer colour to the value

median fulcrum Oct 21, 2021, 7:24 PM

#

shut trail maybe tie clearer colour to the value

I'm also making this plot comparing Cambodia and other countries in 1977, if there is a way to write in the outlier(cambodia) I think the mission would be accomplished too

shut trail Oct 21, 2021, 7:25 PM

#

ahhh effective

#

of course youll fix up your labels but the point is clear

median fulcrum Oct 21, 2021, 7:26 PM

#

shut trail of course youll fix up your labels but the point is clear

yeah lol

#

but if I write tha is cambodia

#

would be nice

shut trail Oct 21, 2021, 7:28 PM

#

remove ticks that have no meaning . but it really depends on your audience how much you want to give a crap

#

add a label for your outlier

#

label the x axis with mortality rate

#

what is the y axis ?

#

nvm reading your code

median fulcrum Oct 21, 2021, 7:30 PM

#

shut trail label the x axis with mortality rate

Instead of the "1977" could be the age like 20 30 etc but I don't know how would I do that

#

#

the database is strange

shut trail Oct 21, 2021, 7:31 PM

#

ya seaborn just jitters it nicely

#

the 1977 should be on the y-tick and the x axis should have rates

median fulcrum Oct 21, 2021, 7:34 PM

#

shut trail the 1977 should be on the y-tick and the x axis should have rates

now it's not so visible that cambodia is very bad, but is better to visualize

#

I'm just wondering if has a possibility in catplot to write in a specific point

shut trail Oct 21, 2021, 7:35 PM

#

okay... rabbit hole

#

for a single visualization are you use you want to jump in ? are you trying to learn data visualization through python ?

#

but if you want fine control like that on a programmatic level, python is home for sure

#

matplotlib is under Seaborn , there you'll be able to pick it all apart

shut trail Oct 21, 2021, 7:42 PM

#

median fulcrum I'm just wondering if has a possibility in catplot to write in a specific point

plt.scatter

#

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

#

https://stackoverflow.com/questions/48456959/plotting-single-data-point-using-seaborn

Stack Overflow

Plotting single data point using seaborn

I am using seaborn to create a boxplot. But how would I add a line or a single point to show a single data value on the chart. For instance how would i go about plotting the value 3.5 on the below ...

median fulcrum Oct 21, 2021, 7:44 PM

#

shut trail plt.scatter

has?

twin mantle Oct 21, 2021, 8:56 PM

#

How can I plot a grouped-by monthly time series in seaborn or plotnine?

median fulcrum Oct 21, 2021, 9:04 PM

#

shut trail for a single visualization are you use you want to jump in ? are you trying to ...

1- what? 2- Yes but not focused on it

austere ridge Oct 21, 2021, 9:07 PM

#

Anyone able to help me melt a datframe with multiple headers, time series

https://stackoverflow.com/questions/69667889/loading-a-time-series-dataframe-with-multiple-headers-melting-it-down

Stack Overflow

Loading a time series dataframe with multiple headers - Melting it ...

I have dataframe below which I originally loaded as:
df = pd.read_excel("Rectifier_DB.xlsx", header = [0,1], index_col=0)
1/1/2015
1/1/2015
2/1/2015
2/1/2015
3/1/2015
3/1/2015
Rectifier...

twin mantle Oct 21, 2021, 9:12 PM

#

austere ridge Anyone able to help me melt a datframe with multiple headers, time series http...

Can you give me a dummy dataframe?

silver sun Oct 21, 2021, 9:37 PM

#

I need some advice. Has anyone created a machine learning model to detect malware or similar project?

arctic wedgeBOT Oct 21, 2021, 9:48 PM

#

Hey @austere ridge!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

robust jungle Oct 21, 2021, 10:04 PM

#

how can I use annotated images to train an opencv object detection model?

#

also, is it possible to implement transfer learning with it

arctic wedgeBOT Oct 21, 2021, 10:23 PM

#

Hey @tight walrus!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

tight walrus Oct 21, 2021, 10:24 PM

#

import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pandas.plotting import register_matplotlib_converters
from pathlib import Path
register_matplotlib_converters()

df = pd.read_csv('airline_passengers.csv', index_col = ['Month']) 
df.head()```

#

~\AppData\Local\Temp/ipykernel_15824/309454017.py in <module>
----> 1 df = pd.read_csv('airline_passengers.csv', index_col = ['Month'])
      2 df.head()

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:```

#

    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

~\Anaconda3\lib\site-packages\pandas\io\parsers\base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

~\Anaconda3\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: 'airline_passengers.csv'```

#

anyone knows how to deal with this?

grave frost Oct 21, 2021, 10:41 PM

#

tight walrus ```import numpy as np import pandas as pd from matplotlib import pyplot as plt f...

either use ./airline_passengers.csv or whatever is the absolute path of the file, by right cliking and going into properties

stable kestrel Oct 21, 2021, 11:20 PM

#

I had a question which plot is the best to use for this question

#

im using seaborn to plot

#

my guess is a barplot or histogram

#

but

#

it does not plot for shit

#

#

so if anyone has any idea how to use this

#

it would be appriciated ❤️

velvet thorn Oct 21, 2021, 11:59 PM

#

stable kestrel ```Create a plot that shows the average body size for extinct species versus ext...

box plot?

royal crest Oct 22, 2021, 12:00 AM

#

yeah box is good

#

violin if you're feeling fancy

desert oar Oct 22, 2021, 12:52 AM

#

violin with boxplot whiskers overlaid

calm thicket Oct 22, 2021, 12:55 AM

#

doesn't a violin already have everything a boxplot has?

desert oar Oct 22, 2021, 1:00 AM

#

not really, you lose the specific "points of interest" like the 25th and 75th percentiles

covert compass Oct 22, 2021, 4:35 AM

#

anyone can help me with tesseract and opencv,
https://stackoverflow.com/questions/69656785/python-tesseract-unable-to-detect-two-lines
In the end this kinda work but the return are not in order. Eg. it should be SCF6045P but I got 6045PSCF

text = pytesseract.image_to_string(ROI, lang='eng', config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 11 --oem 3')

Stack Overflow

Python Tesseract unable to detect two lines

I am trying to read car plates numbers and I have a problem with reading plates with two rows. I can read this image fine:
But this plate returns 'BEP7':

How should my preprocessing work in order...

timid rivet Oct 22, 2021, 6:11 AM

#

y'all know any free websites to learn python?

#

im tryna learn python

#

and i learn from practicing

#

but all i got is a pdf

#

please lmk of any free websites if you know any

royal crest Oct 22, 2021, 6:12 AM

#

the pinned messages in this channel have links to some resources

#

insofar as data science is concerned.

#

for learning python in general, this isn't the right channel.

timid rivet Oct 22, 2021, 6:12 AM

#

oh okay

#

sorry

violet gull Oct 22, 2021, 6:27 AM

#

what are some good starting projects?

royal crest Oct 22, 2021, 6:50 AM

#

depends on your interests and intentions

tight walrus Oct 22, 2021, 7:14 AM

#

grave frost either use `./airline_passengers.csv` or whatever is the absolute path of the fi...

Aight thanks

next lance Oct 22, 2021, 8:38 AM

#

How can I download cuDNN 8.1

next lance Oct 22, 2021, 8:40 AM

#

violet gull what are some good starting projects?

A object detector, Chat bot, A voice password or your idea

royal crest Oct 22, 2021, 9:31 AM

#

what OS are you on

coral kindle Oct 22, 2021, 10:24 AM

#

Any practical application I could do with Pyspark?

#

I need to hone my skills there

gentle sleet Oct 22, 2021, 10:49 AM

#

is this book up to date

#

i want to start learning ai i have read nnfs, and im searching for something new

mortal dove Oct 22, 2021, 12:38 PM

#

I'm working on an ARIMA model, I want to know how significant something should have impacted the model to be considered an intervention.
In 2014 South Africa implemented new travel regulations for tourists, requiring specific documents if a child is not traveling with both their parents.
To me it looks like the underlying pattern has changed in sometime in 2014. If the chance is indeed significant enough, is there any way to exactly pick which month the intervention happened in? Would is just be going back on news articles and finding out when exactly the changes were implemented/announced?

coral kindle Oct 22, 2021, 12:52 PM

#

gentle sleet is this book up to date

scikit-learn went to version 1.0 so there's some outdated stuff

#

But the notions are the same.

gentle sleet Oct 22, 2021, 12:53 PM

#

coral kindle But the notions are the same.

But what would you recommend?

coral kindle Oct 22, 2021, 12:53 PM

#

So i'd suggest giving it a try and when you have examples with code, refer to sklearn's documentation to see what changed and what stayed the same

gentle sleet Oct 22, 2021, 12:54 PM

#

im 18 years old, and i dont have any degree pending

#

i would like to take a look on AI

#

and do some stuff

coral kindle Oct 22, 2021, 12:55 PM

#

gentle sleet im 18 years old, and i dont have any degree pending

If you're 18 I'd suggest checking the basic mathematics for AI. Look up "linear regression" because most of the stuff you'll see here uses that.

#

Otherwise if you want to get straight to model construction, you can browse the examples on scikit-learn and launch them.

#

Are you familiar with Colab?

gentle sleet Oct 22, 2021, 12:56 PM

#

Yep

#

im junior backend developer currently XD

#

so i should know a bit

coral kindle Oct 22, 2021, 12:58 PM

#

Ok so, to describe how scikit-learn works, you have models, and models must train using data.

gentle sleet Oct 22, 2021, 12:58 PM

#

Ok i know that

#

but i would like to move from backend to more Data engineering

#

AI/ML

coral kindle Oct 22, 2021, 1:01 PM

#

So you want to create pipelines?

#

For data engineering you need to be familiar with big data I think

#

And with tools like pyspark

gentle sleet Oct 22, 2021, 1:03 PM

#

Hmm, i just want to go into data world

#

and maybe for a while i'llchoose path where ill be going

coral kindle Oct 22, 2021, 1:16 PM

#

gentle sleet Hmm, i just want to go into data world

Try Kaggle

proven plinth Oct 22, 2021, 1:59 PM

#

i have two dataframes

db_adjusted.columns=Index(['activity_dttm', 'subject_txt', 'action_type_nm', 'activity_id',
       'a.activity_html_txt', 'Start Time Final', 'Subject', 'Type'],
      dtype='object')
5137
cbs_adjusted.columns=Index(['Subject', 'Type', 'Calc Full Name'], dtype='object')
1537
Common Columns {'Type', 'Subject'}

i want to merge on the two common columns

merged = pandas.merge(
    cbs_adjusted,
    db_adjusted,
    on=["Subject", "Type"],
    how="left",
)
merged.count()
>>> [67313 rows x 4 columns]
``` how is this possible

shut trail Oct 22, 2021, 2:00 PM

#

timid rivet sorry

dont be sorry for politely asking for help!

shut trail Oct 22, 2021, 2:06 PM

#

proven plinth i have two dataframes ``` db_adjusted.columns=Index(['activity_dttm', 'subject_t...

my guess: youre keys are not formatted the same

#

did you check that unique values for each are equal?

proven plinth Oct 22, 2021, 2:07 PM

#

Subject on the first df

#

and the other

#

the columns have common values

shut trail Oct 22, 2021, 2:08 PM

#

did you check ?

proven plinth Oct 22, 2021, 2:08 PM

#

yes

shut trail Oct 22, 2021, 2:08 PM

#

get unique values, sort, and evaluate equal..

#

okay

proven plinth Oct 22, 2021, 2:09 PM

#

i need to find rows where both the subject and the type are the same

shut trail Oct 22, 2021, 2:09 PM

#

you cleaned the data ?

proven plinth Oct 22, 2021, 2:09 PM

#

yes

#

i stripped everything, i lowercased everything

#

idk what more i can do tbh to clean them up

#

pandas.merge(df1, df2, on=[col1, col2], how="inner") should get me the rows in the two dfs that have the same value in the two cols col1 and col2

#

but i get 10x more rows than what any of the two dfs have

somber prism Oct 22, 2021, 2:11 PM

#

can someone help me improve my anomaly detection model,

input : 10k toys product description, 225 alcohol product description

the problem is here to detect those 225 alcohol products as an anomaly based on their description, those toys and alcohol product's description will be in one dataset

so ill tell what all i did first

s-1 - i cleaned the product description text by removing puncuations and stop words, tokenizing and lemmatization etc ( basic cleaning process)
s-2 : i converted the cleaned text to a paragraph vector using gensim doc2vec
s-3 : then i tried to fit this vectorized dataset into isolation forest 
was able to get 30 true negative and 597 false positive ( true negative is the anomalies ), i was able to get this evaluation because i labeled them before mixing alcohol products description to toys product description. my goal is here to minimize type 1 and 2 error.```

median fulcrum Oct 22, 2021, 2:13 PM

#

Why it's not staying in just one plot?

#

fig, ax = plt.subplots(ncols=2)
sns.catplot(data = america_dataset, ax=ax[0]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='America').set_ylabels('Life expectancy', weight='bold');
sns.catplot(data = asia_dataset, ax=ax[1]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='Asia').set_ylabels('Life expectancy', weight='bold');

#

sns.catplot doesn't allow me to do that?

shut trail Oct 22, 2021, 2:14 PM

#

proven plinth but i get 10x more rows than what any of the two dfs have

ya i misread that count. pull open your data and explore it a bit

shut trail Oct 22, 2021, 2:15 PM

#

median fulcrum sns.catplot doesn't allow me to do that?

You wana add things to the figure or axis directly then plot

#

are you trying to do two plots side by side or ..

median fulcrum Oct 22, 2021, 2:16 PM

#

shut trail are you trying to do two plots side by side or ..

yes

proven plinth Oct 22, 2021, 2:18 PM

#

looking at the data doesnt really help me, i've been staring at them all day

#

theres something wrong im doing with pandas

shut trail Oct 22, 2021, 2:18 PM

#

median fulcrum yes

try adding ax=ax[0] to the beginning of the of your catplot parameters

#

and ax=ax[1]

#

specifies which subplot to use

median fulcrum Oct 22, 2021, 2:19 PM

#

shut trail try adding ax=ax[0] to the beginning of the of your catplot parameters

doesn't matter

shut trail Oct 22, 2021, 2:21 PM

#

proven plinth theres something wrong im doing with pandas

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

try using merge as a method

shut trail Oct 22, 2021, 2:25 PM

#

median fulcrum doesn't matter

separate the methods to act on the axis

#

ax[0].set_ylables(stuff)

median fulcrum Oct 22, 2021, 2:26 PM

#

shut trail separate the methods to act on the axis

I remove all the sets and stay just with this and also doesn't work

#

I think it's because catplot, but I don't see any alternative to it

shut trail Oct 22, 2021, 2:31 PM

#

median fulcrum I think it's because catplot, but I don't see any alternative to it

fig, axes = plt.subplots(1, 2)
fig.suptitle('hello world')

sns.barplot(ax = axes[0], x = df.x_values, y = df.y_values)
axes[0].set_title('title1')

sns.barplot(ax = axes[1], x = df.x_values, y = df.y_values)
axes[1].set_title('title2')

median fulcrum Oct 22, 2021, 2:32 PM

#

shut trail fig, axes = plt.subplots(1, 2) fig.suptitle('hello world') sns.barplot(ax = axe...

#

It's the same

#

the (1,2)

shut trail Oct 22, 2021, 2:34 PM

#

copy, see if it works

median fulcrum Oct 22, 2021, 2:36 PM

#

shut trail copy, see if it works

what's the different between your code and this:

fig, ax = plt.subplots(1,2)
sns.catplot(data = america_dataset, ax=ax[0])
sns.catplot(data = asia_dataset, ax=ax[1])

shut trail Oct 22, 2021, 2:36 PM

#

you tell me lol

median fulcrum Oct 22, 2021, 2:36 PM

#

don't have differents

#

bro

shut trail Oct 22, 2021, 2:36 PM

#

im out

median fulcrum Oct 22, 2021, 2:37 PM

#

not work with catplot apparently

shut trail Oct 22, 2021, 2:37 PM

#

its called Seaborn...

median fulcrum Oct 22, 2021, 2:38 PM

#

shut trail its called Seaborn...

?

#

yes, and with other types of plot probably will work

#

like countplot I know that work

shut trail Oct 22, 2021, 2:38 PM

#

I have faith you know what you're doing

median fulcrum Oct 22, 2021, 2:39 PM

#

shut trail I have faith you know what you're doing

I tough in catplot we plot more than one graph at the same time as the other seaborn plots

#

fig, ax = plt.subplots(1,2, figsize=(9,5));
sns.countplot(x = y_test, ax=ax[0]).set_title('Real results');
sns.countplot(x = predictions, ax=ax[1]).set_title('Algorithm predictions');
plt.ylabel(' ')
fig.show();

#

like that

shut trail Oct 22, 2021, 2:42 PM

#

The reason i said specify your variables it that it forces you to make sure they make sense. I dont know your data so only you can know if those magical dataframes are in order

#

does countplot use subplots.. i rember problems with facetgrids long ago...

median fulcrum Oct 22, 2021, 2:43 PM

#

shut trail The reason i said specify your variables it that it forces you to make sure they...

I do two america and asia datasets, so just put the data= and it's good

#

I already separed the variables

shut trail Oct 22, 2021, 2:44 PM

#

median fulcrum I do two america and asia datasets, so just put the data= and it's good

apparently not right?

median fulcrum Oct 22, 2021, 2:45 PM

#

shut trail apparently not right?

why?

shut trail Oct 22, 2021, 2:45 PM

#

cause its not working! lol a joke

#

what do you want your vis to look like ?

#

Can you show the two you want working separately ?

median fulcrum Oct 22, 2021, 2:47 PM

#

shut trail cause its not working! lol a joke

??? The graph it's working, but puting then together it's not. Do you feal the difference?

#

shut trail Oct 22, 2021, 2:47 PM

#

and how my friend are those cat plots ?

#

use a scatter ?

median fulcrum Oct 22, 2021, 2:48 PM

#

shut trail use a scatter ?

scatter would give so much job to do the plot

shut trail Oct 22, 2021, 2:49 PM

#

youre not specifying your variables are you

median fulcrum Oct 22, 2021, 2:49 PM

#

shut trail youre not specifying your variables are you

the year variables?

#

the year variables are columns

shut trail Oct 22, 2021, 2:49 PM

#

catplot is a way of looking at categorical vars at the same time.. you just want two side by side scatter plots

#

build the arrays you need and feed them in

median fulcrum Oct 22, 2021, 2:50 PM

#

shut trail catplot is a way of looking at categorical vars at the same time.. you just want...

scatter plot would not work since countries or years are huge in the databse

shut trail Oct 22, 2021, 2:50 PM

#

median fulcrum

this is a scatter plot

median fulcrum Oct 22, 2021, 2:50 PM

#

shut trail this is a scatter plot

this is cat plot

shut trail Oct 22, 2021, 2:51 PM

#

you didnt show your code and i dont really care what function you used, im telling you what it is lol

#

those are points, thats the x and y axis lol

somber prism Oct 22, 2021, 2:51 PM

#

can someone help me with anomaly detection 😐

median fulcrum Oct 22, 2021, 2:51 PM

#

shut trail you didnt show your code and i dont really care what function you used, im telli...

but is catplot

#

lol

#

shut trail Oct 22, 2021, 2:52 PM

#

use a scatter

median fulcrum Oct 22, 2021, 2:52 PM

#

shut trail use a scatter

I would have to had a years column

shut trail Oct 22, 2021, 2:52 PM

#

build the array you need

#

thats why we use python for this

median fulcrum Oct 22, 2021, 2:53 PM

#

I'm not so confidence in making databases

#

That's why I pick already finished

somber prism Oct 22, 2021, 2:53 PM

#

guys is it possible to detect the anomaly product just from their description for the random 2 catagories ?

so with the code:```py let: A is a column and B is a column a is an item in column A b is an item in column B

len([A=a][B])

so with the code:```py
let:
A is a column and B is a column
a is an item in column A
b is an item in column B