#data-science-and-ml

1 messages ¡ Page 349 of 1

serene scaffold
#

dataset['text'].astype.str() does not convert the data to strings.

wicked grove
#

yeah it gave an error

serene scaffold
#

astype is a method, not an accessor.

serene scaffold
#

!docs pandas.Series.str.join

arctic wedgeBOT
#

Series.str.join(sep)```
Join lists contained as elements in the Series/Index with passed delimiter.

If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to [`str.join()`](https://docs.python.org/3/library/stdtypes.html#str.join "(in Python v3.10)").
desert oar
#

also: .astype(str) is dangerous because if you have a "null" value like NaN, it will produce the string "nan", not a proper None

#

in addition to the obvious fact that .astype(str) on a column of lists will produce bad text like "['a', 'b']" which is not what you want

#

that said, it looks like your text is already tokenized. what's the point of joining and then tokenizing again?

serene scaffold
desert oar
#

hm. the screenshot is really hard to read, but it looks like some rows are tokenized and some rows are not

#

like... some rows are strings and some are lists of strings. i.e. a real mess

wicked grove
wicked grove
desert oar
#

what is the type of each element here? are these strings or lists? please show us some example data that is not in a tiny unreadable screenshot

wicked grove
#

i am following this

wicked grove
#
X=dataset['text'].astype(str)
y=dataset.target
desert oar
#

so do you understand now why .astype(str) is wrong?

#

what is the content of dataset['text']?

#

i see, they have it in their example

#

this is some questionable code

#

the data analysis content seems OK, but i would not recommend this article as an example of good code

desert oar
#

it's not bad code either, but it's messy and they do some weird/redundant things

royal crest
#

df. shape sweatDuck

desert oar
#
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x))
  1. this is inefficient because it re-creates the translator in every call, translator = should be outside the function (there are some considerations w/ respect to local vs global lookup, but that's a different issue)
  2. .apply(lambda x: cleaning_punctuations(x)) should be .apply(cleaning_punctuations)
  3. i don't want to criticize their use of english too much because not everyone is a proficient english speaker, and english is a complicated language. but cleaning_punctuations should probably be clean_punctuation
wicked grove
#

800000 ['love', 'healthuandpets', 'u', 'guy', 'r', 'b...
800001 ['im', 'meeting', 'one', 'besties', 'tonight',...
800002 ['darealsunisakim', 'thanks', 'twitter', 'add'...
800003 ['sick', 'really', 'cheap', 'hurt', 'much', 'e...
800004 ['lovesbrooklyn', 'effect', 'everyone']

desert oar
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

"['love', 'healthuandpets', 'u']"
desert oar
#

basically it just made a string containing literal python code

#

that is not what you want, ever

#
STOPWORDS = set(stopwordlist)
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text))

they did the right thing creating STOPWORDS outside the function, but

  1. why is it STOPWORDS and not stopwords? not all globals need to be upper case
  2. why str(text)? it's definitely already a string
royal crest
#
# remove punctuation
from string import punctuation as p
def some_func(text):
    return [char for char in text if char not in p]
  # alternatively
    return " ".join([char for char in text if char not in p])

is this not simpler?

#

also they defined their own stopword list

#

which is hmm

serene scaffold
wicked grove
desert oar
#

meh, the "standard" stopword lists can be too broad @royal crest , i've done that before

desert oar
royal crest
#

time to benchmark

desert oar
#

this is also just questionable string cleaning in general, e.g. it doesn't take @usernames into account

#

i guess you could argue that words in usernames could be relevant for determining the sentiment of a tweet, but i doubt it

#

i guess it'll all wash out in feature selection

wicked grove
desert oar
#

you might want to read about "unicode normalization"

#

and you might want to familiarize yourself generally with what "unicode" is - python strings are sequences of unicode code points

wicked grove
desert oar
#

that's good

wicked grove
wicked grove
desert oar
#

you should have strings like this: "hello this is a tweet", which you vectorize into lists: ["hello", "this", "is", "a", "tweet"]

#

pandas tries to make the output "pretty", but in the process it can be difficult to see what is actually in each element

#

i'm not sure how you ended up with the [] stuff. it's better if you show your code

fluid pebble
#

hello fellas

#

I need help with a project in python

#

i am new to python so i need help

fluid pebble
#

i want to develop a program which can identify difference between 'ideal image" and 'difference image' . can anybody help

wicked grove
desert oar
arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

desert oar
#

2 people here spent a good amount of time trying to understand the use case and guide you towards a solution

fluid pebble
#

yes i appreciate help

#

but my ocr is result are not quite good

#

i need help with my ocr result

desert oar
#

okay, are you able to share examples? there might not be any real OCR experts here. you also might want to describe if there are any kinds of non-text differences that your system should be looking for

desert oar
#

someone else suggested comparing rgb histograms. maybe you can also use some kind of image embedding to compute distances between images in some interesting feature space

fluid pebble
#

well i have also done that

#

but my boss dont want these results

desert oar
# wicked grove https://paste.pythondiscord.com/vuxazonide.py

here's a tip, instead of this:

english_stopwords = set(stopwords.words('english'))
not_stopwords = ['not']
final_stopwords = set([word for word in stop_words if word not in not_stopwords])

you can do this, using the - operation on sets:

english_stopwords = set(stopwords.words('english'))
not_stopwords = {'not'}
final_stopwords = english_stopwords - not_stopwords
fluid pebble
#

the result which i am looking for is the program can highlight 'spelling mistakes' , change in a text in difference image as compared to ideal image, any content which is missing in difference image as compared to original image.

desert oar
#

@wicked grove this is part of the problem:

tokenizer=RegexpTokenizer(r'\w+')
dataset['text']=dataset['text'].apply(tokenizer.tokenize)
#

imo you should be saving the tokenized text into a separate column

#
tokenizer = RegexpTokenizer(r'\w+')
dataset['tokens'] = dataset['text'].apply(tokenizer.tokenize)
desert oar
#

but otherwise this is OK until line 176 when you use .astype

desert oar
#

all you have to do is replace dataset['text'].astype(str) with dataset['text'].str.join(" ") as per stelercus' suggestion

#

but i strongly encourage you to think about what the difference is

#

generally this is inefficient code but it's not necessarily bad. what i will say is that TfidfVectorizer lets you apply pretty much all of these data transformations in one shot

wicked grove
desert oar
#

i don't see any place where you would get a string like "[a, b]"

#

i think you might need to restart your notebook kernel

#

(assuming this is in a notebook)

desert oar
#

finally, you don't actually have to re-join the strings at the end

#

you can do TfidfVectorizer(analyzer=lambda x: x) which does not attempt to split any strings

wicked grove
main fox
#

Special shout-out to .reset_index(), it's a great method

wicked grove
desert oar
#

!eval here, i will show you again:

x = ['a', 'b', 'c']
print(repr(str(x)))
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

"['a', 'b', 'c']"
desert oar
#

!eval you probably want this:

x = ['a', 'b', 'c']
print(' '.join(x))
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

a b c
wicked grove
wicked grove
wicked grove
#

Also for steps 8,9,10 should i just keep the tutorial as a reference and follow some other code for the confusion matrix,etc

lone drum
#

Hello
I am writing to CSV file
I am getting some of columns are not completely filled
Ping me when replying

#

the highlighted part is getting empty

wicked grove
#

Had a question, why should there be a space after binary operators?

stoic musk
#

Hi all, I'm trying to drop the predictions column in a pandas dataframe and getting the following error:

#

AttributeError: 'DataFrame' object has no attribute 'DEATH_EVENT'

#

on line:
y = X_full.DEATH_EVENT

#

X_full.dtypes returns the following:

#

sex int64
smoking int64
time int64
DEATH_EVENT int64
dtype: object

#

...Any ideas?

#

DEATH_EVENT is definitely a column in the dataframe

main fox
stoic musk
#

clever, thanks!

fluid pebble
#

hi

lapis sequoia
#

anyone know what this error mean in seaborn ? ==> "RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density."

zinc rock
#

why when i convert a json to csv using pandas, then try to read that csv, i get decoding errors?

#

gbk cant decode etc

vocal basin
#

Hey guys, have been scratching my head for two hours trying to figure out a way to add names of teams instead of hue dot points. Please help me

royal glen
#

is there any place where i can get beginner questions on numpy,pandas,etc to practics?

ember goblet
#

yes

royal glen
ember goblet
serene scaffold
fluid pebble
#

hi

#

how to improve OCR results accuracy

flint grotto
#

hello!

#

uhm. i have question.

#

Can you recommend a machine learning course on Coursera?

#

Last time, you recommended “data science from scratch”, so I bought it and wanted to take a machine learning course.

#

Do I need to master statistics and linear algebra for additional machine learning?

novel elbow
flint grotto
#

not available on coursera?

vivid cairn
#

What are some methods or tools I can use to distill a small training dataset from a big dataset that when trained on have as good projected performance as the big dataset?

#

I seem to not have the Google Fu required to find information to read about this.

robust jungle
#

does anyone know how I can get output node names of an Xception model?

wicked grove
wicked grove
serene scaffold
#

just arbitrary slices of the original string?

chrome lintel
#

Hey all, I'm trying to use pytesseract to read text from an image. This is my current image, but the output produced from it is just nonsense

#

Hey all, I'm trying to use pytesseract to read an image to extract some numbers. This is the code I'm using to try and extract the numbers from:

output = pytesseract.image_to_string(img,config='--psm 10 --oem 3 tessedit_char_whitelist=0123456789')

#

And the output I receive is:

iy Qe

#

One sec, I'll upload the image to imgur too.

wicked grove
sly wharf
mighty spoke
#

Hi would someone be able to help me coding some loops?

robust jungle
#

How can I use a custom keras model with opencv?

#

Alternatively, how can I get output node names for my model

vague moon
#

Is there a way to find the accuracy of your results with a tolerance

#

e.g. I'm predicting final grades for students, my output can be 0-20, right now I'm just checking how often my model is correct, but often when it isn't it is off by only 1, I want to see how often this is

#

I've stumbled across explained_variance_score and am wondering if that's what I'm looking for, seems like it

kind cape
#

Hey everyone 👋 Can someone help with this? Thanks!

#

I started like this:

#

`#method for kNN method
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss

"""
@param X, np ndarray, feature matrix
@param y, np array, labels vector
@param kfold, Integer number of CV folds
@param n_neighbors, Integer number of neighbors in KNN

@returns floating number
"""
def evaluate_kNN(X,y,kfold,n_neighbors):
kf = KFold(kfold)
neigh = KNeighborsClassifier(n_neighbors)`

boreal summit
#

Seems like many of TFX.utils functions have been decapitated.

#

Same with some tfx.components libraries. Found the new imports for some of them while I can't find the rest. 😐

arctic wedgeBOT
#

Hey @kind cape!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

odd meteor
# vague moon I've stumbled across explained_variance_score and am wondering if that's what I'...

You mentioned tolerance... Are you trying to perform collinearity diagnostic? Tolerance is a metric used to assess the amount of multicolinearity present in a model.
You might wanna provide some more clarity on that part....

On the other hand, explained_variance_score is simply Coefficient of Determination a.k.a R-squared.

This is the default metric for regression problems.

This metric measures the amount of variation in your response variable (target) which the explanatory variables (a.k.a features) in your model was able to explain. So the higher the R-squared score the better your model performance.

The best metric to capture what you are actually looking for is MSE or better still RMSE. The lower your RMSE score the better your model performance.

worthy crystal
#

Hello guys I was wondering if anyone is experience with autoencoders for anomaly detection

#

if you are please @ me

lapis sequoia
#

is pytorch easier than TF + keras?

#

also is TF + keras more well known and has more guides?

lapis sequoia
#

from sklearn.metrics import r2_score

quiet vault
sonic comet
#

how do I get the accuracy of the prediction in tensorflow?

#
val = model.predict(images)
#

???

sharp beacon
#

is it possible to do a time series prediction on super irregular data with a lot of time where the data is zero?

#

for example, from say june 1-15, there are rejected records every 4 hrs, between them there are none rejected. June 16-30 there are zero rejected records. Is it possible to forecast another 30 days with this type of irregular data?

tender hearth
sonic comet
#

yeah, I know how to get the accuracy while training but how do I get a prediction during a 'runtime'?

tender hearth
#

During inference? You do the same thing

sonic comet
#

I pass the image into my model, how do I get tensor flow to give me a "sureness" because I don't the answer for the "live testing"

rigid zodiac
#

Have anyone work with raspberry before??

spark nimbus
#

gonna write a guide on audio processing, what do you guys recommend I talk about or explain in a fairly simple way?

rigid zodiac
stoic musk
#

Hi! I'm trying to convert a numpy array into a pd dataframe

#

When I try to print df.head, i get this:

#

<bound method NDFrame.head of 0
0 0
1 1
2 0
3 0
4 0
5 0

#

the first column must be the index, and the second column is the array (now dataframe), but it doesn't return the typical first five values, like a df.head function call usually does. Any ideas?

velvet thorn
stoic musk
#

Yeah... looks the same as before

#

I used this:
df_YV = pd.DataFrame(y_valid)

am i missing something?

#

where 'y_valid' is the np array

velvet thorn
#

i.e. you’re missing parentheses

stoic musk
#

....no i didn't

#

thank you

#

or.. yes I am missing parentheses.

#

thank yOU!

velvet thorn
#

you’re welcome

stoic musk
#

sublime + ella fitzgerald, great name

#

Is there a function to convert a pd dataframe (single column) of decimal values 0 < x < 1 , to 0-1 values?

#

i.e. if x_h < 0.5, x_h = 0, else x_h = 1

...without iterating over the whole dataframe or array? I have continuous predictions that need to be converted to 0-1

lapis sequoia
tender hearth
stoic musk
#

...that worked!

#

I wonder what a person does when they're so advanced that people on Discord can't help them lol

lapis sequoia
# stoic musk ...that worked!

Also apart from that you have plenty of ways. My fav is .apply others are iloc but iloc is not good enough for complex logic.

stoic musk
#

.apply, I'll write that down, thank you

lapis sequoia
#

!d pandas.DataFrame.apply

stoic musk
#

oh that makes sense using iloc

arctic wedgeBOT
#

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)```
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (`axis=0`) or the DataFrame’s columns (`axis=1`). By default (`result_type=None`), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result\_type argument.
lapis sequoia
#

Also .where would be good for this use case too.

#

!d pandas.Series.where

arctic wedgeBOT
#

Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)```
Replace values where the condition is False.
lapis sequoia
#

df['newcol']= np.where(df.col1>0.5, 1, 0)

#

@stoic musk

stoic musk
#

Nice, would that make a new column in the df or replace the existing?

lapis sequoia
#

I reckon you can replace if you wish. Just give same column name.

stoic musk
#

ohh, yes of course

#

thanks!

tender hearth
#

Why do you need to use the NumPy ufunc

lapis sequoia
#

You can use pd.Series.where too if I'm not mistaken. But anyways i personally would not mind np in this case since pd uses np too.

#

Tho they are bit different as the docs say. I'll check out the details when i get time.

velvet thorn
#

principle of least privilege

velvet thorn
#

(maybe because it’d what I’d do too 🥴)

zinc rock
#

Im training a sklearn.svm.svc classifier on 12k samples

#

Is it normal to be really slow

#

Its like an hour

stoic musk
#

Do you have a lot of features...?

zinc rock
#

4 i think?

#

The dataset each entry has 4 items per row i guess

tender hearth
#

do you have an nvidia gpu

zinc rock
#

Yes 1050 laptop

#

I assume its running on cpu now

tender hearth
#

What library are you using

zinc rock
#

Sklearn

#

Svm.svc

tender hearth
#

Right, dumb question

zinc rock
#

Im running it in vscode and i cant tell

#

If its frozen or not

#

Maybe i should have done in pycharm but idk how to tell if its not frozen or not

tender hearth
#

Scikit has no GPU support

zinc rock
#

Damn.

#

Is there a way to tell its still running

#

I guess im looking at task manager and python is taking 50% of cpu

lapis sequoia
#

Scikit doesn't have gpu support🤦‍♀️

lapis sequoia
zinc rock
#

Oh i never used it before can you explain

lapis sequoia
#

The cpu may be better than ours and it has 13gb ram (i guess)

#

Basically cloud for you to run notebook. Gives nice gpu and cpu and ram. No load on your hardware. And you can use drive as the storage.

zinc rock
#

The model is saved to sav files

#

It should be fine?

#

I can extract it

lapis sequoia
#

(Assuming you have enough space of course )

zinc rock
#

Sorry for weird questions im just really anxious haha

#

Waiting for a model to train without knowing whats happening

lapis sequoia
#

It's alright.

zinc rock
#

This?

lapis sequoia
#

You can give it a try once you're done with doing on your pc. I like it and use in my daily life. Like eating and all.

lapis sequoia
zinc rock
#

Ill just upload my folder then

#

Of code and data

#

Should be able to execute immediately

lapis sequoia
#

Alright. You'll need to mount drive if you want to tho. I'm not on laptop otherwise I'd share a small demo notebook.

zinc rock
#

Itll only run for 12 hours max huh

#

Should be done before...

lapis sequoia
median berry
#

i just deployed some jax models. i have no idea what i deployed.

i really need to learn jax and quick,
i have played around on a bigginer level on keras and pytorch,

zinc rock
#

AttributeError: 'SVC' object has no attribute 'transform', theres nothing about this issue

#

anyone know?

royal crest
#

?

zinc rock
#

yes

#

i finished training a model and trying to test it now

royal crest
#

would it help to tell you that SVC has no transform() method?

#

as the documentation suggests

zinc rock
#

wha thte

zinc rock
royal crest
stoic musk
#

Getting a strange error again (thought I fixed it last night)

KeyError: 'DEATH_EVENT'

line in question is:
y = X_full.loc[:,'DEATH_EVENT']

#

where 'DEATH_EVENT' is a column in a pd dataframe

median berry
fluid pebble
#

hi

#

i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)

stoic musk
#

ValueError: Classification metrics can't handle a mix of binary and continuous targets

even though the dataframe (one column of binary predictions) is of type int64

#

...?

flat crown
#

The x-axis is fine, but the y-axis is supposed to have the values in the array (up to 50,000 or so)

#

but the y-axis here is just going up to 10

#

when i change bins = it fixes but the axis are wrong

main fox
flat crown
#

how do i make it so the x-axis is from 0 to 200

#

and the y axis is just takes from the array i set

main fox
#

The y axis is just going to show you how the data falls within the bins you want

#

It won't be the data itself

flat crown
#

So what do I do

#

if I want the values of the bars to be what's in the array

#

i need to make the x-axis like this

#

its just the values in the bars are wrong and the y-axis is obviously wrong

#

if you look at the array they're in the thousands so

main fox
#

The x goes only to 200 because thats the upper bound you set for the bins

flat crown
#

that's fine

#

that's what i want it to be

#

the y-axis is the probem

main fox
#

Change the step for the bins

#

Try 3 instead of 5, see how it looks

flat crown
#

wat now

#

doesn't do anything

#

i need it to look something like this

#

but obviously y-axis will be larger

main fox
#

Change the step to 25

flat crown
#

step has nothing to do with this

main fox
#

What this tells you is that for the bounds you specified, all your values fall between 0 and 25, and 100 and 150

#

The y axis will never be the values in the array, because in a histogram, you count the frequency of that value appearing

flat crown
#

lemme explain wat i want

#

array= [0,0,0,0,0,0,0,0,146,4021,4323,13434,24089,36611,27023,31367,26800,11079,9285,23142,12757,7973,30389,20770,10426,17347,25305,34806,23654,14287,12107,2004,4256,106,3866,2247,2688,0,0]

#

so i have 39 values in here

#

i want each one to be in its own bin

#

and the x-axis to be from 0-200

#

how can I do that

main fox
#

And you're clear that by setting the x axis to be 200, you'll miss all the values that are greater than 200?

flat crown
#

wdym why would taht be

#

the values in the array are for the y axis

#

each value in the array is a spacing of 5

#

so first value is 0-5

#

second is 5-10

#

etc

main fox
#

The x axis is the bins

The y axis is the frequency of numbers within the bin

flat crown
#

yea

stoic musk
#

ohhhh I understand. @main fox in the array, the first entry is a count of all values between 0-5

flat crown
#

yea

stoic musk
#

so array[0] -> X = 0-5, count = 0

#

It's not an array of individual values

#

@flat crown wouldn't it be easier to create a bar chart instead?

flat crown
#

i need histograms for this assignment

stoic musk
#

does the prompt require you to use np.array

flat crown
#

no

#

i just have to make this histogram

stoic musk
#

**np.arrange

flat crown
#

nah that's just what i used for my other histograms

#

its not really assignment, its a research project

main fox
#

Since you're already using pandas
Convert the list into a series arr= pd.Series(array)

And then usen arr.plot(kind='hist')
Pandas will handle the bins automatically

flat crown
#

almost there

#

just need the x-axis to be on the y

#

and the x-axis to be 0-200

main fox
#

That's not how a histogram works

flat crown
#

i need it to be like this one

#

wdym

lapis sequoia
flat crown
#

idk wat colab is

lapis sequoia
#

google colab

#

for ipynb

#

nvm

flat crown
#

jupyter notebook

lapis sequoia
#

yes

stoic musk
#

I think you should make a bar chart (which will also be a histogram), because your data is already binned and counted.

X axis should be (index of the array + 1) * 5

Y axis should be array[index]

lapis sequoia
#

precisely

flat crown
#

alright i can make a bar chart

#

how would I do that hmge

lapis sequoia
#

i use plotly express

#

lmao

#

anyways

stoic musk
#

excel lol

flat crown
stoic musk
#

I'm new to sklearn/pandas. But i know exactly what you're looking for and understand what the np array is communicating

flat crown
#

this looks kinda bad

stoic musk
#

Just change the axis formatting

#

how many bins do you want

flat crown
#

i can fix the axis

#

its the actual blue lines themselves

stoic musk
#

ok

flat crown
#

look bad

stoic musk
#

ohh. can't help with that unfortunately.

or you can run your other data in the same style

main fox
# flat crown

You said your output needs to be exactly like this?

flat crown
#

yes

main fox
#

They don't have any values that fall between 0 and 25

flat crown
#

my broski

main fox
#

Your data has 0's

stoic musk
#

@main fox , think of the np array like values in a dict

#

array[0] = 0 right?

flat crown
#

yea

stoic musk
#

What this means is:

the frequency of values between 0-5 is 0. So 0 is not in the dataset

main fox
#

So his output won't match the example

#

That's my point

stoic musk
#

nope, not with np.arrange I don't think so

#

cool

stoic musk
# flat crown

So, formatting aside, this is what the graph/histogram should look like, right?

flat crown
#

yep

stoic musk
#

Cool. Good luck, gotta go to bed

lapis sequoia
#

guys, I have some data displayed in the terminal. like the one shown in the image. Do we have a way to put the entire displayed data into a file?

#

This is in visual code

uneven thistle
#

how to make the xaxis look better ? as you can see nothing is visible on the x axis Please help asap

#

and why is my figure size not changing?

fluid pebble
#

hi
i need to compare two images 'ideal image' and 'difference image' . i need to develop an algorithm which can highlight or create boxes around the identified differences. the differences which i want to get through the program is 'if there is a change of text in (difference image) as compared to (ideal image) ,or if there is a spot in (difference image) as compared to (ideal image) or there is a spelling mistake in (difference image) as compare to (ideal image)

median fulcrum
#

I never use a database like that(yeras as columns until 2019). How can I find the line or cell of this value that i'm trying to find. Just returns me a True and False dataframe

indigo steppe
#

Hi there,i am a total beginner in ml and have a question that may make not much sense.I am reading about reinforcement learning right now.I understand that there is a punishment/reward system.So if the program searches for patterns (time series data for example),do i need to define what kind of patterns i want or does the program find its own patterns?Sorry i am not a good programmer and i try to understand the theory behind supervised,unsupervised and reinforcement learning.

royal crest
median fulcrum
#

0 and 1?

royal crest
#

!d pandas.DataFrame.T

arctic wedgeBOT
royal crest
#

since you weren't happy about years being in columns ...

#

though i don't know what kind of data you're working with

median fulcrum
#

I just want to find the line or cel of the value. But if I continue with this dataframe idk if would be very difficult to make graphs, since maybe I will have to put 1950-2019

royal crest
#

indeed

#

you might want to try and restructure your data

#

and before you ask "how", think about what would make your life easier should something go on the index vs column

median fulcrum
#

if was possible to do something:
plt.hist(x = 2:70, data=df)

#

:/

royal crest
#

though you can probably access all the columns with df.columns()

#

minus the first two

median fulcrum
royal crest
#

What you are replying to was not in response to your message

#

It was just me trying to finish what I was saying from earlier

median fulcrum
#

but if you think in a columns to year, maybe would have to repeat the countries to each yar

#

🤔

median fulcrum
#

I never transformed a database like that, and I don't think would be possible with this database since probably would have to repeat the countries to each year

royal crest
#

I think it's pretty simple, but you're the data scientist.

median fulcrum
#

but how would you structure this database?

countries year expectancy?

royal crest
#

i would transpose it

#

as suggested earlier

#

handy dandy

#

i would transpose it then manipulate it in such a way that I'm able to two things from this dataFrame

median fulcrum
#

I can't figure out how would transpose be helpful in that

royal crest
#

typically one would select year

median fulcrum
#

years but like that: 1950 1951

royal crest
#

so it's useful for us to take advantage of that fact, and put year as index

#

because right now, the indexes are not being used at all

#

also, the Code is not useful in this dataframe at all, so we can take that out and store it as dict values using entity as key

#

which leaves us countries as columns, years as indexes and life expectancy as the values

median fulcrum
#

actually if we create a variable years with all years columns that can be used as df[df['years' == 10101] we don't need to transpose

#

since this is the major problema in the df

royal crest
#

Sure, sounds like you have a plan!

#

Good luck.

median fulcrum
#

sad

#

I'm in rage with the guy who make this dataset

#

😩

royal crest
#

Most of data science is just wrestling with data

median fulcrum
#

lol

#

I'm homesick with this databases:

#

clientid loan income default

royal crest
#

it does not get better cz_Unsure_smile

median bone
#

hi I need pandas help

royal crest
#

we know you do, that's why you are here 😄

median bone
#

here is "df_confirmed_local"

#

we need to make a function that

def case_no(report date)```
and return the case no. of the corresponding date
#

example:

#
case_no("04/02/2020")```
#

output:py 16

#

someone help me plsss

royal crest
#

so you're just counting the number of values in the "report date" column that matches your input

royal crest
#
return the case no. of the corresponding date
median bone
#

yup

#

but idk how to do it

royal crest
#

so do you need any other columns?

#

ah i guess you do

#

the last one

median bone
#

ah sorry I dont get 😅

royal crest
#

so in this case, you gotta match two things: "report date" == input, "Confirmed/probably" == "confirmed"

#

then return the len()

median bone
#

so

def case_no(report_date):
  report_date = df_confirmed_local["report date"] ?```
royal crest
#

actually

#

df_confirmed_local -> does this mean your dataframe only contains confirmed cases?

#

that makes life a bit easier

median bone
#

yes

#
def case_counts(report_date):
    for i in df_confirmed_local["Report date"].value_counts():
        if i == report_date:
            return i
    
case_counts("04/02/2020")``` I tried this
#

but It doesnt work

#

cuz I dont know how to link up the two columns

royal crest
#

i don't think you need for and if loops

#

!e

import pandas as pd

def howmany(val):
    """return how many values in A match the input"""
    # make random dataframe
    df = pd.DataFrame({"A": [1, 0, 0, 1, 1], "B": [True, False, True, True, False], "C": ["Y", "Y", "Y", "Y", "Y"]})
    print(df)
 
    # return number of vals that match "A" == val
    return len(df[df["A"] == val]["A"])

print(howmany(1))

here's a really basic example

arctic wedgeBOT
#

@royal crest :white_check_mark: Your eval job has completed with return code 0.

001 |    A      B  C
002 | 0  1   True  Y
003 | 1  0  False  Y
004 | 2  0   True  Y
005 | 3  1   True  Y
006 | 4  1  False  Y
007 | 3
royal crest
#

line 7 is the output for howmany(1)

#

returning 3 since column "A" has three 1's

#

and in your case, that would be a bit like

#
def case_counts(report_date):
    return len(df_confirmed_local[df_confirmed_local["Report date"] == report_date]["Report date"])
    
case_counts("04/02/2020")
#

i'm sure there's always a better way, so feel free to chime in

median bone
#

k lemme try ty :>

lapis sequoia
#

Hey, i have a huge amout of csv files with some apparently empty columns. I want to check if they are actually empty in all files, and if that is the case i want to print their name.
To do that, i would like to know if there a way to see if df[columnname] does not contain any values?

royal crest
#

!d pandas.DataFrame.isna

arctic wedgeBOT
#

DataFrame.isna()```
Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or `numpy.NaN`, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings `''` or `numpy.inf` are not considered NA values (unless you set `pandas.options.mode.use_inf_as_na = True`).
royal crest
#

this could be useful

serene scaffold
median bone
serene scaffold
#

alternatively, you could do this once:

report_date_counts = df_confirmed_local['Report date'].value_counts()
#

and then any time you need a specific report date count, it's just a dict-like lookup to report_date_counts.

royal crest
serene scaffold
#

well, Series

median fulcrum
royal crest
royal crest
median fulcrum
#

cell or line

royal crest
#

so if query is "Banana", you'd want a dataframe that looks like:

Name  Calories  Type
True  False    False

?

median bone
median fulcrum
royal crest
median bone
#

yup so If including the df_ stuffs am I correct?

royal crest
#

sounds good, also check out Stelercus' response

median fulcrum
desert oar
# median fulcrum

i think you need to provide a complete demonstration of input and your expected output. it's still not clear from these scattered examples what you want

lapis sequoia
#

took forever to run through all of the files, i have like 150000 csv files i need to process

#

big help

median fulcrum
desert oar
#

there is no context

median fulcrum
desert oar
#

i still truly have no idea what you mean, i'm sorry. the code database_df[datbase_df["Name"] == user_results] returns a dataframe, not a single value

#

do you want to get all rows where life expectancy is 18.907 in any column?

desert oar
#

this is why it's important to provide small examples that clearly demonstrate the behavior

#

you force people to guess and interrogate you

median fulcrum
#

but is clear, the second print is not a True and False database

desert oar
#

it's not clear, as shown by multiple people trying to help and giving up

#

also a "true and false database" is not the same as "all rows where a certain value is present in any column"

median fulcrum
#

🤔

#

df[df == something]

desert oar
#

it doesn't matter, if you don't show people what your data is, there's no way anyone can help

median fulcrum
#

men

#

I sent many prints

#

I am really not sure how you could messed up with that

desert oar
#

screenshots are not enough

median fulcrum
#

ok sorry

desert oar
#

your life expectancy data looks like this?

data = pd.DataFrame([
  'A', 1, 2, 3,
  'B', 4, 5, 6,
  'C', 2, 5, 3,
  'D', 3, 9, 4,
], columns=['entity', '2007', '2008', '2009'])
median fulcrum
desert oar
#

ok. and can you explain what output you do want?

#

do not show the banana example, it's not helpful

upper granite
#

Hello guys.
Do you know how can I parse this string "20200915" to date with pyspark?

median fulcrum
#

yes, I want the cells or lines as the banana example that contains the minimum value of the database that is 18..

#

18.907

#

not a True and false dataframe

#

understand?

#

and I'm not sure if this .min().min() was the best way to do, so if you have any suggestion to this too you can tell me 🙂

desert oar
desert oar
#

however you need to exclude the non-numeric columns

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

min_all = data.loc[:, '2007':'2009'].min().min()
print(min_all)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

1
desert oar
#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

meta_columns = ['entity', 'code']
value_columns = [c for c in data.columns if c not in meta_columns]

min_all = data[value_columns].min().min()
print(min_all)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

1
desert oar
#

yes, that's fine

median fulcrum
desert oar
#

why 1:70? it looks like you want all columns

#

column numbers in python start at 0

median fulcrum
desert oar
#

it's not

#

you don't need to use .iloc at all if you want all columns

#

you can just write years_database.min().min()

lapis sequoia
#

does anyone have an idea how to make this perform better?

def find_non_empty(file):
    try:
        frame = pd.read_csv(file, sep=";")
        for name in frame.columns:
            if not name in listA:
                listA.append(name)
            if not name in listb:
                res = True
                i = 0
                for val in frame[name].isna():
                    i +=1
                    if val is False:
                        res = False

                        
                if res is False:
                    listb.append(name)
                        
    except UnicodeDecodeError:
        print("ooops")
        print(file)
    except:
        print ("whoops")
desert oar
#

so you want the row that contains the minimum year @median fulcrum ?

median fulcrum
#

with the country etc

lapis sequoia
#

i have been scanning files for 80 minutes

desert oar
#

@lapis sequoia are you just trying to look at the column names and ignore the data?

#

the bare except: is a really bad idea... at least do except e: print(e) so you can see what the error was

lapis sequoia
#

i want to know if certain columns are empty in all of my 150000 csv files

desert oar
#

i also am in the habit of putting the try around the smallest possible area of my code, in this case just the pd.read_csv should be inside try

lapis sequoia
#

not the whole frame, just some of its collumns

#

that makes sense

#

i will adapt that

median fulcrum
desert oar
#

and yes, looping over the entire df is a bad idea. use frame.isnull().any() or something like that

median fulcrum
#

yeah

lapis sequoia
#

i assumed

#

:D

#

yes, ill change that. Thanks, i wasnt sure .isnull() on the column would work

desert oar
#

frame.isnull() returns a dataframe of booleans, frame.isnull().any() returns a series with 1 element for each column, True if any are null, False otherwise

#
>>> data.isnull().any()
entity    False
code      False
2007      False
2008      False
2009      False
dtype: bool
lapis sequoia
#

that makes sense

#

i will do that :>

#

thank you for explaining

median fulcrum
#

🙂

desert oar
#

@median fulcrum

#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

meta_columns = ['entity', 'code']

idx_min = (
data
.drop(columns=meta_columns)
.min(axis='columns')
.idxmin()
)

print(data.loc[idx_min])

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | entity       A
002 | code      asdf
003 | 2007         1
004 | 2008         2
005 | 2009         3
006 | Name: 0, dtype: object
desert oar
#

that of course returns the row as a series

#

use data.loc[[idx_min]] to get a 1-row dataframe

median fulcrum
desert oar
#

be more specific

#

what are you asking?

median fulcrum
#

your dataframe have 5 columns, and you are trying to find the minimum of two, mine has 70 columns, what can I put in columns=?

#

understand?

#

that's the problem :/

desert oar
#

note also that 1:70 selects the 2nd through 69th column. numbering starts at 0 and ranges exclude the upper bound. you would have to write 0:71 to get all columns, which is equivalent to 0:, which is equivalent to :

median fulcrum
#

mine database have 70 columns

desert oar
# median fulcrum in drop

drop removes columns. so put the column names you want to remove. if there are no columns that you want to remove, don't use drop.

#

!d pandas.DataFrame.drop

arctic wedgeBOT
#

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```
Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown\_levels> for more information about the now unused levels.
median fulcrum
#

still the same as the other one

#

:/

#

just as an index

desert oar
#

do you have missing values (NaN) in the data?

median fulcrum
desert oar
#

did you remove Entity and Code like you were supposed to?

#

look at my example

#

i literally use the same column names...

desert oar
#

no, you didn't

#

i can see in your code that you didn't

median fulcrum
#

the database is like that

desert oar
#

result has columns Entity and Code which, according to your screenshots, do not contain numerical data

median fulcrum
#

Why I need to remve?

#

bruh

desert oar
#

well then your result is messed up

#

i don't know what you expected

#

stop and think for a second

#

you are showing me 2 dataframes

median fulcrum
#

oh

desert oar
#

in screenshots, no less

median fulcrum
#

I have to use

desert oar
#

please use code blocks

median fulcrum
#

the year

#

lol

desert oar
#

i don't know what that means either

median fulcrum
#

still......

desert oar
#

that looks right to me

#

what's the problem?

median fulcrum
#

don't are showing the minimum row

desert oar
#

idx_min is the index of the row that contains the minimum value

median fulcrum
#

It's showing the row of the minimum row?

#

line or cell?

#

no

desert oar
#

let's fix up the terminology here

#

row is the data in each line

median fulcrum
#

I want to show WHERE the minimum are located

#

WHERE

desert oar
#

idx_min is the index of the row that contains the minimum value

#

the index is the label of the row

#

in this case, your dataframe has the default index. so the label is the row number

median fulcrum
desert oar
#

show me an example using this dataframe:

data = pd.DataFrame([
  ['A', 'asdf', 1, 2, 3],
  ['B', 'zxcv', 4, 5, 6],
  ['C', 'qwer', 2, 5, 3],
  ['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])
#

show me the output you want

median fulcrum
#

print like that

#

that has the minimum in that

#

or just the exclusive year that has

#

don't need to be everything

#

just know the country and the year of it

desert oar
#

i told you, data.loc[idx_min] is a Series of the data in the row that contains the minimum. if you want a 1-row DataFrame of the data in that row, use data.loc[[idx_min]].

median fulcrum
#

the minimum value I already have

desert oar
#

also, you need to slow down and read what other people are writing. i used idxmin, which is not the same as min

median fulcrum
#

thanks

#

😫

#

would be possible, show just the year that the minimum value was found?

desert oar
#

!e ```python
import pandas as pd

data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009'])

data_num = data.drop(columns=['entity', 'code'])

minval = data_num.min().min()
minval_row = data_num.min(axis='columns').idxmin()
minval_col = data_num.min().idxmin()

print(minval, minval_row, minval_col)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

1 0 2007
median fulcrum
desert oar
#

be more specific

median fulcrum
#

I would like to print a dataframe with:

Country year column

Cambodia 1977

#

can have also the code whatever

desert oar
#

please start using code blocks

#

it's not fair to force people to read screenshots and re-type long variable names

#

did you create years_database from lifexpectancy_database?

#

if so, show how you did it

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar
#

do not post a screenshot

wicked grove
#

text target tokens
800000 love healthuandpets u guys r best 1 [love, healthuandpets, u, guys, r, best]
800001 im meeting one besties tonight cant wait girl... 1 [im, meeting, one, besties, tonight, cant, wai...
800002 darealsunisakim thanks twitter add sunisa got ... 1 [darealsunisakim, thanks, twitter, add, sunisa...
800003 sick really cheap hurts much eat real food plu... 1 [sick, really, cheap, hurts, much, eat, real, ...

desert oar
#

@wicked grove yes, every element of the tokens column is a list of strings

wicked grove
#

but the strings are not within ' '

desert oar
#

that's just how pandas displays it

wicked grove
#

ohh alright!!

desert oar
#

you can put a list of strings into the model if you use TfidfVectorizer(analyzer=lambda x: x)

wicked grove
desert oar
#

what is that?

wicked grove
#

can i send a picture of the output? i am unable to copy paste it correctly

desert oar
#

you should use a code block for posting data

#
paste output here
wicked grove
#

`

#
                                                    text  target tokens
800000                  love healthuandpets u guys r best       1      t
800001  im meeting one besties tonight cant wait  girl...       1      k
800002  darealsunisakim thanks twitter add sunisa got ...       1      t
800003  sick really cheap hurts much eat real food plu...       1      p
800004                      lovesbrooklyn effect everyone       1      e
arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lone nacelle until <t:1634832380:f> (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

sick wedge
#

Hi got a really simple machine learning / pandas question but really struggling with this:
I just want to write a function that converts the 'source_type' column in my dataframe to a number depending on what string 'source_type' is

for example:

if item == 'AGB':
  item = 0
else:
  item = 1
#

I understand I should use df.apply() but I'm not sure how to write the function to make it work

#

please ping me!

desert oar
desert oar
#

you could of course use lambda instead of def

data['source_num'] = data['source_type'] \
    .apply(lambda source_type: 1 if source_type == 'AGB' else 0)
sick wedge
#

awesome thanks 👍

#

I'll go with 1st one, doesn't need to be elegant

desert oar
#

you could also use map which is kind of a neat trick:

from collections import defaultdict

source_type_mapping = defaultdict(int, {'AGB': 1})
data['source_num'] = data['source_type'].map(source_type_mapping)

homework: figure out how this works

sick wedge
#

how map works?

desert oar
#

all of it 😉

#

(hint: collections is part of the python standard library, you'll have to read about defaultdict there)

sick wedge
#

okay will do 😄

desert oar
median fulcrum
#

years_database = lifexpectancy_database.iloc[:, 2:72]

#

with just the years

formal perch
#

Hey guys quick question pytorch or tensorflow?

serene scaffold
#

ah yes, the eternal question

desert oar
median fulcrum
desert oar
median fulcrum
desert oar
#
years_database = lifexpectancy_database.iloc[:, 2:72]

minval = years_database.min().min()
minval_row = years_database.min(axis='columns').idxmin()
minval_col = years_database.min().idxmin()

print(minval, minval_row, minval_col)
#

i am using the variable names from my previous example

desert oar
#

that was my fault, it wasn't clear

#

do you understand now? you can do years_database.at[minval_row, 'Entity'] for example, to get "Cambodia"

#

and minval_col will be the year

median fulcrum
#

or what I'm missing?

desert oar
#

read my message above

median fulcrum
frigid elk
#

how do i create the code block in chat?

median fulcrum
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar
frigid elk
#

thanks

desert oar
frigid elk
#

can somebody tell me, in the code below. .. is order of the original index guaranteed to be preserved.

pd.DataFrame({n: df.T[col].nlargest(5).index.tolist() for n, col in enumerate(df.T)}).T
median fulcrum
#

I think it's good idk

proven plinth
#

alright, so i have 2 excel sheets that i load into dataframes, the objective is to match rows from one dataframe onto another
currently im computing columns in both dataframes that match so i can join them together
am i correct in thinking that pandas.merge(df1, df2, on=cols, how="left") will return all of df1 and also any rows from df2 that match what im merging on cols?

desert oar
# median fulcrum oh

don't "chain" [ and .loc. use lifexpectancy_database.loc[[idx_min], [minval_col]]

desert oar
frigid elk
desert oar
proven plinth
#

alright, the objective is to use columns in df1 to match rows on df2, im not sure what kind of join i should use for something like this tbh, i guess inner? i want the set of rows from df1 that can be identified on df2

wicked grove
desert oar
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |    0  1  2
002 | 0  c  c  c
003 | 1  b  b  b
wicked grove
desert oar
desert oar
#

(you can get multiple matches from an inner join too! think about it)

wicked grove
# desert oar yes, that would help
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
#tokenizer=word_tokenize()
#tokenizer=RegexpTokenizer(r' \w+ ')
#dataset['tokens']=dataset['text'].apply(tokenizer.tokenize)
dataset['tokens']=dataset['text'].apply(word_tokenize)
print(dataset.head())

st=nltk.PorterStemmer()
def stemming_on_text(data):
    text=[]
    for word in data:
        text=st.stem(word)
    return text
dataset['tokens']=dataset['text'].apply(stemming_on_text)
print(dataset.head())

print("lemmatized")
lm=nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
     text=[lm.lemmatize(word) for word in data]
     return text
dataset['text']=dataset['text'].apply(lemmatizer_on_text)
print(dataset['text'].tail())
proven plinth
#

goddamn joins are doing my head in

#

i shall be back shortly

#

thanks @desert oar

wicked grove
#

i removed regexptokenizer as it gave a weird output

#

and when i apply stem and lemmatizer it gives only individual characters

desert oar
# proven plinth goddamn joins are doing my head in

it helps to think of the inner join as a filter applied to the "cartesian product" aka the "cross join"

# cartesian product
result = []
for row_x in data_x:
    for row_y in data_y:
        result.append(row_x + row_y)

# inner join
result = []
for row_x in data_x:
    for row_y in data_y:
        if match(row_x, row_y):
            result.append(row_x + row_y)

then left, right, and outer joins are easy: you just append the rows that were not matched, filling in missing data with NULL

#

in fact, in most databases SELECT * FROM a, b WHERE <condition> is identical to SELECT * FROM a INNER JOIN b ON <condition>

proven plinth
#

makes sense, i think im not computing common columns properly, i should be getting more rows with a left join
i'll revisit when i get home, time for a pint

desert oar
wicked grove
desert oar
#

it's like map() or a list comprehension, but for pandas

#

note that you can rewrite stemming_on_text as:

def stemming_on_text(data):
    return [st.stem(word) for word in data]
wicked grove
wicked grove
desert oar
wicked grove
#

Oh okayy

frigid elk
# desert oar !e does this also do what you want? ```python import pandas as pd data = pd.Dat...

!e thanks, but i think we were thinking of two different result sets...

import pandas as pd
data = pd.DataFrame([
  [1,2,3],
  [6,5,4],
  [7,9,8]
], columns = ['a','b','c'])

print("original")
print(data)
print('\ntop by column')
dataT = pd.DataFrame({n: data.T[col].nlargest(3).index.tolist() for n, col in enumerate(data.T)}).T
dataT.columns = ["top_1_column", "top_2_column", "top_3_column"]
print(dataT)
print('\nlambda:')
data.index = ['a','b','c']
print(data.apply(lambda y: y.nlargest(3).index))

btw: it does appear to keep original sort, thanks for the sounding board

arctic wedgeBOT
#

@frigid elk :white_check_mark: Your eval job has completed with return code 0.

001 | original
002 |    a  b  c
003 | 0  1  2  3
004 | 1  6  5  4
005 | 2  7  9  8
006 | 
007 | top by column
008 |   top_1_column top_2_column top_3_column
009 | 0            c            b            a
010 | 1            a            b            c
011 | 2            b            c            a
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/doqaqozaqi.txt?noredirect

desert oar
#

@frigid elk data.apply(lambda y: y.nlargest(2).index, axis=1)?

#

or data.T.apply(lambda y: y.nlargest(2).index)

frigid elk
wise pelican
#

For getting quantile / percentile data points with pandas, is there a specific use case for setting the interpolation to midpoint instead of the default linear method? Such as when doing something like this: pandas.Series(some_data).quantile(0.99, interpolation="midpoint")

median fulcrum
#

suggestions to better visualize the years in x?

median fulcrum
#

🤔

median fulcrum
#

or something like that...

shut trail
#

whats your audience and purpose ?

#

this is great for exploratory notes. i mean the colour isnt necessary but its purdy

median fulcrum
shut trail
#

mission accomplished

median fulcrum
#

thx

shut trail
#

maybe tie clearer colour to the value

median fulcrum
shut trail
#

ahhh effective

#

of course youll fix up your labels but the point is clear

median fulcrum
#

but if I write tha is cambodia

#

would be nice

shut trail
#

remove ticks that have no meaning . but it really depends on your audience how much you want to give a crap

#

add a label for your outlier

#

label the x axis with mortality rate

#

what is the y axis ?

#

nvm reading your code

median fulcrum
#

the database is strange

shut trail
#

ya seaborn just jitters it nicely

#

the 1977 should be on the y-tick and the x axis should have rates

median fulcrum
#

I'm just wondering if has a possibility in catplot to write in a specific point

shut trail
#

okay... rabbit hole

#

for a single visualization are you use you want to jump in ? are you trying to learn data visualization through python ?

#

but if you want fine control like that on a programmatic level, python is home for sure

#

matplotlib is under Seaborn , there you'll be able to pick it all apart

shut trail
median fulcrum
twin mantle
#

How can I plot a grouped-by monthly time series in seaborn or plotnine?

median fulcrum
austere ridge
twin mantle
silver sun
#

I need some advice. Has anyone created a machine learning model to detect malware or similar project?

arctic wedgeBOT
#

Hey @austere ridge!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

robust jungle
#

how can I use annotated images to train an opencv object detection model?

#

also, is it possible to implement transfer learning with it

arctic wedgeBOT
#

Hey @tight walrus!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

tight walrus
#
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pandas.plotting import register_matplotlib_converters
from pathlib import Path
register_matplotlib_converters()

df = pd.read_csv('airline_passengers.csv', index_col = ['Month']) 
df.head()```
#
~\AppData\Local\Temp/ipykernel_15824/309454017.py in <module>
----> 1 df = pd.read_csv('airline_passengers.csv', index_col = ['Month'])
      2 df.head()

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:```
#
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

~\Anaconda3\lib\site-packages\pandas\io\parsers\base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

~\Anaconda3\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: 'airline_passengers.csv'```
#

anyone knows how to deal with this?

grave frost
stable kestrel
#

I had a question which plot is the best to use for this question

#

im using seaborn to plot

#

my guess is a barplot or histogram

#

but

#

it does not plot for shit

#

so if anyone has any idea how to use this

#

it would be appriciated ❤️

royal crest
#

yeah box is good

#

violin if you're feeling fancy

desert oar
#

violin with boxplot whiskers overlaid

calm thicket
#

doesn't a violin already have everything a boxplot has?

desert oar
#

not really, you lose the specific "points of interest" like the 25th and 75th percentiles

covert compass
#

anyone can help me with tesseract and opencv,
https://stackoverflow.com/questions/69656785/python-tesseract-unable-to-detect-two-lines
In the end this kinda work but the return are not in order. Eg. it should be SCF6045P but I got 6045PSCF

text = pytesseract.image_to_string(ROI, lang='eng', config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 11 --oem 3')
timid rivet
#

y'all know any free websites to learn python?

#

im tryna learn python

#

and i learn from practicing

#

but all i got is a pdf

#

please lmk of any free websites if you know any

royal crest
#

the pinned messages in this channel have links to some resources

#

insofar as data science is concerned.

#

for learning python in general, this isn't the right channel.

timid rivet
#

oh okay

#

sorry

violet gull
#

what are some good starting projects?

royal crest
#

depends on your interests and intentions

next lance
#

How can I download cuDNN 8.1

next lance
royal crest
#

what OS are you on

coral kindle
#

Any practical application I could do with Pyspark?

#

I need to hone my skills there

gentle sleet
#

is this book up to date

#

i want to start learning ai i have read nnfs, and im searching for something new

mortal dove
#

I'm working on an ARIMA model, I want to know how significant something should have impacted the model to be considered an intervention.
In 2014 South Africa implemented new travel regulations for tourists, requiring specific documents if a child is not traveling with both their parents.
To me it looks like the underlying pattern has changed in sometime in 2014. If the chance is indeed significant enough, is there any way to exactly pick which month the intervention happened in? Would is just be going back on news articles and finding out when exactly the changes were implemented/announced?

coral kindle
#

But the notions are the same.

gentle sleet
coral kindle
#

So i'd suggest giving it a try and when you have examples with code, refer to sklearn's documentation to see what changed and what stayed the same

gentle sleet
#

im 18 years old, and i dont have any degree pending

#

i would like to take a look on AI

#

and do some stuff

coral kindle
#

Otherwise if you want to get straight to model construction, you can browse the examples on scikit-learn and launch them.

#

Are you familiar with Colab?

gentle sleet
#

Yep

#

im junior backend developer currently XD

#

so i should know a bit

coral kindle
#

Ok so, to describe how scikit-learn works, you have models, and models must train using data.

gentle sleet
#

Ok i know that

#

but i would like to move from backend to more Data engineering

#

AI/ML

coral kindle
#

So you want to create pipelines?

#

For data engineering you need to be familiar with big data I think

#

And with tools like pyspark

gentle sleet
#

Hmm, i just want to go into data world

#

and maybe for a while i'llchoose path where ill be going

coral kindle
proven plinth
#

i have two dataframes

db_adjusted.columns=Index(['activity_dttm', 'subject_txt', 'action_type_nm', 'activity_id',
       'a.activity_html_txt', 'Start Time Final', 'Subject', 'Type'],
      dtype='object')
5137
cbs_adjusted.columns=Index(['Subject', 'Type', 'Calc Full Name'], dtype='object')
1537
Common Columns {'Type', 'Subject'}

i want to merge on the two common columns

merged = pandas.merge(
    cbs_adjusted,
    db_adjusted,
    on=["Subject", "Type"],
    how="left",
)
merged.count()
>>> [67313 rows x 4 columns]
``` how is this possible
shut trail
shut trail
#

did you check that unique values for each are equal?

proven plinth
#

Subject on the first df

#

and the other

#

the columns have common values

shut trail
#

did you check ?

proven plinth
#

yes

shut trail
#

get unique values, sort, and evaluate equal..

#

okay

proven plinth
#

i need to find rows where both the subject and the type are the same

shut trail
#

you cleaned the data ?

proven plinth
#

yes

#

i stripped everything, i lowercased everything

#

idk what more i can do tbh to clean them up

#

pandas.merge(df1, df2, on=[col1, col2], how="inner") should get me the rows in the two dfs that have the same value in the two cols col1 and col2

#

but i get 10x more rows than what any of the two dfs have

somber prism
#

can someone help me improve my anomaly detection model,

input : 10k toys product description, 225 alcohol product description

the problem is here to detect those 225 alcohol products as an anomaly based on their description, those toys and alcohol product's description will be in one dataset

so ill tell what all i did first

s-1 - i cleaned the product description text by removing puncuations and stop words, tokenizing and lemmatization etc ( basic cleaning process)
s-2 : i converted the cleaned text to a paragraph vector using gensim doc2vec
s-3 : then i tried to fit this vectorized dataset into isolation forest 
was able to get 30 true negative and 597 false positive ( true negative is the anomalies ), i was able to get this evaluation because i labeled them before mixing alcohol products description to toys product description. my goal is here to minimize type 1 and 2 error.```
median fulcrum
#

Why it's not staying in just one plot?

#
fig, ax = plt.subplots(ncols=2)
sns.catplot(data = america_dataset, ax=ax[0]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='America').set_ylabels('Life expectancy', weight='bold');
sns.catplot(data = asia_dataset, ax=ax[1]).set_xlabels('1950-2019', weight='bold').set(xticklabels=[], title='Asia').set_ylabels('Life expectancy', weight='bold');
#

sns.catplot doesn't allow me to do that?

shut trail
shut trail
#

are you trying to do two plots side by side or ..

median fulcrum
proven plinth
#

looking at the data doesnt really help me, i've been staring at them all day

#

theres something wrong im doing with pandas

shut trail
#

and ax=ax[1]

#

specifies which subplot to use

shut trail
#

ax[0].set_ylables(stuff)

median fulcrum
#

I think it's because catplot, but I don't see any alternative to it

shut trail
shut trail
#

copy, see if it works

median fulcrum
# shut trail copy, see if it works

what's the different between your code and this:

fig, ax = plt.subplots(1,2)
sns.catplot(data = america_dataset, ax=ax[0])
sns.catplot(data = asia_dataset, ax=ax[1])
shut trail
#

you tell me lol

median fulcrum
#

don't have differents

#

bro

shut trail
#

im out

median fulcrum
#

not work with catplot apparently

shut trail
#

its called Seaborn...

median fulcrum
#

yes, and with other types of plot probably will work

#

like countplot I know that work

shut trail
#

I have faith you know what you're doing

median fulcrum
#
fig, ax = plt.subplots(1,2, figsize=(9,5));
sns.countplot(x = y_test, ax=ax[0]).set_title('Real results');
sns.countplot(x = predictions, ax=ax[1]).set_title('Algorithm predictions');
plt.ylabel(' ')
fig.show();
#

like that

shut trail
#

The reason i said specify your variables it that it forces you to make sure they make sense. I dont know your data so only you can know if those magical dataframes are in order

#

does countplot use subplots.. i rember problems with facetgrids long ago...

median fulcrum
#

I already separed the variables

median fulcrum
shut trail
#

cause its not working! lol a joke

#

what do you want your vis to look like ?

#

Can you show the two you want working separately ?

median fulcrum
shut trail
#

and how my friend are those cat plots ?

#

use a scatter ?

median fulcrum
shut trail
#

youre not specifying your variables are you

median fulcrum
#

the year variables are columns

shut trail
#

catplot is a way of looking at categorical vars at the same time.. you just want two side by side scatter plots

#

build the arrays you need and feed them in

median fulcrum
shut trail
median fulcrum
shut trail
#

you didnt show your code and i dont really care what function you used, im telling you what it is lol

#

those are points, thats the x and y axis lol

somber prism
#

can someone help me with anomaly detection 😐

shut trail
#

use a scatter

median fulcrum
shut trail
#

build the array you need

#

thats why we use python for this

median fulcrum
#

I'm not so confidence in making databases

#

That's why I pick already finished

somber prism
#

guys is it possible to detect the anomaly product just from their description for the random 2 catagories ?