#data-science-and-ml | Python | Page 226

desert oar Jun 2, 2020, 1:57 PM

#

show the error message

supple minnow Jun 2, 2020, 1:58 PM

#

PS C:\Users\User\Desktop\Diplomski rad> & C:/Python/Python38/python.exe "c:/Users/User/Desktop/Diplomski rad/FuzzyPDC.py"
Traceback (most recent call last):
File "c:/Users/User/Desktop/Diplomski rad/FuzzyPDC.py", line 7, in <module>
class Zvonimirova:
File "c:/Users/User/Desktop/Diplomski rad/FuzzyPDC.py", line 244, in Zvonimirova
np.max(urg_activation60,np.max(urg_activation61,np.max(urg_activation62,np.max(urg_activation63,urg_activation64))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
File "<array_function internals>", line 5, in amax
File "C:\Python\Python38\lib\site-packages\numpy\core\fromnumeric.py", line 2667, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "C:\Python\Python38\lib\site-packages\numpy\core\fromnumeric.py", line 90, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
TypeError: only integer scalar arrays can be converted to a scalar index

desert oar Jun 2, 2020, 2:00 PM

#

i dont think np.max is the right function, is it?

#

can you link to the example in the docs? i see this one https://pythonhosted.org/scikit-fuzzy/auto_examples/plot_tipping_problem.html

supple minnow Jun 2, 2020, 2:03 PM

#

damn im so dumb

#

i didnt see that i wrote np.max instead of np.fmax.....sry for wasting your time and thank you very much

#

now it's working perfect =)))

desert oar Jun 2, 2020, 2:05 PM

#

hah there you go

#

i was about to ask that

#

but i didnt want to assume you made a mistake, maybe there was some weird numpy magic happening

supple minnow Jun 2, 2020, 2:07 PM

#

I need to be more concentrated while tipping

#

thx for help

peak lava Jun 2, 2020, 3:41 PM

#

Hello there, I wanted to ask where and How do I start with ML from scratch ? Are there any Books, Articles or Videotutorials you can recommend ?

wise garden Jun 2, 2020, 3:45 PM

#

I've just started my Master's in Business Analytics and we're using Hands-on machine learning wirh scikit-learn & tensorflow(that's the name of the book) if you already know Python this is a great book. If not, you might want to learn the basics while reading the book. In addition, you'll want to supplement this with online videos about the mathematics behind the algorithms. Statquest is perfect for explaining concepts of the algos in an easy way.

#

At the very least, find an intro textbook and supplement it with free youtube vids online. You don't always need a textbook but I've always really liked them

peak lava Jun 2, 2020, 3:47 PM

#

Sounds good I am going to look into it. I appreciate the help thank you !

wise garden Jun 2, 2020, 3:47 PM

#

No problem, good luck!

desert oar Jun 2, 2020, 3:50 PM

#

a lot of people really like fast.ai if you already know python, but it's very specifically oriented to a few machine learning tasks and you'll want to follow it up with more generalist material

peak lava Jun 2, 2020, 3:56 PM

#

thanks!

lapis sequoia Jun 2, 2020, 5:21 PM

#

Is it okay if I ask for help here?

#

So I have a list of dictionaries under this variable called daily_engagement . This is how it looks: [OrderedDict([('acct', '0'), ('utc_date', datetime.datetime(2015, 1, 9, 0, 0)), ('num_courses_visited', 1), ('total_minutes_visited', 11.6793745), ('lessons_completed', 0), ('projects_completed', 0)]), OrderedDict([('acct', '0'), ('utc_date', datetime.datetime(2015, 1, 10, 0, 0)), ('num_courses_visited', 2), ('total_minutes_visited', 37.2848873333), ('lessons_completed', 0), ('projects_completed', 0)]), OrderedDict([('acct', '0'), ('utc_date', datetime.datetime(2015, 1, 11, 0, 0)), ('num_courses_visited', 2), ('total_minutes_visited', 53.6337463333), ('lessons_completed', 0), ('projects_completed', 0)]),....ect

#

I want to change the key of acct to account_key but I am having trouble brainstorming how to do this. I

#

I started out with a for loop to iterate over the list for engagement_records in daily_engagement: daily_engagement but I am stuck on how I should access only the acct keys in all these list and what method i should use to replace acct with account_key. should i use the pop method?

flat quest Jun 2, 2020, 5:28 PM

#

yea fast ai and all these online courses are pretty good.
But to really become a solid data scientist you're gonna have to read papers/articles on new architectures and keep experimenting through kaggle competitons, datasets you scraped, or public datasets made available by universities and other organizations.

#

yeah pop method should suffice -> do engagement_records['account_key'] = engagement_records['acct].pop()

candid vault Jun 2, 2020, 5:32 PM

#

I am trying to understand the architecture of a "Vanilla" recurrent neural network , but failing again and again . Can't seem to wrap my head around it

#

📎 1_aIT6tmnk3qHpStkOX3gGcQ.png

flat quest Jun 2, 2020, 5:33 PM

#

which part do you not get?

candid vault Jun 2, 2020, 5:37 PM

#

As far as I understand ho, h1, h2,...... are the weight Matrix of hidden layers . But if we have multiple hidden layers in a single "feed-forward network" then we will have multiple hidden weight matrix for a single network . Is my reasoning ok ? Can " h1" be multiple matrix ?

flat quest Jun 2, 2020, 5:42 PM

#

well
not entirely

#

an RNN isn't a traditional feed foward network

solid mantle Jun 2, 2020, 5:42 PM

#

Hiii

flat quest Jun 2, 2020, 5:42 PM

#

its recurrent since it passes a state vector between each RNN cell. H0, h1, etc are those state vectors

#

these are analogous to short term memory

solid mantle Jun 2, 2020, 5:43 PM

#

I needed some support in bayesian analysis. Is anyone familiar with it?

#

also sorry for the interruption

flat quest Jun 2, 2020, 5:44 PM

#

lol np
i haven't specifically worked with bayesian models so prob can't help you there.
are u just starting out with using bayesian analysis?

solid mantle Jun 2, 2020, 5:44 PM

#

yes

#

kind of

#

im familiar with the base concepts, posterior = prior*likelihood. But I was reading a paper, and came across 'cutoff scheme' used in place of likelihood

candid vault Jun 2, 2020, 5:45 PM

#

What is meant by "state vector" exactly ? Some kind of matrix that holds current state (values) of the perceptrons ?

solid mantle Jun 2, 2020, 5:45 PM

#

I dont understand what they mean by it, and i cant find any definition or explanation on it

im familiar with the base concepts, posterior = prior*likelihood. But I was reading a paper, and came across 'cutoff scheme' used in place of likelihood
@solid mantle

flat quest Jun 2, 2020, 5:46 PM

#

do you know the bayesian probability equation?
its sorta mathy but not too difficult

solid mantle Jun 2, 2020, 5:46 PM

#

yes, i do

flat quest Jun 2, 2020, 5:47 PM

#

well @candid vault
state vector is a matrix that is a representation of the memory. Perceptrons themselves don't technically have values. The state vector is really the same as the output of the perceptron for simple RNN (i.e. short term memory -> we are remembering the output of the last few sequences and using that to output a new value)

#

hmm i haven't taken a look into cutoff scheme exactly, have u taken a look to see if there's any medium articles? They seem to clear some confusing concepts well sometimes.

solid mantle Jun 2, 2020, 5:48 PM

#

Yes, i have looked everywhere.

#

Looks like I have to ask the author then. Its embarassing though haha

#

I will be sure to write a medium article on it if its worth it

flat quest Jun 2, 2020, 5:50 PM

#

lol xd
no thats perfectly fine, a lot of the times especially in more complex ideas in ML -> we have to go back to the source of the idea or information. The authors are actually really helpful sometimes.

yea for sure send me the link if u ever make one

solid mantle Jun 2, 2020, 5:50 PM

#

for sure

candid vault Jun 2, 2020, 5:56 PM

#

I can't find any image on the internet that shows the feed-forward networks in a Vanilla RNN . All of them abstract away the networks and replace them with nodes . I am unable visualise whats happening inside those nodes and how it's affecting the output from the nodes

#

I tried to create some image myself in adobe to help me understand

#

📎 Screenshot_2020-06-02_at_11.33.56_PM.png

flat quest Jun 2, 2020, 6:00 PM

#

do you know what the red X1, Green X2, and blue X3 are?

candid vault Jun 2, 2020, 6:04 PM

#

No , I thought they are separate inputs for these separate feed-forward networks . But after going through some articles I think . It's actually a "single" feed forward network . If we take snapshot of the recurrent network , for example at "t" , "t+1" ,"t+2" ......... and so on , we will get something like this .

#

📎 A-recurrent-neural-network-and-the-unfolding-architecture-U-V-and-W-are-the-weights-of.png

indigo fractal Jun 2, 2020, 6:07 PM

#

anyone wanna make me a quick i page 2 link website for 50 usd

desert oar Jun 2, 2020, 6:08 PM

#

pretty sure that's not allowed here

indigo fractal Jun 2, 2020, 6:08 PM

#

really?

#

i just joined

#

i needed somone

arctic wedgeBOT Jun 2, 2020, 7:14 PM

#

Hey @heady vigil!

It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg.

Feel free to ask in #community-meta if you think this is a mistake.

heady vigil Jun 2, 2020, 7:20 PM

#

Id like to know how can I get the content that is inside this pdf in a 100% accurate way to use in a database or a spreadsheet format. It has to be 100% precise because it has to do with taxes, portfolio, accounting and stuff like this. I've tried some OCR but it went 100% the results. There must be a way of copying and organizing the strings to a database and after export to a spreadsheet so users can view. Im strugling to extract information. The pdf is not in image format, is on strings, if someone has more doubts I would appreciate to DM the pdf.

📎 unknown.png

#

My bad, I dunno if im the correct channel. Im new btw

flat quest Jun 2, 2020, 7:53 PM

#

its not really a feed forward network in the traditional sense. @candid vault

its horizontal data transfer, you're not propogating data through layers. However, you are right that in a unidirectional RNN, all data flows from t=0 to t=n.

To get a better understanding of here's another image

📎 Screen_Shot_2020-06-02_at_12.56.03_PM.png

#

basically each X is a timestep. And in each timestep are a batch of elements.

#

the memory cell itself contains n units. Each one applying a weight on each element in the batch.

since there are usually more than one unit applying weights to each element in the batch, we get multiple outputs for each batch element.

tawdry fox Jun 2, 2020, 8:17 PM

#

I just did a Coursera course in ML using tensor Flow will that help me in my career for higher studies and job . I got a certificate

lapis sequoia Jun 2, 2020, 8:26 PM

#

@tawdry fox if you start doing stuff like kaggle or your own projects, most definitely yes.

#

you have a head start now into the stuff. start applying it now.

uncut shadow Jun 2, 2020, 8:45 PM

#

Yeah, well in most cases certificates from online courses don't actually matter a lot. The best thing is to have a real projects (for example on github, gitlab or anywhere else) as they actually show what you can do and what you came up with

rain palm Jun 2, 2020, 9:02 PM

#

@heady vigil Also check out PDFMiner - it gives you granular control of every element in the PDF.

flat quest Jun 2, 2020, 10:28 PM

#

@tawdry fox yeah courses/certificates don't mean much

Even kaggle - while nice doesn't really show how you would program in a production data science environment. That you're gonna need personal github projects or ones using google cloud, azure, or amazon's data storage / ai tools.

But yeah like @lapis sequoia said, you definitely still have a head start. There's still significantly more ai jobs than devs, so as long as you develop your abilities in the field now you should be fine

muted relic Jun 3, 2020, 12:50 AM

#

yo i'm new to all this and have a quick question about dictionaries

marsh berry Jun 3, 2020, 12:52 AM

#

@muted relic go ahead

muted relic Jun 3, 2020, 12:53 AM

#

In this chunk of code I wrote, I understand why the value which corresponds to the Key is being added into the dictionary, but it is unclear to me what section of the code is updating the the actual Key itself.
https://gyazo.com/1dc440a6cf4aaa94df98a778224ce058

Gyazo

#

Like, it doesn't seem that there is an explicit instruction to update the key of dictionary anywhere in this code but there must be

#

else it wouldn't return what I'm expecting it to retunr.

marsh berry Jun 3, 2020, 12:55 AM

#

You create the key in the line

elif name not in reviews_max:
   reviews_max[name] = value #this creates the key name in the reviews_max dictionary and sets it to value. 
#Your dict now looks like {name:value}

#

When you set a value in a dict using item access if the key doesn't already exist it will create it.
Technically looking at that code the if statement is redundant as you aren't trying get a value from a non existing key in the dict

muted relic Jun 3, 2020, 1:03 AM

#

What's confusing me is that I thought reviews_max[name] = review is updating the value which corresponds to the key

#

Not the key

#

And if there is no value which corresponds to the key, which no line of the code seems to create a key explicitly, how can we update those values.

#

"When you set a value in a dict using item access if the key doesn't already exist it will create it." This seems to be the answer i am looking for

#

And I guess that it assigns the correct value to the key because in each iteration of the loop, it is checking the same row # just different columns

#

Right?

marsh berry Jun 3, 2020, 1:05 AM

#

reviews_max = {'hello' :'world'}
Hello_var = reviews_max['hello'] # this succeeds, Hello_var == 'world'
Goodbye_var = reviews_max['goodbye'] #this vails with a Key Error exception because goodbye does not exist in the dict
reviews_max['goodbye'] = 'BOO' #this succeeds, goodbye is now in the dict

Goodbye_var = reviews_max['goodbye'] # this now succeeds, Goodbye_var == BOO

#

Yeah so row is always different each iteration, so name is set to that specific row. Hence why it changes.
That's then used to add the key into the dict, and so that then succeeds

#

Anyway I'm off to bed, pm me if you need this cleared up and I'll reply in the morning

muted relic Jun 3, 2020, 1:07 AM

#

This response was super helpful, thanks.

autumn galleon Jun 3, 2020, 2:25 AM

#

ok i think this is the right channel since it is a quesiton for nltk,

#

Hi all, I am currently trying to do a project with NLTK and flask and its goal is phishing email detection. At this point i am able to scan an email however, it takes 3 seconds per email due to loading of the nltk model file. I would like to have the time reduced. From my understanding is that with flask I am unable to have some sort of global variable that holds the models. Do you suggest any other ways to do it ?

sharp leaf Jun 3, 2020, 3:00 AM

#

is machine learning covered in here or no

dusty depot Jun 3, 2020, 3:17 AM

#

yup @sharp leaf

sharp leaf Jun 3, 2020, 3:18 AM

#

i am trying to create a machine learning program. the problem is that i am not sure how i am in supposed to give it things to get data for store it and be able to use it. like does anyone have any recommendations for modules or anything

desert oar Jun 3, 2020, 3:23 AM

#

thats incredibly vague

sharp leaf Jun 3, 2020, 3:23 AM

#

how can i specify

desert oar Jun 3, 2020, 3:25 AM

#

what are you actually trying to accomplish

#

what libraries are you using

#

what kind of data are you using

sharp leaf Jun 3, 2020, 3:25 AM

#

none thats what i need

desert oar Jun 3, 2020, 3:25 AM

#

do you even understand how machine learning works

#

or what it does

sharp leaf Jun 3, 2020, 3:25 AM

#

yes

desert oar Jun 3, 2020, 3:25 AM

#

i dont want to be a gatekeeper here, but the way you phrased your question sounds like youre missing fundamental information

#

but i also want to give you the benefit of the doubt

#

so can you explain what you are trying to do more precisely

sharp leaf Jun 3, 2020, 3:26 AM

#

whats a i library i should use for it

desert oar Jun 3, 2020, 3:27 AM

#

usually in python the go-to machine learning library is scikit-learn

#

to implement your own algorithms, there are some optimizers in scipy

sharp leaf Jun 3, 2020, 3:27 AM

#

ok thank you

desert oar Jun 3, 2020, 3:27 AM

#

numpy is for array/matrix manipulation, pandas is for data frames

#

matplotlib for plotting

#

and obviously tensorflow and pytorch for neural networks

sharp leaf Jun 3, 2020, 3:28 AM

#

yeah

#

ok

desert oar Jun 3, 2020, 3:28 AM

#

pymc3, pyro, edward, pystan for bayesian...

sharp leaf Jun 3, 2020, 3:28 AM

#

thanks

flat quest Jun 3, 2020, 6:30 AM

#

Yeah if ur new go with scikit there’s enough architectures in there that you’ll probably never use some of them

jade chasm Jun 3, 2020, 9:15 AM

#

Hey guys, I have 2 Pd.Series with identical indexes, and I suspect the normalized values of both to be quite similar (however, they are not normalized yet). What would an intuitive way be to either visualize/'prove' this?

#

so e.g. I suspect value A:2 B:5 in series 1, and A:4 B:10 in series 2, being the same 'ratio'. But I'm not sure how to show this.

random arch Jun 3, 2020, 9:27 AM

#

Hi all, I'm new to this community, not sure if this is the right place to ask this question - do let me know if not. I want to understand how to use python multiprocessing with a large dataframe (from a csv that doesn't fit in my ram) - please see my approach and can anyone point me to the mistake I'm doing here?

arctic wedgeBOT Jun 3, 2020, 9:30 AM

#

Hey @random arch!

It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg.

Feel free to ask in #community-meta if you think this is a mistake.

random arch Jun 3, 2020, 9:32 AM

#

this works

📎 unknown.png

#

this doesn't work. what's wrong here?

📎 unknown.png

lapis sequoia Jun 3, 2020, 11:05 AM

#

@random arch wild guess but jupyter notebook might not want to co-operate with threads.

random arch Jun 3, 2020, 11:25 AM

#

@lapis sequoia Thank you for the suggestion, but it didn't work as a script either.

lapis sequoia Jun 3, 2020, 11:48 AM

#

@random arch i don't know what you're trying to do

#

print(list(map(np.shape, df)), list(map(np.shape, df.values)), np.shape(df))

#

but these all give different results

#

also ```py
import concurrent.futures as fut
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,2))

print(df)

def f(i):
print(f"piece {i}\n")
return np.sum(i, axis=0)

with fut.ThreadPoolExecutor() as executor:
print(list(executor.map(f, df.values)))
```

#

run that to understand

real wigeon Jun 3, 2020, 11:56 AM

#

For those of you who used dash/plotly:

Is it possible to number crunch using pandas, and then list the output using dash?

random arch Jun 3, 2020, 12:03 PM

#

@real wigeon yes.

#

@lapis sequoia I'm doing a very simple operation here. Get the shape of the dataframe in question (from a list of dataframes from the pandas chunking) - sequentially first, then using multiprocessing. Also, I intended to have a ProcessPoolExecutor there (was just trying to see if ThreadPoolExecutor works instead), and it also didn't work.

lapis sequoia Jun 3, 2020, 12:06 PM

#

@random arch why are you trying to get the shape of the dataframe in parallel? the solution is to run it on the values, not the dataframe itself since that only catches the column index.

random arch Jun 3, 2020, 12:07 PM

#

@lapis sequoia Actually, its a dummy operation (getting the shape) to try and see how best I can parallelize code

lapis sequoia Jun 3, 2020, 12:08 PM

#

@random arch copypaste the code i just wrote

random arch Jun 3, 2020, 12:08 PM

#

I already have an entire implementation written in the normal (Sequential) manner. so the idea is if I can get this to work, I can extrapolate it to my current solution.

lapis sequoia Jun 3, 2020, 12:08 PM

#

@random arch the problem you had was that when you run df through the map, it doesn't actually map the function on the dataframe

random arch Jun 3, 2020, 12:09 PM

#

oh is it?

lapis sequoia Jun 3, 2020, 12:09 PM

#

try it...

#

it maps the function on the columns

#

the column names

random arch Jun 3, 2020, 12:11 PM

#

trying it out, thanks a lot!

#

I understand your code - you're sending in an array (for each row), and summing it up.

#

I thought the df was getting passed because of this result:

#

import pandas as pd
import pprint
import concurrent.futures as fut

iterator = pd.read_csv("2019-Oct.csv", chunksize=1000000, low_memory=False)
def df_shape(df):
    return df.shape
shapes_1 = map(df_shape, list(iterator))
print(list(shapes_1))

#

[(1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (1000000, 9),
 (448764, 9)]

#

this didn't work when I used the previous parallel version of map, i.e.,

#

with fut.ThreadPoolExecutor() as executor:
    shapes_2 = executor.map(df_shape, list(iterator))
    print(list(shapes_2))

#

doesn't run.

#

pardon me if I'm missing something obvious here.

solar phoenix Jun 3, 2020, 12:35 PM

#

does anyone have experiance with holoviews for plotting data? I have a dataframe and I have grouped by one of the columns and plotted the others against each other following this example : http://holoviews.org/gallery/demos/bokeh/iris_splom_example.html

Iris splom example — HoloViews 1.13.2 documentation

Stop plotting your data - annotate your data and let it visualize itself.

#

Is there a way i can modify this code to change the colours of the plotted groups?

paper niche Jun 3, 2020, 12:45 PM

#

try color=hv.Cycle(['red', 'green', 'blue']) in the opts.Scatter

solar phoenix Jun 3, 2020, 12:47 PM

#

awesome- thanks it worked

solar phoenix Jun 3, 2020, 1:03 PM

#

@paper niche Do you know if i can cycle alpha values too, I have been trying with little success?

#

@paper niche don't worry figured it out

real wigeon Jun 3, 2020, 1:34 PM

#

Is a tableu certificate worth it?

#

Looks like after the first 3 months you have to sign up for a subscription plan to use the software/learning material

noble gale Jun 3, 2020, 1:54 PM

#

How do I install Notebook?

uncut shadow Jun 3, 2020, 2:03 PM

#

wdym?

noble gale Jun 3, 2020, 2:14 PM

#

Jupyter Notebook How do I use it?

#

I already did Pip install

#

@uncut shadow

random arch Jun 3, 2020, 2:31 PM

#

@noble gale type in jupyter notebook in the terminal

noble gale Jun 3, 2020, 3:23 PM

#

Ok

#

@random arch https://gyazo.com/c52d8ac9bdad77cc340108a18e222469 umm?

Gyazo

random arch Jun 3, 2020, 4:20 PM

#

go to the "Files" tab and create a new notebook.

noble gale Jun 3, 2020, 4:45 PM

#

Ah

#

ok thx

random arch Jun 3, 2020, 5:10 PM

#

no problem.

noble gale Jun 3, 2020, 5:53 PM

#

Which Youtube video did you guys learn Data Analysis?

shell raft Jun 3, 2020, 6:11 PM

#

Hey guys, so I have a defaultdict(list), and Im storing items like {'1' : ['1','2','3','4'], '2': ['1','2']}

#

how would I see how many items are for each key

#

i tried compare_prices[key][x] and that didn't work

#

m=0
        for x in compare_prices[key]:
            l = len(compare_prices[key])
            
            x = re.sub('[!@,]', '', x)
            print("ADDING PRICE.. ", x)
            print("M EQUALS", m)
            m = (((int(m) + int(m)) + int(x)) / l)
            print("Mean of ITEM: ", key, " with length of ", l, " is ", m)

dusty depot Jun 3, 2020, 6:20 PM

#

how do you want the result?

#

@shell raft

#

oh, for each key?

shell raft Jun 3, 2020, 6:20 PM

#

I want the result to add all the values for each key and give me the mean

#

So like i have a key for different item ids, and each item id has like 10 numbers

dusty depot Jun 3, 2020, 6:21 PM

#

so given a dictionary

 values = {'1' : ['1','2','3','4'], '2': ['1','2']}

tight fern Jun 3, 2020, 6:21 PM

#

a key can have multiple different values attached?

dusty depot Jun 3, 2020, 6:21 PM

#

{k: sum(values[k])/len(values[k]) for k,v in values.items()}

#

?

#

oh, string numbers

tight fern Jun 3, 2020, 6:22 PM

#

i didnt realize that was a thing (sorry im new to python programming and i was curious)

shell raft Jun 3, 2020, 6:23 PM

#

{'1027': ['3,000', '4,499', '4,499', '4,499', '5,000', '10,000', '4,000', '4,500', '5,500', '5,500', '4,888', '4,500', '6,888', '6,888', '5,000', '11,080', '11,000', '10,000', '50,000', '20,000'], '1034': ['3,600', '3,650', '4,499', '3,600', '3,600', '5,555', '4,500', '50,000', '5,000', '5,555', '5,000', '25,000', '4,500', '15,000', '9,999', '4,900', '50,000', '5,000', '5,000', '6,000'], '8876': ['12,000', '13,500', '30,000', '40,000', '35,550', '35,000', '22,222', '30,000', '17,000', '15,000', '15,000', '30,000', '30,000', '40,000', '14,141', '14,141', '40,000', '40,000', '14,000', '40,000'], '8877': ['10,000', '20,000', '70,000', '100,000', '35,000', '40,000', '40,000', '61,000', '40,000', '62,500', '60,000', '35,000', '50,000', '60,000', '62,799', '30,000', '40,000', '40,000', '50,000', '50,000', '10,000', '20,000', '70,000', '100,000', '35,000', '40,000', '40,000', '61,000', '40,000', '62,500', '60,000', '35,000', '50,000', '60,000', '62,799', '30,000', '40,000', '40,000', '50,000', '50,000']}```

dusty depot Jun 3, 2020, 6:23 PM

#

!e ```py
values = {'1': ['1', '2', '3', '4'], '2': ['1', '2']}
means = {}
for k,v in values.items():
int_values = [int(x) for x in v]
total = sum(int_values)
count = len(int_values)
means[k] = total/count

print(means)

arctic wedgeBOT Jun 3, 2020, 6:23 PM

#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

dusty depot Jun 3, 2020, 6:23 PM

#

ah

#

well something like that, then

#

oh, hgmn

#

if you have commas

shell raft Jun 3, 2020, 6:23 PM

#

i removed the commas

dusty depot Jun 3, 2020, 6:23 PM

#

ah okay, no commas

#

then the above code should beok

shell raft Jun 3, 2020, 6:25 PM

#

alright just a sec ima try

#

int_values = [int(x) for x in v]

#

what is the x

#

keep getting this error

#

NameError: name 'means' is not defined

#

for key in compare_prices:
        # item = market_items[key]["ID"]
        # listofprices[item]["PRICE"] = (market_items[key]["PRICE"])
        # print(listofprices)
        # #print(item)
        # #print(listofprices)
        
        print(key)
        for k,v in compare_prices.items():
            int_values = [int(x) for x in v]
            
            total = sum(int_values)
            count = len(int_values)
            print(total, " ", count)
            means[k] = total/count
            print(means[k])

uneven jay Jun 3, 2020, 6:48 PM

#

Hey, is there a data engineer here that has 5 free minutes, so I could pick your brain? I need some help trying to understand the whole data pipeline and want to ask about some parts of it and how they fit in the whole process.

uncut shadow Jun 3, 2020, 7:05 PM

#

@shell raft so it basically means that means is not defined

shell raft Jun 3, 2020, 7:05 PM

#

lol

#

yeah i got it

uncut shadow Jun 3, 2020, 7:05 PM

#

;-;

desert oar Jun 3, 2020, 7:24 PM

#

@uneven jay the concept of "the" data pipeline doesn't make a ton of sense

#

there are different kinds of operations that you might need to do, in some kind of sequence, to form a data pipeline for a specific task

shell raft Jun 3, 2020, 7:26 PM

#

How can you make a function call everytime another function is called. Like how can I make it so anytime the function session_requests is used, it calls another function

uneven jay Jun 3, 2020, 7:26 PM

#

@desert oar mind if I pm you, so you could explain these things to me? 😄

desert oar Jun 3, 2020, 7:27 PM

#

i'd rather just discuss it here

#

@shell raft this doesn't sound like a data science question (but it's an interesting question). maybe ask in one of the general-purpose help channels. see #❓｜how-to-get-help

shell raft Jun 3, 2020, 7:28 PM

#

Youre right

uneven jay Jun 3, 2020, 7:31 PM

#

Alright, just didn't want to spam too much. So I am trying to learn the skills that a data engineer should have. I have done a bunch of webscraping, so if I understand, that is data collection. I know how to store it in an excel, or an sql db (so I guess data storage?) and I know how to do things with the data, get analytics etc using pandas, numpy, matplotlib. But I read about an important thing being message broker services (rabbitMQ or Kafka). I kind of read the basics of what they are, but I don't really understand how they are used in data engineering? Also, what is docker and where is it using in data engineering as well?

flat quest Jun 3, 2020, 7:31 PM

#

hey guys so having an issue with pandas.
i'm making a series here thats completely numerical (type int64)

📎 Screen_Shot_2020-06-02_at_11.22.30_PM.png

#

but when i try to plot a distribution plot on it with seaborn getting an error that says cannot convert string to float

📎 Screen_Shot_2020-06-02_at_11.23.06_PM.png

dusty depot Jun 3, 2020, 7:50 PM

#

@flat quest try sns.distplot(series.values)

#

it shouldn't need that, but

flat quest Jun 3, 2020, 8:10 PM

#

still getting the same error :/ @dusty depot

bronze hamlet Jun 3, 2020, 8:19 PM

#

someone experience with solving recaptcha?

uncut shadow Jun 3, 2020, 8:20 PM

#

@flat quest from what I can see, one of lines has "scott" string in it which cannot be changed to float

#

@bronze hamlet it would be better to ask a question and provide code/errors you get so people are able to answer them

bronze hamlet Jun 3, 2020, 8:22 PM

#

thank you i will send my errors soon

flat quest Jun 3, 2020, 8:24 PM

#

yea thats the weird part tho

cause i've looked through the entire series and there's not a single string in it @uncut shadow

uncut shadow Jun 3, 2020, 8:25 PM

#

and what is this dataset?

flat quest Jun 3, 2020, 8:26 PM

#

well the original dataset is from kaggle (twitter disaster tweets)

i'm doing some feature engineering tho and getting the mention count from each tweet

uncut shadow Jun 3, 2020, 8:26 PM

#

Ok

real wigeon Jun 3, 2020, 8:26 PM

#

i feel bummed out

#

dashboards are more complicated than i thought

uncut shadow Jun 3, 2020, 8:31 PM

#

ur using this? https://www.kaggle.com/c/nlp-getting-started/data

Real or Not? NLP with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not

flat quest Jun 3, 2020, 8:32 PM

#

yeah that one

dusty depot Jun 3, 2020, 8:37 PM

#

oh hmn

#

@flat quest https://stackoverflow.com/questions/61440184/who-is-scott-valueerror-in-seaborn-pairplot-could-not-convert-string-to-floa try mucking around with this guy's solution

Stack Overflow

Who is Scott? - ValueError in Seaborn pairplot: Could not convert s...

Who is Scott?

Problem

I get the following error when trying to add the Education attribute from the Loan Prediction dataset to a pairplot using seaborn:
ValueError

flat quest Jun 3, 2020, 8:40 PM

#

aight thanks! :D
i'll see what i can find

#

seems like the method used to automatically find the bandwidth for the KDE is failing :/. Best solution seems to be using a custom bandwidth in that case.

Anyways it's working decently with a custom bandwidth, thanks for the help! @dusty depot

dusty depot Jun 3, 2020, 8:55 PM

#

👌

fathom bronze Jun 3, 2020, 9:44 PM

#

Hey guys, I have a pandas dataframe with 260 rows suppose. I want to divide the dataframe into 10 parts and write each part into a different excel. Why 10 parts? because I have to divide by 26. So, if I have 140 rows, it will create 6 dataframes parts, first 5 parts containing 26 rows, and the last part containing just 10 of the rows.

dusty depot Jun 3, 2020, 9:50 PM

#

@fathom bronze are you looking for a way to do that slicing?

fathom bronze Jun 3, 2020, 9:51 PM

#

Yeah

#

can say that

#

sort of dividing the rows

dusty depot Jun 3, 2020, 9:51 PM

#

u, something like divisons = numpy.linspace(0, len(df.index))

#

and then

#

for div in divisions:

#

df[:div, :]

#

stuff like that

#

well u want like
df[div-before:div-after, :]

#

but yknow what i mean

fathom bronze Jun 3, 2020, 9:52 PM

#

Can you explain the code a bit

dusty depot Jun 3, 2020, 9:52 PM

#

mkay gimme 10 minutes to finish this meeting

#

👌

fathom bronze Jun 3, 2020, 9:52 PM

#

Yeah sure

dusty depot Jun 3, 2020, 10:07 PM

#

@fathom bronze aight so

fathom bronze Jun 3, 2020, 10:07 PM

#

yeah

dusty depot Jun 3, 2020, 10:08 PM

#

basically

#

you can slice a dataframe by row via

#

dataframe[start:end, :]

fathom bronze Jun 3, 2020, 10:08 PM

#

ok

dusty depot Jun 3, 2020, 10:08 PM

#

or well, jsut dataframe[start:end]

fathom bronze Jun 3, 2020, 10:08 PM

#

ok

dusty depot Jun 3, 2020, 10:08 PM

#

so if you want, say, 10 divisions

#

each clump is bascially lke

#

len(df.index) // divisions

fathom bronze Jun 3, 2020, 10:09 PM

#

clump
?

dusty depot Jun 3, 2020, 10:09 PM

#

like each part

#

so the first tenth is like

#

0: len(df.index)//divisions

#

so in this case, if it's 260 rows and 10 parts

#

then
division = 260//10 = 26

#

so then your first group is

#

df[0:26]

#

and then the second group is
df[26:26*2]

#

and so on

#

so you could do like

for i in range(10):
    part = df[i*division:(i+1)*division]

fathom bronze Jun 3, 2020, 10:10 PM

#

Ok. I understand it

#

although,

#

Two things :

Number of rows may not be a perfect multiple of 26 always
~~I don't know before hand how many parts~~

#

ok. Maybe I do know. If I just int(rows/26) I get the parts

#

What do I do about the first part?

dusty depot Jun 3, 2020, 10:24 PM

#

@fathom bronze oh, uh

#

that way it'll just have one shorter group at the end

fathom bronze Jun 3, 2020, 10:28 PM

#

Ight

#

thanks @dusty depot 😁

dusty depot Jun 3, 2020, 10:29 PM

#

thumbsup

patent jewel Jun 4, 2020, 12:42 AM

#

Hello, I am new to DNN and I am working on a music classifier DNN using tensorflow as a starter project. I can load sound data into a numpy array (music size length)x2 shape. What are some ways I can design the input of the network to handle the different lengths of sound arrays and non-flat shape. I am looking at convolution but I also don't want to convert the sound into a graph image because I eventually want to do what google did with deep dream but for sounds. Any help would be greatly appreciated.

slim elm Jun 4, 2020, 2:32 AM

#

hi friends

#

this may be quick , if not ill move over to #help

#

got a pandas int column that im exporting to csv, its scientific notations my 16 number values. Any way to over ride it to allow it show more than 16 ?

dusty depot Jun 4, 2020, 2:37 AM

#

you can do uh, to_csv(float_format="%.20f") @slim elm

#

probably

slim elm Jun 4, 2020, 2:37 AM

#

ouf

dusty depot Jun 4, 2020, 2:37 AM

#

i thiiiiink

slim elm Jun 4, 2020, 2:37 AM

#

didnt know there was that condition

#

imma try

#

i hate csv export

dusty depot Jun 4, 2020, 2:38 AM

#

ye it's kinda wonky

slim elm Jun 4, 2020, 2:38 AM

#

but my work team are dumb

#

and think excel is hard

#

rip didnt work

#

can i convert it to txt

#

.astype(str)

dusty depot Jun 4, 2020, 2:45 AM

#

didn't work how

#

complained about invalid argument?

#

float_format="%f" might work better

slim elm Jun 4, 2020, 3:59 AM

#

worked

#

i just exported excel

#

and the float_format worked

lapis sequoia Jun 4, 2020, 6:31 AM

#

how do i extract a perticular text from image?

📎 contoured.jpg

onyx cove Jun 4, 2020, 6:34 AM

#

how do I add a list to a dataframe as a row?

#

pd.append(list) will add each list member as a new entry in the first column

#

the list doesnt specify the column location, its just a list of floats

flat quest Jun 4, 2020, 7:16 AM

#

same concept as object detection if u've done that before @lapis sequoia

lapis sequoia Jun 4, 2020, 8:05 AM

#

no i haven't done it before

#

@flat quest can you explain it plese?

coarse fox Jun 4, 2020, 9:01 AM

#

Hi guys, I need to sum up approx. 23 columns from a panda dataframe imported from a csv file
how do I go about doing it?

worn stratus Jun 4, 2020, 9:04 AM

#

what do you mean by sum columns?

#

do you want the sum of all the cells? Or just the sum across a single row of 23 columns?

coarse fox Jun 4, 2020, 9:08 AM

#

sorry, sum across a single row of 23 columns

quick hawk Jun 4, 2020, 9:21 AM

#

@coarse fox df['sum'] = df.sum(axis=1) ?

coarse fox Jun 4, 2020, 9:23 AM

#

no, I'd like to specify the range of columns e.g. ('col1':'col5','col7':'col10')

quick hawk Jun 4, 2020, 9:28 AM

#

So col1 + col2, ... + col7 + ... + col10 and store the result in one column?

coarse fox Jun 4, 2020, 9:31 AM

#

yeah

quick hawk Jun 4, 2020, 9:39 AM

#

I supposed you could get your columns as a list (cols = df.columns.tolist()) and then filter what you need (df['sum'] = df[cols[1:6] + cols[7:11]].sum(axis=1)). Not sure if this is the best way to do it but it seems to work

onyx cove Jun 4, 2020, 9:48 AM

#

anyone know why my RandomForestRegressor model .fit() function wont run? if n_jobs=-1 is set, it will error out saying ValueError: need at most 63 handles, got a sequence of length 65

#

apparently it cant spawn over 60 threads, and I have a 64 core CPU

#

however, if I set n_jobs to say, 60/55/1 etc, it just doesnt run

#

I managed to get another model to fit with n_jobs set to 60, but this one isnt

#

from sklearn.model_selection import RandomizedSearchCV

#Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
            "max_depth": [None, 3, 5, 10],
            "min_samples_split": np.arange(2,20,2),
            "min_samples_leaf": np.arange(1,20,2),
            "max_features": [0.5,1,"sqrt","auto"],
            "max_samples": [10000]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                                param_distributions=rf_grid,
                                n_iter=2,
                                cv=5,
                                verbose=True)

#Fit the RandomizedSearchCV model
rs_model.fit(X_train, y_train)
```\

chilly cove Jun 4, 2020, 10:28 AM

#

I am using fcn-resent-101 for semantic segmentation from pytorch. I have also tried to use deeplabv3-resnet-101, did not see any performance improvement, in fact the fcn model worked better. I need to remove the background from images of humans. Overall the model worked well, but I am getting a slight halo around the head while removing the photos.

Any suggestions to fix the issue of the halos?

quick sparrow Jun 4, 2020, 1:20 PM

#

I find myself i quite a pickle. If anybody could help me i would be most thankful I would like to use regex to match in a file with a basic structure similar to this

#

<Contour name="green1" hidden="false" closed="true" simplified="true" border="0 1 0" fill="0 1 0" mode="9"
points="1.30303 1.91643,
1.30787 1.87772,
1.32602 1.80029,
1.33207 1.78093,
1.35505 1.74221,
1.3611 1.73253,

<Contour name="pink1" hidden="false" closed="true" simplified="true" border="1 0 0.5" fill="1 0 0.5" mode="-13"
points="1.5878 1.97466,
1.59021 1.9553,
1.59505 1.93594,
1.59868 1.92626,
1.60473 1.91658,
1.62288 1.89964,

<Contour name="a1" hidden="false" closed="true" simplified="false" border="1 0.5 0" fill="1 0.5 0" mode="13"
points="1.77483 2.11831,
1.77483 2.11589,
1.77725 2.11347,
1.77967 2.11347,
\

#

problem is that i want to just get the numbers but under the condition that they belong to a specific Contour name in this case "pink", "a1", "green1"

#

my previous idea was something like using
name="pink1".\n.\n? points="(\d.\d+) (\d.\d+)|^\t(\d.\d+) (\d.\d+)
as regex

#

but this just shows me every numberpair without differentiating for "contour name"

paper niche Jun 4, 2020, 2:03 PM

#

you don't have to do it in a single regex, find the pink1 contour first, then parse that line to extract the numbers

tight shale Jun 4, 2020, 2:58 PM

#

anyone familiar with jupyterlab?

#

trying to install extensions but can't see any in the extension manager

#

https://i.stack.imgur.com/kVJlm.png

#

any ideas why?

flat quest Jun 4, 2020, 3:53 PM

#

have u done any ml before? @lapis sequoia

real patrol Jun 4, 2020, 4:03 PM

#

hellp guys

#

i am trying to use pandas but i'm getting error

#

Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/WebScraping/panddas.py", line 1, in <module>
import pandas
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas__init.py", line 55, in <module>
from pandas.core.api import (
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\api.py", line 29, in <module>
from pandas.core.groupby import Grouper, NamedAgg
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\groupby__init.py", line 1, in <module>
from pandas.core.groupby.generic import DataFrameGroupBy, NamedAgg, SeriesGroupBy
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\groupby\generic.py", line 60, in <module>
from pandas.core.frame import DataFrame
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 124, in <module>
from pandas.core.series import Series
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py", line 4572, in <module>
Series.add_series_or_dataframe_operations()
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 10349, in add_series_or_dataframe_operations
from pandas.core.window import EWM, Expanding, Rolling, Window
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\window__init.py", line 1, in <module>
from pandas.core.window.ewm import EWM # noqa:F401
File "C:\Users\user\AppData\Roaming\Python\Python37\site-packages\pandas\core\window\ewm.py", line 5, in <module>
import pandas._libs.window.aggregations as window_aggregations
ImportError: DLL load failed: The specified module could not be found.
this is what i'm getting

#

how can i solve this
1)I 've installed pandas

#

its latest version which is 1.0.4

#

3)my py version is uptodate

#

but i'm still getting the error

#

can someone help me please

#

why is this errror?

📎 error.JPG

spiral peak Jun 4, 2020, 4:11 PM

#

@real patrol What IDE are you using?

real patrol Jun 4, 2020, 4:13 PM

#

pychar,

#

pycharm

#

please help me

spiral peak Jun 4, 2020, 4:13 PM

#

Give me a minute to load up pycharm

real patrol Jun 4, 2020, 4:13 PM

#

ok

lapis sequoia Jun 4, 2020, 4:16 PM

#

https://tenor.com/view/machine-learning-baby-crying-deep-learning-sad-math-gif-17208484

Tenor

real patrol Jun 4, 2020, 4:16 PM

#

anyone can help me to solve that error

lapis sequoia Jun 4, 2020, 4:17 PM

#

@real patrol are you working in a file or a directory that is called pandas or is there another file or directory that is called pandas

spiral peak Jun 4, 2020, 4:17 PM

#

@real patrol In the Menu, can you click Run and then Configurations

#

I'm not sure if you're working with the system-wide python or a venv

real patrol Jun 4, 2020, 4:18 PM

#

i didnot get you

lapis sequoia Jun 4, 2020, 4:19 PM

#

there can be multiple python versions installed on your computer. kutiekatj8 wants to make sure you are working on the correct one

#

or at least one that has pandas installed

#

ww: I wish that all the pontification of ML opened with that GIF so people that had the other bits would know.

#

@lapis sequoia i don't understand what you mean

#

I have a background in math, stats, algebra, etc. ML seems to have its own language so all the tutorials were frustrating for a slow learner. I posted about the mental break through it took to understand "Tensors" were a fancy wrapper for stuff I already knew.

real patrol Jun 4, 2020, 4:21 PM

#

how can i check it?

lapis sequoia Jun 4, 2020, 4:21 PM

#

@lapis sequoia makes sense

real patrol Jun 4, 2020, 4:22 PM

#

@ww how can i check it?

lapis sequoia Jun 4, 2020, 4:22 PM

#

@real patrol follow what kutiekatj9 wrote.

#

It's not unique to ML. CS/Math/Engineering all seem to have invented their own terminology for terms.

bronze hamlet Jun 4, 2020, 4:34 PM

#

can i ask in this topic a question about selenium ? or wrong topic?

solar oracle Jun 4, 2020, 5:26 PM

#

Ask away, I think it is relevant.

bronze hamlet Jun 4, 2020, 5:48 PM

#

📎 unknown.png

#

📎 unknown.png

#

📎 unknown.png

#

i want to click on the button but i get error. sorry for my bad english

lapis sequoia Jun 4, 2020, 6:34 PM

#

The images are a bit hard to read, try clicking it via find_element_by_xpath

#

and if that doesnt work try to click either the parent container or the one right above it

lapis sequoia Jun 4, 2020, 7:13 PM

#

Hey y'all, can somebody point me the right direction how I can pass this API response to a pandas dataframe?: https://api.pushshift.io/reddit/search/submission/?subreddit=NEO&after=1500004800&size=10&fields=author,created_utc,full_link,num_comments,score,selftext,title

bronze hamlet Jun 4, 2020, 7:29 PM

#

@lapis sequoia yes xpath works thank you !

lapis sequoia Jun 4, 2020, 7:29 PM

#

yay!

desert oar Jun 4, 2020, 8:26 PM

#

@lapis sequoia it looks like it's pretty much already formatted like a dataframe

#

@lapis sequoia try ```python
data = pd.DataFrame.from_dict(response['data'], orient='index')

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html

fathom bronze Jun 4, 2020, 8:32 PM

#

Hello Guys. This can be a weird question, but, is it possible to resize the column widths to fit the text in a xlsx file using python? I am using pandas.to_excel and the column widths are narrow. I want them to be changed automatically via the python program. Is it possible?

lapis sequoia Jun 4, 2020, 8:40 PM

#

Hey peeps, am I braindead or something? What's happening here with this pandas.DataFrame? Why does .max() not give me the largest entry?!

📎 unknown.png

#

Thanks @desert oar

#

but ```

response = requests.get("https://api.pushshift.io/reddit/search/submission/?subreddit=NEO&after=1500004800&size=10&fields=author,created_utc,full_link,num_comments,score,selftext,title,")
response = response.json()
data = pd.DataFrame.from_dict(response['data'], orient='index')

#

gives me this error: AttributeError: 'list' object has no attribute 'values'

#

@lapis sequoia try .idxmax() instead of .max()

#

@lapis sequoia thanks, but didn't work either but I fixed it after an hour of being stupid: the column score was the only column in that df that had a dtype object............................................... fixed

#

oh ok...

#

do you by any chance have some experience with APIs? It's my first time trying to pass data from an API to a dataframe and i am not quite sure how to do it

slender cypress Jun 4, 2020, 9:05 PM

#

usually a combo of
requests.get()
and
pd.read_csv
work for me on most cases.

#

I guess it depends on the API's, but for my work that's what works. Depends on what the request looks like.

desert oar Jun 4, 2020, 9:08 PM

#

@lapis sequoia oh i misunderstood the data format

#

just try pd.DataFrame(response['data'])

#

my browser was lying to me about what was in the json 🙂

slender cypress Jun 4, 2020, 9:26 PM

#

lol I totally didn't even realize this was an ongoing convo

#

My reading comprehension isn't at its max potential today 😄

lapis sequoia Jun 4, 2020, 10:05 PM

#

it just returns ```TypeError: 'Response' object is not subscriptable

#

data = json.loads(response.text)
data = pd.DataFrame(response['data'])

#

sorry for all my questions. It's my first time trying to get an API response into a pandas dataframe

#

why can't I directly parse the response to the dataframe? I mean isn't the API response already in json format?

#

I don't really understand what is happening with the data formats

vapid wren Jun 5, 2020, 12:48 AM

#

I have a question of understanding about the Delta Rule:

Δwᵢ = (y - ŷ)*xᵢ

Why does x have to be multiplied again after the difference? If the input is 0, the product of w and x remains 0 anyway. Then it should not matter if the weight changes with an input of 0.

#

with a binary step function*

paper niche Jun 5, 2020, 1:28 AM

#

data = json.loads(response.text)
data = pd.DataFrame(response['data'])

@lapis sequoia your last line, try data = pd.DataFrame(data['data'])

real wigeon Jun 5, 2020, 2:07 AM

#

if anyone's used plotly, please let me know how to remove the trace names off of a bubble map

#

i turned the subplot off, but removing the name variable only replaces the names of the trace, with the trace count

#

please @ me

green hornet Jun 5, 2020, 3:00 AM

#

anyone know how to use nltk to detect if a string is a question

#

been stuck on this for a while

real wigeon Jun 5, 2020, 3:24 AM

#

disregard, turns out you can just leave the field blank

quasi cargo Jun 5, 2020, 3:47 AM

#

hi all

#

i need some help in my proyect

balmy trellis Jun 5, 2020, 4:00 AM

#

hey - is this a good spot to ask about pandas and dataframes?

solar oracle Jun 5, 2020, 5:43 AM

#

Sure

random arch Jun 5, 2020, 5:49 AM

#

@fathom bronze Yes, you can format Excel output from pandas. You'll need to rely on xlsxwriter https://xlsxwriter.readthedocs.io/example_pandas_column_formats.html

solar oracle Jun 5, 2020, 6:33 AM

#

Anybody here worked with selenium and svg elements? The problem is that I want to go through all of the elements(already working) but sometimes the actual element is not in the middle of the path element.

#

Thus when I move to the element, it just "skips" it.

blazing bramble Jun 5, 2020, 7:56 AM

#

Made a scraper, and was working perfectly fine.. But all of a sudden, it completely stopped working..Something about the host not responding correctly (using requests). Any thoughts, ideas?

crude tartan Jun 5, 2020, 8:15 AM

#

You need logs.

gray eagle Jun 5, 2020, 10:27 AM

#

Hello

#

I've wanted to learn Artificial Intelligence but I don't know where to start.

#

Should I learn Machine Learning or Deep Learning

quartz stream Jun 5, 2020, 11:18 AM

#

Hello Everyone,
Hope you are doing well and good

I want to create my own voice authentication application/service
Basically a place where it can recognize the difference between voice of users
anyone with any guidance or places to get started?

uncut shadow Jun 5, 2020, 11:19 AM

#

you should probably check a library for deep learning in python called tensorflow or the other one, pytorch

tight shale Jun 5, 2020, 12:05 PM

#

how do i update anaconda's jupyterlab?

#

tried conda update jupyter lab, conda install -c conda-forge jupyter lab, and conda install -c conda-forge jupyterlab=2.1.4 but my version is still 1.2.6

lapis sequoia Jun 5, 2020, 12:07 PM

#

hey, i have this dataframe with this last row:

    author            created_utc    full_link    num_comments    score    selftext    title
999    x               1502089779    x            x            x    x            x

I need to select based on that I need to select the unix timestamp in order to request the next dataset. So the last row will not always be row 999 but could also be 10 or 1304 or whatever. what is the foolproof way to always select the 'created_utc' value that is in the last row even if you don't know how many rows you have in a dataset?

#

it is a pandas dataframe btw

paper niche Jun 5, 2020, 12:09 PM

#

df.iloc[-1]['created_utc']

tight shale Jun 5, 2020, 12:10 PM

#

or df.tail(1)

lapis sequoia Jun 5, 2020, 12:12 PM

#

@paper niche works perfectly! thanks

autumn flax Jun 5, 2020, 12:29 PM

#

Been there

real wigeon Jun 5, 2020, 1:04 PM

#

Herm interesting solutions

willow quest Jun 5, 2020, 1:17 PM

#

Should I learn Machine Learning or Deep Learning
@SwashyAsian#2245 No idea of where you're at knowledge-wise currently, but generally start with some basic statistics (get a feel for data peculiarities, observations/variables, typical tests, distributions and the like). Expand that to some algorithms like clustering. After that start focussing on ML basics, like decision trees, lin/log regression, maybe go to ensembles from there. Then start looking at neural nets and the likes.

But that's my $0.02 from a bit of an academic perspective, maybe industry does it differently.

#

Did he really leave already..

lapis sequoia Jun 5, 2020, 1:37 PM

#

hey, how can i set a name to the column which doesnt have one?

#

in pandas

uncut shadow Jun 5, 2020, 2:39 PM

#

Do you think these are good tutorials for math for ML? https://www.youtube.com/playlist?list=PLmAuaUS7wSOP-iTNDivR0ANKuTUhEzMe4

YouTube

Mathematics of Machine Learning - YouTube

dawn quest Jun 5, 2020, 2:42 PM

#

if you have experience with cv2/multiprocessing, I'd love to get your opinion on this question:
https://discordapp.com/channels/267624335836053506/704067023939960985/718473842946998292

lapis sequoia Jun 5, 2020, 3:10 PM

#

process it in different cores for what

#

not very clear

#

you can split and process numpy arrays.. parallely

elder willow Jun 5, 2020, 3:50 PM

#

kind of a stupid question: as a chemists I frequently do linear regressions with various standard solutions at a known concentration, do a linear regression on those to get a line I can use to predict the concentration of unknown samples.
I'm trying to use Scikit learn to do so with sklearn.linear_model.LinearRegression
Everything looks fine but in my use case I need to predict an X value (concentration) out of a Y (instrumental signal that I get from the unknown sample), is there any way to do with a method? predict can do what I need but in the "opposite direction" (getting a predicted Y from a given X)

uncut shadow Jun 5, 2020, 4:04 PM

#

ummm

#

well

#

if you have Y and you need to predict X

#

then why don't you just turn this Y to become an X and then feed it to predict the value you need tho

#

(but I'm not sure if I understood the question)

unreal thistle Jun 5, 2020, 7:12 PM

#

how can i transform a grey image to rgb pls ,i tried image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) but doesn't work

#

?

slim elm Jun 5, 2020, 10:23 PM

#

any good gui builder suggestions? looking for a python gui and sqlite

real wigeon Jun 5, 2020, 10:49 PM

#

tinkter is one I hear about

lapis sequoia Jun 5, 2020, 11:10 PM

#

tkinter?

fiery cosmos Jun 5, 2020, 11:15 PM

#

hey guys quick requests question concerning a phpbb i believe, im trying to get my script to send a pm to a user on a forum via requests. the part of the code you cant see prompts user to enter their forum profile link in proper format(Ie: https://forums.d2jsp.org/user.php?i=1149478)
I've been messing with it for a while but have been coming up short, i'd like for it to pm the user a confirmation code. how far off am i? heres my script, any help is much appreciated thanks guys.

📎 24fd3dd68e0c0c3693fa9f46abd3722e.png

slim lance Jun 6, 2020, 12:59 AM

#

Hey, I need to export some data for wrangling with pandas. Does it matter if I used gzip'ed CSV vs Parquet? (Advantage of CSV is that I can always open in excel if I want to spot check stuff). I am not a data science person.. just a sysadmin trying to grok some billing data.

bleak field Jun 6, 2020, 1:12 AM

#

if you are exporting/sending this data to someone on your data science team to use - then definitely csv

safe tapir Jun 6, 2020, 1:18 AM

#

parquet is smaller and faster to load

bleak field Jun 6, 2020, 1:23 AM

#

True - depends on the use case. If it is a small dataset and you expect to do a lot of spot checks or if this is just a one-off, I'd still say CSV. Prod-side parquet might be better

slim lance Jun 6, 2020, 2:04 AM

#

kk.. I'll use CSV.. I don't know how big yet.. but it's AWS cost and usage data.

#

trying to create summary reports.

#

can't imagine they are bigger than a couple hundred MB each.

#

Trying to create summary reports to look for areas to look for cost savings.

#

Advantage of CSV is I've dabbled with CSV in Pandas before.

#

(I've used Pandas for a personal project related to managing a collection and the data around it, including a couple third party data sets.)

real patrol Jun 6, 2020, 4:01 AM

#

hey guys

#

can anyone help me to get data from wordometers using Beautifulsoup?

#

hell

lapis sequoia Jun 6, 2020, 4:50 AM

#

@real patrol What do you need help with?

real patrol Jun 6, 2020, 4:50 AM

#

WEb scrapimg

lapis sequoia Jun 6, 2020, 4:51 AM

#

Sure, I can help! What do you want to do?

real patrol Jun 6, 2020, 4:51 AM

#

wow

#

i wanted the number of corona infected people but i ain't getting it

lapis sequoia Jun 6, 2020, 4:52 AM

#

Can you send me the particular link you're looking at?

real patrol Jun 6, 2020, 4:52 AM

#

from bs4 import BeautifulSoup
url="https://www.worldometers.info/coronavirus/"
response=requests.get(url)
# print(response)
soup=BeautifulSoup(response.text,"html.parser")
get_table=soup.find("table",id="main_table_countries_today")

# for names in get_table.find_all("a",class_="mt_a"):
#     name=names.text
#     print(name)

for cases in get_table.find_all("td",class_="sorting_1"):
    print(cases) ```

#

see this

lapis sequoia Jun 6, 2020, 4:52 AM

#

Ah.

real patrol Jun 6, 2020, 4:52 AM

#

https://www.worldometers.info/coronavirus/

Coronavirus Update (Live): 6,850,473 Cases and 398,244 Deaths from ...

Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data a...

frozen lotus Jun 6, 2020, 4:56 AM

#

you cant get help for it if you are gonna scrape from that website

#

!rule 5

arctic wedgeBOT Jun 6, 2020, 4:56 AM

#

Rules

5. Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious/inappropriate or be for graded coursework/exams.

frozen lotus Jun 6, 2020, 4:58 AM

#

if u look in the FAQ section of that website u will see why

#

https://www.worldometers.info/licensing/faq/

normal magnet Jun 6, 2020, 5:20 AM

#

can someone tell me how to solve this equation using python
x-var1*x/100==var2 where var1 and var2 will be entered by user running the script, sorry i started to learn today, i dont know anything yet

thorn dust Jun 6, 2020, 5:29 AM

#

var1 = input(' Enter var1 ')
var2 = input('Enter var2 ')
?? Idk if this is what whatcu mean

lapis sequoia Jun 6, 2020, 12:22 PM

#

Your question is more suited in the general help chats. Though to answer your question: Solving for x should be pretty simple: x = var2 / (1 - var1 / 100)

open spindle Jun 6, 2020, 12:39 PM

#

ok, im fairly new to deep learning and

#

is 90% of it literally just sitting in front of a computer waiting for the training to get done

lapis sequoia Jun 6, 2020, 1:16 PM

#

are you doing stuff on cpu

#

and where were you before DL

open spindle Jun 6, 2020, 1:40 PM

#

i think i've actually got my code fixed

#

apparently i was using a beefy gpu

#

but the problem was that i was using one thread by default

sonic raft Jun 6, 2020, 3:53 PM

#

Hi! Is it possible to make an Image classifier with Logistic Regression?

open spindle Jun 6, 2020, 4:16 PM

#

I think you'd need neural nets for that

cyan sierra Jun 6, 2020, 4:17 PM

#

Is it possible? Yes. How good would such a classifier be? Probably pretty bad. Logistic Regression is just too simple for classifying images

sonic raft Jun 6, 2020, 4:36 PM

#

And, what about Support Vector Machines?

#

Btw, I created a Classifier, and I think it isn't that bad 🙂

lapis sequoia Jun 6, 2020, 5:46 PM

#

Hey y'all, I have a stupid beginner question concerning plotting in python. I have a simple dictionary: {'anger': 2.1819999999999995, 'disgust': 0.89, 'fear': 0.8440000000000001, 'joy': 3.1100000000000003, 'sadness': 0.344, 'trust': 6.134000000000001, 'surprise': 0.727}

What is the easiest way to create a radar chart in python like these https://en.wikipedia.org/wiki/Radar_chart

I looked at a few examples but the codes are full of stuff that are too complex for my little diagram that I am trying to make

Radar chart

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative, but vari...

#

do you have a super basic example how to make one?

untold tundra Jun 6, 2020, 5:49 PM

#

https://stackoverflow.com/questions/46564099/what-are-the-steps-to-create-a-radar-chart-in-bokeh-python

Stack Overflow

What are the steps to create a radar chart in Bokeh python?

Objective: to create a radar chart within Bokeh python

To be helpful this is the chart type I am after:
I obtained this chart example from Matplotlib which might be helpful in closing the gap on a

lapis sequoia Jun 6, 2020, 5:53 PM

#

man .... why are codes for plots always so long and hard to understand? I am seriously considering to use Excel even though I don't want to ....

#

hello

#

I need a bit of help with matplotlib time series graphing

#

I have 2 lists

#

one is the date

#

and the other is a value associated with the date

#

how do I plot that?

spare karma Jun 6, 2020, 6:30 PM

#

Any nlp scientists out there? I'm wondering if one could use a tf-idf to assist in finding stopwords.

unkempt monolith Jun 6, 2020, 7:18 PM

#

Do guys know how to learn Data Science for free?

lapis sequoia Jun 6, 2020, 7:25 PM

#

there are a lot of videos and tutorials on the web where you can learn for free

#

it just depends on what you want to learn exactly

#

a good starting point in my opinion is iTunes U (iTunes University)

#

there are a lot of good courses that you can choose from for free

#

Different topic:
I need to create a new column in a dataframe that contains part of a string from another column. this was my try:
megadf['id']=megadf['full_link'][39:44]
I need to extract the characters 39:44 from the link in the column 'full_link' and make a new column with it

lapis sequoia Jun 6, 2020, 7:53 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html

#

thanks zahand, I already solved it 🙂

#

I'm building a evo sim with NEAT, trying to add risk assessment as a function

#

To me it seems like it makes sense to make it a hidden neuron and hook it up with the input of the creature's stats and other creature's stats

#

But I don't think there is a way to do that

#

Should I just shove the function in the input along side all the other inputs, or just hope that it evolves one on it's own

real wigeon Jun 6, 2020, 8:43 PM

#

anyone worked with dash and plotly?

#

I'm trying to put my graph into the dash script

real wigeon Jun 6, 2020, 10:25 PM

#

i figured it out

#

now how to use dash to add data from a pandas function

coarse fox Jun 7, 2020, 1:13 AM

#

I need some help with a code I'm writing: I want to write a code that'll import all csv files in a specific folder into a class. all the csv files are in this form:

📎 unknown.png

hearty kindle Jun 7, 2020, 2:09 AM

#

https://images-ext-1.discordapp.net/external/6fyvfjuRsm43XKP6cnkF3M4klzhUq5ZwKcGKGMfkdp0/https/media.discordapp.net/attachments/712512283586330626/718727769478922300/unknown.png
hi, would any of you happen to know of a way to remove texture seams, ive heard that esrgan an image upscaler has a feature like that but to do that you would have to upscale with that option on and the script that i used didn't allow for an option like that to be enabled, reupscaling is sadly not an option since well its just too much to upscale

slate scroll Jun 7, 2020, 4:08 AM

#

@hearty kindle Can you provide any more details? Something like this perhaps: http://vcg.isti.cnr.it/Publications/2012/Tar12/

Cylindrical and Toroidal Parameterizations Without Vertex Seams

hearty kindle Jun 7, 2020, 4:10 AM

#

well i wouldn't really call myself tech savvy but anything capable of removing seams from tiled textures would help me tremendously

#

ill give this a look

slate scroll Jun 7, 2020, 4:10 AM

#

Why are you getting seams, if you're just tiling textures onto a flat surface there shouldn't be any seams. The problem arises when you project a texture onto a surface.

hearty kindle Jun 7, 2020, 4:13 AM

#

well im not really using any 3d modeler really im just working on a texture replacement for a very old game and there really isnt a way to remove seams trough the game engine due to how old it is so the textures have to be seamless

slate scroll Jun 7, 2020, 4:14 AM

#

Hm ok well you're outside my area of expertise. I know image processing but nothing about games or how their textures work. Sorry!

hearty kindle Jun 7, 2020, 4:51 AM

#

no no need to apologize you sent me exactly what i was looking for, thank you so much!

sonic finch Jun 7, 2020, 5:14 AM

#

any NLP people around this time of night?

paper niche Jun 7, 2020, 5:33 AM

#

maybe, just ask your question. don't ask to ask

flat quest Jun 7, 2020, 8:46 AM

#

depends on what kinda nlp person ur looking for

tidal sonnet Jun 7, 2020, 4:32 PM

#

quick question...

#

i'm trying to use pandas. But i keep getting this error

#

 import pandas._libs.window.aggregations as window_aggregations
ImportError: DLL load failed while importing aggregations: The specified module could not be found.

#

any idea how to fix it or what might be causing it?

bitter kayak Jun 7, 2020, 4:33 PM

#

What do you get when you type py -0 at the console?

tidal sonnet Jun 7, 2020, 4:33 PM

#

Installed Pythons found by py Launcher for Windows
 (venv) *
 -3.8-64

bitter kayak Jun 7, 2020, 4:34 PM

#

If you type pip show pandas, what comes up?

tidal sonnet Jun 7, 2020, 4:36 PM

#

Name: pandas
Version: 1.0.4
Summary: Powerful data structures for data analysis, time series, and statis
tics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: c:\users\zelda\onedrive\programs\code\python\practice\venv\lib\sit
e-packages
Requires: pytz, numpy, python-dateutil
Required-by:

bitter kayak Jun 7, 2020, 4:36 PM

#

Aha, this sounds like a cousin to this problem.

#

https://stackoverflow.com/questions/60767017/importerror-dll-load-failed-while-importing-aggregations-the-specified-module

Stack Overflow

ImportError: DLL load failed while importing aggregations: The spec...

I am new to Python and currently having trouble when importing some libraries.

I am using Python 3.8.

I have installed Pandas in the CMD using "pip install pandas"

If i go to Python folder i see...

#

The suggested solution:

#

pip uninstall pandas
pip install pandas==1.0.1

#

Basically, you install a slightly earlier version of Pandas until they fix this bug with the most recent version.

tidal sonnet Jun 7, 2020, 4:37 PM

#

aite... i'll try that

bitter kayak Jun 7, 2020, 4:37 PM

#

Another suggestion was to install the most recent Visual C++ Redistributables https://aka.ms/vs/16/release/vc_redist.x64.exe

#

but installing 1.0.1 sounds like an easier fix

tidal sonnet Jun 7, 2020, 4:42 PM

#

ah...

#

it works...

#

thank you so much

bitter kayak Jun 7, 2020, 4:55 PM

#

YW

lapis sequoia Jun 7, 2020, 6:41 PM

#

Hi, does anyone have experience with perceptron classifiers? I have what should be a reasonably basic question

uncut shadow Jun 7, 2020, 7:06 PM

#

then you should ask this question

flat quest Jun 7, 2020, 7:14 PM

#

^^

hearty jewel Jun 7, 2020, 8:58 PM

#

hi everyone - got a quick question: why is " step=random_walk[-1]" set to the last element in this picture here? if I change it to say step=random_walk[0].. i get the following output: [0, 3, 1, 1, -1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 5, -1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, -1, 4, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 5, 1, 4, 1, -1, 1, 1, -1, 1, 1, 2, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 2, -1, -1, 1, -1, 1, 1, 1, 2, -1, 1, 1, 1, 1, 1, -1, -1, 1, -1, 1, 1, -1, 3, 1, 1, 1, -1, 1, 1]

📎 unknown.png

#

instead of the [0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, -1, 0, 5, 4, 3, 4, 3, 4, 5, 6, 7, 8, 7, 8, 7, 8, 9, 10, 11, 10, 14, 15, 14, 15, 14, 15, 16, 17, 18, 19, 20, 21, 24, 25, 26, 27, 32, 33, 37, 38, 37, 38, 39, 38, 39, 40, 42, 43, 44, 43, 42, 43, 44, 43, 42, 43, 44, 46, 45, 44, 45, 44, 45, 46, 47, 49, 48, 49, 50, 51, 52, 53, 52, 51, 52, 51, 52, 53, 52, 55, 56, 57, 58, 57, 58, 59]

uncut shadow Jun 7, 2020, 8:59 PM

#

Ummm

#

Well

#

In python if you do e.g. list[-1] you are going to get last element of this list

#

If u'll do [-2] Ur going to get the one element before the last one (I forgot How it is named in english)

#

So

#

>>> a = [0, 1, 2, 3, 4]
>>> a[-1]
4
>>> a[-2]
3

hearty jewel Jun 7, 2020, 9:02 PM

#

okay

#

so why is it important to set it to -1 in this situation

#

what changes between setting it to zero or -1

uncut shadow Jun 7, 2020, 9:03 PM

#

0 gives you first element

#

While -1 gives you last element

hearty jewel Jun 7, 2020, 9:04 PM

#

why do i want the last element?

uncut shadow Jun 7, 2020, 9:04 PM

#

Well idk lol

hearty jewel Jun 7, 2020, 9:04 PM

#

why did making it hte last element make it additive?

#

look at hte outputs

#

oh sorry

#

i put in the new output

#

so whne you make it the last element, it became additive in nature

uncut shadow Jun 7, 2020, 9:07 PM

#

Well, I see there is some other part of code which is not shown in this screenshot and maybe it might come in handy

hearty jewel Jun 7, 2020, 9:07 PM

#

here u go

📎 unknown.png

uncut shadow Jun 7, 2020, 9:09 PM

#

Hmmm

#

Well, this looks like you maybe did some logical error

hearty jewel Jun 7, 2020, 9:10 PM

#

no no there is no error or anything

#

im just wondering why

uncut shadow Jun 7, 2020, 9:10 PM

#

Oh

hearty jewel Jun 7, 2020, 9:10 PM

#

the author sets step=random_walk[-1]

#

as the starting step

uncut shadow Jun 7, 2020, 9:11 PM

#

Yeah

hearty jewel Jun 7, 2020, 9:11 PM

#

and why when you use step=random_walk[0]

#

why the output changes

#

why did hte author choose to set step as the last element and what impact did that make

uncut shadow Jun 7, 2020, 9:12 PM

#

So, author probably wants to choose the last step so hes doing -1. If he used 0 then it would always return 0 (because 0 is the first element of this list) and would change a lot of steps

keen wasp Jun 7, 2020, 9:15 PM

#

was wondering if anyone has any ideas how I could write this function in a vectorized fashion. I'm trying to run it on a dataframe with 500million rows and it takes a very long time.

It's trying to find the indices where the cumulative sum first goes over some limit, then resets.

📎 unknown.png

lusty coral Jun 7, 2020, 9:28 PM

#

@keen wasp i have an idea

#

📎 unknown.png

#

let's say we have this

#

i want to check where sum of 15 is gone over

#

📎 unknown.png

#

then drop duplicates

#

📎 unknown.png

desert oar Jun 7, 2020, 9:29 PM

#

@keen wasp if all else fails you can use numba and make the function operate on the underlying numpy array 😉

keen wasp Jun 7, 2020, 9:30 PM

#

@lusty coral interesting ... lemme try that.

desert oar Jun 7, 2020, 9:30 PM

#

in all seriousness you might want to just iterate over df['x'].iteritems

lusty coral Jun 7, 2020, 9:30 PM

#

also check "vaex" if you looking for working with that kind of data

#

iteritems goes over all rows, column based operations might speed up the process i guess?

desert oar Jun 7, 2020, 9:31 PM

#

i really like that solution @lusty coral

#

the entire cumsum operation should be pretty fast, faster than basically anything you can write by hand other than numba

#

the downside is it's computing an extra column

#

you can maybe bypass the full computation with the right numexpr invocation

keen wasp Jun 7, 2020, 9:31 PM

#

yeah the only worry i had about hitting cumsum is the memory of storing that entire column

desert oar Jun 7, 2020, 9:31 PM

#

but 500m rows of an extra column is a nontrivial amount of memory

#

yeah

keen wasp Jun 7, 2020, 9:32 PM

#

thanks for the help both of you, gives me some ideas to experiment with

lusty coral Jun 7, 2020, 9:32 PM

#

"vaex" claims to solve these memory issues because it calculates the columns you know, not stores it

#

works with pandas

keen wasp Jun 7, 2020, 9:32 PM

#

cool never heard of vaex! ill check it out

desert oar Jun 7, 2020, 9:33 PM

#

isnt the idea with vaex just that it keeps data on-disk until needed?

#

like dask or spark

#

which seems good for your case

lusty coral Jun 7, 2020, 9:34 PM

#

i'm checking it out but i never deal with that many rows 😄 so for me, even though interesting, it's over-productive 🙂

#

@desert oar it says we do not store computed columns, they just show it i guess?

#

i dont get it, but they claim it, so i believe 😄

desert oar Jun 7, 2020, 9:36 PM

#

no, id believe it

#

if they use some kind of DAG execution engine

#

that's what spark does for example

lusty coral Jun 7, 2020, 9:36 PM

#

so it would be cpu heavy?

desert oar Jun 7, 2020, 9:36 PM

#

(and most sql query planners)

#

its the same cpu usage as if it stores in memory, at least conceptually

#

its a sequential algorithm so you have to do 500 million comparisons and 500 million += operations

#

no matter what

lusty coral Jun 7, 2020, 9:37 PM

#

why people deal with that many rows of data? i mean why they dont partition the data, then deal with it?

desert oar Jun 7, 2020, 9:37 PM

#

its easier if you dont have to bother

#

also how do you partition this?

#

this algorithm needs the entire data

#

so maybe you need a data structure that transparently partitions but logically it should look like 1 single data frame

#

dask and spark both do something like that

lusty coral Jun 7, 2020, 9:38 PM

#

interesting. glad i'm not dealing with big data things 😄 i'm happy with my top 10k or so data

desert oar Jun 7, 2020, 9:42 PM

#

i cant tell if vaex even supports cumsum

warm mauve Jun 7, 2020, 9:43 PM

#

Hello, I have an assignment for class. Anyone willing to help me?

#

PCA, Cross-validation and all..

desert oar Jun 7, 2020, 9:49 PM

#

that said, @keen wasp this should be a lot more efficient

@numba.jit(nopython=True)
def get_bar_indices(arr, limit):
    indices = []
    row_num = 0
    cum_dollar = 0.0
    for x in arr:
        if cum_dollar < limit:
            cum_dollar += x
        else:
            indices.append(row_num)
            cum_dollar = 0.0
        row_num += 1
    return indices

bar_indices = get_bar_indices(df['x'].to_numpy())

caveat: .to_numpy() is not guaranteed to not create a copy. however if df['x'] is a standard Numpy type (e.g. float) it should create a view and not a copy

#

internally it calls np.asarray(df['x']._values) - so you're relying on np.asarray to create a view and not a copy

#

also make sure to pass limit as a float just in case numba tries to generalize or cast types in a weird way

sonic bridge Jun 7, 2020, 10:45 PM

#

def load_config(cfg):
    with open("config.json", "r") as f:
        config = json.load(f)
        #main = config["main.owner_id"]
        for data in config["main"]:
            cfg = data[cfg]
        return cfg

why does that returns cfg twice?

real wigeon Jun 7, 2020, 11:41 PM

#

is it because you named it input as cfg?

#

idk tbh

#

someone who's used dash before plz help

#

i am having a hard time with dash

blazing bridge Jun 8, 2020, 12:06 AM

#

Can someone help me with gradient descent. So the definition given is that gradient refers to the slope of the curve at any point. So they say to find the gradient of loss as intercept changes is this formula... what do they Mean by gradient of loss, is it slope of the loss curve itself or something else. if someone could ping me and maybe talk to me that would be great

desert oar Jun 8, 2020, 12:11 AM

#

@sonic bridge what do you mean "returns"

#

@blazing bridge yes, the gradient of the loss curve

#

or more generally the loss surface if you are optimizing over more than one variable

lapis sequoia Jun 8, 2020, 12:12 AM

#

Can someone help me please?

#

Very easy for you proffesionals

sonic bridge Jun 8, 2020, 12:13 AM

#

i mean if execute

print(load_config("data"))

it prints it twice

blazing bridge Jun 8, 2020, 12:17 AM

#

@desert oar it’s as we change the y intercept or b we check to see the slope of the loss curve and see if it’s going down or how does that work finding the gradient of b

desert oar Jun 8, 2020, 12:18 AM

#

the loss is a function of y and b

#

you're checking the gradient of the loss function

#

the loss function is the loss curve

#

the slope of a curve is the gradient

#

@sonic bridge it shouldn't with that code you wrote

#

but your code is wrong in another way...

        for data in config["main"]:
            cfg = data[cfg]
        return cfg

doesn't make any sense

real wigeon Jun 8, 2020, 12:20 AM

#

@desert oar ever used dash? you seem busy now, but I could really use some guidance

blazing bridge Jun 8, 2020, 12:20 AM

#

Oh ok sorry to annoy you it’s just checking to see if we change the intercept how much the loss and this is doing using the loss vs b curve

desert oar Jun 8, 2020, 12:21 AM

#

i have not @real wigeon

#

@blazing bridge i'm not sure what you mean

#

for gradient descent, you compute the value of the gradient at your current b value, then use that to update your b value

blazing bridge Jun 8, 2020, 12:22 AM

#

What does the slope of the curve do to update your value

#

Is it if the value of the gradient is zero we have reached the min

real wigeon Jun 8, 2020, 12:25 AM

#

slope is rate of change i believe

blazing bridge Jun 8, 2020, 12:25 AM

#

Yeah

#

But what does it do in this case

flat quest Jun 8, 2020, 12:45 AM

#

@numba.jit(nopython=True)
def get_bar_indices(arr, limit):
    indices = []
    row_num = 0
    cum_dollar = 0.0
    for x in arr:
        if cum_dollar < limit:
            cum_dollar += x
        else:
            indices.append(row_num)
            cum_dollar = 0.0
        row_num += 1
    return indices

bar_indices = get_bar_indices(df['x'].to_numpy())

@desert oar wouldn't this still be a looped sequential operation, making it not that efficient?
Or am i missing something here

sonic bridge Jun 8, 2020, 1:07 AM

#

but your code is wrong in another way...

         for data in config["main"]:
             cfg = data[cfg]
         return cfg

doesn't make any sense
@desert oar at least it workes lmao

desert oar Jun 8, 2020, 1:07 AM

#

@flat quest its an inherently sequential algorithm

#

Cumsum is a loop too, just in C not Python

flat quest Jun 8, 2020, 1:09 AM

#

yeah ik, but would that algorithm be parallelizable? Since it can't really split up the operations like you could with cumsum.

desert oar Jun 8, 2020, 1:10 AM

#

Right i dont think it is

#

Unless you precompute cumsum and then you have the memory problem

flat quest Jun 8, 2020, 1:11 AM

#

hmm
yeah true. Its always speed vs memory :/. Tho I heard vaex and dask help out with the memory part to some extent

desert oar Jun 8, 2020, 1:12 AM

#

Only by storing things on disk and not in memory @flat quest

blazing bridge Jun 8, 2020, 1:13 AM

#

@desert oar are you able to get on a voice chat on the Code/Help section. Its ok if you are not comfortable. I feel like it would be much easier to explain what I mean

desert oar Jun 8, 2020, 1:13 AM

#

No sorry

blazing bridge Jun 8, 2020, 1:13 AM

#

ok thats ok

flat quest Jun 8, 2020, 1:14 AM

#

ah gotcha. Is the diff in speed that much between pandas and something like vaex? Been mainly using pandas, but considering picking up vaex / dask if its worth.

blazing bridge Jun 8, 2020, 1:14 AM

#

@desert oar I had a question about slope and what does the slope do in this case

#

and if you dont mind can you explain the concept of gradient descent down in simple terms

safe tapir Jun 8, 2020, 2:54 AM

#

gradient is an n-dimensional slope (i.e. vector)

#

ML solves problems by finding local minima

#

gradient descent is following the gradient towards the local minima

#

this is similar to finding the turning point of a parabola, but in n-dimensions

blazing bridge Jun 8, 2020, 3:15 AM

#

@safe tapir what does the slope do or in this case the slope do to reach the minimum of the loss curve

safe tapir Jun 8, 2020, 3:16 AM

#

the turning point is at slope = 0

blazing bridge Jun 8, 2020, 3:17 AM

#

Ok so if the slope is going downward at any parameter such as m or b the gradient will follow it until it reaches 0

safe tapir Jun 8, 2020, 3:17 AM

#

yes

#

that is the turning point

#

when m = 0

blazing bridge Jun 8, 2020, 3:20 AM

#

Yeah I meant the slope or m of the line in this case rather than the gradient but thank you for clearing it up

#

Sorry to bombard you with questions but for linear regression how does the line relate to the loss curve

#

I understand it’s used to see if the parameters when changed we check to see how much loss was produced and to minimize loss but where is it plotted on the curve

safe tapir Jun 8, 2020, 3:29 AM

#

the position of the line of best fit relative to the actual data points produces residuals

#

the loss function tries to reduce those residuals to maximize fit of the line

#

i.e. minimize loss

blazing bridge Jun 8, 2020, 3:29 AM

#

Sorry what are residuals

safe tapir Jun 8, 2020, 3:30 AM

#

it's the red lines

blazing bridge Jun 8, 2020, 3:30 AM

#

Oh so like squares mean error

safe tapir Jun 8, 2020, 3:30 AM

#

📎 nR5CpI_HA5GWzpVAwlKvhzqq3pPhpRtb4QSv8vqnJE-wFAL_pDLKW9H06WGUHDbCoFy1940McZZCuWY01oaf8sErR7K3yUYdBwjA.png

#

the residuals are the distance of the data from the line of best fit

#

in the right plot you are projecting the line of best fit onto the x axis

blazing bridge Jun 8, 2020, 3:32 AM

#

Ok so gradient descent minimizes the loss using these residuals

#

Thank you so much

#

So just to summarize what was said gradient descent minimizes loss following the slope of the curve at any point downwards towards the local minimum until the gradient of the curve reaches zero and this is done for a line where we are changing m and b of the line accordingly and see if that minimizes the loss i.e gradient moving downwards at that point on the graph

flat quest Jun 8, 2020, 3:40 AM

#

would be nice if we could the absolute min :/, rather than relying on local min all the time @safe tapir

safe tapir Jun 8, 2020, 3:41 AM

#

you are updating parameters (in this case, m and b) until you get slope that approaches 0

#

there is no guarantee of absolute minima because there can be many many critical points for your function

#

you have to keep testing to find a better minima

flat quest Jun 8, 2020, 3:48 AM

#

yeah ik. Well if we were able to mathematically find an absolute minima within a certain boundary at least, that may be nice. Would make learning a lot faster at least, even if it isn't the absolute best critical point.

Well another thing with local minima. Once a model reaches the local min, it won't leave it unless you run the model again.

solid aurora Jun 8, 2020, 7:14 AM

#

Why does Pandas not support using the and operator to find the && of two series, but it can use the __and__() function just fine?

#

doesn't the former delegate to the latter?

stable star Jun 8, 2020, 7:17 AM

#

@solid aurora Pandas objects such as Series do not have a boolean value

#

its not clear whthr they r true or false so they gotta decide to throw an error

sonic raft Jun 8, 2020, 10:11 AM

#

Hi! Is it possible, to create an accurate, image classifier with SVM?

desert oar Jun 8, 2020, 10:31 AM

#

@solid aurora because of the way python works. the behavior of and cannot be overridden by classes and therefore cannot be used for custom functionality in pandas or any other library

#

@sonic raft it was used in the 90s and 00s but the accuracy is poor compared to a modern neural network. lots of transformations were applied to images in order to get SVMs to work

sonic raft Jun 8, 2020, 10:35 AM

#

@desert oar Okay, I'm learning about ML at codecademy and they only teach me how to use Perceptrons, do you consider Perceptrons as neural networks?

desert oar Jun 8, 2020, 10:36 AM

#

a "multilayer perceptron" is a basic kind of neural network yes

#

for image classification people commonly use convolutional neural networks, which are more complicated

leaden creek Jun 8, 2020, 10:47 AM

#

okay, so i'm trying to get the openai gym environments to work with a machine learning experiment (very bad, but works, has successfully learned the xor example) but i keep getting an error along the lines of AssertionError: 0.8197223612113118 (<class 'float'>) invalid with

def test(agent):
    done = False
    observation = env.reset()
    while not done:
        env.render()
        action = agent.predict(observation)
        observation, reward, done, info = env.step(action[0][0].item())

        if done:
            observation = env.reset()

#

specifically on env.step

#

the documentation on openai gym (at least what i've seen) is sparse at best and non-existent at worst, so could anyone point me in the right direction on what to do or where to look for docs?

lapis sequoia Jun 8, 2020, 12:44 PM

#

hey, i have two datasets: One has hourly data in UTC Datetime, the other one has events with a UNIX timestamps. In one hour there can be any number of events from the second dataset, so sometimes 3 or 2000 or 0 events could occur in an hour. Do you have some good starting tutorial on how to work with dates and consolidate these two datasets? I want to count the events per hour but it is already an issue for me to work with datetime and UNIX timestamps. Sorry for this basic question, but if you have a link to a simple tutorial on how to work with dates in pandas dataframes it would be greatly appreciated

paper niche Jun 8, 2020, 12:50 PM

#

just pd.to_datetime() both columns individually: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

#

once both of them are of type datetime you can easily do whatever you want with them

#

the documentation on openai gym (at least what i've seen) is sparse at best and non-existent at worst, so could anyone point me in the right direction on what to do or where to look for docs?
@leaden creek if there aren't docs, then maybe try the source code on github? What is your env defined as? And what's the full error stacktrace?

dense knot Jun 8, 2020, 3:52 PM

#

Hello guys, do you have datasets for network security or malware? It will be very helpful if you have it and are willing to share it with me

flat quest Jun 8, 2020, 3:53 PM

#

there might be some on kaggle @dense knot or uni websites

#

take a look at those

dense knot Jun 8, 2020, 3:59 PM

#

thank you, i appreciate it

real wigeon Jun 8, 2020, 7:05 PM

#

Hey do you guys recommend and good tutorials for working with big data?

desert oar Jun 8, 2020, 7:22 PM

#

how big is big

uncut shadow Jun 8, 2020, 7:26 PM

#

Well, I don't think there are many tutorials like that. Big data is not just few GBs of some labels. Only big companies are able to use it (cuz others simply don't have this amount of data) which makes it harder

real wigeon Jun 8, 2020, 7:33 PM

#

Idk

#

I am looking for work as an analyst, but the field is so broad it's hard to tell what skills I need

#

For example dashboards or not

#

Predictive statistics or not

#

Sql/db knowledge or not

#

Now I'm seeing stuff about spark

desert oar Jun 8, 2020, 7:37 PM

#

skip big data imo

#

analysts generally don't need to deal with that or care about it

#

i'd focus on: probability & stats, intermediate-level excel (array formulas, vlookup/index-match, pivot tables, charts), data visualization principles, at least one data viz/analysis tool like qlik or tableau, sql fundamentals, and python fundamentals

#

that's already quite a handful without worrying about big data and spark

#

you don't need to deep dive into mathematical stats, but you need to be familiar with the most important equations and have a thorough conceptual understanding of how everything works

#

source: i work with analysts and this is their skillset

lapis sequoia Jun 8, 2020, 7:42 PM

#

Can anyone of you help me with understanding how datetime works in pandas dataframes?

desert oar Jun 8, 2020, 7:43 PM

#

sure, got a specific example of some data you're using?

lapis sequoia Jun 8, 2020, 7:43 PM

#

I want to consolidate a dataframe that has data with UNIX timestamps. I changed them to datetime but here is my issue:

#

It contains a number of events. there can be a lot of events within an hour and it changes from hour to hour

#

Basically i want to create a dataframe where each row is one consecutive hour and i count the events for each hour and a few more data points

#

but i cannot figure out how to select only those rows of the first dataframe that are from a specific hour, or how i can loop through the dataframe and create a new one based on the hours because sometimes there are no hours where events occurs so that would need to be made separately

#

it is so confusing for me. What are the right steps that I need to take?

#

or in which order can i tackle this problem?

desert oar Jun 8, 2020, 7:48 PM

#

@lapis sequoia it sounds like you want to group by hour and compute some aggregate values for each hour

#

is that a reasonable summary?

lapis sequoia Jun 8, 2020, 7:48 PM

#

yes

#

precisely

desert oar Jun 8, 2020, 7:50 PM

#

let's assume your timestamp column is called 'timestamp' and your data frame is df

#

df['hour'] = df['timestamp'].strftime('%Y-%m-%d %H:00')
df.groupby('hour').count()

#

more info on what you can do with groupby here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

lapis sequoia Jun 8, 2020, 7:52 PM

#

okay i will read into it

#

thank you

#

so with that i can group my dataframe into the single hours?

desert oar Jun 8, 2020, 7:53 PM

#

yeah, i'm using strftime to ensure that every hour has a unique string associated with it

#

then just grouping by that

#

the datatype of the timestamp column will be DateTimeIndex https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html#datetimeindex

lapis sequoia Jun 8, 2020, 7:54 PM

#

And how would i then loop through the single rows of each hour and pass them to a function? like this?

for index, row in df['hour']:
  ...

desert oar Jun 8, 2020, 7:54 PM

#

user guide here https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

#

generally you dont want to loop over dataframe rows if the data is big

calm scarab Jun 8, 2020, 7:55 PM

#

@lapis sequoia what about apply axis=1 ?

desert oar Jun 8, 2020, 7:55 PM

#

^ yes

#

in fact looping over rows leads to some weirdness with data types

#

so it largely depends on what exactly you want to do row-wise

lapis sequoia Jun 8, 2020, 7:56 PM

#

what is apply axis=1? im sorry for my basic questions... i just started with data sciences last week and am quite new to this

desert oar Jun 8, 2020, 7:56 PM

#

back up

#

what do you want to do with each row

lapis sequoia Jun 8, 2020, 7:57 PM

#

wait, i will add some data rows down here so that we are all on the same page

#

one moment

umbral aspen Jun 8, 2020, 7:57 PM

#

Hello - I have a multilabel image classification problem and I am wondering which metrics you guys use to track model performance with tf/keras? I am using just the plain 'accuracy' right now but it shows a much better picture then the real picture as most of my images have 2-3 labels out of a possible 13 and my model shows a good accuracy because it often does not predict classes etc

calm scarab Jun 8, 2020, 7:57 PM

#

@lapis sequoia http://jonathansoma.com/lede/foundations/classes/pandas columns and functions/apply-a-function-to-every-row-in-a-pandas-dataframe/

Apply a function to every row in a pandas dataframe

#

Hello - I have a multilabel image classification problem and I am wondering which metrics you guys use to track model performance with tf/keras? I am using just the plain 'accuracy' right now but it shows a much better picture then the real picture as most of my images have 2-3 labels out of a possible 13 and my model shows a good accuracy because it often does not predict classes etc
@umbral aspen

desert oar Jun 8, 2020, 7:58 PM

#

@umbral aspen one option is hamming distance, see https://stats.stackexchange.com/a/234354/36229

Cross Validated

Multilabel classification metrics on scikit

I am trying to build a multi-label classifier so as to assign topics to existing documents using scikit

I am processing my documents passing them through the TfidfVectorizer the labels through the

#

not sure what the cool kids are using nowadays

umbral aspen Jun 8, 2020, 7:58 PM

#

😎

desert oar Jun 8, 2020, 7:59 PM

#

wikipedia has a few suggestions too https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics

Multi-label classification

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass class...

#

e.g. you can compute precision, recall, and F1 in a multilabel context (i've done this personally)

calm scarab Jun 8, 2020, 7:59 PM

#

@umbral aspen Accuracy is giving wrong impression if your data is imbalanced. I advise you to use more robust metrics like roc-auc etc.
Check: https://towardsdatascience.com/metrics-for-imbalanced-classification-41c71549bbb5

Medium

Metrics for Imbalanced Classification

The notion of metrics in Data Science is extremely important. If you don’t know how to estimate current results properly, you are unable…

desert oar Jun 8, 2020, 8:00 PM

#

yes multilabel AUC is another option https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/48755

Toxic Comment Classification Challenge

Identify and classify toxic online comments

#

sklearn has an implementation as well i think

umbral aspen Jun 8, 2020, 8:00 PM

#

@calm scarab I did use roc or auc (I forget) and also got a very good accuracy (like 93 %)

#

However the real life performance was bad...even though I used very similar images

desert oar Jun 8, 2020, 8:00 PM

#

did you use separate train and test sets?

umbral aspen Jun 8, 2020, 8:00 PM

#

Yes

#

😄

desert oar Jun 8, 2020, 8:01 PM

#

...is your test set representative of real life data?

umbral aspen Jun 8, 2020, 8:01 PM

#

Yup

#

I am confused as well...

desert oar Jun 8, 2020, 8:01 PM

#

evidently not, or there's a bug in your testing code, or there's a bug in your model

calm scarab Jun 8, 2020, 8:01 PM

#

@umbral aspen check your validation schema for models? Are there data leaks? Are validation data is representative for test set? Are your validation schema unbiased? Maybe check k-folds and its variants

umbral aspen Jun 8, 2020, 8:02 PM

#

I have about 10k of classified photos. I randomly take 7k of those as training data and the other 3k as validation data

calm scarab Jun 8, 2020, 8:03 PM

#

are you sure that dev set and test set coming from same distribution?

umbral aspen Jun 8, 2020, 8:03 PM

#

# take 70% of photos for training
df_copy = df.copy()
train_df = df_copy.sample(frac=0.70, random_state=0)
validation_df = df_copy.drop(train_df.index)

calm scarab Jun 8, 2020, 8:03 PM

#

is it kaggle dataset or something?

umbral aspen Jun 8, 2020, 8:03 PM

#

Nope I generated the dataset myself

calm scarab Jun 8, 2020, 8:04 PM

#

How you are evaluating on real performance of model?

umbral aspen Jun 8, 2020, 8:04 PM

#

Just grabbing 100-200 similar photos and manually going through it

calm scarab Jun 8, 2020, 8:04 PM

#

@umbral aspen can you give a try k-folds?

#

k folds cross validation (or its variants etc)

lapis sequoia Jun 8, 2020, 8:06 PM

#

@desert oar

Here some example data:

score    compound    datetime
345      0.5106    2017-01-27 05:13:11
1        0.4836    2017-02-03 13:39:00
2461     0         2017-03-19 16:12:53
0        0         2017-03-19 16:56:43
235      0         2017-05-13 12:39:52

What I want in the end is something like this:

score    compound    datetime      no_events
345      0.5106    2017-01-27 05   1
0        0         2017-01-27 06   0
....
2461     0         2017-03-19 16   2

notice how the the hour 6 on that date has no events and the two events in this example in hour 16 are counted in one row

desert oar Jun 8, 2020, 8:07 PM

#

how do you compute score and compound in aggregate?

#

sum?

lapis sequoia Jun 8, 2020, 8:08 PM

#

basically yes, but i have a few more values that i need to pass through some functions that i already defined

desert oar Jun 8, 2020, 8:08 PM

#

are those the functions you want to apply to each row?

lapis sequoia Jun 8, 2020, 8:08 PM

#

yes

#

i have some texts in there that i need to pass through some nlp

desert oar Jun 8, 2020, 8:09 PM

#

is it possible for score or compound to be null?

lapis sequoia Jun 8, 2020, 8:09 PM

#

the first example data is the dataframe that i already have

desert oar Jun 8, 2020, 8:09 PM

#

ahh i see

#

ok so you need the hourly logic to be applied to each group

lapis sequoia Jun 8, 2020, 8:09 PM

#

no, score and compound are all filled but it can be zero

#

yes i have the events in the first dataset and i need consecutive hours in the second dataset even if there is no event in an hour

desert oar Jun 8, 2020, 8:11 PM

#

how are the custom functions defined

#

what are their parameters?

#

pandas series? numpy arrays? individual strings of text?

lapis sequoia Jun 8, 2020, 8:12 PM

#

indiviual strings

#

here one example for simple sentiment:

#

def nltk_sentiment(post):
    nltk_sentiment = SentimentIntensityAnalyzer()
    score = nltk_sentiment.polarity_scores(post)
    return score

desert oar Jun 8, 2020, 8:12 PM

#

but it looks like you already ran that

#

per row

#

in the original data

#

no?

#

that doesn't need to be run hourly

#

that needs to be run on every original row, then the scores are aggregated

#

right?

lapis sequoia Jun 8, 2020, 8:13 PM

#

yeah... it sounds right what you are saying...

#

i need a second to think 🙂

#

well I need to run this function a column with text, but yes i need to do this on the original data.

#

it has nothing to do with the datetime problem

#

i confused two things

desert oar Jun 8, 2020, 8:15 PM

#

data['hour'] = data['timestamp'].strftime('%Y-%m-%d %H:00')
data['score'] = data['text'].map(nltk_sentiment)

data_hourly_groupby = data.groupby('hour')
data_hourly = data_hourly.agg({'score': 'count'})
data_hourly['no_events'] = data_hourly_groupby.count()

#

how about something like that

#

if your data is big, you will want to manipulate this a bit so that you aren't doing 2 passes over the groupby, but it's not important if you're just learning

#

then you need to fill in the missing hours

calm scarab Jun 8, 2020, 8:17 PM

#

df['score] = df[['text']].applly(lambda x: your_func(x['text']), axis=1) also a option

lapis sequoia Jun 8, 2020, 8:18 PM

#

thanks

desert oar Jun 8, 2020, 8:18 PM

#

better yet, df['text'].map(your_func)

lapis sequoia Jun 8, 2020, 8:18 PM

#

i will try your suggestions and read more into pandas

calm scarab Jun 8, 2020, 8:18 PM

#

also you can apply on paralel as well, which is faster: https://github.com/nalepae/pandarallel

GitHub

nalepae/pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs - nalepae/pandarallel

desert oar Jun 8, 2020, 8:18 PM

#

im writing up a more complete example, give me a moment

lapis sequoia Jun 8, 2020, 8:18 PM

#

okay

#

one question though concerning the missing hours you mentioned:

#

right now i work with a small piece of the original dataset (2500 rows with 25 columns)

#

i will later run my functions on the original dataset that has almost 2.5 million rows

#

but i can later just generate a datetime column with just hours and then map the filled hours to the new column with all hours right?

desert oar Jun 8, 2020, 8:28 PM

#

@lapis sequoia

def round_hour(ts):
    """ Strip minutes and seconds from Pandas Timestamp """
    return pd.Timestamp(
        year=ts.year,
        month=ts.month,
        day=ts.day,
        hour=ts.hour,
        tzinfo=ts.tzinfo
    )

data['score'] = data['text'].map(nltk_sentiment)

data['hour'] = data['timestamp'].map(round_hour)

data_hourly = data.groupby('hour').agg({'score': ['sum', 'len']})
data_hourly = data_hourly.rename(columns=['score_sum', 'no_events'])

full_hourly_index = pd.date_range(data_hourly.index.min(), data_hourly.max(), freq='1H')
data_hourly = data_hourly.reindex(full_hourly_index)

this is my recommendation/example

lapis sequoia Jun 8, 2020, 8:29 PM

#

thank you so much @desert oar

#

i will try it

desert oar Jun 8, 2020, 8:29 PM

#

as always, never copy and paste code that you do not understand

#

i'll be offline for a while but feel free to @ me in a few hours if you still have questions

lapis sequoia Jun 8, 2020, 8:30 PM

#

yeah, thank you so much!

hearty jewel Jun 8, 2020, 8:33 PM

#

can someone please explain to me line by line whats happening here

#

📎 unknown.png

lapis sequoia Jun 8, 2020, 8:34 PM

#

while the more_results variable is true, the fetchmany function does it's thing (maybe fetch 50 more results or something?)

#

if the this function returns an empty array, the more_results variable is False and stops the fetchmany() function

#

for each row in the results the state_count is incremented by one - something like a counter

#

.close() 🙂

calm scarab Jun 8, 2020, 8:37 PM

#

it basically feetches many rows as long as there are more and for each newly fetched rows it counts how many times a row state has been fetched (state_dict is kinda dict I think). If there is no more rows, which more_result=false, the loop ends and the proxy is closed.

hearty jewel Jun 8, 2020, 8:58 PM

#

thank you @lapis sequoia!!!

#

and @calm scarab !!

#

🙂

lapis sequoia Jun 8, 2020, 9:04 PM

#

Does anyone here know how to implement exponential growth in python? Someone in #help-cookie needs help on the subject and I'm not versed in it. Thanks

solid aurora Jun 8, 2020, 9:21 PM

#

How can I convert a DataFrame like this:```py
measure_1 measure_2 measure_3
count 10.00000 10.000000 10.000000
mean 5.50000 1.500000 55.000000
std 3.02765 0.527046 30.276504
min 1.00000 1.000000 10.000000
25% 3.25000 1.000000 32.500000
50% 5.50000 1.500000 55.000000
75% 7.75000 2.000000 77.500000
max 10.00000 2.000000 100.000000

#

into a DataFrame like this:

#

    measure_1_count  measure_1_mean  measure_1_std  measure_1_min ...  measure_3_max
0   10.00000         5.50000         3.02765        1.00000       ...  100.000000```?

#

Essentially, I have a bunch of dataframes, and I want to use the output of df.describe() for each dataframe into a row in a "stats" dataframe

#

my stats dataframe will then be used as input for a machine learning model

#

Is there a way to do it short of manually copying over values? I feel like there's a better way using some underlying numpy stuff

dusty depot Jun 8, 2020, 9:28 PM

#

uhh hmn

#

@solid aurora df.transpose()?

#

no wait

#

you want it to be all one 'row'?

solid aurora Jun 8, 2020, 9:29 PM

#

yea

#

I've seen np.reshape for numpy arrays

dusty depot Jun 8, 2020, 9:29 PM

#

uhh hmn

#

you could do like

#

v =df.unstack().to_frame().sort_index(level=1).T
v.columns = v.columns.map('_'.join)

solid aurora Jun 8, 2020, 9:30 PM

#

ok could you explain that? I'm completely confused lol

dusty depot Jun 8, 2020, 9:31 PM

#

so

#

what df.unstack() does is that it essentially it pivots the index labels so that it goes like, horizontally instead, of sorts?

#

so like

solid aurora Jun 8, 2020, 9:32 PM

#

ok

dusty depot Jun 8, 2020, 9:32 PM

#

the index labels become column names

solid aurora Jun 8, 2020, 9:32 PM

#

yea I see

dusty depot Jun 8, 2020, 9:32 PM

#

right

#

and so it converts it from like a pivot-view-type thing, with to_frame() into a regular ol dataframe

solid aurora Jun 8, 2020, 9:32 PM

#

ok

dusty depot Jun 8, 2020, 9:32 PM

#

so after that point you have a nested dataframe, sort of

solid aurora Jun 8, 2020, 9:32 PM

#

and at that point it has 1 column and a bunch of rows?

dusty depot Jun 8, 2020, 9:33 PM

#

uh

#

lemme just make a quick example

solid aurora Jun 8, 2020, 9:33 PM

#

looks like that's what's happening

#

📎 unknown.png

dusty depot Jun 8, 2020, 9:33 PM

#

ah well, it's become a multi-index dataframe

#

so er, kind of yeah

solid aurora Jun 8, 2020, 9:34 PM

#

multi-index meaning it's like a nested structure?

dusty depot Jun 8, 2020, 9:34 PM

#

ya

solid aurora Jun 8, 2020, 9:34 PM

#

ok

dusty depot Jun 8, 2020, 9:35 PM

#

because sort_index is operating on level=1, as a side effect it explicitly gives each row both its label and sub-label

solid aurora Jun 8, 2020, 9:35 PM

#

as a list? [label, sublabel]?

#

actually looks like it's a tuple

#

makes sense either way

dusty depot Jun 8, 2020, 9:35 PM

#

aye

solid aurora Jun 8, 2020, 9:36 PM

#

what exactly is sort_index supposed to do?

dusty depot Jun 8, 2020, 9:36 PM

#

or well inside pandas it might be something else

#

it sorts objects by their labels

#

so in this case

#

row label

solid aurora Jun 8, 2020, 9:36 PM

#

oh just ORDER BY the index

#

and the level=1 means it collapses sub-labels?

dusty depot Jun 8, 2020, 9:37 PM

#

more specifically in this case, it also makes (possibly redundant) sure that you don't get into weird reordering issues

#

level=1 means it sorts by the sublabel

solid aurora Jun 8, 2020, 9:37 PM

#

oh ok

dusty depot Jun 8, 2020, 9:37 PM

#

the uh, collapsing is more of a side effect of the sorting

solid aurora Jun 8, 2020, 9:37 PM

#

got it

#

and I get the rest of it

#

thanks so much!

dusty depot Jun 8, 2020, 9:38 PM

#

👌

#

i'm not confident that the sort_index is actually necessary tbh

solid aurora Jun 8, 2020, 9:47 PM

#

@dusty depot you're right, it appears that the sort_index isn't really needed

#

I suppose it can't be terribly slow since my number of columns is in the hundreds, so I may as well keep it in to avoid any sort of reordering issues as u stated before

dusty depot Jun 8, 2020, 9:53 PM

#

👌

solid aurora Jun 8, 2020, 9:54 PM

#

btw I managed to make it a one-liner @dusty depot

#

(df.unstack() 
     ▓▒░    .to_frame() 
      ▓▒░    .sort_index() 
      ▓▒░    .transpose() 
      ▓▒░    .pipe(lambda d: d.set_axis(d.columns.map('_'.join), axis=1)))

dusty depot Jun 8, 2020, 9:54 PM

#

uh

solid aurora Jun 8, 2020, 9:54 PM

#

ugh it copied some of my prompt too

dusty depot Jun 8, 2020, 9:55 PM

#

but 👌

solid aurora Jun 8, 2020, 9:55 PM

#

(well technically not one line but one expression that could be put on one line)

#

@dusty depot there we go

📎 unknown.png

#

took me a bit to realize that you can't use = in a lambda

#

somewhat surprising that I've never run into that before

#

I just got an inexplicable syntax error lmao

dusty depot Jun 8, 2020, 9:57 PM

#

ah yeah lmao

twilit brook Jun 8, 2020, 10:13 PM

#

Hey guys.. I want to check if a value is present in a data frame column. If there is no value I want to make append a list to return false and if there is a value I want to return True. I convert the dataframe to a dictionary, use a for-loop to check all the values in that specific key.. here is my code

#

"""

#

"""

#

"""

#

filename = "nba.csv"
nba = pd.read_csv(filename)
nba_dict = nba.to_dict()
nba_list = list(nba)
nba_df = pd.DataFrame(nba_dict)

datatypes = nba_df.dtypes
print(datatypes)

df = pd.DataFrame(nba_dict, columns=["College"])

college_degree = []
check = 0

d = {} # Empty dictionary
l = [] # Empty list
ms = set() # Empty set
s = '' # Empty string
t = () # Empty tuple
n = 0 # Empty integer

for college in nba_dict["College"]:
if college == d or l or ms or s or t or n:
college_degree.append(False)
check = check + 1
else:
college_degree.append(True)
check = check + 1

print(college_degree)
check

#

the list doesn't append 😦

#

Sorry.. the list comes back as all true

twilit brook Jun 8, 2020, 10:48 PM

#

I solved it.. if anyone wants to see my solution let me know 🙂

#

I converted the specific dataframe column I needed as an array, then I converted any 'nan' value to a string value of 'none' and used a for-loop to check that.

#

df = pd.DataFrame(nba_dict, columns=["College"])
arr = df.values
#print(arr)

arr[np.where(arr.astype(str)==str(np.nan))]='none'

#

last line is the conversion

polar acorn Jun 8, 2020, 10:52 PM

#

You don't have to use a for loop even. Pandas is nice that way, you can just write df.College.isna() or df.College.isna().to_list() if you insist on getting a list back.

flat quest Jun 8, 2020, 10:53 PM

#

^^
yeah use the pandas built in one.

olive lagoon Jun 8, 2020, 10:53 PM

#

Guys please

#

How can i give the user the permission to edit a csv file with pandas

cunning nebula Jun 8, 2020, 10:55 PM

#

chmod +w file.csv

#

in bash

twilit brook Jun 8, 2020, 10:56 PM

#

@polar acorn @flat quest Holy cow that would've made my life SOO much easier.. just got started with pandas

#

Thank you guys

solid aurora Jun 8, 2020, 11:09 PM

#

@twilit brook a good rule of thumb is if you're looping through a dataframe you're likely doing something wrong

#

Wrong as in there is a more efficient vectorized way to do it

twilit brook Jun 8, 2020, 11:10 PM

#

@solid aurora Makes sense.. It seemed like my method had too many in-between steps

#

I converted a specific column from the df into an array and then looped that array

#

I'll try to avoid that next time

flat quest Jun 8, 2020, 11:33 PM

#

yeah either use the built in pandas vector operations

or if that doesn't work
try to use the numpy ones

solid aurora Jun 9, 2020, 12:38 AM

#

Is there a way to force python to garbage collect a dataframe?

#

Currently as I loop through each input file I am using more and more RAM

#

until I run out of ram and my computer freezes completely

#

that sounds like a memory leak to me

gilded shadow Jun 9, 2020, 12:45 AM

#

Does del work

hearty jewel Jun 9, 2020, 1:06 AM

#

why am i getting an error when Im trying to fetch some information from my results?

📎 unknown.png

twilit brook Jun 9, 2020, 1:20 AM

#

@solid aurora Can you avoid using an object dtype?

#

https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe

Stack Overflow

How do I release memory used by a pandas dataframe?

I have a really large csv file that I opened in pandas as follows....

import pandas
df = pandas.read_csv('large_txt_file.txt')
Once I do this my memory usage increases by 2GB, which is expected b...

#

This may help?

safe tapir Jun 9, 2020, 2:17 AM

#

Can anyone link me to more active super active deep/reinforcement learning channels? Are they on discord/slack/irc?

lapis sequoia Jun 9, 2020, 4:27 AM

#

which classification algorithm should i use for this ? decision,neural networks, logistic or k nearest ?

📎 unknown.png

#

k nearest seems to be the one ?

lapis sequoia Jun 9, 2020, 5:14 AM

#

depends what the classes are

#

and what distance metric you want to use