#data-science-and-ml | Python | Page 210

barren bluff Oct 7, 2019, 6:26 PM

#

📎 unknown.png

#

without the plt.gray() its green and blue biskthink

native patrol Oct 7, 2019, 9:00 PM

#

@jade chasm if you're looking to solve from an academic context - you can use Gurobi (free academic licenses iirc).
that's what we had used in our Linear Programming course

pulsar stag Oct 7, 2019, 10:49 PM

#

https://youtu.be/FoxlrOmen8c

YouTube

Potluck Economics

Manipulating Financial Time Series Data Dash Plotly Datetime

My Website: https://cryptopotluck.com/portfolio/16 Github Repo of Project: https://github.com/cryptopotluck/alpha_vantage_tutorial Alpha Vantage Github: http...

▶ Play video

#

https://github.com/cryptopotluck/alpha_vantage_tutorial

GitHub

cryptopotluck/alpha_vantage_tutorial

tutorial video I made and the repo that goes with it - cryptopotluck/alpha_vantage_tutorial

barren bluff Oct 8, 2019, 12:05 PM

#

Hey guys im a bit stuck with my assignment. I have to plot an image before and after using pca, nothing fancy with k-means or anything like that, but I am a little stuck with the plotting after reducing the dimensions

#

here is the code I have so far

#

print(digits.keys())

data = scale(digits.data)

#find amount of samples and features
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))

print("datashape: ",digits.data.shape)

print("n_samples %d, \t n_features %d"
      % (n_samples, n_features))


def plotdigitWithoutPCA(num):
    plt.figure(1, figsize=(3, 3))
    plt.imshow(digits.images[num], cmap = plt.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()
    
plotdigitWithoutPCA(0)
plotdigitWithoutPCA(2)

pca = PCA(n_components=n_digits).fit(data)

barren bluff Oct 8, 2019, 12:22 PM

#

@ me if anyone can help

rapid ridge Oct 8, 2019, 4:54 PM

#

hey guys. is legal to make a web scraping?

fallen quest Oct 8, 2019, 6:33 PM

#

Yeah, just make sure not to overdo it with how frequently you scrape because website will ban you or something

deft harbor Oct 8, 2019, 8:21 PM

#

There are copyright concerns, also tos issues to keep an eye on.

fallen quest Oct 8, 2019, 8:29 PM

#

True!

rapid ridge Oct 8, 2019, 8:54 PM

#

I will keep in mind . I will only extract the links .

cyan mantle Oct 8, 2019, 9:26 PM

#

time.sleep(10) will save you bro

#

weSmart

rapid ridge Oct 8, 2019, 11:13 PM

#

linkedin , monster doesnt allow crawling disallow: /

rapid ridge Oct 9, 2019, 6:02 AM

#

is mysql or sqlite better for web scrapping ?

ancient thistle Oct 9, 2019, 6:15 AM

#

you dont need a DB for web scraping (unless you're trying to scrape and download images or something to a server?)

quartz stream Oct 9, 2019, 6:21 AM

#

MCQs are a widely-used question format that is used for general assessment on domain knowledge of candidates. Most of the MCQs are created as paragraph-based questions.A paragraph or code snippet forms the base of such questions. These questions are created based on the three or four options from which one option is the correct answer. The other remaining options are called Distractors which means that these options are nearest to the correct answer but are not correct. You are provided with a training dataset of questions, answers, and distractors to build and train an

#

I am starting to learn NLP

#

Can anyone help me

#

📎 unknown.png

rapid ridge Oct 9, 2019, 7:16 AM

#

where can I get this list of countries by their top 10 cities to work for IT / security or penetration testing or IT in general of each country below?

United Kingdom
Spain
Belgium
Romania
Italy
Russia
France
Czech Republic
Poland
Switzerland
Australia
Ireland
Singapore
sweden
Germany
New Zeland

eager heath Oct 9, 2019, 8:40 AM

#

Maybe on LinkIn

#

(But i don’t see why this question is in this channel, #career-advice would be more adapted 😄 )

rapid ridge Oct 9, 2019, 8:49 AM

#

I had to google all cities one by one , but how would you implement this . I have already a dict of files of top 10 - 25 cites from each country , but how can I apply to depends on the link use the wordlist for that for example

self.cities = {
      'AU': open('AU','r').read().splitlines(),
      'BE': open('BE','r').read().splitlines(),
      'CA': open('CA','r').read().splitlines(),
      'CH': open('CH','r').read().splitlines(),
      'CZ': open('CZ','r').read().splitlines(),
      'DE': open('DE','r').read().splitlines(),
      'ES': open('ES','r').read().splitlines(),
      'FR': open('FR','r').read().splitlines(),
      'GB': open('GB','r').read().splitlines(),
      'IE': open('IE','r').read().splitlines(),
      'IT': open('IT','r').read().splitlines(),
      'MX': open('MX','r').read().splitlines(),
      'NL': open('NL','r').read().splitlines(),
      'NZ': open('NZ','r').read().splitlines(),
      'PL': open('PL','r').read().splitlines(),
      'RO': open('RO','r').read().splitlines(),
      'RU': open('RU','r').read().splitlines(),
      'SE': open('SE','r').read().splitlines(),
      'SG': open('SG','r').read().splitlines(),
      'US': open('US','r').read().splitlines(),   
}


for url in self.links:
      for city in self.cities:
          print(url+city)


https://www.indeed.com/jobs?q=Los Angeles, CA
https://www.indeed.com/jobs?q=San Jose, CA

https://ca.indeed.com/jobs?q=...

....

#

I was thinking on this , but i am not sure

for url in self.links:
      for city in self.cities:
          if(self.cities['AU']):
            print(city)
            elif self.cities['BE']:
                  print(city)

hazy sierra Oct 9, 2019, 9:10 AM

#

Elif is tabbed too much ?

eager heath Oct 9, 2019, 9:11 AM

#

^

hazy sierra Oct 9, 2019, 9:11 AM

#

I could be wrong but wouldn't self.cities['AU'] need to return true ?

#

also self.cities['AU'] returns the same value over and over again

eager heath Oct 9, 2019, 9:14 AM

#

Why it wouldn’t return the same value?

hazy sierra Oct 9, 2019, 9:15 AM

#

It would

#

over and over again.

eager heath Oct 9, 2019, 9:16 AM

#

As long as you don’t change the value, it is not going to change

rapid ridge Oct 9, 2019, 9:16 AM

#

any other way to loop it without too much if , elif

for city in cities:
    if(cities['AU']):
      print(cities['AU'][0])

hazy sierra Oct 9, 2019, 9:18 AM

#

Isn't self.cities a dictionary?

#

I thought you couldn't iterate through it

#

I guess you can

#

That's weird

eager heath Oct 9, 2019, 9:24 AM

#

You can iterate over a dict

rapid ridge Oct 9, 2019, 9:25 AM

#

I am in a tester

#

that;s why I am not using a self.

hazy sierra Oct 9, 2019, 9:27 AM

#

Hmmm

#

I thought dictionaries weren't ordered

rapid ridge Oct 9, 2019, 9:27 AM

#

this piece of code is getting the len of cities and not the len of each wordlist , how can I fix it?

cities = {
      'AU': open('AU','r').read().splitlines(),
      'BE': open('BE','r').read().splitlines(),
      'CA': open('CA','r').read().splitlines(),
      'CH': open('CH','r').read().splitlines(),
      'CZ': open('CZ','r').read().splitlines(),
      'DE': open('DE','r').read().splitlines(),
      'ES': open('ES','r').read().splitlines(),
      'FR': open('FR','r').read().splitlines(),
      'GB': open('GB','r').read().splitlines(),
      'IE': open('IE','r').read().splitlines(),
      'IT': open('IT','r').read().splitlines(),
      'MX': open('MX','r').read().splitlines(),
      'NL': open('NL','r').read().splitlines(),
      'NZ': open('NZ','r').read().splitlines(),
      'PL': open('PL','r').read().splitlines(),
      'RO': open('RO','r').read().splitlines(),
      'RU': open('RU','r').read().splitlines(),
      'SE': open('SE','r').read().splitlines(),
      'SG': open('SG','r').read().splitlines(),
      'US': open('US','r').read().splitlines(),   
}


for city in cities:
    for x in range(0, len(cities)):
          print(cities['AU'][x])

hazy sierra Oct 9, 2019, 9:27 AM

#

so how do you iterate over it?

rapid ridge Oct 9, 2019, 9:27 AM

#

for city in cities:
    for x in range(0, len(cities)):
          print(cities['AU'][x])

eager heath Oct 9, 2019, 9:28 AM

#

for city in cities:
 for i in city:
  print(cities[city][i])```

#

try this

rapid ridge Oct 9, 2019, 9:31 AM

#

TypeError: list indices must be integers or slices, not str

eager heath Oct 9, 2019, 9:32 AM

#

Which line ?

rapid ridge Oct 9, 2019, 9:32 AM

#

print(cities[city][i])

eager heath Oct 9, 2019, 9:32 AM

#

Then

#

for city in cities:
 for i in city:
  print(city[i])```

rapid ridge Oct 9, 2019, 9:33 AM

#

TypeError: string indices must be integers the same

eager heath Oct 9, 2019, 9:33 AM

#

for city in cities:
 for i in cities[city]:
  print(i)```?

rapid ridge Oct 9, 2019, 9:36 AM

#

works , but how can I now this ?

"I am using US wordlist " + Los Angeles, CA
"I am using US wordlist " + San Jose, CA
"I am using CA wordlist " + Toronto, ON

eager heath Oct 9, 2019, 9:37 AM

#

What does the last snippet output ?

hazy sierra Oct 9, 2019, 9:38 AM

#

That's confusing

rapid ridge Oct 9, 2019, 9:38 AM

#

for countries in country:
 for i in country[countries]:
  print("I am using {} wordlist amd I am in {}").format(i)

hazy sierra Oct 9, 2019, 9:38 AM

#

wouldn't split lines return each line in a list ?

arctic wedgeBOT Oct 9, 2019, 9:38 AM

#

:incoming_envelope: :ok_hand: applied mute to @rapid ridge until 2019-10-09 09:48 (reason: newlines rule: sent 145 newlines in 10s).

hazy sierra Oct 9, 2019, 9:39 AM

#

What the heck

eager heath Oct 9, 2019, 9:40 AM

#

Well..

#

i suppose an admin can unmute him please ?

lyric canopy Oct 9, 2019, 9:42 AM

#

!unmute 518596072122351637

arctic wedgeBOT Oct 9, 2019, 9:42 AM

#

:incoming_envelope: :ok_hand: pardoned infraction mute for @rapid ridge.

rapid ridge Oct 9, 2019, 9:43 AM

#

great

eager heath Oct 9, 2019, 9:44 AM

#

Thanks vez

rapid ridge Oct 9, 2019, 9:44 AM

#

using AU wordlist amd I am in Brisbane
using BE wordlist amd I am in Antwerp

#

works

eager heath Oct 9, 2019, 9:44 AM

#

so print(i) output what ?

#

Perfect

rapid ridge Oct 9, 2019, 9:45 AM

#

yeah

#

for link in self.links:
      for countries in country:
       for i in country[countries]:
           print("using {} wordlist amd I am in {}".format(countries,i))

#

the last one integrate the links

eager heath Oct 9, 2019, 9:46 AM

#

you can also use fstrings

rapid ridge Oct 9, 2019, 9:46 AM

#

self.links = open('indeed.txt','r').read().splitlines()

arctic wedgeBOT Oct 9, 2019, 9:46 AM

#

In Python, there are several ways to do string interpolation, including using %s's and by using the + operator to concatenate strings together. However, because some of these methods offer poor readability and require typecasting to prevent errors, you should for the most part be using a feature called format strings.

In Python 3.6 or later, we can use f-strings like this:

snake = "Pythons"
print(f"{snake} are some of the largest snakes in the world")

In earlier versions of Python or in projects where backwards compatibility is very important, use str.format() like this:

snake = "Pythons"

# With str.format() you can either use indexes
print("{0} are some of the largest snakes in the world".format(snake))

# Or keyword arguments
print("{family} are some of the largest snakes in the world".format(family=snake))

rapid ridge Oct 9, 2019, 9:49 AM

#

how can I complete the string ? using https://www.indeed.com/jobs?q=

links = open('indeed.txt','r').read().splitlines()

for link in links:
      for countries in country:
       for i in country[countries]:
           print("using {}".format(link))
```

#

+ job_qry +'&l=' + str(city) + '&start=' + str(start)

#

our output looks like this using https://www.indeed.es/jobs?q= now

eager heath Oct 9, 2019, 9:50 AM

#

Did you read the embed sent by the bot just up here ^?

rapid ridge Oct 9, 2019, 9:50 AM

#

yeah

#

this are stored in a file using https://www.indeed.es/jobs?q= , how can I supposed to complete the string ? in this case I cannot do https://www.indeed.es/jobs?q={}

eager heath Oct 9, 2019, 9:52 AM

#

print(https://www.indeed.es/jobs?q=job_qry&l={i}&start={start})

rapid ridge Oct 9, 2019, 9:56 AM

#

this should be static without a file , but using for loop how can we complete it all the links?

#

for link in links:
      for countries in country:
       for i in country[countries]:
           print("using %s "%(link+"A"+"^S"+"a")) # dirty fix it . but how can I do it in a better way?

barren bluff Oct 9, 2019, 11:43 AM

#

Hey im working with the MNIST fashion dataset as a project for school. I am working on a little data analysis and I was wondering, what could be interesting to look at? I just made a histogram from all of the labels, but its not very describing. Also how can I make a seaborn scatterplot with the fashion mnist??

slim fox Oct 9, 2019, 1:59 PM

#

I am not sure that you can do much on that kind of image data

#

@barren bluff what's your task exactly?

barren bluff Oct 9, 2019, 2:00 PM

#

yeah I kind of figured, I transformed some of the data via PCA and made a scatterplot, but not much more than that

#

I just had to do a data analysis for my project, I am going to create a neural network and use CNN's later on though

slim fox Oct 9, 2019, 2:04 PM

#

for images themselves you can just do smth like this:

plt.imshow(images[n], cmap=plt.cm.binary)

barren bluff Oct 9, 2019, 2:04 PM

#

yeah I could do that too

serene veldt Oct 9, 2019, 2:05 PM

#

Are there any good metrics for defining batch size and epochs?

silent swan Oct 9, 2019, 8:16 PM

#

differs widely for tasks/models, unfortunately

#

sometimes it's just determined but how much you can fit in memory

vestal pecan Oct 9, 2019, 8:49 PM

#

anyone here experienced Alteryx before or knows about ?

lapis sequoia Oct 10, 2019, 7:51 AM

#

sure

#

what do you want to know

#

it's just a bunch of blocks you can connect together to run experiments... meant for Enterprise users

#

think.. it's like matlab for data science

vestal pecan Oct 10, 2019, 8:28 AM

#

yeah basically the company put me into test, trying data analytics using python and also predictive analytics with alteryx

#

I m a bit confused on which approach to go for and whether knowing alteryx would be any advantage in future jobs

#

though the course of learning alteryx was not that good because, it is all about dropping blocks, which didn't explain much

vestal pecan Oct 10, 2019, 8:48 AM

#

I liked both, tho i found python more interesting, but I don't feel confident enough to do forecasts or analysis

#

there are zillions of models, and ways for each model, etc. and online each article says each model is not good enough

barren bluff Oct 10, 2019, 9:29 AM

#

hey any of you know how to add references inside of markdown cells in jupyter notebooks?

dim beacon Oct 10, 2019, 9:48 AM

#

@barren bluff what do you mean by “reference” ?

barren bluff Oct 10, 2019, 9:48 AM

#

Like a reference to an article

#

because I used something that stood there

#

and I dont want to plagerise

dim beacon Oct 10, 2019, 9:52 AM

#

I use this

Here is some ref[^foo]. And another one[^bar].

[^foo] _Great Book_ by _Great Author_ 
[^bar] [Link to cool study](https://example.com)

#

That's not standard Markdown though

barren bluff Oct 10, 2019, 10:10 AM

#

ah thanks!

#

hey how do you add more neurons per layer when working with keras? Im using a mlp algorithm

desert oar Oct 10, 2019, 2:02 PM

#

@dim beacon sadly you're right, commonmark doesn't have footnote syntax

#

however pandoc markdown does support footnotes https://pandoc.org/MANUAL.html#footnotes

crude flame Oct 10, 2019, 2:26 PM

#

So I just saw that Deep Mind has some really interesting internships coming up next year and I think I'll apply, but atm I'm doing a PhD in Maths with no AI or Data Science relation... I know some basics in both, but would like to really brush up on my skills and try to do some small project - anyone knows any good reviews/books/papers/other ressources to get me towards research level AI problems? I know it's a very broad question, but I'd be happy about any input and am willing to read some heavy and complicated stuff, if I can learn something from it

desert oar Oct 10, 2019, 2:31 PM

#

good question, not sure if the Goodfellow book is still considered relevant

crude flame Oct 10, 2019, 2:34 PM

#

so far for Deep Learning I have the Chollet book Deep Learning with Python and I did like half of it so far, but it's very focused on applying DL and I think I'd like to learn some more theory as well

#

or on the application side some more physics- or math-related applications would be interesting, since I'm doing Mathematical Physics... also went to a summer school about Deep Learning for High Energy Physics at some point, but that was mostly also very introductory

deft harbor Oct 10, 2019, 3:29 PM

#

I'm using Goodfellow

#

With a ton of supplemental material

vestal pecan Oct 10, 2019, 5:16 PM

#

So after learning data wrangling, connecting to api, pandas, plotting and such in python, is there any website or so to practice these stuff and learn prediction technique or so through exercises ?

#

I have knowledge in statistics not super strong, but good in probablities and stats and such. But not forecasting models.

#

Or models

deft harbor Oct 10, 2019, 7:08 PM

#

Search Harvard intro to data science

#

They have a lot of notebooks to work through on their github

desert oar Oct 10, 2019, 7:17 PM

#

or start doing kaggle competitions

#

they tend to throw you into the deep end with little assistance

#

you probably won't get a good score but you will get the chance to practice

#

doing old kaggle competitions is probably better

#

the new ones are pretty sophisticated

vestal pecan Oct 10, 2019, 8:20 PM

#

I m not sure i can do competitions

desert oar Oct 10, 2019, 8:22 PM

#

you dont have to win

#

in fact you probably wont even come close to winning

#

the point is, it's a chance to work on an unfamiliar problem and try out new skills/techniques without pressure to succeed or fail

vestal pecan Oct 10, 2019, 8:28 PM

#

Oh will they like tell me what to do and show me answers so i understand ?

#

I like the projects where they help you reach the answer so you understand how to think when you have a specific problem

desert oar Oct 10, 2019, 9:21 PM

#

No, it's the opposite

#

However they have active discussion forums

#

So you get to see what other people are attempting and working on

vestal pecan Oct 11, 2019, 12:15 AM

#

Oh okay thank you!

lapis sequoia Oct 11, 2019, 2:36 AM

#

hi

#

I have javascript cell magic available on my jupyter.. i'm trying to find a way to add a way to add file upload button to a cell, so I can upload from local to the notebook

desert oar Oct 11, 2019, 2:38 AM

#

Notebooks dont have a filesystem to upload to...

lapis sequoia Oct 11, 2019, 3:51 AM

#

sure they do

small ore Oct 11, 2019, 5:04 AM

#

I only see a blank website when I go to Kaggle. Is it geography limited?

#

Also, has anyone found out anything sane from the annealing database on UCI? Or is there some questions/goals somewhere on the web on what to obtain from the UCI database?

desert oar Oct 11, 2019, 5:06 AM

#

probably javascript

small ore Oct 11, 2019, 5:07 AM

#

I tried multiple browsers and I even tried from an android mobile.

deft harbor Oct 11, 2019, 5:09 AM

#

Works for me. Email webmaster@kaggle.com

#

Maybe its still under construction hilarious_lemon

lapis sequoia Oct 11, 2019, 6:30 AM

#

hi, I want to apply a function to a single dataframe column

#

trying to figure out the most efficient method..

desert oar Oct 11, 2019, 6:36 AM

#

@lapis sequoia what kind of function? usually df['mycol'].map is sufficient

lapis sequoia Oct 11, 2019, 6:37 AM

#

ok.. but the function I'm mapping to.. how should I define it

#

like, should I pass the whole dataframe, and define the name of the column I want to use within the function

#

I think I'll pass one column as a series, because I want to create a separate column using this

#

hmm

#

it doesnt work

#

def convert_open_to_easy_id(open_id_row_input):
    response_xml_as_string = requests.get(url = URL, 
                                          params = {'openid':open_id_row_input}).text
    responseXml = ET.fromstring(response_xml_as_string)
    return responseXml.find('easyId').text

working_df['Easy_id'] = working_df['Open_id'].apply(convert_open_to_easy_id)

#

like.. the new column gets created and everything.. but everything inside it is None

desert oar Oct 11, 2019, 7:04 AM

#

that probably means your function is wrong

lapis sequoia Oct 11, 2019, 7:10 AM

#

hmm :<

#

it was working fine for a single request though

#

I just checked it.. it works fine for a single input

lapis sequoia Oct 11, 2019, 7:34 AM

#

do i use apply or map

desert oar Oct 11, 2019, 7:35 AM

#

age old question

#

some say the answer was once written on a scroll, but that scroll has been lost to the sands of time

#

i usually use map for Series

#

and apply for DataFrame

#

the main reason being that .map(..., na_action='ignore') is extremely useful

lapis sequoia Oct 11, 2019, 7:36 AM

#

here i'm passing a series.. because it's one column of a dataframe

desert oar Oct 11, 2019, 7:36 AM

#

so i personally would use map

#

apply wouldnt be wrong

#

but i personally use map for series

lapis sequoia Oct 11, 2019, 7:39 AM

#

I should probably look up how they differ in operation

#

right after I figure out why my function doesnt work x.x

desert oar Oct 11, 2019, 7:40 AM

#

they dont, really

#

https://stackoverflow.com/a/56300992/2954547

Stack Overflow

Difference between map, applymap and apply methods in Pandas

Can you tell me when to use these vectorization methods with basic examples?

I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap meth...

#

https://stackoverflow.com/q/38276860/2954547

Stack Overflow

What is the difference between Pandas Series.apply() and Series.map()?

Series.map():
Map values of Series using input correspondence (which can be a dict, Series, or function)
Series.apply()
Invoke function on values of Series. Can be ufunc (a NumPy function that

lapis sequoia Oct 11, 2019, 7:43 AM

#

if I apply on a series.. what is the input to a function

#

each element of the series one by one? or the whole series?

desert oar Oct 11, 2019, 7:44 AM

#

each element

#

otherwise thered be no point, right

lapis sequoia Oct 11, 2019, 7:44 AM

#

and map is the same?

desert oar Oct 11, 2019, 7:44 AM

#

yeah

lapis sequoia Oct 11, 2019, 7:45 AM

#

then why doesn't my function work for series :<

#

going nuts

#

ok I think I got it

#

This works:

#

convert_open_to_easy_id('url as string')

#

this doesn't work

#

convert_open_to_easy_id(working_df['Open_id'][0])

#

just need to figure out why..

#

ok

#

I figured out why

#

my data was wrong

#

I am such an idiot

#

I left the url as part of the data....

late gull Oct 11, 2019, 8:43 AM

#

Any good data scientists who wanna team up for kaggle NFL competition?

desert oar Oct 11, 2019, 12:51 PM

#

@late gull what's the timeline for it? i might have time depending on when it's happening

late gull Oct 11, 2019, 12:52 PM

#

@desert oar About 2 months to deadline 1 mont to group merge

desert oar Oct 11, 2019, 12:52 PM

#

when were you looking to get started

late gull Oct 11, 2019, 12:52 PM

#

I already did yoj

desert oar Oct 11, 2019, 12:53 PM

#

oof alright. thats a little short of a deadline i think considering what's going on in my life currently

#

i'll probably have to pass but good luck

late gull Oct 11, 2019, 12:53 PM

#

Allright cheers

fading kernel Oct 11, 2019, 3:12 PM

#

Hi together,
I have a little Problem in one of my project atm.

i have a class in which i create a countVectorizer and create vectors with fit_transform. This generates a _vocabulary.
I would like to have this CountVectorizer with the vocabulary in one file to be able to reuse it in another class.
Does anyone have any advice for me? I already tried to do the whole thing with save_npz. But it didn't work properly.

fading kernel Oct 11, 2019, 3:31 PM

#

i create a post 🙂 https://stackoverflow.com/questions/58344350/how-to-save-and-load-vocabulary-from-a-countvectorizer?stw=2

Stack Overflow

How to save and load vocabulary_ from a CountVectorizer?

I have a class in which I create a countVectorizer and create vectors with fit_transform. This generates a vocabulary_.
I would like to have this CountVectorizer with the vocabulary in one file to be

native patrol Oct 11, 2019, 6:20 PM

#

pickle/joblib is standard for sklearn objects

lapis sequoia Oct 11, 2019, 6:28 PM

#

Anyone have an idea how I could show the relationship between 3 variables?

#

How I could plot that visually?

#

A scatter plot?

late gull Oct 11, 2019, 6:46 PM

#

Anybody wanna team up for kaggle competition?

#

@lapis sequoia use the 3rd variable as hue

lapis sequoia Oct 11, 2019, 7:41 PM

#

@late gull I think I'll do a heatmap

#

Are you familiar with Seaborn?

late gull Oct 11, 2019, 8:32 PM

#

@lapis sequoia I am familiar. How many of your variables are categorical or numerical?

lapis sequoia Oct 11, 2019, 8:33 PM

#

All three are numerical

#

@late gull I used pd.cut to bin the data

#

Now it's just a matter of how I can plot this

late gull Oct 11, 2019, 8:36 PM

#

I'm not sure how you can plot 3 numerical variables together

#

You can just do 3 plots?

lapis sequoia Oct 11, 2019, 8:37 PM

#

I was thinking about what you said - using the third value as a hue

#

So it would be just plotting 2 numerical variables

#

Or is that assumption incorrect?

obtuse skiff Oct 11, 2019, 8:37 PM

#

Can someone explain what it means when it says says maxCategories for Vector Indexer? in pyspark

#

"Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical."

#

idk what this means

late gull Oct 11, 2019, 8:38 PM

#

@lapis sequoia That's what I would do. But it only works if one of the variables is categorical (not continous)

lapis sequoia Oct 11, 2019, 8:40 PM

#

I could certainly change one of the variables to a categorical one

#

@late gull One of the columns I'm working with only contains 2 values - so I could definitely do that

late gull Oct 11, 2019, 8:41 PM

#

So set the hue to that column

#

and it works

lapis sequoia Oct 11, 2019, 8:41 PM

#

Yeah, but my question is how do I plot this on Seaborn

#

I binned the data

#

Now I don't understand how to plot it @late gull

native patrol Oct 11, 2019, 8:42 PM

#

@obtuse skiff https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/VectorIndexer.html
it just checks if the number of distinct values in that column is less than or equal to maxCategories, then it treats it as a categorical column - otherwise treats it as a continuous variable

deft harbor Oct 12, 2019, 12:22 AM

#

Does anyone have a recommendation on where I should start on big data systems? Example, spark vs the alternative.

#

I guess a better question would be, which systems should I focus on learning first

lapis sequoia Oct 12, 2019, 12:46 AM

#

depends on your use case.. field of application

#

spark, kafka.. you can't go wrong with those.. Then big data formats..

#

avro, parquet, capacitor

obtuse skiff Oct 12, 2019, 1:13 AM

#

Say I want to have n number of dataframes, so that I would have a regression model on each. Could I create a dataframe that holds dataframes? or would that not work?

lapis sequoia Oct 12, 2019, 2:30 AM

#

a regression model for what..

deft harbor Oct 12, 2019, 3:19 AM

#

What is the best way to import this .txt file?

#

CG FEB19 30 YEAR US TREASURY BOND OPTIONS CALL
10600      ----     39'37B    38'33A     ----     38'48    -'44                  39'28
10700      ----     38'37B    37'33A     ----     37'48    -'44                  38'28
10800      ----     37'37B    36'33A     ----     36'48    -'44                  37'28
10900      ----     36'37B    35'34A     ----     35'48    -'44                  36'28
11000      ----     35'37B    34'33A     ----     34'48    -'44                  35'28
11100      ----     34'37B    33'33A     ----     33'48    -'44                  34'28
11200      ----     33'37B    32'33A     ----     32'48    -'44                  33'28
11300      ----     32'37B    31'33A     ----     31'48    -'44                  32'28
11400      ----     31'37B    30'33A     ----     30'48    -'44                  31'28

#

I thought it would be tabs, but it just isnt coming out right

unkempt spire Oct 12, 2019, 4:22 AM

#

Maybe something like : ``` with open('path_to_file', 'r') as file_in:
lines = readlines()

#

and after that split using spaces with str.split(' ').strip()

#

for each line

#

using the same approach for the first line

#

Tell me if you're still there

silent swan Oct 12, 2019, 4:31 AM

#

@obtuse skiff why not just a list or dictionary of dataframes

obtuse skiff Oct 12, 2019, 4:33 AM

#

so I need to have n linear regression models
and I will have the n dataframes holding the data for each model

but I was hoping to be able to do each of them in parrellel but idk if thats possible

silent swan Oct 12, 2019, 4:34 AM

#

right, you could still do that with a list of dataframes

#

e.g. using joblib to do them in parallel

unkempt spire Oct 12, 2019, 4:36 AM

#

@deft harbor

#

data_df = pd.DataFrame()
headers = ['put', 'them', 'here']
for ind, line in enumerate(lines): 
    tmp_df = pd.DataFrame()
    if ind>0: 
        lines_split = line.split(' ').strip()
        for index, element in enumerate(lines_split): 
            tmp_df[headers[index]] = element
        data_df = data_df.append(tmp_df)

#

maybe not the best but the first that comes in mind

silent swan Oct 12, 2019, 4:36 AM

#

yea don't use df.append

deft harbor Oct 12, 2019, 4:46 AM

#

Sorry, was away

#

@unkempt spire thanks, I'll give that idea a try in the morning

candid solar Oct 12, 2019, 10:10 PM

#

I am not sure if this is the right place to ask, but I have three dataframes that use datetime64's as their indexes.

I want to make stacked bar graph, but the dates don't always overlap properly, so I think I need to merge these data sets into one big dataframe, but I am unsure how best to do so

#

the original data comes from csv's (downloaded from netflix) of an item title, and a date watched ("YYYY-MM-DD")

#

I've used group by to get a count of things watched per day

#

and I want to do a stacked bar for each user's data

obtuse skiff Oct 13, 2019, 12:31 AM

#

So I want to loop through like 16 things of data creating a dataframe out of it, then append those rows to a single data frame

dataframeCombine = Row('prediction', 'label', 'features')
for i in lst:
#code
dataframeCombine = dataframeCombine.union(dataframeTemp)

something like that where the temporary dataframe has the columns 'prediction', 'label', 'features'

but Im getting AttributeError: __ fields __ when I do the union, as well as check what columns are in the dataframeCombine

in pyspark

hot compass Oct 13, 2019, 3:44 PM

#

can somone help me with json files

#

like hop in the voice chat with me so i can describe what i mean

earnest prawn Oct 13, 2019, 3:48 PM

#

I could but im around publicly right now so i can hardly jump into voice chat with you

marsh token Oct 13, 2019, 4:19 PM

#

Is there any equivalent of pca methods (R) in python?

https://rdrr.io/bioc/pcaMethods/man/

pcaMethods documentation

The pcaMethods package contains the following man pages: asExprSet biplot-methods bpca BPCA_dostep BPCA_initmodel centered-pcaRes-method center-pcaRes-method checkData completeObs-nniRes-method cvseg cvstat-pcaRes-method deletediagonals derrorHierarchic dim.pcaRes DModX-pcaRe...

#

Also, is there an R discord?

small ore Oct 13, 2019, 5:20 PM

#

The closest I can suggest is the Programming channel in /r/LearnMachineLearning. Maybe people there know of a R only server too

#

@marsh token

silent swan Oct 13, 2019, 5:48 PM

#

what do you need from pca methods

fallen anchor Oct 13, 2019, 7:21 PM

#

is docker the way to go for running TF GPU code?\

silent swan Oct 13, 2019, 8:50 PM

#

I've always just run directly/with conda

fallen anchor Oct 13, 2019, 9:00 PM

#

Is conda like pip?

#

If I eventually wanna run is aws, isndocker good?

#

What is do you use @silent swan

silent swan Oct 13, 2019, 9:09 PM

#

conda is sort of like pip +environment manager, and it's better for installing scientific computing libraries

#

I've not used docker myself, always seemed like a lot more work, but also I'm not productionizing my models

fallen anchor Oct 13, 2019, 9:27 PM

#

Are you on Windows?

#

I want to make sure whatever I do will also work on aws

silent swan Oct 13, 2019, 10:38 PM

#

macos / linux

nocturne loom Oct 13, 2019, 11:00 PM

#

I am getting the following error from fill_between from matplotlib:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

def plot_area(f,g,var,x0,x1):
    f_coords = []
    g_coords = []
    x_coords = arange(x0-1, x1+1,0.1)
    for i in x_coords:
        f_coords.append(f.subs({var:i}).evalf())
        g_coords.append(g.subs({var:i}).evalf())
    plt.plot(x_coords,f_coords,x_coords,g_coords)
    plt.fill_between(x_coords, f_coords, g_coords)
    plt.show()

f and g are sympy expressions, so I am creating a list of x-coordinates and their corresponding y-coordinates for functions f and g from point x0 to x1 with a bit of leeway for plotting purposes.
I don't really understand why fill_between errors out like that in all honesty.

fallen anchor Oct 14, 2019, 3:52 AM

#

📎 unknown.png

#

Am I wrong?

#

I don't see how it could possibly get any bigger

#

hmm

#

what if it re-uses edge data

limber cradle Oct 14, 2019, 4:37 AM

#

I am barely entering data science but isn't the point (well, one of the most important points) of max-pooling to make your image smaller? If your 2x2 filters don't overlap at all, then it'll be 13x13

quartz stream Oct 14, 2019, 5:30 AM

#

Yes for every 2x2 matrix

#

it will take the max value

#

so the output will be smaller than 26x26

#

So only B is the correct answer

#

@fallen anchor

silent swan Oct 14, 2019, 7:49 AM

#

actually it depends on the stride

#

and padding

#

if you do some padding and stride=1, you can perversely even get a slightly larger output (although I think the current libraries don't let this happen)

loud kindle Oct 14, 2019, 9:17 PM

#

anyone know how i can plot a large amount of data on a bar graph? I want to plot 1000 "bins" and show the integer at the bottom of the graph (with matplotlib), but the labels just completely overlap :(
ive tried plt.figure(figsize=(2^16,2^16), dpi=200) but the figure is still just 640x480 px

📎 unknown.png

deft harbor Oct 14, 2019, 9:25 PM

#

Don't know if there is a way to fit 1000 labels..

#

Do you need EVERY label? Is there some sorting you could do, and then reduce it to major ticks?

loud kindle Oct 14, 2019, 9:32 PM

#

i could put them into buckets, but i thought there might be a way to increase the resolution. Of course thats not gonna be efficient, but a 2^16 inch plot with 200 dpi should be able to handle 1000 labels imo 😄

#

im trying a scatterplot instead now.

silent swan Oct 14, 2019, 9:34 PM

#

what sort of data is this?

loud kindle Oct 14, 2019, 9:34 PM

#

its the pixelcount of 10k images

#

y shows the amount of images with that pixelcount, x shows the amount of pixels. there are ~1000 different formats in the set

silent swan Oct 14, 2019, 9:35 PM

#

why not just a histogram with fewer bins?

loud kindle Oct 14, 2019, 9:37 PM

#

im just trying stuff out tbh. I tried a bargraph instead because the bins arent continous, so i thought i will save on empty bins

#

why do you think a histogram is better @silent swan ? isnt it kind of the same thing?

silent swan Oct 14, 2019, 10:30 PM

#

what command are you using to plot exactly?

fallen anchor Oct 15, 2019, 3:57 AM

#

@quartz stream Ah, I got it convused

limber cradle Oct 15, 2019, 4:17 AM

#

So I'm trying to dig into a starter project. Autoencoder - D&D dungeon maps (of a consistent style) - sliders in the middle - create new dungeon maps. That's pretty straightforward.
I also have a decent folder of maps and I've already ditched everything not within a particular style. Next issue is that they're not consistent in size, scaling, or aspect ratio.

#

How consistent do I have to make these images, really?

fallen anchor Oct 15, 2019, 4:24 AM

#

are they image files?

#

like jpeg or png?

#

TF can rescale them for you

limber cradle Oct 15, 2019, 4:27 AM

#

They're jpg currently

#

examples in variation

📎 unknown.png

#

I need to make another pass to crop out side views etc and rotate anything that's rotated

fallen anchor Oct 15, 2019, 4:30 AM

#

are you gonna use AI/ML to crop the images?

#

or just a batch ps script?

limber cradle Oct 15, 2019, 4:32 AM

#

I was considering doing it manually since I don't know how to do scripting in PS (which I don't have) or GIMP, and I don't know how to use AI/ML to do this for me either

#

I don't have THAT many maps that need big bits cropped out

fallen anchor Oct 15, 2019, 4:32 AM

#

then do it by hand

#

no need to spend 4 hrs programming that if it will only save you 10min of time

#

but of course this ^ is more fun

#

you could blow the contrast to an extreme ratio with PIL or something

#

than determine the coords of the hope fully 4 big white boxes

#

and then get the cords for the top left one, or which ever one you want to use

#

or just a simple for loop to find the white pixels

limber cradle Oct 15, 2019, 4:38 AM

#

you're describing a way to find the grid size? And then rescale images automatically?

fallen anchor Oct 15, 2019, 4:40 AM

#

yes

#

is the view you want always in the same place?

limber cradle Oct 15, 2019, 4:45 AM

#

Not with any real consistency. A single map may contain separate buildings or floors and there's no rule about where the whitespace between may go.

umbral olive Oct 15, 2019, 6:57 AM

#

what is best way to learn data science?

#

o.0

#

(start from 0)

loud kindle Oct 15, 2019, 7:42 AM

#

@silent swan

plt.bar(buckets, bar_values)
plt.xticks(range(len(bar_values)), buckets, rotation=90 )
plt.show()

safe monolith Oct 15, 2019, 10:29 AM

#

So i'm using python3 to try anonymize some data

#

atm i'm working on richtext/html

#

 soup = BeautifulSoup(x[0], "html.parser")
            #Removes Images.
            for image in soup.find_all('img'):
                image.decompose()
            for p_tag in soup.find_all('p'):
                for p_cxt in p_tag:
                    words = p_cxt.split(' ')
                    for i, word in enumerate(words):
                        words[i] = fake.word()
                    words = ' '.join(words)
                    p_tag.string.replaceWith(words)

#

fake.word() generates a fake word,
what i'm trying to do is replace every word ...

#

with a fakeword

#

also removes all 'imgs'

supple ferry Oct 15, 2019, 8:38 PM

#

@safe monolith this is for other help channels I presume.

safe monolith Oct 15, 2019, 10:39 PM

#

@supple ferry got told might get help In here...

small shore Oct 16, 2019, 2:50 AM

#

I am trying to build a good word embedding system that will allow me to go from words to embeddings and back to words for a chatbot (or atleast build a large vocabulary tokenizer from tensorflow and well built word embeddings). What would the best method be to go about doing this?

silent swan Oct 16, 2019, 5:09 AM

#

https://fasttext.cc/

fastText

Library for efficient text classification and representation learning

#

or GloVe for something more standard

#

but the question is what you're doing for the chatbot

supple ferry Oct 16, 2019, 5:38 AM

#

Any daily users of TS data?
I have some timestamps of different events and I extracted the time from midnight of the earliest event of that type and of that day as a float, which gives me good range of float numbers. What can I extract more? Using sin cos for them is one way and I already did it. How would you approach this question?

normal plinth Oct 16, 2019, 10:27 AM

#

Hey guys, basically a problem I'm having is that I'm using Pandas in Anaconda, trying to predict a health score value for each person. The value is already within the database, but I want the system to try and get them correct based off other factors such as Age, weight, if they smoke etc. The problem is that the correct percentage that they guess is very low and I don't know why it's happening or how I can fix it. I'll post the code too.

#

def display(mess, values):
    print()
    print("-----", mess, "-----")
    print(values)
    print("------------------------")
    
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split


health_data = pd.read_csv("C:/Users/??/Downloads/HealthScores.csv")

health_train, health_test = train_test_split(health_data, test_size=0.2)

#display("Healthscore", health_data)

#display("Column Headings", list(health_data.columns.values))

f_train = health_train[['Age', 'Weight  in lbs', 'Height in Inch',
                        'Units of alcohol per day', 'Cigarettes per day', 'Maritial Status Num', 
                        'Additional People in household', 'Salary', 'ActiveNum']].copy()
f_test = health_test[['Age', 'Weight  in lbs', 'Height in Inch', 
                      'Units of alcohol per day', 'Cigarettes per day', 'Maritial Status Num', 
                      'Additional People in household', 'Salary', 'ActiveNum']].copy()

s_train = health_train[['Health Score (high is good)']].copy()
s_test  = health_test[['Health Score (high is good)']].copy()

#display("features", f_train)
display("", s_train)

#

# Create a Naive Bayes Classifier. By convention, olf means 'Classifier'
clf = GaussianNB()

#Train the Classifier to take the training features and learn how they relate
#to the training y (the species)
clf.fit(f_train, s_train).predict(f_train)


correct = 0
wrong = 0
for index, row in health_test.iterrows():
    prediction = clf.predict([row[['Age', 'Weight  in lbs', 'Height in Inch',
                                   'Units of alcohol per day', 'Cigarettes per day', 'Maritial Status Num', 
                                   'Additional People in household', 'Salary', 'ActiveNum']]])

    diff = abs(row['Health Score (high is good)'] - prediction)
    if (diff < 10):
         correct = correct + 1
    else:
        wrong = wrong + 1
        
total = correct + wrong
        
print("Correct ", correct, " wrong", wrong)
print("Total   ", total,   " percentage right", (correct*100)/total,"%")

print("Predict Data", clf.predict(f_test))
display("Actual Data", s_test)

#

https://paste.pydis.com/cugedujebu.py

#

Here is a bit of the CSV file that I am using.

polar acorn Oct 16, 2019, 11:06 AM

#

You are treating this a classification problem but it looks to me more like a regression problem. Maybe you should try using a simple regression model?

supple ferry Oct 16, 2019, 11:17 AM

#

If values are not discrete you should not use classification @normal plinth

rare coral Oct 16, 2019, 11:44 AM

#

  File "N:/Discord Bot-20190711T105934Z-001/ChatBot.py", line 6, in <module>
    import tflearn
  File "C:\Program Files\Python37\lib\site-packages\tflearn\__init__.py", line 4, in <module>
    from . import config
  File "C:\Program Files\Python37\lib\site-packages\tflearn\config.py", line 5, in <module>
    from .variables import variable
  File "C:\Program Files\Python37\lib\site-packages\tflearn\variables.py", line 7, in <module>
    from tensorflow.contrib.framework.python.ops import add_arg_scope as contrib_add_arg_scope
ModuleNotFoundError: No module named 'tensorflow.contrib'```

#

w h el p

supple ferry Oct 16, 2019, 11:57 AM

#

Where is the code itself @rare coral

#

😀

rare coral Oct 16, 2019, 11:58 AM

#

dammit I just logged off

#

I basically imported tensorflow and tflearn

normal plinth Oct 16, 2019, 12:14 PM

#

@supple ferry @polar acorn What do you both mean by should not use a classification?

supple ferry Oct 16, 2019, 12:16 PM

#

If the value you try to predict only takes finite number of options for example only 0, 1, 2, 3 then you should use classification. If the value can also be 0.1, 3.8 and etc, aka continuous values, you should use regression

#

@normal plinth

polar acorn Oct 16, 2019, 12:21 PM

#

It may be a bit more complicated sometimes though. Say you want to predict a score say for a movie or something which is an integer between 0 and 100. What you're trying to predict is discrete values. But there are many discrete values and they are ordinal so you should use regression.

slim fox Oct 16, 2019, 12:23 PM

#

I would probably say that regression is for when your predictable variable can be ordered

#

and ofc they are not just 1,2,3 discrete values

normal plinth Oct 16, 2019, 12:26 PM

#

@supple ferry @polar acorn The only numbers they need to predict, range from 80-400, so decimals aren't needed - there is a finite number within the database.

slim fox Oct 16, 2019, 12:28 PM

#

if they are 80-400 I'd use regression

#

and then just round it

normal plinth Oct 16, 2019, 12:29 PM

#

How would I change my code so that it uses regression? I'm currently using abs to try and get the percentage up

polar acorn Oct 16, 2019, 12:30 PM

#

@normal plinth
As I said sometimes it's best to use regression even though your input is discrete. Ask yourself this: if you predict 99 and the correct score was 100 are you closer than if you would have predicted 45? If you use classification your model will treat both 99, 23 and 100 as three separate classes that has nothing with each other to do. Obviously it would be better to predict the wrong number but be close than to predict the wrong umber and be far off. This means you should use regression

slim fox Oct 16, 2019, 12:32 PM

#

have you done EDA on it?

polar acorn Oct 16, 2019, 12:32 PM

#

Anyway as your already using sklearn you can check out the many regression models they have.

normal plinth Oct 16, 2019, 12:32 PM

#

@polar acorn Isn't that what abs is? For example, if I have an abs of 25 and it guesses 110, when the correct answer is 100 it'll still deems it correct

#

What is EDA?

slim fox Oct 16, 2019, 12:32 PM

#

exploratory data analysis

#

at least plot your target var against independents

normal plinth Oct 16, 2019, 12:33 PM

#

No, I don't think I have.

#

Also, if I wanted to train the model with a file, but then test it on a different file (using the train_test_split method), would I have to load both files in within the same block of code?

polar acorn Oct 16, 2019, 12:37 PM

#

abs() just means absolute value (look it up if you don't know what that means). Such that if you're prediction is either -10 wrong or 10 wrong the error will come out as 10. I see that you check if your prediction is close enough. But this doesn't solve your problem. Your problem is that your model doesn't care about getting close, just about getting it exactly right. So you should choose another model.

normal plinth Oct 16, 2019, 12:39 PM

#

I've tried about 5 models (Tree, SVM, ForestTree, Naïve, and MLP). All of which give a low percentage - The Tree model being the best (gives around 55%).

polar acorn Oct 16, 2019, 12:40 PM

#

They are all models that can be trained for both classification and regression. Did you use MLPClassifier or MLPRegressor?

normal plinth Oct 16, 2019, 12:41 PM

#

Classifier

polar acorn Oct 16, 2019, 12:42 PM

#

Try Regressor 😉

normal plinth Oct 16, 2019, 12:43 PM

#

Alright, I'll try that now - you would just change the include right?

polar acorn Oct 16, 2019, 12:44 PM

#

Or any of the other sklearn models that say regression (check the sklearn docs if you are unsure). And do read up on classifiers vs regressors. There are probable many good intros and it's a important concept.

#

The include?

normal plinth Oct 16, 2019, 12:44 PM

#

Import, sorry.

#

Changed it to Regressor and it changed to 10%

polar acorn Oct 16, 2019, 12:45 PM

#

Just the import or the classifier also?

normal plinth Oct 16, 2019, 12:46 PM

#

I changed the classifier to regressor within the import.

polar acorn Oct 16, 2019, 12:47 PM

#

you need from sklearn.neural_network import MLPRegressor and clf = MLPRegressor()

normal plinth Oct 16, 2019, 12:47 PM

#

Yeah, I changed both of them

#

📎 CorrectPercentage.PNG

#

That's the outcome

polar acorn Oct 16, 2019, 12:49 PM

#

Well that's not that good. But there many many things to tune in a MLPRegressor. Theres also many other types of regressors. Try out a few different settings and also a few different regression models.

normal plinth Oct 16, 2019, 12:50 PM

#

I've literally tried a load of models and they all around the same percentage - do you think it's something wrong with the code in relation to the size of the data I.E, 5000 + datasets.

polar acorn Oct 16, 2019, 12:53 PM

#

While I don't know the size of the data but I'd look at other stuff first. For instance right now we forgot to scale our data. Which is often nice to do when working with MLP's. Maybe you could try a random forest regressor instead?

normal plinth Oct 16, 2019, 12:53 PM

#

Before trying to predict health scores, I tried to predict a different dataset (wine quality) and it gave me 60/70% with an abs of 2 - so that worked.

#

Just used the forest regressor and it went from 10% to 45%, so that's good, but it's still pretty low

#

That's with a abs of 10 too

polar acorn Oct 16, 2019, 1:01 PM

#

As I said theres many many ways of getting more out of models. Do look at the docs for the models and see how you can tune them. Also when we do regression we usually don't look at accuracy (as in how many are closer than 10). We often look at MSE (mean square error) as that is often more informative. When you check with an abs of 10 as you say that means to be 11 wrong or 1000 wrong are equally wrong.

normal plinth Oct 16, 2019, 1:02 PM

#

By 'Mean square error' do you mean, you calculate the error percentage

#

And the lower it is, the better?

polar acorn Oct 16, 2019, 1:06 PM

#

No, to calcuate the mean square error you find the error of each prediction, square it and then find the mean of all of those. You don't have to calculate them yourself though, sklearn has that implemented. from sklearn.metrics import mean_squared_error and then just call mean_squared_error(true_values, predicted_values) where true_values and predicted_values are some kind of array with true and predicted values

normal plinth Oct 16, 2019, 1:09 PM

#

Alright fair - I'll give it a try.

One final question, if you don't mind. If I wanted to train the model with one csv file, but then test it on smaller csv where I have to predict a health stone that isn't displayed within it; how would I do that, would I have to load the two csv's within the same block of code?

polar acorn Oct 16, 2019, 1:11 PM

#

Though I'm not sure what you mean with block of code the answer is probably no. You can load one file and train your regressor on that and then load the other file and predict on that.

normal plinth Oct 16, 2019, 1:16 PM

#

And that's all in the same python file? Because I tried to earlier and it didn't work

#

It just displayed the default values which were 0 as they hasn't been predicted yet

#

This is what I tried to do:

health_data = pd.read_csv("C:/Users/?/Downloads/Female(2)Database.csv")

health_datas = pd.read_csv("C:/Users/?/Downloads/Population(1).csv")

#health_train, health_test = train_test_split(health_data, health_datas, test_size=0.2)

health_train = train_test_split(health_data, test_size=0.2)

health_test = train_test_split(health_datas, test_size=0.2)

#display("Healthscore", health_data)

#display("Column Headings", list(health_data.columns.values))

f_train = health_train[['Age', 'SexNum', 'Weight', 'Height','Alcohol Per Day (Units)', 'Cigarettes per day', 'ActiveNum']].copy()
f_test = health_test[['Age', 'SexNum', 'Weight', 'Height','Alcohol Per Day (Units)', 'Cigarettes per day', 'ActiveNum']].copy()

s_train = health_train[['Health Score (high is good)']].copy()
s_test  = health_test[['Health Score (high is good)']].copy()

polar acorn Oct 16, 2019, 1:22 PM

#

If you wanted to train on one file and test on the other you don't need to use the train_test_split. Just read one file and use it as f_train and use the other file as f_test.

normal plinth Oct 16, 2019, 1:22 PM

#

So how would I write that?

polar acorn Oct 16, 2019, 1:27 PM

#

Were you define f_train for instance just use f_train = health_data[['Age',... and use health_datas for f_test. Or switch if you want to train on datas and test on data.

normal plinth Oct 16, 2019, 1:33 PM

#

That has worked - Thank you so much. Only problem is that it's only predicting 19 of the health scores, not the full 20. It's like it can't read the first column.

#

And when it is training the model, it only trained 508 bits of data when there's 5k+ of them.

#

Nevermind - I can't count apparently. It goes all 20.

The second problem is still happening though (training only 500).

#

It only trains 10% of the data - does that mean it has defaulted if I haven't specified a value?

polar acorn Oct 16, 2019, 1:39 PM

#

Hmm that sounds strange. How does your code look now?

normal plinth Oct 16, 2019, 1:42 PM

#

def display(mess, values):
    print()
    print("-----", mess, "-----")
    print(values)
    print("------------------------")
    
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split


health_data = pd.read_csv("C:/Users/16027787/Downloads/HealthScores.csv")

health_datas = pd.read_csv("C:/Users/16027787/Downloads/Population(1).csv")

f_train = health_data[['Age', 'SexNum', 'Weight  in lbs', 'Height in Inch', 
                      'IQ', 'Units of alcohol per day', 'Cigarettes per day', 'Maritial Status Num', 
                      'Additional People in household', 'Salary', 'ActiveNum']].copy()
f_test = health_datas[['Age', 'SexNum', 'Weight  in lbs', 'Height in Inch', 
                      'IQ', 'Units of alcohol per day', 'Cigarettes per day', 'Maritial Status Num', 
                      'Additional People in household', 'Salary', 'ActiveNum']].copy()

s_train = health_data[['Health Score (high is good)']].copy()
s_test  = health_datas[['Health Score (high is good)']].copy()

#

I would say that's the main bit of the code - where the problem probably is

#

Before deleting the train_test_split - this is what I used to tell it to train 20%

health_train, health_test = train_test_split(health_data, test_size=0.2)

polar acorn Oct 16, 2019, 1:48 PM

#

Hmm that seems fair enough. I assume you are sure that f_train now has 5k+ rows? You can check by printing out f_train.shape

normal plinth Oct 16, 2019, 1:51 PM

#

Yeah, it is showing 5k but only training on 500.

#

📎 Capture.PNG

#

That's what it's displaying.

polar acorn Oct 16, 2019, 1:56 PM

#

Does your code still say for index, row in health_test.iterrows(): ?

normal plinth Oct 16, 2019, 1:57 PM

#

Yeah, do I need to change that to health_data?

#

Health_data is the training csb.

#

*csv

#

datas being the test

polar acorn Oct 16, 2019, 1:58 PM

#

Yes. That would print "Total 5k+" if that is what your after.

normal plinth Oct 16, 2019, 1:59 PM

#

Oh wait, no - I did change it to:

for index, row in health_data.iterrows():

#

What I'm after is that its training on only 500 while there is 5000 within the csv file

#

Nevermind - it's working now. I was using the wrong one; my bad.

polar acorn Oct 16, 2019, 2:02 PM

#

Note however that while it can be interesting to look at the how well the model does on the training data it doesn't really say much about how good the model is. If you want to evaluate how good the model is you need to check how it performs on the test data. So you should certainly not say you model is right 70% of the time just because it scores that on the training data.

normal plinth Oct 16, 2019, 2:03 PM

#

Yeah, I know that. I looked at the stats for the test (Gender, if they smoke, weight, active or not etc.) And the health scores looks quite accurate.

polar acorn Oct 16, 2019, 2:06 PM

#

Well done 👍 There's probably many things to improve still, but that is always true no matter the case. Good luck further.

normal plinth Oct 16, 2019, 2:06 PM

#

Thank you so much, pptt - you're an actual legend mate. I would legit buy you a pint if I knew you.

hallow hawk Oct 16, 2019, 2:12 PM

#

hey, hope you all are doing well!
noob question here - is it worth to install Anaconda instead of manually each lib and software? I'm already installed numpy, scipy, pandas, scikit-learn and jupyter and it was such a painful process. jupyter dot org said they strongly recommend installing Python and Jupyter using the Anaconda Distribution.
I'm a little afraid of such a massive distro with a bunch of useless (for me ofc) libs. usually I prefer to install each piece manually (so I know what each lib do), but maybe I'm just a paranoid?

slim fox Oct 16, 2019, 2:40 PM

#

and it was such a painful process
@hallow hawkwhy so?

#

most usually it is as simple as pip install lib

hallow hawk Oct 16, 2019, 2:44 PM

#

a lot of errors. I spent a few hours in searching and trying to fix it

silent swan Oct 16, 2019, 4:44 PM

#

yes, use miniconda, that's conda without libraries installed

tidal remnant Oct 17, 2019, 2:04 AM

#

Would someone help me understand back propagation?

#

of a neural net

#

so basically you change the weights of each node by however much the previous node was wrong, based on it's weight?

#

how much do you change it by

#

ping me if you have a response please

silent swan Oct 17, 2019, 2:39 AM

#

differentiation and chain rule

tidal remnant Oct 17, 2019, 3:33 AM

#

okay could you elaborate a little more on how differentiation is used?

earnest prawn Oct 17, 2019, 5:21 AM

#

If you want to know that youd actually have to go through how exactly the math behind back propagation works, but basically you derive the loss function in order to find out into which direction you should step to minimize it a little more and based on that derivative + your neural networks derivative can figure out how you have to change the values to step that deep

graceful birch Oct 17, 2019, 6:55 AM

#

I have some long-running ML pipelines. What tools are good to manage the pipeline.. say start dependent tasks, report progress or errors to a server/dashboard?

blissful badger Oct 17, 2019, 7:11 AM

#

I'm not entirely sure but maybe RQ or Hangfire?

graceful birch Oct 17, 2019, 7:36 AM

#

@blissful badger those seem a bit like Celery, I looked at it but i found the issue was they want me to "lift and shift" my stuff into their framework & language

#

it would great if i can do something like

./prepare_data --progress_callback=http://127.0.0.1/progress?task_id=abc123

then inside my prepare_data program i can instrument it to make callbacks there

http://127.0.0.1/progress?task_id=abc123&percentage=1
http://127.0.0.1/progress?task_id=abc123&percentage=2
... etc

#

ie the task manager exposes an API that lets me instrument my code to report progress

dim beacon Oct 17, 2019, 7:56 AM

#

@graceful birch Celery can do all that

graceful birch Oct 17, 2019, 8:03 AM

#

@dim beacon let's say the ./prepare_data is a long-running Java process. The way I can think of is to have a celery task that
(1) Starts a socket server or HTTP server on localhost:6969
(2) Launches ./prepare_data --progress_server=localhost:6969 using subprocess
(3) The ./prepare_data process will send progress info the server started in (1)
(4) The handler for (1) will take the progress info it has been given and put that in celery's task state metadata ?

dim beacon Oct 17, 2019, 8:05 AM

#

@graceful birch what I would do is make ./prepare-data feeding progress status to a Redis DB that you'd be able to query from anywhere

#

If you use Celery you'd be able to use that value to update the PROGRESS of the task

#

But do not start a web server for that

graceful birch Oct 17, 2019, 8:08 AM

#

@dim beacon yes you are right no point in the webserver and we will have redis anyway

#

@dim beacon so something like this?

def prepare_data_task():
    progress_key = ... # some random key or the celery task id

    subtask = subprocess.Popen(["./prepare_data", "--progess_key", progress_key])
    try:
        while True:
            time.sleep(1.0)
            if task_is_done(subtask):
                break
            try:
                progress = redis.get('progress:' + progress_key)
                celery.update_progress(progress)
            except e:
                print('update progress failed')

        if task_went_pear_shaped(subtask):
            raise Exception(subtask)
    finally:
        if task_is_running(subtask):
            kill_task(subtask)```

hallow hawk Oct 17, 2019, 9:20 AM

#

@silent swan
thank you

supple ferry Oct 17, 2019, 9:54 AM

#

@tidal remnant , look at this. this is the best explanation that i have come up with
https://www.youtube.com/watch?v=QJoa0JYaX1I&t=1s

YouTube

The Coding Train

10.14: Neural Networks: Backpropagation Part 1 - The Nature of Code

In this video, I discuss the backpropagation algorithm as it relates to supervised learning and neural networks. Next Video: https://youtu.be/r2-P1Fi1g60 Thi...

▶ Play video

#

@void anvil , i have been trying to find you on this discord for some time now, looks like you have deleted and restored your account

tidal remnant Oct 17, 2019, 10:04 AM

#

awesome thanks

upper eagle Oct 17, 2019, 4:24 PM

#

Hey, I have two dataframes in pandas, df1 and df2
df1 looks like this:

a
c
b
e
j

df2 looks like this:

a
j
e
z
k

I would like to itterate each dataframe and check if the values matches
I have been trying to setup two for-loops as I would with lists in python
but have been running into issues.

I am looking for something like

    for j in df2:
    if i == j:
    print(f"{i} has a match in df2")

I have been looking around at pandas documentation for a while now and
have not been able to find something useful, any help would be appreciated 😄

silk acorn Oct 17, 2019, 4:26 PM

#

It's the .eq function iirc

#

or == even

upper eagle Oct 17, 2019, 4:29 PM

#

.eq will not return the case for 'e'

#

since e is at different positions

silk acorn Oct 17, 2019, 4:29 PM

#

Oh, my bad, this was just element for element.

upper eagle Oct 17, 2019, 4:30 PM

#

Yea I tried that already : (

silk acorn Oct 17, 2019, 4:32 PM

#

You could turn the columns into sets and get the intersection if duplicates don't matter

supple ferry Oct 17, 2019, 4:36 PM

#

@void anvil have you worked with time series data? I need some advice in feature extraction

silk acorn Oct 17, 2019, 4:36 PM

#

@upper eagle set intersection seems to be one of the faster ways.

#

as opposed to np.intersect1d

upper eagle Oct 17, 2019, 4:37 PM

#

yea duplicates don't matter, I will look into set intersection

supple ferry Oct 17, 2019, 5:31 PM

#

@void anvil so, i have some timestamps of flight times, and I want to extract information from them. I already have weekdays. I also made a new variable which is this:
(timestamp of flight - midnight of the earliest flight in that direction) / 86400

#

which gives me a float of relative distances for every flight. I can then add cosine of it to my mnl

#

my question is, which other methods i can use to extract as much information from those horizontal features

supple ferry Oct 17, 2019, 7:01 PM

#

if the flight is selected

#

my idea is treating flights on 10 october at 23:00 similar to the ones which are on 11 october but around midnight, 1 am

#

time is horizontal in this case

#

im trying ot make it vertical

supple ferry Oct 17, 2019, 10:55 PM

#

How to do that? Can you give me any links where I can learn more about this

supple ferry Oct 18, 2019, 5:09 AM

#

@void anvil so I have both origin and destination + every timestamp of every flight in that connection. I. E, the flight has 1 stop I also have its time and where the stop is

#

The idea with time since last flight seems reasonable. I will try it over the weekend or today if I get enough time

#

However, it might happen that it will be insignificant, because I already have a variable time since midnight of the earliest flight which I take as an arbitrary reference point

jade pine Oct 18, 2019, 7:35 AM

#

hello guys can any one help me in data science, i wanted to create a content in data science and the minimum word count is 1000 words, can any one suggest me any link or github repo from where i can take help

quartz stream Oct 18, 2019, 7:48 AM

#

Anyone knows of production ready speech to text model

#

or any library ?

rare coral Oct 18, 2019, 9:36 AM

#

reeeeeee how tf are we meant to use tflearn with tf 2.0

lapis sequoia Oct 18, 2019, 2:56 PM

#

hey how can I group a pandas dataframe by it's index? (the index is non-unique),

north river Oct 18, 2019, 3:30 PM

#

where's a good place to ask questions about matplotlib?

supple ferry Oct 18, 2019, 3:42 PM

#

@lapis sequoia df.groupby(df.index)
You may have to sort them by index first if it is time based index

silent swan Oct 18, 2019, 6:01 PM

#

here's a good place to ask about matplotlib

wet mica Oct 18, 2019, 6:27 PM

#

@upper eagle there is a function within pandas called .iterrow()

it allows you to iterrate through items in a column without having to convert everything to a list or something.

for x in df.iterrow():
do some stuff here

you will have to check, but you can also designate which column it iterrates through. cant remember how off the top of my head though

unreal dome Oct 18, 2019, 7:05 PM

#

Hi. I'm a reasonably experienced dev in python and in other languages/environments and now have a task involving DSP-ish stuff, an area i've never dealt with before. (home project, so unconnected with my professional experience).

i have some ideas about how to approach it, but no clear idea which might be the best option. is this a good place to pose the question, or is there a more appropriate 'cord for python/dsp stuff?

(TL;DR- of the task: find where a slate tone, a ~1 kHz "sine" ends in an audio stream)

fervent lance Oct 18, 2019, 7:10 PM

#

is there a way to give the program x and y and get the pattern of it ?

#

it's an exponential function

exotic reef Oct 19, 2019, 12:36 AM

#

@upper eagle @wet mica i would advise against using iterrows it's quite slow. Better to use itertuples or convert to a dict with .to_dict(orient='records') which gives you a list of dicts. Unless you really need stuff returned as a series, the overhead of iterrows isn't worth it

#

@unreal dome do you know for sure the frequency of the sine wave you are trying to track?

#

@fervent lance what do you mean by 'the program' and pattern?

wet mica Oct 19, 2019, 12:56 AM

#

@exotic reef that's really good to know. I'm used to working in R, so moving away from dataframes seems scary to me. I'll have to compare the scripts and see how it performs

exotic reef Oct 19, 2019, 12:57 AM

#

Funny you should mention that, i am currently looking longingly at R's plotting ecosystem and tidyverse 😛

#

I did R aaages ago. I hate plotting in Python.

#

As for the issue at hand, unless you need Series specific functionality, list of dicts is the way to go

#

https://stackoverflow.com/a/24871316/2774823

Stack Overflow

Does pandas iterrows have performance issues?

I have noticed very poor performance when using iterrows from pandas.

Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a

#

It doesn't mention using dictionaries there, but it does put iterrows deadlast (almost)

silent swan Oct 19, 2019, 1:04 AM

#

I like iterrows, but I would not use it if I were concerned about performance

#

good point above, feels like they should just expose .to_dict(orient='records') as a more convenient method

#

maybe an iterdict, corresponding to itertuples

exotic reef Oct 19, 2019, 1:15 AM

#

Iterrows is fine when you just need to get the thing done, and as you say performance isn't an issue. But it can add up pretty quickly just because the performance hit is considerable. True, there is overhead in doing the to_dict conversion, and maybe memory overhead if you want to keep a copy of the original df or more cpu overhead if you convert back to df, but in my experience you can easily get a 20x speedup even including all this

#

I've not done extensive testing compared to iteruples though so maybe that is the best of both worlds

silent swan Oct 19, 2019, 1:24 AM

#

does itertuples return tuples or namedtuples

exotic reef Oct 19, 2019, 1:27 AM

#

Ah, good question. I think named tuples

unreal dome Oct 19, 2019, 8:27 AM

#

@exotic reef it's notionally a 1 kHz sinewave, but in reality it's more a square wave with rounded corners so unfortunately its bandwidth is rather wider than what it should be. here's what the section of audio I want to algorithmically identify:

📎 unknown.png

#

you can see what i mean about it being a square-ish wave. but at least its fundamental period is, indeed, 1 ms.

#

Its amplitude should nearly always be much larger (-6 dBFS) than the associated audio but of course I can't guarantee that. You can also see that its amplitude decays over about 8 periods. (the left hand side extends backward for about a second.) Some of the inputs come from a shotgun mic that is rather hotter and noisier than the section you see there.

unreal dome Oct 19, 2019, 8:57 AM

#

possible approaches that occur to me include:
• simply calculate the RMS amplitute and look for the fall off — but this depends on a significant differential in the slate tone and the recorded audio, which isn't a safe assumption being that the target recording level for local peaks to be c. -12 dBFS.

• apply a narrow bandpass filter (IIR or FIR? idk the difference for these purposes) and do the same. The sidelobe frequency components will fall away and the 1kHz fundamental will come through, so the fact that it's not a true sine doesn't matter so much. This would be more accurate. Idk if there's a window comparable to that of an FFT, but I assume not in the same way.

• do a Fourier transform and look for when the peak at 1 kHz goes away. This involves a little imprecision because of the FFT window, but since the frame period is 40 ms (25 fps), i expect that this imprecision is going to be within a frame or two. I'm not sure that this is functionally different from the above and is just wasteful of cycles.

#

—
you or others here might think of an even better way to do it.

exotic reef Oct 19, 2019, 8:59 AM

#

So is the idea that this signal will be inside another one and you want to dig it out?

#

Or you want to recognise precisely this wave when it appears without anything else noising it up

unreal dome Oct 19, 2019, 9:04 AM

#

It's a slate tone: when the PCM recorder starts recording, it substitutes (rather than superimposes, I think) this ~1 second 1kHz tone on the four PCM tracks it records as well as outputs this tone to the camera's own audio input. That gets embedded in the video to ease synchronisation in post.

So the objective is to automatically correlate and align the four discrete PCM tracks with a fifth AAC-compressed audio track taken from video. Then, when the offsets are known, add the four PCM tracks to the video container. The result can then be imported into an NLE and the video editing and colour grading process goes as normal from there, but with the ability to switch between audio tracks as is indicated.

#

.
also worth mentioning that the slate tone always only ever appears within the first second of PCM tracks and within some variable number of seconds at the beginning of the video's audio track.

#

so i don't have to scan all of the audio, just the first few seconds of each of the five sources.

vivid dagger Oct 19, 2019, 2:09 PM

#

is k means the most efficient way to cluster data

supple ferry Oct 19, 2019, 2:24 PM

#

@void anvil can you give me some more specifics?

quaint marten Oct 19, 2019, 6:32 PM

#

Hey, any data scientists around in this chat that I could ask a few questions to about the field

#

would be appreciated 🙂

quaint marten Oct 19, 2019, 7:00 PM

#

@void anvil are you a data scientist?

#

I just wanted to know about what entry level job i should apply to

#

after gaining skills

#

whatentry level job will help me learn most about the field

quaint marten Oct 19, 2019, 7:20 PM

#

i'm going the self taught root rage pop

#

sorry what I mean is what specific job should i target to learn the most about data science

#

e.g. a data analyst etc.

fallen anchor Oct 19, 2019, 7:28 PM

#

data scientist is a job

quaint marten Oct 19, 2019, 7:28 PM

#

data base administrator

#

blah blah

velvet kite Oct 19, 2019, 8:31 PM

#

Does anyone know of a good tutorial to make a game as a custom gym environment?

#

and then make an agent to play it?

deft harbor Oct 19, 2019, 9:49 PM

#

Quick question, because I'm pretty sure I've been looking at this for too long now.

#

When you use sklearns logisticregression, do I need to reshape the features (pandas df) first?

#

I know when it is a single feature I have to reshape using (-1,1)

#

Nm, ignore that

supple ferry Oct 20, 2019, 2:13 AM

#

@velvet kite Sentdex has good tutorial on that

marsh token Oct 20, 2019, 7:42 AM

#

@velvet kite use unity?

hallow wave Oct 20, 2019, 1:23 PM

#

As a data analyst what qualifications and theory would I need to know ?

lapis sequoia Oct 21, 2019, 2:22 AM

#

depends what domain you're going to do analysis on....

#

so background knowledge for one

#

Spreadsheets, git and numpy..

#

basic statistics and statistical tests

dusty talon Oct 21, 2019, 9:23 AM

#

hi has anyone ever worked with facenet?

#

I'm working on realtime face recognition system using facenet, but direct euclidean distance comparison between two vectors (of faces) gave me too many false positives (and negatives too)

#

so I think maybe I need to train a more sophisticated face classifier

#

if anyone here have any thoughts, I would like some advice

deft harbor Oct 21, 2019, 4:22 PM

#

I've built a classification model using LogitsticRegressionCV. There are three classes, 2 predictors and I've used 5 folds. However, I'm struggling with understanding the array that is output when using model.scores_

#

I can't post the whole output because of a limit on text, but it returns three of these.

#

0.0: array([[0.775     , 0.775     , 0.78333333, 0.78333333, 0.78333333,
         0.8       , 0.875     , 0.86666667, 0.86666667, 0.86666667],
        [0.80833333, 0.80833333, 0.80833333, 0.825     , 0.79166667,
         0.79166667, 0.88333333, 0.89166667, 0.89166667, 0.89166667],
        [0.83333333, 0.83333333, 0.84166667, 0.80833333, 0.86666667,
         0.86666667, 0.88333333, 0.88333333, 0.875     , 0.86666667],
        [0.83333333, 0.83333333, 0.83333333, 0.85      , 0.86666667,
         0.89166667, 0.89166667, 0.88333333, 0.89166667, 0.89166667],
        [0.80833333, 0.80833333, 0.80833333, 0.85      , 0.825     ,
         0.84166667, 0.89166667, 0.89166667, 0.9       , 0.9       ]]),

#

0.0, 1.0, 2.0

#

How do I read this?

wet mica Oct 21, 2019, 4:45 PM

#

Looking for some input here: I am the only datascience person at my company. I got in contact with the datascience team of our holding company, and they want me to put together a wishlist of things I want so they can enable me to get my job done better. For context, I work in marketing data analytics

So far what I have is:

github account on the corporate account
virtual desktop or server space to run larger jobs on (my 8GB RAM laptop can't handle too much)
server space to deploy bots from for automatic data retrieval

My manager wants me to go big on the asks, so does anybody here have any ideas of other things I could ask for? Previously I was a graduate student, so i was used to just getting things myself as I needed them. I'm not used to being able to put together a wish list like this.

slim fox Oct 21, 2019, 5:51 PM

#

virtual desktop or server space to run larger jobs on (my 8GB RAM laptop can't handle too much)
AWS probably is what you are looking fr

wet mica Oct 21, 2019, 6:04 PM

#

@void anvil what do you mean by cloud credits?

#

ah ok

oblique belfry Oct 21, 2019, 11:28 PM

#

https://arxiv.org/abs/1802.04799
Just came across this. Somebody figured out how to compile ML models. Interesting...
I wonder if this (or strategies like this) will work.

arXiv.org

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversity of
hardware devices. Current frameworks rely on vendor-specific operator libraries
and optimize for a narrow range of...

quartz stream Oct 22, 2019, 7:04 AM

#

Hmm

#

Interesting Paper @oblique belfry

kindred flame Oct 22, 2019, 11:56 AM

#

is it possible to make an poker AI?

#

I have no clue of data science just aksed it myself

rigid storm Oct 22, 2019, 12:12 PM

#

@kindred flame If by possible you mean it has been achieved by someone or some institution already, then yes

#

there are models that can play 100BB deep 6-max cashgames vs opponents and realize positive BB/100 results over large samples

kindred flame Oct 22, 2019, 12:12 PM

#

@rigid storm Its already achieved?

#

Lol

#

Are they using it?

#

In online poker

rigid storm Oct 22, 2019, 12:13 PM

#

It can't be implemented on sites sice thats highly illegal

kindred flame Oct 22, 2019, 12:13 PM

#

Yea but i mean its possible right?

rigid storm Oct 22, 2019, 12:14 PM

#

im not saying there arent any bots out there, because there are - but they're way simpler and are often detected and banned from the sites

#

yes its possible, but illegal

kindred flame Oct 22, 2019, 12:14 PM

#

But a ai wouldnt be really detected or?

#

I mean the decisions of an ai arent like from a normal bot

rigid storm Oct 22, 2019, 12:16 PM

#

I don't know how sites like PokerStars measure weird activity, nor do i know how hard or easy it is to hide from their security

#

But bots would use more simple heuristics yes, so it would be abc poker

kindred flame Oct 22, 2019, 12:16 PM

#

@rigid storm btw how much experience do you have in ai?

rigid storm Oct 22, 2019, 12:17 PM

#

those usually get around 2bbb/100 in MTTs over large samples, which is slightly winning

#

excluding variance of course

#

I have experience in some simple machine learning tasks such as classifying dementia looking at brain volumes of patients

#

stuff like that

kindred flame Oct 22, 2019, 12:18 PM

#

How long are you already in ai?

rigid storm Oct 22, 2019, 12:18 PM

#

Ehm i study cognitive science and AI at tilburg uni in the netherlands

#

3rd year now

#

I also play poker for a 'living' haha

kindred flame Oct 22, 2019, 4:01 PM

#

@rigid storm haha

oblique belfry Oct 22, 2019, 7:03 PM

#

In regards to the TVM paper I linked earlier...

https://tvm.ai/about

Interesting results with Pytorch.

About

TVM

small ore Oct 22, 2019, 11:42 PM

#

I get <matplotlib.axes._subplots.AxesSubplot at 0x16227e80> when I use any seaborn function and the plot never appears. Could someone tell me what I am doing wrong?

#

I did try plt.show(). No luck

small ore Oct 23, 2019, 12:16 AM

#

Ping me please

silent swan Oct 23, 2019, 1:38 AM

#

is this in a notebook?

small ore Oct 23, 2019, 2:49 AM

#

Ipython terminal

silent swan Oct 23, 2019, 3:29 AM

#

try %matplotlib inline

plain ice Oct 23, 2019, 7:30 AM

#

hey guys for data science or data mining more specifically is kaggle very useful as a portfolio?

lapis sequoia Oct 23, 2019, 10:50 AM

#

Anyone familiar with performance measures in terms of time within machine learning?

slim fox Oct 23, 2019, 10:52 AM

#

hi there. Got a quick-ish question. How to deal with variable shapes of images for classification using CNN in keras/tensorflow?

#

I understand that there are several ways, like resize (which can lead to image distortion) gloab max/avg pooling, adding simply a uniform background (like 0s) to fill them to match the shape of the biggest image, but I'm really not sure what is best and how to decide

lapis sequoia Oct 23, 2019, 10:58 AM

#

Hello, I am looking for a book, article etc. to gaining real-word business insight case by case, can u suggest?

rigid storm Oct 23, 2019, 11:32 AM

#

@lapis sequoia Hey, you mean just the measuring of time to train or test? Or you mean ways to cut the time needed?

lapis sequoia Oct 23, 2019, 11:36 AM

#

Say you want to detect if a malicious message has been injected into a car's computer system (used for sensors etc.) - so a metric to evaluate the performance in terms of time to detect this.

rigid storm Oct 23, 2019, 11:37 AM

#

Well a really simple method is to just use the time module right?

#

forest = RandomForestClassifier(max_depth = 16, n_estimators= 200, random_state=42)
forest.fit(X_train, y_train)
end = time.time()
pred_forest = forest.predict(X_test)
print("Run time: ", end - start)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))```

#

for example, this piece of code shows you the time it took to test a classifier

#

its import time btw

#

output

📎 unknown.png

lapis sequoia Oct 23, 2019, 11:40 AM

#

So simply measure the time it took to run the test?

rigid storm Oct 23, 2019, 11:40 AM

#

For it to predict all classes

#

and then get the accuracy that came with its guesses

lapis sequoia Oct 23, 2019, 11:45 AM

#

ye, i think it makes sense

#

So if I have additional classifiers, runtime of the models can be used as a metric

#

I'm just wondering if it is a valid metric - considering hardware can have an influence.

rigid storm Oct 23, 2019, 1:54 PM

#

Well i think you can use it as a 2nd grade metric. the accuracy, confusion matrix numbers and/or AUC/ROC are the most important metrics for how well a classifier performs ofc

hot compass Oct 23, 2019, 2:35 PM

#

I need to find a place where I can find average latency,download, and upload speed of fibre optic, dsl, 3g, 4g, 5g

#

but like one site that compares values of each

#

anyone know of any sites that do that?

small ore Oct 23, 2019, 3:26 PM

#

@silent swan No. That wont work in terminal. That is only for notebook I think

silent swan Oct 23, 2019, 4:40 PM

#

it should work if the terminal supports graphics. If not, then, welp

hot compass Oct 23, 2019, 4:59 PM

#

What hardware and software components are required to create a wireless network?

exotic reef Oct 23, 2019, 10:01 PM

#

not really a data science question @hot compass ...

glad arch Oct 23, 2019, 10:52 PM

#

anyone familiar with unittest?

wicked mantle Oct 24, 2019, 6:21 AM

#

I made bs4 parser for site, and i want to upload it to google docs, how to do it? I know how to save it to 'csv', but dunno about google docs

brazen folio Oct 24, 2019, 10:54 AM

#

i have an assigment about cleanig data for data science, I have amount of data in csv format, any suggestion ehat should i do for clean and reduce some redundant data?

#

or any reference about massive data cleaning

#

?

chilly salmon Oct 24, 2019, 12:57 PM

#

So I have a CSV file with data.
I use pandas to import the data into a dataframe.
df = pd.read_csv('file.csv')
Works perfectly. However, it'll be missing headers.
The thing is, when I add headers in the CSV file or by using "names = ['Date', 'Name', 'Message']" (matching the amount of columns in the CSV file). It throws an error.

It attempts to import the CSV file 3 times. It always ends on the same line (line 14667 out of 14672).
First error = "Traceback (most recent call last): File "mydata.py", line 14, in <module> print(df) OSError: [WinError 87] The parameter is incorrect"

Second error = "Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'> OSError: [WinError 87] The parameter is incorrect"

The third time it doesn't provide any error message, it just stops at line 14667.
Does anyone have any ideas?

Using 2 column headers instead of 3 works fine btw, for some reason.

glad arch Oct 24, 2019, 2:35 PM

#

What type of unit test i can do fot fourier transform function on python

upper eagle Oct 24, 2019, 4:36 PM

#

📎 unknown.png

#

What am I doing wrong here?

upper eagle Oct 24, 2019, 5:27 PM

#

Never mind, fixed it

#

Had to use, for x, i ... for j, k ...

#

then do, if i['col'] == k['col']

exotic reef Oct 24, 2019, 11:21 PM

#

yeah you can't compare series' like that. Also there are probably faster ways to find matching rows

#

Certainly using iterrows is illadvised for this

#

(it's slow)

chilly shuttle Oct 25, 2019, 7:22 AM

#

looks like it could be done as a 2d numpy array product

exotic reef Oct 25, 2019, 7:36 AM

#

well if you're just looking for the matches you could do drop duplicates in pandas and look at the inverse or something

glad arch Oct 25, 2019, 9:13 AM

#

Hi guys, so I have this Signal class: https://paste.pydis.com/iyihugeqod.py
and Im running unit testing using this TestSignal class: https://paste.pydis.com/elilufozap.py
but when I run the test_comp function I get this error:

diff = Signal.compute_distance(self.signal,self.signal)
AttributeError: 'TestSignal' object has no attribute 'signal'

any idea why?

wind linden Oct 25, 2019, 12:14 PM

#

Hey does anybody here know how to successfully import sklearn.linear_model to pyinstaller? Hiddenimport keeps saying "sklearn not found"

toxic spindle Oct 25, 2019, 5:22 PM

#

Hello. I am trying to read a number from screen and convert it into a string that I can use. I take a screenshot with PIL ImageGrab. This is how the screenshot is.

📎 image.png

#

I'm trying to use pytesseract to convert this image into a string, but it seems to be unable to output the correct one.

#

I've tried making the RGB image into a black/white, as maybe that would get the pytesseract to work better, but no luck. Is there something I can do with pytesseract to tune it to work better with my images, or is there a way I can filter my images so they can be recognized better?

#

Thanks! :)

#

📎 unknown.png

#

📎 unknown.png

#

It works with some images, but not all.

small ore Oct 25, 2019, 5:29 PM

#

Just from the looks of it, the digits in 290 and 1423 are so close to each other that probably there is no pixel separating two digits at least at one point. Whereas 1514 has clear spacing between digits

toxic spindle Oct 25, 2019, 5:31 PM

#

I mean, that's what numbers recognizers are supposed to be able to do right?

#

even while so close apart

small ore Oct 25, 2019, 5:33 PM

#

Can't comment on the intent behind them coding in a particular way and also I do not know anything about the subject. I just made an observation

toxic spindle Oct 25, 2019, 5:35 PM

#

Oh, well from more observations, it seems to handle them stretched much better than not stretched. Good observation, thanks a lot!

#

Now that you mention it, can't believe I didn't notice it 😄

small ore Oct 25, 2019, 5:38 PM

#

Maybe having to recognize numbers close apart may involve it running the same recognition algorithm for every column of pixels added (in a loop) and when it matches some digit satisfactorily, remove the columns from the buffer and carry on with new columns. I am of course shooting in the dark.

toxic spindle Oct 25, 2019, 5:40 PM

#

well, in the example of 1423, it seems to recognize "1" but not 423. That might be related to the fact that 423 are seens as one number and therefor not recognized.

#

It either recognizes them very well or recognizes nothing., So it must be a result of it seeing one character instead of two/three characters

small ore Oct 25, 2019, 5:41 PM

#

4 and 2 seem to touch each other and 3 seems to have a spacing pixel in a different column at the top and different in the bottom

silent swan Oct 25, 2019, 6:19 PM

#

semi-related fact is that deep learning methods for image recognition actually have a more or less built in efficiency for recognizing multiple digits at once

latent oyster Oct 26, 2019, 1:04 PM

#

Hey all! I've done some searching online and wanted to supplement with some answers and/or suggestions from here: what are some data science skills (e.g., dats visualization) that individual projects can showcase?

soft siren Oct 26, 2019, 5:33 PM

#

Data cleaning (outlier removal, missing data imputation), exploratory data analysis, model building, model validation.

lapis sequoia Oct 26, 2019, 8:39 PM

#

Hello, how can I determine correlation threshold? Is there any technique to determine? Because if I use %95 case. Does it depends?

soft siren Oct 26, 2019, 10:36 PM

#

What so mean correlation threshold?

lapis sequoia Oct 27, 2019, 12:12 AM

#

So I have a machine learning question with regards to supervised classifiers. If I have this dataset, with a bunch of messages comprised of timestamp, id, actual data values etc., and the features are computed based on message timestamps. How do I know the amount of messages needed to compute meaningful features? Say for instance the features are computed within a message window of 10 or 100 milliseconds.

lapis sequoia Oct 27, 2019, 11:44 AM

#

what does the time stamp have to do with the message

glad arch Oct 27, 2019, 12:06 PM

#

hi, I'm trying to re-discretise a signal using fft, any idea how?

jovial bay Oct 27, 2019, 7:35 PM

#

@dry sage invert the img. when the text is black onto a white bg it works better. use IMG = 255 - IMG

#

also clear it of any noise and rotate it

#

these things better prepare the img

glad arch Oct 27, 2019, 9:57 PM

#

is there a built in function for calculating mutual information?

bitter spire Oct 28, 2019, 4:34 AM

#

What would be the simplest way to plot this chart https://fivethirtyeight.com/features/every-nba-teams-chance-of-winning-in-every-minute-across-every-game/

FiveThirtyEight

Every NBA Team’s Chance Of Winning In Every Minute Across Every Game

Can you summarize the NBA season in one chart? With 794 games, more than 152,000 possessions and some 372,000 plays, probably not, but we’ve given it a shot. Wh…

#

im trying to use plotly but it wont let me use my first row as my x axis

silent swan Oct 28, 2019, 5:22 AM

#

the plot or the interactive chart

rotund siren Oct 29, 2019, 1:23 AM

#

Using ipython notebook, running notebook on Mac no problem, but when swtich to my windows im getitng this, Unable to allocate array with shape (50, 1216689) and data type float64

silent swan Oct 29, 2019, 4:58 AM

#

that's a very big array

fringe cove Oct 29, 2019, 1:42 PM

#

hello can somebody tells me why csvlook gives me a different line column values from the cat command of the csv ? i should get a column filled with float and not dates 😮

📎 unknown.png

#

no idea why i have this date in column. i opened qith numbers and the file is correct without the date

lyric canopy Oct 29, 2019, 1:52 PM

#

It's trying to infer the type from the values and making a mistake

#

It sees 1.5 and apparently then concludes it's a common date format

#

The documentation probably has something about disabling inference and/or providing explicit types

#

I've never used it myself, though

#

It's just for output, right?

#

Try the -I flag, that should disable the type inference completely and just display it as-is, @fringe cove

#

See https://csvkit.readthedocs.io/en/1.0.2/scripts/csvlook.html

fringe cove Oct 29, 2019, 2:05 PM

#

it works thanks for the great insights ! i had no idea this could happen

#

#learneveryday

#

and do u know how i could limit the numbers after the floating point like 1.666666 didnt see any option for that in -h

lyric canopy Oct 29, 2019, 2:27 PM

#

I have no idea, I have never used that tool

glad arch Oct 29, 2019, 5:00 PM

#

Im trying to import obspy

#

but it doesn't work

#

from obspy import correlate_template

#

but it shows red line underneath the whole thing

olive willow Oct 30, 2019, 1:54 PM

#

guys what can I use to predict numerical values with not a load of data?

#

cuz I'm trying to predict the temp of my home but dont have loads of data or dont even know where to start

#

with the predicting

fringe cove Oct 30, 2019, 2:01 PM

#

if u have a connected heater and u got the time of activation of it 😂

olive willow Oct 30, 2019, 3:17 PM

#

Hahahhaha no I've a rpi and get heat data from there using a sensor

woeful jungle Oct 30, 2019, 3:52 PM

#

Hello, I am just looking for a sanity check, if I am doing a multiple linear regression, I am supposed to trim the model until there are only significant predictors left correct?

hardy lodge Oct 30, 2019, 7:25 PM

#

How do you guys handle writing text files quickly?

#

For work I am using regex to find certain data, storing it in variables and then writing it in a specific format in a text file
we generally do something like
while True:
try:
with open (yada yada):
actual writing
break

#

Is this a bad way to go about it? Me and the other python guy here are self taught and we have 0 guidance lol

mental umbra Oct 30, 2019, 7:29 PM

#

what do you need the while True loop for?

hardy lodge Oct 30, 2019, 7:29 PM

#

I guess it's just so it loops if there is an error

#

It would only break out if no error

mental umbra Oct 30, 2019, 7:30 PM

#

hmm interesting. I can't imagine there could be many errors if you've successfully opened the file in the first place, and if there is there's probably some serious problem you should deal with

#

you'd also end up restarting your whole write process

hardy lodge Oct 30, 2019, 7:34 PM

#

Yeah we probably could cut the while true out

wheat plaza Oct 30, 2019, 7:52 PM

#

good evening guys, if any of you have experience with tesseract [and training it] please let me know :)
i would like to create my own "language" data by training tesseract over cpu-z windows from screenshots (i can provide examples if necessary) and would like to know if that is a feasible and/or good way to go to improve my ocr detection accuracy

lapis sequoia Oct 31, 2019, 5:52 AM

#

why do you want to use tesseract for ocr

#

what cases are you dealing with..

lusty arrow Oct 31, 2019, 9:13 AM

#

@hardy lodge i like your Zeta Gundam Char Aznable avatar

#

Quattro

hardy lodge Oct 31, 2019, 10:18 AM

#

Lol yeah what you mean? That's not Char, it's clearly the new guy Quattro lol

lapis sequoia Oct 31, 2019, 5:31 PM

#

I have to develop a predictive model for an internship interview

#

Never done ML before - any resources to get me started?

#

Unfortunately, I only have 1 week to do this

lapis sequoia Oct 31, 2019, 6:18 PM

#

can anyone explain what this means

📎 Screenshot_from_2019-10-31_19-18-21_-_1.png

#

if I need to conclude anything from this.... how?

#

I know what lift means and what support means; but I don't know how lift vs support works

rare grove Oct 31, 2019, 6:51 PM

#

I have a pandas DataFrame and I want to group all of the Serieses in it based on values in one column. Specifically, I'm working with event logs with IP addresses, and I'd like to get a view where I can loop over the IPs and examine all of their events.

#

I'm pretty sure this is basic, I just don't know the data science words for it

rare grove Oct 31, 2019, 7:34 PM

#

It seems like everyone pushes the groupby method but that squashes the rows - for my use case, I need to reorient the data and pickle it for another process to pick up and analyze.

kindred flame Oct 31, 2019, 7:47 PM

#

how musch data do i need to predict hotel bookings?

versed axle Oct 31, 2019, 7:58 PM

#

hello

wheat plaza Oct 31, 2019, 10:27 PM

#

Tron what else would you recommend to extract text from screenshots?
Preprocessing the screenshot and then running tesseract over it seems to run pretty fine, but im open to other technology suggestions

#

idealy i would use something where i can input the used font and the size and then it would detect all the text with that, but since i havent found anything like this using tesseract seems like the easiest way, and i think i have to train it to get rid of the weird stuff it sometimes detects

#

this would be an example of a typical input image, im trying to extract the cpu-z data (maybe benchmark data later)

📎 example_screenshot.png

wheat plaza Oct 31, 2019, 11:18 PM

#

if anybody has ideas / tips feel free to ping or pm me, everything appreciated

covert torrent Oct 31, 2019, 11:19 PM

#

Hello, can someone recommend me some good articles or documentaries about data in general

#

why is it so valuable in modern society that it surpassed oil in value.

#

thank you

#

I'm trying to understand this

exotic reef Nov 1, 2019, 12:56 AM

#

@manic axle Very much 'how long is a piece of string' question. What do you want to predict/ What kind of input data do you think you'll have? What level of accuracy/precision/recall is sufficient for your task?

#

@rare grove groupby should not squash the rows, do you have example code of how you are using it and what you expect the output format to be?

#

groupby will return an iterable

rare grove Nov 1, 2019, 12:58 AM

#

Wait really?! I read the docs and apparently I'm an idiot

exotic reef Nov 1, 2019, 12:58 AM

#

Well i might also be an idiot and it does stuff i don't know, we shall see 😛

#

groups = result.groupby('shipper')
for s,subset in groups:
     # do stuff

This is an extract from code i am currently working on

#

the 's' will be the shippers, the 'subset' will be the dataframe corresponding to that group key

rare grove Nov 1, 2019, 12:59 AM

#

❗ ❗ ❗ ❗ ❗

#

That is amazing exactly what I need

exotic reef Nov 1, 2019, 1:00 AM

#

it's wicked fast too, and i just learned why. it pre-sorts and then does binary search stuff

#

sorta...

rare grove Nov 1, 2019, 1:00 AM

#

Several tutorials used groupby and mean() or other measurements and so I thought it just had combined/aggregate values, but an iterable of Series is exactly what I need

exotic reef Nov 1, 2019, 1:00 AM

#

ah yes mean will indeed squash the rows because, well, it's an average

rare grove Nov 1, 2019, 1:00 AM

#

binary search trees are fun

exotic reef Nov 1, 2019, 1:00 AM

#

totes

#

it provides a good connection to sql-thinking too

rare grove Nov 1, 2019, 1:01 AM

#

ah, yeah so I want to say Hey, what did 127.0.0.1 do on my network today? and give that IP a rating based on a set of all the logs it generated

exotic reef Nov 1, 2019, 1:01 AM

#

so you need NLP too? 😛

rare grove Nov 1, 2019, 1:02 AM

#

NLP?

exotic reef Nov 1, 2019, 1:02 AM

#

natural language processing

#

i mean, for this you can use basic rule based stuff to get the IP from that text

rare grove Nov 1, 2019, 1:02 AM

#

Oh, haha no I can say it in a computery way, hopefully via a web interface with a table of interesting targets to look at

#

but is that the best way? always select all the rows by the column value? there's no idea like temporary tables for pandas?

exotic reef Nov 1, 2019, 1:04 AM

#

ah so it depends on the frequency of each operation which will be optimal

#

for example are you periodically looping over a long list of ips, or only ever querying one at a time

#

it would be interesting to bench mark this actually...

#

the most straightforward lookup way is

subset = df[df['ip_col'] == ip_val]

#

but if you want to loop and collect over a large number of ips then groupby will be faster

rare grove Nov 1, 2019, 1:07 AM

#

I guess I could look at set metadata with groupby, then if the set show signs of interest (it's large, or has large values, or etc) pull it out and pickle it (500-1k at a time) for a second process to pick up

exotic reef Nov 1, 2019, 1:08 AM

#

oh right yeah if you are methodically processing them all then groupby is the wya to go

#

however you will need temporary arrays and things rather than mutating the groupby object

#

i think

rare grove Nov 1, 2019, 1:10 AM

#

damn, I think I really need a Kubernetes cluster for this project

exotic reef Nov 1, 2019, 1:10 AM

#

really? how much data do you have?

rare grove Nov 1, 2019, 1:10 AM

#

this is all sounding like I'm going to need 12 cores running parallel tasks to keep up with the sessions per minute I hope to ingest

exotic reef Nov 1, 2019, 1:10 AM

#

how many rows?

#

You can also use Dask if you really need distributed computing, kubernetes would be overkill for this i think

rare grove Nov 1, 2019, 1:11 AM

#

Infinite rows, since this would run over time - but right now I store a few hundred million log rows

exotic reef Nov 1, 2019, 1:11 AM

#

Well okay it's never infinite rows, or if it is, you might be solving the wrong problem 😛

rare grove Nov 1, 2019, 1:11 AM

#

One thing I do have access to is an elasticsearch cluster with all this data in it but I have no idea how to leverage that

exotic reef Nov 1, 2019, 1:11 AM

#

Check out Dask, it's dope.

#

Basically distributed pandas

rare grove Nov 1, 2019, 1:12 AM

#

I shall do so presently

#

I'm perpetually amazed at how many words I don't understand in these docs 😄

#

this library looks amazing, I think this is what I'll spend tomorrow morning diving into - thank you!

exotic reef Nov 1, 2019, 1:17 AM

#

👍

loud kindle Nov 1, 2019, 2:18 PM

#

im trying to get pandas to plot the index as my x axis, but i don't know how to pass it, since df.index returns a rangeIndex type which "is not hashable". how can i get the index so it is hashable?

rare grove Nov 1, 2019, 2:21 PM

#

df.index.tolist() maybe?

loud kindle Nov 1, 2019, 3:06 PM

#

list is also unhashable 🙈

#

i just added my own column with index values now :X

rare grove Nov 1, 2019, 4:00 PM

#

Oof

silent swan Nov 1, 2019, 6:26 PM

#

what command are you using to plot

deft harbor Nov 1, 2019, 11:39 PM

#

@exotic reef is this marketing blurb from dask's website right?

#

But you don't need a massive cluster to get started. Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.

#

Is it really worth putting it on my desktop?

exotic reef Nov 1, 2019, 11:44 PM

#

Depends what you mean by 'worth'

lapis sequoia Nov 1, 2019, 11:47 PM

#

what's the best way to scrape data from this website and put it in a csv file?
https://basketball.realgm.com/international/transactions/2020

rare grove Nov 2, 2019, 12:15 AM

#

I got things mostly ported over to dask today but didn't get quite far enough to be able to tell if there was any performance gain

exotic reef Nov 2, 2019, 1:19 AM

#

To state the obvious, make sur eyou are using dask df functions wherever possible

rare grove Nov 2, 2019, 1:30 AM

#

Yes and no - it wants me to use pandas objects in some cases, like appending series to dataframe (the docstring for dask.dataframe.Series literally says not to use it lol)

#

also I will never stop being annoyed at this convention of abbreviating already-short-enough names to two letters 😛

exotic reef Nov 2, 2019, 4:17 AM

#

Ah interesting

exotic reef Nov 2, 2019, 7:48 AM

#

Just learned about this too, which is neat
https://github.com/jmcarpenter2/swifter

GitHub

jmcarpenter2/swifter

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner - jmcarpenter2/swifter

naive jay Nov 2, 2019, 9:29 PM

#

idk if this is the right channel but anyone have any experience with hash functions such as md5 and sha and generated hashes?

devout ridge Nov 2, 2019, 9:32 PM

#

#cybersecurity

lapis sequoia Nov 2, 2019, 11:35 PM

#

Could someone tell me how I could go about providing high level summary stats on a dataset?

#

Things that I should aim to look at?

soft siren Nov 3, 2019, 12:05 AM

#

Depends on the dataset @lapis sequoia . Things like mean, median, and variance of your variables (columns). Number of missing data points, outliers. If you have obvious groups, looking at summary statistics and counts within groups is usually good too

hexed rampart Nov 3, 2019, 7:59 PM

#

I am getting the error: "TypeError: parse() takes 1 positional argument but 4 were given" for the following code:

from datetime import datetime
import pandas as pd
from pandas import read_csv,read_table

def parse(x):
    return datetime.strptime(x, '%Y %m %d %H')


datasetInput = read_csv(r"C:\MLProject\08_LMR_data\basin_mean_forcing\daymet\08\07375000_lump_cida_forcing_leap.txt", sep=" ", parse_dates=[['Year','Mnth','Day','Hr']], index_col=0, date_parser=parse)

i researched it and could not find any fix. I tried doing the date_parser as a lambda function but that did not work either. Any suggestions?

soft siren Nov 4, 2019, 12:43 AM

#

@hexed rampart Instead of “””[[“Year”, “Mnth”, “Day”, “Hour”]]”””
Try
“””[“Year”, “Mnth”, “Day”, “Hour”]”””

lapis frost Nov 4, 2019, 5:18 AM

#

can anyoine help with this error?

#

File "<ipython-input-68-e6aa6e95856f>", line 10
population_record = census.assign(trend = census.2010 + ' ' + census.2011)
^
SyntaxError: invalid syntax

#

i am trying to make a new column with the values of other columns inside.

lapis sequoia Nov 4, 2019, 5:32 AM

#

are you using assign for a pandas df?

lapis frost Nov 4, 2019, 5:32 AM

#

yes. it is telling me i have invalid syntax on my column name which is 2010

soft siren Nov 4, 2019, 5:33 AM

#

@lapis frost you can use a lambda function:
population_record = census.assign(trend = lambda x : x[“2010”] + “ “ + x[“2011”])

lapis frost Nov 4, 2019, 5:33 AM

#

i am trying to create a new column with the values of 2010 and 2011 insid

#

what is lamda. i have never seen that before.

soft siren Nov 4, 2019, 5:34 AM

#

Using the bracket notation may work too.
census[“2010”] vs census.2010

lapis sequoia Nov 4, 2019, 5:34 AM

#

when you're using assign, you can't perform an operation in what you're assigning it with

#

you need to compute that before you assign it to 'trend'

soft siren Nov 4, 2019, 5:35 AM

#

The lambda function is an anonymous function that would allow you to do this.

lapis frost Nov 4, 2019, 5:35 AM

#

this is the lesson i am following:

five_years = five_years.assign(fullname = five_years.namefirst + ' ' + five_years.namelast)
five_years.head()

lapis sequoia Nov 4, 2019, 5:35 AM

#

which is what he's showing you here

lapis frost Nov 4, 2019, 5:35 AM

#

and it works in the lesson.

sullen wing Nov 4, 2019, 5:35 AM

#

in python specifically, lambda is a way to create anonymous functions

#

lambda and functions are mostly the same in python, with lambda restricted to single statement

lapis frost Nov 4, 2019, 5:35 AM

#

namefirst namelast yearid salary fullname
21454 Henry Blanco 2011 1000000 Henry Blanco
21455 Willie Bloomquist 2011 900000 Willie Bloomquist
21456 Geoff Blum 2011 1350000 Geoff Blum
21457 Russell Branyan 2011 1000000 Russell Branyan
21458 Sam Demel 2011 417000 Sam Demel

soft siren Nov 4, 2019, 5:36 AM

#

In this case I think your issue is that accessing columns with the “.” Can be a bit unreliable for some column names, things likes numerics and column names with spaces

#

That’s why bracket notation may be better here

lapis frost Nov 4, 2019, 5:36 AM

#

ok

sullen wing Nov 4, 2019, 5:37 AM

#

Unlike js, dot notation access an attribute by default, dot notation and bracket notation trigger 2 different dunders

#

So take note about that as well

lapis frost Nov 4, 2019, 5:37 AM

#

FuncTypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/ops/init.py in na_op(x, y)
967 try:
--> 968 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
969 except TypeError:

6 frames
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')

During handling of the above exception, another exception occurred:

UFuncTypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/ops/init.py in masked_arith_op(x, y, op)
462 if mask.any():
463 with np.errstate(all="ignore"):
--> 464 result[mask] = op(xrav[mask], y)
465
466 result, changed = maybe_upcast_putmask(result, ~mask, np.nan)

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')

#

i am in a prep course so i do not know what any of this means

soft siren Nov 4, 2019, 5:39 AM

#

That error is telling you’re trying to add different types. What is in census[“2010”] I assume it’s a float

lapis sequoia Nov 4, 2019, 5:39 AM

#

can you do dtypes on your columns

lapis frost Nov 4, 2019, 5:40 AM

#

2010 and 2011 are Columns in the DataFrame

#

ndex(['state', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
'region', 'division'],
dtype='object')

soft siren Nov 4, 2019, 5:40 AM

#

You can find out what type of columns they are by doing census.dtypes

lapis frost Nov 4, 2019, 5:41 AM

#

int64

soft siren Nov 4, 2019, 5:41 AM

#

Yea so when you try to add the int with “ “ you’ll get an error. The question is what you’re trying to calculate.

lapis sequoia Nov 4, 2019, 5:42 AM

#

try .astype(str)

lapis frost Nov 4, 2019, 5:42 AM

#

ok, so i need to create a new column with the populations listed in 2010-2016 in that new column

#

again, I am in a prep course. i do not know what "try .astype" means because i haven't seen that yet.

#

i assume you mean write it as census.astype(string)?

lapis sequoia Nov 4, 2019, 5:44 AM

#

census['2010'].astype(str)

soft siren Nov 4, 2019, 5:44 AM

#

this

lapis frost Nov 4, 2019, 5:45 AM

#

k, i'll go try that real quick

soft siren Nov 4, 2019, 5:45 AM

#

Same with 2011

lapis sequoia Nov 4, 2019, 5:45 AM

#

you want to add a space between your 2010 and the other one.. by using ' ' ... ' ' is a string

#

so you need to type convert your dataframe columns to string first..

lapis frost Nov 4, 2019, 5:46 AM

#

ok, that worked!

#

thank you! the might look at it a little funny since I haven't been taught that yet, but it lets me move on so...

lapis sequoia Nov 4, 2019, 5:47 AM

#

as long as you understood what you're doing..

soft siren Nov 4, 2019, 5:48 AM

#

this

lapis frost Nov 4, 2019, 5:48 AM

#

hahahahahahaha, yah. i don't. i just follow the example in the book. when it doesn't work, i have no idea why.

#

it's learning a new language. when i say words in spanish wrong i don't know why they are wrong. they just are. we'll see how i grow over the next couple of months.

lost sinew Nov 4, 2019, 9:07 AM

#

i did

#

df = df[df['tweet'].str.contains('helo')]

#

and the index of the df is messed up (shows only the one that is selected)

#

is there a way to reset the index to normal (1,2,3,4,5)

#

i can save it to another csv by index=False, got it

#

unless there is a way to do it with a csv

#

but its fine 🙂

lapis sequoia Nov 4, 2019, 10:12 AM

#

wut

#

the index isn't messed up, you've filtered the df based on your selection, so it's only showing those indices

#

you should name your result df something else, so your code looks cleaner

#

if you want to drop the index from the result, do

#

result_df = result_df.reset_index(drop=True)

#

else try it without the drop = True

pale thunder Nov 4, 2019, 3:35 PM

#

how would one Run-length encode each row of 2D numpy array?

river plume Nov 4, 2019, 4:06 PM

#

Hello guys, I wanted to know the difference between LDA and PCA on a tokenized dataset in NLP

#

I know that PCA is unsupervised and LDA is supervised

#

How to decide when to use LDA and when to use PCA ? I'm talking specifically about NLP when we have 5000+ columns of words after tokenizing data

polar acorn Nov 4, 2019, 10:29 PM

#

@river plume
Keeping in mind that I just looked up LDA and don't do much NLP here are my thoughts: LDA would encode more of the context, while PCA would encode more of the variation. LDA needs you to set a number of topics before running and would need to be rerun if you want to change that number. That number of topics would be the length of your encoding vector per document. PCA is run once and you can after that decide how many components you want to include, the length of your encoding vector is equal to the number of components you want to include. I think LDA would more useful unless you have absolutely no idea about your corpus and the amount or range of topics it contains.

#

But do take that with a pinch of salt 🙂 I just now read about LDA on wikipedia.

acoustic mural Nov 4, 2019, 10:31 PM

#

let's say i have access to decades worth of news articles, separated by language and country. 250 million articles is a safe low estimate because I know I have 30 million in English alone. also let's say i were to derive and maintain a set of statistically significant bi-, tri-, and four-grams for each country/language combo, calculated from the beginning of time (as far as i'm concerned) until say... a week ago.

what's a decent rule of thumb for how many times an n-gram should appear in the last week that hasn't crossed that threshold ever before prior to this week, in order to consider it relevant to the current news cycle?

i was thinking 10 would be a good place to start and i could experiment for each language/country, as it's implemented, based on the results (because the number of articles available per combo varies in magnitude significantly) but it can't hurt to ask if anyone's been here before.

#

@river plume what are you looking to do with the data? and, by 5000+ columns do you mean you've one-hot encoded your words, or you have that many features?

lapis sequoia Nov 5, 2019, 1:37 AM

#

Could someone help me figure out how I could split my dataset into train and test?

devout ridge Nov 5, 2019, 1:44 AM

#

just pick some

#

the test dataset should be stuff that the NN has never seen before

lapis sequoia Nov 5, 2019, 2:01 AM

#

@devout ridge Also, what does it mean to describe model results?

#

I'm working on an assignment for an interview - and I'm not entirely familiar with ML

river plume Nov 5, 2019, 2:06 AM

#

@acoustic mural I have cleaned the texts, applied Porter Stemmer and then vectorized the words. That's how i got that many columns. Something like one hot encoding

acoustic mural Nov 5, 2019, 2:06 AM

#

any reason for porter over snowball?

river plume Nov 5, 2019, 2:07 AM

#

I have no idea about snowball

acoustic mural Nov 5, 2019, 2:07 AM

#

when you say vectorized the words, what do you mean exactly? because when i say that, i mean turning them into dense vectors, not sparse ones

#

also snowball is essentially porter version 2

river plume Nov 5, 2019, 2:08 AM

#

Yeah my vector was a sparse one

#

I'll look up snowball

acoustic mural Nov 5, 2019, 2:08 AM

#

very few, if any, reasons to use porter over snowball nowadays

river plume Nov 5, 2019, 2:08 AM

#

@polar acorn thanks, I'll compare the performance of both and see which one is better

acoustic mural Nov 5, 2019, 2:09 AM

#

what are you trying to do once you have your vocabulary?

#

or, one-hot encoded text sorry

river plume Nov 5, 2019, 2:09 AM

#

Something similar to sentiment analysis

acoustic mural Nov 5, 2019, 2:09 AM

#

via what method? or are you not there yet

river plume Nov 5, 2019, 2:09 AM

#

It's a supervised dataset

#

There are text reviews and there's a target variable containing the binary value wheter it is a positive review or a negative review, so i am making a model that predicts of it is a positive review or a negative one

acoustic mural Nov 5, 2019, 2:11 AM

#

🙂 the imdb set?

#

once your text is encoded, the method you're going to use matters for what you do with the vectors next

#

if you want to feed it into a neural network with an embedding layer, perhaps, you're going to want to replace each one-hot vector with the index of its active value

#

then feed that dense vector into an embedding layer with a width equal to the width of your one-hot

lapis sequoia Nov 5, 2019, 2:20 AM

#

@acoustic mural Perhaps you might know the answer to my question

#

What exactly does it mean to "describe your model's results"

acoustic mural Nov 5, 2019, 2:23 AM

#

well, your model was built for a purpose, right? how well does it do what it's supposed to?
if it was created to mimic a function, how often does it get it right? when it does get it right, how close is its answer to the truth?
if it was created to explore possible solutions to a problem, what insights can you gain from it?

#

stuff like that

lapis sequoia Nov 5, 2019, 2:23 AM

#

I see - my model was actually supposed to predict tree cover types based on a dataset

#

So, is there a particular metric I could use to see if it did well?

#

I'm doing this for an internship interview - have never done ML before so I'm really new to this

acoustic mural Nov 5, 2019, 2:25 AM

#

well for starters, what percent of the time does it get it right on data it wasn't trained on (your test set)?

lapis sequoia Nov 5, 2019, 2:28 AM

#

Crap, lol I only have a test set and a training set

#

I don't have data outside of that

acoustic mural Nov 5, 2019, 2:28 AM

#

well the test set is what i'm referring to

#

you shouldn't train your model on your test set

#

you use your test set to evaluate its performance against labeled, unseen data

lapis sequoia Nov 5, 2019, 2:30 AM

#

Oh

#

I see

#

I trained my model on the test set - I was essentially given a big dataset and asked to make a predictive model... But I assume you're saying I should be testing the model on data from outside this dataset?

#

Sorry if these questions are elementary

acoustic mural Nov 5, 2019, 2:35 AM

#

you have a dataset, right? take somewhere between 10-30% of your data and put it somewhere else for now

#

take the remaining 70-90% and train your model

#

then once you have the model, bust out the data you hid earlier and use your model to predict the answer to each, and compare its results against the answer you already have

lapis sequoia Nov 5, 2019, 2:36 AM

#

Awesome, this makes sense

#

Thank you so much

acoustic mural Nov 5, 2019, 2:40 AM

#

👍 good luck with your interview

real wigeon Nov 5, 2019, 4:19 AM

#

im learning about data structures

#

interesting stuff

lost sinew Nov 5, 2019, 10:58 AM

#

how do i round a column of datetime in a dataframe to the nearest minute

polar acorn Nov 5, 2019, 11:13 AM

#

@lost sinew , df.column.dt.round('min')

lost sinew Nov 5, 2019, 11:21 AM

#

@polar acorn thanks man

jolly briar Nov 5, 2019, 12:59 PM

#

i'm wondering about the usual approach to google colab with the google cloud platform.

I've just been looking at it and it seems that it requires authentication using an account rather than a service key.

does anyone know of any decent guides to it?

lapis sequoia Nov 5, 2019, 1:28 PM

#

aye

#

you come to the right place

#

you can link your google colab to your own machine, but the connection will be through internet.. so, not worth it.. Colab doesn't give you enough resources to run for extended periods of time

#

which is where Datalab comes in.. it's part of GCP

#

https://www.youtube.com/watch?v=kdAI0IewkGU

YouTube

Drivsno

How to Datalab - Creating and Running Notebooks

Using Cloud Datalab, creating and deleting the VM instance, using Cloud Storage Bucket

▶ Play video

#

@jolly briar

jolly briar Nov 5, 2019, 1:32 PM

#

@lapis sequoia thanks - i'm trying to get a handle on what a typical workflow is here with a team -- as i have a bunch of data coming into storage buckets that's then sent to bigquery with schema etc... but then colab just seemed to work with drive 🤔
i'll have a look at datalab now though

timid vortex Nov 5, 2019, 2:58 PM

#

Anyone here have experience with running your Python on AWS or other clouds?

#

Just generally curious about the experience

glad arch Nov 5, 2019, 3:51 PM

#

hi, does anyone here know how to use fft?

fallen path Nov 5, 2019, 4:01 PM

#

Maybe you can look into this...

https://www.google.com/amp/s/www.geeksforgeeks.org/python-fast-fourier-transformation/amp/

GeeksforGeeks

Python | Fast Fourier Transformation - GeeksforGeeks

It is an algorithm which plays a very important role in the computation of the Discrete Fourier Transform of a sequence. It converts a space… Read More »

upper ginkgo Nov 5, 2019, 11:51 PM

#

Hey guys, I'm trying to learn all about machine learning and more specifically neural networks to make a bot be able to 'talk'(or at least pretend it does), but I don't know anything about this, I'm not even a beginner, although I have a lot of experience with Python and I know most things very well. Any sources you would recommend for me to learn?

#

I have messed around with neural networks a little, but sadly I only have modified code that does not belong to me and I haven't learned much

soft siren Nov 6, 2019, 1:10 AM

#

@upper ginkgo I think Andrew Ng’s coursera course on machine learning is probably one of the most common resources for learning about the math behind neural nets (as opposed to just using them)

upper ginkgo Nov 6, 2019, 1:10 AM

#

So it focuses on math?

soft siren Nov 6, 2019, 1:11 AM

#

It goes into more of the insides and guts of the neural nets than just saying “this package does them, this is how you fit and this is how you predict”

upper ginkgo Nov 6, 2019, 1:12 AM

#

That’d be nice, I haven’t learned the required math knowledge since I’m young and those subjects are taught in higher grades than my current one

#

I’ll be definitely looking into that course, I hate those guides that explain shit

soft siren Nov 6, 2019, 1:14 AM

#

http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10740.pdf

#

This covers some of it

#

It’s a bit sparse in details but the course covers them a bit more in depth

#

https://www.coursera.org/learn/machine-learning

Coursera

Machine Learning | Coursera

Learn Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, ...

#

This is the course

#

Week 4 particularly

upper ginkgo Nov 6, 2019, 1:19 AM

#

Thanks a ton

soft siren Nov 6, 2019, 1:36 AM

#

👍

lapis sequoia Nov 6, 2019, 2:18 AM

#

What is a confusion matrix?

#

Tried googling but I just don't understand

soft siren Nov 6, 2019, 3:53 AM

#

A confusion matrix tells you about the predictive performance of your classification model. It summarizes the true positives, true negatives, false positives, and false negatives. The diagonal entries tell you how often you’re predicting right, the off diagonals tell you when you’re predicting wrong

#

📎 image0.jpg

#

Important summary statistics can be derived from the matrix. Namely recall and precision

mighty tartan Nov 6, 2019, 8:46 AM

#

I know this is a bit of a broad question but does anyone have a good statistics course for python thinkhonk

#

Or a good statistics course in general

jolly briar Nov 6, 2019, 11:49 AM

#

When I run a Google colab notebook, stored on Google Drive, where is the VM located?

#

I have all the data located in EU ( which is necessary ), but I'm not sure where the computation is done in the case of using colab notebooks

#

@mighty tartan what kind of stats?

mighty tartan Nov 6, 2019, 11:52 AM

#

For finance

#

I know its a broad question, mostly done the basics like linear regression. Anything that could be helpfull I would aprieciate ❤

jolly briar Nov 6, 2019, 11:58 AM

#

@mighty tartan not too sure about finance - quantopian have good resources, and there's also a python for finance book which has some stats and stuff in it iirc (monte carlo syms and stuff)

#

📎 919ecXlVukL.png

mighty tartan Nov 6, 2019, 11:58 AM

#

yea looked into those :p (also looked into the book)
anyways thx for wanting to help me out ❤

river plume Nov 6, 2019, 4:13 PM

#

guys can anyone explain when to use feature scaling and normalization?

#

is it necessary to use it in all the models?

#

also, when to use StandardScaler and when to use MinMaxScaler?

deft harbor Nov 6, 2019, 9:14 PM

#

@mighty tartan google stat110

#

@river plume It depends on your data and what you are doing

#

If you want to use PCA, you need to use standardscaler to bring the mean to 0 and sd to 1

#

MinMaxScaler is generally used to bring all values between 0 and 1

#

This is useful for certain classification problems

#

models

mighty tartan Nov 6, 2019, 9:18 PM

#

arigato ^^

quaint halo Nov 6, 2019, 9:43 PM

#

@river plume You do not need to scale / normalise your data for all models only those which use distances in the feature space to extra insight and perform classification. The likes of K-NearestNeighbour and K-Means are sensitive to scale where as tree based algorithms are not. You will need to understand the internals of your models to understand when scaling / normalisation is or is not required.

glad arch Nov 6, 2019, 9:48 PM

#

hi guys, I have a signal with samples equal to 120 and another signal was samples equal to 240. Im trying to rediscretise one of the signals in space of another.

I tried this method which works (i.e. PyCharm doesn't complain) but I feel like i'm loosing information by doing that.

signal_ = (self.sampling_rate*self.sample_time)
            signal_ = int(signal_)
            scaling = int(signal_/samples)
            new_signal = self.signal[::scaling]
            return new_signal

another approach could be taking the signal to fourier domain and change samples but I don't see how I can do that

any idea?

deft harbor Nov 6, 2019, 9:48 PM

#

Tau wrote it better

worthy meadow Nov 7, 2019, 2:14 AM

#

Hello, I was advised to try this forum to guide me in the right direction. I'm a beginner in python. Just started learning a month ago today, I think. And I'm struggling with career fields to go into. I believe I'd like to work with BCI research or human perception research with virtual reality in the future, and was told by someone in the careers tab that this space may be able to tell me if data science was the correct route to go.