#data-science-and-ml

1 messages Β· Page 195 of 1

midnight oracle
#

I dont have acces to it right now

#

I have already sent it in my previous message

#

timesData.csv

#

Here it is

supple ferry
#

I see

#

instead of null values, you have "-"

#

right ?

midnight oracle
#

Yes

#

Some of them

supple ferry
#

okay. what you can do is again, tweak your reading function

#

df = pd.read_csv("data.csv", na_values = ["-", "_"])

#

na_values is the parameter allowing you to add special na values for a given dataset

#

if parser sees such fields it will treat them as null

#

there I put a list of two strings to be considered as null. if parser sees a field with any of these two strings, it will parse it as null

#

then you can easily df.dropna(inplace = True)

midnight oracle
#

Oh I didnt know that

#

Thanks

supple ferry
#

Now you know πŸ˜ƒ

midnight oracle
#

:)

supple ferry
#

read_csv is very powerful function. I advise you to read its documentation. It will save you tons of headache at readtime already

midnight oracle
#

Okay

supple ferry
#

btw, try not to use inplace = True at any time :

#

πŸ˜ƒ

#

it will confuse you sooner or later

void anvil
#

inplace is so useful

supple ferry
#

It is useful, but also confusing. It is better to reassign rather than in place

fervent solar
#

can someone explain me how np.searchsorted() is working , i checked the documentation but didn't got.thankyou

supple ferry
#

@fervent solar let's say you have two arrays, A and B. You want to put B elements into A but not to mess up with A order. That function will give you indices in which you should append elements of B so that overall sorted order of A remains unchanged

polar acorn
#

Also inplace is supposed to be deprecated.

supple ferry
#

I hope in version 1.0 they do it

jagged nymph
#

i have a question tangentially related to data science so maybe you guys could help

#

as you can see, the result isn't quite that great with those dips on the side

#

is there a better curve fitting module i can use for this purpose?

#

oh and i should mention, the points move and there are an arbitrary number of them

supple ferry
#

This is really a good question, I hope there is someone to answer that. I also became interested πŸ˜„

cursive sun
#

Lmfaooo ya getting runge phenomrnoned

#

scruuuuub

#

Use a chebyshev grid next time or use a degree sqrt(n) polynomial regressor

#

Or throw in a lipschitz coefficient bound

#

Worlds your oyster fam

#

Git gud

void anvil
#

you can also look into wave transforms

#

EEMD, Fourier

past sonnet
#

Hi guys,

So I'm solving a problem where I have html which contains challenge_details ie Question
I managed to extract main_text from this html manually with the help of bs4
now I've the challenge_details in txt file now my next goal is to extract KEYWORDS which I can use to find
relevant documents from the internet which can be helpful to solve the Question or may contain knowledge which can be useful
to understand the question.

My questions are:
- should I manually scrape google search result or there is good python lib for it? ( I'm not going to scrape to frequent or too fast so getting ban will not be an issue)
- I'm only going to take urls in search result which leads me to an HTML or pdf so my next question best library for ARTICLE extraction from HTML and text extraction from pdf.
- Now I will like to use some magic[ library which can find document similarity or n most relevant docs ]

what better Information retrieval system there is than google hence after extracting keyword I had like to do google search scrape as the docs in result will be most relevant.
but I don't think keywords will fetch good search result and hence I'll be using the title which I extracted from challenge_details but extracting keywords and finding documents based on it is a requirement that I need to fulfill.

#

Library that I've found so far: newspaper3k and spacy

jagged nymph
#

@cursive sun @void anvil thanks! i don't really know what any of these things mean but I'll look into them πŸ˜…

cursive sun
#

Oof, this is why we need better math education in schools tbqh

supple ferry
#

@cursive sun if he knew, then he would not ask, right? Not knowing is okay, not wanting to know is not

#

@past sonnet i think you can go to help channels for this question. It is better suited to those profiles

cursive sun
#

Calm down fam, it's a joke. Now the question is why you were the one to get antsy about it πŸ€”

simple crag
#

It's not a very good joke

cursive sun
#

Fine, if you think qwerty of all people is smart enough to help beyond 'import cv2' then you dont need me

#

You guys need me to tiptoe around then im not helping for free, you guys provide soltutions for issues associated with interpolation problems

mossy dragon
#

🍿

lyric canopy
#

Honestly, the most important thing we expect someone to do on this server is showing respect for the other users. Statements like 'qwerty of all people is smart enough' is not that, so you can keep your attitude. @cursive sun

#

So, drop the attitude

cursive sun
#

Lmfao im not gonna listen to you, you can keep querty, ill bounce

lyric canopy
#

!kick 339017672211693570 telling staff they're not going to listen when told to change their immature and toxic attitude.

midnight oracle
#

QWERTY why don't you prefer inplace=True what are its drawbacks, if it has any?

void anvil
#

Honestly fourier transforms were taught in differential equations for me

#

which is sophomore math

#

EEMD and other decompositions are taught in signals processing courses which are at least Jr. level engineering courses and more advanced ones are grad level

mossy dragon
#

differential equations is sophmore math?

void anvil
#

sophomore in college

#

assuming calc 1-3 in freshman / first half of sophomore

fervent solar
#

@supple ferry can you get more detailed example

supple ferry
#

@fervent solar , lets say, you have array A = [1, 2, 5, 7, 10] and you have some array B = [3, 4, 9]. And you want to kinda insert values from B to A so that, the sorted order of A (as you see) dont change. If you just stick B to A, it will extend it and will be [1, 2, 5, 7, 10, 3, 4, 9]. Order is spoiled. if you use np.searchsorted() it will give you potential indexes that you can "stick your array".

In [4]: a = [1, 2, 5, 7, 10]

In [5]: b = [3, 4, 9]
In [6]: np.searchsorted(a, b)
Out[6]: array([2, 2, 4], dtype=int64)

You see the output? it says put the first element of B after between 2 and 5 (index number 2) in the A and it wont spoil its order. Same goes for 4. You can put it inbetween 2 and 5. but 9 should be put at between 7 and 10

#

@midnight oracle , inplace = True is very handy, but only at first sight. If you dont order your code properly, you can easily forget that you modified your dataframe inplace. I am telling it from my experience and I am pretty sure that here are some other users who had the same thing

#

just writing couple more smybols will not hurt, but may help

midnight oracle
#

Okay

fervent solar
#

@supple ferry oh python is smart

supple ferry
#

πŸ˜„ It is NumPy and Python

fervent solar
void anvil
#

eww, not muting discord channels

fervent solar
#

hahaha

supple ferry
#

Literally unplayable 😁

fervent solar
#

any good platform to ask programming based questions ?

magic pecan
#

stack overflow @fervent solar

lean ledge
#

aw first time I saw someone with actual signal processing knowledge online and they get kicked

fervent solar
#

(x[:-1] > x[1:]) can some one explain this code

silk acorn
#

x[:-1] and x[1:] are slices

#
[1,2,3][:-1] -> [1,2]
[1,2,3][1:] -> [2,3]```
#

as for comparing lists with < or >

#

This simply compares the first elements of the list

#

slices are formatted like this

[start:end:step]```
with default of 
```py
[0:-1:1]```
that can be omitted
#

It lets you take part of a list

fervent solar
#

it was used in while np.any(x[:-1] > x[1:]):
np.random.shuffle(x)
return x

#

bogosort

#

the thing i'm not getting is why (x[:-1] > x[1:]) because it will miss first value in x[1:])

#

and why greater than sign is used

silk acorn
#

It's comparing the first value with the second value

#

hmm, it's np.any, was gonna say that wouldn't work for any, but maybe it's different for numpy, lemme check

fervent solar
#

ok

silk acorn
#

Yeah, for numpy it will check all elements of the list and return True if one is larger than the other

fervent solar
#

@silk acorn u said default value is [ 0 : - 1 : 1 ] but its printing reversed list

silk acorn
#

It shouldn't be.

#

But i did make a mistake there

#

it's
[0, len of list, 1]

#

That's not a reversed list, thats a list missing the last character

fervent solar
#

yes

silk acorn
#

[::-1] would get you a reversed list

fervent solar
#

on what basis is (x[:-1] > x[1:])
is computed true

#

when both will have same size

#

In [94]: x
Out[94]: [8, 3, 6, 1, 7, 5]

In [95]: x[:-1]
Out[95]: [8, 3, 6, 1, 7]

In [96]: x[1:]
Out[96]: [3, 6, 1, 7, 5]

In [97]: (x[:-1] > x[1:])
Out[97]: True

In [98]: (x[1:] > x[:-1])
Out[98]: False

silk acorn
#

np.any will take the two lists, compare them element for element with >, and return True is one is True

fervent solar
#

means it wil compare the first and last element ?

silk acorn
#

It will compare a[0] > b[0], a[1] > b[1] etc, where a in this case is x[-1] and b is x[1:]

#

and return True if one or more are True

fervent solar
#

got it

#

bogo sort done

silk acorn
#

The pure python equivalent would be

any(a < b for a, b in zip(x[1:], x[:-1]))```
pure lynx
#

Hi. Is there a library in python that graphs in the coordinate plane and draws segments in between points? I have tried researching online but I cannot seem to find anything. I don't believe matplotlib can suit my purposes for this.

fervent solar
#

u tried seaborne ?

pure lynx
#

No, but that looks cool and useful for another project. However, from what I am seeing of it, it serves nearly the same purpose as matplotlib.

simple crag
#

In what way is matplotlib insufficient?

#

Connecting points with segments seems like something it would do well

pure lynx
#

By plotting points, I mean like you could manually do in Geogebra. I may just be ignorant but I don't believe matplotlib has a functionality to connect the three points in such a fashion.

simple crag
lapis sequoia
#

Hi, been working on a binary text clf using scikit-learn and I feel a bit lost on what i can do to improve the acc (currently at 0.73-0.75). The dataset is quite small (~7000) so I don't know how far I can push it .
I am still very much learning so if anything seems off please let me know I'd really appreciate it :)

PreProcessing:
Cleaned the data
Set up some stopwords
Tried some word-clustering but didn't any gains (because of dataset size?)
Just now messing around with MaxAbsScaler and Normalizer

Pipeline:
CountVectorizer
TfidfTransformer
The Preprocessing I mentioned above
an SGDClf

pure lynx
#

I did not end up needing patches, but I did use matplotlib in a way that I did not know existed.

#

I also learned about zip(), which was good.

#

I essentially made a list of lists, something like data = [[1,3],[2,1],[3,2]], then I found a way to make all of the possible segments given some points. Since I only have three (I am graphing triangles), that works for me, and I used: plt.plot(*zip(*itertools.chain.from_iterable(itertools.combinations(data, 2)))) Still not entirely sure how it works, but it does.

#

Oh yeah, I had to import itertools in addition to matplotlib.pyplot. FYI for anyone looking to use that method.

supple ferry
#

@pure lynx wait you could do that? That's illegal πŸ˜€

supple ferry
#

Does anyone have experience with conditional logistic models?

#

I want to find out why my hessian matrix fails to calculate. I found out that I have hidden intercepts in my design matrix and I want to find out which variables cause that and remove them

lapis sequoia
#

Hello, I have a question about kinds of data for analysis... is working with technical data very different from any other kind of data? I understand that the resoning data provides goes in a different way, but how about the technical aspect of processing data?

supple ferry
#

@lapis sequoia , can you be more specific?

pure lynx
#

What..? How is it illegal?!

lapis sequoia
#

As far as I understand data analysis let's say for business analytics require set of skills that involve knowledge of the businees field and processes etc, technical analytics would be more specific to let's say some technical aspect of knowledge like laser efficiency, but is the process of working with data munging, cleaning, modeling is the same and the difference only comes from the background of data or the whole process of working with data differs in some way?

#

I am not sure I explain myself well enough, sorry, for confusing question

pale eagle
#

@lapis sequoia have u done any projects in data analytics

lapis sequoia
#

I have done some projects at school but it was related to data like prediction if new website would bring in more clients or iwhich store's range of goods to improve

#

Nothing with technical data, that is why I am interested if there is any difference

pale eagle
#

@lapis sequoia which lang u are using

#

??

lapis sequoia
#

Python

supple ferry
#

So, from my understanding, data will mean data always. Which means, technical or non-technical you will work with numbers. However, every industry and data type will need specific approach to its data. If you work with panel data, methods that you will use will differ completely from the methods you will use for example in cross-sectional data

pure lynx
#

Not to be off topic, but... how was what I did illegal?

lyric canopy
#

I don't think that was a serious remark. I'm not sure either.

polar acorn
#

@pure lynx It was not meant literally. It was most likely meant as an amusing compliment πŸ˜ƒ

supple ferry
#

@pure lynx it was an amusing compliment as @polar acorn suggested πŸ˜€

#

Δ° was surprised that one can use itertools in matplotlib. It is very creative

pure lynx
#

Oh, I guess I’m just daft in that case. Thanks for the compliment, though I can take credit only for relentless searching on StackExchange. I will have to find a different way to graph when I eventually move on to quadrilaterals.

pale eagle
#

Anyone is learning data analytics from data camp

heavy apex
#

How do I build a portfolio for data science if I've never had a job or internship within the field. I'm coming up on my graduation date, and kinda scared my basic understanding of core skills isn't going to be enough to present to employers.

supple ferry
#

@heavy apex you can use kaggle for seeing works by other people which will give you a feeling what kind of projects people are doing. If you something you like you can find another dataset and try to implement similar methodology there. There are various free datasets websites you can visit. Let it bd kaggle itself, r/datasets or Google search for datasets, forgot its name

#

Even more important than modeling is the way you interpret the results

supple ferry
#

Anyone got a good book advice for Bayesian Statistics?

lapis sequoia
#

@supple ferry thanks for insight.

heavy apex
#

@supple ferry thank you, great advice.

void anvil
#

If you get Kaggle GM you'll get 6-7 fig job offers

#

depending on what field you want to go in

lean ledge
#

@heavy apex Hackathons, projects, kaggle

void anvil
#

Kaggle is the gold stndard tbh

lean ledge
#

And yes you're right, just having the skills won't be enough

void anvil
#

hackathons and projects mean fuck all

lean ledge
#

Kaggle isn't really a good standard at all

#

It's just a common recommendation since it's easy to find data there

#

Hackathons and projects can easily mean as much as or more than kaggle

#

It's not about what it is, it's the skills you display

#

With hackathons and projects, you can show fast prototyping under pressure or longer term software skills which are hard to show with just Kaggle. Kaggle doesn't actually show any particular kind of skill that the other two don't

#

For reference: my company has hired both top 50 on kaggle and hackathon winners (I'm from the latter) along with those with work experience only

fervent solar
#

Recall that previously we created a simple array using an expression like this:
In[3]: x = np.zeros(4, dtype=int)
We can similarly create a structured array using a compound data type specification:
In[4]: # Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})

#

how this numpy structure is working

#

i understand python dictionaries

vivid hedge
#

Might not be the correct channel but what is a good name for summing up all the values up to a certain value.

Ex. Value 3, 1 + 2 +3 = 6

#

Is there a mathematical name or function for it?

lean ledge
#

sum_until()?

hardy crag
#

A triangular number or triangle number counts objects arranged in an equilateral triangle, as in the diagram on the right. The nth triangular number is the number of dots in the triangular arrangement with n dots on a side, and is equal to the sum of the n natural numbers fro...

vivid hedge
#

Could work, currently I have "Sum up to value" its going to be for the filename also.

hardy crag
#

sry for the preview 😦

#

if you are just interested in the result u might try something like np.arange(n).sum()

#

or write the function given in the wikipedia article πŸ˜ƒ

vivid hedge
#

Me? I am simply looking for a name I can call my file for github push. But I guess Triangular numbers seems to be correct

polar acorn
#

Cum_sum, short for cumulative sum might be used for similar things.

violet crag
#

let's say I am using data of stock prices.
I want to train the set on some subset of dataset and other subset to test it.

Should I go with scikit learn's train_test_split, which "Split arrays or matrices into random train and test subsets"

Or should I divide dataset with respect to dates. Such as train on data from 2016 to 31st Dec 2018, and all the data after it for testing?

polar acorn
#

You should train on data up to a date and test on later data instead of using sklearn's train_test_split. If you want to do CV you probably want to do this several times for different dates. If you just google "time series cross validation" there are lot of guides.

violet crag
#

CV?

polar acorn
#

Cross validation. A more complete way to test model performance but takes more time.

void anvil
#

absolutely not random CV greed

#

time series data needs to be split into train/test with quarantine periods based on forecast period to prevent contamination of data

#

If you want to do cv train test splits they can be segmented with additional quarantine periods

#

Or you can generate synthetic data based off of your current time series

#

just know if you're training on data not actually available at the time you're introducing some level of cheating into your predictions

violet crag
#

I don't think they have taught CV in the course yet. So for this assignment I won't use it. As the assignment is due tomorrow. But I have bookmarked it and will read up on it.

#

Thanks a lot guys.

#

πŸ™

supple ferry
#

@lean ledge , why we cant name it like sum factorial or smth πŸ˜„

#

@void anvil i really liked the process of making synthetic data and now trying to get deeper into Bayesian inference, Monte carlo methods and Markovian methods

void anvil
#

synthetic data is a huge fucking pita

#

and doesn't always owrk

supple ferry
#

i dont mean for actually using it in production or research

#

it will help me to master numpy

#

i am trying to replace most of the python standard functionalities that i use solely with numpy

void anvil
#

ah

supple ferry
#

one ring to rule them all

warm gulch
#

Hello! Has anyone worked with k-means clustering?

supple ferry
#

hi

#

yes. what is your question

warm gulch
#

I’m working on a school project and most of the resources I’m finding online deal with just 2-D x and y data, it’s possible to cluster more complex data in python correct?

supple ferry
#

yes! of course

#

you can use kmeans also with multidimensional data

warm gulch
#

Ok I’ll look into that πŸ€”. I figured out how to get my program to read and store my csv, would I have to do anything extra or will my program realize it is multidimensional?

#

And plot/cluster the data accordingly

supple ferry
#

so, for using kmeans you should use external library. for reading csv and working with it pandas is your friend. for kmeans sklearn is the package you should use

#

in this post you can find very detailed approach how to implement kmeans

warm gulch
#

Oh thank you! I did use pandas, as for my kmeans I was found different ways using numpy and matplotlib, but this looks much cleaner!

supple ferry
#

Should you come up with questions, feel free to ask them here. We will be glad to help :)

warm gulch
#

For sure, thank you!

violet crag
#

got the solution

#

set_xticks()

reef bone
#

you can do k-means on as many dims as you want, the only problem is that it quickly becomes impossible to visualize the results on a single 2d plot

#

tutorials might deal with 2d solely because it's easy to see what's happening

potent path
#

Does anyone have any exp using time series algorithms with python?

supple ferry
#

@violet crag there used to be plt.tight_layout too

#

But in your case it will probably not be helpful

violet crag
#

Now I've run into another issue.

#

Regression function can't needs float for "x" and i have dates

void anvil
#

convert to unix time

#

if you have variant time steps

#

or if time steps are uniform or could be considered uniform just throw ints

violet crag
#

It's variant

void anvil
#

is the step important?

violet crag
#

It is stock price, sometime difference between the entries is 1 day (consecutive) other time its 2-3 days.

#

"step"? Idk what you mean

void anvil
#

time step

#

and yeah it's pretty important

#

have fun with jump functions

violet crag
#

Hmm πŸ€” alright, I'll see what those are

void anvil
#

you can start here

#

this one is also neat

potent path
#

ohhhhhh

void anvil
#

but he definitely cheats

potent path
#

Thanks for that

void anvil
#

and doesn't realize it

violet crag
#

Cheats? How?

void anvil
#

he leaks information from train to test that shouldn't exist

#

He also predicts price, not movement

violet crag
#

Oh, I see. In my case I've made a clean cut in the data frame using list slicing.

void anvil
#

which is very bad

#

yeah you'll probably catch it when he goes over feature creation

violet crag
#

What do you mean by "movement"?

void anvil
#

% change

#

you should never predict price, it's too easy

#

and you'll get bad results

#

you want to predict day over day changes and back that out to price

#

so if your stock price is 100, 101, 102, 101

#

you want to be predicting 1%, 0.9%, -0.9%

violet crag
#

Damn man. I was working on predicting prices.

#

πŸ˜•

void anvil
#

you will get significantly better predictions predicting price than actual movements

potent path
#

Is it possible to use ML as a tool for your investments?

void anvil
#

yes

#

data is expensive

potent path
#

Ive done some work before in that area but never had much success using it on my own portfolio.

void anvil
#

because you need good data

#

and it's expensive lol

#

need level 2 or level 3 data

lapis sequoia
#

try using the news data

#

there was something on kaggle previously

#

it should help

supple ferry
#

one possible idea can be to use news, derive sentiment and apply it to prediction models

violet crag
#

Is it true that regression can't work on dates and I have to covert it into numeric data

mossy dragon
#

so is anyone here an actual data scientist

supple ferry
#

@violet crag regression assumes your exogenous variables are scalars. So, it can not work with date or categorical (I don't mean encoded into dummy)

violet crag
#

@supple ferry I successfully converted date into numeric value, keeping the step in mind as well.

I've issue in regression. I thought it would draw a regression line. But regressor.predict([dates]) just gave me same values as actual price

#

😦

#

thinkmon I think I know where I am going wrong

polar acorn
#

@violet crag In case you didn't see you are predicting on your training dates. You should predict on test_dates.

violet crag
#

@polar acorn yes, I caught that

#

now I have issues with Reshape your data either using array.reshape(-1, 1)

#

is this a numpy thing?

polar acorn
#

It's a sklearn thing. It prefers your numpy arrays in a certain way.

violet crag
#

np.asarray(test_dates).reshape(-1, 1)
ValueError: shapes (101,1) and (403,403) not aligned: 1 (dim 1) != 403 (dim 0)

#

I am clueless

polar acorn
#

What is the shape of the X and y you pass to fit()?

violet crag
#

it's like this [[1, 2, 3]]

#

simple 1D array was giving error

polar acorn
#

if you call test_dates.shape what do you get?

#

or train_dates.shape rather

#

Okay I see what your are doing, you are actually feeding two lists to LinearRegression.fit()

violet crag
#

yea, one x and one y

polar acorn
#

The fit function wants numpy arrays instead. And it wants them in shape of (samples, features) for X and (samples, targets) for y. So what you can do is
fit(np.reshape(train_dates, (-1, 1)), np.reshape(train_prices, (-1,1)))

#

The np.reshape(train_dates, (-1, 1)) means, reshape my list as a numpy array with dimensions -1 and 1. Wtf you might think, -1? -1 just tells numpy to look at your list and substitue -1 with the length of the list. It is a convenience so you can reshape your list without checking the length of it.

#

Makes sense?

violet crag
#

can you explain me this -1, 1 further

#

πŸ™

#

now since I have to depict this predicted array on graph, I need to remove extra dimension, right?

#

like from [[]] to []

supple ferry
#

@violet crag , if you reshape your list of 100 elements into (-1, 1) it means that, it will reshape it (len(thatList), 1)

#

it is easier to know it this way

#

there is no extra dimension btw, technically there is, but no πŸ˜„

violet crag
#

duuuuuude I finally got something

polar acorn
#

Looks good πŸ˜ƒ

#

I mean the predictions are bad but that is pretty much what you should expect here.

supple ferry
#

MANOVA is your way to go. Now try that

violet crag
#

my assignment is due in 11 hours 🀣

#

this is bad, I'll lose money in the market

polar acorn
#

πŸ˜‚ If you could make money in the market doing linear regression there would be no market

supple ferry
#

πŸ˜„

#

time series and overall panel data requires differenc approaches

#

i dont know if you can manage watch this and do your assignment at the same time

#

but this video is gold

violet crag
#

ooh I'll add it to the list of material that I've found on last two days, I have to read a lot

supple ferry
#

reading is the key

violet crag
#

will do after assignment

supple ferry
#

if you can rwerite the text you read into the code and vice versa

#

you are good to go

#

πŸ˜ƒ

#

rewrite*

violet crag
#

you mean theory to practice?

supple ferry
#

yes

#

knowing is not the power now, knowing how to find out is

violet crag
#

πŸ™

void anvil
#

you can do LSMA

#

with linear regression

#

it'll be better than one regression for the entire time series at least

#

you do a linear regression on the last X time points (default is 25) and use it to predict the next period

#

it's still pretty awful but it's better than predicting out months / years

violet crag
#

"Estimated coefficients for the linear regression problem"

#

what does this mean in linear regression?

void anvil
#

it's literally the coefficients for the line you draw

violet crag
#

aah, slope of the line

void anvil
#

slope for all the X variables + intercept

vagrant vector
lapis sequoia
#

well.. you should start with an objective..what are you learning it for?

#

to do data analysis - for business, finance or for other fields?

#

or to do ml for image or text processing?

#

these are very broad..there's more to it.. but it all depends on your objective so you dont end up all over the place

#

this is where you start

#

then come back for more

quiet gyro
lapis sequoia
#

ooooooh

#

this is pretty cool

supple ferry
#

Now, WOW

buoyant trellis
#

I would like to start with data analysis ... essentially transition from QA automation to data analysis... finding it bit difficult to figure out how to go about ut

lapis sequoia
#

Has any of you heard of any good bootcamps for data science in Europe? There’s plenty of them advertised online, but which are any good? Or any advice how to filter?

supple ferry
#

@lapis sequoia in Berlin there was one for 8k. I saw its agenda and advised my friend to learn it on his own

#

It was like a year ago and I was learning myself too

lapis sequoia
#

So it s not really worth it? I understand all the information is online and accessible for free, but I am becoming overwhelmed having no guidance and also the amount is so vast I just feel lost.

supple ferry
#

what do you need guidance with ?

buoyant trellis
#

@supple ferry As an experienced programmer desiring to move to data analyst role I dont know a definite path to reach data analyst role... I am looking at some online courses but not sure if they are worth the cost...

supple ferry
#

@buoyant trellis this can be a good guide

#

It has also programming things, you can ignore them of course

#

@lapis sequoia also for you

buoyant trellis
#

ok

lapis sequoia
#

I ll look through that, thanks! @supple ferry

#

I guess my main confusion comes from that most courses i looked at for data analysts/ data scientists suggest a package of skills to master, but when I look at job ads they require a phd in physics or mathematics... to my understanding one needs statistics the most and the technical programming tools, but again, i guess i have the whole idea wrong

buoyant trellis
#

my understanding was statistics and probability... last time I switched career path best thing that worked was to demonstrate through projects /blogs etc... and that is what I am planning now as well.. @lapis sequoia I think if you have projects to show there is always someone who is there to hire

lapis sequoia
#

I do have project but they are far behind a phd research projects. Would you have any references to share to what kind of project is expected for an entry level analyst?

buoyant trellis
#

honestly no... but I will go in for some project from kaggle.com and put it on my cv once I am there

lean ledge
#

@lapis sequoia it's not just statistics and probability, it's also calculus and linear algebra and some other maths here and there. People want others with really good maths backgrounds, people that can understand papers that use differential geometry or Poincare embeddings of trees into hyperbolic surfaces.

#

To compete with those guys, you have to know maths at a PhD level and that's going to be very hard

#

Most courses in my experience are only training computer science students and people changing careers for the basics of the field. I'm not sure how accurate a look that gives into the kind of ability employers want, especially since what employers want can vary

torn musk
#

@lean ledge wow so differential geometry is a thing??? I really want to know so I can choose to do college course in it

lean ledge
#

It is a thing

torn musk
#

Because everyone I asked was like "oh it's not necessary"

#

Oof

#

If it's a thing I'll do it

lean ledge
#

Prerequisites are real analysis and abstract algebra and number theory here, probably want to do those first

torn musk
#

Ahhh interesting

#

I've done the linear algebra, calculus, statistics package

#

But have not touched into those real analysis or differential geometry

lean ledge
#

Linear algebra is not abstract algebra

torn musk
#

Ok

lapis sequoia
#

That is interesting... I think I should give up idea trying to change my career without actually going back to school and start it all over from scratch, especially that I do not hold a degree in a quantitive field.

heavy apex
#

I'm really close to finishing my BS, and will probably get my MS in data science, but really curious about the difficulty for a non-degree holder trying to get into a data science career off boot camps or self training alone.

#

I never hear any first hand experiences, just a bunch of those ads everywhere on YouTube and whatnot.

lean ledge
#

I don't know a single person who's self taught themselves into a data science role completely

#

Dunno about how difficult it would be but anecdotally everyone I know has some degree, mostly in maths/Phys/stats or engineering, with some CS

feral lodge
#

I'm not familiar with the job market so forgive my ignorance, but surely most entry-level data analyst positions require only a bachelor's? That's the case with the first few results on glassdoor when i search "entry level data analyst". The PoincarΓ© embedding stuff (https://arxiv.org/pdf/1705.08039.pdf) is still on the research stage no?

lapis sequoia
#

^sorry for typing something when you have a question but @lean ledge why do u say u dont know a single person who's self taught themselves data science

lapis sequoia
#

What does inline for an embedded message mean?

lean ledge
#

data analyst positions arent data science positions. in my experience at least, data analysts are mostly full of non-technical people (eg. business majors etc) manually looking for trends. lots of tableau visualisations and whatnot as opposed to data science which tends to be lots of machine learning, statistics and modelling, very math heavy.

#

there are definitely lots of positions in data science that are satisfied with bachelors level too, I was just trying to give a motivation behind why some specific positions like they one they are finding might be looking for math/physics majors. @feral lodge

#

the poincare example was just me looking for a complicated sounding paper I saw recently, not an actual serious thing people might implement but I am reading research papers so often for data science stuff, I dont think research stage stuff is off limits for most data scientists at all

#

@lapis sequoia I know lots of people who've taught themselves data science, just not those with no degree at all who've taught themselves data science to a working capacity. lots of engineering/phys/math/CS majors who self-taught themselves it all, just dont know anyone without a degree at all or from non-technical backgrounds

lapis sequoia
#

Well, for data scientist I understand the high requirements, but I was referring more to entry level data analyst. I do see jobs in US that would fall into my educational background and training well, but in EU requirements seem to be through the roof.

lyric canopy
#

Terms are used in a slightly different manner from place to place. Here's, it's used for people that have an actual analitical background and have an understanding of, say, R/Python, SQL, modelling, and stuff like that

#

I've just pulled up three random listings for "data analyst" and they all require modelling experience, a mathematical/statistical background, and a university degree. They do pay well, though.

supple ferry
#

In EU situation is quite different thiugh. When I was working in Germany, I was a data analyst, but I was also doing data science stuff, like bulding predictive models for forecasting.. In job description they have written about solid math and programming background. However, when later I compared my skillset to jobmarket in US I saw that, there in order to be a data analyst they were requiring not that much in comparision to EU. solid data analyst in EU can easily be a senior or leading data analyst in US for more money

vagrant vector
#

Well I decided to start learning machine learning and some guy here told me to try the coursea tutorial

#

Anyone knows if its a good one??

warm gulch
#

I’m doing edX courses

vagrant vector
#

I dont know if this course is good my self because I am new to this

#

So I am looking for someone who had learned it already and can tell me what he did

pale eagle
#

@vagrant vector try that course you will automatically.know. what is good or bad for you

kindred stirrup
#

anyone know a good package for doing arima modeling? seems more like a job for R

supple ferry
#

@kindred stirrup statsmodels has arima functions

kindred stirrup
#

@supple ferry ahh thanks man i was spending several frustrating hours trying to download this auto.arima package

#

FWIW I work with a lot of data analysts and all of us have a master's degree. The only Data Science people have PhDs

supple ferry
#

@kindred stirrup you welcome

#

Δ° think the power of having PhD is the experience you get while doing scientific research.. It teaches you to look not at just the result part, but also at the process and how it should be done

visual tangle
#

What library would be the best for machine learning?

void anvil
#

sklearn

lapis sequoia
#

SnapML from ibm

#

:v

supple ferry
#

It automatically deletes random 50% of your data? @lapis sequoia

lapis sequoia
#

lol

#

what

supple ferry
#

Reference to ThanosπŸ˜€

lapis sequoia
#

I know..

supple ferry
#

Δ° am being too nerdy πŸ€ͺ

supple ferry
#

Which alternatives can I use for measuring the model performance in Conditional Logit?

#

alternatives to ROC curve

void anvil
#

everything that you use for linear regressins

#

it runs MLE so log likelihood

#
  • all the lag + normality tests
#

but you should get a coef, std error, z & p score, conf interva,s chi2, pseudo R2, F score (prob > chi2), log likelihood, etc.

#

ROC curve is pretty shit for logit/probit tbh

#

pseudo-R2 has has Aldrich-Nelson, Cragg-Uhler (Cox and Snell) and it's variants, estrella and its variants, Veal-Zimmerman

#

@supple ferry

#

And, of course, AIC / BIC

fervent solar
lyric canopy
#

It's grouping by the specified row indices, I think

#

Fits the numbers it produces

#

(0 + 2 + 5) / 3 = 2.33333

supple ferry
#

@void anvil , thank you!
II tend to agree with you on ROC curve, however, I am asked to provide some in order to explaing "business" people about the performance of the model. I am using statsmodels.discrete.conditional_model.ConditionalLogit for this, which is in master, but not publish branch. They dont have predict for that.
I understand the reason. It is calculated by assuming group related fixed effects, and predicting just out of the box not only statistically stupid.

void anvil
#

sklearn has logistic regression

#

@supple ferry

#

is a good introduction as well

#

I have an amazing econometrics book, but it's hand written by our teacher and I can't really scan it up at the moment

#

But basically if you just follow that lecture you can explain in simple terms what everything means

#

If you have access to STATA (bleh), it's super easy to run and get everything precalculated by just doing load data, x = these columns, y = this column, logit(x,y)

#

It's not nearly as powerful but it generates nice, neat tables with nearly 0 effort

#

iirc R has a ton of business-report level stuff as well

lapis sequoia
#

How do i delay a project

#

So it does not close right away

lapis sequoia
#

In MPL how do I get these 2 numbers to be the same? The graph isn't scaling properly.

visual tangle
#

@lapis sequoia time.sleep

#

make sure to import time

lapis sequoia
#

yes i got it

lyric canopy
#

What do you mean by scaling correctly, @lapis sequoia ? To me it just looks like your line ends at (3.00, 6.0)

#

Or, y = 2x

mossy dragon
#

how much do you think affirmative action plays into graduate admissions

lapis sequoia
#

not much man..

#

and this probably isn't the channel to talk about this

lapis sequoia
#

I want the 2 axis to be same

supple ferry
#

@lapis sequoia , you can access and change ticks viaplt.xticks(np.arange(min(x), max(x)+1, 1.0))

#

if you want to set limits, matplotlib.axes.Axes.set_xlim

lapis sequoia
#

so uhh

#

is matplotlib like #1 used package for datavisualization?

reef bone
#

I'd say matplotlib is like the go-to for if you just want to quickly visualize something

#

I know that Seaborn and Plotly generally produce better looking visualizations for if you want to present your results in a more sophisticated way

#

Also Plotly is I think appropriate for if you want to have interactivate visualizations

lapis sequoia
#

ok

#

does it support showing outliers?

reef bone
#

Not sure what you mean by that, outliers are just data points right

#

You might need to select the correct visualization type / technique but that's not really a package specific feature

lapis sequoia
#

cuz like

#

theres a mathematical definition of an outlier

#

so like

#

is there a way you can plot.get_outliers() or something?

reef bone
#

I'm not aware of that being a feature, but I would argue that's not the package's concern

#

You can always extract them yourself and feed them to the visualization framework

#

Or colour them differently

lapis sequoia
#

yeah but like

#

how would I know which ones to color differently

#

how would I extract the outliers

reef bone
#

You said there's a mathematical formula you would like to use, why not apply it to the data and extract the data points that you wish to work with?

lapis sequoia
#

hmm

reef bone
#

Then feed them to matplotlib separately

lapis sequoia
#

I said mathematical definition but I guess I could make it into a formula

reef bone
#

Which definition are you working with

lapis sequoia
#

any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile.

#

that

reef bone
#

I would extract those thresholds from your dataset and then loop over it and filter the outliers

#

And input them separately

#

You probably only need to extract the indices, I'm fairly sure matplotlib can colour by indices

#

For the record I'm not saying that there aren't data vis packages that can do that, I just wouldn't expect it from them

void anvil
#

matlab and s eaborn

#

are the two I use the most

lapis sequoia
#

ok

#

thanks for your help guys!

sharp jetty
#

anyone know how you can add labels to an image?
I have images that are named XR_ELBOW_patient00011_negative_0
as well as images that are named XR_ELBOW_patient00016_positive_4
essentially my labels are positive or negative

#

what im looking to do is run them through a CNN model

supple ferry
#

@sharp jetty , you can just run a script which will parse the names of images and then assign that array to label column

midnight atlas
#

Hi, I'm having a problem with some feature extraction code for speech recognition. I'm trying to use a function to convert from linear magnitude spectrum to mel spectrum. I think I may just be misunderstanding how to use certain functions in Python, but any pointers would be great!

#
    def make_mel_filterbank(self):

        lo_mel = self.lin2mel(self.lo_freq)
        hi_mel = self.lin2mel(self.hi_freq)

        # uniform spacing on mel scale
        mel_freqs = np.linspace(lo_mel, hi_mel, self.num_mel+2)

        # convert mel freqs to hertz and then to fft bins
        bin_width = self.samp_rate/self.fft_size # typically 31.25 Hz, bin[0]=0 Hz, bin[1]=31.25 Hz,..., bin[256]=8000 Hz
        mel_bins = np.floor(self.mel2lin(mel_freqs)/bin_width)

        num_bins = self.fft_size//2 + 1
        self.mel_filterbank = np.zeros([self.num_mel, num_bins])
        for i in range(0,self.num_mel):
            left_bin = int(mel_bins[i])
            center_bin = int(mel_bins[i+1])
            right_bin = int(mel_bins[i+2])
            up_slope = 1/(center_bin-left_bin)
            for j in range(left_bin, center_bin):
                self.mel_filterbank[i, j] = (j - left_bin)*up_slope
            down_slope = -1/(right_bin-center_bin)
            for j in range(center_bin, right_bin):
                self.mel_filterbank[i, j] = (j-right_bin)*down_slope```
#

that's the function ^

#
 # for each frame(column of 2D array 'magspec'), compute the log mel spectrum by applying the mel filterbank to the magnitude spectrum
    def magspec_to_fbank(self, magspec):
        # apply the mel filterbank
        fbank = np.convolve(self.make_mel_filterbank(), magspec)
        return fbank```
#

that's where I'm trying to apply it to magspec

#

thanks

lapis sequoia
#

are there any things on Kaggle for students?
to practice doing data science
or should I just pick a competition and try it

#

I just need simple data science problem to practice the basics; ones that should take about an hour to solve

lean ledge
#

@lapis sequoia there's basic intro problems there

#

Eg the house price one

mossy dragon
#

Hello guys, I have to write a short essay about a technological trend that will impact my target career (data scientist), and I was thinking about writing something about how increasingly complex models and automation are making it easier to make predictions, but since its becoming more of a "black box" where understanding the model is not neccesary then it might cause problems down the line.

#

sounds like a good idea?

marsh fog
#

Can anyone here give me a hand with Pandas?

supple ferry
#

@marsh fog , you can ask your question right away

marsh fog
#

@supple ferry I posted it in

supple ferry
#

@marsh fog , I could not find your question there. you can repost it here

marsh fog
#

I've got a list of dataframes and I'm using pd.melt to modify them: gdp_melt = pd.melt(dataframes[0].reset_index(), id_vars=['country'], var_name="year", value_name="GDP") gdp_melt.set_index(['country', 'year'], inplace=True) is there a way I can run a loop or function to melt all the dataframes in the list but change the value_name to apply to each dataframe?

supple ferry
#

is your dataset list will be something grouped ?

#

you can write a function which does that, but you will need to pass it the value for value_name

marsh fog
#

how would I go about doing that?

#

Yeah I have 3 dataframes contained in a list called dataframes so to access each one it's dataframes[0], dataframes[1] etc

supple ferry
#

do you need to modify those datsets, or return a new list with melted datasets

#

?

#

something like this maybe

#

in the last line i got a typo :D
melted_list.append

#

not meelted_list.append

marsh fog
#

Modify those datasets rather than return a new list

#

So something like this def melt_df(df): df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP") df = df.set_index(['country', 'year'], inplace=True) return dataframes = [melt_df(df) for df in dataframes] but then it makes all my dataframes blank

supple ferry
#

if you take on df from that list and try to run that func

#

what do you get ?

marsh fog
lyric canopy
#

What does your melt_df function return? It looks like it's returning None at the moment

#

Does it modify something in place instead of returning something?

marsh fog
#

It's meant to modify the list of dataframes

supple ferry
#

So, mobile discord doesn't allow me to read the code normally @marsh fog. Ves is right πŸ˜€

#

You are using list comprehension, it means your function must return something

hoary geyser
#
def melt_df(df):
    df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP")
    df = df.set_index(['country', 'year'], inplace=True)
    return df
dataframes = [melt_df(df) for df in dataframes]
#

maybe?

#

return the df you modified

marsh fog
#

unhashable list error again

#

Sorry @lyric canopy this is what I'm trying and It's throwing an unhashable list error def melt_df(df): df.reset_index(inplace=True) df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"]) df.set_index(['country', 'year'], inplace=True) return df dataframes = [melt_df(df) for df in dataframes]

hoary geyser
#

can you show a picture of the traceback

marsh fog
hoary geyser
#

value name doesnt seem like it should be a list

#

just do value_name="GDP"

#

also what IDE are you using? thats the nicest traceback ive ever seen

marsh fog
#

Now I get a Key Error: 'year'

#

Anaconda

#

JupyterLab

hoary geyser
#

ty

marsh fog
#

It's quite nice πŸ˜„

hoary geyser
#

whats the traceback for the keyerror

marsh fog
#

Can you even read that? eek

hoary geyser
#

i opened it on browser and zoomed

marsh fog
#

πŸ‘Œ

hoary geyser
#

not sure, but it looks like set_index is expecting certain keys?

marsh fog
#

Something else Is wrong here though; because if I change the line to df.set_index(['country'], inplace=True)

#

It no longer throws an error

#

but

hoary geyser
#

so it doesnt like 'years'

marsh fog
#

It does absolutely nothing to the dataframes

hoary geyser
#

oh

marsh fog
#

it doesn't melt them at all

hoary geyser
#

what library is this so i can look at some docs

peak jetty
#

pandas?

marsh fog
#

yes

#

ideally as well I need value_name to change for each dataframe within the list - But I'm just trying to get this to work first xD

#

because there are 3 datasets in a dataframe of their own - Each of them have a different variable: HDI,GDP, Unemployment figures along with countries and years. So the value_name needs to change for each

hoary geyser
#
def melt_df(df):
    df.reset_index(inplace=True)
    df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"])
    df.set_index(['country', 'year'], inplace=True)
    return df
#

when you melt it, the id_vars is ["country"] only

#

and 'year' is a var

#

does that cause issues?

marsh fog
#

Nope

#

If I run this outside of a function

#

on a dataframe not in a list

#

It works fine

peak jetty
#

Is this sensitive data? Could you dump a snippet of df.to_dict() so we could mess around with it?

marsh fog
#

ofcourse

peak jetty
#

Oh how are you getting it into your df

marsh fog
#

using glob

peak jetty
#

Oh I see, yea a snippet of the converted dict might be easier, try pd.from_dict() and let us know the conversion params, too please

marsh fog
#

would you just like the datasets? xD

#

not sure if that's easier

peak jetty
#

Maybe, I just wanted to run one line to set it up,

#

Oh, now you'll have to give us all your code though

marsh fog
#

that's okay, there is only a couple of lines

#

just imports it and then renames a few columns and then the bit I'm stuck on aha

#

using melt

#
filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f, encoding = "ISO-8859-1"))
    
def process_df(df):
    """
    intput: unformatted data that comes from the same source so it has the same starting format
    output: processed data that has been re-formatted
    """
    df.drop(index = df.tail(6).index, columns=["Series Name", "Series Code", "Country Code"], inplace=True)
    df.rename(columns= lambda x: x[0:4], inplace=True)
    df.rename(columns={'Coun':'country'}, inplace=True)
    df.set_index('country', inplace=True)
    df.index = df.index.str.strip()
    return df.apply(pd.to_numeric, errors='coerce')
dataframes = [process_df(df) for df in dataframes]
#1 is GDP, 2 is internet-users, 3 is unemployment


def melt_df(df):
    df.reset_index(inplace=True)
    df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    df.set_index(['country'], inplace=True)
    return df
dataframes = [melt_df(df) for df in dataframes]
#

@peak jetty I do much appreciate the help though dude! Thank you!

peak jetty
#

Well don't thank me yet

#

Uh, missing pandas import? import pandas as pd I'm guessing?

marsh fog
#

yeah

#
import pandas as pd```
peak jetty
#

Ok, so it's not throwing an error now, but melt_df doesn't seem to be doing anything, that's where you left off?


In [2]: new_dataframes = [melt_df(df) for df in inter_dataframes]                                                                                              

In [3]: inter_dataframes == new_dataframes                                                                                                                     
Out[3]: True```
marsh fog
#

yes

peak jetty
#

Is the process function working correctly?


In [2]: the_df = inter_dataframes[0]                                                                                                                           

In [3]: the_df.columns                                                                                                                                         
Out[3]: 
Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018'],
      dtype='object')```

[0] is the GDP df?
marsh fog
#

yes

#

it should be the GDP df, but after running pd.melt df.columns should only return GDP column

#

because countries and years should be set as indexes

peak jetty
#

I'm only looking at dataframes[0] for now

#

So the columns are years from 1960-2018?

marsh fog
#

yes

#

So melt should change those Year columns into a single column and then you get the GDP values in their own column with the value_name part of pd.melt

peak jetty
marsh fog
#

I believe value_vars is the same as value_name

peak jetty
#

Are you sure? From their example:

...         var_name='myVarname', value_name='myValname')```
#

They are passing both

marsh fog
#

oh yeah

peak jetty
#

Oh yea?

marsh fog
#

oh no

#

by default

#

value_vars is none

#

because it's columns to unpivot if not specified by id_vars

peak jetty
#

That would explain why it doesn't do anything

marsh fog
#

No, should be working?

#

value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

peak jetty
#

Oh I see, hmm but your id is the index

#

Can you actually do that? In their examples they never pass an index as the id_var

marsh fog
#

from what I looked up - To get around that this is used pd.melt(df.reset_index()

#

So it resets the index making the index a column again so that it can then be used by id_vars

peak jetty
#

Ah, got it, that would probably explain why it isn't blowing up

marsh fog
#

haha

#

that's why this df.reset_index(inplace=True)is on a seperate line because if you try and run df.melt(df.reset_index, id_vars=["country"], var_name=["year"], value_vars="GDP") you get this

peak jetty
#
Out[10]: 
                                                 country  year        GDP
0                                            Afghanistan  1960        NaN
1                                                Albania  1960        NaN
2                                                Algeria  1960        NaN
3                                         American Samoa  1960        NaN
4                                                Andorra  1960        NaN
5                                                 Angola  1960        NaN
6                                    Antigua and Barbuda  1960        NaN```
marsh fog
#

that looks right

#

What did you do?

peak jetty
#

Only thing I did was assign a new df to the commands already here

marsh fog
#

So that's only performed it on 1/3 dataframes?

peak jetty
#
    the_df = the_df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df = the_df.set_index(["country"], inplace=True)```
#

Literally all I changed

#

It seemed like df.melt() was returning a df but not actually modifying the df it was applied to

marsh fog
#

AH

#

I wonder if df.melt can take an inplace arugment

peak jetty
#

I hate how pandas does that

marsh fog
#

No it can't xD

peak jetty
#

I don't see one

#

I really hate how Pandas does that, and how there is df.melt() and pd.melt(df)

marsh fog
#

yeah it's so frustrating

#

wrong line

#

So you've kept that line after the function?

peak jetty
#

Yea

#

Oh and sorry, should be

    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df.set_index(["country"], inplace=True)
    return the_df```
#

You can't even mix and match the assignments, what a mess that can become

marsh fog
#

mhmm it's telling me the_df is not defined

peak jetty
#
import pandas as pd

filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f, encoding="ISO-8859-1"))


def process_df(df):
    """
    intput: unformatted data that comes from the same source so it has the same starting format
    output: processed data that has been re-formatted
    """
    df.drop(
        index=df.tail(6).index,
        columns=["Series Name", "Series Code", "Country Code"],
        inplace=True,
    )
    df.rename(columns=lambda x: x[0:4], inplace=True)
    df.rename(columns={"Coun": "country"}, inplace=True)
    df.set_index("country", inplace=True)
    df.index = df.index.str.strip()
    return df.apply(pd.to_numeric, errors="coerce")


inter_dataframes = [process_df(df) for df in dataframes]
# 1 is GDP, 2 is internet-users, 3 is unemployment


def melt_df(df):
    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df.set_index(["country"], inplace=True)
    return the_df


new_dataframes = [melt_df(df) for df in inter_dataframes]
#

Maybe I missed another change?

#

I'm not getting any issues running that

marsh fog
#

All my values are blank

#

mhmm xD

#

oh no

#

It's working now

#

how shitty is that though from pandas

#

You're a legend

#

So now If I wanted to do exactly the same thing but for where you see value_name

#

for new_dataframes[0] is should be GDP, new_dataframes[1] it should be internet-users and 3 should be Unemployment

peak jetty
#

Start by adding a param to the melt function

    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name=value_name)
    the_df.set_index(["country"], inplace=True)
    return the_df```
minor solar
#

Hey y'all, I have a simple question when @marsh fog is finished with his...

peak jetty
#

Then it's up to you, instead of a list of dataframes you could loop through objects, where the object has a value_name and dataframe, but you should tie them together somehow, a dict might work, too

marsh fog
#

if I specify a dictionary of the value_names

#

How can I get the function df.melt to apply a dictionary name to the value_name

peak jetty
#

See my comment above, there are two changes to your melt function

marsh fog
#

ahh yeah, I only saw the first change; cheers dude!

#

So thankful!

peak jetty
#

No worries, happy to help, I learned there as much as you did

marsh fog
#

That's what it's all about! ahah πŸ‘

cursive glade
#

so, i have a few machine learning scripts. id like to write a program that takes a specified dataset and evaluates that dataset on each of the ML scripts, collects the results and then outputs the statistics for each script.
i was thinking about writing an evaluation script that uses subprocesses to run the ML scripts, optionally with some passed args, and starts the next subprocess after the current one is finished.
if i run a testscript that has basically only output = subprocess.check_output(["python", "Net1.py", *some args*]) in it, i can see the errors that Net1.py throws when my assert(condition) statements are false, but if they arent and the program just runs as intended, i dont get any feedback...
i havent worked with subprocesses before, am i doing anything wrong here? or more general, is there a better way to transfer results from the ML scripts to the testscript without storing the results in say textfiles somewhere to then just read them?

supple ferry
#

@cursive glade i don't know how relevant it is, but I had problems with subprocess and multi processing. Depending on IDE I could either see my errors or not. When I was running them from command line I could see all errors, ipython console was outputing nothing

cursive glade
#

i usually log remotely onto a machine with the necessary gpus so im not rly using any IDE, just execute scripts from the shell with python script.py *args*

lapis sequoia
#

Hey guys, I'm new to python, I need to use numpy to store multiple 2d arrays in a big 3d array basically a vector of 2d arrays, [number_of_arrays,array_x,array_x], I have troubles doing this

#

I tried to use many things, the single one that remotely worked is .append

#

but only iwthout axis and it flattens t he input

#

when I specify axis it throws errors

#
import numpy as numpy


def load_images(path):
    images = numpy.array([[[]]])
    index = 0
    while True:
        try:
            print("try: " + path + "_" + str(index) + ".npy")
            images = numpy.append(images, numpy.load(path + "_" + str(index) + ".npy"))
        except Exception as ex:
            print(ex)
            break

        index += 1

    print(images.shape)


load_images("assets/car")
analog helm
#

Hey all, I am looking to work on a Python project which is essentially a simulation/simulation-like, and based on 3d voxel space/octrees (same thing?). I know I could literally just instantiate a raw block of data with numpy and then fool around with it, but unless I was misunderstanding things, I was under the impression that once your volumes start increasing in size (lets say 256^3), that becomes extremely cumbersome, inefficient, and slow, even with numpy array objects. As a very brief description of my usecase needs, I'm not doing many, or any, matrices operations or transforms, etc. Mostly just running a finite voxel space and performing checks/modifications on specific voxels every tick.

Is there a good library out there for this? Or am I mistaken in the first place for thinking that I need a special library to do this properly?

lean ledge
#

@analog helm There's nothing that can prevent large matrices from slowing down your computer by consuming memory and computational time unless you know more about the matrix. In particular, if the matrix is almost empty, you can apply sparse matrix optimisations and so on

#

Other than that, if all of it is filled with no pattern, nothing can help

analog helm
#

there will be large portions which are empty, but by no means the majority. How do games which run on standard PC hardware handle this? EG Minecraft or Dwarf Fortress. Am I just seriously overestimating the load that kind of environment applies? I kind of assumed they did something special to work with their data sets. Is it really just an array similar to what numpy would give me?

#

and thanks for the response btw @lean ledge

lean ledge
#

You split it up unto chunks and only deal with small chunks of the data at a time. While the other chunks arent loaded, they're stored on storage rather than in memory

#

That's why as you walk over to an area that hasnt loaded yet, it reads it from disk and loads it

#

As you go away from a chunk, it stores it back into disk

analog helm
#

yea, for MC. But this is a finite space, which will be permanently loaded. I should have clarified on that one, my bad. More just thinking of a specific space in MC, and manging the data in that area. If you're familiar with Dwarf Fortress, that is a much better comparison for what I expect to be doing

lean ledge
#

Never played DF, sorry

analog helm
#

well, just imagine a finite space in Minecraft I guess, with tons of entities and interactions. All the entities are bound to the octree coordinate system as well.

#

I know much of MC's data complexity has to do with buffering and streaming data off of and onto disk, but i was under the impression even besides that, there was some special handling going on. I could be wrong though!

lean ledge
#

There's nothing you can really do to avoid storing all the data you have to show in memory. It's not generally a problem since unless the memory is starting to get filled, it doesnt take much computation to have voxels stored. Minecraft often doesnt have thaaat many entities and obvious optimisations are made where you can save on interactions when they dont matter. I have no idea how Minecraft in particular works, though I'm sure some others here do

analog helm
#

Mhmm, it's not so much the data storage its self Im concerned about. RAM is cheap these days. More the iteration and random access of data

#

the only real "complexity" in storage Im aware of is differences in data density (ie, one voxel might need to only store a single byte, another voxel might need to store several), but from what I understand that can be easily resolved by having a base voxel array of single bytes, with a sparse voxel tree parallel to store extra info as necessary

lean ledge
#

It's all in memory! It can be accessed randomly as it pleases with no slowdown. You give modern computers less credit than they deserve given they can render complex 3d scenes, I feel like IO with the data isnt where the main slowdowns would be concerning: rendering would be the bigger problem since that's what needs to be done real time and is just a time consuming process in comparison to fetching whatever few blocks exist

analog helm
#

fair enough

#

On that topic, are there existing libraries which specialize in rendering voxel data? Or am I reaching on that one?

#

My main interest here is really just the concept its self, so the less I have to reinvent, the better. The underlying engine is of little interest to me.

#

Something which has some amount of pre-existing code for determining what is or isnt visible based on what voxels are surrounding an area, what is or isnt transparent, etc

#

volumes would be dynamically generated, and the user would be able to specify which areas of the volume they want to look at, so which voxels are or arent visible in any given situation is dynamic. Dunno if thats something which can be semi-automated via library, or if it is.

#

In either case, thanks for your input on this

lean ledge
#

Cant say I know much about gamedev, it just intersects with other things I actually enjoy. But there are voxel based game engines and it might help to see how they handle things. Probably some combination of not worrying about what isnt visible, combining many blocks into the same mesh with fewer vertices, doing weird stuff with lighting, chunks, etc

analog helm
#

Alright, I'll keep looking around at other projects, I know of a few. Thanks for talking it out with me @lean ledge!

mossy dragon
#

hello guys

lapis sequoia
#

hello, if you want to ask a question can you ask it

grave fog
#

Does anyone have experience with Jupyter Notebooks? Can't seem to make my kernel work

#

Or any kernels. Managed to install everything correctly on another machine, but this one is just throwing errors

supple ferry
#

can you also paste your errors in formatted way?

grave fog
supple ferry
#

Which tensorflow you have and which python?

upper ginkgo
#

Is there any open source bot that uses natural language commands out there?

wraith sage
#

Hello

#
import tensorflow as tf

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]

W = tf.Variable([0.3], tf.float32)```
#

is 0.3 the initial value of W?

#

If so, how can I print the value from W?

lean ledge
#

You'll have to make a session and evaluate the computation graph for the variable. Pass it to a session.run() call after initialising the variables @wraith sage

wraith sage
#

without initializer, it wont work. What is the global_variables_initializer()?

lean ledge
#

It just initialises variables. Tensorflow objects are abstract and you're essentially compiling the logic you want to run into a computational graph and then working on it in a session

serene veldt
#

Greetings, im trying to create multiple bags for a sort of bootstrap aggregation algorithm

#

is there a more efficient way than subsampling n times de dataset with pandas.DataFrame.sample ?

#

like, for example, at least something similar but with a n_samples parameter so i dont have to do a comprehension list of iterate

supple ferry
#

Not on pandas
But on sklearn.ensembleyou can find what you need

serene veldt
#

how so? they only have the classifiers made with ensembles, cant find anything regarding the subsampling part, unless i go into source code

sharp jetty
#

using a CNN for a binary image classificatio, should I use a validation set or simply train and then run it through a test set?

empty current
#
def mbo(n, others):
    not_first = False
    for i in others:
        if not_first:
            x = bin(n ^ i).count('1')
            if x == 0:
                return 0
            elif x < lowest:
                lowest = x
        else:
            lowest = bin(n ^ i).count('1')
            not_first = True
    return lowest


s = set()

for i in range(int(input())):
    e, a = input().split(' ')

    if e == "1":
        s.add(int(a))
    else:
        print(mbo(int(a), s))```
#

This apparently really inefficient
So am trying to find a way to reduce the complexity
What this supposed to do is

Person1 and Person2 are playing an XOR game. Initially, Person1 has an empty set of integers. Then a sequence of N events happens. There are two types of events:

Person1 chooses integer A and adds it to the set;
Person2 chooses integer A and passes it to Person1 who finds integer B in the set such that integer AβŠ•B contains minimal possible number of 1s in its binary representation. Here βŠ• is a bitwise exclusive or operation, for more details check Wikipedia page.
Your taks is to help Person1 finding minimal possible number of 1 bits in binary representaion of AβŠ•B.

Input
The first line contains integer N. Each of the following N lines describes an event as two integers T and A separated by a single space. Here T is an event type.

Output
For each event of the second type print the corresponding minimal number of 1 bits in a separate line.```
#

Don't need code, need algorithm

empty current
#

No one?

void anvil
#

you should probably go to help and not data science

lapis sequoia
#

@sharp jetty you don't need a validation set

sharp jetty
#

@lapis sequoia why is that?

#

shouldnt i run the model on a validation ,save the weights and then run it through a test?

lapis sequoia
#

depends on the problem

#

if it's binary classification.. you should have enough examples to not need additional validation

#

what sorta images are you trying to classify

sharp jetty
#

x ray

#

its binary classification

lapis sequoia
#

and how many examples do you have for training

#

for each class

#

you don't need to do validation unless your classes are imbalanced..

#

this type of problem requires that you have large enough training and test sets.. then just plot/get roc for your predictions

sharp jetty
#

well im looking at different body parts individually, but for the most part i dont have class imbalance

#

on the low side i have about 500 images per class and on the high side i have about 5000 per class

lean ledge
#

@lapis sequoia You definitely need a validation set

#

Binary classification does not dictate whether or not you have enough examples

#

@sharp jetty Dont listen to them, you always need a validation set to prevent overfitting. If you're really short on data, maybe dont have a test set and use validation accuracy at face value

#

You have no idea when to stop training without validation

sharp jetty
#

yea thats what i was thinking

#

@lean ledge thanks again

#

btw so when i run my model and obtain the highest possible accuracy on validation do i save the weights and run it on a test?

lean ledge
#

Yeah, generally how it works. Most people's workflow looks like giving the framework both train set and validation set and then monitoring the output values on tensorboard or equivalent until validation stops improving (in which case you save and try to drop learning rate further) or starts getting worse (overtraining). It's a good idea to use something like TF's Object detection API or similar since that has essentially everything set up for you with pretrained networks to retrain, all outputs configured as necessary etc

#

Instruct your framework to just output a model every epoch for the latest 5-10 epochs or something along those lines. I think object detection API is 5 by default but I prefer 10 because I tend to miss a bit

sharp jetty
#

hmm a lot of what you just said im not familiar with lol. First time working with CNN's. Any resource you would recommend to implement what you described?

lean ledge
#

What's your specific task? @sharp jetty

#

binary classification, no object detection or anything?

sharp jetty
#

binary classification of x ray images

#

abnormality vs normal

#

and i have 7 different body parts which ill run a model on seperately

#

in which case im not sure if i should use the weights learned from one body part to another

lean ledge
#

How much data do you have? @sharp jetty

lapis sequoia
#

dont at me.. do you have a job or pushed anything to production?

#

all I see you doing is posting inane academia stuff with little to know background knowledge or know how of actual application..

sharp jetty
#

i have anywhere from 500 to 5000 images per class for each body part

lapis sequoia
#

not you coldchillin.. this other person I have blocked who continues to feel like he needs to chime in on things he doesn't understand

#

just because validation is part of usual workflow doesn't mean it's something you use to update weights for every problem.. reason you don't use validation is you're tweaking weights to fit certain images, that isn't something you do when it comes to medical applications..

lean ledge
#

...validation isn't used to update weights, it's used to avoid the scenarios of overfitting

sharp jetty
#

hmm, the dataset im working with is part of a competition which is seperated by train valid and testing folders

lapis sequoia
#

he previously talked about using validation to update weight..

#

oh so it's for kaggle.. or something and not actual application

#

is that right?

sharp jetty
#

yea something like that

lean ledge
#

I have both a job in a data science company where I do computer vision among other things and I'm currently at CSIRO'S Robotics and autonomous systems group chugging through literature reviews on my first computer vision paper for multi camera detection and tracking in an industry application. I've won hackathons using computer vision also. If you feel the need to disparage me because you don't like me disagreeing with you, then go ahead

sharp jetty
#

not kaggle but a competition nonetheless, im just doing it for practice

lean ledge
#

@sharp jetty I'd start by trying to retrain a pretrained resnet model (avoid inceptionnet, it tends to have worse results in biomedical) on body part with the most data and then try to transfer learn with those weights on parts with smaller datasets

lapis sequoia
#

I've blocked you because you're an annoying person with no knowledge of actual application.. I don't feel the need to waste my time with people who feel the need to annoy and spew half knowledge..

sharp jetty
#

by other classes you mean body parts?

lean ledge
#

yep

#

edited for clarity

sharp jetty
#

also my aim was to try and produce a model from scratch and then compare that to one thats pretrained

#

right now im still working trying to figure out how to optimize my scratch model

lean ledge
#

if you're trying to compare architectures, it's only a fair battle if they're all trained on the same dataset

sharp jetty
#

but im getting horrid results lol

#

yea i plan on training them both on the same set

#

just doing them seperately

lean ledge
#

models from scratch will rarely beat a model that's been pretrained and then trained again on the same dataset

sharp jetty
#

yea i figured but is it that huge of a difference?

#

like im getting 46% validation accuracy on my model so far

lean ledge
#

can be pretty significant because from scratch make it easier to overfit. remember, something like imagenet has 14 million images and a model pretrained on it will have learnt very very generic features in early layers

#

it only needs to change some logic in the last few layers to start identifying new classes

sharp jetty
#

hmm... what do you think of my validation accuracy results though is it something i should expect? for reference other ppl in the competition are getting like 70% test accuracy

lean ledge
#

cant say anything about validation accuracy without actually being there, lots of reasons it can be low. bad optimisation/training, problem with the model architecture, problem with datasets

sharp jetty
#

this is my architecture for reference

south quest
#

!warn 389084425566289930 Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: warned @lapis sequoia (Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.).

lean ledge
#

Oh, that's a very shallow network. Very much possible that itself is the problem. The more layers there are, the more complex things it can learn. Only a few layers cant learn very many things. For reference, state of the art in many applications have 100s of layers, many layers having complex structures like being a combination of multiple kinds of convolutions for complex features but without loss of speed, or gated units for easier optimisation and so on

#

I have no idea what kind of problem that is but adding a couple more layers may or may not help

#

If it doesnt, i'd try messing around with hyperparameters. Lowering learning rate after you've reached your 0.5 accuracy or whatever

sharp jetty
#

layers as in adding more conv layers?

#

and yea i realized that my arch was pretty basic its something i tweaked from the MNIST project i did earlier

lean ledge
#

try reading up on the architecture of ResNet. it's essentially a simple architecture but it got state of the art a few years ago. it has lotssss of layers

#

CS231n is a great course to go through btw if you're interested in deep vision

#

should go over everything you need to start reading actual literature really

sharp jetty
#

yea any resources would be a great help, it seems i know too little to do a decent from scratch model on my own

lean ledge
#

nah all good, knowing what you know is a great place to start off. just need something to accelerate you into recent literature

#

lectures 6, 7, 9 and 10 should get you up to date somewhat

#

everything 11 onwards is not exactly core deep vision knowledge

#

except maybe generative models

sharp jetty
#

nice! thanks ill look into it

#

btw one more question, in terms of accuracy you usually get a higher accuracy for validation than test right?

#

as ur tuning the model weights on teh validation

#

and running it on the test

lean ledge
#

if there is a discrepancy at all, yeah, validation is tiny bit higher because you stop based on when validation gets worse. but otherwise, the difference tends to be relatively minor-ish since validation only comes in play with figuring out when to stop rather than actually helping with training

sharp jetty
#

so if i just wanted to show the efficacy of my model would it be enough to just show my validation score?

#

because in that case what would be the reason for running a test?

lean ledge
#

it's generally bad practice to use validation in an actual formal setting (competitions, academia, etc) but nobody would really care in industry too much given there's little data so having no test is justified

sharp jetty
#

hmm so it would be better if i just take a small sample of images for each class and use that as a test?

#

would 50 images per class suffice?

lean ledge
#

everyone I know does it all percentage-wise. Out of all the data you have, splitting it 80/15/15 sounds fairish for train/validation/test.

#

but honestly, people's splits are anything from 80/15/15 to 70/20/10 to many in between

sharp jetty
#

got it, thanks so much man you've been a tremendous help

lean ledge
#

nw, glad to help

quiet gyro
sharp jetty
#

just a quick question regarding CNN implementation, if I'm doing a binary image classification and running a model separately on 7 different body parts(ex hand, wrist, humerus, etc) so ill run a model on wrist only binary classification and then wrist and so on. Should I save the weights from each model and run them for the next?

lean ledge
#

@sharp jetty Id train for the part with the largest dataset first, and then reuse weights from the model for retraining other parts with smaller datasets

sharp jetty
#

would i use the adjusted weights after each dataset?

#

ex i train on hand dataset -> use weights to train on arm->use new weights train on shoulder->use new weights to train on wrist

#

or just use one set of weights

lean ledge
#

That, I have no clue about. I don't expect the difference to be huge either way.

magic mauve
#

I hope this is the appropriate place to ask, which module would be better for working with csv files, panda or csv?
I'd assume CSV since its in the name but I heard a lot about panda as well

polar acorn
#

Depends on how you mean work with them. If you mean read a csv to a dataframe fiddle around with it and save it as csv I would use pandas.

pale eagle
#

From where i can find the projects for data analytics

magic mauve
#

I just want to take some data from a csv file and find averages, max, and mins
@polar acorn

polar acorn
#

I would use pandas, it has a read_csv function that handles most csv's. And finding summary stats is quite straight forward.

#

It might take some getting used first time you use it. But its a great tool to know, so it's a worthwhile investment.

magic mauve
#

Appreciate it

wraith sage
#

Hello

#

What is the best way to learn Tensorflow?

supple ferry
#

@magic mauve go for pandas all the way. It can work with lots of data types, and not only it allows to manipulate the data but also save it in other formats, plot (simple plots)

magic mauve
#

I'd assume matplotlib would be better for more complicated plots though?

proven ravine
#

Hey guys,

so I am quiet new to programming in python but there is a question that is bugging me. There is a project in my head which would be awesome to realize but I dont know if it's achievable in any way (technology or skillwise).

The thing is I have been doing quiet a lot of demo trading forex currencies for around a year. I did find a pattern that kind of worked for me, the problem is, that I can mostly only see the pattern in the past. If I look at old graphs I will see all the spots where it worked but I can never really make it work for me, if the plot hasnt formed yet.

Would there be a way to feed the computer some currency graph, give it also the points where I would have taken long or short trades because of my pattern with the stop loss (point where you would sell automatically, to cut looses) and where you would put the take profit and the machine (AI, neural network/whatever) spits out the most likely trade in the present?

supple ferry
#

@magic mauve , matplotlib is good. there are other libraries that are built on top of it and are higher level. maybe you will like them more. seaborn, plotly e.g

#

@proven ravine , there is an entire concept for such things. Algorithmic trading. Its been around for decades now and with the rise of AI and ML, lots of people find it interesting.
I hope it is okay to post reddit links here, so, if you are interested in such stuff, you can head to https://www.reddit.com/r/algotrading

proven ravine
#

@supple ferry but is it doable for an average person like me to program such an algorithm?

supple ferry
#

it is doable. but probably not be useful. you cant compete with bigger machines

void anvil
#

It depends on the uniqueness of your signal

#

tbh

#

You can market make with relatively simple rules and be profitable, but you're being paid to take on risk

#

You would want to sample whenever you see a pattern emerging

#

and either rule-base sample in

#

for live trading

#

or have it predict if it's a trade or not

proven ravine
#

rule-base sample means its like a condition which needs to be met?

void anvil
#

yes

proven ravine
#

for example, one of the patterns includes a certain formation of three candles, which need to be the lowest or highest of a certain move

void anvil
#

like if the market mov es up 5%

#

in 5 minutes

#

then predict whether to ma ke a trade

proven ravine
#

ah okay

#

I was hoping I could just throw data into a black box and get an algorithm πŸ˜‚

#

as said, marking all the trades with numbers in a chart of the past (for example last 5 years) and the program figures out what it needs to do, to predict present trades

#

something like that. I thought thats how machine learning or AI works until now, but I guess I need to look into it a little more πŸ˜„

void anvil
#

that is how it works

#

you have to mark the trades to sample

#

and mark the trades for the predictin

#

Basically instead of your data set being image classification of a cat or a dog

#

you take your stock history or forex or w/e

#

and create the "This is an example of a long trade"

#

and create the "this is an example of a short trade"

#

then feed in new data and predict if you should go long, short, or not trade

proven ravine
#

hmm, any tips with which "software" or whatever I can do it "easily" or test it?

#

or any docs which I should read into ?

void anvil
#

hahaha

#

do it in python

#

you'll have to create most things from scratch

#

good luck find anything useful besides stats packages like ffn

#

and bt

#

but you're better off writing your own backtesting functions

#

99% of all signal generation is going to need to be scripted yourself

proven ravine
#

thanks for the help πŸ˜ƒ

#

thats at least a project which could be helpful in the future so I am more eager to continue with it πŸ˜ƒ

deft bough
#

hey I am trying to plot a 2d gaussian distribution with matplot3d but I have some problems.

cset = ax.contourf(temp1, temp2, Z, zdir='z', offset=-0.15, cmap=cm.viridis)
ax.set_zticks(np.linspace(0,0.6,5))
#

that is my code to plot and the result. I would really appreciate it, if you could help me to make my plot OK

lapis sequoia
#
def distance_parse():
    hist_list = []
    with open('20190326_distance_v1', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            try:
                m = re.search('sdf : (\d+.\d+)', line).group(1)
                hist_list.append(m)
            except:
                pass

        print(hist_list)
        plt.hist(hist_list, bins=5)
        plt.show()

This code will print and draw like this.

proven ravine
#

@void anvil do you know if codecademys machine learning course is helpful?

lapis sequoia
#

How can I fix it??

void anvil
#

no idea

lapis sequoia
#

I guess `plt.hist(x, **karg) could be a key of this problems. but I didn't know these options.

#

ah.. maybe type..

proven ravine
#

@void anvil would you say my goal is more likely to be supervised or unsupervised learning? Cause I am not sure while reading the description of the two ... I am leaning more toward supervised?

void anvil
#

if you have a trading pattern you want it to learn it's supervised

#

if you want it to learn trading patterns, unsupervised

proven ravine
#

ehhh I am confused

#

the second is just giving it data and waiting for what happens

#

?

#

and the first one is where I give it my long/shorts and it learns to use those in present situations

void anvil
proven ravine
#

good read!

vestal axle
#

Hi guys, any of you familiar with R-studio?

void anvil
#

A bit

#

what do you need from it?

void anvil
#

@proven ravine

Supervised: You tell the algorithm this is class A, B, C. Determine if this new stuff is A, B, C.

Unsupervised: Here's a bunch of shit, tell me what you can predict

kindred stirrup
#

@vestal axle yes I’m familiar with R Studio

#

Have any of y’all used facebook’s prophet? Wondering how it compares to ARIMA for time series

sharp jetty
#

with regards to early stopping, should i stop my model when it has the highest accuracy for validation or when the val_loss values are lowest?

chilly shuttle
#

accuracy is presumably what you're interested in

sharp jetty
#
hist = classifier.fit_generator(
        training_set_finger,
        steps_per_epoch=(5064/8),
        nb_epoch=60,
        validation_data=valid_set_finger,
        validation_steps=(461/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])
#

i have a question regarding this code which is part of a CNN im running. If i run this code and another but using a different dataset are the two pieces of code interacting in any way?

#
hist = classifier.fit_generator(
        training_set_humerus,
        steps_per_epoch=(1230/8),
        nb_epoch=60,
        validation_data=valid_set_humerus,
        validation_steps=(288/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])
#

as in when i fit on one set of data and then fit on another are the weights or other parameters being affected?

mossy dragon
#

anyone done the titanic competition on kaggle?

polar acorn
#

@kindred stirrup In my experience prophet is quite good out of the box while ARIMA models would need more tinkering to achieve the same level of performance.

chilly shuttle
#

oh cool, TIL prophet

midnight atlas
#

Hi, still looking for help with speech recognition front end in Python, would be great to get in touch with someone familiar with speech rec. Thanks!

serene veldt
#

need some help normalizing data

#

i have tried most of sklearn.preprocessing module but nothing suits my needs

#

i have a semi sparse set of values

#

and wish them to normalize them into floats from range [0,1]

#

scale, binarize, minmax dont work

paper niche
#

what does "don't work" mean?

frail crow
#

Hey guys, first timer, learning on the job ds - How do I detect abnormal subseries in periodic and synchronous time-series data?

sharp jetty
#

if im doing a CNN model looking to classify abnormal vs normal images, would my dense output be 2 or 1?

void anvil
#

@serene veldt

assuming you have list x:

fixed_x = (x-min(x))/(max(x)-min(x))

narrow obsidian
#

Yo guys, anyone here is experienced with numerical methods and jupyter?

#

hitting a wall trying to solve a ODE with bisection and FPI

serene veldt
#

Yeah I figured it out after some time

#

I was standardizing the values

#

Which was the problem

civic saffron
#

I'm trying to come up with a quick method for determining document similarity to a given query phrase (a list of words). I am fine if it's not extremely accurate, as long as the majority of documents returned are at least "somewhat" similar to the given query. It is important that it is fast though (I need to be able to process at around one hundred 1000-word documents per second, including tokenizing, vectorization, and scanning for matches). I have come up with the following method for extracting a set of similar words to a given term:

#

def simset(query_word, set_size=10, depth=1, size_decay = 0, threshold_score=0.33, with_scores=False):
    if size_decay < 0 or size_decay > 1: 
        raise ValueError("decay rate must be in interval [0,1]")
    
    simsets = [set([query_word])] # list of sets, one per level, query word at root
    level_set_size = set_size        # size of simset for each word @ current level
    for level in range(1, depth+1):      # for each level of depth ...
        
        level_set = set()
        for word in simsets[level-1]: # for each word in the *previous* level's simset
            if level_set_size >=1:
                level_set.update({w[0] for w in vectors.most_similar(word, topn=level_set_size)})
            else: # if decay rate results in set size < 1, just get 1 word for each following level
                level_set.update({w[0] for w in vectors.most_similar(word, topn=1)})

        # remove words from previous levels, to avoid duplicate simset() calls
        for l in range(level-1,-1,-1):
            level_set -= simsets[l]
    
        simsets.append(level_set)
        
        level_set_size = round(level_set_size * (1 - size_decay))
        
    simset = set.union(*simsets)
    if with_scores:
        return {(vectors.similarity(query_word, w), w) for w in simset if vectors.similarity(query_word, w) > threshold_score}
    else:
        return simset
#

The above code gets a list of most similar words to query term, then for each of those it gets most similar, and so on .. to depth levels. The simset of a query phrase is just the union of the simsets for each of the individual terms. Based on this "simset", I then get a "similarity score" for a given text to the query phrase, by comparing the words in the text with the simset of the query phrase with the following function (similar to Jaccard similarity where I am measuring size of intersection between the two):



def skim_text(query, text, word_simset_size=10, depth=2, freq_threshold = 0.0005):    
    query_words = word_tokenize(filter_stopwords(query)) 
    query_simset = set.union(*[simset(w, word_simset_size, depth) for w in query_words])
    query_simset = {(word_freq(w, 'en'), w) for w in query_simset if word_freq(w, 'en') < freq_threshold}

    text_word_count = len(text.split()) 
    if not text_word_count > 0:
        raise NLPError("skim_text() requires a non-empty string as input", text = text)
        
    match_count = 0
    
    escaped_words = [re.escape(w[1]) for w in query_simset]
    re_query_simset = r'(' + '|'.join(escaped_words) + ')'
    matches = re.findall(re_query_simset, text)
    for m in matches:
        match_count += 1
    
    if match_count == 0:
        return 0
    
    score = math.log(match_count+1) / math.log(text_word_count)
    print(f"DEBUG: match_ct: {match_count}, text_word_count: {text_word_count}, score: {score}") 
    if score > 1:
        return 1
    else:
        if score <= 0:
            return 0
        else:
            return score    
#

I am looking for feedback on how to improve my code to make it faster / more accurate, or if there are pre-existing tools that I should be using instead that can operate at the speed I need?

#

... also tips on how to just improve my python would be appreciated, because there are probably several things I'm doing here that could be done more efficiently πŸ˜ƒ

light cloud
#

How would you explain additive models to a 5 year old

obtuse skiff
#

Im clustering using kmeans and DBSCAN using sklearn
the parameters for kmeans are self explanatory just how many different clusters I want
but how do I go about chosing the parameters for DBSCAN, it wants eps and min_samples

obtuse skiff
#

Also, is there a library for Sum of Squared Error?

supple ferry
#

@obtuse skiff sklearn metrics has rmse, root mean squared error. If you want, you can calculate it yourself easily.

obtuse skiff
#

@supple ferry do you know how I do the parameters for dbscan?

supple ferry
#

Never used it, sorry

#

As per documentation, @obtuse skiff , those arguments are optional:

Parameters:    
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.

eps : float, optional
The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samples : int, optional
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
polar acorn
#

@obtuse skiff
Pasted from Introduction to Machine Learning with Python A Guide for Data Scientists

Increasing eps  means that more points will be included in a cluster. This makes clusters grow, but might also lead to multiple clusters joining into one. Increasing min_sample means that fewer points will be core points, and more points will be labeled as noise.

The parameter eps is somewhat more important, as it determines what it means for points to be β€œclose.” Setting eps to be very small will mean that no points are core samples, and may lead to all points being labeled as noise. Setting eps to be very large will result in all points forming a single cluster.

The min_samples setting mostly determines whether points in less dense regions will be labeled as outliers or as their own clusters. If you decrease min_samples, anything that would have been a cluster with less than min_samples many samples will now be labeled as noise.
#

So knowing that, I assume the best way to set the parameters is to try and see what clusters you get and adjust the parameters so that the number and size of clusters makes sense to you.

modest halo
#

hey everyone. I'm working on an application that does a bunch of file input/pandas transformations/statistics, and i'm using PyCharm as my IDE. I'm wondering if anyone could give their workflows while they're working on a project -- specifically in cases where you import a lot of data and are beginning to write a new function to do ~something~, like create a matplotlib plot, run a statistical test, create outputs, etc.

I find I waste a lot of time because my code needs to import all my files, sort them, clean them, etc. Currently, I set a breakpoint at the start of my new function, run the debugger, it stops at my breakpoint, and then I do code edits, restart the debugger, let it run over my new code, check for problems, and do this again and again. I feel like this is really inefficient because I'm constantly restarting my debugger over and over, and (of course) my code needs to load all my data again every time. I know a lot of this may be unavoidable, but getting some insight into others' data science workflows would be great. Thanks!

paper niche
#

I do my analysis in Jupyter notebooks. having the ability to run cell by cell is much easier for data exploration and analysis

#

I find working with py files very cumbersome for this purpose, especially if you need to make small changes and run blocks of code over and over

modest halo
#

@paper niche thanks for the suggestion! I really should get into jupyter -- always been on the list and never really got around to checking it out. I can see how running cell by cell would be useful. When you use jupyter, do you create cells as "building blocks" and just execute them in a certain order depending on what you want to do (or work on next)? Can you also "undo" execution of a cell?

paper niche
#

maybe 1 cell for loading data to df, 1 cell for doing some modification to the columns, another for df.plot, say.

#

and there's no "undo"-ing the execution of a cell, what you do is you re-run the cell that created that particular variable in the first place. e.g. for my example workflow above, if I realized my modification in the second cell was wrong, I'll run the first cell again to re-assign df, then make changes to the second cell, then re-run the second cell again @modest halo

modest halo
#

@paper niche Oh I see. so essentially achieving the same thing. After seeing that notebook you linked, I'm really starting to rethink my workflows lol. That notebook is pretty much exactly the kind of thing I want to do, and it's much more readable for people who aren't as familiar with python (a lot of people I work with). Thanks a lot!

paper niche
#

@modest halo np, happy to share! πŸ˜„ yup, the ability to write text and code in the same document is really nice for this purpose as well.

marsh fog
#

For the part where it plots the regression linefig, axes = plt.subplots(ncols=3, figsize=(20,8)) for ax, col in zip(axes, mergfix.loc[:, mergfix.columns != "internet_users"]): mergfix.plot.scatter(x=[col],y=1, c='green', ax=ax) _ = plt.plot(mergfix.iloc[:, v].values, sm.OLS(mergfix.iloc[:, v].values, sm.add_constant(mergfix.iloc[:, v].values)).fit().fittedvalues,'r-') how can I get it to cycle through the columns in the dataframe?

#

Help? πŸ˜„

worn field
#

@marsh fog I like to just make a list of the columns and wrap it in a for-loop πŸ˜ƒ Like ```
columns = ['column1', 'column2', 'column3']

for column in columns:
plotfunction(dataframe[column])

#

Not sure if that was what you were asking though :p

marsh fog
#

It wasn't πŸ˜„

mellow birch
#

anyone here written any graphs useful graphs for MISP(www.misp-project.org )? Or have any experience with data science for threat hunting?

lapis sequoia
#

Hey guys, I have 8 black&white images, I need the mean image of them, if I do sum(list_of_images) I get a grasycale image with very bad edges, if I do this new image / len(images) (which feels logical to do as this mean mean) it gives me same rough image but with colors

#

UserWarning: Float image out of standard range; displaying image with stretched contrast.
warn("Float image out of standard range; displaying "

#

images are stored in numpy arrays

#

nvm my add logic is wrong

#

fixed, used numpy.mean() and then casted the output image to uint8 to get it grayscale

light cloud
#

I want to see if I can explain KNN here in my own terms.
Let’s say I am a company that sells different types of widgets and we have seasonality to our business and I want to see which widgets are likely to go up or down in sales for a given month.
Would I have my various widgets as rows, the months as columns, previous year sales, previous months sales and maybe a few other columns. Then I would label those as either going up or down in sales the next month. Would I then run a KNN algorithm on the data and hope to see for a row without a next months sales figure prediction?

light cloud
#

Wrong place for that question or am I way off?

mossy dragon
#

you dont want to use KNN for that

#

at least from what i can remember it

#

since your talking about time, and especially seasonality you want to do time series analysis

light cloud
#

Thanks for the response. I think I was reading that time series isn’t ideal for KNN but it can be used.

#

What about random forest then?

#

Similar data setup in structure but different algorithm.

#

There was a tutorial I was reading that was about forecasting weather with it

hasty maple
#

I've not done much of time series but what I've read on them, people usually use LSTM's, GRU's, fbprophet, arima for it. Maybe check those out

paper niche
#

RF's can't extrapolate well, since they always take the average in the leaf nodes, you can't get values higher or lower than the extreme values in your training set.

#

unlike a linear regression for example

#

not to say that it can't, but I would expect other methods to work better for time series predictions

lapis sequoia
#

I'd say it's more about what problem you

#

are trying to solve..than it is about the method..

#

case you cited for example.. how much can you afford to be off by

#

like..what is the tolerance..

#

fbprophet and arima are what I'd suggest for time series.. the person above was right on the money..

#

with Arima, you can tweak it to be conservative..

light cloud
#

I have been using prophet for forecasting and I like it.

#

I just wanted to see what some other models I haven’t used much of, and are are common in ML, to not only compare results, but also to learn.

#

So to get back to my original question, it looks like KNN and rf still belong in the classification toolbox and prophet in the forecasting toolbox.

lapis sequoia
#

yep.. and with knn you use different distance metrics for different applications/problems..

#

RF.. for most business applications with plenty of tolerance.. and minimum number of trees for fast results..

light cloud
#

Do you have a preference between the two? Or is just very dependent on type and amount of data and variables?

lean ledge
#

LSTMs, GRUs, ARIMA and CNNs are all common for time series stuff

#

cant just use random common ML methods for any problem. have to be using ones that suit the problem