#data-science-and-ml | Python | Page 195

midnight oracle Mar 12, 2019, 7:36 PM

#

I dont have acces to it right now

#

I have already sent it in my previous message

#

timesData.csv

#

📎 timesData.csv

#

Here it is

supple ferry Mar 12, 2019, 7:37 PM

#

I see

#

instead of null values, you have "-"

#

right ?

midnight oracle Mar 12, 2019, 7:38 PM

#

Yes

#

Some of them

supple ferry Mar 12, 2019, 7:38 PM

#

okay. what you can do is again, tweak your reading function

#

df = pd.read_csv("data.csv", na_values = ["-", "_"])

#

na_values is the parameter allowing you to add special na values for a given dataset

#

if parser sees such fields it will treat them as null

#

there I put a list of two strings to be considered as null. if parser sees a field with any of these two strings, it will parse it as null

#

then you can easily df.dropna(inplace = True)

midnight oracle Mar 12, 2019, 7:40 PM

#

Oh I didnt know that

#

Thanks

supple ferry Mar 12, 2019, 7:40 PM

#

Now you know 😃

midnight oracle Mar 12, 2019, 7:40 PM

#

:)

supple ferry Mar 12, 2019, 7:41 PM

#

read_csv is very powerful function. I advise you to read its documentation. It will save you tons of headache at readtime already

midnight oracle Mar 12, 2019, 7:41 PM

#

Okay

supple ferry Mar 12, 2019, 7:42 PM

#

btw, try not to use inplace = True at any time :

#

😃

#

it will confuse you sooner or later

void anvil Mar 12, 2019, 9:54 PM

#

inplace is so useful

supple ferry Mar 12, 2019, 10:09 PM

#

It is useful, but also confusing. It is better to reassign rather than in place

fervent solar Mar 13, 2019, 1:27 AM

#

can someone explain me how np.searchsorted() is working , i checked the documentation but didn't got.thankyou

supple ferry Mar 13, 2019, 7:14 AM

#

@fervent solar let's say you have two arrays, A and B. You want to put B elements into A but not to mess up with A order. That function will give you indices in which you should append elements of B so that overall sorted order of A remains unchanged

polar acorn Mar 13, 2019, 7:15 AM

#

Also inplace is supposed to be deprecated.

supple ferry Mar 13, 2019, 9:54 AM

#

I hope in version 1.0 they do it

jagged nymph Mar 13, 2019, 10:32 AM

#

i have a question tangentially related to data science so maybe you guys could help

#

here i'm using scipy.interpolate.interp1d (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.interp1d.html) to generate the function for fitting the points on these bars into a curve.

📎 unknown.png

#

as you can see, the result isn't quite that great with those dips on the side

#

is there a better curve fitting module i can use for this purpose?

#

i want the curve to follow more along the path in red

📎 unknown.png

#

oh and i should mention, the points move and there are an arbitrary number of them

#

like so

📎 2019-03-13_23-36-54.mp4

supple ferry Mar 13, 2019, 1:26 PM

#

This is really a good question, I hope there is someone to answer that. I also became interested 😄

cursive sun Mar 13, 2019, 1:26 PM

#

Lmfaooo ya getting runge phenomrnoned

#

scruuuuub

#

Use a chebyshev grid next time or use a degree sqrt(n) polynomial regressor

#

Or throw in a lipschitz coefficient bound

#

Worlds your oyster fam

#

Git gud

void anvil Mar 13, 2019, 2:27 PM

#

you can also look into wave transforms

#

EEMD, Fourier

past sonnet Mar 13, 2019, 2:44 PM

#

Hi guys,

So I'm solving a problem where I have html which contains challenge_details ie Question
I managed to extract main_text from this html manually with the help of bs4
now I've the challenge_details in txt file now my next goal is to extract KEYWORDS which I can use to find
relevant documents from the internet which can be helpful to solve the Question or may contain knowledge which can be useful
to understand the question.

My questions are:
- should I manually scrape google search result or there is good python lib for it? ( I'm not going to scrape to frequent or too fast so getting ban will not be an issue)
- I'm only going to take urls in search result which leads me to an HTML or pdf so my next question best library for ARTICLE extraction from HTML and text extraction from pdf.
- Now I will like to use some magic[ library which can find document similarity or n most relevant docs ]

what better Information retrieval system there is than google hence after extracting keyword I had like to do google search scrape as the docs in result will be most relevant.
but I don't think keywords will fetch good search result and hence I'll be using the title which I extracted from challenge_details but extracting keywords and finding documents based on it is a requirement that I need to fulfill.

#

Library that I've found so far: newspaper3k and spacy

jagged nymph Mar 13, 2019, 2:58 PM

#

@cursive sun @void anvil thanks! i don't really know what any of these things mean but I'll look into them 😅

cursive sun Mar 13, 2019, 3:00 PM

#

Oof, this is why we need better math education in schools tbqh

supple ferry Mar 13, 2019, 3:08 PM

#

@cursive sun if he knew, then he would not ask, right? Not knowing is okay, not wanting to know is not

#

@past sonnet i think you can go to help channels for this question. It is better suited to those profiles

cursive sun Mar 13, 2019, 3:23 PM

#

Calm down fam, it's a joke. Now the question is why you were the one to get antsy about it 🤔

simple crag Mar 13, 2019, 4:14 PM

#

It's not a very good joke

cursive sun Mar 13, 2019, 4:16 PM

#

Fine, if you think qwerty of all people is smart enough to help beyond 'import cv2' then you dont need me

#

You guys need me to tiptoe around then im not helping for free, you guys provide soltutions for issues associated with interpolation problems

mossy dragon Mar 13, 2019, 4:34 PM

#

🍿

lyric canopy Mar 13, 2019, 4:34 PM

#

Honestly, the most important thing we expect someone to do on this server is showing respect for the other users. Statements like 'qwerty of all people is smart enough' is not that, so you can keep your attitude. @cursive sun

#

So, drop the attitude

cursive sun Mar 13, 2019, 4:37 PM

#

Lmfao im not gonna listen to you, you can keep querty, ill bounce

lyric canopy Mar 13, 2019, 4:39 PM

#

!kick 339017672211693570 telling staff they're not going to listen when told to change their immature and toxic attitude.

midnight oracle Mar 13, 2019, 4:43 PM

#

QWERTY why don't you prefer inplace=True what are its drawbacks, if it has any?

void anvil Mar 13, 2019, 5:11 PM

#

Honestly fourier transforms were taught in differential equations for me

#

which is sophomore math

#

EEMD and other decompositions are taught in signals processing courses which are at least Jr. level engineering courses and more advanced ones are grad level

mossy dragon Mar 13, 2019, 5:41 PM

#

differential equations is sophmore math?

void anvil Mar 13, 2019, 5:59 PM

#

sophomore in college

#

assuming calc 1-3 in freshman / first half of sophomore

fervent solar Mar 13, 2019, 6:04 PM

#

@supple ferry can you get more detailed example

supple ferry Mar 13, 2019, 6:42 PM

#

@fervent solar , lets say, you have array A = [1, 2, 5, 7, 10] and you have some array B = [3, 4, 9]. And you want to kinda insert values from B to A so that, the sorted order of A (as you see) dont change. If you just stick B to A, it will extend it and will be [1, 2, 5, 7, 10, 3, 4, 9]. Order is spoiled. if you use np.searchsorted() it will give you potential indexes that you can "stick your array".

In [4]: a = [1, 2, 5, 7, 10]

In [5]: b = [3, 4, 9]
In [6]: np.searchsorted(a, b)
Out[6]: array([2, 2, 4], dtype=int64)

You see the output? it says put the first element of B after between 2 and 5 (index number 2) in the A and it wont spoil its order. Same goes for 4. You can put it inbetween 2 and 5. but 9 should be put at between 7 and 10

#

@midnight oracle , inplace = True is very handy, but only at first sight. If you dont order your code properly, you can easily forget that you modified your dataframe inplace. I am telling it from my experience and I am pretty sure that here are some other users who had the same thing

#

just writing couple more smybols will not hurt, but may help

midnight oracle Mar 13, 2019, 6:44 PM

#

Okay

fervent solar Mar 13, 2019, 6:57 PM

#

@supple ferry oh python is smart

supple ferry Mar 13, 2019, 6:57 PM

#

😄 It is NumPy and Python

fervent solar Mar 13, 2019, 7:00 PM

#

guess its output @supple ferry

📎 Screenshot_from_2019-03-14_00-29-41.png

void anvil Mar 13, 2019, 7:38 PM

#

eww, not muting discord channels

fervent solar Mar 13, 2019, 7:45 PM

#

hahaha

supple ferry Mar 13, 2019, 7:58 PM

#

Literally unplayable 😁

fervent solar Mar 13, 2019, 8:39 PM

#

any good platform to ask programming based questions ?

magic pecan Mar 13, 2019, 9:01 PM

#

stack overflow @fervent solar

lean ledge Mar 13, 2019, 9:09 PM

#

aw first time I saw someone with actual signal processing knowledge online and they get kicked

fervent solar Mar 13, 2019, 10:34 PM

#

(x[:-1] > x[1:]) can some one explain this code

silk acorn Mar 13, 2019, 10:35 PM

#

x[:-1] and x[1:] are slices

#

[1,2,3][:-1] -> [1,2]
[1,2,3][1:] -> [2,3]```

#

as for comparing lists with < or >

#

This simply compares the first elements of the list

#

slices are formatted like this

[start:end:step]```
with default of 
```py
[0:-1:1]```
that can be omitted

#

It lets you take part of a list

fervent solar Mar 13, 2019, 10:37 PM

#

it was used in while np.any(x[:-1] > x[1:]):
np.random.shuffle(x)
return x

#

bogosort

#

the thing i'm not getting is why (x[:-1] > x[1:]) because it will miss first value in x[1:])

#

and why greater than sign is used

silk acorn Mar 13, 2019, 10:40 PM

#

It's comparing the first value with the second value

#

hmm, it's np.any, was gonna say that wouldn't work for any, but maybe it's different for numpy, lemme check

fervent solar Mar 13, 2019, 10:42 PM

#

ok

silk acorn Mar 13, 2019, 10:42 PM

#

Yeah, for numpy it will check all elements of the list and return True if one is larger than the other

fervent solar Mar 13, 2019, 10:54 PM

#

@silk acorn u said default value is [ 0 : - 1 : 1 ] but its printing reversed list

silk acorn Mar 13, 2019, 10:55 PM

#

It shouldn't be.

#

But i did make a mistake there

#

it's
[0, len of list, 1]

#

That's not a reversed list, thats a list missing the last character

fervent solar Mar 13, 2019, 10:57 PM

#

yes

silk acorn Mar 13, 2019, 10:59 PM

#

[::-1] would get you a reversed list

fervent solar Mar 13, 2019, 11:00 PM

#

on what basis is (x[:-1] > x[1:])
is computed true

#

when both will have same size

#

In [94]: x
Out[94]: [8, 3, 6, 1, 7, 5]

In [95]: x[:-1]
Out[95]: [8, 3, 6, 1, 7]

In [96]: x[1:]
Out[96]: [3, 6, 1, 7, 5]

In [97]: (x[:-1] > x[1:])
Out[97]: True

In [98]: (x[1:] > x[:-1])
Out[98]: False

silk acorn Mar 13, 2019, 11:01 PM

#

np.any will take the two lists, compare them element for element with >, and return True is one is True

fervent solar Mar 13, 2019, 11:02 PM

#

means it wil compare the first and last element ?

silk acorn Mar 13, 2019, 11:03 PM

#

It will compare a[0] > b[0], a[1] > b[1] etc, where a in this case is x[-1] and b is x[1:]

#

and return True if one or more are True

fervent solar Mar 13, 2019, 11:05 PM

#

got it

#

bogo sort done

silk acorn Mar 13, 2019, 11:06 PM

#

The pure python equivalent would be

any(a < b for a, b in zip(x[1:], x[:-1]))```

pure lynx Mar 13, 2019, 11:09 PM

#

Hi. Is there a library in python that graphs in the coordinate plane and draws segments in between points? I have tried researching online but I cannot seem to find anything. I don't believe matplotlib can suit my purposes for this.

fervent solar Mar 13, 2019, 11:10 PM

#

u tried seaborne ?

pure lynx Mar 13, 2019, 11:11 PM

#

No, but that looks cool and useful for another project. However, from what I am seeing of it, it serves nearly the same purpose as matplotlib.

simple crag Mar 13, 2019, 11:15 PM

#

In what way is matplotlib insufficient?

#

Connecting points with segments seems like something it would do well

pure lynx Mar 13, 2019, 11:17 PM

#

By plotting points, I mean like you could manually do in Geogebra. I may just be ignorant but I don't believe matplotlib has a functionality to connect the three points in such a fashion.

📎 unknown.png

simple crag Mar 13, 2019, 11:23 PM

#

How about patches: https://matplotlib.org/api/patches_api.html

lapis sequoia Mar 13, 2019, 11:30 PM

#

Hi, been working on a binary text clf using scikit-learn and I feel a bit lost on what i can do to improve the acc (currently at 0.73-0.75). The dataset is quite small (~7000) so I don't know how far I can push it .
I am still very much learning so if anything seems off please let me know I'd really appreciate it :)

PreProcessing:
Cleaned the data
Set up some stopwords
Tried some word-clustering but didn't any gains (because of dataset size?)
Just now messing around with MaxAbsScaler and Normalizer

Pipeline:
CountVectorizer
TfidfTransformer
The Preprocessing I mentioned above
an SGDClf

pure lynx Mar 14, 2019, 1:25 AM

#

I did not end up needing patches, but I did use matplotlib in a way that I did not know existed.

#

I also learned about zip(), which was good.

#

I essentially made a list of lists, something like data = [[1,3],[2,1],[3,2]], then I found a way to make all of the possible segments given some points. Since I only have three (I am graphing triangles), that works for me, and I used: plt.plot(*zip(*itertools.chain.from_iterable(itertools.combinations(data, 2)))) Still not entirely sure how it works, but it does.

📎 unknown.png

#

Oh yeah, I had to import itertools in addition to matplotlib.pyplot. FYI for anyone looking to use that method.

supple ferry Mar 14, 2019, 6:29 AM

#

@pure lynx wait you could do that? That's illegal 😀

supple ferry Mar 14, 2019, 7:47 AM

#

Does anyone have experience with conditional logistic models?

#

I want to find out why my hessian matrix fails to calculate. I found out that I have hidden intercepts in my design matrix and I want to find out which variables cause that and remove them

lapis sequoia Mar 14, 2019, 9:26 AM

#

Hello, I have a question about kinds of data for analysis... is working with technical data very different from any other kind of data? I understand that the resoning data provides goes in a different way, but how about the technical aspect of processing data?

supple ferry Mar 14, 2019, 9:43 AM

#

@lapis sequoia , can you be more specific?

pure lynx Mar 14, 2019, 10:31 AM

#

What..? How is it illegal?!

lapis sequoia Mar 14, 2019, 11:01 AM

#

As far as I understand data analysis let's say for business analytics require set of skills that involve knowledge of the businees field and processes etc, technical analytics would be more specific to let's say some technical aspect of knowledge like laser efficiency, but is the process of working with data munging, cleaning, modeling is the same and the difference only comes from the background of data or the whole process of working with data differs in some way?

#

I am not sure I explain myself well enough, sorry, for confusing question

pale eagle Mar 14, 2019, 1:30 PM

#

@lapis sequoia have u done any projects in data analytics

lapis sequoia Mar 14, 2019, 1:32 PM

#

I have done some projects at school but it was related to data like prediction if new website would bring in more clients or iwhich store's range of goods to improve

#

Nothing with technical data, that is why I am interested if there is any difference

pale eagle Mar 14, 2019, 1:33 PM

#

@lapis sequoia which lang u are using

#

??

lapis sequoia Mar 14, 2019, 1:34 PM

#

Python

supple ferry Mar 14, 2019, 1:38 PM

#

So, from my understanding, data will mean data always. Which means, technical or non-technical you will work with numbers. However, every industry and data type will need specific approach to its data. If you work with panel data, methods that you will use will differ completely from the methods you will use for example in cross-sectional data

pure lynx Mar 14, 2019, 1:55 PM

#

Not to be off topic, but... how was what I did illegal?

lyric canopy Mar 14, 2019, 1:58 PM

#

I don't think that was a serious remark. I'm not sure either.

polar acorn Mar 14, 2019, 2:31 PM

#

@pure lynx It was not meant literally. It was most likely meant as an amusing compliment 😃

supple ferry Mar 14, 2019, 3:16 PM

#

@pure lynx it was an amusing compliment as @polar acorn suggested 😀

#

İ was surprised that one can use itertools in matplotlib. It is very creative

pure lynx Mar 14, 2019, 4:13 PM

#

Oh, I guess I’m just daft in that case. Thanks for the compliment, though I can take credit only for relentless searching on StackExchange. I will have to find a different way to graph when I eventually move on to quadrilaterals.

pale eagle Mar 14, 2019, 5:08 PM

#

Anyone is learning data analytics from data camp

heavy apex Mar 14, 2019, 5:27 PM

#

How do I build a portfolio for data science if I've never had a job or internship within the field. I'm coming up on my graduation date, and kinda scared my basic understanding of core skills isn't going to be enough to present to employers.

supple ferry Mar 14, 2019, 6:06 PM

#

@heavy apex you can use kaggle for seeing works by other people which will give you a feeling what kind of projects people are doing. If you something you like you can find another dataset and try to implement similar methodology there. There are various free datasets websites you can visit. Let it bd kaggle itself, r/datasets or Google search for datasets, forgot its name

#

Even more important than modeling is the way you interpret the results

supple ferry Mar 14, 2019, 6:37 PM

#

Anyone got a good book advice for Bayesian Statistics?

lapis sequoia Mar 14, 2019, 7:10 PM

#

@supple ferry thanks for insight.

heavy apex Mar 14, 2019, 7:25 PM

#

@supple ferry thank you, great advice.

void anvil Mar 14, 2019, 8:49 PM

#

If you get Kaggle GM you'll get 6-7 fig job offers

#

depending on what field you want to go in

#

@supple ferry https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/

This section provides the assigned readings and reading questions that students are required to complete prior to attending class sessions.

lean ledge Mar 14, 2019, 8:51 PM

#

@heavy apex Hackathons, projects, kaggle

void anvil Mar 14, 2019, 8:51 PM

#

Kaggle is the gold stndard tbh

lean ledge Mar 14, 2019, 8:51 PM

#

And yes you're right, just having the skills won't be enough

void anvil Mar 14, 2019, 8:51 PM

#

hackathons and projects mean fuck all

lean ledge Mar 14, 2019, 8:51 PM

#

Kaggle isn't really a good standard at all

#

It's just a common recommendation since it's easy to find data there

#

Hackathons and projects can easily mean as much as or more than kaggle

#

It's not about what it is, it's the skills you display

#

With hackathons and projects, you can show fast prototyping under pressure or longer term software skills which are hard to show with just Kaggle. Kaggle doesn't actually show any particular kind of skill that the other two don't

#

For reference: my company has hired both top 50 on kaggle and hackathon winners (I'm from the latter) along with those with work experience only

fervent solar Mar 14, 2019, 9:03 PM

#

Recall that previously we created a simple array using an expression like this:
In[3]: x = np.zeros(4, dtype=int)
We can similarly create a structured array using a compound data type specification:
In[4]: # Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})

#

how this numpy structure is working

#

i understand python dictionaries

vivid hedge Mar 14, 2019, 9:08 PM

#

Might not be the correct channel but what is a good name for summing up all the values up to a certain value.

Ex. Value 3, 1 + 2 +3 = 6

#

Is there a mathematical name or function for it?

lean ledge Mar 14, 2019, 9:10 PM

#

sum_until()?

hardy crag Mar 14, 2019, 9:14 PM

#

https://en.wikipedia.org/wiki/Triangular_number

Triangular number

A triangular number or triangle number counts objects arranged in an equilateral triangle, as in the diagram on the right. The nth triangular number is the number of dots in the triangular arrangement with n dots on a side, and is equal to the sum of the n natural numbers fro...

vivid hedge Mar 14, 2019, 9:14 PM

#

Could work, currently I have "Sum up to value" its going to be for the filename also.

hardy crag Mar 14, 2019, 9:15 PM

#

sry for the preview 😦

#

if you are just interested in the result u might try something like np.arange(n).sum()

#

or write the function given in the wikipedia article 😃

vivid hedge Mar 14, 2019, 9:17 PM

#

Me? I am simply looking for a name I can call my file for github push. But I guess Triangular numbers seems to be correct

polar acorn Mar 14, 2019, 9:23 PM

#

Cum_sum, short for cumulative sum might be used for similar things.

violet crag Mar 14, 2019, 9:24 PM

#

let's say I am using data of stock prices.
I want to train the set on some subset of dataset and other subset to test it.

Should I go with scikit learn's train_test_split, which "Split arrays or matrices into random train and test subsets"

Or should I divide dataset with respect to dates. Such as train on data from 2016 to 31st Dec 2018, and all the data after it for testing?

polar acorn Mar 14, 2019, 9:30 PM

#

You should train on data up to a date and test on later data instead of using sklearn's train_test_split. If you want to do CV you probably want to do this several times for different dates. If you just google "time series cross validation" there are lot of guides.

violet crag Mar 14, 2019, 9:31 PM

#

CV?

polar acorn Mar 14, 2019, 9:32 PM

#

Cross validation. A more complete way to test model performance but takes more time.

void anvil Mar 14, 2019, 9:38 PM

#

absolutely not random CV greed

#

time series data needs to be split into train/test with quarantine periods based on forecast period to prevent contamination of data

#

If you want to do cv train test splits they can be segmented with additional quarantine periods

#

Or you can generate synthetic data based off of your current time series

#

just know if you're training on data not actually available at the time you're introducing some level of cheating into your predictions

violet crag Mar 14, 2019, 9:42 PM

#

I don't think they have taught CV in the course yet. So for this assignment I won't use it. As the assignment is due tomorrow. But I have bookmarked it and will read up on it.

#

Thanks a lot guys.

#

🙏

supple ferry Mar 14, 2019, 10:12 PM

#

@lean ledge , why we cant name it like sum factorial or smth 😄

#

@void anvil i really liked the process of making synthetic data and now trying to get deeper into Bayesian inference, Monte carlo methods and Markovian methods

void anvil Mar 14, 2019, 10:20 PM

#

synthetic data is a huge fucking pita

#

and doesn't always owrk

supple ferry Mar 14, 2019, 10:22 PM

#

i dont mean for actually using it in production or research

#

it will help me to master numpy

#

i am trying to replace most of the python standard functionalities that i use solely with numpy

void anvil Mar 14, 2019, 10:29 PM

#

ah

supple ferry Mar 14, 2019, 10:30 PM

#

one ring to rule them all

warm gulch Mar 14, 2019, 10:38 PM

#

Hello! Has anyone worked with k-means clustering?

supple ferry Mar 14, 2019, 10:38 PM

#

hi

#

yes. what is your question

warm gulch Mar 14, 2019, 10:41 PM

#

I’m working on a school project and most of the resources I’m finding online deal with just 2-D x and y data, it’s possible to cluster more complex data in python correct?

supple ferry Mar 14, 2019, 10:41 PM

#

yes! of course

#

you can use kmeans also with multidimensional data

warm gulch Mar 14, 2019, 10:44 PM

#

Ok I’ll look into that 🤔. I figured out how to get my program to read and store my csv, would I have to do anything extra or will my program realize it is multidimensional?

#

And plot/cluster the data accordingly

supple ferry Mar 14, 2019, 10:48 PM

#

so, for using kmeans you should use external library. for reading csv and working with it pandas is your friend. for kmeans sklearn is the package you should use

#

https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

#

in this post you can find very detailed approach how to implement kmeans

warm gulch Mar 14, 2019, 10:52 PM

#

Oh thank you! I did use pandas, as for my kmeans I was found different ways using numpy and matplotlib, but this looks much cleaner!

supple ferry Mar 14, 2019, 10:52 PM

#

Should you come up with questions, feel free to ask them here. We will be glad to help :)

warm gulch Mar 14, 2019, 11:06 PM

#

For sure, thank you!

violet crag Mar 14, 2019, 11:07 PM

#

the dates on x axis are too cramped up. How do I make it only show year?

📎 Screenshot_from_2019-03-15_04-35-53.png

#

got the solution

#

set_xticks()

reef bone Mar 14, 2019, 11:20 PM

#

you can do k-means on as many dims as you want, the only problem is that it quickly becomes impossible to visualize the results on a single 2d plot

#

tutorials might deal with 2d solely because it's easy to see what's happening

potent path Mar 14, 2019, 11:44 PM

#

Does anyone have any exp using time series algorithms with python?

supple ferry Mar 14, 2019, 11:45 PM

#

@violet crag there used to be plt.tight_layout too

#

But in your case it will probably not be helpful

violet crag Mar 14, 2019, 11:45 PM

#

Now I've run into another issue.

#

Regression function can't needs float for "x" and i have dates

void anvil Mar 14, 2019, 11:47 PM

#

convert to unix time

#

if you have variant time steps

#

or if time steps are uniform or could be considered uniform just throw ints

violet crag Mar 14, 2019, 11:47 PM

#

It's variant

void anvil Mar 14, 2019, 11:48 PM

#

is the step important?

violet crag Mar 14, 2019, 11:48 PM

#

It is stock price, sometime difference between the entries is 1 day (consecutive) other time its 2-3 days.

#

"step"? Idk what you mean

void anvil Mar 14, 2019, 11:48 PM

#

time step

#

and yeah it's pretty important

#

have fun with jump functions

violet crag Mar 14, 2019, 11:49 PM

#

Hmm 🤔 alright, I'll see what those are

void anvil Mar 14, 2019, 11:49 PM

#

you can start here

#

https://medium.com/@alexrachnog/neural-networks-for-algorithmic-trading-1-2-correct-time-series-forecasting-backtesting-9776bfd9e589

Medium

Neural networks for algorithmic trading. Correct time series forec...

Hi everyone! Some time ago I published a small tutorial on financial time series forecasting which was interesting, but in some moments…

#

this one is also neat

#

https://towardsdatascience.com/aifortrading-2edd6fac689d

Towards Data Science

Using the latest advancements in deep learning to predict stock pr...

Link to the complete notebook: https://github.com/borisbanushev/stockpredictionai

potent path Mar 14, 2019, 11:50 PM

#

ohhhhhh

void anvil Mar 14, 2019, 11:50 PM

#

but he definitely cheats

potent path Mar 14, 2019, 11:50 PM

#

Thanks for that

void anvil Mar 14, 2019, 11:50 PM

#

and doesn't realize it

violet crag Mar 14, 2019, 11:50 PM

#

Cheats? How?

void anvil Mar 14, 2019, 11:50 PM

#

he leaks information from train to test that shouldn't exist

#

He also predicts price, not movement

violet crag Mar 14, 2019, 11:51 PM

#

Oh, I see. In my case I've made a clean cut in the data frame using list slicing.

void anvil Mar 14, 2019, 11:51 PM

#

which is very bad

#

yeah you'll probably catch it when he goes over feature creation

violet crag Mar 14, 2019, 11:51 PM

#

What do you mean by "movement"?

void anvil Mar 14, 2019, 11:51 PM

#

% change

#

you should never predict price, it's too easy

#

and you'll get bad results

#

you want to predict day over day changes and back that out to price

#

so if your stock price is 100, 101, 102, 101

#

you want to be predicting 1%, 0.9%, -0.9%

violet crag Mar 14, 2019, 11:54 PM

#

Damn man. I was working on predicting prices.

#

😕

void anvil Mar 14, 2019, 11:56 PM

#

you will get significantly better predictions predicting price than actual movements

potent path Mar 15, 2019, 12:03 AM

#

Is it possible to use ML as a tool for your investments?

void anvil Mar 15, 2019, 12:03 AM

#

yes

#

data is expensive

potent path Mar 15, 2019, 12:04 AM

#

Ive done some work before in that area but never had much success using it on my own portfolio.

void anvil Mar 15, 2019, 12:07 AM

#

because you need good data

#

and it's expensive lol

#

need level 2 or level 3 data

lapis sequoia Mar 15, 2019, 1:37 AM

#

try using the news data

#

there was something on kaggle previously

#

it should help

supple ferry Mar 15, 2019, 9:22 AM

#

one possible idea can be to use news, derive sentiment and apply it to prediction models

violet crag Mar 15, 2019, 11:17 AM

#

Is it true that regression can't work on dates and I have to covert it into numeric data

mossy dragon Mar 15, 2019, 11:39 AM

#

so is anyone here an actual data scientist

supple ferry Mar 15, 2019, 11:48 AM

#

@violet crag regression assumes your exogenous variables are scalars. So, it can not work with date or categorical (I don't mean encoded into dummy)

violet crag Mar 15, 2019, 12:06 PM

#

@supple ferry I successfully converted date into numeric value, keeping the step in mind as well.

I've issue in regression. I thought it would draw a regression line. But regressor.predict([dates]) just gave me same values as actual price

#

📎 download_1.png

#

😦

#

my notebook: https://colab.research.google.com/drive/1RxQ2wBMVmCCu-3u8FrKal3rm1aohdSDC

Google Colaboratory

#

thinkmon I think I know where I am going wrong

polar acorn Mar 15, 2019, 12:20 PM

#

@violet crag In case you didn't see you are predicting on your training dates. You should predict on test_dates.

violet crag Mar 15, 2019, 12:21 PM

#

@polar acorn yes, I caught that

#

now I have issues with Reshape your data either using array.reshape(-1, 1)

#

is this a numpy thing?

polar acorn Mar 15, 2019, 12:22 PM

#

It's a sklearn thing. It prefers your numpy arrays in a certain way.

violet crag Mar 15, 2019, 12:27 PM

#

np.asarray(test_dates).reshape(-1, 1)
ValueError: shapes (101,1) and (403,403) not aligned: 1 (dim 1) != 403 (dim 0)

#

I am clueless

polar acorn Mar 15, 2019, 12:27 PM

#

What is the shape of the X and y you pass to fit()?

violet crag Mar 15, 2019, 12:28 PM

#

it's like this [[1, 2, 3]]

#

simple 1D array was giving error

polar acorn Mar 15, 2019, 12:28 PM

#

if you call test_dates.shape what do you get?

#

or train_dates.shape rather

#

Okay I see what your are doing, you are actually feeding two lists to LinearRegression.fit()

violet crag Mar 15, 2019, 12:32 PM

#

yea, one x and one y

polar acorn Mar 15, 2019, 12:33 PM

#

The fit function wants numpy arrays instead. And it wants them in shape of (samples, features) for X and (samples, targets) for y. So what you can do is
fit(np.reshape(train_dates, (-1, 1)), np.reshape(train_prices, (-1,1)))

#

The np.reshape(train_dates, (-1, 1)) means, reshape my list as a numpy array with dimensions -1 and 1. Wtf you might think, -1? -1 just tells numpy to look at your list and substitue -1 with the length of the list. It is a convenience so you can reshape your list without checking the length of it.

#

Makes sense?

violet crag Mar 15, 2019, 12:36 PM

#

can you explain me this -1, 1 further

#

🙏

#

now since I have to depict this predicted array on graph, I need to remove extra dimension, right?

#

like from [[]] to []

supple ferry Mar 15, 2019, 12:40 PM

#

@violet crag , if you reshape your list of 100 elements into (-1, 1) it means that, it will reshape it (len(thatList), 1)

#

it is easier to know it this way

#

there is no extra dimension btw, technically there is, but no 😄

violet crag Mar 15, 2019, 12:41 PM

#

duuuuuude I finally got something

#

📎 download_2.png

polar acorn Mar 15, 2019, 12:42 PM

#

Looks good 😃

#

I mean the predictions are bad but that is pretty much what you should expect here.

supple ferry Mar 15, 2019, 12:43 PM

#

MANOVA is your way to go. Now try that

violet crag Mar 15, 2019, 12:43 PM

#

my assignment is due in 11 hours 🤣

#

📎 download_3.png

#

this is bad, I'll lose money in the market

polar acorn Mar 15, 2019, 12:46 PM

#

😂 If you could make money in the market doing linear regression there would be no market

supple ferry Mar 15, 2019, 12:47 PM

#

😄

#

time series and overall panel data requires differenc approaches

#

https://www.youtube.com/watch?v=JNfxr4BQrLk

YouTube

Enthought

Time Series Analysis with Python Intermediate | SciPy 2016 Tutoria...

Tutorial materials for the Time Series Analysis tutorial including notebooks may be found here: https://github.com/AileenNielsen/TimeSeriesAnalysisWithPython...

▶ Play video

#

i dont know if you can manage watch this and do your assignment at the same time

#

but this video is gold

violet crag Mar 15, 2019, 12:49 PM

#

ooh I'll add it to the list of material that I've found on last two days, I have to read a lot

supple ferry Mar 15, 2019, 12:49 PM

#

reading is the key

violet crag Mar 15, 2019, 12:49 PM

#

will do after assignment

supple ferry Mar 15, 2019, 12:50 PM

#

if you can rwerite the text you read into the code and vice versa

#

you are good to go

#

😃

#

rewrite*

violet crag Mar 15, 2019, 12:50 PM

#

you mean theory to practice?

supple ferry Mar 15, 2019, 12:50 PM

#

yes

#

knowing is not the power now, knowing how to find out is

violet crag Mar 15, 2019, 12:53 PM

#

🙏

void anvil Mar 15, 2019, 2:27 PM

#

you can do LSMA

#

with linear regression

#

it'll be better than one regression for the entire time series at least

#

you do a linear regression on the last X time points (default is 25) and use it to predict the next period

#

it's still pretty awful but it's better than predicting out months / years

violet crag Mar 15, 2019, 7:06 PM

#

"Estimated coefficients for the linear regression problem"

#

what does this mean in linear regression?

#

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn-linear-model-linearregression

void anvil Mar 15, 2019, 8:08 PM

#

it's literally the coefficients for the line you draw

violet crag Mar 15, 2019, 8:49 PM

#

aah, slope of the line

void anvil Mar 15, 2019, 10:20 PM

#

slope for all the X variables + intercept

vagrant vector Mar 16, 2019, 6:40 AM

#

Hey
Man
I want to learn machine learning
https://pythonprogramming.net/machine-learning-tutorials/
With what should I start from here (what will be the most comfortable to start with)

Python Programming Tutorials

Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free.

lapis sequoia Mar 16, 2019, 7:10 AM

#

well.. you should start with an objective..what are you learning it for?

#

to do data analysis - for business, finance or for other fields?

#

or to do ml for image or text processing?

#

these are very broad..there's more to it.. but it all depends on your objective so you dont end up all over the place

#

https://www.coursera.org/learn/machine-learning#syllabus

Coursera

Machine Learning | Coursera

Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, ...

#

this is where you start

#

then come back for more

quiet gyro Mar 16, 2019, 9:46 AM

#

https://poloclub.github.io/ganlab/

#

lapis sequoia Mar 16, 2019, 9:47 AM

#

ooooooh

#

this is pretty cool

supple ferry Mar 16, 2019, 9:57 AM

#

Now, WOW

buoyant trellis Mar 16, 2019, 11:32 AM

#

I would like to start with data analysis ... essentially transition from QA automation to data analysis... finding it bit difficult to figure out how to go about ut

lapis sequoia Mar 16, 2019, 2:41 PM

#

Has any of you heard of any good bootcamps for data science in Europe? There’s plenty of them advertised online, but which are any good? Or any advice how to filter?

supple ferry Mar 16, 2019, 3:16 PM

#

@lapis sequoia in Berlin there was one for 8k. I saw its agenda and advised my friend to learn it on his own

#

It was like a year ago and I was learning myself too

lapis sequoia Mar 16, 2019, 3:26 PM

#

So it s not really worth it? I understand all the information is online and accessible for free, but I am becoming overwhelmed having no guidance and also the amount is so vast I just feel lost.

supple ferry Mar 16, 2019, 4:50 PM

#

what do you need guidance with ?

buoyant trellis Mar 16, 2019, 4:59 PM

#

@supple ferry As an experienced programmer desiring to move to data analyst role I dont know a definite path to reach data analyst role... I am looking at some online courses but not sure if they are worth the cost...

supple ferry Mar 16, 2019, 6:03 PM

#

@buoyant trellis this can be a good guide

#

https://www.springboard.com/resources/learning-paths/data-analysis/

#

It has also programming things, you can ignore them of course

#

@lapis sequoia also for you

buoyant trellis Mar 16, 2019, 6:04 PM

#

ok

lapis sequoia Mar 16, 2019, 6:17 PM

#

I ll look through that, thanks! @supple ferry

#

I guess my main confusion comes from that most courses i looked at for data analysts/ data scientists suggest a package of skills to master, but when I look at job ads they require a phd in physics or mathematics... to my understanding one needs statistics the most and the technical programming tools, but again, i guess i have the whole idea wrong

buoyant trellis Mar 16, 2019, 6:31 PM

#

my understanding was statistics and probability... last time I switched career path best thing that worked was to demonstrate through projects /blogs etc... and that is what I am planning now as well.. @lapis sequoia I think if you have projects to show there is always someone who is there to hire

lapis sequoia Mar 16, 2019, 6:34 PM

#

I do have project but they are far behind a phd research projects. Would you have any references to share to what kind of project is expected for an entry level analyst?

buoyant trellis Mar 16, 2019, 6:43 PM

#

honestly no... but I will go in for some project from kaggle.com and put it on my cv once I am there

lean ledge Mar 16, 2019, 8:40 PM

#

@lapis sequoia it's not just statistics and probability, it's also calculus and linear algebra and some other maths here and there. People want others with really good maths backgrounds, people that can understand papers that use differential geometry or Poincare embeddings of trees into hyperbolic surfaces.

#

To compete with those guys, you have to know maths at a PhD level and that's going to be very hard

#

Most courses in my experience are only training computer science students and people changing careers for the basics of the field. I'm not sure how accurate a look that gives into the kind of ability employers want, especially since what employers want can vary

torn musk Mar 16, 2019, 8:47 PM

#

@lean ledge wow so differential geometry is a thing??? I really want to know so I can choose to do college course in it

lean ledge Mar 16, 2019, 8:47 PM

#

It is a thing

torn musk Mar 16, 2019, 8:47 PM

#

Because everyone I asked was like "oh it's not necessary"

#

Oof

#

If it's a thing I'll do it

lean ledge Mar 16, 2019, 8:48 PM

#

Prerequisites are real analysis and abstract algebra and number theory here, probably want to do those first

torn musk Mar 16, 2019, 8:48 PM

#

Ahhh interesting

#

I've done the linear algebra, calculus, statistics package

#

But have not touched into those real analysis or differential geometry

lean ledge Mar 16, 2019, 8:49 PM

#

Linear algebra is not abstract algebra

torn musk Mar 16, 2019, 8:49 PM

#

Ok

lapis sequoia Mar 16, 2019, 9:40 PM

#

That is interesting... I think I should give up idea trying to change my career without actually going back to school and start it all over from scratch, especially that I do not hold a degree in a quantitive field.

heavy apex Mar 16, 2019, 11:10 PM

#

I'm really close to finishing my BS, and will probably get my MS in data science, but really curious about the difficulty for a non-degree holder trying to get into a data science career off boot camps or self training alone.

#

I never hear any first hand experiences, just a bunch of those ads everywhere on YouTube and whatnot.

lean ledge Mar 16, 2019, 11:43 PM

#

I don't know a single person who's self taught themselves into a data science role completely

#

Dunno about how difficult it would be but anecdotally everyone I know has some degree, mostly in maths/Phys/stats or engineering, with some CS

feral lodge Mar 17, 2019, 3:32 AM

#

I'm not familiar with the job market so forgive my ignorance, but surely most entry-level data analyst positions require only a bachelor's? That's the case with the first few results on glassdoor when i search "entry level data analyst". The Poincaré embedding stuff (https://arxiv.org/pdf/1705.08039.pdf) is still on the research stage no?

lapis sequoia Mar 17, 2019, 3:35 AM

#

^sorry for typing something when you have a question but @lean ledge why do u say u dont know a single person who's self taught themselves data science

lapis sequoia Mar 17, 2019, 3:51 AM

#

What does inline for an embedded message mean?

lean ledge Mar 17, 2019, 4:08 AM

#

data analyst positions arent data science positions. in my experience at least, data analysts are mostly full of non-technical people (eg. business majors etc) manually looking for trends. lots of tableau visualisations and whatnot as opposed to data science which tends to be lots of machine learning, statistics and modelling, very math heavy.

#

there are definitely lots of positions in data science that are satisfied with bachelors level too, I was just trying to give a motivation behind why some specific positions like they one they are finding might be looking for math/physics majors. @feral lodge

#

the poincare example was just me looking for a complicated sounding paper I saw recently, not an actual serious thing people might implement but I am reading research papers so often for data science stuff, I dont think research stage stuff is off limits for most data scientists at all

#

@lapis sequoia I know lots of people who've taught themselves data science, just not those with no degree at all who've taught themselves data science to a working capacity. lots of engineering/phys/math/CS majors who self-taught themselves it all, just dont know anyone without a degree at all or from non-technical backgrounds

lapis sequoia Mar 17, 2019, 9:45 AM

#

Well, for data scientist I understand the high requirements, but I was referring more to entry level data analyst. I do see jobs in US that would fall into my educational background and training well, but in EU requirements seem to be through the roof.

lyric canopy Mar 17, 2019, 10:47 AM

#

Terms are used in a slightly different manner from place to place. Here's, it's used for people that have an actual analitical background and have an understanding of, say, R/Python, SQL, modelling, and stuff like that

#

I've just pulled up three random listings for "data analyst" and they all require modelling experience, a mathematical/statistical background, and a university degree. They do pay well, though.

supple ferry Mar 17, 2019, 11:47 AM

#

In EU situation is quite different thiugh. When I was working in Germany, I was a data analyst, but I was also doing data science stuff, like bulding predictive models for forecasting.. In job description they have written about solid math and programming background. However, when later I compared my skillset to jobmarket in US I saw that, there in order to be a data analyst they were requiring not that much in comparision to EU. solid data analyst in EU can easily be a senior or leading data analyst in US for more money

vagrant vector Mar 17, 2019, 4:20 PM

#

Well I decided to start learning machine learning and some guy here told me to try the coursea tutorial

#

Anyone knows if its a good one??

warm gulch Mar 17, 2019, 4:24 PM

#

I’m doing edX courses

vagrant vector Mar 17, 2019, 4:26 PM

#

I dont know if this course is good my self because I am new to this

#

So I am looking for someone who had learned it already and can tell me what he did

pale eagle Mar 17, 2019, 5:04 PM

#

@vagrant vector try that course you will automatically.know. what is good or bad for you

kindred stirrup Mar 17, 2019, 5:38 PM

#

anyone know a good package for doing arima modeling? seems more like a job for R

supple ferry Mar 17, 2019, 5:42 PM

#

@kindred stirrup statsmodels has arima functions

kindred stirrup Mar 17, 2019, 5:44 PM

#

@supple ferry ahh thanks man i was spending several frustrating hours trying to download this auto.arima package

#

FWIW I work with a lot of data analysts and all of us have a master's degree. The only Data Science people have PhDs

supple ferry Mar 17, 2019, 5:54 PM

#

@kindred stirrup you welcome

#

İ think the power of having PhD is the experience you get while doing scientific research.. It teaches you to look not at just the result part, but also at the process and how it should be done

visual tangle Mar 17, 2019, 6:04 PM

#

What library would be the best for machine learning?

void anvil Mar 17, 2019, 8:20 PM

#

sklearn

lapis sequoia Mar 18, 2019, 1:47 AM

#

SnapML from ibm

#

:v

supple ferry Mar 18, 2019, 5:21 AM

#

It automatically deletes random 50% of your data? @lapis sequoia

lapis sequoia Mar 18, 2019, 5:21 AM

#

lol

#

what

supple ferry Mar 18, 2019, 5:21 AM

#

Reference to Thanos😀

lapis sequoia Mar 18, 2019, 5:22 AM

#

I know..

supple ferry Mar 18, 2019, 6:18 AM

#

İ am being too nerdy 🤪

supple ferry Mar 18, 2019, 8:05 AM

#

Which alternatives can I use for measuring the model performance in Conditional Logit?

#

alternatives to ROC curve

void anvil Mar 18, 2019, 12:08 PM

#

everything that you use for linear regressins

#

it runs MLE so log likelihood

#

all the lag + normality tests

#

but you should get a coef, std error, z & p score, conf interva,s chi2, pseudo R2, F score (prob > chi2), log likelihood, etc.

#

ROC curve is pretty shit for logit/probit tbh

#

pseudo-R2 has has Aldrich-Nelson, Cragg-Uhler (Cox and Snell) and it's variants, estrella and its variants, Veal-Zimmerman

#

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

IDRE Stats

JOSE ROCHA

FAQ: What are pseudo R-squareds?

#

@supple ferry

#

And, of course, AIC / BIC

fervent solar Mar 18, 2019, 1:42 PM

#

whats happening when grouping by L ??

📎 Screenshot_from_2019-03-18_19-11-59.png

lyric canopy Mar 18, 2019, 2:25 PM

#

It's grouping by the specified row indices, I think

#

Fits the numbers it produces

#

(0 + 2 + 5) / 3 = 2.33333

supple ferry Mar 18, 2019, 4:51 PM

#

@void anvil , thank you!
II tend to agree with you on ROC curve, however, I am asked to provide some in order to explaing "business" people about the performance of the model. I am using statsmodels.discrete.conditional_model.ConditionalLogit for this, which is in master, but not publish branch. They dont have predict for that.
I understand the reason. It is calculated by assuming group related fixed effects, and predicting just out of the box not only statistically stupid.

void anvil Mar 18, 2019, 5:40 PM

#

sklearn has logistic regression

#

@supple ferry

#

https://www.uio.no/studier/emner/sv/oekonomi/ECON4150/v16/lectures/lecture15_binary_dependent_variables.pdf

#

is a good introduction as well

#

I have an amazing econometrics book, but it's hand written by our teacher and I can't really scan it up at the moment

#

But basically if you just follow that lecture you can explain in simple terms what everything means

#

If you have access to STATA (bleh), it's super easy to run and get everything precalculated by just doing load data, x = these columns, y = this column, logit(x,y)

#

It's not nearly as powerful but it generates nice, neat tables with nearly 0 effort

#

iirc R has a ton of business-report level stuff as well

lapis sequoia Mar 18, 2019, 10:51 PM

#

How do i delay a project

#

So it does not close right away

lapis sequoia Mar 18, 2019, 11:18 PM

#

In MPL how do I get these 2 numbers to be the same? The graph isn't scaling properly.

📎 unknown.png

visual tangle Mar 19, 2019, 2:09 AM

#

@lapis sequoia time.sleep

#

make sure to import time

lapis sequoia Mar 19, 2019, 2:10 AM

#

yes i got it

lyric canopy Mar 19, 2019, 7:11 AM

#

What do you mean by scaling correctly, @lapis sequoia ? To me it just looks like your line ends at (3.00, 6.0)

#

Or, y = 2x

mossy dragon Mar 19, 2019, 9:17 AM

#

how much do you think affirmative action plays into graduate admissions

lapis sequoia Mar 19, 2019, 9:22 AM

#

not much man..

#

and this probably isn't the channel to talk about this

lapis sequoia Mar 19, 2019, 11:10 AM

#

I want the 2 axis to be same

supple ferry Mar 19, 2019, 11:17 AM

#

@lapis sequoia , you can access and change ticks viaplt.xticks(np.arange(min(x), max(x)+1, 1.0))

#

if you want to set limits, matplotlib.axes.Axes.set_xlim

lapis sequoia Mar 19, 2019, 2:56 PM

#

so uhh

#

is matplotlib like #1 used package for datavisualization?

reef bone Mar 19, 2019, 3:00 PM

#

I'd say matplotlib is like the go-to for if you just want to quickly visualize something

#

I know that Seaborn and Plotly generally produce better looking visualizations for if you want to present your results in a more sophisticated way

#

Also Plotly is I think appropriate for if you want to have interactivate visualizations

lapis sequoia Mar 19, 2019, 3:13 PM

#

ok

#

does it support showing outliers?

reef bone Mar 19, 2019, 3:18 PM

#

Not sure what you mean by that, outliers are just data points right

#

You might need to select the correct visualization type / technique but that's not really a package specific feature

lapis sequoia Mar 19, 2019, 3:36 PM

#

cuz like

#

theres a mathematical definition of an outlier

#

so like

#

is there a way you can plot.get_outliers() or something?

reef bone Mar 19, 2019, 3:42 PM

#

I'm not aware of that being a feature, but I would argue that's not the package's concern

#

You can always extract them yourself and feed them to the visualization framework

#

Or colour them differently

lapis sequoia Mar 19, 2019, 3:42 PM

#

yeah but like

#

how would I know which ones to color differently

#

how would I extract the outliers

reef bone Mar 19, 2019, 3:43 PM

#

You said there's a mathematical formula you would like to use, why not apply it to the data and extract the data points that you wish to work with?

lapis sequoia Mar 19, 2019, 3:43 PM

#

hmm

reef bone Mar 19, 2019, 3:43 PM

#

Then feed them to matplotlib separately

lapis sequoia Mar 19, 2019, 3:44 PM

#

I said mathematical definition but I guess I could make it into a formula

reef bone Mar 19, 2019, 3:44 PM

#

Which definition are you working with

lapis sequoia Mar 19, 2019, 3:45 PM

#

any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile.

#

that

reef bone Mar 19, 2019, 3:54 PM

#

I would extract those thresholds from your dataset and then loop over it and filter the outliers

#

And input them separately

#

You probably only need to extract the indices, I'm fairly sure matplotlib can colour by indices

#

For the record I'm not saying that there aren't data vis packages that can do that, I just wouldn't expect it from them

void anvil Mar 19, 2019, 7:04 PM

#

matlab and s eaborn

#

are the two I use the most

lapis sequoia Mar 19, 2019, 7:10 PM

#

ok

#

thanks for your help guys!

sharp jetty Mar 20, 2019, 3:54 AM

#

anyone know how you can add labels to an image?
I have images that are named XR_ELBOW_patient00011_negative_0
as well as images that are named XR_ELBOW_patient00016_positive_4
essentially my labels are positive or negative

#

what im looking to do is run them through a CNN model

supple ferry Mar 20, 2019, 12:36 PM

#

@sharp jetty , you can just run a script which will parse the names of images and then assign that array to label column

midnight atlas Mar 20, 2019, 5:57 PM

#

Hi, I'm having a problem with some feature extraction code for speech recognition. I'm trying to use a function to convert from linear magnitude spectrum to mel spectrum. I think I may just be misunderstanding how to use certain functions in Python, but any pointers would be great!

#

    def make_mel_filterbank(self):

        lo_mel = self.lin2mel(self.lo_freq)
        hi_mel = self.lin2mel(self.hi_freq)

        # uniform spacing on mel scale
        mel_freqs = np.linspace(lo_mel, hi_mel, self.num_mel+2)

        # convert mel freqs to hertz and then to fft bins
        bin_width = self.samp_rate/self.fft_size # typically 31.25 Hz, bin[0]=0 Hz, bin[1]=31.25 Hz,..., bin[256]=8000 Hz
        mel_bins = np.floor(self.mel2lin(mel_freqs)/bin_width)

        num_bins = self.fft_size//2 + 1
        self.mel_filterbank = np.zeros([self.num_mel, num_bins])
        for i in range(0,self.num_mel):
            left_bin = int(mel_bins[i])
            center_bin = int(mel_bins[i+1])
            right_bin = int(mel_bins[i+2])
            up_slope = 1/(center_bin-left_bin)
            for j in range(left_bin, center_bin):
                self.mel_filterbank[i, j] = (j - left_bin)*up_slope
            down_slope = -1/(right_bin-center_bin)
            for j in range(center_bin, right_bin):
                self.mel_filterbank[i, j] = (j-right_bin)*down_slope```

#

that's the function ^

#

 # for each frame(column of 2D array 'magspec'), compute the log mel spectrum by applying the mel filterbank to the magnitude spectrum
    def magspec_to_fbank(self, magspec):
        # apply the mel filterbank
        fbank = np.convolve(self.make_mel_filterbank(), magspec)
        return fbank```

#

that's where I'm trying to apply it to magspec

#

thanks

lapis sequoia Mar 21, 2019, 6:05 PM

#

are there any things on Kaggle for students?
to practice doing data science
or should I just pick a competition and try it

#

I just need simple data science problem to practice the basics; ones that should take about an hour to solve

lean ledge Mar 21, 2019, 8:56 PM

#

@lapis sequoia there's basic intro problems there

#

Eg the house price one

mossy dragon Mar 22, 2019, 12:04 AM

#

Hello guys, I have to write a short essay about a technological trend that will impact my target career (data scientist), and I was thinking about writing something about how increasingly complex models and automation are making it easier to make predictions, but since its becoming more of a "black box" where understanding the model is not neccesary then it might cause problems down the line.

#

sounds like a good idea?

marsh fog Mar 22, 2019, 2:21 PM

#

Can anyone here give me a hand with Pandas?

supple ferry Mar 22, 2019, 2:48 PM

#

@marsh fog , you can ask your question right away

marsh fog Mar 22, 2019, 2:49 PM

#

@supple ferry I posted it in

#

#help-grapes if you could lend a hand

supple ferry Mar 22, 2019, 2:50 PM

#

@marsh fog , I could not find your question there. you can repost it here

marsh fog Mar 22, 2019, 2:51 PM

#

I've got a list of dataframes and I'm using pd.melt to modify them: gdp_melt = pd.melt(dataframes[0].reset_index(), id_vars=['country'], var_name="year", value_name="GDP") gdp_melt.set_index(['country', 'year'], inplace=True) is there a way I can run a loop or function to melt all the dataframes in the list but change the value_name to apply to each dataframe?

supple ferry Mar 22, 2019, 2:59 PM

#

is your dataset list will be something grouped ?

#

you can write a function which does that, but you will need to pass it the value for value_name

marsh fog Mar 22, 2019, 3:00 PM

#

how would I go about doing that?

#

Yeah I have 3 dataframes contained in a list called dataframes so to access each one it's dataframes[0], dataframes[1] etc

supple ferry Mar 22, 2019, 3:03 PM

#

do you need to modify those datsets, or return a new list with melted datasets

#

?

#

https://hastebin.com/efehumowab.py

#

something like this maybe

#

in the last line i got a typo :D
melted_list.append

#

not meelted_list.append

marsh fog Mar 22, 2019, 3:08 PM

#

Modify those datasets rather than return a new list

#

So something like this def melt_df(df): df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP") df = df.set_index(['country', 'year'], inplace=True) return dataframes = [melt_df(df) for df in dataframes] but then it makes all my dataframes blank

supple ferry Mar 22, 2019, 3:11 PM

#

if you take on df from that list and try to run that func

#

what do you get ?

marsh fog Mar 22, 2019, 3:11 PM

#

blank dataframes is what I get xD

📎 unknown.png

lyric canopy Mar 22, 2019, 3:33 PM

#

What does your melt_df function return? It looks like it's returning None at the moment

#

Does it modify something in place instead of returning something?

marsh fog Mar 22, 2019, 3:35 PM

#

It's meant to modify the list of dataframes

supple ferry Mar 22, 2019, 3:35 PM

#

So, mobile discord doesn't allow me to read the code normally @marsh fog. Ves is right 😀

#

You are using list comprehension, it means your function must return something

hoary geyser Mar 22, 2019, 3:37 PM

#

def melt_df(df):
    df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP")
    df = df.set_index(['country', 'year'], inplace=True)
    return df
dataframes = [melt_df(df) for df in dataframes]

#

maybe?

#

return the df you modified

marsh fog Mar 22, 2019, 3:38 PM

#

unhashable list error again

#

Sorry @lyric canopy this is what I'm trying and It's throwing an unhashable list error def melt_df(df): df.reset_index(inplace=True) df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"]) df.set_index(['country', 'year'], inplace=True) return df dataframes = [melt_df(df) for df in dataframes]

hoary geyser Mar 22, 2019, 3:38 PM

#

can you show a picture of the traceback

marsh fog Mar 22, 2019, 3:39 PM

#

📎 unknown.png

hoary geyser Mar 22, 2019, 3:40 PM

#

value name doesnt seem like it should be a list

#

just do value_name="GDP"

#

also what IDE are you using? thats the nicest traceback ive ever seen

marsh fog Mar 22, 2019, 3:41 PM

#

Now I get a Key Error: 'year'

#

Anaconda

#

JupyterLab

hoary geyser Mar 22, 2019, 3:41 PM

#

ty

marsh fog Mar 22, 2019, 3:41 PM

#

It's quite nice 😄

hoary geyser Mar 22, 2019, 3:42 PM

#

whats the traceback for the keyerror

marsh fog Mar 22, 2019, 3:43 PM

#

It's small - Had to scroll out to catch it all

📎 unknown.png

#

Can you even read that? eek

hoary geyser Mar 22, 2019, 3:45 PM

#

i opened it on browser and zoomed

marsh fog Mar 22, 2019, 3:45 PM

#

👌

hoary geyser Mar 22, 2019, 3:45 PM

#

not sure, but it looks like set_index is expecting certain keys?

marsh fog Mar 22, 2019, 3:45 PM

#

Something else Is wrong here though; because if I change the line to df.set_index(['country'], inplace=True)

#

It no longer throws an error

#

but

hoary geyser Mar 22, 2019, 3:46 PM

#

so it doesnt like 'years'

marsh fog Mar 22, 2019, 3:46 PM

#

It does absolutely nothing to the dataframes

hoary geyser Mar 22, 2019, 3:46 PM

#

oh

marsh fog Mar 22, 2019, 3:46 PM

#

it doesn't melt them at all

hoary geyser Mar 22, 2019, 3:46 PM

#

what library is this so i can look at some docs

peak jetty Mar 22, 2019, 3:46 PM

#

pandas?

marsh fog Mar 22, 2019, 3:46 PM

#

yes

#

ideally as well I need value_name to change for each dataframe within the list - But I'm just trying to get this to work first xD

#

because there are 3 datasets in a dataframe of their own - Each of them have a different variable: HDI,GDP, Unemployment figures along with countries and years. So the value_name needs to change for each

hoary geyser Mar 22, 2019, 3:48 PM

#

def melt_df(df):
    df.reset_index(inplace=True)
    df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"])
    df.set_index(['country', 'year'], inplace=True)
    return df

#

when you melt it, the id_vars is ["country"] only

#

and 'year' is a var

#

does that cause issues?

marsh fog Mar 22, 2019, 3:49 PM

#

Nope

#

If I run this outside of a function

#

on a dataframe not in a list

#

It works fine

#

📎 unknown.png

peak jetty Mar 22, 2019, 3:49 PM

#

Is this sensitive data? Could you dump a snippet of df.to_dict() so we could mess around with it?

marsh fog Mar 22, 2019, 3:49 PM

#

ofcourse

#

It's just data from worldbank.org

peak jetty Mar 22, 2019, 3:50 PM

#

Oh how are you getting it into your df

marsh fog Mar 22, 2019, 3:50 PM

#

using glob

#

📎 unknown.png

peak jetty Mar 22, 2019, 3:52 PM

#

Oh I see, yea a snippet of the converted dict might be easier, try pd.from_dict() and let us know the conversion params, too please

marsh fog Mar 22, 2019, 3:52 PM

#

would you just like the datasets? xD

#

not sure if that's easier

#

https://send.firefox.com/download/77de6aac33/#dYXBiukO1RSSnGStmIYP1Q

Firefox Send

Encrypt and send files with a link that automatically expires to ensure your important documents don’t stay online forever.

peak jetty Mar 22, 2019, 3:57 PM

#

Maybe, I just wanted to run one line to set it up,

#

Oh, now you'll have to give us all your code though

marsh fog Mar 22, 2019, 3:59 PM

#

that's okay, there is only a couple of lines

#

just imports it and then renames a few columns and then the bit I'm stuck on aha

#

using melt

#

filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f, encoding = "ISO-8859-1"))
    
def process_df(df):
    """
    intput: unformatted data that comes from the same source so it has the same starting format
    output: processed data that has been re-formatted
    """
    df.drop(index = df.tail(6).index, columns=["Series Name", "Series Code", "Country Code"], inplace=True)
    df.rename(columns= lambda x: x[0:4], inplace=True)
    df.rename(columns={'Coun':'country'}, inplace=True)
    df.set_index('country', inplace=True)
    df.index = df.index.str.strip()
    return df.apply(pd.to_numeric, errors='coerce')
dataframes = [process_df(df) for df in dataframes]
#1 is GDP, 2 is internet-users, 3 is unemployment


def melt_df(df):
    df.reset_index(inplace=True)
    df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    df.set_index(['country'], inplace=True)
    return df
dataframes = [melt_df(df) for df in dataframes]

#

@peak jetty I do much appreciate the help though dude! Thank you!

peak jetty Mar 22, 2019, 4:05 PM

#

Well don't thank me yet

#

Uh, missing pandas import? import pandas as pd I'm guessing?

marsh fog Mar 22, 2019, 4:05 PM

#

yeah

#

import pandas as pd```

peak jetty Mar 22, 2019, 4:10 PM

#

Ok, so it's not throwing an error now, but melt_df doesn't seem to be doing anything, that's where you left off?


In [2]: new_dataframes = [melt_df(df) for df in inter_dataframes]                                                                                              

In [3]: inter_dataframes == new_dataframes                                                                                                                     
Out[3]: True```

marsh fog Mar 22, 2019, 4:11 PM

#

yes

peak jetty Mar 22, 2019, 4:13 PM

#

Is the process function working correctly?


In [2]: the_df = inter_dataframes[0]                                                                                                                           

In [3]: the_df.columns                                                                                                                                         
Out[3]: 
Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018'],
      dtype='object')```

[0] is the GDP df?

marsh fog Mar 22, 2019, 4:14 PM

#

yes

#

it should be the GDP df, but after running pd.melt df.columns should only return GDP column

#

because countries and years should be set as indexes

peak jetty Mar 22, 2019, 4:15 PM

#

I'm only looking at dataframes[0] for now

#

So the columns are years from 1960-2018?

marsh fog Mar 22, 2019, 4:15 PM

#

yes

#

So melt should change those Year columns into a single column and then you get the GDP values in their own column with the value_name part of pd.melt

peak jetty Mar 22, 2019, 4:18 PM

#

Hmm, in the docs (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) there is a value_vars param which seems to be missing from yours

marsh fog Mar 22, 2019, 4:19 PM

#

I believe value_vars is the same as value_name

peak jetty Mar 22, 2019, 4:19 PM

#

Are you sure? From their example:

...         var_name='myVarname', value_name='myValname')```

#

They are passing both

marsh fog Mar 22, 2019, 4:19 PM

#

oh yeah

peak jetty Mar 22, 2019, 4:20 PM

#

Oh yea?

marsh fog Mar 22, 2019, 4:20 PM

#

oh no

#

by default

#

value_vars is none

#

because it's columns to unpivot if not specified by id_vars

peak jetty Mar 22, 2019, 4:20 PM

#

That would explain why it doesn't do anything

marsh fog Mar 22, 2019, 4:21 PM

#

No, should be working?

#

value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

peak jetty Mar 22, 2019, 4:21 PM

#

Oh I see, hmm but your id is the index

#

Can you actually do that? In their examples they never pass an index as the id_var

marsh fog Mar 22, 2019, 4:23 PM

#

from what I looked up - To get around that this is used pd.melt(df.reset_index()

#

So it resets the index making the index a column again so that it can then be used by id_vars

peak jetty Mar 22, 2019, 4:23 PM

#

Ah, got it, that would probably explain why it isn't blowing up

marsh fog Mar 22, 2019, 4:23 PM

#

haha

#

that's why this df.reset_index(inplace=True)is on a seperate line because if you try and run df.melt(df.reset_index, id_vars=["country"], var_name=["year"], value_vars="GDP") you get this

📎 unknown.png

peak jetty Mar 22, 2019, 4:26 PM

#

Out[10]: 
                                                 country  year        GDP
0                                            Afghanistan  1960        NaN
1                                                Albania  1960        NaN
2                                                Algeria  1960        NaN
3                                         American Samoa  1960        NaN
4                                                Andorra  1960        NaN
5                                                 Angola  1960        NaN
6                                    Antigua and Barbuda  1960        NaN```

marsh fog Mar 22, 2019, 4:26 PM

#

that looks right

#

What did you do?

peak jetty Mar 22, 2019, 4:27 PM

#

Only thing I did was assign a new df to the commands already here

marsh fog Mar 22, 2019, 4:28 PM

#

So that's only performed it on 1/3 dataframes?

peak jetty Mar 22, 2019, 4:28 PM

#

    the_df = the_df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df = the_df.set_index(["country"], inplace=True)```

#

Literally all I changed

#

It seemed like df.melt() was returning a df but not actually modifying the df it was applied to

marsh fog Mar 22, 2019, 4:29 PM

#

AH

#

I wonder if df.melt can take an inplace arugment

peak jetty Mar 22, 2019, 4:29 PM

#

I hate how pandas does that

marsh fog Mar 22, 2019, 4:29 PM

#

No it can't xD

peak jetty Mar 22, 2019, 4:29 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt

#

I don't see one

#

I really hate how Pandas does that, and how there is df.melt() and pd.melt(df)

marsh fog Mar 22, 2019, 4:30 PM

#

yeah it's so frustrating

#

wrong line

#

So you've kept that line after the function?

peak jetty Mar 22, 2019, 4:32 PM

#

Yea

#

Oh and sorry, should be

    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df.set_index(["country"], inplace=True)
    return the_df```

#

You can't even mix and match the assignments, what a mess that can become

marsh fog Mar 22, 2019, 4:34 PM

#

mhmm it's telling me the_df is not defined

peak jetty Mar 22, 2019, 4:34 PM

#

import pandas as pd

filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f, encoding="ISO-8859-1"))


def process_df(df):
    """
    intput: unformatted data that comes from the same source so it has the same starting format
    output: processed data that has been re-formatted
    """
    df.drop(
        index=df.tail(6).index,
        columns=["Series Name", "Series Code", "Country Code"],
        inplace=True,
    )
    df.rename(columns=lambda x: x[0:4], inplace=True)
    df.rename(columns={"Coun": "country"}, inplace=True)
    df.set_index("country", inplace=True)
    df.index = df.index.str.strip()
    return df.apply(pd.to_numeric, errors="coerce")


inter_dataframes = [process_df(df) for df in dataframes]
# 1 is GDP, 2 is internet-users, 3 is unemployment


def melt_df(df):
    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
    the_df.set_index(["country"], inplace=True)
    return the_df


new_dataframes = [melt_df(df) for df in inter_dataframes]

#

Maybe I missed another change?

#

I'm not getting any issues running that

marsh fog Mar 22, 2019, 4:35 PM

#

All my values are blank

#

mhmm xD

#

oh no

#

It's working now

#

how shitty is that though from pandas

#

You're a legend

#

So now If I wanted to do exactly the same thing but for where you see value_name

#

for new_dataframes[0] is should be GDP, new_dataframes[1] it should be internet-users and 3 should be Unemployment

peak jetty Mar 22, 2019, 4:38 PM

#

Start by adding a param to the melt function

    df.reset_index(inplace=True)
    the_df = df.melt(id_vars=["country"], var_name=["year"], value_name=value_name)
    the_df.set_index(["country"], inplace=True)
    return the_df```

minor solar Mar 22, 2019, 4:40 PM

#

Hey y'all, I have a simple question when @marsh fog is finished with his...

peak jetty Mar 22, 2019, 4:40 PM

#

Then it's up to you, instead of a list of dataframes you could loop through objects, where the object has a value_name and dataframe, but you should tie them together somehow, a dict might work, too

marsh fog Mar 22, 2019, 4:40 PM

#

if I specify a dictionary of the value_names

#

How can I get the function df.melt to apply a dictionary name to the value_name

peak jetty Mar 22, 2019, 4:41 PM

#

See my comment above, there are two changes to your melt function

marsh fog Mar 22, 2019, 4:41 PM

#

ahh yeah, I only saw the first change; cheers dude!

#

So thankful!

peak jetty Mar 22, 2019, 4:42 PM

#

No worries, happy to help, I learned there as much as you did

marsh fog Mar 22, 2019, 4:42 PM

#

That's what it's all about! ahah 👍

cursive glade Mar 22, 2019, 5:13 PM

#

so, i have a few machine learning scripts. id like to write a program that takes a specified dataset and evaluates that dataset on each of the ML scripts, collects the results and then outputs the statistics for each script.
i was thinking about writing an evaluation script that uses subprocesses to run the ML scripts, optionally with some passed args, and starts the next subprocess after the current one is finished.
if i run a testscript that has basically only output = subprocess.check_output(["python", "Net1.py", *some args*]) in it, i can see the errors that Net1.py throws when my assert(condition) statements are false, but if they arent and the program just runs as intended, i dont get any feedback...
i havent worked with subprocesses before, am i doing anything wrong here? or more general, is there a better way to transfer results from the ML scripts to the testscript without storing the results in say textfiles somewhere to then just read them?

supple ferry Mar 22, 2019, 5:24 PM

#

@cursive glade i don't know how relevant it is, but I had problems with subprocess and multi processing. Depending on IDE I could either see my errors or not. When I was running them from command line I could see all errors, ipython console was outputing nothing

cursive glade Mar 22, 2019, 5:26 PM

#

i usually log remotely onto a machine with the necessary gpus so im not rly using any IDE, just execute scripts from the shell with python script.py *args*

lapis sequoia Mar 22, 2019, 6:58 PM

#

Hey guys, I'm new to python, I need to use numpy to store multiple 2d arrays in a big 3d array basically a vector of 2d arrays, [number_of_arrays,array_x,array_x], I have troubles doing this

#

I tried to use many things, the single one that remotely worked is .append

#

but only iwthout axis and it flattens t he input

#

when I specify axis it throws errors

#

import numpy as numpy


def load_images(path):
    images = numpy.array([[[]]])
    index = 0
    while True:
        try:
            print("try: " + path + "_" + str(index) + ".npy")
            images = numpy.append(images, numpy.load(path + "_" + str(index) + ".npy"))
        except Exception as ex:
            print(ex)
            break

        index += 1

    print(images.shape)


load_images("assets/car")

analog helm Mar 23, 2019, 2:03 AM

#

Hey all, I am looking to work on a Python project which is essentially a simulation/simulation-like, and based on 3d voxel space/octrees (same thing?). I know I could literally just instantiate a raw block of data with numpy and then fool around with it, but unless I was misunderstanding things, I was under the impression that once your volumes start increasing in size (lets say 256^3), that becomes extremely cumbersome, inefficient, and slow, even with numpy array objects. As a very brief description of my usecase needs, I'm not doing many, or any, matrices operations or transforms, etc. Mostly just running a finite voxel space and performing checks/modifications on specific voxels every tick.

Is there a good library out there for this? Or am I mistaken in the first place for thinking that I need a special library to do this properly?

lean ledge Mar 23, 2019, 5:19 AM

#

@analog helm There's nothing that can prevent large matrices from slowing down your computer by consuming memory and computational time unless you know more about the matrix. In particular, if the matrix is almost empty, you can apply sparse matrix optimisations and so on

#

Other than that, if all of it is filled with no pattern, nothing can help

analog helm Mar 23, 2019, 5:21 AM

#

there will be large portions which are empty, but by no means the majority. How do games which run on standard PC hardware handle this? EG Minecraft or Dwarf Fortress. Am I just seriously overestimating the load that kind of environment applies? I kind of assumed they did something special to work with their data sets. Is it really just an array similar to what numpy would give me?

#

and thanks for the response btw @lean ledge

lean ledge Mar 23, 2019, 5:22 AM

#

You split it up unto chunks and only deal with small chunks of the data at a time. While the other chunks arent loaded, they're stored on storage rather than in memory

#

That's why as you walk over to an area that hasnt loaded yet, it reads it from disk and loads it

#

As you go away from a chunk, it stores it back into disk

analog helm Mar 23, 2019, 5:24 AM

#

yea, for MC. But this is a finite space, which will be permanently loaded. I should have clarified on that one, my bad. More just thinking of a specific space in MC, and manging the data in that area. If you're familiar with Dwarf Fortress, that is a much better comparison for what I expect to be doing

lean ledge Mar 23, 2019, 5:24 AM

#

Never played DF, sorry

analog helm Mar 23, 2019, 5:25 AM

#

well, just imagine a finite space in Minecraft I guess, with tons of entities and interactions. All the entities are bound to the octree coordinate system as well.

#

I know much of MC's data complexity has to do with buffering and streaming data off of and onto disk, but i was under the impression even besides that, there was some special handling going on. I could be wrong though!

lean ledge Mar 23, 2019, 5:27 AM

#

There's nothing you can really do to avoid storing all the data you have to show in memory. It's not generally a problem since unless the memory is starting to get filled, it doesnt take much computation to have voxels stored. Minecraft often doesnt have thaaat many entities and obvious optimisations are made where you can save on interactions when they dont matter. I have no idea how Minecraft in particular works, though I'm sure some others here do

analog helm Mar 23, 2019, 5:28 AM

#

Mhmm, it's not so much the data storage its self Im concerned about. RAM is cheap these days. More the iteration and random access of data

#

the only real "complexity" in storage Im aware of is differences in data density (ie, one voxel might need to only store a single byte, another voxel might need to store several), but from what I understand that can be easily resolved by having a base voxel array of single bytes, with a sparse voxel tree parallel to store extra info as necessary

lean ledge Mar 23, 2019, 5:31 AM

#

It's all in memory! It can be accessed randomly as it pleases with no slowdown. You give modern computers less credit than they deserve given they can render complex 3d scenes, I feel like IO with the data isnt where the main slowdowns would be concerning: rendering would be the bigger problem since that's what needs to be done real time and is just a time consuming process in comparison to fetching whatever few blocks exist

analog helm Mar 23, 2019, 5:32 AM

#

fair enough

#

On that topic, are there existing libraries which specialize in rendering voxel data? Or am I reaching on that one?

#

My main interest here is really just the concept its self, so the less I have to reinvent, the better. The underlying engine is of little interest to me.

#

Something which has some amount of pre-existing code for determining what is or isnt visible based on what voxels are surrounding an area, what is or isnt transparent, etc

#

volumes would be dynamically generated, and the user would be able to specify which areas of the volume they want to look at, so which voxels are or arent visible in any given situation is dynamic. Dunno if thats something which can be semi-automated via library, or if it is.

#

In either case, thanks for your input on this

lean ledge Mar 23, 2019, 5:41 AM

#

Cant say I know much about gamedev, it just intersects with other things I actually enjoy. But there are voxel based game engines and it might help to see how they handle things. Probably some combination of not worrying about what isnt visible, combining many blocks into the same mesh with fewer vertices, doing weird stuff with lighting, chunks, etc

analog helm Mar 23, 2019, 6:47 AM

#

Alright, I'll keep looking around at other projects, I know of a few. Thanks for talking it out with me @lean ledge!

mossy dragon Mar 23, 2019, 7:34 AM

#

hello guys

lapis sequoia Mar 23, 2019, 8:14 AM

#

hello, if you want to ask a question can you ask it

grave fog Mar 23, 2019, 11:38 AM

#

Does anyone have experience with Jupyter Notebooks? Can't seem to make my kernel work

#

Or any kernels. Managed to install everything correctly on another machine, but this one is just throwing errors

supple ferry Mar 23, 2019, 12:57 PM

#

can you also paste your errors in formatted way?

grave fog Mar 23, 2019, 2:00 PM

#

📎 log.txt

supple ferry Mar 23, 2019, 3:30 PM

#

Which tensorflow you have and which python?

upper ginkgo Mar 24, 2019, 3:39 AM

#

Is there any open source bot that uses natural language commands out there?

wraith sage Mar 24, 2019, 9:33 AM

#

Hello

#

import tensorflow as tf

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]

W = tf.Variable([0.3], tf.float32)```

#

is 0.3 the initial value of W?

#

If so, how can I print the value from W?

lean ledge Mar 24, 2019, 9:46 AM

#

You'll have to make a session and evaluate the computation graph for the variable. Pass it to a session.run() call after initialising the variables @wraith sage

wraith sage Mar 24, 2019, 10:05 AM

#

📎 unknown.png

#

without initializer, it wont work. What is the global_variables_initializer()?

lean ledge Mar 24, 2019, 10:33 AM

#

It just initialises variables. Tensorflow objects are abstract and you're essentially compiling the logic you want to run into a computational graph and then working on it in a session

serene veldt Mar 24, 2019, 12:16 PM

#

Greetings, im trying to create multiple bags for a sort of bootstrap aggregation algorithm

#

is there a more efficient way than subsampling n times de dataset with pandas.DataFrame.sample ?

#

like, for example, at least something similar but with a n_samples parameter so i dont have to do a comprehension list of iterate

supple ferry Mar 24, 2019, 2:25 PM

#

Not on pandas
But on sklearn.ensembleyou can find what you need

serene veldt Mar 24, 2019, 2:49 PM

#

how so? they only have the classifiers made with ensembles, cant find anything regarding the subsampling part, unless i go into source code

sharp jetty Mar 24, 2019, 3:15 PM

#

using a CNN for a binary image classificatio, should I use a validation set or simply train and then run it through a test set?

empty current Mar 24, 2019, 3:21 PM

#

def mbo(n, others):
    not_first = False
    for i in others:
        if not_first:
            x = bin(n ^ i).count('1')
            if x == 0:
                return 0
            elif x < lowest:
                lowest = x
        else:
            lowest = bin(n ^ i).count('1')
            not_first = True
    return lowest


s = set()

for i in range(int(input())):
    e, a = input().split(' ')

    if e == "1":
        s.add(int(a))
    else:
        print(mbo(int(a), s))```

#

This apparently really inefficient
So am trying to find a way to reduce the complexity
What this supposed to do is

Person1 and Person2 are playing an XOR game. Initially, Person1 has an empty set of integers. Then a sequence of N events happens. There are two types of events:

Person1 chooses integer A and adds it to the set;
Person2 chooses integer A and passes it to Person1 who finds integer B in the set such that integer A⊕B contains minimal possible number of 1s in its binary representation. Here ⊕ is a bitwise exclusive or operation, for more details check Wikipedia page.
Your taks is to help Person1 finding minimal possible number of 1 bits in binary representaion of A⊕B.

Input
The first line contains integer N. Each of the following N lines describes an event as two integers T and A separated by a single space. Here T is an event type.

Output
For each event of the second type print the corresponding minimal number of 1 bits in a separate line.```

#

Don't need code, need algorithm

empty current Mar 24, 2019, 4:55 PM

#

No one?

void anvil Mar 25, 2019, 12:37 AM

#

you should probably go to help and not data science

lapis sequoia Mar 25, 2019, 1:17 AM

#

@sharp jetty you don't need a validation set

sharp jetty Mar 25, 2019, 1:18 AM

#

@lapis sequoia why is that?

#

shouldnt i run the model on a validation ,save the weights and then run it through a test?

lapis sequoia Mar 25, 2019, 1:21 AM

#

depends on the problem

#

if it's binary classification.. you should have enough examples to not need additional validation

#

what sorta images are you trying to classify

sharp jetty Mar 25, 2019, 1:34 AM

#

x ray

#

its binary classification

lapis sequoia Mar 25, 2019, 1:40 AM

#

and how many examples do you have for training

#

for each class

#

you don't need to do validation unless your classes are imbalanced..

#

this type of problem requires that you have large enough training and test sets.. then just plot/get roc for your predictions

sharp jetty Mar 25, 2019, 1:57 AM

#

well im looking at different body parts individually, but for the most part i dont have class imbalance

#

on the low side i have about 500 images per class and on the high side i have about 5000 per class

lean ledge Mar 25, 2019, 2:02 AM

#

@lapis sequoia You definitely need a validation set

#

Binary classification does not dictate whether or not you have enough examples

#

@sharp jetty Dont listen to them, you always need a validation set to prevent overfitting. If you're really short on data, maybe dont have a test set and use validation accuracy at face value

#

You have no idea when to stop training without validation

sharp jetty Mar 25, 2019, 2:07 AM

#

yea thats what i was thinking

#

@lean ledge thanks again

#

btw so when i run my model and obtain the highest possible accuracy on validation do i save the weights and run it on a test?

lean ledge Mar 25, 2019, 2:10 AM

#

Yeah, generally how it works. Most people's workflow looks like giving the framework both train set and validation set and then monitoring the output values on tensorboard or equivalent until validation stops improving (in which case you save and try to drop learning rate further) or starts getting worse (overtraining). It's a good idea to use something like TF's Object detection API or similar since that has essentially everything set up for you with pretrained networks to retrain, all outputs configured as necessary etc

#

Instruct your framework to just output a model every epoch for the latest 5-10 epochs or something along those lines. I think object detection API is 5 by default but I prefer 10 because I tend to miss a bit

sharp jetty Mar 25, 2019, 2:16 AM

#

hmm a lot of what you just said im not familiar with lol. First time working with CNN's. Any resource you would recommend to implement what you described?

lean ledge Mar 25, 2019, 2:18 AM

#

What's your specific task? @sharp jetty

#

binary classification, no object detection or anything?

sharp jetty Mar 25, 2019, 2:19 AM

#

binary classification of x ray images

#

abnormality vs normal

#

and i have 7 different body parts which ill run a model on seperately

#

in which case im not sure if i should use the weights learned from one body part to another

lean ledge Mar 25, 2019, 2:23 AM

#

How much data do you have? @sharp jetty

lapis sequoia Mar 25, 2019, 2:24 AM

#

dont at me.. do you have a job or pushed anything to production?

#

all I see you doing is posting inane academia stuff with little to know background knowledge or know how of actual application..

sharp jetty Mar 25, 2019, 2:25 AM

#

i have anywhere from 500 to 5000 images per class for each body part

lapis sequoia Mar 25, 2019, 2:25 AM

#

not you coldchillin.. this other person I have blocked who continues to feel like he needs to chime in on things he doesn't understand

#

just because validation is part of usual workflow doesn't mean it's something you use to update weights for every problem.. reason you don't use validation is you're tweaking weights to fit certain images, that isn't something you do when it comes to medical applications..

lean ledge Mar 25, 2019, 2:27 AM

#

...validation isn't used to update weights, it's used to avoid the scenarios of overfitting

sharp jetty Mar 25, 2019, 2:28 AM

#

hmm, the dataset im working with is part of a competition which is seperated by train valid and testing folders

lapis sequoia Mar 25, 2019, 2:28 AM

#

he previously talked about using validation to update weight..

#

oh so it's for kaggle.. or something and not actual application

#

is that right?

sharp jetty Mar 25, 2019, 2:28 AM

#

yea something like that

lean ledge Mar 25, 2019, 2:28 AM

#

I have both a job in a data science company where I do computer vision among other things and I'm currently at CSIRO'S Robotics and autonomous systems group chugging through literature reviews on my first computer vision paper for multi camera detection and tracking in an industry application. I've won hackathons using computer vision also. If you feel the need to disparage me because you don't like me disagreeing with you, then go ahead

sharp jetty Mar 25, 2019, 2:28 AM

#

not kaggle but a competition nonetheless, im just doing it for practice

lean ledge Mar 25, 2019, 2:30 AM

#

@sharp jetty I'd start by trying to retrain a pretrained resnet model (avoid inceptionnet, it tends to have worse results in biomedical) on body part with the most data and then try to transfer learn with those weights on parts with smaller datasets

lapis sequoia Mar 25, 2019, 2:30 AM

#

I've blocked you because you're an annoying person with no knowledge of actual application.. I don't feel the need to waste my time with people who feel the need to annoy and spew half knowledge..

sharp jetty Mar 25, 2019, 2:30 AM

#

by other classes you mean body parts?

lean ledge Mar 25, 2019, 2:31 AM

#

yep

#

edited for clarity

sharp jetty Mar 25, 2019, 2:31 AM

#

also my aim was to try and produce a model from scratch and then compare that to one thats pretrained

#

right now im still working trying to figure out how to optimize my scratch model

lean ledge Mar 25, 2019, 2:32 AM

#

if you're trying to compare architectures, it's only a fair battle if they're all trained on the same dataset

sharp jetty Mar 25, 2019, 2:32 AM

#

but im getting horrid results lol

#

yea i plan on training them both on the same set

#

just doing them seperately

lean ledge Mar 25, 2019, 2:32 AM

#

models from scratch will rarely beat a model that's been pretrained and then trained again on the same dataset

sharp jetty Mar 25, 2019, 2:32 AM

#

yea i figured but is it that huge of a difference?

#

like im getting 46% validation accuracy on my model so far

lean ledge Mar 25, 2019, 2:34 AM

#

can be pretty significant because from scratch make it easier to overfit. remember, something like imagenet has 14 million images and a model pretrained on it will have learnt very very generic features in early layers

#

it only needs to change some logic in the last few layers to start identifying new classes

sharp jetty Mar 25, 2019, 2:35 AM

#

hmm... what do you think of my validation accuracy results though is it something i should expect? for reference other ppl in the competition are getting like 70% test accuracy

lean ledge Mar 25, 2019, 2:36 AM

#

cant say anything about validation accuracy without actually being there, lots of reasons it can be low. bad optimisation/training, problem with the model architecture, problem with datasets

sharp jetty Mar 25, 2019, 2:38 AM

#

this is my architecture for reference

#

📎 unknown.png

south quest Mar 25, 2019, 2:39 AM

#

!warn 389084425566289930 Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.

arctic wedgeBOT Mar 25, 2019, 2:40 AM

#

:incoming_envelope: :ok_hand: warned @lapis sequoia (Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.).

lean ledge Mar 25, 2019, 2:40 AM

#

Oh, that's a very shallow network. Very much possible that itself is the problem. The more layers there are, the more complex things it can learn. Only a few layers cant learn very many things. For reference, state of the art in many applications have 100s of layers, many layers having complex structures like being a combination of multiple kinds of convolutions for complex features but without loss of speed, or gated units for easier optimisation and so on

#

I have no idea what kind of problem that is but adding a couple more layers may or may not help

#

If it doesnt, i'd try messing around with hyperparameters. Lowering learning rate after you've reached your 0.5 accuracy or whatever

sharp jetty Mar 25, 2019, 2:42 AM

#

layers as in adding more conv layers?

#

and yea i realized that my arch was pretty basic its something i tweaked from the MNIST project i did earlier

lean ledge Mar 25, 2019, 2:46 AM

#

try reading up on the architecture of ResNet. it's essentially a simple architecture but it got state of the art a few years ago. it has lotssss of layers

#

CS231n is a great course to go through btw if you're interested in deep vision

#

should go over everything you need to start reading actual literature really

sharp jetty Mar 25, 2019, 2:47 AM

#

yea any resources would be a great help, it seems i know too little to do a decent from scratch model on my own

lean ledge Mar 25, 2019, 2:48 AM

#

nah all good, knowing what you know is a great place to start off. just need something to accelerate you into recent literature

#

https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv

YouTube

Lecture Collection | Convolutional Neural Networks for Visual Reco...

Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving car...

#

lectures 6, 7, 9 and 10 should get you up to date somewhat

#

everything 11 onwards is not exactly core deep vision knowledge

#

except maybe generative models

sharp jetty Mar 25, 2019, 2:50 AM

#

nice! thanks ill look into it

#

btw one more question, in terms of accuracy you usually get a higher accuracy for validation than test right?

#

as ur tuning the model weights on teh validation

#

and running it on the test

lean ledge Mar 25, 2019, 2:53 AM

#

if there is a discrepancy at all, yeah, validation is tiny bit higher because you stop based on when validation gets worse. but otherwise, the difference tends to be relatively minor-ish since validation only comes in play with figuring out when to stop rather than actually helping with training

sharp jetty Mar 25, 2019, 2:57 AM

#

so if i just wanted to show the efficacy of my model would it be enough to just show my validation score?

#

because in that case what would be the reason for running a test?

lean ledge Mar 25, 2019, 2:59 AM

#

it's generally bad practice to use validation in an actual formal setting (competitions, academia, etc) but nobody would really care in industry too much given there's little data so having no test is justified

sharp jetty Mar 25, 2019, 3:00 AM

#

hmm so it would be better if i just take a small sample of images for each class and use that as a test?

#

would 50 images per class suffice?

lean ledge Mar 25, 2019, 3:03 AM

#

everyone I know does it all percentage-wise. Out of all the data you have, splitting it 80/15/15 sounds fairish for train/validation/test.

#

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1337&rep=rep1&type=pdf reading the conclusion of this paper seems to give a recommendation for the kind of split

#

but honestly, people's splits are anything from 80/15/15 to 70/20/10 to many in between

sharp jetty Mar 25, 2019, 3:06 AM

#

got it, thanks so much man you've been a tremendous help

lean ledge Mar 25, 2019, 3:07 AM

#

nw, glad to help

quiet gyro Mar 25, 2019, 3:09 AM

#

https://github.com/ray-project/ray

GitHub

ray-project/ray

A system for parallel and distributed Python that unifies the ML ecosystem. - ray-project/ray

sharp jetty Mar 25, 2019, 4:24 AM

#

just a quick question regarding CNN implementation, if I'm doing a binary image classification and running a model separately on 7 different body parts(ex hand, wrist, humerus, etc) so ill run a model on wrist only binary classification and then wrist and so on. Should I save the weights from each model and run them for the next?

lean ledge Mar 25, 2019, 4:41 AM

#

@sharp jetty Id train for the part with the largest dataset first, and then reuse weights from the model for retraining other parts with smaller datasets

sharp jetty Mar 25, 2019, 4:42 AM

#

would i use the adjusted weights after each dataset?

#

ex i train on hand dataset -> use weights to train on arm->use new weights train on shoulder->use new weights to train on wrist

#

or just use one set of weights

lean ledge Mar 25, 2019, 4:48 AM

#

That, I have no clue about. I don't expect the difference to be huge either way.

magic mauve Mar 25, 2019, 1:19 PM

#

I hope this is the appropriate place to ask, which module would be better for working with csv files, panda or csv?
I'd assume CSV since its in the name but I heard a lot about panda as well

polar acorn Mar 25, 2019, 1:56 PM

#

Depends on how you mean work with them. If you mean read a csv to a dataframe fiddle around with it and save it as csv I would use pandas.

pale eagle Mar 25, 2019, 1:56 PM

#

From where i can find the projects for data analytics

magic mauve Mar 25, 2019, 2:06 PM

#

I just want to take some data from a csv file and find averages, max, and mins
@polar acorn

polar acorn Mar 25, 2019, 2:08 PM

#

I would use pandas, it has a read_csv function that handles most csv's. And finding summary stats is quite straight forward.

#

It might take some getting used first time you use it. But its a great tool to know, so it's a worthwhile investment.

magic mauve Mar 25, 2019, 2:09 PM

#

Appreciate it

wraith sage Mar 25, 2019, 2:31 PM

#

Hello

#

What is the best way to learn Tensorflow?

supple ferry Mar 25, 2019, 2:36 PM

#

@magic mauve go for pandas all the way. It can work with lots of data types, and not only it allows to manipulate the data but also save it in other formats, plot (simple plots)

magic mauve Mar 25, 2019, 2:36 PM

#

I'd assume matplotlib would be better for more complicated plots though?

proven ravine Mar 25, 2019, 3:46 PM

#

Hey guys,

so I am quiet new to programming in python but there is a question that is bugging me. There is a project in my head which would be awesome to realize but I dont know if it's achievable in any way (technology or skillwise).

The thing is I have been doing quiet a lot of demo trading forex currencies for around a year. I did find a pattern that kind of worked for me, the problem is, that I can mostly only see the pattern in the past. If I look at old graphs I will see all the spots where it worked but I can never really make it work for me, if the plot hasnt formed yet.

Would there be a way to feed the computer some currency graph, give it also the points where I would have taken long or short trades because of my pattern with the stop loss (point where you would sell automatically, to cut looses) and where you would put the take profit and the machine (AI, neural network/whatever) spits out the most likely trade in the present?

supple ferry Mar 25, 2019, 3:55 PM

#

@magic mauve , matplotlib is good. there are other libraries that are built on top of it and are higher level. maybe you will like them more. seaborn, plotly e.g

#

@proven ravine , there is an entire concept for such things. Algorithmic trading. Its been around for decades now and with the rise of AI and ML, lots of people find it interesting.
I hope it is okay to post reddit links here, so, if you are interested in such stuff, you can head to https://www.reddit.com/r/algotrading

r/algotrading

proven ravine Mar 25, 2019, 4:03 PM

#

@supple ferry but is it doable for an average person like me to program such an algorithm?

supple ferry Mar 25, 2019, 4:32 PM

#

it is doable. but probably not be useful. you cant compete with bigger machines

void anvil Mar 25, 2019, 4:48 PM

#

It depends on the uniqueness of your signal

#

tbh

#

You can market make with relatively simple rules and be profitable, but you're being paid to take on risk

#

You would want to sample whenever you see a pattern emerging

#

and either rule-base sample in

#

for live trading

#

or have it predict if it's a trade or not

proven ravine Mar 25, 2019, 4:50 PM

#

rule-base sample means its like a condition which needs to be met?

void anvil Mar 25, 2019, 4:51 PM

#

yes

proven ravine Mar 25, 2019, 4:51 PM

#

for example, one of the patterns includes a certain formation of three candles, which need to be the lowest or highest of a certain move

void anvil Mar 25, 2019, 4:51 PM

#

like if the market mov es up 5%

#

in 5 minutes

#

then predict whether to ma ke a trade

proven ravine Mar 25, 2019, 4:51 PM

#

ah okay

#

I was hoping I could just throw data into a black box and get an algorithm 😂

#

as said, marking all the trades with numbers in a chart of the past (for example last 5 years) and the program figures out what it needs to do, to predict present trades

#

something like that. I thought thats how machine learning or AI works until now, but I guess I need to look into it a little more 😄

void anvil Mar 25, 2019, 5:02 PM

#

that is how it works

#

you have to mark the trades to sample

#

and mark the trades for the predictin

#

Basically instead of your data set being image classification of a cat or a dog

#

you take your stock history or forex or w/e

#

and create the "This is an example of a long trade"

#

and create the "this is an example of a short trade"

#

then feed in new data and predict if you should go long, short, or not trade

proven ravine Mar 25, 2019, 5:17 PM

#

hmm, any tips with which "software" or whatever I can do it "easily" or test it?

#

or any docs which I should read into ?

void anvil Mar 25, 2019, 5:34 PM

#

hahaha

#

do it in python

#

you'll have to create most things from scratch

#

https://www.amazon.com/gp/product/0470284889/
https://www.amazon.com/gp/product/0470432063/
https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089
https://www.amazon.com/Hands-Machine-Learning-Algorithmic-Trading-ebook/dp/B07JLFH7C5

#

good luck find anything useful besides stats packages like ffn

#

https://pypi.org/project/ffn/

PyPI

ffn

Financial functions for Python

#

and bt

#

https://github.com/pmorissette/bt

GitHub

pmorissette/bt

bt - flexible backtesting for Python. Contribute to pmorissette/bt development by creating an account on GitHub.

#

but you're better off writing your own backtesting functions

#

99% of all signal generation is going to need to be scripted yourself

proven ravine Mar 25, 2019, 6:05 PM

#

thanks for the help 😃

#

thats at least a project which could be helpful in the future so I am more eager to continue with it 😃

deft bough Mar 25, 2019, 6:42 PM

#

hey I am trying to plot a 2d gaussian distribution with matplot3d but I have some problems.

cset = ax.contourf(temp1, temp2, Z, zdir='z', offset=-0.15, cmap=cm.viridis)
ax.set_zticks(np.linspace(0,0.6,5))

📎 unknown.png

#

that is my code to plot and the result. I would really appreciate it, if you could help me to make my plot OK

lapis sequoia Mar 25, 2019, 6:48 PM

#

def distance_parse():
    hist_list = []
    with open('20190326_distance_v1', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            try:
                m = re.search('sdf : (\d+.\d+)', line).group(1)
                hist_list.append(m)
            except:
                pass

        print(hist_list)
        plt.hist(hist_list, bins=5)
        plt.show()

This code will print and draw like this.

#

📎 unknown.png

proven ravine Mar 25, 2019, 6:49 PM

#

@void anvil do you know if codecademys machine learning course is helpful?

lapis sequoia Mar 25, 2019, 6:49 PM

#

How can I fix it??

void anvil Mar 25, 2019, 6:49 PM

#

no idea

lapis sequoia Mar 25, 2019, 6:51 PM

#

I guess `plt.hist(x, **karg) could be a key of this problems. but I didn't know these options.

#

ah.. maybe type..

proven ravine Mar 25, 2019, 6:58 PM

#

@void anvil would you say my goal is more likely to be supervised or unsupervised learning? Cause I am not sure while reading the description of the two ... I am leaning more toward supervised?

void anvil Mar 25, 2019, 6:58 PM

#

if you have a trading pattern you want it to learn it's supervised

#

if you want it to learn trading patterns, unsupervised

proven ravine Mar 25, 2019, 6:59 PM

#

ehhh I am confused

#

the second is just giving it data and waiting for what happens

#

?

#

and the first one is where I give it my long/shorts and it learns to use those in present situations

void anvil Mar 25, 2019, 7:01 PM

#

https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Towards Data Science

Supervised vs. Unsupervised Learning

Understanding the differences between the two main types of machine learning methods

proven ravine Mar 25, 2019, 7:05 PM

#

good read!

vestal axle Mar 25, 2019, 7:21 PM

#

Hi guys, any of you familiar with R-studio?

void anvil Mar 25, 2019, 8:55 PM

#

A bit

#

what do you need from it?

void anvil Mar 25, 2019, 9:21 PM

#

@proven ravine

Supervised: You tell the algorithm this is class A, B, C. Determine if this new stuff is A, B, C.

Unsupervised: Here's a bunch of shit, tell me what you can predict

kindred stirrup Mar 25, 2019, 10:56 PM

#

@vestal axle yes I’m familiar with R Studio

#

Have any of y’all used facebook’s prophet? Wondering how it compares to ARIMA for time series

sharp jetty Mar 25, 2019, 11:56 PM

#

with regards to early stopping, should i stop my model when it has the highest accuracy for validation or when the val_loss values are lowest?

chilly shuttle Mar 26, 2019, 3:46 AM

#

accuracy is presumably what you're interested in

sharp jetty Mar 26, 2019, 5:25 AM

#

hist = classifier.fit_generator(
        training_set_finger,
        steps_per_epoch=(5064/8),
        nb_epoch=60,
        validation_data=valid_set_finger,
        validation_steps=(461/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])

#

i have a question regarding this code which is part of a CNN im running. If i run this code and another but using a different dataset are the two pieces of code interacting in any way?

#

hist = classifier.fit_generator(
        training_set_humerus,
        steps_per_epoch=(1230/8),
        nb_epoch=60,
        validation_data=valid_set_humerus,
        validation_steps=(288/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])

#

as in when i fit on one set of data and then fit on another are the weights or other parameters being affected?

mossy dragon Mar 26, 2019, 7:58 AM

#

anyone done the titanic competition on kaggle?

polar acorn Mar 26, 2019, 9:21 AM

#

@kindred stirrup In my experience prophet is quite good out of the box while ARIMA models would need more tinkering to achieve the same level of performance.

chilly shuttle Mar 26, 2019, 12:53 PM

#

oh cool, TIL prophet

midnight atlas Mar 26, 2019, 2:11 PM

#

Hi, still looking for help with speech recognition front end in Python, would be great to get in touch with someone familiar with speech rec. Thanks!

serene veldt Mar 26, 2019, 3:32 PM

#

need some help normalizing data

#

i have tried most of sklearn.preprocessing module but nothing suits my needs

#

i have a semi sparse set of values

#

and wish them to normalize them into floats from range [0,1]

#

scale, binarize, minmax dont work

paper niche Mar 26, 2019, 3:47 PM

#

what does "don't work" mean?

frail crow Mar 26, 2019, 4:14 PM

#

Hey guys, first timer, learning on the job ds - How do I detect abnormal subseries in periodic and synchronous time-series data?

sharp jetty Mar 26, 2019, 4:31 PM

#

if im doing a CNN model looking to classify abnormal vs normal images, would my dense output be 2 or 1?

void anvil Mar 26, 2019, 4:43 PM

#

@serene veldt

assuming you have list x:

fixed_x = (x-min(x))/(max(x)-min(x))

narrow obsidian Mar 26, 2019, 5:45 PM

#

Yo guys, anyone here is experienced with numerical methods and jupyter?

#

hitting a wall trying to solve a ODE with bisection and FPI

serene veldt Mar 26, 2019, 6:18 PM

#

Yeah I figured it out after some time

#

I was standardizing the values

#

Which was the problem

civic saffron Mar 26, 2019, 10:20 PM

#

I'm trying to come up with a quick method for determining document similarity to a given query phrase (a list of words). I am fine if it's not extremely accurate, as long as the majority of documents returned are at least "somewhat" similar to the given query. It is important that it is fast though (I need to be able to process at around one hundred 1000-word documents per second, including tokenizing, vectorization, and scanning for matches). I have come up with the following method for extracting a set of similar words to a given term:

#


def simset(query_word, set_size=10, depth=1, size_decay = 0, threshold_score=0.33, with_scores=False):
    if size_decay < 0 or size_decay > 1: 
        raise ValueError("decay rate must be in interval [0,1]")
    
    simsets = [set([query_word])] # list of sets, one per level, query word at root
    level_set_size = set_size        # size of simset for each word @ current level
    for level in range(1, depth+1):      # for each level of depth ...
        
        level_set = set()
        for word in simsets[level-1]: # for each word in the *previous* level's simset
            if level_set_size >=1:
                level_set.update({w[0] for w in vectors.most_similar(word, topn=level_set_size)})
            else: # if decay rate results in set size < 1, just get 1 word for each following level
                level_set.update({w[0] for w in vectors.most_similar(word, topn=1)})

        # remove words from previous levels, to avoid duplicate simset() calls
        for l in range(level-1,-1,-1):
            level_set -= simsets[l]
    
        simsets.append(level_set)
        
        level_set_size = round(level_set_size * (1 - size_decay))
        
    simset = set.union(*simsets)
    if with_scores:
        return {(vectors.similarity(query_word, w), w) for w in simset if vectors.similarity(query_word, w) > threshold_score}
    else:
        return simset

#

The above code gets a list of most similar words to query term, then for each of those it gets most similar, and so on .. to depth levels. The simset of a query phrase is just the union of the simsets for each of the individual terms. Based on this "simset", I then get a "similarity score" for a given text to the query phrase, by comparing the words in the text with the simset of the query phrase with the following function (similar to Jaccard similarity where I am measuring size of intersection between the two):



def skim_text(query, text, word_simset_size=10, depth=2, freq_threshold = 0.0005):    
    query_words = word_tokenize(filter_stopwords(query)) 
    query_simset = set.union(*[simset(w, word_simset_size, depth) for w in query_words])
    query_simset = {(word_freq(w, 'en'), w) for w in query_simset if word_freq(w, 'en') < freq_threshold}

    text_word_count = len(text.split()) 
    if not text_word_count > 0:
        raise NLPError("skim_text() requires a non-empty string as input", text = text)
        
    match_count = 0
    
    escaped_words = [re.escape(w[1]) for w in query_simset]
    re_query_simset = r'(' + '|'.join(escaped_words) + ')'
    matches = re.findall(re_query_simset, text)
    for m in matches:
        match_count += 1
    
    if match_count == 0:
        return 0
    
    score = math.log(match_count+1) / math.log(text_word_count)
    print(f"DEBUG: match_ct: {match_count}, text_word_count: {text_word_count}, score: {score}") 
    if score > 1:
        return 1
    else:
        if score <= 0:
            return 0
        else:
            return score

#

I am looking for feedback on how to improve my code to make it faster / more accurate, or if there are pre-existing tools that I should be using instead that can operate at the speed I need?

#

... also tips on how to just improve my python would be appreciated, because there are probably several things I'm doing here that could be done more efficiently 😃

light cloud Mar 27, 2019, 3:01 PM

#

How would you explain additive models to a 5 year old

obtuse skiff Mar 28, 2019, 12:02 AM

#

Im clustering using kmeans and DBSCAN using sklearn
the parameters for kmeans are self explanatory just how many different clusters I want
but how do I go about chosing the parameters for DBSCAN, it wants eps and min_samples

obtuse skiff Mar 28, 2019, 12:19 AM

#

Also, is there a library for Sum of Squared Error?

supple ferry Mar 28, 2019, 7:33 AM

#

@obtuse skiff sklearn metrics has rmse, root mean squared error. If you want, you can calculate it yourself easily.

obtuse skiff Mar 28, 2019, 7:40 AM

#

@supple ferry do you know how I do the parameters for dbscan?

supple ferry Mar 28, 2019, 8:15 AM

#

Never used it, sorry

#

As per documentation, @obtuse skiff , those arguments are optional:

Parameters:    
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.

eps : float, optional
The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samples : int, optional
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

polar acorn Mar 28, 2019, 8:46 AM

#

@obtuse skiff
Pasted from Introduction to Machine Learning with Python A Guide for Data Scientists

Increasing eps  means that more points will be included in a cluster. This makes clusters grow, but might also lead to multiple clusters joining into one. Increasing min_sample means that fewer points will be core points, and more points will be labeled as noise.

The parameter eps is somewhat more important, as it determines what it means for points to be “close.” Setting eps to be very small will mean that no points are core samples, and may lead to all points being labeled as noise. Setting eps to be very large will result in all points forming a single cluster.

The min_samples setting mostly determines whether points in less dense regions will be labeled as outliers or as their own clusters. If you decrease min_samples, anything that would have been a cluster with less than min_samples many samples will now be labeled as noise.

#

So knowing that, I assume the best way to set the parameters is to try and see what clusters you get and adjust the parameters so that the number and size of clusters makes sense to you.

modest halo Mar 28, 2019, 2:32 PM

#

hey everyone. I'm working on an application that does a bunch of file input/pandas transformations/statistics, and i'm using PyCharm as my IDE. I'm wondering if anyone could give their workflows while they're working on a project -- specifically in cases where you import a lot of data and are beginning to write a new function to do ~something~, like create a matplotlib plot, run a statistical test, create outputs, etc.

I find I waste a lot of time because my code needs to import all my files, sort them, clean them, etc. Currently, I set a breakpoint at the start of my new function, run the debugger, it stops at my breakpoint, and then I do code edits, restart the debugger, let it run over my new code, check for problems, and do this again and again. I feel like this is really inefficient because I'm constantly restarting my debugger over and over, and (of course) my code needs to load all my data again every time. I know a lot of this may be unavoidable, but getting some insight into others' data science workflows would be great. Thanks!

paper niche Mar 28, 2019, 2:42 PM

#

I do my analysis in Jupyter notebooks. having the ability to run cell by cell is much easier for data exploration and analysis

#

I find working with py files very cumbersome for this purpose, especially if you need to make small changes and run blocks of code over and over

modest halo Mar 28, 2019, 2:46 PM

#

@paper niche thanks for the suggestion! I really should get into jupyter -- always been on the list and never really got around to checking it out. I can see how running cell by cell would be useful. When you use jupyter, do you create cells as "building blocks" and just execute them in a certain order depending on what you want to do (or work on next)? Can you also "undo" execution of a cell?

paper niche Mar 28, 2019, 2:55 PM

#

yup i do. as an example notebook (not mine, but you'll get the idea): https://www.kaggle.com/gpreda/rsna-pneumonia-detection-eda

RSNA Pneumonia Detection EDA

Using data from RSNA Pneumonia Detection Challenge

#

maybe 1 cell for loading data to df, 1 cell for doing some modification to the columns, another for df.plot, say.

#

and there's no "undo"-ing the execution of a cell, what you do is you re-run the cell that created that particular variable in the first place. e.g. for my example workflow above, if I realized my modification in the second cell was wrong, I'll run the first cell again to re-assign df, then make changes to the second cell, then re-run the second cell again @modest halo

modest halo Mar 28, 2019, 3:02 PM

#

@paper niche Oh I see. so essentially achieving the same thing. After seeing that notebook you linked, I'm really starting to rethink my workflows lol. That notebook is pretty much exactly the kind of thing I want to do, and it's much more readable for people who aren't as familiar with python (a lot of people I work with). Thanks a lot!

paper niche Mar 28, 2019, 3:07 PM

#

@modest halo np, happy to share! 😄 yup, the ability to write text and code in the same document is really nice for this purpose as well.

marsh fog Mar 29, 2019, 6:55 PM

#

For the part where it plots the regression linefig, axes = plt.subplots(ncols=3, figsize=(20,8)) for ax, col in zip(axes, mergfix.loc[:, mergfix.columns != "internet_users"]): mergfix.plot.scatter(x=[col],y=1, c='green', ax=ax) _ = plt.plot(mergfix.iloc[:, v].values, sm.OLS(mergfix.iloc[:, v].values, sm.add_constant(mergfix.iloc[:, v].values)).fit().fittedvalues,'r-') how can I get it to cycle through the columns in the dataframe?

#

Help? 😄

worn field Mar 29, 2019, 8:29 PM

#

@marsh fog I like to just make a list of the columns and wrap it in a for-loop 😃 Like ```
columns = ['column1', 'column2', 'column3']

for column in columns:
plotfunction(dataframe[column])

#

Not sure if that was what you were asking though :p

marsh fog Mar 29, 2019, 11:15 PM

#

It wasn't 😄

mellow birch Mar 30, 2019, 4:35 AM

#

anyone here written any graphs useful graphs for MISP(www.misp-project.org )? Or have any experience with data science for threat hunting?

lapis sequoia Mar 30, 2019, 4:45 PM

#

Hey guys, I have 8 black&white images, I need the mean image of them, if I do sum(list_of_images) I get a grasycale image with very bad edges, if I do this new image / len(images) (which feels logical to do as this mean mean) it gives me same rough image but with colors

#

📎 unknown.png

#

📎 unknown.png

#

UserWarning: Float image out of standard range; displaying image with stretched contrast.
warn("Float image out of standard range; displaying "

#

images are stored in numpy arrays

#

nvm my add logic is wrong

#

fixed, used numpy.mean() and then casted the output image to uint8 to get it grayscale

light cloud Mar 30, 2019, 6:03 PM

#

I want to see if I can explain KNN here in my own terms.
Let’s say I am a company that sells different types of widgets and we have seasonality to our business and I want to see which widgets are likely to go up or down in sales for a given month.
Would I have my various widgets as rows, the months as columns, previous year sales, previous months sales and maybe a few other columns. Then I would label those as either going up or down in sales the next month. Would I then run a KNN algorithm on the data and hope to see for a row without a next months sales figure prediction?

light cloud Mar 30, 2019, 8:17 PM

#

Wrong place for that question or am I way off?

mossy dragon Mar 31, 2019, 4:40 AM

#

you dont want to use KNN for that

#

at least from what i can remember it

#

since your talking about time, and especially seasonality you want to do time series analysis

light cloud Mar 31, 2019, 7:47 AM

#

Thanks for the response. I think I was reading that time series isn’t ideal for KNN but it can be used.

#

What about random forest then?

#

Similar data setup in structure but different algorithm.

#

There was a tutorial I was reading that was about forecasting weather with it

hasty maple Mar 31, 2019, 7:53 AM

#

I've not done much of time series but what I've read on them, people usually use LSTM's, GRU's, fbprophet, arima for it. Maybe check those out

paper niche Mar 31, 2019, 7:54 AM

#

RF's can't extrapolate well, since they always take the average in the leaf nodes, you can't get values higher or lower than the extreme values in your training set.

#

unlike a linear regression for example

#

not to say that it can't, but I would expect other methods to work better for time series predictions

lapis sequoia Mar 31, 2019, 7:56 AM

#

I'd say it's more about what problem you

#

are trying to solve..than it is about the method..

#

case you cited for example.. how much can you afford to be off by

#

like..what is the tolerance..

#

fbprophet and arima are what I'd suggest for time series.. the person above was right on the money..

#

with Arima, you can tweak it to be conservative..

light cloud Mar 31, 2019, 7:58 AM

#

I have been using prophet for forecasting and I like it.

#

I just wanted to see what some other models I haven’t used much of, and are are common in ML, to not only compare results, but also to learn.

#

So to get back to my original question, it looks like KNN and rf still belong in the classification toolbox and prophet in the forecasting toolbox.

lapis sequoia Mar 31, 2019, 8:01 AM

#

yep.. and with knn you use different distance metrics for different applications/problems..

#

RF.. for most business applications with plenty of tolerance.. and minimum number of trees for fast results..

light cloud Mar 31, 2019, 8:07 AM

#

Do you have a preference between the two? Or is just very dependent on type and amount of data and variables?

lean ledge Mar 31, 2019, 8:08 AM

#

LSTMs, GRUs, ARIMA and CNNs are all common for time series stuff

#

cant just use random common ML methods for any problem. have to be using ones that suit the problem