#data-science-and-ml

1 messages ยท Page 191 of 1

lime lava
#

Gimme a sec ill put and example

#

Obv for a very big database but the idea is to remove duplicated groups, in the example i would be the third group (Id=3)

desert oar
#

How big

lime lava
#

12k ids, each with variable amount of a and b characteristics

desert oar
#

that's not big ๐Ÿ˜‰ anyway you can do .groupby('B').apply(whatever).drop_duplicates(keep='first')

lime lava
#

Haha youre right its not that big

#

So youโ€™re suggesting i do it like reverse, group by the characteristic combo and then remove ids?

desert oar
#

Maybe I don't fully understand what you are getting at

#

It looks to me like you are doing an aggregate operation by group

#

And then removing duplicates after that aggregation

lime lava
#

I update the gist which maybe shows my problem a bit better

#

Group with Id 2 has the same first 2 characteristics, but has a third one so its not the same as groups 1 and 3

#

On the other hand groups 1 and 3 are, excluding ID, the same

#

So i want to remove either 1 or 3

desert oar
#

Isnt that just what drop_duplicates does?

#

You can use subset=['foo', 'bar'] to only look at specific columns

lime lava
#

Wouldnโ€™t that catch group 2 though?

desert oar
#

hm

#

i see

#

naively you can iterate of unique pairs of groups and remove dupes

#

for 17k records just do it

#

use tqdm to monitor progress and go get some water while it runs

lime lava
#

Okay that doesnโ€™t sound so bad actually

desert oar
#

hmm the dupe removal is nontrivial

#

from a data structure perspective

#
import itertools as it

import pandas as pd
from tqdm import tqdm

def groups_are_identical(g1, g2):
    cols = ['A', 'B']
    try:
        pd.testing.assert_frame_equal(g1, g2)
    except AssertionError:
        return True
    else:
        return False

# assumes 'id' is a column, not the dataframe index itself
grps = data.groupby('id') 
n_grps = len(grps)
grp_pairs = {(lab1, lab2): groups_are_identical(g1, g2)
             for (lab1, grp1), (lab2, grp2)
             in tqdm(it.combinations(grps, 2), total=n_grps * (n_grps - ) / 2)}

this gets you part of the way

lime lava
#

Thank you!

desert oar
#

that tells you if any pair of id's is a dupe

#

then youd have to build up connected sets, and grab 1 id from each set

#

probably an easier way to do it tho

#

yeah theres gotta be an easier way

#

or not. graph algorithms pop up in unlikely places

#

wait hang on this is idiotic

#

lol

#
unique_grps = {}

for lab1, grp1 in grps:
    for grp2 in unique_grps.values():
        if not groups_are_identical(grp1, grp2):
            unique_grps[lab1] = grp1

ukeys, ugrps = zip(*unique_grps.items()
data_deduplicated = pd.concat(ugrps, ukeys)
lime lava
#

๐Ÿ˜ฎ

desert oar
#

ok thats my queue to go for a walk. too tired for this lol

lime lava
#

Thank you!

granite stream
azure wren
#

๐Ÿค” Has anyone tried these Python for Data Science essential oils?

brazen spade
#

Hello,

I'm having a hard time finding information on on some machine learning topics. I'm looking for theoretical help, not how to / programming stuff. If anyone can answer these questions for me it would be much appreciated (and sorry I'm a noob!), or at least point me in the right direction:

  • After training and testing a model, the accuracy score and jaccard index are reporting a 98.7% accuracy rate. This seems ridiclously high considering I didn't even tune the model, using default parameters. I am following someone else's example that did extensive feature engineering, is it possible it was just that good or am I right to be skeptical?

  • After training a model, can I make alternations to the parameters of the model or to the original data itself to obtain different results? Or do I need completely new data? I've tried this and the result does not seem to change

  • In the same line of thought, can I train multiple models, say a decicion tree and an SVM on the same data set (albeit one data set separated into training and test sets, but the same training and test set used on both models?) or do they need completely different data sets?

arctic moth
#

@brazen spade for the first question, it is difficoult to say, because it depends on the ratio of data you are using. If you use 99% of dataset to train the model and just 1% to train the model. You can get really high accuracy, because of overfitting. So it is possible, but I would have to see the code. 2) What do you mean by changing parameters of the model. If you fit the model on same data and with some parameters and afterwards change the original data, the model would still be trained on the previous data and the inner weights would not change. But you can train the 2 models with the different parameters on the same dataset and compare the results.

#
  • if you are following someone else example (same parameters, same dataset etc.) you should obtain a comparable result.
brazen spade
#

@arctic moth thanks for the comment. I followed an example for the data cleaning / feature engineering, but I did not like their modeling approach, it was complex with very little notes explaining why there were doing what they were. My goal was to obtain the same data that they used essentially, then model it with maybe three different model of with default parameters, then pick the best default resulting mode and use a GridSearchCV from scikit-learn or something like that to generate the hyper parameters and see improvement. As I mentioned I'm new to this so I'm not really sure the best way to do it. Thank you for your answers though, you've answered a couple of my questions

arctic moth
#

Sure np.

arctic moth
#

Anyone know any good free tensorflow tutorials?

ripe niche
#

@arctic moth If you're just starting out, might want to try pytorch instead.

chilly shuttle
#

or keras

#

probably keras

arctic moth
#

And do you have any recommendation for keras tutorials?

undone dirge
#

There are some on reddit.. forgot the link

#

And you tube

arctic moth
#

thanks ๐Ÿ˜ƒ

desert oar
#

@brazen spade 1) check baseline prevalence of class, or possibly its wildly overfitted to the test set, or its too easy (eg MNIST you should be getting > 98%). 2) same data is fine... why wouldn't it be? 3) as with 2, model algorithm choice is just a big parameter so same data is fine. But know that if you use a test set to compare models and select the best one, that test set is no longer valid for estimating the true out of sample performance of the model. Because once you use it to compare/evaluate models, you are incorporating information from it into the model. This is why i recommend keeping a "holdout set" until the end of the project. Choose model type, parameters, and features using cross validation on say 80% of the data, then use holdout set to evaluate final accuracy after you do all that stuff ... ideally you exclude this holdout set from exploratory analysis as well but that's not always feasible. When i start a predictive modeling / machine learning project i always try to reserve a holdout set asap if possible

#

Sorry for wall of text, there is no "return" in discord on mobile

#

@arctic moth i saw the link to the datasets, what did you want to know about them? I dont have personal experience with them

granite stream
arctic moth
#

@desert oar i just wanned to ask how to import dataset which is in matlab file, but I already figured out ๐Ÿ˜ƒ

#

for anyone interested it is in module scipy.io

#
import scipy.io
mat = scipy.io.loadmat('musk1_normalized_matlab.mat')```
lapis sequoia
#

so uh

#

I have a 4 million entries csv which can occupy more than like 30GB when in a pandas dataframe

#

what's the best way to compress all this mess?

#

actually 200GB

gritty hawk
#

@lapis sequoia I would try to split that csv up first

#

it's kind of a mess to find things to split them by though =/

dreamy tartan
#

Hi, i want to ask something. Onehot ignoring missing values in test data there is no problem in categoric features but there are some missing values in numeric features too, is there any way to ignore these missing values in test data? I dont want to drop these rows.

olive trench
#

Guys do you know why pandas function cat.codes skips some numbers? It is consistent, but my problem is that I want to use one of the cat codes as a row number indicator. I have around 154k unique values but the cat codes go up to 157k (with around 3k being skipped). The cat codes was used on a column that has a text ID

void star
chilly shuttle
#

@olive trench i mean it's trivial to convert any column into unique integer representation, so you could just do that

#

only guessing here but it might be skipping number ranges to help with ordering

#

and if you want a row number indicator... just make one?

desert oar
#

@void star that's just your IDE telling you that you haven't used it yet...

polar acorn
#

I assume he meant he didn't get anything done except for importing numpy, I've had such mornings too...

lapis sequoia
#

Hey guys: quick question here:

[1,3] in list1
>>> False
nplist = np.asarray(list1)
[1,3] in nplist
>>> True```
I would like numpy to compare the entire [1,3] and not break the list down first. I came up with this solution, but it's ugly.
```np.any(np.all([1,3] == nplist, axis=1))```
Any suggestions for a nicer solution? [was routed here from help]
desert oar
#

oof

#

i dont know of a better workaround honestly

#

its a glaring edge case in numpy

#

because tbh its sometimes extremely convenient to write myarray == [1,2] instead of myarray == np.array([1,2]), so its really eager about casting iterable non-array data structures to arrays

#

specifically lists...

#

try it with tuples maybe but i think those get converted as well

#

did you try converting the RHS to array?

#

wait i actually dont think i understand what youre trying to do

#

are you trying to replicate the non-numpy comparison behavior?

lapis sequoia
#

yes. that is what I was trying to do.,

#

since I am working with coords, I figured it would be more logical to just convert everything to tuples and take it from there

#

but the the numpy issue still kinda bothered me. So if you have a suggesstion for the future

lyric canopy
#

Yes, I think that's the best workaround. I'd write it as this, but it comes down to same thing:

>>> x = np.arange(20).reshape((10,2))
>>> x
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])
>>> ([6, 7] == x).all(axis=1).any()
True
>>> ([6, 8] == x).all(axis=1).any()
False
timber crescent
#

any one help??

dreamy tartan
#

I couldnt solve this error:
TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>
I didnt understand why i get this error because i didnt do something different than what i was doing till now. I have some numeric and categoric columns, i fit_transform X_train and i transform X_test and thats all. Also already my categoric columns are labeled by labelencoder.

ohe = OneHotEncoder(categorical_features=col_index, handle_unknown='ignore')
X_train = ohe.fit_transform(X_train).toarray()
X_test = ohe.transform(X_test).toarray()```

What should i do? What im missing?
desert oar
#

hmmm

#

that seems weird

#

full traceback?

thorn river
#

I want to use POS-Tags to train a model.

The data is like this:

label = [1, 0, 0, ... N] ```

(1 = Female, 0 = Male) 

I have tokenized the strings with SpaCy and intend to use the POS-tagger from SpaCy.
If I apply the POS-tag to the tokenized strings, do I have to do anything else to train a model on this? Such as concatenating the POS-tags to the strings? 
Or can I immediately apply something like tfidfvectorizer or something to supply it to a model (such as a SVM or anything)
solemn topaz
#

Anyone interested in helping my with classifying text documents based on the presence of certain keywords in them?

scenic musk
#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

โ€ข Don't ask to ask your question, just go ahead and tell us your problem.
โ€ข Try to solve the problem on your own first, we're not going to write code for you.
โ€ข Show us the code you've tried and any errors or unexpected results it's giving
โ€ข Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

solemn topaz
#

Ok. I have a bunch of call center recordings that have been transcribe with AWS transcribe. I'm trying to detect whether or not the customer service agent promoted a new chat-bot tool on a website available to the customer.

#

When the agent promote the tool they usually say something along the lines of "..by the way have you tried our new virtual-assistant"

#

or "we have a new tool available where you can chat with us..."

#

So my approach was to have all these key words and phrases 'ask IT', 'virtual assistant', 'chat with us' etc

#

And scan for their presence in the call transcription json files spit out by AWS transcribe

#

That approach wasn't very accurate. A lot of false positives. So now I am trying cosine similarity by tokanizing the transcription into sentences, stemming, removing stop words etc. Get the term frequency vectors of each sentence and comparing it to the term frequency vector of a concatenated string of all the keywords

#

It's still not as accurate as I'd like so I'm wondering if anyone knows a good approach for this.

#

Cosine similarity seems to work well for comparing two sentences together however I'm trying to find the presence of certain keywords in the text and I'm not sure it's the most suitable method for my particular problem .

#

Wondering if anyone has any suggestions

scenic musk
#

Try Machine Learning?

solemn topaz
#

the problem is I only have about 10 confirmed examples of the agent promoting the tool. Not much of a training set.

scenic musk
#

Yeah.

solemn topaz
#

The plan is to use my current approach to detect new instances of promotion, verify them and slowly build up a decent sized training set.

#

but that will take some time.

#

If anyone has any experience with text similarity measures and knowledge of how to best approach this problem I would be so grateful.

hybrid temple
#

Is python pretty good when it comes to handling math stuff Im about to take a math class that requires you to learn python along the way

reef bone
#

I would say computers are generally pretty good when it comes to handling math stuff

#

Python is beginner-friendly and has many libraries

#

So it's a good tool

hybrid temple
#

Yeah thats what i was more looking for

#

Seems python as more user friendly math Libs then java

desert oar
#

in python you will just have to write a lot less code than java

#

to do simple things

#

also python has Sympy for symbolic math. im not aware of a java equivalent

granite stream
lapis sequoia
#

A good tutorial for pandas and numpy?

chilly shuttle
#

To do what? Those libraries are very pretty general

vivid hedge
#

Would this channel include mathematics?

placid snow
#

The one relevant to data-science I suppose

vivid hedge
#

Im not sure if this is but Ill give it a shot. Im trying to generate passwords based on different settings and with weights on certain characters (like I want symbols to be heavier then say lowercase letters) and was wondering if you guys can redirect me to some good resources for this.

hearty token
vivid hedge
#

@hearty token That could actually help me.. I believed I had to write my own (or implement someone elses) algorithm for this.

That might be shooting for the stars abit tho.

Thank you!

hearty token
#

Sure thing! Best of luck with it.

half olive
#

hey guys. Would like to contribute in an open source python package. I am a data scientist with a good knowledge of python.
I am aware that there are plenty of projects in github where I could start. But, I would like to start with small projects as it is
my first time. cheers!

lean ledge
#

@lapis sequoia This is not data science

#

Also change your nickname please

lapis sequoia
#

Prove me wrong

lean ledge
#

?

lapis sequoia
#

Prove me itโ€™s not data science

lean ledge
#

Stop trolling

simple crag
#

!ban 387197586370592768 troll

arctic wedgeBOT
#

:ok_hand: permanently banned @rustic lily (troll).

runic siren
#

hello

lapis sequoia
#

sup

#

so i am using this function: ```python
import requests

limit = 100
symbol = "BTCUSDT"
timeframe = "1m"

def get_bars(symbol, limit=100):
api = '/api/v1/klines?'
postdict = {
'symbol': symbol,
'interval': timeframe,
'limit': limit
}
return _curl_fox(api=api, postdict=postdict)

def _curl_fox(api, postdict=None):
BASE_URL = 'https://api.binance.com'
url = BASE_URL + api
if postdict:
response = requests.get(url, params=postdict).json()
else:
response = requests.get(url).json()
return response

bardata = get_bars(symbol=symbol, limit=limit)

C = []
for innerlist in bardata:
C.append(innerlist[5])

#print(bardata)
print(C)```

#

that should make a list of the closing prices of btc for the 1m x 100 times that

#

so i want to make a rsi out of that

#

i have a function from that somewhere ```python
def make_RSI(dataframe):
delta = dataframe['c'].diff()
dUp, dDown = delta.copy(), delta.copy()
dUp[dUp < 0] = 0
dDown[dDown > 0] = 0
RolUp = dUp.rolling(14).mean()
RolDown = dDown.rolling(14).mean().abs()

RS = RolUp / RolDown
dataframe['RSI'] = 100 - (100/(1+RS))``` but this one uses a dataframe pandas and i don't wanna use pandas, so how do i make this function usefull for the code i am already using?!
chilly shuttle
#

re-implement all the pandas functionality that's using..? I'm not sure why anyone would want to do that though

lapis sequoia
#

no i wanna avoid using pandas @chilly shuttle

#

and i don't know how to do that thats what i am asking?

#

?

#

anybody knows how to rewrite the function for my list?

wispy blaze
#

"don't wanna use pandas"? but why?

lapis sequoia
#

too complicated for me now i don't know how to use it on what i wanna do

wispy blaze
#

pandas is like the core.

#

you need to learn it xD

simple crag
#

You're making it more complicated by attempting to reimplement Pandas

lapis sequoia
#

how do i change my code to make this data in a pandas dataframe?

simple crag
#

Didn't the code you copied from the internet already do that?

lapis sequoia
#

yes

#
        bar_data = pd.DataFrame(get_bars(symbol=symbol, limit=limit))

        if len(bar_data.index) < length + 2: #if the api dont return sufficent OHLC data: TERMINATE
            printMessages.terminatingProgram()
            printMessages.notEnoughBarData()
            quit()

        bar_data.drop([0, 6, 7, 8, 9, 10, 11], axis=1, inplace=True)
        bar_data.columns = ['o', 'h', 'l', 'c', 'v']
        for j in ['o', 'h', 'l', 'c', 'v']:
            for i, v in enumerate(bar_data[j]):
                bar_data.loc[i, j] = float(v)

        # H/O L/O H/C L/C
        for i in bar_data.index:
            bar_data.loc[i, 'Body'] = min((max(abs(bar_data.loc[i, 'o'] - bar_data.loc[i, 'c']), 0.0001) / max(
                (bar_data.loc[i, 'h'] - bar_data.loc[i, 'l']), 0.0001)), 0.001)
            bar_data.loc[i, 'L/O'] = (bar_data.loc[i, 'l'] / bar_data.loc[i, 'c'])
            bar_data.loc[i, 'C/O'] = (bar_data.loc[i, 'c'] / bar_data.loc[i, 'o'])
            if bar_data.loc[i, 'c'] >= bar_data.loc[i, 'o']:
                bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'c']), 0.001) / max(
                    (bar_data.loc[i, 'o'] - bar_data.loc[i, 'l']), 0.0001)), 100)
            else:
                bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'o']), 0.001) / max(
                    (bar_data.loc[i, 'c'] - bar_data.loc[i, 'l']), 0.0001)), 100)```
#

but i have like zero clue of pandas

chilly shuttle
#

"too complicated for me now i don't know how to use it on what i wanna do"
then you're gonna have zeeeeero idea how to replicate the functionality in those functions without pandas

#

move on

lapis sequoia
#
import requests
import pandas as pd
limit = 100
symbol = "BTCUSDT"
timeframe = "1m"
length = 60



def get_bars(symbol, limit=100):
    api = '/api/v1/klines?'
    postdict = {
        'symbol': symbol,
        'interval': timeframe,
        'limit': limit
    }
    return _curl_fox(api=api, postdict=postdict)
    
    
def _curl_fox(api, postdict=None):
    BASE_URL = 'https://api.binance.com'
    url = BASE_URL + api
    if postdict:
        response = requests.get(url, params=postdict).json()
    else:
        response = requests.get(url).json()
    return response

    
        # get the bars
bar_data = pd.DataFrame(get_bars(symbol=symbol, limit=limit))

if len(bar_data.index) < length + 2: #if the api dont return sufficent OHLC data: TERMINATE
    printMessages.terminatingProgram()
    printMessages.notEnoughBarData()
    quit()

    bar_data.drop([0, 6, 7, 8, 9, 10, 11], axis=1, inplace=True)
    bar_data.columns = ['o', 'h', 'l', 'c', 'v']
    for j in ['o', 'h', 'l', 'c', 'v']:
        for i, v in enumerate(bar_data[j]):
            bar_data.loc[i, j] = float(v)

        # H/O L/O H/C L/C
    for i in bar_data.index:
        bar_data.loc[i, 'Body'] = min((max(abs(bar_data.loc[i, 'o'] - bar_data.loc[i, 'c']), 0.0001) / max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'l']), 0.0001)), 0.001)
        bar_data.loc[i, 'L/O'] = (bar_data.loc[i, 'l'] / bar_data.loc[i, 'c'])
        bar_data.loc[i, 'C/O'] = (bar_data.loc[i, 'c'] / bar_data.loc[i, 'o'])
        if bar_data.loc[i, 'c'] >= bar_data.loc[i, 'o']:
            bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'c']), 0.001) / max((bar_data.loc[i, 'o'] - bar_data.loc[i, 'l']), 0.0001)), 100)
        else:
            bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'o']), 0.001) / max((bar_data.loc[i, 'c'] - bar_data.loc[i, 'l']), 0.0001)), 100)


#print(bardata)
print(bar_data)```
#

this is what it should look with that

#

you can run it for yourself what the return is (its too big too paste here)

#

so how do i call this function then ```python
def make_RSI(dataframe):
delta = dataframe['c'].diff()
dUp, dDown = delta.copy(), delta.copy()
dUp[dUp < 0] = 0
dDown[dDown > 0] = 0
RolUp = dUp.rolling(14).mean()
RolDown = dDown.rolling(14).mean().abs()

RS = RolUp / RolDown
dataframe['RSI'] = 100 - (100/(1+RS))```
#

like what is "dataframe" bar_data?

simple crag
#

bar_data is a dataframe, yes

lapis sequoia
#

then i get ```KeyError: 'c'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "pandas_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
TypeError: an integer is required```

#

with thiscode

wispy blaze
#

your columns are named after integers

#

not letters

lapis sequoia
#

yes?

wispy blaze
#

dataframe['c']

#

this is calling collumn with header "c"

chilly shuttle
#

50 bucks it'll return object

#

as the dtype

wispy blaze
#

well yeah

#

It's still not "c" tho

chilly shuttle
#

i don't get it

#

all the code there is referencing a column named c

lapis sequoia
#

yes

#

it should be named c

wispy blaze
lapis sequoia
#

but it isn;t

wispy blaze
#

his output

lapis sequoia
#

yeah

#

shouldn't they be renamed here? for j in ['o', 'h', 'l', 'c', 'v']: for i, v in enumerate(bar_data[j]): bar_data.loc[i, j] = float(v)

#

or here

#

bar_data.columns = ['o', 'h', 'l', 'c', 'v']

wispy blaze
#

ah

#

okay

#

your if statement is failing

#

if len(bar_data.index) < length + 2:

#

length = 60

#

oh wait

#

lolol

lapis sequoia
#

lol

wispy blaze
#

i need a nap

lapis sequoia
#

also an approach.....

wispy blaze
#

so if i run it piecewise.. it works

lapis sequoia
#

yes

#

but once i want to get indicators it doesn't

wispy blaze
#

It is your if statement

#

len(bar_data.index) = 100

#

length = 60

#

if you leave it in you need to move the rest of the code to else:

#
    printMessages.terminatingProgram()
    printMessages.notEnoughBarData()
    quit()

else:
    bar_data.drop([0, 6, 7, 8, 9, 10, 11], axis=1, inplace=True)
    bar_data.columns = ['o', 'h', 'l', 'c', 'v']
    for j in ['o', 'h', 'l', 'c', 'v']:
        for i, v in enumerate(bar_data[j]):
            bar_data.loc[i, j] = float(v)

        # H/O L/O H/C L/C
    for i in bar_data.index:
        bar_data.loc[i, 'Body'] = min((max(abs(bar_data.loc[i, 'o'] - bar_data.loc[i, 'c']), 0.0001) / max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'l']), 0.0001)), 0.001)
        bar_data.loc[i, 'L/O'] = (bar_data.loc[i, 'l'] / bar_data.loc[i, 'c'])
        bar_data.loc[i, 'C/O'] = (bar_data.loc[i, 'c'] / bar_data.loc[i, 'o'])
        if bar_data.loc[i, 'c'] >= bar_data.loc[i, 'o']:
            bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'c']), 0.001) / max((bar_data.loc[i, 'o'] - bar_data.loc[i, 'l']), 0.0001)), 100)
        else:
            bar_data.loc[i, 'TopBottom'] = min((max((bar_data.loc[i, 'h'] - bar_data.loc[i, 'o']), 0.001) / max((bar_data.loc[i, 'c'] - bar_data.loc[i, 'l']), 0.0001)), 100)```
lapis sequoia
#

yes

#

so now it should work right?

wispy blaze
#

it should

#

that rsi function should eat the bar_data just fine

lapis sequoia
#

yeah oke thats cool i guess

#

but here is the part that i don't get lol, how to use this data

#

i wanna create an average of the rsi

#

but as you can see the rsi starts with nan in order to start creating a rsi

#
0     4.277778        NaN
1     4.462882        NaN
2   100.000000        NaN
3     3.490909        NaN
4     1.837302        NaN
5     2.070513        NaN
6    25.000000        NaN
7     0.212454        NaN
8     0.000741        NaN
9     3.985294        NaN
10    1.128302        NaN
11    1.585714        NaN
12    2.701754        NaN
13   42.500000        NaN
14    0.126482  55.902004```
#

so we can't run an calculation on that

wispy blaze
#

wait what

#

rephrase

lapis sequoia
#

the rsi's first outputs are NAN as you can see

#

so i want the latest rsi value

chilly shuttle
#

yeah that's what rolling window functions will do

lapis sequoia
#

in a variable

#

so how do you extract such thing from it?

chilly shuttle
#

define 'latest'

lapis sequoia
#

the latest known value so the 100th value in this case

#

correction 99th since the first is the name

wispy blaze
#

just drop those rows?

lapis sequoia
#

no but i want that value in a variable

wispy blaze
#

then get more data?

lapis sequoia
#

whut

chilly shuttle
#

you are trying to build a castle

#

on a cloud

#

with no foundation

wispy blaze
#

xD

chilly shuttle
#

understand stats. Understand pandas

#

then do whatever tf you're trying to do

lapis sequoia
#

i don't understand pandas, i told that already and also said i wanted to avoid it

wispy blaze
#

there is no avoiding pandas

#

if you want to avoid pandas python is not for you xD

chilly shuttle
#

you sound like 'i wanna drift around a pro racing circuit in a top 10% time, but I don't understand transmissions. How do I race without using a transmission?'

wispy blaze
#

^^

lapis sequoia
#

there are only 2 things left that i need? please ๐Ÿ˜„ ?

wispy blaze
#

is this homework?

lapis sequoia
#

nope

wispy blaze
#

then youve got plenty of time to read up xD

lapis sequoia
#

sad face

#

this is the code i have now

#

i added another indicator to it since heck why no

#

i am just testing around with this data, but sadly it is in pandas thats why i wanna make them variables now so its more in my field

#

so if you'd want to explain to me how extracting specific data from the dataframe is possible so that we can both sleep ๐Ÿ˜„

chilly shuttle
#

.dropna().values[0]

#

now get out

wispy blaze
#

LOLOL

lapis sequoia
#

no i wasn't talking about the nan values

wispy blaze
#

Your doing an extremely poor job of explaining what you even want out of the dataframe.

lapis sequoia
#

i need the latest (this case the 99th value) of the RSI collum in a variable named x and i need to have the distance from the upperband and underband from the ma in (2) variables

wispy blaze
#

x=dataframe['column'][row]

chilly shuttle
#

wait you can do that now?

#

no need for iloc?

wispy blaze
#

I've never used iloc

chilly shuttle
#

i guess it's very pandas'y to have multiple syntax for the same thing

wispy blaze
#

iloc has always bothered me tbh

simple crag
#

coming from MATLAB iloc is great

wispy blaze
#

MATLAB is sin

simple crag
#

No memes pls

wispy blaze
#

i hear the germans still love it tho

lapis sequoia
#

lol

chilly shuttle
#

oh shit it wins on perf too

#

TIL

#

caveat being it only works in favour if you're fetching a single column

wispy blaze
#

true.

#

bar_data[['c','o']][1:2] unless you do that

#

but the output is a bit different

chilly shuttle
#

also vectorized access

#

different beasts after all

#

but yeah i had no idea you would access rows with a second indexer like that

#

...even though it's the same as slicing a dataframe or view

#

I guess I'm just dumb

wispy blaze
#

shrug you seem alright to me

chilly shuttle
#

so the only actual winning use case for iloc is vectorized access

#

and iirc it's the only way to do assignment anymore

lapis sequoia
#

yes

#

i got it working now

#

thank you guys both so much

chilly shuttle
#

congrats, you have build a castle on top of a cloud

lapis sequoia
#

yey ๐Ÿ˜„

chilly shuttle
#

it will fall apart at the first breeze

wispy blaze
#

lol

lapis sequoia
#

ยฏ_(ใƒ„)_/ยฏ

#

defenitly

#

thats why i don't touch it anymore lolzs

#

but just thanks, really appreciated

wispy blaze
#

geen probleem

lapis sequoia
#

fijne avond nog

chilly shuttle
vale hedge
#

anyone know for pandas dataframe, what is simplest way to add a row?

slender oracle
#

If the row youโ€™re adding is another dataframe with the same columns you could use pd.concat

woven tundra
#
df.append({"col1": val1, "col2": val2}, ignore_index=True)
thorn river
#

Does anyone know how to cite the SpaCy POS-tagger? I've scoured the internet but cant'seem to find anything

void anvil
#

for ML in python, is it more correct / efficient to keep everything in data frames or separate lists

#

e.g. df1 has predictor 1, predictor x, bins

#

or df predictor 1, predictor x; list bin

twin sierra
#

Hello, I am looking for a better way to mask out the Green and Blue channel of an opencv image.

  • The current way I do it is create an image the same size as my target, and set each pixel to (0,0,255), then use cv.bitwise_and between the two images so only the Red channel is left. This works, but I would like to not have to have a 2nd buffer for the mask when every pixel of the mask is the same.
  • The 2nd way I tried is img_target[:,:,:2] = 0 which sets for first two channels of every pixel to 0. This also works, but takes 16 times as long as the bitwise_and method I currently use.
  • Finally, I thought I could apply the bitwise operation using a slide like img_target[:,:] &= (0,0,255), but that threw a type error.

Is there a way to do this that is as fast as bitwise_and, but doesn't use a 2nd buffer?

twin sierra
#

I figured out a way that is even faster than the buffer. cv.bitwise_and(img_target, (0, 0, 255), img_target)

wheat wedge
#

is there a good way in python to calculate the joint eigenvalues of a pair of matrices?
so finding the lambdas for det(\lambda*A-B)=0

#

where A and B are covariance matrices

wheat wedge
#

nvm, it is solved afaik

latent flicker
#

Anyone have any good pandas resources to go from beginner to intermediate?

polar acorn
#

Should give you a good grasp of the fundamentals at least when it comes to indexing, where I think a lot of beginners have trouble.

gray tartan
late garnet
#

Does anyone have experience with NLP? I am trying to do some data cleansing on some free form text - specifically job titles. However, there are many variations in typos making it difficult to cleanse. Do any of you have any general suggestions? So far I have manually curated job titles, built a BernoulliNB model to suggest labels and have attempted to use spelling correction. Another approach I have thought about is to use edit distance, but I would need to find a list of common job titles.

carmine lava
#

hi

lean ledge
#

@gray tartan the image you linked looks like an image with high barrel radial distortion. Should be able to construct a distortion coefficient matrix that does that and crop into a circle. OpenCV should be able to do that. Think you might have to construct an identity camera matrix

small ore
#

What is a greedy algorithm?

lean ledge
#

Algorithms that at any point maximise local benefit using a heuristic function but plan to find the global optimum solution anyway

small ore
#

That definition seems more complicated than ESL ๐Ÿ˜„

#

"heuristic function" :faint:

latent flicker
#

@polar acorn thanks

lean ledge
#

@small ore A heuristic function is a "hint" function. It approximates and hints to whether you're close to the goal or not. Imagine if you were looking for a path from point A to B in a maze. A naive search algorithm would start at start and search in all directions even if its not meant to go that way. A heuristic function would be a measure of how close you are to the goal. So an algorithm like A* would start by expanding only in the directions that help reach the goal (by listening to the heuristic function) but if that doesnt work try out other options

gray tartan
#

@lean ledge hey thx

#

I learnt about it

#

But i cant find any algorithm that can reproduce it

#

So i think i'll try it myself

#

With opencv i found onky things to correct it

#

So, the contrary

#

To project on a spheric surface, is it better to do barrel or pincushion distorsion ?

lean ledge
#

you want barrel distortion

gray tartan
#

Cool

#

Thx

#

In fact i've just to make the points nearer to the center

#

Starting by the center

#

I'll try by dividing by 10 the distance between the pixel and the center

gray tartan
#

ok @lean ledge

#

i've happenned with this :

#
from PIL import Image
import math


im = Image.open('08.jpg').convert('RGBA')
im2 = Image.new('RGBA', (im.width, im.height), (255,255,255,255))

coordsStart = (int(im.width/2) + 1, int(im.height/2) - 1)
cardinalSearch = [(-1, 0), (0, 1), (1, 0), (0, -1)]
pointer = 0
for x in range(im.height - 1):
    for i in range(2):
        for y in range(x):
            coords = (coordsStart[0] + (x * cardinalSearch[pointer][0]), coordsStart[1] + (x * cardinalSearch[pointer][1]))
            centerDistance = math.sqrt((coords[0] - coordsStart[0]) ** 2 + (coords[1] - coordsStart[1]) ** 2)

            im2.putpixel((int(coords[0] - cardinalSearch[pointer + 1 if pointer != 3 else 0][0] * centerDistance / 4), int(coords[1] - cardinalSearch[pointer + 1 if pointer != 3 else 0][1] * centerDistance / 4)), im.getpixel(coords))

        pointer += 1 if pointer != 3 and x != 0 else -3 if x != 0 else 0

im2.save('barrelDistorded.png')
#

but it gives me an index out of range on getpixel :/

#

coords is supposed to go spiral from center

#

to get a smooth distortion

#

starting from the top right pixel from the square of 4 pixels at the center

somber zodiac
#

Hello

#

Is anyone familiar with Yolo object detection?

#

I'm doing a project for Cambridge Uni

#

Would greatly appreciate some help

#

Not regarding code, but concept

hearty token
#

I'm not familiar with it @somber zodiac, but I saw your question in #help-kiwi and had a look. Just a thought, but:
By default, YOLO only displays objects detected with a confidence of .25 or higher. You can change this by passing the -thresh <val> flag to the yolo command. If you have many similar objects in the image (like the smileys), perhaps the threshold of 25% match is too low? So it would find too many matches. The tool is made for real-time object detection after all, so it can't be too picky. But in your case it seems it should be very picky. Did you try adjusting this?

somber zodiac
#

I have. I'll explain more clearly what's happening

#

Suppose you have this image

#

Now suppose all of the emojis are moving, such that they may occlude other emojis

#

And suppose we want to detect just 1 of all the emojis on the screen

#

I.e. this one:

#

How many images of ๐Ÿ˜ช would you expect to require to train the model?

#

Because I've used 300 and sometimes it produces bounding boxes around ๐Ÿ˜ช and ๐Ÿ˜† for instance saying they are the same thing

#

Thing is, for those 300 images, they're all of the same icon

#

I've changed the threshold as best as I can to narrow the detection down to 1 object as closely as possible but sometimes the minimum it can detect is 2 objects

#

Which it shouldn't, because there is only 1 ๐Ÿ˜ช

#

My error is near 0 in training

hearty token
#

Well, this is not my area, but I'll share my thoughts. I'm thinking if you want it to be able to identify partially obstructed smileys, shouldn't you train it with what the partially obstructed smiley looks like? 300 identical images wouldn't give it any training, as far as I can tell anyway (or does the algorithm account for that?). Maybe you could randomize training images by randomly cropping off segments of the smiley you want to identify, or something like that?

somber zodiac
#

Yup I've done that

#

I didn't literally use 300 of the same image

#

I used 300 of the same emoji, but in different scenarios

#

i.e. how you desccribe

#

Sometimes it is partially covered

#

Sometimes it isn't but there is something in the background

hearty token
#

Well, I don't know, to be honest. And I don't even know if you're getting unexpected behavior. Does the percentage of times it gets it right correspond somewhat with the confidence threshold?

somber zodiac
#

It does depend on the threshold. I'm usually able to detect it correctly when the object isn't occluded but sometimes it classifies 2 different emojis that aren't occluded as the same object

#

But when I vary the thresold to detect just 1 when it does that

#

It picks the wrong one

hearty token
#

In the demo video YOLO is distinguishing bicycles from dogs, for example. That difference is pretty big compared to ๐Ÿ˜ช and ๐Ÿ˜† . So perhaps it's expected that it confuses the two? Both smileys will have obstructed parts that look the same. I guess maybe the training images shouldn't be too obstructed? Because if all the distinguishing features are obstructed, and you're saying "that's my smiley", then the program will think anything yellow is the one you're looking for

somber zodiac
#

Yes that's correct, but what would need to be done to overcome this?

#

Also, the problem I'm talking about doesn't have any occlusion

#

Sometimes, the emojis aren't even covered, yet it still classifies more than 1 emoji that aern't covered as the same thing

lyric canopy
#

I've also been reading a bit about the model: Isn't its main goal to detect classes of object, not specific individuals?

#

That would explain why it sometimes groupes smileys together

#

It also doesn't seem to care much about small details (because it isn't interested in classifying individuals, but classes)

hearty token
#

@somber zodiac Right, but wouldn't that happen if you've trained it with too many "feature-weak" images?

#

Or perhaps it's in the model itself, as Ves suggests

#

Also, the first answer you got (in #help-kiwi ) was that you should train it more. You said it shouldn't need training because it's a static image, but in this problem description, it isn't at all static. So maybe it is just more training required?

somber zodiac
#

Suppose you have a 60 second clip of the emojis moving

#

I've taken 60 frames

#

So 1 frame per second

#

So those are my static images

#

I have 60 static images

#

And in each image I want to detect that emoji

hearty token
#

Why not try giving it more training and see if it improves?

lapis sequoia
#

Also doesn't that more fall under object tracking rather than classification?

small ore
#

Thanks @lean ledge .

lean ledge
#

That sounded unproductive, lol

#

@somber zodiac Is it generating extra bounding boxes for the other emoji during the non maximal suppression stage?

#

It sounds to me like the problem might be removing the bounding box for the emoji behind because of high shared area

#

Have you tried using RCNNs instead of YOLO? They are not as fast but they result in generally better performance than YOLO

#

So they're better for non real-time contexts

#

And they can generally be built up to more complicated stages where they can do instance segmentation instead of just detection which YOLO isnt really built for. If you get instance segmentation working for your network, your problem would be solved because it would be trying to segment each different instance of the emoji differently

delicate nymph
#

good evening

#

is there a way to draw the best line for this graph?

lean ledge
#

looks like you can just do linear fits for each set of points

delicate nymph
#

the look more like curves to me

lean ledge
#

search up linear regression. lots of frameworks that can do that

#

you sure?

#

they're blobby but the relationship is still linear

delicate nymph
#

well i have a slightly different

lean ledge
#

now that is not linear, yes

delicate nymph
#

and i think they both should be curves

#

may i explain what i've tried so far?

lean ledge
#

honestly, the best fit to me looks like 2 different lines

delicate nymph
#

i don't understand i'm sorry

#

i've tried a 2 libraries so far and they don;t have subplots

#
fig5, ax5 = plt.subplots()
ax5.plot(data30['PAR'],data30['302nm'],'bo',data30['PAR'],data30['312nm'],'ko',
        data30['PAR'],data30['320nm'],'go',data30['PAR'],data30['340nm'],'ro',
        data30['PAR'],data30['380nm'],'mo',ms=2)
ax5.grid()
fig5.savefig("PAR30.png")
plt.show()
lean ledge
#

one line with a higher slope for the first few parts, another line for the second part also going up but less so

delicate nymph
#

shouldn't it be a code where it finds itself the string? i mean there is in origin there should be in python too

lean ledge
#

the closest curve I can think of for that data would be the output characteristics and operation regions of a MOSFET

#

so you can look at the model equations they use for its regions

delicate nymph
#

but this wouldn't be exact right?

#

it's like scipy's curve_fit

#

you have to guess the equation

fierce saffron
#

Is there any good way of including a numeric measure of the number of samples for jointplot?

#

in seaborn

#

I've tried kdeplot & jointplot, but neither seems to provide meaningful numbers to go along with the plot.

ancient dome
#

hello

#

anyone here can help me with time series forecasting ?

polar acorn
#

Maybe, ask your question and we'll see.

strange radish
#

Hey, everyone. I'm the lead dev and maintainer of PyData/Sparse (http://sparse.pydata.org/) and I'd like to invite everyone here to a webinar where I'll be talking about it. It's on December 19 at Noon Eastern. https://app.livestorm.co/quansight/

small ore
#

Helps if you can give a x hours from now, @strange radish

strange radish
#

@small ore Done.

small ore
#

Ah. Still days to it. I said x hrs from now so the timezone miscalculation problem can be avoided

strange radish
#

I'll re-post closer to when it's about to happen! ๐Ÿ˜„

#

But if you're interested, you can register and it'll email you.

small ore
#

๐Ÿ‘ I check here more than I check emails. Thanks for letting us know

serene veldt
#

hey guys

#

im having an issue with eager tensorflow

#

i have this dataset

#
70,1,4,130,322,0,2,109,0,24,2,3,3,2
67,0,3,115,564,0,2,160,0,16,2,0,7,1
57,1,2,124,261,0,0,141,0,3,1,0,7,2
64,1,4,128,263,0,0,105,1,2,2,1,7,1
74,0,2,120,269,0,2,121,1,2,1,1,3,1
65,1,4,120,177,0,0,140,0,4,1,0,7,1
56,1,3,130,256,1,2,142,1,6,2,1,6,2
59,1,4,110,239,0,2,142,1,12,2,1,7,2
60,1,4,140,293,0,2,170,0,12,2,2,7,2
63,0,4,150,407,0,2,154,0,4,2,3,7,2
#

just a small sample for testing

#

and am trying to read it properly

#
tf.enable_eager_execution()
defaults = [tf.float64] * 14
dataset=tf.data.experimental.CsvDataset(path, defaults)
>>> dataset
>>> <CsvDataset shapes: ((), (), (), (), (), (), (), (), (), (), (), (), (), ()), types: (tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64, tf.float64)>
#

so far so good, but then i get this CsvDataset object

#

my goal would be to have a list of tensors containing all the values from each row

#

like such

#
[<tf.Tensor(shape=(10,), dtype=float64, numpy=array([70.0,67.0,57.0,64.0,74.0,65.0,56.0,59.0,60.0,63.0]))>,
(...) x14]
#

i have tried doing : col1 = dataset.map(lambda *row: row[0]) which makes an iterable <MapDataset shapes: (), types: tf.float64>

#

the problem with that would be raising the complexity to O(n^2) since i would have to loop all the columns and then iterate over the MapDatasets

#

isnt there any proper way to get that desides list of tensors?

small ore
#

Could we discuss l1 vs l2 regression please? I just learnt about each and I am not sure how one fares vs the other and in what circumstances each is used

serene veldt
#

ignore my question, found an answear

lean ledge
#

l1 vs l2 isnt specifically about regression, it's about the nature of the norm

polar acorn
#

@small ore Regression is used when you want to avoid overfitting right? Regularisation works by adding a term to the cost function, the term involves the parameters of your model. So in essence you want to minimize the cost function while at the same time keeping your parameters close to zero. So do you want to mimize the square of the parameters (this is l2) or the absolute value of the parameters (this is l1)?

If you choose l1, the absolute value, there is a larger chance that some of your parameters end up at zero, but the cost function itself might not be as minimized so the model is a slightly worse fit to the data. With l2 you might end up with fewer parameters that are zero but with a slightly better fit.

So in the end. If you have a lot of parameters in your model and you suspect some of them are superfluous you can use l1 to reduce the number of parameters that influence your model. If you have fewer parameters and you suspect they are all valuable you could use l2 to squeeze out a better fit while still avoiding overfitting.
Makes sense?

chilly shuttle
#

that's a pretty good explanation

#

one thing worth adding is even though regularisation modifies trainable parameters, the consequence is it modifies sensitivity to actual features in the input

#

and L1 will cause low-impact features to get ignored as they're received with a 0 term

small ore
#

Woo. Awesome explanations, pppt and bicubic. Thank you. Though I would like to hear any more intutions if there is to it.

chilly shuttle
#

ask questions

small ore
#

Hm. This is also a discussion channel. So I thought a discussion would benefit. Anyway, question: Is there a way to assess sensitivity of each parameter rather than just find out these are in or out from a l1 after the optimization?

chilly shuttle
#

there's a whole bunch of feature selection algorithms, using L1 is not the only way to do feature selection

small ore
#

Well, noob here. I just learnt l1 and was interested in knowing how best to put it to use

charred crest
#

Hey, is there the right channel to ask question about DL ? (Basically I'm wondering if my results are coherent, I train an AI against itself then I make it play against another AI that my teacher trained before, but my results are kind of weird)

lean ledge
#

@charred crest Yes

charred crest
#

So, I'm wondering why when I train my AI 70.000 times I have a better % of victory than when I train it 100.000 times, my % of victory decrease by 5 ? That's kind of weird, no? @lean ledge

latent flicker
#

@lean ledge are there assignments in that course?

charred crest
#

@latent flicker Are you talking to me or?

latent flicker
#

No, to @lean ledge

small ore
#

Anyone has experience with azure notebooks unable to restart?

worn hollow
#

Hey im trying to get started with machine learning and neural networks in python (seems like tensorflow is the way to go?) could anyone get my pointed in the right direction to start this stuff? I tried to do the tutorials from https://www.tensorflow.org/tutorials/ but keras fails to download the datasets.

arctic moth
#

Hello, I want to use the weights that are outputted by SVM classificator from scikit-learn to test the data that I used to fit the model (Positive should give me number > 1 and negative should give me number < -1) and for small number of samples (5) the output weights are 2D. but when I try to add more points (100) and train the classifier on 2-dimensional vectors its output weights is 3-dimensinal array.How can I get the coefficients that are used to multiply the 2D vector?

#

@worn hollow Dont start with tensorflow, if you have no experience with machine learning or neural networks. Try Keras tutorials for starters, or even scikit-learn is much more user friendly. There is a lot of concepts behind machine learning and neural networks in general and it is much better to start learning in easier frameworks than in the hardest one.

sleek path
#

hi

#

is ms in data science a good idea?

lyric canopy
#

No one can decide that for you

charred crest
#

Is this coherent result for IA reinforcement learning with Q-learning / Td-lambda algorithms?

lapis sequoia
#

Not sure to what extend this is a specific data science question, but my application is regarding data science, so I'll ask away. Do any of you have experience in subclassing numpy 's ndarray. The context is typhoon tracking. My idea is to break down my global model into regions class instances and add my parameters to these. The parameters are all np arrays (from netCDF)
e.g.

def func_for_ndarray()
    pass

#region.parameter.function
se_asia.pressure.func_for_ndarray()

Would this method be advisable? It would be extremely elegant, but I'm afraid of starting something that's maybe too complex for what it's worth.

void anvil
#

Quick question about using an ensemble classifier for time series data:

Say I have a dataset that's all of 2017. For a train / test split I arbitrarily choose October as the last month for train data, Nov:Dec for the test and I train a few models (SVM, RF, MLP, etc.) on this data.

Now I want to train an ensemble (without contamination) on the few model's I've made (e.g. train an RF using predictors of SVM, RF, MLP probability output from the Jan:Oct / Nov:Dec train / test split). Can I predict Jan:Oct using the previous models and feed that to train the ensemble model (e.g. use the same Jan:Oct train / Nov:Dec test), or is that 'cheating' because the initial set of models are trained on the input data which is then being fed to the ensemble? I'm thinking in order to create a 'clean' model I would need to train on the November model outputs (of this example) and test on the December actuals in order to keep the data sanitar

void anvil
fierce saffron
void anvil
#

What do you mean numbers

#

like volume numbers on the x / y axis?

fierce saffron
#

yeah.

#

like the number the bar on the histogram represents.

void anvil
#

don't know, sorry

#

Anyone know how I should preprocess data? Currently using

scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)```

However, trying to do a Multinomial NaiveBayes is throwing me for a massive loop because it's rather upset there's negative values
```ValueError: Input X must be non-negative```
feral lodge
#

@void anvil Draws from a k-dimensional multinomial is like casting a k-sided die n times and counting how many times each side came up, right? So here are a few draws from a Multinomial(k=3, n=5):

[1, 1, 3]
[3, 0, 2]
[4, 1, 0]
[3, 2, 0]
[0, 3, 2]
[3, 1, 1]
[1, 2, 2]
[2, 2, 1]
[3, 2, 0]
[3, 2, 0]

In this case the die is not fair; the probabilities for the three sides are [0.5, 0.3, 0.2]. So, if you have a bunch of data and you want to see which multinomial fits it best, your data should look like that -- a bunch of counts from each category. If you're using sklearn, you can also use fractional "counts". Then the above draws are equivalent to this:

[0.2, 0.2, 0.6]
[0.6, 0.0, 0.4]
[0.8, 0.2, 0.0]
[0.6, 0.4, 0.0]
[0.0, 0.6, 0.4]
[0.6, 0.2, 0.2]
[0.2, 0.4, 0.4]
[0.4, 0.4, 0.2]
[0.6, 0.4, 0.0]
[0.6, 0.4, 0.0]

Regardless of whether your data is integer counts or fractional counts, you can see that negative number don't make sense for fitting a multinomial -- no multinomial will ever produce negative counts, so therefore there is no good fit.

feral lodge
#

@void anvil By the way, in an old post (https://discordapp.com/channels/267624335836053506/366673247892275221/465938441520152576) I explained and plotted what the standard scaler does to your data, in case you're not sure! Since the standard scaler transforms your data to have zero mean, you'll always get some negative data points. I've personally never been in a situation where I needed to normalize multinomial data. It's usually either in fractional count (ie category frequencies) or integer count form already, so I have always been able to just fit it

void anvil
#

@feral lodge thanks for that

Unfortunately this is real world data and it very much needs to be scaled for piping into MLP. I'm setting up data pipelines inside my ensemble class so it can make the appropriate decision on a model by model basis (e.g. feed data in to ensemble class; pipeline scales for MLP, doesn't scale for RF, etc.)

feral lodge
#

@void anvil Oh, okay! Could you maybe show some sample data? I've only ever fit multinomials as part of school work and I'm having a hard time understanding why vectors of category counts would need any kind of normalization (other than dividing by the sum to produce category frequencies)

feral lodge
#

Or I guess since it's an ensemble you don't really know the nature of the data? The way I see it (though i'm only vaguely familiar with multinomials) we can immediately exclude the multinomial distribution as a model of x if x is not either (1): A vector of positive integers, or (2) A vector of fractions that sum to 1

#

In the same way we can immediately, for example, exclude the Beta distribution if the input is not a real number between 0 and 1; the Gamma distribution if the input is negative; or the Bernoulli distribution if the input is not exactly 1or 0

void anvil
#

so an example of the data would be like:

10999    -1    37    1/1/18 00:05
10432    -1    5        1/1/18 02:13
11993     1    3        1/1/18 02:15
12345    1    27    1/1/18 17:29
13500    -1    -13    1/1/18 23:23
13400    1    -150    1/1/18 23:45
#

We know the nature of the data before it gets fed in to the initial models, then the models are fed to an ensemble

#

so it looks something like:

Data
stage 1 models
stage 2 ensemble```
#

And each stage 1 model needs a different data pipeline

#

MNB requires non zero, MLP requires standardized data

#

RF doesn't require any data preprocessing

#

so when you're piping in to the first stage models there's varying preprocessing needed (none, some)

feral lodge
#

@void anvil

void anvil
#

it fucks up way bad when you do 0-1 and you can get features outside

feral lodge
#

Ah, true! I'm out of ideas, maybe someone else can pick up from here ๐Ÿ˜„ To me it just doesn't look like multinomial data, so if it were me I'd probably just not include a multinomial in the ensemble

void anvil
#

even if the data isn't multinomial you can still get good results

#

using MNB and using it as part of an ensemble

#

just need it to pick up something the other models don't

#

it's about model diversity rather than single model perfection

small ore
#

Out of curiosity, how are you modelling the date and time?

stray elk
#

i dont mean to interupt a conversation if one is all ready happening, but I have a question regarding hyperparameter tuning that i cant seem to find an answer to

#

I would like to know if it is possible to dynamically tune hyperparameters that aren't specified in the code

#

I have some code that will iterate over a bunch of different regression models, and then will select the 3 most accurate (based on mae) and create an ensemble model out of them

#

I was wondering if it was possible to be able to tune the parameters on this ensemble model, given I won't know beforehand what ml models will actually be a part of it

#

it could even do the parameter tuning within the ModelTransformer class that I have created for the ensemble model, however if that was the case it would still need to somehow do it dynamically due to not knowing what model could be passed to it

void anvil
#

you would tune it the same way you would a regular model tuning

#

you should tune all the 1st layer algorithms

#

if you're going to do that

#

and once you're done tuning hyperparameters on the first layer your inputs to layer 2 don't change

#

so tune all your stuff there

#

etc.

#

just make sure you don't train too many

small shore
#

Hello. So I need to run tensorflow gpu on my computer, but I have cuda 10 installed and no cuda 9 support because of rtx not being able to work with it apparently, so how do I get its support/how do I compile tensorflow from the source/able to run cuda 10.

#

idk if this is appropiate to ask here, so excuse me if it isnt

strange radish
#

Hey everyone. I'm the maintainer of PyData/Sparse. Join me in talking about this FOSS project in a webinar hosted by Quansight. It starts in 43ish minutes, You can register at https://app.livestorm.co/quansight/

#

@small ore Sorry if the ping is unwelcome, but as I understand it you wanted to join. ๐Ÿ˜„

small ore
#

I am already trying to connect from the past 5 mins. No luck

strange radish
#

๐Ÿ˜ฆ

#

Just like, click Access Webinar in the registration email.

#

@small ore

#

It's live now.

stray elk
#

@void anvil hi, I had a look into tuning the 1st layer algorithms, but I wasn't quite sure how to do that as the first layer parameters are simply the classes for the transformers I have created

#

this is part of the output for when i run get_params() on the ensemble model

#

Im not quite sure how i would go about tuning the model as a whole, given 1) i wouldn't know what the parameters are in the first place due to dynamic creation, and 2) I don't know how to access deeper parameter levels given the actual ml models are wrapped in a transformer class as part of a feature union

void anvil
#

basically before you pump into the ensemble

#

you want to tune each one individually

#

so the pred_union_modelA

#

just gridsearch all the hyperparameters

#

or w/e

#

then when you run the ensemble, it should be the 'best pred_union_modelA'

#

even if it isn't chosen

stray elk
#

yea, that is what i was wondering i would need to do

void anvil
#

it's not really efficient computationally

#

but if you have unlimited time / resources that's what I'd do

#

not sure how long your training and stuff takes

#

and how much performance increase you get out of it

stray elk
#

yea, this isn't time limited

void anvil
#

but yeah basically just lazily gridsearch until you perfect each submodel

#

then train on the optimal models

#

then do the same for each layer

#

until you have your final output

stray elk
#

so in that case, I would need to pass the params as an argument in the class instantiation

void anvil
#

yeah

stray elk
#

cool, figured i would probably need to do it that way

void anvil
#

I mean you can do

#
    def stage1:
        for models in dic
            grid_search best  hyper params
        return predictions

    def stage 2 (x,y)
        train stage 1
        return ensemble predictions
stray elk
#

thanks for helping out, would the predictions being returned in stage 1 be predictions for the best parameters?

void anvil
#

yeah

#

you would grid search hyperparam for each model

#

then only return predictions from the best

#

it's going to be horribly slow likely

stray elk
#

ok cool i see, thanks

#

idm about speed, its a personal project

void anvil
#

because you'll train a fuck ton of models

#

depending on how in depth you do your gridsearch

stray elk
#

i can always do random search

void anvil
#

if you do a 10x10x10 you're training 1k models for each model then selecting the best

#

sure

#

just keep in mind you're just adding a ton of bloat

#

and it may or may not affect the results

#

might run a trial of

#

random guess 25 hyperparams for each

#

and 10 and 50

#

and compare the ensemble returns vs the normal ones

#

and see if you really get any improvements

stray elk
#

thanks

#

also, I don't mean to bother you more, but while having this convo i have just noticed that nothing i do changes the mae of the ensemble model

#

meaning that something is up with the file, presumably

void anvil
#

make sure your train / test splits

#

aren't cheating

stray elk
#

my train/test split is 450/200

#

i just ran a test where all 4 models were doing simple ols regression, and the ensemble mae is still 3.15

#

with ridge, ard and elastic net being the 3 models built into the ensemble

#

as you can see, the mae of the models are incredibly similar, yet the models being used are quite different

#

do you have any idea why this may be?

void anvil
#

because the ensemble is able to pick up mostly the same from each set of outputs?

#

look at how correlated each model is to eachother

#

if it's an easy data set and you get great results for each

#

you're probably only going to see marginal improvements

#

so if linear regression is .90 auc and ridgecv is .9001

#

it's only marginally better

#

and you'd only expect a marginally better result

#

by putting in a .9000 vs a .9001 model to the ensemble

#

unless the .9001 is highly uncorrelated to the other models plugged in whereas the .9000 is highly correlated

#

If you try with a much harder dataset, you can probably find substantial improvements

#

as you swap in / out

#

but if every model is similar

#

swapping isn't going to do fuck all

stray elk
#

thats the thing, the models are all quite different

void anvil
#

you need to look at

stray elk
#

also, i think its notable that the ensemble model is improving accuracy by just under 1 whole % point

void anvil
#

colinearity between model predictions

#

because each set of 4 could be not colinear and be enough for the ensemble to get the same info

#

as the other set of 4

stray elk
#

ah ok i see what you mean, although wouldn't the different maes indicate that the collinearity isn't perfect?

void anvil
#

not really

stray elk
#

whereas the ensemble mae would suggest the colinearity was perfect

void anvil
#

well

#

it's sort of saying that 'roughly the same amount of decision making material is present in both sets'

#

if you dump all 8 of them together and the mae doesn't change

#

then you know that both sets of 4 are passing, roughly, the same information

stray elk
#

ok, i think i get what you are saying

#

so in this case, accuracy can still be gained by tuning the models separately and then putting them into the ensemble?

stray elk
#

ok its taking a while to run and hasn't thrown an error yet, so im pretty sure i have got the tuning working

void anvil
#

yeah basically

#

You know about pipelines?

#

'mlp': Pipeline([('transform', scaler.transform), ('clf', MLPClassifier())]),

is throwing
TypeError: All intermediate steps should be transformers and implement fit and transform. '<bound method StandardScaler.transform of StandardScaler(copy=True, with_mean=True, with_std=True)>' (type <class 'method'>) doesn't

stray elk
#

yea, i have a pipeline as part of the ensemble builder function

void anvil
#

no I mean

#

I'm trying to put one in

#

and it's messing up

#

I'm not sure where to drop the scaler.fit

#

I think I put it here?

        for name, clf in self.learners.items():
            clf.fit(X,y)```

changes to 

```    def train(self, X, y):
        for name, clf in self.learners.items():
             scaler.fit(X,y)  
         clf.fit(X,y)```?
#

nah still breakig]

stray elk
#

i have a pipeline, but i have only ever created this one so i don't have much experience

#

in fact looking at that i would say i don't even need to use the ModelTransformer at the lin_reg_start, i could just use LinearRegression() on its own

#

well, turns out one of my tuned models is in fact less accurate than when left untuned:

#

@void anvil by removing the ModelTransformer at the lin_reg_start, i got that exact error, so im guessing the first and last steps of a pipeline have to have a fit and transform method?

#

linear regression on its own has no transform method

void anvil
#

yeah

#

makes sense

#

thanks

stray elk
#

@void anvil Hey, sorry to bother you again, but i don't think the model always returning a mae of 3.15 is because of high colinearity between different models

void anvil
#

no I mean they're probably not colinear

#

the combinations probably all give the same info to the second model

#

the sum of information from model group a is ~= model group b

stray elk
#

when i comment out the preprocessing stages, RobustScaler and PolynomialFeatures, the resulting mae of the ensemble model is almost the exact same as just regular LinearRegression

#

suggesting to me that the feature union isn't quite functioning right

void anvil
#

sure

#

or it could be that 3.15 is all you can learn

stray elk
void anvil
#

most of your

#

models

#

don't need feature transform

stray elk
void anvil
#

and will behave the same regardless of transform

stray elk
#

the mae of the ensemble model which uses ridgecv, ardregression and elasticnetcv is almost the exact same as linear regression on its own

void anvil
#

MLP and MNB are the two ones that need

#

transforms

#

And again, there might not be much to learn

stray elk
#

wdym by feature transform?

void anvil
#

15000, 20000, 25000 => -.66, 0, .66

#

or => 0, 0.5, 1

#

that's what the scaling does

stray elk
#

ok, the scaling

void anvil
#

if you feed any of those 3 into a linear regression

#

you'll get the same result

#

same with RF

#

etc.

#

or about the same I guess

#

your data could exhibit linear relationships

stray elk
#

its feeding 450 different rows of data for the training data

void anvil
#

which is why linear is nearly as good as the ensemble

stray elk
#

surely one would expect variation between LinearRegression and an ensemble of ARD, RidgeCV and ElasticNetCV?

void anvil
#

run a linear regression, look at test statistics, etc.

stray elk
#

especially given when done individually, those three return different maes than lr on its own

void anvil
#

it might be linear

#

and it might be you can't really learn better than 3.15

stray elk
#

is that a thing in ml? a 'maximum' that can be learned by the data at hand?

void anvil
#

yeah

#

unless you want to way overfit

#

everything is modeled by y = ax1 + bx2... + randomness

#

more or less

#

you may be at the part that's just randomness

stray elk
#

Tbh that makes sense, itโ€™s just that a Mae of 3.15 seems high for the problem Iโ€™m dealing ei

#

With*

#

In fairness tho, I think that this is because of the data Iโ€™ve chosen to do this with, rather than flaws in the ml implementation

void anvil
#

if you want to make some fake data

#

or run the same thing on another training set

#

you can verify it's the data vs what you did

stray elk
#

Yea, I think that is how Iโ€™m going to do it in the evaluation

#

I will run the same thing on the entire set, hopefully get a more accurate result, and then use that to say itโ€™s the ml model but the data

stable egret
#

I am looking at diving deeper and learning about data science, I am coming from a strictly javascript background for the past 10 years tho. Can anyone recommend good courses, resources or even projects I can see the source code of? I am more of a practical learner than a theoretical

sinful forge
#

@lean ledge I guess this is the right place

#

Can I get some links? :)

proud jolt
lean ledge
sinful forge
#

Awesome thanks mate

lean ledge
#

Mind you, I personally really don't like Andrew Ng's course because of how mathematically shallow it is, given machine learning is basically a field of maths, but other people here get annoyed at me if I say I don't recommend it.

sinful forge
#

I'll keep your opinion in mind

#

@proud jolt you got them udemy links? I wouldn't mind accessing them too

sinful forge
#

Thanks bro

void anvil
#

anyone know what's up

lone mist
#

Probably doesn't like the ~

void anvil
#

I've done that everywhere else

#

should I put the whole C:/filepath

#

?

lone mist
#

I thought that the tilde wouldn't work at all, but maybe I am misremembering

#

worth a shot to use a full path though

topaz nacelle
#

i don't think open expands ~ by default

void anvil
#

yeah

#

it dumps if I put the whole filepath

#

pandas takes the ~/shortcut

topaz nacelle
#

you'd have to use os.path.expanduser or pathlib

void anvil
#

yeah cancer

#

that's fine

#

thanks

topaz nacelle
#

"cancer"

void anvil
#

When the package makers don't implement the lazy solution everyone else has

#

heaven forbid they don't copy/paste some code or put the warning that they don't

lone mist
#

It's likely somewhere in the docs

#

But perhaps it could be clearer. I wont comment on that since I don't remember where it's documented

twilit current
#

Hey there, I need some help with a pandas module question ๐Ÿ˜ƒ

#

Can anyone here take a look at my problem? I've been working on it for a while, but not getting anywhere

forest willow
#

I'm thinking of taking part in a Code Camp and need to develop an app on Pattern Analysis and Data Visualization, it would be a great help if you guys provided me with some suggestions on it. Thanks!

reef bone
#

Quick question about numpy: say I have an array that goes like a = [1, 2, 3] and i want to move all elements left by one step.

numpy.roll(a, -1, axis=0) will produce [2, 3, 1] which is good but I would prefer it not to wrap around, so I'd like to produce [2, 3, 0] instead. Is there a nice way to do this?

#

For now I pad the values with 0s manually but that's not nice ๐Ÿ˜ฌ

sinful forge
#

@forest willow if you find out anything please let me know or tag me because I'm really. Intereted in learning anything data science!

midnight oracle
#
from subprocess import check_output

display = check_output(["ls"])
print(display)```

im getting an error on this code... please help

error is:

```py
Traceback (most recent call last):
  File "C:\Users\Omer Kural\Desktop\๏ฟฝmer\Python Projects\Projects\data_projectx.py", line 3, in <module>
    display = check_output(["ls"])
  File "C:\Users\Omer Kural\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "C:\Users\Omer Kural\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Users\Omer Kural\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Users\Omer Kural\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified```
kind sluice
#

hey guys, can anyone recommend a good geospatial analysis tools (preferably for jupyter), what I'd like to see is basically being able to show heatmaps and scatterplots on a map given data (pandas dataframe of real estate stuff) and being able to interactively select regions/shapes so that I could filter those within the selected region and calc selective stats. Folium seems to be the most promising option but maybe there's something with simpler plotting functionality? I also tried geonotebook but it was broken for me.

chilly shuttle
#

@kind sluice the closest thing is pyleaflet

dim slate
#

@reef bone roll it, then change the last member to 0?

reef bone
#

Yeah I ended up writing my own function that takes the axis and index difference as arguments, in my use case the array is multi dimensional and will sometimes have to roll by more than one, so I was looking for something more robust. Now I can one line it. I was surprised that np.roll will always wrap around as filling the "new" indices with 0s seems like something that would be useful more often than wrapping around. Thanks for the answer. @dim slate

dim slate
#

Share? @reef bone

stray elk
#

@void anvil sorry I cant help with your issue, I just wanted to say that i've been using the wrong metric to measure my model accuracy

#

mae was not good, im now using r2

#

i was getting low mae values, but the predictions were still wildly off the actual results, now that im using r2, all my tuning is working fine, and i've managed to predict the EU referendum within 0.2%

gaunt axle
#

Hi!
I've been trying to get a good answer to "why huffman codes is a greedy algorithm"
and I've found:
"The reason that this is a greedy algorithm is that at each stage we
perform a merge without regard to global considerations. We merely
select the two smallest trees." in "Data Structures and Algorithm
Analysis in C" by by Mark Allen Weiss

But still, why would "merge the two smallest trees" be the optimal local solution?

feral lodge
#

@gaunt axle Because in the Huffman algorithm, we want to create a binary tree where high-frequency symbols appear high up (near the root) -- higher up than low-frequency symbols.

When we merge two trees, we create a parent node to the trees, pushing them down one level. If we're feeling greedy, what trees look like good choices to push down one level? The trees whose nodes are associated with the lowest-frequency symbols! If we push those trees down, that means that the remaining trees, associated with more frequent symbols, will remain further up in the final tree, which is what we wanted.

That's why it's a greedy algorithm; we take a quick look at the available trees and push down (merge) the trees whose leafs are the lowest-frequency symbols. Your quote calls these trees "smallest", which is not a good choice of words IMO. They can be big and deep, it's just that the sum of their leafs' frequencies is low.

gaunt axle
#

I understand better now, thanks :)

feral lodge
heavy apex
#

What should I expect from a Data Visualization class next semester? My data science class after finishing data analysis with spread sheets, so not sure how in depth it can really get.

desert oar
#

R, Python, and/or D3.js code

#

Likely one of the first two, and possibly the latter

#

Basic statistics

#

Undergrad level? Masters?

terse pewter
#

Good to probably familiarize yourself with matplotlib and numpy if you haven't used them already

candid pilot
#

Hey everyone! So I recently decided to start my journey on machine learning.
I searched quite a bit, and built my first neural network! It was super fun!
But now I have a kind of "problem". I don't know how to actually use it...
I mean I have tested with random numbers that really don't mean anything. What if I want to test it with something like written numbers recognition, or simple voice commands recognition, how do I "convert" those things to numbers?
Maybe this is a silly question but I'm really new to machine learning...

#

I mean, they need to be converted to numbers right? So they can work with the weights and the activation function

spark summit
#

definitely started with python with numpy, scipy, and matplotlib as a substitute for matlab/octave

#

it's like octave with OOP

light cloud
#

Is anyone familiar with pyLDAvis? Having a deprecation issue

hardy crag
#

@candid pilot depends on what you are doing. If you are doing image based tasks, the images you are using are already "numbers" in the way that for each pixel there is a value between 0 and 255 (in grayscale) and 3 values for a RGB image.

#

If you are doing natural language processing, e.g. translation between languages, you need to encode the "text" you are using into numbers.

#

For more common tasks there are already lots of ideas how to accomplish that, however it really depends on your data and your task

candid pilot
#

Oh ok thanks! So for example, if I want to use voice recognition, is there anything for that already created?

#

Like "Yes" or "No", is there anything I can use to "convert" that to numbers?

hardy crag
#

by voice recognition do you mean finding words in sound files

#

?

candid pilot
#

Yeah

hardy crag
#

so afaik sound is already represented as a series of numbers

#

it's a data exploration of an audio dataset

candid pilot
#

Thanks!

carmine lava
#

Darknet any one try

spark summit
#

we don't discuss that here

sinful forge
#

Lol

#

What about Skynet? :)

prime elm
#

Proper topic iโ€™d say

#

As long aa skynet is a bunch of LSTMs huehue

placid snow
#

#databases y'all seem to have a misplaced question there GWchadThink

scarlet salmon
#

Hey guys, so I've got some simple stuff that I need to do for an assignment

#

I'm basically building a simple k means clustering program

#
    def kmeans(m,c):
        cluster1 = []
        cluster2 = []
        cluster3 = []
        for x in c:
            if x-m[0]>x-m[1] and x-m[2]:
                cluster1.append(x)
            if x-m[1]>x-m[0] and x-m[1]:
                cluster2.append(x)
            if x-m[2]>x-m[1] and x-m[2]:
                cluster3.append(x)

This is what I have so far which I'd expect to go through one iteration

#

To do: write something to store new means, go through multiple iterations, and terminate the function

#

Before I go any further I'd just like some confirmation that this should work as intended

small ore
#

Did you mean to do if x-m[0]>x-m[1] and x-m[0]>x-m[2]: or if x-m[0]>x-m[1] > x-m[2]: instead? For each of those if conditions

#

@scarlet salmon

scarlet salmon
#

Oh, sorry

#

I've solved this now

#

Here it is finished

#
import numpy as np
from itertools import chain
with open(r"C:\Users\Evan\Desktop\Patch 4 year 2\data.txt") as data:

    means = data.readline()
    means = [int(n) for n in means.split(", ")] 
    
    numbers1 = data.readlines()[0:] #reads all lines except first
    numbers = [elem.strip().split(';') for elem in numbers1]
    
    clusters = list(chain(*numbers)) #lists numbers
    clusters = map(int, clusters) #changes list to integers
    
    def kmeans(m,c):
        for counter in range (10):
            cluster1 = []
            cluster2 = []
            cluster3 = []
            for x in c:
                c1 = abs(x - m[0])
                c2 = abs(x - m[1])
                c3 = abs(x - m[2])
                if c1 < c2 and c1 < c3:
                    cluster1.append(x)
                elif c2 < c1 and c2 < c3:
                    cluster2.append(x)
                elif c3 < c1 and c3 < c2:
                    cluster3.append(x)
            print "mean = " + str(m[0])+ ' ' + str(cluster1)
            print "mean = " + str(m[1])+ ' ' + str(cluster2)
            print "mean = " + str(m[2]) + ' ' + str (cluster3)
            print "end of iteration"
            print '\n'
            if np.mean(cluster1) == m[0] and np.mean(cluster2) == m[1] and np.mean(cluster3) == m[2]:
                return
            m = []
            m.append(np.mean(cluster1))
            m.append(np.mean(cluster2))
            m.append(np.mean(cluster3))
        return
    
    kmeans(means, clusters)
placid snow
#

You can technically dedent almost everything in that code

#

except means = data.readline()

desert oar
#

what?

#

i still wouldnt do data = open() even in a script

#

just a bad habit

#

and asking for mistakes

#

especially since so much data science is done in notebooks

#

where you can have a process run for days depending on your setup

carmine plinth
#

can you recommend framework for chatbot?

candid pilot
#

When training a neural network, is it ok to train it in different loops and each loop as a target? For example: I'm doing a voice recognition, is it ok to train for each word in a different loop or should I mix everything?

hardy crag
#

@candid pilot You should mix all the words. If you sort them the network will "forget" earlier words.

candid pilot
#

Ok thanks!

scarlet salmon
#

Can anyone tell me why this plot doesn't show a line that fits the data well?

#
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import leastsq


time = np.arange(0.0,35.1,5.0)
pop = np.array([12,48,84,113,195,225,193,188])


def logistic(pars,t):
    r,K,y0 = pars
    return K*y0*np.exp(r*t)/(K+y0*(np.exp(r*t)-1))

def logistic_resid(pars,t,data):
    return logistic(pars,t)-data

# code to run the Levenberg-Marquardt algorithm

#initial values - can you adjust these to better guesses?
p0 = np.array([1.0,1.0,1.0])

lsq_out = leastsq(logistic_resid, p0, args=(time,pop))
# code to plot the data and fitted values goes here
plt.plot(time, pop, 'h', logistic([0.224119,207.214,13.2892],time))
#

The plot I get back is:

#

I feel like I'm making a very basic mistake

dire atlas
#
plt.plot(time, pop, 'h')
plt.plot(logistic([0.224119,207.214,13.2892],np.arange(0,35,1)))
#
logistic([0.224119,207.214,13.2892],time))

this only generates 8 values, and since you don't specify it's corresponding x values it just uses x=1,2,3...8

#

you could also just specify the x values like this

plt.plot(time, logistic([0.224119,207.214,13.2892],time))
scarlet salmon
#

Ah, thank you

#

I wonder if you'd be able to help with the project I have going on that's lead on from this

#

It's detailed here (bit of a long post)

sweet ember
#

Solved it. Stupid me forgot to put the square brackets

scarlet salmon
#

So I currently have this

#
from scipy.integrate import odeint
import numpy as np
from scipy.optimize import leastsq

time = np.array([0,168,336,504,672,840,1008,1176,1344,1512,1680,1848,2016,2184,2352,2520,2688,2856])
pop = np.array([2,27,43,36,39,32,27,22,13,10,14,14,4,4,9,3,3,1])
# defines the model equations
def amr_ode(x,t,params):
    B, G, Da = params
    S, R, A = x
    derivs = [0.5*(1-R+S/10**7)*0.6577*S-0.025*S-(B*S*R/R+S), 0.5*(1-G)*(1-R+S/10**7)*0.0000156*R-0.025*R+(B*S*R/R+S), -Da*5.6]
    return derivs

def amr_run(pars,t):
    B,G,Da,S0,R0,A0 = pars
    ode_params=[B,G,Da]
    ode_starts=[S0,R0,A0]
    out = odeint(amr_ode, ode_starts, t, args=(ode_params,))
    return out

def amr_resid(pars,t,data):
    B,G,Da,S0,R0,A0 = pars
    ode_params=[B,G,Da]
    ode_starts=[S0,R0,A0]
    out = odeint(amr_ode, ode_starts, t, args=(ode_params,))
    return amr_run(pars,t)-data

p0 = [0.00001, 0.5, 400, 1, 1, 1]
lsq_out = leastsq(amr_resid, p0, args=(time,pop))
lsq_out
#

For the problem indicated in the above reddit post

#

But I get this error: ValueError: operands could not be broadcast together with shapes (18,3) (18,)

scarlet salmon
#

Okay I've fixed the above issue

#

I now get a graph (yay) but the estimated line of best fit falls completely flat ( ๐Ÿ˜ฆ )

#
from scipy.integrate import odeint
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt

time = np.array([0,168,336,504,672,840,1008,1176,1344,1512,1680,1848,2016,2184,2352,2520,2688,2856])
pop2 = np.array([2,27,43,36,39,32,27,22,13,10,14,14,4,4,9,3,3,1])

# defines the model equations
def amr_ode(x,t,params):
    B, G, Da = params
    S, R, A = x
    derivs = [0.5*(1-R+S/10**7)*0.6577*S-0.025*S-(B*S*R/R+S), 0.5*(1-G)*(1-R+S/10**7)*0.0000156*R-0.025*R+(B*S*R/R+S), -Da*5.6]
    return derivs

# runs a simulation and returns the population size
def amr_run(pars,t):
    B,G,Da,S0,R0,A0 = pars
    ode_params=[B,G,Da]
    ode_starts=[S0,R0,A0]
    out = odeint(amr_ode, ode_starts, t, args=(ode_params,))
    return out[:,0] # we only return the population size - we don't worry about the substrates as they are not measured

# residual function. Note that the parameters for optimization include all of the starting values, but the parameters for the ODE do not
def amr_resid(pars,t,data):
    B,G,Da,S0,R0,A0 = pars
    ode_params=[B,G,Da]
    ode_starts=[S0,R0,A0]
    out = odeint(amr_ode, ode_starts, t, args=(ode_params,))
    return amr_run(pars,t)-data

p0 = [1e-05, 0.5, 400, 1, 1, 1]
lsq_out = leastsq(amr_resid, p0, args=(time,pop2))
plt.plot(time, pop2, 'h')
plt.plot(time, amr_run([0.000148593, 17.7663, 112260, 1.99715, -20.8516, -313.528], time))
plt.show()
#

If anyone would be so kind as to want to help out and wants a link to the full problem sheet I can send it via PM

small pumice
#

Okay, I've got a rather long question here:

I am working on a project that uses neural networks and satellite data to predict wildfires. I am using the Google Earth Engine Javascript API and will use Keras to train a deep ANN. The network will take the temperature, humidity (if I can get the data), and vegetation (I might use NDVI if possible). (Just in advance, part of my question won't necessarily have to do with datasets and more with neural networks, I just want to get more done in one question).

I am using the MODIS satellite to find the temperature of given areas within a given timeframe using the Land Surface Temperature and Emissivity dataset. I am able to do this with the following code:

#
var dataset = ee.ImageCollection('MODIS/006/MOD11A1')
              .filter(ee.Filter.date('2018-12-10', '2018-12-23'));
var landSurfaceTemperature = dataset.select('LST_Day_1km');
var landSurfaceTemperatureVis = {
  min: 13000.0,
  max: 16500.0,
  palette: [
    '040274', '040281', '0502a3', '0502b8', '0502ce', '0502e6',
    '0602ff', '235cb1', '307ef3', '269db1', '30c8e2', '32d3ef',
    '3be285', '3ff38f', '86e26f', '3ae237', 'b5e22e', 'd6e21f',
    'fff705', 'ffd611', 'ffb613', 'ff8b13', 'ff6e08', 'ff500d',
    'ff0000', 'de0101', 'c21301', 'a71001', '911003'
  ],
};
Map.setCenter(6.746, 46.529, 2);
Map.addLayer(landSurfaceTemperature, landSurfaceTemperatureVis, 'Land Surface Temperature');
print(landSurfaceTemperature);

// map over the image collection and use server side functions
var tempToDegrees = landSurfaceTemperature.map(function(image){
  return image.multiply(0.02).subtract(273.15);
});
// print and add to the map
print('image collection in temp in degrees', tempToDegrees);
Map.addLayer(tempToDegrees, {min: -20, max: 40, palette: landSurfaceTemperatureVis.palette}, 'temp in degrees');

With this code, I can click on a specific area on the map and get a graph of the temperature within a specified timeframe. How would I go about turning this into a Python array, with the temperatures of 1 km squares with their respective coordinates? I also want to be able to find such array for humidity and vegetation.

Second, I am also using the Terra Thermal Anomalies & Fire Daily Global 1km MODIS dataset for my wildfire data. I want to combine this data with the temperature data to find whether a wildfire will occur in a 1 km square within a month. How can I turn this into an array that corresponds with the other array(s)?
Overall, I want to build a neural network that, for input data, takes the temperature, humidity, and vegetation in a given area and output the likelihood of a fire occurring in the area within a month.

hardy crag
#

@scarlet salmon can you please explain the pops? why is p0 not in use?

scarlet salmon
#

I thought p0 was in use in the least squares function to calculate estimates for the values I want to know

#

I'll be honest I think I'm in way over my head

hardy crag
#

right, but why not also in the plotting?

scarlet salmon
#

Since when I plot I want the original data against a predicted line of best fit

hardy crag
#

(it doesn't really change anything if it's used, just curious where the numbers in plt.plot(time, amr_run([0.000148593, 17.7663, 112260, 1.99715, -20.8516, -313.528], time))
come from

#

)

scarlet salmon
#

p0 is guesses at what the values may be

#

Those values come from least square estimating what the values are

hardy crag
#

Is your goal to actually solve the differential equations or just fit the pop2 data?

scarlet salmon
#

Mostly just to fit the pop2 data so I have values for B, G, Da

#

If you want to help out I can DM you the project sheet if you'd like

#

It explains things a bit better than I can typing over discord

hardy crag
#

sure

sweet ember
dire atlas
#

what are you trying to do?

#

plt.plot takes 2 arrays of the same length as input

sweet ember
#

I am trying to do take X as features and y as the labels. This is the titanic dataset from kaggle. I want to evalute the models I knowa and submit it

hardy crag
#

if titanic_data is a pandas DataFrame, you are trying to plot many columns at once

#

which obviously does not work, because you only have two dimensions

reef bone
#

There should be an error at the end of the traceback

#

What does it say?

sweet ember
#

I get this "AttributeError: 'NoneType' object has no attribute 'update'"

#

how do I do it then?

small pumice
#

Hey could someone please help me with the question I asked yesterday? Itโ€™s for a project that I need to finish soon.

lapis sequoia
#

anyone good with data analytics?
or know a thing for data i don't think R will be useful. I have 5 columns.

source address
destination address
protocol
source port
destination port

and sometimes destination port has a name instead of an int

desert oar
#

What is your question @lapis sequoia

#

@small pumice what is your question

small pumice
#

Just scroll up a bit

#

Itโ€™s kind of long

void anvil
#

@small pumice set an array with geocoordinates [x,y] to be the grid value at [x,y] based off of whatever abitrary value you want to choose as 0,0

#

then dump it into a single DF with an array with columns of whatever values you have + temp + the values in the 8 directions around the 1x1 square

#

or create values for each square in a time series df and add in lags

#

withg thousands of predictor variables

#

assuming you have enough time

sweet ember
#

@hardy crag what do I do about it then?

vital bison
#

so how many of you have appeared for GRE exams
im planning for MSDS

#

and get into data science

hardy crag
#

titanic_data.plot()

#

and see what happens maybe. or you could do

#
for col in titanic_data.columns:
    titanic_data.plot(x='col', y='survived')
#

(this doesn't really make sense if the columns have different data types)

sweet ember
#

I got the output for the plot titanic_data.plot()

#

Thanks!

#

The other one did't work

small ore
#

@sweet ember By other one did you mean the for loop? Maybe try without the ' ' for col

sweet ember
#

Also when I try to fit it in LinearRegression I get ValueError: could not convert string to float: 'Q'

small ore
#

Unclean data? Also it helps if you can show the head. Titanic_data.head()

desert oar
#

what do you expect to happen @sweet ember ?

#

survived takes 2 values, and you are plotting that against passenger id

#

it's oscillating between 1 and 0

#

and drawing a line between them

#

can you describe in words what kind of plot you are trying to create?

sweet ember
#

I actuaclly want to plot the the raph that I get to see how many survived with reepect to each column so that I can identify what model to use

desert oar
#

what do you mean "how many survived with respect to each column"

#

can you give a few examples of columns and describe how the chart should look

small ore
#

I think histograms are a better option but not really for every column. Doesnt make sense for passenger id for example

sweet ember
#

I want to plot age against survived, price against survived etc

small ore
#

I don't think a passenger id or a name wouldmake sense for one of the axes to be plotted with the 'survived'

sweet ember
desert oar
#

i still dont understand what you expect here

#

do you want a histogram in each category (survived / perished)?

#

a violin plot? box plot?

#

horizontally jittered point cloud?

#

its just not clear how you expect to plot a continuous data set vs a categorical one

small ore
#

something like:

plot_cols = ['Age', 'pclass', 'Sex'....]
for col in plot_cols:
   <plot functions here>

might help

desert oar
#

and pandas is just making a guess

#

if you just want points like in that pairs matrix, use .plot.scatter instead of .plot

#

.plot is for lines, .plot.scatter is for points

#

but you will need to jitter said points on the categorical axis, otherwise it's just a line and you can't actually see the density of the data in each category

small ore
#

Salt rock lamp, by histogram, I meant number survived vs relevant columns

desert oar
#

wait whose question am i answering lol

small ore
#

His. Not mine ๐Ÿ˜ƒ

sweet ember
#

Yes I tried to create featurees as X but was not able to due to atrribute error

desert oar
#

back up

#

stop writing code

#

write down in words what you are trying to do

#

then write the code to achieve what you wrote down in words

sweet ember
desert oar
#

ok great. so focus on how to do it for one column first

small ore
#

You were trying to plot the 'survived' column as opposed to the number survived in a normal line plot

desert oar
#
plt.clear()
ax = plt.gca()
ax.hist(titanic.loc[titanic['Surived'], 'Age'].values, alpha=0.5, bins=16, label='Survived')
ax.hist(titanic.loc[~titanic['Survived'], 'Age'].values, alpha=0.5, bins=16, label='Did not survive')

something like that should do what you want

#

now you have to figure out how to automate that for every column in the dataset

#

also .hist won't work with a column that's already categorical like gender

#

so you will think a bit more on how to generalize your plots to work with all columns in the data

#

and yes, name and passenger id are poor choices for this visualization method...

sweet ember
#

Thanks now I ll try that now and get back @small ore , @desert oar

#

So i ll just use age, gender and class then

haughty wind
#

This is kind of more of a hardware question - I want to upgrade my ram from 8 GB to 16 or 24 GB, will the extra 8 GB really matter if I want to run neural nets and such or is it not going to be that useful?

reef bone
#

It heavily depends on what you're trying to do.

will the extra 8 GB really matter if I want to run neural nets

They will matter if you're doing something that requires more than 8 GB. Ideally you would want to have all the data you're working with loaded in RAM, but depending on the dimensionality and overall size it might not be possible. 8 GB extra can make a big difference, but it's possible that as a beginner you won't have much use for it. Also keep in mind that we generally want to utilize GPUs for neural nets, the performance gain is substantial, so in this case the GPU's memory also matters.

desert oar
#

Ive hit double digits memory usage just fitting an SVM

#

Gradient descent should in theory be much more memory efficient tho

#

Having a ton of RAM basically means you dont have to think all that hard about memory efficiency

#

Vs if you are RAM-constrained you gotta be careful with sparsity and stuff

#

In this day and age i would upgrade to 16 pretty much no matter what

#

And 24 would be a nice buffer, also add some future proofing as well

#

Especially if you plan to do stuff other than just machine learning on this machine

#

Basically 24gb frees you from having to care about memory usage day to day

reef bone
#

Yeah, it depends. 16 GB of RAM is useful to have, definitely. I'm currently training an LSTM on a chatlog corpus with 500,000 expressions, the np arrays holding the data aren't too big because i can use sparse matrices and sparse categorical crossentropy, but my GPU suffers and I get OOM errors when I try batch sizes above 32

haughty wind
#

For GPU memory, is that the VRAM in the system? I know it's important if I want to run CNNs

small pumice
#

Hi,
Iโ€™m trying to use Google Earth Engine to make a neural network that can predict whether a fire will occur in a given area. I want the neural network to take in temperature, humidity, and amount of vegetation in a given area and output whether a wildfire will occur in the area within a month. What kind of neural network should I use? I was thinking a deep network because it would be the easiest in terms of data preprocessing. Would a recurrent network also work well for this situation?

steady escarp
#

Oh, nice! A data science channel!

#

That's what I want to do.

#

Also, are graphing modules usually a little tougher to install than other modules?

#

I've been working on a project that tracks bird behaviours, and want to graph those sightings.

#

It's pretty dang cool.

lapis sequoia
#

Can I see it?

small pumice
#

@steady escarp do you mean modules like Matplotlib? In that case, just use โ€œpip install matplotlibโ€.

steady escarp
#

Yeah, I've installed matplotlib. I just wanted the basemap module specifically. Matplotlib is pretty great so far.

#

I just want a map that uses lat and long coordinates.

small pumice
#

Oh I get it

#

Physically graph it

steady escarp
#

Oh, thats kinda neat. Using google maps. Maybe I can create a scatter plot using this. Thanks!

small pumice
#

np

opal knot
#

Hey guys, wanted to know something

#

I have been doing data analysis and machine learning for more than a year

#

And I can deploy full blown apps for these purposes (frontend, backend, deployment etc.)

#

How do i go from being a data analyst to a data scientist? Am I already technically a data scientist?

reef bone
#

@haughty wind Yes, GPU memory is usually referred to as VRAM. Convolutional networks can be quite costly, usually you would downsample inbetween conv layers.

sinful forge
#

@reef bone you said you should utilise GPU. If we running the AI on a server In a Datacentre they don't use for right so it would need to be CPU and ram no?

#

Also I'm very interested in your 500,000 expression corpus data set you have. Can we converse sometime on this?

reef bone
#

I'm not sure what you mean by the first question

#

And yes feel free to ask anything, it's nothing spectacular, just a compilation of data I'm using at the moment for a project

sinful forge
#

Oh I mean you said that it's best to run neural networks on GPU rather then ram. That's fine I guess if you are using it on a desktop but what If I needed more processing power or I wanted it running 24 7 off my PC. I'd use a Datacentre dedicated server no? They don't normally have GPU in them right?

#

Oh nevermind

reef bone
#

I've never used cloud computing services but I'm sure ones aimed at machine learning will offer powerful GPUs. It depends on the plan you have. You can refer to AWS for example (first one I checked).
https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html

sinful forge
#

My info is old

#

Awesome thanks

reef bone
#

Regardless, you would normally use the GPU for training. Once trained, inference isn't nearly as costly so provided you can hold the model in RAM can be done even on regular computers.

sinful forge
#

I'm still learning python. Takes time likebeer

#

Awesome thanks for the info!

reef bone
sinful forge
#

I want to create a AI that's able to hold a text conversation and learn from me typing to it. Wonder how possible that is?

#

๐Ÿ‘Œ

reef bone
#

It's a fun project to do and definitely possible, there's a lot of ways to approach that

#

For example, if you've used keras before, turning the example seq2seq model they have into a chatbot should be fairly simple, but I can't promise you it'll give good results

#

They have a character level model for translation, using the keras embedding layer it should be fairly straight forward to turn this into a word level model

#

learn from me typing to it might be a bit difficult to do, usually you would either train on a publicly available dataset (cornell movie lines comes to mind, that one is quite popular), or make your own by scraping reddit comments for example

#

Oh, in fact that article shows how to turn it into a word level model

#

So that's something you can use for inspiration and maybe have a look at github and see what others have come up with

#

However, I probably wouldn't recommend this as a starter project, NLP and sequence modelling come with some difficulties that might be a little daunting for beginners, so if you're just starting maybe something like image classification would be a more fitting project

#

LSTMs are still a bit of magic to me as well

sinful forge
#

Wow. Thank you so much @reef bone !!

full shard
#

Hello, I've got the following problem - I'm trying to use supervised machine learning to differentiate between multiple classes of data. The data I am plugging in consists of, essentially, 500-point scatter plots - therefore, my feature vector per sample is all the x values, followed by the matching y values (each normalized to between 0 and 1). I am using AdaBoost as a multiclass classifier, and it works well with unseen data up to 3-4 classes. However, with 6-7 classes, as some of the classes are more similar (although they can be differentiated by eye), the classification breaks down. Would anyone have any suggestions as to what other multiclass classification methods to use? And is the x-y value feature vector a good option, or should I use the average and standard deviation of each variable instead? (turning the feature vector from a length of 1000 into a length of 4)
Thank you for any suggestions!

chrome lily
#

Context: A group of programs needs to be run in many servers on wich the program can be distributed
Problem: For each task each server has a diffefent execution time (example: CPU and RAM available)
Objective: Find a combination of servers wich allow minimal time of execution of a program/software (adding of tasks)

#

Servidores = Server and Tarefas = Tasks

#

This is my first year project im pretty much new to python

#

I'd be forever grateful for any help

carmine lava
#

Any one know how to train faster rcnn

#

Any tutorials

hardy crag
#

@chrome lily what kind of class? optimization? programming? any more information available? I guess you should try to form a equation about transfer time and calculation time and then minimize the complete time.

chrome lily
#

@hardy crag
Its a programming class, artificial inteligence. My teacher gave me this assingment and Im a bit lost on how to do it, Ive just started to learn python a couple weeks ago and was a little intimidated.

[Problem 1]
Context: A group of programs needs to be run in many servers in wich they can be organized/distributed
Problem: For each task each server has a different execution time (example: Too much load in server; CPU and RAM available)
Objective: Finding the combination of servers who allow minimizing the time of execution of a program (SUM of tasks)

[Problem 2]
Context: Many predicting programs must be executed and they generate different volume of data.
Problem: Each program generates different volume of data and the server wich executes them have a limited capacity of storage.
Objective: Determinate the maximum number of programs wich can run in a given server and determinate the set of minimum servers in order for them to execute all predict programs.

hardy crag
#

do you have an idea about how to solve these problems in general (without the programming part)? What part of your class would you think is best suited for this task?

chrome lily
#

Basically we have a 2D cost matrix, with the values of the execution times of each server for each task

And I have an array with the same row number and columns as the previous one, but only have 1s and 0s

This makes the activation of a certain task-server pair

1 - enabled, 0 - disabled

Then we implement the genetic algorithm on top of that

I also took a pic today if it helps

#

Sorry about my phone quality

#

@hardy crag

hardy crag
#

right, you have the theory down. I recommend setting up a python script that either gets the matrices as input of some kind (e.g. txt or csv) and calls a class which contains the actual optimizer.

#

(if the input is always the same matrix, then you can just hardcode them)