#data-science-and-ml

1 messages · Page 201 of 1

maiden eagle
#

Could be a bunch of stuff

#

Usually when I get that error it's because my array is dtype.object because I wasn't careful about building my array

#

The fact that X.shape[1] returns a value error makes me think you have an array of lists or something

desert oar
#

@median siren you'll need to show your code or a sample of the data or something

median siren
#

Alright

lapis sequoia
#

check the values first

#

before trying to fit

desert oar
#

@median siren you have an array of arrays

#

that might be causing problems

median siren
#

That makes sense right, considering my data points represents a vector. So my X_train variable is a list of vectors?

desert oar
#

well, no, it's a dataframe

#

usually you don't get an array of arrays when you use .values

#

something is funny in your data

median siren
#

🤔

median siren
#

Yes, I figured out what's wrong.

hard veldt
#

hey does anyone have a recommendation for a free online course to learn ML

lean ledge
#

Check pinned

karmic geyser
#

@desert oar Hey I tried your code. I think I need to edit the cython module or change the data type I'm passing. "ValueError: Buffer dtype mismatch, expected 'DTYPE_t' but got 'float'"

lapis sequoia
#

And why do we use fit_transform() on training set and only transform() on test set?

charred onyx
#

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.

lapis sequoia
#

parameters of scaling on train data means?

sand reef
#

Does anyone know how do I fix the issue with TensorBoardColab? The one where I make the tensorboard and pass it in, and it says:

#

AttributeError: TensorBoardColab does not have a parameter, on_batch_training_begin()

#

or something close to that, let me get the error

#
AttributeError                            Traceback (most recent call last)
<ipython-input-6-98410153379f> in <module>()
      1 with tf.Session() as sess:
      2   sess.run(tf.global_variables_initializer())
----> 3   model.fit(X, Y, batch_size = 32, epochs = 10,validation_split = 0.1, callbacks = [TensorBoardColabCallback(tbc)])

2 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, max_queue_size, workers, use_multiprocessing, **kwargs)
    878           initial_epoch=initial_epoch,
    879           steps_per_epoch=steps_per_epoch,
--> 880           validation_steps=validation_steps)
    881 
    882   def evaluate(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, mode, validation_in_fit, **kwargs)
    323         # Callbacks batch_begin.
    324         batch_logs = {'batch': batch_index, 'size': len(batch_ids)}
--> 325         callbacks._call_batch_hook(mode, 'begin', batch_index, batch_logs)
    326         progbar.on_batch_begin(batch_index, batch_logs)
    327 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
    194     t_before_callbacks = time.time()
    195     for callback in self.callbacks:
--> 196       batch_hook = getattr(callback, hook_name)
    197       batch_hook(batch, logs)
    198     self._delta_ts[hook_name].append(time.time() - t_before_callbacks)

AttributeError: 'TensorBoardColabCallback' object has no attribute 'on_train_batch_begin'```
#

and here is the model made

#
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Flatten, Dense, Activation, Conv2D, MaxPooling2D
from google.colab import drive
drive.mount('/content/drive')
import pickle
X = pickle.load(open('/content/drive/My Drive/data/X.pickle', 'rb'))
Y = pickle.load(open('/content/drive/My Drive/data/Y.pickle', 'rb'))
model = Sequential()
model.add(Conv2D(64, (3,3), input_shape = X.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))

model.add(Conv2D(64,(3,3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation("relu"))

model.add(Dense(1))
model.add(Activation("sigmoid"))

model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
!pip install -U tensorboardcolab
from tensorboardcolab import TensorBoardColab, TensorBoardColabCallback
tbc = TensorBoardColab()
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  model.fit(X, Y, batch_size = 32, epochs = 10,validation_split = 0.1, callbacks = [TensorBoardColabCallback(tbc)])```
#

pls halp

lapis sequoia
#

tensorboard is to visualize the flow of your program?

#

can you look it up and see where it fails?

#

oh

#

misunderstood your problem

#

try this

#

tbCallBack = TensorBoard()

#

and use that in model.fit

#

i'm not sure if you need to pass arguments into TensorBoard.. but I think you probably might need to

#

@sand reef

sand reef
#

But the TensorBoardColab is what is imported, the regular one is only TensorBoardv2.0 which is supported by Colab

#

The TensorBoardv2.0 is having another set of issues of not reading any of my tensorboard event files

#

despite me running ngork on it

olive willow
#

btw guys what should I know before starting to learn calc?

sand reef
#

functions

olive willow
#

so f(x)

sand reef
#

and a bit of set theory

olive willow
#

f(x) = x^2

#

what's that? set theory

sand reef
#

like x is a real number or x belongs to an interval between 2 and 5

#

that notation

#

f:x->x

olive willow
#

what's an interval idk the english terms that good

sand reef
#

yeah, that sort of stuff, check it out

#

cuz they will use a lot of that weird notation

olive willow
#

sure I'm doing linear algebra rn after that it's calc and stuff

#

and after that the holy motherland MACHINE LEARNING

sand reef
#

okay

spice cargo
#

thanks @lean ledge

desert oar
#

@karmic geyser I did warn you it was untested 😃 but the error means what it says

#

What is a bit weird is that DTYPE_t should be np.float32

#

Maybe the issue is native python float vs numpy float

lost sinew
#
import pandas as pd

df_btcusdt = pd.read_csv("BTCUSDT.csv", parse_dates=True, index_col=0)
df_ethusdt = pd.read_csv("ETHUSDT.csv", parse_dates=True, index_col=0)
df_ltcusdt = pd.read_csv("LTCUSDT.csv", parse_dates=True, index_col=0)

df_btcusdt = df_btcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ethusdt = df_ethusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ltcusdt = df_ltcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])

df_btcusdt.rename(columns={'Close':'BTCUSDT Close'}, inplace=True)
df_ethusdt.rename(columns={'Close':'ETHUSDT Close'}, inplace=True)
df_ltcusdt.rename(columns={'Close':'LTCUSDT Close'}, inplace=True)


main_df = pd.concat([df_btcusdt, df_ethusdt, df_ltcusdt], axis=1, sort=False)

print(main_df.corr())

#

how do i make this code neater?

desert oar
#

@lost sinew inplace= i think is discouraged nowadays. but other than that, seems neat enough to me

#

alternatively you can use usecols= in the read_csv call instead of dropping columns afterward

lost sinew
#

alright thanks.. new to programming so im scared if its messy lol

#
Date,BTCUSDT Close,ETHUSDT Close,LTCUSDT Close
2018-01-23,10799.18,980.0,176.98
2018-01-24,11349.99,1061.0,180.89
2018-01-25,11175.27,1056.52,179.59
2018-01-26,11089.0,1051.03,177.09
2018-01-27,11491.0,1118.99,182.1
2018-01-28,11879.95,1251.96,196.74
2018-01-29,11251.0,1177.01,181.5
2018-01-30,10237.51,1085.5,168.21
2018-01-31,10285.1,1124.81,165.19
2018-02-01,9224.52,1041.94,143.69
#

how do i find the time lag between each of the * Close

#

this is a csv file

desert oar
#

what do you mean time lag

lost sinew
#

like the whether the price increases or decreases.. it follows each of the other prices because they are highly correlated.. is there a way to find the average lag/lead time

desert oar
#

not sure i understand. you want to find lags or leads such that the series are all maximally correlated?

lost sinew
#

for example, ETHUSDT Close has a +ve increase 10 minutes after BTCUSDT Close has a +ve increase

desert oar
#

also your data is daily so you can't figure out +10 minutes from that. but i think i see

lost sinew
#

ohh its just an example

desert oar
#

i'm not sure of any principled way to do that other than making leads and lags of different lengths and computing the correlations

#

or making plots and eyeballing it

lost sinew
#

so theres no quantitative way to do it?

desert oar
#

im sure there is, but i dont know it

#

sometimes the "dumb way" is good enough

lost sinew
#

alright thanks for ur help

#

been searching google for days and i still cant find the answer 😦

desert oar
#

https://quant.stackexchange.com/a/14868

it seems like you really just have to compute a bunch of lagged correlations, or use Granger causality

turbid bay
desert oar
#

guessing the same number every time suggests something degenerate in your training

#

if you print the gradient at each training step maybe you can see something going wrong

turbid bay
#

which bits the gradient tho? 😂. sorry i dont 100% know whats going on

sand reef
#

Well. You implemented the entirety of the neural network from scratch.

#

It's gonna be a bit hard to point out where you are going wrong.

#

Here what you can do.

#

Shuffle the dataset

#

And then take samples.

#

If it started predicting 3 a lot, 2 things are only possible. Either something is going wrong in your network, which is unlikely since you said that it was working with 10 examples, or your training data had a lot of 3s, so your network learnt to predict only 3 for high accuracy.

desert oar
#

is your data really unbalanced

sand reef
#

Altho, how did it get a very high accuracy with just 10 examples? That also the mnist dataset?

#

Is that even possible with a balanced out dataset?

turbid bay
#

no my data consists of 20 of each number

#

i mean it is from the mnist dataset. but i just got the images online and saved them as png’s. i only got 200 of them

sand reef
#

Well. Do one thing.

#

Use tensorflow.keras and get the mnist dataset

#

And train your model on that.

#

If the same issue persists, then your code for the neural network is having errors.

turbid bay
#

i wanted to do that before. but i never knew how to get it

#

or how to use it if i did get it

lost sinew
#
import requests
import csv
import pandas as pd

market = 'XRPUSDT'#'LTCUSDT'#'BTCUSDT'#'ETHUSDT'
interval = '1d'

url = 'https://api.binance.com/api/v1/klines?symbol=' + market + '&interval=' + interval
data = requests.get(url).json()

with open(market + '.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(data)

df = pd.read_csv(market + '.csv', names=['Date', 'Open', 'High', 'Low', 'Close',
                                         'Volume', 'Close time', 'Quote asset volume',
                                         'Number of trades', 'Taker buy base asset volume',
                                         'Taker buy quote asset volume', 'Ignore'])

df['Date'] = pd.to_datetime(df['Date'], unit='ms')

df = df.drop(columns=['Close time', 'Quote asset volume',
                      'Number of trades', 'Taker buy base asset volume',
                      'Taker buy quote asset volume', 'Ignore'])
# save file
df.to_csv(market + '.csv', index=False)
#

how do i make a loop for all of the 'market' commented

#

i wanna just type all of the different markets in a list and loop around it automatically instead of manually chanign the market

#

changing*

desert cradle
#
markets = ['XRPUSDT', 'LTCUSDT', ...]
for market in markets:
    all the rest of your code```
#

the rest of your code could probably be made more efficient but that's how you'd make the loop

lost sinew
#

how can i make it more efficient

desert cradle
#

i'd probably create the dataframe directly from the json

lost sinew
#

how would i do that tho.. im really new into programming

desert cradle
#

something like this ```py
data = requests.get(url).json()
df = pd.DataFrame(data,
columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume',
'x', 'x', 'x', 'x', 'x', 'x'])
df['Date'] = pd.to_datetime(df['Date'], unit='ms')
df = df.drop(columns=['x'])
df.to_csv(market + '.csv', index=False)

#

(I just used 'x' instead of names for columns you're deleting anyway)

lost sinew
#

alright thanks

#
import pandas as pd


markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
           'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']

df_btcusdt = pd.read_csv("BTCUSDT.csv", parse_dates=True, index_col=0)
df_ethusdt = pd.read_csv("ETHUSDT.csv", parse_dates=True, index_col=0)
df_ltcusdt = pd.read_csv("LTCUSDT.csv", parse_dates=True, index_col=0)

df_btcusdt = df_btcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ethusdt = df_ethusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ltcusdt = df_ltcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])

df_btcusdt.rename(columns={'Close':'BTCUSDT Close'}, inplace=True)
df_ethusdt.rename(columns={'Close':'ETHUSDT Close'}, inplace=True)
df_ltcusdt.rename(columns={'Close':'LTCUSDT Close'}, inplace=True)


main_df = pd.concat([df_btcusdt, df_ethusdt, df_ltcusdt], axis=1, sort=False)

#main_df.to_csv('combined.csv', index=False)

print(main_df.corr())
#

how would i do this for all of the markets now

#

i trieed it but it wouldnt work.. what shouyld i change the final 'main_df' line to

#

@desert cradle

desert cradle
#

ok uh

#

that's very different from your other code that was making separate files

lost sinew
#

yea..

#

is there a way to loop it?

desert cradle
#

i think what you want is pd.merge

#

but i'm not 100% sure on the details of how to use it

lost sinew
#

is there like a way to concat it to the main_df everytime it loops?

desert cradle
#

actually...

#

since you're using an index column

#

wait can you tell me why the concat didn't work?

lost sinew
#

illk show u what i tried.. wait

desert cradle
#

but anyway, you just want the close columns, right?

#

also you have this long list of markets but just three csv files, what are you trying to do with that exactly

lost sinew
#
import pandas as pd

markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
           'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
pair = 'USDT'
for market in markets:

    df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)

    df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])

    df.rename(columns={'Close': market + pair +' Close'}, inplace=True)

    main_df = pd.concat([df], axis=1, sort=False)

print(main_df.corr())
desert cradle
#

ok, that's your problem

#

concat works fine, you're just doing it in the wrong place

lost sinew
#

ohh

#

where should it be

desert cradle
#
import pandas as pd

markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
           'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
pair = 'USDT'
dfs = []
for market in markets:
    df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)
    df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])
    df.rename(columns={'Close': market + pair +' Close'}, inplace=True)
    dfs.append(df)

main_df = pd.concat(dfs, axis=1, sort=False)

print(main_df.corr())
lapis sequoia
#

@desert cradle is there any way of doing it without the append?

desert cradle
#

why

lost sinew
#

it cmae out with this error ValueError: No objects to concatenate

lapis sequoia
#

list comprehension would be worst in this case?

desert cradle
#

that doesn't make any sense

lost sinew
#

lol nvm @desert cradle i forgot to put dfs.append(df)

#

THANKS

lapis sequoia
#
import pandas as pd


markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
           'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']

def op(market)
    pair = 'USDT'

    df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)
    df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])
    df.rename(columns={'Close': market + pair +' Close'}, inplace=True)
    return df

dfs = [op(market)for market in markets]


main_df = pd.concat(dfs, axis=1, sort=False)


print(main_df.corr())


#

maybe something like this?

desert cradle
#

eh

#

making a function just so you can have a list comprehension doesn't really improve readability that much

#

and it's not much of a difference for performance either

lost sinew
#

how do i make it into a heatmap after having a correlation table?

desert cradle
#

no idea

lost sinew
#

okay thanks ill figure it out

desert cradle
#

sounds like a lot of math, we're past my ability to help

#

i just knwo the basics of how pandas itself works

lost sinew
#

ohh okay

lapis sequoia
#

@desert cradle Okey ty!, i thought the performance would be better with the list comprehension

desert cradle
#

it probably doesn't make much difference - a list comprehension might be very slightly faster than a loop with append, but adding an extra layer of function call might slow it down too, and it's not worth worrying about anyway

lost sinew
#

how does python calculate correlation for the x.corr() code

desert oar
#

Standard Pearson correlation

lost sinew
#

thanks

desert oar
#

Can choose Spearman or Kendall if you want

#

It's in the docs

lost sinew
#

which one is the best

#

nvm

#

do you think the standard pearson correlation is suitable for finding the correlation between two stock prices?

#

or is the standard pearson correlation only suitbale for linear relationships

desert oar
#

by definition it's only suitable for linear relationships, but you might be underestimating the value of measuring a linear relationship. if "priceA" generally goes up whenever "priceB" goes up, then you can see that with a linear relationship

gaunt gorge
#

What project will be good to do to learn data science and to put on the resume?

desert oar
#

anything tbh

#

your learning project likely won't be a good resume project

#

kaggle is never a bad place to start for machine learning

#

it's kind of hard to learn "data science" on your own tbh

#

you end up mostly learning technical stuff, which is maybe 80% of the equation

sand reef
#

say, anyone here well versed in the concept of hopfield neural networks?

#

i m slowly starting to lose it on this neural network

#

please ping me if anyone can help

#

For some reason, this network is always converging only to the latest learnt pattern

sand reef
#

to see what the error is, use this code in conjunction: "Processor.py" is the name of the pasted file in the above link

#
from Processor import *

def print_matrix(matrix):
    for i in range(len(matrix)):
        string = ''
        for j in range(len(matrix)):
            string += str(1 if matrix[i][j] == 1 else 0) + ' '
        print(string)


nn = Network()
a = [
[ 1, 1, 1, 1, 1, 1, 1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1]]

b = [
[ 1,-1,-1,-1,-1,-1, 1],
[-1, 1,-1,-1,-1, 1,-1],
[-1,-1, 1,-1, 1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1, 1,-1, 1,-1,-1],
[-1, 1,-1,-1,-1, 1,-1],
[ 1,-1,-1,-1,-1,-1, 1]]

c = [
[ 1, 1, 1,-1, 1, 1, 1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1, 1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1, 1,-1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1]]

nn.read_matrix(a)
nn.set_weights()
nn.read_matrix(b)
nn.set_weights()
nn.run_async(1000)
print_matrix(nn.get_matrix())```
#

okay....i m seeing an issue......

#

i think, the issue does not lie in the network, i think my gui is messing up

sand reef
#

now who do i ask to help me see the error that i think my logic is causing?

#

since its a gui logic error that i cant seem to fit into the button events

naive shore
#

sorry im of no help but i have a question ducky

#

if i have an array of say 50000 values and i iterate over them with my algorithm and extensively using a counter (like counter++ at each iteration and than use it in my calculations)

so.... it works fine for an array, but when i apply this algorithm to a data flow obviously i get overflow very fast

#

so i thought (to not reinvent the wheel) maybe there is a well known consept to deal with this kind of things, like phase iterators or something

#

googling "phase iterator" gives nothing useful, so maybe it has different name

#

but my image of it is like when your counter is more than some value it zero's out but we still know its not real 0 but 0 + whatever counter we zeroed

#

duh such a mess of a thought )

zenith nova
#

As far as I am aware, instead of overflowing pythons ints become long's, and longs simply don't overflow?

#
>>> 10**10**3
100000000000....(snip too many zeroes)
naive shore
#

oh...
so i must inspect why it gives me overflow more carefully

desert oar
#

Numpy doesnt do that though, if its float32 its float32

#

But yeah use native python ints, they can get huge

knotty nexus
#

does anyknow how how I can calculate permutations, but with multiple lists of combinations? ie. [3,4] [5,6] [7,8,9] would be (3,5,7) (3,6,7) etc

earnest prawn
#

I've never done this but my intuition says take a look at the itertools module from stdlib @knotty nexus

hollow quartz
#

Hi I am a beginner in Data Science. I have a machine learning problem. So I want to know what is the useful statistic for begin a machine learning problem?

desert oar
#

Depends on the problem

#

Usually you want to learn something about the data

#

Summary statistics, or plot the data if you can

#

You should have a goal in mind so you can stay focused on that goal

silk forge
#

made my first ever decision tree classifier

knotty nexus
#

thanks @earnest prawn . I look at itertools, but as far I can see it can only handle combinations of a single list; [3,4] would be (3,3), (3,4), (4,3) etc. For now I'm gonna try for loops, but it's gonna be really slow

desert oar
#

You want itertools.product() maybe @knotty nexus

hollow quartz
#

@desert oar I use pandas for example data.describe() show mean, std, min, max, 1st , 2nd and 3rd quartile

#

Is it the only statistic that i can use?

desert oar
#

You can use anything you want

#

Its better to start with a specific objective

#

What are you trying to achieve? What question do you want to answer?

hollow quartz
#

ok thanks

wary fox
#

This is kind of a simple question and I am not 100% sure if it belongs here under data-science but I figured it fits, so what exactly about numpy makes it "faster"

earnest prawn
#

that its written in C and uses C arrays instead of python lists

lost sinew
#

how would i find the average lead/lag time of a time series

desert oar
#

What do you mean "average" lead/lag?

lapis sequoia
#

Can anyone pls explain me p-value in layman terms

#

I seriously can understand abit

#

i do understood what is null and alternate hypothesis

lean ledge
#

When p<0.05, there is less than 5% chance that the results were a fluke of random chance

#

< 0.1 -> less than 10%

#

Etc

lapis sequoia
#

so what does accepting the null hypothesis means when p>0.05

#

i was watching a video it gave an example that let the null hypothesis be people are on my website for population average time of 20 min before change and alternate hypothesis becomes people are on my website fore than 20 min

#

then it set significance value = 0.05

#

then it took sample mean of 100 people and found out to be 25 min

#

after that i didnt understand a thing

#

so can u tell me based on this example what exactly is p value..

#

like this much i understood that lower the p value lower are the chances that my observation was just a random chance

lean ledge
#

@lapis sequoia p>0.05 means there's more than 5% chance that your results was due to chance

#

We consider that too likely

#

Hence we consider it to mean that "the experiment did not show the relationship we expected"

#

Hence we are unable to reject the null hypothesis

#

Say your hypothesis is A is correlated with B

#

Null hypothesis: there is no relationship

#

You do the experiment and find that A is correlated with B

#

But

#

There's a greater than 5% chance that it is due to random chance that you got that result

#

Hence you are unable to reject the idea that they are unrelated

#

And can't accept the hypothesis

lapis sequoia
#

Ok ok..

desert oar
#

@lean ledge be careful, it means that if the null hypothesis is true there's more than 5% chance that your results was due to chance

sonic girder
lapis sequoia
#

I am just curious do we use the 3,4,5 method ever or do we use backward elimination most of the time

#

coz at least thats what i am learning.. just backward elimination

lean ledge
desert oar
#

I know what a hypothesis test is

lean ledge
#

What was I being corrected on?

desert oar
#

p>0.05 means there's more than 5% chance that your results was due to chance

#

that's only true under the null hypothesis

#

which of course is the whole point

#

if you get a value that's "rare" under the null (in this case 1-in-20 or rarer), then we say we don't believe the null

#

ah wait i think i misunderstood your comment 😄

wide gyro
#

Is anyone good with Pandas and could help me with a problem regarding the csv reader?

lapis sequoia
#

!ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

wide gyro
#

I am using csv reader to read data from one file and get rid of all the rows that are missing a value, and when i use dropna(inplace=true), it works fine. However, I want to exclude some columns from that so I tried to implement the dropna(subset=[]). Unfortunately, when appending to a new csv, that file is actually larger in size than my previous one.

supple ferry
#

@wide gyro , hi. First, try not to use inplace = True at all cost. it is shorter, but it will bring you more headache in the long run. Also, it will be removed on v1.0 of Pandas.
Can you manually chekc both files and report their outputs?
Size can be affected also by data types too

#

shape of dataframes and which types you have

silent swan
#

huh, why is inplace getting dropped? It sounds like something has reasonable usecases

wide gyro
#

@supple ferry I will update you in a bit with the output file, I didn’t look too much into it as I still have a decent sample size but would like to refine it as well as I can

#

Also, how would I pass the csv reader to other functions? I’d like to manipulate the data however I’d want once it’s read through but I’m not sure how to pass it through. New to Python but I understand the basics due to knowing a couple other languages

#

Would I need to put the reader into a list or dictionary? I figured I wouldn’t need to as I can call the column and row number in the method I initialize the reader, and it returns whatever I need. However, when trying to use it in certain functions, it says something like “Missing argument”

supple ferry
#

what do you want to achieve by putting it into the function?? you want to read on demand ??

wide gyro
#

@supple ferry I want to read the data and be able to use it for whatever I’d like, with one column being time that I’m converting or simple arithmetic use

turbid bay
#

how does one get the mnist dataset and how does one use it?

#

im using my own made neural network using pygame

#

numpy*

#

oops 😂😂

lapis sequoia
#

my text file has uneven number of lines

#

want to read to dataframe.. how should I approach this

#

I want everything in one column

desert oar
#

@silent swan I think in some cases it was actually less efficient, but it was misleading people into thinking it was somehow a performance optimization; also it leads to two discordant and incompatible programming styles, rather than just one

#

@wide gyro what CSV reader? The native python one, or the pandas function?

#

If you're getting an error in your code nobody can help you unless you post a sample of code that demonstrates the error, and also the full error message

#

Usually when a data frame is bigger than you expect, it's because of a join/merge that went wrong

gilded dagger
#

I have a few questions about Machine Learning, is this the right thread for it?

#

In particular, I'd like to do some lip reading using TensorFlow, but I'm unsure as to what's already been done. Anybody knows which projects are still maintained?

supple ferry
#

!ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

wide gyro
#

@desert oar Pandas

void anvil
#

Where is the default save directory for WSL? I spent a couple days working on a project. Went to open it up today and it's gone. The .csv's I created in the folder are there, but the code is gone.

random jasper
#

Sorry if this is the wrong section. I'm trying to use OpenCV to analyze images of particles . I have gotten to the point where I have binarized the image and the particles are decently defined, but what functions should I be looking at to analyze say the area or the diameter bounded by a countor?

void anvil
#

And is there a way to just search the WSL drive

sand reef
#

@random jasper I am not very well versed in openCV and there might be something already existing which does what you are asking for, but well, you can convert it into an array, and use a condition and mark those areas. If you can mathematically represent a contour that is.

void anvil
#
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
   1504         try:
-> 1505             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1506         except TypeError:

~/.local/lib/python3.5/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    207     if use_numexpr:
--> 208         return _evaluate(op, op_str, a, b, **eval_kwargs)
    209     return _evaluate_standard(op, op_str, a, b)

~/.local/lib/python3.5/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     67     with np.errstate(all='ignore'):
---> 68         return op(a, b)
     69 

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')


#line of my code
-> 1155    l = (a + 2 * i + 2 * j + k)/6


~/.local/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(left, right)
   1581             rvalues = rvalues.values
   1582 
-> 1583         result = safe_na_op(lvalues, rvalues)
   1584         return construct_result(left, result,
   1585                                 index=left.index, name=res_name, dtype=None)

#
   1527         try:
   1528             with np.errstate(all='ignore'):
-> 1529                 return na_op(lvalues, rvalues)
   1530         except Exception:
   1531             if is_object_dtype(lvalues):

~/.local/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
   1505             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1506         except TypeError:
-> 1507             result = masked_arith_op(x, y, op)
   1508 
   1509         result = missing.fill_zeros(result, x, y, op_name, fill_zeros)

~/.local/lib/python3.5/site-packages/pandas/core/ops.py in masked_arith_op(x, y, op)
   1024         if mask.any():
   1025             with np.errstate(all='ignore'):
-> 1026                 result[mask] = op(xrav[mask], y)
   1027 
   1028     result, changed = maybe_upcast_putmask(result, ~mask, np.nan)

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')```
sand reef
#

?

#

pandas issue?

void anvil
#

Not sure

#

maybe

#

I'm running on WSL for the first time

#

porting over some code

#

I've never seen anything like it running on windows

sand reef
#

i see..

void anvil
#

trying to figure out what the error is actually saying

sand reef
#

seems to be stemming from pandas....so i guess something went wrong there..

#

didcha google this issue?

#

the type error, just copy and paste and see what it gives?

#

see if this helps

#

if this doesn't help, well, i am sorry, this, for now, is out of my league...

wide gyro
#

I am passing my pandas dataframe to a class's method and using df.at[row, column] to check if a certain value equals 0 or 1, and returns a String based on the outcome. However, nothing is being returned yet the program is being terminated without giving an error.

#

If I try to print that same df.at[row,column] outside of the method, though, I receive the value I need

desert oar
#

@wide gyro did you forget to write return in the function? you'll need to show your code

wide gyro
#
class CellTower():

    def __init__(self,data):
        self.changeable = data.changeable

    def checkChange(data,int1):
        if data.at[int1,'changeable'] == 1:
            return 'Firm'
        else:
            return 'Processing'

data = pd.read_csv('towers.csv')
CellTower.checkChange(data,3)
#

@desert oar

#

One suggested that I go through cmd line and that would give me the true error, but I'm struggling to get that going

desert oar
#

and that code reproduces your problem?

wide gyro
#

Well I found out that trying to access any method inside that class doesn't work

desert oar
#
    def checkChange(data,int1)

this needs a :

#

it's likely that your class is failing to be instantiated and your program isn't even running

#

cause that's a syntax error

wide gyro
#

Oh, it has that

#

I must've accidentally deleted it when putting it on here

desert oar
#

is checkChange supposed to be a static method

wide gyro
#

Yes

desert oar
#

ok

#

did you forget to write @staticmethod then?

#

well actually

#

@classmethod in this case

#

well... oh

#

that's weird

#

that should work but it's not recommended

#

anyway, i can't reproduce your problem, works fine

wide gyro
#

It appears that my program ends once it finishes the csv reader

#

Calling any method in the class I'm trying to get from returns nothing

#

I've been stuck on this for a few hours now, not sure what I can do

#

Been trying to figure out the cmd line too, looks like that's my only hope

desert oar
#

are you expecting it to print the output?

#

is that the full script you have?

#

what do you mean "returns nothing" -- clearly that method returns something

#

it's also not even really a method, the way it's defined, more of a namespaced function. that will fail if you actually instantiate the CellTower class

wide gyro
#

well right now I'm just using a simple IDE and going to transfer it to Linux machine once I sort this all out

#

Then that's probably where I go wrong

#

What's the difference between method and namespaced function?

desert oar
#

can you share more of the code?

#

a namespaced function is something i made up...

#

it shouldn't matter

#

point being, i unfortunately can't help because i can't reproduce the error and it's not clear what's going wrong

wide gyro
#

Hm, have you ever used the windows cmd line?

#

@desert oar

desert oar
#

yes

#

what IDE are you using though

#

it shouldn't matter

wide gyro
#

Spyder

#

Anaconda

desert oar
#

ok

#

how are you running this

wide gyro
#

the .py file?

desert oar
#

how are you running the code

#

and what output are you expecting. some kind of print-out?

#

and can you share a more representative piece of code that demonstrates the issue?

#

unfortunately i have to head out now but maybe someone else can see this and help

modest scarab
#

I might be overreaching but i could be using wrong keywords

#

Is there such a thing where i can predict what a person will say based on time of the day

#

Using previous messages in a group chat?

#

how can i go about finding resources for this?

desert oar
#

yep, that's something you might call a "language model". usually they just predict the next word in a sentence but one of them can probably be adapted for whole text messages

#

it's not really my area of expertise but that might at least get you started. you can also look into the blog-literature on chatbots, which is abundant

lapis sequoia
#

hi

#

I need some help..

#

How do I load my text file into a dataframe? I want the dataframe to have only one column.. the text file is delimited between sentences by a newline, and there length of lines is not fixed.. so I'm running into issues

desert oar
#

just read the file as a list

#

then create a dataframe with that list as the only column

#
with open('myfile.txt') as f:
    text_lines = [line.strip() for line in f]

data = pd.DataFrame({'text': text_lines})
lapis sequoia
#

@modest scarab you're talking about smart reply.. like in gmail.. there are pretrained embeddings that can help you do this.. just base next message prediction on the previous one, or on the fly

#

thanks a bunch!

desert oar
#

ahhh "smart reply" thats what people call it

#

you might need to train your own model to incorporate "external" features like time of day

lapis sequoia
#

or just switch between models depending on time of day.. easier

desert oar
#

i'd only do that as a last resort if i couldn't use fine-tuning or reuse the embeddings in another model. but again i don't know this area specifically

modest scarab
#

this is probably too advanced and probably havent been done

#

but it isn't "smart reply" that i am looking for

#

it's just basically what sort of words or phrases my friend would predictably say

#

at a certain time of the day

lapis sequoia
#

it's not advanced..

#

the next level down is if else blocks :v

desert oar
#

@lapis sequoia i wouldnt say that at all man. those language models are really complicated

lapis sequoia
#

well it's all relative.. for me programming concepts are hard at the moment.. but language comes easy..

desert oar
#

language? maybe. but actually understanding the math and design decisions that goes into a SOTA language model? i dont think anyone would ever call that easy

#

unless you're already very experienced in machine learning

lean ledge
#

tbf, dont have to understand something to reuse pretrained SOTA models

lapis sequoia
#

math yes.. design decisions.. I'm not really sure these were built to be efficient.. just as poc..

wide gyro
#

What is the correct format to set your constructor with your dataframe?

#

I figured you would just set the init and match the columns like

#
class CT:
    def init(self,data)
        self.radio = data.radio
        self.cell = data.cell
        self.range = data.range
data = pd.read_csv("ct.csv")
#

Would I then do

#
ct1 = CT(data.iloc[i])
#

or

#
ct1 = CT(data[i])
desert oar
#

@wide gyro what are you trying to do exactly

#

you want radio, cell, and range to be columns in the data frame?

wide gyro
#

I might be approaching this entirely wrong

#

But I wanna make instances of class CT that could hold a row of my dataframe

desert oar
#

oh i see

#

what you wrote should actually work

#

oh youre asking about iloc

#

iloc is positional indexing

#

loc uses the index of the dataframe

#

[] changes meaning depending on what you pass into it

#

i always use .loc or .iloc for extracting rows, for clarity

wide gyro
#

I'm getting back "init() missing 1 required positional argument: data" with iloc, but when I just use say "data[i]" I get a KeyError for whatever number is i

desert oar
#

oh

#

__init__ not init

wide gyro
#

Yeah, sorry, is there a major difference? Can I just change it to init?

#

if init works and init doesn't?

desert oar
#

no

#

python specifically looks for __init__

#

not init

wide gyro
#

I tried putting init into the code but it made me update it

desert oar
#
class CT:
    def __init__(self,data):
        self.radio = data.radio
        self.cell = data.cell
        self.range = data.range

data = pd.read_csv("ct.csv")

ct1 = CT(data.iloc[1])
#

that should work

#

err try now. syntax

#

you were missing a :

wide gyro
#

oh oops, I always mess up translating the code over, I have it on desktop but use my laptop for this

#

apologies

#

but now that i initialized ct1, how could i extract say "self.radio"?

#

I couldn't right

#

or like

#
ct1.radio
#

that wouldn't work right

desert oar
#

why not?

#

did you try it?

wide gyro
#

I did right before you sent that, works well noice

#

I'm finally getting somewhere hehe

#

@desert oar are you familiar with dropna as well?

desert oar
#

yes

wide gyro
#

I tried using dropna(subset=['']) but I don't believe it gets rid of any data, and instead just adds another index which then puts on storage

#

but when I use inplace=True, it lowers the size of the file

#

Well I guess that might be because every column I'm taking out of subset could be the only ones containing missing data?

#

Also, how would I stop it from adding another index if I use the subset drop

desert oar
#

er

#

you have a column called ''?

wide gyro
#

No, I was just putting it in

desert oar
#

??

#

the effect youre describing doesnt match up with what you are saying you did

#

can you share actual code

wide gyro
#

Now that I'm thinking about it I think the reason it isn't dumping any rows is because everything I'm subsetting in dropna contains data in each row and every column I'm excluding are the ones that are missing the data

#

which is why inplace=True works better but I can fix that

desert oar
#

why would inplace=True have any bearing on this at all

wide gyro
#

However do you know how to stop the subset from adding another index in front of the updated csv file?

desert oar
#

also inplace=True is deprecated and will be removed in pandas 1.0

#

it shouldn't add an index

#

that's why im confused

#

and want to see your code

wide gyro
#

Yeah I was told not to use it, inplace doesn't add an index but subset does

#

alright one sec

#
def clean(df)
    #df.dropna(inplace=True)
    df.dropna(subset=['radio', 'range'])
    df.to_csv('updated.csv')
#

Is that correct format?

desert oar
#

no subset should not add an index

#

but that's correct syntax yes

#

oh i know whats happening

wide gyro
#

I genuinely think it's just that every column I'm adding to that subset isn't missing a single piece of data and the columns I'm excluding are the ones that are

desert oar
#

that shouldn't have anything to do with any index

wide gyro
#

yeah

desert oar
#

so the extra index is coming from somewhere else

#

do you have a column in the csv that you already intend to use as an index?

wide gyro
#

Well if I took a raw csv file that didn't have an index in first column and ran it through that clean method, it would add an index to first column and shift everything else 1 over

desert oar
#

pandas always adds an index

wide gyro
#

and if I were to clean that clean file, it would add another

desert oar
#

every pandas dataframe has an index

wide gyro
#

So just keep it?

desert oar
#

if you want to omit the index when saving, use to_csv(..., index=False)

wide gyro
#

I'm never gonna clean a clean csv so I don't have to worry about 2 index's, but will removing the initial index cause any issues down the line?

#

I guess it doesn't matter that much if I have it, but with millions of rows I feel like it adds some storage

desert oar
#

no, unless you are planning to use the index for something

#

it does add

#

if you aren't using it, omit it when saving

wide gyro
#

so it would be something like

#
df.to_csv('updated.csv',index=False)
#

I was also initially using data as the df's variable name, but I switched it to df because I feel like data could be a keyword in some cases

desert oar
#

its not a keyword

#

go ahead

wide gyro
#

oo

#

noice

wide gyro
#

I'm trying to use chunksize to only get a portion of the csv file I'm reading off, but that turns the dataframe into a TextFileReader which I don't want

#

if I'm using pandas csv reader do I have to use something other than chunksize or do I have to read everything from the file?

#

Or do I somehow have to convert it back from a textfilereader to dataframe

earnest spear
#

I'm not suer how to quote on discord, but your message from 10:11 with the code snippet- you do the df.dropna() without inplace=True, but you don't ever set anything to its return value. I think you need df = df.dropna(.) since you aren't doing it inplace anymore

wide gyro
#

@earnest spear what do you mean by that?

#

Are you saying if I append to another csv file after dropna, it would keep the values I used before dropna?

#

But if I do df = df.dropna, it would take the new values into other csv file?

earnest spear
#

df.dropna() returns a dataframe with the na values dropped. It doesn't modify the dataframe you call it on (unless you use inplace=True)

wide gyro
#

so if I wanted to use the subset dropna I would have to set df equal to it

earnest spear
#

So in the code snippet you posted, the df.dropna() is functionally doing nothing because you don't assign it's return value to anything

#

Right

wide gyro
#

Gotcha, can't believe I never thought of that

#

I'm assuming I'd do the same thing for fillna

earnest spear
#

Pandas as a whole likes the idea of immutable dataframes. Most operations don't change the dataframe but rather returns a new one. The inplace=True is saying, instead of returning a new dataframe, I want to change the dataframe I'm calling this on.

wide gyro
#

So could I use the subset followed by inplace = true

#

or would that not work

#

instead of setting it equal

#

like do they accomplish same thing

earnest spear
#

df = df.dropna() and df.dropna(inplace=True will yield the same result (df being a dataframe with the NA values dropped). It's functionally different in that the first creates a new dataframe, whereas the second just changes the existing one, but they do accomplish the same thing.

#

The use of subset or other arguments shouldn't affect the usage of inplace

lapis sequoia
#

I have one question too regarding pandas.. do u think i should buy a book regarding pandas library.. coz i know its very important library?

#

like do i need to know it on fingertips?

earnest spear
#

My opinion is that the best way to learn pandas is just use it for something. It's documentation is pretty good and usually any question you have has a solution on stack overflow since its so widely used. But if you learn well from books then it's never bad

lapis sequoia
#

So i can understand everything from documentation?

#

Like usually documentation are complicated..u get everything but its a little hard for me to understand

earnest spear
#

Yeah that is the issue - pandas can go pretty deep, but being able to parse the documentation for what you need is a good skill to get comfortable with

wide gyro
#

I'm trying to use iterrows to compare some floats that have been selected and then run it through each line of the dataframe's same float

#

but i'm not sure how to access that df's float

#

I tried using at and iat but it says that i need integer indexers

#

and when i use loc and iloc it returns "KeyError: None of [Index] are in the [Index]"

lapis sequoia
#

is Linear Regression more reliable or support vector machines?

#

i noticed that when i use SVM.SVR() in recognising a pattern it only works for certain extent after that it gets wrong where is LinearRegression() was on point when predicting the pattern

#

so which would be better to use when it comes to Predicting stock prices?

wide gyro
#

Is there a way I can drop any values that are NaN but not those with an input of 0?

#

dropna for pandas appears to get rid of 0/NaN but some of the 0 values are good for me

hollow quartz
#

Does pandas take account the missing values for the calcul of correlation?

desert oar
#

@wide gyro thats not what chunksize does. Read the docs carefully

#

@wide gyro you are asking a lot of XY questions. Maybe start by saying what you are trying to do first, then tell us what specific thing isnt working

wide gyro
#

Yeah I switched to nrows and it works fine @desert oar

sand reef
#

@hollow quartz if you mean the Nan values, the values might not be taken, but their existence is very likely taken into account.

hollow quartz
#

@sand reef if the line have a a missing value this line might not be taken?

lapis sequoia
#
#Building Multiple Linear Regression Model
import statsmodels.formula.api as sm
#adding the orignal x to a column of ones so that ones column is at first
x = np.append(arr = np.ones((50,1)).astype(int), values = x, axis=1)
x_opt = x[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

#removing the index x with p values greater than SL 
x_opt = x[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

x_opt = x[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

x_opt = x[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

x_opt = x[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
#

Is there a way in which i can run a loop so that p values checks if its less than 0.05 and removes the variable automatically..

#

I know automated backward elimination can be done using r squared values.. but any way of p values

#

Also after i did the backward elimination now what.. do i create a new test set and training set and predict new values?

ripe sundial
#

Heya. What exactly is the difference between MSE in keras and the loss? The MSE in my case is as seen in the image. I am not sure I understand why it is growing or if the values are good

sand reef
#

Mean squared error is growing means something is wrong. It should go down.

#

MSE is just basic telling if you are close or far. Loss is used to train your model.

#

It has the answer to all such technicalities.

#

@lapis sequoia try making a list with those indexes. Now remove values from that list with your loop and condition. And then pass that list of indices into the x_opt = x[:, custom_list]

lapis sequoia
#

We can also use .pop method right?

sand reef
#

Well. Pop will only remove the last element of the list

lapis sequoia
#

oh ok

sand reef
#

Wait. I think you can use that.

#

I got it confused for the stack pop

lapis sequoia
#

Ok...

sand reef
#

Sorry about that

#

An error has occurred?

lapis sequoia
#

Ok so what u r saying is I make a list with all my indices and then i compare p values?

#

and gradually remove it and then fit in x_opt

sand reef
#

Yes

lapis sequoia
#

ok

#

Now how will get p values?

sand reef
#

Good question.

lapis sequoia
#

Is there any particular command?

#

Ok wait i saw this somewhere..

#
regressor_OLS.pvalues[j].astype(float)
#

Is this any good?

sand reef
#

Welp. You could try it.

#

I am not familiar with the api

lapis sequoia
#

Ok well thanks for the idea tho.. i will try to implement this

sand reef
#

No problem!

lapis sequoia
#
def backwardElimination(x, sl):
    numVars = len(x[0])
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
    regressor_OLS.summary()
    return x
 
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)
#

@sand reef is this something it would look like? This was the code given to be done in automatic backward elimination but to be honest i didnt get it somewhat

#

Like the numVars = len(x[0])

sand reef
#

Numvars = len(x[0]) means number of columns in x

lapis sequoia
#

Ok..

sand reef
#

Well. Kind of yes. It is kind of similar to what I suggested.

#

Instead of making a copy and assigning it to x_opt

#

It directly removes the index from x itself

#

Altho I am kind of afraid of the fact that this might throw a index out of range error

#

Nah it won't.

#

i is never used in indexing, so it won't throw.

#

So, what's the issue?

lapis sequoia
#

Ok.. so i are all the entries in a particular line right?

sand reef
#

Yus. Index of the columns.

lapis sequoia
#

Ok and then j is picking up each entry in that line

#

ok this is a very noob question but why did we do numVars - i?

#

also just to confirm

sand reef
#

No. J is still a column index.

lapis sequoia
#

i will look something like this a b c d

sand reef
#

Because if you see in np.delete(), we are passing it with axis =1

lapis sequoia
#

and not this ```a
b
c
d

sand reef
#

Well. The one above this comment is a column.

lapis sequoia
#

ok .. sorry i am getting confused in loops..

#

So j is column index

sand reef
#

Yes

#

So what is being done here is,

#

Max value of p for a column is being taken, and then compared. If a p value of a column is the same as the max p value, it is then removed from x

#

And the numvars - i

#

That is saying that, we are leaving i number of columns. Means we will not check the last i columns in that iteration.

lapis sequoia
#

Oh ok.... now i get it

#

so in first for loop it does i=[a,b,c,d] takes out the max p value out of them

#

and then comparing it with sl so the first time i will be 0 so j will take all index .. and then delete the max value

sand reef
#

Yes

lapis sequoia
#

and then the second loop will continue till the condition is satisfied i.e. all the indexes with p values more than sl will be removed and then it will move onto next line starting from the first loop

#

right?

sand reef
#

Yus

lapis sequoia
#

Ok... jeez thanks a lot

sand reef
#

Yeah. All values will be removed until either nothing remains or only columns with p values less than sl remain.

#

No problem! Now I am off.

lapis sequoia
#

ok

wide gyro
#

Do NaN values and 0 give the same result if you used that part of your data?

#

Like if I asked for a row of a column that contains 0 or NaN, will they both return 0?

ripe sundial
#

@sand reef Would you expect it to look like this: Also what does the number 5.X mean? Is it a lot?

#

Also @sand reef I do not use the MSE as loss but rather a metric (in Keras)

heavy tundra
#

I'm trying to learn how to use Beautiful Soup by scraping data from Twitch.tv

#

Twitch.tv Top Channels Data Scrape

import csv
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

URL = "https://www.twitch.tv/directory/all"
uClient = uReq(URL)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html , "html.parser")
containers = page_soup.findAll("div",{"data-target" : "directory-game__card_container"})
containers2 = page_soup.findAll("div",{"class" : "tw-mg-b-2"})
print(len(containers))
print(len(containers2))

#

I got this far

#

but it says both containers have length 0

#

containers and containers 2 should be how I separate each of the channels on the top channels page but it doesn't pick them up

#

also I'm sorry I just dumped the code in chat, I don't know how to format it in this server

shut widget
#

for x in v:
print(x)

#

dangit

#

how do you make it activate

#

oh

#

this isn't a help channel

#

@heavy tundra is this data science related?

heavy tundra
#

yeah

#

I can go to a different channel though

shut widget
#

seems like a scraping question to me

heavy tundra
#

is that not data science

shut widget
#

scraping? uh, no...?

heavy tundra
#

it would lead to a dataset and statistics

shut widget
#

yes but the scraping itself is not data science and that's what your question is about

hardy solstice
#

Hello

#

I want to start in Data science carear

#

do i have to master np /panda or i can go through them as any lib

#

so i start in Data visual.. and ML

void anvil
#

@supple ferry numba is lit. I'm seeing speed ups of 100-1000x with minimal work. You basically convert pandas objects to np objects and wrap loops in @jit functions.

earnest prawn
#

@hardy solstice libs like numpy for example "just" provide implementations of common mathematical concepts, if you understand them mastering numpy is a question of reading the docs

hardy solstice
#

this is what i mean, it is just normal lib where i can go through docs

#

but someone told me i have to master it as first step

earnest prawn
#

if you know the maths behind it Id call bullshit on that

void anvil
#

^

#

Just know how to use it to do your data transforms from A to B in a timely matter

hardy solstice
#

oooh this explains alot

sand reef
#

@ripe sundial yes that's what it's supposed to look like

#

5.x?

ripe sundial
#

@sand reef so I found the error (no pun intended :D) thanks to @feral lodge Turns out MSE is not usually used for classification problems, this is why the MSE I linked earlier is wrong

sand reef
#

I see. So you were classifying.

void anvil
#

Is there an equivalent of np.argmin() that is compatible with numba or should I rewrite it in python and use it there?

sand reef
#

It's mentioned there.

void anvil
#
            index_min = np.argmin(close.iloc[i-25:i].values)       ```
#

I'm on the latest pip3 install of numba and it's not liking it

wide gyro
#

I'd like to take a column of time in my dataframe that is set to Epoch, and change it to the exact date

#

which I accomplish in one of my functions

#

but how do I loop through an entire dataframe and run that time column through my epochTimeConverter function until there are no more rows

#

I tried this but I'm not sure if it'd work

#
df.created = CT.epochCre()
df.updated = CT.epochUpd()
#

would that put every value in created and updated column through their respective functions?

#

Got TypeError: epochCre() missing 1 required positional argument: 'self' when I tried to run it

void anvil
#

I use dt.datetime.now()

lapis sequoia
#

I just watched a YouTube ML program by some expert at Google. Man it's so intimidating to not know so many concepts in machine learning.
The code seems so unrealistic and difficult to understand. It makes me wanna give up.
I am a beginner, been studying for few months and feels like am no where near.

What to do when you're confidence level is down?

junior dragon
#

Every beginner is same as you, you aren't alone

#

Listen to some music ig

steep herald
#

Anyone awake that can answer a question about SaaS metrics?

#

Posting it in Help-5 if any of you can help

lapis sequoia
#

hi

#

how do I drop rows in my df, if they contain datetime..

modest scarab
#

@lapis sequoia is it already a csv?

#

most of the time, it's easier to manipulate the csv in SQL and then save it as a new csv file

lapis sequoia
#

no it's not csv

#

it's in the dataframe

#

I read a text file to get this

#

im thinking I need to do something like

#

df = df.drop(df[df.text.str.contains].index)

#
df = df.drop(df[df.text.str.contains].index)
#

but not sure how to implement the contains here..for excluding datetime

steep herald
#

subscription_start subscription_end
35:09.0
09:48.0
00:51.0
53:30.0 10:28.0

I am trying to apply ARPU & Churn rate.

but example 53:30.0 or format XX:XX.X doesn't look like any date format I can find online.

#

i think its most likely minutes:seconds

However start is sometimes bigger than end.

For instance

28:33.0 and 16:02.0

if there was a counter raising evertime it exceeds an hour I would understand. but there is nothing

lapis sequoia
#

need code to help buddy..

foggy bridge
#

Hello guys

sand reef
#

Hoi

foggy bridge
#

hey whats up

#

i wanted to ask

#

currently im learning pandas

sand reef
#

Mmm?

foggy bridge
#

and what would you guys recommend me

#

to train myself with

sand reef
#

To train yourself with?

foggy bridge
#

yeah

#

i mean

#

like a project so i could test my skills and improve

#

or test my knowledge

sand reef
#

Well. There are a lot of things you can do.

#

And it depends where you want to go

#

Try googling for some projects

#

Related to the skills you've learnt I guess?

#

And try pulling them off.

foggy bridge
#

@errorsans to be honest i like data analyzing that's why i dove into pandas, but the thing is i feel like lost

lyric canopy
sand reef
#

I mean, well, then I guess I am the wrong person to be asking that then.

#

Cuz I literally am trying to do something like that myself.

foggy bridge
#

@sand reef its ok , thank you !

#

@lyric canopy i just finished Sentdex tutorials

#

@lyric canopy will do! Thank you very much!

tender lance
#

I'm trying to click on a HTML5 canvas.

#

or stimulate some clicks.

#

I can't get selenium basic to work, can someone point me in a direction to stimulate clicks and drags on a html5 canvas?

sleek otter
#

Hi

#

im doing a project with python and dataframes, can i ask for a little help with pandas

olive willow
#

just post the question dude @sleek otter

outer marsh
#

Hey!

#

I'm writing a report where I have to explain MLE

#

But what's the intuition behind multiplying probabilities together if you've got a large dataset?

lapis sequoia
#

any dot graphs making application?

shut widget
#

@lapis sequoia what you're asking for makes no sense

lapis sequoia
#

in the file i downloaded i dont see EXE file

shut widget
#

the issue here is that you don't understand how to run programs, not that they all need to be "built"

#

there's a lot of exe files

#

in the zip

lapis sequoia
#

none

#

that run the program

#

i think just some command lines that does nothing

shut widget
#

they don't do nothing

lapis sequoia
#

they dont run it

#

yeah i just want simply open program and write graph

#

exactly like that

#

thats all i want

#

any app

#

anybody knows app for it?

lapis sequoia
#

those are good

#

but i want to insert image to graphs

#

and im pretty sure its not possible on web ones

desert oar
#

that isn't generally possible

lapis sequoia
#

it is

desert oar
#

make a graph out of an image?

lapis sequoia
#

no

#

adding images to graphs

desert oar
#

oh. no im not aware of one either

#

just use an image editor?

lapis sequoia
#

noo

desert oar
#

@outer marsh it's just the arithmetic of probabilities. each data point is the realization of a random variable, and we assume those random variables are independent -- this is the "iid" assumption.

that means each observation is an "event". let's say you have a data set with 5 individuals, and you know their favorite ice cream flavor (chocolate or vanilla), and you know their age -- you can propose a linear model

Pr(Y_i = "chocolate" | AGE_i) = Bernoulli(p_i)
logit(p_i) = b1*AGE_i + b0

This is pretty typical of what you'd use maximum likelihood for.

Our data set might be like

age | flavor
----|----------
 21 | chocolate
 27 | vanilla
 18 | chocolate
 20 | vanilla
 30 | vanilla

which means we have 5 random variables (one per individual), and one event per individual (one data point per individual).

You can think of the entire dataset as one big event: the intersection of all the independent events that correspond to individual data points. Mathematically you might write it like this:

Dataset = Y_1 ∩ ... ∩ Y_5

so that

Pr(Dataset | AGE_1, ..., AGE_5 ) = Pr(Y_1 ∩ ... ∩ Y_5 | AGE_1, ..., AGE_5) 

Then you just apply the usual rule for computing the probability of the intersection independent events.

which gives us

Pr(Dataset | AGE) = Pr(Y_1 | AGE_1) * ... * Pr(Y_5 | AGE_5)
lapis sequoia
#

definelly possible

outer marsh
#

Hmmm

#

So it's like the probability of the dataset?

#

Like I don't get what the product of all these probabilities represents

#

@desert oar

desert oar
#

You're exactly right, when you do maximum likelihood you are looking at the likelihood over the entire data

#

So the product of the probabilities you can interpret literally as the product of the intersection of all those events describing all the data points

#

Thinking about it that way also makes it obvious why independent and identically distributed are necessary assumptions

#

If they aren't identically distributed, you need a different expression for each data point, which is fine of course but you can't implement that as efficiently in a computer, and computing the gradient, not to mention the Hessian, is more involved

outer marsh
#

Yeah I see

desert oar
#

Whereas if they aren't independent, the whole expression falls apart

outer marsh
desert oar
#

Yeah, although usually you distinguish the variable from its realization with capital and lowercase letters

#

x_1 is the realization of X_1

outer marsh
#

Ah okay

desert oar
#

Meaning the event is "X_1 = x_1"

#

Which of course has zero probability if X is continuous...

outer marsh
#

Oh yeah he also does that, just isn't that clear

#

But why is that 🤔

desert oar
#

It's kind of just how probability theory is set up. that's why the event thing isn't technically correct in general

#

Think about a number line, the point "1" is infinitely small, because there are an infinite number of real numbers that are arbitrarily close to it

outer marsh
#

Yeah

desert oar
#

This stuff gets very esoteric very quickly, but suffice it to say that an infinitely small interval must have probability zero

outer marsh
#

Well yes

desert oar
#

This is where probability density comes in

#

A probability density is the derivative of the distribution function, right? Well, a derivative is basically a function of an infinitely small interval, that's how they are defined

outer marsh
#

Yeah

desert oar
#

So even though the event itself is infinitely small and has zero probability, the way the stuff works is you can just drop in the probability density instead

#

And that's where you get the usual expression of multiplying the probability density for each point in the data set

outer marsh
#

Hmmm

#

I think I'll ask my math teacher tomorrow, I think it's better if I can talk to someone irl

#

But thanks for the help

lapis sequoia
#

@desert oar the lecturer said to use this line plt.plot(x, lin_reg2.predict(poly_reg.fit_transform(x)), color = 'blue')

#

instead of this plt.plot(x, lin_reg2.predict(x_poly), color = 'blue')

#

coz he said that then we can use the model for other dataset

#

so was he saying that if we had training and test set instead of the current dataset where we dont split it.... then x_poly would already have been assigned to something'?

#

the code is this

#
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing dataset
dataset=pd.read_csv('Position_Salaries.csv')
x = dataset.iloc[:, 1:2].values
#if we give only index value i.e. 1 then it will return a vector rather than matrix
#hence we give range i.E. 1:2 rather than 1
y = dataset.iloc[:,2].values

#polynomial linear regression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
x_poly = poly_reg.fit_transform(x)
lin_reg2 = LinearRegression()
lin_reg2.fit(x_poly, y)

#plotting polynomial regression model
plt.scatter(x, y, color = 'red')
plt.plot(x, lin_reg2.predict(poly_reg.fit_transform(x)), color = 'blue')  #<----- this line
plt.title('truth vs bluff (without x_poly)')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.show()

#another model using x_poly
plt.scatter(x, y, color = 'red')
plt.plot(x, lin_reg2.predict(x_poly), color = 'blue')  #<---- and this line
plt.title('truth vs bluff (with x_poly)')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.show()
#

because no matter what the line is i get the same graph for both

desert oar
#

the first one re-fits the poly reg model, the second one doesnt

lapis sequoia
#

ok so if i had a dataset in which i had training and test set then would i understand this thing better?

#

also when i increase my degree the regression line keeps on improving but somewhere in q and a the lecturer said that if we increase the degree too much then i would overfit the model... how is that possible

desert oar
#

i dont really understand your question about training and test sets

#

i also dont know why you would re-fit your model every plot...

#

err, oh. i see

#

maybe hes saying that if you want to use a different x, you can use the first one instead of hardcoding x_poly

#

that's an arbitrary distinction... just write a function if you need to reuse

lapis sequoia
#

Okkk..

#

ok thanks

lapis sequoia
#

does anyone have any reference graphics for ML metrics

#

need a quick revision..

lapis sequoia
#

guys any favorite books or courses to start data science

lapis sequoia
#

Will check it out

lapis sequoia
#

Im stuck

#

how do I get my df to have equal number of rows per group..

warm orbit
#

could someone dumb down the kalman filter for me pls lol

#

so that i can implement it in python

lean ledge
#

It looks at the state variable (the one you're trying to measure), and assume it's measurement, it's changes etc are affected by noise that's Gaussian (follows a normal distribution)

#

And then it uses probability to predict the most likely state and standard deviation/variance given previous measurements, the current measurements and any "control" you put into it

#

@warm orbit

warm orbit
#

i see

lean ledge
#

It also assumes the current state is related to the control/previous state/measurement using a linear function

hazy hare
#

hello guys, i stuck some problem i didnt fix 3-4 hours... when i try mean to convert my Reviews column (i write this code data['Reviews'] = data['Reviews'].astype('float') ) I face like that error message "You have categorical data, but your model needs something numerical. See our one hot encoding tutorial for a solution." I try to make One-Hot Encoding " but i face different error... If somebody help me, i ll glad, thank you

https://www.kaggle.com/berkeyilmaz/my-first-data-analysis

lapis sequoia
#

what is the error

hazy hare
#

ValueError: could not convert string to float: '3.0M'

lapis sequoia
#

says it's a string

#

is this from you trying to do one hot encoding?

#

what column does this value belong to

hazy hare
wide gyro
#

@desert oar you available? I counted the columns of my rows in dataframe, and while most have 14, some have 13 or less. I'm assuming it's telling me the ones less than 14 contain NaN values

#

However, isn't 0 considered NaN? If that's the case, I wanted to know if I could remove rows that are missing an input, but keep those with 0 as those are useful

desert oar
#

what do you mean 'counted the columns of my rows'

#

and why would 0 be NaN? NaN stands for "not a number" (part of the IEEE floating point specification, used/abused in pandas to represent missing data)

desert oar
#

what do you mean "counted"

wide gyro
#

I ran print(df.count(axis='columns')) and it showed 14 for some, 13 for others

desert oar
#

do you know what .count does?

#

i recommend checking out the docs. i don't think it does what you think it does

wide gyro
#

"Count non-NA cells for each column or row."

#

Where am I getting the 13 and 14 from then?

#

My dataframe holds 14 values

desert oar
#

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.```
wide gyro
#

I figured the ones that showed 13 were missing a value inside

desert oar
#

exactly

#

cause thats what the function does..

wide gyro
#

Is there a difference between dropna inplace and subset?

desert oar
#

they do completely different things

#

what do the docs say

wide gyro
#

`inplace : bool, default False
If True, do operation inplace and return None.

`

#

subset : array-like, optional Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

#

so for subset, I would list the columns that I wanna check for missing values, and those according rows would be marked to be eliminated

#

I guess I don't really understand what inplace fully does

desert oar
#

In place modifies the data frame, rather than creating a new data frame. It sounds like itd be faster or more efficient but in practice it's not. It will also be deprecated in 1.0 so don't bother with it

spark nimbus
#

Live right now, in need of some help. Lacking some insight into what I'm doing wrong

desert oar
#

@spark nimbus on stream? i cant join but if you describe here maybe i can help

spark nimbus
#

so basically

#

I'm trying to play audio

#

when I play it all at once it works fine

#

but I want to play the next few samples every 20ms or so

#

and now it's all static-y

wide gyro
#

is using dropna(subset=) realistic if I'm placing every column in the subset

desert oar
#

@wide gyro what do you mean placing? if you omit subset= it uses every column by default

#

@spark nimbus is this a data science question?

#

not sure what you mean. is this about python? OBS?

wide gyro
#

Oh so I don't need to add every column like subset='radio','mcc',etc.

#

didn't know that

desert oar
#

yeah that one isn't explicitly mentioned in the docs, but the examples demonstrate how it's used

sand reef
#

If I am not wrong, isn't NaN counted as 0 during computation?

spark nimbus
#

@desert oar audio is closest to data science than anything else here

sand reef
#

Say. I have learnt conv nets and all and rnns and all. What do I do now? Like what skills are needed for a decent job in this field? And is data science a completely different aspect all together? Or is it somewhat related?

#

Basically the coursera Machine Learning course and the Deep Learning Specialization.

desert oar
#

@sand reef depends on the job

#

there arent many "junior ML engineer" roles out there afaik yet

#

@spark nimbus ok, unfortunately im not sure what the context for the problem is, nor am i an audio guy. good luck though

#

@sand reef also my career path has been very much not "machine learning oriented" -- so maybe i'm not the best to ask. i know that, honestly, i wouldn't hire a data scientist that's only done a couple online machine learning courses

#

i'd personally prefer someone with math and stats background, who can reason about data

#

data cleaning, missing values, basic statistical analysis, etc

#

and who knows how to code

#

if they have a math background i can teach them any of the fancy modeling they need to know

#

ideally someone who can write well and make good data visualizations too

#

maybe try a project now? something "end to end" where you have to choose a problem statement, get your own data, clean it yourself, come up with your own model, and then make some kind of report w/ your results

#

that's quite a bit of work but if i was hiring at the very basic junior level it would make me more interested

lean ledge
#

@sand reef if you're going for a computer vision role, unless you're familiar and up to date with research, I don't think you'll find many to hire you. That means being familiar with Resnet, ResNext, Alexnet, VGG, GoogLenet InceptionNet, unet, (Fast/Faster) RCNM + knowing about GANs, NLP based models description models, etc etc. Computer vision is sort of high barrier of entry and you should definitely be familiar enough that you can pick up random papers from CVPR and understand them

#

If you're going for generic deep learning, well... You'll be disappointed because no one actually uses just deep learning that much. It's used in research, and for CV, NLP and then a few things here and there. Outside of that, it either works worse or there isn't enough data for it

#

The only reason it's popular as it is is because people like the sound of "neural networks" and because it's easier to learn without much maths.

prisma verge
#

so, uh, i have no idea what i'm doing since i'm bad at ml
can anybody explain what should i do so it predicts continuation of csv file?

#

1 and 2 mean wins (though 1 = safer, 2 = closer to lose), while 3 means lose, 4 5 6 - are teams, mostleft column is number of round

#

i'm quite bad at ml but i just wanna try to predict some stuff

#

would be nice to improve knowledge of myself with this project
also, anyone got good keras books?

#

i just have no idea how to predict it

warm orbit
#

@lean ledge is there an advantage to using a kalman filter over a linear regression

#

i think they both identify a linear trend with gaussian noise

lean ledge
#

@warm orbit they don't do the same thing at all

#

Nothing alike

warm orbit
#

yes kalman just predicts the next one right

lean ledge
#

linear regression is finding an approximate A for a given
y=Ax
When y and x are known for a lot of examples

Kalman filter is finding y_(k+1) for a
x_(k+1) = Ax_k + Bu_k + Ω
y_(k+1) = Cx_(k+1) + μ
Where μ and Ω are Gaussian noise vectors
Given previous values of x, y and known A, B, C

#

@warm orbit

#

I think you should really just study a bunch of maths

warm orbit
#

lol

#

thanks

lean ledge
#

Kalman filters are generally the kind of stuff you learn in a 3rd year electrical engineering (signal processing) class

#

They're the most basic form of Bayesian filters and still aren't very approachable

lapis sequoia
#

that's what I told him yesterday..

queen vigil
#

what's a good example to start learning about neural networks

#

i know the concept but i need a project/example to do it on

#

im thinking a game for the computer to play or processing images but idk

reef bone
#

i think the MNIST handwritten digits one is pretty much canonical in terms of image classification
http://yann.lecun.com/exdb/mnist/
this is a good starter dataset imo, the samples are small which makes it feasible to quickly retrain and play with parameters of your network, and you can easily find others' implementations and see what they did differently
otherwise this is a good repository of commonly used datasets
https://archive.ics.uci.edu/ml/index.php
and of course you can check out kaggle too

queen vigil
#

ive seen sentdex use mnist in his tensorflow tutorial series but i wanna try my own thing instead of just following the tutorial blindly

small ore
#

Is NN needed for handwriting recognition or will just normal regresson work?

reef bone
#

regression is a problem, rather than an algorithm, and digit recognition is inherently a classification task

#

you could use logistic regression to classify with one vs all

#

after all, logistic regression is essentially a sigmoid function, and you often see the sigmoid used as an activation function in simple networks

#

so you can think of a simple NN as multiple sigmoids connected together, which allows them each to learn a distinct relationship and work together to solve a more complex problem

sand reef
#

So, if I did contribute in a research paper, will that boost me up?

prisma verge
sand reef
#

Well, whats the error?

prisma verge
#

it gives absurd -16 loss

#

and predicts nan values

sand reef
#

mmm

#
Y = dataset[:, 3]```
#

You are using the Y values in the X values?

prisma verge
#

uhhh, have no idea honestly

#

it doesn't seem to touch y values on slice

sand reef
#

Well, according to the code, you are taking the first 10 columns into X

#

and the third column into Y

#

*11 columns for X

prisma verge
#

huh

#

let me try

sand reef
#

no, its 10 only, my bad xD

#

but yeah, the Y is being used to train?

prisma verge
#

i'm just trying to finetune existing network to predict my csv

#

i have very little idea behind ml honestly

sand reef
#

well, I can't seem to pinpoint the major issue here other than the labels being a part of the training data

prisma verge
#

should i bring up csv here?

#

maybe it'll help?

sand reef
#

i guess?

prisma verge
sand reef
#

maybe i'll try to run it and see

prisma verge
#
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import to_categorical

df = pd.read_csv('m.csv')
dataset = df.values
X = dataset
Y = dataset
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
print(X_scale)
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.3)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)

model = Sequential([Dense(32, activation='relu', input_shape=(4,)), 
                    Dense(32, activation='relu', input_shape=(4,)),
                    Dense(4, activation='sigmoid')])
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
                 batch_size=1, 
                 epochs=100,          
                 validation_data=(X_val, Y_val))
print(model.predict(X_scale))
sand reef
#

welp, gimme a sec to download sklearn

prisma verge
#

any comments so far?

lapis sequoia
#

what are you trying to do

#

why does your X and Y both point to dataset

prisma verge
#

i'm trying to make nn that'd predict the contents of csv

#

and i have no idea, because i'm bad at ml actually and just trying to finetune existing network

#

before it was
X = dataset[:,0:10]
Y = dataset[:,3]

#

but that didn't work

#

though it still doesn't

#

also, @sand reef, i'm calling you!
since i didn't got any far by changing layers

sand reef
#

mew?

#

ah yes

#

i am having an issue here

#

for some reason, its not importing scipy

#

even tho i have it installed

prisma verge
#

huh

#

or intsall anaconda since afaik it has this stuff

#

... or just stop wasting too much power to help me lol

sand reef
#

well to instal anaconda means to set up a lot of stuff lol

prisma verge
#

i didn't use it tbh so don't know how hard it is

#

i'm just using colab since it provides free gpu

sand reef
#

yeah thats there..

#

wellp, i'll try it out on colab

#

what i dont like about colab is the tensorboard issue\

prisma verge
#

have no idea what it is, works finely with keras for me

sand reef
#

tensorboard?

prisma verge
#

yeh

sand reef
#

well, its basically to see the progress of your model and how its going

#

but tensorboard for some reason has some issues on google colab

prisma verge
#

keras outputs it automatically on calling fit method

lapis sequoia
#

can someone explain me what a dimension is?

sand reef
#

like length, breadth and height. Matrices can have those too.

lapis sequoia
#

how is input_shape(781,)
same as
input_dim=781

sand reef
#

well, @prisma verge

#

there is one issue, its reading the head also, which has NaN in it

prisma verge
#

that's the only problem?

sand reef
#

@lapis sequoia (781, ) thats how one dimensional matrices are represented

#

meaning there are 781 elements in it

prisma verge
#

removed nans

#

gonna try it now

sand reef
#

okay

prisma verge
#

how do i define x and y though?

#

i'm really not sure which should go in y and which should go to x

sand reef
#

well, you need to figure out that by reading the csv file

lapis sequoia
#

so input_dim takes number of elements present ?

sand reef
#

no, it takes the shape of the input matrix

prisma verge
#

csv represents 47 columns with 4 rows

#

i just wanna know that so i can know it for future projects

#

and get deeper into ml

lapis sequoia
#

if possible can u give an example please :) ?

sand reef
#

okay. meaning you have 4 examples?

prisma verge
#

just i learn best by practice, haha

sand reef
#

@lapis sequoia well, like, for example, you want to input an image

prisma verge
#

welp, yeah
first one goes from 1 to 47, and the others have 1, 2, and 3 in "random" sequences every

#

because those are logs from one game which has 2/3 chances to win

#

1 and 2 means wins, 3 means lose

sand reef
#

a gray image is of the shape (200, 200, 1)

prisma verge
#
47,1,3,1
46,1,3,2
45,3,1,2
44,2,1,3
43,2,1,3
42,2,1,3
41,2,3,1
40,3,2,1
39,1,3,2
38,2,1,3
37,2,3,1
36,1,2,3
35,1,3,2
34,2,3,1
33,2,1,3
32,1,3,2
31,2,1,3
30,2,1,3
29,3,2,1
28,1,2,3
27,1,3,2
26,3,1,2
25,1,3,2
24,3,2,1
23,3,2,1
22,2,3,1
21,3,2,1
20,3,2,1
19,2,1,3
18,2,3,1
17,2,1,3
16,3,2,1
15,2,1,3
14,2,3,1
13,1,2,3
12,1,3,2
11,1,3,2
10,1,3,2
9,3,2,1
8,2,1,3
7,2,3,1
6,2,1,3
5,1,2,3
4,1,2,3
3,3,2,1
2,3,2,1
1,2,3,1
lapis sequoia
#

so its dimension is 3?

prisma verge
#

kinda like that it looks

sand reef
#

yes, its a 3D matrix

#

a color image is (200, 200, 3)

prisma verge
#

also, removing nans didn't help

sand reef
#

so tell me something about your csv

#

does your csv have 4 features and 47 examples?

#

or 4 examples nad 47 features?

prisma verge
#

how should i know that?

#

i guess 4 features and 47 examples

sand reef
#

well, you made the csv or know about its origin right?

prisma verge
#

yeah, i do

lapis sequoia
prisma verge
#

biggest team means lose, middle means close to lose but wins, and smallest means highest chances to win
then i've changed them to numbers and removed rounds

#

well, not removed rounds, removed word round

#

now i wanna make ai predict from that csv

sand reef
#

I see

prisma verge
#

because it'd be amazing experience to machine learning, and also very useful

#

i know that predictions won't predict reality, but i wanna make it at least as project for fun

lapis sequoia
#

9:05*

sand reef
#

oh that thing