#data-science-and-ml
1 messages · Page 201 of 1
Usually when I get that error it's because my array is dtype.object because I wasn't careful about building my array
The fact that X.shape[1] returns a value error makes me think you have an array of lists or something
@median siren you'll need to show your code or a sample of the data or something
That makes sense right, considering my data points represents a vector. So my X_train variable is a list of vectors?
well, no, it's a dataframe
usually you don't get an array of arrays when you use .values
something is funny in your data
🤔
Yes, I figured out what's wrong.
hey does anyone have a recommendation for a free online course to learn ML
Check pinned
@desert oar Hey I tried your code. I think I need to edit the cython module or change the data type I'm passing. "ValueError: Buffer dtype mismatch, expected 'DTYPE_t' but got 'float'"
And why do we use fit_transform() on training set and only transform() on test set?
We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.
parameters of scaling on train data means?
Does anyone know how do I fix the issue with TensorBoardColab? The one where I make the tensorboard and pass it in, and it says:
AttributeError: TensorBoardColab does not have a parameter, on_batch_training_begin()
or something close to that, let me get the error
AttributeError Traceback (most recent call last)
<ipython-input-6-98410153379f> in <module>()
1 with tf.Session() as sess:
2 sess.run(tf.global_variables_initializer())
----> 3 model.fit(X, Y, batch_size = 32, epochs = 10,validation_split = 0.1, callbacks = [TensorBoardColabCallback(tbc)])
2 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, max_queue_size, workers, use_multiprocessing, **kwargs)
878 initial_epoch=initial_epoch,
879 steps_per_epoch=steps_per_epoch,
--> 880 validation_steps=validation_steps)
881
882 def evaluate(self,
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, mode, validation_in_fit, **kwargs)
323 # Callbacks batch_begin.
324 batch_logs = {'batch': batch_index, 'size': len(batch_ids)}
--> 325 callbacks._call_batch_hook(mode, 'begin', batch_index, batch_logs)
326 progbar.on_batch_begin(batch_index, batch_logs)
327
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
194 t_before_callbacks = time.time()
195 for callback in self.callbacks:
--> 196 batch_hook = getattr(callback, hook_name)
197 batch_hook(batch, logs)
198 self._delta_ts[hook_name].append(time.time() - t_before_callbacks)
AttributeError: 'TensorBoardColabCallback' object has no attribute 'on_train_batch_begin'```
and here is the model made
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Flatten, Dense, Activation, Conv2D, MaxPooling2D
from google.colab import drive
drive.mount('/content/drive')
import pickle
X = pickle.load(open('/content/drive/My Drive/data/X.pickle', 'rb'))
Y = pickle.load(open('/content/drive/My Drive/data/Y.pickle', 'rb'))
model = Sequential()
model.add(Conv2D(64, (3,3), input_shape = X.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Conv2D(64,(3,3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Dense(1))
model.add(Activation("sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
!pip install -U tensorboardcolab
from tensorboardcolab import TensorBoardColab, TensorBoardColabCallback
tbc = TensorBoardColab()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
model.fit(X, Y, batch_size = 32, epochs = 10,validation_split = 0.1, callbacks = [TensorBoardColabCallback(tbc)])```
pls halp
tensorboard is to visualize the flow of your program?
can you look it up and see where it fails?
oh
misunderstood your problem
try this
tbCallBack = TensorBoard()
and use that in model.fit
i'm not sure if you need to pass arguments into TensorBoard.. but I think you probably might need to
@sand reef
But the TensorBoardColab is what is imported, the regular one is only TensorBoardv2.0 which is supported by Colab
The TensorBoardv2.0 is having another set of issues of not reading any of my tensorboard event files
despite me running ngork on it
btw guys what should I know before starting to learn calc?
functions
so f(x)
and a bit of set theory
like x is a real number or x belongs to an interval between 2 and 5
that notation
f:x->x
what's an interval idk the english terms that good
sure I'm doing linear algebra rn after that it's calc and stuff
and after that the holy motherland MACHINE LEARNING
okay
thanks @lean ledge
@karmic geyser I did warn you it was untested 😃 but the error means what it says
What is a bit weird is that DTYPE_t should be np.float32
Maybe the issue is native python float vs numpy float
import pandas as pd
df_btcusdt = pd.read_csv("BTCUSDT.csv", parse_dates=True, index_col=0)
df_ethusdt = pd.read_csv("ETHUSDT.csv", parse_dates=True, index_col=0)
df_ltcusdt = pd.read_csv("LTCUSDT.csv", parse_dates=True, index_col=0)
df_btcusdt = df_btcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ethusdt = df_ethusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ltcusdt = df_ltcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_btcusdt.rename(columns={'Close':'BTCUSDT Close'}, inplace=True)
df_ethusdt.rename(columns={'Close':'ETHUSDT Close'}, inplace=True)
df_ltcusdt.rename(columns={'Close':'LTCUSDT Close'}, inplace=True)
main_df = pd.concat([df_btcusdt, df_ethusdt, df_ltcusdt], axis=1, sort=False)
print(main_df.corr())
how do i make this code neater?
@lost sinew inplace= i think is discouraged nowadays. but other than that, seems neat enough to me
alternatively you can use usecols= in the read_csv call instead of dropping columns afterward
alright thanks.. new to programming so im scared if its messy lol
Date,BTCUSDT Close,ETHUSDT Close,LTCUSDT Close
2018-01-23,10799.18,980.0,176.98
2018-01-24,11349.99,1061.0,180.89
2018-01-25,11175.27,1056.52,179.59
2018-01-26,11089.0,1051.03,177.09
2018-01-27,11491.0,1118.99,182.1
2018-01-28,11879.95,1251.96,196.74
2018-01-29,11251.0,1177.01,181.5
2018-01-30,10237.51,1085.5,168.21
2018-01-31,10285.1,1124.81,165.19
2018-02-01,9224.52,1041.94,143.69
how do i find the time lag between each of the * Close
this is a csv file
what do you mean time lag
like the whether the price increases or decreases.. it follows each of the other prices because they are highly correlated.. is there a way to find the average lag/lead time
not sure i understand. you want to find lags or leads such that the series are all maximally correlated?
for example, ETHUSDT Close has a +ve increase 10 minutes after BTCUSDT Close has a +ve increase
also your data is daily so you can't figure out +10 minutes from that. but i think i see
ohh its just an example
i'm not sure of any principled way to do that other than making leads and lags of different lengths and computing the correlations
or making plots and eyeballing it
so theres no quantitative way to do it?
alright thanks for ur help
been searching google for days and i still cant find the answer 😦
https://quant.stackexchange.com/a/14868
it seems like you really just have to compute a bunch of lagged correlations, or use Granger causality
hey. im making a neural network for detecting handwritten digits from scrath using numpy. rn when i train with 10-20 images and test with those too i get a very high accuracy. but when i go above using 50 training images it just guesses 3 every time. idk y. heres the code
https://pastebin.com/32Z5ZMgr
guessing the same number every time suggests something degenerate in your training
if you print the gradient at each training step maybe you can see something going wrong
which bits the gradient tho? 😂. sorry i dont 100% know whats going on
Well. You implemented the entirety of the neural network from scratch.
It's gonna be a bit hard to point out where you are going wrong.
Here what you can do.
Shuffle the dataset
And then take samples.
If it started predicting 3 a lot, 2 things are only possible. Either something is going wrong in your network, which is unlikely since you said that it was working with 10 examples, or your training data had a lot of 3s, so your network learnt to predict only 3 for high accuracy.
is your data really unbalanced
Altho, how did it get a very high accuracy with just 10 examples? That also the mnist dataset?
Is that even possible with a balanced out dataset?
no my data consists of 20 of each number
i mean it is from the mnist dataset. but i just got the images online and saved them as png’s. i only got 200 of them
Well. Do one thing.
Use tensorflow.keras and get the mnist dataset
And train your model on that.
If the same issue persists, then your code for the neural network is having errors.
i wanted to do that before. but i never knew how to get it
or how to use it if i did get it
import requests
import csv
import pandas as pd
market = 'XRPUSDT'#'LTCUSDT'#'BTCUSDT'#'ETHUSDT'
interval = '1d'
url = 'https://api.binance.com/api/v1/klines?symbol=' + market + '&interval=' + interval
data = requests.get(url).json()
with open(market + '.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(data)
df = pd.read_csv(market + '.csv', names=['Date', 'Open', 'High', 'Low', 'Close',
'Volume', 'Close time', 'Quote asset volume',
'Number of trades', 'Taker buy base asset volume',
'Taker buy quote asset volume', 'Ignore'])
df['Date'] = pd.to_datetime(df['Date'], unit='ms')
df = df.drop(columns=['Close time', 'Quote asset volume',
'Number of trades', 'Taker buy base asset volume',
'Taker buy quote asset volume', 'Ignore'])
# save file
df.to_csv(market + '.csv', index=False)
how do i make a loop for all of the 'market' commented
i wanna just type all of the different markets in a list and loop around it automatically instead of manually chanign the market
changing*
markets = ['XRPUSDT', 'LTCUSDT', ...]
for market in markets:
all the rest of your code```
the rest of your code could probably be made more efficient but that's how you'd make the loop
how can i make it more efficient
i'd probably create the dataframe directly from the json
how would i do that tho.. im really new into programming
something like this ```py
data = requests.get(url).json()
df = pd.DataFrame(data,
columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume',
'x', 'x', 'x', 'x', 'x', 'x'])
df['Date'] = pd.to_datetime(df['Date'], unit='ms')
df = df.drop(columns=['x'])
df.to_csv(market + '.csv', index=False)
(I just used 'x' instead of names for columns you're deleting anyway)
alright thanks
import pandas as pd
markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
df_btcusdt = pd.read_csv("BTCUSDT.csv", parse_dates=True, index_col=0)
df_ethusdt = pd.read_csv("ETHUSDT.csv", parse_dates=True, index_col=0)
df_ltcusdt = pd.read_csv("LTCUSDT.csv", parse_dates=True, index_col=0)
df_btcusdt = df_btcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ethusdt = df_ethusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_ltcusdt = df_ltcusdt.drop(columns=['Open', 'High', 'Low', 'Volume'])
df_btcusdt.rename(columns={'Close':'BTCUSDT Close'}, inplace=True)
df_ethusdt.rename(columns={'Close':'ETHUSDT Close'}, inplace=True)
df_ltcusdt.rename(columns={'Close':'LTCUSDT Close'}, inplace=True)
main_df = pd.concat([df_btcusdt, df_ethusdt, df_ltcusdt], axis=1, sort=False)
#main_df.to_csv('combined.csv', index=False)
print(main_df.corr())
how would i do this for all of the markets now
i trieed it but it wouldnt work.. what shouyld i change the final 'main_df' line to
@desert cradle
i think what you want is pd.merge
but i'm not 100% sure on the details of how to use it
is there like a way to concat it to the main_df everytime it loops?
actually...
since you're using an index column
wait can you tell me why the concat didn't work?
illk show u what i tried.. wait
but anyway, you just want the close columns, right?
also you have this long list of markets but just three csv files, what are you trying to do with that exactly
import pandas as pd
markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
pair = 'USDT'
for market in markets:
df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)
df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])
df.rename(columns={'Close': market + pair +' Close'}, inplace=True)
main_df = pd.concat([df], axis=1, sort=False)
print(main_df.corr())
ok, that's your problem
concat works fine, you're just doing it in the wrong place
import pandas as pd
markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
pair = 'USDT'
dfs = []
for market in markets:
df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)
df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])
df.rename(columns={'Close': market + pair +' Close'}, inplace=True)
dfs.append(df)
main_df = pd.concat(dfs, axis=1, sort=False)
print(main_df.corr())
@desert cradle is there any way of doing it without the append?
why
it cmae out with this error ValueError: No objects to concatenate
list comprehension would be worst in this case?
that doesn't make any sense
import pandas as pd
markets = ['BNB', 'LTC', 'EOS', 'ONE', 'TRX', 'BCHABC', 'MATIC', 'XRP', 'LTC', 'BTC', 'ETH', 'BTT', 'FET', 'ZIL', 'ADA',
'ATOM', 'LINK', 'NEO', 'ETC', 'CELR', 'XLM']
def op(market)
pair = 'USDT'
df = pd.read_csv(market + pair+ ".csv", parse_dates=True, index_col=0)
df = df.drop(columns=['Open', 'High', 'Low', 'Volume'])
df.rename(columns={'Close': market + pair +' Close'}, inplace=True)
return df
dfs = [op(market)for market in markets]
main_df = pd.concat(dfs, axis=1, sort=False)
print(main_df.corr())
maybe something like this?
eh
making a function just so you can have a list comprehension doesn't really improve readability that much
and it's not much of a difference for performance either
how do i make it into a heatmap after having a correlation table?
no idea
okay thanks ill figure it out
sounds like a lot of math, we're past my ability to help
i just knwo the basics of how pandas itself works
ohh okay
@desert cradle Okey ty!, i thought the performance would be better with the list comprehension
it probably doesn't make much difference - a list comprehension might be very slightly faster than a loop with append, but adding an extra layer of function call might slow it down too, and it's not worth worrying about anyway
how does python calculate correlation for the x.corr() code
Standard Pearson correlation
thanks
which one is the best
nvm
do you think the standard pearson correlation is suitable for finding the correlation between two stock prices?
or is the standard pearson correlation only suitbale for linear relationships
by definition it's only suitable for linear relationships, but you might be underestimating the value of measuring a linear relationship. if "priceA" generally goes up whenever "priceB" goes up, then you can see that with a linear relationship
What project will be good to do to learn data science and to put on the resume?
anything tbh
your learning project likely won't be a good resume project
kaggle is never a bad place to start for machine learning
it's kind of hard to learn "data science" on your own tbh
you end up mostly learning technical stuff, which is maybe 80% of the equation
say, anyone here well versed in the concept of hopfield neural networks?
i m slowly starting to lose it on this neural network
please ping me if anyone can help
For some reason, this network is always converging only to the latest learnt pattern
to see what the error is, use this code in conjunction: "Processor.py" is the name of the pasted file in the above link
from Processor import *
def print_matrix(matrix):
for i in range(len(matrix)):
string = ''
for j in range(len(matrix)):
string += str(1 if matrix[i][j] == 1 else 0) + ' '
print(string)
nn = Network()
a = [
[ 1, 1, 1, 1, 1, 1, 1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1]]
b = [
[ 1,-1,-1,-1,-1,-1, 1],
[-1, 1,-1,-1,-1, 1,-1],
[-1,-1, 1,-1, 1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1, 1,-1, 1,-1,-1],
[-1, 1,-1,-1,-1, 1,-1],
[ 1,-1,-1,-1,-1,-1, 1]]
c = [
[ 1, 1, 1,-1, 1, 1, 1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1, 1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1],
[-1,-1, 1,-1,-1,-1,-1],
[-1,-1,-1, 1,-1,-1,-1]]
nn.read_matrix(a)
nn.set_weights()
nn.read_matrix(b)
nn.set_weights()
nn.run_async(1000)
print_matrix(nn.get_matrix())```
okay....i m seeing an issue......
i think, the issue does not lie in the network, i think my gui is messing up
now who do i ask to help me see the error that i think my logic is causing?
since its a gui logic error that i cant seem to fit into the button events
sorry im of no help but i have a question 
if i have an array of say 50000 values and i iterate over them with my algorithm and extensively using a counter (like counter++ at each iteration and than use it in my calculations)
so.... it works fine for an array, but when i apply this algorithm to a data flow obviously i get overflow very fast
so i thought (to not reinvent the wheel) maybe there is a well known consept to deal with this kind of things, like phase iterators or something
googling "phase iterator" gives nothing useful, so maybe it has different name
but my image of it is like when your counter is more than some value it zero's out but we still know its not real 0 but 0 + whatever counter we zeroed
duh such a mess of a thought )
As far as I am aware, instead of overflowing pythons ints become long's, and longs simply don't overflow?
>>> 10**10**3
100000000000....(snip too many zeroes)
oh...
so i must inspect why it gives me overflow more carefully
Numpy doesnt do that though, if its float32 its float32
But yeah use native python ints, they can get huge
does anyknow how how I can calculate permutations, but with multiple lists of combinations? ie. [3,4] [5,6] [7,8,9] would be (3,5,7) (3,6,7) etc
I've never done this but my intuition says take a look at the itertools module from stdlib @knotty nexus
Hi I am a beginner in Data Science. I have a machine learning problem. So I want to know what is the useful statistic for begin a machine learning problem?
Depends on the problem
Usually you want to learn something about the data
Summary statistics, or plot the data if you can
You should have a goal in mind so you can stay focused on that goal
made my first ever decision tree classifier
thanks @earnest prawn . I look at itertools, but as far I can see it can only handle combinations of a single list; [3,4] would be (3,3), (3,4), (4,3) etc. For now I'm gonna try for loops, but it's gonna be really slow
You want itertools.product() maybe @knotty nexus
@desert oar I use pandas for example data.describe() show mean, std, min, max, 1st , 2nd and 3rd quartile
Is it the only statistic that i can use?
You can use anything you want
Its better to start with a specific objective
What are you trying to achieve? What question do you want to answer?
ok thanks
This is kind of a simple question and I am not 100% sure if it belongs here under data-science but I figured it fits, so what exactly about numpy makes it "faster"
that its written in C and uses C arrays instead of python lists
also https://github.com/tensorflow/agents if anyone is interested
how would i find the average lead/lag time of a time series
What do you mean "average" lead/lag?
Can anyone pls explain me p-value in layman terms
I seriously can understand abit
i do understood what is null and alternate hypothesis
When p<0.05, there is less than 5% chance that the results were a fluke of random chance
< 0.1 -> less than 10%
Etc
so what does accepting the null hypothesis means when p>0.05
i was watching a video it gave an example that let the null hypothesis be people are on my website for population average time of 20 min before change and alternate hypothesis becomes people are on my website fore than 20 min
then it set significance value = 0.05
then it took sample mean of 100 people and found out to be 25 min
after that i didnt understand a thing
so can u tell me based on this example what exactly is p value..
like this much i understood that lower the p value lower are the chances that my observation was just a random chance
@lapis sequoia p>0.05 means there's more than 5% chance that your results was due to chance
We consider that too likely
Hence we consider it to mean that "the experiment did not show the relationship we expected"
Hence we are unable to reject the null hypothesis
Say your hypothesis is A is correlated with B
Null hypothesis: there is no relationship
You do the experiment and find that A is correlated with B
But
There's a greater than 5% chance that it is due to random chance that you got that result
Hence you are unable to reject the idea that they are unrelated
And can't accept the hypothesis
Ok ok..
@lean ledge be careful, it means that if the null hypothesis is true there's more than 5% chance that your results was due to chance
https://raw.githubusercontent.com/AndrewCathcart/got-sentiment-analysis/master/cleaned_got.csv if anyone wants a dataset of around 600k game of thrones tweets from roughly the time S8E3 aired
I am just curious do we use the 3,4,5 method ever or do we use backward elimination most of the time
coz at least thats what i am learning.. just backward elimination
@desert oar You're meant to "not reject" it rather than accept it https://researchskills.epigeum.com/courses/researchskills/473/course_files/html/wht_1_10.html
I know what a hypothesis test is
What was I being corrected on?
p>0.05 means there's more than 5% chance that your results was due to chance
that's only true under the null hypothesis
which of course is the whole point
if you get a value that's "rare" under the null (in this case 1-in-20 or rarer), then we say we don't believe the null
ah wait i think i misunderstood your comment 😄
Is anyone good with Pandas and could help me with a problem regarding the csv reader?
!ask
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
I am using csv reader to read data from one file and get rid of all the rows that are missing a value, and when i use dropna(inplace=true), it works fine. However, I want to exclude some columns from that so I tried to implement the dropna(subset=[]). Unfortunately, when appending to a new csv, that file is actually larger in size than my previous one.
@wide gyro , hi. First, try not to use inplace = True at all cost. it is shorter, but it will bring you more headache in the long run. Also, it will be removed on v1.0 of Pandas.
Can you manually chekc both files and report their outputs?
Size can be affected also by data types too
shape of dataframes and which types you have
huh, why is inplace getting dropped? It sounds like something has reasonable usecases
@supple ferry I will update you in a bit with the output file, I didn’t look too much into it as I still have a decent sample size but would like to refine it as well as I can
Also, how would I pass the csv reader to other functions? I’d like to manipulate the data however I’d want once it’s read through but I’m not sure how to pass it through. New to Python but I understand the basics due to knowing a couple other languages
Would I need to put the reader into a list or dictionary? I figured I wouldn’t need to as I can call the column and row number in the method I initialize the reader, and it returns whatever I need. However, when trying to use it in certain functions, it says something like “Missing argument”
what do you want to achieve by putting it into the function?? you want to read on demand ??
@supple ferry I want to read the data and be able to use it for whatever I’d like, with one column being time that I’m converting or simple arithmetic use
how does one get the mnist dataset and how does one use it?
im using my own made neural network using pygame
numpy*
oops 😂😂
my text file has uneven number of lines
want to read to dataframe.. how should I approach this
I want everything in one column
@silent swan I think in some cases it was actually less efficient, but it was misleading people into thinking it was somehow a performance optimization; also it leads to two discordant and incompatible programming styles, rather than just one
@wide gyro what CSV reader? The native python one, or the pandas function?
If you're getting an error in your code nobody can help you unless you post a sample of code that demonstrates the error, and also the full error message
Usually when a data frame is bigger than you expect, it's because of a join/merge that went wrong
I have a few questions about Machine Learning, is this the right thread for it?
In particular, I'd like to do some lip reading using TensorFlow, but I'm unsure as to what's already been done. Anybody knows which projects are still maintained?
!ask
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
@desert oar Pandas
Where is the default save directory for WSL? I spent a couple days working on a project. Went to open it up today and it's gone. The .csv's I created in the folder are there, but the code is gone.
Sorry if this is the wrong section. I'm trying to use OpenCV to analyze images of particles . I have gotten to the point where I have binarized the image and the particles are decently defined, but what functions should I be looking at to analyze say the area or the diameter bounded by a countor?
And is there a way to just search the WSL drive
@random jasper I am not very well versed in openCV and there might be something already existing which does what you are asking for, but well, you can convert it into an array, and use a condition and mark those areas. If you can mathematically represent a contour that is.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
1504 try:
-> 1505 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1506 except TypeError:
~/.local/lib/python3.5/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
207 if use_numexpr:
--> 208 return _evaluate(op, op_str, a, b, **eval_kwargs)
209 return _evaluate_standard(op, op_str, a, b)
~/.local/lib/python3.5/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
67 with np.errstate(all='ignore'):
---> 68 return op(a, b)
69
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')
#line of my code
-> 1155 l = (a + 2 * i + 2 * j + k)/6
~/.local/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(left, right)
1581 rvalues = rvalues.values
1582
-> 1583 result = safe_na_op(lvalues, rvalues)
1584 return construct_result(left, result,
1585 index=left.index, name=res_name, dtype=None)
1527 try:
1528 with np.errstate(all='ignore'):
-> 1529 return na_op(lvalues, rvalues)
1530 except Exception:
1531 if is_object_dtype(lvalues):
~/.local/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
1505 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1506 except TypeError:
-> 1507 result = masked_arith_op(x, y, op)
1508
1509 result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
~/.local/lib/python3.5/site-packages/pandas/core/ops.py in masked_arith_op(x, y, op)
1024 if mask.any():
1025 with np.errstate(all='ignore'):
-> 1026 result[mask] = op(xrav[mask], y)
1027
1028 result, changed = maybe_upcast_putmask(result, ~mask, np.nan)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')```
Not sure
maybe
I'm running on WSL for the first time
porting over some code
I've never seen anything like it running on windows
i see..
trying to figure out what the error is actually saying
seems to be stemming from pandas....so i guess something went wrong there..
didcha google this issue?
the type error, just copy and paste and see what it gives?
see if this helps
if this doesn't help, well, i am sorry, this, for now, is out of my league...
I am passing my pandas dataframe to a class's method and using df.at[row, column] to check if a certain value equals 0 or 1, and returns a String based on the outcome. However, nothing is being returned yet the program is being terminated without giving an error.
If I try to print that same df.at[row,column] outside of the method, though, I receive the value I need
@wide gyro did you forget to write return in the function? you'll need to show your code
class CellTower():
def __init__(self,data):
self.changeable = data.changeable
def checkChange(data,int1):
if data.at[int1,'changeable'] == 1:
return 'Firm'
else:
return 'Processing'
data = pd.read_csv('towers.csv')
CellTower.checkChange(data,3)
@desert oar
One suggested that I go through cmd line and that would give me the true error, but I'm struggling to get that going
and that code reproduces your problem?
Well I found out that trying to access any method inside that class doesn't work
def checkChange(data,int1)
this needs a :
it's likely that your class is failing to be instantiated and your program isn't even running
cause that's a syntax error
is checkChange supposed to be a static method
Yes
ok
did you forget to write @staticmethod then?
well actually
@classmethod in this case
well... oh
that's weird
that should work but it's not recommended
anyway, i can't reproduce your problem, works fine
It appears that my program ends once it finishes the csv reader
Calling any method in the class I'm trying to get from returns nothing
I've been stuck on this for a few hours now, not sure what I can do
Been trying to figure out the cmd line too, looks like that's my only hope
are you expecting it to print the output?
is that the full script you have?
what do you mean "returns nothing" -- clearly that method returns something
it's also not even really a method, the way it's defined, more of a namespaced function. that will fail if you actually instantiate the CellTower class
well right now I'm just using a simple IDE and going to transfer it to Linux machine once I sort this all out
Then that's probably where I go wrong
What's the difference between method and namespaced function?
can you share more of the code?
a namespaced function is something i made up...
it shouldn't matter
point being, i unfortunately can't help because i can't reproduce the error and it's not clear what's going wrong
the .py file?
how are you running the code
and what output are you expecting. some kind of print-out?
and can you share a more representative piece of code that demonstrates the issue?
unfortunately i have to head out now but maybe someone else can see this and help
I might be overreaching but i could be using wrong keywords
Is there such a thing where i can predict what a person will say based on time of the day
Using previous messages in a group chat?
how can i go about finding resources for this?
yep, that's something you might call a "language model". usually they just predict the next word in a sentence but one of them can probably be adapted for whole text messages
it's not really my area of expertise but that might at least get you started. you can also look into the blog-literature on chatbots, which is abundant
hi
I need some help..
How do I load my text file into a dataframe? I want the dataframe to have only one column.. the text file is delimited between sentences by a newline, and there length of lines is not fixed.. so I'm running into issues
just read the file as a list
then create a dataframe with that list as the only column
with open('myfile.txt') as f:
text_lines = [line.strip() for line in f]
data = pd.DataFrame({'text': text_lines})
@modest scarab you're talking about smart reply.. like in gmail.. there are pretrained embeddings that can help you do this.. just base next message prediction on the previous one, or on the fly
thanks a bunch!
ahhh "smart reply" thats what people call it
you might need to train your own model to incorporate "external" features like time of day
or just switch between models depending on time of day.. easier
i'd only do that as a last resort if i couldn't use fine-tuning or reuse the embeddings in another model. but again i don't know this area specifically
this is probably too advanced and probably havent been done
but it isn't "smart reply" that i am looking for
it's just basically what sort of words or phrases my friend would predictably say
at a certain time of the day
@lapis sequoia i wouldnt say that at all man. those language models are really complicated
well it's all relative.. for me programming concepts are hard at the moment.. but language comes easy..
This paper presents a computationally efficient machine-learned method for
natural language response suggestion. Feed-forward neural networks using n-gram
embedding features encode messages into...
language? maybe. but actually understanding the math and design decisions that goes into a SOTA language model? i dont think anyone would ever call that easy
unless you're already very experienced in machine learning
tbf, dont have to understand something to reuse pretrained SOTA models
math yes.. design decisions.. I'm not really sure these were built to be efficient.. just as poc..
What is the correct format to set your constructor with your dataframe?
I figured you would just set the init and match the columns like
class CT:
def init(self,data)
self.radio = data.radio
self.cell = data.cell
self.range = data.range
data = pd.read_csv("ct.csv")
Would I then do
ct1 = CT(data.iloc[i])
or
ct1 = CT(data[i])
@wide gyro what are you trying to do exactly
you want radio, cell, and range to be columns in the data frame?
I might be approaching this entirely wrong
But I wanna make instances of class CT that could hold a row of my dataframe
oh i see
what you wrote should actually work
oh youre asking about iloc
iloc is positional indexing
loc uses the index of the dataframe
[] changes meaning depending on what you pass into it
i always use .loc or .iloc for extracting rows, for clarity
I'm getting back "init() missing 1 required positional argument: data" with iloc, but when I just use say "data[i]" I get a KeyError for whatever number is i
Yeah, sorry, is there a major difference? Can I just change it to init?
if init works and init doesn't?
I tried putting init into the code but it made me update it
class CT:
def __init__(self,data):
self.radio = data.radio
self.cell = data.cell
self.range = data.range
data = pd.read_csv("ct.csv")
ct1 = CT(data.iloc[1])
that should work
err try now. syntax
you were missing a :
oh oops, I always mess up translating the code over, I have it on desktop but use my laptop for this
apologies
but now that i initialized ct1, how could i extract say "self.radio"?
I couldn't right
or like
ct1.radio
that wouldn't work right
I did right before you sent that, works well noice
I'm finally getting somewhere hehe
@desert oar are you familiar with dropna as well?
yes
I tried using dropna(subset=['']) but I don't believe it gets rid of any data, and instead just adds another index which then puts on storage
but when I use inplace=True, it lowers the size of the file
Well I guess that might be because every column I'm taking out of subset could be the only ones containing missing data?
Also, how would I stop it from adding another index if I use the subset drop
No, I was just putting it in
??
the effect youre describing doesnt match up with what you are saying you did
can you share actual code
Now that I'm thinking about it I think the reason it isn't dumping any rows is because everything I'm subsetting in dropna contains data in each row and every column I'm excluding are the ones that are missing the data
which is why inplace=True works better but I can fix that
why would inplace=True have any bearing on this at all
However do you know how to stop the subset from adding another index in front of the updated csv file?
also inplace=True is deprecated and will be removed in pandas 1.0
it shouldn't add an index
that's why im confused
and want to see your code
Yeah I was told not to use it, inplace doesn't add an index but subset does
alright one sec
def clean(df)
#df.dropna(inplace=True)
df.dropna(subset=['radio', 'range'])
df.to_csv('updated.csv')
Is that correct format?
no subset should not add an index
but that's correct syntax yes
oh i know whats happening
I genuinely think it's just that every column I'm adding to that subset isn't missing a single piece of data and the columns I'm excluding are the ones that are
that shouldn't have anything to do with any index
yeah
so the extra index is coming from somewhere else
do you have a column in the csv that you already intend to use as an index?
Well if I took a raw csv file that didn't have an index in first column and ran it through that clean method, it would add an index to first column and shift everything else 1 over
pandas always adds an index
and if I were to clean that clean file, it would add another
every pandas dataframe has an index
So just keep it?
if you want to omit the index when saving, use to_csv(..., index=False)
I'm never gonna clean a clean csv so I don't have to worry about 2 index's, but will removing the initial index cause any issues down the line?
I guess it doesn't matter that much if I have it, but with millions of rows I feel like it adds some storage
no, unless you are planning to use the index for something
it does add
if you aren't using it, omit it when saving
so it would be something like
df.to_csv('updated.csv',index=False)
I was also initially using data as the df's variable name, but I switched it to df because I feel like data could be a keyword in some cases
I'm trying to use chunksize to only get a portion of the csv file I'm reading off, but that turns the dataframe into a TextFileReader which I don't want
if I'm using pandas csv reader do I have to use something other than chunksize or do I have to read everything from the file?
Or do I somehow have to convert it back from a textfilereader to dataframe
I'm not suer how to quote on discord, but your message from 10:11 with the code snippet- you do the df.dropna() without inplace=True, but you don't ever set anything to its return value. I think you need df = df.dropna(.) since you aren't doing it inplace anymore
@earnest spear what do you mean by that?
Are you saying if I append to another csv file after dropna, it would keep the values I used before dropna?
But if I do df = df.dropna, it would take the new values into other csv file?
df.dropna() returns a dataframe with the na values dropped. It doesn't modify the dataframe you call it on (unless you use inplace=True)
so if I wanted to use the subset dropna I would have to set df equal to it
So in the code snippet you posted, the df.dropna() is functionally doing nothing because you don't assign it's return value to anything
Right
Gotcha, can't believe I never thought of that
I'm assuming I'd do the same thing for fillna
Pandas as a whole likes the idea of immutable dataframes. Most operations don't change the dataframe but rather returns a new one. The inplace=True is saying, instead of returning a new dataframe, I want to change the dataframe I'm calling this on.
So could I use the subset followed by inplace = true
or would that not work
instead of setting it equal
like do they accomplish same thing
df = df.dropna() and df.dropna(inplace=True will yield the same result (df being a dataframe with the NA values dropped). It's functionally different in that the first creates a new dataframe, whereas the second just changes the existing one, but they do accomplish the same thing.
The use of subset or other arguments shouldn't affect the usage of inplace
I have one question too regarding pandas.. do u think i should buy a book regarding pandas library.. coz i know its very important library?
like do i need to know it on fingertips?
My opinion is that the best way to learn pandas is just use it for something. It's documentation is pretty good and usually any question you have has a solution on stack overflow since its so widely used. But if you learn well from books then it's never bad
So i can understand everything from documentation?
Like usually documentation are complicated..u get everything but its a little hard for me to understand
Yeah that is the issue - pandas can go pretty deep, but being able to parse the documentation for what you need is a good skill to get comfortable with
I'm trying to use iterrows to compare some floats that have been selected and then run it through each line of the dataframe's same float
but i'm not sure how to access that df's float
I tried using at and iat but it says that i need integer indexers
and when i use loc and iloc it returns "KeyError: None of [Index] are in the [Index]"
is Linear Regression more reliable or support vector machines?
i noticed that when i use SVM.SVR() in recognising a pattern it only works for certain extent after that it gets wrong where is LinearRegression() was on point when predicting the pattern
so which would be better to use when it comes to Predicting stock prices?
Is there a way I can drop any values that are NaN but not those with an input of 0?
dropna for pandas appears to get rid of 0/NaN but some of the 0 values are good for me
Does pandas take account the missing values for the calcul of correlation?
@wide gyro thats not what chunksize does. Read the docs carefully
@wide gyro you are asking a lot of XY questions. Maybe start by saying what you are trying to do first, then tell us what specific thing isnt working
Yeah I switched to nrows and it works fine @desert oar
@hollow quartz if you mean the Nan values, the values might not be taken, but their existence is very likely taken into account.
@sand reef if the line have a a missing value this line might not be taken?
#Building Multiple Linear Regression Model
import statsmodels.formula.api as sm
#adding the orignal x to a column of ones so that ones column is at first
x = np.append(arr = np.ones((50,1)).astype(int), values = x, axis=1)
x_opt = x[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
#removing the index x with p values greater than SL
x_opt = x[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
x_opt = x[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
x_opt = x[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
x_opt = x[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()
Is there a way in which i can run a loop so that p values checks if its less than 0.05 and removes the variable automatically..
I know automated backward elimination can be done using r squared values.. but any way of p values
Also after i did the backward elimination now what.. do i create a new test set and training set and predict new values?
Heya. What exactly is the difference between MSE in keras and the loss? The MSE in my case is as seen in the image. I am not sure I understand why it is growing or if the values are good
Mean squared error is growing means something is wrong. It should go down.
MSE is just basic telling if you are close or far. Loss is used to train your model.
It has the answer to all such technicalities.
@lapis sequoia try making a list with those indexes. Now remove values from that list with your loop and condition. And then pass that list of indices into the x_opt = x[:, custom_list]
To remove items from list:
https://www.quora.com/How-do-I-remove-an-item-from-a-python-list
You can remove an item from a list in three ways: 1. using list object's remove() method. Here you need to specify an item to be removed. If there are multiple occurrences, then the first such item is removed. This can be seen as removal by item's...
We can also use .pop method right?
Well. Pop will only remove the last element of the list
oh ok
Ok...
Ok so what u r saying is I make a list with all my indices and then i compare p values?
and gradually remove it and then fit in x_opt
Yes
Good question.
Is there any particular command?
Ok wait i saw this somewhere..
regressor_OLS.pvalues[j].astype(float)
Is this any good?
Ok well thanks for the idea tho.. i will try to implement this
No problem!
def backwardElimination(x, sl):
numVars = len(x[0])
for i in range(0, numVars):
regressor_OLS = sm.OLS(y, x).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
if maxVar > sl:
for j in range(0, numVars - i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
x = np.delete(x, j, 1)
regressor_OLS.summary()
return x
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)
@sand reef is this something it would look like? This was the code given to be done in automatic backward elimination but to be honest i didnt get it somewhat
Like the numVars = len(x[0])
Numvars = len(x[0]) means number of columns in x
Ok..
Well. Kind of yes. It is kind of similar to what I suggested.
Instead of making a copy and assigning it to x_opt
It directly removes the index from x itself
Altho I am kind of afraid of the fact that this might throw a index out of range error
Nah it won't.
i is never used in indexing, so it won't throw.
So, what's the issue?
Ok.. so i are all the entries in a particular line right?
Yus. Index of the columns.
Ok and then j is picking up each entry in that line
ok this is a very noob question but why did we do numVars - i?
also just to confirm
No. J is still a column index.
i will look something like this a b c d
Because if you see in np.delete(), we are passing it with axis =1
and not this ```a
b
c
d
Well. The one above this comment is a column.
Yes
So what is being done here is,
Max value of p for a column is being taken, and then compared. If a p value of a column is the same as the max p value, it is then removed from x
And the numvars - i
That is saying that, we are leaving i number of columns. Means we will not check the last i columns in that iteration.
Oh ok.... now i get it
so in first for loop it does i=[a,b,c,d] takes out the max p value out of them
and then comparing it with sl so the first time i will be 0 so j will take all index .. and then delete the max value
Yes
and then the second loop will continue till the condition is satisfied i.e. all the indexes with p values more than sl will be removed and then it will move onto next line starting from the first loop
right?
Yus
Ok... jeez thanks a lot
Yeah. All values will be removed until either nothing remains or only columns with p values less than sl remain.
No problem! Now I am off.
ok
Do NaN values and 0 give the same result if you used that part of your data?
Like if I asked for a row of a column that contains 0 or NaN, will they both return 0?
@sand reef Would you expect it to look like this: Also what does the number 5.X mean? Is it a lot?
Also @sand reef I do not use the MSE as loss but rather a metric (in Keras)
I'm trying to learn how to use Beautiful Soup by scraping data from Twitch.tv
Twitch.tv Top Channels Data Scrape
import csv
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
URL = "https://www.twitch.tv/directory/all"
uClient = uReq(URL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html , "html.parser")
containers = page_soup.findAll("div",{"data-target" : "directory-game__card_container"})
containers2 = page_soup.findAll("div",{"class" : "tw-mg-b-2"})
print(len(containers))
print(len(containers2))
I got this far
but it says both containers have length 0
containers and containers 2 should be how I separate each of the channels on the top channels page but it doesn't pick them up
also I'm sorry I just dumped the code in chat, I don't know how to format it in this server
for x in v:
print(x)
dangit
how do you make it activate
oh
this isn't a help channel
@heavy tundra is this data science related?
seems like a scraping question to me
is that not data science
scraping? uh, no...?
it would lead to a dataset and statistics
yes but the scraping itself is not data science and that's what your question is about
Hello
I want to start in Data science carear
do i have to master np /panda or i can go through them as any lib
so i start in Data visual.. and ML
@supple ferry numba is lit. I'm seeing speed ups of 100-1000x with minimal work. You basically convert pandas objects to np objects and wrap loops in @jit functions.
@hardy solstice libs like numpy for example "just" provide implementations of common mathematical concepts, if you understand them mastering numpy is a question of reading the docs
this is what i mean, it is just normal lib where i can go through docs
but someone told me i have to master it as first step
if you know the maths behind it Id call bullshit on that
oooh this explains alot
@sand reef so I found the error (no pun intended :D) thanks to @feral lodge Turns out MSE is not usually used for classification problems, this is why the MSE I linked earlier is wrong
I see. So you were classifying.
Is there an equivalent of np.argmin() that is compatible with numba or should I rewrite it in python and use it there?
index_min = np.argmin(close.iloc[i-25:i].values) ```
I'm on the latest pip3 install of numba and it's not liking it
I'd like to take a column of time in my dataframe that is set to Epoch, and change it to the exact date
which I accomplish in one of my functions
but how do I loop through an entire dataframe and run that time column through my epochTimeConverter function until there are no more rows
I tried this but I'm not sure if it'd work
df.created = CT.epochCre()
df.updated = CT.epochUpd()
would that put every value in created and updated column through their respective functions?
Got TypeError: epochCre() missing 1 required positional argument: 'self' when I tried to run it
I use dt.datetime.now()
I just watched a YouTube ML program by some expert at Google. Man it's so intimidating to not know so many concepts in machine learning.
The code seems so unrealistic and difficult to understand. It makes me wanna give up.
I am a beginner, been studying for few months and feels like am no where near.
What to do when you're confidence level is down?
Anyone awake that can answer a question about SaaS metrics?
Posting it in Help-5 if any of you can help
@lapis sequoia is it already a csv?
most of the time, it's easier to manipulate the csv in SQL and then save it as a new csv file
no it's not csv
it's in the dataframe
I read a text file to get this
im thinking I need to do something like
df = df.drop(df[df.text.str.contains].index)
df = df.drop(df[df.text.str.contains].index)
but not sure how to implement the contains here..for excluding datetime
subscription_start subscription_end
35:09.0
09:48.0
00:51.0
53:30.0 10:28.0
I am trying to apply ARPU & Churn rate.
but example 53:30.0 or format XX:XX.X doesn't look like any date format I can find online.
i think its most likely minutes:seconds
However start is sometimes bigger than end.
For instance
28:33.0 and 16:02.0
if there was a counter raising evertime it exceeds an hour I would understand. but there is nothing
need code to help buddy..
Hello guys
Hoi
Mmm?
To train yourself with?
yeah
i mean
like a project so i could test my skills and improve
or test my knowledge
Well. There are a lot of things you can do.
And it depends where you want to go
Try googling for some projects
Related to the skills you've learnt I guess?
And try pulling them off.
@errorsans to be honest i like data analyzing that's why i dove into pandas, but the thing is i feel like lost
Maybe Kaggle's micro-courses are something for you to get a feel for the field: https://www.kaggle.com/learn/overview
I mean, well, then I guess I am the wrong person to be asking that then.
Cuz I literally am trying to do something like that myself.
@sand reef its ok , thank you !
@lyric canopy i just finished Sentdex tutorials
@lyric canopy will do! Thank you very much!
I'm trying to click on a HTML5 canvas.
or stimulate some clicks.
I can't get selenium basic to work, can someone point me in a direction to stimulate clicks and drags on a html5 canvas?
Hi
im doing a project with python and dataframes, can i ask for a little help with pandas
just post the question dude @sleek otter
Hey!
I'm writing a report where I have to explain MLE
But what's the intuition behind multiplying probabilities together if you've got a large dataset?
any dot graphs making application?
@lapis sequoia what you're asking for makes no sense
in the file i downloaded i dont see EXE file
the issue here is that you don't understand how to run programs, not that they all need to be "built"
there's a lot of exe files
in the zip
they don't do nothing
they dont run it
yeah i just want simply open program and write graph
exactly like that
thats all i want
any app
anybody knows app for it?
those are good
but i want to insert image to graphs
and im pretty sure its not possible on web ones
that isn't generally possible
it is
make a graph out of an image?
noo
@outer marsh it's just the arithmetic of probabilities. each data point is the realization of a random variable, and we assume those random variables are independent -- this is the "iid" assumption.
that means each observation is an "event". let's say you have a data set with 5 individuals, and you know their favorite ice cream flavor (chocolate or vanilla), and you know their age -- you can propose a linear model
Pr(Y_i = "chocolate" | AGE_i) = Bernoulli(p_i)
logit(p_i) = b1*AGE_i + b0
This is pretty typical of what you'd use maximum likelihood for.
Our data set might be like
age | flavor
----|----------
21 | chocolate
27 | vanilla
18 | chocolate
20 | vanilla
30 | vanilla
which means we have 5 random variables (one per individual), and one event per individual (one data point per individual).
You can think of the entire dataset as one big event: the intersection of all the independent events that correspond to individual data points. Mathematically you might write it like this:
Dataset = Y_1 ∩ ... ∩ Y_5
so that
Pr(Dataset | AGE_1, ..., AGE_5 ) = Pr(Y_1 ∩ ... ∩ Y_5 | AGE_1, ..., AGE_5)
Then you just apply the usual rule for computing the probability of the intersection independent events.
which gives us
Pr(Dataset | AGE) = Pr(Y_1 | AGE_1) * ... * Pr(Y_5 | AGE_5)
Hmmm
So it's like the probability of the dataset?
Like I don't get what the product of all these probabilities represents
@desert oar
You're exactly right, when you do maximum likelihood you are looking at the likelihood over the entire data
So the product of the probabilities you can interpret literally as the product of the intersection of all those events describing all the data points
Thinking about it that way also makes it obvious why independent and identically distributed are necessary assumptions
If they aren't identically distributed, you need a different expression for each data point, which is fine of course but you can't implement that as efficiently in a computer, and computing the gradient, not to mention the Hessian, is more involved
Yeah I see
Whereas if they aren't independent, the whole expression falls apart
The video I'm watchin (https://www.youtube.com/watch?v=I_dhPETvll8) describes the event as the variable x_1 took on the value x_1 and the variable x_2 took on the value of x_2 etc.
Yeah, although usually you distinguish the variable from its realization with capital and lowercase letters
x_1 is the realization of X_1
Ah okay
Meaning the event is "X_1 = x_1"
Which of course has zero probability if X is continuous...
It's kind of just how probability theory is set up. that's why the event thing isn't technically correct in general
Think about a number line, the point "1" is infinitely small, because there are an infinite number of real numbers that are arbitrarily close to it
Yeah
This stuff gets very esoteric very quickly, but suffice it to say that an infinitely small interval must have probability zero
Well yes
This is where probability density comes in
A probability density is the derivative of the distribution function, right? Well, a derivative is basically a function of an infinitely small interval, that's how they are defined
Yeah
So even though the event itself is infinitely small and has zero probability, the way the stuff works is you can just drop in the probability density instead
And that's where you get the usual expression of multiplying the probability density for each point in the data set
Hmmm
I think I'll ask my math teacher tomorrow, I think it's better if I can talk to someone irl
But thanks for the help
@desert oar the lecturer said to use this line plt.plot(x, lin_reg2.predict(poly_reg.fit_transform(x)), color = 'blue')
instead of this plt.plot(x, lin_reg2.predict(x_poly), color = 'blue')
coz he said that then we can use the model for other dataset
so was he saying that if we had training and test set instead of the current dataset where we dont split it.... then x_poly would already have been assigned to something'?
the code is this
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing dataset
dataset=pd.read_csv('Position_Salaries.csv')
x = dataset.iloc[:, 1:2].values
#if we give only index value i.e. 1 then it will return a vector rather than matrix
#hence we give range i.E. 1:2 rather than 1
y = dataset.iloc[:,2].values
#polynomial linear regression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2)
x_poly = poly_reg.fit_transform(x)
lin_reg2 = LinearRegression()
lin_reg2.fit(x_poly, y)
#plotting polynomial regression model
plt.scatter(x, y, color = 'red')
plt.plot(x, lin_reg2.predict(poly_reg.fit_transform(x)), color = 'blue') #<----- this line
plt.title('truth vs bluff (without x_poly)')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.show()
#another model using x_poly
plt.scatter(x, y, color = 'red')
plt.plot(x, lin_reg2.predict(x_poly), color = 'blue') #<---- and this line
plt.title('truth vs bluff (with x_poly)')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.show()
because no matter what the line is i get the same graph for both
the first one re-fits the poly reg model, the second one doesnt
ok so if i had a dataset in which i had training and test set then would i understand this thing better?
also when i increase my degree the regression line keeps on improving but somewhere in q and a the lecturer said that if we increase the degree too much then i would overfit the model... how is that possible
i dont really understand your question about training and test sets
i also dont know why you would re-fit your model every plot...
err, oh. i see
maybe hes saying that if you want to use a different x, you can use the first one instead of hardcoding x_poly
that's an arbitrary distinction... just write a function if you need to reuse
as for the degree and overfitting... https://statisticsbyjim.com/regression/overfitting-regression-models/
guys any favorite books or courses to start data science
prereq for this is that you know basic statistics, calculus, and matrices
but a decent beginner book
Will check it out
could someone dumb down the kalman filter for me pls lol
so that i can implement it in python
It looks at the state variable (the one you're trying to measure), and assume it's measurement, it's changes etc are affected by noise that's Gaussian (follows a normal distribution)
And then it uses probability to predict the most likely state and standard deviation/variance given previous measurements, the current measurements and any "control" you put into it
@warm orbit
i see
It also assumes the current state is related to the control/previous state/measurement using a linear function
hello guys, i stuck some problem i didnt fix 3-4 hours... when i try mean to convert my Reviews column (i write this code data['Reviews'] = data['Reviews'].astype('float') ) I face like that error message "You have categorical data, but your model needs something numerical. See our one hot encoding tutorial for a solution." I try to make One-Hot Encoding " but i face different error... If somebody help me, i ll glad, thank you
what is the error
ValueError: could not convert string to float: '3.0M'
says it's a string
is this from you trying to do one hot encoding?
what column does this value belong to
@desert oar you available? I counted the columns of my rows in dataframe, and while most have 14, some have 13 or less. I'm assuming it's telling me the ones less than 14 contain NaN values
However, isn't 0 considered NaN? If that's the case, I wanted to know if I could remove rows that are missing an input, but keep those with 0 as those are useful
what do you mean 'counted the columns of my rows'
and why would 0 be NaN? NaN stands for "not a number" (part of the IEEE floating point specification, used/abused in pandas to represent missing data)
what do you mean "counted"
I ran print(df.count(axis='columns')) and it showed 14 for some, 13 for others
do you know what .count does?
i recommend checking out the docs. i don't think it does what you think it does
"Count non-NA cells for each column or row."
Where am I getting the 13 and 14 from then?
My dataframe holds 14 values
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.```
I figured the ones that showed 13 were missing a value inside
Is there a difference between dropna inplace and subset?
`inplace : bool, default False
If True, do operation inplace and return None.
`
subset : array-like, optional Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
so for subset, I would list the columns that I wanna check for missing values, and those according rows would be marked to be eliminated
I guess I don't really understand what inplace fully does
In place modifies the data frame, rather than creating a new data frame. It sounds like itd be faster or more efficient but in practice it's not. It will also be deprecated in 1.0 so don't bother with it
Live right now, in need of some help. Lacking some insight into what I'm doing wrong
@spark nimbus on stream? i cant join but if you describe here maybe i can help
so basically
I'm trying to play audio
when I play it all at once it works fine
but I want to play the next few samples every 20ms or so
and now it's all static-y
is using dropna(subset=) realistic if I'm placing every column in the subset
@wide gyro what do you mean placing? if you omit subset= it uses every column by default
@spark nimbus is this a data science question?
not sure what you mean. is this about python? OBS?
Oh so I don't need to add every column like subset='radio','mcc',etc.
didn't know that
yeah that one isn't explicitly mentioned in the docs, but the examples demonstrate how it's used
If I am not wrong, isn't NaN counted as 0 during computation?
@desert oar audio is closest to data science than anything else here
Say. I have learnt conv nets and all and rnns and all. What do I do now? Like what skills are needed for a decent job in this field? And is data science a completely different aspect all together? Or is it somewhat related?
Basically the coursera Machine Learning course and the Deep Learning Specialization.
@sand reef depends on the job
there arent many "junior ML engineer" roles out there afaik yet
@spark nimbus ok, unfortunately im not sure what the context for the problem is, nor am i an audio guy. good luck though
@sand reef also my career path has been very much not "machine learning oriented" -- so maybe i'm not the best to ask. i know that, honestly, i wouldn't hire a data scientist that's only done a couple online machine learning courses
i'd personally prefer someone with math and stats background, who can reason about data
data cleaning, missing values, basic statistical analysis, etc
and who knows how to code
if they have a math background i can teach them any of the fancy modeling they need to know
ideally someone who can write well and make good data visualizations too
maybe try a project now? something "end to end" where you have to choose a problem statement, get your own data, clean it yourself, come up with your own model, and then make some kind of report w/ your results
that's quite a bit of work but if i was hiring at the very basic junior level it would make me more interested
@sand reef if you're going for a computer vision role, unless you're familiar and up to date with research, I don't think you'll find many to hire you. That means being familiar with Resnet, ResNext, Alexnet, VGG, GoogLenet InceptionNet, unet, (Fast/Faster) RCNM + knowing about GANs, NLP based models description models, etc etc. Computer vision is sort of high barrier of entry and you should definitely be familiar enough that you can pick up random papers from CVPR and understand them
If you're going for generic deep learning, well... You'll be disappointed because no one actually uses just deep learning that much. It's used in research, and for CV, NLP and then a few things here and there. Outside of that, it either works worse or there isn't enough data for it
The only reason it's popular as it is is because people like the sound of "neural networks" and because it's easier to learn without much maths.
so, uh, i have no idea what i'm doing since i'm bad at ml
can anybody explain what should i do so it predicts continuation of csv file?
1 and 2 mean wins (though 1 = safer, 2 = closer to lose), while 3 means lose, 4 5 6 - are teams, mostleft column is number of round
i'm quite bad at ml but i just wanna try to predict some stuff
would be nice to improve knowledge of myself with this project
also, anyone got good keras books?
i just have no idea how to predict it
@lean ledge is there an advantage to using a kalman filter over a linear regression
i think they both identify a linear trend with gaussian noise
yes kalman just predicts the next one right
linear regression is finding an approximate A for a given
y=Ax
When y and x are known for a lot of examples
Kalman filter is finding y_(k+1) for a
x_(k+1) = Ax_k + Bu_k + Ω
y_(k+1) = Cx_(k+1) + μ
Where μ and Ω are Gaussian noise vectors
Given previous values of x, y and known A, B, C
@warm orbit
I think you should really just study a bunch of maths
Kalman filters are generally the kind of stuff you learn in a 3rd year electrical engineering (signal processing) class
They're the most basic form of Bayesian filters and still aren't very approachable
that's what I told him yesterday..
what's a good example to start learning about neural networks
i know the concept but i need a project/example to do it on
im thinking a game for the computer to play or processing images but idk
i think the MNIST handwritten digits one is pretty much canonical in terms of image classification
http://yann.lecun.com/exdb/mnist/
this is a good starter dataset imo, the samples are small which makes it feasible to quickly retrain and play with parameters of your network, and you can easily find others' implementations and see what they did differently
otherwise this is a good repository of commonly used datasets
https://archive.ics.uci.edu/ml/index.php
and of course you can check out kaggle too
ive seen sentdex use mnist in his tensorflow tutorial series but i wanna try my own thing instead of just following the tutorial blindly
Is NN needed for handwriting recognition or will just normal regresson work?
regression is a problem, rather than an algorithm, and digit recognition is inherently a classification task
you could use logistic regression to classify with one vs all
after all, logistic regression is essentially a sigmoid function, and you often see the sigmoid used as an activation function in simple networks
so you can think of a simple NN as multiple sigmoids connected together, which allows them each to learn a distinct relationship and work together to solve a more complex problem
So, if I did contribute in a research paper, will that boost me up?
... honestly, what i'm doing wrong
Well, whats the error?
Well, according to the code, you are taking the first 10 columns into X
and the third column into Y
*11 columns for X
i'm just trying to finetune existing network to predict my csv
i have very little idea behind ml honestly
well, I can't seem to pinpoint the major issue here other than the labels being a part of the training data
i guess?
maybe i'll try to run it and see
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import to_categorical
df = pd.read_csv('m.csv')
dataset = df.values
X = dataset
Y = dataset
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
print(X_scale)
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.3)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
model = Sequential([Dense(32, activation='relu', input_shape=(4,)),
Dense(32, activation='relu', input_shape=(4,)),
Dense(4, activation='sigmoid')])
model.compile(optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=1,
epochs=100,
validation_data=(X_val, Y_val))
print(model.predict(X_scale))
welp, gimme a sec to download sklearn
any comments so far?
i'm trying to make nn that'd predict the contents of csv
and i have no idea, because i'm bad at ml actually and just trying to finetune existing network
before it was
X = dataset[:,0:10]
Y = dataset[:,3]
but that didn't work
though it still doesn't
also, @sand reef, i'm calling you!
since i didn't got any far by changing layers
mew?
ah yes
i am having an issue here
for some reason, its not importing scipy
even tho i have it installed
huh
or intsall anaconda since afaik it has this stuff
... or just stop wasting too much power to help me lol
well to instal anaconda means to set up a lot of stuff lol
i didn't use it tbh so don't know how hard it is
i'm just using colab since it provides free gpu
yeah thats there..
wellp, i'll try it out on colab
what i dont like about colab is the tensorboard issue\
have no idea what it is, works finely with keras for me
tensorboard?
yeh
well, its basically to see the progress of your model and how its going
but tensorboard for some reason has some issues on google colab
keras outputs it automatically on calling fit method
can someone explain me what a dimension is?
like length, breadth and height. Matrices can have those too.
how is input_shape(781,)
same as
input_dim=781
well, @prisma verge
there is one issue, its reading the head also, which has NaN in it
that's the only problem?
@lapis sequoia (781, ) thats how one dimensional matrices are represented
meaning there are 781 elements in it
okay
how do i define x and y though?
i'm really not sure which should go in y and which should go to x
well, you need to figure out that by reading the csv file
so input_dim takes number of elements present ?
no, it takes the shape of the input matrix
csv represents 47 columns with 4 rows
i just wanna know that so i can know it for future projects
and get deeper into ml
if possible can u give an example please :) ?
okay. meaning you have 4 examples?
just i learn best by practice, haha
@lapis sequoia well, like, for example, you want to input an image
welp, yeah
first one goes from 1 to 47, and the others have 1, 2, and 3 in "random" sequences every
because those are logs from one game which has 2/3 chances to win
1 and 2 means wins, 3 means lose
a gray image is of the shape (200, 200, 1)
47,1,3,1
46,1,3,2
45,3,1,2
44,2,1,3
43,2,1,3
42,2,1,3
41,2,3,1
40,3,2,1
39,1,3,2
38,2,1,3
37,2,3,1
36,1,2,3
35,1,3,2
34,2,3,1
33,2,1,3
32,1,3,2
31,2,1,3
30,2,1,3
29,3,2,1
28,1,2,3
27,1,3,2
26,3,1,2
25,1,3,2
24,3,2,1
23,3,2,1
22,2,3,1
21,3,2,1
20,3,2,1
19,2,1,3
18,2,3,1
17,2,1,3
16,3,2,1
15,2,1,3
14,2,3,1
13,1,2,3
12,1,3,2
11,1,3,2
10,1,3,2
9,3,2,1
8,2,1,3
7,2,3,1
6,2,1,3
5,1,2,3
4,1,2,3
3,3,2,1
2,3,2,1
1,2,3,1
so its dimension is 3?
kinda like that it looks
also, removing nans didn't help
so tell me something about your csv
does your csv have 4 features and 47 examples?
or 4 examples nad 47 features?
well, you made the csv or know about its origin right?
yeah, i do
but in this yt video https://www.youtube.com/watch?v=VGCHcgmZu24
timestamp 9:05-9:15
what this dude says is kinda opposite can u explain to me if u dont mind?
Here we go over the sequential model, the basic building block of doing anything that's related to Deep Learning in Keras. (this is super important to unders...
original table looks like that
biggest team means lose, middle means close to lose but wins, and smallest means highest chances to win
then i've changed them to numbers and removed rounds
well, not removed rounds, removed word round
now i wanna make ai predict from that csv
I see
because it'd be amazing experience to machine learning, and also very useful
i know that predictions won't predict reality, but i wanna make it at least as project for fun
9:05*
oh that thing