#data-science-and-ml
1 messages Β· Page 237 of 1
so you are using a linear regression model, that tries to fit the data given to it by creating a line of best fit through the data, if you think of it as a 2d graph for now, what the model is trying to do isfind the best gradient and intercept of a line so that this line of best fit goes through the data
alright yeah
x_train and x_test is just the data from x defined at the top but split into 2 groups, it takes most of that data for training but reserves some for testing its performance on at the end
so lets say for example you had some data like py array([-1.15878911, 2.93868307, -1.59251035, 2.96522191, -1.47123134, 2.73263764, 2.17527494, 2.90636932]) this would be the data in x at the beggining, (there would be values in y for the true values as well), when you pass this to sklearn.model_selection.train_test_split(), it takes some of the data away e.g. -1.59251035 to be used for testing ( x_test) at the end, and the rest of it is kept for training on (x_train) the 0.1 means 10% of the data is for training
you don't give all of the data to the model because then you don't know if it is just memorizing the input so you test it on new data to see how well it can make predictions on new data
ahhh
this is a lot clearer
so that's what
train_test_split() does
so for example
yes, it partitions all the data you give it so it can be used for training and then checking if your model has done a good job at finding general patterns in the data
14 x = np.array(data.drop([predict], 1)) 15 y = np.array(data[predict]) 16 17 x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1) 18 linear = linear_model.LinearRegression() ```
ignore the 14s, and 15s
so if instead of x and y at the first two rows
if I did like
dataA and dataB
then on the train, and test lines
would it be
dataA_train, dataA_test, and so on?
sort of yes, but y isn't used as data the model uses to make predictions with, its only used as a comparison at the end to see how well that predicion did so the model knows how it should change to make better future predictions
oh so then
y is like the answer sheet
it compares it's own results (x) with the answers (y)?
Hello there! I'm new to the server and I'm coming with a few questions about an error I'm getting using cosine similarity (for a Count Vectorizer Matrix) and linear kernel (TFIDF vectorizer matrix) of Sklearn. I'm getting a Memory Error each time I try to compute the similarities as my data-frame is too big (df.shape = (394592, 7)) and I don't know how to approach this problem. Any help would be appreciated!
for example if I was trying to teach you to classify 2 new animals, the data x could be an image of them, and the labels y could be which animal that is, you would make a predicion from the image I showed you (x data), then I would tell you what that animal actuall was (y labels), this way you learn from your mistakes, this is called supervised learning
ahh
so like your answers against an answer sheet then
Yeah that makes sense
This is gonna be one really long course to finish
its important to note here that x isnt the prediction but the data e.g. the image, your prediction is made based on the data x but its not the same thing
oh.
then what is the prediction then
y_train or y_test?
Or am I still not getting it
the prediction isn't any of the data you store, htat is just data, the prediction is based on those variables but because as the model learns it may interpret data differentl, the predications cannot be stored before hand, they are made by the model in linear.fit and linear.score, then it checks agains the true value y what it should of predicted
i haven't used sklearn much but see if the model (liner) has a predict method (or something similar), if so you can print out a single data item, print out linear.predict(<that item>), then print out label (y) for that prediction, it may help you understand the difference between them
Hello there! I'm new to the server and I'm coming with a few questions about an error I'm getting using cosine similarity (for a Count Vectorizer Matrix) and linear kernel (TFIDF vectorizer matrix) of Sklearn. I'm getting a Memory Error each time I try to compute the similarities as my data-frame is too big (df.shape = (394592, 7)) and I don't know how to approach this problem. Any help would be appreciated!
@amber hound I've heard of Gensim for the TFIDF case and I'm searching information on how to implement it but I don't know what to do about the Count Vectorizer and Cosine Similarity memory error.
hello, i am using pycountry module.
i am getting "country_code" : "IND" in request.
how i can make use it in my code to get the value of respected country
my code here
modelType = pycountry.countries.get(alpha_3='country_code').name
@dull turtle you should use one of our general purpose help channels. See #βο½how-to-get-help
@amber hound you want the entire pairwise similarity matrix for almost 400k vectors?
That's like trillions of nonzero entries in the similarity matrix. No surprise it won't fit in RAM
Sorry not trillions. Many billions though
So multiply that by 64 bit or 32 bit for minimum RAM usage
What do you actually want to do with that
In SageMath, is it possible to calculate probabilities of multiple sets, using P(x) notation?
no @void anvil it sends data back and forth between python and an R process
so yes you still need R
@amber hound Yes, your df is too big.
Consider that your distance matrix would be 394592^2 = 155702846464 entries. Assuming 1 byte each, that's 155 GB. Typically you'd store reals or floats, which are 4 bytes. Assuming you store above-diagonal only since it's symmetric, you'd still need approx 2x of 155GB or 300GB+
There is no sane or normal computer with approx. 300GB RAM, assuming it's processed in RAM.
I'd recommend you somehow make your df smaller with some qualitative reasoning
I must admit IDK why your df has 7 columns though.
for context, the 32 core general purpose ML server that my team uses has 256 GB of RAM
That's some serious investment mannnnnn
it's a physical on-prem machine, so it's not like they're paying through the nose for some cloud services
but yes
this is neither a sane nor normal setup π
I'm pretty sure I can get access to supercomputing for highly parallelizable loads, though I can't say how much it'd cost. Not too sure about specific high-corecount single-machine-ish things
but yeah point being you'd basically have to parallelize the distance computation and dump the results to disk periodically
which again begs the question: what exactly do you need such a huge distance matrix for?
I hope there's a better way to do the distance-matrix thing though. Scaling by n^2 in memory means it's quite impossible to develop large-scale things
not that i know of. it's a fundamental limitation of techniques that require a full distance matrix
Large distance matrices are always useful
yeah but for what? are you going to compute hdbscan on it or something?
obviously if you are building a database it makes sense that you'd want to construct an index of some kind
that actually might be better
is there a general-purpose on-disk "vector database" for doing neighbor queries and stuff?
Well I have used distance matrices in 2 instances:
- node-A to node-B distance in a graph. The more different locations, the more nodes you get. And also generalises to anything that 'nodes' can be, which can scale to very large numbers
- NLP word vectors. A larger vocabulary is better than a smaller vocabulary
word-vec A dist to word-vec B is useful IMO
You'd need the whole distance matrix if you want to query any pair, or alternatively you could just query specific word vectors/elements themselves to get closest for just one I guess, which should scale linearly in terms of memory
yes, this is what indexes are for
go look at how e.g. fasttext does it, that's all written in very simple readable C++
and just computing distance for a specific pair of vectors isn't that slow
I doubt you'd want a specific specific pair
For a specificvector-to-all-other-vectors I can see applications yes
Also I must be really late to the party, but I found out about fastText from what you just said
Welp, guess there's always new things, although I do NLP more for hobby (what's a close word to "xyz"?) than anything
yeah there are a few of these "oddball" ML tools out there
vowpal wabbit, fasttext, starspace
the latter 2 are facebook research products
Interestingly I don't see fastText data in CC0 license π¦
software typically isnt CC licensed
I meant the word vectors themselves
ah
CC BY-SA 3.0
That said I'm surprised a Gigaword-trained data is dedicated to Public Domain
Yeah it's BY-SA, which is copyleft-forcing
Which means no commercialisation
it isn't even "viral" like the GPL
im not sure how CC BY-SA handles derivative works
so you need to check
well no... if you modify the data you must share it under BY-SA
Ah, I guess MIT is more permissive?
so for the data itself, it is viral like GPL
It's definitely easier to self-train
but for derivative works e.g. a software product that uses said data internally, i don't know
well the data itself is under the license too
so a trained model would be a derivative work of the source data
probably not at the same scale or with the exact same data
ah derivative work i think still requires SA
"Adaptation" means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.
"Distribute" means to make available to the public the original and copies of the Work or Adaptation, as appropriate, through sale or other transfer of ownership.
I just realised Wikipedia is CC-BY-SA 3.0 legally speaking too
yes
Welllllllll I guess I'm never working for an NLP company
i dont understand your issue
you want to do hobby projects?
who cares, you arent distributing them
Nah my hobby projects are obviously doable
you want to start a company and use other people's free data? sucks, follow the license terms
But for any monetary purposes I'd need to declare it comes via training on Wiki, for example
Yeah hmm
that's my guess as to why they bothered using CC BY-SA 3.0
because the source wikipedia data is, and their dataset + trained model are derivative works
also english gigaword has a whole license agreement attached to it
not sure what public domain version there is
are you looking for a public domain dataset?
That'd be nice
I'd be able to publish that under permissive licenses on Github without troubles π
Ah well I should have just used my original Wikinews dump
But honestly Wikinews is very small.
@desert oar going line-by-line worked but produced an output with fewer lines but a much larger file size
I'm thoroughly confused.
@desert oar That worked out. Thank you so much!!
@mellow spruce Hi is me again. This worked great, I have a question tho. I want to apply this calculations in two different columns i.e want to calculate the idle time between and activity is over and the next activity starts for these different groups I have created. I triedtime_diff_byname=lf.groupby('name')['Time','Time 2'].apply(lambda y:y['Time'].shift(-1)-y['Time2']But that didn't work
first of all, write [['Time, 'Time 2']]
actually that should do it
or you can write lf[['Time', 'Time 2', 'name']].groupby('name')
That did it king. much thanks
In SageMath, is it possible to calculate probabilities of multiple sets, using P(x) notation?
idk if anyone here uses sage :/ i saw you've asked this a few times
Hello, Im having some trouble using a dataset I have created with tensor flow
im getting this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
this is the code:
train_data = np.asarray(train_data)
train_labels = np.asarray(train_labels)
test_data = np.asarray(test_data)
test_labels = np.asarray(test_labels)
return (train_data, train_labels), (test_data, test_labels)
model.fit(train_data, train_labels, epochs=10)
ask for any other stuff u need
So when I tried to add time_diff_byname to the data frame with lf['Time_diff']=time_diff_byname.reset_index(level-1, drop=True) it gives me this error message cannot reindex from a duplicate axis any ideas to append this column?
@earnest wadi what does train_data and test_data contain?
@mellow spruce does time_diff_byname.index contain duplicate values?
.reset_index(level=-1, drop=True)
```?
can you show what time_diff_byname.head() contains
4 -1 days +23:10:10
8 -1 days +22:49:50
12 NaT
Mary 2 -1 days +22:22:35```
It might contain duplicates
ah
im actually not sure how it would, tbh
oh did i steer you wrong on this too
does lf itself have duplicate indices?
or no
nop, I haven't set indices to the df yet
can you send me some sample data again
'tool':['Hammer', 'Drill','Wipes', 'Driver', 'Drill','Wipes','Hammer', 'Driver','Driver', 'Drill','Hammer', 'Drill', 'Drill','Wipes','Hammer', 'Driver'],
'Time':['13:40:31','13:20:33','13:05:00','12:15:28','12:00:00','11:43:35','11:27:35','11:17:22','11:10:10','10:59:11','10:22:15','10:12:10','10:00:00','09:55:05','09:45:45','09:16:35']}
lf=pd.DataFrame(data=d)
lf['Time']=pd.to_timedelta(lf['Time'])```
so, groupby on series and dataframes have different semantics
with respect to how the indices are constructed at the end
which i did not realize
try level=0 instead of level=-1
In [31]: time_diff_byname
Out[31]:
name
John 0 12:00:00
4 11:10:10
8 10:00:00
12 NaT
Mary 2 11:27:35
6 10:22:15
10 09:45:45
14 NaT
Peter 1 11:43:35
5 10:59:11
9 09:55:05
13 NaT
Richard 3 11:17:22
7 10:12:10
11 09:16:35
15 NaT
this is whati get
which means the first (0th) level of the index needs to go
not the last
that was my mistake
lf['Time_diff']=time_diff_byname.reset_index(level-1, drop=True) in here?
that worked out correctly, thank you so much master!!1
@desert oar
they both look like lots of these
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] ```
there is a corresponding word in a word index for each number
it cant print the whole thing
[ 4 5 14 ... 0 0 0]
[ 22 23 5 ... 0 0 0]
...
[1080 32 5 ... 0 0 0]
[ 89 5 25 ... 0 0 0]
[ 448 59 76 ... 0 0 0]] [[ 330 366 1032 ... 0 0 0]
[1125 22 615 ... 0 0 0]
[1142 134 5 ... 0 0 0]
...
[ 126 2402 128 ... 0 0 0]
[ 33 2419 248 ... 0 0 0]
[ 128 2430 22 ... 0 0 0]]```
oh yeah that would actually make sense
ill iterate through it and make sure they are all arrays
still getting this ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
I did this:
for i in range(len(test_labels)):
for r in range(len(test_labels[i])):
test_labels[i][r] = int(test_labels[i][r])
test_labels[i] = np.array(test_labels[i])
you still have an array of arrays
where is this data coming from?
how is it constructed?
seems like you ought to flatten this to a 2d array
the data is from a file structured like this
data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]```
I then split it up into train and test data
how do I flatten it
that's what the data looks like in the file?
what kind of horrible data format is that
or .npy
I can extract the data back in to a python list
yeah but i shudder to imagine how
either way you need to make sure this is correctly handled as a 2d matrix, not a 1d array of 1d arrays
litterally just reading it, splitting it at the \n then at the first " " to remove data
alright so how to I do that
how can I make it a 2d matrix
oh, they are all converted to the same length by keras later down the line
hang on
[ 4 5 6 7 8 5 4 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
like this
so you need to 1) read the data, 2) pad them with 0s, 3) convert to np array
is that right?
by "need" do you mean thats what im doing?
well that's what you should be doing
I thinking im np arraying after padding
nope
padding is last currently
should I change
yes
if you try to make a numpy array out of uneven length lists
it will never make a 2d array from that
okay
it will always be an array of arrays
[[ 4 5 6 ... 0 0 0]
[ 4 5 14 ... 0 0 0]
[ 22 23 5 ... 0 0 0]
...
[1080 32 5 ... 0 0 0]
[ 89 5 25 ... 0 0 0]
[ 448 59 76 ... 0 0 0]]
its now padded before being an converted to an array
but still getting an error
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
data_txt = """data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""
data = eval(data_txt.split(' ', maxsplit=1)[1])
max_len = max(len(rec) for rec in data)
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
data = np.array(data)
works for me
resulting shape is (21, 21)
def parse_horrible_format(txt):
data = eval(data_txt.split(' ', maxsplit=1)[1])
max_len = max(len(rec) for rec in data)
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
data = np.array(data)
return data
here's a function to do it π
yes, small brain good
hahaha
parse horrible format
so if I input """data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""
it will return
train_data that will work with tf?
no idea, but it will definitely do the padding and np.array conversion for you
alright
ill get that in
and see what happens
@desert oar your function doest work entirely
AttributeError: 'list' object has no attribute 'split'
what string should it be
this
"""data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""
if you are starting with the literal list
[['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]
then feel free to delete the first line where you split the string and eval
oh
I see
@desert oar still the same error
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
show your code..
def import_data(dsdir):
ds = open(f"{dsdir}.cds", "r", encoding='cp1252')
dataset = ds.read()
ds.close()
dataset = dataset.split("\n")
splitup = []
for part in dataset:
splitup.append(part.split(" ", 1))
splitup[1][1] = splitup[1][1].replace("\'", "\"")
splitup[2][1] = splitup[2][1].replace("\'", "\"")
data = json.loads(splitup[1][1])
labels = json.loads(splitup[2][1])
train_data = (data[:len(data)//2])
train_labels = (data[:len(labels)//2])
test_data = (data[len(data)//2:])
test_labels = (data[len(labels)//2:])
def parse_horrible_format(data):
max_len = max(len(rec) for rec in data)
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
data = np.array(data)
return data
train_data = parse_horrible_format(train_data)
test_data = parse_horrible_format(test_data)
return (train_data, train_labels), (test_data, test_labels)```
(train_data, train_labels), (test_data, test_labels) = ds.import_data("Pickup Lines - Insults")
word_index = ds.get_word_index("Pickup Lines - Insults")
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, "?") for i in text])
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding="post",
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding="post",
maxlen=256)
train_data = np.array(train_data)
train_labels = np.array(train_labels)
test_data = np.array(test_data)
test_labels = np.array(test_labels)
print (train_data)
vocab_size = 88000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["acc"])
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
model.fit(train_data, train_labels, epochs=10)
first block is in my package, 2nd is the main neural net script
parse_horrible_format already does the padding FYI
so I should delete keras.pre... etc
i mean, there is quite a bit more action happening here
or does it not matter too much
frankly i dont use keras so i have no idea what much of this code does
ooh
the basic problem is: train_data and train_labels must be numpy arrays of floats, ints, et al
not arrays of arrays
so whatever processing you do, at the end of the day you must make sure that you are feeding "flat" numpy arrays to keras
but you dont know how I can flatten my data to work
ill try looking some more stuff up
i dont know because i dont know what your data looks like before you pass it to keras
its possible/likely that pad_sequences is returning an array of lists or something
idk because this code was working with an official keras dataset, I bassically just tried my best to replicate the format of the index,labels and data
is this "tf keras" or "keras keras" btw?
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
this looks so easy even i could probably do it and i am legitimately very stupid
yeah i managed to really easy get my head around the code they provided to make this:
i also really hate that they called their "keras-style" library "keras"
they should have come up with a different name
cool
all im tryna do now is make an easy and modular way to train it on any text dataset you want
straight from a .txt fike
file*
bassically
my question
after your function you made for parsing
is the output a 2d matrix
or is it still array of arrays
thats called a "clean room" reimplementation
when you copy the logic but none of the source code
im in a meeting, but i know something about this topic
ill @ you later
alrigfht, ive done some testing the shape is definatley 2 all the way through the code @desert oar
ok so
i have
can i be smart enough to take average of percentage difference every month?
data.set_index('Date').resample('1M')['Percentage Difference'].mean()
like that?
or ```python
data.resample('1M', on='Date')['Percentage Difference'].mean()
@visual violet ^
insanely smart
thanks!
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
hmm type error
so something like this?
df_newyork['Date Time'] = pd.to_datetime(df_newyork['Date Time'])
Should work, im on mobile now so cant check
is there anyway I can get twitter dataset?
i dont have the time to scrape one, are there any repos online that can provide twitter dataset?
you could always do a quick search
I'm sure theres some on kaggle or uni websites @severe island
Has anyone ever worked with neural machine translation? I got an error where it says runtime error: dimension specified as 0 but tensor has no dimensions.
someone told me to ask it here so I here I go
https://paste.pythondiscord.com/abinalowir.coffeescript Hi I want to confert a table I made to a pd.DataFrame(), take a look at the script I used, if that helps.
since the whole thing should work just fine if the table would just be in pd.DataFrame format π
please ping me if you got a solution
Hi, how i can fix this erros : ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Hi so basically im attempting to reshape my 1D array to become a 2D array, but I keep getting an error that says my object doesnt have the Attribute array.reshape(1, -1), even though my console told me to reshape my data using array.reshape(1, -1).
how i can fix it : len() of unsized object
Read the error and then understand why its giving that error
@lapis sequoia You need to post your code
@silk knot can you show an example of what rows and data contain? also try to avoid using variable names like list which shadow important built-in names
something like this but I think its fixed already
And good point
But I did have another issue you might could help me with π
ValueError: Shape of passed values is (33905, 36), indices imply (36, 36)
I dont know if you mind being tagged so im just gonna do it once, let me know if it bothers you @desert oar
once is fine, thanks for asking
if i disappear for a while you can tag me again although i will be heading offline soon
show the code that produced that error?
usually that means you have mismatched sizes
e.g. ```python
pd.Series([1,2], index=list('abcdefg'))
produces a similar error
the thing is, I don't remember ever making it 36, 36
I did have 36 samples however, so its not a random number I just don't know where I implied it
what line
I think it goes wrong in 74
makes sense
pca_data is not the covariance matrix...
it's your full data projected into PC space
so N rows and 36 columns
whereas you specified index=columns, columns=labels both of which evidently contain 36 elements
uhu
well thank you I should be able to fix it, after I get some sleep
some nights, before you know it its 5am
@void anvil arch? that's a library on pypi?
so it is https://pypi.org/project/arch/
nice
oh
not about clean room code but
i wanted to clarify that copyright applies to code
specifically to the code
not to the algorithm
because the algorithm as you point out is part of nature
and so can be patented, but not copyrighted, because it is not a creative work
however the code is copyrightable, and not patentable, because it is considered a creative work
(under US law)
this is why licenses such as the GPL are interesting - they apply to the source code, but because of the license terms also have implications for consumers of the software itself
thats actually a good question
my guess is that like paraphrasing someone else's paper without citing their work
right
i would imagine the same is true for code
that isn't necessarily true. this is why clean room implementations exist.
for plausible deniability that any source code which happens to be in common is a coincidence
i'm not sure about that
i'm sure there is plenty of case law on that subject
i have a friend who is an attorney in this field
i can ask how this all works
he will likely say "this scenario is absurd and would never happen in real life" and it would take me 10 mins to get an answer
if there is exactly one way and only one way to implement an algorithm in a particular language, e.g. C, i imagine that it would not be considered copyrightable, or that using the same code is fair use
so much of US IP law is in the form of case law and precedent
so its really really hard to know even if you are brave enough to wade through the statutes
so i can ask him but i cant guarantee a good answer
i think in general if you don't willfully commit copyright infringement, most codebases are sufficiently complicated and distinct enough that you won't get copyright trolled
well if there's no license at all then it's all rights reserved
how what works?
you must distribute your contributions under the same license as the original.
i'm not sure if the CC BY-SA license can be interpreted to mean that software compiled from the licensed source code must also be distributed under CC BY-SA
i suspect that it can't, or isn't
and this is also why you should use a code-specific license for your code, so as to remove the ambiguity
wait who is using CC for code
why would you do that
other than like... contributions to rosetta code
i suspect that compiled software falls outside the scope of "Adapted Material" in the 4.0 version
and of "Derivative Work" in the 2.0 version
that's a bit like if you printed out the script to Arcadia, ran it through a paper shredder, then used the pieces to make a collage
is that a derivative work of Arcadia?
is it?
i mean, is it a derivative work?
i mean ignore the fact that snippets of copyrighted material are visible in the collage
the fact that it's physically made of Arcadia i think isn't alone enough to call it derivative
its a good question
if you arent distributing it then the GPL at least doesnt care
its more interesting if you are building a public-facing API with a closed-source backend
are you "distributing" something based on the GPL'ed code?
you aren't physically distributing software, but you are providing access to that software
right. if providing an API were determined to be "distributing" it would be the legal equivalent of finding P=NP or breaking RSA
a court discovery order would change that in a hurry
you wouldn't have to look far
what if they give a tech talk at a python conference about how they use your GPL library to go super duper fast
im not saying thats available in every case, but it could be enough to make an example out of someone
plus theres some software which is both GPL and almost literally ubiquitous
although then you're basically saying "statistically this person is likely to be using my software" which has a lot of uncomfortable implications for criminal law
right, idk if or how much they differ in that respect
you might also be able to avoid the "we have cause because 99% of businesses use this"
look at job postings for example
do they list django? they're probably using django
having a case like this succeed in courts would be a true apotheosis of the free software movement
not sure its even a good thing
or if its obviated by some other legal doctrine or statute
ok that one probably will show up in court
in the next 5 years
great question
yep
funny i literally just read this article tonight
IP law is such a mess already, now throw in datasets and trained models
increasingly complicated blends of the above with source code and patentable algorithms
incoming: a German court rules that articles written by AI are considered derivative works of the AI developer
i can't wait for that hacker news thread
...i think its past my bedtime
im having bizarre copyright fantasies
likewise, ill try to remember to ask about some of this
how do I get numpy or matplotlib in vscode?
`def f(row):
try:
return Polygon([(pt['Longitude'], pt['Latitude']) for pt in json.loads(row['PlotGeoFence'])])
except:
return numpy.nan
InputFile['geofence_poly'] = InputFile.apply(f, axis=1)`
i have this function which returns NaN if it fails, but it is setting all row with NaN
but i want to keep other column data aslo and only set NaN where it fails
any help wiil be appreciated
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])```
Could someone explain the 2nd layer?
the video says it in a very complicated way and I don't really get it
So is it right to say that it has 128 functions that will determine what that thing is?
Does any one know the most attracting data to scrape with pyhton, from any links or useful data?
guys can you help me with this```
pca=dt.corr() #dt is my data
#this works fine
k=10
col=pca.nlargest(k,'SalePrice').index
d=np.corrcoef(dt[col].values.T) #presence of transpose here
sb.heatmap(d,annot=True)
#this doesn't work (it keeps on running till RAM uses exceed and crashes, gives no output )
k=10
col=pca.nlargest(k,'SalePrice').index
d=np.corrcoef(dt[col].values)
sb.heatmap(d,annot=True)
how do I get numpy or matplotlib in vscode?
@rapid plank
Install numpy and matplotlib in your Python environment and import it...
@desert parcel do you understand how a neural network works at a basic level?
a "dense" layer in keras means that every input is connected to every node. in this case there are 128 nodes
this is also called a "fully connected" layer
Oh no I do not, the video i'm watching doesn't go that detailed
ah
It's this one
i strongly recommend the 3blue1brown series that explains how neural networks operate
Machine Learning represents a new paradigm in programming, where instead of programming explicit rules in a language such as Java or C++, you build a system which is trained on data to infer the rules itself. But what does ML actually look like? In part one of Machine Learning...
Home page: https://www.3blue1brown.com/
Brought to you by you: http://3b1b.co/nn1-thanks
Additional funding provided by Amplify Partners
Full playlist: http://3b1b.co/neural-networks
Typo correction: At 14 minutes 45 seconds, the last index on the bias vector is n, when it's...
ohh
I couldn't find a good one I just thought that i'd stick to the official tensorflow yt
And I keep redoing my notes
which is really annoying
dont forget that TF is a software library
they will be focused on teaching you the software
although it looks like they are doing a good job at introducing you to the concept
they might go back and explain layers later
this actually seems like a very nice gentle introduction
which one?
the TF one
but the 3blue1brown video will probably be enlightening
i don't see where they use the Dense(128) thing
eventually you will want to learn the math as well
yep. the whole idea of a "neuron" is a conceptual aid to understanding the model and doesn't really have much to do with how the model actually works
it's all math underneath
most of the text in the notes is just me paraphrasing what the guy said
you dont need to understand it all, but the more you do understand the more interesting problems you can solve
Well I don't seem too excited to get into the mathy side of things lol
i can't access that link from work
i'm just wondering where you got the idea to use this
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
oh, the part 2?
Yeah part 2
ok
I watched the 2 of them b2b and keep rewatching them
I wanna get good at the start first
the 3blue1brown video should help explain what's happening with this
Alright then
and it will also show you where the math part comes in
Alright I'll give that a watch then
cool
@desert oar hey, have gotten quite a bit further with my problem i got some help from other people and shuffled some stuff up, however now im havintg a slight problem with your function
def parse_horrible_format(data):
max_len = max(len(rec) for rec in data)
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
data = np.array(data).astype(np.float32)
return data
Traceback (most recent call last):
File "c:/Users/Silv3/OneDrive/Desktop/datasetup/datasetup/hillbilly.py", line 12, in <module>
(train_data, train_labels), (test_data, test_labels) = ds.import_data("Pickup Lines - Insults")
File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 258, in import_data
labels = parse_horrible_format(labels)
File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in parse_horrible_format
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in <listcomp>
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in <listcomp>
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
ValueError: invalid literal for int() with base 10: 'O'
π» spooky
oh
yeah I have no idea
labels is littrally just 1's and 0's
oooohh
I see
labels is this @desert oar
'Do you like Star Wars? Because Yoda only one for me!': '1', 'Call me Shrek because Iβm head ogre heels for you.': '1', 'Wanna go bowling? I thought it might be right up your alley.': '1', 'Excuse me, I just noticed you noticing me and I just wanted to give you notice that I noticed you too.': '1', 'If your heart was a prison, I would like to be sentenced for life.': '1', 'I love you like a pig loves not being bacon.': '1', 'Are you parents bakers? Because you are a cutie pie.': '1', 'Are you a cat? Cause you are purrrfect': '1'}
I had to shorten it because its like 90 000 characters
so the first one is
{'Of course I talk like an idiot. How else could you understand me?': '0'
hense the "O"
?
lol i am stealing that
Oh here let me give you mine
?
annoying = ['Is it now?', 'Nope', 'Not really', 'I say not', 'lol git gud', 'Naaaa']
query = ['Yes that is correct.', f"""I agree with which was last said""", "That's correct",
"I'm not sure, you tell me.", "Oh please you're smarter than that.","Figure it out.",
"I'm not google.", "You think I know everything?",
"I'm not going to say it, now that you want me to say it.",
"lol good luck figuring it out on your own","why would I know.",
"Sure. If that's what you wanna think."]
what_is = [
"I'm not sure, you tell me.", "Oh please you're smarter than that.",
"Figure it out.", "I'm not google.", "You think I know everything?",
"I'm not going to say it, now that you want me to say it.",
"lol good luck figuring it out on your own",
"why would I know.", "Sure. If that's what you wanna think."
]```
@earnest wadi you need to use a totally different function to parse that
Lol It's just my bot being sassy
haha
I wanted to implement this by learning ML and stuff
you really really really need to use json or some standard format
I'll let you guys talk lol
this is turning into a mess to read data from disk
thats not a weird format
its a python dictionary
before all the numpy stuff
data is a list
index is a dict
labels is a dict
yes but you are dumping literal python objects
parsing that is inevitably a mess
eval is not a good idea
use json instead
X = np.array([-1.5, 0, 3.5], dtype=float)
Y = np.array([0, 3, 10], dtype=float)
newModel = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
newModel.compile(optimizer='sgd', lose='mean_squared_error')
newModel.fit(X, Y, epochs=20)
print(Model.predict([1.0]))```
This is the error I am getting
I am not sure how to fix it
why does this work ```py
X = np.array([-1.5, 0, 3.5], dtype=float)
Y = np.array([0, 3, 10], dtype=float)
newModel = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
newModel.compile(optimizer='sgd', loss='mean_squared_error')
newModel.fit(X, Y, epochs=250)
print(newModel.predict([1.0]))```
excuse me guys, can someone help me, currently im learning something about Image Classification using google colab. since im learning it by watching Youtube Video, i got some problem and i can't solve it, can anyone help me?
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving.
β’ Be patient while we're helping you.
You can find a much more detailed explanation on our website.
see our guide to good questions above
so i just put my question here right?
Yes @autumn veldt
ok, lemme prepare my question first
!ask so, i was learning something about image classification using feature GLCM + SVM method. where i put my dataset into csv file. after reading the dataset im trying to see how much the accuracy that i can got from it (it's only showing one time training with 0.70 accuracy), now the problem that i want to ask is, how to put something like 5-20 training with different accuracy (epoch = 20) in one time run? im so sorry if my english so bad tho
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving.
β’ Be patient while we're helping you.
You can find a much more detailed explanation on our website.
if its possible, can u guys give me some link or whatever that i can learn about my problem too, thanks
you dont need the !ask command, that just shows you info about asking good questions
still don't really understand what you're trying to do @autumn veldt
so, here i was running DataValidation.csv using GLCM+SVM and the result of train is 0.70 accuracy on one time running (only one data that is calculated from the accuracy of the many existing data). the problem is, on DataValidation.csv there's about 100+ data and i want to show accuracy at least 20 data from 100 data available in the CSV file by expecting different accuracy results.
what do you mean by one data?
A single row? @autumn veldt
yea, single row
Hello, I need some suggestions. I've trained a model to detect the probability of a client cancelling their services with us using our current client's list, if I deploy that model to production to an API to do live-predictions, and I pass the one of the clients that was used in the training set, wouldn't that be biased?
then just get 20 rows and run them through the model @autumn veldt, and check accuracy?
thats the problem @flat quest , how to get 20 rows on one run?
just select all 20 rows xd @autumn veldt
data[0:20]
and run that through the model
that should work unless that particular model doesn't work on batched data
ill try it soon, btw thanks @flat quest
How i can fix : thid white draw
In CNN models, why is it more accurate to have Conv layers and then some dense layers instead of only Conv layers?
What's the best machine learning model that can take in a decent sized amount of rows with many nominal categorical variable and a few output variables to determine predicting factors?
@void anvil you mean training the model on a small subset of the clients and running the predictions on the total population?
What if my dataset is small? Like 3,000~?
Hello all. Can someone recommended me a good book/course/site to start learning machine learning.
Learning about linear algebra and stats would be the place to start
@desert parcel Are you free to upload that dataset somewhere / link me to it? It looks awesome to play with
@void anvil Well I made the dataset myself I just came up with a random linear line equation.
I realize my question above was rather vague, but if anyone has the time to spare, I just need help figuring out how to go about analyzing this data. We have a product with many configurations, and rather than going through and checking each individual feature or permutation of the features, I'd like to run it through a model that can identify correlation to a binary good or bad column.
I have found many methods that seem almost on point for what I'm looking for, but not quite on the nose.
what kind of product is this? @queen barn
@limpid raft well technically you can. And it does work reasonably well. Remember Conv nets are basically 2d dense nets with a limited scope.
But at some point you need to switch from a 2d output to a 1D. That can be done through maxpool, etc. but dense layers tend to work better.
Conv nets may focus too much on low level features. Since each filter is like 3 x 3, low level local features (ex blue dot in left bottom corner) may be prioritized over a higher level feature like an oval face.
Do I need a GPU for this?
I haven been using google colab for this
But do I really need a GPU?
what kind of product is this? @queen barn
@flat quest I'm not sure why that's relevant. I'd prefer not to sure the specifics of the product, but I'd be more than willing to explain or provide an example of the data structure.
@queen barn Ah yeah, was only asking because like you said question was pretty vague. The type of model to use is heavily dependent on what kinda product and data ur working with.
You said configurations are these config files for software? Different setups for products as in machines, etc.
No it's a physical product that has about 95 different attributes that can be changed by customization request. I'm trying to find a correlation between some of these customizations and a quality failure.
@flat quest I hope that helps a bit
I am trying to convert a column in a dataframe from an object to a string but it is not working? Could someone explain how come?
because the column is a column/a 1D array
you'll have to take every element and append/concat them together
Getting an error when fitting data to simple neural net( input, 2 hidden, 1 output, all dense layers): Error when checking input: expected dense_34_input to have shape (518994,) but got array with shape (1,). My input is 32950 by 63 and I flattened it to fit it to the network. Not sure why it's showing (1,). Anyone know what I'm doing wrong?
well a very basic way (This would be something like a base model) would be to simply throw in all these attributes as features.
As for getting those categories into a numerical format. Since its only 95 attributes, one hot should work fine.
Another method would be to use embeddings.
You should probably start with a standard dense model and see how well it performs. Since there's no temporal data involved, you won't need RNN's or LSTM's. Transformers will tend to work better than standard dense models, but its much more compute heavy.
@queen barn
@wise garden
So if you print the flattened input what shape does it have? It should be (batch_size, features)
can anyone here explain neural networks to me like I'm five? I got into Machine Learning about a month ago and Python like 2 months ago. I use google colab if that will help u help me better
Im really struggling to grasp the concepts of RNN's, CNN's, KNN's, and ANN's. I just don't understand how they work
Please PM me if possible
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
DataA = np.array([1, 2, 3, 4, 5, 6, 7], dtype=int)
DataB = np.array([6, 7, 8, 9, 10, 11, 12], dtype=int)
# DataB = DataA + 5
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(DataA, DataB, epochs=250)
print("-"*20)
print(model.predict([9]))```
Output:
Why is the output 15 though if it's 9 shouldn't it be 14
You might not have enough data for it to be accurate maybe? but dont take my advice, ask someone more experienced @desert parcel
But I do though
In another one I copied from the web
it has less epochs and less data but is more accurate than what i'm using
Yeah alright
no
it's literally just adding 5 to every element of DataA
it's an estimator and the model you're using isn't predicting well
because it's an estimator you're usually not going to get exact values like that
try using a different model and test your result
So the equation is backwards?
no?
Could you explain it again
Hello all. Can someone recommended me a good book/course/site to start learning machine learning.
@ebon nebula Sorry to answer you a little bit late;
I recommend you sololearn application that is also available in web.
It has a specific section named "Machine learning".
I'm learning python by sololearn and it teaches everything from 0 to 100.
If it's the first time you're hearing about sololearn, it's better to start with android or ios version not web application.
and sorry for wall.
You have nothing to be sorry for. Thank you for the suggestion. I will check it out asap.
@desert parcel onw issue you probably have is that you only have 7 different inputs you are training the model on (which is a very small ammount for a neural network) but you are trainnig for 250 epochs which is a lot, your model is probably overfitting to the data being passed to it although having said that, i would of thought that the weight would update towards one with the bias being 5, i would reccommend using more training data and probably fewer epochs but as you only have 1 weight i don't this can over fit
one other note is that this is not the kind of problem neural networks are good at solving, you will probably get better results using some sort of regression, i was just highlighting above why the results may not be what you were expecting and how to try ammend these isssues for other models
ohhh yeah
I was just using a Y=MX+C equation since that is used in the example
I base most of my test models on that equation
how does one get into deep learning without much higher math?
i've been looking at tf2.0 for really long time, and it seems really simple. i do understand how dataset is processed in the most cases, but when it comes to model building, choosing optimizer, etc. then i'm just lost. so much models, loss functions, and every model needs number of parameters that i don't understand how to calculate. my brain can't find any patterns in this.
tl;dr how do i acquire basic knowledge of deep learning without diving into higher math and linear algebra? moden frameworks seem to make it possbile, but i just don't understand which optimizers/layers to use in which situation, how to get needed numbers of params, etc
@flat quest I had a few to many last night and didn't realize I was passing the wrong tensor through my pipeline lol got it figured out
how does one get into deep learning without much higher math?
i've been looking at tf2.0 for really long time, and it seems really simple. i do understand how dataset is processed in the most cases, but when it comes to model building, choosing optimizer, etc. then i'm just lost. so much models, loss functions, and every model needs number of parameters that i don't understand how to calculate. my brain can't find any patterns in this.
tl;dr how do i acquire basic knowledge of deep learning without diving into higher math and linear algebra? moden frameworks seem to make it possbile, but i just don't understand which optimizers/layers to use in which situation, how to get needed numbers of params, etc
@prisma verge You just need HS level math up to how to differentiate and how to do matrix addition, subtraction, and multiplication. PyTorch can however do all of that for you with functions.
But know the concepts do help
`
noise = np.random.uniform(-1,1(observations,1)) targets = 2*xs - 3*zs + 5 + noise
why do i get an error for this
please some body help
the error is int is not callable
1(observations,1)
As the error says, you can't call ints
Which is what you're doing there
You might've meant 1, (observations, 1) idk
TypeError: Level type mismatch: month
noo not that one
noise = np.random.uniform(-1,1(observations,1)) targets = 2*xs - 3*zs + 5 + noise
this one
gives me an error at line 1 stating int is not callable
observations = 1000 xs = np.random.uniform(low=-10,high=10,size=(observations,1)) zs = np.random.uniform(-10,10,(observations,1)) inputs = np.column_stack((xs,zs))
but this gives no error
Because you aren't calling an integer there
yeah yeah got it
i missed a comma there
-1,1,(observation)
yeah got it thank you so much !
x = temperature.Temperature.resample('D').mean()
temperature is a dataframe
so x should be a series
will anacoda work better in my 2gb ram pc
better than wat
use google colab they let you use their free gpu or whatever
it's just faster and it saves time from downloading all the modules
temperature_array is pm2.5- an air pollutant 2017-2019
pm25_array is just average temperature 2017-2019
what does this graph mean
i have a ongoing test can anyone please help me
my_dict = {'100001':{
'forename':'John',
'surname':'Smith'
},
'100002':{
'forename': 'Alice',
'surname': 'Van Gogh'
}
}
I have a dictionary as such where one ID corresponds to two values. In pandas, I have a column which has an ID for each record - I want to split this into two columns of the forename and surname.
What's the most idiomatic way to do this?
I have very little experience with pandas
oh no, I mean that i have a column in an existing dataframe which needs to be replaced by the two other columns
thank you though
Can u share it... And output expected so that I can share you code
Hey guys, I gotta form a prediction model involving a prediction of one main variable based on it's dependence on 4 different parameters ... do I use Step-wise regression or Multi- polynomial regression...Or anything else
@gloomy thistle multi-pol will be good.
I would also like to know how much of each parameter is also tallied into it.. how do I do that?
You can use PCA or LDA to find it out
So use all your variables and run PCA, it will generate new matrix where you can select top N coloum which has impact of x1 x2.... Xn
Cool, I'll look it up, thanks Kapil !
Welcome bro
temperature_test = pd.read_csv ('C:/Users/dotha/PythonNotebook/File/temperature (2020) NYC.csv')
pm25_predicted_list = []
for i in temperature_test['temperature'].values.reshape(-1, 1):
predicted_value = linear_regressor.predict(i)
pm25_predicted_list.append(predicted_value)
pm25_predicted_list
please help
error: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
i did try both way
(sorry for the bump hb ill delete after) and ty for the help kapil
use the old generated df which i shared and use pd.merge(your_df, my_df, key = 'id') to merge it
awesome, tysm
welcome bro
the temperature_test looks like this
i dont understand why i am getting the errors
@visual violet , try ony .values
Just put a print(predicted_value) after your predict code line
Let's see what is output of your regressor
Man u r appending pm* list not m* list.... Correct that you will find your answer
oh, i get an error saying that key is not a recognised keyword parameter in pd.merge?
for this dataframe, I'm trying to subtract consecutive times in df1['JOB TIME'] columns an expressing the result in minutes in df1['delta_t'] column. Why is my code failing to execute the conversion.? No traceback ```md
#Here is mi firt function to calculate time delta between consecutive rows
#Goals was to epress the timedelta in minutes the last two comments
def time_diff1(col1, col2):
if col1 is np.datetime64('NaT'):
pass
else:
diff = col2 - col1
#trying to get minutes here but doesn't work
return np.timedelta64(diff, 'm')
#Also Tried these without luck
#return diff.dt.total_seconds()/60
df1['delta_t'] = np.vectorize(time_diff1)(df1['Previous Time'], df1['JOB TIME'])
df1.head()
JOB TIME COMMENT Values Previous Time delta_t
1 2017-11-20 01:23:00 d_P = 1259.08 psi 1259.08 2017-11-20 00:13:00 0 days 01:10:00
2 2017-12-02 13:24:00 Offset_Pressure 0.00 2017-11-20 01:23:00 12 days 12:01:00
3 2017-12-02 16:21:00 d_P = 4142.57 psi 4142.57 2017-12-02 13:24:00 0 days 02:57:00
4 2017-12-03 02:57:00 Offset_Pressure 0.00 2017-12-02 16:21:00 0 days 10:36:00
5 2017-12-03 04:03:00 d_P = 539.38 psi 539.38 2017-12-03 02:57:00 0 days 01:06:00```
@sinful dock in function use diff=pd.Timedelta(col2- col1).minutes
@lapis sequoia pandas shows only top5 and bottom 5 rows but it has in actual all rows... So you can go ahead and continue your operation
Tell me davidd
I want to remove these numbers on the side
31975 Zimbabwe
31976 Zimbabwe
31977 Zimbabwe
31978 Zimbabwe
31979 Zimbabwe
31980 Zimbabwe
31981 Zimbabwe
31982 Zimbabwe
31983 Zimbabwe
31984 Zimbabwe
31985 Zimbabwe
31986 Zimbabwe
31987 Zimbabwe
31988 Zimbabwe
31989 Zimbabwe
31990 Zimbabwe
31991 Zimbabwe
31992 Zimbabwe
31993 Zimbabwe
31994 Zimbabwe
31995 Zimbabwe
31996 Zimbabwe
31997 Zimbabwe
31998 Zimbabwe
31999 Zimbabwe
32000 Zimbabwe
32001 Zimbabwe
32002 Zimbabwe
32003 Zimbabwe
32004 Zimbabwe
32005 Zimbabwe
32006 Zimbabwe
32007 Zimbabwe
32008 Zimbabwe
32009 Zimbabwe
32010 Zimbabwe
32011 Zimbabwe
32012 Zimbabwe
not sure how
They are just index....
While exporting pass index=False and you'll not found them in your file
@bleak fox Bhai how are you so sure when he didn't give column names? (I'm new so just asking)
@sick swan your question is about which questions reply... Tell me I'll explain
Do i really need to be a math GOD in order to DS? or can I.... Know how things work and know how to apply things, would that be enough?
@sick swan your question is about which questions reply... Tell me I'll explain
@bleak fox the Zimbabwe guy
I mean I'm good at math, no probs, but not GOD tier
You can start with whatever you know in Math but keep on learning while practicing
@sick swan you can go ahead... DS is an art... And it requires a mind set more thn math and programing
@sick swan who is Zinbabwe guy?
I have done an IBM professional specialization, but it was clearly visible that they didnt dig deep into math
Man if DS is as simple as calling a sklearn api thn everyone whould have done that.... Correct?
@bleak fox
I want to remove these numbers on the side
31975 Zimbabwe 31976 Zimbabwe 31977 Zimbabwe
this one
@sick swan he shared it first but deleted... From where i got that these numbers are index
@bleak fox xD exactly, that's why I was trying to do a lotta math
you do like this
print(df.to_string(index=False))
@sick swan
I figured it out that's why I deleted
@lapis sequoia @sick swan check this
here's what code I have just so you can see in full
""" Modules """
import pandas as pd
# Read excel spreadsheet.
dataframe = pd.read_excel("../covid19data.xlsx", usecols="G")
pd.set_option("display.max_rows", None, "display.max_columns", None)
def countries():
""" This prints all of the countries in the spreadsheet """
countries = dataframe.to_string(index=False)
print(countries)
countries()
damn im glad i found this community
more like, its so good to have like minded people, everywhere i look, people are just freaking clueless
in the sense they dont take their future seriously
or dont yet have the realisation
Yup thanks to creators
Hey so, I'm trying to get some data from a redis website. (It has a .php ending) and then cache that data and then put it into either google sheets or some kind of analysis tool. I'm really new to this so I'm flying blind.
I can implement this myself in python but I don't want to reinvent the wheel. What kind of stuff do you (more experienced people) all use for this kind of thing?
@strange igloo use selenium
GOD
@sick swan GOD
:3
Alright, I'll try and figure out selenium
@bleak fox md def time_diff1(col1, col2): if col1 is np.datetime64('NaT'): pass else: diff = pd.Timedelta(col2- col1).minutes #trying to get minutes here but doesn't work return diff Unsure if datetime can take any self and I believe arguments need to be inside because is it failing to vectorize
<ipython-input-48-23b332169aaf> in time_diff1(col1, col2)
5 pass
6 else:
----> 7 diff = pd.Timedelta(col2- col1).minutes
8 #trying to get minutes here but doesn't work
9 return diff
AttributeError: 'Timedelta' object has no attribute 'minutes'
@sinful dock welcome buddy
stats noob btw
I am noob * 100...πππ
so i have some survey data, likert scale (ordinal)
and i want to test for significance differences between some nominal groups
which test should i be using?
P-value test...
ANOVA
kruskal wallis anova?
Yup
prof said that was wrong
I want gpt-3 to write my jq filters or xpath predicates!!
holy **** that would be cool
Anyone have any ideas on what I can do with a bunch of twitch VOD chat logs?
be open to possible violation of GDPR?
aside from that, depends what you want to do. make a bot that responds in some way to VOD? automated moderation? some kind of game?
Hmm, not sure. Trying to gleam some insight from the data. I saw someone said they were able to select timestamps based on chat to pick out highlights in the stream.
Right now, I'm just making a word cloud and picking out the most popular emotes. Originally, I wanted to do topic modelling and sort out chat comments and figure out how they are reacting to the stream, but kind of stuck on that since I'm not sure what to look for, really.
sure, sounds like you could do that
topic modelling, I'm not sure how you'd output that
Well, so I was able to do that. It generated it's own topics but they don't really mean much lol
but you could make use of pre-trained models to help, you'd just get a large vector
Like one of the topics is one of the emotes "Pog" so it's mostly comments that just say "Pog"
I guess you could see when the topic changed, but to do anything more may require you to train something
when the topic changed?
Currently, my first run of it was using the same process as a post analyzing tweets. Twitch comments seem like a whole different beast since it's so much shorter and dependent on the comments and stream.
I guess you could see when the topic changed, but to do anything more may require you to train something
the issue with this is there won't be a clear divide between topics, as well as having to deal with 'garbage' comments not related to the topic/general 'sum' of topics
Idk how you'd approach it
Going to see if I can find a correlation between chat and the stream by plotting chat "acceleration" minute by minute by counting the difference in comments. If I see a spike, there may be something there that the stream responded to.
Hmm, is there a way to get the data that's plotted by a histogram?
I have an array of timestamps and I want to group them by minute. The histogram plot does that for me but won't give me data back like
hist_data = {"bin_1": 200, "bin_2": 300}
Oh, got it. It was just bins = timestamps // 60
Then it's easy to group
Ooh, I'll try that then. Thanks
Dumb question, What's the best way to practice Pandas ?
Now I have this image of the difference in the number of comments per minute. Gotta use some math to figure out what the cut off point should be signal and noise.
Sorry for interrupting :/
No worries.
I would look at some videos of pandas tutorials at first. They guide you through the basics
Get some data you want to play around with and try to follow along
I'm following a book, But It doesn't have much challenges so
Ah-
Thanks ! :D
Btw
Does kaggle have challenges on datasets?
Yep tons of them.
I would look into the datacamp series of videos. They cover a lot of pandas stuff
Thanks a lot
This might help. I skipped through it but it gives a nice overview. https://www.youtube.com/watch?v=zyIN3SE11V0
If I know all of these foundations, Am I ready to lean on myself and try to play with some datasets?
Yeah
Because I do..
Great!
Quick question, What I will be doing now is called Data analysis ?
The most important thing to know is not to use like a for loop to go through the whole dataset. Numpy/Pandas has it's own faster way of doing calculations on large sets of data
^^ I find doing random stuff that'll never be useful/never be used is super useful for learning not only fundamentals but some extra stuff as well
Yep, I agree.
Quick question, What I will be doing now is called Data analysis ?
yes
data analysis is more of a generalization tho
The most important thing to know is not to use like a for loop to go through the whole dataset. Numpy/Pandas has it's own faster way of doing calculations on large sets of data
@coarse spire
I might not use them again in my entire life lol
^^ I find doing random stuff that'll never be useful/never be used is super useful for learning not only fundamentals but some extra stuff as well
@bitter harbor Can you give me an example of this ?
I've found numpy's useful in a lot of things
well for example matrix manipulation can be used for data science, but it's also super useful for game dev
How's it useful for Game Dev?
A common thing is filtering out data. On the project I'm working on, I had to remove a bunch of comments made by bots so I would do df[~df.comment.str.contains("some bot message")] Super common if you are just looking at the data.
take tetris as an example
you could store the position of each piece as a separate variable (plz never do this, I've seen it and it's awful)
or you could use matrices
A common thing is filtering out data. On the project I'm working on, I had to remove a bunch of comments made by bots so I would do
df[~df.comment.str.contains("some bot message")]Super common if you are just looking at the data.
@coarse spire
I didn't think~will be that useful xD
like if avg in row == 1, remove row and bring all values above it, down one
you could store the position of each piece as a separate variable (plz never do this, I've seen it and it's awful)
@bitter harbor
just imagine this
Yeah this makes a whole sense to me now ..
I've seen it and it's awful
Ah, that's cool. You can even make an ascii version of tetris and basically just print the matrix out.
When I first tried to play around with ~
I used it on a number and yeah.. My head started to hurt
like you could use lists, but lists are essentially 1d matrices
I have another important question-
Do companies asks you to tell them specific things about their data?
Or it's up to you to warn them about specific things and give tips to improve their companies
depends on your position *and values ig
DS
Yeah, there are business/data analysts who create reports about the data.
About the specific data they need from the dataset ?
Some places will have you doing predictive analytics, some will want just reporting.
Yeah, really depends on who's asking you to do it.
you'll never have to do a bunch of random calculations on a dataset
(unless they tell you to do it)
So they might tell me to do random calculations and come out with random predictions ?
So, yeah, at my company, the person they're "giving money" to would be the Machine Learning teams.
Well I'm complaining because I won't know what to predict from a current dataset I have if I don't search the calculations π
Who usually come up with models to forecast some outcomes.
Who usually come up with models to forecast some outcomes.
@coarse spire Oh..
Wait
This is how ML works ?
Correct me if I'm wrong,
It can predict random info that I don't know about as a DS from a dataset
Well, when you say random info, I'm not sure what you mean.
Let's say I have a dataset about heights and width of squares
A common example is figuring out house prices based on their square footage.
that's actually a common example in describing the difference between coords and vectors
What's the difference?
If it's related to statistics, I can watch an explanation rn
The housing price example? Makes sense since you don't just want to start at 0,0.
I like this playlist for stats. https://www.youtube.com/playlist?list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr
Because I don't have any idea..
Similarly, while vectors generally imply some kind of linear space and the use of linear operations (i.e. linear algebra) they can be used to describe other data which may not have a linear basis."```
Ah that makes sense then.
also a coordinate is always a vector but a vector is not always a coordinate.
This is really helpful ! @coarse spire
You're trying to find a vector to predict other housing prices
^
Which is what linear regression is.
I'm confused ..
Yeah, so sklearn has a bunch of machine learning algorithms and tensorflow has a lot of deep learning algorithms.
I think that's....about what you need to know unless you need to use one of them, then you would go deeper. π
@bitter harbor ?
I think that's....about what you need to know unless you need to use one of them, then you would go deeper. π
@coarse spire I might start sklearn after getting familiar with matplotlib, Do you think it's a good idea ?
Yes
Well, it depends on your task, but if you just want to learn about the datascience ecosystem then machine learning with sklearn is great.
(a1,a2) would be the coords, [a1] [a2] would be the vector
like that
I know it'd be a lot more work but i'd suggest learning about basic linear algebra, stats, and neural nets before looking at ml libraries
Can't I learn them in the way ?
I think Bepples is saying that you should learn it that way π
So, the thing about that is, without a solid foundation, you won't even know what you don't know.
You might make it harder on yourself when there is a simpler solution.
like you can for sure just get a dataset and put it into a template, but it's easier to tell what's going wrong if you actually know what's happening
especially if the error isn't a programming error
At a certain point, you'll have to do that anyway "I don't know this exact thing so I'll search for it as I go."
When I first got interested in data science, I took an online stats course.
Then I would try programming simple models and trying to pick apart others.
Picking apart Random Forest was fun.
This makes sense to me ..
thank god π
You could also just try to throw Neural Networks at everything and see what sticks π
π I think that will be horrible
^
2 days ago, Someone was asking a good question
That's the fun part.
About working as a DS in a great company like Google
What libraries should he learns and etc.
"great"
I don't know it seems to be
it's a ds company because it's a data farm
Someone said you should know how the libraries work like what's going on behind them then think about working for them
Eh...
I'm glad I didn't ask that question
how the libraries work like what's going on behind them
that's a good idea imo
That's basically what the fundamentals are though. I think they want to know that you know what you're doing with the most common ML algos at least.
I don't know if they care too much about the underlying implementation though.
well ya they aren't going to ask you how nn's work, but understanding how allows you to do more and do it correctly
Oh
Well, I guess I'm saying that there is a difference between knowing how an NN works and how Tensorflow 2.0 implements an NN layer.
Hmm, actually, tensorflow is from google so maybe they do care :p
ya the implementation is different because of the language
the concepts of a nn is __mostly __linear algebra/stats but the implementation is cs
Can't wait!
I just released libra: a machine learning API that lets you build and train models in one line of code. Check it out here please: https://github.com/Palashio/libra. Would really appreciate any feedback / questions y'all have π
@warm wedge #303934982764625920 is a better place to put this
There seems to be an error and I don't understand
so does it mean I can only do .backward() on scalars only?
changed it to this and the output is none
Lol that might get confused with Libra blockchain @warm wedge
I need a .backward() in order to differentiate something hmm
but I can't .backward() something that isnt a scalar
So what do I do
You should learn the fundamentals. Not because you donβt have to do searching along the way, but a baseline knowledge is critical if you want to understand why you got a particular result
@arctic cliff
I think numpy has a function for that
So what can you do in numpy
wait let me rephrase
So I should convert a tensor into a numpy array and do a .backward() on that?
I'll try it
The error gave me a suggestion
var.detach().numpy()
But that sort of fixed a prolem
problem i'm trying to recreate the error that gave me the recommendation