#data-science-and-ml

1 messages Β· Page 237 of 1

desert parcel
#

so x_test is the stuff inside the .csv and it uses the things the came up with in x_train and compare it against x_test?

spark stag
#

so you are using a linear regression model, that tries to fit the data given to it by creating a line of best fit through the data, if you think of it as a 2d graph for now, what the model is trying to do isfind the best gradient and intercept of a line so that this line of best fit goes through the data

desert parcel
#

alright yeah

spark stag
#

x_train and x_test is just the data from x defined at the top but split into 2 groups, it takes most of that data for training but reserves some for testing its performance on at the end

desert parcel
#

so the percentage it gives out at the end

#

are the y values?

spark stag
#

so lets say for example you had some data like py array([-1.15878911, 2.93868307, -1.59251035, 2.96522191, -1.47123134, 2.73263764, 2.17527494, 2.90636932]) this would be the data in x at the beggining, (there would be values in y for the true values as well), when you pass this to sklearn.model_selection.train_test_split(), it takes some of the data away e.g. -1.59251035 to be used for testing ( x_test) at the end, and the rest of it is kept for training on (x_train) the 0.1 means 10% of the data is for training

#

you don't give all of the data to the model because then you don't know if it is just memorizing the input so you test it on new data to see how well it can make predictions on new data

desert parcel
#

ahhh

#

this is a lot clearer

#

so that's what

#

train_test_split() does

#

so for example

spark stag
#

yes, it partitions all the data you give it so it can be used for training and then checking if your model has done a good job at finding general patterns in the data

desert parcel
#
 14 x = np.array(data.drop([predict], 1))                                                                               15 y = np.array(data[predict])                                                                                         16                                                                                                                     17 x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)                    18 linear = linear_model.LinearRegression()     ```
#

ignore the 14s, and 15s

#

so if instead of x and y at the first two rows

#

if I did like

#

dataA and dataB

#

then on the train, and test lines

#

would it be

#

dataA_train, dataA_test, and so on?

spark stag
#

sort of yes, but y isn't used as data the model uses to make predictions with, its only used as a comparison at the end to see how well that predicion did so the model knows how it should change to make better future predictions

desert parcel
#

oh so then

#

y is like the answer sheet

#

it compares it's own results (x) with the answers (y)?

amber hound
#

Hello there! I'm new to the server and I'm coming with a few questions about an error I'm getting using cosine similarity (for a Count Vectorizer Matrix) and linear kernel (TFIDF vectorizer matrix) of Sklearn. I'm getting a Memory Error each time I try to compute the similarities as my data-frame is too big (df.shape = (394592, 7)) and I don't know how to approach this problem. Any help would be appreciated!

spark stag
#

for example if I was trying to teach you to classify 2 new animals, the data x could be an image of them, and the labels y could be which animal that is, you would make a predicion from the image I showed you (x data), then I would tell you what that animal actuall was (y labels), this way you learn from your mistakes, this is called supervised learning

desert parcel
#

ahh

#

so like your answers against an answer sheet then

#

Yeah that makes sense

#

This is gonna be one really long course to finish

spark stag
#

its important to note here that x isnt the prediction but the data e.g. the image, your prediction is made based on the data x but its not the same thing

desert parcel
#

oh.

#

then what is the prediction then

#

y_train or y_test?

#

Or am I still not getting it

spark stag
#

the prediction isn't any of the data you store, htat is just data, the prediction is based on those variables but because as the model learns it may interpret data differentl, the predications cannot be stored before hand, they are made by the model in linear.fit and linear.score, then it checks agains the true value y what it should of predicted

#

i haven't used sklearn much but see if the model (liner) has a predict method (or something similar), if so you can print out a single data item, print out linear.predict(<that item>), then print out label (y) for that prediction, it may help you understand the difference between them

desert parcel
#

alright i'll mess around with this

#

You can help the other person

amber hound
#

Hello there! I'm new to the server and I'm coming with a few questions about an error I'm getting using cosine similarity (for a Count Vectorizer Matrix) and linear kernel (TFIDF vectorizer matrix) of Sklearn. I'm getting a Memory Error each time I try to compute the similarities as my data-frame is too big (df.shape = (394592, 7)) and I don't know how to approach this problem. Any help would be appreciated!
@amber hound I've heard of Gensim for the TFIDF case and I'm searching information on how to implement it but I don't know what to do about the Count Vectorizer and Cosine Similarity memory error.

dull turtle
#

hello, i am using pycountry module.
i am getting "country_code" : "IND" in request.
how i can make use it in my code to get the value of respected country
my code here
modelType = pycountry.countries.get(alpha_3='country_code').name

desert oar
#

@amber hound you want the entire pairwise similarity matrix for almost 400k vectors?

#

That's like trillions of nonzero entries in the similarity matrix. No surprise it won't fit in RAM

#

Sorry not trillions. Many billions though

#

So multiply that by 64 bit or 32 bit for minimum RAM usage

#

What do you actually want to do with that

river thistle
#

In SageMath, is it possible to calculate probabilities of multiple sets, using P(x) notation?

desert oar
#

no @void anvil it sends data back and forth between python and an R process

#

so yes you still need R

chilly geyser
#

@amber hound Yes, your df is too big.
Consider that your distance matrix would be 394592^2 = 155702846464 entries. Assuming 1 byte each, that's 155 GB. Typically you'd store reals or floats, which are 4 bytes. Assuming you store above-diagonal only since it's symmetric, you'd still need approx 2x of 155GB or 300GB+

There is no sane or normal computer with approx. 300GB RAM, assuming it's processed in RAM.

I'd recommend you somehow make your df smaller with some qualitative reasoning

I must admit IDK why your df has 7 columns though.

desert oar
#

for context, the 32 core general purpose ML server that my team uses has 256 GB of RAM

chilly geyser
#

That's some serious investment mannnnnn

desert oar
#

it's a physical on-prem machine, so it's not like they're paying through the nose for some cloud services

#

but yes

#

this is neither a sane nor normal setup πŸ˜†

chilly geyser
#

I'm pretty sure I can get access to supercomputing for highly parallelizable loads, though I can't say how much it'd cost. Not too sure about specific high-corecount single-machine-ish things

desert oar
#

but yeah point being you'd basically have to parallelize the distance computation and dump the results to disk periodically

#

which again begs the question: what exactly do you need such a huge distance matrix for?

chilly geyser
#

I hope there's a better way to do the distance-matrix thing though. Scaling by n^2 in memory means it's quite impossible to develop large-scale things

desert oar
#

not that i know of. it's a fundamental limitation of techniques that require a full distance matrix

chilly geyser
#

Large distance matrices are always useful

desert oar
#

yeah but for what? are you going to compute hdbscan on it or something?

#

obviously if you are building a database it makes sense that you'd want to construct an index of some kind

#

that actually might be better

#

is there a general-purpose on-disk "vector database" for doing neighbor queries and stuff?

chilly geyser
#

Well I have used distance matrices in 2 instances:

  1. node-A to node-B distance in a graph. The more different locations, the more nodes you get. And also generalises to anything that 'nodes' can be, which can scale to very large numbers
  2. NLP word vectors. A larger vocabulary is better than a smaller vocabulary
#

word-vec A dist to word-vec B is useful IMO

#

You'd need the whole distance matrix if you want to query any pair, or alternatively you could just query specific word vectors/elements themselves to get closest for just one I guess, which should scale linearly in terms of memory

desert oar
#

yes, this is what indexes are for

#

go look at how e.g. fasttext does it, that's all written in very simple readable C++

#

and just computing distance for a specific pair of vectors isn't that slow

chilly geyser
#

I doubt you'd want a specific specific pair

#

For a specificvector-to-all-other-vectors I can see applications yes

#

Also I must be really late to the party, but I found out about fastText from what you just said

#

Welp, guess there's always new things, although I do NLP more for hobby (what's a close word to "xyz"?) than anything

desert oar
#

yeah there are a few of these "oddball" ML tools out there

#

vowpal wabbit, fasttext, starspace

#

the latter 2 are facebook research products

chilly geyser
#

Interestingly I don't see fastText data in CC0 license 😦

desert oar
#

software typically isnt CC licensed

chilly geyser
#

I meant the word vectors themselves

desert oar
#

ah

#

CC BY-SA 3.0

chilly geyser
#

That said I'm surprised a Gigaword-trained data is dedicated to Public Domain

#

Yeah it's BY-SA, which is copyleft-forcing

#

Which means no commercialisation

desert oar
#

that isn't true at all

#

that would be NC

#

this is not a NC license

chilly geyser
#

Oh

#

Right

desert oar
#

it isn't even "viral" like the GPL

chilly geyser
#

Ah I got confused

#

So it's not very functionally different from MIT/BSD

desert oar
#

im not sure how CC BY-SA handles derivative works

#

so you need to check

#

well no... if you modify the data you must share it under BY-SA

chilly geyser
#

Ah, I guess MIT is more permissive?

desert oar
#

so for the data itself, it is viral like GPL

chilly geyser
#

It's definitely easier to self-train

desert oar
#

but for derivative works e.g. a software product that uses said data internally, i don't know

#

well the data itself is under the license too

#

so a trained model would be a derivative work of the source data

chilly geyser
#

probably not at the same scale or with the exact same data

desert oar
#

ah derivative work i think still requires SA

#

"Adaptation" means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.

#

"Distribute" means to make available to the public the original and copies of the Work or Adaptation, as appropriate, through sale or other transfer of ownership.

chilly geyser
#

I just realised Wikipedia is CC-BY-SA 3.0 legally speaking too

desert oar
#

yes

chilly geyser
#

Welllllllll I guess I'm never working for an NLP company

desert oar
#

i dont understand your issue

#

you want to do hobby projects?

#

who cares, you arent distributing them

chilly geyser
#

Nah my hobby projects are obviously doable

desert oar
#

you want to start a company and use other people's free data? sucks, follow the license terms

chilly geyser
#

But for any monetary purposes I'd need to declare it comes via training on Wiki, for example

#

Yeah hmm

desert oar
#

that's my guess as to why they bothered using CC BY-SA 3.0

#

because the source wikipedia data is, and their dataset + trained model are derivative works

#

also english gigaword has a whole license agreement attached to it

#

not sure what public domain version there is

chilly geyser
#

Yeahhhh

#

I got a feeling that it can't be sent to Public Domain ...

desert oar
#

are you looking for a public domain dataset?

chilly geyser
#

That'd be nice

#

I'd be able to publish that under permissive licenses on Github without troubles πŸ˜›

#

But honestly Wikinews is very small.

serene scaffold
#

@desert oar going line-by-line worked but produced an output with fewer lines but a much larger file size

#

I'm thoroughly confused.

mellow spruce
#

@desert oar That worked out. Thank you so much!!
@mellow spruce Hi is me again. This worked great, I have a question tho. I want to apply this calculations in two different columns i.e want to calculate the idle time between and activity is over and the next activity starts for these different groups I have created. I tried time_diff_byname=lf.groupby('name')['Time','Time 2'].apply(lambda y:y['Time'].shift(-1)-y['Time2'] But that didn't work

desert oar
#

first of all, write [['Time, 'Time 2']]

#

actually that should do it

#

or you can write lf[['Time', 'Time 2', 'name']].groupby('name')

mellow spruce
#

That did it king. much thanks

river thistle
#

In SageMath, is it possible to calculate probabilities of multiple sets, using P(x) notation?

desert oar
#

idk if anyone here uses sage :/ i saw you've asked this a few times

earnest wadi
#

Hello, Im having some trouble using a dataset I have created with tensor flow

im getting this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

this is the code:

train_data = np.asarray(train_data)
train_labels = np.asarray(train_labels)
test_data = np.asarray(test_data)
test_labels = np.asarray(test_labels)
return (train_data, train_labels), (test_data, test_labels)


model.fit(train_data, train_labels, epochs=10)
#

ask for any other stuff u need

mellow spruce
#

So when I tried to add time_diff_byname to the data frame with lf['Time_diff']=time_diff_byname.reset_index(level-1, drop=True) it gives me this error message cannot reindex from a duplicate axis any ideas to append this column?

desert oar
#

@earnest wadi what does train_data and test_data contain?

#

@mellow spruce does time_diff_byname.index contain duplicate values?

#
.reset_index(level=-1, drop=True)
```?
#

can you show what time_diff_byname.head() contains

mellow spruce
#

      4    -1 days +23:10:10

      8    -1 days +22:49:50

      12                 NaT

Mary  2    -1 days +22:22:35```
#

It might contain duplicates

desert oar
#

ah

#

im actually not sure how it would, tbh

#

oh did i steer you wrong on this too

#

does lf itself have duplicate indices?

#

or no

mellow spruce
#

nop, I haven't set indices to the df yet

desert oar
#

can you send me some sample data again

mellow spruce
#

   'tool':['Hammer', 'Drill','Wipes', 'Driver', 'Drill','Wipes','Hammer', 'Driver','Driver', 'Drill','Hammer', 'Drill', 'Drill','Wipes','Hammer', 'Driver'],

   'Time':['13:40:31','13:20:33','13:05:00','12:15:28','12:00:00','11:43:35','11:27:35','11:17:22','11:10:10','10:59:11','10:22:15','10:12:10','10:00:00','09:55:05','09:45:45','09:16:35']}

lf=pd.DataFrame(data=d)

lf['Time']=pd.to_timedelta(lf['Time'])```
desert oar
#

so, groupby on series and dataframes have different semantics

#

with respect to how the indices are constructed at the end

#

which i did not realize

#

try level=0 instead of level=-1

#

In [31]: time_diff_byname
Out[31]:
name
John     0    12:00:00
         4    11:10:10
         8    10:00:00
         12        NaT
Mary     2    11:27:35
         6    10:22:15
         10   09:45:45
         14        NaT
Peter    1    11:43:35
         5    10:59:11
         9    09:55:05
         13        NaT
Richard  3    11:17:22
         7    10:12:10
         11   09:16:35
         15        NaT

this is whati get

#

which means the first (0th) level of the index needs to go

#

not the last

#

that was my mistake

mellow spruce
#

lf['Time_diff']=time_diff_byname.reset_index(level-1, drop=True) in here?

#

that worked out correctly, thank you so much master!!1

earnest wadi
#

@desert oar
they both look like lots of these

  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0] ```

there is a corresponding word in a word index for each number
#

it cant print the whole thing

 [   4    5   14 ...    0    0    0]
 [  22   23    5 ...    0    0    0]
 ...
 [1080   32    5 ...    0    0    0]
 [  89    5   25 ...    0    0    0]
 [ 448   59   76 ...    0    0    0]] [[ 330  366 1032 ...    0    0    0]
 [1125   22  615 ...    0    0    0]
 [1142  134    5 ...    0    0    0]
 ...
 [ 126 2402  128 ...    0    0    0]
 [  33 2419  248 ...    0    0    0]
 [ 128 2430   22 ...    0    0    0]]```
desert oar
#

is that a 2d array?

#

it might be an array of lists

#

which the error message hints at

earnest wadi
#

oh yeah that would actually make sense

#

ill iterate through it and make sure they are all arrays

#

still getting this ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

#

I did this:

for i in range(len(test_labels)):
        for r in range(len(test_labels[i])):
            test_labels[i][r] = int(test_labels[i][r])
        test_labels[i] = np.array(test_labels[i])
desert oar
#

you still have an array of arrays

#

where is this data coming from?

#

how is it constructed?

#

seems like you ought to flatten this to a 2d array

earnest wadi
#

the data is from a file structured like this

data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]```

I then split it up into train and test data
#

how do I flatten it

desert oar
#

that's what the data looks like in the file?

#

what kind of horrible data format is that

earnest wadi
#

yes

#

haha

#

its litterally just the list

#

I slapped it in there

desert oar
#

:/

#

use json

#

please

earnest wadi
#

alr haha

#

I will learn that for the future

#

point is

desert oar
#

or .npy

earnest wadi
#

I can extract the data back in to a python list

desert oar
#

yeah but i shudder to imagine how

#

either way you need to make sure this is correctly handled as a 2d matrix, not a 1d array of 1d arrays

earnest wadi
#

litterally just reading it, splitting it at the \n then at the first " " to remove data

#

alright so how to I do that

desert oar
#

then evaling it right?

#

blah

earnest wadi
#

how can I make it a 2d matrix

desert oar
#

ah

#

wait

#

these all have different lengths

earnest wadi
#

oh, they are all converted to the same length by keras later down the line

#

hang on

#

[ 4 5 6 7 8 5 4 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

#

like this

desert oar
#

so you need to 1) read the data, 2) pad them with 0s, 3) convert to np array

#

is that right?

earnest wadi
#

by "need" do you mean thats what im doing?

desert oar
#

well that's what you should be doing

earnest wadi
#

I thinking im np arraying after padding

#

nope

#

padding is last currently

#

should I change

desert oar
#

yes

#

if you try to make a numpy array out of uneven length lists

#

it will never make a 2d array from that

earnest wadi
#

okay

desert oar
#

it will always be an array of arrays

earnest wadi
#

[[ 4 5 6 ... 0 0 0]
[ 4 5 14 ... 0 0 0]
[ 22 23 5 ... 0 0 0]
...
[1080 32 5 ... 0 0 0]
[ 89 5 25 ... 0 0 0]
[ 448 59 76 ... 0 0 0]]

#

its now padded before being an converted to an array

#

but still getting an error

desert oar
#

show the error

#

same one?

earnest wadi
#

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

desert oar
#
data_txt = """data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""

data = eval(data_txt.split(' ', maxsplit=1)[1])
max_len = max(len(rec) for rec in data)
data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
data = np.array(data)

works for me

#

resulting shape is (21, 21)

earnest wadi
#

damn ok so uh

#

thats just like

#

what i was tring to do, but not small brain

desert oar
#
def parse_horrible_format(txt):
    data = eval(data_txt.split(' ', maxsplit=1)[1])
    max_len = max(len(rec) for rec in data)
    data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
    data = np.array(data)
    return data

here's a function to do it πŸ™‚

#

yes, small brain good

earnest wadi
#

hahaha

#

parse horrible format

#

so if I input """data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""

#

it will return

#

train_data that will work with tf?

desert oar
#

no idea, but it will definitely do the padding and np.array conversion for you

earnest wadi
#

alright

#

ill get that in

#

and see what happens

#

@desert oar your function doest work entirely

AttributeError: 'list' object has no attribute 'split'

desert oar
#

data_txt should be a string

#

obviously this won't work on a list..

earnest wadi
#

what string should it be

desert oar
#

this

"""data [['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]"""
#

if you are starting with the literal list

[['4', '5', '6', '7', '8', '5', '4', '11'], ['4', '5', '14', '15', '16', '5', '4', '6', '20', '21'], ['22', '23', '5', '25', '6', '27', '28', '29', '30', '31'], ['32', '33', '34', '35', '6', '37', '22', '39', '25', '41', '42', '43', '44', '45'], ['46', '47', '22', '49', '50', '5', '52', '47', '54', '22', '49', '57', '41', '59', '5', '61', '62', '22', '50', '5', '66'], ['67', '68', '69', '22', '71', '72', '73', '42', '75', '76', '33', '78'], ['79', '47', '81', '16', '83', '84', '85', '86', '44', '5'], ['89', '5', '25', '92', '93', '16', '95', '96', '97', '44', '47'], ['100', '5', '102', '103', '104', '44', '106', '16', '5', '109', '110', '111'], ['5', '113', '42', '6', '116', '117', '118', '119', '6', '121', '16', '123', '5', '125', '126', '127', '128', '129', '130'], ['89', '5', '133', '134', '135', '8', '137', '138', '126', '140'], ['125', '33', '143', '144', '8', '146', '147', '44', '149', '150'], ['32', '5', '153', '6', '155', '156', '157', '42', '6', '160'], ['32', '5', '153', '6', '165', '157', '42', '6', '169'], ['125', '33', '172', '173', '16', '137', '176', '177'], ['100', '5', '180', '126', '182', '16', '5', '185', '186', '187', '44', '47'], ['4', '5', '129', '193', '16', '195', '6', '197', '198', '119', '129', '201', '202', '47', '22', '205', '206', '5', '208'], ['32', '22', '153', '6', '7', '214', '215', '216', '217', '218', '219', '5'], ['89', '5', '133', '134', '225', '16', '22', '25', '5', '6', '231'], ['47', '233', '5', '125', '25', '6', '238', '233', '240', '6', '242', '233', '244', '6', '246'], ['137', '248', '249', '5', '251', '47', '253', '254', '33', '143']]

then feel free to delete the first line where you split the string and eval

earnest wadi
#

oh

#

I see

#

@desert oar still the same error
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

desert oar
#

show your code..

earnest wadi
#
def import_data(dsdir):
    ds = open(f"{dsdir}.cds", "r", encoding='cp1252')
    dataset = ds.read()
    ds.close()

    dataset = dataset.split("\n")
    splitup = []
    for part in dataset:
        splitup.append(part.split(" ", 1))

    splitup[1][1] = splitup[1][1].replace("\'", "\"")
    splitup[2][1] = splitup[2][1].replace("\'", "\"")
    data = json.loads(splitup[1][1])
    labels = json.loads(splitup[2][1])

    train_data = (data[:len(data)//2])
    train_labels = (data[:len(labels)//2])
    test_data = (data[len(data)//2:])
    test_labels = (data[len(labels)//2:])

    def parse_horrible_format(data):
        max_len = max(len(rec) for rec in data)
        data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
        data = np.array(data)
        return data    
    
    train_data = parse_horrible_format(train_data)
    test_data = parse_horrible_format(test_data)

    

    return (train_data, train_labels), (test_data, test_labels)```
#
(train_data, train_labels), (test_data, test_labels) = ds.import_data("Pickup Lines - Insults")


word_index = ds.get_word_index("Pickup Lines - Insults")


word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, "?") for i in text])

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding="post",
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                        value=word_index["<PAD>"],
                                                        padding="post",
                                                        maxlen=256)

train_data = np.array(train_data)
train_labels = np.array(train_labels)
test_data = np.array(test_data)
test_labels = np.array(test_labels)

print (train_data)

vocab_size = 88000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer="adam",
                loss="binary_crossentropy",
                metrics=["acc"])

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

model.fit(train_data, train_labels, epochs=10)
#

first block is in my package, 2nd is the main neural net script

desert oar
#

parse_horrible_format already does the padding FYI

earnest wadi
#

so I should delete keras.pre... etc

desert oar
#

i mean, there is quite a bit more action happening here

earnest wadi
#

or does it not matter too much

desert oar
#

frankly i dont use keras so i have no idea what much of this code does

earnest wadi
#

ooh

desert oar
#

the basic problem is: train_data and train_labels must be numpy arrays of floats, ints, et al

#

not arrays of arrays

#

so whatever processing you do, at the end of the day you must make sure that you are feeding "flat" numpy arrays to keras

earnest wadi
#

but you dont know how I can flatten my data to work

#

ill try looking some more stuff up

desert oar
#

i dont know because i dont know what your data looks like before you pass it to keras

#

its possible/likely that pad_sequences is returning an array of lists or something

earnest wadi
#

idk because this code was working with an official keras dataset, I bassically just tried my best to replicate the format of the index,labels and data

desert oar
#

is this "tf keras" or "keras keras" btw?

#
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(32, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

this looks so easy even i could probably do it and i am legitimately very stupid

earnest wadi
#

this is

#

tf keras

#

from tensorflow import keras

desert oar
#

cool

#

i should look into tf 2.0

#

ive used pytorch more, tf 1.0 was too cumbersome

earnest wadi
#

yeah i managed to really easy get my head around the code they provided to make this:

desert oar
#

i also really hate that they called their "keras-style" library "keras"

#

they should have come up with a different name

earnest wadi
desert oar
#

cool

earnest wadi
#

all im tryna do now is make an easy and modular way to train it on any text dataset you want

#

straight from a .txt fike

#

file*

#

bassically

#

my question

#

after your function you made for parsing

#

is the output a 2d matrix

#

or is it still array of arrays

desert oar
#

my function should return a 2d array

#

i.e. the length of .shape should be 2

earnest wadi
#

ill check

#

len(train_data.shape) ?

#

yes

#

it is 2

desert oar
#

thats called a "clean room" reimplementation

#

when you copy the logic but none of the source code

#

im in a meeting, but i know something about this topic

#

ill @ you later

earnest wadi
#

alrigfht, ive done some testing the shape is definatley 2 all the way through the code @desert oar

visual violet
#

ok so

#

i have

#

can i be smart enough to take average of percentage difference every month?

desert oar
#
data.set_index('Date').resample('1M')['Percentage Difference'].mean()

like that?

#

or ```python
data.resample('1M', on='Date')['Percentage Difference'].mean()

#

@visual violet ^

visual violet
#

insanely smart

#

thanks!

#

Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

#

hmm type error

desert oar
#

Convert your Date column to an actual date or timestamp type

#

Not a string

visual violet
#

so something like this?

#

df_newyork['Date Time'] = pd.to_datetime(df_newyork['Date Time'])

desert oar
#

Should work, im on mobile now so cant check

visual violet
#

@desert oar dude it worked

#

thanks for the second code

severe island
#

is there anyway I can get twitter dataset?

#

i dont have the time to scrape one, are there any repos online that can provide twitter dataset?

flat quest
#

you could always do a quick search

I'm sure theres some on kaggle or uni websites @severe island

vernal sierra
#

Has anyone ever worked with neural machine translation? I got an error where it says runtime error: dimension specified as 0 but tensor has no dimensions.

silk knot
#

someone told me to ask it here so I here I go

#

since the whole thing should work just fine if the table would just be in pd.DataFrame format πŸ˜…

#

please ping me if you got a solution

lapis sequoia
#

Hi, how i can fix this erros : ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

#

Hi so basically im attempting to reshape my 1D array to become a 2D array, but I keep getting an error that says my object doesnt have the Attribute array.reshape(1, -1), even though my console told me to reshape my data using array.reshape(1, -1).

lapis sequoia
#

how i can fix it : len() of unsized object

rancid brook
#

Read the error and then understand why its giving that error

#

@lapis sequoia You need to post your code

desert oar
#

@silk knot can you show an example of what rows and data contain? also try to avoid using variable names like list which shadow important built-in names

silk knot
#

And good point

#

But I did have another issue you might could help me with πŸ˜…

#

ValueError: Shape of passed values is (33905, 36), indices imply (36, 36)

#

I dont know if you mind being tagged so im just gonna do it once, let me know if it bothers you @desert oar

desert oar
#

once is fine, thanks for asking

#

if i disappear for a while you can tag me again although i will be heading offline soon

#

show the code that produced that error?

#

usually that means you have mismatched sizes

#

e.g. ```python
pd.Series([1,2], index=list('abcdefg'))

produces a similar error
silk knot
#

the thing is, I don't remember ever making it 36, 36

#

I did have 36 samples however, so its not a random number I just don't know where I implied it

desert oar
#

what line

silk knot
#

I think it goes wrong in 74

desert oar
#

makes sense

silk knot
desert oar
#

pca_data is not the covariance matrix...

#

it's your full data projected into PC space

#

so N rows and 36 columns

#

whereas you specified index=columns, columns=labels both of which evidently contain 36 elements

silk knot
#

uhu

#

well thank you I should be able to fix it, after I get some sleep

#

some nights, before you know it its 5am

desert oar
#

@void anvil arch? that's a library on pypi?

#

nice

#

oh

#

not about clean room code but

#

i wanted to clarify that copyright applies to code

#

specifically to the code

#

not to the algorithm

#

because the algorithm as you point out is part of nature

#

and so can be patented, but not copyrighted, because it is not a creative work

#

however the code is copyrightable, and not patentable, because it is considered a creative work

#

(under US law)

#

this is why licenses such as the GPL are interesting - they apply to the source code, but because of the license terms also have implications for consumers of the software itself

#

thats actually a good question

#

my guess is that like paraphrasing someone else's paper without citing their work

#

right

#

i would imagine the same is true for code

#

that isn't necessarily true. this is why clean room implementations exist.

#

for plausible deniability that any source code which happens to be in common is a coincidence

#

i'm not sure about that

#

i'm sure there is plenty of case law on that subject

#

i have a friend who is an attorney in this field

#

i can ask how this all works

#

he will likely say "this scenario is absurd and would never happen in real life" and it would take me 10 mins to get an answer

#

if there is exactly one way and only one way to implement an algorithm in a particular language, e.g. C, i imagine that it would not be considered copyrightable, or that using the same code is fair use

#

so much of US IP law is in the form of case law and precedent

#

so its really really hard to know even if you are brave enough to wade through the statutes

#

so i can ask him but i cant guarantee a good answer

#

i think in general if you don't willfully commit copyright infringement, most codebases are sufficiently complicated and distinct enough that you won't get copyright trolled

#

well if there's no license at all then it's all rights reserved

#

how what works?

#

you must distribute your contributions under the same license as the original.

#

i'm not sure if the CC BY-SA license can be interpreted to mean that software compiled from the licensed source code must also be distributed under CC BY-SA

#

i suspect that it can't, or isn't

#

and this is also why you should use a code-specific license for your code, so as to remove the ambiguity

#

wait who is using CC for code

#

why would you do that

#

other than like... contributions to rosetta code

#

i suspect that compiled software falls outside the scope of "Adapted Material" in the 4.0 version

#

and of "Derivative Work" in the 2.0 version

#

that's a bit like if you printed out the script to Arcadia, ran it through a paper shredder, then used the pieces to make a collage

#

is that a derivative work of Arcadia?

#

is it?

#

i mean, is it a derivative work?

#

i mean ignore the fact that snippets of copyrighted material are visible in the collage

#

the fact that it's physically made of Arcadia i think isn't alone enough to call it derivative

#

its a good question

#

if you arent distributing it then the GPL at least doesnt care

#

its more interesting if you are building a public-facing API with a closed-source backend

#

are you "distributing" something based on the GPL'ed code?

#

you aren't physically distributing software, but you are providing access to that software

#

right. if providing an API were determined to be "distributing" it would be the legal equivalent of finding P=NP or breaking RSA

#

a court discovery order would change that in a hurry

#

you wouldn't have to look far

#

what if they give a tech talk at a python conference about how they use your GPL library to go super duper fast

#

im not saying thats available in every case, but it could be enough to make an example out of someone

#

plus theres some software which is both GPL and almost literally ubiquitous

#

although then you're basically saying "statistically this person is likely to be using my software" which has a lot of uncomfortable implications for criminal law

#

right, idk if or how much they differ in that respect

#

you might also be able to avoid the "we have cause because 99% of businesses use this"

#

look at job postings for example

#

do they list django? they're probably using django

#

having a case like this succeed in courts would be a true apotheosis of the free software movement

#

not sure its even a good thing

#

or if its obviated by some other legal doctrine or statute

#

ok that one probably will show up in court

#

in the next 5 years

#

great question

#

yep

#

funny i literally just read this article tonight

#

IP law is such a mess already, now throw in datasets and trained models

#

increasingly complicated blends of the above with source code and patentable algorithms

#

incoming: a German court rules that articles written by AI are considered derivative works of the AI developer

#

i can't wait for that hacker news thread

#

...i think its past my bedtime

#

im having bizarre copyright fantasies

#

likewise, ill try to remember to ask about some of this

rapid plank
#

how do I get numpy or matplotlib in vscode?

limpid oak
#

`def f(row):
try:
return Polygon([(pt['Longitude'], pt['Latitude']) for pt in json.loads(row['PlotGeoFence'])])
except:
return numpy.nan

InputFile['geofence_poly'] = InputFile.apply(f, axis=1)`

#

i have this function which returns NaN if it fails, but it is setting all row with NaN

#

but i want to keep other column data aslo and only set NaN where it fails

#

any help wiil be appreciated

desert parcel
#
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
model = keras.Sequential([
                         keras.layers.Flatten(input_shape=(28, 28)),
                         keras.layers.Dense(128, activation=tf.nn.relu),
                         keras.layers.Dense(10, activation=tf.nn.softmax)
])```
#

Could someone explain the 2nd layer?

#

the video says it in a very complicated way and I don't really get it

#

So is it right to say that it has 128 functions that will determine what that thing is?

simple violet
#

Does any one know the most attracting data to scrape with pyhton, from any links or useful data?

snow grove
#

guys can you help me with this```
pca=dt.corr() #dt is my data

#this works fine
k=10
col=pca.nlargest(k,'SalePrice').index
d=np.corrcoef(dt[col].values.T) #presence of transpose here
sb.heatmap(d,annot=True)

#this doesn't work (it keeps on running till RAM uses exceed and crashes, gives no output )
k=10
col=pca.nlargest(k,'SalePrice').index
d=np.corrcoef(dt[col].values)
sb.heatmap(d,annot=True)

grizzled oriole
#

how do I get numpy or matplotlib in vscode?
@rapid plank
Install numpy and matplotlib in your Python environment and import it...

desert oar
#

@desert parcel do you understand how a neural network works at a basic level?

#

a "dense" layer in keras means that every input is connected to every node. in this case there are 128 nodes

#

this is also called a "fully connected" layer

desert parcel
#

Oh no I do not, the video i'm watching doesn't go that detailed

desert oar
#

ah

desert parcel
#

It's this one

desert oar
#

i strongly recommend the 3blue1brown series that explains how neural networks operate

desert parcel
desert oar
desert parcel
#

ohh

#

I couldn't find a good one I just thought that i'd stick to the official tensorflow yt

#

And I keep redoing my notes

#

which is really annoying

desert oar
#

dont forget that TF is a software library

#

they will be focused on teaching you the software

#

although it looks like they are doing a good job at introducing you to the concept

#

they might go back and explain layers later

desert parcel
#

yeah

#

they jump around the code a bit

desert oar
#

this actually seems like a very nice gentle introduction

desert parcel
#

which one?

desert oar
#

the TF one

desert parcel
#

Oh

#

yeah I like it

#

I can make notes and simplify it

desert oar
#

but the 3blue1brown video will probably be enlightening

desert parcel
#

Ah I'll watch it while eating lunch lol

#

Which will be tomorrow haha

desert oar
#

i don't see where they use the Dense(128) thing

desert parcel
#

since i't snight already

#

Let me get my code

#

hold on

desert oar
#

eventually you will want to learn the math as well

desert parcel
#

the math huh

#

well I don't like math a lot sometimes

desert oar
#

yep. the whole idea of a "neuron" is a conceptual aid to understanding the model and doesn't really have much to do with how the model actually works

#

it's all math underneath

desert parcel
#

most of the text in the notes is just me paraphrasing what the guy said

desert oar
#

you dont need to understand it all, but the more you do understand the more interesting problems you can solve

desert parcel
#

Well I don't seem too excited to get into the mathy side of things lol

desert oar
#

i can't access that link from work

desert parcel
#

Yeah that makes sense

#

Oh you're at work?

desert oar
#

i'm just wondering where you got the idea to use this

(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])
desert parcel
#

The youtube video

#

I copied it from the notebook they linked

desert oar
#

oh, the part 2?

desert parcel
#

Yeah part 2

desert oar
#

ok

desert parcel
#

I watched the 2 of them b2b and keep rewatching them

#

I wanna get good at the start first

desert oar
#

the 3blue1brown video should help explain what's happening with this

desert parcel
#

Alright then

desert oar
#

and it will also show you where the math part comes in

desert parcel
#

Alright I'll give that a watch then

desert oar
#

cool

earnest wadi
#

@desert oar hey, have gotten quite a bit further with my problem i got some help from other people and shuffled some stuff up, however now im havintg a slight problem with your function

#
def parse_horrible_format(data):
        max_len = max(len(rec) for rec in data)
        data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
        data = np.array(data).astype(np.float32)
        return data
Traceback (most recent call last):
  File "c:/Users/Silv3/OneDrive/Desktop/datasetup/datasetup/hillbilly.py", line 12, in <module>
    (train_data, train_labels), (test_data, test_labels) = ds.import_data("Pickup Lines - Insults")
  File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 258, in import_data
    labels = parse_horrible_format(labels)
  File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in parse_horrible_format
    data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
  File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in <listcomp>
    data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
  File "c:\Users\Silv3\OneDrive\Desktop\datasetup\datasetup\datasetup.py", line 253, in <listcomp>
    data = [[int(x) for x in rec] + [0] * (max_len - len(rec)) for rec in data]
ValueError: invalid literal for int() with base 10: 'O'
desert oar
#

that's a capital letter O, not a digit 0

#

how did that get into your data?

earnest wadi
#

wheres the O coming from wtf

#

i have no idea

desert oar
#

πŸ‘» spooky

earnest wadi
#

oh

#

yeah I have no idea

#

labels is littrally just 1's and 0's

#

oooohh

#

I see

#

labels is this @desert oar

'Do you like Star Wars? Because Yoda only one for me!': '1', 'Call me Shrek because I’m head ogre heels for you.': '1', 'Wanna go bowling? I thought it might be right up your alley.': '1', 'Excuse me, I just noticed you noticing me and I just wanted to give you notice that I noticed you too.': '1', 'If your heart was a prison, I would like to be sentenced for life.': '1', 'I love you like a pig loves not being bacon.': '1', 'Are you parents bakers? Because you are a cutie pie.': '1', 'Are you a cat? Cause you are purrrfect': '1'}

I had to shorten it because its like 90 000 characters

#

so the first one is

#

{'Of course I talk like an idiot. How else could you understand me?': '0'

#

hense the "O"

desert parcel
#

...

#

nice joke

earnest wadi
#

?

desert parcel
#

lol i am stealing that

earnest wadi
#

oh haha

#

yeah its a dataset of insults and pickup lines

desert parcel
#

Oh here let me give you mine

earnest wadi
#

?

desert parcel
#
annoying = ['Is it now?', 'Nope', 'Not really', 'I say not', 'lol git gud', 'Naaaa']
query = ['Yes that is correct.', f"""I agree with which was last said""", "That's correct",
        "I'm not sure, you tell me.", "Oh please you're smarter than that.","Figure it out.",
        "I'm not google.", "You think I know everything?",
        "I'm not going to say it, now that you want me to say it.",
        "lol good luck figuring it out on your own","why would I know.",
        "Sure. If that's what you wanna think."]
what_is = [
    "I'm not sure, you tell me.", "Oh please you're smarter than that.",
    "Figure it out.", "I'm not google.", "You think I know everything?",
    "I'm not going to say it, now that you want me to say it.",
    "lol good luck figuring it out on your own",
    "why would I know.", "Sure. If that's what you wanna think."
]```
desert oar
#

@earnest wadi you need to use a totally different function to parse that

desert parcel
#

Lol It's just my bot being sassy

earnest wadi
#

haha

desert parcel
#

I wanted to implement this by learning ML and stuff

desert oar
#

you really really really need to use json or some standard format

desert parcel
#

I'll let you guys talk lol

desert oar
#

this is turning into a mess to read data from disk

earnest wadi
#

thats not a weird format

#

its a python dictionary

#

before all the numpy stuff

data is a list
index is a dict
labels is a dict

desert oar
#

yes but you are dumping literal python objects

#

parsing that is inevitably a mess

#

eval is not a good idea

#

use json instead

lapis sequoia
desert parcel
#
X = np.array([-1.5, 0, 3.5], dtype=float)
Y = np.array([0, 3, 10], dtype=float)

newModel = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
newModel.compile(optimizer='sgd', lose='mean_squared_error')
newModel.fit(X, Y, epochs=20)

print(Model.predict([1.0]))```
#

I am not sure how to fix it

#

why does this work ```py
X = np.array([-1.5, 0, 3.5], dtype=float)
Y = np.array([0, 3, 10], dtype=float)

newModel = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
newModel.compile(optimizer='sgd', loss='mean_squared_error')
newModel.fit(X, Y, epochs=250)

print(newModel.predict([1.0]))```

desert oar
#

look at the error @desert parcel

#

what does it say?

desert parcel
#

Model must be made and compiled with the same DistStart

#

and a type error

desert oar
#

@desert parcel the error message is the last line

#

where it says TypeError

autumn veldt
#

excuse me guys, can someone help me, currently im learning something about Image Classification using google colab. since im learning it by watching Youtube Video, i got some problem and i can't solve it, can anyone help me?

desert oar
#

just ask your question

#

!ask

arctic wedgeBOT
#

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving.
β€’ Be patient while we're helping you.

You can find a much more detailed explanation on our website.

desert oar
#

see our guide to good questions above

autumn veldt
#

so i just put my question here right?

silk axle
#

Yes @autumn veldt

autumn veldt
#

ok, lemme prepare my question first

#

!ask so, i was learning something about image classification using feature GLCM + SVM method. where i put my dataset into csv file. after reading the dataset im trying to see how much the accuracy that i can got from it (it's only showing one time training with 0.70 accuracy), now the problem that i want to ask is, how to put something like 5-20 training with different accuracy (epoch = 20) in one time run? im so sorry if my english so bad tho

arctic wedgeBOT
#

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving.
β€’ Be patient while we're helping you.

You can find a much more detailed explanation on our website.

autumn veldt
#

if its possible, can u guys give me some link or whatever that i can learn about my problem too, thanks

desert oar
#

you dont need the !ask command, that just shows you info about asking good questions

flat quest
#

still don't really understand what you're trying to do @autumn veldt

autumn veldt
#

so, here i was running DataValidation.csv using GLCM+SVM and the result of train is 0.70 accuracy on one time running (only one data that is calculated from the accuracy of the many existing data). the problem is, on DataValidation.csv there's about 100+ data and i want to show accuracy at least 20 data from 100 data available in the CSV file by expecting different accuracy results.

flat quest
#

what do you mean by one data?

A single row? @autumn veldt

autumn veldt
#

yea, single row

drifting hemlock
#

Hello, I need some suggestions. I've trained a model to detect the probability of a client cancelling their services with us using our current client's list, if I deploy that model to production to an API to do live-predictions, and I pass the one of the clients that was used in the training set, wouldn't that be biased?

flat quest
#

then just get 20 rows and run them through the model @autumn veldt, and check accuracy?

autumn veldt
#

thats the problem @flat quest , how to get 20 rows on one run?

visual violet
#

guys

#

so i calculated percetange difference from 2019 to 2020

flat quest
#

just select all 20 rows xd @autumn veldt

data[0:20]

#

and run that through the model
that should work unless that particular model doesn't work on batched data

autumn veldt
#

ill try it soon, btw thanks @flat quest

lapis sequoia
limpid raft
#

In CNN models, why is it more accurate to have Conv layers and then some dense layers instead of only Conv layers?

queen barn
#

What's the best machine learning model that can take in a decent sized amount of rows with many nominal categorical variable and a few output variables to determine predicting factors?

drifting hemlock
#

@void anvil you mean training the model on a small subset of the clients and running the predictions on the total population?

#

What if my dataset is small? Like 3,000~?

ebon nebula
#

Hello all. Can someone recommended me a good book/course/site to start learning machine learning.

bitter harbor
#

Learning about linear algebra and stats would be the place to start

eager arrow
#

sup nerds

#

i just got back from banging your girlfriends

desert parcel
#

@desert parcel Are you free to upload that dataset somewhere / link me to it? It looks awesome to play with
@void anvil Well I made the dataset myself I just came up with a random linear line equation.

queen barn
#

I realize my question above was rather vague, but if anyone has the time to spare, I just need help figuring out how to go about analyzing this data. We have a product with many configurations, and rather than going through and checking each individual feature or permutation of the features, I'd like to run it through a model that can identify correlation to a binary good or bad column.

#

I have found many methods that seem almost on point for what I'm looking for, but not quite on the nose.

flat quest
#

what kind of product is this? @queen barn

#

@limpid raft well technically you can. And it does work reasonably well. Remember Conv nets are basically 2d dense nets with a limited scope.

But at some point you need to switch from a 2d output to a 1D. That can be done through maxpool, etc. but dense layers tend to work better.

Conv nets may focus too much on low level features. Since each filter is like 3 x 3, low level local features (ex blue dot in left bottom corner) may be prioritized over a higher level feature like an oval face.

desert parcel
#

Do I need a GPU for this?

#

I haven been using google colab for this

#

But do I really need a GPU?

queen barn
#

what kind of product is this? @queen barn
@flat quest I'm not sure why that's relevant. I'd prefer not to sure the specifics of the product, but I'd be more than willing to explain or provide an example of the data structure.

flat quest
#

@queen barn Ah yeah, was only asking because like you said question was pretty vague. The type of model to use is heavily dependent on what kinda product and data ur working with.

You said configurations are these config files for software? Different setups for products as in machines, etc.

queen barn
#

No it's a physical product that has about 95 different attributes that can be changed by customization request. I'm trying to find a correlation between some of these customizations and a quality failure.

#

@flat quest I hope that helps a bit

nimble lotus
#

I am trying to convert a column in a dataframe from an object to a string but it is not working? Could someone explain how come?

bitter harbor
#

because the column is a column/a 1D array

#

you'll have to take every element and append/concat them together

wise garden
#

Getting an error when fitting data to simple neural net( input, 2 hidden, 1 output, all dense layers): Error when checking input: expected dense_34_input to have shape (518994,) but got array with shape (1,). My input is 32950 by 63 and I flattened it to fit it to the network. Not sure why it's showing (1,). Anyone know what I'm doing wrong?

flat quest
#

well a very basic way (This would be something like a base model) would be to simply throw in all these attributes as features.

As for getting those categories into a numerical format. Since its only 95 attributes, one hot should work fine.

Another method would be to use embeddings.

You should probably start with a standard dense model and see how well it performs. Since there's no temporal data involved, you won't need RNN's or LSTM's. Transformers will tend to work better than standard dense models, but its much more compute heavy.

@queen barn

#

@wise garden

So if you print the flattened input what shape does it have? It should be (batch_size, features)

old thorn
#

can anyone here explain neural networks to me like I'm five? I got into Machine Learning about a month ago and Python like 2 months ago. I use google colab if that will help u help me better

#

Im really struggling to grasp the concepts of RNN's, CNN's, KNN's, and ANN's. I just don't understand how they work

#

Please PM me if possible

desert parcel
#
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])

DataA = np.array([1, 2, 3, 4, 5, 6, 7], dtype=int)
DataB = np.array([6, 7, 8, 9, 10, 11, 12], dtype=int)
# DataB = DataA + 5


model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(DataA, DataB, epochs=250)
print("-"*20)
print(model.predict([9]))```
#

Why is the output 15 though if it's 9 shouldn't it be 14

old thorn
#

You might not have enough data for it to be accurate maybe? but dont take my advice, ask someone more experienced @desert parcel

desert parcel
#

But I do though

#

In another one I copied from the web

#

it has less epochs and less data but is more accurate than what i'm using

old thorn
#

huh

#

well idk

#

that was my only answer, gl on ur quest

desert parcel
#

Yeah alright

tame fractal
#

What exactly were you expecting?

#

why should it be 14?

desert parcel
#

DataB = DataA + 5

#

in the comment over there

tame fractal
#

no

#

it's literally just adding 5 to every element of DataA

#

it's an estimator and the model you're using isn't predicting well

#

because it's an estimator you're usually not going to get exact values like that

#

try using a different model and test your result

desert parcel
#

So the equation is backwards?

tame fractal
#

no?

desert parcel
#

Could you explain it again

lapis sequoia
#

Hello all. Can someone recommended me a good book/course/site to start learning machine learning.
@ebon nebula Sorry to answer you a little bit late;
I recommend you sololearn application that is also available in web.
It has a specific section named "Machine learning".
I'm learning python by sololearn and it teaches everything from 0 to 100.
If it's the first time you're hearing about sololearn, it's better to start with android or ios version not web application.

#

and sorry for wall.

ebon nebula
#

You have nothing to be sorry for. Thank you for the suggestion. I will check it out asap.

spark stag
#

@desert parcel onw issue you probably have is that you only have 7 different inputs you are training the model on (which is a very small ammount for a neural network) but you are trainnig for 250 epochs which is a lot, your model is probably overfitting to the data being passed to it although having said that, i would of thought that the weight would update towards one with the bias being 5, i would reccommend using more training data and probably fewer epochs but as you only have 1 weight i don't this can over fit

#

one other note is that this is not the kind of problem neural networks are good at solving, you will probably get better results using some sort of regression, i was just highlighting above why the results may not be what you were expecting and how to try ammend these isssues for other models

desert parcel
#

ohhh yeah

#

I was just using a Y=MX+C equation since that is used in the example

#

I base most of my test models on that equation

prisma verge
#

how does one get into deep learning without much higher math?
i've been looking at tf2.0 for really long time, and it seems really simple. i do understand how dataset is processed in the most cases, but when it comes to model building, choosing optimizer, etc. then i'm just lost. so much models, loss functions, and every model needs number of parameters that i don't understand how to calculate. my brain can't find any patterns in this.
tl;dr how do i acquire basic knowledge of deep learning without diving into higher math and linear algebra? moden frameworks seem to make it possbile, but i just don't understand which optimizers/layers to use in which situation, how to get needed numbers of params, etc

wise garden
#

@flat quest I had a few to many last night and didn't realize I was passing the wrong tensor through my pipeline lol got it figured out

visual violet
#

guys

#

if i use resample function of pandas

#

it returns a series?

desert parcel
#

how does one get into deep learning without much higher math?
i've been looking at tf2.0 for really long time, and it seems really simple. i do understand how dataset is processed in the most cases, but when it comes to model building, choosing optimizer, etc. then i'm just lost. so much models, loss functions, and every model needs number of parameters that i don't understand how to calculate. my brain can't find any patterns in this.
tl;dr how do i acquire basic knowledge of deep learning without diving into higher math and linear algebra? moden frameworks seem to make it possbile, but i just don't understand which optimizers/layers to use in which situation, how to get needed numbers of params, etc
@prisma verge You just need HS level math up to how to differentiate and how to do matrix addition, subtraction, and multiplication. PyTorch can however do all of that for you with functions.

#

But know the concepts do help

lapis sequoia
#

`

#

noise = np.random.uniform(-1,1(observations,1)) targets = 2*xs - 3*zs + 5 + noise

#

why do i get an error for this

#

please some body help

#

the error is int is not callable

silk axle
#

1(observations,1)

#

As the error says, you can't call ints

#

Which is what you're doing there

#

You might've meant 1, (observations, 1) idk

lapis sequoia
#

TypeError: Level type mismatch: month

#

noo not that one

#

noise = np.random.uniform(-1,1(observations,1)) targets = 2*xs - 3*zs + 5 + noise

#

this one

#

gives me an error at line 1 stating int is not callable

silk axle
#

Read what I said

#

I literally answered that

#

@lapis sequoia

lapis sequoia
#

observations = 1000 xs = np.random.uniform(low=-10,high=10,size=(observations,1)) zs = np.random.uniform(-10,10,(observations,1)) inputs = np.column_stack((xs,zs))

#

but this gives no error

silk axle
#

Because you aren't calling an integer there

lapis sequoia
#

yeah yeah got it

#

i missed a comma there

#

-1,1,(observation)

#

yeah got it thank you so much !

visual violet
#

x = temperature.Temperature.resample('D').mean()

#

temperature is a dataframe

#

so x should be a series

wanton bough
#

will anacoda work better in my 2gb ram pc

visual violet
#

better than wat

desert parcel
#

use google colab they let you use their free gpu or whatever

#

it's just faster and it saves time from downloading all the modules

visual violet
#

temperature_array is pm2.5- an air pollutant 2017-2019
pm25_array is just average temperature 2017-2019

#

what does this graph mean

river wing
#

i have a ongoing test can anyone please help me

signal sluice
#
my_dict = {'100001':{
  'forename':'John',
  'surname':'Smith'
  },
'100002':{
  'forename': 'Alice',
  'surname': 'Van Gogh'
  }
}

I have a dictionary as such where one ID corresponds to two values. In pandas, I have a column which has an ID for each record - I want to split this into two columns of the forename and surname.
What's the most idiomatic way to do this?

#

I have very little experience with pandas

bleak fox
#

df = (pd.DataFrame.from_dict(my_dict)).T

signal sluice
#

oh no, I mean that i have a column in an existing dataframe which needs to be replaced by the two other columns

#

thank you though

bleak fox
#

Can u share it... And output expected so that I can share you code

gloomy thistle
#

Hey guys, I gotta form a prediction model involving a prediction of one main variable based on it's dependence on 4 different parameters ... do I use Step-wise regression or Multi- polynomial regression...Or anything else

bleak fox
#

@gloomy thistle multi-pol will be good.

gloomy thistle
#

I would also like to know how much of each parameter is also tallied into it.. how do I do that?

bleak fox
#

You can use PCA or LDA to find it out

#

So use all your variables and run PCA, it will generate new matrix where you can select top N coloum which has impact of x1 x2.... Xn

gloomy thistle
#

Cool, I'll look it up, thanks Kapil !

bleak fox
#

Welcome bro

visual violet
#
temperature_test =  pd.read_csv ('C:/Users/dotha/PythonNotebook/File/temperature (2020) NYC.csv')
pm25_predicted_list = []
for i in temperature_test['temperature'].values.reshape(-1, 1):
    predicted_value = linear_regressor.predict(i)
    pm25_predicted_list.append(predicted_value)
pm25_predicted_list
#

please help

#

error: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

#

i did try both way

signal sluice
#

(sorry for the bump hb ill delete after) and ty for the help kapil

bleak fox
#

use the old generated df which i shared and use pd.merge(your_df, my_df, key = 'id') to merge it

signal sluice
#

awesome, tysm

bleak fox
#

welcome bro

visual violet
#

the temperature_test looks like this

#

i dont understand why i am getting the errors

bleak fox
#

@visual violet , try ony .values

visual violet
#

yea i tried that one too

bleak fox
#

Use .predict([i])

#

And rest as your shared code..

visual violet
#

nvm it worked

#

no error but the list is blank

bleak fox
#

Just put a print(predicted_value) after your predict code line

#

Let's see what is output of your regressor

#

Man u r appending pm* list not m* list.... Correct that you will find your answer

visual violet
#

lmao you are right

#

omg

#

thanks a lot Kapil

signal sluice
#

oh, i get an error saying that key is not a recognised keyword parameter in pd.merge?

bleak fox
#

Which IDE you are using?

#

Ok for my dataframe write 1 more line df['id']= df.index

signal sluice
#

oh yeh

#

that worked tyty

sinful dock
#

for this dataframe, I'm trying to subtract consecutive times in df1['JOB TIME'] columns an expressing the result in minutes in df1['delta_t'] column. Why is my code failing to execute the conversion.? No traceback ```md
#Here is mi firt function to calculate time delta between consecutive rows
#Goals was to epress the timedelta in minutes the last two comments
def time_diff1(col1, col2):
if col1 is np.datetime64('NaT'):
pass
else:
diff = col2 - col1
#trying to get minutes here but doesn't work
return np.timedelta64(diff, 'm')

#Also Tried these without luck
#return diff.dt.total_seconds()/60

df1['delta_t'] = np.vectorize(time_diff1)(df1['Previous Time'], df1['JOB TIME'])

df1.head()
JOB TIME COMMENT Values Previous Time delta_t
1 2017-11-20 01:23:00 d_P = 1259.08 psi 1259.08 2017-11-20 00:13:00 0 days 01:10:00
2 2017-12-02 13:24:00 Offset_Pressure 0.00 2017-11-20 01:23:00 12 days 12:01:00
3 2017-12-02 16:21:00 d_P = 4142.57 psi 4142.57 2017-12-02 13:24:00 0 days 02:57:00
4 2017-12-03 02:57:00 Offset_Pressure 0.00 2017-12-02 16:21:00 0 days 10:36:00
5 2017-12-03 04:03:00 d_P = 539.38 psi 539.38 2017-12-03 02:57:00 0 days 01:06:00```

bleak fox
#

@sinful dock in function use diff=pd.Timedelta(col2- col1).minutes

#

@lapis sequoia pandas shows only top5 and bottom 5 rows but it has in actual all rows... So you can go ahead and continue your operation

lapis sequoia
#

I figured it out that's why I deleted

#

but I have another issue

bleak fox
#

Tell me davidd

lapis sequoia
#

I want to remove these numbers on the side

31975                                    Zimbabwe
31976                                    Zimbabwe
31977                                    Zimbabwe
31978                                    Zimbabwe
31979                                    Zimbabwe
31980                                    Zimbabwe
31981                                    Zimbabwe
31982                                    Zimbabwe
31983                                    Zimbabwe
31984                                    Zimbabwe
31985                                    Zimbabwe
31986                                    Zimbabwe
31987                                    Zimbabwe
31988                                    Zimbabwe
31989                                    Zimbabwe
31990                                    Zimbabwe
31991                                    Zimbabwe
31992                                    Zimbabwe
31993                                    Zimbabwe
31994                                    Zimbabwe
31995                                    Zimbabwe
31996                                    Zimbabwe
31997                                    Zimbabwe
31998                                    Zimbabwe
31999                                    Zimbabwe
32000                                    Zimbabwe
32001                                    Zimbabwe
32002                                    Zimbabwe
32003                                    Zimbabwe
32004                                    Zimbabwe
32005                                    Zimbabwe
32006                                    Zimbabwe
32007                                    Zimbabwe
32008                                    Zimbabwe
32009                                    Zimbabwe
32010                                    Zimbabwe
32011                                    Zimbabwe
32012                                    Zimbabwe
#

not sure how

bleak fox
#

They are just index....

#

While exporting pass index=False and you'll not found them in your file

lapis sequoia
#

yep I put index=False and it worked

#

thanks

sick swan
#

@bleak fox Bhai how are you so sure when he didn't give column names? (I'm new so just asking)

bleak fox
#

@sick swan your question is about which questions reply... Tell me I'll explain

sick swan
#

Do i really need to be a math GOD in order to DS? or can I.... Know how things work and know how to apply things, would that be enough?

#

@sick swan your question is about which questions reply... Tell me I'll explain
@bleak fox the Zimbabwe guy

#

I mean I'm good at math, no probs, but not GOD tier

bleak fox
#

You can start with whatever you know in Math but keep on learning while practicing

#

@sick swan you can go ahead... DS is an art... And it requires a mind set more thn math and programing

#

@sick swan who is Zinbabwe guy?

sick swan
#

I have done an IBM professional specialization, but it was clearly visible that they didnt dig deep into math

bleak fox
#

Man if DS is as simple as calling a sklearn api thn everyone whould have done that.... Correct?

sick swan
#

@bleak fox

I want to remove these numbers on the side

31975                                    Zimbabwe
31976                                    Zimbabwe
31977                                    Zimbabwe
this one
bleak fox
#

@sick swan he shared it first but deleted... From where i got that these numbers are index

sick swan
#

@bleak fox xD exactly, that's why I was trying to do a lotta math

lapis sequoia
#

you do like this

print(df.to_string(index=False))

@sick swan

bleak fox
#

I figured it out that's why I deleted
@lapis sequoia @sick swan check this

sick swan
#

rn i need to revise my python so HARD

#

Ohk

lapis sequoia
#

here's what code I have just so you can see in full

""" Modules """
import pandas as pd

# Read excel spreadsheet.
dataframe = pd.read_excel("../covid19data.xlsx", usecols="G")

pd.set_option("display.max_rows", None, "display.max_columns", None)


def countries():
    """ This prints all of the countries in the spreadsheet """
    countries = dataframe.to_string(index=False)
    print(countries)


countries()
sick swan
#

damn im glad i found this community

lapis sequoia
#

same

#

it is helpful

sick swan
#

more like, its so good to have like minded people, everywhere i look, people are just freaking clueless

#

in the sense they dont take their future seriously

#

or dont yet have the realisation

bleak fox
#

Yup thanks to creators

strange igloo
#

Hey so, I'm trying to get some data from a redis website. (It has a .php ending) and then cache that data and then put it into either google sheets or some kind of analysis tool. I'm really new to this so I'm flying blind.

I can implement this myself in python but I don't want to reinvent the wheel. What kind of stuff do you (more experienced people) all use for this kind of thing?

bleak fox
#

@strange igloo use selenium

sick swan
#

GOD

bleak fox
#

@sick swan GOD

sick swan
#

:3

strange igloo
#

Alright, I'll try and figure out selenium

sinful dock
#

@bleak fox md def time_diff1(col1, col2): if col1 is np.datetime64('NaT'): pass else: diff = pd.Timedelta(col2- col1).minutes #trying to get minutes here but doesn't work return diff Unsure if datetime can take any self and I believe arguments need to be inside because is it failing to vectorize
<ipython-input-48-23b332169aaf> in time_diff1(col1, col2)
5 pass
6 else:
----> 7 diff = pd.Timedelta(col2- col1).minutes
8 #trying to get minutes here but doesn't work
9 return diff

AttributeError: 'Timedelta' object has no attribute 'minutes'

bleak fox
#

Use seconds/60

#

Instead of minutes

#

().seconds/60

sinful dock
#

yay!!

#

Thanks, documentation on these is quite confusing

bleak fox
#

@sinful dock welcome buddy

sinful wharf
#

hi

#

anyone here, have a quick stats question

bleak fox
#

I can try

#

@sinful wharf

sinful wharf
#

stats noob btw

bleak fox
#

I am noob * 100...πŸ˜‚πŸ˜‚πŸ˜‚

sinful wharf
#

so i have some survey data, likert scale (ordinal)

#

and i want to test for significance differences between some nominal groups

#

which test should i be using?

bleak fox
#

P-value test...

sinful wharf
#

yea they're all spit out p-value right?

#

or am i missing something

bleak fox
#

ANOVA

sinful wharf
#

kruskal wallis anova?

bleak fox
#

Yup

sinful wharf
#

prof said that was wrong

hardy apex
#

I want gpt-3 to write my jq filters or xpath predicates!!

#

holy **** that would be cool

coarse spire
#

Anyone have any ideas on what I can do with a bunch of twitch VOD chat logs?

marble jasper
#

be open to possible violation of GDPR?

#

aside from that, depends what you want to do. make a bot that responds in some way to VOD? automated moderation? some kind of game?

coarse spire
#

Hmm, not sure. Trying to gleam some insight from the data. I saw someone said they were able to select timestamps based on chat to pick out highlights in the stream.

#

Right now, I'm just making a word cloud and picking out the most popular emotes. Originally, I wanted to do topic modelling and sort out chat comments and figure out how they are reacting to the stream, but kind of stuck on that since I'm not sure what to look for, really.

marble jasper
#

sure, sounds like you could do that

#

topic modelling, I'm not sure how you'd output that

coarse spire
#

Well, so I was able to do that. It generated it's own topics but they don't really mean much lol

marble jasper
#

but you could make use of pre-trained models to help, you'd just get a large vector

coarse spire
#

Like one of the topics is one of the emotes "Pog" so it's mostly comments that just say "Pog"

marble jasper
#

I guess you could see when the topic changed, but to do anything more may require you to train something

coarse spire
#

when the topic changed?

#

Currently, my first run of it was using the same process as a post analyzing tweets. Twitch comments seem like a whole different beast since it's so much shorter and dependent on the comments and stream.

bitter harbor
#

I guess you could see when the topic changed, but to do anything more may require you to train something
the issue with this is there won't be a clear divide between topics, as well as having to deal with 'garbage' comments not related to the topic/general 'sum' of topics

#

Idk how you'd approach it

coarse spire
#

Going to see if I can find a correlation between chat and the stream by plotting chat "acceleration" minute by minute by counting the difference in comments. If I see a spike, there may be something there that the stream responded to.

#

Hmm, is there a way to get the data that's plotted by a histogram?

#

I have an array of timestamps and I want to group them by minute. The histogram plot does that for me but won't give me data back like

hist_data = {"bin_1": 200, "bin_2": 300}
#

Oh, got it. It was just bins = timestamps // 60

#

Then it's easy to group

tidal bough
#

oh, you just need the bins, not to bin the data before plotting.

#

oh, and actually

coarse spire
#

No, I still need to bin the data lol

#

Thank you πŸ™‚

tidal bough
#

the docs for pyplot.hist say they return the values

coarse spire
#

Ooh, I'll try that then. Thanks

arctic cliff
#

Dumb question, What's the best way to practice Pandas ?

coarse spire
#

Now I have this image of the difference in the number of comments per minute. Gotta use some math to figure out what the cut off point should be signal and noise.

arctic cliff
#

Sorry for interrupting :/

coarse spire
#

No worries.

#

I would look at some videos of pandas tutorials at first. They guide you through the basics

#

Get some data you want to play around with and try to follow along

arctic cliff
#

I'm following a book, But It doesn't have much challenges so

#

Ah-

#

Thanks ! :D

#

Btw

#

Does kaggle have challenges on datasets?

coarse spire
#

Yep tons of them.

#

I would look into the datacamp series of videos. They cover a lot of pandas stuff

arctic cliff
#

Thanks a lot

coarse spire
arctic cliff
#

If I know all of these foundations, Am I ready to lean on myself and try to play with some datasets?

coarse spire
#

Yeah

arctic cliff
#

Because I do..

#

Great!

#

Quick question, What I will be doing now is called Data analysis ?

coarse spire
#

The most important thing to know is not to use like a for loop to go through the whole dataset. Numpy/Pandas has it's own faster way of doing calculations on large sets of data

bitter harbor
#

^^ I find doing random stuff that'll never be useful/never be used is super useful for learning not only fundamentals but some extra stuff as well

coarse spire
#

Yep, I agree.

bitter harbor
#

Quick question, What I will be doing now is called Data analysis ?
yes

#

data analysis is more of a generalization tho

arctic cliff
#

The most important thing to know is not to use like a for loop to go through the whole dataset. Numpy/Pandas has it's own faster way of doing calculations on large sets of data
@coarse spire
I might not use them again in my entire life lol

#

^^ I find doing random stuff that'll never be useful/never be used is super useful for learning not only fundamentals but some extra stuff as well
@bitter harbor Can you give me an example of this ?

bitter harbor
#

I've found numpy's useful in a lot of things

#

well for example matrix manipulation can be used for data science, but it's also super useful for game dev

arctic cliff
#

How's it useful for Game Dev?

coarse spire
#

A common thing is filtering out data. On the project I'm working on, I had to remove a bunch of comments made by bots so I would do df[~df.comment.str.contains("some bot message")] Super common if you are just looking at the data.

bitter harbor
#

take tetris as an example

#

you could store the position of each piece as a separate variable (plz never do this, I've seen it and it's awful)

#

or you could use matrices

arctic cliff
#

A common thing is filtering out data. On the project I'm working on, I had to remove a bunch of comments made by bots so I would do df[~df.comment.str.contains("some bot message")] Super common if you are just looking at the data.
@coarse spire
I didn't think ~ will be that useful xD

bitter harbor
#

like if avg in row == 1, remove row and bring all values above it, down one

coarse spire
#

lol yeah, for pandas, it's used to negate whole series

#

instead of not

arctic cliff
#

you could store the position of each piece as a separate variable (plz never do this, I've seen it and it's awful)
@bitter harbor
just imagine this

#

Yeah this makes a whole sense to me now ..

bitter harbor
#

I've seen it and it's awful

coarse spire
#

Ah, that's cool. You can even make an ascii version of tetris and basically just print the matrix out.

arctic cliff
#

When I first tried to play around with ~
I used it on a number and yeah.. My head started to hurt

bitter harbor
#

like you could use lists, but lists are essentially 1d matrices

arctic cliff
#

I have another important question-

#

Do companies asks you to tell them specific things about their data?
Or it's up to you to warn them about specific things and give tips to improve their companies

bitter harbor
#

depends on your position *and values ig

arctic cliff
#

DS

coarse spire
#

Yeah, there are business/data analysts who create reports about the data.

arctic cliff
#

About the specific data they need from the dataset ?

coarse spire
#

Some places will have you doing predictive analytics, some will want just reporting.

bitter harbor
#

it depends on what they want you to do with the data

#

which they'll tell you

coarse spire
#

Yeah, really depends on who's asking you to do it.

bitter harbor
#

you'll never have to do a bunch of random calculations on a dataset

#

(unless they tell you to do it)

arctic cliff
#

So they might tell me to do random calculations and come out with random predictions ?

coarse spire
#

They could tell you to do anything πŸ™‚

#

Probably won't

arctic cliff
#

This is really confusing

#

Like I'm giving you my money, Buy me whatever you want -

coarse spire
#

So, yeah, at my company, the person they're "giving money" to would be the Machine Learning teams.

arctic cliff
#

Well I'm complaining because I won't know what to predict from a current dataset I have if I don't search the calculations πŸ˜‚

coarse spire
#

Who usually come up with models to forecast some outcomes.

arctic cliff
#

Who usually come up with models to forecast some outcomes.
@coarse spire Oh..

#

Wait

#

This is how ML works ?

coarse spire
#

A part of it, yeah.

#

A very common part of it.

arctic cliff
#

Correct me if I'm wrong,
It can predict random info that I don't know about as a DS from a dataset

coarse spire
#

Well, when you say random info, I'm not sure what you mean.

arctic cliff
#

Let's say I have a dataset about heights and width of squares

coarse spire
#

A common example is figuring out house prices based on their square footage.

arctic cliff
#

..

#

Holy

#

Moly

#

I'm speechless

bitter harbor
#

that's actually a common example in describing the difference between coords and vectors

arctic cliff
#

What's the difference?

#

If it's related to statistics, I can watch an explanation rn

coarse spire
#

The housing price example? Makes sense since you don't just want to start at 0,0.

arctic cliff
#

Btw

#

What's the difference between sklearn and tensorflow ?

coarse spire
arctic cliff
#

Because I don't have any idea..

bitter harbor
#

Similarly, while vectors generally imply some kind of linear space and the use of linear operations (i.e. linear algebra) they can be used to describe other data which may not have a linear basis."```
coarse spire
#

Ah that makes sense then.

bitter harbor
#

also a coordinate is always a vector but a vector is not always a coordinate.

arctic cliff
#

This is really helpful ! @coarse spire

coarse spire
#

You're trying to find a vector to predict other housing prices

bitter harbor
#

^

coarse spire
#

Which is what linear regression is.

arctic cliff
#

I'm confused ..

coarse spire
#

Yeah, so sklearn has a bunch of machine learning algorithms and tensorflow has a lot of deep learning algorithms.

#

I think that's....about what you need to know unless you need to use one of them, then you would go deeper. πŸ™‚

arctic cliff
#

I think that's....about what you need to know unless you need to use one of them, then you would go deeper. πŸ™‚
@coarse spire I might start sklearn after getting familiar with matplotlib, Do you think it's a good idea ?

coarse spire
#

Yes

#

Well, it depends on your task, but if you just want to learn about the datascience ecosystem then machine learning with sklearn is great.

bitter harbor
#

(a1,a2) would be the coords, [a1] [a2] would be the vector

arctic cliff
#

OH

#

I got this !

bitter harbor
#

like that

#

I know it'd be a lot more work but i'd suggest learning about basic linear algebra, stats, and neural nets before looking at ml libraries

arctic cliff
#

Can't I learn them in the way ?

coarse spire
#

I think Bepples is saying that you should learn it that way πŸ™‚

arctic cliff
#

Like if I don't understand a specific point, I go and search it

#

Ah-

coarse spire
#

So, the thing about that is, without a solid foundation, you won't even know what you don't know.

#

You might make it harder on yourself when there is a simpler solution.

bitter harbor
#

like you can for sure just get a dataset and put it into a template, but it's easier to tell what's going wrong if you actually know what's happening
especially if the error isn't a programming error

coarse spire
#

At a certain point, you'll have to do that anyway "I don't know this exact thing so I'll search for it as I go."

#

When I first got interested in data science, I took an online stats course.

#

Then I would try programming simple models and trying to pick apart others.

#

Picking apart Random Forest was fun.

arctic cliff
#

This makes sense to me ..

bitter harbor
#

thank god πŸ˜†

coarse spire
#

You could also just try to throw Neural Networks at everything and see what sticks πŸ™‚

arctic cliff
#

πŸ˜‚ I think that will be horrible

bitter harbor
#

^

arctic cliff
#

2 days ago, Someone was asking a good question

coarse spire
#

That's the fun part.

arctic cliff
#

About working as a DS in a great company like Google

#

What libraries should he learns and etc.

bitter harbor
#

"great"

arctic cliff
#

I don't know it seems to be

bitter harbor
#

it's a ds company because it's a data farm

arctic cliff
#

Someone said you should know how the libraries work like what's going on behind them then think about working for them

coarse spire
#

Eh...

arctic cliff
#

I'm glad I didn't ask that question

bitter harbor
#

how the libraries work like what's going on behind them
that's a good idea imo

coarse spire
#

That's basically what the fundamentals are though. I think they want to know that you know what you're doing with the most common ML algos at least.

#

I don't know if they care too much about the underlying implementation though.

bitter harbor
#

well ya they aren't going to ask you how nn's work, but understanding how allows you to do more and do it correctly

arctic cliff
#

Oh

coarse spire
#

Well, I guess I'm saying that there is a difference between knowing how an NN works and how Tensorflow 2.0 implements an NN layer.

#

Hmm, actually, tensorflow is from google so maybe they do care :p

bitter harbor
#

ya the implementation is different because of the language
the concepts of a nn is __mostly __linear algebra/stats but the implementation is cs

arctic cliff
#

Can't wait to post my complicated questions related to DS here >:D

#

One day

coarse spire
#

Can't wait!

warm wedge
#

I just released libra: a machine learning API that lets you build and train models in one line of code. Check it out here please: https://github.com/Palashio/libra. Would really appreciate any feedback / questions y'all have πŸ™‚

bitter harbor
desert parcel
#

There seems to be an error and I don't understand

#

so does it mean I can only do .backward() on scalars only?

flat quest
#

Lol that might get confused with Libra blockchain @warm wedge

desert parcel
#

I need a .backward() in order to differentiate something hmm

#

but I can't .backward() something that isnt a scalar

#

So what do I do

flat quest
#

You should learn the fundamentals. Not because you don’t have to do searching along the way, but a baseline knowledge is critical if you want to understand why you got a particular result

#

@arctic cliff

bitter harbor
#

I think numpy has a function for that

desert parcel
#

So what can you do in numpy

#

wait let me rephrase

#

So I should convert a tensor into a numpy array and do a .backward() on that?

#

I'll try it

bitter harbor
#

uhh no

#

I'll try to find what it's called

#

does nf.reverse() not work?

desert parcel
#

The error gave me a suggestion

#

var.detach().numpy()

#

But that sort of fixed a prolem

flat quest
#

C is not a scalar output

#

It’s a vector

desert parcel
#

problem i'm trying to recreate the error that gave me the recommendation