#data-science-and-ml

1 messages ยท Page 322 of 1

grave sparrow
#

Just fillna in advance?

desert oar
#

that's one good way to do it, if you have a string you want to use to represent missing data

#
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'v1': ['a', 'b', np.nan],
    'v2': ['x', np.nan, 'z'],
})

not_null = df[['v1', 'v2']].notnull().all(axis=1)
df.loc[not_null, 'v3'] = df.loc[not_null, 'v1'] + ' -> ' + df.loc[not_null, 'v2']

you could do it this way too

#

i generally recommend not ever using .astype unless you really need to

#

usually i want more control than that

grave sparrow
#

Ahhhh thats perfect

#

So I could either fillna in the columns as needed before I concatenate, or do that, then afterwards fillna in v3 with 'Null values detected, unable to generate directive' or something

#

Thanks for the advice

desert oar
#

yeah, i've seen way too many f'ed up datasets with nan where it doesn't belong

#

can't allow it ๐Ÿ˜›

#

also don't use csv to save this data once you've generated it... use parquet or something else intelligent

ember sapphire
#

i can't figure out why this doesn't converge to a local minimum, the cost function is oscillating

#
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import image

rng = np.random.default_rng()

img = image.imread('fruits_small.jpg') / 256.0
h, w = img.shape[:2]

copy = np.array(img)

plt.subplot(3, 3, 1)
plt.title('Original')
plt.imshow(img)

for plot, k in enumerate([4, 8, 16, 32, 64]):
    centroids = rng.choice(copy.reshape((-1, 3)), size=k, replace=False)
    clusters = np.empty((h, w))

    print(centroids)

    while True:
        for y, x in np.ndindex(img.shape[:2]):
            v = copy[y, x]
            clusters[y, x] = np.argmin(np.linalg.norm(centroids - v, axis=1))

        cost = 0
        for i in range(k):
            cost += np.linalg.norm(copy[clusters == i] - centroids[i], axis=1).sum()

        print(f'cost = {cost}')

        d = 0
        for i in range(k):
           new_centroid = copy[clusters == i].mean(axis=0)
           d += np.linalg.norm(centroids[i] - new_centroid)
           centroids[i] = new_centroid

        if d == 0:
            break

        print(centroids)

    for i in range(k):
        img[clusters == i] = centroids[i]

    plt.subplot(3, 3, plot + 1)
    plt.title(f'k = {k}')
    plt.imshow(img)

plt.show()
#

this version isn't the 5d one, it clusters based solely on color

desert oar
#

@ember sapphire this is k-means? i think usually k-means stops after a fixed number of iterations anyway

ember sapphire
#

the cost function should be decreasing

#

mine isn't

#

i can't figure out why though

near oasis
#

its an NLP project on vaccine information.

#

if your interested DM me

thorn bobcat
#

yo

#

what you guys think about grabcut?

cedar sun
#

does keras have a fast way to measure top 5 accuracy?

oblique palm
#

Hello everyone, I'm working on an IT project proposal based on ripeness detection through machine learning. My project consists of a camera observing a basket of bananas (for example). I'd like to know if there are effective ways to recognize if a banana looks rotten based on what the camera sees from a bunch of bananas in a basket?

My main doubt is how hard, and if there are machine learning tools that make the process of identifying one bad banana from the group in the basket, because from what I've seen most related project do it in a small scale, like putting 1 banana in front of a camera. Any suggestions?

desert oar
#

i suspect that such a model will mostly be detecting brown vs yellow vs green

#

training data is probably the hardest part here

#

how many photos of bananas, in how many different configurations, and different kitchens, and different lighting conditions do you have to collect and label before you have enough data? potentially a lot

#

maybe you can use an off-the-shelf object detection algorithm to find the bananas in the image first, so your model doesn't have to be so powerful

oblique palm
#

would be inside a supermarket so honestly we are considering one type of configuration haha

desert oar
#

so you want to have a camera that takes a photo of the banana display every day, and sends an alarm when they look overripe?

oblique palm
#

More like live

#

or maybe pictures every hour

#

and yeah, send an alarm when they look overripe

desert oar
#

it might be a fun hobby project, but can't an employee go check the bananas? i know they have a lot to do in a grocery store already, but it's a pretty quick task with a quick yes/no answer. of course the downside is that employees might let the bananas go bad because they're lazy and don't want to throw them out if they say "yes"

#

you still will want to account for day/night lighting conditions

#

what if someone in a brown shirt walks in front of the camera?

oblique palm
#

the problem we are trying to solve here is that in my country, bananas and other fruits dont look very good when they are ripe, but they are still good for feeding, so they throw these to the trash because they are not in their "standards" to be sold

#

and these could be donated or sold cheaper to institutions that help poor people

wet folio
#

I'm using PyTorch XLA to port a training script from using a CPU/GPU to a TPU. Is there a way to use a ParallelLoader on more than one variable?

livid venture
#

Hey! Iโ€™m not sure if this is right channel but itโ€™s my first post so please be lenient for me this time ๐Ÿ˜„
I have a problem with an unsual taskโ€ฆ I have a dataframe like this:

    id     name
0     962966     A
1     402171     A
2     478034     B
3     936505     B
4     516152     C
5     379497     C
6     977649     D
7     869046     D

Now what I have to do is divide this dataframe by name into many multiplesheets excel files... So for example in my case I would like to have 4 files (named: A, B, C, D) every with 2 sheets named by id inside (for example A: 962966 and 402171)

This is my code (random_df is only to fill up sheets with some data):

ExcelWriter = 0

for index, row in df.iterrows():
    random_df = pd.DataFrame(np.random.randint(0, 100, (5, 4)), columns = list("1234"))
    
    if ExcelWriter != pd.ExcelWriter(row["name"] + ".xlsx"):
        
        if ExcelWriter != 0:
            ExcelWriter.save()
        
        ExcelWriter = pd.ExcelWriter(row["name"] + ".xlsx")

        random_df.to_excel(
            ExcelWriter,
            sheet_name = str(row["id"]))
        
    else:
        random_df.to_excel(
            ExcelWriter,
            sheet_name = str(row["id"]))

ExcelWriter.save()

The result I get is almost fine because this code generate 4 excel files but every with only one sheet named by last id number for name... it looks like the data is being overwritten by the next ones but I haven't idea how to fix this because I'm completely freshman in pandas ๐Ÿ˜„ Do you have any ideas?

limpid oxide
serene scaffold
#

@limpid oxide you can ask for data science help in these channels

#

@livid venture you can do a groupby and then iterate over that. That will be faster than trying to fumble around by row.

thorn bobcat
#

can computers create pixar movies

serene scaffold
#

@thorn bobcat not autonomously

thorn bobcat
#

style transfer looks like a filter to me tbh

#

it feels like AI is a long way from where I want it tbh.

serene scaffold
thorn bobcat
#

I want it to be able to convert real life videos into pixar movie scenes

#

want to bring the power of animation studios to ordinary users.

#

I might have apply this to individual frames containing just the foreground.

#

I'll also have to extract the background so I can control the setting.

#

something like this but for video

languid chasm
#

You're model has >93% accuracy, >93% sensitivity, >93% spcificity AND >93% precision?

#

Your*

charred umbra
#

yes, I had very little false positives and false nevgatives

cedar sun
#

Can u always overfit a model?

charred umbra
visual violet
#

distance measurement methods for time-series clustering algo calculate the physical distance in space between two objects right?

serene scaffold
visual violet
#

yes

#

i am trying to make sure the way i am describing things is correct

serene scaffold
visual violet
#

tslearn

#

metric = dtw

#

algo = k-means

#

i am in the process of writing the actual paper

serene scaffold
#

dtw?

visual violet
#

dynamic time warp

#

works similarly to euclidean

#

but more lenient with time-series data

visual violet
#

*at least that is what i got from my research

serene scaffold
#

"DTW is computed as the Euclidean distance between aligned time series"

visual violet
#

would be nice if they define what Xi and Y j are

serene scaffold
#

the ith element of X and the jth element of Y

#

I think. let me check

pulsar hull
#

i want to try out machine learning and as a final goal I want to try to make a chatbot. I tried following a tutorial for TensorFlow but i didn't really understand it and was mostly just copying code from the tutorial, anyone know a good place to start and/or a good TensorFlow tutorial for something simple?

visual violet
serene scaffold
#

Yes, it's the ith element of X and the jth element of Y, where each (i, j) tuple comes from the set of tuples that most closely align with each other

visual violet
#

sheesh

#

it is just euclidian no more no less

serene scaffold
visual violet
#

it works the same way

serene scaffold
#

yes

visual violet
#

that is why the zigma is there

#

damn big brain

serene scaffold
#

the sigma? yes.

visual violet
#

suppose i have two objects/time-series

#

1: 1 3 4 7
2: 2 4 8 9

#

X is 1 and Y is 2

#

i am pretty sure

#

the distance would be sqrt ( |2-1|^2 + |4-3|^2 + |8-4|^2 + |9-7|^2 )

serene scaffold
serene scaffold
visual violet
#

what i am wondering is why make up a fancy name

#

and use the same century-old method

serene scaffold
visual violet
#

the name

#

DTW

#

"time warping"

#

may was well name it time bending

serene scaffold
#

idk what that is.

visual violet
#

the problem is

#

i don't know how to test the metric

#

like there is 0 example code

serene scaffold
#

it's on the website

visual violet
#

oh lord

#

i am dumb

visual violet
#

for x, y in arr1, arr2:

#

is that legal python syntax?

serene scaffold
#

!e

import numpy as np

a = np.random.random((5,))
b = np.random.random((5,))
print(a, b)

print(np.sqrt((np.abs(a - b) ** 2).sum()))
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

001 | [0.98432056 0.62829273 0.62277413 0.44841935 0.8191274 ] [0.33321332 0.25763163 0.75565997 0.36456538 0.89677632]
002 | 0.76944770610025
visual violet
#

life is not always simply huh

serene scaffold
#

I don't read screenshots of code, but I'll give you some more code to illustrate the earlier point

#

!e

import numpy as np

a = np.random.random((5,))
b = np.random.random((5,))

print(a - b)
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

[-0.55704989 -0.41628915 -0.04400796 -0.552509    0.38075351]
visual violet
#

that is very neat

#

when the documentation tells you what it is supposed to do

#

but it doesn't function like the doc when you test it

serene scaffold
#

I mean, the docs didn't say that dtw is exactly like euclidean distance

#

"DTW is computed as the Euclidean distance between aligned time series, i.e., if ฯ€ is the optimal alignment path:"

#

idk what an optimal alignment path is.

#

my earlier guess was wrong, apparently

#

(and not very well stated)

visual violet
#

ya iam taking a look

#

trying to read the source code

#

makes absolutely 0 sense

desert oar
#

I believe it's the alignment with the lowest distance such that the order of points is preserved

#

Don't quote me on that

visual violet
#

yup

#

i mean i don't know

#

but

#

when the array elemtns are identical

#

like [1,1,1,1] and [2,2,2,2]

#

dtw would give the same result as euclidean

#

but when things start to be weird, they converge

dire frost
#

where should i start with ai?
i watched this video but i cant find any python tutorials for ai
https://youtu.be/JMUxmLyrhSk

๐Ÿ”ฅ Machine Learning Engineer Masters Program (Use Code "๐˜๐Ž๐”๐“๐”๐๐„๐Ÿ๐ŸŽ"): https://www.edureka.co/masters-program/machine-learning-engineer-training
This Edureka video on "Artificial Intelligence" will provide you with a comprehensive and detailed knowledge of Artificial Intelligence concepts with hands-on examples.

Following topics are covered in th...

โ–ถ Play video
lapis sequoia
#

what should i learn to go for ML?

#

field

slow comet
#

Hi! I'm looking for a data cleaning library. I found HoloClean but it contains some TODOs in the code and the GitHub repo hasn't been updated since April 2019..

slow comet
#

the course is free unless you want to have the certificate

viral scroll
#

Is this the correct channel to ask Pandas related queries...??

uncut barn
#

Are there any good resources to understand context free grammars and logical grammars?

random mist
#

for quicker help start an help channel

viral scroll
#

I have a pandas dataframe like below:
company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 * Developer Engineer

and a list of companies like ['Google', 'Apple', 'Facebook']

I want to write a operation in pandas to find the rows where company name has "*" and replace it with individual companies

the output should be

company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 Google Developer Engineer
3 Apple Developer Engineer
4 Facebook Developer Engineer

#

Thanks in advance

novel elbow
#

so for each row in the original dataframe with the "*" field you will replace it with 3 rows (with google, apple, fb) ?

viral scroll
#

Yes, the company list is dynamic it can be n rows. I have added 3 companies for example

novel elbow
#

ok, first I will divide the data into the ones that have companies and the ones with "" then with a function create copies replacing "" with the names in your list and finally concatenate all the dataframes

#

should look like this

#
# df = # your dataframe

mask = df['company name'] == "*"
df_withname,df_noname = df.loc[~mask],df.loc[mask]

names = ['Google', 'Apple', 'Facebook']

def add_name(df, name):
    df = df.copy()
    df['company name'] = name
    return df

df_final = pd.concat([df_withname] + [add_name(df_noname, o) for o in names])
viral scroll
#

Thanks a lot

#

Let me try this

serene scaffold
#

@novel elbow I think there's a simpler solution.

#

@viral scroll how do you know what company you're replacing a given row with?

#

Eh maybe not. Though I'm confused as to how one would arrive at this particular problem

viral scroll
#

so in my code i need to replace * with multiple rows for individual company so that I can do an inner join

serene scaffold
#

You could replace the asterisk with a python list and then do explode

#

I'm on mobile so I can explain more in a bit

viral scroll
#

Sure, I will wait for your response

#

Thanks

normal grove
serene scaffold
# viral scroll Sure, I will wait for your response
In [40]: df.loc[(df['company'] == '*'), 'company'] = [['Google', 'Apple', 'Facebook']]
Out[41]: 
                     company       role      level
0                   Company1   Director   director
1                   Company2  Developer  developer
2  [Google, Apple, Facebook]  Developer   Engineer

In [42]: df2.explode('company')
Out[42]: 
    company       role      level
0  Company1   Director   director
1  Company2  Developer  developer
2    Google  Developer   Engineer
2     Apple  Developer   Engineer
2  Facebook  Developer   Engineer
#

You have to use a nested list for the first step though

viral scroll
#

Sure, this will work

#

Thanks

visual violet
#

good morning stelercus

#

do you think hierrachial and k-means will produce different results?

hard hound
#

you didn't ask me but still I will answer - probably

ember sapphire
#

can anybody help me figure out why my kmeans implementation is not converging?

visual violet
#

holy cow

visual violet
#

i am doing the exact same thing

#

it is not working

ember sapphire
#

i just print the centroid locations and cost function after each iteration

#

it just jumps around

visual violet
#

can you share the code?

inland zephyr
#

good evening guys i need some suggestion about my small project

#

i have small experiment to test whether CNN works with small size of data. I have about 38 sample data from 2 class, said sick and healthy (coded as 0 and 1) with 18 from 20 from class 0 and 18 from class 1. The data sample is an ECG record with long duration (about 1 day), where the class 0 has anchor point to mark where the episode of sickness happen, but the class 1 is healthy so no marking point on it

#

i want to check for the n minutes before anchor point of class 0 can be detected as sick and not 1, so from 20 i have 10 train, 2 valid and 8 test for n minutes. I also take arbitrary data from 18 healthy record, so i have 9 train, 2 valid and 7 test. Since CNN will feed small data for training, so I try with simple network

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 15352, 128)        1280      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 5117, 128)         0         
_________________________________________________________________
dropout (Dropout)            (None, 5117, 128)         0         
_________________________________________________________________
flatten (Flatten)            (None, 654976)            0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1309954   
=================================================================
Total params: 1,311,234
Trainable params: 1,311,234
Non-trainable params: 0
_________________________________________________________________

When i run the training, the training accuracy are pretty fast to decrease but the validation result are very low

#

but when i run the test, the result is promising

#

I know this is the sign of overfit but with the accuracy result of the test set I still confused whether to add more epoch or rearrange the model.

#

the feature is 1D with 15360 feature

visual violet
#

jesus

inland zephyr
#

The result pretty funny but frightening at the same time

#

and still dunno what to do next

desert oar
#

look at some examples

visual violet
desert oar
#

yes

#

sorry i didn't mean examples of model code

#

i literally meant examples of data where the predictions are wrong

#

it can be very enlightening to see why, qualitatively, the model is failing

#

maybe you don't have enough data - i have no idea if CNNs can be trained from scratch on so few datapoints, i would assume that they can't

#

or maybe your train/test split is f'ed up

visual violet
#

What is arima?

desert oar
#

who asked you this?

visual violet
#

oh i asked people

desert oar
#

for what?

#

also it's silly that prophet isn't "interpretable" it's a goddamn linear regression, it's more interpretable than ARIMA

#

anyway ARIMA is "AutoRegressive + Integrated + Moving Average" - basically a family of models you can use to fit a single time series

#

there's kind of a lot to be said about time series modeling and ARIMA specifically

visual violet
#

The dataset was a matrix of the weighted AMP of the 724 ingredients from 2016 to 2020 - tables where rows represent drug ingredients, columns represent the time, such as the second quarter of 2017, and numbers in each cell characterize the price of the particular ingredient in the particular year.
When i do kmean clustering with dynamic time warping metric, it cluster the majority of the ingredients into one cluster, which i definitely do not want. so i asked the person and he told me that

desert oar
#

what's the point of doing arima then?

inland zephyr
visual violet
#

to somehow break the big cluster up?

desert oar
desert oar
desert oar
inland zephyr
desert oar
#

can you link the paper?

inland zephyr
#

since this is a health-related project, so a golden standard data is mandatory to use (although imho, the data are pretty old and very small)

visual violet
#

i am kinda lost at the moment

inland zephyr
#

this is the example, https://pubmed.ncbi.nlm.nih.gov/30117048/ but i only using they testing flow and the data

desert oar
# visual violet so what do you suggest me do?

i don't know because i don't know what this person is responding to. what exactly did you ask, that prompted this response? also what ultimately are you trying to do again? i know you were getting into clustering time series but i don't remember what the purpose of this was

inland zephyr
desert oar
#

ok, but it looks like the paper uses a purpose-built algorithm based on some specific signal processing stuff

#

i don't know that it makes sense to just replace that with a CNN

inland zephyr
visual violet
#

OHH i asked "how would you proceed to predict the future with the given dataset"

inland zephyr
#

that's my oppinion based what i have read about CNN

desert oar
inland zephyr
#

the idea can be said: what if we can feed this info without preprocess the data, which mean we can lost some important sign and let CNN infer the condition

desert oar
#

i would imagine that the CNN probably can't learn much from 20 data points

inland zephyr
#

since CNN or deep learning method are still pretty new for ECG-related research

desert oar
#

now, if you had 2000 datapoints of "unlabled" data, but only 38 data points of "labeled" data, i'd suggest fitting an autoencoder on the 2000 unlabeled examples, then using transfer learning or something to fit a simpler model on the 38 labeled examples

#

maybe there are ways to train the autoencoder simultaneously with the small number of labeled data points - this was a thing several years ago called "semi-supervised learning", but it dates back to the SVM era and it was of questionable value back then

inland zephyr
#

i have cautious with that, since i using more classical method like preprocess the data into several feature and using traditional machine learning method like decission tree or random forest the result still higher than the CNN

visual violet
inland zephyr
visual violet
#

like i want to see if there is a trend in price in group of certain disease-targeted drug

desert oar
inland zephyr
#

so since the data are pretty small but complex dimension, i decide with the simpler one

#

no much layer and just dump the feature from the convo - maxpool - dropout into dense with 2 class

desert oar
#

so you have a model that needs to learn 1280 parameters from 20 data points, and that's not even including the dense final layer

#

that seems like a losing proposition to me

#

are you at least using regularization to train it?

#

128 features as in, each example is a single time series of 128 data points?

#

or wait, you have 128 "channels" in the conv1d?

visual violet
#

interesting. i don't understand a single word you guys are saying

inland zephyr
#
def GetModel(shp):
    model = Sequential()
    model.add(layers.Conv1D(filters=128,kernel_size=9,activation=tf.nn.leaky_relu,input_shape=[shp[1],1]))
    model.add(layers.MaxPool1D(pool_size=3))
    model.add(layers.Dropout(rate=0.7))
    model.add(layers.Flatten())
    model.add(layers.Dense(2,activation='softmax'))
    model.compile(loss = 'sparse_categorical_crossentropy',optimizer='Adam', metrics=['accuracy','mse'])
    return model
#

the shp is the X, [1] is 15360

desert oar
#

obviously you're already started, but still make lots and lots of plots

#

so your k-means results are bad, why? are they really bad? or are they sensible given the data and # of clusters?

#

did you try other numbers of clusters?

#

did you try k-medians? etc.

inland zephyr
#

plotting the data is the most easiest one to do, and also easiest to understand too

visual violet
#

yes sir. similarity/dissimilarlity in price and price percetnage difference -> two tasks

desert oar
#

did you try "soft" clustering like HDBSCAN? self-organizing maps? did you look into https://github.com/fpetitjean/DBA to see if there's any common overall shape?

GitHub

DBA: Averaging for Dynamic Time Warping. Contribute to fpetitjean/DBA development by creating an account on GitHub.

inland zephyr
desert oar
#

you can also try fitting individual time series models to each series, then looking at the models as a dataset of its own. you can fit all kinds of forecast models like exponential smoothing, arima, etc. and then look at the distribution of model parameters for example

visual violet
#

it feel like it doesn't

desert oar
#

you're just looking at the cluster memberships

#

how do you know those make sense?

#

did you try using dimension reduction to plot these and color them by cluster membership?

visual violet
#

i didn't think it is necessary since the dataset only has 20 dimensions or 20 columns

desert oar
#

that doesn't make sense

inland zephyr
#

20 if not have importance are pretty useless to predict

visual violet
#

so i have to find more data?

desert oar
#

i'm not sure what you mean by that @inland zephyr

visual violet
desert oar
#

you mean that there are 20 time points in each time series @visual violet ?

visual violet
#

yes salt

#

they all have equal time length 2016 quarter one to 2020 quarter 4

desert oar
#

and are they all measured at the same time points?

inland zephyr
visual violet
#

yes

#

same time points

desert oar
#

ok, that makes things easier

visual violet
#

oh so the vocab is time point

desert oar
#

eh, there isn't a single term for it

visual violet
#

lol i have a tough time describing things

desert oar
#

that's the term i use

#

naming things is hard

inland zephyr
#

why dont plot a line bar to see the movement? since the data is time related right?

desert oar
#

@inland zephyr they have 600+ individual time series with 20 time points in each

arctic wedgeBOT
#

Hey @visual violet!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

visual violet
#

oh shit i can't share excel lol

inland zephyr
#

so it's 600*20 dimension right?

desert oar
#

you could think of it that way, yes

visual violet
#

the majority of them is at the bottom

desert oar
#

great, use a log scale for y axis

#

price is never exactly 0 right?

inland zephyr
#

thats what i said

visual violet
#

cluster 0 : dark blue has 665 members out of 724

inland zephyr
#

how bout cumulated it?

#

cumsum?

desert oar
#

im not sure what the point of cumsum on prices would be

visual violet
#

so there gotta be a price

inland zephyr
#

nvm, no need cumsum for this

#

the cluster has been seen easily, with majority cluster is the blue and the cyan one

#

but what is the red data? is it come from some data or it's come from one member?

desert oar
#

right. k-means looks like it's just clustering time series based on how high the average price is

visual violet
#

red has one member :(((

inland zephyr
#

bingo

visual violet
#

i don't want that since it doesn't mean much (i assume)

desert oar
#
  1. use a log scale y axis
  2. replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level
inland zephyr
#

it's the outlier maybe

desert oar
#

so my recommendation is to normalize each price such that its starting price is 1. then your distance metric can focus on the shapes of the time series, not just the price levels

#

basically a price index for each drug

#

so if the price is $100 in qtr 1, and $125 in qtr 2, normalize that to 1 and 1.25

inland zephyr
#

or maybe as @visual violet said no 0 price, the normalization can be done between min-max of the price

visual violet
desert oar
#

this way you can take the "price level" of the drug as separate from relative changes in price over time

#

@visual violet in general recommend the following:

  • the specific feature engineering described above, perhaps using the starting price level (or average or median price level) as a separate feature
  • try several dimension reduction techniques to visualize the time series, don't worry about "modeling" yet. such techniques include PCA and UMAP. this might mean using trying out a few distance metrics such as DTW
  • use https://github.com/fpetitjean/DBA to see if there is any overall "shape" to the drug prices, although i doubt it based on the plot i just saw
  • try "soft" clustering methods such as HDBSCAN
  • do you have any "metadata" about the drugs? company developing it, year development started, type of drug (antibiotic, etc.), specialized vs general market, etc. that could be interesting to examine as well.
#

VAR for 600+ time series is too big i think

#

however you could try using the correlation between time series as "distance"

visual violet
#

@desert oar do you think i should even bother with the percentage difference dataset? here is the graph i got

1      3
2      2
desert oar
#

specifically distance = 1 - abs(corr(ts1, ts2))

desert oar
visual violet
#

i see 0 pattern

desert oar
#

try plotting this without the cluster coloring, but set alpha=0.2 or something so you can see them all overlaid

#

(and make the lines thinner)

#

generally you should probably do all your clustering on log prices anyway, if you're not using this price index thing

#

also - do you know what would cause these drug price fluctuations in the first place?

#

that could also guide your thinking about it

visual violet
#

i have been doing research

#

there are no direct causes

desert oar
visual violet
#

the drug economy does not folow the supply-and-demand model

#

sometimes existing drugs increase price for no reason

#

i have trouble udnerstand this statement "replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level" can you please explain?

#

oh i understand

#

take average for each row

#

so then i only have 724 values

desert oar
#

yep

#

you'll probably see that the clusters are just splitting the drugs somewhat arbitrarily into average price levels

#

re-do the clustering on log price scale, if nothing else

ember sapphire
#

im ready to kms over this kmeans implementation

#

ive read it like 500 times and cant see anything wrong but it's still wrong

visual violet
desert oar
#

why symlog?

#

log should be fine

#

also i mean literally re-run k-means but using log price

desert oar
#

just "reading" the code isn't always the best way to debug code

visual violet
#

so log every single data points (724 *20) and k cluster

desert oar
#

yes

#

because the prices levels are so wildly different

#

look at the formula for euclidean distance

#

this is a good lesson in the importance of using intuition about fundamentals to solve problems

visual violet
#

k means with what distance measurement method? euclidean?

desert oar
#

whatever, euclidean is probably fine but you can try DTW too

visual violet
#

i have been using dynamic time warping without understand exactly what it does

desert oar
#

when taking differences, and especially squared differences, the order-of-magnitude differences between time series will overwhelm anything else

#

so at least put them on log scale to try and reduce that effect

visual violet
#

let me do that real quick

#

hoping you won't sleep soon

desert oar
#

i'm at work so i might have to leave for a while, but ping me so i don't lose track of it

#

DTW just re-aligns elements of the time series such that the distance between the two time series is minimized

visual violet
#

dtw does that for every pair of time series?

desert oar
#

(more strictly the sum of the absolute values of the differences)

#

yes

visual violet
#

let say i have
A: 1 7 16 50
B: 2 15 51 8

desert oar
#

with certain restrictions: the first and last elements must be matched to each other, at least one of the sequences must have every point matched, and the matching must be monotonically increasing, so if A10 matches B15, then A11 cannot match B14

#

basically, the lines in the above picture can't cross

#

the distance itself is the sum of the lengths of the dashed lines

#

@ember sapphire as a matter of courtesy, if you could repost that using https://paste.pythondiscord.com/, it will be easier to follow the multiple conversations happening here. otherwise now there's a big wall of code between this message and the others.

ember sapphire
#

ah sure

desert oar
#

(you can edit your messages btw)

ember sapphire
ember sapphire
#

my understanding is that following Lloyd's algorithm, the cost should decrease monotonically until it converges to a local optimum

#

but my implementation seems to bounce around indefinitely

desert oar
#

first of all, write some damn functions

#

don't just put it all in one big script

#

are you trying to plot each step of lloyd's algorithm?

visual violet
#

salt, the algo won't work

#

i have tried some matching

desert oar
#

what is assign, the cluster assignment?

ember sapphire
#

currently the plotting is just there for debugging purposes so i can see the behavior, but the final version should compute clusterings for k=5 to k=12 and then plot them all at the end

#

yeah assign[y, x] is the index of the cluster that pixel [y, x] is assigned to

desert oar
#

what is copy?

#

just the image copied?

ember sapphire
#

just a copy of the image

#

yeah because the original is mutated

#

every step

#

also im not sure why, but when i ran it just now, it actually converged for k=5 for once and moved on to k=6... i didn't change anything though

desert oar
#

unrelated to the algorithm itself, my suggestions for your code itself:

  1. use functions
  2. use better / more-descriptive names
  3. when you do have a "big" function with multiple "sections", write comments so it's obvious to the reader what the sections are
#

plus if you're running this all in a notebook cell you're guaranteed to make a mess out of the top-level namespace, and you should at minimum restart it once in a while

ember sapphire
#

im not using ipython

visual violet
#

If the first and last two elements must match and the line canโ€™t cross and the the time series have equal length and the line canโ€™t cross, then DTW behave exactly the same way as Euclidean?

#

I canโ€™t think of a way for it to not behave the same way as Euclidean if the time lengths are equal

desert oar
#

it is possible that DTW can produce the same results as euclidean yes. although the distance is a sum of absolute differences, not a sum of squared differences

#

but DTW can "skip" elements of one of the series

#

actually if they are equal it might be required to be the same as euclidean

#

good observation

visual violet
#

You canโ€™t skip one without skipping the one in the other series

desert oar
#

right

#

anyway i don't think euclidean or DTW are great solutions here, i think correlation could be more interesting

#

maybe even spearman correlation

#

but it depends on what you hope to find

#

subjectively, what does it mean for 2 time series to be similar?

#

if the price movements are arbitrary, why would you expect to find any interesting clusters based on price movements?

#

maybe you should be looking at linear trends, seasonal decompositions, mean price level, etc.

visual violet
#

i guess i am hoping to find anything possible out of this dataset lol

desert oar
#

i gave you a lot of suggestions of places to look

#

think of what characteristics of the price sequence could be meaningful, i gave some examples above

#

use those as features for clustering

visual violet
#

okay so first, log everything and cluster again
second, try sevel dimension reduction and graph
third, try soft clustering methods instead of k-means

desert oar
#

i'm changing my recommendation

#

i don't think you should waste time with these distance metrics based on price movements

#

find a way to characterize each drug in a way that makes sense

#

and use that to compute distances, do clustering, etc.

#

however i think you did discover something: price movements are indeed arbitrary and don't suggest any kind of meaningful relationships

visual violet
#

i feel good about that, salt

#

lmao i may have to submit m,y paper as a null result

desert oar
#

nah

#

you're not done yet

#

all you found is that euclidean distance isn't useful for these time series

#

(this is just one of many examples of how you could use such a thing)

visual violet
#

okay so abandon k-means and hierarchical as well as dtw and eucliean altogether

#

since prive movements are random and i should not expect like 20 nice clusters

desert oar
#

no, i'm not saying to abandon k-means or hierarchical clustering

#

but yes, you probably want to abandon those distance metrics

visual violet
#

btw

#

i am still trying to understand how dtw works

#

my code make sense tho

#

in theory they should output the same thing?

desert oar
#

don't do the sqrt

visual violet
#

i have always thought the straight line provdies the shortest path

#

oh well not anymore i guess

jolly ginkgo
#

pls help me

visual violet
ember sapphire
#

i don't see any obvious pieces that should be extracted into functions

#

but aside from the style, do you notice any errors / opportunities for optimization?

dawn sapphire
#

is there a good live plotting software?

visual violet
#

jupyter lol

dawn sapphire
#

ive gota loop thats constantly reading from a source and i want the same plot to be updated every second. is jupyter the best too for this?

desert oar
normal grove
visual violet
#

thank youf or the suggestion salt

#

finally a new direction

inland zephyr
#

since we can break down the data into the volume one (the Y) and the time (the X) whether to find any small detail to cluster the data

desert oar
#

that could also be interesting, although i'm generally skeptical of fourier-like methods on very short time series. but @visual violet take note

desert oar
#

so you can at least verify that the output is correct in a tighter loop without making all these plots

#

my guess is that somewhere you are forgetting to update something

#

also the centroid differences will almost never be exactly zero

#

how badly does the cost oscillate?

#

that's more interesting than looking at cluster assignments

#

also if this is a very small number of data points the oscillations could be significant

silk marsh
#

am working on payment prediction

desert oar
#

try it on a dataset with known obvious clusters

silk marsh
#

need a lil help

#

anyone please?

desert oar
#

@silk marsh successfully asking for machine learning help requires a detailed description of your task, a detailed description of your data, and a detailed description of your current solution(s) and/or the actual code you're using. example data is even more helpful

#

the "dont ask to ask" rule is even more important in data science than in programming, because there are even more open-ended questions

#

otherwise you force people to waste their time interviewing you in order to be able to help you

silk marsh
#

bro if u don't want to help then ignore my msg

#

@desert oar

desert oar
#

i already just tried to offer some help

silk marsh
#

sounds like u are angry

#

don't take me in wrong way

#

@desert oar

#

so can u join code help1 @desert oar

#

voice?

#

so that i can explain my prob?

desert oar
#

@ember sapphire i don't see anything obviously wrong with this code either, so i suspect that maybe you forgot to update a "new thing" and left an "old thing" behind

ember sapphire
#

Hmm

#

It doesn't oscillate badly

#

It goes down relatively quickly and then reaches what appears to be a local optimum but then it starts slowly going up again

desert oar
#

i wonder if you just need to be cutting it off there? maybe randomly swapping around points is causing it to destabilize

#

not sure what the theoretical guarantees are on the algorithm

#

https://en.wikipedia.org/wiki/Lloyd's_algorithm#Convergence

The algorithm converges slowly or, due to limitations in numerical precision, may not converge. Therefore, real-world applications of Lloyd's algorithm typically stop once the distribution is "good enough." One common termination criterion is to stop when the maximum distance moved by any site in an iteration falls below a preset threshold.
floating point issues?

ember sapphire
#

Yeah I thought it was guaranteed to converge so I was confused

#

Hmm maybe..

desert oar
#

try using that termination criterion

ember sapphire
#

Sounds reasonable, if it's just floating point issues then I'm fine. I was worried it meant there was a deeper problem

desert oar
#
while True:
    if (centroids_curr - centroids_prev).abs().max() > threshold:
        break
#

try to compare vs scikit-learn k-means or some other known-good implementation

#

if you have the same-ish stopping criterion and the same-ish algorithm, you should get same-ish results

ember sapphire
#

Depends on initial centroids too but yeah, the image I'm using has a pretty obvious fit I think

lapis sequoia
#

Hey folks, would anyone here spare a few moments and help me with a scikit learn question?

misty flint
#

you can just ask. if people have time, theyll reply

lapis sequoia
#

Yeah thats fair. i just dont want to clog the chat you know. But here goes. I have a set of data generated by a random model .

They have a id, prediction value from some random model, and the actual boolean value.

I am trying to calculate the AUC of the random model by 2 methods, one by the scikit-learn and one by hand (running the normal algorithm)

There is quite a mismatch from the 2 methods.

for the data file/ code : https://filebin.net/yuzmnov4k8r6ha0c (hope filebin links are allowed)

near oasis
lapis sequoia
#

Now my question is, am i using the roc_curve function right (so to it i supply the actualy boolean values (1 and 0) and the prediction of the random model) ). If yes, is there any reason to why the by hand calculation of roc (going over all steps of auc calculation) differs so much from the scikit learn one?

shadow quiver
graceful ledge
#

Hey everyone! I am wanting to use Facebook Prophet to perform multivariate anomaly detection of user session data but struggling to figure out how it would work.

For some context, when a user logs in we create a session id and certain events/actions are captured and tied to that session. We are able to get counts of the different actions at 1 min intervals for a particular session.

The thing I am struggling with is how to detect anomalies at the session level. How would that data be fed into Facebook Prophet and is this something that can be done with it?

Online, I have only seen multivariate examples of things like different store locations but those store locations are static while the sessions are dynamic.

Appreciate your time!

visual violet
#

do you have a bachelor degree in data science?

graceful ledge
#

No, but I have created some basic autonomous driving apps like being able to drive around a track in a simulator, detecting road signs, lanes etc.

visual violet
#

sheesh

#

you are good

graceful ledge
#

Love the sarcasm. Just seeking some guidance as I am "not good"

visual violet
#

i am a high school student

#

i know absolutely nothing

#

i am sorry to push your question to the top

thorn bobcat
#

I want to learn about stylegan2

#

Its like I'm trying to swim before learning to walk

#

but it's beautiful

visual violet
#

@desert oar i just realized the red book by micromedex is the answer to my hope and dream

main fox
#

How should I typically handle dates in a ML model?
I have a dataframe that has a column for Year (from 2010 to 2020)
Should I leave it as int or code them from 1 to 10?

desert oar
#

is there a digital version? or do you have to now type in data for 700 drugs?

grave frost
visual violet
#

micromedex provides historal data gonig back 50 years

#

right now what i have is quite enough

#

but like to predict the future

#

20 datapoints are like a grain of sand

#

the problem is there is a subscription lol

thorn bobcat
#

anyone know what I should start learning to make something like style gan 2?

#

or 1

#

do I gotta learn about gans or do i gotta learn about neural style transfer?

velvet thorn
#

and how you want to model

#

can't tell without knowing what the problem is

main fox
visual violet
#

it is open source?

main fox
#

I can easily forecast these values, but I'm building a ML model just for the fun of it.

visual violet
#

are you saying you can forecast college tuitions?

velvet thorn
#

sounds like time series modelling

main fox
#

Yeah. Just from looking at the data, the trend appears to be +200 every year since 2010

velvet thorn
#

so

#

the simplest way to handle this is

#

given (X - n...X), model X + 1

#

in this case the date is used only to order data points

visual violet
#

can you share the data for upenn?

#

i am going there so i am curious

main fox
main fox
visual violet
#

University of Pennsylvania

velvet thorn
#

dates are used only to order the points

#

so they aren't features

#

UNLESS you want to encode more information

#

for simple stuff

#

you could look into e.g. ARIMA

main fox
visual violet
#

thank you

main fox
visual violet
#

@velvet thorn i am having a similar time series

#

i have historical data for price of different colleges, by year

#

now i am trying to find interseting patterns

thorn bobcat
#

this was a good paper

visual violet
#

i did k-means dynamic time warp on my data already but it gives me one cluster with the majority of the drugs and the rest of clusters consist very few members

#

what do you suggest me to do

visual violet
#

@desert oar

#

not exactly how stuffs work

#

but it appear to provide good results now?!?!?!?

#

i legit added one more line of code ingredient_price_matrix = np.log(ingredient_price_matrix)

desert oar
#

yep

#

i told you

#

If โ€œdtwโ€, DBA is used for barycenter computation.
this is also a good feature in that library

#

plot the log time series again w/ the colors

#

also 15 clusters seems like a lot

#

do you have any reason to expect 15? why not 1 or 5?

silver sun
#

Can anyone give me resources and link to prepare for a data science Interview? They said I will be asked on ML rapid prototyping, and simple ML models.

serene scaffold
#

And be prepared to talk about anything that's on your resume.

visual violet
#

to be honest, i have no idea which number of clusters i should pick

#

the elbow method tells me 6? so i will try that

#

also what does DBA mean?

#
  1. What's an example of an unsupervised learning algorithm?
    = kmeans? @serene scaffold
desert oar
visual violet
#

huh no wonder why when i check, the math doesn't work out

#

this is not normal dtw

#

this is DBA

#

also how do you know it uses DBA?

#

i check the tslearn website. there is no where it says that

#

@desert oar

#

i am so sorry for pinging you so many times

desert oar
# visual violet this is not normal dtw

DBA is a method for calculating centroids. the docs say that when you use DTW as the distance metric, it uses DBA instead of regular means to calculate the centroids in k-means

visual violet
#

thank you very much.

#

PCA happens to be the method to plot

#

so i will try that out as well

silver sun
visual violet
#

what do you think?

desert oar
#

i still think this is just clustering on average price level

#

k-means tends to try to find "round" clusters

#

and it will somewhat arbitrarily segment the data in order to do so

visual violet
#

perhaps it is because my graphing function is weird?

desert oar
#

if you see a bunch of evenly-sized clusters with no obvious separation, k-means is not doing anything interesting

#

no, this just looks like k-means not doing anything interesting

visual violet
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

visual violet
desert oar
#

as i suggested earlier, if price movements are arbitrary then you probably won't find much that's interesting in the individual price movements

#

it might be interesting to look at the variability in prices

#

e.g. you can start building features for each drug like mean price, std. dev of price, etc.

#

in fact, how about this: plot each drug with mean on the x axis and std. dev on the y axis

visual violet
#

so there will be 700 graphs?

desert oar
#

no

#

1 graph

#

700 points

visual violet
#

oh right

visual violet
#

but i had an idea of using percetange differnce

#

because i didn't think there will be an actual method

#

haha

#

the fact that my teacher recommends me to cluster the big cluster ๐Ÿ’€

#

he gives 0 advice on anything

visual violet
#

btw the matrix is the log of the prices

#

if used actual prices

digital tulip
#

Hi, I would like to learn about AI, where do you recommend me to start?

serene scaffold
digital tulip
serene scaffold
digital tulip
#

And then?

visual violet
#

i have access to the IBM SPSS Statistics

#

let see if i can do the ward method without coding

#

i can't seem to find a library

desert oar
# visual violet is there a potential?

not for clustering, but i see some interesting structure here:

  1. variability looks like it increases as the avg price gets higher (not surprising)
  2. there are some weird outliers with hugely variable prices
visual violet
#

i agree yeah

cedar sky
#

Hi guys...
So I have been trying to load word embeddings from tf hub...
But most of it are converting the word embedding to a sentence embedding and I am not able to find any documentation to help me out of it...
If any of you guys have any ideas how to get the embeddings please let me know

visual violet
#

by looking at it, i see no distinct cluster

normal bay
#

how can i determine model over-fitting or not ?

dim beacon
coral kindle
lapis sequoia
#

what's better for storing analytics one-table data, sql databases (relational) such as mysql or noSQL such as mongodb?

ripe forge
primal tulip
primal tulip
# silver sun Can anyone give me resources and link to prepare for a data science Interview? T...

I do recommend datacamp.com if you have some extra bucks. I do have the yearly plan (150$ per year or so if you get it on cybermonday)
They have this new prep thing after you've done some 10 minute tests on each category, a real-life project you must pass to be certified and finally they'll help you get ready for the interviews and match your profile to other companies.

I'm still studying and haven't done them yet. And it just came out recently, so don't expect it to be perfectly flawless.

https://app.datacamp.com/certification/dashboard

Also there might be a free prep service around if you look properly. I was just willing to pay.

silver widget
#

Hi all. Been using CatBoostRegressor for a project (this is my first time with Catboost). I used it in StackingRegressor with different models. I tuned the hyper parameters of the boost which is ;
'Cat', CatBoostRegressor(iterations= 5000,
learning_rate=0.01,
l2_leaf_reg = 5,
depth = 6,
border_count = 50))

However, the model runs each iteration from 0 to 5000 with different learning rates. Is there a way to fix these parameters?

normal bay
normal bay
wanton sleet
#

If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it? *

#

the dataset is tabular

serene scaffold
#

Be sure to be specific as the question isn't answerable in general.

wanton sleet
#

This is one of the interview questions i had faced

#

Here we have to be creative as possible.

#

The data is numeric tabular data

serene scaffold
wanton sleet
#

No

#

its just random ML interview question for trainee

serene scaffold
wanton sleet
#

yes

#

If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it?

#

After preprocessing data what can be done

serene scaffold
wanton sleet
#

?

serene scaffold
#

if one column is just telling you the class, then you only have four features, not five

wanton sleet
#

okayy

#

any further

#

for discrete data

serene scaffold
# wanton sleet for discrete data

depends on what algorithm you want to use. for the discrete data, would the number be mathematically meaningful? Like for example, age?

#

or would it just be arbitrary numbers like "1" for green and "2" for blue

wanton sleet
#

okay thanks

#

now i need to know the data nature

#

continous or discrete

#

and act accordingy

serene scaffold
wanton sleet
#

best option might be support vector machine

serene scaffold
wanton sleet
#

i have done decision trees, random forest but not SVM yet

#

but maybe in near future

serene scaffold
wanton sleet
#

okay thanks

#

highly appreciated

serene scaffold
#

No problem ๐Ÿ˜„

ember sapphire
#

@desert oar lol I forgot to square the norms when computing error

#

It was actually converging just slowly

fallen vapor
#

can anyone suggest me some cool chatbot repos

#

which i can train

#

nobody?

#

:(

desert oar
cedar sun
#
        is a single float value, the range will be (-shift_limit, shift_limit). Absolute values for lower and
        upper bounds should lie in range [0, 1]. Default: (-0.0625, 0.0625).```
#

What it means?

#

like, height width will be multiplied by that?

#

like, a 100x100 will result into a 6x6?

#

or into a 94x94?

silver widget
#

Train score (r2) = 0.98
test score = 0.86
is this counted as overfitting? (data= kaggle house prices advanced reg)

chilly geyser
#

I'd say yes

#

That difference is quite big

silver widget
#

Figured it out. RandomForest max depth is too much high..tuning them al again

#

*all

#

Thanks for the answer

somber prism
#

can someone explain me how is this 1/2 for this SVM prob

#

so for whichever choice p1 * theta >= 1 and p2 * theta <= -1 that would the right theta value ?

#

someone correct me if i am wrong

wheat sandal
#

Hey guys , which YouTube channel or resource would you recommend me to learn statistics from scratch? I donโ€™t have a math background

desert oar
#

i wish i knew, all the good statistics resources i know of already assume some knowledge of the basics

#

what is your background? how much math do you know?

wheat sandal
#

I studied literature lol

#

I have knowledge from high schoolโ€™s Math

desert oar
#

do you know calculus?

#

what's the "most advanced" math thing you remember how to do?

wheat sandal
desert oar
#

alright, you might really have to start from the basics then

#

i honestly don't know where you'd start from there, khan academy i guess has some videos? there might be some "statistical thinking" type of courses online, which would focus more on intuition and concepts

wheat sandal
#

I found one course in data camp called introduction to statistics in python . I will do that. Thanks anyway

digital tulip
#

After having a so-so knowledge in mathematics, what should I learn to make my first AI? ๐Ÿค”

distant needle
#

I have a .dat data file that I believe is in some type of .csv format, however it's too large (3.1GB) to inspect with any programs that I've tried inspecting it with. What's the easiest way to convert from .dat to CSV so I can load it into a pandas dataframe?

distant needle
#

Not a clear way of knowing. It's a government dataset.

#

I may have found a way around it though...I think I found a CSV, which makes life much easier.

desert oar
#

hopefully it says somewhere in the govt docs though

wheat sandal
main fox
#

Can someone help me with an xgboost model making bad forecasts?

iron basalt
wheat sandal
iron basalt
wheat sandal
#

Cool , I really appreciate your help !

lapis sequoia
#

Anyone have an idea of what approach to take for a problem where I need to predict the column name based on itโ€™s data? I understand it could be a classification however the issue is there are so many different types of fields which contain various fields over 1000

desert oar
lapis sequoia
#

Recommend unsupervised for this?

desert oar
#

so this is like looking at the data, then covering up the columns and trying to figure out which one was which?

#

this is a strange task

lapis sequoia
#

Correct

#

Because column names might be misspelled

#

So letโ€™s say one year you have name and then the next year column name for same data is nme

desert oar
#

so it's not the same exact dataset, it's different datasets but with the same columns in each dataset?

#

that's a different task

lapis sequoia
#

Similar column names *

desert oar
#

and harder

lapis sequoia
#

Oh youโ€™re right apologize

#

Explained incorrectly

#

Yea itโ€™s a hard task thatโ€™s for sure

#

Iโ€™m having a hard time figuring out the approach to take

visual violet
#

@desert oar i am so sorry to bother you again

#

i have been doing research today

#

i can't find the correct method to cluster

#

given the variability is not so clumped together

visual violet
#

may i ask

#

if i can somehow insert more variables in addition to time series?

#

like drug class and strength

#

and form factors like tablet, liquid, etc

desert oar
#

Multiple factor analysis (MFA) is a factorial method devoted to the study of tables in which a group of individuals is described by a set of variables (quantitative and / or qualitative) structured in groups. It may be seen as an extension of:

Principal component analysis (PCA) when variables are quantitative,
Multiple correspondence analysis ...

visual violet
#

thank you very much

#

given my situation

#

do you think i should try different methods with my current dataset?

visual violet
pearl nymph
#

Hello

#

how are you doing guys

main fox
# velvet thorn you could look into e.g. ARIMA

Hello, thank you for suggesting an ARIMA model yesterday. Traditional methods through sk tried using randomforest or xgboost along with custom train, test, split and walk forward validations which ended up being too convoluted. I found sktime and they have an Arima model that produced the results I was looking for.

main fox
#

Haha thanks. I was just stubborn enough to read through how machine learning with time series usually goes.

#

And a lot of trial and error

visual violet
#

much respect

serene scaffold
visual violet
#

you changed your pfp

#

AND name

#

the pfp looks nice back then :((

serene scaffold
#

I'm a mosaic now

visual violet
#

is amazing lol

#

maybe the person has a degree in data science already?

#

who knows

visual violet
#

didn't you

#

go to school for cs

serene scaffold
#

it was cs and data science at the same time

main fox
# visual violet wait so you actually read

I was under the impression that ML with a snapshot of data using xgboost was similar enough to using a time series. Or that xgboost could handle time series easily. That wasn't the case and it took me a lot of errors to realize.

main fox
visual violet
#

so cool!!!!

#

i might as well study cs + biochem now

#

increasingly sounds like a good option

main fox
#

A lot of Pharma companies are in need of people with those backgrounds. Not a lot of people that study CS go for health sciences.

visual violet
#

working for faang is so overrated

#

they are good jobs but seems boring

main fox
#

Agreed, I'd go for Netflix though lol. Other than that, I'd rather work anywhere else.

visual violet
#

i mean

#

i don't think netflix needs a microbiology lmao

main fox
#

Lol yeah they don't. But I'm leaning towards data science

tough frigate
#

so am i lol

#

is it true? that Data Science will be overtaken by AIs

raven hare
#

we need to keep its control

drowsy maple
#

I am not getting desirable output... anyone?

warped relic
#

I have a git logs where commit hash, author, timestamp, commit message and file logs(this is dynamic in the case of lines) in separate line. I could individually pull out the data(but not of file logs and messages) but not create a columns of them.

This is the sample file

commit 2232asdfeafdadssc0a63d3ded7e95e894bb735c121f
Author: John Shahi
Date:   Thu Jun 10 05:10:31 2021 +0000

    Feature/bill overview

70    0    src/components/pages/abc.tsx


commit 18asdfasd9104c7fb59d9027f48csdfss8b61776e21d0
Author: rashi.coder
Date:   Wed Jun 9 12:39:33 2021 +0545

    disable other call, refine disabled styles

13    1    src/components/organisms/contact/detail/card/ContactDetailCard.tsx
11    2    src/components/organisms/contact/list/ActionsColumn.tsx

commit 65adfadfc2090e299bf9c514735eb1a2779a12ed9
Author: Ritesh  Poudel
Date:   Wed Jun 9 05:08:00 2021 +0000

commit 04afdad56f5136da10c87d5181dab8afdsfs29e57a5
Author: rashi.coder
Date:   Wed Jun 9 10:22:29 2021 +0545

    fix multiple contacts selection overflow

1    0    src/components/organisms/contact/list/ContactTable.tsx
8    1    src/components/organisms/contact/list/Styles.tsx

this is what I was trying to do

commits = pd.read_csv(
        COMMIT_LOG,
        sep="\n",
        header=None,
        names=["raw"]
        # names=[
        #     "sha",
        #     "author",
        #     "timestamp",
        #     "message",
        #     "additions",
        #     "deletions",
        #     "filename",
        # ],
    )
    commit_marker = commits[commits["raw"].str.startswith("commit")]
    author_marker = commits[commits["raw"].str.startswith("Author")]
    date_marker = commits[commits["raw"].str.startswith("Date:")]
    print(commit_marker.head())
    print(author_marker.head())
    print(date_marker.head())
spark stag
pine umbra
#

Guys, how are you? Do any of you already work with geospatial data? I need to plot accident areas and vehicle paths on the map of Brazil. I tried it and I liked using the folium, but when I use too many points for the path the map just doesn't load, does anyone have a solution for that or already known and like another tool?

pure sleet
#

whats the best way to learn ML ?

#

directly starting from tensorflow

#

or doing octave/matlab first

serene scaffold
pine umbra
serene scaffold
#

Two people have said it independently, so it must be true

pine umbra
#

Tensorflow is most used in deep learning, it is a step after

serene scaffold
#

and while I know deep learning sounds cooler, that's just what they call it. the algorithms in sklearn can be great for a lot of use cases and don't require you to know quite as much math to wrap your head around what's happening.

pine umbra
#

this is true... a lot of problems can be solved using ML with sklearn... other some specific and difficult problems need DL and other frameworks.

grave breach
#

Maybe I can try helping you

hollow ember
#

Can someone help me with pandas and datasets?

arctic wedgeBOT
#

Hey @hollow ember!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

โ€ข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

โ€ข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

hollow ember
#

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

#

can someone help me out . Thank you

visual violet
#

1 means diabetes?

hollow ember
#

yes

visual violet
#

recently i have learned a thing called t-sne

#

but since you said predict, i have no idea how

hollow ember
#

yeah im confused too

#

Can anyone else help me out?

cedar sun
#

when training a model

#

can i tell it to focus more on a specific class?

slim marten
#

Hello, I'm trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.

My test data (34 pictures) is saved in a 'test' folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.

I receive a "Found 34 images belonging to 1 classes." message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353

Why do I receive a "2/2" everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn't it look super weird? I know I'm doing something wrong, but I just can't figure it out - maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument - in order to test it - it seems all fine, but here I just can't figure it out.)

A little bit of help would be appreciated. ๐Ÿ™‚

desert oar
slim marten
#

Yes, sure

#
imageDataGenerator = ImageDataGenerator(validation_split = 0.2,  
                                   rescale = 1./255,             
                                   rotation_range = 40,          
                                   width_shift_range = 0.2,      
                                   height_shift_range = 0.2,    
                                   zoom_range = 0.2,            
                                   horizontal_flip = True,       
                                   fill_mode = 'nearest')
trainingDataGenerator = imageDataGenerator.flow_from_directory(
                                dir,
                                target_size = (70, 70),
                                batch_size = batchSize,
                                color_mode="rgb", 
                                class_mode = 'binary',              
                                shuffle = True,
                                seed=42,                    
                                subset = 'training')            

validationDataGenerator = imageDataGenerator.flow_from_directory(
                                dir,
                                target_size = (70, 70),
                                batch_size = batchSize,
                                color_mode="rgb",
                                class_mode = 'binary',               
                                subset = 'validation')```

so these are for the training and the validation - using the data from my dir directory, which is gonna be divided as 80% for the training, 20% for the validation
the problem is that, except for the dir directory, I have a test directory as well
and I'm trying to obtain a generator based on the images in it
slim marten
# austere swift could you send some code?

test_data = 'C:/Users/Ana/Desktop/Licenta/Practic/maskAPI/test'

datagen = ImageDataGenerator(rotation_range = 40,          # rotirea imaginilor
                            width_shift_range = 0.2,      # modificarea latimii
                            height_shift_range = 0.2,     # modificarea inaltimii
                            zoom_range = 0.2,            
                            horizontal_flip = True,       # intoarce imaginea orizontal
                            fill_mode = 'nearest')

data = datagen.flow_from_directory('./test', classes=['test'], target_size=(70, 70), color_mode='rgb')

result = seqModel.evaluate(data, batch_size=1, verbose=1)```

I've tried many variants, this is just the last one of them
by the way, I commented most of the training part, as I don't think it's relevant - only showed you the other 2 generators, that work just fine (I followed a tutorial for those)
the test part is what I don't get
I just want to evaluate the loss and accuracy of my model and I simply don't get how
austere swift
#

make sure that the directory you're getting the images from actually has the correct amount of images and that the data generator read them correctly, i don't see anything else that would be wrong

slim marten
#

yes, test directory has 34 images. (The directory 'test' from test_data has one subfolder called test containing all the images)

#

but even if I look at the result, I don't get what that 2/2 is and why it remains like that. Shouldn't it be 34/34? (it was like that at some point, but I had other problems then when experimenting)
2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353

#

Hmm, any other suggestions for obtaining my accuracy and loss for the test data, other than the evaluate() method? ๐Ÿ™‚

austere swift
#

wait i think i know the issue

slim marten
#

YAY

austere swift
#

you didn't specify batch size in the ImageDataGenerator

#

and by default its 32

slim marten
#

OH, let me try

austere swift
#

so even if the evaluate() method has a batch size of 1, it'll take 1 batch from the datagen which will be 32

slim marten
#

true

#

34/34 [==============================] - 0s 3ms/step - loss: 239.4841 - accuracy: 0.7647

#

now it's 34/34

#

still getting the 239.4841 loss value

austere swift
#

what loss alg?

slim marten
#

what do you mean exactly?

austere swift
#

the loss algorithm

#

which one are you using

hollow ember
#

How can i use multiple categories In X or Y axis?

slim marten
# austere swift the loss algorithm
               loss = 'binary_crossentropy',                   
               metrics = ['accuracy'])```
In the compile method I use 'binary_crossentropy'
and I have 2 classes in the 'train' folder -> _with mask_ and _without mask_
grave breach
hollow ember
#

im only restricted to use mathplotlib and seaborn for my assignment

grave breach
#

You can implement them by yourself

#

Or, you can implement a neural network in raw python

#

If you know the theory they're not so complex

#

But, implementing a neural network without numpy can be hard

hollow ember
#

I know nothing, that's the problem

austere swift
#

why are you suggesting implementing a neural network from scratch? theres no need for a neural network for that application

#

a simple logistic regression model would probably be fine

grave breach
#

But they're pretty fun

grave breach
austere swift
#

but a logistic regression model should be fine

#

you can do that in seaborn as well

grave breach
#

I agree

austere swift
#

no seaborn wow

hollow ember
#

can u help with this, im quite new to this whole thing

grave breach
#

@hollow ember By the way, can you upload the dataset?

#

Maybe as a .zip

hollow ember
#

sure

arctic wedgeBOT
#

Hey @hollow ember!

It looks like you tried to attach file type(s) that we do not allow (.rar). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

hollow ember
#

can send rar files

grave breach
#

Try uploading it on pastebin

hollow ember
#

yup

slim marten
vital lodge
#

Hey, I'm new to computer vision and I tried training a resnet 50 model
as I'm working with like 280 images I tried implementing data augmentation
now the accuracy score is something like 0.006
I'm guessing it overfitted, are there any solutions to this problem?

grave breach
#

@hollow ember I got something for you

#

I tried different methods on your dataset and measured the accuracy of each of those

#

Here's my results:

#

LogisticRegression -> 0.7825520833333334
DecisionTree ->0.7916666666666666
GradientBoostedTrees -> 0.7838541666666666
SVM -> 0.8033854166666666
NearestNeighbors -> 0.75390625
Neural Network -> 0.7955729166666666

#

(Accuracy)

#

So I think that an SVM would be the best way to go

#

(By the way, I didn't do train/test split, so some outputs could suffer of overfitting)

grave breach
#

By the way, are you dealing with a classification task?

slim marten
#

so is there any problem with this? ๐Ÿ™‚

              loss = 'binary_crossentropy',                   
              metrics = ['accuracy'])```
vital lodge
cerulean mauve
#

Can pandas df.drop() accept a python list? I swear.

l = ['foo', 'bar']
df.drop(columns=l, axis=1)

Does not work, even with labels, or raw...
while

df.drop(['foo', 'bar'], axis=1)

Does work. What the !@#!@#$ is that?

desert oar
#

@cerulean mauve what does "does not work" mean?

#

you probably can't use both axis= and columns=

cerulean mauve
#

Tried that as well.

#

Example 1 with the list does not work, already tried it bare

desert oar
#

.drop(l, axis='columns') is the same as .drop(columns=l) and .drop(l, axis=1)

cerulean mauve
#

I should be able to .drop(l) with no problem according to what I have seen and read.

#

Every single way, I just tried them again to make sure that I am not crazy.

#

I got need to specify at least one of 'labels', 'index', or 'columns'

#

I got that with columns specified

#

like .drop(columns=l)

#

Let me walk you through what I have here.

#
#First I am taking a byte string to buffer from s3, using panda's optional library(fsspec) to handle the buffer into the DF, which works fine, I can print(df.to_string) and it comes out well
csv_buffer = StringIO(some_bytes_from_s3)
df = pandas.read_csv(csv_buffer)
my_list = [ 'foo', 'bar']
df.drop(my_list) # does not work
#df.drop(columns=my_list) # does not work
#df.drop(my_list, axis=1) # does not work

# I get the aforementioned error every time.
#

I can give you some sample data that the CSV looks like.

#

If needed.

desert oar
#

!e ```python
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4], 'b': [11,12,13,14], 'c': [21,22,23,24]})
print(df.drop(['a', 'c'], axis=1))
print()
print(df.drop(['a', 'c'], axis='columns'))
print()
print(df.drop(columns=['a', 'c']))

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |     b
002 | 0  11
003 | 1  12
004 | 2  13
005 | 3  14
006 | 
007 |     b
008 | 0  11
009 | 1  12
010 | 2  13
011 | 3  14
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/xajinecesi.txt?noredirect

desert oar
#

so yes i think you need to send a csv and the exact code to reproduce the error, because i can't reproduce it here

cerulean mauve
#

As far as I can tell

desert oar
cerulean mauve
#

if I do as you have done, and .drop(['a', 'b'], axis=1) it will work, it's only when I use a variable of type list that it fails.

desert oar
#

but ['a', 'b'] is a list

cerulean mauve
#

right?

#

I think my container is fubarred.

#

I'm using python-lambda-wrapper to develop aws lambda jobs for some database backfilling from CSV

#

going to test on my other machine

#

brb

drowsy maple
cerulean mauve
#

I want to drop the columns arg_reportdate and arg_reportname

#

I place those in a variable that is classed as a list

#

When I use the list, I cannot drop them.

#

When I literally write it out aka .drop(['arg_reportdate', 'arg_reportname'])

#

It works.

#

just not with the python list variable.

#

Which is maddening.

#

because it's a !@#$!$@#!@$ list

#

๐Ÿ™ƒ

#

maybe it's time to find food and microwave it.

#

i'll brb

cedar sun
#

if my model input shape is (100,100,4)

#

it means the images it reads have alpha, right?

desert oar
#

i means that your input shape is 100,100,4 ๐Ÿ˜‰

#

if they're images, yeah probably the 4 corresponds to rgba

#

but nobody can answer that question for you, you have to know about your own data

cedar sun
#

so if i wanna augment my data on a custom way using the info of the alpha channel, the images given by flow_from_directory are gonna have alpha too?

#

ah nooooo

#

ooooh

#

ok ok

#

color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb". Whether the images will be converted to have 1, 3, or 4 channels.

#

what happens if i read an RGB image as RGBA?

#

cuz i have some images with Alpha, and i wanna do special things with those only

cerulean mauve
#

back @desert oar thanks for looking for me.

cedar sun
#

fml

#

do u guys have any idea of how to approach this?

#

my dataset contains rgb and rgba images

cerulean mauve
#

convert to grayscale?

cedar sun
#

I am using flow_from_directory method, which, by default, it reads rgb images, but i can set it to rgba. The thing is... i get this error ValueError: could not broadcast input array from shape (160,160,3) into shape (160,160,4)

#

can i read rgb images as rgb and rgba as rgba?

cerulean mauve
#

Maybe, seems like you just want to drop the alpha layer, no?

#

oh wiat

#

wait you wanted to do something special with those.

cedar sun
#

yes, but i think i cant

#

on an easy way

cerulean mauve
#

I would check the image metadata for the existence of an alpha layer, and then operate differently based on that.

cedar sun
#

like... i could rewrite flow_from_dir method, and if color_mode='rgba' and the current image has 3 channels, read it as rgb

#

or something

twin moth
#

Hey guys, I need some help with a DS project I'm working on.
I'm trying to predict peoples taste using recipes, profiles, favorites, and reviews I scraped from a popular recipes website.

I currently have 95K recipes, and 2M profiles, 2M reviews and 70M~ favorites.
How would one use the data I fetch in order to finish the project?

cerulean mauve
#

if you wanted to drop the alpha layer yes.

cedar sun
cerulean mauve
#

got the code handy?

cedar sun
#

?

#

wdym

cerulean mauve
#

source code?

#

I assume you are using a method

cerulean mauve
#

seems right ๐Ÿ˜„

cedar sun
#

@desert oarhow do i set weights for a certain class? i asked u before

grave breach
#

@cedar sun what do you mean?

cerulean mauve
#

The class ImageDataGenerator seems to have an parameter called data_format sounds promising. @twin moth

cedar sun
cedar sun
cerulean mauve
#

Polymorphism is the way the truth, and the light.

#

you can use super to import the method into your own class.

cedar sun
grave breach
#

So maybe I can be helpful

cedar sun
cerulean mauve
#

to make a child class:

class Parent:
  def __init__(self, txt):
    self.message = txt

  def printmessage(self):
    print(self.message)

class Child(Parent):
  def __init__(self, txt):
    super().__init__(txt)

x = Child("Hello, and welcome!")

x.printmessage()