#data-science-and-ml | Python | Page 322

grave sparrow Jun 21, 2021, 7:36 PM

#

Just fillna in advance?

desert oar Jun 21, 2021, 7:37 PM

#

that's one good way to do it, if you have a string you want to use to represent missing data

#

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'v1': ['a', 'b', np.nan],
    'v2': ['x', np.nan, 'z'],
})

not_null = df[['v1', 'v2']].notnull().all(axis=1)
df.loc[not_null, 'v3'] = df.loc[not_null, 'v1'] + ' -> ' + df.loc[not_null, 'v2']

you could do it this way too

#

i generally recommend not ever using .astype unless you really need to

#

usually i want more control than that

grave sparrow Jun 21, 2021, 7:40 PM

#

Ahhhh thats perfect

#

So I could either fillna in the columns as needed before I concatenate, or do that, then afterwards fillna in v3 with 'Null values detected, unable to generate directive' or something

#

Thanks for the advice

desert oar Jun 21, 2021, 7:45 PM

#

yeah, i've seen way too many f'ed up datasets with nan where it doesn't belong

#

can't allow it 😛

#

also don't use csv to save this data once you've generated it... use parquet or something else intelligent

ember sapphire Jun 21, 2021, 7:46 PM

#

i can't figure out why this doesn't converge to a local minimum, the cost function is oscillating

#

import numpy as np
from matplotlib import pyplot as plt
from matplotlib import image

rng = np.random.default_rng()

img = image.imread('fruits_small.jpg') / 256.0
h, w = img.shape[:2]

copy = np.array(img)

plt.subplot(3, 3, 1)
plt.title('Original')
plt.imshow(img)

for plot, k in enumerate([4, 8, 16, 32, 64]):
    centroids = rng.choice(copy.reshape((-1, 3)), size=k, replace=False)
    clusters = np.empty((h, w))

    print(centroids)

    while True:
        for y, x in np.ndindex(img.shape[:2]):
            v = copy[y, x]
            clusters[y, x] = np.argmin(np.linalg.norm(centroids - v, axis=1))

        cost = 0
        for i in range(k):
            cost += np.linalg.norm(copy[clusters == i] - centroids[i], axis=1).sum()

        print(f'cost = {cost}')

        d = 0
        for i in range(k):
           new_centroid = copy[clusters == i].mean(axis=0)
           d += np.linalg.norm(centroids[i] - new_centroid)
           centroids[i] = new_centroid

        if d == 0:
            break

        print(centroids)

    for i in range(k):
        img[clusters == i] = centroids[i]

    plt.subplot(3, 3, plot + 1)
    plt.title(f'k = {k}')
    plt.imshow(img)

plt.show()

#

this version isn't the 5d one, it clusters based solely on color

desert oar Jun 21, 2021, 7:57 PM

#

@ember sapphire this is k-means? i think usually k-means stops after a fixed number of iterations anyway

ember sapphire Jun 21, 2021, 7:57 PM

#

the cost function should be decreasing

#

mine isn't

#

i can't figure out why though

near oasis Jun 21, 2021, 7:58 PM

#

anyone interested on making this prototype better? https://youtu.be/UIOO09AYvQM

#

its an NLP project on vaccine information.

#

if your interested DM me

thorn bobcat Jun 21, 2021, 8:25 PM

#

yo

#

what you guys think about grabcut?

grave frost Jun 21, 2021, 8:33 PM

#

near oasis anyone interested on making this prototype better? https://youtu.be/UIOO09AYvQM

its a private video?

cedar sun Jun 21, 2021, 8:33 PM

#

does keras have a fast way to measure top 5 accuracy?

oblique palm Jun 21, 2021, 9:37 PM

#

Hello everyone, I'm working on an IT project proposal based on ripeness detection through machine learning. My project consists of a camera observing a basket of bananas (for example). I'd like to know if there are effective ways to recognize if a banana looks rotten based on what the camera sees from a bunch of bananas in a basket?

My main doubt is how hard, and if there are machine learning tools that make the process of identifying one bad banana from the group in the basket, because from what I've seen most related project do it in a small scale, like putting 1 banana in front of a camera. Any suggestions?

desert oar Jun 21, 2021, 9:46 PM

#

i suspect that such a model will mostly be detecting brown vs yellow vs green

#

training data is probably the hardest part here

#

how many photos of bananas, in how many different configurations, and different kitchens, and different lighting conditions do you have to collect and label before you have enough data? potentially a lot

#

maybe you can use an off-the-shelf object detection algorithm to find the bananas in the image first, so your model doesn't have to be so powerful

oblique palm Jun 21, 2021, 9:49 PM

#

would be inside a supermarket so honestly we are considering one type of configuration haha

desert oar Jun 21, 2021, 9:49 PM

#

so you want to have a camera that takes a photo of the banana display every day, and sends an alarm when they look overripe?

oblique palm Jun 21, 2021, 9:50 PM

#

More like live

#

or maybe pictures every hour

#

and yeah, send an alarm when they look overripe

desert oar Jun 21, 2021, 9:51 PM

#

it might be a fun hobby project, but can't an employee go check the bananas? i know they have a lot to do in a grocery store already, but it's a pretty quick task with a quick yes/no answer. of course the downside is that employees might let the bananas go bad because they're lazy and don't want to throw them out if they say "yes"

#

you still will want to account for day/night lighting conditions

#

what if someone in a brown shirt walks in front of the camera?

oblique palm Jun 21, 2021, 9:52 PM

#

the problem we are trying to solve here is that in my country, bananas and other fruits dont look very good when they are ripe, but they are still good for feeding, so they throw these to the trash because they are not in their "standards" to be sold

#

and these could be donated or sold cheaper to institutions that help poor people

wet folio Jun 21, 2021, 11:21 PM

#

I'm using PyTorch XLA to port a training script from using a CPU/GPU to a TPU. Is there a way to use a ParallelLoader on more than one variable?

livid venture Jun 21, 2021, 11:24 PM

#

Hey! I’m not sure if this is right channel but it’s my first post so please be lenient for me this time 😄
I have a problem with an unsual task… I have a dataframe like this:

    id     name
0     962966     A
1     402171     A
2     478034     B
3     936505     B
4     516152     C
5     379497     C
6     977649     D
7     869046     D

Now what I have to do is divide this dataframe by name into many multiplesheets excel files... So for example in my case I would like to have 4 files (named: A, B, C, D) every with 2 sheets named by id inside (for example A: 962966 and 402171)

This is my code (random_df is only to fill up sheets with some data):

ExcelWriter = 0

for index, row in df.iterrows():
    random_df = pd.DataFrame(np.random.randint(0, 100, (5, 4)), columns = list("1234"))
    
    if ExcelWriter != pd.ExcelWriter(row["name"] + ".xlsx"):
        
        if ExcelWriter != 0:
            ExcelWriter.save()
        
        ExcelWriter = pd.ExcelWriter(row["name"] + ".xlsx")

        random_df.to_excel(
            ExcelWriter,
            sheet_name = str(row["id"]))
        
    else:
        random_df.to_excel(
            ExcelWriter,
            sheet_name = str(row["id"]))

ExcelWriter.save()

The result I get is almost fine because this code generate 4 excel files but every with only one sheet named by last id number for name... it looks like the data is being overwritten by the next ones but I haven't idea how to fix this because I'm completely freshman in pandas 😄 Do you have any ideas?

limpid oxide Jun 21, 2021, 11:39 PM

#

For help you should ask in any help channels. If you want to know how to use it check you should see #❓｜how-to-get-help

serene scaffold Jun 21, 2021, 11:45 PM

#

@limpid oxide you can ask for data science help in these channels

#

@livid venture you can do a groupby and then iterate over that. That will be faster than trying to fumble around by row.

thorn bobcat Jun 21, 2021, 11:47 PM

#

can computers create pixar movies

serene scaffold Jun 21, 2021, 11:47 PM

#

@thorn bobcat not autonomously

thorn bobcat Jun 21, 2021, 11:48 PM

#

style transfer looks like a filter to me tbh

#

it feels like AI is a long way from where I want it tbh.

serene scaffold Jun 21, 2021, 11:51 PM

#

thorn bobcat it feels like AI is a long way from where I want it tbh.

You want it to be able to make two-hour animations with coherent plots?

thorn bobcat Jun 21, 2021, 11:52 PM

#

I want it to be able to convert real life videos into pixar movie scenes

#

want to bring the power of animation studios to ordinary users.

#

I might have apply this to individual frames containing just the foreground.

#

I'll also have to extract the background so I can control the setting.

#

something like this but for video

#

https://github.com/CMU-Perceptual-Computing-Lab/openpose
could work for cgi

GitHub

CMU-Perceptual-Computing-Lab/openpose

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation - CMU-Perceptual-Computing-Lab/openpose

languid chasm Jun 22, 2021, 12:20 AM

#

You're model has >93% accuracy, >93% sensitivity, >93% spcificity AND >93% precision?

#

Your*

charred umbra Jun 22, 2021, 1:29 AM

#

yes, I had very little false positives and false nevgatives

cedar sun Jun 22, 2021, 1:48 AM

#

Can u always overfit a model?

charred umbra Jun 22, 2021, 1:59 AM

#

cedar sun Can u always overfit a model?

You in theory could, but you probabaly dont want to

visual violet Jun 22, 2021, 2:49 AM

#

distance measurement methods for time-series clustering algo calculate the physical distance in space between two objects right?

serene scaffold Jun 22, 2021, 3:09 AM

#

visual violet distance measurement methods for time-series clustering algo calculate the physi...

you might be thinking of euclidean distance

visual violet Jun 22, 2021, 3:09 AM

#

yes

#

i am trying to make sure the way i am describing things is correct

serene scaffold Jun 22, 2021, 3:10 AM

#

visual violet distance measurement methods for time-series clustering algo calculate the physi...

what are you using to determine their distance? a function from a certain library?

visual violet Jun 22, 2021, 3:11 AM

#

tslearn

#

metric = dtw

#

algo = k-means

#

i am in the process of writing the actual paper

serene scaffold Jun 22, 2021, 3:11 AM

#

dtw?

visual violet Jun 22, 2021, 3:11 AM

#

dynamic time warp

#

works similarly to euclidean

#

but more lenient with time-series data

serene scaffold Jun 22, 2021, 3:11 AM

#

https://tslearn.readthedocs.io/en/stable/gen_modules/metrics/tslearn.metrics.dtw.html

visual violet Jun 22, 2021, 3:12 AM

#

*at least that is what i got from my research

serene scaffold Jun 22, 2021, 3:12 AM

#

"DTW is computed as the Euclidean distance between aligned time series"

visual violet Jun 22, 2021, 3:13 AM

#

would be nice if they define what Xi and Y j are

serene scaffold Jun 22, 2021, 3:13 AM

#

visual violet would be nice if they define what Xi and Y j are

elements of an array

#

the ith element of X and the jth element of Y

#

I think. let me check

pulsar hull Jun 22, 2021, 3:14 AM

#

i want to try out machine learning and as a final goal I want to try to make a chatbot. I tried following a tutorial for TensorFlow but i didn't really understand it and was mostly just copying code from the tutorial, anyone know a good place to start and/or a good TensorFlow tutorial for something simple?

visual violet Jun 22, 2021, 3:15 AM

#

https://github.com/tslearn-team/tslearn/blob/60a39f2/tslearn/metrics/dtw_variants.py#L387-L468
so yeah after reading this , i should be able to figure out

GitHub

tslearn-team/tslearn

A machine learning toolkit dedicated to time-series data - tslearn-team/tslearn

serene scaffold Jun 22, 2021, 3:15 AM

#

Yes, it's the ith element of X and the jth element of Y, where each (i, j) tuple comes from the set of tuples that most closely align with each other

visual violet Jun 22, 2021, 3:15 AM

#

UPUY_dSWBpH3LM_ujmZAHhiFQdArEwklCUA-wOFSqBRo1Y4SFtnD5io397_Iw3YREocm_EkDPEUgKU3sDIMnZdU.png

#

sheesh

#

it is just euclidian no more no less

serene scaffold Jun 22, 2021, 3:16 AM

#

visual violet it is just euclidian no more no less

that's only for two dimensions

visual violet Jun 22, 2021, 3:16 AM

#

it works the same way

TheDistanceFormulaItlooksthesameinspaceasitdidbeforeexceptwithathirdcoordinate3A.png

serene scaffold Jun 22, 2021, 3:16 AM

#

yes

visual violet Jun 22, 2021, 3:16 AM

#

that is why the zigma is there

#

damn big brain

serene scaffold Jun 22, 2021, 3:17 AM

#

the sigma? yes.

visual violet Jun 22, 2021, 3:17 AM

#

suppose i have two objects/time-series

#

1: 1 3 4 7
2: 2 4 8 9

#

X is 1 and Y is 2

#

i am pretty sure

#

the distance would be sqrt ( |2-1|^2 + |4-3|^2 + |8-4|^2 + |9-7|^2 )

serene scaffold Jun 22, 2021, 3:19 AM

#

visual violet 1: 1 3 4 7 2: 2 4 8 9

try passing those as arrays to the dsw function and see if you get the euclidean distance

serene scaffold Jun 22, 2021, 3:19 AM

#

visual violet the distance would be sqrt ( |2-1|^2 + |4-3|^2 + |8-4|^2 + |9-7|^2 )

would be easier to do it as np.sqrt((np.abs(a - b) ** 2).sum())

visual violet Jun 22, 2021, 3:20 AM

#

what i am wondering is why make up a fancy name

#

and use the same century-old method

serene scaffold Jun 22, 2021, 3:20 AM

#

visual violet what i am wondering is why make up a fancy name

which part is fancy? euclidean?

visual violet Jun 22, 2021, 3:20 AM

#

the name

#

DTW

#

"time warping"

#

may was well name it time bending

serene scaffold Jun 22, 2021, 3:20 AM

#

visual violet DTW

well, there's the optimal alignment path

#

idk what that is.

visual violet Jun 22, 2021, 3:22 AM

#

the problem is

#

i don't know how to test the metric

#

like there is 0 example code

serene scaffold Jun 22, 2021, 3:22 AM

#

#

it's on the website

visual violet Jun 22, 2021, 3:23 AM

#

oh lord

#

i am dumb

visual violet Jun 22, 2021, 3:25 AM

#

serene scaffold would be easier to do it as `np.sqrt((np.abs(a - b) ** 2).sum())`

shouldn't make this into a loop

#

for x, y in arr1, arr2:

#

is that legal python syntax?

serene scaffold Jun 22, 2021, 3:26 AM

#

visual violet shouldn't make this into a loop

no, numpy does the looping internally.

#

!e

import numpy as np

a = np.random.random((5,))
b = np.random.random((5,))
print(a, b)

print(np.sqrt((np.abs(a - b) ** 2).sum()))

arctic wedgeBOT Jun 22, 2021, 3:27 AM

#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

001 | [0.98432056 0.62829273 0.62277413 0.44841935 0.8191274 ] [0.33321332 0.25763163 0.75565997 0.36456538 0.89677632]
002 | 0.76944770610025

visual violet Jun 22, 2021, 3:28 AM

#

life is not always simply huh

serene scaffold Jun 22, 2021, 3:28 AM

#

I don't read screenshots of code, but I'll give you some more code to illustrate the earlier point

#

!e

import numpy as np

a = np.random.random((5,))
b = np.random.random((5,))

print(a - b)

arctic wedgeBOT Jun 22, 2021, 3:29 AM

#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

[-0.55704989 -0.41628915 -0.04400796 -0.552509    0.38075351]

visual violet Jun 22, 2021, 3:29 AM

#

that is very neat

#

when the documentation tells you what it is supposed to do

#

but it doesn't function like the doc when you test it

serene scaffold Jun 22, 2021, 3:30 AM

#

I mean, the docs didn't say that dtw is exactly like euclidean distance

#

"DTW is computed as the Euclidean distance between aligned time series, i.e., if π is the optimal alignment path:"

#

idk what an optimal alignment path is.

#

my earlier guess was wrong, apparently

#

(and not very well stated)

visual violet Jun 22, 2021, 3:40 AM

#

ya iam taking a look

#

trying to read the source code

#

makes absolutely 0 sense

desert oar Jun 22, 2021, 3:58 AM

#

I believe it's the alignment with the lowest distance such that the order of points is preserved

#

Don't quote me on that

visual violet Jun 22, 2021, 4:01 AM

#

yup

#

i mean i don't know

#

but

#

when the array elemtns are identical

#

like [1,1,1,1] and [2,2,2,2]

#

dtw would give the same result as euclidean

#

but when things start to be weird, they converge

dire frost Jun 22, 2021, 6:03 AM

#

where should i start with ai?
i watched this video but i cant find any python tutorials for ai
https://youtu.be/JMUxmLyrhSk

YouTube

edureka!

Artificial Intelligence Full Course | Artificial Intelligence Tutor...

🔥 Machine Learning Engineer Masters Program (Use Code "𝐘𝐎𝐔𝐓𝐔𝐁𝐄𝟐𝟎"): https://www.edureka.co/masters-program/machine-learning-engineer-training
This Edureka video on "Artificial Intelligence" will provide you with a comprehensive and detailed knowledge of Artificial Intelligence concepts with hands-on examples.

Following topics are covered in th...

▶ Play video

lapis sequoia Jun 22, 2021, 6:29 AM

#

what should i learn to go for ML?

#

field

slow comet Jun 22, 2021, 7:30 AM

#

Hi! I'm looking for a data cleaning library. I found HoloClean but it contains some TODOs in the code and the GitHub repo hasn't been updated since April 2019..

slow comet Jun 22, 2021, 7:37 AM

#

lapis sequoia what should i learn to go for ML?

Recently I talked with a friend that did the ML course of Deeplearning AI in coursera and he told me that the content is a good starting point

#

the course is free unless you want to have the certificate

viral scroll Jun 22, 2021, 10:20 AM

#

Is this the correct channel to ask Pandas related queries...??

uncut barn Jun 22, 2021, 10:31 AM

#

Are there any good resources to understand context free grammars and logical grammars?

random mist Jun 22, 2021, 10:48 AM

#

viral scroll Is this the correct channel to ask Pandas related queries...??

yah sure!

#

for quicker help start an help channel

novel elbow Jun 22, 2021, 11:46 AM

#

viral scroll Is this the correct channel to ask Pandas related queries...??

go ahead

viral scroll Jun 22, 2021, 11:47 AM

#

I have a pandas dataframe like below:
company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 * Developer Engineer

and a list of companies like ['Google', 'Apple', 'Facebook']

I want to write a operation in pandas to find the rows where company name has "*" and replace it with individual companies

the output should be

company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 Google Developer Engineer
3 Apple Developer Engineer
4 Facebook Developer Engineer

#

Thanks in advance

novel elbow Jun 22, 2021, 11:51 AM

#

so for each row in the original dataframe with the "*" field you will replace it with 3 rows (with google, apple, fb) ?

viral scroll Jun 22, 2021, 11:51 AM

#

Yes, the company list is dynamic it can be n rows. I have added 3 companies for example

novel elbow Jun 22, 2021, 11:54 AM

#

ok, first I will divide the data into the ones that have companies and the ones with "" then with a function create copies replacing "" with the names in your list and finally concatenate all the dataframes

#

should look like this

#

# df = # your dataframe

mask = df['company name'] == "*"
df_withname,df_noname = df.loc[~mask],df.loc[mask]

names = ['Google', 'Apple', 'Facebook']

def add_name(df, name):
    df = df.copy()
    df['company name'] = name
    return df

df_final = pd.concat([df_withname] + [add_name(df_noname, o) for o in names])

viral scroll Jun 22, 2021, 12:05 PM

#

Thanks a lot

#

Let me try this

serene scaffold Jun 22, 2021, 12:50 PM

#

@novel elbow I think there's a simpler solution.

#

@viral scroll how do you know what company you're replacing a given row with?

#

Eh maybe not. Though I'm confused as to how one would arrive at this particular problem

viral scroll Jun 22, 2021, 12:57 PM

#

serene scaffold <@565331829125546036> I think there's a simpler solution.

Basically I take query inputs from user using an excel sheet and the user has an option to either specify search condition for specific company or use "*" to apply the search condition to all the companies

#

so in my code i need to replace * with multiple rows for individual company so that I can do an inner join

serene scaffold Jun 22, 2021, 12:58 PM

#

You could replace the asterisk with a python list and then do explode

#

I'm on mobile so I can explain more in a bit

viral scroll Jun 22, 2021, 12:58 PM

#

Sure, I will wait for your response

#

Thanks

normal grove Jun 22, 2021, 1:13 PM

#

dire frost where should i start with ai? i watched this video but i cant find any python tu...

https://www.coursera.org/learn/ai-for-everyone?
If you can afford to take this course. Or else you can enroll in each course separately but you wont get a certificate after completion.

Coursera

AI For Everyone

Offered by DeepLearning.AI. AI is not only for engineers. If you want your organization to become better at using AI, this is the course to ... Enroll for free.

serene scaffold Jun 22, 2021, 1:13 PM

#

viral scroll Sure, I will wait for your response

In [40]: df.loc[(df['company'] == '*'), 'company'] = [['Google', 'Apple', 'Facebook']]
Out[41]: 
                     company       role      level
0                   Company1   Director   director
1                   Company2  Developer  developer
2  [Google, Apple, Facebook]  Developer   Engineer

In [42]: df2.explode('company')
Out[42]: 
    company       role      level
0  Company1   Director   director
1  Company2  Developer  developer
2    Google  Developer   Engineer
2     Apple  Developer   Engineer
2  Facebook  Developer   Engineer

#

You have to use a nested list for the first step though

viral scroll Jun 22, 2021, 1:15 PM

#

Sure, this will work

#

Thanks

visual violet Jun 22, 2021, 1:37 PM

#

good morning stelercus

#

do you think hierrachial and k-means will produce different results?

hard hound Jun 22, 2021, 1:47 PM

#

you didn't ask me but still I will answer - probably

ember sapphire Jun 22, 2021, 2:16 PM

#

can anybody help me figure out why my kmeans implementation is not converging?

visual violet Jun 22, 2021, 2:32 PM

#

holy cow

visual violet Jun 22, 2021, 2:32 PM

#

ember sapphire can anybody help me figure out why my kmeans implementation is not converging?

how do you check if it converges or not

#

i am doing the exact same thing

#

it is not working

ember sapphire Jun 22, 2021, 2:33 PM

#

i just print the centroid locations and cost function after each iteration

#

it just jumps around

visual violet Jun 22, 2021, 2:41 PM

#

can you share the code?

inland zephyr Jun 22, 2021, 2:41 PM

#

good evening guys i need some suggestion about my small project

#

i have small experiment to test whether CNN works with small size of data. I have about 38 sample data from 2 class, said sick and healthy (coded as 0 and 1) with 18 from 20 from class 0 and 18 from class 1. The data sample is an ECG record with long duration (about 1 day), where the class 0 has anchor point to mark where the episode of sickness happen, but the class 1 is healthy so no marking point on it

#

i want to check for the n minutes before anchor point of class 0 can be detected as sick and not 1, so from 20 i have 10 train, 2 valid and 8 test for n minutes. I also take arbitrary data from 18 healthy record, so i have 9 train, 2 valid and 7 test. Since CNN will feed small data for training, so I try with simple network

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 15352, 128)        1280      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 5117, 128)         0         
_________________________________________________________________
dropout (Dropout)            (None, 5117, 128)         0         
_________________________________________________________________
flatten (Flatten)            (None, 654976)            0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1309954   
=================================================================
Total params: 1,311,234
Trainable params: 1,311,234
Non-trainable params: 0
_________________________________________________________________

When i run the training, the training accuracy are pretty fast to decrease but the validation result are very low

#

#

but when i run the test, the result is promising

#

I know this is the sign of overfit but with the accuracy result of the test set I still confused whether to add more epoch or rearrange the model.

#

the feature is 1D with 15360 feature

visual violet Jun 22, 2021, 2:57 PM

#

jesus

inland zephyr Jun 22, 2021, 2:58 PM

#

The result pretty funny but frightening at the same time

#

and still dunno what to do next

desert oar Jun 22, 2021, 3:02 PM

#

look at some examples

visual violet Jun 22, 2021, 3:03 PM

#

desert oar look at some examples

are you a data scientist?

desert oar Jun 22, 2021, 3:03 PM

#

yes

#

sorry i didn't mean examples of model code

#

i literally meant examples of data where the predictions are wrong

#

it can be very enlightening to see why, qualitatively, the model is failing

#

maybe you don't have enough data - i have no idea if CNNs can be trained from scratch on so few datapoints, i would assume that they can't

#

or maybe your train/test split is f'ed up

visual violet Jun 22, 2021, 3:05 PM

#

What is arima?

desert oar Jun 22, 2021, 3:05 PM

#

who asked you this?

visual violet Jun 22, 2021, 3:05 PM

#

oh i asked people

desert oar Jun 22, 2021, 3:05 PM

#

for what?

#

also it's silly that prophet isn't "interpretable" it's a goddamn linear regression, it's more interpretable than ARIMA

#

anyway ARIMA is "AutoRegressive + Integrated + Moving Average" - basically a family of models you can use to fit a single time series

#

there's kind of a lot to be said about time series modeling and ARIMA specifically

visual violet Jun 22, 2021, 3:07 PM

#

The dataset was a matrix of the weighted AMP of the 724 ingredients from 2016 to 2020 - tables where rows represent drug ingredients, columns represent the time, such as the second quarter of 2017, and numbers in each cell characterize the price of the particular ingredient in the particular year.
When i do kmean clustering with dynamic time warping metric, it cluster the majority of the ingredients into one cluster, which i definitely do not want. so i asked the person and he told me that

desert oar Jun 22, 2021, 3:07 PM

#

what's the point of doing arima then?

inland zephyr Jun 22, 2021, 3:07 PM

#

desert oar maybe you don't have enough data - i have no idea if CNNs can be trained from sc...

i using the split and experimental flow based on paper (this is what i pretty hate why they don't clearly tell what they do in their experiment), they use 50:20:30 for splitting not based on the data chunk but the record

visual violet Jun 22, 2021, 3:07 PM

#

to somehow break the big cluster up?

desert oar Jun 22, 2021, 3:07 PM

#

inland zephyr i using the split and experimental flow based on paper (this is what i pretty ha...

but the paper said it worked with as few as 20 data points in the training set?

inland zephyr Jun 22, 2021, 3:07 PM

#

desert oar but the paper said it worked with as few as 20 data points in the training set?

they claimed yes

desert oar Jun 22, 2021, 3:08 PM

#

inland zephyr they claimed yes

are you trying to replicate the paper? or are you trying to use their technique on your own data?

desert oar Jun 22, 2021, 3:08 PM

#

visual violet to somehow break the big cluster up?

that doesn't make a ton of sense to me, ARIMA is a way to model and characterize one time series

inland zephyr Jun 22, 2021, 3:08 PM

#

desert oar are you trying to replicate the paper? or are you trying to use their technique ...

i try to replicate only the flow of the experiment, with the same data that they use

desert oar Jun 22, 2021, 3:09 PM

#

can you link the paper?

inland zephyr Jun 22, 2021, 3:09 PM

#

since this is a health-related project, so a golden standard data is mandatory to use (although imho, the data are pretty old and very small)

visual violet Jun 22, 2021, 3:10 PM

#

desert oar that doesn't make a ton of sense to me, ARIMA is a way to model and characterize...

so what do you suggest me do?

#

i am kinda lost at the moment

inland zephyr Jun 22, 2021, 3:10 PM

#

this is the example, https://pubmed.ncbi.nlm.nih.gov/30117048/ but i only using they testing flow and the data

PubMed

A Novel Wavelet Transform-Homogeneity Model for Sudden Cardiac Deat...

Sudden cardiac death (SCD) is one of the main causes of death among people. A new methodology is presented for predicting the SCD based on ECG signals employing the wavelet packet transform (WPT), a signal processing technique, homogeneity index (HI), a nonlinear measurement for time series signals, …

visual violet Jun 22, 2021, 3:10 PM

#

like https://cdn.discordapp.com/attachments/852558685590650951/856633177136824350/unknown.png

#

very sad

desert oar Jun 22, 2021, 3:10 PM

#

visual violet so what do you suggest me do?

i don't know because i don't know what this person is responding to. what exactly did you ask, that prompted this response? also what ultimately are you trying to do again? i know you were getting into clustering time series but i don't remember what the purpose of this was

desert oar Jun 22, 2021, 3:11 PM

#

inland zephyr this is the example, https://pubmed.ncbi.nlm.nih.gov/30117048/ but i only using ...

what do you mean by "flow"?

inland zephyr Jun 22, 2021, 3:11 PM

#

desert oar what do you mean by "flow"?

the testing flow, how they parted the data for training and testing purpose

desert oar Jun 22, 2021, 3:11 PM

#

ok, but it looks like the paper uses a purpose-built algorithm based on some specific signal processing stuff

#

i don't know that it makes sense to just replace that with a CNN

inland zephyr Jun 22, 2021, 3:12 PM

#

desert oar ok, but it looks like the paper uses a purpose-built algorithm based on some spe...

since CNN can make the feature

visual violet Jun 22, 2021, 3:12 PM

#

OHH i asked "how would you proceed to predict the future with the given dataset"

inland zephyr Jun 22, 2021, 3:12 PM

#

that's my oppinion based what i have read about CNN

desert oar Jun 22, 2021, 3:13 PM

#

visual violet OHH i asked "how would you proceed to predict the future with the given dataset"

ok, but that's a completely unrelated task to clustering. can you back up and provide a more complete explanation of what's happening here? i feel like i'm only seeing random little snippets of what you're working on.

inland zephyr Jun 22, 2021, 3:13 PM

#

the idea can be said: what if we can feed this info without preprocess the data, which mean we can lost some important sign and let CNN infer the condition

desert oar Jun 22, 2021, 3:13 PM

#

i would imagine that the CNN probably can't learn much from 20 data points

inland zephyr Jun 22, 2021, 3:14 PM

#

since CNN or deep learning method are still pretty new for ECG-related research

desert oar Jun 22, 2021, 3:14 PM

#

now, if you had 2000 datapoints of "unlabled" data, but only 38 data points of "labeled" data, i'd suggest fitting an autoencoder on the 2000 unlabeled examples, then using transfer learning or something to fit a simpler model on the 38 labeled examples

#

maybe there are ways to train the autoencoder simultaneously with the small number of labeled data points - this was a thing several years ago called "semi-supervised learning", but it dates back to the SVM era and it was of questionable value back then

inland zephyr Jun 22, 2021, 3:17 PM

#

i have cautious with that, since i using more classical method like preprocess the data into several feature and using traditional machine learning method like decission tree or random forest the result still higher than the CNN

visual violet Jun 22, 2021, 3:17 PM

#

desert oar ok, but that's a completely unrelated task to clustering. can you back up and pr...

My research question: How do characteristics in the pharmaceutical industry, such as new drug development, patent expiration, and drug classifications influence the similarities and dissimilarities among both drugs' prices and drug prices' percentage difference over time. I am still under the process of revising my question, so i hope it makes sense

inland zephyr Jun 22, 2021, 3:18 PM

#

visual violet My research question: How do characteristics in the pharmaceutical industry, suc...

breakdown into more specific one:
maybe like: Is there any group of drug development which is more focused on specific field, or what kind of drug class mostly made by the industry

visual violet Jun 22, 2021, 3:18 PM

#

like i want to see if there is a trend in price in group of certain disease-targeted drug

desert oar Jun 22, 2021, 3:19 PM

#

inland zephyr i have cautious with that, since i using more classical method like preprocess t...

yeah that doesn't surprise me. like i said, the CNN might just need more data to train. what's the size of the convolution "window"?

inland zephyr Jun 22, 2021, 3:19 PM

#

desert oar yeah that doesn't surprise me. like i said, the CNN might just need more data to...

i using 128 feature with 9 window

#

so since the data are pretty small but complex dimension, i decide with the simpler one

#

no much layer and just dump the feature from the convo - maxpool - dropout into dense with 2 class

desert oar Jun 22, 2021, 3:22 PM

#

so you have a model that needs to learn 1280 parameters from 20 data points, and that's not even including the dense final layer

#

that seems like a losing proposition to me

#

are you at least using regularization to train it?

#

128 features as in, each example is a single time series of 128 data points?

#

or wait, you have 128 "channels" in the conv1d?

visual violet Jun 22, 2021, 3:24 PM

#

interesting. i don't understand a single word you guys are saying

inland zephyr Jun 22, 2021, 3:24 PM

#

desert oar or wait, you have 128 "channels" in the conv1d?

wait

#

def GetModel(shp):
    model = Sequential()
    model.add(layers.Conv1D(filters=128,kernel_size=9,activation=tf.nn.leaky_relu,input_shape=[shp[1],1]))
    model.add(layers.MaxPool1D(pool_size=3))
    model.add(layers.Dropout(rate=0.7))
    model.add(layers.Flatten())
    model.add(layers.Dense(2,activation='softmax'))
    model.compile(loss = 'sparse_categorical_crossentropy',optimizer='Adam', metrics=['accuracy','mse'])
    return model

#

the shp is the X, [1] is 15360

desert oar Jun 22, 2021, 3:27 PM

#

visual violet My research question: How do characteristics in the pharmaceutical industry, suc...

ok, so these are 2 separate tasks under the same "umbrella", thanks for clarifying. i think you're on the right track with the clustering, although i think k-means is almost always the wrong choice. i recommend starting by making lots and lots of plots

#

obviously you're already started, but still make lots and lots of plots

#

so your k-means results are bad, why? are they really bad? or are they sensible given the data and # of clusters?

#

did you try other numbers of clusters?

#

did you try k-medians? etc.

inland zephyr Jun 22, 2021, 3:28 PM

#

plotting the data is the most easiest one to do, and also easiest to understand too

visual violet Jun 22, 2021, 3:29 PM

#

yes sir. similarity/dissimilarlity in price and price percetnage difference -> two tasks

desert oar Jun 22, 2021, 3:29 PM

#

did you try "soft" clustering like HDBSCAN? self-organizing maps? did you look into https://github.com/fpetitjean/DBA to see if there's any common overall shape?

GitHub

fpetitjean/DBA

DBA: Averaging for Dynamic Time Warping. Contribute to fpetitjean/DBA development by creating an account on GitHub.

inland zephyr Jun 22, 2021, 3:29 PM

#

visual violet yes sir. similarity/dissimilarlity in price and price percetnage difference -> t...

this is your feature. Plot it using scatter plot and see if there're some pattern

visual violet Jun 22, 2021, 3:30 PM

#

https://cdn.discordapp.com/attachments/852558685590650951/856632891483488326/unknown.png
https://cdn.discordapp.com/attachments/852558685590650951/856633177136824350/unknown.png

#

is this sensible?

desert oar Jun 22, 2021, 3:30 PM

#

you can also try fitting individual time series models to each series, then looking at the models as a dataset of its own. you can fit all kinds of forecast models like exponential smoothing, arima, etc. and then look at the distribution of model parameters for example

visual violet Jun 22, 2021, 3:30 PM

#

it feel like it doesn't

desert oar Jun 22, 2021, 3:30 PM

#

you're just looking at the cluster memberships

#

how do you know those make sense?

#

did you try using dimension reduction to plot these and color them by cluster membership?

visual violet Jun 22, 2021, 3:30 PM

#

i didn't think it is necessary since the dataset only has 20 dimensions or 20 columns

desert oar Jun 22, 2021, 3:30 PM

#

that doesn't make sense

inland zephyr Jun 22, 2021, 3:31 PM

#

20 if not have importance are pretty useless to predict

visual violet Jun 22, 2021, 3:31 PM

#

so i have to find more data?

desert oar Jun 22, 2021, 3:31 PM

#

i'm not sure what you mean by that @inland zephyr

visual violet Jun 22, 2021, 3:31 PM

#

desert oar did you try using dimension reduction to plot these and color them by cluster me...

i will share pic. one sec please

desert oar Jun 22, 2021, 3:31 PM

#

you mean that there are 20 time points in each time series @visual violet ?

visual violet Jun 22, 2021, 3:31 PM

#

yes salt

#

they all have equal time length 2016 quarter one to 2020 quarter 4

desert oar Jun 22, 2021, 3:32 PM

#

and are they all measured at the same time points?

inland zephyr Jun 22, 2021, 3:32 PM

#

desert oar i'm not sure what you mean by that <@!317253318990626816>

oh i mean feature importance from the 20 feature, not a 20 time points. My bad

visual violet Jun 22, 2021, 3:32 PM

#

yes

#

same time points

desert oar Jun 22, 2021, 3:32 PM

#

ok, that makes things easier

visual violet Jun 22, 2021, 3:32 PM

#

oh so the vocab is time point

desert oar Jun 22, 2021, 3:32 PM

#

eh, there isn't a single term for it

visual violet Jun 22, 2021, 3:32 PM

#

lol i have a tough time describing things

desert oar Jun 22, 2021, 3:33 PM

#

that's the term i use

#

naming things is hard

inland zephyr Jun 22, 2021, 3:33 PM

#

why dont plot a line bar to see the movement? since the data is time related right?

desert oar Jun 22, 2021, 3:33 PM

#

@inland zephyr they have 600+ individual time series with 20 time points in each

arctic wedgeBOT Jun 22, 2021, 3:34 PM

#

Hey @visual violet!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

visual violet Jun 22, 2021, 3:34 PM

#

oh shit i can't share excel lol

inland zephyr Jun 22, 2021, 3:34 PM

#

so it's 600*20 dimension right?

desert oar Jun 22, 2021, 3:34 PM

#

you could think of it that way, yes

visual violet Jun 22, 2021, 3:35 PM

#

desert oar did you try using dimension reduction to plot these and color them by cluster me...

#

the majority of them is at the bottom

desert oar Jun 22, 2021, 3:36 PM

#

great, use a log scale for y axis

#

price is never exactly 0 right?

inland zephyr Jun 22, 2021, 3:36 PM

#

thats what i said

visual violet Jun 22, 2021, 3:36 PM

#

cluster 0 : dark blue has 665 members out of 724

inland zephyr Jun 22, 2021, 3:37 PM

#

how bout cumulated it?

#

cumsum?

desert oar Jun 22, 2021, 3:37 PM

#

im not sure what the point of cumsum on prices would be

visual violet Jun 22, 2021, 3:37 PM

#

desert oar price is never exactly 0 right?

nope since drug can't be free

#

so there gotta be a price

inland zephyr Jun 22, 2021, 3:37 PM

#

nvm, no need cumsum for this

#

the cluster has been seen easily, with majority cluster is the blue and the cyan one

#

but what is the red data? is it come from some data or it's come from one member?

desert oar Jun 22, 2021, 3:38 PM

#

right. k-means looks like it's just clustering time series based on how high the average price is

visual violet Jun 22, 2021, 3:39 PM

#

red has one member :(((

inland zephyr Jun 22, 2021, 3:39 PM

#

bingo

visual violet Jun 22, 2021, 3:39 PM

#

i don't want that since it doesn't mean much (i assume)

desert oar Jun 22, 2021, 3:39 PM

#

use a log scale y axis
replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level

inland zephyr Jun 22, 2021, 3:39 PM

#

it's the outlier maybe

desert oar Jun 22, 2021, 3:40 PM

#

so my recommendation is to normalize each price such that its starting price is 1. then your distance metric can focus on the shapes of the time series, not just the price levels

#

basically a price index for each drug

#

so if the price is $100 in qtr 1, and $125 in qtr 2, normalize that to 1 and 1.25

inland zephyr Jun 22, 2021, 3:41 PM

#

or maybe as @visual violet said no 0 price, the normalization can be done between min-max of the price

visual violet Jun 22, 2021, 3:41 PM

#

https://paste.pythondiscord.com/ejeruhawod.css -> for first 200 ingrdients

desert oar Jun 22, 2021, 3:41 PM

#

inland zephyr or maybe as <@!354372432838000642> said no 0 price, the normalization can be don...

min-max normalization is bad when you don't have a known min and max

#

this way you can take the "price level" of the drug as separate from relative changes in price over time

#

@visual violet in general recommend the following:

the specific feature engineering described above, perhaps using the starting price level (or average or median price level) as a separate feature
try several dimension reduction techniques to visualize the time series, don't worry about "modeling" yet. such techniques include PCA and UMAP. this might mean using trying out a few distance metrics such as DTW
use https://github.com/fpetitjean/DBA to see if there is any overall "shape" to the drug prices, although i doubt it based on the plot i just saw
try "soft" clustering methods such as HDBSCAN
do you have any "metadata" about the drugs? company developing it, year development started, type of drug (antibiotic, etc.), specialized vs general market, etc. that could be interesting to examine as well.

#

VAR for 600+ time series is too big i think

#

however you could try using the correlation between time series as "distance"

visual violet Jun 22, 2021, 3:44 PM

#

@desert oar do you think i should even bother with the percentage difference dataset? here is the graph i got

1      3
2      2

desert oar Jun 22, 2021, 3:44 PM

#

specifically distance = 1 - abs(corr(ts1, ts2))

desert oar Jun 22, 2021, 3:44 PM

#

visual violet <@!389497659087650836> do you think i should even bother with the percentage dif...

do you see an obvious pattern here?

visual violet Jun 22, 2021, 3:44 PM

#

i see 0 pattern

desert oar Jun 22, 2021, 3:44 PM

#

try plotting this without the cluster coloring, but set alpha=0.2 or something so you can see them all overlaid

#

(and make the lines thinner)

#

generally you should probably do all your clustering on log prices anyway, if you're not using this price index thing

#

also - do you know what would cause these drug price fluctuations in the first place?

#

that could also guide your thinking about it

visual violet Jun 22, 2021, 3:46 PM

#

i have been doing research

#

there are no direct causes

desert oar Jun 22, 2021, 3:47 PM

#

also this is maybe a helpful reference https://www.kaggle.com/izzettunc/introduction-to-time-series-clustering

Introduction to Time Series Clustering

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

visual violet Jun 22, 2021, 3:47 PM

#

the drug economy does not folow the supply-and-demand model

#

sometimes existing drugs increase price for no reason

#

i have trouble udnerstand this statement "replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level" can you please explain?

#

oh i understand

#

take average for each row

#

so then i only have 724 values

desert oar Jun 22, 2021, 3:49 PM

#

yep

#

you'll probably see that the clusters are just splitting the drugs somewhat arbitrarily into average price levels

#

re-do the clustering on log price scale, if nothing else

ember sapphire Jun 22, 2021, 3:51 PM

#

im ready to kms over this kmeans implementation

#

ive read it like 500 times and cant see anything wrong but it's still wrong

visual violet Jun 22, 2021, 3:52 PM

#

desert oar re-do the clustering on log price scale, if nothing else

do you mean this ```

matplotlib.pyplot.yscale('symlog')```?

desert oar Jun 22, 2021, 3:55 PM

#

why symlog?

#

log should be fine

#

also i mean literally re-run k-means but using log price

desert oar Jun 22, 2021, 3:55 PM

#

ember sapphire im ready to kms over this kmeans implementation

show your code one more time?

#

just "reading" the code isn't always the best way to debug code

visual violet Jun 22, 2021, 3:56 PM

#

so log every single data points (724 *20) and k cluster

desert oar Jun 22, 2021, 3:56 PM

#

yes

#

because the prices levels are so wildly different

#

look at the formula for euclidean distance

#

this is a good lesson in the importance of using intuition about fundamentals to solve problems

visual violet Jun 22, 2021, 3:57 PM

#

k means with what distance measurement method? euclidean?

desert oar Jun 22, 2021, 3:57 PM

#

whatever, euclidean is probably fine but you can try DTW too

visual violet Jun 22, 2021, 3:57 PM

#

i have been using dynamic time warping without understand exactly what it does

desert oar Jun 22, 2021, 3:57 PM

#

when taking differences, and especially squared differences, the order-of-magnitude differences between time series will overwhelm anything else

#

so at least put them on log scale to try and reduce that effect

visual violet Jun 22, 2021, 3:58 PM

#

let me do that real quick

#

hoping you won't sleep soon

desert oar Jun 22, 2021, 3:58 PM

#

i'm at work so i might have to leave for a while, but ping me so i don't lose track of it

#

DTW just re-aligns elements of the time series such that the distance between the two time series is minimized

visual violet Jun 22, 2021, 3:59 PM

#

dtw does that for every pair of time series?

desert oar Jun 22, 2021, 3:59 PM

#

(more strictly the sum of the absolute values of the differences)

#

yes

visual violet Jun 22, 2021, 4:00 PM

#

let say i have
A: 1 7 16 50
B: 2 15 51 8

desert oar Jun 22, 2021, 4:00 PM

#

with certain restrictions: the first and last elements must be matched to each other, at least one of the sequences must have every point matched, and the matching must be monotonically increasing, so if A10 matches B15, then A11 cannot match B14

#

https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Dynamic_time_warping.png/600px-Dynamic_time_warping.png

#

basically, the lines in the above picture can't cross

#

the distance itself is the sum of the lengths of the dashed lines

#

@ember sapphire as a matter of courtesy, if you could repost that using https://paste.pythondiscord.com/, it will be easier to follow the multiple conversations happening here. otherwise now there's a big wall of code between this message and the others.

ember sapphire Jun 22, 2021, 4:01 PM

#

ah sure

desert oar Jun 22, 2021, 4:02 PM

#

(you can edit your messages btw)

ember sapphire Jun 22, 2021, 4:02 PM

#

https://paste.pythondiscord.com/cuveroxafe.py

visual violet Jun 22, 2021, 4:03 PM

#

desert oar with certain restrictions: the first and last elements must be matched to each o...

this makes a lot of sense

ember sapphire Jun 22, 2021, 4:03 PM

#

my understanding is that following Lloyd's algorithm, the cost should decrease monotonically until it converges to a local optimum

#

but my implementation seems to bounce around indefinitely

desert oar Jun 22, 2021, 4:07 PM

#

first of all, write some damn functions

#

don't just put it all in one big script

#

are you trying to plot each step of lloyd's algorithm?

visual violet Jun 22, 2021, 4:08 PM

#

salt, the algo won't work

#

i have tried some matching

desert oar Jun 22, 2021, 4:08 PM

#

visual violet salt, the algo won't work

?

#

what is assign, the cluster assignment?

ember sapphire Jun 22, 2021, 4:09 PM

#

currently the plotting is just there for debugging purposes so i can see the behavior, but the final version should compute clusterings for k=5 to k=12 and then plot them all at the end

#

yeah assign[y, x] is the index of the cluster that pixel [y, x] is assigned to

desert oar Jun 22, 2021, 4:10 PM

#

what is copy?

#

just the image copied?

ember sapphire Jun 22, 2021, 4:10 PM

#

just a copy of the image

#

yeah because the original is mutated

#

every step

#

also im not sure why, but when i ran it just now, it actually converged for k=5 for once and moved on to k=6... i didn't change anything though

desert oar Jun 22, 2021, 4:11 PM

#

unrelated to the algorithm itself, my suggestions for your code itself:

use functions
use better / more-descriptive names
when you do have a "big" function with multiple "sections", write comments so it's obvious to the reader what the sections are

#

plus if you're running this all in a notebook cell you're guaranteed to make a mess out of the top-level namespace, and you should at minimum restart it once in a while

ember sapphire Jun 22, 2021, 4:12 PM

#

im not using ipython

visual violet Jun 22, 2021, 4:13 PM

#

If the first and last two elements must match and the line can’t cross and the the time series have equal length and the line can’t cross, then DTW behave exactly the same way as Euclidean?

#

I can’t think of a way for it to not behave the same way as Euclidean if the time lengths are equal

desert oar Jun 22, 2021, 4:14 PM

#

it is possible that DTW can produce the same results as euclidean yes. although the distance is a sum of absolute differences, not a sum of squared differences

#

but DTW can "skip" elements of one of the series

#

actually if they are equal it might be required to be the same as euclidean

#

good observation

visual violet Jun 22, 2021, 4:15 PM

#

You can’t skip one without skipping the one in the other series

desert oar Jun 22, 2021, 4:15 PM

#

right

#

anyway i don't think euclidean or DTW are great solutions here, i think correlation could be more interesting

#

maybe even spearman correlation

#

but it depends on what you hope to find

#

subjectively, what does it mean for 2 time series to be similar?

#

if the price movements are arbitrary, why would you expect to find any interesting clusters based on price movements?

#

maybe you should be looking at linear trends, seasonal decompositions, mean price level, etc.

visual violet Jun 22, 2021, 4:16 PM

#

desert oar if the price movements are arbitrary, why would you expect to find any interesti...

You are not wrong

#

i guess i am hoping to find anything possible out of this dataset lol

desert oar Jun 22, 2021, 4:17 PM

#

i gave you a lot of suggestions of places to look

#

think of what characteristics of the price sequence could be meaningful, i gave some examples above

#

use those as features for clustering

visual violet Jun 22, 2021, 4:18 PM

#

okay so first, log everything and cluster again
second, try sevel dimension reduction and graph
third, try soft clustering methods instead of k-means

desert oar Jun 22, 2021, 4:18 PM

#

i'm changing my recommendation

#

i don't think you should waste time with these distance metrics based on price movements

#

find a way to characterize each drug in a way that makes sense

#

and use that to compute distances, do clustering, etc.

#

however i think you did discover something: price movements are indeed arbitrary and don't suggest any kind of meaningful relationships

visual violet Jun 22, 2021, 4:20 PM

#

i feel good about that, salt

#

lmao i may have to submit m,y paper as a null result

desert oar Jun 22, 2021, 4:20 PM

#

nah

#

you're not done yet

#

all you found is that euclidean distance isn't useful for these time series

#

PCA however could be interesting https://towardsdatascience.com/the-pca-trick-with-time-series-d40d48c69d28

Medium

The PCA Trick with Time-Series

PCA can be used to reject cyclic time-series behavior, and this works for anomaly detection.

#

(this is just one of many examples of how you could use such a thing)

#

https://stats.stackexchange.com/q/158281/36229

Cross Validated

Can PCA be applied for time series data?

I understand that Principal Component Analysis (PCA) can be applied basically for cross sectional data. Can PCA be used for time series data effectively by specifying year as time series variable and

visual violet Jun 22, 2021, 4:22 PM

#

okay so abandon k-means and hierarchical as well as dtw and eucliean altogether

#

since prive movements are random and i should not expect like 20 nice clusters

desert oar Jun 22, 2021, 4:23 PM

#

no, i'm not saying to abandon k-means or hierarchical clustering

#

but yes, you probably want to abandon those distance metrics

visual violet Jun 22, 2021, 4:24 PM

#

btw

#

i am still trying to understand how dtw works

#

my code make sense tho

#

in theory they should output the same thing?

desert oar Jun 22, 2021, 4:25 PM

#

don't do the sqrt

visual violet Jun 22, 2021, 4:31 PM

#

i have always thought the straight line provdies the shortest path

#

oh well not anymore i guess

jolly ginkgo Jun 22, 2021, 4:38 PM

#

https://stackoverflow.com/questions/68082030/valueerror-dimensions-must-be-equal-but-are-262144-and-327680

Stack Overflow

ValueError: Dimensions must be equal, but are 262144 and 327680

Orijinal data shape : (512, 512, 1)
Reshaped Train/Test data shape : ((286, 256, 256, 4), (293, 256, 256, 4))
Model:
import tensorflow as tf
from tensorflow.keras.models import Model
from keras.la...

#

pls help me

visual violet Jun 22, 2021, 4:43 PM

#

desert oar PCA however could be interesting https://towardsdatascience.com/the-pca-trick-wi...

PCA is a dimension redcution method. don't i still need a measumrent metric?

ember sapphire Jun 22, 2021, 4:43 PM

#

ok i added some comments https://paste.pythondiscord.com/opuvepokok.py @desert oar

#

i don't see any obvious pieces that should be extracted into functions

#

but aside from the style, do you notice any errors / opportunities for optimization?

dawn sapphire Jun 22, 2021, 4:47 PM

#

is there a good live plotting software?

visual violet Jun 22, 2021, 4:50 PM

#

jupyter lol

dawn sapphire Jun 22, 2021, 4:52 PM

#

ive gota loop thats constantly reading from a source and i want the same plot to be updated every second. is jupyter the best too for this?

jolly ginkgo Jun 22, 2021, 5:16 PM

#

jolly ginkgo https://stackoverflow.com/questions/68082030/valueerror-dimensions-must-be-equal...

heeeeelp

desert oar Jun 22, 2021, 5:17 PM

#

visual violet PCA is a dimension redcution method. don't i still need a measumrent metric?

no, it doesn't use an explicit distance matrix

normal grove Jun 22, 2021, 5:22 PM

#

Hey guys this was my first blog related AI on medium. Have a look awaiting for your responses.
https://medium.com/@h_ali/linear-regression-using-gradient-descent-from-scratch-66328352e671

Medium

Linear Regression using Gradient Descent from Scratch

In this blog, I’m going to explain how linear regression i.e equation of line finds slope and intercept using gradient descent.

visual violet Jun 22, 2021, 5:24 PM

#

thank youf or the suggestion salt

#

finally a new direction

inland zephyr Jun 22, 2021, 5:26 PM

#

desert oar PCA however could be interesting https://towardsdatascience.com/the-pca-trick-wi...

and since i also work with time series data (like ECG), wanna try wavelet decomposition to make the feature?

#

since we can break down the data into the volume one (the Y) and the time (the X) whether to find any small detail to cluster the data

desert oar Jun 22, 2021, 6:22 PM

#

that could also be interesting, although i'm generally skeptical of fourier-like methods on very short time series. but @visual violet take note

desert oar Jun 22, 2021, 6:27 PM

#

ember sapphire i don't see any obvious pieces that should be extracted into functions

the entire k-means procedure can be

#

so you can at least verify that the output is correct in a tighter loop without making all these plots

#

my guess is that somewhere you are forgetting to update something

#

also the centroid differences will almost never be exactly zero

#

use https://docs.python.org/3/library/math.html#math.isclose / https://numpy.org/doc/stable/reference/generated/numpy.isclose.html

#

how badly does the cost oscillate?

#

that's more interesting than looking at cluster assignments

#

also if this is a very small number of data points the oscillations could be significant

silk marsh Jun 22, 2021, 6:29 PM

#

am working on payment prediction

desert oar Jun 22, 2021, 6:29 PM

#

try it on a dataset with known obvious clusters

silk marsh Jun 22, 2021, 6:29 PM

#

need a lil help

#

anyone please?

desert oar Jun 22, 2021, 6:34 PM

#

@silk marsh successfully asking for machine learning help requires a detailed description of your task, a detailed description of your data, and a detailed description of your current solution(s) and/or the actual code you're using. example data is even more helpful

#

the "dont ask to ask" rule is even more important in data science than in programming, because there are even more open-ended questions

#

otherwise you force people to waste their time interviewing you in order to be able to help you

silk marsh Jun 22, 2021, 6:35 PM

#

bro if u don't want to help then ignore my msg

#

@desert oar

desert oar Jun 22, 2021, 6:35 PM

#

i already just tried to offer some help

silk marsh Jun 22, 2021, 6:36 PM

#

sounds like u are angry

#

don't take me in wrong way

#

@desert oar

#

so can u join code help1 @desert oar

#

voice?

#

so that i can explain my prob?

desert oar Jun 22, 2021, 6:40 PM

#

@ember sapphire i don't see anything obviously wrong with this code either, so i suspect that maybe you forgot to update a "new thing" and left an "old thing" behind

ember sapphire Jun 22, 2021, 6:41 PM

#

Hmm

#

It doesn't oscillate badly

#

It goes down relatively quickly and then reaches what appears to be a local optimum but then it starts slowly going up again

desert oar Jun 22, 2021, 6:43 PM

#

i wonder if you just need to be cutting it off there? maybe randomly swapping around points is causing it to destabilize

#

not sure what the theoretical guarantees are on the algorithm

#

https://en.wikipedia.org/wiki/Lloyd's_algorithm#Convergence

The algorithm converges slowly or, due to limitations in numerical precision, may not converge. Therefore, real-world applications of Lloyd's algorithm typically stop once the distribution is "good enough." One common termination criterion is to stop when the maximum distance moved by any site in an iteration falls below a preset threshold.
floating point issues?

ember sapphire Jun 22, 2021, 6:44 PM

#

Yeah I thought it was guaranteed to converge so I was confused

#

Hmm maybe..

desert oar Jun 22, 2021, 6:44 PM

#

try using that termination criterion

ember sapphire Jun 22, 2021, 6:45 PM

#

Sounds reasonable, if it's just floating point issues then I'm fine. I was worried it meant there was a deeper problem

desert oar Jun 22, 2021, 6:45 PM

#

while True:
    if (centroids_curr - centroids_prev).abs().max() > threshold:
        break

#

try to compare vs scikit-learn k-means or some other known-good implementation

#

if you have the same-ish stopping criterion and the same-ish algorithm, you should get same-ish results

ember sapphire Jun 22, 2021, 6:47 PM

#

Depends on initial centroids too but yeah, the image I'm using has a pretty obvious fit I think

lapis sequoia Jun 22, 2021, 7:42 PM

#

Hey folks, would anyone here spare a few moments and help me with a scikit learn question?

misty flint Jun 22, 2021, 7:51 PM

#

you can just ask. if people have time, theyll reply

lapis sequoia Jun 22, 2021, 7:57 PM

#

Yeah thats fair. i just dont want to clog the chat you know. But here goes. I have a set of data generated by a random model .

They have a id, prediction value from some random model, and the actual boolean value.

I am trying to calculate the AUC of the random model by 2 methods, one by the scikit-learn and one by hand (running the normal algorithm)

There is quite a mismatch from the 2 methods.

for the data file/ code : https://filebin.net/yuzmnov4k8r6ha0c (hope filebin links are allowed)

near oasis Jun 22, 2021, 7:59 PM

#

grave frost its a private video?

yea why?

lapis sequoia Jun 22, 2021, 7:59 PM

#

Now my question is, am i using the roc_curve function right (so to it i supply the actualy boolean values (1 and 0) and the prediction of the random model) ). If yes, is there any reason to why the by hand calculation of roc (going over all steps of auc calculation) differs so much from the scikit learn one?

shadow quiver Jun 22, 2021, 8:00 PM

#

Any help would be appreciated on this question from me: https://stackoverflow.com/questions/68089910/tf-keras-ignore-values-in-custom-cross-entropy-function

Stack Overflow

TF-Keras ignore values in custom cross entropy function

Shape of y_trues is (16, 50). E.g. [1, 2, 3, 0, ..., 0, 0]. It's going to be always end with some sequence of zeros, since they are padding values.
Shape of y_preds is (None, 50, 50), an output of a

graceful ledge Jun 22, 2021, 9:53 PM

#

Hey everyone! I am wanting to use Facebook Prophet to perform multivariate anomaly detection of user session data but struggling to figure out how it would work.

For some context, when a user logs in we create a session id and certain events/actions are captured and tied to that session. We are able to get counts of the different actions at 1 min intervals for a particular session.

The thing I am struggling with is how to detect anomalies at the session level. How would that data be fed into Facebook Prophet and is this something that can be done with it?

Online, I have only seen multivariate examples of things like different store locations but those store locations are static while the sessions are dynamic.

Appreciate your time!

visual violet Jun 22, 2021, 9:54 PM

#

do you have a bachelor degree in data science?

graceful ledge Jun 22, 2021, 9:57 PM

#

No, but I have created some basic autonomous driving apps like being able to drive around a track in a simulator, detecting road signs, lanes etc.

visual violet Jun 22, 2021, 10:06 PM

#

sheesh

#

you are good

graceful ledge Jun 22, 2021, 10:07 PM

#

Love the sarcasm. Just seeking some guidance as I am "not good"

visual violet Jun 22, 2021, 10:13 PM

#

i am a high school student

#

i know absolutely nothing

#

i am sorry to push your question to the top

thorn bobcat Jun 22, 2021, 10:23 PM

#

I want to learn about stylegan2

#

Its like I'm trying to swim before learning to walk

#

but it's beautiful

visual violet Jun 22, 2021, 10:32 PM

#

@desert oar i just realized the red book by micromedex is the answer to my hope and dream

main fox Jun 22, 2021, 10:32 PM

#

How should I typically handle dates in a ML model?
I have a dataframe that has a column for Year (from 2010 to 2020)
Should I leave it as int or code them from 1 to 10?

desert oar Jun 22, 2021, 10:35 PM

#

visual violet <@!389497659087650836> i just realized the red book by micromedex is the answer ...

im not familiar with it. is it a source of data about medications?

#

is there a digital version? or do you have to now type in data for 700 drugs?

grave frost Jun 22, 2021, 10:36 PM

#

near oasis yea why?

you tell me lol

visual violet Jun 22, 2021, 10:36 PM

#

micromedex provides historal data gonig back 50 years

#

right now what i have is quite enough

#

but like to predict the future

#

20 datapoints are like a grain of sand

#

the problem is there is a subscription lol

thorn bobcat Jun 22, 2021, 10:41 PM

#

anyone know what I should start learning to make something like style gan 2?

#

or 1

#

do I gotta learn about gans or do i gotta learn about neural style transfer?

velvet thorn Jun 22, 2021, 10:49 PM

#

main fox How should I typically handle dates in a ML model? I have a dataframe that has a...

depends on what you want to encode

#

and how you want to model

#

can't tell without knowing what the problem is

main fox Jun 22, 2021, 10:50 PM

#

velvet thorn can't tell without knowing what the problem is

I have historical data for tuition of different colleges, by year.

visual violet Jun 22, 2021, 10:51 PM

#

it is open source?

main fox Jun 22, 2021, 10:51 PM

#

I can easily forecast these values, but I'm building a ML model just for the fun of it.

visual violet Jun 22, 2021, 10:52 PM

#

are you saying you can forecast college tuitions?

velvet thorn Jun 22, 2021, 10:52 PM

#

main fox I have historical data for tuition of different colleges, by year.

hm

#

sounds like time series modelling

main fox Jun 22, 2021, 10:52 PM

#

Yeah. Just from looking at the data, the trend appears to be +200 every year since 2010

velvet thorn Jun 22, 2021, 10:52 PM

#

so

#

the simplest way to handle this is

#

given (X - n...X), model X + 1

#

in this case the date is used only to order data points

visual violet Jun 22, 2021, 10:54 PM

#

can you share the data for upenn?

#

i am going there so i am curious

main fox Jun 22, 2021, 10:56 PM

#

velvet thorn in this case the date is used only to order data points

So what's your suggestion? I tested xgboost and RandomForest. RandomForest performed better. Now I'm looking to fine tune the model. Should I leave the years as integers or code them with ordered labels from 1 to 10?

main fox Jun 22, 2021, 10:57 PM

#

visual violet can you share the data for upenn?

Whats the full name of that college?

visual violet Jun 22, 2021, 10:57 PM

#

University of Pennsylvania

velvet thorn Jun 22, 2021, 10:57 PM

#

main fox So what's your suggestion? I tested xgboost and RandomForest. RandomForest perfo...

I literally just said

#

dates are used only to order the points

#

so they aren't features

#

UNLESS you want to encode more information

#

for simple stuff

#

you could look into e.g. ARIMA

main fox Jun 22, 2021, 11:00 PM

#

visual violet University of Pennsylvania

In 2020 and 2019 the tuition change was +3.9% for both years. So I'd say for a 4 year program $60k * 3.9%

visual violet Jun 22, 2021, 11:00 PM

#

thank you

main fox Jun 22, 2021, 11:01 PM

#

velvet thorn you could look into e.g. ARIMA

Thanks, I'll look into it.

visual violet Jun 22, 2021, 11:06 PM

#

@velvet thorn i am having a similar time series

#

i have historical data for price of different colleges, by year

#

now i am trying to find interseting patterns

thorn bobcat Jun 22, 2021, 11:07 PM

#

this was a good paper

visual violet Jun 22, 2021, 11:07 PM

#

i did k-means dynamic time warp on my data already but it gives me one cluster with the majority of the drugs and the rest of clusters consist very few members

#

what do you suggest me to do

visual violet Jun 22, 2021, 11:58 PM

#

@desert oar

#

not exactly how stuffs work

#

but it appear to provide good results now?!?!?!?

#

i legit added one more line of code ingredient_price_matrix = np.log(ingredient_price_matrix)

desert oar Jun 23, 2021, 12:55 AM

#

yep

#

i told you

#

If “dtw”, DBA is used for barycenter computation.
this is also a good feature in that library

#

plot the log time series again w/ the colors

#

also 15 clusters seems like a lot

#

do you have any reason to expect 15? why not 1 or 5?

silver sun Jun 23, 2021, 12:59 AM

#

Can anyone give me resources and link to prepare for a data science Interview? They said I will be asked on ML rapid prototyping, and simple ML models.

serene scaffold Jun 23, 2021, 1:09 AM

#

silver sun Can anyone give me resources and link to prepare for a data science Interview? T...

I've had a number of data scientist interviews lately. Here are some of the questions I was asked:

What's an example of an unsupervised learning algorithm?
What are the pros and cons of neural networks?
What do you do if your dataset is missing certain data?
What's an example of a time that you solved a problem you were unsure about?

#

And be prepared to talk about anything that's on your resume.

visual violet Jun 23, 2021, 1:39 AM

#

desert oar do you have any reason to expect 15? why not 1 or 5?

i will attempt to graph with t-SNE hehe.

#

to be honest, i have no idea which number of clusters i should pick

#

the elbow method tells me 6? so i will try that

#

also what does DBA mean?

#

What's an example of an unsupervised learning algorithm?
= kmeans? @serene scaffold

desert oar Jun 23, 2021, 1:49 AM

#

visual violet also what does DBA mean?

https://github.com/fpetitjean/DBA

GitHub

fpetitjean/DBA

DBA: Averaging for Dynamic Time Warping. Contribute to fpetitjean/DBA development by creating an account on GitHub.

visual violet Jun 23, 2021, 1:50 AM

#

huh no wonder why when i check, the math doesn't work out

#

this is not normal dtw

#

this is DBA

#

also how do you know it uses DBA?

#

i check the tslearn website. there is no where it says that

#

@desert oar

#

i am so sorry for pinging you so many times

desert oar Jun 23, 2021, 1:53 AM

#

visual violet this is not normal dtw

DBA is a method for calculating centroids. the docs say that when you use DTW as the distance metric, it uses DBA instead of regular means to calculate the centroids in k-means

#

i happened to have seen DBA in this doc that i think i posted earlier, and remembered it. the tslearn docs really should have a reference to this, instead of just using the acronym. https://www.kaggle.com/izzettunc/introduction-to-time-series-clustering

Introduction to Time Series Clustering

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

visual violet Jun 23, 2021, 1:56 AM

#

thank you very much.

#

PCA happens to be the method to plot

#

so i will try that out as well

silver sun Jun 23, 2021, 2:33 AM

#

serene scaffold I've had a number of data scientist interviews lately. Here are some of the ques...

Thank you so much!!! If I may ask how was the technical interviews like?

visual violet Jun 23, 2021, 2:37 AM

#

desert oar plot the log time series again w/ the colors

looks more like layers to me. before it was a clump and outliers 🙂

#

what do you think?

desert oar Jun 23, 2021, 2:38 AM

#

i still think this is just clustering on average price level

#

k-means tends to try to find "round" clusters

#

and it will somewhat arbitrarily segment the data in order to do so

visual violet Jun 23, 2021, 2:38 AM

#

perhaps it is because my graphing function is weird?

desert oar Jun 23, 2021, 2:38 AM

#

if you see a bunch of evenly-sized clusters with no obvious separation, k-means is not doing anything interesting

#

no, this just looks like k-means not doing anything interesting

visual violet Jun 23, 2021, 2:39 AM

#

!paste

arctic wedgeBOT Jun 23, 2021, 2:39 AM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

visual violet Jun 23, 2021, 2:39 AM

#

https://paste.pythondiscord.com/betusevuni.apache just in case

desert oar Jun 23, 2021, 2:39 AM

#

as i suggested earlier, if price movements are arbitrary then you probably won't find much that's interesting in the individual price movements

#

alternative suggestion: https://stats.stackexchange.com/a/165234/36229

Cross Validated

Using correlation as distance metric (for hierarchical clustering)

I would like to hierarchically cluster my data, but rather than using Euclidean distance, I'd like to use correlation. Also, since the correlation coefficient ranges from -1 to 1, with both -1 and 1

#

it might be interesting to look at the variability in prices

#

e.g. you can start building features for each drug like mean price, std. dev of price, etc.

#

in fact, how about this: plot each drug with mean on the x axis and std. dev on the y axis

visual violet Jun 23, 2021, 2:42 AM

#

so there will be 700 graphs?

desert oar Jun 23, 2021, 2:42 AM

#

no

#

1 graph

#

700 points

visual violet Jun 23, 2021, 2:42 AM

#

oh right

visual violet Jun 23, 2021, 2:43 AM

#

desert oar it might be interesting to look at the _variability_ in prices

i thought about it too

#

but i had an idea of using percetange differnce

#

because i didn't think there will be an actual method

#

haha

#

the fact that my teacher recommends me to cluster the big cluster 💀

#

he gives 0 advice on anything

visual violet Jun 23, 2021, 3:01 AM

#

desert oar in fact, how about this: plot each drug with mean on the x axis and std. dev on ...

is there a potential?

#

btw the matrix is the log of the prices

#

if used actual prices

digital tulip Jun 23, 2021, 3:17 AM

#

Hi, I would like to learn about AI, where do you recommend me to start?

serene scaffold Jun 23, 2021, 3:25 AM

#

digital tulip Hi, I would like to learn about AI, where do you recommend me to start?

what's your math background?

digital tulip Jun 23, 2021, 3:29 AM

#

serene scaffold what's your math background?

I don't know how to define my level in mathematics, I guess high school with a little bit of the first year of college.

serene scaffold Jun 23, 2021, 3:30 AM

#

digital tulip I don't know how to define my level in mathematics, I guess high school with a l...

Learning linear algebra and probability/statistics would help

digital tulip Jun 23, 2021, 3:37 AM

#

serene scaffold Learning linear algebra and probability/statistics would help

Hmmm ok

#

And then?

visual violet Jun 23, 2021, 3:41 AM

#

i have access to the IBM SPSS Statistics

#

let see if i can do the ward method without coding

#

i can't seem to find a library

desert oar Jun 23, 2021, 3:50 AM

#

visual violet is there a potential?

not for clustering, but i see some interesting structure here:

variability looks like it increases as the avg price gets higher (not surprising)
there are some weird outliers with hugely variable prices

visual violet Jun 23, 2021, 3:57 AM

#

i agree yeah

cedar sky Jun 23, 2021, 3:58 AM

#

Hi guys...
So I have been trying to load word embeddings from tf hub...
But most of it are converting the word embedding to a sentence embedding and I am not able to find any documentation to help me out of it...
If any of you guys have any ideas how to get the embeddings please let me know

visual violet Jun 23, 2021, 3:58 AM

#

by looking at it, i see no distinct cluster

normal bay Jun 23, 2021, 7:14 AM

#

how can i determine model over-fitting or not ?

dim beacon Jun 23, 2021, 7:15 AM

#

normal bay how can i determine model over-fitting or not ?

Try cross validation

coral kindle Jun 23, 2021, 7:25 AM

#

https://lightning-flash.readthedocs.io/en/latest/ Anybody heard of this library?

lapis sequoia Jun 23, 2021, 8:02 AM

#

what's better for storing analytics one-table data, sql databases (relational) such as mysql or noSQL such as mongodb?

ripe forge Jun 23, 2021, 8:55 AM

#

normal bay how can i determine model over-fitting or not ?

Add a callback that shows test performance after each epoch

primal tulip Jun 23, 2021, 10:07 AM

#

serene scaffold I've had a number of data scientist interviews lately. Here are some of the ques...

And how was it? Good luck on your interviews bud.

primal tulip Jun 23, 2021, 10:22 AM

#

silver sun Can anyone give me resources and link to prepare for a data science Interview? T...

I do recommend datacamp.com if you have some extra bucks. I do have the yearly plan (150$ per year or so if you get it on cybermonday)
They have this new prep thing after you've done some 10 minute tests on each category, a real-life project you must pass to be certified and finally they'll help you get ready for the interviews and match your profile to other companies.

I'm still studying and haven't done them yet. And it just came out recently, so don't expect it to be perfectly flawless.

https://app.datacamp.com/certification/dashboard

Also there might be a free prep service around if you look properly. I was just willing to pay.

silver widget Jun 23, 2021, 10:31 AM

#

Hi all. Been using CatBoostRegressor for a project (this is my first time with Catboost). I used it in StackingRegressor with different models. I tuned the hyper parameters of the boost which is ;
'Cat', CatBoostRegressor(iterations= 5000,
learning_rate=0.01,
l2_leaf_reg = 5,
depth = 6,
border_count = 50))

However, the model runs each iteration from 0 to 5000 with different learning rates. Is there a way to fix these parameters?

normal bay Jun 23, 2021, 11:55 AM

#

dim beacon Try cross validation

Okay

normal bay Jun 23, 2021, 11:56 AM

#

ripe forge Add a callback that shows test performance after each epoch

Thanks lemme try

wanton sleet Jun 23, 2021, 12:52 PM

#

If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it? *

#

the dataset is tabular

serene scaffold Jun 23, 2021, 12:57 PM

#

wanton sleet If you were provided a csv file with 2000 rows and 5 columns, and asked to creat...

this is the beginning of a lot of machine learning problems. What is the data? what does it represent?

#

Be sure to be specific as the question isn't answerable in general.

wanton sleet Jun 23, 2021, 12:58 PM

#

This is one of the interview questions i had faced

#

Here we have to be creative as possible.

#

The data is numeric tabular data

serene scaffold Jun 23, 2021, 12:59 PM

#

wanton sleet This is one of the interview questions i had faced

if you were continuing the earlier conversation thread about interview questions, I wasn't able to infer that.

wanton sleet Jun 23, 2021, 12:59 PM

#

No

#

its just random ML interview question for trainee

serene scaffold Jun 23, 2021, 1:00 PM

#

wanton sleet its just random ML interview question for trainee

Is this a question you want an answer to?

wanton sleet Jun 23, 2021, 1:00 PM

#

yes

#

If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it?

#

After preprocessing data what can be done

serene scaffold Jun 23, 2021, 1:01 PM

#

wanton sleet If you were provided a csv file with 2000 rows and 5 columns, and asked to creat...

You would first need to know if one of the five columns is what tells you the class of that row, and you'd need to know if the data (since you said it's all numeric) is discrete or continuous.

wanton sleet Jun 23, 2021, 1:01 PM

#

?

serene scaffold Jun 23, 2021, 1:02 PM

#

if one column is just telling you the class, then you only have four features, not five

wanton sleet Jun 23, 2021, 1:02 PM

#

okayy

#

any further

#

for discrete data

serene scaffold Jun 23, 2021, 1:04 PM

#

wanton sleet for discrete data

depends on what algorithm you want to use. for the discrete data, would the number be mathematically meaningful? Like for example, age?

#

or would it just be arbitrary numbers like "1" for green and "2" for blue

wanton sleet Jun 23, 2021, 1:07 PM

#

okay thanks

#

now i need to know the data nature

#

continous or discrete

#

and act accordingy

serene scaffold Jun 23, 2021, 1:08 PM

#

wanton sleet now i need to know the data nature

if they refuse to budge on being specific about what they want you to do, you can say that you'd use support vector machines

wanton sleet Jun 23, 2021, 1:09 PM

#

best option might be support vector machine

serene scaffold Jun 23, 2021, 1:10 PM

#

wanton sleet best option might be support vector machine

are you familiar with those?

wanton sleet Jun 23, 2021, 1:10 PM

#

i have done decision trees, random forest but not SVM yet

#

but maybe in near future

serene scaffold Jun 23, 2021, 1:11 PM

#

wanton sleet i have done decision trees, random forest but not SVM yet

SVM would treat each row in your table as a point in space, and try to determine the boundary between the two classes

wanton sleet Jun 23, 2021, 1:12 PM

#

okay thanks

#

highly appreciated

serene scaffold Jun 23, 2021, 1:12 PM

#

No problem 😄

ember sapphire Jun 23, 2021, 2:04 PM

#

@desert oar lol I forgot to square the norms when computing error

#

It was actually converging just slowly

fallen vapor Jun 23, 2021, 2:35 PM

#

can anyone suggest me some cool chatbot repos

#

which i can train

#

nobody?

#

:(

desert oar Jun 23, 2021, 2:39 PM

#

ember sapphire <@389497659087650836> lol I forgot to square the norms when computing error

lol glad you figured it out

cedar sun Jun 23, 2021, 2:51 PM

#

        is a single float value, the range will be (-shift_limit, shift_limit). Absolute values for lower and
        upper bounds should lie in range [0, 1]. Default: (-0.0625, 0.0625).```

#

What it means?

#

like, height width will be multiplied by that?

#

like, a 100x100 will result into a 6x6?

#

or into a 94x94?

silver widget Jun 23, 2021, 3:44 PM

#

Train score (r2) = 0.98
test score = 0.86
is this counted as overfitting? (data= kaggle house prices advanced reg)

chilly geyser Jun 23, 2021, 3:50 PM

#

I'd say yes

#

That difference is quite big

silver widget Jun 23, 2021, 4:02 PM

#

Figured it out. RandomForest max depth is too much high..tuning them al again

#

*all

#

Thanks for the answer

somber prism Jun 23, 2021, 4:31 PM

#

can someone explain me how is this 1/2 for this SVM prob

#

so for whichever choice p1 * theta >= 1 and p2 * theta <= -1 that would the right theta value ?

#

someone correct me if i am wrong

wheat sandal Jun 23, 2021, 5:03 PM

#

Hey guys , which YouTube channel or resource would you recommend me to learn statistics from scratch? I don’t have a math background

desert oar Jun 23, 2021, 6:27 PM

#

i wish i knew, all the good statistics resources i know of already assume some knowledge of the basics

#

what is your background? how much math do you know?

wheat sandal Jun 23, 2021, 6:43 PM

#

I studied literature lol

#

I have knowledge from high school’s Math

desert oar Jun 23, 2021, 8:05 PM

#

do you know calculus?

#

what's the "most advanced" math thing you remember how to do?

wheat sandal Jun 23, 2021, 8:30 PM

#

desert oar do you know calculus?

If it is a high school stuff yes! I only remember few algebra things

desert oar Jun 23, 2021, 8:30 PM

#

alright, you might really have to start from the basics then

#

i honestly don't know where you'd start from there, khan academy i guess has some videos? there might be some "statistical thinking" type of courses online, which would focus more on intuition and concepts

wheat sandal Jun 23, 2021, 8:50 PM

#

I found one course in data camp called introduction to statistics in python . I will do that. Thanks anyway

digital tulip Jun 23, 2021, 8:51 PM

#

After having a so-so knowledge in mathematics, what should I learn to make my first AI? 🤔

distant needle Jun 23, 2021, 9:02 PM

#

I have a .dat data file that I believe is in some type of .csv format, however it's too large (3.1GB) to inspect with any programs that I've tried inspecting it with. What's the easiest way to convert from .dat to CSV so I can load it into a pandas dataframe?

desert oar Jun 23, 2021, 9:23 PM

#

distant needle I have a .dat data file that I believe is in some type of .csv format, however i...

what program generated the file?

distant needle Jun 23, 2021, 9:23 PM

#

Not a clear way of knowing. It's a government dataset.

#

I may have found a way around it though...I think I found a CSV, which makes life much easier.

desert oar Jun 23, 2021, 9:46 PM

#

hopefully it says somewhere in the govt docs though

desert oar Jun 23, 2021, 9:53 PM

#

wheat sandal I found one course in data camp called introduction to statistics in python . I ...

wheat sandal Jun 23, 2021, 9:54 PM

#

desert oar

🙏🏻 thanks!!!

main fox Jun 23, 2021, 9:56 PM

#

Can someone help me with an xgboost model making bad forecasts?

iron basalt Jun 23, 2021, 10:02 PM

#

wheat sandal 🙏🏻 thanks!!!

https://brilliant.org not sponsored.

Brilliant | Learn to think

Brilliant - Build quantitative skills in math, science, and computer science with fun and
challenging interactive explorations.

wheat sandal Jun 23, 2021, 10:03 PM

#

iron basalt https://brilliant.org not sponsored.

Thanks 🤩

iron basalt Jun 23, 2021, 10:04 PM

#

wheat sandal Thanks 🤩

Most important one: https://brilliant.org/courses/math-fundamentals/

Learn Mathematical Fundamentals on Brilliant

In this course, we'll introduce the foundational ideas of algebra, number theory, and logic that come up in nearly every topic across STEM.

This course is ideal for anyone who's either starting or re-starting their math education. You'll learn many essential problem solving techniques and you'll need to think creatively and strategically to so...

wheat sandal Jun 23, 2021, 10:05 PM

#

Cool , I really appreciate your help !

lapis sequoia Jun 23, 2021, 10:18 PM

#

Anyone have an idea of what approach to take for a problem where I need to predict the column name based on it’s data? I understand it could be a classification however the issue is there are so many different types of fields which contain various fields over 1000

desert oar Jun 23, 2021, 11:36 PM

#

lapis sequoia Anyone have an idea of what approach to take for a problem where I need to predi...

what's the purpose of this task? you want to be able to guess a column name in any dataset? a specific dataset?

lapis sequoia Jun 23, 2021, 11:44 PM

#

desert oar what's the purpose of this task? you want to be able to guess a column name in _...

Want to predict column name based on data. It’s a specific dataset however the issue is the dataset is very large!

#

Recommend unsupervised for this?

desert oar Jun 23, 2021, 11:44 PM

#

so this is like looking at the data, then covering up the columns and trying to figure out which one was which?

#

this is a strange task

lapis sequoia Jun 23, 2021, 11:45 PM

#

Correct

#

Because column names might be misspelled

#

So let’s say one year you have name and then the next year column name for same data is nme

desert oar Jun 23, 2021, 11:45 PM

#

so it's not the same exact dataset, it's different datasets but with the same columns in each dataset?

#

that's a different task

lapis sequoia Jun 23, 2021, 11:45 PM

#

Similar column names *

desert oar Jun 23, 2021, 11:45 PM

#

and harder

lapis sequoia Jun 23, 2021, 11:45 PM

#

Oh you’re right apologize

#

Explained incorrectly

#

Yea it’s a hard task that’s for sure

#

I’m having a hard time figuring out the approach to take

visual violet Jun 24, 2021, 12:26 AM

#

@desert oar i am so sorry to bother you again

#

i have been doing research today

#

i can't find the correct method to cluster

#

given the variability is not so clumped together

visual violet Jun 24, 2021, 12:43 AM

#

may i ask

#

if i can somehow insert more variables in addition to time series?

#

like drug class and strength

#

and form factors like tablet, liquid, etc

desert oar Jun 24, 2021, 12:46 AM

#

visual violet if i can somehow insert more variables in addition to time series?

Yes

#

https://en.m.wikipedia.org/wiki/Multiple_factor_analysis

Multiple factor analysis

Multiple factor analysis (MFA) is a factorial method devoted to the study of tables in which a group of individuals is described by a set of variables (quantitative and / or qualitative) structured in groups. It may be seen as an extension of:

Principal component analysis (PCA) when variables are quantitative,
Multiple correspondence analysis ...

#

@visual violet https://youtube.com/playlist?list=PLnZgp6epRBbRX8TEp1HlFGqfMf_AxYEj7

YouTube

Multiple Factor Analysis

Course on Multiple Factor Analysis

visual violet Jun 24, 2021, 12:50 AM

#

thank you very much

#

given my situation

#

do you think i should try different methods with my current dataset?

visual violet Jun 24, 2021, 12:55 AM

#

desert oar <@354372432838000642> https://youtube.com/playlist?list=PLnZgp6epRBbRX8TEp1HlFGq...

very nice theoretical explanation. i am looking up python implementation rn pydis_hacktoberfest_2020

pearl nymph Jun 24, 2021, 1:09 AM

#

Hello

#

how are you doing guys

main fox Jun 24, 2021, 1:12 AM

#

velvet thorn you could look into e.g. ARIMA

Hello, thank you for suggesting an ARIMA model yesterday. Traditional methods through sk tried using randomforest or xgboost along with custom train, test, split and walk forward validations which ended up being too convoluted. I found sktime and they have an Arima model that produced the results I was looking for.

visual violet Jun 24, 2021, 1:13 AM

#

main fox Hello, thank you for suggesting an ARIMA model yesterday. Traditional methods th...

you are a god

main fox Jun 24, 2021, 1:14 AM

#

Haha thanks. I was just stubborn enough to read through how machine learning with time series usually goes.

#

And a lot of trial and error

velvet thorn Jun 24, 2021, 1:15 AM

#

main fox Hello, thank you for suggesting an ARIMA model yesterday. Traditional methods th...

yw

#

👋

visual violet Jun 24, 2021, 1:15 AM

#

main fox Haha thanks. I was just stubborn enough to read through how machine learning wit...

wait so you actually read

#

much respect

serene scaffold Jun 24, 2021, 1:21 AM

#

visual violet you are a god

I've been dethroned?

visual violet Jun 24, 2021, 1:21 AM

#

you changed your pfp

#

AND name

#

the pfp looks nice back then :((

serene scaffold Jun 24, 2021, 1:22 AM

#

I'm a mosaic now

visual violet Jun 24, 2021, 1:22 AM

#

serene scaffold I've been dethroned?

the ability to do arima over night

#

is amazing lol

#

maybe the person has a degree in data science already?

#

who knows

serene scaffold Jun 24, 2021, 1:23 AM

#

visual violet maybe the person has a degree in data science already?

I do

visual violet Jun 24, 2021, 1:24 AM

#

didn't you

#

go to school for cs

serene scaffold Jun 24, 2021, 1:26 AM

#

it was cs and data science at the same time

main fox Jun 24, 2021, 1:30 AM

#

visual violet wait so you actually read

I was under the impression that ML with a snapshot of data using xgboost was similar enough to using a time series. Or that xgboost could handle time series easily. That wasn't the case and it took me a lot of errors to realize.

main fox Jun 24, 2021, 1:32 AM

#

visual violet maybe the person has a degree in data science already?

My BS is in Microbiology. I only began studying programming and more advanced quantitative analysis last year

visual violet Jun 24, 2021, 1:32 AM

#

so cool!!!!

#

i might as well study cs + biochem now

#

increasingly sounds like a good option

main fox Jun 24, 2021, 1:35 AM

#

A lot of Pharma companies are in need of people with those backgrounds. Not a lot of people that study CS go for health sciences.

visual violet Jun 24, 2021, 1:36 AM

#

working for faang is so overrated

#

they are good jobs but seems boring

main fox Jun 24, 2021, 1:38 AM

#

Agreed, I'd go for Netflix though lol. Other than that, I'd rather work anywhere else.

visual violet Jun 24, 2021, 1:38 AM

#

i mean

#

i don't think netflix needs a microbiology lmao

main fox Jun 24, 2021, 1:41 AM

#

Lol yeah they don't. But I'm leaning towards data science

tough frigate Jun 24, 2021, 6:31 AM

#

so am i lol

#

is it true? that Data Science will be overtaken by AIs

raven hare Jun 24, 2021, 8:39 AM

#

tough frigate is it true? that Data Science will be overtaken by AIs

probably, in my opinion at least

#

we need to keep its control

drowsy maple Jun 24, 2021, 10:17 AM

#

I am not getting desirable output... anyone?

warped relic Jun 24, 2021, 10:18 AM

#

I have a git logs where commit hash, author, timestamp, commit message and file logs(this is dynamic in the case of lines) in separate line. I could individually pull out the data(but not of file logs and messages) but not create a columns of them.

This is the sample file

commit 2232asdfeafdadssc0a63d3ded7e95e894bb735c121f
Author: John Shahi
Date:   Thu Jun 10 05:10:31 2021 +0000

    Feature/bill overview

70    0    src/components/pages/abc.tsx


commit 18asdfasd9104c7fb59d9027f48csdfss8b61776e21d0
Author: rashi.coder
Date:   Wed Jun 9 12:39:33 2021 +0545

    disable other call, refine disabled styles

13    1    src/components/organisms/contact/detail/card/ContactDetailCard.tsx
11    2    src/components/organisms/contact/list/ActionsColumn.tsx

commit 65adfadfc2090e299bf9c514735eb1a2779a12ed9
Author: Ritesh  Poudel
Date:   Wed Jun 9 05:08:00 2021 +0000

commit 04afdad56f5136da10c87d5181dab8afdsfs29e57a5
Author: rashi.coder
Date:   Wed Jun 9 10:22:29 2021 +0545

    fix multiple contacts selection overflow

1    0    src/components/organisms/contact/list/ContactTable.tsx
8    1    src/components/organisms/contact/list/Styles.tsx

this is what I was trying to do

commits = pd.read_csv(
        COMMIT_LOG,
        sep="\n",
        header=None,
        names=["raw"]
        # names=[
        #     "sha",
        #     "author",
        #     "timestamp",
        #     "message",
        #     "additions",
        #     "deletions",
        #     "filename",
        # ],
    )
    commit_marker = commits[commits["raw"].str.startswith("commit")]
    author_marker = commits[commits["raw"].str.startswith("Author")]
    date_marker = commits[commits["raw"].str.startswith("Date:")]
    print(commit_marker.head())
    print(author_marker.head())
    print(date_marker.head())

spark stag Jun 24, 2021, 10:20 AM

#

drowsy maple I am not getting desirable output... anyone?

you aren't allowing the : in your test string this is after the From, so that will need to accounted for in the regex, also in future a help channel is probably better suited for this type of question

pine umbra Jun 24, 2021, 11:41 AM

#

Guys, how are you? Do any of you already work with geospatial data? I need to plot accident areas and vehicle paths on the map of Brazil. I tried it and I liked using the folium, but when I use too many points for the path the map just doesn't load, does anyone have a solution for that or already known and like another tool?

pure sleet Jun 24, 2021, 11:48 AM

#

whats the best way to learn ML ?

#

directly starting from tensorflow

#

or doing octave/matlab first

serene scaffold Jun 24, 2021, 12:04 PM

#

pure sleet directly starting from tensorflow

I'd probably start with the stuff in sklearn before using tensorflow

pine umbra Jun 24, 2021, 12:04 PM

#

pure sleet whats the best way to learn ML ?

definitely with Python frameworks starting with sklearn and after tersorflow or pytorch

serene scaffold Jun 24, 2021, 12:04 PM

#

Two people have said it independently, so it must be true

pine umbra Jun 24, 2021, 12:04 PM

#

Tensorflow is most used in deep learning, it is a step after

serene scaffold Jun 24, 2021, 12:06 PM

#

and while I know deep learning sounds cooler, that's just what they call it. the algorithms in sklearn can be great for a lot of use cases and don't require you to know quite as much math to wrap your head around what's happening.

pine umbra Jun 24, 2021, 12:08 PM

#

this is true... a lot of problems can be solved using ML with sklearn... other some specific and difficult problems need DL and other frameworks.

pure sleet Jun 24, 2021, 12:23 PM

#

serene scaffold I'd probably start with the stuff in sklearn before using tensorflow

thanks

pure sleet Jun 24, 2021, 12:24 PM

#

pine umbra definitely with Python frameworks starting with sklearn and after tersorflow or ...

thanks

grave breach Jun 24, 2021, 12:54 PM

#

pine umbra Guys, how are you? Do any of you already work with geospatial data? I need to pl...

Mathematica has great geospatial capabilities

#

Maybe I can try helping you

hollow ember Jun 24, 2021, 1:32 PM

#

Can someone help me with pandas and datasets?

arctic wedgeBOT Jun 24, 2021, 1:37 PM

#

Hey @hollow ember!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

hollow ember Jun 24, 2021, 1:38 PM

#

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

#

can someone help me out . Thank you

visual violet Jun 24, 2021, 1:39 PM

#

1 means diabetes?

hollow ember Jun 24, 2021, 1:39 PM

#

yes

visual violet Jun 24, 2021, 1:41 PM

#

recently i have learned a thing called t-sne

#

but since you said predict, i have no idea how

hollow ember Jun 24, 2021, 1:41 PM

#

yeah im confused too

#

Can anyone else help me out?

cedar sun Jun 24, 2021, 2:08 PM

#

when training a model

#

can i tell it to focus more on a specific class?

slim marten Jun 24, 2021, 2:10 PM

#

Hello, I'm trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.

My test data (34 pictures) is saved in a 'test' folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.

I receive a "Found 34 images belonging to 1 classes." message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353

Why do I receive a "2/2" everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn't it look super weird? I know I'm doing something wrong, but I just can't figure it out - maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument - in order to test it - it seems all fine, but here I just can't figure it out.)

A little bit of help would be appreciated. 🙂

desert oar Jun 24, 2021, 2:23 PM

#

cedar sun can i tell it to _focus_ more on a specific class?

yes, you can increase the weight of that class

austere swift Jun 24, 2021, 2:35 PM

#

slim marten Hello, I'm trying to measure the performance (accuracy and loss) of my model and...

could you send some code?

slim marten Jun 24, 2021, 2:36 PM

#

Yes, sure

#

imageDataGenerator = ImageDataGenerator(validation_split = 0.2,  
                                   rescale = 1./255,             
                                   rotation_range = 40,          
                                   width_shift_range = 0.2,      
                                   height_shift_range = 0.2,    
                                   zoom_range = 0.2,            
                                   horizontal_flip = True,       
                                   fill_mode = 'nearest')
trainingDataGenerator = imageDataGenerator.flow_from_directory(
                                dir,
                                target_size = (70, 70),
                                batch_size = batchSize,
                                color_mode="rgb", 
                                class_mode = 'binary',              
                                shuffle = True,
                                seed=42,                    
                                subset = 'training')            

validationDataGenerator = imageDataGenerator.flow_from_directory(
                                dir,
                                target_size = (70, 70),
                                batch_size = batchSize,
                                color_mode="rgb",
                                class_mode = 'binary',               
                                subset = 'validation')```

so these are for the training and the validation - using the data from my dir directory, which is gonna be divided as 80% for the training, 20% for the validation
the problem is that, except for the dir directory, I have a test directory as well
and I'm trying to obtain a generator based on the images in it

cedar sun Jun 24, 2021, 2:36 PM

#

desert oar yes, you can increase the weight of that class

how?

slim marten Jun 24, 2021, 2:38 PM

#

austere swift could you send some code?


test_data = 'C:/Users/Ana/Desktop/Licenta/Practic/maskAPI/test'

datagen = ImageDataGenerator(rotation_range = 40,          # rotirea imaginilor
                            width_shift_range = 0.2,      # modificarea latimii
                            height_shift_range = 0.2,     # modificarea inaltimii
                            zoom_range = 0.2,            
                            horizontal_flip = True,       # intoarce imaginea orizontal
                            fill_mode = 'nearest')

data = datagen.flow_from_directory('./test', classes=['test'], target_size=(70, 70), color_mode='rgb')

result = seqModel.evaluate(data, batch_size=1, verbose=1)```

I've tried many variants, this is just the last one of them
by the way, I commented most of the training part, as I don't think it's relevant - only showed you the other 2 generators, that work just fine (I followed a tutorial for those)
the test part is what I don't get
I just want to evaluate the loss and accuracy of my model and I simply don't get how

austere swift Jun 24, 2021, 2:42 PM

#

make sure that the directory you're getting the images from actually has the correct amount of images and that the data generator read them correctly, i don't see anything else that would be wrong

slim marten Jun 24, 2021, 2:46 PM

#

yes, test directory has 34 images. (The directory 'test' from test_data has one subfolder called test containing all the images)

#

but even if I look at the result, I don't get what that 2/2 is and why it remains like that. Shouldn't it be 34/34? (it was like that at some point, but I had other problems then when experimenting)
2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353

#

Hmm, any other suggestions for obtaining my accuracy and loss for the test data, other than the evaluate() method? 🙂

austere swift Jun 24, 2021, 3:04 PM

#

wait i think i know the issue

slim marten Jun 24, 2021, 3:04 PM

#

YAY

austere swift Jun 24, 2021, 3:04 PM

#

you didn't specify batch size in the ImageDataGenerator

#

and by default its 32

slim marten Jun 24, 2021, 3:04 PM

#

OH, let me try

austere swift Jun 24, 2021, 3:04 PM

#

so even if the evaluate() method has a batch size of 1, it'll take 1 batch from the datagen which will be 32

slim marten Jun 24, 2021, 3:06 PM

#

true

#

34/34 [==============================] - 0s 3ms/step - loss: 239.4841 - accuracy: 0.7647

#

now it's 34/34

#

still getting the 239.4841 loss value

austere swift Jun 24, 2021, 3:06 PM

#

what loss alg?

slim marten Jun 24, 2021, 3:08 PM

#

what do you mean exactly?

austere swift Jun 24, 2021, 3:08 PM

#

the loss algorithm

#

which one are you using

hollow ember Jun 24, 2021, 3:09 PM

#

How can i use multiple categories In X or Y axis?

slim marten Jun 24, 2021, 3:10 PM

#

austere swift the loss algorithm

               loss = 'binary_crossentropy',                   
               metrics = ['accuracy'])```
In the compile method I use 'binary_crossentropy'
and I have 2 classes in the 'train' folder -> _with mask_ and _without mask_

grave breach Jun 24, 2021, 3:12 PM

#

hollow ember The objective of the dataset is to diagnostically predict whether or not a patie...

I suggest you to use Keras's brand new trees

hollow ember Jun 24, 2021, 3:12 PM

#

im only restricted to use mathplotlib and seaborn for my assignment

grave breach Jun 24, 2021, 3:13 PM

#

You can implement them by yourself

#

Or, you can implement a neural network in raw python

#

If you know the theory they're not so complex

#

But, implementing a neural network without numpy can be hard

hollow ember Jun 24, 2021, 3:14 PM

#

I know nothing, that's the problem

austere swift Jun 24, 2021, 3:14 PM

#

why are you suggesting implementing a neural network from scratch? theres no need for a neural network for that application

#

a simple logistic regression model would probably be fine

grave breach Jun 24, 2021, 3:14 PM

#

austere swift why are you suggesting implementing a neural network from scratch? theres no nee...

That's why I suggested trees

#

But they're pretty fun

grave breach Jun 24, 2021, 3:15 PM

#

hollow ember I know nothing, that's the problem

If you need something simpler try implementing a dimensionality reduction algorithm and a clustering one

austere swift Jun 24, 2021, 3:15 PM

#

grave breach That's why I suggested trees

oh yeah i saw keras and thought you meant a neural network lol

#

but a logistic regression model should be fine

#

you can do that in seaborn as well

grave breach Jun 24, 2021, 3:16 PM

#

I agree

austere swift Jun 24, 2021, 3:16 PM

#

no seaborn wow

#

https://seaborn.pydata.org/generated/seaborn.regplot.html

hollow ember Jun 24, 2021, 3:17 PM

#

can u help with this, im quite new to this whole thing

grave breach Jun 24, 2021, 3:18 PM

#

@hollow ember By the way, can you upload the dataset?

#

Maybe as a .zip

hollow ember Jun 24, 2021, 3:18 PM

#

sure

arctic wedgeBOT Jun 24, 2021, 3:19 PM

#

Hey @hollow ember!

It looks like you tried to attach file type(s) that we do not allow (.rar). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

hollow ember Jun 24, 2021, 3:19 PM

#

can send rar files

grave breach Jun 24, 2021, 3:19 PM

#

Try uploading it on pastebin

hollow ember Jun 24, 2021, 3:19 PM

#

yup

#

https://pastebin.com/hMGDZA1b @grave breachhttps://ufile.io/tfwhi557

slim marten Jun 24, 2021, 3:20 PM

#

austere swift the loss algorithm

why? is it ok? c:

vital lodge Jun 24, 2021, 3:21 PM

#

Hey, I'm new to computer vision and I tried training a resnet 50 model
as I'm working with like 280 images I tried implementing data augmentation
now the accuracy score is something like 0.006
I'm guessing it overfitted, are there any solutions to this problem?

grave breach Jun 24, 2021, 3:36 PM

#

@hollow ember I got something for you

#

I tried different methods on your dataset and measured the accuracy of each of those

#

Here's my results:

#

LogisticRegression -> 0.7825520833333334
DecisionTree ->0.7916666666666666
GradientBoostedTrees -> 0.7838541666666666
SVM -> 0.8033854166666666
NearestNeighbors -> 0.75390625
Neural Network -> 0.7955729166666666

#

(Accuracy)

#

So I think that an SVM would be the best way to go

#

(By the way, I didn't do train/test split, so some outputs could suffer of overfitting)

grave breach Jun 24, 2021, 3:41 PM

#

vital lodge Hey, I'm new to computer vision and I tried training a resnet 50 model as I'm wo...

I think that 280 images is too low

#

By the way, are you dealing with a classification task?

slim marten Jun 24, 2021, 3:42 PM

#

so is there any problem with this? 🙂

              loss = 'binary_crossentropy',                   
              metrics = ['accuracy'])```

vital lodge Jun 24, 2021, 3:48 PM

#

grave breach I think that 280 images is too low

Oh, I'm trying to count objects in a image

slim marten Jun 24, 2021, 3:54 PM

#

slim marten so is there any problem with this? 🙂 ```seqModel.compile(optimizer = Adam(lear...

sad moment 😦

cerulean mauve Jun 24, 2021, 4:05 PM

#

Can pandas df.drop() accept a python list? I swear.

l = ['foo', 'bar']
df.drop(columns=l, axis=1)

Does not work, even with labels, or raw...
while

df.drop(['foo', 'bar'], axis=1)

Does work. What the !@#!@#$ is that?

desert oar Jun 24, 2021, 4:06 PM

#

@cerulean mauve what does "does not work" mean?

#

you probably can't use both axis= and columns=

cerulean mauve Jun 24, 2021, 4:06 PM

#

Tried that as well.

#

Example 1 with the list does not work, already tried it bare

desert oar Jun 24, 2021, 4:06 PM

#

.drop(l, axis='columns') is the same as .drop(columns=l) and .drop(l, axis=1)

cerulean mauve Jun 24, 2021, 4:08 PM

#

I should be able to .drop(l) with no problem according to what I have seen and read.

#

Every single way, I just tried them again to make sure that I am not crazy.

#

I got need to specify at least one of 'labels', 'index', or 'columns'

#

I got that with columns specified

#

like .drop(columns=l)

#

Let me walk you through what I have here.

#

#First I am taking a byte string to buffer from s3, using panda's optional library(fsspec) to handle the buffer into the DF, which works fine, I can print(df.to_string) and it comes out well
csv_buffer = StringIO(some_bytes_from_s3)
df = pandas.read_csv(csv_buffer)
my_list = [ 'foo', 'bar']
df.drop(my_list) # does not work
#df.drop(columns=my_list) # does not work
#df.drop(my_list, axis=1) # does not work

# I get the aforementioned error every time.

#

I can give you some sample data that the CSV looks like.

#

If needed.

desert oar Jun 24, 2021, 4:20 PM

#

!e ```python
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4], 'b': [11,12,13,14], 'c': [21,22,23,24]})
print(df.drop(['a', 'c'], axis=1))
print()
print(df.drop(['a', 'c'], axis='columns'))
print()
print(df.drop(columns=['a', 'c']))

arctic wedgeBOT Jun 24, 2021, 4:20 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |     b
002 | 0  11
003 | 1  12
004 | 2  13
005 | 3  14
006 | 
007 |     b
008 | 0  11
009 | 1  12
010 | 2  13
011 | 3  14
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/xajinecesi.txt?noredirect

desert oar Jun 24, 2021, 4:20 PM

#

so yes i think you need to send a csv and the exact code to reproduce the error, because i can't reproduce it here

cerulean mauve Jun 24, 2021, 4:21 PM

#

As far as I can tell

desert oar Jun 24, 2021, 4:21 PM

#

use https://paste.pythondiscord.com/ for the csv file

cerulean mauve Jun 24, 2021, 4:22 PM

#

if I do as you have done, and .drop(['a', 'b'], axis=1) it will work, it's only when I use a variable of type list that it fails.

desert oar Jun 24, 2021, 4:22 PM

#

but ['a', 'b'] is a list

cerulean mauve Jun 24, 2021, 4:22 PM

#

right?

#

I think my container is fubarred.

#

I'm using python-lambda-wrapper to develop aws lambda jobs for some database backfilling from CSV

#

going to test on my other machine

#

brb

drowsy maple Jun 24, 2021, 4:31 PM

#

spark stag you aren't allowing the `:` in your test string this is after the `From`, so tha...

Thank you .... i got it... and i will remember it in the future

cerulean mauve Jun 24, 2021, 4:35 PM

#

https://paste.pythondiscord.com/awugevijuf.lua

#

I want to drop the columns arg_reportdate and arg_reportname

#

I place those in a variable that is classed as a list

#

When I use the list, I cannot drop them.

#

When I literally write it out aka .drop(['arg_reportdate', 'arg_reportname'])

#

It works.

#

just not with the python list variable.

#

Which is maddening.

#

because it's a !@#$!$@#!@$ list

#

🙃

#

maybe it's time to find food and microwave it.

#

i'll brb

cedar sun Jun 24, 2021, 4:42 PM

#

if my model input shape is (100,100,4)

#

it means the images it reads have alpha, right?

desert oar Jun 24, 2021, 4:43 PM

#

i means that your input shape is 100,100,4 😉

#

if they're images, yeah probably the 4 corresponds to rgba

#

but nobody can answer that question for you, you have to know about your own data

cedar sun Jun 24, 2021, 4:43 PM

#

so if i wanna augment my data on a custom way using the info of the alpha channel, the images given by flow_from_directory are gonna have alpha too?

#

ah nooooo

#

ooooh

#

ok ok

#

color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb". Whether the images will be converted to have 1, 3, or 4 channels.

#

what happens if i read an RGB image as RGBA?

#

cuz i have some images with Alpha, and i wanna do special things with those only

cerulean mauve Jun 24, 2021, 4:46 PM

#

back @desert oar thanks for looking for me.

cedar sun Jun 24, 2021, 4:47 PM

#

fml

#

do u guys have any idea of how to approach this?

#

my dataset contains rgb and rgba images

cerulean mauve Jun 24, 2021, 4:48 PM

#

convert to grayscale?

cedar sun Jun 24, 2021, 4:48 PM

#

I am using flow_from_directory method, which, by default, it reads rgb images, but i can set it to rgba. The thing is... i get this error ValueError: could not broadcast input array from shape (160,160,3) into shape (160,160,4)

#

can i read rgb images as rgb and rgba as rgba?

cerulean mauve Jun 24, 2021, 4:51 PM

#

Maybe, seems like you just want to drop the alpha layer, no?

#

oh wiat

#

wait you wanted to do something special with those.

cedar sun Jun 24, 2021, 4:52 PM

#

yes, but i think i cant

#

on an easy way

cerulean mauve Jun 24, 2021, 4:53 PM

#

I would check the image metadata for the existence of an alpha layer, and then operate differently based on that.

cedar sun Jun 24, 2021, 4:53 PM

#

like... i could rewrite flow_from_dir method, and if color_mode='rgba' and the current image has 3 channels, read it as rgb

#

or something

twin moth Jun 24, 2021, 4:53 PM

#

Hey guys, I need some help with a DS project I'm working on.
I'm trying to predict peoples taste using recipes, profiles, favorites, and reviews I scraped from a popular recipes website.

I currently have 95K recipes, and 2M profiles, 2M reviews and 70M~ favorites.
How would one use the data I fetch in order to finish the project?

cerulean mauve Jun 24, 2021, 4:53 PM

#

if you wanted to drop the alpha layer yes.

cedar sun Jun 24, 2021, 4:54 PM

#

cerulean mauve I would check the image metadata for the existence of an alpha layer, and then o...

i cant cuz the image i get is read from flow from dir, and it reads imgs as rgb

cerulean mauve Jun 24, 2021, 4:55 PM

#

got the code handy?

cedar sun Jun 24, 2021, 4:55 PM

#

?

#

wdym

cerulean mauve Jun 24, 2021, 4:55 PM

#

source code?

#

I assume you are using a method

cedar sun Jun 24, 2021, 4:57 PM

#

cedar sun like... i could rewrite ``flow_from_dir`` method, and if ``color_mode='rgba'`` a...

.

#

i mean

#

https://keras.io/api/preprocessing/image/#flowfromdirectory-method

Keras documentation: Image data preprocessing

#

this?

cerulean mauve Jun 24, 2021, 4:57 PM

#

seems right 😄

cedar sun Jun 24, 2021, 5:01 PM

#

@desert oarhow do i set weights for a certain class? i asked u before

grave breach Jun 24, 2021, 5:02 PM

#

@cedar sun what do you mean?

cerulean mauve Jun 24, 2021, 5:02 PM

#

The class ImageDataGenerator seems to have an parameter called data_format sounds promising. @twin moth

cedar sun Jun 24, 2021, 5:03 PM

#

cerulean mauve The class `ImageDataGenerator ` seems to have an parameter called `data_format` ...

yeah but still, i would have to override the method

cedar sun Jun 24, 2021, 5:03 PM

#

grave breach <@534837275234795530> what do you mean?

like, telling the model to focus more on a certain class

cerulean mauve Jun 24, 2021, 5:03 PM

#

Polymorphism is the way the truth, and the light.

#

you can use super to import the method into your own class.

cedar sun Jun 24, 2021, 5:03 PM

#

cerulean mauve Polymorphism is the way the truth, and the light.

yeah but ive never done it in python lol

grave breach Jun 24, 2021, 5:04 PM

#

cedar sun like, telling the model to focus more on a certain class

Sorry, wasn't following the discussion from the beginning, can you tell me what goal are you trying to achive, what are your data and how are you trying to do this?

#

So maybe I can be helpful

cedar sun Jun 24, 2021, 5:05 PM

#

cerulean mauve you can use super to import the method into your own class.

i dont think that would be enough. What i actually want is to make flow_from_directory method behave the following way:
if color_mode = 'rgba', then attempt to read all images as rgba. If an image doesnt have alpha, then read it as rgb

cerulean mauve Jun 24, 2021, 5:05 PM

#

to make a child class:

class Parent:
  def __init__(self, txt):
    self.message = txt

  def printmessage(self):
    print(self.message)

class Child(Parent):
  def __init__(self, txt):
    super().__init__(txt)

x = Child("Hello, and welcome!")

x.printmessage()