#data-science-and-ml
1 messages ยท Page 322 of 1
that's one good way to do it, if you have a string you want to use to represent missing data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'v1': ['a', 'b', np.nan],
'v2': ['x', np.nan, 'z'],
})
not_null = df[['v1', 'v2']].notnull().all(axis=1)
df.loc[not_null, 'v3'] = df.loc[not_null, 'v1'] + ' -> ' + df.loc[not_null, 'v2']
you could do it this way too
i generally recommend not ever using .astype unless you really need to
usually i want more control than that
Ahhhh thats perfect
So I could either fillna in the columns as needed before I concatenate, or do that, then afterwards fillna in v3 with 'Null values detected, unable to generate directive' or something
Thanks for the advice
yeah, i've seen way too many f'ed up datasets with nan where it doesn't belong
can't allow it ๐
also don't use csv to save this data once you've generated it... use parquet or something else intelligent
i can't figure out why this doesn't converge to a local minimum, the cost function is oscillating
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import image
rng = np.random.default_rng()
img = image.imread('fruits_small.jpg') / 256.0
h, w = img.shape[:2]
copy = np.array(img)
plt.subplot(3, 3, 1)
plt.title('Original')
plt.imshow(img)
for plot, k in enumerate([4, 8, 16, 32, 64]):
centroids = rng.choice(copy.reshape((-1, 3)), size=k, replace=False)
clusters = np.empty((h, w))
print(centroids)
while True:
for y, x in np.ndindex(img.shape[:2]):
v = copy[y, x]
clusters[y, x] = np.argmin(np.linalg.norm(centroids - v, axis=1))
cost = 0
for i in range(k):
cost += np.linalg.norm(copy[clusters == i] - centroids[i], axis=1).sum()
print(f'cost = {cost}')
d = 0
for i in range(k):
new_centroid = copy[clusters == i].mean(axis=0)
d += np.linalg.norm(centroids[i] - new_centroid)
centroids[i] = new_centroid
if d == 0:
break
print(centroids)
for i in range(k):
img[clusters == i] = centroids[i]
plt.subplot(3, 3, plot + 1)
plt.title(f'k = {k}')
plt.imshow(img)
plt.show()
this version isn't the 5d one, it clusters based solely on color
@ember sapphire this is k-means? i think usually k-means stops after a fixed number of iterations anyway
the cost function should be decreasing
mine isn't
i can't figure out why though
anyone interested on making this prototype better? https://youtu.be/UIOO09AYvQM
its an NLP project on vaccine information.
if your interested DM me
its a private video?
does keras have a fast way to measure top 5 accuracy?
Hello everyone, I'm working on an IT project proposal based on ripeness detection through machine learning. My project consists of a camera observing a basket of bananas (for example). I'd like to know if there are effective ways to recognize if a banana looks rotten based on what the camera sees from a bunch of bananas in a basket?
My main doubt is how hard, and if there are machine learning tools that make the process of identifying one bad banana from the group in the basket, because from what I've seen most related project do it in a small scale, like putting 1 banana in front of a camera. Any suggestions?
i suspect that such a model will mostly be detecting brown vs yellow vs green
training data is probably the hardest part here
how many photos of bananas, in how many different configurations, and different kitchens, and different lighting conditions do you have to collect and label before you have enough data? potentially a lot
maybe you can use an off-the-shelf object detection algorithm to find the bananas in the image first, so your model doesn't have to be so powerful
would be inside a supermarket so honestly we are considering one type of configuration haha
so you want to have a camera that takes a photo of the banana display every day, and sends an alarm when they look overripe?
More like live
or maybe pictures every hour
and yeah, send an alarm when they look overripe
it might be a fun hobby project, but can't an employee go check the bananas? i know they have a lot to do in a grocery store already, but it's a pretty quick task with a quick yes/no answer. of course the downside is that employees might let the bananas go bad because they're lazy and don't want to throw them out if they say "yes"
you still will want to account for day/night lighting conditions
what if someone in a brown shirt walks in front of the camera?
the problem we are trying to solve here is that in my country, bananas and other fruits dont look very good when they are ripe, but they are still good for feeding, so they throw these to the trash because they are not in their "standards" to be sold
and these could be donated or sold cheaper to institutions that help poor people
I'm using PyTorch XLA to port a training script from using a CPU/GPU to a TPU. Is there a way to use a ParallelLoader on more than one variable?
Hey! Iโm not sure if this is right channel but itโs my first post so please be lenient for me this time ๐
I have a problem with an unsual taskโฆ I have a dataframe like this:
id name
0 962966 A
1 402171 A
2 478034 B
3 936505 B
4 516152 C
5 379497 C
6 977649 D
7 869046 D
Now what I have to do is divide this dataframe by name into many multiplesheets excel files... So for example in my case I would like to have 4 files (named: A, B, C, D) every with 2 sheets named by id inside (for example A: 962966 and 402171)
This is my code (random_df is only to fill up sheets with some data):
ExcelWriter = 0
for index, row in df.iterrows():
random_df = pd.DataFrame(np.random.randint(0, 100, (5, 4)), columns = list("1234"))
if ExcelWriter != pd.ExcelWriter(row["name"] + ".xlsx"):
if ExcelWriter != 0:
ExcelWriter.save()
ExcelWriter = pd.ExcelWriter(row["name"] + ".xlsx")
random_df.to_excel(
ExcelWriter,
sheet_name = str(row["id"]))
else:
random_df.to_excel(
ExcelWriter,
sheet_name = str(row["id"]))
ExcelWriter.save()
The result I get is almost fine because this code generate 4 excel files but every with only one sheet named by last id number for name... it looks like the data is being overwritten by the next ones but I haven't idea how to fix this because I'm completely freshman in pandas ๐ Do you have any ideas?
For help you should ask in any help channels. If you want to know how to use it check you should see #โ๏ฝhow-to-get-help
@limpid oxide you can ask for data science help in these channels
@livid venture you can do a groupby and then iterate over that. That will be faster than trying to fumble around by row.
can computers create pixar movies
@thorn bobcat not autonomously
style transfer looks like a filter to me tbh
it feels like AI is a long way from where I want it tbh.
You want it to be able to make two-hour animations with coherent plots?
I want it to be able to convert real life videos into pixar movie scenes
want to bring the power of animation studios to ordinary users.
I might have apply this to individual frames containing just the foreground.
I'll also have to extract the background so I can control the setting.
something like this but for video
https://github.com/CMU-Perceptual-Computing-Lab/openpose
could work for cgi
You're model has >93% accuracy, >93% sensitivity, >93% spcificity AND >93% precision?
Your*
yes, I had very little false positives and false nevgatives
Can u always overfit a model?
You in theory could, but you probabaly dont want to
distance measurement methods for time-series clustering algo calculate the physical distance in space between two objects right?
you might be thinking of euclidean distance
what are you using to determine their distance? a function from a certain library?
tslearn
metric = dtw
algo = k-means
i am in the process of writing the actual paper
dtw?
dynamic time warp
works similarly to euclidean
but more lenient with time-series data
*at least that is what i got from my research
"DTW is computed as the Euclidean distance between aligned time series"
would be nice if they define what Xi and Y j are
elements of an array
the ith element of X and the jth element of Y
I think. let me check
i want to try out machine learning and as a final goal I want to try to make a chatbot. I tried following a tutorial for TensorFlow but i didn't really understand it and was mostly just copying code from the tutorial, anyone know a good place to start and/or a good TensorFlow tutorial for something simple?
https://github.com/tslearn-team/tslearn/blob/60a39f2/tslearn/metrics/dtw_variants.py#L387-L468
so yeah after reading this , i should be able to figure out
Yes, it's the ith element of X and the jth element of Y, where each (i, j) tuple comes from the set of tuples that most closely align with each other
that's only for two dimensions
it works the same way
yes
the sigma? yes.
suppose i have two objects/time-series
1: 1 3 4 7
2: 2 4 8 9
X is 1 and Y is 2
i am pretty sure
the distance would be sqrt ( |2-1|^2 + |4-3|^2 + |8-4|^2 + |9-7|^2 )
try passing those as arrays to the dsw function and see if you get the euclidean distance
would be easier to do it as np.sqrt((np.abs(a - b) ** 2).sum())
what i am wondering is why make up a fancy name
and use the same century-old method
which part is fancy? euclidean?
shouldn't make this into a loop
for x, y in arr1, arr2:
is that legal python syntax?
no, numpy does the looping internally.
!e
import numpy as np
a = np.random.random((5,))
b = np.random.random((5,))
print(a, b)
print(np.sqrt((np.abs(a - b) ** 2).sum()))
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | [0.98432056 0.62829273 0.62277413 0.44841935 0.8191274 ] [0.33321332 0.25763163 0.75565997 0.36456538 0.89677632]
002 | 0.76944770610025
life is not always simply huh
I don't read screenshots of code, but I'll give you some more code to illustrate the earlier point
!e
import numpy as np
a = np.random.random((5,))
b = np.random.random((5,))
print(a - b)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
[-0.55704989 -0.41628915 -0.04400796 -0.552509 0.38075351]
that is very neat
when the documentation tells you what it is supposed to do
but it doesn't function like the doc when you test it
I mean, the docs didn't say that dtw is exactly like euclidean distance
"DTW is computed as the Euclidean distance between aligned time series, i.e., if ฯ is the optimal alignment path:"
idk what an optimal alignment path is.
my earlier guess was wrong, apparently
(and not very well stated)
I believe it's the alignment with the lowest distance such that the order of points is preserved
Don't quote me on that
yup
i mean i don't know
but
when the array elemtns are identical
like [1,1,1,1] and [2,2,2,2]
dtw would give the same result as euclidean
but when things start to be weird, they converge
where should i start with ai?
i watched this video but i cant find any python tutorials for ai
https://youtu.be/JMUxmLyrhSk
๐ฅ Machine Learning Engineer Masters Program (Use Code "๐๐๐๐๐๐๐๐๐"): https://www.edureka.co/masters-program/machine-learning-engineer-training
This Edureka video on "Artificial Intelligence" will provide you with a comprehensive and detailed knowledge of Artificial Intelligence concepts with hands-on examples.
Following topics are covered in th...
Hi! I'm looking for a data cleaning library. I found HoloClean but it contains some TODOs in the code and the GitHub repo hasn't been updated since April 2019..
Recently I talked with a friend that did the ML course of Deeplearning AI in coursera and he told me that the content is a good starting point
the course is free unless you want to have the certificate
Is this the correct channel to ask Pandas related queries...??
Are there any good resources to understand context free grammars and logical grammars?
yah sure!
for quicker help start an help channel
go ahead
I have a pandas dataframe like below:
company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 * Developer Engineer
and a list of companies like ['Google', 'Apple', 'Facebook']
I want to write a operation in pandas to find the rows where company name has "*" and replace it with individual companies
the output should be
company name role level role keywords
0 Company1 Director director
1 Company2 Developer developer
2 Google Developer Engineer
3 Apple Developer Engineer
4 Facebook Developer Engineer
Thanks in advance
so for each row in the original dataframe with the "*" field you will replace it with 3 rows (with google, apple, fb) ?
Yes, the company list is dynamic it can be n rows. I have added 3 companies for example
ok, first I will divide the data into the ones that have companies and the ones with "" then with a function create copies replacing "" with the names in your list and finally concatenate all the dataframes
should look like this
# df = # your dataframe
mask = df['company name'] == "*"
df_withname,df_noname = df.loc[~mask],df.loc[mask]
names = ['Google', 'Apple', 'Facebook']
def add_name(df, name):
df = df.copy()
df['company name'] = name
return df
df_final = pd.concat([df_withname] + [add_name(df_noname, o) for o in names])
@novel elbow I think there's a simpler solution.
@viral scroll how do you know what company you're replacing a given row with?
Eh maybe not. Though I'm confused as to how one would arrive at this particular problem
Basically I take query inputs from user using an excel sheet and the user has an option to either specify search condition for specific company or use "*" to apply the search condition to all the companies
so in my code i need to replace * with multiple rows for individual company so that I can do an inner join
You could replace the asterisk with a python list and then do explode
I'm on mobile so I can explain more in a bit
https://www.coursera.org/learn/ai-for-everyone?
If you can afford to take this course. Or else you can enroll in each course separately but you wont get a certificate after completion.
Offered by DeepLearning.AI. AI is not only for engineers. If you want your organization to become better at using AI, this is the course to ... Enroll for free.
In [40]: df.loc[(df['company'] == '*'), 'company'] = [['Google', 'Apple', 'Facebook']]
Out[41]:
company role level
0 Company1 Director director
1 Company2 Developer developer
2 [Google, Apple, Facebook] Developer Engineer
In [42]: df2.explode('company')
Out[42]:
company role level
0 Company1 Director director
1 Company2 Developer developer
2 Google Developer Engineer
2 Apple Developer Engineer
2 Facebook Developer Engineer
You have to use a nested list for the first step though
good morning stelercus
do you think hierrachial and k-means will produce different results?
you didn't ask me but still I will answer - probably
can anybody help me figure out why my kmeans implementation is not converging?
holy cow
how do you check if it converges or not
i am doing the exact same thing
it is not working
i just print the centroid locations and cost function after each iteration
it just jumps around
can you share the code?
good evening guys i need some suggestion about my small project
i have small experiment to test whether CNN works with small size of data. I have about 38 sample data from 2 class, said sick and healthy (coded as 0 and 1) with 18 from 20 from class 0 and 18 from class 1. The data sample is an ECG record with long duration (about 1 day), where the class 0 has anchor point to mark where the episode of sickness happen, but the class 1 is healthy so no marking point on it
i want to check for the n minutes before anchor point of class 0 can be detected as sick and not 1, so from 20 i have 10 train, 2 valid and 8 test for n minutes. I also take arbitrary data from 18 healthy record, so i have 9 train, 2 valid and 7 test. Since CNN will feed small data for training, so I try with simple network
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 15352, 128) 1280
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 5117, 128) 0
_________________________________________________________________
dropout (Dropout) (None, 5117, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 654976) 0
_________________________________________________________________
dense (Dense) (None, 2) 1309954
=================================================================
Total params: 1,311,234
Trainable params: 1,311,234
Non-trainable params: 0
_________________________________________________________________
When i run the training, the training accuracy are pretty fast to decrease but the validation result are very low
but when i run the test, the result is promising
I know this is the sign of overfit but with the accuracy result of the test set I still confused whether to add more epoch or rearrange the model.
the feature is 1D with 15360 feature
jesus
The result pretty funny but frightening at the same time
and still dunno what to do next
look at some examples
are you a data scientist?
yes
sorry i didn't mean examples of model code
i literally meant examples of data where the predictions are wrong
it can be very enlightening to see why, qualitatively, the model is failing
maybe you don't have enough data - i have no idea if CNNs can be trained from scratch on so few datapoints, i would assume that they can't
or maybe your train/test split is f'ed up
What is arima?
who asked you this?
oh i asked people
for what?
also it's silly that prophet isn't "interpretable" it's a goddamn linear regression, it's more interpretable than ARIMA
anyway ARIMA is "AutoRegressive + Integrated + Moving Average" - basically a family of models you can use to fit a single time series
there's kind of a lot to be said about time series modeling and ARIMA specifically
The dataset was a matrix of the weighted AMP of the 724 ingredients from 2016 to 2020 - tables where rows represent drug ingredients, columns represent the time, such as the second quarter of 2017, and numbers in each cell characterize the price of the particular ingredient in the particular year.
When i do kmean clustering with dynamic time warping metric, it cluster the majority of the ingredients into one cluster, which i definitely do not want. so i asked the person and he told me that
what's the point of doing arima then?
i using the split and experimental flow based on paper (this is what i pretty hate why they don't clearly tell what they do in their experiment), they use 50:20:30 for splitting not based on the data chunk but the record
to somehow break the big cluster up?
but the paper said it worked with as few as 20 data points in the training set?
they claimed yes
are you trying to replicate the paper? or are you trying to use their technique on your own data?
that doesn't make a ton of sense to me, ARIMA is a way to model and characterize one time series
i try to replicate only the flow of the experiment, with the same data that they use
can you link the paper?
since this is a health-related project, so a golden standard data is mandatory to use (although imho, the data are pretty old and very small)
so what do you suggest me do?
i am kinda lost at the moment
this is the example, https://pubmed.ncbi.nlm.nih.gov/30117048/ but i only using they testing flow and the data
Sudden cardiac death (SCD) is one of the main causes of death among people. A new methodology is presented for predicting the SCD based on ECG signals employing the wavelet packet transform (WPT), a signal processing technique, homogeneity index (HI), a nonlinear measurement for time series signals, โฆ
i don't know because i don't know what this person is responding to. what exactly did you ask, that prompted this response? also what ultimately are you trying to do again? i know you were getting into clustering time series but i don't remember what the purpose of this was
what do you mean by "flow"?
the testing flow, how they parted the data for training and testing purpose
ok, but it looks like the paper uses a purpose-built algorithm based on some specific signal processing stuff
i don't know that it makes sense to just replace that with a CNN
since CNN can make the feature
OHH i asked "how would you proceed to predict the future with the given dataset"
that's my oppinion based what i have read about CNN
ok, but that's a completely unrelated task to clustering. can you back up and provide a more complete explanation of what's happening here? i feel like i'm only seeing random little snippets of what you're working on.
the idea can be said: what if we can feed this info without preprocess the data, which mean we can lost some important sign and let CNN infer the condition
i would imagine that the CNN probably can't learn much from 20 data points
since CNN or deep learning method are still pretty new for ECG-related research
now, if you had 2000 datapoints of "unlabled" data, but only 38 data points of "labeled" data, i'd suggest fitting an autoencoder on the 2000 unlabeled examples, then using transfer learning or something to fit a simpler model on the 38 labeled examples
maybe there are ways to train the autoencoder simultaneously with the small number of labeled data points - this was a thing several years ago called "semi-supervised learning", but it dates back to the SVM era and it was of questionable value back then
i have cautious with that, since i using more classical method like preprocess the data into several feature and using traditional machine learning method like decission tree or random forest the result still higher than the CNN
My research question: How do characteristics in the pharmaceutical industry, such as new drug development, patent expiration, and drug classifications influence the similarities and dissimilarities among both drugs' prices and drug prices' percentage difference over time. I am still under the process of revising my question, so i hope it makes sense
breakdown into more specific one:
maybe like: Is there any group of drug development which is more focused on specific field, or what kind of drug class mostly made by the industry
like i want to see if there is a trend in price in group of certain disease-targeted drug
yeah that doesn't surprise me. like i said, the CNN might just need more data to train. what's the size of the convolution "window"?
i using 128 feature with 9 window
so since the data are pretty small but complex dimension, i decide with the simpler one
no much layer and just dump the feature from the convo - maxpool - dropout into dense with 2 class
so you have a model that needs to learn 1280 parameters from 20 data points, and that's not even including the dense final layer
that seems like a losing proposition to me
are you at least using regularization to train it?
128 features as in, each example is a single time series of 128 data points?
or wait, you have 128 "channels" in the conv1d?
interesting. i don't understand a single word you guys are saying
wait
def GetModel(shp):
model = Sequential()
model.add(layers.Conv1D(filters=128,kernel_size=9,activation=tf.nn.leaky_relu,input_shape=[shp[1],1]))
model.add(layers.MaxPool1D(pool_size=3))
model.add(layers.Dropout(rate=0.7))
model.add(layers.Flatten())
model.add(layers.Dense(2,activation='softmax'))
model.compile(loss = 'sparse_categorical_crossentropy',optimizer='Adam', metrics=['accuracy','mse'])
return model
the shp is the X, [1] is 15360
ok, so these are 2 separate tasks under the same "umbrella", thanks for clarifying. i think you're on the right track with the clustering, although i think k-means is almost always the wrong choice. i recommend starting by making lots and lots of plots
obviously you're already started, but still make lots and lots of plots
so your k-means results are bad, why? are they really bad? or are they sensible given the data and # of clusters?
did you try other numbers of clusters?
did you try k-medians? etc.
plotting the data is the most easiest one to do, and also easiest to understand too
yes sir. similarity/dissimilarlity in price and price percetnage difference -> two tasks
did you try "soft" clustering like HDBSCAN? self-organizing maps? did you look into https://github.com/fpetitjean/DBA to see if there's any common overall shape?
this is your feature. Plot it using scatter plot and see if there're some pattern
is this sensible?
you can also try fitting individual time series models to each series, then looking at the models as a dataset of its own. you can fit all kinds of forecast models like exponential smoothing, arima, etc. and then look at the distribution of model parameters for example
it feel like it doesn't
you're just looking at the cluster memberships
how do you know those make sense?
did you try using dimension reduction to plot these and color them by cluster membership?
i didn't think it is necessary since the dataset only has 20 dimensions or 20 columns
that doesn't make sense
20 if not have importance are pretty useless to predict
so i have to find more data?
i'm not sure what you mean by that @inland zephyr
i will share pic. one sec please
you mean that there are 20 time points in each time series @visual violet ?
and are they all measured at the same time points?
oh i mean feature importance from the 20 feature, not a 20 time points. My bad
ok, that makes things easier
oh so the vocab is time point
eh, there isn't a single term for it
lol i have a tough time describing things
why dont plot a line bar to see the movement? since the data is time related right?
@inland zephyr they have 600+ individual time series with 20 time points in each
Hey @visual violet!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
oh shit i can't share excel lol
so it's 600*20 dimension right?
you could think of it that way, yes
the majority of them is at the bottom
thats what i said
cluster 0 : dark blue has 665 members out of 724
im not sure what the point of cumsum on prices would be
nope since drug can't be free
so there gotta be a price
nvm, no need cumsum for this
the cluster has been seen easily, with majority cluster is the blue and the cyan one
but what is the red data? is it come from some data or it's come from one member?
right. k-means looks like it's just clustering time series based on how high the average price is
red has one member :(((
bingo
i don't want that since it doesn't mean much (i assume)
- use a log scale y axis
- replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level
it's the outlier maybe
so my recommendation is to normalize each price such that its starting price is 1. then your distance metric can focus on the shapes of the time series, not just the price levels
basically a price index for each drug
so if the price is $100 in qtr 1, and $125 in qtr 2, normalize that to 1 and 1.25
or maybe as @visual violet said no 0 price, the normalization can be done between min-max of the price
https://paste.pythondiscord.com/ejeruhawod.css -> for first 200 ingrdients
min-max normalization is bad when you don't have a known min and max
this way you can take the "price level" of the drug as separate from relative changes in price over time
@visual violet in general recommend the following:
- the specific feature engineering described above, perhaps using the starting price level (or average or median price level) as a separate feature
- try several dimension reduction techniques to visualize the time series, don't worry about "modeling" yet. such techniques include PCA and UMAP. this might mean using trying out a few distance metrics such as DTW
- use https://github.com/fpetitjean/DBA to see if there is any overall "shape" to the drug prices, although i doubt it based on the plot i just saw
- try "soft" clustering methods such as HDBSCAN
- do you have any "metadata" about the drugs? company developing it, year development started, type of drug (antibiotic, etc.), specialized vs general market, etc. that could be interesting to examine as well.
VAR for 600+ time series is too big i think
however you could try using the correlation between time series as "distance"
@desert oar do you think i should even bother with the percentage difference dataset? here is the graph i got
1 3
2 2
specifically distance = 1 - abs(corr(ts1, ts2))
do you see an obvious pattern here?
i see 0 pattern
try plotting this without the cluster coloring, but set alpha=0.2 or something so you can see them all overlaid
(and make the lines thinner)
generally you should probably do all your clustering on log prices anyway, if you're not using this price index thing
also - do you know what would cause these drug price fluctuations in the first place?
that could also guide your thinking about it
also this is maybe a helpful reference https://www.kaggle.com/izzettunc/introduction-to-time-series-clustering
the drug economy does not folow the supply-and-demand model
sometimes existing drugs increase price for no reason
i have trouble udnerstand this statement "replace each time series with its mean over the 20 data points, you'll probably see that the clusters are neatly segmented according to price level" can you please explain?
oh i understand
take average for each row
so then i only have 724 values
yep
you'll probably see that the clusters are just splitting the drugs somewhat arbitrarily into average price levels
re-do the clustering on log price scale, if nothing else
im ready to kms over this kmeans implementation
ive read it like 500 times and cant see anything wrong but it's still wrong
do you mean this ```
matplotlib.pyplot.yscale('symlog')```?
why symlog?
log should be fine
also i mean literally re-run k-means but using log price
show your code one more time?
just "reading" the code isn't always the best way to debug code
so log every single data points (724 *20) and k cluster
yes
because the prices levels are so wildly different
look at the formula for euclidean distance
this is a good lesson in the importance of using intuition about fundamentals to solve problems
k means with what distance measurement method? euclidean?
whatever, euclidean is probably fine but you can try DTW too
i have been using dynamic time warping without understand exactly what it does
when taking differences, and especially squared differences, the order-of-magnitude differences between time series will overwhelm anything else
so at least put them on log scale to try and reduce that effect
i'm at work so i might have to leave for a while, but ping me so i don't lose track of it
DTW just re-aligns elements of the time series such that the distance between the two time series is minimized
dtw does that for every pair of time series?
let say i have
A: 1 7 16 50
B: 2 15 51 8
with certain restrictions: the first and last elements must be matched to each other, at least one of the sequences must have every point matched, and the matching must be monotonically increasing, so if A10 matches B15, then A11 cannot match B14
basically, the lines in the above picture can't cross
the distance itself is the sum of the lengths of the dashed lines
@ember sapphire as a matter of courtesy, if you could repost that using https://paste.pythondiscord.com/, it will be easier to follow the multiple conversations happening here. otherwise now there's a big wall of code between this message and the others.
ah sure
(you can edit your messages btw)
this makes a lot of sense
my understanding is that following Lloyd's algorithm, the cost should decrease monotonically until it converges to a local optimum
but my implementation seems to bounce around indefinitely
first of all, write some damn functions
don't just put it all in one big script
are you trying to plot each step of lloyd's algorithm?
currently the plotting is just there for debugging purposes so i can see the behavior, but the final version should compute clusterings for k=5 to k=12 and then plot them all at the end
yeah assign[y, x] is the index of the cluster that pixel [y, x] is assigned to
just a copy of the image
yeah because the original is mutated
every step
also im not sure why, but when i ran it just now, it actually converged for k=5 for once and moved on to k=6... i didn't change anything though
unrelated to the algorithm itself, my suggestions for your code itself:
- use functions
- use better / more-descriptive names
- when you do have a "big" function with multiple "sections", write comments so it's obvious to the reader what the sections are
plus if you're running this all in a notebook cell you're guaranteed to make a mess out of the top-level namespace, and you should at minimum restart it once in a while
im not using ipython
If the first and last two elements must match and the line canโt cross and the the time series have equal length and the line canโt cross, then DTW behave exactly the same way as Euclidean?
I canโt think of a way for it to not behave the same way as Euclidean if the time lengths are equal
it is possible that DTW can produce the same results as euclidean yes. although the distance is a sum of absolute differences, not a sum of squared differences
but DTW can "skip" elements of one of the series
actually if they are equal it might be required to be the same as euclidean
good observation
You canโt skip one without skipping the one in the other series
right
anyway i don't think euclidean or DTW are great solutions here, i think correlation could be more interesting
maybe even spearman correlation
but it depends on what you hope to find
subjectively, what does it mean for 2 time series to be similar?
if the price movements are arbitrary, why would you expect to find any interesting clusters based on price movements?
maybe you should be looking at linear trends, seasonal decompositions, mean price level, etc.
You are not wrong
i guess i am hoping to find anything possible out of this dataset lol
i gave you a lot of suggestions of places to look
think of what characteristics of the price sequence could be meaningful, i gave some examples above
use those as features for clustering
okay so first, log everything and cluster again
second, try sevel dimension reduction and graph
third, try soft clustering methods instead of k-means
i'm changing my recommendation
i don't think you should waste time with these distance metrics based on price movements
find a way to characterize each drug in a way that makes sense
and use that to compute distances, do clustering, etc.
however i think you did discover something: price movements are indeed arbitrary and don't suggest any kind of meaningful relationships
nah
you're not done yet
all you found is that euclidean distance isn't useful for these time series
PCA however could be interesting https://towardsdatascience.com/the-pca-trick-with-time-series-d40d48c69d28
(this is just one of many examples of how you could use such a thing)
okay so abandon k-means and hierarchical as well as dtw and eucliean altogether
since prive movements are random and i should not expect like 20 nice clusters
no, i'm not saying to abandon k-means or hierarchical clustering
but yes, you probably want to abandon those distance metrics
btw
i am still trying to understand how dtw works
my code make sense tho
in theory they should output the same thing?
don't do the sqrt
i have always thought the straight line provdies the shortest path
oh well not anymore i guess
Orijinal data shape : (512, 512, 1)
Reshaped Train/Test data shape : ((286, 256, 256, 4), (293, 256, 256, 4))
Model:
import tensorflow as tf
from tensorflow.keras.models import Model
from keras.la...
pls help me
PCA is a dimension redcution method. don't i still need a measumrent metric?
ok i added some comments https://paste.pythondiscord.com/opuvepokok.py @desert oar
i don't see any obvious pieces that should be extracted into functions
but aside from the style, do you notice any errors / opportunities for optimization?
is there a good live plotting software?
jupyter lol
ive gota loop thats constantly reading from a source and i want the same plot to be updated every second. is jupyter the best too for this?
heeeeelp
no, it doesn't use an explicit distance matrix
Hey guys this was my first blog related AI on medium. Have a look awaiting for your responses.
https://medium.com/@h_ali/linear-regression-using-gradient-descent-from-scratch-66328352e671
and since i also work with time series data (like ECG), wanna try wavelet decomposition to make the feature?
since we can break down the data into the volume one (the Y) and the time (the X) whether to find any small detail to cluster the data
that could also be interesting, although i'm generally skeptical of fourier-like methods on very short time series. but @visual violet take note
the entire k-means procedure can be
so you can at least verify that the output is correct in a tighter loop without making all these plots
my guess is that somewhere you are forgetting to update something
also the centroid differences will almost never be exactly zero
how badly does the cost oscillate?
that's more interesting than looking at cluster assignments
also if this is a very small number of data points the oscillations could be significant
am working on payment prediction
try it on a dataset with known obvious clusters
@silk marsh successfully asking for machine learning help requires a detailed description of your task, a detailed description of your data, and a detailed description of your current solution(s) and/or the actual code you're using. example data is even more helpful
the "dont ask to ask" rule is even more important in data science than in programming, because there are even more open-ended questions
otherwise you force people to waste their time interviewing you in order to be able to help you
i already just tried to offer some help
sounds like u are angry
don't take me in wrong way
@desert oar
so can u join code help1 @desert oar
voice?
so that i can explain my prob?
@ember sapphire i don't see anything obviously wrong with this code either, so i suspect that maybe you forgot to update a "new thing" and left an "old thing" behind
Hmm
It doesn't oscillate badly
It goes down relatively quickly and then reaches what appears to be a local optimum but then it starts slowly going up again
i wonder if you just need to be cutting it off there? maybe randomly swapping around points is causing it to destabilize
not sure what the theoretical guarantees are on the algorithm
https://en.wikipedia.org/wiki/Lloyd's_algorithm#Convergence
The algorithm converges slowly or, due to limitations in numerical precision, may not converge. Therefore, real-world applications of Lloyd's algorithm typically stop once the distribution is "good enough." One common termination criterion is to stop when the maximum distance moved by any site in an iteration falls below a preset threshold.
floating point issues?
try using that termination criterion
Sounds reasonable, if it's just floating point issues then I'm fine. I was worried it meant there was a deeper problem
while True:
if (centroids_curr - centroids_prev).abs().max() > threshold:
break
try to compare vs scikit-learn k-means or some other known-good implementation
if you have the same-ish stopping criterion and the same-ish algorithm, you should get same-ish results
Depends on initial centroids too but yeah, the image I'm using has a pretty obvious fit I think
Hey folks, would anyone here spare a few moments and help me with a scikit learn question?
you can just ask. if people have time, theyll reply
Yeah thats fair. i just dont want to clog the chat you know. But here goes. I have a set of data generated by a random model .
They have a id, prediction value from some random model, and the actual boolean value.
I am trying to calculate the AUC of the random model by 2 methods, one by the scikit-learn and one by hand (running the normal algorithm)
There is quite a mismatch from the 2 methods.
for the data file/ code : https://filebin.net/yuzmnov4k8r6ha0c (hope filebin links are allowed)
yea why?
Now my question is, am i using the roc_curve function right (so to it i supply the actualy boolean values (1 and 0) and the prediction of the random model) ). If yes, is there any reason to why the by hand calculation of roc (going over all steps of auc calculation) differs so much from the scikit learn one?
Any help would be appreciated on this question from me: https://stackoverflow.com/questions/68089910/tf-keras-ignore-values-in-custom-cross-entropy-function
Hey everyone! I am wanting to use Facebook Prophet to perform multivariate anomaly detection of user session data but struggling to figure out how it would work.
For some context, when a user logs in we create a session id and certain events/actions are captured and tied to that session. We are able to get counts of the different actions at 1 min intervals for a particular session.
The thing I am struggling with is how to detect anomalies at the session level. How would that data be fed into Facebook Prophet and is this something that can be done with it?
Online, I have only seen multivariate examples of things like different store locations but those store locations are static while the sessions are dynamic.
Appreciate your time!
do you have a bachelor degree in data science?
No, but I have created some basic autonomous driving apps like being able to drive around a track in a simulator, detecting road signs, lanes etc.
Love the sarcasm. Just seeking some guidance as I am "not good"
i am a high school student
i know absolutely nothing
i am sorry to push your question to the top
I want to learn about stylegan2
Its like I'm trying to swim before learning to walk
but it's beautiful
@desert oar i just realized the red book by micromedex is the answer to my hope and dream
How should I typically handle dates in a ML model?
I have a dataframe that has a column for Year (from 2010 to 2020)
Should I leave it as int or code them from 1 to 10?
im not familiar with it. is it a source of data about medications?
is there a digital version? or do you have to now type in data for 700 drugs?
you tell me lol
micromedex provides historal data gonig back 50 years
right now what i have is quite enough
but like to predict the future
20 datapoints are like a grain of sand
the problem is there is a subscription lol
anyone know what I should start learning to make something like style gan 2?
or 1
do I gotta learn about gans or do i gotta learn about neural style transfer?
depends on what you want to encode
and how you want to model
can't tell without knowing what the problem is
I have historical data for tuition of different colleges, by year.
it is open source?
I can easily forecast these values, but I'm building a ML model just for the fun of it.
are you saying you can forecast college tuitions?
hm
sounds like time series modelling
Yeah. Just from looking at the data, the trend appears to be +200 every year since 2010
so
the simplest way to handle this is
given (X - n...X), model X + 1
in this case the date is used only to order data points
So what's your suggestion? I tested xgboost and RandomForest. RandomForest performed better. Now I'm looking to fine tune the model. Should I leave the years as integers or code them with ordered labels from 1 to 10?
Whats the full name of that college?
University of Pennsylvania
I literally just said
dates are used only to order the points
so they aren't features
UNLESS you want to encode more information
for simple stuff
you could look into e.g. ARIMA
In 2020 and 2019 the tuition change was +3.9% for both years. So I'd say for a 4 year program $60k * 3.9%
thank you
Thanks, I'll look into it.
@velvet thorn i am having a similar time series
i have historical data for price of different colleges, by year
now i am trying to find interseting patterns
this was a good paper
i did k-means dynamic time warp on my data already but it gives me one cluster with the majority of the drugs and the rest of clusters consist very few members
what do you suggest me to do
@desert oar
not exactly how stuffs work
but it appear to provide good results now?!?!?!?
i legit added one more line of code ingredient_price_matrix = np.log(ingredient_price_matrix)
yep
i told you
If โdtwโ, DBA is used for barycenter computation.
this is also a good feature in that library
plot the log time series again w/ the colors
also 15 clusters seems like a lot
do you have any reason to expect 15? why not 1 or 5?
Can anyone give me resources and link to prepare for a data science Interview? They said I will be asked on ML rapid prototyping, and simple ML models.
I've had a number of data scientist interviews lately. Here are some of the questions I was asked:
- What's an example of an unsupervised learning algorithm?
- What are the pros and cons of neural networks?
- What do you do if your dataset is missing certain data?
- What's an example of a time that you solved a problem you were unsure about?
And be prepared to talk about anything that's on your resume.
i will attempt to graph with t-SNE hehe.
to be honest, i have no idea which number of clusters i should pick
the elbow method tells me 6? so i will try that
also what does DBA mean?
- What's an example of an unsupervised learning algorithm?
= kmeans? @serene scaffold
huh no wonder why when i check, the math doesn't work out
this is not normal dtw
this is DBA
also how do you know it uses DBA?
i check the tslearn website. there is no where it says that
@desert oar
i am so sorry for pinging you so many times
DBA is a method for calculating centroids. the docs say that when you use DTW as the distance metric, it uses DBA instead of regular means to calculate the centroids in k-means
i happened to have seen DBA in this doc that i think i posted earlier, and remembered it. the tslearn docs really should have a reference to this, instead of just using the acronym. https://www.kaggle.com/izzettunc/introduction-to-time-series-clustering
thank you very much.
PCA happens to be the method to plot
so i will try that out as well
Thank you so much!!! If I may ask how was the technical interviews like?
looks more like layers to me. before it was a clump and outliers ๐
what do you think?
i still think this is just clustering on average price level
k-means tends to try to find "round" clusters
and it will somewhat arbitrarily segment the data in order to do so
perhaps it is because my graphing function is weird?
if you see a bunch of evenly-sized clusters with no obvious separation, k-means is not doing anything interesting
no, this just looks like k-means not doing anything interesting
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
https://paste.pythondiscord.com/betusevuni.apache just in case
as i suggested earlier, if price movements are arbitrary then you probably won't find much that's interesting in the individual price movements
alternative suggestion: https://stats.stackexchange.com/a/165234/36229
it might be interesting to look at the variability in prices
e.g. you can start building features for each drug like mean price, std. dev of price, etc.
in fact, how about this: plot each drug with mean on the x axis and std. dev on the y axis
so there will be 700 graphs?
oh right
i thought about it too
but i had an idea of using percetange differnce
because i didn't think there will be an actual method
haha
the fact that my teacher recommends me to cluster the big cluster ๐
he gives 0 advice on anything
is there a potential?
btw the matrix is the log of the prices
if used actual prices
Hi, I would like to learn about AI, where do you recommend me to start?
what's your math background?
I don't know how to define my level in mathematics, I guess high school with a little bit of the first year of college.
Learning linear algebra and probability/statistics would help
Hmmm ok
And then?
i have access to the IBM SPSS Statistics
let see if i can do the ward method without coding
i can't seem to find a library
not for clustering, but i see some interesting structure here:
- variability looks like it increases as the avg price gets higher (not surprising)
- there are some weird outliers with hugely variable prices
i agree yeah
Hi guys...
So I have been trying to load word embeddings from tf hub...
But most of it are converting the word embedding to a sentence embedding and I am not able to find any documentation to help me out of it...
If any of you guys have any ideas how to get the embeddings please let me know
by looking at it, i see no distinct cluster
how can i determine model over-fitting or not ?
Try cross validation
https://lightning-flash.readthedocs.io/en/latest/ Anybody heard of this library?
what's better for storing analytics one-table data, sql databases (relational) such as mysql or noSQL such as mongodb?
Add a callback that shows test performance after each epoch
And how was it? Good luck on your interviews bud.
I do recommend datacamp.com if you have some extra bucks. I do have the yearly plan (150$ per year or so if you get it on cybermonday)
They have this new prep thing after you've done some 10 minute tests on each category, a real-life project you must pass to be certified and finally they'll help you get ready for the interviews and match your profile to other companies.
I'm still studying and haven't done them yet. And it just came out recently, so don't expect it to be perfectly flawless.
https://app.datacamp.com/certification/dashboard
Also there might be a free prep service around if you look properly. I was just willing to pay.
Hi all. Been using CatBoostRegressor for a project (this is my first time with Catboost). I used it in StackingRegressor with different models. I tuned the hyper parameters of the boost which is ;
'Cat', CatBoostRegressor(iterations= 5000,
learning_rate=0.01,
l2_leaf_reg = 5,
depth = 6,
border_count = 50))
However, the model runs each iteration from 0 to 5000 with different learning rates. Is there a way to fix these parameters?
Okay
Thanks lemme try
If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it? *
the dataset is tabular
this is the beginning of a lot of machine learning problems. What is the data? what does it represent?
Be sure to be specific as the question isn't answerable in general.
This is one of the interview questions i had faced
Here we have to be creative as possible.
The data is numeric tabular data
if you were continuing the earlier conversation thread about interview questions, I wasn't able to infer that.
Is this a question you want an answer to?
yes
If you were provided a csv file with 2000 rows and 5 columns, and asked to create a binary classifier, how would you solve it?
After preprocessing data what can be done
You would first need to know if one of the five columns is what tells you the class of that row, and you'd need to know if the data (since you said it's all numeric) is discrete or continuous.
?
if one column is just telling you the class, then you only have four features, not five
depends on what algorithm you want to use. for the discrete data, would the number be mathematically meaningful? Like for example, age?
or would it just be arbitrary numbers like "1" for green and "2" for blue
okay thanks
now i need to know the data nature
continous or discrete
and act accordingy
if they refuse to budge on being specific about what they want you to do, you can say that you'd use support vector machines
best option might be support vector machine
are you familiar with those?
SVM would treat each row in your table as a point in space, and try to determine the boundary between the two classes
No problem ๐
@desert oar lol I forgot to square the norms when computing error
It was actually converging just slowly
lol glad you figured it out
is a single float value, the range will be (-shift_limit, shift_limit). Absolute values for lower and
upper bounds should lie in range [0, 1]. Default: (-0.0625, 0.0625).```
What it means?
like, height width will be multiplied by that?
like, a 100x100 will result into a 6x6?
or into a 94x94?
Train score (r2) = 0.98
test score = 0.86
is this counted as overfitting? (data= kaggle house prices advanced reg)
Figured it out. RandomForest max depth is too much high..tuning them al again
*all
Thanks for the answer
can someone explain me how is this 1/2 for this SVM prob
so for whichever choice p1 * theta >= 1 and p2 * theta <= -1 that would the right theta value ?
someone correct me if i am wrong
Hey guys , which YouTube channel or resource would you recommend me to learn statistics from scratch? I donโt have a math background
i wish i knew, all the good statistics resources i know of already assume some knowledge of the basics
what is your background? how much math do you know?
If it is a high school stuff yes! I only remember few algebra things
alright, you might really have to start from the basics then
i honestly don't know where you'd start from there, khan academy i guess has some videos? there might be some "statistical thinking" type of courses online, which would focus more on intuition and concepts
I found one course in data camp called introduction to statistics in python . I will do that. Thanks anyway
After having a so-so knowledge in mathematics, what should I learn to make my first AI? ๐ค
I have a .dat data file that I believe is in some type of .csv format, however it's too large (3.1GB) to inspect with any programs that I've tried inspecting it with. What's the easiest way to convert from .dat to CSV so I can load it into a pandas dataframe?
what program generated the file?
Not a clear way of knowing. It's a government dataset.
I may have found a way around it though...I think I found a CSV, which makes life much easier.
hopefully it says somewhere in the govt docs though
๐๐ป thanks!!!
Can someone help me with an xgboost model making bad forecasts?
https://brilliant.org not sponsored.
Thanks ๐คฉ
Most important one: https://brilliant.org/courses/math-fundamentals/
In this course, we'll introduce the foundational ideas of algebra, number theory, and logic that come up in nearly every topic across STEM.
This course is ideal for anyone who's either starting or re-starting their math education. You'll learn many essential problem solving techniques and you'll need to think creatively and strategically to so...
Cool , I really appreciate your help !
Anyone have an idea of what approach to take for a problem where I need to predict the column name based on itโs data? I understand it could be a classification however the issue is there are so many different types of fields which contain various fields over 1000
what's the purpose of this task? you want to be able to guess a column name in any dataset? a specific dataset?
Want to predict column name based on data. Itโs a specific dataset however the issue is the dataset is very large!
Recommend unsupervised for this?
so this is like looking at the data, then covering up the columns and trying to figure out which one was which?
this is a strange task
Correct
Because column names might be misspelled
So letโs say one year you have name and then the next year column name for same data is nme
so it's not the same exact dataset, it's different datasets but with the same columns in each dataset?
that's a different task
Similar column names *
and harder
Oh youโre right apologize
Explained incorrectly
Yea itโs a hard task thatโs for sure
Iโm having a hard time figuring out the approach to take
@desert oar i am so sorry to bother you again
i have been doing research today
i can't find the correct method to cluster
given the variability is not so clumped together
may i ask
if i can somehow insert more variables in addition to time series?
like drug class and strength
and form factors like tablet, liquid, etc
Yes
Multiple factor analysis (MFA) is a factorial method devoted to the study of tables in which a group of individuals is described by a set of variables (quantitative and / or qualitative) structured in groups. It may be seen as an extension of:
Principal component analysis (PCA) when variables are quantitative,
Multiple correspondence analysis ...
thank you very much
given my situation
do you think i should try different methods with my current dataset?
very nice theoretical explanation. i am looking up python implementation rn 
Hello, thank you for suggesting an ARIMA model yesterday. Traditional methods through sk tried using randomforest or xgboost along with custom train, test, split and walk forward validations which ended up being too convoluted. I found sktime and they have an Arima model that produced the results I was looking for.
you are a god
Haha thanks. I was just stubborn enough to read through how machine learning with time series usually goes.
And a lot of trial and error
yw
๐
wait so you actually read
much respect
I've been dethroned?
I'm a mosaic now
the ability to do arima over night
is amazing lol
maybe the person has a degree in data science already?
who knows
I do
it was cs and data science at the same time
I was under the impression that ML with a snapshot of data using xgboost was similar enough to using a time series. Or that xgboost could handle time series easily. That wasn't the case and it took me a lot of errors to realize.
My BS is in Microbiology. I only began studying programming and more advanced quantitative analysis last year
so cool!!!!
i might as well study cs + biochem now
increasingly sounds like a good option
A lot of Pharma companies are in need of people with those backgrounds. Not a lot of people that study CS go for health sciences.
Agreed, I'd go for Netflix though lol. Other than that, I'd rather work anywhere else.
Lol yeah they don't. But I'm leaning towards data science
probably, in my opinion at least
we need to keep its control
I am not getting desirable output... anyone?
I have a git logs where commit hash, author, timestamp, commit message and file logs(this is dynamic in the case of lines) in separate line. I could individually pull out the data(but not of file logs and messages) but not create a columns of them.
This is the sample file
commit 2232asdfeafdadssc0a63d3ded7e95e894bb735c121f
Author: John Shahi
Date: Thu Jun 10 05:10:31 2021 +0000
Feature/bill overview
70 0 src/components/pages/abc.tsx
commit 18asdfasd9104c7fb59d9027f48csdfss8b61776e21d0
Author: rashi.coder
Date: Wed Jun 9 12:39:33 2021 +0545
disable other call, refine disabled styles
13 1 src/components/organisms/contact/detail/card/ContactDetailCard.tsx
11 2 src/components/organisms/contact/list/ActionsColumn.tsx
commit 65adfadfc2090e299bf9c514735eb1a2779a12ed9
Author: Ritesh Poudel
Date: Wed Jun 9 05:08:00 2021 +0000
commit 04afdad56f5136da10c87d5181dab8afdsfs29e57a5
Author: rashi.coder
Date: Wed Jun 9 10:22:29 2021 +0545
fix multiple contacts selection overflow
1 0 src/components/organisms/contact/list/ContactTable.tsx
8 1 src/components/organisms/contact/list/Styles.tsx
this is what I was trying to do
commits = pd.read_csv(
COMMIT_LOG,
sep="\n",
header=None,
names=["raw"]
# names=[
# "sha",
# "author",
# "timestamp",
# "message",
# "additions",
# "deletions",
# "filename",
# ],
)
commit_marker = commits[commits["raw"].str.startswith("commit")]
author_marker = commits[commits["raw"].str.startswith("Author")]
date_marker = commits[commits["raw"].str.startswith("Date:")]
print(commit_marker.head())
print(author_marker.head())
print(date_marker.head())
you aren't allowing the : in your test string this is after the From, so that will need to accounted for in the regex, also in future a help channel is probably better suited for this type of question
Guys, how are you? Do any of you already work with geospatial data? I need to plot accident areas and vehicle paths on the map of Brazil. I tried it and I liked using the folium, but when I use too many points for the path the map just doesn't load, does anyone have a solution for that or already known and like another tool?
whats the best way to learn ML ?
directly starting from tensorflow
or doing octave/matlab first
I'd probably start with the stuff in sklearn before using tensorflow
definitely with Python frameworks starting with sklearn and after tersorflow or pytorch
Two people have said it independently, so it must be true
Tensorflow is most used in deep learning, it is a step after
and while I know deep learning sounds cooler, that's just what they call it. the algorithms in sklearn can be great for a lot of use cases and don't require you to know quite as much math to wrap your head around what's happening.
this is true... a lot of problems can be solved using ML with sklearn... other some specific and difficult problems need DL and other frameworks.
thanks
thanks
Mathematica has great geospatial capabilities
Maybe I can try helping you
Can someone help me with pandas and datasets?
Hey @hollow ember!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:
โข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
โข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
can someone help me out . Thank you
1 means diabetes?
yes
recently i have learned a thing called t-sne
but since you said predict, i have no idea how
Hello, I'm trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.
My test data (34 pictures) is saved in a 'test' folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.
I receive a "Found 34 images belonging to 1 classes." message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353
Why do I receive a "2/2" everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn't it look super weird? I know I'm doing something wrong, but I just can't figure it out - maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument - in order to test it - it seems all fine, but here I just can't figure it out.)
A little bit of help would be appreciated. ๐
yes, you can increase the weight of that class
could you send some code?
Yes, sure
imageDataGenerator = ImageDataGenerator(validation_split = 0.2,
rescale = 1./255,
rotation_range = 40,
width_shift_range = 0.2,
height_shift_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True,
fill_mode = 'nearest')
trainingDataGenerator = imageDataGenerator.flow_from_directory(
dir,
target_size = (70, 70),
batch_size = batchSize,
color_mode="rgb",
class_mode = 'binary',
shuffle = True,
seed=42,
subset = 'training')
validationDataGenerator = imageDataGenerator.flow_from_directory(
dir,
target_size = (70, 70),
batch_size = batchSize,
color_mode="rgb",
class_mode = 'binary',
subset = 'validation')```
so these are for the training and the validation - using the data from my dir directory, which is gonna be divided as 80% for the training, 20% for the validation
the problem is that, except for the dir directory, I have a test directory as well
and I'm trying to obtain a generator based on the images in it
how?
test_data = 'C:/Users/Ana/Desktop/Licenta/Practic/maskAPI/test'
datagen = ImageDataGenerator(rotation_range = 40, # rotirea imaginilor
width_shift_range = 0.2, # modificarea latimii
height_shift_range = 0.2, # modificarea inaltimii
zoom_range = 0.2,
horizontal_flip = True, # intoarce imaginea orizontal
fill_mode = 'nearest')
data = datagen.flow_from_directory('./test', classes=['test'], target_size=(70, 70), color_mode='rgb')
result = seqModel.evaluate(data, batch_size=1, verbose=1)```
I've tried many variants, this is just the last one of them
by the way, I commented most of the training part, as I don't think it's relevant - only showed you the other 2 generators, that work just fine (I followed a tutorial for those)
the test part is what I don't get
I just want to evaluate the loss and accuracy of my model and I simply don't get how
make sure that the directory you're getting the images from actually has the correct amount of images and that the data generator read them correctly, i don't see anything else that would be wrong
yes, test directory has 34 images. (The directory 'test' from test_data has one subfolder called test containing all the images)
but even if I look at the result, I don't get what that 2/2 is and why it remains like that. Shouldn't it be 34/34? (it was like that at some point, but I had other problems then when experimenting)
2/2 [==============================] - 0s 5ms/step - loss: 282.6923 - accuracy: 0.7353
Hmm, any other suggestions for obtaining my accuracy and loss for the test data, other than the evaluate() method? ๐
wait i think i know the issue
YAY
OH, let me try
so even if the evaluate() method has a batch size of 1, it'll take 1 batch from the datagen which will be 32
true
34/34 [==============================] - 0s 3ms/step - loss: 239.4841 - accuracy: 0.7647
now it's 34/34
still getting the 239.4841 loss value
what loss alg?
what do you mean exactly?
How can i use multiple categories In X or Y axis?
loss = 'binary_crossentropy',
metrics = ['accuracy'])```
In the compile method I use 'binary_crossentropy'
and I have 2 classes in the 'train' folder -> _with mask_ and _without mask_
I suggest you to use Keras's brand new trees
im only restricted to use mathplotlib and seaborn for my assignment
You can implement them by yourself
Or, you can implement a neural network in raw python
If you know the theory they're not so complex
But, implementing a neural network without numpy can be hard
I know nothing, that's the problem
why are you suggesting implementing a neural network from scratch? theres no need for a neural network for that application
a simple logistic regression model would probably be fine
That's why I suggested trees
But they're pretty fun
If you need something simpler try implementing a dimensionality reduction algorithm and a clustering one
oh yeah i saw keras and thought you meant a neural network lol
but a logistic regression model should be fine
you can do that in seaborn as well
I agree
can u help with this, im quite new to this whole thing
sure
Hey @hollow ember!
It looks like you tried to attach file type(s) that we do not allow (.rar). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
can send rar files
Try uploading it on pastebin
why? is it ok? c:
Hey, I'm new to computer vision and I tried training a resnet 50 model
as I'm working with like 280 images I tried implementing data augmentation
now the accuracy score is something like 0.006
I'm guessing it overfitted, are there any solutions to this problem?
@hollow ember I got something for you
I tried different methods on your dataset and measured the accuracy of each of those
Here's my results:
LogisticRegression -> 0.7825520833333334
DecisionTree ->0.7916666666666666
GradientBoostedTrees -> 0.7838541666666666
SVM -> 0.8033854166666666
NearestNeighbors -> 0.75390625
Neural Network -> 0.7955729166666666
(Accuracy)
So I think that an SVM would be the best way to go
(By the way, I didn't do train/test split, so some outputs could suffer of overfitting)
I think that 280 images is too low
By the way, are you dealing with a classification task?
so is there any problem with this? ๐
loss = 'binary_crossentropy',
metrics = ['accuracy'])```
Oh, I'm trying to count objects in a image
sad moment ๐ฆ
Can pandas df.drop() accept a python list? I swear.
l = ['foo', 'bar']
df.drop(columns=l, axis=1)
Does not work, even with labels, or raw...
while
df.drop(['foo', 'bar'], axis=1)
Does work. What the !@#!@#$ is that?
@cerulean mauve what does "does not work" mean?
you probably can't use both axis= and columns=
.drop(l, axis='columns') is the same as .drop(columns=l) and .drop(l, axis=1)
I should be able to .drop(l) with no problem according to what I have seen and read.
Every single way, I just tried them again to make sure that I am not crazy.
I got need to specify at least one of 'labels', 'index', or 'columns'
I got that with columns specified
like .drop(columns=l)
Let me walk you through what I have here.
#First I am taking a byte string to buffer from s3, using panda's optional library(fsspec) to handle the buffer into the DF, which works fine, I can print(df.to_string) and it comes out well
csv_buffer = StringIO(some_bytes_from_s3)
df = pandas.read_csv(csv_buffer)
my_list = [ 'foo', 'bar']
df.drop(my_list) # does not work
#df.drop(columns=my_list) # does not work
#df.drop(my_list, axis=1) # does not work
# I get the aforementioned error every time.
I can give you some sample data that the CSV looks like.
If needed.
!e ```python
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4], 'b': [11,12,13,14], 'c': [21,22,23,24]})
print(df.drop(['a', 'c'], axis=1))
print()
print(df.drop(['a', 'c'], axis='columns'))
print()
print(df.drop(columns=['a', 'c']))
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | b
002 | 0 11
003 | 1 12
004 | 2 13
005 | 3 14
006 |
007 | b
008 | 0 11
009 | 1 12
010 | 2 13
011 | 3 14
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/xajinecesi.txt?noredirect
so yes i think you need to send a csv and the exact code to reproduce the error, because i can't reproduce it here
As far as I can tell
use https://paste.pythondiscord.com/ for the csv file
if I do as you have done, and .drop(['a', 'b'], axis=1) it will work, it's only when I use a variable of type list that it fails.
but ['a', 'b'] is a list
right?
I think my container is fubarred.
I'm using python-lambda-wrapper to develop aws lambda jobs for some database backfilling from CSV
going to test on my other machine
brb
Thank you .... i got it... and i will remember it in the future
I want to drop the columns arg_reportdate and arg_reportname
I place those in a variable that is classed as a list
When I use the list, I cannot drop them.
When I literally write it out aka .drop(['arg_reportdate', 'arg_reportname'])
It works.
just not with the python list variable.
Which is maddening.
because it's a !@#$!$@#!@$ list
๐
maybe it's time to find food and microwave it.
i'll brb
if my model input shape is (100,100,4)
it means the images it reads have alpha, right?
i means that your input shape is 100,100,4 ๐
if they're images, yeah probably the 4 corresponds to rgba
but nobody can answer that question for you, you have to know about your own data
so if i wanna augment my data on a custom way using the info of the alpha channel, the images given by flow_from_directory are gonna have alpha too?
ah nooooo
ooooh
ok ok
color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb". Whether the images will be converted to have 1, 3, or 4 channels.
what happens if i read an RGB image as RGBA?
cuz i have some images with Alpha, and i wanna do special things with those only
back @desert oar thanks for looking for me.
fml
do u guys have any idea of how to approach this?
my dataset contains rgb and rgba images
convert to grayscale?
I am using flow_from_directory method, which, by default, it reads rgb images, but i can set it to rgba. The thing is... i get this error ValueError: could not broadcast input array from shape (160,160,3) into shape (160,160,4)
can i read rgb images as rgb and rgba as rgba?
Maybe, seems like you just want to drop the alpha layer, no?
oh wiat
wait you wanted to do something special with those.
I would check the image metadata for the existence of an alpha layer, and then operate differently based on that.
like... i could rewrite flow_from_dir method, and if color_mode='rgba' and the current image has 3 channels, read it as rgb
or something
Hey guys, I need some help with a DS project I'm working on.
I'm trying to predict peoples taste using recipes, profiles, favorites, and reviews I scraped from a popular recipes website.
I currently have 95K recipes, and 2M profiles, 2M reviews and 70M~ favorites.
How would one use the data I fetch in order to finish the project?
if you wanted to drop the alpha layer yes.
i cant cuz the image i get is read from flow from dir, and it reads imgs as rgb
got the code handy?
.
i mean
this?
seems right ๐
@desert oarhow do i set weights for a certain class? i asked u before
@cedar sun what do you mean?
The class ImageDataGenerator seems to have an parameter called data_format sounds promising. @twin moth
yeah but still, i would have to override the method
like, telling the model to focus more on a certain class
Polymorphism is the way the truth, and the light.
you can use super to import the method into your own class.
yeah but ive never done it in python lol
Sorry, wasn't following the discussion from the beginning, can you tell me what goal are you trying to achive, what are your data and how are you trying to do this?
So maybe I can be helpful
i dont think that would be enough. What i actually want is to make flow_from_directory method behave the following way:
if color_mode = 'rgba', then attempt to read all images as rgba. If an image doesnt have alpha, then read it as rgb
to make a child class:
class Parent:
def __init__(self, txt):
self.message = txt
def printmessage(self):
print(self.message)
class Child(Parent):
def __init__(self, txt):
super().__init__(txt)
x = Child("Hello, and welcome!")
x.printmessage()