#data-science-and-ml | Python | Page 400

misty flint May 3, 2022, 3:40 PM

#

i was more looking for any youtubers who explore this space or books or that type of educational content

#

not necessarily tools

#

the only one i know is part-time larry

#

but hes been doing web3 stuff lately and that doesnt interest me atm

gilded kestrel May 3, 2022, 3:46 PM

#

maybe noob question, if you have a time series is training a bidirectional LSTM considered data leakage?

barren wedge May 3, 2022, 3:48 PM

#

In my opinion
Forecasting is useless
Time series analysis is just another way to drawing a chart

gilded kestrel May 3, 2022, 3:49 PM

#

I guess this reply is for my question?

barren wedge May 3, 2022, 3:50 PM

#

gilded kestrel I guess this reply is for my question?

No but maybe
Lol

gilded kestrel May 3, 2022, 3:50 PM

#

haha

versed gulch May 3, 2022, 4:45 PM

#

Hi does anyone know how to upscale a 3D image i.e. producing much finer slices. For example, my image is of size (30, 400, 400) (number of images x height x width) and I want to convert this to (40, 400, 400)

tidal bough May 3, 2022, 4:46 PM

#

Huh, interesting. So do you want to interpolate between the slices somehow?

versed gulch May 3, 2022, 4:49 PM

#

tidal bough Huh, interesting. So do you want to interpolate between the slices somehow?

yh

tidal bough May 3, 2022, 4:49 PM

#

You can use scipy.interpolate.interp1d

#

if that's for something like model training though, I'm not sure the new slices will be useful (they are just made from original ones after all)

versed gulch May 3, 2022, 4:50 PM

#

tidal bough if that's for something like model training though, I'm not sure the new slices ...

no I'm using this to filter the vessels by making my image istotropic

versed gulch May 3, 2022, 4:51 PM

#

tidal bough You can use `scipy.interpolate.interp1d`

however my image is 3D not 1D?

mild dirge May 3, 2022, 4:54 PM

#

versed gulch however my image is 3D not 1D?

yeah but you would interpolate for each pixel basically

#

So in the end you could still use 1d interpolation

versed gulch May 3, 2022, 4:56 PM

#

mild dirge So in the end you could still use 1d interpolation

how so?

mild dirge May 3, 2022, 4:57 PM

#

For each pixel you can interpolate over all images

#

Each pixel is just a value, so you'd have 30 values per pixel location

#

and can use 1d interpolation for that

versed gulch May 3, 2022, 4:57 PM

#

ok so a (30,) --> (40,)?

mild dirge May 3, 2022, 4:58 PM

#

yeah

#

basically

pliant marten May 3, 2022, 5:00 PM

#

whats the easiest way to make machine learning for a game?

mild dirge May 3, 2022, 5:00 PM

#

pliant marten whats the easiest way to make machine learning for a game?

That's way too general of a question

pliant marten May 3, 2022, 5:00 PM

#

okay

#

what about

#

for mario

mild dirge May 3, 2022, 5:01 PM

#

mario kart, platformer 2d, platformer 3d, mario party?

pliant marten May 3, 2022, 5:01 PM

#

2d mario

#

mario the first

#

1-1

#

the first level

versed gulch May 3, 2022, 5:01 PM

#

mild dirge basically

so how would you assess each pixel of those 30 slices?

tidal bough May 3, 2022, 5:01 PM

#

versed gulch however my image is 3D not 1D?

Your x-coordinate, as far as interp1d is concerned, is the height (frame number, whatever). The values at each x-coordinate is a 400x400 image (notice that interp1d supports multidimensional y-values). Since you're only interpolating along one input variable, it's 1d interpolation.

mild dirge May 3, 2022, 5:02 PM

#

pliant marten the first level

There's plenty of videos on it*

#

https://www.youtube.com/watch?v=CI3FRsSAa_U

YouTube

Chrispresso

AI Learns to Play Super Mario Bros!

Using a Genetic Algorithm and Neural Network, a population of AI were able to learn to play different levels of Super Mario Bros for the NES.

Code: https://github.com/Chrispresso/SuperMarioBros-AI
Blog: https://chrispresso.github.io/AI_Learns_To_Play_SMB_Using_GA_And_NN

Music: https://soundcloud.com/ashamaluevmusic
first song: Memory
second so...

▶ Play video

#

Like this one f.e. (haven't watched, but highest result on yt)

pliant marten May 3, 2022, 5:02 PM

#

oh okay

#

thank you

versed gulch May 3, 2022, 5:05 PM

#

tidal bough Your `x`-coordinate, as far as `interp1d` is concerned, is the height (frame num...

so it would be

scipy.interpolate.interp1d(threeD_image, (40, 400, 400), kind='linear')?

tidal bough May 3, 2022, 5:07 PM

#

that's not how the arguments to it go, I'm pretty sure

#

!docs scipy.interpolate.interp1d

arctic wedgeBOT May 3, 2022, 5:07 PM

#

scipy.interpolate.interp1d


class scipy.interpolate.interp1d(x, y, kind='linear', axis=- 1, copy=True, bounds_error=None, fill_value=nan, assume_sorted=False)#```
Interpolate a 1-D function.

*x* and *y* are arrays of values used to approximate some function f:
`y = f(x)`. This class returns a function whose call method uses
interpolation to find the value of new points.

pliant marten May 3, 2022, 5:07 PM

#

wait

#

how do i track where a goomba is going/going to go?

tidal bough May 3, 2022, 5:07 PM

#

first argument is the interpolation variable. For you it's, say, linspace(0,1,len(threeD_image)). The y would be your threeD_image indeed.

#

Then to interpolate using the interpolator it returns, you'd do interpolator(linspace(0,1,100)), for example, which will give you 100 evenly-spaced slices.

versed gulch May 3, 2022, 5:15 PM

#

tidal bough Then to interpolate using the interpolator it returns, you'd do `interpolator(li...

sorry just a bit confused with this part of the interpolator

tidal bough May 3, 2022, 5:18 PM

#

interpolator being what interp1d returns

versed gulch May 3, 2022, 5:23 PM

#

tidal bough `interpolator` being what `interp1d` returns

so this?

from scipy.interpolate import interp1d

interpolator = interp1d(x = np.linspace(0,1,len(threeD_image)), y = threeD_image)

tidal bough May 3, 2022, 5:36 PM

#

yup

lapis sequoia May 3, 2022, 5:37 PM

#

c i think?

versed gulch May 3, 2022, 5:38 PM

#

tidal bough yup

get this

#

unless I need to do axis = 0

tidal bough May 3, 2022, 5:38 PM

#

ah, yeah, you do - it defaults to -1

mild dirge May 3, 2022, 5:39 PM

#

lapis sequoia c i think?

We can't just give the answer to hw/exam questions

lapis sequoia May 3, 2022, 5:40 PM

#

mild dirge We can't just give the answer to hw/exam questions

it's study for an exam tomorrow from a past paper just wanted to check if im right.... i think c because you should try the other options before doing this

brazen sapphire May 3, 2022, 6:00 PM

#

Hey guys, this may be belonging more into "general" but its super packed there and it kind of belongs here too so:

csv_writer = csv.writer(f)
  for i in range (1,len(data_list)-1):
     for j in range(0,len(data_list[i])):
        csv_writer.writerow(data_list[i][j])

"iterable expected no float" - how can i achieve to split my list of tuples into a proper table like i am aiming to with this?

tidal bough May 3, 2022, 6:01 PM

#

csv_writer.writerow wants a row at a time. You seem to be passing it an element at a time.

brazen sapphire May 3, 2022, 6:02 PM

#

Makes sense. Unfortunately a row of a list is an element

#

Seems like i might have to make an array out of my list first, thanks for your advice

mighty spoke May 3, 2022, 6:08 PM

#

Hi i'm trying to code this recursive relation but its giving me an error saying
IndexError: index 256 is out of bounds for axis 0 with size 256 I think its because i'm using th list/array u is not big enough since I refer to u[i+1], and u[i+2] which is outside the range I am looking at. But i'm not sure how to fix this. ```import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.fft import fft, fftfreq
from scipy import fftpack

data = np.loadtxt('data.txt')
x=np.arange(0,1.024, 4e-3)

T=4e-3
N=256

#amp=data**2

#yf = fft(amp)

#xf = fftfreq(N,T)
#plt.plot(xf, yf)

def alg(u,t):
for i in range(0,N-1):
for j in range(1,N-1):
u[i]=x[i]+2np.cos((2np.pi*j)/N)*u[i+1]-u[i+2]
ak=(1/N)u[0]-u[1]np.cos((2np.pij)/N)
bk=(1/N)u[1]np.sin((2np.pij)/N)
p=(ak)**2+(bk)**2#power
return ak, bk, p

y=alg(data,x)
print(y[0])``` This is a pic of the equations i'm using

mild dirge May 3, 2022, 6:09 PM

#

well if there is no [n+1] or [n+2] then you either,

stop early
pad the array

mighty spoke May 3, 2022, 6:11 PM

#

mild dirge well if there is no [n+1] or [n+2] then you either, 1. stop early 2. pad the arr...

Hi but there is a [n+1] and [n+2]

mild dirge May 3, 2022, 6:12 PM

#

then there will be no index error

mighty spoke May 3, 2022, 6:13 PM

#

mild dirge then there will be no index error

yeah but in the pic i sent there's an n+1 and n+2

#

in the relation

mild dirge May 3, 2022, 6:14 PM

#

but in your array there is not

#

you can't take [n+1] if your n is at the end of the array

mighty spoke May 3, 2022, 6:16 PM

#

yhh i guess i can work out u1 and u0 instead as thats all i need

mild dirge May 3, 2022, 6:17 PM

#

What they say is to set Un+1 and Un to 0

#

and then you only have to compute for N-1, N-2...0

#

So if you do that, there will be no issue

#

So their solution is padding with 0's

misty flint May 3, 2022, 6:22 PM

#

interesting concept here:

#

simple but easy way to get more mileage out of ML models i believe

mighty spoke May 3, 2022, 6:23 PM

#

mild dirge What they say is to set Un+1 and Un to 0

ah yeah i'll try that thanks

eager wedge May 3, 2022, 6:36 PM

#

cm_df = pd.DataFrame(cm, index = ['Glioma', 'Meningioma', 'Pituitary'], columns = ['Glioma', 'Meningioma', 'Pituitary'])
plt.figure(figsize=(5,4))
sns.heatmap(cm_df, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()

#

Why no work? Error message -> Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

fervent flicker May 3, 2022, 6:52 PM

#

@eager wedge there's a problem with how you've made y_test and y_pred

#

post that code and i'll let you know what's wrong

balmy burrow May 3, 2022, 7:51 PM

#

Dunno if i am butchering the right place but "ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 16512 and the array at index 3 has size 8" error is caused by some sort of missing data?

#

nvm brainfarted and forgot numpy concantrate needs 2d arrays.

serene scaffold May 3, 2022, 9:58 PM

#

balmy burrow nvm brainfarted and forgot numpy concantrate needs 2d arrays.

In [1]: a = np.array([1, 2, 3])

In [2]: a.shape
Out[2]: (3,)

In [3]: np.concatenate((a, a))
Out[3]: array([1, 2, 3, 1, 2, 3])

balmy burrow May 3, 2022, 9:59 PM

#

yeah i just splited an array with a[:, b] it works too probably.

daring cape May 3, 2022, 10:00 PM

#

is someone able to help me with this?

tall blaze May 3, 2022, 10:40 PM

#

So if I have a data set consisting of numeric and categorical variables. How can I show a correlation coefficent to y(y is a numeric) where I am comparing apples to apples? Additionally if I am feeding this data into a DNN what is a way I could visualize how each variable effects the accuracy of the model?

mild dirge May 3, 2022, 10:41 PM

#

if some feature is categorical, you can show a histogram, with the average value (or maybe even a boxplot) of y for each category

#

For others you could indeed plot the correlation (the x value with y values in a scatter plot)

#

To find out the effect of a single variable on the resulting performance, you could try feeding some inputs with the specific feature randomized

tall blaze May 3, 2022, 10:42 PM

#

is there anyway to quantify a correlation for the cat variables that would be comparable to the pearson correlation for the numerical variables

#

and ok

#

interesting

mild dirge May 3, 2022, 10:43 PM

#

tall blaze is there anyway to quantify a correlation for the cat variables that would be co...

not sure

tall blaze May 3, 2022, 10:43 PM

#

basically I have stakeholders who want to know what variables "made the difference" for a dnn I built them

mild dirge May 3, 2022, 10:44 PM

#

Well just a correlation won't tell you everything

#

maybe only combinations of certain features were important for deciding the output

tall blaze May 3, 2022, 10:44 PM

#

I am thinking of dimensionality reduction somehow but that is tough to explain

mild dirge May 3, 2022, 10:45 PM

#

mild dirge To find out the effect of a single variable on the resulting performance, you co...

This would tell you more I think, but I just heard this was a method, maybe look into seeing how reliable this is for telling the importance of a feature

#

and how exactly to randomize the value

tall blaze May 3, 2022, 10:45 PM

#

ok do you know how to randomize certian nodes in a dnn

#

input nodes*

#

do I just randomize the x_train

mild dirge May 3, 2022, 10:45 PM

#

You wouldn't randomize nodes, you would randomize the specific feature of the input

tall blaze May 3, 2022, 10:45 PM

#

ahhhh ok

mild dirge May 3, 2022, 10:46 PM

#

So the DNN basically can't get information from it anymore

tall blaze May 3, 2022, 10:46 PM

#

and I would have to fully train the model for each input wouldnt I?

mild dirge May 3, 2022, 10:46 PM

#

Don't think so

#

Again, I just heard someone say this was a method to check what inputs were important

#

Not too sure on the specifics

tall blaze May 3, 2022, 10:46 PM

#

ahh ok

mild dirge May 3, 2022, 10:48 PM

#

https://christophm.github.io/interpretable-ml-book/feature-importance.html

8.5 Permutation Feature Importance | Interpretable Machine Learning

Machine learning algorithms usually operate as black boxes and it is unclear how they derived a certain decision. This book is a guide for practitioners to make machine learning decisions interpretable.

tall blaze May 3, 2022, 10:48 PM

#

snap this helps

mild dirge May 3, 2022, 10:48 PM

#

So basically just change a single feature for every data point, feed to model, compare to error without randomizing

#

you can do this for every feature

tall blaze May 3, 2022, 10:49 PM

#

yea I see that

#

any packages or I am coding the math

#

?

mild dirge May 3, 2022, 10:50 PM

#

you could literally just shuffle the feature column

#

that way you still get values in the correct range

tall blaze May 3, 2022, 10:51 PM

#

ok and no differentiaton between the num and cat inputs?

#

I am not seeing reason any but I guess I just need some handholding today

mild dirge May 3, 2022, 10:52 PM

#

For shuffling I don't think so no

#

For correlation I am not sure what to use for categories

#

for visualizing you could make a histogram like I said, not sure about a single measure telling "correlation"

tall blaze May 3, 2022, 10:54 PM

#

Awesome and ohhhh it clicked I am running it out of model predict by randomizing the x_test columns then comparing the change in MSE

#

Thank you!

karmic valley May 4, 2022, 12:45 AM

#

Pls help

odd meteor May 4, 2022, 12:58 AM

#

tall blaze is there anyway to quantify a correlation for the cat variables that would be co...

There are 3 methods (that I know of) you could use to understand if a continuous and categorical features in your data are significantly correlated. As you probably might have realised, it's tautology in Statistics to measure the correlation between a numeric feature and a categorical feature in your dataset using the conventional Pearson correlation.

Point Biserial correlation

The point biserial correlation coefficient is a special case of Pearson’s correlation coefficient. You can make more research to understand this (I'm not too familiar with this approach)

Logistic regression

The idea behind using logistic regression to understand correlation between variables is actually quite straightforward and follows as such: If there is a relationship between the categorical and continuous variable, you should be able to construct an accurate predictor of the categorical variable from the continuous variable.

If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated.

Kruskal Wallis H Test.

A significant Kruskal–Wallis test indicates that at least one sample stochastically dominates another sample. Although the test does not identify where this stochastic dominance occurs. However, for analyzing the specific sample pairs for stochastic dominance, Dunn’s test, pairwise Mann-Whitney tests w/o Bonferroni correction, Conover–Iman test are appropriate or t-tests when you use ANOVA.

NB: Kruskal-Wallis test is a non parametric test.

I wouldn’t wanna go deep into stats explanations. But I do hope you'll be curious enough to proceed with your research and make progress from there.

#

This table summarizes what I'm tryna say

serene scaffold May 4, 2022, 12:59 AM

#

for a moment I thought I had ended up on the rules page for a different discord or something

odd meteor May 4, 2022, 1:01 AM

#

serene scaffold for a moment I thought I had ended up on the rules page for a different discord ...

😊 I guess I need to learn brevity

karmic valley May 4, 2022, 1:07 AM

#

Basically in my data I have 15 patients and for each patient I am analysing parameters at 5 time points (Baseline, Immediate Sonovue injection, 20s after sonovue, 40s after and 1minute after). For each time point for each patient I have calculated a signal intensity value. I have also calculated an AI Quality score for the same point. So in total I have 75 values of intensity (5*15) and 75 values of AI quality.

I want to do a correlation graph of AI quality and intensity. I thought of plotting 75 points on a graph and then doing line of best fit and then doing pearsons coefficient to get correlation. However, I read online that pearsons requires data to be independent and I think my data is not all independent as I have 5 points for the same patient on the graph? If this is not something I can use, is there any other way to simply get a correlation coefficient?

serene scaffold May 4, 2022, 1:09 AM

#

odd meteor 😊 I guess I need to learn brevity

if you know what you're talking about, it takes more effort to be brief than to be thorough. but that wasn't my point, in either case.

odd meteor May 4, 2022, 1:18 AM

#

serene scaffold if you know what you're talking about, it takes more effort to be brief than to ...

That's true. Although, long post can sometimes be tiring to read... 😀

odd meteor May 4, 2022, 1:32 AM

#

karmic valley Basically in my data I have 15 patients and for each patient I am analysing para...

Yeah. Your explanatory variables needs to be independent. If they aren't, you'll most likely notice they'll have a very high correlation amongst themselves (signalling presence of multicolinearity)

I'm not into medical practice but it seems your features are time dependent. Try to plot the correlation matrix first and observe the result.

If your fear is confirmed positive, then you might wanna engineer more features + apply feature interactions

karmic valley May 4, 2022, 1:35 AM

#

I will try to Google how to do correlation matrix. What will that tell me and what do I do with that information

#

Can I share my data it's not long

arctic wedgeBOT May 4, 2022, 1:38 AM

#

Hey @karmic valley!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.

Feel free to ask in #community-meta if you think this is a mistake.

karmic valley May 4, 2022, 1:39 AM

#

📎 Book2.csv

odd meteor May 4, 2022, 1:39 AM

#

karmic valley I will try to Google how to do correlation matrix. What will that tell me and wh...

It will show you the correlation score of every feature in your dataset.

Yes you can. But I can't infer the correlation score by just looking at your data

karmic valley May 4, 2022, 1:41 AM

#

okay ill briefly just explain what the data is. so 1st and 2nd column is the two variables i want to correlate. the column period tells you at what time point that measurement was taken (5time points). the patent number column tells you what patient so 15 patients and each patient has variables at the time points measured

#

it shows easier when you download it as csv

karmic valley May 4, 2022, 1:42 AM

#

odd meteor It will show you the correlation score of every feature in your dataset. Yes yo...

so doing a correlation matrix allows for non-independent data?

odd meteor May 4, 2022, 1:45 AM

#

karmic valley okay ill briefly just explain what the data is. so 1st and 2nd column is the two...

I'll take a look at it when I'm up. It's 2:45 am here and I'm very sleepy now. I must disappear now 🙏

karmic valley May 4, 2022, 1:46 AM

#

aha no problem!

#

yes would appreciate if you could help with this when youre free tomorrow. been stuck on it for ages lool

odd meteor May 4, 2022, 1:48 AM

#

karmic valley so doing a correlation matrix allows for non-independent data?

Yes. A correlation matrix tells you the correlation score between the dependent variable and the independent variables, as well as the correlation between an independent variable and other independent variables.

karmic valley May 4, 2022, 1:50 AM

#

does it allow for paired data too like in my case.?

bold timber May 4, 2022, 2:36 AM

#

Why I get an error like this? How to fix out?

desert quartz May 4, 2022, 3:38 AM

#

Hello people, I hope you have a nice day. How could I avoid that when I am making a loop of requests to an API, if the internet cuts me, I lose all the queries that the program had already made and have to start over.

In python is my query

Practically, if the communication is cut, the program waits for it to resume and does not break

serene scaffold May 4, 2022, 3:40 AM

#

desert quartz Hello people, I hope you have a nice day. How could I avoid that when I am maki...

this is not a data science question. see #❓｜how-to-get-help to read how to ask in a general help channel

desert quartz May 4, 2022, 3:41 AM

#

I get it. sorry

drifting oasis May 4, 2022, 3:55 AM

#

trying to add a line after each tweet so i can see my data better, please help

Screen_Shot_2022-05-03_at_8.55.15_PM.png

odd meteor May 4, 2022, 8:50 AM

#

karmic valley does it allow for paired data too like in my case.?

You could use the groupby() method on your patient column to get a more summarized view of the data.

Yes you can perform correlation on your SNR and AI Qual columns.

bold timber May 4, 2022, 9:01 AM

#

Hi, I have a question, why if we use a different version of sklearn causes get different scores? If the new version provides a score less than an old version, which is better?

odd meteor May 4, 2022, 9:03 AM

#

bold timber Hi, I have a question, why if we use a different version of sklearn causes get d...

I'm not sure you're experiencing difference in scores because of the sklearn versions.

What metric were you calculating?

bold timber May 4, 2022, 9:10 AM

#

odd meteor I'm not sure you're experiencing difference in scores because of the sklearn ver...

I have several times used different versions of sklearn and I got different results of scores. I just wondering why did it happen

I use r-squares to evaluate the model

#

I try to reinstall of python version and I only can use the version of sklearn more than 0.24.0 and then I get a score of model is less than when I use the sklearn version 0.22.2

#

Maybe it causes when I get a new python version? or what?

upper spindle May 4, 2022, 9:40 AM

#

Is there a way to replicate this graph like this using the frequency of the values per date

odd meteor May 4, 2022, 9:50 AM

#

bold timber I have several times used different versions of sklearn and I got different resu...

Your explained_variance_score (R^2) will only vary if you didn’t set same seed or use same random_state when running your code with different versions of sklearn

supple leaf May 4, 2022, 10:04 AM

#

Good day everyone. Does anyone have an idea how I could calculate whether the derivative is positive or negative where the blue line crosses the dotted red line in my graph? Would very much appreciate any tips.

bold timber May 4, 2022, 10:13 AM

#

odd meteor Your explained_variance_score (R^2) will only vary if you didn’t set same `seed`...

I always use a random state in the model, but it still gets a different score

This is my model to evaluate the score. Before this, I have scored 92.8 % on test_score when I use 0.22.2 version of scikit learn

rose agate May 4, 2022, 10:19 AM

#

supple leaf Good day everyone. Does anyone have an idea how I could calculate whether the de...

Can't you just tell it's positive by looking at it? Anyway, if you know the point of intersection, and have a function for the blue line, I think you might just be able to get two very close points, e.g. 45 and 45.01 and get a decent estimate for the gradient based on those. There might be some actual functions that calculate gradient that I don't know about though

rose agate May 4, 2022, 10:28 AM

#

bold timber I always use a random state in the model, but it still gets a different score T...

I can't say why it's different, maybe they slightly altered some functions, but if the accuracy has gone from 92.8% to 91.6%, I don't think that's much of a problem. It's a small random deviation which I don't think indicates worse performance unless your sample size is pretty large.

supple leaf May 4, 2022, 10:35 AM

#

rose agate Can't you just tell it's positive by looking at it? Anyway, if you know the poin...

I forgot to add important information, the x axis will stretch all the way to 1000 hours. Therefore the blue line will cross the red line a lot of times. What im actually looking for is to know when the blue line is above/below the red line. Maybe it would just be too complicated to calculate the derivative. Maybe I can just look when the value of the blue line is greater than the red line => the blue line is above the red line

odd meteor May 4, 2022, 10:36 AM

#

bold timber I always use a random state in the model, but it still gets a different score T...

When you split your data into train and validation set, did you also set a seed?

rose agate May 4, 2022, 10:36 AM

#

supple leaf I forgot to add important information, the x axis will stretch all the way to 10...

yep, I'd probably just step through 1 hour at a time, and find the instances where it goes from below to above or above the below the line

supple leaf May 4, 2022, 10:37 AM

#

rose agate yep, I'd probably just step through 1 hour at a time, and find the instances whe...

Perfect, thank you 🙂

bold timber May 4, 2022, 10:43 AM

#

rose agate I can't say why it's different, maybe they slightly altered some functions, but ...

Of course

sullen dirge May 4, 2022, 10:44 AM

#

hi guys , sorry to interrupt

#

but like i took data analyst as my course

#

learning python currently

#

any advice what shall i learn so i can be ahead from class ?

bold timber May 4, 2022, 10:45 AM

#

rose agate I can't say why it's different, maybe they slightly altered some functions, but ...

The dataset is not too large, it only has 1460 values and 75 features. And then I only use 30 feature

odd meteor May 4, 2022, 10:45 AM

#

bold timber I always use a random state in the model, but it still gets a different score T...

To be honest, I don't see any problem with the little difference in scores here. So there's no need to be worried about the little difference. Also, you don't need to switch to different versions of sklearn. Both scores are quite close.

rose agate May 4, 2022, 10:46 AM

#

bold timber Of course

I encourage you to do it on your own but I wrote this in case you get stuck

import matplotlib.pyplot as plt
import numpy as np

ls = np.random.rand(30)*10

line = 4
for i in range(1, len(ls)):
    
    if ls[i] > line and ls[i-1] < line:
        print("Gradient positive at:", i, "crossing the line")
        
    if ls[i] < line and ls[i-1] > line:
        print("Gradient negative at:", i, "crossing the line")
        

plt.plot(ls)
plt.axhline(y = line, color = 'r', linestyle = '--')

bold timber May 4, 2022, 10:48 AM

#

odd meteor To be honest, I don't see any problem with the little difference in scores here....

I think like that, but I wonder if I use that model in competition whether can affect my leaderboard? When I use my old score, which is 92.8% I got to be in the top 4% on the leaderboard.

supple leaf May 4, 2022, 10:49 AM

#

rose agate I encourage you to do it on your own but I wrote this in case you get stuck ``` ...

Was that for me or for "noway"? Since it looks like you answered "noway" 🙂

odd meteor May 4, 2022, 10:49 AM

#

sullen dirge any advice what shall i learn so i can be ahead from class ?

Usually in school, we are given our complete curriculum and scheme of work before/on the 1st class of each semester. So you can start from there I guess. You're doing great by learning Python as well. 🦾

rose agate May 4, 2022, 10:49 AM

#

oops lol

rose agate May 4, 2022, 10:50 AM

#

rose agate I encourage you to do it on your own but I wrote this in case you get stuck ``` ...

@supple leaf

#

thanks

#

wait you replied to that

#

ignore me

sullen dirge May 4, 2022, 10:51 AM

#

odd meteor Usually in school, we are given our complete curriculum and scheme of work befor...

mmm i feel like they arent teaching what they should , in the first semester i get to learn R but it was very rushed . Plus i dont think teachers are that good so i am studying more online but appreciate the help 🙏

supple leaf May 4, 2022, 10:53 AM

#

rose agate <@202184134976733184>

Haha im a bit confused. But it kind of looks like the code you write would be a very good direction towards my first idea with using gradients for knowing when the blue line is crossing the red line, right? I truly appreciate it!

rose agate May 4, 2022, 10:54 AM

#

supple leaf Haha im a bit confused. But it kind of looks like the code you write would be a ...

let me know if you need any part of it explained

supple leaf May 4, 2022, 10:55 AM

#

rose agate let me know if you need any part of it explained

Will do, will look at it and work on it now. Thanks once again

odd meteor May 4, 2022, 10:59 AM

#

bold timber I think like that, but I wonder if I use that model in competition whether can a...

Ohh it makes more sense now.. When it comes to hackathons and competition, even a 0.01% increase counts 😄 Since your model's performance is quite good, you can try to squeeze out more juice from your model to rank higher in the LB by using any of these ensemble tricks: Weighted Average, Blending, Stacking, Voting etc.

supple leaf May 4, 2022, 11:02 AM

#

rose agate let me know if you need any part of it explained

ls = np.random.rand(30) * 10

This line creates a random function right? I would have to find the function for my blue line?

rose agate May 4, 2022, 11:04 AM

#

supple leaf ```py ls = np.random.rand(30) * 10 ``` This line creates a random function right...

it creates a list with some random numbers just for the example, you'll need a list with the value of the blue line at every timestep

supple leaf May 4, 2022, 11:05 AM

#

rose agate it creates a list with some random numbers just for the example, you'll need a l...

Ahhh okok!

bold timber May 4, 2022, 11:18 AM

#

odd meteor Ohh it makes more sense now.. When it comes to hackathons and competition, even ...

But still, I don't know why I got a different score with my model when I use different version of my sklearn and python version 😅

But, thank you for discussion, I've learned something new from you.

odd meteor May 4, 2022, 11:39 AM

#

bold timber But still, I don't know why I got a different score with my model when I use dif...

Sometimes these things can be tricky to figure out. I remember one time at my office where a model's performance was quite good but after the model was deployed the model performance reduced.

I could recall the guys who built the model spent days trying to figure out what went wrong. I think it was on the third day they realized it was from the system OS. Apparently, two system OS could handle data splitting differently.

Model was built on Windows OS and deployed on Mac Os. They fixed the problem by using Docker.

On a normal day, there was no way I coulda imagined maybe the problem was from the OS. 😀

bold timber May 4, 2022, 11:55 AM

#

odd meteor Sometimes these things can be tricky to figure out. I remember one time at my of...

What is docker? can you explain to me?

#

Hi, I also wondering how to deploy machine learning model? can you give me a recommendation source to learn of deployment? @odd meteor

odd meteor May 4, 2022, 12:07 PM

#

bold timber What is docker? can you explain to me?

Docker is a tool that's used to combine a software application code and its dependencies as a single package, that can run independently in any computer’s environment. A developed software application may depend on a tone of dependencies and the dependencies of a software may fail to install due to differences in coding environments such as operating systems or poor environment setup. If we can isolate the software in such a way that it will be independent of the computer’s environment, the frustrations of failed dependencies to use a software will be greatly reduced.
Example
A machine learning software application is built in python for classifying objects in images and videos. The goal of the engineer is to make this software available for everyone to use. In reality, using the software will require you to install deep learning libraries like tensorflow or pytorch, additional dependencies like opencv, numpy and a tone of other packages. The engineer can easily package this software’s code and dependencies as a single package using docker. Thereby, making it possible for anyone to download the machine learning software application as a dockerized application and use it without worrying about installing its dependencies.

Docker is mostly used in DevOps and MLOps. I'm really not into MLOps yet so my experience in using it isn't too solid at this time.

odd meteor May 4, 2022, 12:17 PM

#

bold timber Hi, I also wondering how to deploy machine learning model? can you give me a rec...

Most online ML courses don't teach MLOps or model deployment in general. However, I learnt model deployment by reading documentation and watching YouTube videos on how to deploy a model.

If you are interested in MLOps, consider checking out these websites

DeepLearning.AI

Ben

Machine Learning Engineering for Production (MLOps) Specialization

Home - Made With ML

Learn how to responsibly deliver value with ML.

lapis sequoia May 4, 2022, 12:32 PM

#

This is a very generalized question, I know people who are doing certis like assiciate devops/professional devops in AWS mentioned that it helps, its challenging and its worth. But there is also ML Speciality certificate by AWS is worthy?

because it will make you learn some things about AWS which may or may are not needed as a data scientist person. So should one think about preparing about it in free time?
I need opinion on this and will appreciate any positive or negative opinion.

link of course: https://aws.amazon.com/certification/certified-machine-learning-specialty/
Thanks!!

Amazon Web Services, Inc.

AWS Certified Machine Learning - Specialty Certification | AWS Cert...

Earning AWS Certified Machine Learning – Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS. Learn more about this certification and AWS resources that can help you prepare for your exam.

plucky vine May 4, 2022, 1:13 PM

#

r = requests.post(
        "https://api.deepai.org/api/text2img",
        data={
            'text': text,
        },
        headers={'api-key': 'e1d60515-9c7f-4586-92b5-994771e99b9b'}
    )
    print(r.json())```
when i runs it, it showing 2 values named {id} and {output_img}. but i only need output img. so how can i get it. pls help me

mortal wren May 4, 2022, 1:25 PM

#

Can someone recommend some interesting open source projects on AI and ML? Specifically projects that are still under active development

serene scaffold May 4, 2022, 1:59 PM

#

plucky vine ```py r = requests.post( "https://api.deepai.org/api/text2img", ...

you should regenerate the api key now that you've leaked it

river citrus May 4, 2022, 2:10 PM

#

Hello, i have a question.

ive never coded an ai in python, so im asking if its even possible.

i have a large number of images, and i want to quickly identify images that contain text. (simple text, mabe by computers. not photos of signs or smt)
is that possible?

im currently using pytesseract to extract the text from the images, this is rather slow tho.

im thinking that an ai could possibly decide that relatively fast, but as i said i have no experience so im not sure.

im happy for any reply or comment on this matter.

serene scaffold May 4, 2022, 2:10 PM

#

river citrus Hello, i have a question. ive never coded an ai in python, so im asking if its...

state-of-the-art AIs require a lot of computation power and are going to be slower than simpler AIs.

river citrus May 4, 2022, 2:12 PM

#

ive heard that its only computationally expensive to train the ai, identifying the images in the end should be fast.

is that incorrect?

serene scaffold May 4, 2022, 2:13 PM

#

river citrus ive heard that its only computationally expensive to train the ai, identifying ...

are you aware that pytesseract is an AI?

river citrus May 4, 2022, 2:13 PM

#

no

#

in that case

serene scaffold May 4, 2022, 2:14 PM

#

it's also the one that everyone uses for this task

river citrus May 4, 2022, 2:15 PM

#

i just doubt that it takes so long to detect if an image contains text

#

i understand that it can take longer to actually read the text

serene scaffold May 4, 2022, 2:15 PM

#

so you don't actually care what the text is?

river citrus May 4, 2022, 2:15 PM

#

indeed

#

i just want to know if there is text

serene scaffold May 4, 2022, 2:16 PM

#

that wasn't explicitly clear from your question. let me think

river citrus May 4, 2022, 2:16 PM

#

sorry, ill try to better formulate my questions in the future

serene scaffold May 4, 2022, 2:18 PM

#

no problem. I'm still looking into it for you.

#

@river citrus this article talks about a system for finding the bounding box over text in images, without deciding what the text is. but it will detect any text, including signs and shirts and stuff. https://pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/

PyImageSearch

Adrian Rosebrock

OpenCV Text Detection (EAST text detector) - PyImageSearch

In this tutorial you will learn how to use OpenCV to detect text in images and video, including using OpenCV's EAST text detector for natural scene text detection.

#

but just figuring out where the text is is a simpler problem

river citrus May 4, 2022, 2:21 PM

#

wow thank you very much. ive googled for some time. but i was missing the right keywords aparently

serene scaffold May 4, 2022, 2:21 PM

#

btw, I should plug our brand new #media-processing channel

river citrus May 4, 2022, 2:22 PM

#

serene scaffold btw, I should plug our brand new <#971142229462777926> channel

ill do that. 👋 thx

serene scaffold May 4, 2022, 2:22 PM

#

river citrus wow thank you very much. ive googled for some time. but i was missing the right ...

no problem!

serene scaffold May 4, 2022, 2:22 PM

#

serene scaffold btw, I should plug our brand new <#971142229462777926> channel

@devout sail where is my cookie

lucid abyss May 4, 2022, 3:08 PM

#

Is anyone confident in deep learning and neural network

#

I need help

echo lance May 4, 2022, 3:10 PM

#

i also need help...but for the basics only @lucid abyss can you help me ??

lucid abyss May 4, 2022, 3:10 PM

#

No bro I need help with it

echo lance May 4, 2022, 3:12 PM

#

do you know any good source for learning DL and NN ?..for beginner

serene scaffold May 4, 2022, 3:21 PM

#

lucid abyss Is anyone confident in deep learning and neural network

you should always ask your actual question, not if someone knows about a general topic.

lucid abyss May 4, 2022, 3:21 PM

#

I need em to tutor me

devout sail May 4, 2022, 3:27 PM

#

serene scaffold <@512354988157173763> where is my cookie

🍪

misty flint May 4, 2022, 3:36 PM

#

odd meteor Most online ML courses don't teach MLOps or model deployment in general. However...

great resources. i forgot i had the first one bookmarked but you reminded me to add this to my to-do list on my notion Praise

lapis sequoia May 4, 2022, 3:57 PM

#

Hello, recently got back into codecademy and I have completed the "Breast Cancer Classifier Project" in the machine learning projects. - Yet on my graph - the accuracy appears to be going down over time ? Am I misunderstanding something here?

#

mild dirge May 4, 2022, 5:04 PM

#

over time? @lapis sequoia

#

over epochs of training?

lucid abyss May 4, 2022, 5:11 PM

#

Hello anyone good?

old stump May 4, 2022, 5:11 PM

#

so, I dont do a lot of data and numpy stuff as I am mostly a web developer and just general linux man.

I have a question i cant make sense of the parameters to reshape.

    video = (
        np.frombuffer(
            out, np.uint8
        ).reshape(
            [-1, height, width, 3]
        )
    )

so... reshape here is making an ndarray from the dimensions of the video which is a pipe of rawvideo from ffmpeg. What is the -1 specifying here? the 3 means a 3d array?

mild dirge May 4, 2022, 5:12 PM

#

no, the 3 means 3 channel

#

so basically rgb

#

then you have the height and the width of each image

#

and the -1 is basically whatever is left

#

So if you have 50 images, then that -1 would make it 50 x height x width x 3

old stump May 4, 2022, 5:13 PM

#

i see, ty

tranquil yarrow May 4, 2022, 5:27 PM

#

lapis sequoia Hello, recently got back into codecademy and I have completed the "Breast Cancer...

It would appear you've done something wrong.

lapis sequoia May 4, 2022, 5:28 PM

#

mild dirge over time? <@456226577798135808>

Yeah, after a bit of research you were right, thank-you!

mild dirge May 4, 2022, 5:29 PM

#

I was right about what?

#

You still don't know why accuracy degrades over epochs do you?

upper spindle May 4, 2022, 5:50 PM

#

what does the validation set do in plain english

#

every youtube video and google search pages just complicate it

mild dirge May 4, 2022, 5:51 PM

#

used for validating how well your model performs while you are still training

#

we use a validation set instead of the test set for this to prevent overfitting on the test set

#

Otherwise we "taint" the test set, and the final accuracy we get on the test set wouldn't be "fair" as we have used it in the process of designing our model

upper spindle May 4, 2022, 5:52 PM

#

oh, i see

#

thats helpful thanks

upper spindle May 4, 2022, 5:53 PM

#

mild dirge Otherwise we "taint" the test set, and the final accuracy we get on the test set...

thank you

lapis sequoia May 4, 2022, 5:53 PM

#

mild dirge You still don't know why accuracy degrades over epochs do you?

my understanding of it is the learning rate is too high, the program then converges too quickly, and this leads to failure in the accuracy of the program because of noise ?

mild dirge May 4, 2022, 5:54 PM

#

It is probably overfitting

old stump May 4, 2022, 5:56 PM

#

what sorts of things can you do with a video that is represented as a numpy array?

mild dirge May 4, 2022, 5:56 PM

#

I'm currently working on a classifier for character images, but the dataset is very imbalanced, how would I split this data into train/test

old stump May 4, 2022, 5:57 PM

#

there are a lot of guides on how to do this but not any of why you would do it.

mild dirge May 4, 2022, 5:57 PM

#

mild dirge May 4, 2022, 6:00 PM

#

mild dirge

Should I just get a set amount of images per class, or a proportion?

desert oar May 4, 2022, 6:15 PM

#

mild dirge Should I just get a set amount of images per class, or a proportion?

you can use stratified sampling to take a % from each class. the thing about the smaller classes is that "20%" means something very different for 300 classes vs 10 classes, and 2 classes in the test set might not be enough. so you might need to do something like 50/50 for the classes with fewer than 20 members, and 80/20 for the rest

#

on the other hand, you also need to be able to evaluate performance before "burning" your test set. so you also need to be careful w/ cross-validation that you have at least 1 instance of the rare class in each fold

#

you can also use class weighting in your model

#

and because these are images, you can (probably should) use data augmentation to create more instances, e.g. stretching or blurring or altering colors etc

mild dirge May 4, 2022, 6:17 PM

#

desert oar on the other hand, you also need to be able to evaluate performance _before_ "bu...

Yeah was planning to do that, this is a good point too

desert oar May 4, 2022, 6:17 PM

#

hopefully augmentation gives you enough extra instances that you don't need to care so much

upper spindle May 4, 2022, 6:20 PM

#

are nodes and neurons the same thing in an lstm model

woven coral May 4, 2022, 6:21 PM

#

upper spindle May 4, 2022, 6:21 PM

#

also what is the main purpose of an lstm layer

woven coral May 4, 2022, 6:21 PM

#

any solution??

upper spindle May 4, 2022, 6:21 PM

#

google and youtube are not doing my any favours

woven coral May 4, 2022, 6:21 PM

#

how to slove this problem??

serene scaffold May 4, 2022, 6:39 PM

#

woven coral

I don't help when the question is presented only as a screenshot; it may be that other people reading the channel feel similarly.

serene scaffold May 4, 2022, 6:41 PM

#

upper spindle also what is the main purpose of an lstm layer

you can use LSTMs for sequences of data, like text, audio, or video. Whereas traditional neural layers mostly deal with data points that have a fixed number of features (like images and tabular data)

upper spindle May 4, 2022, 7:07 PM

#

serene scaffold you can use LSTMs for sequences of data, like text, audio, or video. Whereas tra...

thanks

balmy burrow May 4, 2022, 7:33 PM

#

Can anyone explain what is "dot" here? I know x_b.T returns transpose of matrix but cant figure out what .dot does.

#

and yes my linear algebra and math sucks, sorry for dumb question.

tidal bough May 4, 2022, 7:42 PM

#

.dot for matrices is matrix multiplication.

#

I prefer using the @ operator, myself, but some people use .dot for some reason.

balmy burrow May 4, 2022, 7:43 PM

#

tidal bough `.dot` for matrices is matrix multiplication.

thank you!

desert oar May 4, 2022, 7:43 PM

#

tidal bough I prefer using the `@` operator, myself, but some people use `.dot` for some rea...

.dot predates @, so habit (and backward-compatibility, until recently)
i was morally opposed to adding @/__matmul__ to python, even if it directly benefits my work 😛

tidal bough May 4, 2022, 7:45 PM

#

elaborate on 2? 🥴

balmy burrow May 4, 2022, 7:46 PM

#

I dont even know if I should go beyond chapter 3 in hands-on machine learning,seeing all you people and everyone knowing maths on python or math irl really well lol

upper spindle May 4, 2022, 7:49 PM

#

last question, what does units mean in the lstm layer

#

and what does the dense layer do

#

thanks in advance

iron basalt May 4, 2022, 8:05 PM

#

tidal bough I prefer using the `@` operator, myself, but some people use `.dot` for some rea...

IMO, dot is still better, @ adds yet another thing someone needs to learn / memorize. Anybody can see dot and understand what that means (without really knowing Python).

#

@ is also not used in mathematics for this AFAIK (I know there are some very strange notations out there).

tidal bough May 4, 2022, 8:12 PM

#

There's two reasons I dislike .dot:

It has a dual meaning depending on the shapes involved: matrix multiplication and vector dot product - and the name seems to imply the latter. IMO that's at least as confusing as a new operator.
As it's a function call, it makes for tons of nested parentheses. Compare:

x_b.T.dot(x_b.dot(theta)-y)
x_b.T @ (x_b@theta - y)

iron basalt May 4, 2022, 8:21 PM

#

Well, dot product can be between vectors, or matrices, etc, so the naming is correct. A new operator unlike "dot" has no meaning at all. So while dot may apply to too many things, @ applies to nothing (it's not a thing) (in terms of the meaning of the word / symbol). The extra parenthesis is not ideal. Ideally it would have been something like AB, but that is just parsed as one token by Python (an identifier), and . is already the access operator (I don't know if this can be overloaded)

#

For matrices specifically I also use matmul instead of @, making it very obvious.

#

It's not a super big deal, they just need to quickly look up @, but I like to have as little of those type of language specific lookup moments as possible (or lookups in general (so I don't use crazy complex functions (e.g. einsum) either if I can help it)).

#

(The less lookups is also why type hints are great, because then if I don't know what exact type the function returns, but I understand the meaning of its name, I don't have to look it up if the code has a type hint)

#

(Basically, not being rude to the reader / having them do as little digging as possible, and that includes different kinds of readers, such as non Python programmers, or those that know some Python, but are not really programmers (and would probably not know more obscure operators like @))

odd meteor May 4, 2022, 8:45 PM

#

balmy burrow I dont even know if I should go beyond chapter 3 in hands-on machine learning,se...

Please remain steadfast. There's no such thing as "not a math person". You'll definitely get better at it with time 😀

tidal bough May 4, 2022, 8:47 PM

#

looks like there's only two minor differences I suppose:
https://numpy.org/doc/stable/reference/generated/numpy.matmul.html#numpy.matmul

serene scaffold May 4, 2022, 8:47 PM

#

balmy burrow I dont even know if I should go beyond chapter 3 in hands-on machine learning,se...

I agree with Emyrs. All these things sound complicated until you understand them. If you maintain a positive attitude about learning, it will come soon enough 😄

odd meteor May 4, 2022, 8:51 PM

#

upper spindle and what does the dense layer do

A dense layer is simply a fully connected layer (i.e a layer where all the neurons are connected to every neuron of its preceding layer) in NN.

desert oar May 5, 2022, 2:11 AM

#

tidal bough elaborate on 2? 🥴

matrix multiplication is specific to one very specific problem domain, while literally nothing else in the python standard library is oriented towards that problem domain. and it isn't even supported in the stdlib anyway

serene scaffold May 5, 2022, 2:47 AM

#

desert oar matrix multiplication is specific to one very specific problem domain, while lit...

I tend to agree. and then they won't even imbue that operator with function composition 😠

misty flint May 5, 2022, 3:00 AM

#

"Every Superman has his kryptonite, but as a Data Scientist, coding can't be yours."

#

i like this quote

#

kekHands

serene scaffold May 5, 2022, 3:07 AM

#

misty flint > "Every Superman has his kryptonite, but as a Data Scientist, coding can't be y...

shame it's the kryptonite of seemingly every PhD student.

misty flint May 5, 2022, 3:08 AM

#

serene scaffold shame it's the kryptonite of seemingly every PhD student.

omg it seems like its a very common kryptonite for them

serene scaffold May 5, 2022, 3:08 AM

#

except they also don't know it

misty flint May 5, 2022, 3:08 AM

#

unless they are a phd student that has done frequent internships as a software person

#

which isnt as common

serene scaffold May 5, 2022, 3:09 AM

#

though one of the talks at pycon, the speaker made the point that the reward model for academia only involves publishing, and the code is just a means to that singular end. so we can't really blame the authors of shitty research code for what they have unleashed on the world.

misty flint May 5, 2022, 3:13 AM

#

oof

#

the incentives of academia

#

kekHands

#

oh this reminds me

#

i have started to see uhh...how to explain this

#

postdocs for phd peeps to transition to industry better

#

created by tech companies themselves

#

let me see if i can find it to show you

misty flint May 5, 2022, 3:16 AM

#

serene scaffold though one of the talks at pycon, the speaker made the point that the reward mod...

https://www.amazon.science/postdoctoral-science-program

Amazon Science

Amazon launches new Postdoctoral Science Program

The program offers recent PhD graduates an opportunity to advance research while working alongside experienced scientists with backgrounds in industry and academia.

#

kekHands

#

good option for phd students i guess

serene scaffold May 5, 2022, 3:17 AM

#

misty flint https://www.amazon.science/postdoctoral-science-program

I heard that working at amazon is low-key cancer, even as a dev

misty flint May 5, 2022, 3:17 AM

#

~~i have also heard the same~~

#

RunFail

#

i believe they are one of the worst ones at contributing back to open source as well

#

out of big tech

serene scaffold May 5, 2022, 3:18 AM

#

I didn't think they do any amount of open source

misty flint May 5, 2022, 3:19 AM

#

kekHands

serene scaffold May 5, 2022, 3:19 AM

#

my company does open source occasionally. some of my coworkers open sourced something, and I would have liked to refactor it...

misty flint May 5, 2022, 3:20 AM

#

kekHands

#

you can always make a pull request

#

but its good that your company gives back. i think any company that has the space for it should do so

#

even if its features that your company would use, since chances are others might need it as well

worldly dawn May 5, 2022, 3:23 AM

#

misty flint i believe they are one of the worst ones at contributing back to open source as ...

They do contribute a lot, but you don't hear that much about it.

The worst one is apple

misty flint May 5, 2022, 3:24 AM

#

oof kekHands

#

i can see that

#

apple being apple

worldly dawn May 5, 2022, 3:24 AM

#

I have heard it's so bad that it has created some recruitment issues since their secrecy goes against the common practices

misty flint May 5, 2022, 3:24 AM

#

Pika

#

i have heard part of that on a podcast once

#

mostly about employees that have left

#

and how hard it was for them afterwards

#

ah this was why: https://www.washingtonpost.com/technology/2022/02/10/apple-associate/

#

Every employee who leaves Apple becomes an ‘associate’
In job databases used by employers to verify resume information, every former Apple employee’s title gets erased and replaced with a generic title

#

imagine being in charge of a team of data scientists, data engineers, and ML engineers

#

but then you only become an "associate" if you leave

#

kekHands

iron basalt May 5, 2022, 3:36 AM

#

serene scaffold shame it's the kryptonite of seemingly every PhD student.

https://www.youtube.com/watch?v=YnL9vAFphmE

YouTube

Programmers are also human

Interview with a Postdoc, Junior Python Developer in 2022

Python programming language

Programmer humor
Python humor
Programming jokes
Programming memes
Python 2022
Python memes
python jokes
Keras
Tensorflow
Data science
Data Science humor
Pandas Pandas Pandas
async with
OpenCV
GANs
Scikit-l...

▶ Play video

misty flint May 5, 2022, 3:41 AM

#

"geared towards children and phds. children and phds"

#

💀

#

im dead

iron basalt May 5, 2022, 3:44 AM

#

About the same amount of experience programming.

#

(Except many children have more free time to learn it and end up being better)

#

(Although they lack the math)

misty flint May 5, 2022, 3:46 AM

#

wildly entertaining

#

havent laughed like that in a while

serene scaffold May 5, 2022, 3:46 AM

#

iron basalt https://www.youtube.com/watch?v=YnL9vAFphmE

C++ is bad because you get errors before you can even run the code
and
I just learned that on Medium, uhh, an hour ago
these are the only two parts of the video that I found clever. the rest is just him saying random phrases and trying to pass that off as a punchline.

also

it's easy because it's just one page. where's the async with?"
sounds like a joke from a d.py video

misty flint May 5, 2022, 3:46 AM

#

💀

#

im gonna share this with my classmates if you dont mind kekHands

iron basalt May 5, 2022, 3:47 AM

#

serene scaffold > C++ is bad because you get errors before you can even run the code and > I jus...

Yeah it does have the editing of early Youtube with many jump cuts and repeated phrases.

misty flint May 5, 2022, 3:47 AM

#

i like the repeated phrases since i interpreted it as him making fun of the buzzwords in this space

#

and the hype that they create

#

kekHands

worldly dawn May 5, 2022, 3:48 AM

#

and the mug

serene scaffold May 5, 2022, 3:48 AM

#

iron basalt Yeah it does have the editing of early Youtube with many jump cuts and repeated ...

I don't object to that. if the punchlines were good, that might add to it. "lists are arrays. tuple unpacking." saying a bunch of phrases is not a punchline.

misty flint May 5, 2022, 3:48 AM

#

im going to watch it again

#

kekHands

worldly dawn May 5, 2022, 3:48 AM

#

each video has the mug with the logo taped on it

iron basalt May 5, 2022, 3:49 AM

#

serene scaffold I don't object to that. if the punchlines were good, that might add to it. "list...

I agree.

misty flint May 5, 2022, 3:50 AM

#

worldly dawn each video has the mug with the logo taped on it

ahhh i just realized this kekHands

#

it starts unfolding midway

lapis sequoia May 5, 2022, 3:51 AM

#

what does the core analyst engineer do? I got its job in hand and i got no idea.(sorry if this feels like off topic, if it is not related to this channel i can move on to offtopic.)

bold timber May 5, 2022, 6:24 AM

#

odd meteor Most online ML courses don't teach MLOps or model deployment in general. However...

it's free?

celest vine May 5, 2022, 8:02 AM

#

Hi, can anyone help me with a problem statement?

fresh moss May 5, 2022, 8:06 AM

#

Hello, does anyone have experience in scraping consumer reviews? I have difficulty scraping reviews of many products without changing the product links one by one.

celest vine May 5, 2022, 8:07 AM

#

So, Its a classification problem.
I have the following Twitter data.
profileURL
screenName
name
Bio
Location
Created at
FollowersCount
FollowingCount
TweetsCount
Certified - boolean(yes/no)
Replied - (yes/no)

So, replied column is my output and and I what to predict if a Twitter account will reply to my dm or not.

Any suggestions?

rose agate May 5, 2022, 8:53 AM

#

celest vine So, Its a classification problem. I have the following Twitter data. profileUR...

There should be a lot of resources online for label classification, following a tutorial should be a good way to learn. sklearn will probably be the easiest package to use, and you can change which model to use pretty simply.

odd meteor May 5, 2022, 9:06 AM

#

bold timber it's free?

The first one is free, however the second one from deeplearning.ai requires a financial commitment of about $46 per month.

bold timber May 5, 2022, 9:48 AM

#

odd meteor The first one is free, however the second one from deeplearning.ai requires a fi...

Can you give me the source that free?😅

odd meteor May 5, 2022, 9:48 AM

#

lapis sequoia what does the core analyst engineer do? I got its job in hand and i got no idea....

The job description will give you a well-rounded overview on what the role entails. In some companies, it could very well mean a BI analyst or Data Analyst.

bold timber May 5, 2022, 9:48 AM

#

some people say we can use streamlit. Where I can learn of streamlit?

celest vine May 5, 2022, 9:49 AM

#

rose agate There should be a lot of resources online for label classification, following a ...

Okay

odd meteor May 5, 2022, 9:49 AM

#

odd meteor Most online ML courses don't teach MLOps or model deployment in general. However...

@bold timber the first one on this list

odd meteor May 5, 2022, 9:51 AM

#

bold timber some people say we can use streamlit. Where I can learn of streamlit?

Check their documentation and use YouTube as well. That's how I learned it

bold timber May 5, 2022, 9:51 AM

#

odd meteor <@786960616664727572> the first one on this list

Whether the first one of that course completed to deployment?

odd meteor May 5, 2022, 9:52 AM

#

bold timber Whether the first one of that course completed to deployment?

It goes beyond model deployment. It covers the entirety of MLOps. It appears you're only interested about model deployment 😁

bold timber May 5, 2022, 9:55 AM

#

odd meteor It goes beyond model deployment. It covers the entirety of MLOps. It appears you...

yeah, because I'm so beginner in ML. then, some company ask me to deployment my model and I don't understand how to do that 😂 🥲

#

but, thank you for the information.

odd meteor May 5, 2022, 10:01 AM

#

fresh moss Hello, does anyone have experience in scraping consumer reviews? I have difficul...

You can first write a code to grab the product pages you're interested in scrapping first. It might be a good idea to save the grabbed pages as HTML file in your PC to avoid sending multiple requests (which could lead to your IP getting banned) 😀

ocean swallow May 5, 2022, 11:29 AM

#

I am trying to extract the most used word groups of n length in a text

#

Something like a word tree but n length.

#

So like word group tree maybe lol

fresh moss May 5, 2022, 11:31 AM

#

odd meteor You can first write a code to grab the product pages you're interested in scrapp...

If so, do you have any suggestions for references? So far I haven't found a tutorial for a similar case to mine.... The tutorial I found was just scraping from the same page.

odd meteor May 5, 2022, 12:08 PM

#

fresh moss If so, do you have any suggestions for references? So far I haven't found a tuto...

Once I get back on my pc, I'll try to check if I can find a specific reference resource that fits your case.
However, for the time being, you can try to restructure my example to fit your scenario (depending on how the website you're scrapping is structured.)

Presuming you're working with a website with this kind of url structure
https://www.basketball-reference.com/awards/awards_2021.html and you'd like to scrap data for year 2018 to 2021.

You could write your code to grab those pages like this:

import requests
from bs4 import BeautifulSoup

years = list(range(2018, 2022))
url_start = 'https://www.basketball-reference.com/awards/awards_{}.html'

for year in years:
    url = url_start.format(year)
    data = requests.get(url)

    with open('data_folder/{}.html'.format(year), 'w+', encoding='utf-8') as f:
        f.write(data.text)

This code will pretty much write the scrapped html page of each year inside the data_folder. You can try to run the code on your pc to perhaps understand how to structure yours.

Basketball-Reference.com

2020-21 NBA Awards Voting | Basketball-Reference.com

2020-21 NBA Awards Voting Summary

#

Once you've successfully grabbed all the product pages you're interested in, you can then write a function/ for loop that will enable you to easily scrap information from each product page without changing product url every time. This way, you can also reduce the number of request you send to the website. If you send too much request some website will ban your IP permanently or temporarily.

You could also use the time.sleep() method to circumvent this scenario if you prefer that approach.

arctic cliff May 5, 2022, 12:47 PM

#

Why do we shuffle the data before training?

#

also do we shuffle the samples along with their targets?

fresh moss May 5, 2022, 12:49 PM

#

odd meteor Once I get back on my pc, I'll try to check if I can find a specific reference r...

Woahh thank you so much for the insight! when i tried your suggestion i found a source. I wanna make sure this source matches what you said earlier.. This is the link :https://www.geeksforgeeks.org/how-to-scrape-multiple-pages-of-a-website-using-python/

GeeksforGeeks

How to Scrape Multiple Pages of a Website Using Python? - GeeksforG...

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

tidal bough May 5, 2022, 12:51 PM

#

arctic cliff also do we shuffle the samples along with their targets?

Yes, otherwise you ruin your data.

tidal bough May 5, 2022, 12:52 PM

#

arctic cliff Why do we shuffle the data before training?

I think it's mostly that the easiest way to train on randomly picked samples is to shuffle the entire dataset once, and then you can just pick consequtive slices without randomly drawing many times.

arctic cliff May 5, 2022, 12:53 PM

#

That's reasonable. Thanks!

#

Just to make sure I get this straight- By giving mini-batch arguments my data are sliced randomness? without having to do np.shuffle(data)?

#

@tidal bough Sorry for the ping in case you mind

tidal bough May 5, 2022, 12:56 PM

#

Check the docs for whatever method of whatever library you're using for minibatching

arctic cliff May 5, 2022, 12:56 PM

#

Thanks a lot

gusty anvil May 5, 2022, 1:43 PM

#

HOWDY >> I have a question 🙂
I have data with this layout in power bi and I want to forecast with machine learning in order to receive the same data in those columns for following days. Any advice for noobs ?

#

somber prism May 5, 2022, 1:44 PM

#

guys i have one doubt, since lstm or any othere sequence model can work well on nlp problems and word 2 vec can save the semantic, does that mean we dont need to do preprocesssing like stemming removing stop words etc ?

odd meteor May 5, 2022, 1:46 PM

#

fresh moss Woahh thank you so much for the insight! when i tried your suggestion i found a ...

Yes that's a good resource with more detailed examples on how to get the work done.

faint isle May 5, 2022, 2:00 PM

#

Hi guys anyone can help me with sentiment analysis using logistic regression and random forest? 🙂

serene scaffold May 5, 2022, 2:10 PM

#

faint isle Hi guys anyone can help me with sentiment analysis using logistic regression and...

try being more specific

faint isle May 5, 2022, 2:12 PM

#

serene scaffold try being more specific

balmy burrow May 5, 2022, 2:34 PM

#

is there any channel I can ask machine learning questions specifically?

#

for scikit or smth

#

never mind found ultimatum of problem solving

odd meteor May 5, 2022, 2:40 PM

#

balmy burrow is there any channel I can ask machine learning questions specifically?

This is the right channel to ask your ML questions

odd meteor May 5, 2022, 2:47 PM

#

faint isle

If you're familiar with sentiment analysis, after you're done with the whole cleaning and text pre-processing stage, you're expected to use Logistic regression and/or Random Forest algorithm to do the sentiment analysis instead of TextBlob. If you're not too familiar with sentiment analysis, you could check online for some examples ( check kaggle.com) I'm sure you'll find a good reference notebook there.

balmy burrow May 5, 2022, 2:48 PM

#

can anyone help why .reshape doesnt work here?

shell pasture May 5, 2022, 2:53 PM

#

NLP Contract Processing - Hi guys, I have a question on NLP in terms of processing contracts. When building a model would the contracts all have to be somewhat the same for the model to work or would the model be able to work on different kinds of contracts. If I’m asking a vague question apologies

odd meteor May 5, 2022, 3:01 PM

#

somber prism guys i have one doubt, since lstm or any othere sequence model can work well on ...

Performing basic pre-processing steps is still very important before we get to the model building part. Those text pre-processing are still very much needed. Word2Vec is just a word embedding. We still need to feed a clean text to it. Although, depending on the task, some pre-processing step might not be entirely compulsory https://aclanthology.org/W19-6203/

ACL Anthology

To Lemmatize or Not to Lemmatize: How Word Normalisation Affects EL...

Andrey Kutuzov, Elizaveta Kuzmenko. Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing. 2019.

somber prism May 5, 2022, 3:04 PM

#

i se

odd meteor May 5, 2022, 3:04 PM

#

For example, when it comes to ABSA, depending on the approach you're using, removing stop words before doing cross-lingual coreference and applying POS tag isn't a good practice.

somber prism May 5, 2022, 3:04 PM

#

thank you

#

so basically, if i am planning on making a model that predicts the next word based on the previous sentence, then i should avoid removing the stop words right ?

spare briar May 5, 2022, 3:14 PM

#

bold timber Hi, I also wondering how to deploy machine learning model? can you give me a rec...

Some great resources here as well https://huyenchip.com/mlops/

MLOps - tools, best practices, and case studies

Chip Huyen is a writer and computer scientist. She works to bring the best engineering practices to machine learning production.

odd meteor May 5, 2022, 3:14 PM

#

somber prism so basically, if i am planning on making a model that predicts the next word bas...

When it comes to predicting next word in a sentence, I'd say don't remove stopwords. However, you can try both to compare which yields a more meaningful result ( with or without stopwords)

spare briar May 5, 2022, 3:15 PM

#

(plus she has a ml ops focused discord server)

misty flint May 5, 2022, 3:51 PM

#

spare briar (plus she has a ml ops focused discord server)

i find great resources are there as well

#

also

#

i just saw a video with all my favorite data influencers in one place

#

but the video got privated

#

secret secrets are secret

#

ID_blurryeyes

#

what are they doing?

#

where are they going?

#

something is happening but i do not know what

#

some of the peeps i saw: ken jee, tina huang, shashank kalanithi, mikiko bazely, luke barousse, ben rogojan, and others ID_blurryeyes

misty flint May 5, 2022, 4:05 PM

#

spare briar Some great resources here as well https://huyenchip.com/mlops/

oh hey i just saw this too btw

#

also

#

i remember seeing chip huyen post her slides for her ML Systems course online

#

free for anyone to study

#

that course is more focused towards ML deployment and ML case studies

#

pretty nifty

spare briar May 5, 2022, 4:11 PM

#

yes her content + the book designing data intensive applications were helpful for me when asked ops/productionization questions in interviews

misty flint May 5, 2022, 4:11 PM

#

oh nice ive heard great things about that book as well

#

PikaThink

#

iirc, she writes a decent amount on real-time ML inference as well

spare briar May 5, 2022, 4:14 PM

#

yes although i have some issues with that

#

i did like her content on bandits

misty flint May 5, 2022, 4:14 PM

#

was your issue that sometimes real-time is not the way to go?

#

that you can get away with mini-batch sometimes?

#

PikaThink

spare briar May 5, 2022, 4:15 PM

#

nope more with methods

misty flint May 5, 2022, 4:15 PM

#

ah i see

spare briar May 5, 2022, 4:15 PM

#

i am very into continual learning these days

#

we have a system that does real time classification with user feedback

misty flint May 5, 2022, 4:15 PM

#

thats good

#

i believe real time can provide a lot of value

misty flint May 5, 2022, 4:55 PM

#

for those into computer vision https://www.xview2.org/

#

Recognizing an opportunity to solve a key analytical bottleneck, the Defense Innovation Unit, together with other Humanitarian Assistance and Disaster Recovery (HADR) organizations, is releasing a new labeled, high-resolution satellite dataset and a challenge to the computer vision community.
Help us automate damage assessment to accelerate recovery from natural disasters and win prizes ($150,000 of cash awards)!

glossy zenith May 5, 2022, 6:43 PM

#

Is someone of you familiar with PyTorch and vgg16?

My Model is having a hard time predicting the Images right from a Datasets (i guess around 3200 for each Label) SFW 0 and NSFW 1

On Epoch 50/100 he's recognizing NSFW pretty often But i havent Seen SFW once...

radiant frigate May 5, 2022, 6:47 PM

#

Hello, this may be a weird request but is it possible to find anywhere a benchmark of computing a Fourier series in languages like Python, Julia and Matlab?

tidal bough May 5, 2022, 7:05 PM

#

Hmm, quick googling doesn't show me anyone doing a comparison like that. You'd need to make sure a benchmark yourself.

#

Note though that speed of different libraries in one language can vary a lot. There's this benchmark which compares various C/C++ FFT libraries and also compares a few popular Python ones:
https://github.com/project-gemmi/benchmarking-fft

#

Note how installing mkl-fft accelerates numpy's fft by 16 times. (At least, on the old python version this benchmark was done on)

upper spindle May 5, 2022, 7:27 PM

#

does anyone know how to plot two graphs together so i can these two on graphs on the same graph but with different colours

#

the df's are there

#

I want to plot log returns and sentiment on the same graphs, thanks in advance

serene scaffold May 5, 2022, 7:32 PM

#

upper spindle does anyone know how to plot two graphs together so i can these two on graphs on...

you can add Standard Deviation as a column on the "main" dataframe

#

what code did you use to make those two plots?

#

!docs pandas.DataFrame.plot.line

arctic wedgeBOT May 5, 2022, 7:32 PM

#

pandas.DataFrame.plot.line


DataFrame.plot.line(x=None, y=None, **kwargs)```
Plot Series or DataFrame as lines.

This function is useful to plot lines using DataFrame’s values
as coordinates.

radiant frigate May 5, 2022, 7:35 PM

#

tidal bough Hmm, quick googling doesn't show me anyone doing a comparison like that. You'd n...

Thank you for helping me out :)

upper spindle May 5, 2022, 7:36 PM

#

serene scaffold what code did you use to make those two plots?

i used df['Log Returns'].plot(figsize=(10,5), ylabel='Log Returns')

#

and df.plot(figsize=(10,5), xlabel='Date', ylabel='Std Deviation of Sentiment')

serene scaffold May 5, 2022, 7:38 PM

#

upper spindle i used `df['Log Returns'].plot(figsize=(10,5), ylabel='Log Returns')`

there are examples in the docs of plotting two lines in different colors from the same DF

upper spindle May 5, 2022, 7:53 PM

#

serene scaffold there are examples in the docs of plotting two lines in different colors from th...

ooh okay, thanks, ill have a look

lapis sequoia May 5, 2022, 9:54 PM

#

idk where to ask this but how can i print something at the same position even if it has less characters
for example

bla bla     TEST
bla         TEST

serene scaffold May 5, 2022, 10:09 PM

#

lapis sequoia idk where to ask this but how can i print something at the same position even if...

take a look at #❓｜how-to-get-help. also look into f string formatting

lapis sequoia May 5, 2022, 10:10 PM

#

okay thanks

upper spindle May 6, 2022, 12:11 AM

#

can anyone create a function for a univariate lstm to test on the validation set

#

i created a multivariate function for forecasting on validation set

#

    start_index = range_index[0] - dt.timedelta(n_past - 1)
    end_index = range_index[-1]
    mat_X, _ = windowed_dataset(input_df[start_index:end_index], 
                                df['Future Vol'][range_index], n_past)
    preds = pd.Series(model.predict(mat_X)[:, 0],
                      index=range_index)

    return preds```

#

testY_preds = forecast_multi(lstm_multi, test_index) which gives me the out

#

but im struggling to write a code for forecasting on a univariate

ocean swallow May 6, 2022, 12:27 AM

#

Anyone tried to predict views we get if we just

#

published a video on youtube, given the title, channel sub and date etc

#

?

plush jungle May 6, 2022, 2:05 AM

#

can someone explain the differences between decision trees, support vector machines, and neural nets?

#

like, they can all solve complex, non-linearly separable problems by analyzing training data and then classifying new data, but I feel like I have a much stronger grasp of neural nets and why they work than the other two

serene scaffold May 6, 2022, 2:31 AM

#

@plush jungle a decision tree algorithm takes the training data and tries to construct a series of yes/no decisions that can make a decision about each instance. does that make sense?

plush jungle May 6, 2022, 2:33 AM

#

serene scaffold <@433856634192789504> a decision tree algorithm takes the training data and trie...

so while a neural network's neuron takes a matrix and spits out a vector, a decision tree's branch takes a vector and spits out a scalar?

serene scaffold May 6, 2022, 2:34 AM

#

plush jungle so while a neural network's neuron takes a matrix and spits out a vector, a deci...

neural networks don't necessarily take only a matrix.

plush jungle May 6, 2022, 2:35 AM

#

you mean because they can also take a vector?

serene scaffold May 6, 2022, 2:35 AM

#

there are a lot of neural architectures

plush jungle May 6, 2022, 2:37 AM

#

ok but I'm familiar with neural nets mainly as applied to optical character recognition, and my friend told me that any data you can train a neural net on you can also construct a decision tree for

#

so what would it look like

#

if you made a decision tree where each example in the training data was a vector of pixels

serene scaffold May 6, 2022, 2:38 AM

#

plush jungle ok but I'm familiar with neural nets mainly as applied to optical character reco...

you can always train a decision tree; whether or not it will be good is another matter.

#

images are usually 2d arrays if they're greyscale images or 3d arrays if they have RGB values.

plush jungle May 6, 2022, 2:40 AM

#

right, so the first x number of branches of the decision tree would get passed one of these 2d arrays

#

and they would return a boolean value?

ocean swallow May 6, 2022, 2:43 AM

#

if the problem is optical character recognition it would elementarily be passed a flattened vector of the 2d image and for each pixel on it, you would create a branch. 1st pixel black or white Yes No etc

serene scaffold May 6, 2022, 2:44 AM

#

plush jungle and they would return a boolean value?

each decision is yes/no, but the final result of traversing the decision tree is a label for a given instance

plush jungle May 6, 2022, 2:45 AM

#

ocean swallow if the problem is optical character recognition it would elementarily be passed ...

so a 723 pixel image would have how many branches? 723? one branch for each pixel?

ocean swallow May 6, 2022, 2:45 AM

#

2^723

plush jungle May 6, 2022, 2:45 AM

#

oh

#

because each new pixel would be another split

ocean swallow May 6, 2022, 2:45 AM

#

yes

plush jungle May 6, 2022, 2:46 AM

#

so the downside is that decision trees scale very poorly?

serene scaffold May 6, 2022, 2:46 AM

#

they also tend to overfit to the training data

plush jungle May 6, 2022, 2:46 AM

#

serene scaffold they also tend to overfit to the training data

yeah that makes sense, given that it sounds like it's not generalizing at all

#

it's just making up rules that perfectly match the training data

serene scaffold May 6, 2022, 2:47 AM

#

plush jungle it's just making up rules that perfectly match the training data

yesssssss

ocean swallow May 6, 2022, 2:48 AM

#

yeah and as you can see impractical too

serene scaffold May 6, 2022, 2:48 AM

#

a random forest is an algorithm that attempts to solve that problem. but if you hear about a random forest being used in consumer software, it probably means that they just wanted to use AI for marketing purposes, and should be avoided.

ocean swallow May 6, 2022, 2:48 AM

#

so you need to do some feature extraction

plush jungle May 6, 2022, 2:49 AM

#

are SVM's the opposite? they generalize more than neural nets?

ocean swallow May 6, 2022, 2:50 AM

#

I don't know what they would do in real data honestly but the way I was told in university

#

since it creates functions of vectors that only barely fits the data, they are good against overfitting

#

it is a great algorithm even today

#

low cost fast etc.

#

not the basic linear svm

plush jungle May 6, 2022, 2:52 AM

#

I thought we were talking about the non-linear svm

#

linear ones are just for linearly separable problems right?

ocean swallow May 6, 2022, 2:52 AM

#

yeah fixed it

plush jungle May 6, 2022, 2:52 AM

#

oh gotcha

ocean swallow May 6, 2022, 2:54 AM

#

apart from being optimization problem solvers, they don't have much in common really. probably svm is closer to neural network but that's all

#

also svm's name don't make sense is what my professor told me as well lol

plush helm May 6, 2022, 3:52 AM

#

ocean swallow published a video on youtube, given the title, channel sub and date etc

that would generally be subjective & we have to account for yt algorithm itself

misty flint May 6, 2022, 3:58 AM

#

random forests are also relatively robust to outliers so theres that use case

mental bane May 6, 2022, 4:28 AM

#

Can anyone please tell me how to get the count of Malignant and Benign tumors from load_breast_cancer of scikit?

mental bane May 6, 2022, 4:44 AM

#

nvm, found it in the description of the dataset

misty flint May 6, 2022, 4:50 AM

#

misty flint secret secrets are secret

secret unveiled

#

ZoomEyes

versed sleet May 6, 2022, 4:59 AM

#

I still can't understand tensorflow or pytorch. I think my weakness is the math behind it.

#

I know basic math

#

Not sure if I should revisit trigonometry and algebra

#

Or go for statistics and probability

#

🤔

serene scaffold May 6, 2022, 5:16 AM

#

versed sleet I still can't understand tensorflow or pytorch. I think my weakness is the math ...

neural networks involve linear algebra and calculus. and classifiers in general involve probability and statistics.

#

but yes, tensorflow and pytorch are both for neural networks. they do most of the work for you. so if you don't already understand neural networks, you won't really understand what those libraries enable you to do.

#

and you won't really figure it out from using them, because most of the work happens internally

misty flint May 6, 2022, 5:20 AM

#

yes

#

you should learn neural networks and how they work at least

#

if you intend to use either pytorch or tensorflow for your needs

misty flint May 6, 2022, 5:44 AM

#

also im glad data engineering is becoming more popular

#

since it will enable data science

#

and allow for more DS work to be able to provide value

#

also fang salaries are fang so im not surprised

#

kekHands

inland zephyr May 6, 2022, 6:34 AM

#

does anyone have better knowledge about regex? i have this text Character_Xingqiu_Thumb.webp and I want to keep Xingqiu and remove the surrounded words. i try ((Character_))($\w^)((_Thumb.webp))but no works.

#

when work with re, it also not work properly for the last word (_Thumb.webp)

head = "Character_"
tail = "_Thumb.webp"
username = re.sub(tail,"",re.sub(head,"",username))

it returns Xinqiu_Thumb thats what i dont need

delicate apex May 6, 2022, 6:42 AM

#

inland zephyr does anyone have better knowledge about regex? i have this text ```Character_Xin...

consider something like (?:Character_)(\w+)(?:_Thumb\.webp)

delicate apex May 6, 2022, 6:42 AM

#

delicate apex consider something like `(?:Character_)(\w+)(?:_Thumb\.webp)`

the ?: means non-capturing group, so the only group is what you want
the . has to be escaped, as it has meaning in regex
\w only matches a single character, and $ and ^ refer to the start and end of, depending on mode, either the entire string or each line

#

doubts and problems about regex can frequently be helped with online interpreters, and i use https://regex101.com/ for such tasks

#

!e

import re
s = 'Character_Xingqiu_Thumb.webp'
rem = re.search(r'(?:Character_)(\w+)(?:_Thumb\.webp)', s)
print(rem.group(1))

arctic wedgeBOT May 6, 2022, 6:48 AM

#

@delicate apex :white_check_mark: Your eval job has completed with return code 0.

Xingqiu

inland zephyr May 6, 2022, 6:49 AM

#

lemme try this

potent fractal May 6, 2022, 6:54 AM

#

Hi,

is there a package out there that could take my RNA dataset (bulk or single cell) and compare it to published datasets to find those with the highest percentage of similarity to my dataset?

Thanks!

celest vine May 6, 2022, 6:56 AM

#

I have text as well as numerical feature in my dataset. Can these both be used together for supervised classification problem?

celest vine May 6, 2022, 6:57 AM

#

celest vine I have text as well as numerical feature in my dataset. Can these both be used t...

Which algorithm will be best for this?

inland zephyr May 6, 2022, 6:58 AM

#

delicate apex doubts and problems about regex can frequently be helped with online interpreter...

this is works, thanks for the help

wild pagoda May 6, 2022, 7:26 AM

#

hey everyone, can i create excel image like this in python?

#

Currently my excel file is like this:

tacit basin May 6, 2022, 8:01 AM

#

wild pagoda Currently my excel file is like this:

Create pandas dataframe from it and df.plot() should plot something similar. You can customize it with matplotlib

wild pagoda May 6, 2022, 8:08 AM

#

tacit basin Create pandas dataframe from it and df.plot() should plot something similar. You...

currently matplot plot is rounded, anyway to make it sharp?

tacit basin May 6, 2022, 8:09 AM

#

wild pagoda currently matplot plot is rounded, anyway to make it sharp?

Not sure if i understand.

tacit basin May 6, 2022, 8:10 AM

#

misty flint also im glad data engineering is becoming more popular

Interesting 🤔

inland zephyr May 6, 2022, 8:55 AM

#

I need to investigate deeper about how to deal with transparent image such as webp and the best augmentation method for them.

#

The resnet embedder for Genshin potrait... i need to deep dive the augmentation method for Image classification tho...

upper spindle May 6, 2022, 9:41 AM

#

why would an lstm model prediction be bad

#

like the drawbacks

#

as in the model predicts this

#

whereas a univariate lstm predicts this

iron kettle May 6, 2022, 9:52 AM

#

hello everybody I have an question.I use pandas to read csv.
then I use df[df['columnName']=='XX'] to search a data, I can't find it .
but the data in the dataframe.

#

such us

chilly abyss May 6, 2022, 9:53 AM

#

Hi

#

Pls WHat's the full meaning of ADADelta?

iron kettle May 6, 2022, 9:54 AM

#

#

anybody can help me? grumpchib

rose agate May 6, 2022, 9:55 AM

#

iron kettle anybody can help me?<:grumpchib:552214257148887060>

The data type might not be a string perhaps, try removing the quotes around the number

iron kettle May 6, 2022, 9:56 AM

#

rose agate The data type might not be a string perhaps, try removing the quotes around the ...

I set the data type is str

#

#

I remove the quote.....

chilly abyss May 6, 2022, 9:57 AM

#

Hello all, pls could you help with the full meaning of ADADelta

rose agate May 6, 2022, 9:59 AM

#

iron kettle I set the data type is str

not sure then tbh. If you're able to upload the file I could try though

odd meteor May 6, 2022, 10:31 AM

#

chilly abyss Hello all, pls could you help with the full meaning of ADADelta

It's an optimizer that's used in NN to minimise the loss function. Just like Momentum, AdaDelta is just the name the creator gave it. There's no extra "full meaning" to it since that's the full name.

https://keras.io/api/optimizers/adadelta/

Keras documentation: Adadelta

upper spindle May 6, 2022, 11:19 AM

#

what is a zip file

#

and how do i create a zip file containing two ipynb files

misty flint May 6, 2022, 11:36 AM

#

tacit basin Interesting 🤔

indeed

#

DoggoKek

rose agate May 6, 2022, 11:45 AM

#

upper spindle what is a zip file

It's a type of file that contains data compressed to save space. If you're on windows download WinRAR, put the two .ipynb files in a folder, right click the folder and click "add to archive", click the zip option and then click "ok"

frigid elk May 6, 2022, 11:55 AM

#

is it possible to create boxplot without full dataset? .. using just the breakpoints? .. e.g. [lower, q1, med, q3, upper, [outliers]] ?

rose agate May 6, 2022, 12:10 PM

#

frigid elk is it possible to create boxplot without full dataset? .. using just the breakpo...

Found a post online that explained how to create boxplot from the quartiles and whiskers. To add outliers I found that using a scatterplot works

import matplotlib.pyplot as plt

stats = [
    {'med': 5, 'q1': 2, 'q3': 6, 'whislo': 1, 'whishi': 8},
]

_, ax = plt.subplots();
ax.bxp(stats, showfliers=False);

plt.scatter([1,1,1,1], [9,10,-1,-2])

#

upper spindle May 6, 2022, 12:32 PM

#

rose agate It's a type of file that contains data compressed to save space. If you're on wi...

thanks, ill try it now

upper spindle May 6, 2022, 12:33 PM

#

rose agate It's a type of file that contains data compressed to save space. If you're on wi...

how do i put my two .ipynb files in a folded

#

im using juypter notebook

rose agate May 6, 2022, 12:43 PM

#

upper spindle how do i put my two .ipynb files in a folded

just create a folder wherever you want, you can either save them to that location from inside jupyter or just copy and paste them into the folder

wild pagoda May 6, 2022, 12:47 PM

#

hi everyone, why my plot error when i want to put in scientific notation? my pd.dataframe is like this

#

plot image

serene scaffold May 6, 2022, 12:48 PM

#

wild pagoda plot image

can you show the code you used to make the plot as text?

#

!code

arctic wedgeBOT May 6, 2022, 12:48 PM

#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

wild pagoda May 6, 2022, 12:49 PM

#

        GraphSheet = pd.read_csv(fileExcel, dtype=str)
        GraphData = pd.DataFrame(GraphSheet)
        # Start make graph
            # Get column
        trvl = GraphData.columns[0]
        FY_raw = GraphData.columns[1]
        FY_fit = GraphData.columns[2]
        note.append(FY_raw)  # Add note of graph data
        note.append(FY_fit)

        # Add data to graph
        plt.plot(GraphData[trvl],
                GraphData[FY_raw], markersize=5)
        plt.plot(GraphData[trvl],
                GraphData[FY_fit], markersize=5)
        plt.legend(note, loc=(1.04, 0))

serene scaffold May 6, 2022, 12:50 PM

#

        GraphSheet = pd.read_csv(fileExcel, dtype=str)
        GraphData = pd.DataFrame(GraphSheet)

you shouldn't need both of these. is GraphSheet not already what you need?

#

!docs pandas.DataFrame.plot.line

arctic wedgeBOT May 6, 2022, 12:50 PM

#

pandas.DataFrame.plot.line


DataFrame.plot.line(x=None, y=None, **kwargs)```
Plot Series or DataFrame as lines.

This function is useful to plot lines using DataFrame’s values
as coordinates.

serene scaffold May 6, 2022, 12:51 PM

#

I would see how you can make the plot with this. There should be a way to make the scale labels less frequent, if that ends up being an issue.

wild pagoda May 6, 2022, 12:53 PM

#

serene scaffold I would see how you can make the plot with this. There should be a way to make t...

if i remove the dtype=str pandas gonna automaticly convert my data to float, i want to prevent that, but when i put in dtype it got that error

serene scaffold May 6, 2022, 12:54 PM

#

wild pagoda if i remove the `dtype=str` pandas gonna automaticly convert my data to float, i...

why do you not want numerical data to be numerical?

wild pagoda May 6, 2022, 12:54 PM

#

serene scaffold why do you not want numerical data to be numerical?

my data is scientific notation

serene scaffold May 6, 2022, 12:54 PM

#

wild pagoda my data is scientific notation

the way the data is notated for human visualization is not the same as the data itself

wild pagoda May 6, 2022, 12:55 PM

#

serene scaffold the way the data is notated for human visualization is not the same as the data ...

so there's no way making a graph like this?

serene scaffold May 6, 2022, 12:55 PM

#

wild pagoda so there's no way making a graph like this?

there is, but you have to first store the data as actual numbers (strings of digits are just text that are mathematically meaningless). then you can just change the settings for the plot so that the numbers are displayed in scientific notation

wild pagoda May 6, 2022, 12:56 PM

#

serene scaffold there is, but you have to first store the data as actual numbers (strings of dig...

oh i though i need to change the data

#

so if like that, currently i make it work like this, now idk how to change those to scientific notation

serene scaffold May 6, 2022, 12:59 PM

#

wild pagoda so if like that, currently i make it work like this, now idk how to change those...

https://matplotlib.org/3.5.0/api/_as_gen/matplotlib.axes.Axes.ticklabel_format.html

wild pagoda May 6, 2022, 1:01 PM

#

        plt.plot(GraphData[trvl],
                GraphData[FY_raw], markersize=5)
        plt.plot(GraphData[trvl],
                GraphData[FY_fit], markersize=5)
        plt.ticklabel_format(style='sci')

it's not working tho

serene scaffold May 6, 2022, 1:01 PM

#

in what way is it not working? also what is the type of plt?

wild pagoda May 6, 2022, 1:01 PM

#

import matplotlib.pyplot as plt

serene scaffold May 6, 2022, 1:02 PM

#

and in what way is it not working? did you get an error message? did your computer explode?

wild pagoda May 6, 2022, 1:02 PM

#

it just run without any error

serene scaffold May 6, 2022, 1:03 PM

#

run without any error... so it worked? I'm still in the dark about what happened that is different from what you expected.

wild pagoda May 6, 2022, 1:03 PM

#

i want these to be scientific notation, but currently it's not

serene scaffold May 6, 2022, 1:05 PM

#

it might be that scientific notation only kicks in for sufficiently large or small numbers.

#

try multiplying the y data by a really large number and plot it again, and see what happens

wild pagoda May 6, 2022, 1:07 PM

#

i don't think it's work in matplot

wild pagoda May 6, 2022, 1:08 PM

#

wild pagoda i want these to be scientific notation, but currently it's not

so how do i have more number in the label? it have 100-200-300-400 not 200-400-600 like right now

barren wedge May 6, 2022, 1:20 PM

#

what we can do with NLP token(tokenizer)?
do you have any idea to leverage the token?

serene scaffold May 6, 2022, 1:36 PM

#

barren wedge what we can do with NLP token(tokenizer)? do you have any idea to leverage the t...

it depends on what you're trying to do. tokenizers decide what word boundaries are. often this is straightforward, but there are times when you might want "don't" to be treated as two tokens, or treat short phrases as one token.

barren wedge May 6, 2022, 1:37 PM

#

serene scaffold it depends on what you're trying to do. tokenizers decide what word boundaries a...

how about BERT Tokenizer?

serene scaffold May 6, 2022, 1:38 PM

#

barren wedge how about BERT Tokenizer?

that's just another tokenizer. but for bert models, you have to use the same tokenizer that the model expects.

barren wedge May 6, 2022, 1:39 PM

#

serene scaffold that's just another tokenizer. but for bert models, you have to use the same tok...

can we unsupervised token to make some similarities between the 2 columns?

serene scaffold May 6, 2022, 1:40 PM

#

barren wedge can we unsupervised token to make some similarities between the 2 columns?

unsupervised token? a token is just a word. it's not something that does something.

barren wedge May 6, 2022, 1:41 PM

#

i mean in the form of input_ids and attention_mask in transformers

serene scaffold May 6, 2022, 1:41 PM

#

also what are the two columns?

barren wedge May 6, 2022, 1:43 PM

#

one column contains the customer address and the other is PIC area of responsibility

#

so we can match the customer for the right PIC

upper spindle May 6, 2022, 1:44 PM

#

rose agate just create a folder wherever you want, you can either save them to that locatio...

thanks

barren wedge May 6, 2022, 1:45 PM

#

like VLOOKUP or HLOOKUP in excel but not exact values

upper spindle May 6, 2022, 2:08 PM

#

rose agate just create a folder wherever you want, you can either save them to that locatio...

where would i locate the python files from jupyterlab, im currently using windows

cunning condor May 6, 2022, 2:09 PM

#

Is the following sequence correct using Depth-First Postorder?

D, E, B, F, C, A.

OR

D, E, B, C, F, A.

Thank You.

#

echo vigil May 6, 2022, 2:38 PM

#

Is there a way to group by / aggregate to a boolean value? Currently working in pyspark but a solution with pandas or sql would also be helpful. For instance, if we have a table where each row is a person_id and a pet, I'd want to group by person_id and the aggregate column should be true if the person has a dog and a cat and false otherwise.

ID | pet
1 | dog
1 | cat
2 | dog
2 | snake

->

ID | agg
1 | True
2 | False

frigid elk May 6, 2022, 2:59 PM

#

rose agate Found a post online that explained how to create boxplot from the quartiles and ...

nice, thanks for that. i'll give it a shot

tacit basin May 6, 2022, 3:32 PM

#

Any recs for practical exercises for designing and implementing solutions to an Object-Oriented Programming task.
Practical like MLOps interview (live coding) practical :) Thanks!

urban lance May 6, 2022, 3:41 PM

#

I have a dataset sorted by user_id and timestamp (essentially session_id)
I want keep just the first n rows for 2 sessions of each user. What would be the most efficient way to go about it?

echo vigil May 6, 2022, 3:54 PM

#

@urban lance pandas?

#

group by session id rank by timestamp and filter where the rank <= 2 would be the cleanest code; there may be a more efficient method if performance is critical.

urban lance May 6, 2022, 4:04 PM

#

echo vigil <@424603301103796234> pandas?

Pandas

echo vigil May 6, 2022, 4:16 PM

#

Something like:

df['rank'] = df.groupby('user_id')['session_id'].rank(method='min', ascending=True)
df = df[df['rank'] <= 2]

will get you most of the way there

rose agate May 6, 2022, 4:40 PM

#

Never knew about the rank function, looks useful

ocean swallow May 6, 2022, 4:40 PM

#

plush helm that would generally be subjective & we have to account for yt algorithm itself

thanks for replying. Like what do I start my research in this area though? I don't know where to start

#

I think the final result would yield like a idea categories in a word cloud kind of thing. so that the person who would create a video would have a good idea of what video will success when

#

and yeah like you said there is this black box of youtube algorithm

frosty flower May 6, 2022, 5:24 PM

#

Is there any library that can be used to check how common a word (or lots of words) is?

#

Like if you input a word "hello" it gives you a score, say, 100

#

And if you give it a word "serendipity" then it maybe gives a score 15 or something

#

@serene scaffold I recall that you are a computational linguistics so you might know the answer to this? Hope you don't mind the ping

serene scaffold May 6, 2022, 6:07 PM

#

frosty flower Is there any library that can be used to check how common a word (or lots of wor...

what is the use case? it's not very difficult to construct a frequency table for a given corpus.

spare briar May 6, 2022, 6:09 PM

#

you might be interested in https://en.wikipedia.org/wiki/Tf–idf

Tf%E2%80%93idf

frosty flower May 6, 2022, 6:11 PM

#

serene scaffold what is the use case? it's not very difficult to construct a frequency table for...

Just trying to build a simple tool to help me read English books. I plan to extract difficult words out of the input (pdf or text) and list them out so I can learn them before I start reading the book.

serene scaffold May 6, 2022, 6:18 PM

#

frosty flower Just trying to build a simple tool to help me read English books. I plan to extr...

what kinds of books? nonfiction?

frosty flower May 6, 2022, 6:24 PM

#

serene scaffold what kinds of books? nonfiction?

Philosophy textbooks

#

Hume, Kant etc

#

English isn’t my first language so it’s a bit hard for me to digest these books if I have to pause from time to time to look up the words I don’t know

serene scaffold May 6, 2022, 6:27 PM

#

frosty flower Philosophy textbooks

so you'll need to see which words are more frequent in philosophy textbooks than they are in general English use. and those are going to be the ones that you need to look up

#

someone already linked to the wikipedia about that, which is term frequency inverse document frequency

#

or tf-idf

grave frost May 6, 2022, 7:10 PM

#

frosty flower Just trying to build a simple tool to help me read English books. I plan to extr...

you can use BERT to sort by the associated probability the model gives for each word and use the ones with the least likelihood (but were correct) compile them, and look them up

misty flint May 6, 2022, 7:21 PM

#

https://github.com/facebookresearch/metaseq

GitHub

GitHub - facebookresearch/metaseq: Repo for external large-scale work

Repo for external large-scale work. Contribute to facebookresearch/metaseq development by creating an account on GitHub.

#

OPT-175B

#

they say its similar to GPT-3

#

pithink

gilded kestrel May 6, 2022, 8:14 PM

#

hey guys I'm stupid and first time trying keras, how do I change the Bidirectional to just an LSTM layer here?

    x_input = layers.Input(shape=x_train.shape[1:])
    x = layers.Masking(mask_value=MASK, input_shape=(x_train.shape[1:]))(x_input)
    x = layers.Bidirectional(tf.compat.v1.keras.layers.CuDNNLSTM(16, return_sequences=True))(x_input)```

can I just replace the last with `x = tf.compat.v1.keras.layers.CuDNNLSTM(16, return_sequences=True)(x_input)` ?

cinder schooner May 6, 2022, 9:11 PM

#

cunning condor Is the following sequence correct using Depth-First Postorder? D, E, B, F, C, A...

D,E,B,F,C,A
Explanation: Postorder = (Left, Right, Root).
I also think this should go into algos and data structs

lapis sequoia May 6, 2022, 9:22 PM

#

Guys, I have a question. In my dataset, I have some Boolean features that are put in as 0,1. Does it make sense to use them in spatial distance algorithms like k nearest neighbour?

#

Because it's actually a categorical feature. But converted forcefully into a numeric one.

runic raft May 6, 2022, 9:26 PM

#

lapis sequoia Guys, I have a question. In my dataset, I have some Boolean features that are pu...

if you have lots of boolean features, it could potentially make sense to calculate the Jaccard similarity from a sample where all booleans are true

lapis sequoia May 6, 2022, 9:27 PM

#

I have a decent amount compared to total number of features. But idk what jaccard similarity is

#

5/12 features are Boolean.

#

It's a medical dataset. Where booleans are like smoker or not.

runic raft May 6, 2022, 9:30 PM

#

well, before I keep pushing my answer, can you tell us more the what you're reducing with a clustering algo like KNN?

lapis sequoia May 6, 2022, 9:31 PM

#

It's just 2 cluster. Death or not.

#

We wanna predict if the patient dies based on their medical data

runic raft May 6, 2022, 9:31 PM

#

Well, is there a reason you've ruled out decision trees?

lapis sequoia May 6, 2022, 9:31 PM

#

No. I am gonna use decision trees

#

I am just trying to make sense of it. Because it feels right for categorical features. But the thing is, i will need to use k nearest neighbour too along with decision trees. So was wondering if I should include those features in knn or not.

#

Usually we drop the categorical features in knn

runic raft May 6, 2022, 9:40 PM

#

I don't really see what the KNN is adding here for predicting death, decision trees ought to be able to deal with continuous features just fine.

That being said, you said you need KNN, so I'll explain the Jaccard distance a bit more with a python snippet

lapis sequoia May 6, 2022, 9:43 PM

#

knn is gonna classify them into positive and negative. there's other numeric features too.

runic raft May 6, 2022, 9:43 PM

#

them?

#

the numeric features?

lapis sequoia May 6, 2022, 9:45 PM

#

like age

lapis sequoia May 6, 2022, 9:45 PM

#

runic raft them?

the death

#

positive or negative

#

death

#

so more age means more likely to die. Like that

runic raft May 6, 2022, 9:47 PM

#

lapis sequoia knn is gonna classify them into positive and negative. there's other numeric fea...

I was asking what "them" meant in your previous statement. It sounds like "them" meant the sample in your dataset

lapis sequoia May 6, 2022, 9:48 PM

#

runic raft I was asking what "them" meant in your previous statement. It sounds like "them"...

them meant the data points. or each single row of a patient

#

They need to be classified as dead or not.

runic raft May 6, 2022, 9:52 PM

#

Anyways, Jaccard distance (aka Jaccard Similarity) is when you look that what percent of values are matching each other:

For example let's say you have four binary/boolean features A, B, C, D.

def jaccard_similarity(sample: list[bool], other: list[bool]) -> float:
  assert len(sample) == len(other)
  num_matching = 0
  for sample_feature, other_feature in zip(sample, other):
    if sample_feature == other_features:
      num_matching += 1
  return num_matching / len(sample)
  
sample = [true, false, false, true]
other_sample = [true, false, true, false]
print(jaccard_similarity(sample, other_sample)) # prints 0.5

lapis sequoia May 6, 2022, 9:53 PM

#

Pog

runic raft May 6, 2022, 9:54 PM

#

lapis sequoia them meant the data points. or each single row of a patient

again, what is the KNN adding to your ability to classify samples/rows which isn't being done with a decision tree?

lapis sequoia May 6, 2022, 9:54 PM

#

runic raft again, what is the KNN adding to your ability to classify samples/rows which isn...

Nothing. But it's for school

runic raft May 6, 2022, 9:54 PM

#

oh, lol

lapis sequoia May 6, 2022, 9:54 PM

#

So have to do both.

#

Haha

#

Need to research on all the hyperparameters tuning too

runic raft May 6, 2022, 9:55 PM

#

so you if you have five boolean features, I would compare them all to a hypothetical case where every boolean is true, or every boolean is false

lapis sequoia May 6, 2022, 9:57 PM

#

Does jaccard similarity return the percentage of matching booleans?

#

In 2 lists

runic raft May 6, 2022, 9:57 PM

#

the definition you'd see in a textbook is much more general than the sample code I just gave you

lapis sequoia May 6, 2022, 9:58 PM

#

lapis sequoia Does jaccard similarity return the percentage of matching booleans?

You could have said this without giving the code. Haha

#

So how exactly do I use jaccard similarity in knn

runic raft May 6, 2022, 9:59 PM

#

lapis sequoia You could have said this without giving the code. Haha

You're not wrong, but I find that writing out code can remove lots of the ambiguity that shows up in English

runic raft May 6, 2022, 10:00 PM

#

runic raft so you if you have five boolean features, I would compare them all to a hypothet...

do this for every row and the use the Jaccard Similarity as a a feature instead of the categorical features

lapis sequoia May 6, 2022, 10:00 PM

#

Every row, compared with what?

runic raft May 6, 2022, 10:01 PM

#

so you if you have five boolean features, I would compare them all to a hypothetical case where every boolean is true, or every boolean is false
was this not clear

lapis sequoia May 6, 2022, 10:02 PM

#

runic raft so you if you have five boolean features, I would compare them all to a hypothet...

Oh

#

Didn't read that

#

But that's, in easy terms. Calculating the percentage of true booleans. Lolllll

runic raft May 6, 2022, 10:02 PM

#

doesnt matter whether you pick all true or all false, but just do the same for every row

lapis sequoia May 6, 2022, 10:04 PM

#

lapis sequoia But that's, in easy terms. Calculating the percentage of true booleans. Lolllll

OMG, it is, isn't it?

runic raft May 6, 2022, 10:04 PM

#

always all true or always all false, and yes that is a simple way of implementing it.

As with most summary statistics, you are losing information

#

but you aren't losing as much info as you would be in the case of just omitting it

lapis sequoia May 6, 2022, 10:05 PM

#

lapis sequoia But that's, in easy terms. Calculating the percentage of true booleans. Lolllll

If you would have told me that initially then there was no need for the rest of the explanations 😁

#

You made it so formal

#

Jaccard similarity. Hahaha

#

Can I change it into a percentage too btw?

#

0.7 changed too 70%

runic raft May 6, 2022, 10:06 PM

#

I'm not trying to just give you the correct answer, I'm trying to give you enough info to be able to explore the idea more in your time

#

The formality is a justification for why this would be a good approach

lapis sequoia May 6, 2022, 10:07 PM

#

hmm. Thanks

lapis sequoia May 6, 2022, 10:07 PM

#

lapis sequoia 0.7 changed too 70%

will this be fine?

runic raft May 6, 2022, 10:07 PM

#

it should be, yes

lapis sequoia May 6, 2022, 10:07 PM

#

great

runic raft May 6, 2022, 10:10 PM

#

but yeah, ideally if you're doing this for an assignment then your prof might ask "why did you do this" and if you say "some guy on Discord told me to" that's probably not a very good answer

cunning condor May 6, 2022, 10:11 PM

#

cinder schooner D,E,B,F,C,A Explanation: Postorder = (Left, Right, Root). I also think this shou...

I see thank you 🙂

lapis sequoia May 6, 2022, 10:14 PM

#

runic raft but yeah, ideally if you're doing this for an assignment then your prof might as...

No i will tell them that I did it with my hardwork and reading medical research papers on the internet. HAhaha

runic raft May 6, 2022, 10:22 PM

#

Question for pandas experts here:

I have a column of strings and I'd like to the base64-encoded SHA256 hash of each column to see if any columns a match a key. The naïve way of doing this is probably something along these lines:

def hash_encode(s: str) -> str:
    hashed = hashlib.new("sha256", s).digest()
    return base64.urlsafe_b64encode(hashed).decode('utf-8')

df["hashes"] = df["textcolumn"].apply(hash_encode)
results = df.loc[df["hashes"] == key]

But, let's say I have a lot of rows, and this hash_encode operations isn't exactly a lightweight operation. What's the most sane fix?

#

I think if I'm looking for a match, it might make the most sense to actually iterate through the rows and check them manually one at a time

lapis sequoia May 6, 2022, 11:20 PM

#

How do I select the hyperparameters for my decision tree

runic raft May 6, 2022, 11:47 PM

#

lapis sequoia How do I select the hyperparameters for my decision tree

What's wrong with good ol' grid search

lapis sequoia May 6, 2022, 11:47 PM

#

That I don't know what that is

#

Is it like I calculate accuracy by looping through a list of values?

runic raft May 6, 2022, 11:48 PM

#

yes

lapis sequoia May 6, 2022, 11:48 PM

#

And how is that list determined

#

Minimum is 1 for each leaf node. What's max?

#

Size of root node? Haha

runic raft May 6, 2022, 11:50 PM

#

1 - How fast your computer is
2 - What other people have done on similar problems

Usually for nodes in a decision tree, aren't you usually optimizing that using something like CART or Gini Coefficient?

lapis sequoia May 6, 2022, 11:50 PM

#

And do we do it as chained loops for all the parameters to get the best combination? As in 3 hyperparameters with 7,6,8 possible values respectively. So do we do 768 iterations to exhaust all possible combinations?

lapis sequoia May 6, 2022, 11:50 PM

#

runic raft 1 - How fast your computer is 2 - What other people have done on similar problem...

Gini

#

My computer is stupidly slow. But dataset is small.

#

299 rows only.

#

I split it into 70-30

runic raft May 6, 2022, 11:52 PM

#

Are you familiar with multiprocessing

lapis sequoia May 6, 2022, 11:52 PM

#

299,12

#

Nope

lapis sequoia May 6, 2022, 11:52 PM

#

lapis sequoia And do we do it as chained loops for all the parameters to get the best combinat...

But this sounds good?

runic raft May 6, 2022, 11:52 PM

#

!docs multiprocessing

arctic wedgeBOT May 6, 2022, 11:52 PM

#

multiprocessing

Source code: Lib/multiprocessing/

runic raft May 6, 2022, 11:52 PM

#

that's the general idea, yeah

lapis sequoia May 6, 2022, 11:53 PM

#

Haha. Let me run it and see how fast it is.

#

Then will limit the values a bit maybe

#

My professor would beat me up for using this shit 😛

#

I have ryzen 5600h, 6 cores.

#

Gotta go make sandwiches. And then will code this shit. So exciting 😁

#

I am gonna do feature selection too. Should it be done before or after this grid search?

#

Using a hill climbing method which takes in features one by one. And check accuracy

runic raft May 6, 2022, 11:57 PM

#

if you have a multicore CPU then you should be able to leverage your hardware better if you can find a way to take your model.fit and your evaluation approach (K-fold eval?) into single python def and then use a multiprocessing.map or a multiprocessing.starmap to do a bunch of work faster

#

grid search is like usually one of the last things you wanna do

lapis sequoia May 7, 2022, 12:05 AM

#

I haven't figured out k-fold yet

#

I thought I would write everything and the modify it for k-fold.

lapis sequoia May 7, 2022, 12:06 AM

#

lapis sequoia I am gonna do feature selection too. Should it be done before or after this grid...

How about this though? @runic raft

wild pagoda May 7, 2022, 12:54 AM

#

Hey everyone, can i make a graph like this in python? (matplotlib or smthing like that )

modern cypress May 7, 2022, 12:54 AM

#

wild pagoda Hey everyone, can i make a graph like this in python? (matplotlib or smthing lik...

Yeah matplotlib can do this, also seaborn as well I think

wild pagoda May 7, 2022, 12:55 AM

#

modern cypress Yeah matplotlib can do this, also seaborn as well I think

currently i'm stuck that the input of x-asix is string not value

#

so it make my matplot froze

modern cypress May 7, 2022, 12:55 AM

#

wild pagoda currently i'm stuck that the input of x-asix is string not value

post the full error

wild pagoda May 7, 2022, 12:57 AM

#

modern cypress post the full error

My code rn:

GraphSheet = pd.read_csv(csv_files)
GraphData = pd.DataFrame(GraphSheet)
file_col = GraphData.columns[0]
note = []
second_col = GraphData.columns[1]
note.append(second_col)  # Add note of graph data
  # Add data to graph
plt.plot(GraphData[file_col],
     GraphData[second_col], markersize=5)

#

now when i'm running it's just freeze bc the input of file_col is not number

inner furnace May 7, 2022, 12:59 AM

#

You could assign the values of the column to a new variable already defined as an integer before

modern cypress May 7, 2022, 12:59 AM

#

wild pagoda My code rn: ```py GraphSheet = pd.read_csv(csv_files) GraphData = pd.DataFrame(G...

can you try GraphData.describe(include='all')

wild pagoda May 7, 2022, 1:04 AM

#

modern cypress can you try GraphData.describe(include='all')

currently it's sticky like this, anyway to make it like in the image?

wild pagoda May 7, 2022, 1:04 AM

#

wild pagoda Hey everyone, can i make a graph like this in python? (matplotlib or smthing lik...

like this

modern cypress May 7, 2022, 1:05 AM

#

hmm

modern cypress May 7, 2022, 1:05 AM

#

wild pagoda currently it's sticky like this, anyway to make it like in the image?

try plt.xticks(rotation=90)

wild pagoda May 7, 2022, 1:06 AM

#

modern cypress try `plt.xticks(rotation=90)`

oki that solved, the last problem is, i want to make the x-axis to at the 0.0 of the y-axis

modern cypress May 7, 2022, 1:09 AM

#

wild pagoda oki that solved, the last problem is, i want to make the x-axis to at the 0.0 of...

Honestly, I have no idea hahahah

wild pagoda May 7, 2022, 1:09 AM

#

modern cypress Honestly, I have no idea hahahah

lol, thanks for all the help that you gave me!

modern cypress May 7, 2022, 1:10 AM

#

wild pagoda lol, thanks for all the help that you gave me!

hmm, can you try adding labelpad=20 (change the number if you see it works to whatever you need) to your x axis

lapis sequoia May 7, 2022, 1:11 AM

#

@runic raft my pc is doing 900,000 iterations rn 🤪

#

I skimmed it down to 25000 that is also taking a lot of time.

wild pagoda May 7, 2022, 1:18 AM

#

modern cypress hmm, can you try adding `labelpad=20` (change the number if you see it works to ...

i found the way, but currently it making this error lol

lapis sequoia May 7, 2022, 1:43 AM

#

I am getting accuracies lower when using k-fold cross validation when compared to simple train test split. Should I not use it then?

serene scaffold May 7, 2022, 2:00 AM

#

lapis sequoia I am getting accuracies lower when using k-fold cross validation when compared t...

no. picking the evaluation system that produces the best results has no effect on how well the model would actually perform in a real situation

lapis sequoia May 7, 2022, 2:01 AM

#

So better to use k fold?

serene scaffold May 7, 2022, 2:02 AM

#

k fold generally gives you a more realistic sense of how well your model would perform in real life. though it requires training your model k times, which might not be feasible for models that are very expensive to train.

#

@lapis sequoia make sense?

lapis sequoia May 7, 2022, 2:03 AM

#

Sure

#

Also, regarding hyper parameters in decision trees.

#

According to the paper, An empirical study on hyperparameter tuning of decision trees [5] the ideal min_samples_split values tend to be between 1 to 40 for the CART algorithm which is the algorithm implemented in scikit-learn.

#

I found this on the internet

serene scaffold May 7, 2022, 2:05 AM

#

I've never actually used decision trees, but go on.

lapis sequoia May 7, 2022, 2:05 AM

#

I used all these values and looped through them

#

A nested loop for 3 hyperparameters

#

8000 iterations

#

And took the one with best accuracy score

misty flint May 7, 2022, 2:07 AM

#

reinforcement learning is interesting but state space needs to be limited for many real world tasks pithink

lapis sequoia May 7, 2022, 2:07 AM

#

cross_val_score(clf, X, targets, cv=5).mean() is this the right way to do cross validation>

#

or I am doing something wrongh

#

because the best accuracy I am getting is at only 1 split of the tree

ocean swallow May 7, 2022, 2:18 AM

#

hey I want to, get an advice for my personal project.

#

I have view, publish date, date data info of youtube videos that I got from kaggle. I got the vectors of titles using Doc2Vec, I used year_day(0-365 in case of seasonality) and days passed since published as features and document embedding vector of video titles as X input, and trying to predict views it has.

#

Is there anything weird to you here in the code?

#

X = us_yt[['year_day', 'days_past']]
X = (X - X.mean()) / X.std()
nlp_data = us_yt['title_vector']
nlp_data = np.array([data for data in nlp_data])
Y = us_yt[['view_count']]
nlp_input = Input(shape=(len(nlp_data[0]),))
feature_input = Input(shape=(len(X.columns),))
feature_out = Dense(100, activation='relu')(feature_input)
concat = Concatenate()([feature_out, nlp_input])
hidden1 = Dense(100, activation='relu')(concat)
hidden2 = Dense(40, activation='relu')(hidden1)
output = Dense(1, activation='relu')(hidden2)

model = Model(inputs=[nlp_input , feature_input], outputs=[output])
model.compile(optimizer='Adam', loss='MSE')

#

Not so surprisingly, it is not doing really fine.

#

I think subscriber number is an essential part to this data.

#

Which I lack. But anything else to add?

misty flint May 7, 2022, 3:54 AM

#

ocean swallow I think subscriber number is an essential part to this data.

yeah these features just might not be helpful, especially if your target is view count

#

subscriber number would def help

#

i wonder if video length would be helpful too

#

PikaThink

lapis sequoia May 7, 2022, 4:11 AM

#

Is it better to do feature selection before optimising the hyper parameters or later?

ocean swallow May 7, 2022, 4:17 AM

#

misty flint <:PikaThink:869207765262950410>

Want this emote. Need it want it need it have to have it

#

Yeah subs is the biggest predictor of views I think.

#

length I think meeeeh.

misty flint May 7, 2022, 4:33 AM

#

ocean swallow Want this emote. Need it want it need it have to have it

bro...you have access to it

#

the python server has it

#

💀

misty flint May 7, 2022, 4:34 AM

#

ocean swallow length I think meeeeh.

length is meh but i think its a better feature than the vector of the title kekHands

ocean swallow May 7, 2022, 4:45 AM

#

misty flint length is meh but i think its a better feature than the vector of the title <:ke...

is this serious lol? Title is like the first thing that 100% pulls the person?

#

like say videos with Will Smith will have probably 100 folds of say Grasshopper

#

or will smith slap

lilac yew May 7, 2022, 6:35 AM

#

Has anyone tried importing data into DB from Grafana, Loki & Prometheus?
Need some tips or good approaches

arctic wedgeBOT May 7, 2022, 7:11 AM

#

Hey @cinder matrix!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

cinder matrix May 7, 2022, 7:13 AM

#

Hi guys, Im trying to do reverse summarisation, if you consider the script, it takes input document and output is summary. Is it good idea to reverse the input and output so the model then take in summary to predict the document? Please @ when replying https://paste.pythondiscord.com/ixisafedul

modest mulch May 7, 2022, 10:36 AM

#

cinder matrix Hi guys, Im trying to do reverse summarisation, if you consider the script, it t...

Might be, i think it would be harder to train in general

#

Try it

cinder matrix May 7, 2022, 10:36 AM

#

modest mulch Try it

i did the result is awful XD

#

i tried same on transformer and its thousand time better

modest mulch May 7, 2022, 10:37 AM

#

yea expected lmao

cinder matrix May 7, 2022, 10:37 AM

#

and only needed really few data to converge

#

is there a technical reason why lstm bad for anti sumarusation

bitter ivy May 7, 2022, 10:41 AM

#

Hello can I ask, why do I get all K1 devices (red dots) moved outside of range (time) on x axis? How can I fix it, so that time is from 0:00 to 24:00, and red dots (K1 devices) are within the range. Not separated.

Here is the image: https://i.imgur.com/NA37A3n.png

Here is the code:

import plotly.express as px
import pandas as pd

data = pd.read_excel('log_data.xls')
data = data.sort_values(by=['Time'])
print(data)
fig = px.scatter(data, x="Time", y="Date", color="Device")
fig.show()

Here are the data:

Index Device       Date      Time
53      53     L1 2022-02-06  00:01:00
124    124     L1 2022-02-13  00:02:00
83      83     L1 2022-02-09  00:09:00
54      54     L1 2022-02-06  00:22:00
0        0     L1 2022-02-01  00:38:00
33      33     L1 2022-02-04  01:22:00
60      60     L1 2022-02-07  02:22:00
44      44     L1 2022-02-05  03:03:00
94      94     L1 2022-02-10  03:44:00
61      61     L1 2022-02-07  04:20:00
73      73     L1 2022-02-08  05:03:00
62      62     L1 2022-02-07  05:15:00
84      84     L1 2022-02-09  05:59:00
55      55     L1 2022-02-06  06:00:00
114    114     L1 2022-02-12  06:01:00
105    105     L1 2022-02-11  06:03:00
23      23     L1 2022-02-03  06:06:00
1        1     L1 2022-02-01  06:20:00
13      13     L1 2022-02-02  06:22:00
74      74     L1 2022-02-08  06:28:00
14      14     L1 2022-02-02  06:28:00
34      34     L1 2022-02-04  06:34:00
56      56     K1 2022-02-06  06:39:00
85      85     K1 2022-02-09  06:48:00
115    115     K1 2022-02-12  06:52:00
95      95     L1 2022-02-10  06:59:00
106    106     K1 2022-02-11  07:02:00
2        2     K1 2022-02-01  07:02:00
116    116     L1 2022-02-12  07:03:00
96      96     K1 2022-02-10  07:04:00
..     ...    ...        ...       ...

Imgur

#

Thanks.

craggy tiger May 7, 2022, 12:13 PM

#

Hey Folks, can someone recommend a guide for Business Intelligence implementation into corporation? An up-to date BI-Structure with pros and cons as well as a role definition and workforce planing?

woven sonnet May 7, 2022, 2:18 PM

#

excuse me, anyone here know about arc consistency map coloring? i want to ask something ty

serene scaffold May 7, 2022, 3:16 PM

#

woven sonnet excuse me, anyone here know about arc consistency map coloring? i want to ask so...

you're more likely to get help if you ask your question, rather than if anyone knows about the topic

woven sonnet May 7, 2022, 3:27 PM

#

serene scaffold you're more likely to get help if you ask your question, rather than if anyone k...

Yeah i want to ask something about that

#

how do i make a user input for map coloring using arc consistency? im newbie and i just know the simple algorithm like
arcs = [('a', 'b'), ('b', 'a'), ('b', 'c'), ('c', 'b'), ('c', 'a'), ('a', 'c')]

domains = {
    'a': [2, 3, 4, 5, 6, 7],
    'b': [4, 5, 6, 7, 8, 9],
    'c': [1, 2, 3, 4, 5]
}
constraints = {
    ('a', 'b'): lambda a, b: a * 2 == b,
    ('b', 'a'): lambda b, a: b == 2 * a,
    ('a', 'c'): lambda a, c: a == c,
    ('c', 'a'): lambda c, a: c == a,
    ('b', 'c'): lambda b, c: b >= c - 2,
    ('b', 'c'): lambda b, c: b <= c + 2,
    ('c', 'b'): lambda c, b: b >= c - 2,
    ('c', 'b'): lambda c, b: b <= c + 2
}

thank you

cyan sierra May 7, 2022, 3:41 PM

#

Hey guys, in pandas, why should I use .apply() to apply a function on rows in specific columns instead of using the function directly on the columns?

serene scaffold May 7, 2022, 4:07 PM

#

cyan sierra Hey guys, in pandas, why should I use .apply() to apply a function on rows in sp...

what do you mean, the function direction of the columns? usually you want to avoid apply as much as possible.

cyan sierra May 7, 2022, 4:12 PM

#

serene scaffold what do you mean, the function direction of the columns? usually you want to avo...

I meant directly sorry

bold timber May 7, 2022, 4:13 PM

#

Hi, I have a question: Why I get an error like this? and how to fix this error?

serene scaffold May 7, 2022, 4:13 PM

#

cyan sierra I meant directly sorry

func(series) doesn't have the same semantics as series.apply(func). the first takes the series as one argument for the function. the second is basically the same as pd.Series(func(elem) for elem in series)

cyan sierra May 7, 2022, 4:19 PM

#

serene scaffold `func(series)` doesn't have the same semantics as `series.apply(func)`. the firs...

I see! Thank you. In other words, func(series) is trying to apply the function onto the series as a whole? whereas .apply() is trying to apply a function to each element of the series?

serene scaffold May 7, 2022, 4:20 PM

#

cyan sierra I see! Thank you. In other words, ```func(series)``` is trying to apply the func...

func(series) just passes the series to the function. there's no special behavior here.

series.apply(func) does func(elem) for each element in the Series and returns a new Series with the results.

#

the problem is that apply is not optimized. it is just for convenience if there's no native pandas or numpy functionality for what you are trying to do.

cyan sierra May 7, 2022, 4:23 PM

#

serene scaffold `func(series)` just passes the series to the function. there's no special behavi...

What do you mean not optimized?

serene scaffold May 7, 2022, 4:23 PM

#

cyan sierra What do you mean not optimized?

native pandas and numpy functions/methods are orders of magnitude faster than equivalent Python code

cyan sierra May 7, 2022, 4:24 PM

#

serene scaffold native pandas and numpy functions/methods are orders of magnitude faster than eq...

I see, thanks. So the only reason why apply should be avoided is because of computational efficiency?

serene scaffold May 7, 2022, 4:25 PM

#

cyan sierra I see, thanks. So the only reason why apply should be avoided is because of comp...

yes. take a look at this.

In [3]: %timeit mean(list(range(10_000_000)))
2.37 s ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit np.arange(10_000_000).mean()
14.7 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#

14 miliseconds vs two whole seconds.

cyan sierra May 7, 2022, 4:32 PM

#

serene scaffold yes. take a look at this. ```py In [3]: %timeit mean(list(range(10_000_000))) 2....

I see! Thank you so much

bitter ivy May 7, 2022, 5:31 PM

#

Sorry but did I asked my question in the right room? No one here understands plotly and pandas? It is fairly basic, but I do not get it.

agile cobalt May 7, 2022, 5:52 PM

#

bitter ivy Hello can I ask, why do I get all K1 devices (red dots) moved outside of range (...

What is the type of time? if it's a string (object), make sure that no rows have any leading whitespace

#

also, whenever or not it's basic, staff members here are just volunteers. there's no guarantee / obligation that your question will be answered even if you ask it correctly

bitter ivy May 7, 2022, 5:59 PM

#

agile cobalt What is the type of `time`? if it's a string (`object`), make sure that no rows ...

Time is: hh:mm:ss

#

That is all

#

it is a string

#

but with no whitespaces

agile cobalt May 7, 2022, 6:03 PM

#

just to make sure, would you mind checking ```py
data['Time'].str.len().value_counts()

pseudo wren May 7, 2022, 6:06 PM

#

i'm not sure how i should approach working with this data

#

i am working on the titanic dataset

#

right now i'm just doing the basic data cleaning and data observation

#

i checked for correlation and so far, nothing really has a strong relationship to each other

#

not sure what story i should tell about the data

mighty spoke May 7, 2022, 6:17 PM

#

Hi if I have a a FFT graph power vs frequency does anyone know how I can find the time period? and fundamental frequency?

serene scaffold May 7, 2022, 6:35 PM

#

@pseudo wren if you know the history of the titanic, do you know which groups of people were more likely to be saved?

pseudo wren May 7, 2022, 6:36 PM

#

serene scaffold <@160842639938158593> if you know the history of the titanic, do you know which ...

See that’s the thing

#

Based on history

#

I could probably answer that question

#

But I don’t think I can answer it with the data provided

#

The correlation matrix suggests that nothing really contributed to their survival in a meaningful way

serene scaffold May 7, 2022, 6:37 PM

#

as the story goes, women, children, and people in higher-classed cabins were more likely to be saved. so you should see if the data bares that out.

pseudo wren May 7, 2022, 6:37 PM

#

I checked for that

#

But the heatmap has it as cold

#

And I can’t use a different titanic dataset

#

So I’m not sure what to say other than

#

“The data suggests nothing meaningfully contributed to their survival”

serene scaffold May 7, 2022, 6:38 PM

#

did you try grouping by saved/not saved and doing value counts on gender?

#

also is this just the kaggle dataset?

pseudo wren May 7, 2022, 6:38 PM

#

Oh this is true I could do value counts on gender

#

Yes it is from Kaggle

#

It’s from their titanic competition thing

#

So I could throw saved and not saved together

#

And then count how many more women were saved than men

#

That could be interesting?

#

Or the likelihood of being saved based on gender

serene scaffold May 7, 2022, 6:40 PM

#

pseudo wren Or the likelihood of being saved based on gender

In [6]: df.groupby('Survived')['Sex'].value_counts()
Out[6]:
Survived  Sex
0         male      468
          female     81
1         female    233
          male      109

#

this shows pretty clearly that women were strongly favored for not dying

pseudo wren May 7, 2022, 6:41 PM

#

Yeah I could work with that

#

Maybe try one hot encoding some values as well

#

See if that helps anything

serene scaffold May 7, 2022, 6:42 PM

#

In [8]: df.groupby('Survived')['Pclass'].value_counts().unstack()
Out[8]:
Pclass      1   2    3
Survived
0          80  97  372
1         136  87  119

most people in third class died. most people in first class lived.

pseudo wren May 7, 2022, 6:43 PM

#

Ahhh okay

#

I just need to manipulate the data a little more then

serene scaffold May 7, 2022, 6:44 PM

#

In [9]: df.groupby(['Sex', 'Survived'])['Pclass'].value_counts().unstack()
Out[9]:
Pclass            1   2    3
Sex    Survived
female 0          3   6   72
       1         91  70   72
male   0         77  91  300
       1         45  17   47

Third class men who died are the largest group. wow

pseudo wren May 7, 2022, 6:44 PM

#

I think I maybe have a little more direction now then

#

if I can group those who survived

#

and demonstrate the relationship between those

#

I can train my model to predict who would most likely be saved

serene scaffold May 7, 2022, 6:45 PM

#

yessss

pseudo wren May 7, 2022, 6:45 PM

#

out of men and women or 1st and 3rd class

serene scaffold May 7, 2022, 6:45 PM

#

though a problem with the titanic dataset is that there really aren't that many instances

pseudo wren May 7, 2022, 6:45 PM

#

yeah i think they made it that way so it could be simple for beginners

serene scaffold May 7, 2022, 6:46 PM

#

it's not that they wanted there to be fewer instances. it's just that there were only so many people on the boat

#

and the fates of all of them might not be known

pseudo wren May 7, 2022, 6:46 PM

#

ah that makes sense

#

being a data scientist is very interesting in that it teaches you about the world more than just the average profession

#

so for classification

#

i can make a new dataframe

#

that contain those who have survived/died

#

with their genders and classes

serene scaffold May 7, 2022, 6:47 PM

#

I wonder how age played into it. were old men more likely to survive? idk

pseudo wren May 7, 2022, 6:47 PM

#

see that's what i was hoping to see

#

but i tried correlating the gender and age values

#

nothin

#

i may need to one hot encode

serene scaffold May 7, 2022, 6:48 PM

#

because another problem with this dataset is that you have gender and class, and maybe also age. and beyond that, you can't really know what real world circumstances might have enabled or prevented a given person from getting to a lifeboat

#

if one random first class woman tripped and got trampled by other passengers, or something, that's not going to show up in this data.

pseudo wren May 7, 2022, 6:49 PM

#

yeah that's what sort of made me feel discouraged about working with this dataset at first

serene scaffold May 7, 2022, 6:49 PM

#

well, this is all part of the learning

#

it's a very interesting dataset. look at all the discussion we've had

pseudo wren May 7, 2022, 6:49 PM

#

i do think it's very interesting!

#

I just have a lot of learning to do on how to exactly manipulate the data

#

I know what my goal is

serene scaffold May 7, 2022, 6:50 PM

#

by the way, don't one-hot encode age

#

one-hot encoding is for nominal features.

pseudo wren May 7, 2022, 6:50 PM

#

I want to be able to demonstrate the relationship between the values, and classify them so that i can better train my model

pseudo wren May 7, 2022, 6:50 PM

#

serene scaffold one-hot encoding is for nominal features.

hm

#

okay good to know

serene scaffold May 7, 2022, 6:51 PM

#

things like gender and passenger class are categories; they don't have numeric relationships to eachother

#

but someone who is 40 has "twice as much age" as someone who's 20

pseudo wren May 7, 2022, 6:51 PM

#

that's fair

#

i think i have a better idea of what i should look for at least

serene scaffold May 7, 2022, 6:52 PM

#

lemon_hyperpleased

#

https://tenor.com/view/baby-yes-well-done-keep-it-up-gif-11035050

misty flint May 7, 2022, 6:56 PM

#

Praise

misty flint May 7, 2022, 6:56 PM

#

pseudo wren being a data scientist is very interesting in that it teaches you about the worl...

yeah i think you mightve gotten there if you did a bit more EDA or data visualization and asked more questions about the data

#

like

#

what if you had asked initially

#

did more women survive than men?

#

and then gone about checking your hypothesis with the data

#

my point is its not necessarily about getting it right the first time, but if you get into the habit of asking initial questions when you receive a dataset, you will get better at getting those questions answered

#

and then move towards the modeling that you would like to get to

#

afterwards

#

since it would help you set up for that

#

just dont get stuck in exploration land forever. thats another trap people fall into

#

kekHands

pseudo wren May 7, 2022, 7:01 PM

#

misty flint what if you had asked initially

I think the issue was that I was stumped on what to ask or explore. Or how to manipulate it. I had an initial question that was somewhat vague but it was asking what determining factors led to someone’s survival. I expected to see values like age to be highly correlated to survival rates but they weren’t.

#

So after a minute I just sort of lost confidence in it

bitter ivy May 7, 2022, 7:06 PM

#

agile cobalt just to make sure, would you mind checking ```py data['Time'].str.len().value_co...

Hm, it says: dtype: int64. There is the problem. How would I convert to string?

agile cobalt May 7, 2022, 7:07 PM

#

sorry but go read the user guides & documentation at least enough to understand what that piece of code does before anything else...

#

you must build a basic understanding of how pandas works first, otherwise you'll never be able to debug anything yourself

#

||in this case, the dtype is of the series output by the value_counts method||

misty flint May 7, 2022, 7:14 PM

#

pseudo wren So after a minute I just sort of lost confidence in it

thats okay. its difficult to see easy patterns irl

#

sometimes you get lucky tho kekHands

pseudo wren May 7, 2022, 7:24 PM

#

misty flint thats okay. its difficult to see easy patterns irl

Yeah I think for me, because I’m a little new at this, if I don’t see an easy pattern at first glance I feel like I have to read the entire sklearn bible in order to figure out how to manipulate the data the way I need

#

I’m not super confident in sklearn yet. I just started using it.

misty flint May 7, 2022, 7:35 PM

#

sklearn is more for modeling

#

i think for the data manipulation youre looking for, i would turn to pandas

#

there is a lot you can do with pandas

#

just to explore the data more

#

i heard some good things about the wes mckinney's book python for data analysis

#

it also helps that hes the creator of pandas

#

kekHands

bitter ivy May 7, 2022, 8:04 PM

#

agile cobalt ||in this case, the dtype is of the series output by the `value_counts` method||

If I print your line of code I get: Series([], Name: Time, dtype: int64). So I think I am checking the length of every object (str). What should I get if it is not?

agile cobalt May 7, 2022, 8:06 PM

#

it should show how many times each length appears.
if it's something like 8: (number of rows), then all of them have 8 characters
if it's something like 8: (number of L), 9: (number of K), then some of have may have whitespace

misty flint May 7, 2022, 8:29 PM

#

this is very good for understanding the bias variance tradeoff http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

A visual introduction to machine learning, Part II

Learn about bias and variance in our second animated data visualization.

serene scaffold May 7, 2022, 10:59 PM

#

Is it possible that they could be talking about more than one animal at a time?

#

you will want to see what words come up in the context of only one animal.

#

my guess is that in your case, there aren't going to be enough words for that to end up making a difference

lapis sequoia May 7, 2022, 11:14 PM

#

How could I do feature scaling combined with cross validation. Because I have no X_train when doing cross validation. So what do I feature scale. The whole X?

serene scaffold May 8, 2022, 12:00 AM

#

lapis sequoia How could I do feature scaling combined with cross validation. Because I have no...

when you're doing k fold cross validation, the data gets partitioned into k groups, and each one takes a turn being the test data. each turn that a group isn't part of the test data, it is part of the train data.

lapis sequoia May 8, 2022, 12:01 AM

#

Can someone help me understand why my performance is not getting better with feature selection.

lapis sequoia May 8, 2022, 12:01 AM

#

serene scaffold when you're doing k fold cross validation, the data gets partitioned into k grou...

I know the theory behind it.

serene scaffold May 8, 2022, 12:01 AM

#

Sorry, but I will not look at this. I will only look at actual text.