#data-science-and-ml
1 messages · Page 6 of 1
oh my fucking god
it was right all along, since I was using apply
but it didn't look right
because I didn't know that pandas has special goddamn support for lists
and if you have a column of lists, it will visually "unwrap" these lists, each element into a row
This is one row 😩
no, there's no such automatic thing for lists
and no, that's not a one row...
you need explicit .explode to roll lists to rows
that screenshot implies a MultiIndex frame
so I guess the issue then is that apply explodes a column of mine without asking
(or however it's called, here)
well, without an MRE, i don't know what to write
because not sure what function you're applying to what kind of dataframe :|
test_df = pd.DataFrame.from_dict(dict(name=["A", "A", "A", "B", "B"], thing=["1", "2", "3", "1", "2"]))
def test_f(group):
lst = [row.thing for row in group.itertuples()]
return pd.DataFrame.from_dict(dict(
name=group.iloc[0].name,
lst=lst
))
test_df.groupby("name").apply(test_f)
here's a simplified example. Each group is collapsed into a 1-row dataframe with a list-type column.
The result becomes multiindex:
MultiIndex([('A', 0),
('A', 1),
('A', 2),
('B', 0),
('B', 1)],
names=['name', None])
oh hey, I got it
the way to do it is pretty counterintuitive to me, though:
return pd.Series(dict(
name=group.iloc[0].name,
lst=lst
))
if apply returns a Series, it's not unwrapped into a multiindex. If it returns a dataframe, it is.
I wonder if there's even a mention of that in the docs.
Hey everyone,
the distinction is not Series vs DataFrame-returning function per se; it's about whether what you return respects the original index of the apply-e group
pandas will try to be helpful by putting another level of index consisting of the grouper keys so you can identify which group led to which new indexing scheme.
if you want, you can disable this via passing group_keys=False to .groupby(...).
in your case,2 groups had the indicies [0, 1, 2] and [3, 4]; but what you returned from the function per groups did not preserve the corresponding indexes fully, e.g., you returned [0, 1, 2] for "A" (fine) but [0, 1] for "B" (not so fine), hence the multiindex appearing.
Hmm, interesting
Hey everyone, I have encountered a unique issue and I could really use some suggestions on how to resolve it as I have been stuck for a few days. I am currently trying to iterate through a df and create a new column that contains the value of the difference between two columns. The issue is that I need to find the difference between two separate columns on different rows. (so the difference between column A on row 1, and Column B on row 2, and store the difference value in column C on row 1. ) Does anyone have any experience doing this?
hi, from that explanation, it seems like df["new"] = df["A"] - df["B"].shift(-1) should work?
.shift(-1) will "pull up" the column B one row above; then, when subtracted from df["A"] it is as if you're substracting row k of A and k + 1 of B
Thank you so much! I will give it a try
you can also .shift() or .shift(1) to shift forward instead
!e ```python
import pandas as pd
data = pd.DataFrame({
'x': [1,2,3],
'y': [33,22,11],
}, index=list('abc'))
print(data['x'])
print(data['x'].shift(-1))
print(data['x'].shift()) # default value is 1
@desert oar :white_check_mark: Your 3.10 eval job has completed with return code 0.
001 | a 1
002 | b 2
003 | c 3
004 | Name: x, dtype: int64
005 | a 2.0
006 | b 3.0
007 | c NaN
008 | Name: x, dtype: float64
009 | a NaN
010 | b 1.0
011 | c 2.0
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/ecegevatub.txt?noredirect
note that shift shifts the data while keeping the index in place. this is perfect for doing computations exactly like what @solid quail described
If you have a regression problem, where the output can range from 1000 to 10 billion, is it a good idea to scale the output down? (atm i'm using the sklearn minmaxscaler)
And then you just did one hot?
That's so weird cuz I did just that and my model is just getting stuck at a local minima where it outputs the most frequent character in the text
In my case it's either e or spaces
And when I researched on Google, it looks like people use more fancy encoding/decoding methods to avoid this issue
E truly is the most magnificent
even our computers agree
class Layer_Dense:
def __init__(self, n_inputs, n_neurons):
self.weights = 0.01 * np.random.rand(n_inputs, n_neurons)
self.biases = np.zeros((1, n_neurons))
def forward(self, inputs):
self.output = np.dot(inputs, self.weights) + self.biases
X, y = spiral_data(samples=100, classes=3)
dense1 = Layer_Dense(2, 3)
dense1.forward(X)
print(dense1.output.shape)
Any idea why dense1 output is in the shape (300, 3)? Shouldn't it be 200 since I'm inputting 100 different x and y values for each dot?
Where does the 300 come from?
any recommendations on courses with projects or a course that goes over practical parts to get me steup creating projects?
im currently waiting on financial aid for the 2nd Ng Andrew course in the ml specialization but since it takes 2 weeks im looking for something while i wait
i recommend "just doing projects", e.g. kaggle. don't worry too much about high score, focus on figuring out workflows and tools that you like using and feel comfortable with.
and just google and watch videos when i need?
no, avoid videos and google and medium.com
so books? i got acces to O'reilly learning for free rn
practice reading actual software docs, and get a textbook if you don't already have one
they got all their books and some videos associated with the books
u recommend skipping that too?
no, those are good books. the videos are probably fine
OpenIntro has a statistics textbook too
ye im doing the math as im preparing for a masters in cs and focus on ml
i don't think this is doing what you think it's doing. you're saying the weights are of size 2 x 3 and the biases of size 3. what does your data look like? you're dotting X with the weights, so that seems to imply that X is of size 300 x 2
there is just so much "blogspam" out there on beginner-level ML that it's very difficult to avoid as a newbie
just wanna get into the practical parts beforehand
yeah, then just spend your time messing with data
if you are going to read blog junk, at least make sure it's from towardsdatascience.com, their blog junk is still somewhat junky but at least the minimum quality is somewhat high
rando youtube videos are unlikely to be useful
fast.ai is also a free online course and collection of resources
The biases are 0 because of the np.zeros function
I think randn is better for initialization no?
that doesn't matter. what's the shape of X?
i'll check it out and try some of the o'reilly books
it's 2 dimensional, 1 x value and 1 y value
asides from learning ml, any recommendations on learning how to actually work with data @desert oar
just learn by practise?
learn statistics, read about data visualization, learn about databases, learn about missing data imputation (subset of statistics), practice
can you just print the shape? and what do you expect spiral_data to return with the parameters you gave it?
because matrix products between matrices of sizes m x n and n x k are of size m x k, so naturally your X is of size 300 x 2
X.shape returns ```py
(300, 2)
indeed. so the question is what did you expect the shape to be given the parameters you passed to spiral data
Ah... I thought it would be 100 since I specified samples=100
maybe I misunderstood what samples=100 means
what does spiral data return?
what do you mean?
... what it does return lol
what is the data supposed to look like?
you specified 3 classes, too
how are those supposed to be returned
I just told you the shape of the X values, the y values(labels) are either 0, 1, or 2.
an example of the X value would be:
[-8.67685974e-01 -3.85566771e-01]
[-9.11101043e-01 3.01196396e-01]
opinions on balancing the test set in binary classification of medical record data?
appreciate it!
yes dude, that's not what i'm asking. you were expecting it to return something of shape 100 x 2. why?
what do the classes mean?
how are the classes being concatenated?
i'm asking you what your code means
X, y = spiral_data(samples=100, classes=3)
you wrote this, yes? what does this mean? what do the classes do there?
I was expecting spiral_data to return 100, 2 because I thought samples=100 would mean 100 different instances of X
what does spiral_data return? did you write this function? is it from another lib?
ah
it's from library called nnfs, they created spiral_data where X returns (x, y) values which are points, and y returns classes(0, 1, or 2) which show what spiral the dot(X) belongs to
And I thought there would be 100 rows because I specified spiral_data(samples=100) and not 300 rows
isn't that what samples=100 makes sense that it would be, right?
i'm looking for the docs for this func but can't find anything useful
on top of that, does having the values range from 0-1 affect how mse is calculated? (the square of a decimal between 0 and 1 will be less than the value itself)
one thing you can do is just tell us the shape and type of the thing it returns
using print
print(X.shape, y.shape)
print(type(X), type(y)
X, y = spiral_data(samples=100, classes=3)
print(X.shape, y.shape)
returns
(300, 2) (300,)
they're both numpy.ndarray
oh so the input shape is 300?
Yeah
wait this doesn't make sense
how so?
if you have 100 samples and the input shape is a vector of size 300
then I would expect the shape to be (100, 300)
the only question is why is it 300 x 2 instead of 100 x 2
oh
that's what I'm saying
let me show a different output with different sample and class values
i know, i'm just getting abdulhaleem up to speed. yeah, change classes to 1
maybe it generates n samples per class
i can't find the docs anywhere
X, y = spiral_data(samples=10, classes=2)
print(X.shape, y.shape)
returns
(20, 2) (20,)
and ```py
X, y = spiral_data(samples=100, classes=1)
print(X.shape, y.shape)
prints
(100, 2) (100,)
ok, so it generates n samples per class indeed
actually yeah that makes sense
I'm guessing that the spiral_data returns x,y coordinates representing a spiral
hense why it's 100, 2
actually nvm then what does classes mean
right, but when I do samples=100, classes=3 why does it return (300, 2)
oh interesting
well then I'd expect the matrix multiply to return shape n_samples * num_classes
opinions in balancing test set?
so it's returning a new x, y coordinate for each class per sample...
yes
what I was expecting it to return was 100 samples, with 33% of them in class 1, 33% in class 2, and 34% in class 3
idk why I was expecting it to act like that but yeah it doesn't act like that
i would've expected that too
yeah that's pretty confusing
yeah, thanks for your help Edd and Abdulhaleem
bad design choice if you ask me
anyone know why a neural net would do this if the traiing data was oversampled to balance?
random forest manages just fine to reach 0.6 recall and 0.1 prec for class 1
I'd like to use NLP to generate answers in a simple oracle bot, but I haven't dealt with ML much yet. What are my options?
I haven't done any NLP but I've heard of GPT 2 which could be a good option
actually although using categorical sampling at inference time significantly improved my results, it turns out my biggest oopsie was thinking that torch.nn.functional.cross_entropy accepts a probability distribution as input when it actually accepts logits
now my model actually works
Does anyone have some tips, or know a good tutorial, for someone who wants to create a classification model on images on my own PC? A lot of tutorials just use the built-in datasets.
so you want to do classification on your own dataset?
that isn't that different from using built in datasets
yeah
I don't know how to get my code to read the images off my PC or how to label them for my model
In pytorch for example, the built in datasets are built using the DataLoader and Dataset classes. So all you have to do is get your images and wrap them on those classes
Ok, I'm gonna try that
No, I'm not
well there's a long way to do it and an easy way to do it
but if you're already working within a framework then just using DataLoader and Dataset can make your life pretty easy
they can do all that under the hood
Do Dataloaders handle turning images into tensors?
Ok, trying that now
Yeah I think once you are able to turn your images into PIL, then the rest should be easy
Ok, I converted an image to a tensor
but I'm not sure if that's the most efficient way
I don't know about efficiency either. I just loaded the image as a PIL image and used a ToTensor() transform and it seems to work
probably shouldn't matter since you only need to load the dataset once and you're done
it returns 3 RGB values like expected
it's still better to have good habits, but yeah doesn't matter to me now.
no I'm thinking about if you have say 100,000 images for example then there might be a more efficient way to do all this at the same time
rather than going one by one
but you probably don't need to worry about that
Yeah
are you using dataloader and datasets?
not yet
I'm working with 1 image right now
Do you know how to convert a tensor to a numpy array?
wait so you're not using either pytorch or tensorflow?
if you're only dealing with one image then you don't need a dataloader
but why would you want to turn it into a anumpy array if youre using pytorch?
you want to show it using matplotlib?
in that case it's just .numpy()
well the ToTensor object isn't the same thing as a Tensor object right?
Trying to figure out how to convert it to a tensor object or a numpy array
unless they're the same thing
Ok, not the same because when I run "my_tensor.shape" I get
AttributeError: 'ToTensor' object has no attribute 'shape'
ToTensor is a function
it returns a function that can be used to convert images into tensors
if you want to use the function right away, theres a pil_to_tensor function
but ToTensor is good if you want to compose it with other transformations
torchvision.transforms.functional.pil_to_tensor might be what you're looking for and you can import it
also I checked one of my recent projects and it looks pretty simple to load an entire folder of images as a dataloader
I used "PILToTensor" and it worked as well
Ok, I'll try it and see how it goes
wait... I could've just kept using ToTensor and set a variable to the function and it would've worked
transform = transforms.Compose([
transforms.Resize(image_size),
transforms.ToTensor(),
# other transformations
])
# the folder data/fiftyk has a single folder in it where all images are
train_dataset = datasets.ImageFolder(root="data/fiftyk", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
yeah ik but I'd say it's cleaner to use pil_to_tensor
maybe that's just me though
“Shame I want to get us coming from his s1elling table!” said Stan. “Harry yelled after
chance of glittening of Otter, while Kuttic tells me.”
They all thowed his broom below at the bit past twelve
Prifet consulets, had said, “I always middle shop I told your common room, the
statue and next way. “So you’ve eaten to do, Harry ... it’s dead ideas had had
exams to frightens, and she — what was a rust.
Trained for 2 epochs and already this good 😵💫
Also look at the source code for PilToTensor
it literally returns F.pil_to_tensor
If you have a regression problem, where the output can range from 1000 to 10 billion, is it a good idea to scale the output down? (atm i'm using the sklearn minmaxscaler, which scales it to be between 0 and 1)
on top of that, does having the values range from 0-1 affect how mse is calculated? (the square of a decimal between 0 and 1 will be less than the value itself)
yes, it's a good idea because the gradients will depend on the size of the values the function takes. rescaling prevents exploding gradients
the MSE will behave as usual
what does matter about the MSE, independent of what you mentioned, is that small error values are effectively "ignored. this happens regardless of the error dynamic range, and is one of the reasons regularization is helpful
the TL;DR is "yes it's a good idea" and "no, MSE still works the same"
as long as there's no problem of outliers then minmax feature scaling is good
https://www.udemy.com/course/tensorflow-developer-certificate-machine-learning-zero-to-mastery/
https://www.udemy.com/course/artificial-intelligence-az/
https://www.udemy.com/course/data-science-machine-learningtheoryprojectsa-z-90-hours/
Are any of these Udemy courses good for data science? Or is there a even better course I don't know of/better resource
im getting a memory error even tho htop says i have almost 3gb memory at my hand idk why
what are you trying to do?
@worthy phoenix you can double check that with psutil
import psutil
psutil.virtual_memory()
got the reason for the error
thesis results complete, not great but at least we can say 'u DEF dont have cancer'
what would you do
its reading the whole model into memory and the model is about 10gb
is it worth balancinbg the test set to get a better understanding of the model
If your set is overall unbalanced, then no
Preprocess, split with stratification, balance the training data, train the models, and then test on the imbalanced data with appropriate metrics
i trained this model on balanced data
but its hard to get a clear understanding of the model when 98% of the test data is a single class
also i used smote on the train set
If your set is balanced to begin with, you don't want to overengineer
Oh ok
Then this applies
so u can say im training on lets say 4000 of each
then testing on 2000 and 500
BUT, if i tested on 500 and 500 i might better see how it determines class
frmo those results, would you say its over predicting class 0 because the data available doesnt allow for much class 1 predictions
For the train/test splitting you can additionally apply the K-fold, and then to the remainders for each fold
wdym
id di oversample
i used smote
maybe i shud try non-informed over sampling
because the data is v noisy?
AND its quite high dimensional, i one hot encoded a couple features
Hi there, any suggestions on data science interview practice sites? I've used Hackerrank, Leetcode, and AceAI so far
my experience so far with coding data science interviews is so negative
idk how u can give someone a bunch of tables in an env and with code theyve never seen before and expect them in 10 mins to return a table exactly how you like it when it takes ages ot understand whats even going on
especially when hacker rank error output is bugged and invisible
Hello, has one of you experience with using statmodels? I used it to plot acf plots like such:
its good but i prefer stata
ok, my question is though: I have 10 different TimeSeries that I want to plot in one ACF plot
matplotlib lets u do that yes
im sure statsmodels isnt the graph itself, pyplot is
just add a bunch of plots and then draw it theyll all go in the space
on that axis
how would I do this? Should I first sum the 10 Timeseries up to one timeseries and then run acfplot?
well are you allowed to do that?
10 time serires cant be represented by 1 time series
ud need to plot them as seperate lines surely
I have 10 different timesseries that are kind of correlated, but I want to see if there is generally some autocorrelation on a meta level.
cant u do them alongside each other then as seperate values
wouldnt that add to the corrleatrion
or is there a way to get the autocorrelations per lag per timeseries from the plot? And calculate a mean autocorrelation per lag?
like multiple variables
Well I already have created an acf plot per dataset and per predictor (all in all 80 plots), so on a detailed level I can already assess the autocorrelations. I just want to have an aggregated view of autocorrelations
Been training this yolo algorithm for 6.5 hours now..though google colab. I watched 2 movies, died a few times in playing ps4 games all while keeping the session active...
y_1 = lambda x: x**2
plt.scatter(list(range(1000)),[y_1(a) for a in range(1000)], s=20, edgecolor='none', cmap=plt.cm.Blues)
The cmap doesn't work here. Why is that?
got it
Hey is the structure of make_column_transformer right?
# First would need to deal with Binary Labeling
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder
bin_cols = ["gender", "ever_married", "Residence_type"]
ohe_cols = ["work_type", "smoking_status"]
ct = make_column_transformer(
(LabelBinarizer(), bin_cols),
(OneHotEncoder(), ohe_cols),
remainder="passthrough"
)
ct.fit(df)
Error:
TypeError: LabelBinarizer.fit_transform() takes 2 positional arguments but 3 were given
cmap is for 2D and 3D images. for 1d plots, you can directly specify the color of each curve with a letter or by making your own colors
`model = Sequential()
model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256,256,3)))
model.add(MaxPooling2D(4,4))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))
model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))
model.add(Flatten())
model.add(Dense(32))
model.add(Dense(5, activation='softmax'))`
I trained this model and got these results
Hi! I want to calculate the array of sound envelope of a signal(in each music note), and I see a tool called https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.hilbert.html. However, there are two questions I want to check from the reference.
The first one is what are they doing in
signal = chirp(t, 20.0, t[-1], 100.0)
signal *= (1.0 + 0.5 * np.sin(2.0*np.pi*3.0*t))
I am not sure the number 20 and 100 in first line is for what meaning?
The second question is, I want to calculate the envelope of a signal(onset), not the whole sound. I already have onset and offset. Then, how to calculate it?
Thanks!
I was wondering if the this is overfitting or non-ideal and also wondering how you would change the model to increase accuracy
what do you mean per note?
also chirps require you to specify the starting and ending frequency, since the frequency changes over time. that's what the 20 and 100 are: start at 20 hz and end at 100 hz
I mean onset, each sound event(might me more clear
i don't know what you mean by onset and offset here, you'd have to clarify a bit more
Is it the maximum and minimum, or just the starting point and ending point?
those two things are the same here
well. i lie. if you look at the spectrum, depending on the extent of the signal w.r.t. the total time duration, you'll get higher frequency harmonics.
let's go with starting and ending value of the modulation frequency
Sure, Onset refers to the beginning of a musical note and offset means the ending of a musical note
and by note you mean, pitch?
What is b note?
a typo 😛
yeah
yes! You can say in this way
this is kind of a difficult problem, depending on how complicated you want to make it
Really? I thought it was easy!
the easiest answer is "you don't need a hilbert transform, but rather a fourier one", but that's probably not what you're looking for
hahaa
there's ongoing research in this stuff
ot other methods, maybe?
do you know which notes you are looking for?
each note in this music piece
Librosa has one, I can show you.
But I the same problem is that i don't know how to calculate per sound event
i'm reading a paper right now and they suggest using a filterbank, which is a generalization of the fourier approach i mentioned just now
Oh, cool. Let me check for a sec
not only that, they seem to use a time-windowed approach, so it's really more like a short time fourier transform
i think that's what you're looking for
that'll give you a time-varying magnitude (well, really amplitude and phase, but the phase presumably won't matter much unless you want a super sophisticated method) for each frequency bin
that's how you make these "spectrograms"
each horizontal line is the time-varying magnitude of a bin or "note"
Great! COuld you send me the link of paper?
Yes, I think so too
i just glossed over this because it mainly discusses the detection that goes AFTER. but just seeing that plot you can immediately tell it's something akin to an STFT, but presumably with filters that have a better band rejection http://dafx.de/paper-archive/2002/DAFX02_Duxbury_Sandler_Davis_note_onset_detection.pdf
if you're doing this for fun, i'd recommend to start with an STFT. if you're doing research, then it's time to read about polyphase filters
Yeah, I am doing the research, probably need to read it
😭
i'm checking the docs and it makes something called an "onset strength envelope", which is going to be some variation of what i mentioned right now
idk how it finds the peaks, but it's probably something like a thresholded smoothed derivative
Find the peak is not a big problem for me, because there is a function tool called get_peak 🙂
that's ok, but those have very low resolution
what kind of research are we talking 😛 super resolution parameter estimation?
Yeah! onset strength envelope is what I am looking for
i'm gonna check the pick_peak code really quick, let's see
Do you mean to find the peak?
ah it's even simpler
it's just a simple heuristic, pick the max value in an interval if it's above a theshold
that will 'work' but you won't get state of the art results. depending on what your research is in, this is not good enough
scipy.signal.find_peaks(x, height=None, threshold=None, distance=None, prominence=None, width=None, wlen=None, rel_height=0.5, plateau_size=None)
they actually coded their own in librosa
lemme read how scipy does it
eh pretty similar
okay
What do you "feel" about this? About calculate onset strength envelope.
in fairness, this type of peak finder is indeed a maximum likelihood estimator of peak locations, but only if the underlying parametric model is "easily resolved"
Yeah, u'r right...
i'm not sure i fully understand what their onset strength computation is doing, i don't have enough time to go through all the details right now. they reference this mel spectrogram though, so this is a good place to start
[#] Böck, Sebastian, and Gerhard Widmer.
"Maximum filter vibrato suppression for onset detection."
16th International Conference on Digital Audio Effects,
Maynooth, Ireland. 2013.
at any rate, it's a filterbank, they're applying some bandpass filter to the signal to split it into bands over different time windows
this is a good place to start if your research is in onset detection itself. if not, and you just need this as an intermediate result, i'd say this is probably good enough. if this IS exactly what you're researching... the next question is whether you want to develop a new, better method or just make a survey of what is out there
Sure! Thanks for giving me those information and suggestion! I appreciate it!
Does anyone know how I would proceed making my own image generator from words app?
You can use an existing one, but training your own "from scratch" would require you to learn very advanced techniques that you wouldn't be able to understand until you've been studying AI for a while.
Not to say that you'll never be able to do it. But it would be a supremely disappointing first project.
a LOT of hard coding lol. i recommend you to use someone elses and then just ask permission or if its free use put it in ur app
Anyone know if it's possible to pass a spacy object through an SKlearn pipeline
(I need the info in the object at different stages)
Anyone know any open source ai image generator that generates purely based on word descriptions?
Dalle E Mini and Dalle E
I think Dream too
disco diffusion
I'm solving a regression problem, but when i calculate the MSE for my loss it becomes "inf". I believe this is because my data ranges from 1,000 to 2,147,000,000, but what should i do to solve it?
(using keras)
i have rmse as a metric, and that displays a valid number, but when the loss is initially high it is impossible for it to be displayed as mse
so is it a viable solution to set objectives (like for early stopping, reducing lr, and hyper parameter tuning) to be the validation rmse?
a quick fix is to change the dynamic range of the data
and how would i go about doing that?
whats the best modules to learn b4 starting doing ML
matplotlib, numpy, pandas. but you will almost certainly learn how to use these in the process of learning machine learning, i wouldn't worry too much about them. focus on being a good general-purpose programmer and being comfortable with python.
ok
it's the other way around I think
you should learn a bit of ML then start learning these modules
or at the very least be doing them in parallel
Are there any ML models that have sequential/ordered splits?
I’d like the model to take into account specific columns first then others
you could try changing the units of your target variable
you can try normalizing the target data
i tried doing that and came up with a lot of problems, such as the model not predicting anything below 107319 (very weird number)
that doesn't make sense
I suspect you're implementation could have a bug
could you show me how you normalized your targets?
define best ? Easiest, cheapest ?
Hello wave, can anyone tell What exactly is Generalization error and how does it differ from train or test error ?
i need best one
@quaint leaf ?
I don't know what best means for you, so I cannot answer that without more info
I mean best courses which are available
tell me the easiest one
then definitely this one:
https://www.tableau.com/learn/classroom
https://www.tableau.com/tableau-training-pass
It's official from Tableau itself, and when you don't understand you can ask the teachers so that makes it easier
Normally I never recommend paid stuff actually but since you asked for the best 🙂
Hi,
I need help with this question about clustering and finding optimal number of clusters. I would appreciate for any help.
Can u give me the best in paid one
yes that's a paid one
ok
cheapest?
this one is pretty complete (and free)
https://www.youtube.com/watch?v=aHaOIvR00So
Hello, I'm trying to remove the sky from this tree picture. I have used kmeans to cluster the colors. And this is the output. Now I need to remove the sky. What would be the best way about it.
original picture
How can I predict 2020 values with machine learning
Looks like regression at first glance
do you know what an activation function is?
each image goes through the whole network. what changes with each image is the output, not whether or not it goes all the way through.
and keep in mind that we're talking about mathematical functions, which (unlike Python functions) always have an output.
anyway, activation functions are non-linear functions
here's a great comment from Emyrs about what activation functions are for
as for the 128, this determines how many weights and biases there are. how to pick this "well" is difficult to answer and it is often the case one has to try a couple different configurations to see which one works best
in estimation theory this is called the "model order" and one tries to strike a balance between having too few and too many parameters. too many means you have quite a bit of "descriptive power", but it is both difficult to tune the parameters correctly and it is easy to overfit. if you have too few parameters, you'll simply lose predictive power
a more in depth discussion requires quite some information theory and statistics
the number stands for the size of the output of that layer. that means you have an input of 28*28 and an output of 128. one dense layer is an affine transformation with a matrix of size layer input x layer output biases of size layer output, so you're telling the layer to grab the input of size 28*28, multiply it by a weight matrix W of size 28*28 x 128, and add a bias vector b of size 128. then, the relu activation function is applied
google happened to quickly grace me with this illustration 😛 this is exactly the same as the network you have
Hey I had a quick question about coefficient and p-value.
Based on the results I have noticed Radio is the correct answer for the first question and billboard is the answer for second; however, I am struggling to grasp the reasons for it. Could someone help out?
that depends on which definition of p-value you are using in your course
no prior information was given or constructed
this is completely uncorrelated question to the previous ones
i doubt that
at any rate, the standard usage of p-values is the probability of observing the measurement data under a base model, and the null hypothesis is that that base model explains the data. the smaller the value p, the more unlikely it is to observe the data under the base model, meaning that the parameters you derived are more significant and a better explanation of the data
that'd make p values of 0 have the interpretation of "the base model cannot explain the data at all, these new parameters should be accepted". then TV would be the most effective channel, as it has the strongest positive correl and highest significance
my recommendation is to review the content, it looks like you skipped something either during the course, or one of its prerequisites. be it the interpretation of p-values, or the definition they decided to use here
See that's the thing I'm confused about, due to the fact that p-value was so insiginificant in both cases kind of cancels it out and falls under the idea of <0.05 rejection, but that's what I am confused about is that Radio was more effective than TV in the answer
so the question is, what definition of p-value did they give during the course
None, it is practice tests I'm doing with no prior information given
well, then we can't answer the question, can we 😛 you don't have enough info
no way to know if the quiz is wrong, or if they were working with a different def
put in the TV at first and was given a feedback that Radio was the right answer
Why tf me network crashing after 3 epochs
Wdum crashing
hi there
i wanna deploy a ml inference program in rpi
but the thing is it seems to take too much time to give a single output
as we all know ml requires alot of computing power
so should i buy a jetson instead? or what updates can i make to the rpi to get it working
I wanted to become a data scientist. However I wanted to know what modules and things would I need to master in Python to be a competent one.
I have a message in the pins about the major libraries, but your understanding of data science theory would matter more than your programming ability.
Hi everyone. I have a corpus of text with a continuous dependent variable, and I would like to create a model that predicts this variable based on text. Which types of ML/DL models could be used for such a task?
so then any books?
I would start with the second edition of "data science from scratch"
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
^ you can go to that page and filter for books
Ok
Hey another question, this is easier 😄 I've tried to understand Bayes theorem, however, struggling. The answer appears to be 16.7%, but don't know how it was calculated and https://www.youtube.com/watch?v=HZGCoVF3YvM didn't help sadly.
Perhaps the most important formula in probability.
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/bayes-thanks
Home page: https://www.3blue1brown.com
The quick proof: https://youtu.be/U_85TaXbeIo
Intera...
you gave some info but didn't show the question
sorry, added
yup where P(A) is the probability of something happening?
here, they are asking you for P(A|B) where A is a probability the person has cancer, and B is the event your code gave the output 1
okay so in this case the probability of the person having cancer is 0.01? since the total population is 0.01?
i don't get what you mean by "the total population is 0.01", but yes, the probability of any person having cancer is 0.01
let me walk you through it because this has several steps
let's say P(A) is the probability a person has cancer. this is 0.01
next, P(B) is the probability your model outputs a 1. we will deal with this later
then, P(B|A) is the probability your model outputs 1 when a person has cancer. we are told the model is 0.99 correct when a person has cancer, so if a person is known to have cancer, the output 1 is 99% of the time. P(B|A) = 0.99
we want P(A|B): if the output of the model 1, how likely is it they have cancer? we can compute that with bayes' theorem, but we need that missing value P(B)
due to the 1/100 having it
so the top side of the equation will be 0.99 * 0.01
correct
now, P(B) is the gotcha. your model can output 1 in two ways. it can be a true positive, or a false positive. a true positive happens when a person has cancer AND you detect it correctly. this is the probability of A and B happening at the same time. the probability of this happening is P(output 1 | person has cancer) * P(person has cancer).
we need something else here
the probability of false positive?
.latex $P(A \cap B) = P(A \vert B) P(B)$
so that will be 0.99 * 0.01?
this is the probability of two dependent events happening together
right. so P(model gives 1 | person has cancer) = 0.99, as we were told. that means the probability of getting a TRUE positive is 0.99 * 0.01
but for the false positive, we need P(model gives 1 AND person does NOT have cancer)
this would be P(model gives 1 | person does not have cancer) * P(person does not have cancer)
we know the model is correct 95% of the time when a person does not have cancer
okay so that would be .99 -> prob of getting 1
no
since it's correct 95% of the time, that means it's wrong 5% of the time
so the missing term for a false positive is 0.05*0.99
yup that makes sense
so we have
0.99 * 0.01 / (0.99 * 0.01 + 0.05 * 0.99)
.latex $P(cancer \vert test = 1) = \frac{0.990.01}{0.990.01 + 0.05*0.99} = 16.6 \cdots$
i missed the % mark
ahhh okay so that will be division of the population
the we just extract the probability of model being incorrect in the false positive
let me try to flip the question and see what comes out
idk what you mean by division of the population
so You run the model and predicted 0. What is the probaility that this person does not have Cancer?
that would be:
0.95 * 0.99 / (0.95 * 0.99 + 0.01 * 0.01)
looks aight
okay great that makes a lot of sense, thanks Edd!
Edd how do you invoke Sir Lancelot to display latex? 😊
Thanks, let me try it out 😊
.latex VIF_{Weight} = \frac{1}{1-R_{Weight}^{2}}
you need .latex at the beginning of the message and $ to start and end an in-line equation environment
.latex so presumable this is in-line $x = 3$ but this other one makes a separate environment
\begin{align}
\phi(x) = \int f(\tau - x) d\tau
\end{align}
but i'm not sure if it'll work
Thanks. I'm still yet to get used to Latex.
Any idea why this isn't rendering? Where's the typo coming from
yeah latex doesn't like unexplained back slashes. \frac is not expected outside of math mode, so you need either $$ or another math environment
They're all unexplained if you don't know latex
Ohh I get now... Let me try it again
.latex $VIF = \frac{1}{1 - R^{2}} = \frac{1}{Tolerance}$
It’s easier to just use Microsoft word equation writer
word also accepts latex for typesetting equations
I like that, in the sense that word is becoming less like word.
Yeah it's easier to use MS Word to write math equations, but writing it on JupyterLab or Jupyter Notebook requires Latex; except of course, there's now a new trick to import equation written in MS Word to Jupyter.
this is the worst conversation ever, jupyter and word brought together
No conversation involving Emyrs can be the worst one. But those are two of my least favorite things, yes.
😁 I think it's much easier for people who don't know Latex. Of course, to people with good knowledge of Latex, it's drudgery moving back and forth from Word to Jupyter.
i can't deny that, it took me months to warm up to latex
We have to use it to write reports at uni, but it's pretty nice for scientific reporting with a lot of formulas
Still get annoyed by latex placing figures 2 pages further than you want though :/
Yeah. Not a fan of equations in jupyter, I keep them to reports written in word
I don’t think many people read jupyter for anything method related right? That stuffs all reported on
Hey guys for Gensims most_similar function how do I get a list of just the most similar word without the float value next to them
I was looking through documentations trying to do it and it wasn’t working
Bottom image is my code
Sorry if this is a noob question bye
Btw *
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
hey, for some reason tensorflow is not properly installed and I tried uninstalling and reinstalling but I get this when I try to verify the install (in the cmd)
2022-08-07 21:34:41.553613: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...```
What are some classification algorithms which work with categorical features.
Hmm
Like how
Wouldn't that be incorrect
Or as long as you encode them properly, categorical features can be used As numeric?
Last time someone taught me one-hot encoding.
Does that now allow me to use any algorithm to use, even if it's only supposed to work for numerical features?
Like knn
@lapis sequoia, @serene scaffold
If I get your question correctly, you mean the ML algorithms that can pretty much handle categorical data explicitly without having to preprocess or encode them to a numeric feature....
LightGBM and CatBoost does that with ease! But because I kinda love CatBoost more than LightGBM, I'll focus on 🙀-Boost
You only need to identify which feature is a categorical feature in your dataset.
You could do something like this with CatBoost
categorical_features_indices = np.where(X.dtypes != np.float)[0]
model.fit(x_train, y_train, cat_features=categorical_features_indices, eval_set=(X_val, y_val), plot = True)
almost all of them
if not all
use dummy features and yes you can with 0s and 1s
However, all ML classification algorithms can work with categorical feature. Most of them would require you explicitly preprocess your categorical features to numeric feature, while some few (like CatBoost and LightGBM) can handle such even when you don't explicitly preprocess your categorical features.
Quick question...
df['timestamp'].diff() returns a series of timestamps(?); I have a data stream and trying to identify different 'epoch' (gaps in data). Most .diff() values are 1 sec; but every so often I get a gap of minutes/hours/days. How can I find the index of these 'jumps'?
is_large_gap = df.timestamp.diff().abs().gt(pd.Timedelta("1 sec"))
inds_large_gaps = is_large_gap[is_large_gap].index
Checking whether the (absolute value of) the difference is greater than 1 second; this gives a True/False Series. Then index it with itself to let only True's through. The .index will then give the indexes of the (end point of) jumps.
.diff on a datetime type Series gives a Series of type timedelta.
side note: is df['timestamp'] same as calling df.timestamp? how do you distinguish between methods and columns?
using pandas to get the duration between two dates gives me a wierd output, what can I do to get the hours that passed between those dates? ```python
import pandas as pd
df = pd.read_csv('./data/essential_info_dashborad.csv')
df['Last session begin'] = pd.to_datetime(df['Last session begin'], errors='coerce')
df['Last session end'] = pd.to_datetime(df['Last session end'], errors='coerce')
df['test'] = df['Last session begin'] - df['Last session end']
df.to_csv('./data/testingdate.csv', index=False)
df['diff'] = (df['end']-df['start']).dt.total_seconds()
df['diff'] = (df['end']-df['start']).dt.total_seconds()/3600
This works great!
I'm trying to add a column with a letter to designate each group/epoch. Any suggestions on how to add a 'epoch' column, and the first set is 'A', after the first large gap index, 'B', and so on?
Omg I was doing it backwards 🥲 thanks for the enlightenment!!
does anyone know how to retrieve and add the rest of a dataframe(df1) based on two columns worth of data in another dataframe(df2) that have the same column names in both df? I tried merging(inner and left) but since there some of the values in the dfs are duplicative it messes the whole thing up. im coming from excel so im trying to fundamentally do a vlookup based on two conditions - thanks!
❤️ Check out Weights & Biases here and sign up for a free demo: https://www.wandb.com/papers
❤️ Their blog post is available here: https://www.wandb.com/articles/better-paths-through-idea-space
📝 The paper "Emergent Tool Use from Multi-Agent Interaction" is available here:
https://openai.com/blog/emergent-tool-use/
❤️ Watch these videos in ear...
is there any opensource implementation of gpt3?
Even if there were, it would be incredibly expensive to train.
with pretrained models
xDDD, forgot to add that
There's only one instance of gpt 3, and it's behind a paywall.
sadge
any other generative transformers which is at par with gpt3 and opensource?
There's nothing on par with gpt 3, but there is gpt 2.
aight, link to the repo pls
couldnt find the gpt2 one either
I'd have to look for it. I'm sure you can.
aight trying again, btw if u get free time to look for it , pls drop it to me will ya?
Did you try "gpt2 GitHub"
yep but the implementations are kinda not looking ok xD
oh ok
this one yeilded the official openai repo
nice
thanks
Is there a way to make pandas's fancy integrated dataframe display break very long cells into lines?
E.g. consider
df = pd.DataFrame.from_dict(dict(a=[["hello"*30]]))
If you do pd.options.display.max_colwidth = 150, this row will be shown as a giant line. But can you make it shown fully, but broken into more than one line?
visual aid
I want this cell shown in full, but broken into two lines
I'm using Pytorch to try to train a model and I'm getting this error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x25568 and 400x120)
any tips on how to fix?
my code:
transform = transforms.Compose([transforms.Resize((150, 200)), transforms.ToTensor()])
-----------------
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_dler, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
I can share more code if needed
Do you understand what the error message is telling you?
I think that my images are sized incorrectly
but just taking the error message by itself, do you understand the problem?
I understand that it can't multiply the two matrices(my image's pixels and the weights I presume)
but I might be wrong
not even as it relates to your code. does "mat1 and mat2 shapes cannot be multiplied (10x25568 and 400x120)" mean anything to you?
well, it quite literally says that there are two matrices that can't be multiplied. do you understand why?
because the row in the first matrix needs to be the same size as the column of the 2nd matrix
and it isn't so
if we're talking about matrix multiplication, yes. I don't actually know if we're talking about matrix multiplication or element-wise multiplication
so the question is, how are mat1 and mat2 created from your code?
we would need to see the whole traceback (ie, the whole error message) to begin to guess.
what is net?
so Net.forward() is:
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net is an instance of this ^
great. the error message shows you that the problem starts here: ---> 19 x = F.relu(self.fc1(x))
and then return F.linear(input, self.weight, self.bias)
yes
so figure out what x is when you get the error
I think x is the matrix of my images,
I don'
t understand what the second matrix (400x120) is
and I'm not certain if the first matrix is even the image
How do I include the head of the dataframe (first column) in the csv
The code for that would be under
#CENTRAL DATAFRAME
Anyone here has ideas on how to fix this?
the error is not even in the code you gave us
it's probably inside net
oh you also gave that nvm
that's the shape of the weight matrix of the first fully connected layer
in other words, it's expecting inputs that are 400 dimensional vectors
convolutional layers don't care about the spatial size of the image, but fully connected layers do
so you have to make sure that your input image is the right size such that the dimentions end up matching (you can do that by cropping/resizing the image)
how do i make a ai that predict typos
like if i type "helo"
list = ["hello", "goodbye"] it will predict that "helo" is the closest to "hello"
You can make a tree of the letters in your list and find words that match by using the letters of the typo until the typo doesn’t have a letter in a path on your tree then suggest the word at that node
@rapid cedar
hey guys I hope this is right place to ask. I've been having hard time trying to look for real life example for usage of global temp view in spark. Has anyone used this feature, and if so, could you share your experience, like why use that over temp view.
I'm well aware of definition for global temp view which exists across all spark sessions compared to temp view which expires when session that was created ends; I just can't imagine real life case to use global temp view from the first place. What would require you to create separate session and use that global temp view instead of just using a session you already have opened with existing temp view? Thanks in advance
hi again, sorry i went AFK shortly after
yes df["col_name"] and df.col_name are equivalent, except in 2 cases:
- if "col_name" isn't a valid Python identifier, it will fail
- e.g., spaces in it: "current value", starting with a number: "20th quantile", containing apostrophe: "today's"
- if it clashes with a method/attribute of a Series/DataFrame as you said
- e.g., df.sum will go to the method even if you have a column named "sum"
so, df["col name"] always works, df.col_name sometimes works. But the latter is easier to type, so if it is fine to do, i prefer it due to laziness :| It also makes code more readable IMHO when chaining things like we did above (.diff().abs()...)
IIUC, you can map the cumulative sum of is_large_gap... why? here's a demonstration:
sample data
In [20]: df
Out[20]:
date sales
0 2021-12-29 300
1 2021-12-27 100
2 2021-12-30 100
3 2021-12-31 300
4 2021-12-28 200
5 2022-01-03 500
6 2022-01-02 0
7 2022-01-01 200
above code applied to get is_large_gap, except the threshold being 1 day here
In [21]: is_large_gap = df.date.diff().abs().gt(pd.Timedelta("1 day"))
In [22]: is_large_gap
Out[22]:
0 False
1 True
2 True
3 False
4 True
5 True
6 False
7 False
Name: date, dtype: bool
so what I understood is, you want to start with some letter, say "A", and keep it going as long as "no large gap"; once it hits a large gap, change it to "B", i.e., the successive letter. And do this until the end.
like for i in len(typo):
typo[:1] match for i in range len(possible_typo)
match with possible typo[:1]?
to that end, we can use .cumsum()... This takes the cumulative sum: let's see what it would do a Boolean series: if it sees a True, the accumulated sum then-far is increased by 1 (True is 1 in numeric context). So this is good: when it hits a large gap point, it will change its value. If, on the other hand, it sees a False, the accumulated sum then-far won't change! (because False is 0 in numeric context). This too is good: when it hits a not-large gap, it won't change its value.
well this gives this:
In [23]: is_large_gap.cumsum()
Out[23]:
0 0
1 1
2 2
3 2
4 3
5 4
6 4
7 4
Name: date, dtype: int32
see how it stays the same when it hits False's?
now all we need is a mapper: 0 -> A, 1 -> B, ...
assuming it won't exceed 25, we can use the ASCII alphabet :p
so here we go map:
In [24]: import string
In [25]: mapper = dict(enumerate(string.ascii_uppercase))
In [26]: mapper
Out[26]:
{0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
...
23: 'X',
24: 'Y',
25: 'Z'}
In [27]: is_large_gap.cumsum().map(mapper)
Out[27]:
0 A
1 B
2 C
3 C
4 D
5 E
6 E
7 E
Name: date, dtype: object
note that you'll get NaNs instead of letters in the output if the value you're mapping is not in the mapper
e.g., if there's 47 to map, since it's not in mapper, it will put NaN in the result.
range(is_large_gap.sum()) will give you the range of numbers to cover in your mapping ('s keys).
Hi guys, im not sure where should i ask this,
but im looking for a project idea (something like AlgoTrading).
i feel like im intermediate - advanced in python
Any idea, or help would be appreciated!
I am taking machine learning classes (high level, only application), the teacher said that rather than learning about the working of the algorithm (gradient decent and OLS) I should just learn to use it
is this correct ? should I attempt to understand/learn it myself ?
you should at least have some notion of when and how it makes sense to use it
OLS is not always the best estimator, and gradient descent doesn't always make sense to use. even when it DOES make sense to use grad des, it only converges under special conditions
and WHAT it converges to is another matter
okok, makes sense
what is the google service that provides computing power for machine learning called
anything in a csv is included anyways
anyone know how to deal with such noisy data that my minority class precision is almost 0?
google colab?
Google cloud
hi guys im a new data science student and i was wondering what do i call those values that are above 600
do i call them anomalies or outliers?
anomaly usually refers to a behavior that is caused by something external and is not part of the underlying distribution. extreme values are part of the original distribution, just very unlikely
Ohhhhhh
to be able to tell which of the two it is, some kind of anomaly detection entails
in some cases the difference doesn't matter and they're treated as synonyms, so more context is needed 😛
is it logical to remove any values above 400 or 500?
one time, a non-technical person asked me to do anomaly detection. and they weren't very clear on what an anomaly was in the context of their data. so I asked a senior coworker, and he said
"anomaly detection. they don't know what an anomaly is--wouldn't know if it bit them. but it's a buzzword. everyone and their grandma is doing it--that's probably who told them about it."
"sure, but what would an anomaly be in this context?"
"you know, you're asking the right questions."
sounds about right
if i were to do a .describe() on my departure delay in minutes column, the 3rd iqr only has a value of 12 minutes
does this count as an anomaly dection?
you have a point
I'm solving a regression problem, but when i calculate the MSE for my loss it becomes "inf". I believe this is because my output data ranges from 1 to 2.147 billion, but what should i do to solve it?
(using keras)
i have rmse as a metric, and that displays a valid number, but when the loss is initially high it's impossible for it to be displayed as mse,
so is it a viable solution to set objectives (like for early stopping, reducing lr, and hyper parameter tuning) to be the validation rmse?
as a buzzword, sure. but since you're likely not trying to find how many distributions are needed to correctly describe your data without overfitting (a model order estimation problem), it sounds like you're just looking to remove outliers. you can make a histogram and see what probability distribution fits it best, then remove extreme values based on that
im currently doing the data cleaning process of my whole project and later on im going to do linear and logistic regression
so im kind of stuck on whether i should remove them or not
probably so if the entries skew the distribution
ohhhhh
for example OLS regression is only optimal if the distribution truly is normal
so check the histogram
alright ill try
thank you for the help by the way
if you're willing to use maximum likelihood based on the sample covariance instead of assuming normality, you'll get something often called "mahalanobis distance" instead of the ordinary least squares
would be nice to compare if you have some time to spend
@wooden sail are you able to possibly assist me?
what happens if my histogram is skewed?
i'm pretty sure i suggested you rescale/normalize your data to start with
try with 100 bins, this plot is deceptive
I tried doing that beforehand using sklearn's minmaxscaler, but i ended up having this weird issue where it wouldn't predict any value lower than 103,000, is there another encoding method i should use to scale the y data?
i already normalize my x data btw
the y as well. it also sounds like you have some overfitting
what encoding method should i use for scaling the y data?
after chaging the bins there wasnt much difference and it remained skewed
how many data points do you have?
101695 rows and 24 columns
try 500 bins for one last look at the data
i think 100 is the max it can go 😆
if not the bins would be very very thin
lemme try removing some outliers
well but that's already ok
now the question is, which distribution does this look like
usually for OLS regression it would be logical for the column to be normally distributed right?
technically this is right skewed
that's what you normally assume, but that evidently isn't the case
Ohhhhhhhh
so your options are: DON'T use OLS and compute the sample covariance
or manipulate the data and use OLS
that was the point of this exercise
but it may be the case that the data never looks normally distributed no matter where you draw the cutoff point
poisson, gamma, and chi-squared distributions all kinda look like what you have. and indeed, these have degenerate cases in which they approach the gaussian distribution
you can try and make a box and whiskers plot and see where the mean and median are, and use this info to make a cutoff decision
aight, best of luck
Hey guys not sure if this is the right place to ask but could anybody here help me generalize this Matrix in a loop im kinda stuck
if for example here K1 = {0, 2}, K2 = {0}, K3 = {3}, K4 = {1, 3}
not sure how to build the Matrix so the rows equal the amount of Elements inside each K[i]
as i understand it, the first column contains the elements of the K_i. the second column is just the value of i?
the first column(first row) is the Value of the first Element in K1, the first column(row 2) is the second Value of K1
the second column stands for the the index after K
and the third columns are the constraints (a(x) if its K1, b(x) if its K2, M(x) if its K3 and Q(x) if its K4
this is my artistic interpretation. you can add the bit about a,b,M and Q yourself, i think
In [1]: import numpy as np
In [2]: K = [[0,2], [0], [3], [1,3]]
In [3]: N = 0
In [4]: for k in K:
...: N += len(k)
...:
In [5]: B = np.zeros((N,3))
In [6]: row = 0
In [7]: for i,k in enumerate(K):
...: B[row:row+len(k),0] = k
...: B[row:row+len(k),1] = i
...: row += len(k)
...:
In [8]: B
Out[8]:
array([[0., 0., 0.],
[2., 0., 0.],
[0., 1., 0.],
[3., 2., 0.],
[1., 3., 0.],
[3., 3., 0.]])
ah, it was supposed to be i+1
ty very much this line 7 was killing me
is there a possibility to change a number out of K to for example -> n ?
if its K = [[0,2], [0], [n], [1,3]]
if n is already defined, yes
do you need the matrix to be variable?
your options are to make this into a function and pass in specific values, or use sympy/symengine
okay i guess i'll just leave that then 😄 not sure if I even need that was just a thought
thank you
the code i wrote there works regardless of what is inside K, as long as it is defined and K is a list of lists of floats
there are libraries that do this. have you googled "python speech recognition library"?
being a liturgical language, if you had to create your own speech recognition model, how much training data would be available?
you would need lots of examples of Sanskrit speech audio and transcripts. and you would probably need to align those in the time dimension.
Hey, I might be in a little wrong section, but I think some of you guys could give me ideas.
I currently work in a high school as a physics teacher, and I'm considering replacing slides with colab notebooks. One of the main reasons, is use of interactive demos with widgets. I learned how to make interactive graphs which is great, however I was wondering if somebody could suggest any other cool demos/libraries. I would kill for a library where I could create simple interactive animations, as I'm already using manimCE to create short animations.
Any directions/suggestions are welcome!
is there already a name for a ML program that has a model with lots of sentences and entity's. and classifies a input sentence with the data in the model and returns the entity?
Hey guys, how can I transpose one column to another df while matching another column? Visual example:
Without having to iterate through the df using df.at, that is
it looks like ATTR1 has different kinds of data in df1 and df2, so you need to change the name of one of them. but you should just be able to merge the two dataframes, and then sort the result on the SKU column.
if what you want to do is more intricate than that, your example does not illustrate it.
They're the same kind of data, just different values, my question is, if I use merge, if df1 has the SKUs in this order: 1, 2, 3, 4, 5... and df2 has the SKUs in this order: 5, 2, 3, 4, 1, will it merge them by matching the SKUs or will it just do it in their original orders?
Sorry if I wasn't clear
when you do the merge method, you say that the SKU column is the one that you want to use to do the linking.
!docs pandas.DataFrame.merge
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
it might be that you need to merge on both SKU and ATTR1.
by the way, what pandas calls a "merge" is a join in SQL terminology.
Desmos is really nice: https://www.desmos.com/calculator/fxhu08lai8
I'm looking for something to use inside colab (so anything that works on jupyter notebook)
Wdym by interactive animations? You mean drag a point in a graph function?
matplotlib sliders can give you some level of interaction with your plots https://matplotlib.org/stable/gallery/widgets/slider_demo.html rather basic though
Is it fine to use clustering algorithms even if you have labels?
But instead to use the labels for validation
To be honest anything. I did some graphs + sliders stuff. For example how tragectory of a ball depends on initial velocity.
manim
Hi there, I have a small issue with importing a data set, the CSV file contains : Team 1; Team 2; Score Team 1; Score Team 2, so str; str; int; int. But I can't get to import it in numpy.
I used this command python np.genfromtxt(filepathEast,skip_header=1,usecols = (0, 1)) but I get an error message for each line Line #2 (got 1 columns instead of 2). So yeah struggling a bit, I haven't touched python for a month and half by now
Is this a really bad model?
perhaps if you import it through pandas first and then turn score team 1 and 2 into numeric values? could be a temporary solution but im not that versed
does anyone know how to recursively merge/concat a temp df into a master_df. I have a temp df that is created for every id, and it calculates the difference between row values in a column. Im trying to bring that column to the master_df so I can track the differences in the master_df. <- for python
yes, it's significantly worse than always guessing "no churn", which should give you 86.2% accuracy since "no churn" is 86.2% of the data. see https://towardsdatascience.com/calculating-a-baseline-accuracy-for-a-classification-model-a4b342ceb88f and https://machinelearningmastery.com/dont-use-random-guessing-as-your-baseline-classifier/
What can I do then
To make it better
Because my train data also has tons of negative labels
No churn labels I mean
Can someone help explain the tensorboard training logs generated for GAN network
Are these good or bad? How do I compare ?
does treating a column dtype as categorical instead of object dtype make any difference?
apart from the space optimization does it have any other impact?
Would any of you have food ressources about scores prediction with neural network? I can't find something satisfying online
I have a simple pandas question... I want to fill in the NaN values for the code column by referring to the dict below... how can I do that?
Just show the precision recall
Use a class weight or resample
I opt to use class weight atm
Easier one liner code
Does the same shit but you don’t lose info or gain noise
I just got my thesis results and the model has 0.06 precision in predicting the positive class. Do u think there’s any redeeming thing I can write so to not get bad grade? It’s the datas fault
is it possible to overfit to data whiile using naive bayes?
yes
for completeness: naive bayes is a probabilistic model. you pair this is a "rule" to make an estimator: a statistic with which you evaluate whether a model is good. maximum posterior likelihood is common here (also called maximum a posteriori)
you then bundle this up with 2 other things: an actual model, which is your network, and an optimizer, which optimizes the model parameters based on the estimator
the model, optimizer, and/or available data might not be good enough to correctly implement the estimator
but what are the hyper parameters?
isnt sigma and mean all fixed? also in naive we consider variable to independent and disjoint.
so you're specifically using a gaussian model? also no, the means and variances of the gaussian model are computed per class
and yes, you consider each mean and variance per vector in a class to be disjoint and dependent only on the class
so each class has its own mean vector and covariance matrix
it i dont do gaussian, wont there be even less params?
yes we compute it seperately but they are fixed for each vector, right?
i dont get, what are we optimising
they are fixed, but they are computed from the data. and i guess a gaussian is about as easy as it gets
can we further nudge sigma and mean?
wdym nudge
optimize
they're already computed from the data
you solve an optimization problem to find them
and once those are learned, what the estimator does is optimize for the class of each sample you feed it by maximizing the posterior likelihood
so it's optimizing 2 things
the parameters of the gaussian distribution of each class first, as parameters of the model during training
then those params are used to infer the classes of examples you feed it when using the trained model
if you're using neural networks for the inference, then you can consider learning the parameters of the gaussian distributions a "pre-learning", and then you feed labeled exampled so that the network trains the weights of the inference network that predicts the class
i formulate it this way because all you said was naive bayes with gaussian model. that's barely enough to describe the estimator, not how you're implementing it
i sort of get it, but sort of difficult to understand how optimisation will be take place.
optimization of what?
the thing during fine-tuning as per this
i will look at code, it will clear things
i would think it's usually backwards, but hopefully that helps you
Hi!! Do any of you know about data scrubbing?
Is there anything wrong with using a model to make predictions then incorporating the predictions into another model?
why do people use -1 and 1 for perceptron?
1 is the max value the weighted sum can be
thx
Hey guys, I couldn't understand the meaning and the calculation in this section. Could sb explain to me, I appreciate it!!!
Adding up the bins, and then averaging those sums across the N frames.
.latex $\frac{1}{N}\sum_{i=1}^{N}...$ is often an averaging.
Thanks! So what's the meaning of bins here?
bins usually refers to frequency bins, but it's hard to say without further context
Why are pd.join/merge/concat/combine so cursed
Got it!
in general bins refer to reference values for the discretization of a domain, so do make sure you read the explanations given before that equation to ascertain whether it truly referred to the spectrum
Will that change the way confusion matrix looks?
And can you tell me how to use class weights
Name: City, Length: 200000, dtype: category
Categories (7489, object): ['Abbeville', 'Abbotsford', 'Abbottstown', 'Aberdeen', ..., 'Zumbro Falls', 'Zumbrota', 'Zuni', 'Zwingle'
7489 sounds like a lot of Categories to use for a categorical variable, but given the length of 200k it doesn't seem large enough, so it seems fine?
Hi, I have a question: What the meaning of "lambda: input_fn(train, train_y, training=True)"?
Why after lambda is not any variable like "lambda x: x**2"? Why is only "lambda: " ?
"input_fn" is input function
hi, it means the function does not take any arguments; equivalent to this in functionality:
def f():
return input_fn(train, train_y, training=True)
except yours has no name f
if train, train_y and training are defined and visible in the same scope as the lambda function, then it will call the function on those parameters
So I'm trying to print a pandas tabel and doing that by saving it to a csv file and opening it in excel. At first everything seems fine, but now the decimal separator of the value changed. If a value is 50.650 cm, it will now just write 50650. Any one know a way to fix this? I already tried in options>Advance.
ok thank you so much! @untold bloom @wooden sail
whether the form of input should be a function when we want to training the model in tensorflow?
sorry i didn't understand, can you rephrase
Sorry, my bad. I miss understanding @untold bloom
Sorry, I have a question again: Why the result is different?
this is my function of 'input_fn'
in the first one you're calling the function with parens ()
in the second one you're not calling it; only referring to the function itself
if you do (lambda: input_fn(train_y, training=True))(), they will do the same thing
f is a function; f() is calling it
lambda: ... is a function; (lambda: ...)() is calling it
Uhh… yes ?
Have you thought about how models make predictions and what this means for confusion matrix
Thank you so much for the explanation. You makes me understand now!
how to check for cross validation using a model like isolation forest?
Hi I recently tried creating a linear regression model using statsmodels.api and I tried creating a scatter plot using matplot lib. After using the train_test_split function i ran into this error
is there any way i can make both x and y train the same size?
Hello there, I’ve transaction dataset, my goal is to find rules. So i decided to use FP-growth algorithm which is an association rule algorithm, but my minimum support is like 0.01% which is so low for 55k transactions.
What can I do to fix this?
What the meaning of * * in .format(**eval_result)?
why it calculated power of eval_result?
if df_concat_noid['42006-0.0'][i] != np.datetime64('1970-01-01T00:00:00.000000000'):
anyone know why i cannot do this
hey guys... does anyone know whether it is fair practise to drop NA from oNLY the test set, because while I can conditionally impute in the training data, the massive data imbalance means that imputing at all in the test set will misguide models to predicting towards the majority class
so, train test split, impute training data, and dropping any rows with NA from testing data to only test on complete samples
They're not. That was easy
How do you know that you can't
hi, as it stands it is as if you're trying a 5D scatter plot - 4 columns of X + the y values; since matplotlib (or any other tool i guess) is uncapable of that, it flattens the input arrays to attempt a 2D plot. But then the sizes do not match. What you can do includes plt.scatter(x_val[:, 0], y_val), for example; it selects the 0th column first, plots its scatter against y_val.
one use of ** is indeed exponentation; but that's when it's an infix operator (i.e., takes 2 operands like 5 ** 7); here it's a prefix operator (i.e., unary operator; cares about only what's after) and it does some other thing. (another example is -: when used like 12 - 9, it subtracts; when used as -7, though, it negates.)
What unary ** does is to "unpack" its mapping operand. Presumably eval_result returns a dictionary. Then that dictionary's key-value pairs are passed as keyword arguments to the .format function. Say it returned a dict {"accuracy": 0.77}. Then with .format(**eval_res), it is as if we we wrote .format(accuracy=0.77).
Hi, I am not sure which channel I should send this message to, but I wanted to know if there are any good websites for challenges in Python. By challenges, I don't mean like a competition, but something which will help practise basic concepts like loops, lists, functions, etc., using libraries like Pandas, Matplotlib, Numpy etc. If you know of any such website, please let me know.
There are things like 100-numpy-exercises, 100-pandas-exercises, etc. on GitHub. Is that what you’re looking for? If you don’t get basic Python concepts like loops and built-in data structures, I recommend mastering those first.
Yes. But I don't know much about how class weights work
Thank you so much for the explanation!
you're welcome!
Hi there I had a Q on imbalanced data
I am predicting if an animal gets rehomed from a shelter. In general my data is only slightly imbalanced (60:40 split), however by animal type it is extremely imbalanced (like 5% of birds get adopted, 90% of dogs get adopted). Is this an issue? What are ways I should solve it?
These are the average proportions of animal types that get rehomed
Guys, if Boolean variables can be handled by the Algorithms which only work with numerical distance. Then what's the point of calculating a jaccard similarity for Boolean features.
In my last assignment I dropped the Boolean features before feeding them to my knn classifier and converted them into a feature named jaccard.
But now people say that all of the Algorithms can handle categorical features with one hot encoding. And a Boolean column is already sort of one hot encoded.
What do u mean by that
Like
High BP:
1
0
0
0
1
1
Can I put it in directly to knn?
I last time transformed 5 of such Boolean features to a jaccard feature.
Which basically calculated the percentage of 1's for each row
So if it's 5 features.
1,1,0,0,1
Jaccard was 60% for that row.
What is a jaccard “feature”
That’s a binary feature ?
Jaccard index measured between values
Told you
.
5 features having values 1,1,0,0,1 gets transformed into 60% jaccard index.
As a feature.
Well, if you think about it the animal type is a feature that is related to the probability of it being adopted or not.
5% of birds are adopted and 90% the dogs are, so the animal being a bird or a dog is something that should be taken into consideration while fitting the model.
So no, you should not balance for each animal type.
Fair point, ty
while doing clustering, if a scatterplot ends up looking like this, it's safe say that both columns have no correlation and thus should be disregarded, is that correct?
looks like you need to downsample. it might be that there's actually a curve where the plot is "thicker", but that it's impossible to see because there's so many points.
Lmao looks like that African flag
I don’t rly understand that
Hmm
Leave it. I just have to write a report. I will write silly stuff in it
Is this channel pretty active, or are there better discords for AI questions?
Any good datasets of free to use images? They don't have to be labeled or anything, just images.
It's one of the most active channels on this server.
Just arbitrary images? Literally any images will do?
Okay cool. I’m trying to do an informal survey of 1: what the future of AI is (like particularly promising sub fields) and 2: what sub fields need more hardware acceleration. If you have answers to those two questions I’d be really appreciative
Preferably real world photos. What they are of matters less. Also preferably the same size.
- I don't know
- natural language processing and image/video processing
So not like ct scan data or the large scale fish dataset
Hrm I found 50k art scans but that’s probably not what you want
Why those two? Also third question, how do you test a novel type of NN?
Actually that could work really well
not what I originally had in mind, but yep that's perfect.
Those two domains use deep neural networks and large training sets, and that's where one uses GPU computation.
- building it with pytorch and seeing if it performs well on a given dataset
Hrm so I’ve got a bigger one of plants and an equal sized one of celebrities
Idk you can pick out of this https://imerit.net/blog/22-free-image-datasets-for-computer-vision-all-pbm/
Gotcha, I kinda wanted to make a NN only using multiplication and division. Idk if that’s been tried before
Hi anyone familiar with Tweepy
that's not gonna be a neural network though, it's equivalent to doing linear (or affine) transformations and can therefore be condensed into a single matrix-vector product. you need nonlinear operations to reap the benefits of the universal approximation theorems
Well you can make nonlinear activation functions from that iirc, no?
from mult and division? no
Okay so I’m lost, like what does the sigmoid function do besides multiplication and division?
exponentiation
Which is a special case of multiplication.
not at all, not for non integer arguments
if you're only working with integers, i'd be willing to let that slide, but then you can't use division
exponentiation is a nonlinear transformation
Okay. So would a NN with exponentiation, multiplication, and division be sufficient?
Do you know of any approach that does that, IE avoiding addition and subtraction?
nope
Would be fun to try then
there's a reason no one does it, but try it out by all means
multiplication, division, and exponentiation can only do so much
Like what’s it missing?
in particular, you can't do any translations/shift up, down, left, or right
which means no matter how hard you try, you cannot change activation thresholds with them
Hrm true.
anyway you need addition in back propagation and grad desc, but idk how strict you were being with the "no addition/mult"
you could do gradient-free techniques like simulated annealing, but again, depends how strict you are in allowing addition in the cost function itself
Well I’m kinda intrigued now.
What if you were completely strict, no addition/subtraction whatsoever
go find out and let me know 😛 but that limits which functions you can work with quite a bit
for one you are immediately kinda limited to work with min and max of functions that are bounded by below or above
since you can't even subtract the target value
can't take norms of anything other than scalars either
i would call it like "diet optimization", more like what you do the first time you learn about opt with local minima and maxima looking at the first and second derivative tests, or using probabilistic approaches
though now that you think about it, you can be a pain in the ass and work out loopholes like logs and exponents of products and divisions to achieve the same as addition and subtraction. up to you if you consider that fair or not
How would that work?
stuff like log(ab) = log(a) + log(b)
Hrm that’s probably fair
in that case you're not all that constrained regarding cost funcs. working with independent random variables already lets you create log-likelihood expressions and maximum likelihood estimators only out of products of probability density functions, and their log expressions are equivalent to common cost funcs like least squares
Do you know how to add something that affects the positive part of a function only (without being piecewise)?
Oh and Edd if you could answer these two questions too please
1.) i like explainable/hybrid AI where networks are made out of classical models, but the hyperparams are optimally learned in a data driven fashion. 2.) any that need it:P i work with multidimensional data, and it's certainly needed there. that's stuff like anything with sensor arrays, composition of different sensors, hyperspectral imaging, spatial audio, etc
i can't think of anything off the top of my head, but i'm pretty sure you could fit a pretty good polynomial to a relu and add two of them with opposite signs, and compose another function with the positive one. maybe a hyperboloid would do well
With multidimensional data what sort of processing are you doing?
most commonly multi channel ultrasound stuff, but some colleagues with with stuff like mimo radar and satellites
depending how the ultrasound data is collected, you have data with axes like tx angle, rx angle, tx element, rx element, time/freq
i try to image stuff with it. tomographic inversion
Oh cool.
The wait is over. Finally found physicals and the calculus grind begins
All I need now is a trig book first
heh gil's book has an illustration of the "fundamental theorem of linear algebra" on the cover
I spotted a really good one which was called foundations of mathematics and it holds your hand through the very basics which I may borrow and recap, espeically for stuff like sine and angles
may help to go back ovre that before trying this because i bet after a few dozen pages theyll ask you to use it, tho im not so sure the methodology cares whether its a sine function or not
oh, and logarithms
what does WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 71742 vs previous value: 71742. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize. mean?
wouldnt it be much better to just initially feed generator with real images but the ones which we are not feeding in discriminator?
Instead of random input??
probably not cuz when deploying, we would need similar images again to get novel output images.
We should input something that is actually input-able when deployed.
A step is one forward pass + gradient update
tf maintains a global step value which can be used by each component of your model (for example you might change your learning rate based on global step)
tf is warning you that your global step is not being incremented and suggesting that you have your optimizer update it automatically when it updates gradients
Also our generator generates noisy labels while our discriminator, being relatively robust to noise, cleans these labels which is not the case in TS framework.
I am new to generative models, but how is possible what that bold thing says?
I have question regarding concept of ELT: after transformation, after you create dataframes and tables after optimization, what happens to that dataframes/tables? Do they go back to data warehouse for data analyst to work on? If not, do data analyst need to run optimization every time they want to work on optimized dataframes/tables?
How would I do that
"it depends"
different architectures for different use cases
different resources as well
How would the steepness of an activation function impact its utility?
the other question is also: are you the software/data engineer serving the data analyst/scientist? are you on the cloud -- if so, this will allow for more flexible architecture.
another thing to consider is what kind of data warehouse are you working with? does it allow for "easy reads" without interfering with other processes? how expensive are the queries to run, etc.
Hello, I'm using pandas to filter some data, I'm using the next code:
first_week_data = data_relevant.loc[(data_relevant["FechaEncuesta"] >= first_day_week1) &
(data_relevant["FechaEncuesta"] <= last_day_week1)]
The problem is that I'm geting the first and las day data only not the days between, any ideas where's the problem.
data science is about analyzing data using programming. AI is about writing programs that make decisions, and stuff.
Fuente object
FechaEncuesta datetime64[ns]
Grupo object
Ali object
Cant Kg float64
dtype: object
Oh, like predicting and stuff
Got it
!docs pandas.Series.between
Series.between(left, right, inclusive='both')```
Return boolean Series equivalent to left <= series <= right.
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
@hazy saddle use this instead ^
AttributeError: 'DataFrame' object has no attribute 'between'
did you see what class has the between method?
this was made with machine learning https://www.descript.com/
Series!!
AttributeError: 'DataFrame' object has no attribute 'between'
do you see the problem?
