#data-science-and-ml
1 messages · Page 21 of 1
Hey @serene scaffold , tell me a bit about LSTMs in text networks... If I make a GAN for text without any LSTMs, without any syllable/word/sentence sequence, my model will only generate text without any logic, right? Even if I pass to the discriminator texts with some logic? Nevermind this latter. I remembered that if I don't shuffle my batch, it'll overfit the model
Can I simply pass sequences as inputs to both generator and discriminator without using LSTMs? Or should I use them in order to achieve better performance?
even with overfitting, loss shouldn't start at zero. also that test curve is weirdly smooth compared to the train curve. check for bugs in your code
alright, will double check the parameters
i'm less concerned about the parameters and more about the train/test split and/or how you're calculating loss
when in doubt, try to simplify, then work back upwards
I'm creating sequences and then separating those sequences into train and test set with the following functions:
def to_sequences(data, seq_len):
d = []
for index in range(len(data)+1 - seq_len):
d.append(data[index: index + seq_len])
return np.array(d)
def preprocess(data_raw, seq_len, train_split):
data = to_sequences(data_raw, seq_len)
num_train = int(train_split * data.shape[0])
X_train = data[:num_train, :-1, :]
y_train = data[:num_train, -1, :]
X_test = data[num_train:, :-1, :]
y_test = data[num_train:, -1, :]
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test =\
preprocess(scaled_set1, SEQ_LEN, train_split = 0.8)```
what i'm finding weird is that the original dataset has 7 points (set1), but the X_train and X_test have only 3 and 1 sequence, respectively, while 5 sequences are possible:
wanted to use this old dataset of mine to try a bit of algorithm and nn on it but i struggle to find a suitable approach to generate 1 big dataset out of n=349 dfs like shown:
so far i tried .pivot with the column "temp"
atm the dfs are all stored inside a dict with range(0, 350)
people what is the best place for learn Machine learning?
maybe kaggle works for u in my opinion its a bit to easy with the tuts and then the competitions are rlly hard but there are some good githubs u can check from contributors
I need to understand how it works and why it works, you know?
i was studying for w3school but i dont like the blog of ML
so i have these two csv files the first one is the ingrediets and the total amount we have of it and second file is a pastry, the price the pastry sells for then the amount of ingredients needed to create that pastry
sorry for cutting you off
but i need to find the best solution, given your circumstances. Output the total profit and how much of each
pastry you have to make
i'm not here for the answer i want to actually understand how i would go about this
divide the amounts u need for each recipe to see how many u could produce after that u could build total price
by that you mean the max number of each pastry right?
ye
u could then also check whats the best function for a mix of pastrys with the given amount
sorry if its dumb qs but just to make sure i divide the ingredient amount needed by total we have of that ingredient to find max pastry we could make right
so for apple pie i got 158, croissant 79, poppy seed 51
u need to consider u can always only produce the least amount possible
so if
Y F S
1 2 3
is the result u can only do 1
got it i was dumbo
im a bit confused
so what i did was the max amount of each individual pastry that could be made with the total amount of ingredients
yes but what if for example for x pastry u would need x sugar but >x flour
if it works out just fine thats good but u need to consider that
ohh ok yea i took that into account i divided each ingredient amount needed for apple pie by its total
then i took the lowest amount
👍
thats what you mean right
how do i do A* (star) search when my goal state is finding all the keys in a grid. Like how do i calculate my heuristic
do yk what i should calculate next
how would i find the best combo to maximize profit
well
people what is best place for learn machine learning with fundamentals
lol
Neural network from scratch 🤓
Have you read it?
Good to know. It's on our resources page, but I can't actually verify that it's good
Ok
Yes
Add a new categorical column to df_housing called NOXCAT. This column categorizes the suburbs into towns with LOW, MEDIUM, and HIGH nitric oxides concentration (based on the variable NOX). The categorization should be based on quantiles of NOX as follows:
LOW (NOX <= 30% quantile)
MEDIUM (> 30% quantile; <= 70% quantile)
HIGH (> 70% quantile).
There is a dataset with a column NOX, all numbers with about 3 decimals.
I know this will be way off but my attemp that keeps getting an error is;
itm_low= np.quantile(df_housing["NOX"], q=0.30)
itm_med= np.quantile(df_housing["NOX"], q=0.70)
itm_high= np.quantile(df_housing["NOX"], q=1)
df_housing['NOXCAT']= {"NOX": {(itm_low): "LOW", (itm_med): "MEDIUM", (itm_high): "HIGH"}}
Any assistance would be much appreciated!
Does anyone have a tip to get the closest float number from a certain input?
I'm testing a word prediction model and I'm trying to work with data within range [-1, 1]. The model is doing quite fine, but I'm having some problems when trying to convert my tokens back to words again.
How can I make an output which has value -0.0703 be converted to a word which has value(in my dictionary) -0.0702?
Meh. I'll just stick to scikit learn's nearest neighbours...
I'm still trying to work my one out, ffs
the operation is in general not invertible. in special cases, you can reconstruct the values using sparse recovery, so L1 regularized optimization
Oh... I see...
Uh...well...at least it worked with KNN...
I'm not even using embedding layers, since I'm using floats and not using one-hot encoding. Hope this doesn't prejudice the model too much.
My answer to this is;
is_small = df_housing['NOX'] < df_housing['NOX'].quantile(.3)
is_large = df_housing['NOX'] > df_housing['NOX'].quantile(.7)
is_medium = ~(is_small | is_large)
df_housing['NOXCAT'] = df_housing['NOX'].mask(is_small, 'small').mask(is_large, 'large').mask(is_medium, 'medium')
print (df_housing['NOXCAT'])
Seems to work.
just need to change the names around to LOW, MEDIUM and HIGH
maate no wonder software engineers and data scientists are on the big bucks, being proficient at excel I thought I was clever until I took on this stuff
!rule 8
8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.
Anyway other than this, anyone who claims to be 'paying well' is 99.95% guaranteed to not be paying well
you sound really nice, I'm sure someone will be hanging to work for you
All the best chief
<@&831776746206265384>
ok, not using embeddings does make it a lot easier. knn is certainly once way to do it, but it depends how. in theory you already know the centroids of the voronoi regions as these are what you have in your dict already, so there's no need to compute them again
you could also use the 2-norm (euclidean distance) if the encoding is multidimensional
Nice. Thanks. Then I'll stick to the knn
Hi, any idea on which algorithm i can explore (or article for reference) if i want to predict a coordinate value (x,y) based on x values (x1, x2, ... , x (n))
Eg:
| x1 | x2 | x3 | ....... | y
0.43 | 0.56 | 31.21 | ....... | (3.51, 4.66)
I tried RandomForest but it gives me error - "ValueError: could not convert string to float: ''
That's a Python error on your end, not an algorithmic error
should I convert all my x values to float then?
They should be floats yeah
Let me try
Hi is there a way I can condense my code, what I'm doing here is taking a centre pixel and looking neighbouring pixel values that are equal to 255
coords = zip(z_coords.tolist(), x_coords.tolist(), y_coords.tolist())
for z, x, y in coords:
# exclude edegs/boundary of skeleton image (may cpuld pad skeleton image in the future)
if z == skel3d.shape[0] - 1 or x == skel3d.shape[1] - 1 or y == skel3d.shape[2] -1:
continue
# keep track of the neighbours
neighbours = []
# current slice
neighbours.append(skel3d[z, x-1, y-1])
neighbours.append(skel3d[z, x-1, y])
neighbours.append(skel3d[z, x-1, y+1])
# middle so exclude the actual centre voxel - except for prev and next slice
neighbours.append(skel3d[z, x, y-1])
neighbours.append(skel3d[z, x, y+1])
neighbours.append(skel3d[z, x+1, y-1])
neighbours.append(skel3d[z, x+1, y])
neighbours.append(skel3d[z, x+1, y+1])
# previous slice
neighbours.append(skel3d[z-1, x-1, y-1])
neighbours.append(skel3d[z-1, x-1, y])
neighbours.append(skel3d[z-1, x-1, y+1])
neighbours.append(skel3d[z-1, x, y-1])
neighbours.append(skel3d[z-1, x, y])
neighbours.append(skel3d[z-1, x, y+1])
neighbours.append(skel3d[z-1, x+1, y-1])
neighbours.append(skel3d[z-1, x+1, y])
neighbours.append(skel3d[z-1, x+1, y+1])
# next slice
neighbours.append(skel3d[z+1, x-1, y-1])
neighbours.append(skel3d[z+1, x-1, y])
neighbours.append(skel3d[z+1, x-1, y+1])
neighbours.append(skel3d[z+1, x, y-1])
neighbours.append(skel3d[z+1, x, y])
neighbours.append(skel3d[z+1, x, y+1])
neighbours.append(skel3d[z+1, x+1, y-1])
neighbours.append(skel3d[z+1, x+1, y])
neighbours.append(skel3d[z+1, x+1, y+1])
if neighbours.count(255) > 2:
print(z, x, y)
has anyone created their own python implementation of neat? It would be really helpful if I could see it
for loops ? lmao
Turn neighbors into a numpy array, compare the entire thing to 255, and apply a convolution on the result of the comparison
^ would work better
Do a > 2 on the result of that, and send it to numpy.where to get the indices
Hi guys, is there anyone here who has experience on number recognition by chance?
Please always ask your actual question, rather than asking if people know something
hey! so I have this final year project in the theme of city. I thought about doing a program to optimize traffic light system. So my idea is counting vehicles on each waiting queue ( which I've already done using opencv and yolov3), but since I'm finding problems implementing an algorithm I found that uses a conflict matrix, I'm looking for alternatives things I can do in case I couldn't realize the code
something that uses vehicles detection
and that is not just programs but also math theories
Hey @tacit nacelle!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
Is it not recommended to try make an AI crack sha256 hashes?
Ai can't do that.
Is dumping model with pickle library fine
I dumped it and loaded it in another script. But the predictions seems to be a bit off. So I was wondering if there's some issue with the model loading
Because my accuracy was good earlier
Infact it couldn't even predict well the data it was trained on
If the predictions seem wrong, pickle probably isn't the problem. If there was a problem with using pickle, it would probably fail to load.
But for whatever library you used to train the model, I would use it's native saving functionality.
I did model=logisticreg()
Model.fit(train, test)
Pickle.dump(model)
Is that fine way?
See if there's a save method for the model object
Hmm
I used sklearn
And official sklearn says to use pickle
Might be something wrong with my preprocessing maybe then
well i'm trying anyway
my lowest is 89 bits off out of 256 (Maximum error across 1,500 random hashes)
hey I want to learn AI can anyone share any roadmap
(ignore, figured it out)
That's not how it works. AI can't learn stuff that's basically random.
i mean, yeah, but worth a shot
not worth a shot
AI isn't magic, it's a science
it is not applicable to that task at all
You could potentially overfit a model to the hashes in your training data, but it simply isn't possible to create a generalized model that can do this.
I just want to make hash cracking faster for 32 bytes
Then it absolutely is not "worth a shot" to do it this way.
Without reading above, it is literally impossible to predict randomness
By definition of what randomness is
so i am using ssh and was having problem of process getting "killed" probably due to resource utilisation.
I had to extract feature using resnext3d on 1900 videos
doing all was giving that "killed" error
so i tried to decrease the number of videos i did feature extraction.
on 10 videos it took 50 minutes.
I dont know if its normal or no, Please guide.
this was pretty nifty https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor
In our paper, published today in Nature, we introduce AlphaTensor, the first artificial intelligence (AI) system for discovering novel, efficient, and provably correct algorithms for fundamental tasks such as matrix multiplication. This sheds light on a 50-year-old open question in mathematics about finding the fastest way to multiply two matric...
tldr is this:
yeah
it's pretty cool that they managed to discover a new multiplication algorithm for 4x4 matrixes over Z/2
hopefully they adjust their loss function and see if they can generate some novel algorithms that work on floats (or even just on ints)
hey if anyone is familiar with deploying a ml model in ai can u check my q in broccoli pls thx
I need help on part 5
import numpy as np
import pandas as pd
yellow_taxi = pd.read_csv('2018_Yellow_Taxi_Trip_Data.csv')
# For each month, print the entire row with the highest fare_amount
print("For each month, print the entire row with the highest fare_amount.")
# obtain month from pickupdate
yellow_taxi['month'] = pd.DatetimeIndex(yellow_taxi['tpep_pickup_datetime']).month
monthlyMaxFares = yellow_taxi.groupby(['month'])['fare_amount'].max()
# for month_index in range(1,13):
# month_subset = yellow_taxi.loc[yellow_taxi['month']==month_index]
# max_fare_row = month_subset.loc[month_subset['fare_amount']==np.max(month_subset['fare_amount'])]
# if max_fare_row.shape[0] != 0:
# print(max_fare_row)
Look into idxmax
Once you know the idxmax, you can get the rows
iv never heard of this before lol but let me check
cuz i have the csv i have to get maximum of certain column for every month and have to print down the entire row
The new script where this model is loaded in, is it running on the same machine you used to train the model in the first place? If No, then that's probably what could have caused the drift.
Yes it is. Giving the drift you're currently experiencing, you might wanna try using joblib instead of pickle. Then compare and contrast if there's any significant change in your model performance.
Hi Musk, checkout this roadmap. You might wanna pay less attention to the timeline therein as I don't find very realistic.
Meanwhile.... I need a new Tesla 😊
Some of these need to be abbreviated so that deep learning can have more time. And NLP should probably be dropped completely
Month 8 might probably be for learning a Deep Learning framework, who knows? 😄 I do find it ridiculously unrealistic to learn CV + NLP in 1 month. Is one month even enough to do justice to CV alone?
I doubt it
how do i get a heuristic for multiple targets
like my goal isn't an end point but finding all keys
@regal ingot can you elaborate?
What metadata?
I think I figured it out
I am basically doing it this way
# adding metadata of sheet name into the df to be used later
df.sheet_name = sheet_name
This is too vague for me to understand what would be helpful for you. If you have a specific question, feel free to ask.
thank you
no worries on this question though I think I am good for now
thank you though!
Does anyone know of a credible paid tutoring service for pandas, numpy, matplotlib, scripy? Not sure I will use it but curious what kind of resources are out there.
I don't. You can just ask questions in this channel, and as long as they're well formulated, people will be happy to answer.
This
I tried to create 3 variables using numpy.quantile, just can’t get it to work
I would have answered that if I were at a desktop
Unfortunately I'm on vacation. So I only have my phone.
It’s ok, thanks though
Try again on Tuesday
Just frustrating, uni course pre-requisites were basically nothing. They sting you $3500 and the lectures assume a lot of prior knowledge, I’m on track for a fail.
Good way for them to make money I guess.
To get better help I suggest you come up with public available toy data and code
At best I can give you this
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df.rename(columns={"sepal length (cm)": "a"}, inplace=True)
quantiles = np.quantile(df["a"], (0.2, 0.5, 0.8))
def cat(x): # categorise
if x < quantiles[0]:
return "LOW"
df["b"] = df["a"].apply(cat)
I don't want to ping Stelercus but let me know if you think that's too much help
Cheers I’ll give it a run in a few hours
It won't solve your problem if that's what you're expecting
I need to name the quantiles LOW MEDIUM HIGH and then do visualisations and regress each
That’s ok
I’m gonna try get my money back on the basis the course outline is misleading
Respect to those that are good at this, a lot to learn
Do you think I'll eat you?
Your solution could be made a bit more performant with a loc assignment
I guess that's not actually a solution. And I strongly agree with your saying that creating a toy example is very helpful
This was my answer, but when I regressed each against another column it omits LARGE, leading me to believe it’s wrong
It’s due tonight so I’ll submit that and hope for the best, problem is that part is only worth 5% 😵
The rest is probably even more difficult
oh it's my first "let's do this" kind of solution
I think this means you don't quite understand the regression.
I'm not sure if you're regressing it correctly
The newly created column NOXCAT in df_housing is a categorical column with three possible values (LOW, MEDIUM, and HIGH).
Create a set of dummy variables (for different values of NOXCAT).
Regress MEDV on the different NOX categories using the dummy variables. Choose the dummy variable coding in your regression such that the intercept reflects the MEDV value of suburbs in the MEDIUM category. Save the regression result as res_2 and print the regression result to the console.
Report the regression results from res_2 in your own words according to APA stype and interpret the coefficients.
Hint: Look at pd.get_dummies.
ANSWER that doesn't provide coef for HIGH
pd.get_dummies(df_housing['NOXCAT'], prefix="dummy")
#print (df_housing) - checked
mod_2 = smf.ols('MEDV ~ NOXCAT', data=df_housing)
res_2 = mod_2.fit()
print (res_2.summary())
I do note I see no HIGH values in NOXCAT, so I think my previous answer is wrong
That is what it produces
Can I do MS in AI or NLP with a BS in IT?
You might have to take some prerequisites before you would be able to start the MS courses. The best way to figure this out would be to look at the admissions websites for the programs.
There usually isn't an "MS in AI", and there definitely won't be one in NLP. It would probably be an MS in CS. If you know you want to do NLP, make sure there are research faculty at that university who specializes in it.
It's the same machine but different consoles. I am using it with flask in anaconda second time. First time I trained it in jupyter.
Do undergrad research?
Are you in US btw?
I don't understand the question.
Please make sure that your question is a complete sentence. If you use an incomplete sentence thinking that I'll know what the intended full sentence is, there's a very high chance that we'll miscommunicate.
What you mean by this, "If you know you want to do NLP, make sure there are research faculty at that university who specializes in it."
Yes
They are faculty that specialize in it
If you want to get an MS so that you can do NLP, you would get a CS degree and take the NLP courses. And if there's an NLP specialist, you could do thesis work with them.
Ah I understand
My major is fully Python
The CS major is heavily C language
Basically all of NLP is done in Python. The "main" courses might be in C, but the AI ones won't be.
Can someone tell me some metrics and tricks using loss functions for Text Generator Models? I suppose there's some GANs for text in order to have a good conversation model, right? Perhaps some metric or trick to measure how much the generated text makes sense?
I know that, in SRGAN, it was used a MSE Loss multiplied by an "adversarial loss", in order to achieve the "pixel-wise loss"(or something like this), which can improve the GAN output diversity.
There's overlap between what is considered data science and what is considered NLP. Beyond that, I don't agree with the premise of your question.
Ok understood
"data science" isn't a well defined thing. Linear algebra is, however. And it's needed for NLP.
Okay thx for your time
I have always thought of making a reinforced learning algorithm for better traffic lights, never got to it
can you guys take mock interviews lmao, I would attend 100%
I am kinda scared of them
lol
You need to read what is R doing for categorical predictors.
e.g. https://rpubs.com/beane/n3_5
Cheers, I think I need to convert the categorical values to int, I’ll have a better read later and play around. Have 3hrs left til it’s due 🤪 Thinking with what I’ve done will go close to 50% and keep me afloat for now…
I want to get into concepts of ai and ml, where should i start?
the links seem to point at books
I did not point at the fcc, so no, I don't think the fcc is worth it
Thanks
thanks dude Tesla Coming ur way😂
Rubiks Cube AI assistant
Is it normal for PyTorch CUDA models to show barely any usage in Task Manager?
It seems like it just uses a fraction of Copy and fills up VRAM, but it's not actually working too hard
Is it possible to somehow use more of the GPU in order to accelerate the workload, or is that just not possible?
Sorry if it's a silly question, it's my first time using CUDA
I'm just fabulous as always
Hello y'all! Does anyone of you know the correct term for when you assume the prediction for tomorrow is the same as the value today? I think I read it on machinelearningmastery, but I am unable to find it and I also don't remember what this was called. I believe he introduced it as the simplest benchmark in order to see if a model can beat the simple assumption that value today == value tomorrow
What about interpolation?
not stationarity. it was a special word that i cannot remember.... what i actually mean is a martingale sequence. but he used another word (and in my eyes better word) for it when he created a baseline model
"naive" model it is called
while martingale always kind of implys that you double your stakes, he used a word (not it but like) "autoregressive baseline"
it's random walk's corresponding model
do you have some link or paper at hand ? Naive as search word always returns naive bayes 😄
it's mentioned here https://otexts.com/fpp3/simple-methods.html
under "naive method"
although "i" is with 2 dots on it
thank you so much!
This is what i was looking for! (even though i still believe brownlee used another word 😄 )
you'd have to give more context 😛
He actually called it naive! You are exactly right @untold bloom : https://machinelearningmastery.com/how-to-grid-search-naive-methods-for-univariate-time-series-forecasting/
Simple forecasting methods include naively using the last observation as the prediction or an average of prior observations. It is important to evaluate the performance of simple forecasting methods on univariate time series forecasting problems before using more sophisticated methods as their performance provides a lower-bound and point of comp...
finally found it 🙂
stupid follow up question: If i assume the price of X tomorrow is the same as price of X today in a naive model, how do i decide if I should buy or sell? I basically can only make the decision based on the differenced timeseries, right?
So if the change from yesterday to today was let's say 2%, I assume 2% for tomorrow as well. Whereas if todays price of X was 100 and I assume 100 for tomorrow as well, there is no room for decisionmaking - which would kind of imply a "hold" strategy, right?
Given an image of a chess board, I would like to find out what piece each square contains. Is there any python package that could assist this task?
Don't look for libraries/packages. Look for techniques
The first step would be segmenting the chessboard image into each tile
This should be easy since chessboard is already a grid
The second step is to classify each tile as either blank or what piece it is
For the second step, you would need training data that has different images of what those pieces could look like
Are these pictures of real chess boards, or virtual chess boards that are 2d?
@hardy siren sorry, I was away for a bit. Look at 3blue1brown's series about neural networks. He makes a classifier for the MNIST dataset of images, which is a very similar problem.
https://jakevdp.github.io/PythonDataScienceHandbook/
There is a second edition coming out this December which will also be online (although you may find a pdf of a pre-release version floating about 👀 )
Covers NumPy, Pandas, Matplotlib, and Scikit-learn
I have the pre release via my O'Reilly account. Perhaps this should go on our resources page?
Does anyone know of research done on extracting features from images for structure from motion? A neural SIFT so to say. I've only found one or two Papers that don't really delve deep into the subject.
Yeah if you can access the latest version there is no reason not to
Do you have something like that in video format
You'll never find a video with the level of detail in a text book
What do data scientists do? It is not clear to me, would anyone mind explaining it?
modern statistics
hmm
Would you give me an example of where data science could be used?
I've seen some people commenting on the use of Python for managing investments, would that be a case scenario where data science is used?
u can use python for everything thats the neat thing bout it
hmm
so u generate data
u import the data into ur algorithm
u transform the data for better use of it
u can run different types of "tests" to see trends in ur data
u can visualise ur data
so u see there is no clear description
I got quite interested in investment lately, would Python be a useful tool for analyzing data and then deciding on what would be a good investment?
This part I'd be doing myself rly
the market is not following any rules
I'd only use Python for analyzing and showing important data
ye that works
hmm
easy
Apparently, data science and machine learning seem to be used together quite often, is ML useful for data science?
yfinance or another API
for ML u need data so ofc
but its not always the best approach to a problem
sometimes human brain works aswell
i plan to do so
cool
any tips on where I could get started? assuming I already know Python
depends on ur background im new to data science aswell
well, I'm a backend developer
well then u got more knowledge then me i guess 😄
but if u wanna analyse stocks i can give u my import tool on crypto currencies
Yeah but not used to reading text books. They get boring to me.
Plus I get overwhelmed by how slow i proceed in a book
Like 30 minutes a page
that'd be cool, do u have the code on github? if so, would you mind letting me see it?
its not complex code only the import part i had not yet managed to work on it further
that's cool
import pandas as pd
from requests_html import HTMLSession
numbers = [number for number in range(0, 1100, 100)]
table = pd.DataFrame()
for number in range(len(numbers)):
if numbers[number] == 1000:
break
else:
session = HTMLSession()
resp = session.get(f"https://finance.yahoo.com/cryptocurrencies?offset={numbers[number]}&count=100")
tables = pd.read_html(resp.html.raw_html)
df = tables[0].copy()
df.index = range(numbers[number],
numbers[number+1])
table = pd.concat([table,df])
Symbols = list(table.Symbol)
import yfinance as yf
import datetime as dt
import timeit
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import clear_output
fig=go.Figure()
fig = make_subplots(specs=[[{"secondary_y":True}]])
start = "2020-01-01"
end = dt.datetime.now()
count = 0
start_timer = timeit.default_timer()
for x in Symbols:
count +=1
symbol = x
x = yf.download(x,
start,
end,
)
stop_timer = timeit.default_timer()
clear_output(wait=True)
print("Current progress:",
np.round((count/len(Symbols))*100, 2), "%",
#end="\r"
)
print("Current runtime:", np.round((stop_timer-start_timer)/60, 2), "minutes")
fig.add_trace(go.Scatter(
y=x['Open'],
x=x.index,
name = symbol,
legendgroup = symbol,
marker_color = "green"
),
secondary_y=False
)
fig.add_trace(go.Scatter(
y=x['Volume'],
x=x.index,
name = symbol,
legendgroup = symbol,
marker_color = "red"
),
secondary_y=True
)```
thx for sharing :]
my pleasure
@young granite I was working with yfinance yesterday! Nice coincidence. What are you working on? I am an experienced ML developer, so I can help answer a few questions if you have any
How to measure speed rate when someone reading a paragraph
Interesting question. But my best guess is this is not something solved using ML, apart from the part where you track eye movement.
Hey anyone knows how to rewrite this compile without string parameter ?
m.compile(
optimizer="RMSprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
Tried this, but got an error in model fitting:
m.compile(
optimizer=keras.optimizers.RMSprop(),
loss=keras.losses.sparse_categorical_crossentropy,
metrics=[keras.metrics.Accuracy()]
)
I understand that it is metrics issue, however, confused on why would that throw an error
ValueError: Shapes (None, 1) and (None, 10) are incompatible
Note: I'm getting this error only with Accuracy
Hi! I had a question regarding backwards propagation and gradient descent.
W <- W - alpha * dJ(W)/dW
From what I am understanding, the gradient in the formula above is retrieved via back propagation in the neural network. In screenshot1 I understand how this chain rule gives us the gradient for J(w) and w2.
In screenshot 2 though, I am a bit confused as to why we apply the chain rule in that way compared to screenshot 3. Since we have the values of w1 and J(w), why do we need to apply the chain rule again for dy/dw1? Wouldnt that be an unnecessary extra step?
what is J here?
i assume W are all the model parameters
J would be the cost function, and yes W are all the model parameters
what is the context for screenshots 2 and 3?
to get the gradient of the graph mapping the cost function J(W) against w1. This gradient is then used in the gradient descent formula to find the optimal w1
it looks like screenshot 2 is the "fully expanded" form of 3
Yes, but I am just a bit confused as to why we need to do that, in the video it was said that we cant directly get dy/dw1 (screenshot3)
hence why they apply chain rule again in screenshot 2
since we already have the values of dJ(W) and dw1
dw1 isn't really a "value". dJ/dw1 is notation indicating the derivative of J with respect to w1
Lets say we are at the stage where we want to find the optimal w1. Why would we even need to use the chain rule to find that gradient, isnt it enough to just have the values of the cost, based on different vals of w1?
you can do that if you know the closed form of that expression!
I came across this formula before
Would that be the closed form?
If thats the case, then is the chain rule preferred due to performance? Ie would it be more expensive to use that closed form for each w (especially in networks with higher depth and width)?
it's hard to tell what this formula means
this looks a bit like the expression for one layer only
yes I think thats what it is
it might be illustrative to actually work through these expressions in their fully "expanded" form
it's good that you're asking these kinds of questions though
(and also a great example of why learning math from the videos is not usually that effective)
by fully "expanded", i mean write out a model with a small number of inputs, one output, and one small hidden layer, and then actually work through the backprop equations
Alright yes that sounds good, will give that a try. I just had one thing to ask about the closed form stuff u mentioned. If we did have closed form equations, would it be more efficient than the chain rule? I assume thats usually not the case, and finding the closed form seems harder/more computationally expensive than just using the chain rule?
it is, and in fact backpropagation allows us to take advantage of a lot of repeated computation in practice and significantly reduce runtime by caching them
just to confirm, its more expensive to use the closed form?
Ahhh ok things are starting to click now
I also stumbled across this earlier:
Answer (1 of 5): The chain rule is a mathematical formula.
There are many ways of computing that formula.
For example, if you have a formula a + b + c, you could compute a + b first, then add c, but you could also do b + c first, then add a, and so on.
Back-propagation is one particular way to...
right. the closed form very quickly becomes ridiculous
Awesome, its super clear now! Thank you so much for all of ur help!! I really appreciate it 🙂
Hey is anyone available to refactor a quick df.apply if statement into a np.where?
Can anyone recommend a youtube video that will help with creating certain bots which tells u different key words to use esc and by bots I mean the ones that are meant to do stuff for you
I think this may have something to do with ai not so sure
You might want to start with reading up on Natural Language Processing.
And then I have an InterativeImputer object go over the columns with missing values after that point?
Please don't ask to ask. It wastes everyone's time, including your own. If you want help, show the series and the function you were applying.
What do you mean "which key words to use"?
i am lost with my "machine learning" project attempting to predict the winner of this upcoming world cup
i recognize that to train a model i would need to find at least two correlated variables that somehow connect back to the team who won (in a match). however, i realize that team names are not numbers but strings so that's not very useful in a correlation matrix
is my approach flawed? is there another (better) way to approach this problem?
are there any books related to python ml for beginners that I could download?
https://www.amazon.com/gp/product/B0BHCFNY9Q/ is nice and available on kindle and oreilly
would it be good to read the data science one first before moving on to this book?
I don't know which book are you are referring to
Data Science from Scratch: First Principles with Python, Second Edition | Joel Grus | download | Z-Library. Download books for free. Find books
doesn't look as modern
ideally, read both 🙂
alright cool
Hey what's a good source to get started on AWS for ML/Infra?
Did anyone solve the turing.com test for the data science stack?
Do I need to learn data science before starting with AI?
yes
Hi in tensorflow, I get a ValueError: Shapes (None, 1) and (None, 5) are incompatible. I am implementing an NLP scenario which has a multi class classification
I have converted my training data and labels into numpy arrays
what operation are you trying to do, exactly? this is telling you you placed something of size 1 where it should have been of size 5 (or backwards)
common cases where it happens are where you use something that expected a one-hot encoded vector, but you returned an int instead
e.g. [0,0,1,0,0] in one-hot vs [2] as an int
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=123, stratify=[target])
ValueError: Found input variables with inconsistent numbers of samples: [11846, 1]
print(features.shape, target.shape) # (11846, 25) (11846,)
😐
I'm going to give this a shot: https://stackoverflow.com/questions/30813044/sklearn-found-arrays-with-inconsistent-numbers-of-samples-when-calling-linearre
you can think of "team" itself as a categorical variable, for which we have several encoding techniques. but i would not spend your energy trying to hammer your data into some generic format that people usually use for machine learning.
there is plenty of formal probability analysis that you can do with this. see for example the Elo ranking system https://en.m.wikipedia.org/wiki/Elo_rating_system, which predicts something akin to the probability of any one team beating any other team.
look at the MLOps section
no, stop guessing and trying to pattern-match to stackoverflow questions. look at the actual error message. clearly you provided arrays of different lengths. so what are the shapes of features and target?
They each have the same number of rows, where one is a DataFrame and the other, the target, is a Series.
(11846, 25) (11846,) are not the same length?
okay. i also see the shapes now, sorry. yes that SO post seems like a reasonable solution
you can try doing target.to_frame() to easily "upgrade" the series to dataframe
Does anyone know how I can convert a pytorch geometric GNN model to ONNX? I can't seem to find any examples on this topic
That worked in that it upgraded the frame, but it didn't work as I still got the same error message(!)
I've tried converting to np.arrays, parsed through the values method, added to.frame(), it reads exactly the same (like I'm not misreading those numbers below, am I?)
But none of these seem to have worked.
Wait - I resolved it.
Damn it.
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=123, stratify=[target])
I've read "Array Like" as it you would be parsing in an array of columns that you would be stratifying as. So the stratify parameter should be stratify=target, not =[target]
Thanks for your help, @desert oar
Hello, do you know how to make the attached picture? Sounds like it was made with python Matplotlib to me but I was wondering is there any resource you can suggest how to color the indicator function and super trend buy and sell order?
cheers
np. take a look at fullstackdeeplearning, the 2022 course, if you are interested in diving deeper afterwards
Sure, will do. Thanks
Does anyone know how I can convert a pytorch geometric GNN model to ONNX? I can't seem to find any examples on this topic
Dl is a part of AI right?
Hi, I just had a random thought regarding this. I think im missing something basic, but for example in the image I uploaded earlier: https://cdn.discordapp.com/attachments/366673247892275221/1028761475453571132/unknown.png wouldn’t we need some sort of closed form formula to calc du/dz1 and dz1/dw? Ie wouldn’t this still have the same problem as if we were to use the closed form instead of chain rule?
you do, but those are fairly simple to compute
from the perspective of y as a function of z1, all there is is a single layer
by using the chain rule, you take derivatives by considering a single layer at a time
those are a lot more simple because it's an affine transformation composed with a nonlinear func. that itself requires using the chain rule, but it's not so bad
oops, i completely missed the stratify= kwarg. that was the original source of error that i suspected, but got thrown off
yes but the network is literally defined in terms of formulas like "z as a function of w1". and as Edd said, they are usually easy to differentiate
and when they are not necessarily easy to differentiate "symbolically" (what you learn in calculus class), they can probably still be differentiated using "automatic differentiation"
the latter is part of the magic behind the various deep learning and differentiable computing frameworks, and why specifically the property of a program or algorithm being differentiable is interesting: because you can actually use it in backpropagation, or rather you can backpropagate through it
If for example, this screenshot is z1, and x is the function in the screenshot nested many times, I am mainly confused about how you can take a derivative considering a single layer at a time when each layer is dependent on the previous due to the nesting?
by using the chain rule 😛
let's take an easier example
you know the derivative of e^x is equal to e^x * dx/dx
now let's replace x with f(x)
the derivative of e^f(x) is e^f(x) df/dx (x)
or in a more general case of function composition, the derivative of g(f(x)) is g'(f(x)) * f'(x)
you can see that, inside of g and g', it's always f(x). we can black box this
then we treat f'(x) completely separately
it's just your usual chain rule
Ahh ok, so if I am understanding this correctly, the black boxing of f(x) pretty much solves our issue of the deeply nested funtions making things super complex?
yep
hi i got a question about pytorch cnn's
why after a conv2d layer, then a maxpool layer and we have a new conv2d layer why is the new conv2d layer input the same as the output layer of 1st conv2d
even tho theres a maxpooling layer
ah nvm
i think its cuz its channels and not image size
OH wait a sec, so basically we can find the dJ(X)/dw1 without having to look further back in the NN, ie if we had even more layers before w1
yep, chain rule
but for example, how does dy/dz1 get calculated?
instead of expanding the composition and differentiating a complex function once, we take several easy derivatives
ok I think this solves my doubt
exactly as we did in the example above
look, let's take g(f(x))
now let's call z = f(x)
we find the derivative of g(z) w.r.t. x
that's g'(z) z'
and z' = f'(x)
so g'(z) f'(x)
g'(z) doesn't need anything other than the derivative of g, evaluated at whatever z is. it doesn't matter what
z is just the previous layers evaluated at the given input
it's just chain rule
forget about the network
just review your calculus
Ok ok I see now this is super clear. Yeah I checkout out another video on chain rule and things are connecting now, I now understand WHY we use the chain rule and what it actually does.
Just to confirm the process overall goes:
If we specify f(x) and the sigmoid as an activation function, we can specify the derivatives of those then in the code (I assume we would have to calculate/specify the derivative of our functions if we are implementing back propagation?). This allows us to then take take these simple derivatives we talked about earlier via the chain rule and hence find the derivative of J(W) w.r.t some w value that is super far back in the chain for example?
right
if you can compute functions and their individual derivatives, you can do the same for their composition by using the chain rule
Alright awesome, Its very clear to me now! Thank you both so much for all of ur help and for bearing with me! I appreciate it a lot 😄
I'm trying to run a model on CUDA and am pretty clueless on what I'm doing - is there a way I can somehow get around this memory issue possibly at the cost of performance or is it a hard border of what I can and cannot run?
Can you show the part of the code that loads the model with all relevant import statements? And please don't ask people to read screenshots of text.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Like the if statements or return or def or lists or the and statement keywords like that because rn im doing a course so i have a really good base for learning bot making since its a goal of mine for this year
Thank you very much I'll note that
Sorry, I had screenshotted that without thinking 😅
And yes, here's the code (whisper) is https://github.com/openai/whisper:
import whisper
from torch import cuda
def load_model(model_name: str = 'base'):
device = 'cuda' if cuda.is_available() else 'cpu'
print(f'Loading {model_name} model on {device.upper()}')
return whisper.load_model(model_name, device=device)
I'm not doing any of the loading myself, just using the library's exposed functions
If you want the source for whisper.load_model, I can provide that as well:
def load_model(name: str, device: Optional[Union[str, torch.device]] = None, download_root: str = None, in_memory: bool = False) -> Whisper:
"""
Load a Whisper ASR model
Parameters
----------
name : str
one of the official model names listed by `whisper.available_models()`, or
path to a model checkpoint containing the model dimensions and the model state_dict.
device : Union[str, torch.device]
the PyTorch device to put the model into
download_root: str
path to download the model files; by default, it uses "~/.cache/whisper"
in_memory: bool
whether to preload the model weights into host memory
Returns
-------
model : Whisper
The Whisper ASR model instance
"""
if device is None:
device = "cuda" if torch.cuda.is_available() else "cpu"
if download_root is None:
download_root = os.getenv(
"XDG_CACHE_HOME",
os.path.join(os.path.expanduser("~"), ".cache", "whisper")
)
if name in _MODELS:
checkpoint_file = _download(_MODELS[name], download_root, in_memory)
elif os.path.isfile(name):
checkpoint_file = open(name, "rb").read() if in_memory else name
else:
raise RuntimeError(f"Model {name} not found; available models = {available_models()}")
with (io.BytesIO(checkpoint_file) if in_memory else open(checkpoint_file, "rb")) as fp:
checkpoint = torch.load(fp, map_location=device)
del checkpoint_file
dims = ModelDimensions(**checkpoint["dims"])
model = Whisper(dims)
model.load_state_dict(checkpoint["model_state_dict"])
return model.to(device)
My battery is running low. Hopefully I can look later. Others are welcome to.
Alright, thanks a lot for helping!
@fresh tiger it looks like you worked through it with Edd, but this is a great example of why it's valuable to actually go through the motions with specific (simple) cases, like some small neural network with 4 inputs, 2 hidden nodes, and 3 outputs + softmax, with MSE loss and sigmoid activations. that's well within reach of what you can write out and work through entirely by hand, even completely avoiding vector notation and working with sums of scalar terms
it's not the kind of thing you need to do more than once or twice before you get it
part of the value of a good course, or at least a good textbook, is having exercises like the above presented to you
can you post the error too? i can't read that screenshot
whisper> load large
Loading large model on CUDA
100%|█████████████████████████████████████| 2.87G/2.87G [05:57<00:00, 8.64MiB/s]
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 8.00 GiB total capacity; 7.12 GiB already allocated; 0 bytes free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
whisper>
Does anyone know how I can convert a pytorch geometric GNN model to ONNX? I can't seem to find any examples on this topic
hello, sorry to annoy, i just started uni (last month) for data science diploma. problem is, math was never a strength. is there any material out there, that is clear, and shows the maths directly in python.
cause looking at math formulas all day, i fried my brain 3 days in a row.
thanks for your time
Not answering this one directly, but I am (somewhat) in the same boat as you. I'm in my third year now. Still struggling XD
Barbara Oakley's "Learning How To Learn" free course on Coursera is a must do soon. Helps create a framework to work with Math concepts and memory recall and comprehension.
Side note everyone - here's a silly question.
So this is a little right skewed, right?
Now watch before your very eyes as I REMOVE the slight right skewness.
# I am going to do a Log Transform of Min C, output the graph and p-value to see what improvement is shown.
logged_minc = df_imputed['Min °C'].copy() # not logged yet
logged_minc = (logged_minc - logged_minc.min()) + 1
logged_minc = np.log10(logged_minc)
plt.figure()
ax = sns.histplot(data=df_imputed, x=logged_minc, hue=df_imputed['Rain(Y/N)'], kde=True)
TADA!
The right skew is gone!
It is now a trailing left Skew.
Am I doing something wrong here?
thanks, ill check out the course
Log transform here as well - Standard Deviation has narrowed in terms of the values on the x-axis to have the deep crevasse of nothing from 0 to the start of the distribution be from 10 units to about 1.
But I have an overwhelming feeling what I should be saying in my report is that Log Transformations may not be the catch-all transformation for skewed data? Or is that taking it too far?
elif args.attack_type == "remote":
prisoner_loc = env.env.prisoner.location.copy()
dists = []
for i in range(env.env.num_known_cameras):
cam_loc = env.env.camera_list[i].location.copy()
dist = np.linalg.norm(np.array(prisoner_loc)-np.array(cam_loc))
dists.append((i, dist))
sorted_dists = sorted(dists, key=lambda x: x[1], reverse=True)
idx = np.random.choice(5, args.C, replace=False)
attack_action = [sorted_dists[i][0] for i in idx]```
can someone explain this chunk of code? it's supposed to only perturbs the detection flag to True and set the detected location to be the camera's own location. how could I edit this code so it could specify the location of the camera as an action?
https://www.youtube.com/watch?v=YqaNo0XfAD4
A quick talk I gave to PyHEP, in last september, organised by the people at CERN.
We talked about what Python can do in VR, not exactly related to particle physics, but they kindly invited us to show our work.
Through several examples of practical use cases the talk will present our experiences of 3D and Virtual Reality, all implemented in Python with the help of our 3D package "HARFANG 3D" :
Human factor study of a railway station in virtual reality
Using a aircraft simulation sandbox for AI training
Tele-operating a humanoid robot in VR...
@fringe anvil @alpine temple it's unfortunately a disservice to students to try to force them to learn material that they don't have the prerequisites for. "in python" is also a bit of a challenge here. there might be some books that use numpy for linear algebra examples (i don't know of one), and i know there is at least one book using code examples to teach probability. but if you can at least specify a couple of things you don't understand, someone might be able to direct you to useful resources
realistically the only way to learn applied math is to learn math. you don't need a graduate degree, but you do need the fundamentals of linear algebra and multivariable calculus.
if you are significantly more comfortable with code than with traditional math notation, maybe a good exercise would be to translate traditional formulas into python functions
Hey there! Whats a good free resource to learn python for data science / ML ? Something similar to TOP but for python?
Dl is a sub category of ML right?
Yes, I believe so
Ty
Kaggle? Maybe
check the pinned messages, there might be something in there
@alpine temple @fringe anvil in addition to what i said before, check the pinned messages in this channel. look for the MML book, among other things
Hey @empty nacelle!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
Hey, I want to do some kind of benchmarking (multiple datasets with multiple different algorithms) and I was searching for some tooling in order to make it reproducable. I found DVC and mlflow, they seem to support running experiments with configuration files of the hyperparamaters etc, but all I could find was with only one dataset and one algorithm (but different models of this algorithm). Does anyone know if those tools are appropriate for my use case as well, or are there better alternatives?
Hi, for my engineering project I have to predict data of solar panels on how much energy they get. And I'm new to deep learning so would Keras be the best option to make my model or are there better alternatives?
hi, am new to machine learning and needed some assistance from experienced ppl
heeeeeelp
you're gonna have to specify what with
well i started learned numpy,pandas and matplotlib made some projects learned scikitlearn and did things with that , now i am confused if i should go on further with normal algorithms or i should just go on further to deep learning
any suggestions?
my main goal is to go further in deep learning and make deep learning models
what do you have in mind when you say "normal algorithms"
like i learned the basics such as linear regression logistic regression etc but i've seen there are more such as knn naive bayes and more
so should i complete them and then go for deep learning?
or am i good to go?
i would say you should complete those first, but i'm a big proponent of learning from the ground up. it really depends on how you prefer doing stuff
you'll pick up stuff that will be useful/necessary for deep learning
hmm so according to you i should strengthen my basics before goin for deep learning and neural networks
that would be my claim since ML is math
the more you know, the better
you can learn it ahead of time or at the same time, and i'm saying learning ahead of time is what i prefer, but that's personal flavor
hey anyone has speech labeled dataset in English?
I dont know if this is the right place to ask this
but
I have a pandas dataframe , which has a coloumn
which is full with joined hashtags leme show you
so is there anything I can do to like
first take these rows , then make a list by separating those strings
then arrange in ascending order of hashtags used
where did you get this data from? the ÿ characters appear to be some kind of record separator, represented incorrectly because the original data is in a different text encoding from what pandas used to load the data
I think that might be because
the guy who made the data , must have used a mac
or something. I am using win 11
do you know what program they might have used to make the data?
i wonder if they just chose a byte that isn't valid ascii
that's 0xFF which maybe was some overly-clever programmer's idea of a "character that nobody will use and will be obviously just a record separator"
split on ÿ of course
that way you will get a list of hashtags in each data frame cell, and you can proceed
!d pandas.Series.str.split
Series.str.split(pat=None, n=- 1, expand=False, *, regex=None)```
Split strings around given separator/delimiter.
Splits the string in the Series/Index from the beginning, at the specified delimiter string.
...from the data?
can you clarify your question?
well
I want to seperate the hashtags
then ,
i want to count all the unique ones ,
and arrange them in ascending order
and make a bar graph out of it
okay, you want the number of unique hashtags in each row?
or the number of times each hashtag appears? or something else?
no
or the number of hashtags that appears in the whole table
like including all rows
that's just one number, not something you want to plot with a bar chart
but there will be many values right
for different hast tags
x axis will be the hashtag name
y axis will be its value
or count
it sounds like you are asking for the number of times each individual hash tag appears
Yes
good question. it might be worth your while to at least make an attempt at it on your own
i will give you the hint that there is no single function or method that will do this for you
welp I did try to do it before
also I am doing this for a school project
but I cant figure it out
okay, and what did you try?
well I tried that one split command
and made an effort to create a list
by making a virtual column
a virtual column?
what is a virtual column?
idk
in mysql they call it virtual column
I don't know what they call it in python
(keep in mind that pandas is not python, it's just a library written in python)
Yup sorry about that
how did you attempt to create a virtual column? pandas doesn't really have that concept, so it's very likely that you just misunderstood what you were doing
I tried to do something like that
because in sql, I did a question like that before
i'm asking you to describe specifically what you tried. pandas does not have virtual columns, so telling me that you tried to create one doesn't actually tell me anything!
how about this, can you share the code that you used?
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
i got rid of it
also it wasn't really a virtual column I think
I had just updated the column prob
but the process is big
indeed, i was going to suggest that but i didn't want to guess without seeing your code
in general, it's usually prudent to try out your code on a small sample of data before trying to use the full dataset
so am I supposed to use loops?
you often can avoid writing for loops in pandas, but conceptually almost always yes
using built-in pandas looping tools is usually much faster and considered better style
sorry, i won't do your homework for you
I don't mind that , I know that
just asking you what commands will be used
like about the split thing you mentioned
i posted a link to the docs for the split method
you would access the column with [] and then call .str.split on the result
okay
if you are a practicing data scientist you must be able to read docs and combine them with your fundamental understanding in order to solve a problem
so il get a list after that?
you can't rely on other people for that
you get a Series where each element is a list
I am planning to take Cse
this is stated in the documentation
I am in 12th grade
ok, then imo you've had enough education to know that in order to learn you need to solve problems for yourself, instead of waiting for solutions to be provided to you
i'm sure you hear this from people all the time, but expectations only go up as you get older and gain more experience
Mhm... I'm in 11th..
well i gave you plenty of advice so far. why don't you at least try to split the hashtags into lists?
yup doing that now
assign them to a new column in the same dataframe perhaps, just in case you make a mistake and need to get the original
pandas ??
i also strongly suggest working with a small sample of this data
Is it better to work with csv files or json files while using pandas ??
I mostly use csv..
1000 rows should be more than enough to be "realistic" for testing your code, without worrying about performance on a bigger dataset
ideally neither, parquet is better than both for a lot of uses. csv and json have their own, but different, problems
csv is good if you don't have complicated text in your data
I think matrix calculations might be a part of your syllabus
Mm...
I really haven't worked much with data science....
Just doing some discord bot coding stuff ;-;
there is no universally best data format. so really the question needs to be qualified: in what context, for what purpose?
Mhm...
I was looking for a general answer but ya it's ok...
the general answer is that there is no general answer
Mm....
you will encounter that in many many situations in programming, data science, and elsewhere in life
Making a roadmap to self-study for data analist, starting with Excel. I will also need statistics and probability, but which math topics do you need to study before statistics and probability? Graduated 30 years ago as an engineer, but this math stuff is buried deep. You may point me to Khan Academy courses, YouTube videos, books incl. The Manga Guides to ... from nostarch, FreeCodeCamp, ... Anything that may help me to know where to start and what's next.
Is there any way to assess the quality of audio?
hoo boy, is there
if you want to be a data analyst, but not a "data scientist", this server may be of limited use for you. because we mostly use pandas or SQL for tabular data
but it depends entirely on what kind of audio and the application
plain calculus and vector/matrix arithmetic is good enough to start. eventually you'll want multivariable calculus and linear algebra.
there is a "Math for ML" book in the pinned messages
keep in mind that probability is often considered a subset of pure math, or at least tends to straddle the line between pure and applied. you will probably want to focus on understanding the fundamentals and don't need to worry (at least not at first) about things like moment generating functions
i just came up with this now: if you can derive the binomial distribution on the back of an envelope, you are off to a good start in practical probability
I don't think they want to do ML though
Actually, the roadmaps I found for data analist also include SQL and python (pandas, numpy, matplotlib, ...)
well they asked about math for stats and probability
that's a pretty good book
it might be a little advanced for your level if you forgot all your math
but it's also full of the kinds of things that, if you can implement them in practice, he will be able to solve a huge variety of real problems in a variety of domains
a good introductory statistics book would also be a really good idea
let me see if there's a probability book that starts a little on the lighter side, so you can get into the fun stuff more quickly without worrying too much about math prereqs
you can do a lot without much more than high school algebra
the best "data analysts" in my experience are the ones who don't worry about learning fancy stuff but are incredibly solid with their fundamentals, and have one or two extremely powerful tools that they know how to use proficiently, like SQL and Excel, and also have substantial domain knowledge about whatever field they work in
Hey I am back here
I managed to split the data a long time ago
was drinking tea
so now I have this
each row has a list of all the tags
how do I count how many times Finance or Money has been repeated
and stuff
is each element a list of strings, or a string that looks like a list of strings?
once you have a Series of lists of strings, you can do .explode().value_counts()
for Series[list[str]], .explode will give you a flat Series[str]. and then you can do value_counts on that
I already told you
x = PivotTable.loc[PivotTable.Retailer == "Bela","Promotion Relevance (Cat)_Energy"]
print(x)```
How can I get just the value
use .at instead of .loc
giving me error
x = PivotTable.at[PivotTable.Retailer == "Bela","Promotion Relevance (Cat)_Energy"]
print(x)
I got the number of values in a single row
please do not ask people to read screenshots of text. please copy and paste text as text.
I guess you can do x = PivotTable.loc[PivotTable.Retailer == "Bela","Promotion Relevance (Cat)_Energy"].iat[0]
I'm not sure what you mean.
Please do print(series.head().to_dict()) and put the text in the chat. I will not accept a screenshot.
series is whatever the NewHash column is.
Okay
it worked
{0: ['#finance', '#money', '#business', '#investing', '#investment', '#trading', '#stockmarket', '#data', '#datascience', '#dataanalysis', '#dataanalytics', '#datascientist', '#machinelearning', '#python', '#pythonprogramming', '#pythonprojects', '#pythoncode', '#artificialintelligence', '#ai', '#dataanalyst', '#amankharwal', '#thecleverprogrammer'], 1: ['#healthcare', '#health', '#covid', '#data', '#datascience', '#dataanalysis', '#dataanalytics', '#datascientist', '#machinelearning', '#python', '#pythonprogramming', '#pythonprojects', '#pythoncode', '#artificialintelligence', '#ai', '#dataanalyst', '#amankharwal', '#thecleverprogrammer'], 2: ['#data', '#datascience', '#dataanalysis', '#dataanalytics', '#datascientist', '#machinelearning', '#python', '#pythonprogramming', '#pythonprojects', '#pythoncode', '#artificialintelligence', '#ai', '#deeplearning', '#machinelearningprojects', '#datascienceprojects', '#amankharwal', '#thecleverprogrammer', '#machinelearningmodels'], 3: ['#python', '#pythonprogramming', '#pythonprojects', '#pythoncode', '#pythonlearning', '#pythondeveloper', '#pythoncoding', '#pythonprogrammer', '#amankharwal', '#thecleverprogrammer', '#pythonprojects'], 4: ['#datavisualization', '#datascience', '#data', '#dataanalytics', '#machinelearning', '#dataanalysis', '#artificialintelligence', '#python', '#datascientist', '#bigdata', '#deeplearning', '#dataviz', '#ai', '#analytics', '#technology', '#dataanalyst', '#programming', '#pythonprogramming', '#statistics', '#coding', '#businessintelligence', '#datamining', '#tech', '#business', '#computerscience', '#tableau', '#database', '#thecleverprogrammer', '#amankharwal']}
thank you, one moment
In [5]: s.explode().value_counts()
Out[5]:
#thecleverprogrammer 5
#amankharwal 5
#pythonprojects 5
#pythonprogramming 5
...
#stockmarket 1
#trading 1
#investment 1
#investing 1
#database 1
dtype: int64
Is this different from what you wanted? Do you need the value counts per row, rather than overall?
great, so we did it 
2 [#data, #datascience, #dataanalysis, #dataanal...
3 [#python, #pythonprogramming, #pythonprojects,...
4 [#datavisualization, #datascience, #data, #dat...
dtype: object
In [11]: s.explode()
Out[11]:
0 #finance
0 #money
0 #business
0 #investing
0 #investment
...
4 #computerscience
4 #tableau
4 #database
4 #thecleverprogrammer
4 #amankharwal
Length: 98, dtype: object
leme try
We can also clip the hashtag, if you don't want that.
In [14]: s.explode().str[1:].value_counts()
Out[14]:
thecleverprogrammer 5
amankharwal 5
pythonprojects 5
pythonprogramming 5
python 5
pythoncode 4
you can use the .str accessor to do string methods to every element at once.
no attributes called value_counts
show code
print(x.explode().str.value_counts())
you did .str., not .str[1:].
then remove the .str part entirely
oh okay
#thecleverprogrammer 117 #amankharwal 117 #python 109 #machinelearning 97 #pythonprogramming 95 ... #bigdataanalytics 1 #qrcodes 1 #datascienceinterview 1 #facebook 1 #boxplots 1 Name: Hashtags, Length: 164, dtype: int64
Did it , all thanks to you
you shall surely progress in the pandas arts
well thank you
@serene scaffold Very sorry to bother you again
but just wanted to confirm one thing
the command we used made a new series right?
nvm thats a dumb question
no it's not
yes, pretty much all pandas operations return new objects
@worldly wren
yeah?
Hello
hey
Thanks alot man
Mhm
really needed a cute picture of a cat
but my day is made
I managed to get to a solution of something I have been trying to do for 2 days now
Here comes woofer to help with your stress
I love your kitty. but cat pics should go in one of the off-topic channels
Ok
hello yall! Can anyone of you tell me why statsmodels VECM.predict() returns sometimes float64 and sometimes complex128 arrays?
I don't understand what this triggers and statsmodels documentation does not say anything about this like always :/
also it seems to be random which prediction is complex128 and which is float64. not like "every third pred is X" or sth.
Does anyone know how I can convert a pytorch geometric GNN model to ONNX? I can't seem to find any examples on this topic
is anyone familiar with RandomForestClassifier
Thanks! I'm at work, but as soon as I get home I'll have a look
I did also find https://machinelearningmastery.com/start-here/
I just started with AI and I want to know how models work?
it's simple. you usually take exemplary data and pass it to a model (e.g. neural network):
2, 3 -> 5
3, 4 -> 7
6, 2 -> 8
....
while training the model with such data, it will learn in this case to add the first two numbers to find the desired output (here: 5,7,8). Now you take new data that the model has not seen yet to test if it really works:
4, 2 -> 6! Congrats you trained a model to add up numbers 🙂
sum of input times weight and then you compare it to a threshold or bias. if your sum is higher than the threshold you pass the sum to next layer as a respective input
if you are interested just watch a video on youtube on how neural nets work. its probably easier to see some graphics than explaining it here via text
are there techniques to prevent your discriminator from learning faster than your generator in GANs?
my generator loss is almost always higher, and I'm told that ideally they should stay about equal until the generator starts to gradually get below .5 and the discriminator should end up at .5
good job! note that this is the total number of times each hashtag appears, not the number of unique rows it appears in. that is, if a hashtag appears twice in the same row, it will be double counted.
ok what exactly is going on here?
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
def discriminator_block(in_filters, out_filters, bn=True):
block = [nn.Conv2d(in_filters, out_filters, 3, 2, 1), nn.LeakyReLU(0.2, inplace=True), nn.Dropout2d(0.25)]
if bn:
block.append(nn.BatchNorm2d(out_filters, 0.8))
return block
self.model = nn.Sequential(
*discriminator_block(opt.channels, 16, bn=False),
*discriminator_block(16, 32),
*discriminator_block(32, 64),
*discriminator_block(64, 128),
)
# The height and width of downsampled image
ds_size = opt.img_size // 2 ** 4
self.adv_layer = nn.Sequential(nn.Linear(128 * ds_size ** 2, 1), nn.Sigmoid())
def forward(self, img):
out = self.model(img)
out = out.view(out.shape[0], -1)
validity = self.adv_layer(out)
return validity```
this is not the usual way I see neural net layers defined
I saw a thing on the internet that suggested dumbing down the discriminator by removing a hidden layer
but every time I try to change any of the discriminator_block() lines it throws the following error
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x32768 and 8192x1)```
how can I remove a hidden layer from this?
Is there a way to vectorize this in some sensible way? I have a vector of random numbers, call it x. I need to create a numpy 1d array so that the ith element is a number drawn from a probability distribution with parameters that are a function of the ith element of x.
So for example, something like this:
x = scipy.stats.norm(0,1).rvs(10**5)
y = np.zeros(10**5)
for ind in range(x.shape[0]):
y[ind] = scipy.stats.norm(y[ind], y[ind]/2).rvs(10**5)
except not glacially slow.
is norm.rvs not vectorized over mean and variance?
I don't think so, but it doesn't hurt to try. Let me see
it does not seem to work unless I am misunderstanding the syntax
x = scipy.stats.norm(0,1).rvs(10**5)
y = scipy.stats.norm.rvs(x, x/2, size=10**5)
or whatever the exact setup is
also numpy random is probably faster
scipy uses this object oriented framework that involves a lot of indirection internally
and i would be very very surprised if numpy rng norm was not vectorized over mean and variance
I'll check it out and swap if that works. I can't recall the reason why I am using scipy other than that I brought it up and was told that if unless the reason is considerable, the speed increase from that swap isn't worth it because "the engineers want it this way".
numpy is usually simpler, scipy is good if you like the OO interface or want to reuse the object representing a specific distribution repeatedly
with vectorization both should be "fast enough"
I think I must be just doing something weird with the syntax because your links suggest I can do it. That's helpful thank you
2 quick questions:
-> In a translation model, translating sentences is better than translating each word inside a sentence, right?
-> If so, then each sentence will be assigned to a token, right? I'll have a single value for an entire sentence, no matter how big that sentence is?
Oh...now I think I get it... I'll have to tokenize each word, but the input will be the entire sentence. So each sentence will be a sequence of tokens...
- Yes
- i dont think so because the model tokenize each word no? I'm not sure
Thanks! I'll see what I can do, then
some model require you to tokenize the word, but if you use ready made/pretrained from hugging model you can just feed the whole sentence and done
Meh. The funny is part is doing it all by myself...
even if it's through copying someone else's code
yeah its fun to code from scratch but those big ass model is fun to play with too
I got it working now thanks again for your help.
Is there any way to assess the quality of audio?
To know wether it has disturbance in the sound
The audio is basically voice recodings
didn't someone already help you with this earlier today?
@RenegadeZed#4600 one more thing: if you feel like you are lacking in core intuition (many of us are), the 3blue1brown "essence of" video sequences are excellent
you won't learn the mechanical equation stuff, but you will probably come away with a much richer intuition than you had before
do you have a database of voice recordings without any disturbance?
Hey! I just wrote a data analysis project using Python on Jupyter Notebook and I really want someone to help me with a short review of it. Would you be up for this?
This is my first project and I want to get a second perspective from someone with more experience.
if you put all the code in the paste bin (you'll have to copy and paste the code in each cell individually), I can look over it.
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
It is quite big, I don't think it will be readable here. 😦
that's why I told you to put it in the paste bin.
if it's too big to share, how were you expecting to get it reviewed?
All right. I will try it later and ping you. Thanks a lot!
Sure. Next time you want people to look at something, be it in the context of a question or whatever else, make sure that everything is available all at once. Don't ask people to commit to answering your question, or looking at your code, before sharing it.
If you had shared the code in your first message, I would be reviewing it right now. Now we're just wasting our time.
That is for sure, thanks for your advice, I will keep this in mind.
noob here, is it possible to use datetime dtype for training a ML model?
how should i approach the idea that the machine should pay attention (via feature selection) that the date or year column is pretty relevant considering
what kind of model? please be specific.
random tree
are you sure you don't mean random forest?
sorry, yes i mean random forest
Examples using sklearn.ensemble.RandomForestClassifier: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn 0.22 Release Highlights...
what's the first and last year in the data?
min() is 1930, max() is 2014
how precise are the timestamps? to the day? hour?
1930-2-15
what is the data? like what does each row represent?
each row represents a football match
each row contains a date column, among other attributes ofc
does the date actually have anything to do with what the model is supposed to learn?
among other attributes ofc
I have no idea what they are unless you tell me
i mean tbh idk -- maybe?
then I wouldn't use it.
home team score, away team score, penalty kicks, home team id, away team id
what is the model supposed to learn?
learn the different teams based on the attributes provided, ultimately to predict match outcomes (win/lose/draw)
Great. So far, you've said what your model is, what your features are, and what the model is supposed to learn. Next time you have a question, please say all of these things in your first message.
isn't the winner whichever team has the higher score?
well, the reason i ask is simple because year might have statistical significance tied to it. as we get closer to modernity, the reocurring national teams should have more statistical advantage than non-reocurring national teams. if that makes sense?
statistical significance
this term has a specific meaning that isn't the one that you meant.
anyway
noted
obviously, if a team has existed from 1930 all the way to the 2000s, then the players who compose that team have changed. so, can we assume that for any team, the players that compose that team is the same within a calendar year?
i think that's a fair assumption
and do we know which teams have faster rates of turnover?
hmm we do not, but i think if i did have that rate, i would certainly use it as a feature
ofc
it doesn't sound to me like you have enough features to do anything interesting. if you pick two teams, and ask "which is more likely to win", you can just pick whichever wins more often. if there are considerations other than whichever wins more, you don't really have features for that.
you are exactly right, which is precisely the understanding that i have come to when approaching future models to train
hmm, I actually have an idea. do you know about time series forecasting?
i am just entering the machine learning universe, so sadly no
so sadly no
don't look at what you don't know as a negative. think of it as another thing you get to learn
but for my original question: in either general cases or specifically random forest classifier cases, can we designate datetime dtypes as features. i ask only because i understand machine learning models can only accept int and float dtypes (atm)
in "normal" ML, the order of the observations (ie, the rows of data) doesn't really matter. but for time series stuff, the order is taken into account.
Time series forecasting means to forecast or to predict the future value over a period of time. It entails developing models based on previous data and applying them to make observations and guide future strategic decisions. The future is forecast or estimated based on what has already happened.
whoa that sounds very interesting
in general, you would want to decide what level of precision you need (years, months, days, etc) and encode the time as an int of that many of that unit of time
so if you decide that you need to be as precise as months, and your time range starts at 1 January, 1970, and you want to encode 7 February, 1971, you would encode that as 14, because individual days don't matter, and February 1971 is the 14th month in the data.
alternatively, if you're treating time as a sort of category (like the name of the month or the day of the week week), you can one-hot encode those
are you still with me? questions?
of course. take your time.
anyway, I don't know that you have enough features to do time series stuff, either. because if you had the turnover rate for each team, you could make a model that estimates how a team's turnover rate and past performance determines its future performance.
so if i understand correctly your examples, the first example (month) requires ordering to be accounted for in the encoding process, while the other one is just for assigning an int encode?
the second one is just treating the month/day-of-week as a label. when you treat time things as a label, the model won't know which days or months come before or after each other. so you use that when you assume that certain things usually happen in/on certain days/months
ok that makes total sense
for example, I used to work for Starbucks, and we knew that sales were higher on weekdays and on Friday especially, and that sales are especially low in July, and that sales are especially high in December.
and this is true week to week and year to year. so we don't really need to know that today is Friday, tomorrow is Saturday, and that Sunday comes afterward.
did you love my dank reference?
i have a follow up question about categorical dtypes, using the example of t-shirt sizes (S,M,L,XL)
yeah, you'd one hot encode those, because knowing that some of them are bigger than others doesn't really help you that much
This can also be used to do what the single int does. If your system wants binary inputs or 0 to 1, it can be used.
But ignoring the ordering can be useful as described.
Since you are doing Football I can give you the hint that you want injury data more than anything else.
did fifa give you forbidden knowledge or something?
(Not easy to get, that is very private information)
using another example, under the context of competition and sports, specifically high school sports? wouldn't you want to designate hierarchy with grade level ? (Freshman, Sophomore, Junior, Senior)?
I have seen a lot of Football predictors.
It's something many want to try to do.
i guess the specific sport im thinking of is wresting (freshman vs sophomore)
(And coach apps track injury data for maximum training efficiency too)
i dont believe you'd want grade level to just be a label encode
idk i could be wrong totally
if you have a small set of categories, knowing that they're conceptualized in a certain order isn't really that helpful. and if you encode them as the integers 0 to 4, depending on your model, you might get a prediction of 2.653, or something. and what are you going to do with that?
i honestly dont know, maybe im overcomplicating this concept. i think i will experiment with the question using small pilot tests
since your data is of limited use, I would take this opportunity to practice data manipulation. see if you can make a line plot to show each team's performance over the years
like what percentage of games they win each year
that information is available given the features you described, but you'd have to fiddle around with it.
As Stelercus wrote, plot some stuff, then try adding some more data if you have it (although not just any data randomly, try to pick something reasonable or it will just make it harder).
@iron basalt since you are familiar with football predictions, do you have some examples to share off the top of your head? preferable beginner-friendly examples?
and by extension, assigned that "performance rating" to that year?
in a new column
You are using this dataset? https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017
the performance rating would just be the number of wins divided by the number of games
not quite
Stelercus brought this up, but teams are not static, they are made up entities with players constantly moving in and out, and so without that data there is not too much that can be done. @lapis sequoia
It's not a team name that wins, but a temporary group of people that wins.
(And this gets really complicated, because while the nice story is that there are star players that carry teams (can still happen), the combinations of players are key (in ways that are not obvious, like two star players in their own teams might be good, but when put on the same team fail))
these are very good points that i didnt think about
thank you both for entertaining my curious questions, i will leave you be
Sigh, now I have this robotic voice stuck in my head.
Hi everyone,
I have been working on a project where I have to extract text from images and I am using pytesseract for that. Currently, I am working on preprocessing and have used basic transformations binarization, dilation followed by erosion. It is working well on some of the images but for other images it is not even detecting the text. Can anyone suggest me how to get better results?
What is different about the images it works on and the ones it does not?
Images where the text is in black with white background, is giving really good results like how it is books and papers generally however the images where text is in white then it isn't able to detect it
And what does the text look like after pre-processing?
Some adaptive thresholding might do the trick.
I'm using adaptive thresholding because global thresholding methods weren't giving good results
Is it detecting the smaller text in the image?
Yes, it is detecting that text
What colors are happening for the block with the 6 Person.
If you put it through some edge detection what does it look like?
I have not tried edge detection yet.
If the edge detection gives some nice text without the block around it and it still does not work, it could be a multiple scales issue.
Alright. I will try this. Thanks for the inputs 🙂
Try messing with your adaptive threshold parameters too first.
(block size and constant subtracted)
Yes, exactly. I did that because a small change in those was giving pretty different results. I was also looking as to how I can set block size if there is a way to figure out optimal value but could not find it.
When the block size is small it can act kind of like edge detection.
Did you do any blurring?
(e.g. Gaussian blur)
Yes, I tried median blurring but blurring was capturing unnecessary noise
Try Gaussian.
The blur size relative to the threshold block size is something to consider.
Alright. I will try that
Also try Gaussian on the threshold if you are doing mean.
And block size is somewhat relative to size of the image right? Other than that I could not come up with any relation to figure out block size
How quickly the lighting varies.
(And image size)
okaay
The regular thresholding does not handle such variations (it's globally, not locally, applied).
(Imagine what happens when block size equals the image size)
aaah okaay. Got it.
I have no idea what you are talking about but it sounds cool
Can anyone help me use this API to get the population statistic for each city in Canada? I have gone through the doc so many times but I can't figure it out
https://www12.statcan.gc.ca/wds-sdw/2021profile-profil2021-eng.cfm
2021 Census Profile Web Data Service User Guide
this isn't a trivial document to read, it's definitely written for experienced programmers to use
it looks like you need to go through the docs and figure out what "flowRef" you need
Probably easier just to download the data here rather than use the API: https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/details/download-telecharger.cfm?Lang=E
This product presents information from the Census of Population for various levels of geography. Data are from the 2021 Census of Population and are available according to the major releases of the 2021 Census release dates: February 9, 2022 – Population and dwelling counts; April 27, 2022 – Age, Sex at birth and gender, Type of dwelling; July 1...
is there any library to recognise only alphabets from audio without API i.e offline ?
I tried VOSK , but as it is trying to recognise all Words and sentence, it has lots of errors.
I only want letter recognition.
What does it need to do
i don't really understand the difference between data science and machine learning could someone explain to me?
data science is basically just knowing how to analyze data with code. machine learning is where you have programs that adjust themselves ("learn") based on example data.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Please don't ask people to read screenshots of text.
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
..
def __init__(self):
super().__init__()
# Simple CNN
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 1)
Hi, why is the input for the linear 320
@serene scaffold like this
Do you know how to open the excel file in pandas?
This is the approach....ryt ?
@serene scaffold
A very silly thing about pandas that I noticed was that for object dtype. It can store some entries as int. And others as string.
That's very silly to me
Oh
Or maybe not. Maybe the mismatch was because of the "22.0" kind of strings.
But I did notice this happening tbh
If a column is heterogenous, the type will be object
So if you do str.isinstance to an object dtype. Some of them were str. And other were int
Oh shoot. I wasn't aware of that
I thought it always transforms the whole series back to the most generalised dtype
Like strings
So if it has strings and int. All the ints being string
But it just keeps heterogeneous varieties
It can sort of do this for ints to floats, since any int can be represented as a float.
So I think it's worth changing dtype to numeric each time for object dtype numerical columns
Yeah, in R it did even for strings
I always thought that's how it works. I didn't know about the existence of heterogeneous dtypes. No one told me 😭
Each column should be heterogeneous. Rows often will not be.
But I think I need to lemmatize you. Are you down?
Gently lemmatize you*
You lemmatize words. Not people.
my model keeps giving value of one kind
which leads to high accuracy how do I fix that
like I have two class yes and no
Wdym by heterogeneous. I am comprehending it as having more than one dtypes
and my model predicts no
I also faced this issue. Maybe something with your training data
In predictions?
and it keeps on predicting no
Heterogenous is more than one type. Homogenous is only one type. It's the same distinction as homo or heterosexual
but how do I fix that
