#data-science-and-ml
1 messages · Page 397 of 1
is this a DS/AI question?
robotics is not equal to AI
yeah but Reinforcement Learning is part of AI and pybullet is being used to create virtual environments for training RL agents
The question is fine for this channel, though unfortunately it's not likely to be answered given that it's very niche @lapis sequoia @frank edge
Hi there 👋
I'm applying for data science degree apprenticeships and was wondering whether or not people have concocted any opinions as to how good they are?
I would be training as a data scientist while earning roughly £20000 give or take depending on the company (if I managed to land the apprenticeship ofc), and after 4 years would have a bachelors in Data Science paid for
my main long-term concern would be career progression with only a bachelors degree, I've seen a few posts on reddit about how its much harder to progress without at least a masters, with many people choosing to get their PhDs
So my question is, would you agree with that or not? Thanks in advance
@rugged tide you might ask in #career-advice, asking for those who are familiar with the job market in Britain.
My apologies, didn't see that channel.
as for the question about the masters, the posts i've seen have been worldwide, do you have an opinion on that part?
In the US, it's harder to get your foot in the door with only a bachelors related to data science, but once you get a job, progression isn't necessarily stopped by not having a higher degree.
also idk what £20000 can get you in the UK, but if you take today's exchange rate for GBP->USD and try to live on that here, it wouldn't be that great. Are you sure you could live on that?
yeah up north you can live on that easily tbh
and if in london I can commute
I see
I say easily, I guess I mean for my lifestyle lol
I don't really drink or rave so its fine
There are other ways to burn money fast 
what income prospects are you looking at if you have a bachelors? because people encouraged me to do community college before starting the CS program "to save money", but in taking longer to get my degree, I missed out on a few years of higher income.
So in retrospect, I lost money by not getting my degree in four years.
obviously the situation is different. I'm just pointing out that future income is a consideration.
it's not really like that here, it's usually 3 years to obtain a bachelors, and that would be around 27k in uni-fee debt plus another 20-30k maintenance loan debt, so roughly 55k I guess for 3 years of uni? This would allow me to obtain a degree in 4 years, with the degree fully paid for while also earning a little bit of money
so 1 extra year for the degree, but no debt, and far more exp
there are very conflicting opinions on degree apprenticeships here though, some people think they're amazing, other people say they suck
well, one thing you'll have to do as a data scientist is figure out why similar events have different outcomes, so I guess you can start doing that now 😄
🤣 thanks
PPP is vastly different in the UK.
This is definitely more of a #career-advice place for this kind of discussion, but since you haven't been moaned at yet ill put my answer here 🤣
Getting a good apprenticeship in the UK is incredibly difficult and its unbelievably competitive for what it is (a job and a place at a mid-rank uni). My advice is to take what you have assuming you've just finished A-levels or equivalent and hit up the best name brand university you can (bristol, warwick, birmingham, etc). Coming from an elite University is what makes the most difference even over experience in some cases. Personally, I'm finishing my masters in September and will be an EO for the ONS and had a better edge than most the kids applying just because my universities name was shinier.
I was definitely dumber though.
hello,in k fold cross validation are the weights initialized in each fold?
hello guys i need help i need to make a project in https://robotbenchmark.net/benchmark/obstacle_avoidance/ can sm1 give me some tutorials to learn abt controll library or any usefull documentation
What is your's opinion on blobcity ai cloud
Hello, I'm just learning python for my homework. Asked to make a program for imageAI and when I tried, It comes to error. Can someone explain it why
the pip install failed
try restarting the kernel (as the message suggests)
i doubt anybody is gonna just make a model for you, thats a job you'd have to pay quite a bit for
you can try to make one and we'll help you if you run into any snags though
its a school project-
yeah I've already made one
but I just don't know how to improve it further
I've used every parameter for a random forest classifier ( best_params )
!rule 8
8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.
homework
we can't do it for you, but we can help you
which is what i explained earlier
yeah u don't need to help
but could u give tips on improving it ( after hyperparameter tuning )
is there anything else u can do on a model
just tell me, I'll research and do it myself
+I'm not able to balance the data
"can anyone create a good model for the dataset I'll give" sounds a lot like asking for someone to do it for you
try using a different type of model
like a neural network or an svm etc
welp SVM did very bad on the data
idk bout neural network
its more of a y/n model
😭 gomenasai, I'll be clearer next time
So far I've used :
- Logistic Regression Model
- Gradient Boosting Model
- Random Forest Classifier
- Decision Tree Classifier
- Naive Bayes
- Pipeline
Random Forest gave the best results but, the options to improve it are limited
well try neural networks
will it give only 2 outputs?
isn't it a multiclass thing
neural networks can pretty much do whatever you want, it's just about how you configure them
you can give as many or as few outputs as you want
- I have a doubt, is there any way I can increase the speed of my GridSearchCV
its been 3.5 hours
oh i see
I'll try it then
you can increase your n_jobs
which is just how many parallel jobs it'll run
oh
how much?
I did n_jobs=-1 before
well -1 is the maximum you can use anyways
oh 🗿
if I update a running jupyter cell
I just added
n_jobs=-1
into a running cell
will it update the parameter?
@austere swift
you have to restart the cell
Where can I start learning about Data Science?
Depends on your background, do you know the basics of Python and/or data analysis ?
what can happen if i encode my csv dataset into very high number of columns using hash encoders
Just learn statistics.
it was doing so good too :(
Found this interesting. Important Advanced Regression Techniques with a project : https://medium.com/coders-mojo/day-37-60-days-of-data-science-and-machine-learning-series-2e78afca9680
I figured it out and it was not really connected to pybullet engine, the problem was that I just wrote colision instead of collision
Hello, what would be the best model for a dataset containing 5000 rows and 2098 columns
F
And I once got a loss rate of 2 and was upset lmao
Yo why is this not working 
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
# load the dataset
dataset = loadtxt("data.csv", delimiter=",")
X = dataset[:,0:3]
y = dataset[:,3]
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=3, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10, verbose=0)
# make class predictions with the model
predictions = (model.predict([4,5,7]) > 0.5).astype(int)
# summarize the first 5 cases
for i in range(5):
print('%s => %d (expected %d)' % (X[i].tolist(), predictions[i], y[i]))```
import numpy as np
import matplotlib.pyplot as plt
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 100000
n = len(x)
learning_rate = 0.001
for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
plt.plot(x,y_predicted, color = "green")
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([10,9,11,12,6,5,7,6,12,14])
y = np.array([95,90,90,105,75,75,80,85,110,115])
gradient_descent(x,y)
Even after so many iterations the cost is still 20
Tryna make a basic neural network in keras
if your ratio of classes isn't close to being even try using SMOTE or undersampling
hey guys
i want help
can anyone help me with a machine learning course pls
can anyone help me
you have to ask your actual question, not give a teaser for it
the qustion that i want advanced machine learning course
you want someone to tell you what advanced ML course you should take?
yes i want suggestions
the andrew ng course seems to be popular. I have not taken it.
are u joking or talking serious
I am being serious. there's an ML course taught by Andrew Ng that I hear about a lot. But I have not taken it personally, so I can't tell you how it is from experience.
class MyCell(tf.keras.layers.AbstractRNNCell):
@property
def output_size(self):
return 16
@property
def state_size(self):
return 16
def call(self, inputs, states):
alpha_t, alpha_t_prev = inputs, states[0]
return alpha_t, [alpha_t]
my_cell = MyFCell()
layer_res = tf.keras.layers.RNN(my_cell)(logits)
anything obviously wrong with this code? logits contains a tensor with shape [batch_size, timesteps, logits]. I know it doesn't do anything right now but i'm getting this error: TypeError: Cannot iterate over a scalar tensor.
is it the way i'm defining the output and state sizes?
so i just tried to get anaconda on my computer
never again
anaconda's like that hot ex that you want back in your life bc you think it'll change
but then it's exactly the same
stel i know you're gonna read this
jupyter notebook fucking sucks
except they're not even hot. you just had bad taste as a teen.
HAHAHXCKDN
that describes my last one so well
anywaysssss
thonny is the shawty
there was an ask reddit where someone asked "what is your high school crush doing now?" and someone said "I'm 40. he's still a douchebag who spends all his time at the gym. but I was into that at the time."
i was facetiming this girl last night and she was like full time real estate, full time law firm stuff, full time college student and she was dating this dude who had nothing going for him
and i was like why?
and then she said verbatim "i was bored"
oh i'm gonna preach this: everyone please pip install pyforest
you don't need to write import numpy as np, import pandas as pd, import scikitlearn as sklearn
it lazily imports everything
i am using mediapipe to make stick figures from video (sorry for the rick roll i was just using it as a test video)
does that use AI in some way?
mediapipe
what is that
look it up
no
what should i use instead of jupyter?
Hi das someone know if you can export a yolov5 file in xml format I already found the https://pytorch.org/hub/ultralytics_yolov5/ but I can’t find it. i use open vc for all the image stuff
I know the basics of python
@lapis sequoia So, from where should I learn
I like Medium articles (practical aplications) and Codecademy (structured learning) but there is also lots of free materials on Kaggle for example
Can you send me a link to something you recommend the most
Thanks
I see that your entire participation in this channel is posting content from that same author. This is tantamount to advertising, so we're going to remove you if you don't actually contribute.
thonny
thonny mad cute
and gives nice debugging tips
mad easy to install and update packages
lightweight
people give it shit bc it’s for beginners but i prefer it
it basically has a rubber duck installed that talks to you about your code and tries to suggest what went wrong
more descriptive than some long ass error message you’ll run a rabbit hole for hours looking to solve
hello,can someone please tell me how i can use grad cam to correct my model
i can see the areas wheremy model is making a mistake using grad cam
can anyone send a course from code academy to start in machine learning if i studied data analysis
what did you study in data analysis
i studied statistics and spread sheets and bussines metrics
iknow this is not enough
is that a MOOC or a degree?
google colab

aka their version of jupyter notebooks
no need to install anything

only obscure errors i mean what

i think yes
i think yes
yes
ok yes
can you tell me a course to begin in ML in codecademy
yes
???
You can find all their courses in the catalogue
yo
Hey small question since questions about pandas seem to fall short in the help channels, does anybody know of a quick way to replace values in a column with 'day' if it falls within a certain time of the day and 'night' if it falls within a certain time in a pandas dataframe? I've searched and dataframe.between_time does not seem to return it in a way such that I can set a certain column in those values to another value
Nvm I think I got it
Hello. Could anybody recommend a textbook something like a textbook on data science with python that mainly focuses on quant methods?
if you know more than one book, please do refer a few for evaluation
This is probably the wrong section but is python able to make a script that grabs a =count(E2:Ewhatever the last one is) on a specific column in multiple sheets and print them in a new column of a specific file?
I have a small macro from Fiji I made/found parts of and it runs an analysis on a set of images within a folder and spits out all the excel sheets into a folder. So I have that backbone but I don't think it's going to work
so if anyone knows if this is easily possible, that would be great
Any free source from where I can learn Data Science?
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
hello
what can be the right answer for this question.?
I selected B as the right answer
but unfortunately its wrong
B looks like the correct answer. I'm guessing they messed up the question
for that one they gave C as answer
anybody got interviewed at Uber for data analyst ? any help is appreciated.
hi'
hey
Thanks
How can I manipulate the Names of a Data Frame that are within a specific range of id`s
using Pandas
you can use df.loc[start_id:end_id] to get the desired rows, but beyond that, you'll have to be a lot more specific.
I have a Data Frame that looks like this Bertha,F,1320 Sarah,F,1288 Annie,F,1258 Clara,F,1226 Ella,F,1156 Florence,F,1063 Cora,F,1045 Martha,F,1040 Laura,F,1012 the header is ['Name','Gender','Id'] and I want to change the names that have a an Id within a range of 1180-1200 to John
Is this enough
you would just need to do df.loc[df['Id'].between(1180, 1200), 'Name'] = 'John'
if you change the Id column to be the index, it would be df.loc[1180:1200, 'Name'] = 'John'
lemme test it
im just getting ```py
k:
John
I'd have to see the code that caused that to be printed.
like, what caused k: to be displayed?
k = df.loc[df_1880['Id'].between(42, 49), 'Name'] = 'John'
print('k:\n',k)
you have to print df to see the result
k = df.loc[df_1880['Id'].between(42, 49), 'Name'] = 'John' is the same as
df.loc[df_1880['Id'].between(42, 49), 'Name'] = 'John'
k = 'John'
No I fixed it ```py
k = df_1880.loc[df_1880['Id'].between(42, 49), 'Name'] = 'John'
print('k:\n',k)
its working now
when I look into variable explorer
great. but k is still just going to be 'John'
Yes, because you overwrote df_1880
k is irrelevant
df.loc[...] = ... is a method call. it's not actually doing assignment.
k is just a variable you created. How do you want to display the output?
I see
I wanted to get smth like this ```py
Name
2117 Vince
2118 Vivian
2119 Whit
2120 Willaim
2121 Winifred
2122 Wirt
2123 Woodson
2124 Woody
2125 Worley
2126 Zed
Yes. Then just write df_1880 underneath your input.
it's the same as df.loc.__setitem__(x, y). it changes the state of df. it doesn't write any new variables. but since you stacked it with k =, it assigned to k
Pandas is the apple imac of python code. Please check the documentation.
I tried to but got lost in it
I appreciate where this is coming from, but the pandas docs are incomprehensible if you don't understand the basics of pandas.
but Kingu is right that pandas works very differently from the rest of Python
Of course this is more advanced but I just got into it and dont know how it reacts to code so I dont know what to expect
m = df_1880['Name'].value_counts()['Mary']
print('Mary :\n',m)
``` here I counted how many times the name Mary occurs
how would I do the same thing for 3 dataframes at once
what distinguishes the three dataframes?
what do you think about julia for datascience ?
maybe this is the dumbest error but i keep running into it
I've heard that it performs well for CPU-bound work (which as you may know, is one of Python's greatest drawbacks), but it just doesn't have the ecosystem that Python has.
whenever i am trying to insert table values with sqlite3, i get a syntax error and i am not sure what's wrong with it or my spacing
idk
Car_Name
Year
Selling_Price
Present_Price
Kms_Driven
Fuel_Type
Seller_Type
Transmission
Owner
)
VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', tups)
connector.commit()
connector.close()```
you might try #databases
you should probably specify what flavor of SQL you're using as well
df_1880,df_1881,df_1882
if the columns of these three DFs represent the same thing, they should be one DF.
are 1880-2 just years?
kind of a beginner question but can someone explain buffer size and batch size to me in simple terms? (in ML i mean)
are you doing audio processing?
nope, im following a dcgan tutorial
I haven't heard of buffer size before. batch size is the number of training instances that are passed through the network at once.
in an epoch or an iteration?
in an iteration, I guess. an epoch is a pass over the entire training set.
thanks for the explaination, i was kinda confused :)
ima look more into what buffer size is though
Yeah these are Years so its 3 years
then Year should probably be a column. splitting things into a variable number of dataframes (like having a separate dataframe for each year in the data) is almost always bad.
So you suggest to create a DataFrame with these inside
can someone help me to create a ChatBot Song Recommender System,
I can provide all the information which I have
yes. and then you can do df.groupby('Year')['Name'].value_counts()
Does anybody here have experience with gpt-2? I'm really interested in making a chatbot with it, but I have NO clue what I'm doing.
https://paste.pythondiscord.com/luhoyasewu can you help me with some different code i want to work out average pixel whiteness of image. my code i wrote i think the library doesnt support transparency, i have transparency in my image
do people most typically use jupyter whenever they're actually doing data analysis?
most people use some kind of interactive setup to explore data, though the appropriateness of jupyter for different use cases is controversial.
so what is the most popular way to do it?
Jupiter is a fine way
just don't become dependent on it. it's a tool for visualization and exploration. if it becomes the only way you write code, you're gonna have a bad time in the future.
whats the tool for actual reporting? would it be inside the terminal or are there other things like sublime text that would be better fit?
Jupyter notebooks are fine-ish for reporting if you use it effectively (i.e., use markdown cells to document it well and clean up the code) then generate a PDF, or just use Powerpoint
okay so coding can be done however but final reporting is usually done outside and probably in a more presentable way? such as screenshots for powerpoint?
usually not screenshots if possible
most libraries will have ways of outputting to a file
ah okay so relying on jupyter is only bad if you don't know how to do the programming or final reporting separately?
that's not really the point
the issue about Jupyter is that the code tends to get messy, not as easy to isolate, and the global state is just a mess with things from even deleted cells still existing.
It is fine for data exploration and reporting though
it is bad if you do not organise your code or if you end up with something you cannot reproduce later
ah okay I think I have a better idea of why now, and I see what you mean whenever I want to make outside changes you have to go back through it which can be tedious
Just to make sure - What exactly do you mean by that? (outside changes & going back through it)
like if i have a csv and i want to change it, i cant just make changes and go back and continue working I have to rerun sections of the notebook that would create a df to make sure my changes were made
you really shouldn't simply "change it" in the middle of the process like that
A good rule of thumb for notebooks is that if some output needs to be reproducible, it should be possible to obtain it only by running each cell once in order.
changing your data is not something you should do often to the point of it being a concern, but if you do change anything about the source you should definitely shutdown/restart the kernel and rerun all
thank you again
Just to add in as well, but the point has probably already been made, jupyter globally caches elements of your script which can get tedius at times which is why its often better and easier to write from scratch in a script.
Hi what's the best method to approach this question? logistic regression maybe?
Yes. Each variable is categorical and it has <20 elements.
there are some numeric too
would it be enough to just fit the model using all variables and see there significance using a model summary in R (I use R studio)
Yes. Make sure you diagnose gauss.
what is the 4 variable version of this class?
do you mean four parameter version ? There is one that takes five parameters, of which three are optional https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
yea
Is it nn.Linear that takes in five parameters?
according to those docs yes
Ohk cool, thanks for the help
diagnose gauss?
Normality. Plus do all the other assumptions (independence, linearity etc)
Hello, i've an problem to solve using python. i need to build an school timetable.. and i'm thinking in use deep learning or something like that. could anyone here tell me what is the best model to choose for this problem ? neural network, NPL, random forest.. i've some rules like (if theacher can give lass in the morning.. or nightly, and anothers rules.
what do you mean "school timetable"? also, if there's any way you can solve the problem without AI, that is better.
school timetable is an board that has defined hours with 1 teacher, 1 physic space and 1 theme.. like "William , 09:30AM - 10:30AM - Math - Romm 5"
i solved this problem without AI but.. maybe with AI the solution can be more smart.
if you can solve a problem without AI, the AI solution will be worse.
one uses AI when it's not possible to write a program that can always solve the problem. AI programs attempt to approximate human judgement
if there's an exact series of steps that is guaranteed to produce the correct result for something, do that.
Time for coffee im drinking mine rn
ok.. but if i have some problem with a lot os strings, there is an "ideal" model to work ? or i will transform all string in numbers ?
how did you solve it without ai? you might be confused about what ai actually is, at our current level of (non-fantasy, non-scifi) technology
What is a single metric I can use to calculate correlation amongst multiple variables? A correlation matrix doesn't accomplish it because it reports multiple correlations
Could someone explain to me what PR AUC is?
Or has any resources that could help me understand that metric better?
!mute 899605898639597568 "1 week" It seems your only interest in our community is as a place to post Medium articles from the same author. This is not reddit. Please take your promotion elsewhere.
:incoming_envelope: :ok_hand: applied mute to @next phoenix until <t:1650946679:f> (6 days and 23 hours).
I have a band column in a data frame. There are 5 values only spread throughout the 100 rows df . I want to make the df in 5 rows. All the same band values column data will be added into a single row. So there will be one row each for those 5 band. So 5 rows will be there. Can anyone please help me??
Use pivot table or groupby function
I solved it with an api using C#, i ve create an api, rules, everything.
hi there, need a help, can't plot the graphs like in excel
`
#%%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#%%
df_new_1 = pd.read_excel(r"C:\Users\Nikita\Documents\MEGAsync\current_test\Outline Proposed\night_minds\data_velocities_impr.xlsx",
index_col=None, header=None,
sheet_name="Dynamic velocities_1")
df_new_2 = pd.read_excel(r"C:\Users\Nikita\Documents\MEGAsync\current_test\Outline Proposed\night_minds\data_velocities_impr.xlsx",
index_col=None, header=None,
sheet_name="Dynamic velocities_2")
#%%
z_1 = df_new.iloc[1:, 1:len(df_new_1.columns)]
z_2 = df_new.iloc[1:, 1:len(df_new_2.columns)]
#%%
z_1 = z_1.rename(columns={
1: 0.1,
2: 0.15,
3: 0.2,
4: 0.25,
5: 0.3,
6: 0.35,
7: 0.4,
8: 0.45,
9: 0.5,
10: 0.55,
11: 0.6,
12: 0.65,
13: 0.7,
14: 0.75,
15: 0.8,
16: 0.85,
17: 0.9,
18: 0.95,
19: 1.0
})
z_2 = z_2.rename(columns={
1: 0.1,
2: 0.15,
3: 0.2,
4: 0.25,
5: 0.3,
6: 0.35,
7: 0.4,
8: 0.45,
9: 0.5,
10: 0.55,
11: 0.6,
12: 0.65,
13: 0.7,
14: 0.75,
15: 0.8,
16: 0.85,
17: 0.9,
18: 0.95,
19: 1.0
})
#%%
z_1 = z_1.rename({
1: 0.1,
2: 0.15,
3: 0.2,
4: 0.25,
5: 0.3,
6: 0.35,
7: 0.4,
8: 0.45,
9: 0.5,
10: 0.55,
11: 0.6,
12: 0.65,
13: 0.7,
14: 0.75,
15: 0.8,
}, axis="index")
z_2 = z_2.rename({
1: 0.1,
2: 0.15,
3: 0.2,
4: 0.25,
5: 0.3,
6: 0.35,
7: 0.4,
8: 0.45,
9: 0.5,
10: 0.55,
11: 0.6,
12: 0.65,
13: 0.7,
14: 0.75,
15: 0.8,
}, axis="index")
`
`
#%%
sns.set(style = "whitegrid")
f, (ax1, ax2) = plt.subplots(1, 2, figsize = (16, 9), dpi=160)
x_1 = z_1.index
y_1 = z_1.columns
Y_1, X_1 = np.meshgrid(y_1, x_1)
ax1.plot(X_1, Y_1, ".-", label="")
ax1.legend(loc="upper right")
ax1.set(xlabel=r"$d_n$",
ylabel=r"$\rho_{m}^{-} \cdot 10^7$ (Ом$\cdot$м)")
ax1.grid(b=True, which='major', color='#666666', linestyle='-', alpha=0.7)
ax1.minorticks_on()
`
here's the result in python
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
i want to plot every row as separated graph
where x is column name and y is the data in the row
@true elk i want a scatter plot
Do the rules always produce the correct result?
how to select every time each row separated?
for instance i need a first row, I use head(1) but then I need only 2nd row
Hey @jaunty mural!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
that's how i plot the first row of my table
here's the result
but how to plot the next line and next, and ... etc
this command didn't work for next lines
df_new_1.head(1), df_new_1.iloc[1, :]
@jaunty mural you can use df.iterrows, I guess.
when you do df.head(n), you get the first n rows, so it's not a way to select a specific row.
didn't work for me, i have tried
it didn't work in what way? saying that something "didn't work" is quite opaque.
because each row through loop printed in two columns
if you use iterrows, it will give you values from each column, for every row
but i need different for each row all columns (values)
i have managed it!!!!
plt.scatter(df_new.iloc[i].index, df_new.iloc[i, :]) plt.show()
for independence I have some predictors that are correlated, what should I do? cause I can't really remove them if I want to check to see if they affect whether someone has heart disease or not?
Is there any way to get interactive outputs like this one in Jupyter in Spyder?
damn it, i don't understand why the second ax2 is bigger than ax1
< hii >```
Yep, but the result can be better
hey trying to make loop
import skimage.io as io
image1 = io.imread(r"C:\Users\samay\part1.png")
image2 = io.imread(r"C:\Users\samay\part2.png")
images = [image1,image2]
for image in images:
print(image[image[..., -1] != 0][...,0:-1].mean())
at the moment have to wrtie every image
they are all called part1, part2, part3, etc
i want to do it automatically
iterate over the directory
over files in C:\Users\samay\
easiest way would be using list comprehension
images = [read(file) for file in directory if file is pngfile]
@karmic valley
how can the result be better than "correct"? does the problem not have definitive answers?
hey im trying to run this Masked_RCNN for webcam from https://github.com/Cheng-Lin-Li/MachineLearning/tree/master/Competition/ObjectDetectionSegmentation.
but i got an error AlreadyExistsError: Another metric with the same name already exists.
oh i see. will try this!
good luck i wrote pseudocode but you should be able to replace with python functions
do i have to specify the filepath first like stop at this C:\Users\samay\
python standard library has a module called os
it gives functions for iterating over a directory
like os.listdir
ah okay i will try look that up, im still newbie
from this code, which ones shall i delete for now? line 2,3,4,?
and i replace that with file directory
you can replace lines 2-4 with the list comprehension
and it will work for a directory full of as many .png files as you want
put them all in a list
oh so they are 2 separate lines of codes. first read file directory then do list comrehsion
nope you are reading in the list comprehension
images = [io.imread(file) for file in os.listdir("C:\Users\samay") if file.endswith(".png")]
ahh thanks i will add this to my code!
yeah what you wrote seems to make sense logically just didnt know how to do before
yup just try to remember the pattern, it is very useful
import skimage.io as io
images = [io.imread(file) for file in os.listdir("C:\Users\samay") if file.endswith(".png")]
#for image in images:
print(image[image[..., -1] != 0][...,0:-1].mean())
do i still need for loop
or shall i put for loop on 2nd line
well images is a list
now you want to do something with each object in the list
so you need to iterate over images
oh so your line of code adds them all to list?
right my line reads every .png file in C:\Users\samay into a list called images
import skimage.io as io
images = [io.imread(file) for file in os.listdir("C:\Users\samay") if file.endswith(".png")]
for image in images:
print(image[image[..., -1] != 0][...,0:-1].mean())
yeah that should work
need to import os too
thank you!
oh yeah
adding stuff to list like this so powerful true! way easier than doing manually
oh nice, will search that up too
though that happens to work out to be the same as dict(zip(list1, list2))
im just trying to teach him that this is a generic pattern
right
to learn more about this what can i search up exactly
generator expressions are related, but generators in general are not.
ah got you
Its not so hard.
If i choose a teacher to give the class A and i need to put.other teacher to give the class B. If the first teacher also can give class B and i put another teacher into class A, maybe its better.
images = [io.imread(file) for file in os.listdir(r"C:\Users\samay\out\test\raw-image\") if file.endswith(".png")]
how come words after the if are in green
did i type wrong?
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
I've never used ML before, I think I have a good opportunity to use it in a project. Is 41 nodes in the output layer viable?
viable if necessary, but what are you actually doing? you might need a lot of data for something like that
It's kinda of OCR but with only 41 possibilities
I have 30 hours of video to analyse, I can skip some frames (only need 1 info per seconde)
And the number only ranges from -20° to +20°
I thought about doing manual detect on the pixels but the background is changing too much
It's not too hard for the data entry but I'm on a Mac so no CUDA/big GPU
and as it would be my first project, I don't want to get into a problem that might not be solvable with my current setup/knowledge
So guys, i have a doubt about machine learning. I am currently building a model based on a dataset about music popularity. The idea for model is that the company gets a model that can predict the popularity of each song. But now i have a doubt about data preparation that is : Should i delete the musics with low popularity or should i keep it? As shown in the screenshot, the column of music popularity goes from 0 to 100, so should i delete values above 60/70 or are they good to train the model?
what would you recommend in my case?
what are you trying to do? there are a few out of the box solutions that could just detect those numbers if needed
i would highly recommend doing some image processing however first
to increase accuracy
before feeding in frames from the videos
also i am not a Computer Vision guy so i will defer to someone with more expertise 
Huge pre processing was one of my ideas but I'm not sure how to implement it. It's not opaque numbers 😦
As I have a lot of data, and data entry is not hard, I thought going directly to ML would be a better option
I've been stuck on other OCR projects so I can foresee some issues by going with an out of the box OCR solution. My best chance would be EasyOCR with a custom model, which kinda leads to the same situation I'm in right now
If all images look similar to that, you could at least make a mask for the blue color of the digits
how to decide the output size for a convelutional 2d net?
Do you mind I @ you in an help channel so you can explain me better your suggestion?
It's not that complicated, a mask is just checking for each pixel if it falls in a certain color range
and then setting that pixel to 1, if it is in that range, otherwise 0
So you'll get an image the size of your original image with only 1's and 0's
Like a threshold with upper and lower boundaries? Better to do it in the RGB mode?
no, image processing =/= pre processing
using matlab image functions or opencv for image processing
sorry, I'm not sure I understood what you meant
If you make the color range around the blue color of your display, then you will at least have filtered out the important stuff (hopefully)
thats a good initial approach
And then manually doing the detection of the 7 segment digits on some specific pixel location?
I think I've done something with 7 seg in AoC this year 😄
Not really sure what would be the best way of classifying it
Just making a classifier with an output for each possible outcome seems naive since a lot of outcomes will be very similar (like -19 and 19 f.e.)
and maybe just discard the frames where there is too much blue
Depends if you want your model to classify images with a lot of blue too
damn forgot about that.. Thank god I asked here 😄
If you remove all the blue-ish images, your model will surely perform bad on those
nah, I have a lot of frames, I just need data points each 30 or 60 frames
the only thing is that I need high confidence
I get that you have a lot of training data, but if you want it to work for those blue-ish images, you want to use those for training too
Removing "difficult to classify images" is a bad idea is what I'm trying to say
Unless you know your real data is not gonna contain any of those
help
import skimage.io as io
import os
images = [io.imread(file) for file in os.listdir(r"C:\Users\samay\out\test\raw-image\") if file.endswith(".png")]
for image in images:
print(image[image[..., -1] != 0][...,0:-1].mean())
41 distinct classes is more manageable than 41 continuous outputs. seems reasonable, start with MNIST practice and go up from there to your real data
i getting error
File "C:\Users\samay\Dropbox\Average pixel colour.py", line 5
images = [io.imread(file) for file in os.listdir(r"C:\Users\samay\out\test\raw-image\") if file.endswith(".png")]
^
SyntaxError: EOL while scanning string literal
so just to check if I understood correctly: doing blue mask on data + doing ML anyway (and keeping difficult images)?
the blue mask is to make the images easier for whatever model you are planning to use
Since we already know the display is going to be blue
I actually have Kaggle tutorial on MNIST opened in my browser right now 😄
Your model won 't have to learn that
i think im so close
but missing some syntax
cant figure out
Thanks for the help guys! Let's try to code this 😄
hey anyone know what wrong with this:
import skimage.io as io
import os
images = [io.imread(file) for file in os.listdir(r"C:\Users\samay\out\test\raw-image\") if file.endswith(".png")]
for image in images:
print(image[image[..., -1] != 0][...,0:-1].mean())
You already asked
twice in here and opened a help channel, try waiting for a reply pls
residual plot how can it have both normal distribution and constant variance (homoscedasticity) in graph
(wrt Ordinary Least Squares assumptions)'
What is the type of distance if I use p = 1.5?
I know p=1 is manhattan and p=2 is eucliden, but what is the distance if p=1.5?
context?
instead of either Manhattan or Euclidean distance, youll get something in between. gimme a sec
p=1
p=2
now imagine a curve connecting the green and red blocks
but right between the previous two trajectories. and thats how you can visualize p=1.5
dont think it has a specific name. i think most just refer to it as a minkowski distance but with p=1.5
For a more mathematical/theoretical take, you should look into Lp spaces. Here is a good article on them: https://en.wikipedia.org/wiki/Lp_space?wprov=sfti1
In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue (Dunford & Schwartz 1958, III.3), although according to the Bourbaki group (Bourbaki 1987) they were first introduced by Frigyes Riesz (...
It means manhattan and euclidean are included as Minkowski distance?
every vector from the origin to the unit circle has a length of one, the length being calculated with length-formula of the corresponding p
ok thank you for the explanation
hi channel
If i have a question regarding file names in a directory, where should I go ?
a general help channel. see #❓|how-to-get-help for instructions for how to get one.
I am currently in a course for data science, that is going to take 3.5 months.
As it is of utmost importance for me to journey with a high learning curve, it is of essence to communicate.
What literature would be very convenient to go through?
I want to get very competent, as fast as I can. It is important to be playful and accumulate experience, so to program a lot on basics and then on different projects that are challenging regarding a solid structure and a diverse set of functions and styles to write code, is a no brainer.
I have a need to be efficient in learning and practice.
Would very much appreciate every constructive advice and suggestion.
To get from zero -> A.I. Developer
Also working on a degree in physics at the same time.
Well if you are at the completely zero stage right now as I presume I would start with Automate the Boring Stuff the book to kind of take you through the beginning. Lots of practice projects good information and free PDFs online are available. I don't know much past that but it is always a good base.
Thank you for your feedback, I am already on that, but I do not have the time to go about it the average way, on the way I will sort out by myself what will be of importance and what not, but someone with loads of experience can give me some more insights on what to focus more and on what to focus less.
you want to learn how AI works, or data science in general or?
I know some basic type of data cleaning we do about anomalous data....but when its comes to data such as face motion, heart rate pulse, etc etc., what kind of cleaning is done...i mean what do we do
And there aren't clear shortcuts, if you want to do data science in Python, you need to know python
also the more basic stuff like data structures/functions/classes etc.
Yeah I am doing a python - Data science course -> MYSQL / NOSQL -> AI / ML
But the average way is not working by itself, I don't have the time to idle through this. I need to push it.
So I just do everything that is of essence for everyone but with a faster pace? So I need to adjust my learning speed, my reading speed and processing speed like in every other discipline, I guess.
I think the expectation might be a bit too high
How much experience do you have with python?
or coding in general?
What is your background in mathematics? Specifically, Linear Algebra, Probability Theory (and therefore statistics), and Optimization theory?
I am almost done with my physics degree.
You should be good on the theory then.
I found this link that might be helpful: https://realpython.com/learning-paths/math-data-science/
Either way, to get your desired speed, you’ll need to read and write a lot of code
That’s the main way to learn
Especially at your pace
Thank you very much @river sierra
is there anyone here good with tensorflow?
I would like to ask a few questions to see if what I want to do is viable
still, that page makes an excellent point.
Let me rephrase.. is there anyone I can DM to ask questions
still, same problem. Could you give a general idea of what you want here? I have some general idea of machine learning, I dabbled in this stuff a while ago, but I might have some idea of what might be possible. However, if you're asking about specifics, I've got absolutely no clue.
if you think it would contravene the rules of this server, you'd be better off not asking at all
sure. i have a 2-d histogram that has two "tails" of data. i would like to build a model that can essentially distinguish between the two with a level of confidence.
Like this?
sorry for the crappy ms paint picture, art isn't exactly my forte
here you go
so you want to distinguish between the upper and lower tails?
What would stop you from using traditional programming, what necessitates machine learning?
I can and have already. I would like to compare the two and for practice
do you have thousands of different input datasets with two tails? Is that feasible?
I really don't know how that sort of thing would be done, it's not like the sort of optimization problem that machine learning is really good at
It's been done before using svm
What would be the optimal machine learning strategy to detect and remove specific sounds from an input audio source?
im going to throw a wild guess and say fourier transform
I looked into that, and my intended use case, removing laugh tracks, wouldn't work with the fourier transform because the frequencies of a laugh track are so similar to the frequencies of everyday speech
it sounds like you will have enough on your plate between the physics degree and this course. i would suggest not fixating on trying to achieve mastery over something complicated in a short period of time, and instead focusing on what you are learning in the course, and attempting to internalize and apply it as much as possible.
a good foundation in the basics is more valuable than a scattered sample of a lot of advanced things
I need to practice predictive modeling with linear regression
How can I do this
What are good resources
start with the "boston housing" dataset, aka the "ames" dataset. it's on kaggle
it's like the titanic dataset, but for regression instead of classification
Hmmm okay
Now the actual linear regression has a built in
And the predictive modeling built in as well
i don't know what you mean by that, sorry
in order to do predictive modeling sklearn has built in functions for this right?
scikit-learn implements a large number of machine learning algorithms. you must decide if any of those algorithms are useful to your work.
never said anything about mastery in a short time, but to go the average way or the way of mastery is about being in two different worlds
and yes, it is important to focus on what I am confronted with, that's what I already do
I got myself a few books now and working these through as well, it will solidify what I already know and give me new perspectives. Thanks for your feedback!
how to calculate the output channels for conv2d layer, is it a random value or is there anyway to get the number for it?
yes. ditto what salt said
i would say after you understand the tooling, try and use it on a new and different dataset
anyone can explaining to me what the type of error like this: "ERROR! Session/line number was not unique in database. History logging moved to new session 12088"
I got it when I use bayesian search for tuning the hyperparameter
Ah if you ask the answer I think is it depends on your interests and your skills
you can try kaggle comps, datasets for example
Implement a Tsetlin Machine.
Does data such as temperature readings,and ECG need filtering?
What kind of filtering do these need
Hello, I need your help with pytorch.DataLoader. I want it to sample images from my custom Coco_Dataset_Manager, a batch of 3 images per iteration. Images in Coco have different sizes. Let's say DataLoader sampled 3 images with sizes (100, 100), (100, 100), (400, 400). I want to force it to cache these images in a "waiting room" and sample some more. When a waiting room of a particular size has 3 or more images, I want DataLoader to put the batch through collate_fn and return it.
How can I program such behaviour?
I'm thinking about writing a custom Sampler.
Where are you taking the samples from? are you loading the images from a folder, or...
Trying to understand how much wiggle room you have
So guys, i have a doubt about machine learning. I am currently building a model based on a dataset about music popularity. The idea for model is that the company gets a model that can predict the popularity of each song. But now i have a doubt about data preparation that is : Should i delete the musics with low popularity or should i keep it? As shown in the screenshot, the column of music popularity goes from 0 to 100, so should i delete values above 60/70 or are they good to train the model?
how can i implement Keras Tuner into my code? https://nbviewer.org/urls/bpa.st/raw/A6JA
I don't see a screenshot
Also, a high value means unpopular?
Either way, I don't see why you would remove it
You said here "delete [...] with low popularity" and then "delete values above 60/70", so I'm confused
below 60/70* sorry
yeah I'm not sure why that's necessary
If anything you want a good representation of the sample space
I was in a paradox where i was thinking " Well, my target is predicting songs with high popularity, but should i delete low popularity? However, with low popularity, the model will also understand what are bad musics"
yeah, basically your answer is that last sentence
Negative examples are important
If it only sees popular songs, then nothing stops it from learning to say that everything is popular
True, thanks mate! Thank you for your time 🙂
np
anyone?
Hi everyone, I am currently solving the Titanic Disaster Problem on Kaggle.
I analysed the data, and then built a machine learning model and I got score around 0.78%
I tried to improve my model (around 30times by now) for better accuracy and no hope. I tried almost every possible technique to improve my score (using machine learning) and no hope. Can you suggest me any proven technique I could’ve missed? (please don’t send full code to the problem I want to try on my own).
Many thanks in advance ☺️
Hey guys, I'm back 😄
Do you think this image processing will be enough for classification? I'll try the MNIST dataset first to get some understanding on how to train a model
I've tried already a lot of things, I think HSV mask was my best option yet
From instances_train2017.json, fetching them through img_url
you're downloading the images on the fly each time you're training?
Do you think I can get out of it with this data?
I need to classify those into -20 to 20, so 41 nodes output
I only need high confidence data, I can discard all the other ones
Yesterday, PC Camel suggested me to focus on processing first, then going for MNIST example for ML
Is CNN the right path for this task?
😦
Anyone have any suggestions for storing different versions of multiple models?
I have finished the basics of Python and I've decided that I want to learn Data Science, where can I start from? Can someone give me a good source that's free?
while waiting for an answer, I'm really amused by the amount of people requesting help for assignments or bad purposes 
A lot of people want easy/fast solutions after all lol
I'm planning to start a career as a Machine Learning Researcher.
I've already gotten the math side (Linear Algebra, Calculus, Statistics and Probabilities) somewhat covered. And currently learning TensorFlow/Keras and planning to study Pytorch after.
Tool-wise, is there anything else I need to make sure I know?
i think it is the right task, but you have a lot of noise. if you can do pre-processing to increase the signal-to-noise ratio that would help. and you will have to accept that some instances are not going to give usable results, e.g. the blank ones
it looks like maybe 1/3 to 1/2 of these are unusable
i think a "good" model should give ~0 confidence on all categories for those unusable ones
that's going to be the hard part imo, not obtaining false positives on the junk images
what is this task, anyway? is this the thermometer in your car dashboard? some industrial process that is hooked up to an LCD but not a usable computer?
it's to detect the range of temperature in a game (NGL Biathlon)
Don't ask why 😄
Kidding
is going for a numpy as my ML machine a good idea in this context?
I have the information that "Cold" should range from "-10 to -15" but I'd like to see the bell curve to check the repartition (and for all the other temperatures)
Any ideas for better pre-processing? The temperature is kind of transparent and blue, so on blue sky even an human can't read the temperature
it seems like it although it might be better to pretrain a model on something like mnist and then maybe train the pretrained model on the data you have
or use a already pretrained model
numpy is not really a machine learning framework. it is just a library for linear algebra. you will want to use a higher-level framework like pytorch that can take care of all the complicated mathematical optimization stuff for you
you can try mnist pre-training. you'll want to convert rgb to b&w first though
depending on your time and resources, you could also take like 1000 pictures of "clean" output, artificially add noise, and use that as pre-training
if this were an industrial process maybe i'd suggest doing that, idk if it's worth it for a video game
i had no idea there was a biathlon video game..
yeah I thought of doing just manual pixel detection as this is 7 segment digits
but also a good first project for ML
but maybe the the optimal first project 
no i think it's a great project
it's a relatively straightforward task but with lots of noise and a small amount of data
i personally recommend tensorflow keras more for beginners
I thought of going on Numpy directly to get my hands dirty (and understanding better what I'm doing), and also because all the MNIST video/tutorial I found just import it easily. I can't really do that with my current data
fair enough
i'd try it both ways tbh. pre-train on mnist, and not-pretraining on mnist. i am skeptical that handwritten numbers will be good pre-training for 7-segment lcd
i wouldn't suggest trying to implement a neural network from scratch in numpy, that would be a good exercise if you wanted to get into ml engineering or numerical computing. but it would be a distraction in this project i think
but it is worth to try tensorflow and pytorch i personally really want to learn pytorch but dont have the time yet
From my understand, pre-training wouldn't be beneficial in this case. The numbers are really in a fixed position!
the reason you want pre-training is that you have a very small dataset here
even smaller when you consider the number of "usable" items
that is the entire dataset?
I have 30 hours of footage at 30 fps, it's not enough?
oh
that's a lot
and they're all labeled? that is, you know the number for every frame?
well you might want to use semi supervised learning if they aren't labeled but if they are labeled use supervised learning
This is where I'm at right now
I can do manual labelling to some extent
are all numbers always the same at the same location?
it's for learning purposes
exact same location!
as a side note, this game looks too realistic, my heart rate is going up just watching gameplay footage 😆
That's why the fixed position pixel detection was my first idea
but a 2 will always look the some or will it slightly change
(+ I've done some 7 segment logic in AoC this year 😄 )
so basically the same font
i assume it's the temperature in the top right corner here? https://www.youtube.com/watch?v=ZlJ6VPN8G0o
В этом видео мы пробежим все гонки Олимпиады 2022 по биатлону за Александра Логинова в игре NGL Biathlon! Получится ли завоевать медаль?!
Скачать игру 1 - http://boosty.to/ngl_biathlon
Скачать игру 2 - http://patreon.com/biathlon
Группа Вконтакте - http://vk.com/ngl_biathlon
Instagram Васи - https://www.instagram.com/vasya_ngl/
Instagram Игры -...
yep
it looks like it's fixed position in a HUD, with varying backgrounds
gotta love open source gaming!
Please don't suggest training Tesseract, I literally had nightmares with it. EasyOCR saved my life
in that case you may not want to use ml
i think they just want to do this as a toy project to learn
oh in that case use ml or multiple solutions
i actually love this idea. easy enough to download gameplay footage and DIY it too
maybe i'll do it too 😛
i need hands-on practice w/ image deep learning i think
what's the state of the art for semi-supervised learning? i looked into it several years ago and it seemed like it was kind of a dead end
active learning might be a better choice perhaps, i've used that successfully for record linkage / deduplication projects
if you want to pair-programming on this one, I'd love some help and peer review !
i'm trying to get some by downloading like 1600 images from pexels so all images are high res and then get a tile from 256,256 at a random point of the image and use that to train a super resolution model although i still need to make the discriminator and i'm planning on doing that today (i'm not expecting great results it only has like 50k to 300k params (depending on if i use conv2d or dephwise seperable convolution)
What kind of model would you recommend in my case? I've been suggested both CNN and CRNN
i probably won't have time but i'd be interested to see your progress on this
how often does the temperature change like how many frames are between there and do you want the challenge of a multi frame solution because that will be way harder
sounds fun
i know nothing about ML on videos, i'm curious to know what the multi-frame solution is
is it a 3D CNN over fixed-length durations of video? like 5 frames at a time
maybe a ConvLSTM2d()
That's the point of my project, check the time at the change of temperature and count seconds. The temperature changes are more likely to be each 3-10 seconds but I'd like to have data each second
so I need at least 1 good frame each 30
and it's okay if I can get 1 out of 100ish
I only need high confidence on the 1
Damn the more I think, the more I realise that image comparison would be much easier 😦
if you want extra challenge you might want to make a solution that considers multiple frames but that would be very hard
although you are also maybe able to average out like 3 frames and use that to get more clear data but that can result in that you are 2 frames off
someone told me yesterday to not discard bad frames, but he didn't knew about the whole context, so I'm a bit confused rn
no i mean that you have like 30 frames each sec so you could make 30 predictions on each sec but what if you make it 28 groups of 3
and you use the average of each one
damn this would be great for my image comparison solution! Get 30 frames of each second and doing kind of overlay/masking thing
like groups of 30
i would just use averages because the shape and color doesn't really change
but that way you get 3 values for each frame so if the output is 2,2,5 you can say it is most likely a 2
that would dilute my error rate
sooo.. where should I start 😄 ?
using my current processing on images, labelling a few, then throw it into keras/pytorch and pray for the best?
i would say spend a bit more time on processing
tbh I'm out of ideas on how to improve it 😦
also are the numbers slightly transparent? if not you can use only that single color of the letters
yep.. they are
could you send like 2 images to me that aren't pre-processed 1 that gives bad results with your current method and 1 that gives good results
and with to me i mean in this chat
only way i ca think of is to subtract the pixel value from the bottom right most pixel and then increase the brightness a lot
I'm trying to np.mean some group of images, seem promising for now.
hey there! Can some1 help me with nvidia DeepStream (docker)?
subtracting the most common pixel could be a way
although i would have to say it is impossible to get the second image because there is nothing there
although a function like ```py
def try_to_get_number(image):
bottom_right_pixel = image.mean()
return image - bottom_right_pixel
i wouldn't change your current way
i would also argue that some images are actually "unknown" and that a good model should indicate this
should I consider "unknown" as an output node or as a low confidence on all nodes?
no as like a separate category that the model can classify
any mlops guys working on foundry (palantir)? .. looking for some best practices on workflow within that environment, ml pipeline in code repository specifically. how to keep code modular given the available toolset and proven libraries to provide reliable results while utilizing spark scalability and not wasting resources
can somebody tell why there in no 1/2m in cost function of L2 regularization whereas it is present in Linear regression?
for I in range(5):
Print('I am going to fail the AP test')
Can anyone give me a hint of some performative search algorithm? The search is to find a word
kinda sus 
Hello, please don't post unapproved advertising. Thanks
@jaunty belfry it just a constant 1/2 * 1/m
the 1/m is used for averaging
the 1/2 was put there to cancel the square power when we take the derivative
this does not actually change or affect the derivative
it just scaling
how can i implement Keras Tuner into my code? https://nbviewer.org/urls/bpa.st/raw/A6JA
Hey Everyone! Happy to be here been doing data science with python for about a year now, and now wanting to use Django to create an API for my website
Some questions, I will presenting stats to a website
- Should I be updating the stats on the backend directly through the db or using post
- MySQL as backend of Stats website or Postgress another DB
- I will be doing historical analysis, can I do this through django views or should I do the analysis before and just present the already updated information
where did you see this? dividing by a constant doesn't change the argmin so it shouldn't really matter anyway
postgres has the most features and is the easiest to administer imo. mysql i think has better scaling functionality but you don't need that.
Should I be updating the stats on the backend directly through the db or using post
you can write directly to the database, but if you go through your own api endpoints then you maybe have a "safer" interface, with fewer ways to make mistakes, but then you have to deal with authentication for a privileged user that has the ability to write to the db, which maybe is more complexity than you want in your website
I will be doing historical analysis
what is historical analysis?
Say I have a list like [London, Paris, Chicago], is it better to index these in my data like [0, 1, 2] or to turn them into bools and have a new column for each? so like is_London, is_Paris, is_Chicago?
Or does this not have any effect?
neither? both? provide more context
are you talking about a series where each value is a list?
there's nothing inherently better about integers compared to strings. if anything, strings prevent you from making the mistake of putting your categorical-valued integers into a model, which will treat them incorrectly as continuous values
I was looking back at some previous work, and some of the learning material I was provided and it said to change the categorical values in to indexes like that, and in other examples they were doing it the other way using pd.get_dummies(). So I was just unsure if one of the ways was better than the other
i dont think changing to numerical indexes does anything for you, unless you're using a specific model that treats integers as categoricals (maybe some random forest implementations do this)
ideally you'll use pd.Categorical for categorical data
Thanks for getting back to me! So essentially I will be keeping track of nft collections prices and volumes and want to be able to compare the change in the past lets say 7 days
oh, i see. doing the queries and computations in the view will be the easiest from a software design perspective. it's probably the least efficient, but for a hobby project that's fine
This will be for a production website, Can I add you separately and can give you some SOL to consult me by chance?
no, sorry
i'd rather not give 1:1 private help
you might want to post these questions in #web-development or #databases , it sounds like the data science aspect of your project is unrelated to these questions
Ok any help on how I can accomplish the loading of the data to the DB in the correct way, I am currently using mysql.connector but worried about slugs and timestamps then how I can do the analysis of historical data on the backend? Thank you so much
One of the models I was testing on was actually random forest hahaha. Sounds good either way though, thanks for the help 👍
yeah, whether to one-hot encode categorical variables in a random forest is a topic of debate. imo you shouldn't do it, leave them as categoricals
if your implementation requires "numbers", use LabelEncoder, otherwise leave them as-is
keep in mind that pd.Categorical is backed by an integer array anyway
Oh for real? Damn
But yeah, I had done it because it was in one of the lectures
It also said to try change values like "yes, no" or "Risk, NoRisk" to binary
I’m going to be honest
Idk how you guys do it
I feel like my head is going to burst trying to figure all this stuff out
how do you print the number of items from a column from an excel file? while also using np.unique
because the names on that column are repeated , so i need the number of names without it being repeated?
if you have a pandas dataframe it should be something like len(np.unique(df.names))
I had csv data involving the info of patients having mri scan in various regions and the population data. Now I wanted to have a predictive model which could predict the probability of the supply and demand with the population data so that the occurence of mri scans could be predicted in that region.
Can anyone help in apporaching this problem or suggest me some material/work to look having similar format and problem.
how can i implement Keras Tuner into my code? https://nbviewer.org/urls/bpa.st/raw/A6JA
does somebody knows how this type of chart is called?
i want to implement it in a game wich works in rounds
isn't that a line graph with just its axis moved right and top
i realised it is called a bump chart
Hi all, I would like to ask a question on how filters are activated
my resource is mainly deeplizard from YT for cnn
for a filter to be able to detect patterns in an image, they mentioned that the loss function between the input image and the output channel must be maximised
which is done with gradient ascent
I am guessing this is for training a cnn, would that be right?
my question is
if they are maximising the loss function for a filter
to be able to detect the a feature from the image
won't it be easy to just have the filter (3x3 for this example) to just keep increasing the element values of the filter
like say that I have a filter, and it moves over a completely white image,
won't I be able to produce a filter that has a million for each value in the filter
and that would generate a very high value
and then when that filter moves over another part of the white image that may have a black spot, that black spot would be negligible
.
it was easy to understand back-propagation when we were trying to minimise the loss function, which brings the accuracy up
but I cant get over why the filter needs to have its loss function maximised, on top of that, maximising the loss function should be something that doesn't have an end right?
another question: is it that we maximise the constitutional layer's loss function, and minimise the rest of the loss function of the cnn in two separate training events?
Man, google colab is high
I have this code
def plot_pred (train_data=X_train,
train_label=y_train,
test_data=X_test,
test_label=y_test,
prediction=y_pred):
plt.figure(figsize=(10, 7))
plt.scatter(train_data, train_label, c='b', label="Training Data")
plt.scatter(test_data, test_label, c='g', label="Testing Data")
plt.scatter(test_data, prediction, c='r', label="Prediction Data")
plt.legend()
plot_pred()```
Which gives me an error on the ```plot_pred()``` saying that X and y need to be the same value.
But then I comment the prediction code, run it again, uncomment it, run it yet again and it works fine with no errors.
are you using global variables as default values in your function? @warm oracle
That's probably messing stuff up
Also not passing any values in your function call
Yea. I have them set that way as the only change would be the y_pred between models. So I can just go plot_pred(prediction=y_pred_1) or whichever it is lol
You shouldn't use global variables in a function to begin with
let alone use them as default values
Pass them as arguments
Ah I see. Thanks.
" saying that X and y need to be the same value." It would also help to just give the error traceback btw
I assume it actually said that they need to be the same length
Sorry, I had it fixed so didn't have the traceback to post.
lemme see if I can replicate the error
"ValueError: x and y must be the same size"
So the length of both arrays/lists differ
Not really, as I set both X and y to be the same size, so I can understand it better before going into an actual dataset. As I'm still new to TensorFlow.
X = tf.range(-100, 100, 4)
y = X + 10
X_train = X[:40]
y_train = y[:40]
X_test = X[:10]
y_test = y[:10]```
Well they do, otherwise you don't get an exception
If you check the error traceback you already know which line causes it
Yea. Which, like I said, got fixed by commenting one line on and off again.
Alright, well if it's fixxed I don't know what you want
If it is still a problem, you aren't give close to enough information to let us help you, otherwise I don't know why you are telling this :/
Was just an observation. Since I said it got fixed in my initial comment.
Guess those aren't allowed lol
Yeah but commenting and uncommenting code doesn't fix anything, that makes no sense
It might be that you reran some code, that made some values change or something
Whatever it was, it made it so the length of two of those arrays/lists weren't the same size
i have a 2d array, something like:
[[0,2,3],
[1,0,3],
[2,1,0],
[3,1,2]]
and i have a pandas dataframe something like:
id val
0 a 9
1 b 8
2 c 3
3 d 7
now i want to get a new dataframe based on the indexes listed in the 2d array, for this example, it will be like:
id val id_2 val
0 a 9 a 9
0 a 9 c 3
0 a 9 d 7
1 b 8 b 8
1 b 8 a 9
1 b 9 d 7
2 c 3 c 3
2 c 3 b 8
2 c 3 a 9
3 d 7 d 7
3 d 7 b 8
3 d 7 c 3
That's why I was confused.
As I only reran the two cells that have the function defined (after commenting and uncommenting a line), and the one with the function call.
But don't worry about it lol.
Sorry for taking your time. And thanks again for the advice.
yeah wasn't meant to sound so angry, just a bit confused
sorry if it came of aggressive 😛
Don't worry. I'd react the same if someone told me what I just said lol.
can you help me out with my problem?
I am not that experienced with pandas srr
ah aighty
can you please explain this, uhhh, better? edit: nvm
might be a better way to do it but I got this working
import pandas as pd
import numpy as np
index = [[0,2,3],
[1,0,3],
[2,1,0],
[3,1,2]]
idse = ['a','b','c','d']
vals = [9,8,3,7]
data = {'id': idse, 'val': vals}
df = pd.DataFrame(data=data)
newdf = pd.DataFrame(np.repeat(df.values, len(index[0]), axis=0))
flat_list = [item for sublist in index for item in sublist]
newdf['id_2'] = df.id[flat_list].values
newdf['val_2'] = df.val[flat_list].values
produces
I did use np.repeat so the length of each sublist in the index would need to remain constant or it'd break
no worries
yessss each sublist has equal length
hey so, my original dataframe has around 40 columns along with the id column,,, is there anyway add the columns all together instead of individually?
are you trying to repeat those columns like with id and val or index them like with id_2 and val_2
hard to understand
multiple columns of values
each id will have its own set of values
we are repeating ids
index like we did with id_2 and val_2
for each of the 40 cols?
I can't really tell what you need without an example
just like here, i have only one column of val, in the original dataframe i have 40 columns of val
let me think
not sure if this is what you need but try this modification
# newdf['id_2'] = df.id[flat_list].values
# newdf['val_2'] = df.val[flat_list].values
cols = df.columns
for col in cols:
newdf[col] = (df[col])[flat_list].values
Hi is there a way to apply a function starting with the 3rd instance of a value? I.e. ID numbers, on the the third count of an id number do x
I have the following df:
ids = [1001, 1002,
1003, 1004,
1005, 1006,
1007, 1008,
1009, 1010]
numbers = list(range(1,11))
systems = ["ONE", "TWO"]
num = 40
sample1 = random.choices(ids, k=num)
sample2 = random.choices(systems, k=num)
sample3 = random.choices(numbers, k=num)
df = pd.DataFrame(zip(sample1, sample3, sample2),
columns=['id', 'seq', 'system'])
df.sort_values(by=['id', 'seq'])
If the count of the IDs >= 3, then starting at the third row, shift all the values in the system column up by one
Im making a fairly basic music content-based recommender system, following some code I used before for a movie recommender system that had a much smaller dataset. Not sure if this is relevant but the biggest change I made was that the music dataset I used is way too large so I have the system create a sample and the index is reset.
When running the function that compares an input to the rest of the data and outputs a small list of most recommended artists I get an error Ive never seen before: "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
Im unsure what this means and what I should do
index = indices[artist_name]
sig_scores = list(enumerate(sig[index]))
sig_scores = sorted(sig_scores, key = lambda x: x[1], reverse=True)
sig_scores = sig_scores[1:11]
spotify_indices = [i[0] for i in sig_scores]
return spotifydf['artist_name'].iloc[spotify_indices]```
ValueError Traceback (most recent call last)
<ipython-input-27-95e154b531ed> in <module>
----> 1 recommendation(spotifyRec)
<ipython-input-25-19214a8b47a6> in recommendation(artist_name, sig)
2 index = indices[artist_name]
3 sig_scores = list(enumerate(sig[index]))
----> 4 sig_scores = sorted(sig_scores, key = lambda x: x[1], reverse=True)
5 sig_scores = sig_scores[1:11]
6 spotify_indices = [i[0] for i in sig_scores]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()```
Interesting
Sounds like the rows of your original array aren't one-element, and so can't be compared. Basically, check what the elements of sig_scores are after the first line.
By the way, what you're doing can be done via np.argsort instead (it gives an array of indices such that the corresponding elements are in sorted order. That seems to be exactly what you're doing with all these lines).
Yeah, looks like each element is an array, so how do you expect sorted to compare them? What's bigger, [0,1,2] or [1,-1,0]?
those are tuples, not arrays. python allows you to sort tuples
just on principle i would expect that you could define the same ordering as you would define for tuples, i.e. compare elementwise until the tie is broken. but imo numpy is right to reject that as the default behavior
yeah, sure. Anyway, sorted just does the equivalent of if a<b:, which for numpy arrays is not valid (a<b is an array of bools, and an array of bools can't be implicitly reduced to a single bool like that)
im a little confused then why my original has tuples and this version uses arrays
presumably because sig is an array
this is from the movies one that works
which seems to show that sig is an array here
(the one that doesnt work is a similar result)
i would just loop over rows in this case
from collections import defaultdict
df = ...
systems = ["ONE", "TWO"]
id_counts = defaultdict(lambda: 0)
for row in df.itertuples():
id_counts[row.id] += 1
if id_counts[row.id] >= 3:
df.loc[row.Index, systems] += 1
you could also do this with groupby:
df = ...
systems = ["ONE", "TWO"]
df['id_count'] = df.groupby('id').cumcount()
df.loc[df['id_count'] >= 3, systems] += 1
Ah that might be what I'm looking for, let me give it a shot
or even combining the loop and groupby, which might be the best option if this dataframe is really big and you have a large number of duplicate id values:
df = ...
systems = ["ONE", "TWO"]
for _, group in df.groupby('id'):
if len(group) > 2:
inc_ids = group.index[2:]
df.loc[inc_ids, systems] += 1
im not entirely sure about the semantics of += while looping over itertuples or groupby. if you want to be safer, you can construct a list of id's to modify first, and then do the modification in one shot after
df = ...
systems = ["ONE", "TWO"]
inc_ids = []
for _, group in df.groupby('id'):
if len(group) > 2:
inc_ids.extend(group.index[2:].tolist())
df.loc[inc_ids, systems] += 1
can you post your entire code? both the "original" you mentioned as well as the new version that gives you a problem
sure
@desert oar what I had so far was this:
def func(df_group):
if len(df_group) >= 3:
return df_group.system.shift(-1)
else:
return df_group.system
new_col = df.groupby(['id']), as index=False).apply(func)
df['new'] = new_col.reset_index(level=0, drop=True)
oh, in this case you have different array shapes
oh, i didn't realize what you meant with the shift
that would work too but that shift would apply to the entire group, not just the values after the 3rd
Yeah I want to move the system values up one. But only if the count of ID's is 3+. And the shift needs to start from row count 3 of each ID, as they already have preset values.
I think that's where im stuck at.
My code works on id's with counts of 3+, and it applies it to all the values. But I only want to apply the shift to the values starting from the third row count
wait... is system a column? it looked like a list of columns
Yeah its a column with only 2 values inside.
no worries, hopefully the problem is abit clearer now?
huh, that's a new one
let me see what i broke
found it
let me do this offline 😆 hang on
# In[11]:
spotifydf = spotifydf.sample(frac =.1)
spotifydf = spotifydf.reset_index()
# In[12]:
spotifydf.head()
# In[13]:
spotifydf['popularity'] = spotifydf['popularity'].apply(str)
# In[14]:
spotifydf['genre'] = str(spotifydf['genre'])
spotifydf['genre'] = str(spotifydf['artist_name'])
spotifydf['genre'] = str(spotifydf['track_name'])
# In[15]:
spotifydf["content"] = spotifydf['genre'] + spotifydf['artist_name'] + spotifydf['track_name'] + spotifydf['popularity']
# In[16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english')
spotifydf['content'] = spotifydf['content'].fillna('')
# In[17]:
tfvmatrix = tfv.fit_transform(spotifydf['content'])
# In[18]:
tfvmatrix
# In[19]:
tfvmatrix.shape
# In[20]:
from sklearn.metrics.pairwise import sigmoid_kernel
# In[21]:
sig = sigmoid_kernel(tfvmatrix, tfvmatrix)
# In[22]:
sig[0]
# In[23]:
indices = pd.Series(spotifydf.index, index=spotifydf['artist_name']).drop_duplicates()
# In[24]:
indices
# In[29]:
def recommendation(artist_name, sig=sig):
index = indices[artist_name]
sig_scores = list(enumerate(sig[index]))
sig_scores = sorted(sig_scores, key = lambda x: x[1], reverse=True)
sig_scores = sig_scores[1:11]
spotify_indices = [i[0] for i in sig_scores]
return spotifydf['artist_name'].iloc[spotify_indices]
# In[26]:
spotifyRec = input("Enter the artist you would like a recommendation based on!")
# In[27]:
recommendation(spotifyRec)
!paste i recommend using our paste site for longer samples like this
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
I have a dumb question, I saw they had discord.js would it be better in the long run to start and stay in python or is discord.js a good start, I plan on working on creating my own version of Carl bot and have never tried to make a project this big before.
I have done some basic bots before and would like to go bigger
and this is the new version with music instead of movies @desert oar
!eval @proper swift
import numpy as np
import pandas as pd
ids = [
1001, 1002, 1003, 1004, 1005,
1006, 1007, 1008, 1009, 1010,
]
numbers = list(range(1,11))
systems = ["ONE", "TWO"]
num = 40
rng = np.random.default_rng()
sample1 = rng.choice(ids, size=num)
sample2 = rng.choice(systems, size=num)
sample3 = rng.choice(numbers, size=num)
df = pd.DataFrame(
zip(sample1, sample3, sample2),
columns=['id', 'seq', 'system'],
)
def shift_system(group):
if len(group) < 3:
return group
return pd.concat((
group.iloc[:2],
group.iloc[2:].shift(-1)
))
df['new'] = (
df.groupby('id')['system']
.apply(shift_system)
.reset_index(level=0, drop=True)
)
print(df)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | id seq system new
002 | 0 1008 6 TWO TWO
003 | 1 1002 5 ONE ONE
004 | 2 1004 2 TWO TWO
005 | 3 1001 10 ONE ONE
006 | 4 1009 3 TWO TWO
007 | 5 1008 6 ONE ONE
008 | 6 1001 7 TWO TWO
009 | 7 1004 2 TWO TWO
010 | 8 1001 7 ONE ONE
011 | 9 1008 2 ONE ONE
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/doverelawu.txt?noredirect
i can't imagine why you want to do this though 😆
this sounds like a good question for #discord-bots
Long story haha. Need to fix some data entry issues
wait
hi can someone help me, i am trying to find a way to evaluate my model, which is created using this article https://www.ivanlai.project-ds.net/post/conditional-text-generation-by-fine-tuning-gpt-2, which uses transformers. i came up with training another model but with the rnn/lstm structure, how would i do this?
both models should convert keyworeds to sentences, am strugglign to find a tutorial to train a rnn/lstm to do this
@desert oar thanks works as intended! Been stuck on that problem for the last 2 days
ok so I think the issue was possibly coming through a few things
so originally when using my method I got an error when concatenating columns saying: "TypeError: can only concatenate str (not "int") to str"
wait
maybe not 😂
yeah I cant work out why its making full arrays instead of tuples, I cant see anywhere why its doing this
oh wait, is it potentially because im doing it based off of artist_name, and one artist can have many songs within the dataframe, so each artist is assigned a multitude of values in sig?
@undone wind can you do print(df.head().to_dict('list')) and show the text in this chat?
and then we can talk about how to transform it to get your desired result. only text will do--no screenshots.
Please ping me when you do that and we can get into it.
{'index': [133553, 204593, 52399, 79490, 93264], 'genre': ['Reggae', 'Soundtrack', 'Blues', 'Opera', 'Indie'], 'artist_name': ['Bob Marley & The Wailers', 'Nick Glennie-Smith', 'Galactic', 'Giacomo Puccini', 'The Lagoons'], 'track_name': ['Bend Down Low - B Is Version', "Jack's Death", "You Don't know (featuring Glen David Andrews and The Rebirth Brass Band)", 'Un bel dì (From "Madama Butterfly")', 'California'], 'track_id': ['6bwr7Qgxrc0hERBOrapmVh', '34devHoJ8tjNLPgSaOpPuo', '5qh4q09WZTUMCkqXWR4l6l', '4jekropd6vkVfunMXZqwVh', '35QAUfIbfIXT3p3cWhaKxZ'], 'popularity': ['35', '28', '25', '20', '64'], 'acousticness': [0.44, 0.973, 0.0393, 0.9890000000000001, 0.276], 'danceability': [0.779, 0.14400000000000002, 0.701, 0.24, 0.7859999999999999], 'duration_ms': [213867, 98440, 244200, 296693, 261773], 'energy': [0.445, 0.203, 0.7659999999999999, 0.163, 0.6859999999999999], 'instrumentalness': [0.000151, 0.7829999999999999, 0.000816, 2.8499999999999998e-05, 0.6679999999999999], 'key': ['C', 'D#', 'G', 'C#', 'E'], 'liveness': [0.166, 0.11599999999999999, 0.23800000000000002, 0.317, 0.0416], 'loudness': [-7.791, -17.989, -6.285, -15.071, -7.18], 'mode': ['Major', 'Minor', 'Major', 'Major', 'Major'], 'speechiness': [0.0458, 0.0356, 0.0976, 0.05, 0.0289], 'tempo': [87.94, 74.194, 110.001, 89.719, 119.99700000000001], 'time_signature': ['4/4', '1/4', '4/4', '4/4', '4/4'], 'valence': [0.755, 0.0389, 0.588, 0.0388, 0.542], 'content': ['ReggaeBob Marley & The WailersBend Down Low - B Is Version', "SoundtrackNick Glennie-SmithJack's Death", "BluesGalacticYou Don't know (featuring Glen David Andrews and The Rebirth Brass Band)", 'OperaGiacomo PucciniUn bel dì (From "Madama Butterfly")', 'IndieThe LagoonsCalifornia']} @serene scaffold
content comes from spotifydf["content"] = spotifydf['genre'] + spotifydf['artist_name'] + spotifydf['track_name']
I tfvmatrix content
then sig = sigmoid_kernel(tfvmatrix, tfvmatrix) to make sig
sig[index]
array([[0.7616427 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76161052, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76163003, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
...,
[0.7616156 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76163003, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.7616192 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416]])```
sig contains arrays with multiples values for some reason whereas it should only contain 1, the sigmoid value of each row in content
i dont know why and I believe this is what I am asking for
Question, how to make 2D array like this using numpy? The output wants the array starts from 137 to 166
yes, you can do arr.reshape(-1, 1) to get one like that for any array. though you have to be careful that the array actually means something when you shape it like that.
thanks for posting it. I don't know that I understand what change you are trying to make
Ahh yes it works, thank you so much
@undone wind what is spotifyRec?
even the code snippets you posted don't include all of the code
don't make people guess at what you are doing here, if you can include the whole notebook please do so
this is the non-working one?
yes
@undone wind ```python
indices = pd.Series(spotifydf.index, index=spotifydf['artist_name']).drop_duplicates()
does `spotifydf` have a multi-index?
also it seems a bit weird that you're inverting the index and values like this
ok i see... you have this
spotifydf = pd.read_csv(r'C:\Users\cens\Downloads\archive (3)\SpotifyFeatures.csv')
so its index should just be the default RangeIndex
i see you also did spotifydf.reset_index() in cell 11
ok, so sig should be N x N where N is the number of rows in spotifydf
ahh i see why you invert the index, that's one way to do it. but sig is a plain numpy array, so it's only valid if you use a RangeIndex, you're better off with np.arange(len(spotifydf)) instead of spotifydf.index.
this notebook is also pretty messy. it's very likely that you have some weird intermediate state. did you try restarting the kernel and running it top to bottom?
if sig is indeed a 2d array, and if artist_name is a scalar (i.e. not a list/series/array), then sig[index] should be a 1d array because you deduplicated indices,
so if you get something different, then one of those assumptions is wrong
check the shape of sig and check that you are only passing plain "artist name" values, not arrays thereof
this is the full of the one that works too https://paste.pythondiscord.com/hafoyezoco
sig[index]
array([[0.7616427 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76161052, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76163003, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
...,
[0.7616156 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76163003, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.7616192 , 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416]])```
contains multiple arrays for 1 artist name
show index as well as spotifyRec
i bet spotifyRec itself is not a scalar
this is why i always use .at when i expect to be indexing with scalars, if i accidentally pass a non-scalar it gives an error
how do you mean
what is spotifyRec? if it's not a single scalar, then it's going to produce an index that is not a a scalar, which will produce a 2d array from sig[index]
do you just mean
indices
artist_name
Bob Marley & The Wailers 0
Nick Glennie-Smith 1
Galactic 2
Giacomo Puccini 3
The Lagoons 4
...
Glass Animals 23267
Night Beats 23268
Jackie Kashian 23269
311 23270
Bruce Broughton 23271
Length: 23272, dtype: int64
spotifyRec = input("Enter the artist you would like a recommendation based on!")```
?
and then index = indices[artist_name]
spotifyRec is just an input from the user
ok, can you confirm that index is also a scalar?
do this
index = indices.at[artist_name]
this way you definitely get an error if it's wrong
also can you show me sig.shape to confirm that it is definitely 2d and not 3d?
alright
and can you print the value of index too? using the same artist_name that caused the problem before
i think I was right earlier on, multiple songs with the same artist
is whats causing arrays rather than tuples
the movie one works because there arent any movies with the exact same title
would you say this is whats happening? @desert oar
yes, exactly. your "index inversion" didn't work as expected
i think you need to reconsider these data structures
Hi all, I have a question regarding training a multilayer perceptron for mnist using the classes/functions that I was provided with. Is this an appropriate place to reach out for some assistance?
in the short term I guess I could change the dataframe so that it is 1:1 between song and artist, just to get it working
wont make the greatest recommender system but I just want it working for now
and then is there a place anyone could recommend ( 😏 ) for learning more about content based recommenders and implementing them
no, you need to use an index that is actually unique
this has nothing to do with recommendation systems... this is a matter of being a bit smarter about numpy and pandas usage
if your recommender system works with songs, then you have a song recommender system
so you can ask for an artist, but then obviously you will get more than one song per artist
which is maybe fine, but then you need to be smarter about how you do the lookup
you need to get the list of song ids from the artist id
or you need to come up with an artist-level recommendation system, not a song-level recommendation system
