#data-science-and-ml
1 messages · Page 388 of 1
yeah they are quite imbalanced, but even using imblearn the best i am getting is 50%.
It's just so discouraging when you feel like you can't do anything to solve the problem.
The field feels like such a black box
I'm only in second year of college. I really wanted to go into ML but this like broke that. Sorry im like ranting but its just so annoying.
Maybe just the more data science part of it is not for me
arg
you shouldnt get discouraged
these are tough problems and it doesnt mean you arent good at it
im a grad student and i basically get the same results as you

also this is just one domain, NLP. IRL, you usually dont have to reinvent the wheel again aka only using sklearn

im reinventing the wheel again in my DL class with minitorch
and sometimes it feels like im just banging my head against the table

interesting read talking about applications in machine learning + cybersecurity
Yes I am
I have an object detection problem with two types of images. Cards and random images without cards. Cards have their bounding boxes and the other doesn't. I only have the images for my model to be able to classify that there aren't any cards in the image. How should I keep this data (I have to turn them into TFRecord files later on because I have to use Tensorflow Object Detection API) I have the images as numpy files such as ìd-1.npy and am keeping their bounding box informations in a dictionary {"id-1":[],"id-2":[448,123,343,532]} (In this case id-1 is one of the random images, therefore it doesn't have any bounding box information -so empty list). How should I go with this?
hello
i have a question most likely related to numpy
so currently i am tasked with creating a mock schedule for an nba team and here are the rules
im having issues with the second bullet point
you can include a EarlyStopping parameter as a callback to monitor if validation loss is not improving over a set number of epochs
essentially i need to create a column in pandas that generates either a 0 or a 1 but it has to generate exactly 6 1's
how can i accomplish this^^^?
have you tried doing it on paper first
something like this?
!e
import numpy as np
x = np.ones((6,1))
print(x)
@misty flint :white_check_mark: Your eval job has completed with return code 0.
001 | [[1.]
002 | [1.]
003 | [1.]
004 | [1.]
005 | [1.]
006 | [1.]]
how can i add this randomly to a matrix of zeros then?
a column of zeros to be more specific
pretty sure you can stack them
lets see
!e
import numpy as np
a = np.array([0,0,0,0,0,0])
b = np.array([1,1,1,1,1,1])
c = np.column_stack([a,b])
print(c)
@misty flint :white_check_mark: Your eval job has completed with return code 0.
001 | [[0 1]
002 | [0 1]
003 | [0 1]
004 | [0 1]
005 | [0 1]
006 | [0 1]]
lets say ur col is of size 20,
col=np.zeros(20)
col[np.random.choice(np.arange(20),6)]=1
Hi everyone, I have a project about " scannen results of corona test " which algo should i use ? KNN or SVM ?
Try both see which gives better results
Hello, im training a vgg 19 model and im using k fold cross validation and want to plot the validation accuracy
Should i use this ?
ok thx
can you show example of what you desired output to be?
Something like this but for all my folds
just for one epch point?
so like if you train for 10 epochs, at each epoch you will have different validation loss or other metric, you can only use last epoch, that's a choice. just something to remember that neural nets can overfit with too many epochs and you will see overfitte results, where some previouse epochs could be better. unless you use early stopping or saving best model checkpint
so yeah with early stopping or saving best checkpint you could just plot one loss va lue / metric per training
then you could just append the results to a list and calucate mean, standard devaition etc
I don't even have money hahah !
these are greate (and free) courses:
https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/
course.fast.ai
Yess i am using early fitting
And i want for each fold the last epoch's accuracy and loss curve to be plotted
@tacit basin how do i get this kinda graph kfold cross validation
what do you want on x and y axis?
Accuracy vs the fold
And loss vs the fold
so you would just add the accuracy and losses for each fold to say list and then plot that. would that work?
you are probably interested in average value and st deviation for example as well?
something like that:
Some ML courses on Udemy are pocket-friendly you can get something good with less than $50 there.
Alternatively, you can try free courses and Bootcamp. Check Dphi bootcamp you might like it.
Hahaha don’t say u know tensorflow if u can’t use it 😂
spmething like this?
but i can't understand the plot i am getting
for i in range(len(histories)):
# plot loss
plt.subplot(211)
plt.title('Cross Entropy Loss')
plt.plot(histories[i].history['loss'], color='blue', label='train')
plt.plot(histories[i].history['val_loss'], color='orange', label='test')
# plot accuracy
plt.subplot(212)
plt.title('Classification Accuracy')
plt.plot(histories[i].history['accuracy'], color='blue', label='train')
plt.plot(histories[i].history['val_accuracy'], color='orange', label='test')
plt.show()```
but this is only needed for test set right?
pkt.scatter(folds, accuracies)
Oh okay thank you
I used this
And got this graph
@tacit basin but idk if this is correct
What graph did you get?
This
Like does it mean it's underfitting cause the curves fall flat
You had 50 folds?
Or colors are folds? And x axis epochs?
x axis is epochs, the orange is val accuracy and blue is train accuracy
Nice
@slim frigate ask here.
But i cant understand the graph
Im getting kinda confused
If it's underfitting cause it the loss curves look flat
It kind of starts overfitting from epoch 6 ish
At least on that blue orange curve that start at high loss
Not sure why the other folds start at low losses, maybe something with the setup?
Oh like?🙈
When i read these graphs
Am i only supposed to look at the loss the curve or the accuracy one as well
Yes both. Loss is for model to adjust weights, accuracy is human read able metric,
Anyone know in pandas how to return only rows which meet condition
.where and Iloc giving errors
I want to say where the columns are saying True, or 1 would also work as I converted to binary
one usually does something like df[df["country"] == "my cool country"] or whatever.
the idea is that you get a boolean Series from comparisons, and you can index with it.
Try this : df[df['A'] == True]
So in the case of accuracy the train curev should be below the val curve?
data = {
"Joe": {
"math": 65,
"science": 78,
"english": 98,
"gym": 89
},
"Bill": {
"math": 55,
"science": 72,
"english": 87,
"gym": 95
},
"Tim": {
"math": 100,
"science": 45,
"english": 75,
"gym": 92
},
"Sally": {
"math": 30,
"science": 25,
"english": 45,
"gym": 100
},
"Jane": {
"math": 100,
"science": 100,
"english": 100,
"gym": 60
}
}
i want to add the values with inputs like a app In order not to keep edit the dictionary every time
I don't understand the question, sorry
are u talking about GUI's like Tkinter or something?
maybe a very broad question, but how do I decide whether I should try and write an algorithm or use an ai / machine learning approach for a task?
if the problem you're trying to solve can be solved with an exact procedure, don't use AI. if you can't, but you have training data, you can use AI.
@serene scaffold like this
def info():
name = input('Your name ? : ')
math = input('Your math degree ? : ')
science = input('Your science degree ? : ')
english = input('Your english degree ? : ')
gym = input('Your gym degree ? : ')
info()
you mentioned over DMs that this would be a pandas question, but it would appear that it is not.
because i'm trying to export the data as excel but i can pass this step
try using a general help channel; see #❓|how-to-get-help
thanks
exactly like stelercus said. google's first rule of ML: if you can solve a problem without ML, do it. 
mm I guess I should just try an algorithmic approach first, I think it maybe makes more sense for my problem anyways. thank you for the advice
I just made a youtube video about working with image data in python for anyone interested in image processing for computer vision / machine learning. Let me know if you have feedback: https://www.youtube.com/watch?v=kSqxn6zGE0c
In this video I show how to work with image data in python! Using the popular python packages matplotlib and opencv you will learn how to open image data, how the data is formatted, some ways to manipulate the data and save it off in a different format. If you enjoy you can also check out my live twitch streams (below). Image data is extremely p...
Train loss / accuracy are usually better than test/valid. You don't want to see valid/ test loss / accuracy getting much worse and going in wrong direction as you see on your graph
refactored and rerunning a training set that took 40 mins last time... this should be fun
I didn't get you
This is the point where model started overfitting
Red curve ( test/valid) started going in 'the wrong' direction
While train (blue) decreases
a SVM is using 100 percent of a core for training for > 20 mins... anyone know how to make it multi-threaded?
underfitting happens when both training and validation accuracy are low, meaning the model doesnt have the complexity needed to fit the data
when they fall flat that just means it converged
which means its done training
Oh okayy, thank you so much
Yeah so i can't understand for kfold cross validation how can i evaluate the model when there are like 5 folds
Should i then plot the curve only using the average like you had suggested
Has anyone here used the ukbiobank, urgent research help needed
@serene scaffold perhaps?
Some amazing tweets from data twitter this week! Take a look 👇
https://twitter.com/moderndatastack/status/1505589000240701440?t=hEmXC9_15YzJX6nEwrfY1A&s=19
Data Twitter this Week!
Bringing you some of the amazing tweets from data twitter for this week. If we missed any amazing tweets or tweet threads (we know we did!), add them in the thread below👇
#datatwitter #moderndatastack
that's an option too. but i wonder why other plots in your graph start with such low error. is this correct?
i got the correct cwd but still comes out with an error
error is FileNotFoundError: [Errno 2] No such file or directory: 'Downloads\\sp_500_stocks(1).csv'
try using the full path. your cwd suggests otherwise. also.. not important but if you're using windows use raw-strings for less chances of errors
i tried full path but still comes out with the same error
You could try something like
os.path.expanduser("~/Downloads/whatever.csv")
is embedding just encoding?
I've just subscribed. Nice Channel.
Happy early congratulations on hitting 1k subscribers. 🎉🎉🎉
You should easily hit 1k subs if about 60 more people subscribe to your channel.
Hey guys let's get RoblksCube to 1k. Consider subscribing to his YouTube channel 🙏🙏
Medallion Data Science is a channel devoted to growing a community of people interested in learning machine learning, data science and coding in python. Also streaming live coding sessions on twitch as Medallion Stallion https://www.twitch.tv/medallionstallion_
someone pls help... i'm utterly stuck... trying to use pandas.concat feature
for i in enumerate(col_name):
column_name = col_name[i[0]]
pearson_coef, p_value = stats.pearsonr(df[column_name], df["SEVERITYCODE"])
fuck = {
"Column Name": column_name,
"Pearson Correlation Coefficient": pearson_coef,
"P-value of": p_value,
}
i_m_crying = pd.DataFrame(fuck)
df_local_list.append(i_m_crying)
percof_smry = pd.concat(df_local_list, ignore_index=True)
ValueError: If using all scalar values, you must pass an index
i tried every solution but i keep getting errors.. how do i use pd.concat in a for loop?
One is usually done to the input a model, the other is usually the output of a model
why do we have a dedicated layer for embedding
is it trained
that's not the problem... i got the idx correct... it's the i_m_crying = pd.DataFrame(fuck) that's saying i'm using scalar value
fuck
can you create dataframe without loop with the same data?
manually create an entire dataframe of the columns + their Pearson Corr + p value?
Please don't ping me to answer questions that I haven't already started answering
yeah without the loop,
if it wants index givit it to it:
d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
i can keep using the old code
percof_smry = pd.DataFrame({'Column Name': [], 'Pearson Correlation Coefficient': [], 'P-value of': []})
for i in range (0,len(col_name)):
pearson_coef, p_value = stats.pearsonr(df[col_name[i]], df['SEVERITYCODE'])
percof_smry = percof_smry.append({"Column Name":col_name[i],"Pearson Correlation Coefficient": pearson_coef , "P-value of": p_value }, ignore_index=True)
print(percof_smry)
but pd.append is deprecated
where is the error raised? at concat?
i'm trying to use the new pd.concat feature. right now the error is at
i_m_crying = pd.DataFrame(fuck)
it goes away if i use pd.Series but that's the wrong thing to use... i need the output in a concatenated DataFrame
If you add index as the error message suggests?
SOLVED:
df_local_list = []
for i in enumerate(col_name):
column_name = col_name[i[0]]
pearson_coef, p_value = stats.pearsonr(df[column_name], df["SEVERITYCODE"])
fuck = [{
"Column Name": column_name,
"Pearson Correlation Coefficient": pearson_coef,
"P-value of": p_value,
}]
i_m_crying = pd.DataFrame.from_dict(fuck)
df_local_list.append(i_m_crying)
percof_smry = pd.concat(df_local_list, ignore_index=True)
monkey around long enough and you eventually produce works of shakespeare. my exasperation can be seen thru my variable names
It can be learned, or a pre-trained one can be used. It can be used on inputs to reduce computation complexity by reducing input size to the actual network since "encodings" are not typically learned
yes, it finally made sense, i will probably give a million dollar to developer if they started writing docs like how people would say while explaining to others
pd.append... simple... straight forward.. easy to use
what they replaced it with: pd.concat
how to use it:
- turn your dict into a list...
x=[dict] - pass it to a dataframe...
framed = pd.DataFrame(x) - append it to a list again ...
appendify = []appendify.append(framed) - now you can use concat...
your_objective_df.concat(appendify)👏
who comes up with these ideas?.... rewriting features just for the sake of putting out a new version
concat comes from numpy world. its good for concatenating matrices
It's been around pandas for a long time, maybe as long as append. It's also more general
what's not to understand? it's written right there
The issue is it adds several steps to appending a dict to a dataframe
Though they can be put on one line it's ugly
But python's tagline is to have one obvious way of doing things
pd.append was working fine... they didn't need to deprecate it
If you want to read the actual reasons
Series.append and DataFrame.append [are] making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.
more 👎 than 👍 ... and his reasoning was flawed... what we ended up doing was pd.concat with ignore_index=True .... seems like the devs just needed to do "something" to put out a new version and forced this issue thinking it up in isolation
my old code:
percof_smry = pd.DataFrame({'Column Name': [], 'Pearson Correlation Coefficient': [], 'P-value of': []})
for i in range (0,len(col_name)):
pearson_coef, p_value = stats.pearsonr(df[col_name[i]], df['SEVERITYCODE'])
percof_smry = percof_smry.append({"Column Name":col_name[i],"Pearson Correlation Coefficient": pearson_coef , "P-value of": p_value }, ignore_index=True)
print(percof_smry)
my new code:
df_local_list = []
for i in enumerate(col_name):
column_name = col_name[i[0]]
pearson_coef, p_value = stats.pearsonr(df[column_name], df["SEVERITYCODE"])
the_values = [
{
"Column Name": column_name,
"Pearson Correlation Coefficient": pearson_coef,
"P-value of": p_value,
}
]
pass_to_df = pd.DataFrame.from_dict(the_values)
df_local_list.append(pass_to_df)
percof_smry = pd.concat(df_local_list, ignore_index=True)
print(percof_smry)
it has to pass thru extra steps now
What deep neutral network architecture would be good for large images classification, there will be both larger and smaller details that will be important for classification.
If you've ever designed architecture, you'll know popularity is not always aligned with the correct choice.
If you provide counters to each of their reasons that would be much more relevant than one code sample
if you're just looking for pointless debates just go away. don't talk to me
Go away? Is this your discord? I'm merely asking you to think beyond your specific use case
You're welcome to ignore the request and exit the discussion. You're trying to debate the deprecation of an API, if you're looking for more than blind agreement perhaps look elsewhere.
Can anyone direct me to pretrained object detection models for COCO? If anything like that exists?
Yolov5
go away means don't talk to me... some of us have real models to train rather than having pointless debates
You introduced the topic
Thanks, I'll check this out ^^
There are a lot, as COCO is one of the largest and most popular datasets
One of the biggest model zoos is detectron2. https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md
But there are surely more across all frameworks
srsl = [pd.Series((d[0], *d[1:]), index=['Colname', 'Corr', 'Pval']) for d in dat]
pd.DataFrame(srsl)
or something like that
And as I mentioned above you can roll multiple of your lines into one. List append doesn't need to be it's own line, and neither does from dict
Additionally rolling into all one line is less readible and maintainable, specifically the old dict creation
Thanks so much for the sub! Hope to hit 1k soon 😊
ValueError: Length of values (2) does not match length of index (3)
you would have to work on it a bit most probably, was just eyeballing it
i feel like you would do well to work on your python fundamentals though
depends on what stats.pearsonr gives back, you could unpack it in the first comprehension maybe
yes that worked.. although a bit messy in a single list comprehension line
generators can be nice if you need to do a bit too much dancing for a straight comprehension
i definitely lean towards using more lines and being clearer than trying to wrap it all in one giant thing
U have a good point that sometimes pandas has some unintuitive design
Love the package tho
Sometimes I am confused why they fix what isn’t broken
Btw I don’t see anything wrong with concat
I don’t think it’s new
i think one of the problems is, where do u draw the line about what to keep for backwards compatibility
like if they acknowlege it was a bad idea/didnt really work as expected, id rather they got rid of it
rahter than having a million different ways to do things, many of which are suboptimal
also maintaining a bunch of older stuff that should have been deprecated takes time away from doing newer better things
like as an extreme, ive worked at place that still run on mainframes, the argument is 'well they still work fine' which is true on one level but in practise it means they cant do anything modern that their users/customers want to do
Hey Guys ! Can u send some examples or ur projects using AI
You know what I mean, like a face detector or whatever
@iron basalt seems Numenta is ditching HTM https://arxiv.org/abs/2201.00042
Apparently, it doesn't change things.... much for TBT, but I am currently pestering some guys about the status
anyways, its a pretty bad paper with plenty of criticisms - not to mention the DL baselines it competed against were weak, GOFAI stuff used to disguise cheating and general sussiness regarding inconsistent methodologies across experiments
Anyone any good with doc2vec? I'm wondering why in this tutorial/ example https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py
They train the model on both the train and the test set. I had thoguht that this was not the way it was done and you should keep the sets seperate
Efficient topic modelling in Python
until everything breaks and you dont have anyone around able to fix it

just like if your system is run on COBOL. where are you going to find COBOL programmers nowadays 
thanks, this helped a lot
doesnt look like it
looks like it is separated even when they trained the model
so I’m close to completing LunarLander on OpenAI, how different is that from a 4 motor drone?
how to deal with problem that LabelEncoding takes input only similar datatype
these are my modelling parameters
what kind of encoding do you suggest
oh thats weird. i have no idea then 
Having trouble getting all the elements from my dataframe into one list
I’m looking to get everything into a list in order to calculate polarity
For some reason this is not working right now
I’ve been trying to find an answer to this question and can’t as of yet
paste a sample of your code
anyone help me make a quick loop. if free for 5min just @winged grove would appreciate
want to create a loop or something. all my images are in order. like part_1, part_2, part3 . want code to automatically do code for like part[i+1].png
from skimage import io, img_as_float
import numpy as np
image= io.imread(r'C:\Users\guest\Dropbox\con1_outfolder_split_30sbeforepeak2min30safterpeak\part_1.png')
image = img_as_float(image)
print(np.mean(image))
eg:
images = [img_as_float(io.imread(f'C:\Users\guest\Dropbox\con1_outfolder_split_30sbeforepeak2min30safterpeak\part_{i}.png')) for i in range(10)]
thanks you
only thing i am thinking is will it give me in order. because i want to see printed results in order. like first part 1 then part 2 then part 3 @novel elbow
got this error
images = [img_as_float(io.imread(f'C:\Users\guest\Dropbox\con4_outfolder_split_30sbeforepeak2min30safterpeak\part_{i}.png')) for i in range(10)]
^
SyntaxError: invalid syntax
Yeah idk why either, i mean i did the plotting with that code
How can i know if it is wrong tho
whats the easiest way to extract the last digit from a number in pandas?
have you tried .to_list()
youll need to select a column from your dataframe first since that function only accepts pandas series
your_list = df['column A'].to_list()
yeah, double-check cross-validation set up and plotting. looks suspicious.
If everything checks out what could be the issue
I don't think they are ditching it for DL, they have pretty much always been doing some DL-ish stuff too (I think Jeff already said in his book that HTM was not right, the specifics of it, but general things like having cortical columns still persist). I have not read the paper all that much so IDK about its quality. It does not interest me that much. I'm more interested in Jeff's grid cells idea (thousand brains theory, but also just for localization / regular grid cells stuff).
While we were inspired by HTM and such, we don't do it the way they do because it never got really good results (if it does not work, we move on, though it's hard to tell since it's a multi-arm bandit problem). The big picture of the structure and such is there / modelling the neocortex, but the details of how to do that are very different.
You can also see their use of DL in their first papers on grid cells based object detection in which they use a pre-trained CNN to simply demonstrate that the grid cells can identify objects given only patches of the original image in any order (as a sequence of eye movements). Ideally the DL part would be replaced with something more biologically plausible that at least gets similar results (we have done that, which also enables online learning in our case). Numenta as a company has different things going on and for me personally it's hit and miss. Sometimes it's a really nice idea like thousand brains theory, but sometimes it's kind of meh.
mabye convert column to string and slice it and back to number
df.AA.str[-1:].astype(int)
different splits give different resutls, would try to understand splitting
Ahh, so i have like 2963 images
And i split it five times
what is it classification i forgot sry, if classification you can check number of classess in each split and also in validation set
does anyone know how to use ffmpeg? I'm following these instructions but keep getting an error saying the file can't be found, even though I saved it in the same directory as the ffmpeg-split.py file
This is the command I ran
I just realized I don't have ffmpeg itself installed lmao
suggest some architectures for complete categorial dataset
hi
from openpyxl import Workbook, load_workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import Font
data = {"rename": {
"math": 20,
"science": 20,
"english": 20,
"gym": 20}
}
def info():
name = input('Your name ? : ')
math = input('Your math degree ? : ')
science = input('Your science degree ? : ')
english = input('Your english degree ? : ')
gym = input('Your gym degree ? : ')
data.update({name:{"math":math,"science":science,"english":english,"gym":gym }})
input('press any key ...')
for a in range(2):
info()
a+=1
wb = Workbook()
ws = wb.active
ws.title = "Grades"
headings = ['Name'] + list(data['rename'].keys())
ws.append(headings)
for person in data:
grades = list(data[person].values())
ws.append([person] + grades)
for col in range(1, 6):
ws[get_column_letter(col) + '1'].font = Font(bold=True, color="0099CCFF")
wb.save("NewGrades.xlsx")```
result :
but i want change the output without rename data
any suggestions can help me
This does not appear to be data science, you may want to ask in the standard help rooms.
embedding needs encoding first right?
Is there a way for me to use drone-acquired images of water/ocean/lakes and use them to check for pollution using machine learning?
Can someone advise what filter to keep on images while making word clouds in python??
Since I am not able to get proper imprint of the person with word cloud
Do photos show polution? Like colour?
hey, i have a cropped opencv2 image that i want to predict in a model that i made.
the model requires shape (28,28), but when i try to reshape it i get the error
"cannot reshape array of size 63948 into shape (28,28)'"
img = cv2.imread(u'/content/drive/MyDrive/Data/Project/Screenshot_44.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
hImg, wImg, _ = img.shape
boxes = pytesseract.image_to_boxes(img)
for b in boxes.splitlines():
b = b.split(' ')
x,y,w,h = int(b[1]), int(b[2]), int(b[3]), int(b[4])
width = w-x
height = h-y
n = width - height
if (width > height):
h = int(h + n/2)
y = int(y - n/2)
elif (height > width):
w = int(w - n/2)
x = int(x + n/2)
crop_img = img[hImg-h:hImg-y, x:w]
reshape = crop_img.reshape((28,28))
yes and even stuff like oil spills
It's a weird trajectory all-in-all I suppose, what with this weird hybrid of approaches. but at least now they have started competing on benchmarks 🤷♂️
Do you have thousands of these images
How high up are they taken
If exists a vast library of these images then you could do it
Hi guys, is it possible to use CNN to train a model without labels in order to query similar images?
by what standard of similarity?
Someone need help for a code?
if you want to answer questions, try checking the occupied help channels for those not being addressed.
@serene scaffold I have 20 query images which have been cropped, I need to rank 10 most similar images among 5000 images, they are not quite similar
I only know I can use some feature extraction algo like SIFT or color histogram to find those image, but some images are still not found. And I know transfer learning is a way to modify the existing CNN model like VGG. but what's the most accurate way to do that?
I have tried to list, but I need it to be a list of strings
Does it matter if I normalize my data if I am using logistic regression ? I tried using standard scaler from sklearn and I am somehow getting way better results in confusion matrix and classification report
look up contrastive learning
I can send the whole code if someone wants to take a look
sometimes it helps alot
yeah send it if you want
normalizing data for lstm layers helps alot since it ranges values in tanh and sigmoid function
Hey everyone! Anyone here experienced with Elasticsearch who could help me out? Im using ES 7.17 for lower level security but when I try to access the running node it tells me now that I am missing authentication credentials
elasticsearch.exceptions.AuthenticationException: AuthenticationException(401, 'security_exception', 'missing authentication credentials for REST request [/persons]')
I am basically worried that I might have leaked the data to the scaler I am sending some pics to you in dm please take a look
ok
@grave frost thank you for suggestion, i'm studying contrastive learning and transfer learning to see which one is more accurate
How to make a graph in third quadrant in matplotlib ?
What Data Science and Data Analysis Skills Are Required to Become an ML Engineer?
I read in an article about the requirements to become an ML engineer, that you should have some experience with Data Science, data analysis and other major requirements.
Are you a tensorflow user? Do you know how to code models from scratch and are you a math god
actual ml engineers in research are literal gods
Check out some papers on designing new algorithms
Tbh I don’t think I’ll ever reach that level due to the hard cap on my math
On I'm a beginner trying to make a fresh start
I'm not bad in math 😅
Ur gona need like
Literal years of education or experience
Take CS at uni maybe, then a statistical postgrad
Do u use python? Or R
Python
In my view a ml engineer is the one who knows how to apply software eng good practices into ml model development/deployment (mainly testing and CI)
good point of course. but you might want to look into scikit-learn's dataset generators for some insight into generating somewhat "realistic" datasets https://scikit-learn.org/stable/datasets/sample_generators.html
In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. Generators for classification and clustering: Th...
Are you ML engineer?
unemployed atm, but have worked as such
Hi, how can i convert a csv file to base64 encode?
Good
Hi guys, i have a question, its simple and it has 2 parts, so basically
Should we train a CNN with Conv1D layers, with 2D or 3D data, or both are possible?
Should we train a LSTM model with 2D or 3D data, or both are possible?
I need someone's advice on this before I proceed.
sys:1: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False.
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
error^^^^^^^^^^ on running
Continuing from #help-cherries
For context, I'm a mechanical engineer without a ton of software experience that's now writing finite-element analysis code. We have an old mess of MPI-based C++ code that nobody understands. I'm currently looking to rewrite it. I have a simpler version of it implemented in Python now using Numba, to see if it works with multithreading the way we need.
I'm trying to determine if Numba is the best way to implement this, or if I should look into something else like Cython
Here's a snippet that shows the 'type' of operations that it's mostly based on.
@njit(parallel = True)
def stepAllCalcs(dx, dz, ux3, uz3, ux2, uz2, ux1, uz1, lam, mu, lam_2mu, dt2rho, weights):
co_dxx = 1/dx**2
co_dzz = 1/dz**2
co_dxz = 1/(4.0 * dx * dz)
#Ux
dux_dxx = co_dxx * (ux2[1:-1,0:-2] - 2*ux2[1:-1,1:-1] + ux2[1:-1,2:])
dux_dzz = co_dzz * (ux2[0:-2,1:-1] - 2*ux2[1:-1,1:-1] + ux2[2:,1:-1])
dux_dxz = co_dxz * (ux2[0:-2,2:] - ux2[2:,2:]- ux2[0:-2,0:-2] + ux2[2:,0:-2])
(...)
# Stress G
stressUX = lam_2mu * dux_dxx + lam * duz_dxz + mu * (dux_dzz + duz_dxz)
It's mostly simple array addition, with some scalar multiplication.
@desert oar
other than the horrifying 70s style variable names, this looks about as good as it's going to get
(i know i know it's math notation rendered in ascii, i've written/used code like this)
maybe there are some additional optimizations for this kind of calculation but i wouldn't know any
yeah there are certainly some improvements to be made there
writing high-performance cython is a lot closer to C than Python
you are still messing with pointers and such, and even worse you now have to worry about interacting with python, thread safety, reference counting, etc.
and at that point you're probably better off with the original c++ application
which leads me to ask: is this performing significantly worse than the C++ version?
we don't have an exact comparison between the two
it seems like numba already uses openmp for parallelization, fwiw https://numba.discourse.group/t/does-numba-support-mpi-and-or-openmp-parallelization/483/2
Hi @goldmosh, Not out of the box. You can use a lot of ctypes in Numba and could call MPI functions if you wanted to but it’d probably be a lot of work. You might be interested in trying out dask and it’s dask.distributed backend, it works well with Numba. Yes. See Automatic parallelization with @jit — Numba 0.52.0-py3.7-linux-x86_64.egg doc...
this is a 2d version that I wrote a while back and am just now trying on our HPC, while the 'actual' code is 3d
trying to figure out if it's worth rewriting in 3d
Is there any reason I'd want to use MPI? from my limited understanding it's mostly useful for distributed computing, while we're mostly running on one system
just multiple cores, which numba seems to handle fine
also if anyone has general input for how to approach writing large-scale stuff like this I'd appreciate it
Hi I want to make a project that uses object detection. I have some tf and data science experience but never used computer vision and stuff. Which libraries or frameworks do you guys recommend?
Or any courses to get started?
Guys, I have a column called "Routes" with 900 unique values. Should I have one-hot encoded it? Haha
If not, what should I have done?
for what purpose?
for feature selection (rfe), then model development (random forest)
probably not actually. although random forest does tend to over-weight high-cardinality categorical features
hmm what should I have done with it instead?
i was getting relatively decent results, do you think it may have overfitted?
probably not, i wouldn't worry about it
hashing specifically doesn't matter, because there is no "training" involved
i have some worry regarding hashing of Y
but normally you should fit/train your data transformations only on the training set, not on the test set. data transformation is part of your model!
how should i deal with Y
you probably shouldn't hash your class labels / output values
why would you?
LabelEncode is simpler and better-designed for this purpose
assuming you're using scikit-learn, OneHotEncoder has a bunch of extra features that you don't need for labels
and hashing makes no sense for a variety of reasons; i encourage you to think about why
yeah ok on it, i hope it doesnt give me 100 accuracy on all three sets this time
else i am skrewed
if you are getting 100% accuracy on all 3 sets, then you probably accidentally put the labels into the model as a feature
yeah tried all that stuff
i am actually not doing any mistakes
that model was just not suited for the problem i believe thats why i am switching the model
100% accuracy isn't a bad thing btw, but it does probably mean that your model is badly overfitted
exactly. thats why i am just overcomplicating model. lol to decrease accu
that's not a good idea
that's covering up the problem, not solving it
well i have a reason, its not just that 100/100/100
i have lower loss on validation then on train
thats suggest an "inconclusive performance"
that suggests bugs in your code to me, or a particularly unlucky train/test split
this is why i think nested cross validation is a much better approximation of out-of-sample performance than a plain split
you want to try debugging, i would appreciate it a lot
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
i have dataset and code in google drive and collab
are you comfortable with those
i can dm you, if you give me permission to
that's a bit more than i'll have time to look at
This website paste.pythondiscord.com/ is currently offline. Cloudflare's Always Online™ shows a snapshot of this web page from the Internet Archive's Wayback Machine. To check for the live version, click Refresh.
salt rock lamp, could you give me a tip how i can find the maximum value from my code:
import pandas as pd
import matplotlib.pyplot as plt
#import numpy as np
var = pd.read_excel(r'/Users/pontusskol/Desktop/data.xlsx')
print(var)
x = list(var['X values'])
y = list(var['Y values'])
plt.figure(figsize=(10,10))
plt.style.use('seaborn')
plt.plot(x,y, '-o',label='x,y')
plt.scatter(x,y,marker="o",s=100,edgecolors="black",c="yellow")
plt.title("Excel sheet to Scatter Plot")
plt.show()
Which gives me this graph as i showed you before:
Ive been searching on internet but I just cant make it work :/
by using this?
xmax = x[numpy.argmax(y)]
ymax = y.max()
Is there a way to make my twint program update its data in real time?
Use the Twitter API instead of web scraping libraries
i told you: use loess, spline, or gaussian process to interpolate. then find the max value from that
either you compute a bunch of values on a very finely-spaced grid and do a search, or use some numerical optimization routine
okay thanks, will look into that
you got any error?
@desert oar sorry for the random ping, but it seemed like you were knowledgeable about Numba - any ideas where I'd start troubleshooting LLVM / SVML? I'm trying to enable it for the project from earlier but it doesn't seem to be working
right now I'm just doing numba._try_enable_svml which always returns false despite having the libs installed
how to handle hash encoding if my column has more then one datatype.
does anyone know if I load a model through HDFS how can I load it to use like pickle.load would since it is from a connection instead of a file
hello guys how can i fix installing kivy in anaconda errors
oof, definitely no idea. sorry
i have a conceptual model about how numba works, and i know what llvm is, and that's about it
ERROR: Could not find a version that satisfies the requirement kivy-deps.sdl2 (from versions: none)
ERROR: No matching distribution found for kivy-deps.sdl2
how to fix it?
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.```
hello guys
im kinda new to learning curve analysis . does anyone know if this curve is good or not
im trying to run optimization buut I dont exactly know how I can fine tune the hyperparameters according to the learning curve
Hi I want to convert this: ```label,question,answer
label 1,pytanie 1?,odpowiedź 1
label 1-2,pytanie 2?,odpowiedź 1
label 1-2,pytanie 1?,odpowiedź 1
label 1-2,pytanie 2?,odpowiedź 2
label 1-2,pytanie 1?,odpowiedź 2
label 2,pytanie 2?,odpowiedź 2
with base64
and instead of this:
it kinda goes down so probably not so good
bGFiZWwscXVlc3Rpb24sYW5zd2VyCmxhYmVsIDEscHl0YW5pZSAxPyxvZHBvd2llZMW6IDEKbGFiZWwgMS0yLHB5dGFuaWUgMj8sb2Rwb3dpZWTFuiAxCmxhYmVsIDEtMixweXRhbmllIDE/LG9kcG93aWVkxbogMQpsYWJlbCAxLTIscHl0YW5pZSAyPyxvZHBvd2llZMW6IDIKbGFiZWwgMS0yLHB5dGFuaWUgMT8sb2Rwb3dpZWTFuiAyCmxhYmVsIDIscHl0YW5pZSAyPyxvZHBvd2llZMW6IDIK
I have that written in my csv file: ```98,71,70,105,90,87,119,115,99,88,86,108,99,51,82,112,98,50,52,115,89,87,53,122,100,50,86,121,67,109,120,104,89,109,86,115,73,68,69,115,99,72,108,48,89,87,53,112,90,83,65,120,80,121,120,118,90,72,66,118,100,50,108,108,90,77,87,54,73,68,69,75,98,71,70,105,90,87,119,103,77,83,48,121,76,72,66,53,100,71,70,117,97,87,85,103,77,106,56,115,98,50,82,119,98,51,100,112,90,87,84,70,117,105,65,120,67,109,120,104,89,109,86,115,73,68,69,116,77,105,120,119,101,88,82,104,98,109,108,108,73,68,69,47,76,71,57,107,99,71,57,51,97,87,86,107,120,98,111,103,77,81,112,115,89,87,74,108,98,67,65,120,76,84,73,115,99,72,108,48,89,87,53,112,90,83,65,121,80,121,120,118,90,72,66,118,100,50,108,108,90,77,87,54,73,68,73,75,98,71,70,105,90,87,119,103,77,83,48,121,76,72,66,53,100,71,70,117,97,87,85,103,77,84,56,115,98,50,82,119,98,51,100,112,90,87,84,70,117,105,65,121,67,109,120,104,89,109,86,115,73,68,73,115,99,72,108,48,89,87,53,112,90,83,65,121,80,121,120,118,90,72,66,118,100,50,108,108,90,77,87,54,73,68,73,75
why is that?
my code:
if not name or name == '':
print("badn name")
label_check = db.session.query(Labels.label_name,Labels.label_id).first()
if label_check == None:
print("no labels")
header = ['question', 'label', 'answer']
data = db.session.query(Labels.label_name,Questions.question,Answers.answer)\
.filter(Labels.label_id==Answers.label_id)\
.filter(Labels.label_id==Questions.label_id).all()
result = all_many_schema.dumps(data)
with open(f"dowolands/{name}.csv", 'w', newline='') as f:
header = ['label', 'question', 'answer']
writer = csv.writer(f)
writer.writerow(header)
i = 0
for range in data:
writer.writerow(data[i])
i = i+1
data = open(f"dowolands/{name}.csv", "r").read().encode('utf8')
encoded = base64.b64encode(data)
with open(f"dowolands/{name}.csv", 'w') as f:
writer = csv.writer(f)
writer.writerow(encoded)
f.close()
print(encoded)```
Thanks, that's a relief to know. But do you know what should I have done instead with dealing with high-cardinality columns? I've read of PCA but I heard it's designed for continuous variables. Would very much appreciate your input!
night guys, sorry for disturbing, I feel a bit sick and can't concantrate on simple task. How can I add points(markers) for this subplot and prevent displaying scientific number in axis
whats the ".exe" encoding bois?
The Microsoft portable executable format? Is this a DS question?
Encoding is putting something into some kind of system of signals (very generic term). Embedding's goal is to embed something into something to gain new insight about it and other things related to it. You can imagine embedding the data as analogous to Archimedes embedding Hiero's crown into liquid to measure its volume.
Whenever you change the form of some data you have technically encoded it. But whether or not that encoding is useful in that it lets you compare things is what matters.
how do i predict embedding layer attributes
can you tell it dependencies
input_dim
output_dim
and input_length
while embedding
You need to define some embedded space and then have something that embeds items into that space. The input_dim is whatever the input's dim is. The output_dim is whatever you decided.
hey squiggle
will this equation work for all reinforcement learning tasks?
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
im using free google colab and i cant train 100 epochs all in a single runtime so can i just train it 50 epochs first then save the model then next runtime load and train it for another 50 epochs?
If by works you mean works well, then no, not on every task (but a lot of them). Q-learning "works" on any RL task, as in you can try to apply it always. If you want another option see SARSA and try to figure out which tasks it would perform better on and why.
Sure, I will look into SARSA, since I’m working on Lunar Lander, how hard is a task similar to lunar lander but with 4 motors?
From your plot, you can see that your train set (blue line) Loss reduces as the number of Epoch increases. However, we can't say the same about the Validation loss.
The validation loss briefly reduced alongside train loss until, say, in the 7th Epoch when it slowly starts to diverge.
So, in essence the bigger/wider the resulting space caused by the divergence between your Train loss and Validation loss, the more your model overfits the data
Try to use EarlyStopping callback to prevent overfitting.
from Keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience = 5)
model.fit(X_train, y_train, epochs =250, validation_data =(X_yest, y_test), callbacks=[early_stopping])
Also, for performing hyperparameter tunning in DL using an approach that's somewhat synonymous to RandomizedSearchCV in sklearn, you could use sklearn wrapper from keras
from keras.wrappers.acikit_learn import KerasClassifier`
Has anyone tried out neural intent?
i'm trying to make a new column that converts the S&P ratings to numbers
import pandas as pd
grades = {
'AAA': 1,
'AA+': 2,
'AA': 3,
'AA-': 4,
'A+': 5,
'A': 6,
'A-': 7,
'BBB+': 8,
'BBB': 9,
'BBB-': 10,
'BB+': 11,
'BB': 12,
'BB-': 13,
'B+': 14,
'B': 15,
'B-': 16,
'CCC+': 17,
'CCC': 18,
'CCC-': 19,
'CC': 20,
'C': 21,
'D': 22,
}
states = pd.read_csv('./data/states_credit_scores.csv')
states_frame = pd.DataFrame(states)
number_sp = [grades[x] for x in states_frame['Rating']]
states_frame['Rating_Num'] = number_sp
states_frame.sort_values(by='Rating_Num', inplace=True)
states_frame
countries = pd.read_csv('./data/countries_credit_scores.csv')
countries_frame = pd.DataFrame(countries)
number_cp = [grades[x] for x in countries_frame['Rating']]
countries_frame['Rating_Num'] = number_cp
countries_frame.sort_values(by='Rating_Num', inplace=True)
countries_frame
this is my error:
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/snap/pycharm-professional/278/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/snap/pycharm-professional/278/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/amicharski/PycharmProjects/njBudget/main.py", line 37, in <module>
number_cp = [grades[x] for x in countries_frame['Rating']]
File "/home/amicharski/PycharmProjects/njBudget/main.py", line 37, in <listcomp>
number_cp = [grades[x] for x in countries_frame['Rating']]
KeyError: nan
turn it into a for loop and print each iteration? that or running in a debugger to see which key is causing problems.
Anyone know of a way to invert the background/axis/label colors on a matplotlib 3d plot?
I'm trying to plot the orbits of solar system bodies, and they are color-coded with light colors because they will be on dark background (space) eventually.
I could probably convert to HSL and then just turn down the lightness, but a dark background would probably just be better.
Here is what they look like currently:
For anyone curious, the visible blue orbit is Neptune, the largest colored orbit is the dwarf planet Gonggong, and the biggest gray orbit is Comet Ikeya-Zhang.
Any suggestions on how to solve systems of differential equations on GPU using Python? Are there any packages like SciPy that offer this functionality on the GPU? I posted this in the #algos-and-data-structs channel too but it might be more appropriate for this channel.
Approximate numerically or solve analytically?
Which functions from SciPy do you want?
@iron basalt Numerical solvers. For example, the solve_ivp function in SciPy solves a system of ODEs but uses the CPU. Is there anything like that available for GPU?
Hello, has anyone used or worked on "spotify/ANNOY" machine learning nearest neighbor model?
I need held on my project!!!!
help*
Not sure, but you could implement it yourself to run on the GPU using either Numba (probably the easiest if you can get it to use the GPU), cupy (if you are using Nvidia GPUs) / pycuda, pyopencl (any GPU or CPU / device with parallel compute), or Kompute (Vulkan).
It seems SciPy's default solving method is Runge-Kutta of order 5 (4). Assuming I mean interpreting "RK45" correctly and they don't actually mean "RKF45".
Just checked the source code for it, it's Runge-Kutta order 5 (4).
If you have never written a solver for it before there are plenty of tutorials, the code is really short.
So you can first write it with numpy, then move that to numba.
SciPy's solve_ivp is basically just RK45 with some extra code for picking methods other than RK45, and parameter wrangling.
can someone tell me why i need to install visual studio for cuda? i haven't installed any of the tools that it comes with, just the editor
and yet cuda seems to do just fine
what's so magical about vs 2019 that cuda wants
?
It wants the Microsoft SDK probably, not the IDE. But the SDK comes with the IDE.
Microsoft SDK
?
Windows SDK*
I think so, I have not used Windows in a while, but I there is some SDK which is needed for development on Windows which is used by Visual Studio and others.
(unless you use mingw or something like that, but that is unofficial)
CUDA I think makes use of the visual studio SDK on windows so it may be that one (or both).
I know that whatever system you are using, CUDA hijacks your C++ compiler so you can write kernels in C++ directly.
Ah found the info: ```
Visual Studio is an IDE (Integrated Development Environment). It's the user interface.
Build Tools include the compiler that compiles your source code into machine code.
Windows SDK contains headers, libraries and sample code used to develop applications.
I would think that it needs the SDK and build tools.
But it probably wants to also integrate into visual studio.
thanks!
Windows does not have a standard way of dealing with SDKs / libs like Linux, so it's all a mess there.
oof
sounds about right tho

think i wanna try this manim library
to see if i can make a short clip about numpy's broadcasting

Hello, I'm looking to have an interactive bot that reacts to user messages in certain scenarios and tries to match up a user's response up to one of a few different request options. I'm kind of lost as to what approach I should take here. Any general pointers would be much appreciated!
So you are trying to classify their messages?
yes
Have you tried something really simple like naive bayes?
I haven't really tried anything yet. I'm looking for pointers on what I can read up on or specific libraries to use
I've done Binary Classification to use, but now I need something more
and I also dont really have that large of a training set to work on
Even more simple, you can just check for keywords in the messages.
Basically naive bayes, but hand crafted probabilities.

yeah, I've considered that, but in some cases, it would be necessary to differentiate between who is being referred to
Well that is not just classification, that is much more complicated.
yeah best to start simple. you can always iterate later
for example if the message says "I will do xyz, you should do abc", that needs to be analysed
However, you could first classify it with something dumb like naive bayes, and then try to figure out stuff from there based on that class.
Or yeah, actually learn NLP.
you could use this as an excuse to learn more

funny enough we went over RNNs today
super classic
not even LSTMs yet
or attention mechanisms

we should get to transformers eventually
and modern NLP architectures
If you are struggling with LSTM, try looking at GRU. It's better in every way.
More simple, better results.
i also looked at that today
but on my own

what i was using: https://d2l.ai/chapter_recurrent-modern/gru.html
ok, thanks for the pointers. so Naive Bayes can provide some basic classification, that can then be further analyzed, and if I need more I need to look further into NLP
since at the same time it can't be unreasonably complicated because more complex algorithms usually take more processing power, its probably better to keep it simple stupid anyway
Can someone please make me underatand how n_companents work in hashing.
In computing, a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed ...
I am having dimension error, when i change n_components
I understand how it works
But dont know how dimensions work
Hello guys, looking for some easy to follow python repos on github (data engineering preferred) where the code is written in a modular and production appropriate manner.
Basically I have been writing code in Jupyter notebooks for data lift and shift but would like to learn how to convert the code into a more modular and reusable format.
Boys my essay and coding assignment has been set
Which supervised models should I become an expert in 🧐
wow, after good cup of tea I've improved my simple script file to plot 4 subplots
Hi there, Is there anyone who has a sample presentation/study file analyzing PCA components in terms of original variables? ( I am struggling to find a good example that explains PCA in business context)
As much as you can. There are many Supervised Learning algorithm, once you know a 2 or 3, it'll be easier to grasp how others work too. It's almost same syntax but different algorithms, and sometimes, different hyperparameters.
Knowing both Linear-based and Tree-based algorithm is quite important
Do you understand what PCA does? I think if you've understood it very well you can easily apply/implement it in any business context.
You can check this video https://youtu.be/FgakZw6K1QQ
Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.
In this video, I...
I don't know of any but try checking this https://youtu.be/bkJZDmreIpA then put on your FBI hat and do a quick digging on their GitHub repo. You might find what you seek therein
I have recently dived into it, am trying to learn and apply. Thank you for the information and the YouTube link. I will watch it.
I meant, the assignment you have to choose two models
Which two… I already know how all of them “work” I meant on an expert level
We have to use them predictively as well as write essays
Awesome. You can pick any of your favourite algorithm. For me, I like CatBoost 😂
categorical_columns = [c for c in dataset.columns if (c != 'Slice Type (Output)')]
hs = category_encoders.HashingEncoder(cols=categorical_columns, n_components=16)
d = hs.fit_transform(dataset)```
i applied this encoding, is this actually correct, i worry about not able to see any correlation
If I want to compare a column from 2 data frames, is there a more efficient way than df1.compare(df2) ?
Sorry I can not help with your problem. I was asking about mine 😂😂😭
what do you mean by "compare" exactly?
and what do you mean by "efficient"?
i always had good results with lightgbm for quick and dirty gradient boosting
xgboost seems to need more "care and feeding" to get good model performance, and generally is slower
and catboost never gave me good results compared to lightgbm on the problems where i tried it
if there's a pandas operation that is one method call, that is almost certainly the "most efficient" way by any definition of "efficient".
imo just start reading code that isn't "data" code. production-ready just means no bugs. which means it needs to be testable, which means you are basically writing an application and all the usual recommendations about application design apply
data science code is usually bad quality
read the scikit-learn source code, their code is usually pretty decent
it's a bit "old school" in some respects, but for the most part it's a well-organized and thoughtful code base
hmm, old school in what way?
no type hints, * imports
one thing that they do which is interesting is mapping 1:1 **kwargs to instance attributes, this is actually enforced by their base classes
in a world without type annotations, that's a really nice thing
and in general it makes it impossible to accidentally discard user input
also distinguishing "generated" fields by suffixing with _ is kind of ad-hoc but a very useful convention
of course they almost certainly should have gone the R/statsmodels route of returning a "result" object instead of mutating the original "model" object and adding a bunch of fields
oh yeah, another old school thing: fields that are not initialized in __init__ and using hasattr() to check the current state of the object
the flipside of making the model fitted in-place is that you can chain transformers easily, but that's kind of a quirky thing that you don't usually need anyway
hey yall
i wanted to ask a question
if AI isnt telling the code if the person says "hello" or hi or sup or wassup or what ever word that means a welcoming action then how is AI made i mean like siri and google assastaint
programs like siri are not an AI on the whole, but they contain components that are
yea but if siri hgas components that include AI
for example, if you ask siri a factual question, it uses an information retrieval algorithm to find a statement that answers the question, and that is AI.
so when i say hello it uses a algorithm to know what does hello even mean and what to answer if a user says hello
is that what your saying?
no, just saying "hello" to siri and getting a response does not include any AI.
then how?, how does it know what hello means
if user.says() in ('hello', 'hi'):
return random.choice(['hi', 'hello', 'greetings'])
speech recognition. but it's just mapping what you say to a string.
it's unlikely that the siri source code has whole conversations mapped out like this
but for trivial conversations, it's probably just picking from a few canned responses, just using speech recognition.
yw
well can i ask you anthor question
sure
in your code here
return random.choice
i want to put that in my code
look ill give you an example
it's my code, so you owe me 100 bucks
Hello,
Which is better for detecting fake news?
Supervised Learning? Semi-Supervised Learning? UnSupervised Learning? Deep Learning?
message = input("Type your message: ")
if message == ("hello"):
print("hello", "hi", "greetings")
will this work?
these are all broad categories of algorithms. your question can only be answered in terms of specific algorithms.
also none of those are mutually exclusive with deep learning
you don't need to wrap hello in parentheses. the indentation for the print call is wrong. that would print all three of those, not one of them randomly.
What is your suggestion?
it's in the code example you were referencing.
so i can just type
message = input("")
if message == hello:
return random.choice(['hi', 'hello'])
!indent
Indentation
Indentation is leading whitespace (spaces and tabs) at the beginning of a line of code. In the case of Python, they are used to determine the grouping of statements.
Spaces should be preferred over tabs. To be clear, this is in reference to the character itself, not the keys on a keyboard. Your editor/IDE should be configured to insert spaces when the TAB key is pressed. The amount of spaces should be a multiple of 4, except optionally in the case of continuation lines.
Example
def foo():
bar = 'baz' # indented one level
if bar == 'baz':
print('ham') # indented two levels
return bar # indented one level
The first line is not indented. The next two lines are indented to be inside of the function definition. They will only run when the function is called. The fourth line is indented to be inside the if statement, and will only run if the if statement evaluates to True. The fifth and last line is like the 2nd and 3rd and will always run when the function is called. It effectively closes the if statement above as no more lines can be inside the if statement below that line.
Indentation is used after:
1. Compound statements (eg. if, while, for, try, with, def, class, and their counterparts)
2. Continuation lines
More Info
1. Indentation style guide
2. Tabs or Spaces?
3. Official docs on indentation
@serene scaffold ?
I don't have one, sorry
yea about the tabs im typing the code in discord thats why the tabs arent there
well thanks
a company I interviewed for told me about their fake news detection algorithm, but I don't think they want me repeating it.
if the indentation isn't there when you paste the code into Discord, I have no way of knowing what the actual code looks like.
oh ok
then ill just make it in VSC then send it here
well ik i have asked from you alot
but just the last question
user.says() THIS
THIS
I HAVE SUFFRED FROM THIS
sorry caps but dude please tell me
that part is entirely made up. there is no user.says()
ik
but the user.says
has the input code and stuff
please tell me how do i make it
I don't know.
I like LightGBM too. For me, it's CatBoost, XGBoost, LightGBM in that order. 😀
I have two models, and I want to compare their output. I want to know which event did they predict differently.
And since I'm dealing with massive data frames, I want to use the most efficient way to compare. Efficient in terms of memory.
Yes, there is compare. So I should just use it?
Thank you
ok, i have https://www.toptal.com/developers/hastebin/epelemifuz.properties as (a* algorithm) pathfind AI, i just wanna ask, where is the (0,0) point in this list https://www.toptal.com/developers/hastebin/novaconile.ini
Hastebin is a free web-based pastebin service for storing and sharing text and code snippets with anyone. Get started now.
Hastebin is a free web-based pastebin service for storing and sharing text and code snippets with anyone. Get started now.
is there an easy way to replace values with their mean for array area. For example:
[[0,6,0],
[6,3,1]]
and id want to take mean of [0,6].[6,3],[0,1] so result will be following:
[[3,4.5,0.5],
3,4.5,0.5]]```
with numpy or without, anything will do
you can use mean imputation to replace the numeric NaNs and mode imputation to replace the string NaNs. Both of these can involve DataFrame.fillna. If you have any follow up questions, keep in mind that I will not look at any screenshots of text, only actual text in a markdown block in the chat or in the pastebin.
which values are you trying to replace?
:) which one?
Apologies for sending them in such format. I am not looking to replace the values in the dataset, but instead whenever the function encounters a NULL value just to skip over it and do nothing with it. This is where I define the column list and the function
# positive integer columns
pos_int_col = data_check[['ApplicationFinancedAmount'
,'AssetHighestValueGapRatio'
,'AssetHighestValueManufacturingYear'
,'DeductionPercentage']]
# Creating a function to check for negative values "find_neg_index" and print the value of the row and its position
# The function takes two arguments
# df - the dataframe to validte upon
# num_col - is a predifined list of integer values only columns within the dataframe / or a single integer column
def find_neg_index(df, num_col):
neg_dict = {}
# Iterating on column level
for col in num_col:
# Creating a list within the dictionary and adding the column name as key and input an empty list as it pair
neg_dict[col] = []
# Getting the full lenght of the dataframe, row
indx_list = range(0,len(df[col]))
# Creating an empty list for the index position
neg_indices = []
# Iterating on row level
for indx in indx_list:
# Extracting the value on each row the loop is working on
val = data_check.loc[indx,col]
# Setting the condition for the validation and transforming string values to numeric
if pd.to_numeric(val) < 0:
print('Find ',val,'at row',indx,'for column',col)
neg_indices.append(indx)
neg_dict.update({col:neg_indices})
return neg_dict ```
After that when I parse the dataframe and the list of columns through the function I get the following error:
#error message
Find -62242.65 at row 25 for column ApplicationFinancedAmount
ValueError: Unable to parse string "NULL" at position 0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "NULL"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-156159945049379> in <module>
----> 1 find_neg_index(data_trans, pos_int_col)
<command-3934177330870497> in find_neg_index(df, num_col)
26
27 # Setting the condition for the validation and transforming string values to numeric
---> 28 if pd.to_numeric(val) < 0:
29 print('Find ',val,'at row',indx,'for column',col)
30 neg_indices.append(indx)
/databricks/python/lib/python3.8/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
152 coerce_numeric = errors not in ("ignore", "raise") ```
I just want for the function to skip over the NULLS but I am not sure how to achieve that.
you can use dropna
if you call dropna on a series, it will give you a copy of the series with no NaNs. if you do it on a df, it will give you a copy without rows that had at least one NaN
Thank you!
how does A* determine x and y position
What rl algorithm would be smart to choose for a simple shooter game.
The agent would be given the
position and rotation of enemies (direction their facing)
distance to the enemies, and it's own
rotation, speed, position and acceleration.
Possible moves would be
Turn left, right
accelerate forwards backwards left and right, fire.
Positive points for hitting enemies, negative points for being hit.
(If for whatever reason more complexity is needed, enemies could move, the agent could accelerate in 8 directions instead of 4 (diagonal) and the agent only gets position of visible enemies.)
is qr decomposition in numpy a static method? as in, does it involve any random factors or it will give same output for the same matrix everytime?
if i had a drone learn how to fly, what would be the reward?
what is q learning?
hm yeah Q learning is related to reinforcement learning, and qr decomposition is a very different process.
Will i have to experiment when it comes making architectures like this:
not for you. the other guy is asking about RL algorithms bud

try a #help channel
this is the channel to ask for data science help. otherwise there are instructions in #❓|how-to-get-help. just don't interrupt someone else's help channel
tysm
Oh yeah yeah lol my bad
all good
any advice ? plz
Do questions about numerical integration methods go here?
I suppose, but unfortunately I think it's unlikely that you will get an answer.
hi! is it possible to create working sound classification model for atypical speech (stuttering etc) using this dataset? https://github.com/apple/ml-stuttering-events-dataset i'm fairly new to machine learning and my attempts left me unsatisfied.
most of these audio clips contain multiple labels (in rating system 0-3)
not sure
speech recognition is not my area of expertise
but it does sound like an interesting problem

it looks promising at first, but I cannot find the short stuttering clips
only longer interviews, seems too messy to work with
artists_european = artists_european.drop(['Position','Track Name', 'URL', 'Date','Region'], axis = 1) ```
Why does this code give me this?
But i get this when running ```py
artists_european = artists_european.groupby("Artist")['Streams'].sum()
How can i make stay in the previous format but just summing the streams for each artist?
just the previous format, indexing doesnt really matter
i can just reset it
the result of groupby is not a dataframe, that's why it looks different https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
there's a script to extract short clips, which i already did receive
but training model with this data is kinda unclear for me, how should i correctly label those clips?
ah that will help great ^^
if for example one clip contains 4 labels, with associated number from 1 to 3
oh I don't know that much about it yet, just the very basics
I learn mostly Python and pandas lately, ML just every now and then not deep yet
is there an alternative that would do what im looking for?
I think your best bet would be to turn the groupbydf object back into a dataframe
is the dataset publically available ?
this is what it seems like bud
3 means all 3 agree on that label
No it isnt unfortunately
yeah exactly! training my model with only clips that all reviewers agreed to give certain label resulted in 60% accuracy.. so what really could be improved?
ah well I'll make a simple mock-up version, hold on
i think i might have figured it out
suggest a functional api plz, column 7 and 8 are greatly correlated, other comparitivly preety less
have you tried with 2 reviewers or even different models. i think at this point its just trying different approaches/ methods
but my font is weird especially for suicideboy, is there a way to make all of the font types to be the same?
i also dont know what you tried so theres always dif ways for improvements
i've tried multiple approaches with no success, therefore my main question is — is this dataset capable of creating actually working model?
who knows tbh
you cant really look at a dataset and know immediately without exploring and/or trying a few models

so there's hope, that's all i really needed to know lol
yes u did help! thank you 🙂

@strange stump what do you mean by just plotting and gradients?
you seem like youre at least comfortable in excel, no?
scatter plots for example
thats a good start
i see
We’re only allowed to use the basic ones
had some "useless" data i needed to clean and then identify peaks
i used python for this
physics is a good background for this tbh
since you are used to working with messy data 
ya thats probably why i got the interview
thats good
i would try to step back and try to understand the dataset when you open it up
look at the column names, see if theres any attached documentation that might give you more context
yeah ok
then you can maybe figure out what exactly you want to plot
so to my understanding
what is independent vs dependent
typically but you may also receive an already cleaned dataset
true!
if youre more comfortable in python, then feel free
whatever youre most comfortable with
then after understanding the columns, i would do some EDA (Exploratory Data Analysis)
i just gotta draw some conclusions about the variables they give
pandas is really good for EDA
i might need to look at that more
i just remember using pandas to store data in a dataframe
and then doing fourier transforms on my cleaned data
yeah artists_european.reset_index(drop=True)
the $ make it cursive, weird cannot find info about it online
since you can see avg, max, min, count, etc.
lol you probably wont have to do fourier transformations here unless there is signal processing involved 
HAHAHAHAH
yeah its just nice discrete data i hope
i wouldnt expect anything too spicy just a nice simple graph
probs otherwise it would be a bit much to expect from a data analyst role
graduate level too...
yeah how comfortable are you with python viz libraries
either matplotlib or plotly or seaborn
be able to make sure you can
- plot graphs
- label axes and titles
- draw simple regression lines
the other channel user said seaborn is just for visualisation too but not necessary
yeah it just looks nicer is all
im partial to plotly myself
but you should be able to still convey the info with matplotlib
should i bother learning how to present data with seaborn?
i mean if its visually pleasing they might like it more?
or do you think the company cares more about the info
psychology and sht like that
i mean its not that hard to pick up tbh
so up to you
i wouldnt use matplotlib graphs in any documentation or papers but thats me

i think theyre a bit ugly too yeah xD
ok fine i got the visualisation bit
and maybe the analysis
so i should be good then?
good
i just gotta answer some questions apparently
get some practice in working with regular datasets and you should be good
just so you dont run out of time
oh yeah
ok i should be good then
thanks man
ill ping you next week IF i need it 😄
Is there a better solution to what I did?
not that I know of
Using an existing neural network model (efficient net b1) for object recognition, but untrained. Trying to fit it on dataset of 400 classes (bird images) with about 100 images per class. Is this going to take really long to train/ is it possible from an untrained efficient net b1 model?
Running it with pytorch on a 2080 gpu with cuda
according to https://keras.io/api/applications/#usage-examples-for-image-classification-models it takes around 5.6 ms for each inference, but other than that I have no idea about what to even look for
hmm
I'm more so wondering if it would even converge after a reasonable amount of epochs
it takes about 5 mins per epoch (40k images)
But I guess there's no good way to tell
Training such a massive network from scratch seems not very do-able
7.9M parameters with a depth of 186...
yeah haha
using transfer learning it's literally 1 epoch that takes about 1 minute and 90% accuracy
Doing it for a project comparing pre-trained and non-pretrained networks
maybe look into https://keras.io/api/applications/efficientnet_v2?
But otherwise i'd have to design my own network, which makes the comparison basically worthless
Why would this be better if I may ask? is it smaller?
the v2 sounds like it should be a direct improvement on the v1
right haha
same authors, two years of progress later down the line
Don't see it in pytorch, and all my code is with pytorch rn so that would complicate a lot
I'll just try efficientnet b0 for now, it seems a lot smaller and at least 3 times faster per iteration
!paste
Currently looking at this code (custom CNN model using PyTorch), And i'm not completely sure how the shapes match for a specific line (line 46)
The input shape there is 64 x 7 x 7 but in the forward pass they explain that the output after the layer before it would be 128 x 7 x 7 (line 68)
The code seems to work fine however, so is the comment wrong, or am I missing something?
And a bonus question, They seem to bundle these layers up multiple times. Does this pattern have a name? what does res stand for?
Appreciate any response!
looks like its a typical ResNet "residual block"
ResNet follows VGG’s full convolutional layer design. The residual block has two convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function.
im not even a computer vision guy

Ah that would make sense, just read about the residual block coincidentally too lol
Thx for the reply!
still wondering about this though if anyone could explain that (might reply late, going to sleep rn)
yeah maybe the comment is wrong. i believe you can always check
have it print the size at that line or something
good night


I feel like one thing that people looking to advance in their DS careers tend to not think about is DE and Devops stuff, as well as Business-related things. It's totally possible to become a staff or whatever DS without this stuff, but having a general understanding, in my opinion, makes one much more competitive in the industry and allows for a more holistic understanding of the entire data pipeline --- instead of just modeling.
But since I'm doing MLE right now, my opinion is pretty biased, ha.
hello python File "C:\Users\Admin\AppData\Local\Temp/ipykernel_2592/3914410830.py", line 1 nf_df_cur_exp = df[df['Expiry_new'] == 2022-03-24] ^ SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
how i can fix above error ?
when i try python nf_df_cur_exp = df[df['Expiry_new'] == 2022-0o3-24] i am getting empty dataframe
can anyone help me in this ? ping me when reply
2022-03-24 is not valid python syntax.
Did you mean "2022-03-24"?
Hii thanks for your response
I have fixed the issue
i have question about loss while training a model
Epoch 1/10
8/8 [==============================] - 14s 578ms/step - loss: 529.6362 - accuracy: 0.7676 - categorical_crossentropy: 529.6362 - val_loss: 62.1763 - val_accuracy: 0.0000e+00 - val_categorical_crossentropy: 62.1763
Epoch 2/10
8/8 [==============================] - 2s 248ms/step - loss: 466.1245 - accuracy: 0.8966 - categorical_crossentropy: 466.1245 - val_loss: 78.1461 - val_accuracy: 0.1207 - val_categorical_crossentropy: 78.1461
Epoch 3/10
8/8 [==============================] - 2s 248ms/step - loss: 201.3840 - accuracy: 0.9024 - categorical_crossentropy: 201.3840 - val_loss: 139.4732 - val_accuracy: 0.1762 - val_categorical_crossentropy: 139.4732
...
Epoch 9/10
8/8 [==============================] - 2s 252ms/step - loss: 60.5674 - accuracy: 0.9659 - categorical_crossentropy: 60.5674 - val_loss: 897.9677 - val_accuracy: 0.7778 - val_categorical_crossentropy: 897.9677
Epoch 10/10
8/8 [==============================] - 2s 245ms/step - loss: 66.1619 - accuracy: 0.9669 - categorical_crossentropy: 66.1619 - val_loss: 924.7500 - val_accuracy: 0.8333 - val_categorical_crossentropy: 924.7500
3/3 [==============================] - 1s 241ms/step - loss: 414.3506 - accuracy: 0.8426 - categorical_crossentropy: 414.3506
i used small epoch for the model(10) since my data are very small. I dont know if this is normal when the epoch spiked quickly but the evaluate result are fine
that never happened to me
when you say the data are very small, what you really mean with that?
@inland zephyr
18 class, each class have 18 image
when using 80:20 split, so 16 train and 4 test. Although that, i also set in training phase 0.1 validation split
i think its fine when that happens
but...
I'm not 100% shore
I'm new in the machine learning
but I think its fine
but this is happen if i call the model.evaluate()
Accuracy: 0.790123462677002
AUC: 0.5
Precision: 0.0555555559694767
Recall: 0.0555555559694767
F1-Sco: 0.0555555559694767
I think this is troublesome
my dataframe this way ```python
1 Strike Price Token_x Exchange_x ... Vega_y Gamma_y Expiry_new_y
0 14350.0 102048025.0 NgdE ... None None 2022-03-24
1 14350.0 102048025.0 NSsgE ... None None 2022-03-24``` i want to make first row as header
ping me when reply
HI guyz
can anyone help me how to implement this one?
I implement this in C++ but the output seems like not correct
void train(vector<vector<double>> xy){
int x = 0, y = 1;
int epoch = 3;
while (epoch--){
random_shuffle(xy.begin(), xy.end());
double tot_err = 0;
while(tot_err < 0.01){
for(vector<double> data : xy){
double y_c = predict(data[x]);
// a.
err = data[y] - y_c;
tot_err += err * err;
// b.
b1 = b1 + alpha * err * data[x];
b0 = b0 + alpha * err;
}
}
}
}
the epoch here happens with the variable x and y are done distributed.
Total error: 1.80167e+09
Total error: 1.45195e+09
Total error: 1.54914e+09
y = -2556 + 6608x
the good thing here is that the total error is not zero but in the y hat it should be y = -2467 + 256x or something like that is not thousand because my output seems like too different.
here's code for simple linear regression code from scratch: https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/simple_linear_regression.py
def sum_of_sqerrors(alpha: float, beta: float, x: Vector, y: Vector) -> float:
return sum(error(alpha, beta, x_i, y_i) ** 2
for x_i, y_i in zip(x, y))
@tacit basin what is the zip(x, y) here?
it takes elements from x and y and, works like that
>>> a = [1,2,3,4]
>>> b = [5,6,7,8]
>>> list(zip(a,b))
[(1, 5), (2, 6), (3, 7), (4, 8)]
okay thanks...
does anyone know when sk-learns random forest was released?
I am quite suprised that these authors did not show in their results that RF can achieve higher than their NN. On this data set I see it is 74% with RF without oversampling
is this sort of academic trickery prevalent?
And I wonder why these essentially homework level projects are being published
yeah i agree. theres def things that can make you more competitive and i feel like many DS steer clear away from the DE/DevOps stuff which is a shame tbh
model.add(Dense(8, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(4, activation='tanh'))
model.add(BatchNormalization())
model.add(Dense(3, activation='softmax'))```
comment of performance please
yeah, focussing on that
Maybe the test images are cherry-picked? @mint palm
my dataset if huge, half million example, i also used test_train_split and shuffle
Have another pytorch question, when using transfer learning you often see something like in this code I attached. the model.classifier = line. Is this an existing part of the model that we replace with our own layers?
Just a lucky draw on the test images maybe
maybe removing seed would do that
Does anyone knows TF-IDF well, my question is should we remove exremely rare words/features and the most common features/words when producing TF-IDF vectors by using min_df and max_df?
please i need help, i want to crop the passport of each student base on the of the student in the album, i wrote an algorithm which can crop the passports but it crop it crop the passport randomly, whereas i want the first passport to be 001.jpg while the second passport to 002.jpg .
We did this as well, but our teacher actually pointed out that even very rare features can be very decisive. We had to classify the review rating of recipes, and stuff like "bell pepper" could make everyone give the recipe a 1 star rating, even though bell peppers aren't super common.
Does that make sense?
hmm i see
@iron basalt Pretty disappointed in Numenta all in all, the fact that they're resorting to such base tricks to try and show the performance of their methods is...honestly appalling.
they feed an explicit one-hot-encoded vector to their model for a meta learning, multi-task RL env and they have the gall to call it a "prior" which other DL models don't have access to?
pretty much exploiting the definiton of a prior smh
i tried changing seed and validation size, it had similar/rather minutely worse impact on difference between val and train accuracy, can i conclude my model its ok.
I mean if that test accuracy is correct then it seems fine
but it is weird that your test accuracy is higher than training accuracy
So there might be some unknown underlying problem
you can just do series1 == series2, which will give you a bool-valued Series, with True where they are equal and False otherwise
you should check the docs to see what compare does, it probably does more than you need
Is it unconventional to not freeze a large part of the model when using transfer learning?
Can someone help in why is it not working when I try it in loop
model is a list of strings
we'd need to see the whole error message. the most salient part is off-screen. Also, text is pretty much always better than screenshots
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Anyway, it looks like you're trying to select rows where df['Model'] is an element of model, is that right?
IndexError Traceback (most recent call last)
<ipython-input-50-9ca28fa27480> in <module>
1 for m in model:
----> 2 x=df[df["Model"]==m].sort_values("Total",ascending=False).iloc[0] ##Taking the one with the maximum Total
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
893
894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)
896
897 def _is_scalar_access(self, key: Tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1499
1500 # validate the location
-> 1501 self._validate_integer(key, axis)
1502
1503 return self.obj._ixs(key, axis=axis)
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_integer(self, key, axis)
1442 len_axis = len(self.obj._get_axis(axis))
1443 if key >= len_axis or key < -len_axis:
-> 1444 raise IndexError("single positional indexer is out-of-bounds")
1445
1446 # -------------------------------------------------------------------
IndexError: single positional indexer is out-of-bounds```
yup
and why do you want that?
there are repeated rows. I wanna take the one with the maximum "Total" column. Because it's the latest
try df.groupby('Model')['Total'].max()
without the for loop. just that one statement by itself.
Then do i drop the duplicates and replace their total value from this new DF?
what I showed you is just intended to give you the max Total value for each value of Model. I don't know enough about your goal or what data you have to guide you beyond that
If I were to do so, I'd have to see the rest of the DataFrame. but I'm heading out now.
The way fastai fine tune works by default is to train one epoch with frozen weights except for the head, then multiple epochs with unfrozen weights, but with differential learning rate, that means some earlier layers get updated with much lower learning rate than the later layers
what does it mean by n_sample and n_feature
in above
with "except for the head" you mean put on a different head and train that right? and what do you mean with differential learning rate?
Hey, I'm trying to make my own trading bot.
Does anyone know a good API what I should use to gain stock information (not only the value/volume but also values from indicators like MACD / RSI)?
Guys is dataquest a good site to learn python? i've learned the basic so now i will move on doing projects learning about data science ( Data analysing and Machine learning) are there better place to learn or this is good? i like learning visualising and doing projects
it is the most popular thing in Python (not count web development with Django) but data manipulation with pandas and numpy, and further visualisation with matplotlib are most common thing today
it's pretty good, but only if you have the money to spare. You can learn similar things all over the internet for free
Yes correct, first epoch only trains the new head. Differential learning rates it's when different layers in network are trained with different learning rates.
ah alright
Using SGD now with decaying learning rates for the entire model so won't be using that I think
thx for the replies!
You get it for free with fastai learner. You can use lr rate scheduler like cosine annealing and on top of if differential learning rates
ah cool. Just started using pytorch for this project. I'll just probably try to wrap up this project as soon as possible to start on the report. but pytorch seems really cool.
The fact that it is a lot lower level than sklearn ,what I mostly use, really helps understand stuff better
Fastai is layer on top of pytorch https://docs.fast.ai/callback.schedule.html#Learner.fine_tune
Callback and helper functions to schedule any hyper-parameter
Sorry it's discriminative learning rates not differential
Is this still about that same paper? Can you link it again? Let me put it this way, I am impressed with Numenta's ideas, not their results or comparisons with others. For example, there are several others that have also gone and run with the grid cell idea and their stuff seems to be getting results. So I would suggest taking their ideas and trying to make them work yourself, and avoid the issues that they have. There is ofc always drama in the ML community and such. If you think the idea might still have some merit to it but they did it wrong, either in their implementation or method of testing / comparison, then you can do it the right way. You can find those that blindly follow Numenta's work, and those that are overly dismissive.
If anyone out here with experience in ai & ml field doesn’t mind specifying a solid book for Ai beginners/juniors please let he/she kindly do as I’m really confused on Ai learning
Been doing data manipulation in pandas for like 6 months now for work and just now got a strong hold on apply, and now I feel like i can both rule the world, and need to reinvent everything I wrote so far.
So full disclosure I originally asked this in a help channel but it’s kind of an open ended question so I think it fits here better.
Hello, so I have a project where I need to parse through a txt file of a classic novel, examine all lines of spoken dialogue, and (this is the hard part) decide which character speaks which line.
My teacher has not lectured us on NLP before, and I honestly don’t know where to start for the actual classification algorithm. If anyone can help guide me with any tips on what I would have to employ, links to resources (that aren’t too mathy for a HS sophomore), explain some packages that can help, etc., that would be great, thanks!
the simplest way would probably be to use words or groups of words that one character uses more than other characters
what class is this for?
My AI course in school. Also would the NLTK be any help for this?
yes, nltk can help you make ngrams. an ngram is a tuple of n words/tokens/"grams". so you might pick an n value of 3, which are trigrams. and then you'd look for trigrams that correlate with specific characters.
How does one determine the correlation though if its not the case that the same exact three words are said repeatedly, is it an ML algorithm?
Sorry I’m a noob at this I don’t really know ML
depending on the complexity of the text, simple classification models should suffice
otherwise
youre looking at maybe more advanced stuff

if you do it based on trigrams, you'd have to make trigrams for everything that every character says in your training data, and count them. and then add them up for all the characters. and then see which ones are high for a given character but low for all the others.
I think it’s supposed to be unsupervised learning
uh okay. are there any other details like this?
Give me a moment
this is like when you work with a business stakeholder
no offense to business peeps

ye, https://arxiv.org/abs/2201.00042
my main issue is why the level of inconsistencies in the overall testing methodology? put it on the forum, authors won't even reply 🤷♂️
its not a major thing really, but....it does put a dent in Numenta's overall credibility
Ok so basically the project is: Build a "profile" for each character in the novel. The profile includes all of their spoken dialogue as well as a list of adjectives that would accurately describe their characterization in the novel. I'm pretty sure it's supposed to be unsupervised clustering (ie I will not go through the novel by hand and match words to characters).
This is my teacher's first year doing this lab so it's pretty open-ended, I don't need something perfect
what do you mean "by hand"? none of it will be "by hand" because you'll write a program that does it.


