#data-science-and-ml
1 messages · Page 301 of 1
cool, no worries
I dread the struggle but its 100x times the learning experience than anything else
Bonus if i can get her or the CEO to vouch for my.implementation on my resume lol
you are fine-tuning a model - how hard can it be?
personal experience - the grind is usually only for a few days
that would look so cool on your resume!
No data pipelines and stuff? I dpubt i can just toss raw CVs rhere lol
you're pretty lucky to have such an oppurtunity - make the most of it
you do have to make them, just that most of its automated by Huggingface methods
Thats the plan 😊
all in all, a simple text classification takes 60 lines
I doubt NER would push it any more
^ including data preprocessing BTW
Some simple lemmatization and tokenizing wouldnt be much code yeah
Ill look into once we formally define the project. Thanks a lot man 😊
ner
that is a thing that I know about
what is happen
what is happen?
so wait..
is the hidden layer in a neural network just taking the values from the input layer, and repeating the same sigmoid process?
Anyone good with NLTK?
will start a project related to NLTK this weekend, hope we can learn together
I was actually going to ask a question about it lol
Oh okay lol
don't ask to ask, just post your question here
Well I posted it in a help channel but:
I've got a question - I'm trying to evaluate some sets of ngrams using a training set and a data set from a corpus and write some statistics about it. I have to include two measures: accuracy (which is easy using the ".evaluate" module in NLTK. But I also have to find out the "words/error" rate of it. So for example I'm:
using the Brown corpus' subcorpus "news"
this supcorpus has 100,554 words in total
I've split the corpus in to a training set of 500 sentences, and the rest are the testing set (the suborpus has 4,623 phrases in all).
I have a partially filled chart with an example:
The default ngram tagger has an accuracy of 30.41% (when comparing the training and testing sets) and a error rate of 1.4 words/error
But I can't figure out how this 1.4 words/error was calculated. Anyone have any ideas? I think, and I may be wrong here, that it's calculated as a function of the rest of the corpus that is NOT accurate - that is, 100%-30.41% = 69.95% of the corpus. That number is then related to the total number of words somehow I think?
...yeah, I have no idea either...lol...
@grave frost can I pm you man?
What is the best way to convert an excel .xslx file to csv? Or alternatively, how to properly read in an excel .xslx file into a python program so that you can parse it and write what you want to a separate csv?
you can read it via pandas
and you an export csv with pandas as well
So, is there a way to have pandas library work with a script that I run, or do I need to interactively be in a pandas notebook and run the commands?
pandas is Python, you can write a script
notebook is just convenient for testing
Do you happen to have a good link to something explaining this that worked for you? I tried following a couple links I found but was getting errors. These posters were using a combination of openpyxl and pandas tho, so not sure if that had something to do with it.
honestly its difficult without actualy having the files
pandas has native excel reder
pd.read_excel
you can do it in one line too
pandas.read_excel("filename.xlsx").to_csv("out.csv")
Since this file has multiple sheets, would I just do pandas.read_excel("filename.xlsx", sheet_name=0).to_csv("out.csv") if I wanted first sheet, or is that inferred if I dont specify sheet?
hi can someone tell me after training cnn model how to predict for custom input images
im using tensorflow
Hmm, so I'm getting an error with several lines in the traceback showing directoriess of python packages and at the end it says TypeError: expected string or bytes-like object
your path seems to be wrong
or something
Ohh. Do I need xlrd or openpyxl installed or should this work with just pandas?
xlrd is usually needed
as the backend
pandas abstracts its usage if im not mistaken
So I honestly have no freakin clue on how to pick an adequate model. Any resources anyone would recommend?
what are you trying to do
yor model depends on your problem and your data
Hmm, still getting errors, even with a new venv with pandas. \AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\openpyxl\descriptors\base.py", line 57, in _convert
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'datetime.datetime'>
I installed openpyxl as well as xlrd after running with just xlrd returned a message saying i needed openpyxl for .xslx files
Tbh not sure a good general way to describe it. I’ve got a bunch of simulation data that changes four different initializing parameters for each generation of a grid. I want to train a model to predict two of the input parameters based on the other two input parameters and the output parameter of the simulation.
Two relations appear to be somewhat linear. Maybe another is a broken or segmented power law, no clue the shape of the final parameter.
I also know that is a piss poor explanation too
ML is not my world😅 I’ve got no idea what I’m doing lmao.
but what kind of prediction do you want. start there
Can I redirect that back to you and ask what classifications of predictions exist? Ooof
can somebody tell me how to load and predict using this model https://github.com/dasoto/skincancer/tree/master/models
Hey 👋 I’m coming from web dev background how to get started in machine learning and ai??
learn the use of numpy, pandas, keras and tensorflow and do simple projects at first I would say...
Okay can you list the math needed?
It's in the pins.
Where though
Seen ...
Thanks
Similar to what I found just wanted to confirm... and so my journey begins ...
Hey all!
Im very much a AI beginner (if even) and I had a small project in my mind for a browsergame.
Basically I want to teach an AI to generate a "good Unit composition"
But i'm wondering how the cost function would look like.
A battle works as follows: both opponents start with a unit composition (e.g.: Player1: 10 light fighter, 5 heavy fighter, 2 battleships & Player2: 15 light fighter 2 heavy fighter, 2 baattleships) Some units are better versus some units than others are and some units cost more than others do.
So.. after the battle you can simply sum up the units that u lost and there is a indicator of how well your "team" performed.
Moreover you can sum the "damage" that u dealt, which is another indicator.
But how do I get value out of this?
I imagine my AI to gimme some Unit composition.
Than I calculate the battle.
and than feed it with what? I mean I can't tell whats "the best" Unit composition
I have a question, where and how can I learn AI for a beginner?
What's the most respected data science coursera course
sure
is this the right channel to ask stats question?
Your value function should accurately track what you want the result to be. In this case, the problem is that what you actually want is probably to win, which correlates with, but isn't strictly equivalent to losing few units and killing a lot. Do you have win/loss data to train on?
Do you mean reward function? Value function has a specific meaning
Oh yeah, I actually meant utility/reward.
I read a research paper where they split the dataset Into 2 parts to test out different types of model philoshpies and in their testing, they proved their point via train and validation accuracy. Do you think I should consider their results to be genuine, even though their testing methods seem sketchy? I would have expected some amount of CV atleast in such a paper.
Idk if I understand that correctly..
I mean.. yeah I want it to win, but I need it to win „really good“ (with little losses for example)
And hypothetical I can generate win/loss data with „randomly generated „players““
Hmm, how are you going to generate win-loss data?
If you have a way to determine how likely a configuration is to win, then you basically already have the important half of a model to generate best configurations, don't you?
Yeah I think I just realised that AI likely won’t solve my problem.. lol.
Well I can just generate two random unit compositions (of equal „price“) let them battle a hundred times, avg the data and there I am.
My problem is, that idk what’s „the best“ unit composition against something.
So I don’t know how to train the AI
I want to input one unit composition and the AI should tell be what it would use to fight it, but since I can’t train it that’s probably not going to work by just telling it „this wins against that“ right?
Sorry for the bad language only typing this on my phone rn
Deep learning datasets tend to approach big enough that cross validation isn't really necessary
2 splits end up okay
I have just a list of 43 texts. I have encoded them via ber-base model from Huggingface, limited to 512 tokens.
tokenized = df.text.apply(lambda text: tokenizer.tokenize(text)).to_list()
inputs = tokenizer(tokenized, is_split_into_words=True, truncation=True, max_length=512, padding='max_length', return_tensors='pt')
outputs = model(**inputs) # this line fills my ram
I have 16 GB of RAM and another 16 GB of swap. When the third line is being run, the RAM usage goes beyond my machine's limitations. I have tried many different ways but didn't helped.
What could be done? My texts have long sentences, some of them may be consist of no sentences at all; could this be the problem? I have limited them to have 512 tokens though.
Basically, what data do you have?
- Do you have data on real battles - unit compositions of both teams and who won, say?
- Can you run simulated battles (and have the outcome be, well, realistic, so that you can train on this data)?
In the end, I just want to get hidden states to represent my text as vectors. Via last_hidden_states = outputs.logits
ahhh, that's great, but the paper was for Low Resource Languages 😒 so I would check how much data they actually used. thanks a lot, for the info
sorry I don't have the in-depth knowledge for under-the-hood stuff 😅 ConfusedReptile or Raggy would better answer your question
Okay, thanks for reccomendation
hey guys
i am implementing siamese network
and i am trying to improve the accuracy of my model
currently i am using single-channel(b&w) images
of size (32*32)
and my training model is using adam classifier with learning rate = 0.001
can someone help me try improving the model
i can use transforms and optimization using optim
but i dont know what or how to use them
hey, does anyone knows, where could I find some english textes to classify?
i want longer than 1 page.
you mean a dataset, or just some texts without labels?
does not matter, I can make labels, I need just 4 or 5 textes long enough to build markov chains
checking kaggle now
Hi
data set is ok, if its big enough, I mean. there is enough words and I can see where sentece ends and starts
can someone help me with ai in #help-cherries
because if you don't need labels, you could maybe get free articles from arxiv or something
yeah, wiki would work too
that would also worked, I wanted this, but seems too short ;p https://www.artofmanliness.com/articles/the-35-greatest-speeches-in-history/#Theodore_Roosevelt_duties_Of_American_Citizenship
hi guys, i have a question that i hope someone can help me solve this problem.
I'm doing my capstone project, the project is about smart door lock using facial recognition. I have already built the API using Flask and OpenCV for capturing faces and training. After that, I don't know how to put the model to the Raspberry Pi for facial recognition. If you have any ideas, please let me know. Thanks for your help guys.
is the problem just with putting it on the raspberry pi?
if so, try asking in #microcontrollers
I don't know how to put the model to Raspberry Pi through Internet
#microcontrollers is the place, then
Okay thanks
oh no, not another one
its called QuestionAnswering - check out Huggingface
Hey anyone knows where I can find pandas.to_datetime sub-functions? like .month_name()?
Can I use sigmoid as a normalization function if I only have one output activation? In theory it does the same as softmax in case of multiple output activation
Pretty much both yeah, there is a website where players upload their „combat logs“ with a couple thousand battles (however there might be a lot of cases where a player with a far bigger army fights someone smaller)
And yeah I can reliably calc the outcome of a battle and the outcome is realistic so I can hypothetically create data myself
Hey, I'm a PhD student and I've learnt Python basics few months ago and I'd like to develop deep learning models. From what I've read, Keras is easy to use but PyTorch can make more complex and flexible models. As I'm not really good at programming with Python (like creating classes), is it worth to use PyTorch or should I stick with Keras ? Thanks for the help
I have an array of N strings, and I want an encoder that can take a string and return an array of all 1s and 0s where a 1 indicates that the string at the ith position in the original array is a substring of the string being encoded
I don't believe this is one hot encoding. I'm pretty sure that would return arrays of N^2 elements
What is this called?
Tensorflow and Keras are pretty great for starters - both offer quite some flexibility, but a lot of research is done via pytorch and a lot of Businesses use Google framework (TF).
ith position should be a single number to be one-hot encoding. If you select multiple iths then its not a binary.
I never said binary. I'm not sure what you mean.
Suppose N = 3 and my tokens are "cat", "dog", and "mouse", and when I encode a string, I want an array of ['cat' in string, 'dog' in string, 'mouse' in string]
so multilabel binarizer?
So that's what that's called? Thanks!
As I'm going to work at least 3 years on deep learning, is it worth to learn PyTorch instead of Keras ? Won't it be too hard considering I know only the Python basics ?
@severe rover
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(['x', 'y', 'z'])
mlb.transform(['a', 'b', 'x', 'z', 'c'])
# output
array([[0, 0, 0],
[0, 0, 0],
[1, 0, 0],
[0, 0, 1],
[0, 0, 0]])
The desired output is array([1, 0, 1])--is taking the columnwise sum the only way to get that?
The problem must be that it's treating every string as its own iterable
("problem" as in "incongruity between my expectations and the spec")
This is great. Thanks!
hi, i want to work with instagramgraph api anyone can help to me ?
go ahead and explain the question that you have
I don't know anything about instagramgraph, but it's not likely that anyone who does will volunteer themselves until they know what the question is.
hi all, I have a matrix that shows the rate of agreement between different participants in a survey based on how many times they voted the same in some questions. What would be a good algorithm to cluster them based on similarity?
TF/keras is not hard, but pytorch does require good python basic
ahh, my bad
@pure quiver ty
hey what is best course for learn data science
actually need advance data science for compete in kaggle
have background
there is also a columbia coursera course
andrew is not advance i think
sure andrew is legend
i mean advance for compete
i think andrew teach
how it works u understand
In this case, yeah, you can do a lot of nice things with this data, and especially the simulator. You can have an utility function of, say, average winrate against a representative sample of teams or (harder, and might not be a good idea if there's a counter to any team) minimaxed winrate: that is, make the utility function the lowest winrate among all possible enemy teams of the same cost, so that an optimal team would be one that doesn't lose too badly against any enemy team (even if that makes it mediocre against any team too).
After desigining such a function, you can start with just a metaheuristic optimization algorithm like simulated annealing, rather than a neural network or something.
maybe you should try looking at some advanced books for ML
I don't know about advanced courses
learn the basics, read papers, learn from the LB toppers, grind some time, win
Yes i try this actually i can say i know in intermediate degree
Just i have lack is english but i try hard to this too
you don't need to know advanced DS to win at kaggle
oh, what is your native language BTW?
Azerbaijani
Yes just model=RandomForest() hahahahaj
Xgboost :D
learning english is a pretty important skill. a lot of the stuff online is in english, so I recommend you also try to improve it little by little. google translate is always there
those are common yes, but the techniques they apply are anything but
For read papers need advance english technical based
yeah
every time I read some new paper, its like a bucket of fresh water on my head
What u think about it
^ always
feature engineering is easily the best way. model techniques come second
where would I get help with a small cython problem? i'm so lost on this discord
I find feature engineering interesting because a lot of it is common sense and thinking out-of-the-box
Ml models os easy for adjust hyperparameters
Actually i not like feature engineering
Now i m interesting much in DL
Cnns object detection
But i think i should do something in kaggle for show myself that i can do this
:)))
kaggle is very hard
not to discourage you or anything, but you would have to put significant amount of effort to do anything
since there is a lot of people too, competion also increases. but its fun too
Sure it's not deal to get 1 st in world
But nit deal get last position in world
It's important just u know that u competitive and u can push limits u can make hard day for master's of kaggle jajaja
agreed
Not necessarily, many models are extremely hard to tune
There are entire fields of ML that have been shown to be bogus because of hyperparameter tuning
A Metric Learning Reality Check
by Kevin Musgrave, Serge Belongie, Ser-Nam Lim
Tl;dr: when properly tuned and used the same setup (architecture, augmentation, optimizers, output feature dimension), there is almost no differences between various metric learning approaches. https://t.co/Z2CMVJsTzO
“This is my favorite paper I’ve read in a long, long while. https://t.co/lf0ahvePSx”
Sometimes u need 99.9 accuracy then it must be hard to tune model
I ll read this paper
A Metric Learning Reality Check
by Kevin Musgrave, Serge Belongie, Ser-Nam Lim
Tl;dr: when properly tuned and used the same setup (architecture, augmentation, optimizers, output feature dimension), there is almost no differences between various metric learning approaches. https://t.co/Z2CMVJsTzO
105
This paper looks like high level :)
I think 2.2 highlights what exactly I was saying here before lol
Hm?
the one where a paper just used test set (no val) and no CV on a Low resource language haha
and their proposal was that CNN+embedding is always better than traditional ML methods
You sure it's training + test no Val instead of training + val no test?;
...Isn't CNN+embedded a traditional ML technique?
I could be misreading the technical details, but from their giant flowchart that seemed pretty clear
no, I mean like NB vs CNN. (traditional algos vs Deep learning)
What is NB
Ah i know but i know briefly
Not theory behind bahes
Now you saying cnn is traditional ml technique?
no, I said that CNN is Deep Learning and NB is a traditional algo
Try sci hub
Hey guys, can you recommend a book to learn complex AI math? I want a book that'll help me understand the complex math used in AI (especially natural language processing) that is related to programming (python preferred)
I am working on a project of NLP with LSTM right now
My head is being burnt with encoder/decoder
Yes it's really hard
I really read watch many things about it about RNN when i went do drink tea and came bacm
I understand i didn't understand anything :?*
lmao Socrates
CNN is much better and much more popular too
I used VGG16 for captioning images
Kinda easier to understand
me neither
long way to learn mannnnn
but I really wish I could be smarter to understand the LSTM and RNN crap...
Yes maybe one day i ll learn thoose too
I think I would try CNN + Embedding for classification
logically, it does seem like a great choice
what do you want to understand
I am just working on a Kaggle project to make a customer service chatbot based on Twitter data using LSTM and NLP, read docs and do not understand much...
I will reach you if I have specific high-level questions
if you don't have questions right now, then don't ping me please
I might not be around 😉
What is embedding you meant here
Because i know embedding only in nlp
yeah its nlp
Oh I didn't ping you, if so I am sorry if I mistakenly did
no, I mean, in the future
I will remember that, thank you!
🙂
Hahha i know ;)))
also github
What
Hi, I have a shp file streets names and info, and a csv file with 3 million addresses. And I want to know the coordinates of those addresses. Does anyone know how should I start? I don't want to use google maps api or here api because they are too slow and have requests limit per month
If anyone has experience with solving systems of ODEs using SciPy's solve_ivp, I would love to get some feedback on this project: https://github.com/wigging/bfb-gasifier. It takes about 16 minutes to run the dynamic model but I get overflow warnings while the solver is running (see below). See the README in the repo for usage instructions and a reference for the differential equations. Please let me know if you have suggestions to fix the warnings and if there is a way to speed up the solver for this particular problem. Thanks for the help.
dyn-bfbgasf/solid_phase.py:133: RuntimeWarning: overflow encountered in power
+ hps * (Tp - Ts)
dyn-bfbgasf/solid_phase.py:309: RuntimeWarning: overflow encountered in power
- hps * (Tp - Ts)
dyn-bfbgasf/solid_phase.py:393: RuntimeWarning: overflow encountered in power
qwr = np.pi * Dwi * epb / ((1 - ep) / (ep * epb) + (1 - ew) / ew + 1) * sc * (Tw**4 - Tp**4)
dyn-bfbgasf/gas_phase.py:230: RuntimeWarning: overflow encountered in multiply
SmgV = SmgG + Smgs * (ug + v) - (Smgp - Smgg) * ug - SmgF
----------------------- Solver Info ------------------------
message The solver successfully reached the end of the integration interval.
success True
nfev 63758
njev 291
nlu 1188
----------------------- Results Info -----------------------
t0 0.0
tf 1000.0
len t 8707
y shape (1600, 8707)
elapsed time 16m 22s
hello guys
I have left the question at the help-croissant channel but let me leave the qeustion here as well.
so the data contains name age city country course.
If I use .drop column age, normalize the data with rest of the columns... how can I re add the column age back to the dataset?
Do you have that column else where?
can you print out a few lines of the underlying csv so that I can replicate the issue?
you can save the column that you want to exclude from the operations you're going to do and join them back together later
there's probably a lot of ways you can do it.
Okay so.
I figured out that I have to use pd.concad. (helped by one of the user from here :))
yea so I figured out that I was basically implementing way too complicated So.
What I have decided to do is just leave the dataset alone and just simply select the column that I want do put inside the. min_max_scaler.fit_transform.
The data I gave you guys are just examples.
okay...?
Yeah idk seems like they said it in multiple channels
sorry its my first time asking question 😦
@thin prism no not you
Was running a script using pandas fine and then started getting exceptions related to read_excel()
Has anyone had issues related to read_excel() for opening .xslx files recently? Weirdest thing is it wasnt working yesterday, then it was for most of today. Have tried like uninstall/reinstalling openpyxl and xlrd bc it keeps asking me to.
hi i have a question about k means clustering
i have an assignment that wants me to implement k means based on the following strategy:
Strategy 2: pick the first center randomly; for the i-th center (i>1), choose a sample (among all possible samples) such that the average distance of this chosen one to all previous (i-1) centers is maximal.
i don't understand what this even means
so i pick the first point randomly from the data set and set it as the first centroid, then for the second one, i find the farthest point from the first centroid, and for the third one i find the point where the average distance of the average distance of the two centroids is the largest?
is that correct?
please tag me if you respond thanks :)
Well.. the thing is.. I can simulate a battle. But its really difficult to figure out what works BEST against a team.
Its farily easy to see this Unit1 is good against Unit2
But there is ~10 different Units and the composition/relative number of Units of that type in the comp makes it really hard to tell what its struggling against the most.
So the only thing I have in my mind to do that would be brute force it.
100% Unit1 -> simulate, store data
99% Unit1 + 1% Unit1 -> simulate, if better result store the data.
etc. Thats like.. 100^10 Simulations if I'm not mistaking with ~1-3s per Simulation that would be 3.170.979.198.376+++ years.
But that would be a monstrous task.
Especially because thats only one "perfect Unit composition against one very specific Team"
Again: I'm by no means professional but thats what I came up with.
€dit: Thats why I thought hey, maybe AI is the way to go.
Could I Input Data of "an enemy Team" (or Batches of it) (E.g.: Input Neuron1 = 3 light fighters) let the AI come up with a Team themself (Input Neurons == Output Neurons), than simulate a battle and teach it via the results of that battle?
E.g. Minimize the Amount of Ships that it lost?/Maximize the amount of Ships that the enemy(Input) lost? I'm scared that this would result in the AI just sending an "empty" team all the time because than the losses would be 0.
Again apologies for my bad language, kinda hard to get this difficult topic into proper words
100% Unit1 -> simulate, store data
99% Unit1 + 1% Unit1 -> simulate, if better result store the data.
That's brute force, but it's not the only way to find extrema of functions. Simulated annealing is more like a weird form of gradient descent - it calculates small changes from the current configuration, calculates their utility, and accepts or rejects them based whether they are better or worse and on the current "temperature" (that's where the method's magic is).
Could I Input Data of "an enemy Team" (or Batches of it) (E.g.: Input Neuron1 = 3 light fighters) let the AI come up with a Team themself (Input Neurons == Output Neurons), than simulate a battle and teach it via the results of that battle?
Yup, if you want to teach a model to generate teams to counter a specific team, that would probably require ML.
hi all ^^, I would like to make a "simple" AI to act like a ant, but I don't know how to start and where to start
But how would I train it? I can’t tell it what it should have come up with itself^^
Hey I am in need of some advice. This is probably a simple question but I am doing text-sentiment-analysis and I need 3 different classifiers. I currently have SGDClassifier from linear_model and Multinominal from Naive_bayes.
I tried KNN but my input variables become inconsistent and I ran SVM for like 2 hours but It's still loading. Anyone any advice about this topic?
Hello, has anyone tried to make poker ai with reinforced learning? Would love some guidance
hi everyone. This probably is a really basic question, but I have 0 training in statistics or anything similar.
I have this matrix that shows the rate agreement of participants' ratings on images. So for each pair of participants, I have a number between 0 and 1. Can these numbers be used to somehow cluster participants that all tend to agree with each other? I have ~70 participants
this is the matrix
If you know the sum of each row and each column, you could know which is the most agreeable participant.
this matrix is symmetrical along the diagonal, so I guess either summing rows or columns would tell me that, right? Also, I think that is a good metric, thank you. But I'm also interested in clustering participants that voted similarly, because that may help me better detect commonalities on the stuff they voted on (images). I may be wrong but "agreeability" probably doesn't help me learn much about the images themselves?
It won't help you directly but you'll understand better the relationship between participants. I can't think of a better approach so far, so sorry I can't be of much help, but try to sort the data in different ways to check for patterns. I would do what I said before as a first step.
yes, absolutely. And I hadn't thought of that at all. Thank you @primal tulip
I’ve seen some people done it on YouTube but it’s really really difficult from what I understand so not the best „starting point“
thats my uni topic well to get the hang i could do blackjact and then holdem
you can just research, there is a ton of stuff on the net
searching playing poker with ai might give some video with a github repo
It is already a string. You need to change it to integers or floats
Is it a categorical data?
this the data
I want to make sure we can put in age and gender and then it will tell you the hobby
Ordinal encoder will do good for you.
pre = model.... line, you forgot to add ) at the end
i add this
Hi all. For past few days I've been working on forage.com's ANZ data -task 2. My task is to predict annual salary of the customers. The data includes current budget, amount withdrew, age and date (there are other variables such as transaction location etc, don't think they will be useful for prediction).
Anyways, The transaction data has ordinal data such as 'credit card', 'salary/pay check'. So I gathered monthly salaries of the customers. Then I got stuck here. For the annual salary, should I be using Time series analysis? The task requires simple regression model to predict annual salary.
After got stuck, I looked at youtube; the videos I found just used regression with x= amount, age, y= balance.
Am i doing sth wrong or my approach is good to go?
You need to change the genre data into numeric data first. Easiest way to do this is by ordinalencoder.
Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding. In this tutorial, you will discover how […]
Of course ,guys with way too much experience and knowledge will answer more detailed than me. I'm just a learner too.
Hello everyone, I have a problem on making sentiment analysis twitter. I want to search the hashtag in specific location, but I still get error. Does anyone solve the error? Here are the code and the error. Thanks in advance🙏🏻
['covid-19'] is a list
'covid-19' is a string / data[0] is a string for the sting concatenation
import pandas as pd
import numpy as np
df_sample =\
pd.DataFrame([["day1","day2","day1","day2","day1","day2"],
["A","B","A","B","C","C"],
[100,150,200,150,100],
[120,160,100,180,110]] ).T
df_sample.columns = ["day_no","class","score1","score2"]
df_sample.index = [11,12,13,14,15,16]
agg = df_sample.groupby(["day_no","class"]).sum()
so for day 1 class B it currently displays no value how can I make it so it will fill in a 0 when no value is observed? Same for A on day 2
(e.g. I need a value for each possibilities no matter what)
This is currently what it outputs
unrelated to discord.py, maybe try #data-science-and-ml
@static owl That's where we are
haha i didn't realize i'd switched here 
gonna blame discord, it's totally not my fault
I don't really understand the problem
How can I fill None values (I assume they are None?) with 0 when using groupby
you can see the current output there no value for ["day1", "B] and ["day2", "A"] and I'd like to have 0s instead of nothing @serene scaffold
create a multi-index with all the possibilities, then reindex the groupby with that
In [258]: mi = pd.MultiIndex.from_product([df['day_no'].unique(), df['class'].unique()])
In [261]: df.groupby(['day_no', 'class']).sum().reindex(mi, fill_value=0)
Out[261]:
score1 score2
day1 A 300 220
B 0 0
C 100 110
day2 A 0 0
B 300 340
C 0 0
C 0.0 0.0
I'm afk but do you know about fillna?
Awesome I'll try that, thanks
TIL df.reindex has a fill_value param
Is there any downside to using replit to run my AI and get it through the first couple generations?
it would be doggone slow
and replit is shit anyway
It would be, I just wanna make sure it works
just run it on your laptop then
Don’t want to let it populate for a week only for it to break without noticing
Fair enough.
Thanks
or use colab if you want to run for a few hours
Colab?
what plots are useful for time series data apart from linegraphs?
scatterplot, barplots
so line graphs are derived from scatter plots, yh i thought of bar plots, but any other plots?
one boxplot for each time period if it's not too long
for example @uncut barn
otherwise heatmaps or auto-correlation plots come to mind
depends on what your goal and data are
@tidal bronze so this is my task and I justused a line graph
also I did a bar plot for this task , but dont if I should any other visualisations that wouldn't be redundant
Hey guys,
I was instructed to replace all outliers in my DF with np.NaNs.
Outliers should be numerical values which were greater than Q3 + 1.5 * IQR OR lesser than Q1 - 1.5 * IQR
For some reason it does not work as intended.
Can you spot any issues in my code?
def outlier_detection_iqr(df):
n_df = df.select_dtypes(include=np.number)
q1, q3 = n_df.quantile(numeric_only=True, q=[0.25, 0.75]).iloc
iqr = q3 - q1
n_df, iqr = n_df.align(iqr, axis=1, copy=False)
return n_df.where((n_df > (q1 - iqr * 1.5)) & (n_df < (q3 + iqr * 1.5)), other=np.NaN)\
.join(df.select_dtypes(exclude=np.number))\
.reindex(columns=df.columns.tolist())
I should add that it does return a DF but it doesn't contain the expected NaNs, at least for some of the columns
Works now, I forgot to change > and < to >= and <=
Does anyone have any specific architecture recommendations for classification on a vector as an input? CNN + MLP is the standard one, but there might be some that I have missed
I've gotten as far as deciding on how to encode objects for two classes that I want to classify, but the representations end up being pretty sparse. Is there an algorithm that's known to be good at binary classification on sparse vectors?
I could provide more context about what I'm doing and how I decided to encode everything, but I kind of want to see how what I came up with turns out.
I could be totally off base, but don't most architectures take vectors as input?
I've finally gotten back into AI and I'm wanting to try again at this project. I've managed to get the vanilla GTP2 model working in huggingface, but unsure how to fine-tune it. Any ideas? #data-science-and-ml message
Currently, my dataset is just a bunch of text files for each user where every message is on a new line, with a few filters to avoid span (I've also considered using list(set(...) to remove duplicate messages from each user, but decided against it) https://paste.pythondiscord.com/gucurocago.py
Do all your imports, then do the setup stuff. don't intersperse setup statements / defining constants etc... with the imports. Also don't put multuiple statements on the same line with ; in between (it's not code golf... fewer lines doesn't make it better). You might want to run this through pylint or similar...
formatData is very inefficient, it is looping over the input 8 times, and then where it returns, it looks like it's looping n^2 times where n=len(originalData)
For this, ye it's a fair point, I don't code like this anymore but this is mostly based on some of my really old code when I was experimenting with a module called textgenrnn
Efficiency shouldn't really matter, it takes less than 2 seconds to do all the filtering, most of the time is just it loading the data in the first place
Efficiency shouldn't really matter
suggest you rethink that mindset, this would not fly in most professional settings.
I am usually all about efficiency, but for a small hobbyist project, someone that already only takes 2 seconds probably doesn't need to be any more efficient
....you also posted it to a public chat and asked for feedback.....
sorry, i dont mean to be too harsh, i do get what you are saying, just realize where you're coming from
I wasn't asking for feedback- I was just showing how I produce my dataset to make it clearer what I'm working with so my question about fine-tuning GPT2 could be answered 😅
So this might be a patzer question, but I just did some quantile regression, and I'm confused why each equation for each quantile has its own 95% confidence interval, I'm having trouble wrapping my head around this
I appreciate feedback, but that isn't actually what I posted it for as the code I posted works fine
hmm
For example, since I'm aware that usually when working with AI with text, you have to assign each character a unique number so the less unique characters is usually better, that code shows that I am filtering out lots of characters such as emojis to reduce this
are there any packages that can transform your image dataset?
I have been using https://app.roboflow.com/ to add a grayscale and horizontal flip but there's a limit to how much data you can upload
is there something else that can create augmented images while keeping the labels for object detection
Might be misunderstanding your question, but maybe something like OpenCV2?
maybe
It needs to keep the labels is the main thing
or move the labels accordingly
are you just looking to make some tweaks to the images to pre-process them?
yeah I guess so
PIL cam do most of that pretty easy
with roboflow I could upload an image dataset and then generate "augmented" images that were resized, grayscaled, flip etc.
but also it needs to generate labels for the new images
because quantiles can also have uncertainty?
Folks, could I please ask you about the Python packages and libraries you are using for data science and ML? In our app (shameless promotion) Devbook (https://usedevbook.com/) we added support for NumPy and PyTorch docs and right now I'm working on Pandas, but we do not have that much data about what people really use and metrics like GitHub stars seem dangerous to rely on.
Well, maybe I should rephrase the question, being in the data science channel - how do you think I should estimate what packages and libraries should I add first?
tbf, this product sounds cool, but ultimately doesn't give much lift if it's just python packages like numpy, pandas, etc
a lot of these libraries have actually fairly decent docs to begin with-- unless your product does more (isolated testing repls for tricky functions, maybe integration of blog resources)
for a lot of us, if it makes looking up pyspark docs, or pyarrow docs easier then this would be great
considering those projects have notoriously bad docs
Huh, that actually sounds more helpful that a list of the most used libraries, thank you 😄
the stack overflow integration is 😘 tho
Something that you could have, which I don't see much of, is a concept graph. When you search for something it shows the concept you searched for as a node, and all the surrounding nodes are related concepts and prerequisite knowledge. Navigation can be done by clicking on nodes.
Apart from the docs you mentioned and the SO :D, are there any other resources that you look up when you are working? Maybe some private solutions, infrastructure dashboards,...I've never done data science professionally, but is there anything to view/browse dataset repositories online?
The reason for this addition is that finding a specific function is not really the problem, it's understanding its context that let's you understand when to use it and how.
Huh, that makes sense - do you have this problem with a specific libraries, or languages or do you think it is a general problem?
It's a problem for all libraries and programming projects. Many functions in a project are really just built on top of the core part of the project and tailored for specific tasks, understanding why those specific functions were added, what the specific use case was, and what it was built on is key. Following a graph backwards from the searched function to the core of the project will give a deeper understanding of it and an automatic understanding of other functions in the project (can already guess what they do).
Ideally one does not need to search anything, so getting that deeper understanding pays off in the long run by reducing lookup.
It's dependencies all the way down!
It does give me some ideas and it reminds me of a project my friend did - https://brainec.com/ (here is an actual example https://brainec.com/g/9eDLBbL9TVMb) it is just for all the information/notes not just for the docs.
This can get philosophical really quickly 😄
Here is an example of concept graph greatly improving a website: http://hyperphysics.phy-astr.gsu.edu/hbase/hph.html
Huh, can you think of a reason why such graphs are not more widely used?
It takes effort to make. Most people (sadly) can't render a graph.
And in this case it's a simple static setup. Yours would need to dynamically add stuff.
Over time.
Since libraries change rapidly and therefor some kind of web crawler or maybe even code parser for github would be ideal.
Yeah. I'm thinking that it would be a little easier with a typed language though. The Graph of ALL the packages interconnectibility would be something else.
It starts to morph into a new "IntelliSense" - you have a data/object and you can browse the graphs of all the transformations that all the packages for the language provide. Ideally you would then just select the final form.
Of course this is all like 100x harder that this.
It would probably involve a graph database and maybe even some ML.
Basically the stuff social media websites work on all day.
You might want to ask in #databases about that though.
Allright, thanks. I will think about the concept first and if I decide to explore it more I will talk about it somewhere here.
Any cool project on your side, btw?
Right now i'm making a little game with Ursina. Not data-science related.
Still wanting to know how to fine-tune GPT2
It's also worth noting I'd like to live fine-tune it, so I can actively receive data to add to it, while still generating new data from the model
i have a question regarding calculating error functions for k means clustering, i am tryign to implement this objective function and either i don't understand it completely or i'm implementing it wrong
errors = []
for i in range(2,11):
clf = K_Means()
print("\n\nK = ", i)
clf.fit(X, i)
error = 0
for j in range(i):
for classification in clf.classifications:
for point in clf.classifications[classification]:
error += np.linalg.norm(point - clf.centroids[classification])**2
errors.append(error)```
so clf has centroid attribute that are the means in the picture above
and then it has classification which contains the x's
Hey @warm wharf!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
sorry bud, can't do that with today's technology/research
you could parallelize it though. one node to keep fine-tuning model and another to keep serving predictions. thus, the only delay would be the time to fine-tune (i.e about 15 mins or so)
Huh, how not? So, for example, YouTube has to redo their model constantly for recommendations? Surely they'd just add to the model?
nah man, it doesn't work like that
Then what do they do? Surely that's live fine-tuning?
Google usually re-train your models every month. like if you own a google assistant, you might notice it having some new features every month
and your voice becomes more easier to understand for the device since now it has more data
But then, how come if you make a new account, watch a video then go back to the home screen, your recommendations are already similar to what you just watched? Surely by the logic of re-training every month, it would take a month before you'd find your recommendations being more like your tastes?
no. we do not have the exact specifics of their models, but I would guess its mostly transfer learning fine-tuned for particular demographic categories that would be decided by another model. this is still a pretty naive approach (this being a guess) but theoretically it would do the trick
Hm I suppose that makes sense
So how exactly would I go about this? Would I add data to a queue and once the current fine-tuning is done, it restarts the fine-tuning again with the data in the queue? Or would I start a new fine-tuning process with every new data recieved?
see my previous message ^ parellization seems to be the best option IMO
That's what I'm referring to- the node that adds data to the queue would be the same one generating new data from the model, waiting for the other node to finish fine-tuning, or telling the other node the fine-tune every time new data is recieved
I always learn this obscure libraries thanks to you lol
Either way, I still don't know how to fine-tune in the first place
Hey, does anyone know of a function in python for computing the linear transformation from one matrix to another?
btw the only thing I've been able to find to help me is this but the file no longer exists https://huggingface.co/transformers/v2.7.0/examples.html#language-model-training
Just realized wrong channel, sorry!
bump
>>> b = MultiLabelBinarizer().fit([{'a', 'b'}, {'c', 'd'}])
>>> b.transform(['a', 'c', 'd'])
[[1, 0],
[0, 1],
[0, 1]]
This is not the actual output. This is the desired output. What's the right encoder to encode different objects the same way?
so i was wondering why for gradient descent
we decrease each weight
by it's partial derivative * the learning rate
shouldn't we just add the learning rate if the pd (partial derivative) is possible, and subtract it if negative?
wouldn't that work as welll?
(ping 2 reply thx)
I think this might go here idk really. how would I find the percentage of a number between 2 numbers? like the percentage of 10 from 5 to 20.
this wouldnt really belong in here, but to answer your question you could subtract the minimum (5) from both the maximum (20) and the number (10) to get all of them from 0, then get the percentage from there, so for your example it would be 20-5 = 15 and 10-5 = 5 so 5/15 = 33.3%
this channel is for data science, scientific python, and AI
Dumb question here...is the target distribution here the same as a prior?
Can someone please explain this to me?
df1 = df.melt('EST').dropna(subset=['value'])
d = {k: dict(zip(v['EST'], v['value'])) for k, v in df1.groupby('variable')}
hi i was wondering if someone could please explain the open-cv distance measurement? im trying measure distance from centroid of one's face, to camera in real-time, however i haven't found a way to calculate distance correctly.. any ideas?
is it possible to upload lstm keras model into a file
what's the criteria for similarity of objects?
and can you elaborate with an example what you are trying to do?
Are there any downsides to using ensure_ascii=False when dumping a dictionary to a JSON file?
ooh, did anyone see the updated colab with the status bas? it tracks all the call functions performed and the lines being executed. noice for a jupyter notebook
Hello, I'm a pretty newbie in Machine Learning and started following Andrew Ng's Machine Learning Course from coursera and as many of you may know - It is based on Matlab and Octave, so I researched a bit and found it'll be better if I used NumPy instead of MatLab, so I started implementing Assignments with NumPy but what I've found was because of some Datatype restrictions, my NumPy Programs which were written as almost Translation of MatLab Programs were'nt working right, also I searched about this on YouTube - then found out Algorithms which those guys were implementing were hella small i.e. w/ less lines of code and no complex Maths were giving Great Accuracy for Logistic Regression, I'm in Duality about following Andrew Ng's Course or not, can anyone help me in this issue ?
Hi, any expert could tell me at a starting point what could I expect to try doing whit a clustering algorithm on a database that has this features?
It depends if you want to finish that course. You might not like the tools, but you'll gain a strong understanding on the methodology and concepts you should be aware of. If you'd rather then using other tools, yeah, go for it. But if it's required for the course you must decide if you want to do it or not.
So the course is great for theory, but tools wise it is out-dated ?
I wouldn't say outdated. Matlab, R are used a lot in the Data Science fields. You might encounter them more commonly at investigation and research or medicine related fields for example and less in banking fields which leans more towards python. But it's not at all outdated. I haven't used Octave so I can't say much about it.
Ohh, I was completely unaware about this, but can you explain why Numpy fails to implement some calculations which are performed in MatLab
because of restriction of datatype limits only or there are reasons, too?
I have no idea without looking at the code. Either way, try to find the part that's being weird and share it. Even if I can't help you, there are bright heads here that could.
okay
Hey @brave owl!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Hey guys can anybody help me out with this problem,
https://discuss.pytorch.org/t/typeerror-new-received-an-invalid-combination-of-arguments-got-tensor-int-but-expected-one-of-torch-device-device-didnt-match-because-some-of-the-arguments-have-invalid-types-tensor-int/116316
Hi there, I am building my first neural net with pytorch to predict a single output from an image using a pretrained resnet18 model and keep getting this error. I don’t understand what I am doing wrong here. #ERROR TypeError: new() received an invalid combination of arguments - got (Tensor, int), but expected one of: * (*, torch.device device)...
I am trying to learn, encoder decoder for NLP by training a model that can generate docstrings for small code snippets ( Java ).
The model has 93% accuracy, but it always predicts 0 ( padding token). With mask_zero = True
If you can make a good model or tell me what's wrong, it would be of great help.
Model link: https://filetransfer.io/data-package/XneDbGur#link
https://paste.pythondiscord.com/ahegejutuh.properties
So this is my Python Code for Logistic Regression with Gradient Descent and all I know is Cost isn't decreasing and this is predicting always 1
Assume that it's arbitrary. I have a list of sets and I want anything that is in the set at the ith position to be encoded with a 1 at the ith position.
import xlrd
excel_workbook = xlrd.open_workbook("data.xlsx")
excel_worksheet = excel_workbook.sheet_by_index(0)
#reading data
for row in range(excel_worksheet.nrows):
for col in range(excel_worksheet.ncols):
if col == 0 and row !=0:
print(excel_worksheet.cell_value(row,col), end='')
print('\t', end='')
print()```
it gives me a error saying
```py
Exception has occurred: XLRDError
Excel xlsx file; not supported
File "D:\everything\Legacy\Game_currency_stats\code.py", line 5, in <module>
excel_workbook = xlrd.open_workbook("data.xlsx")```
XLRD does not support xlsx files. Only xls
Try XlsxWriter or openpyxl
is it a function or a library?
Both are libraries
ohk
Depending on what you want to do you might also use Pandas with the "read_excel" method: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
hmm
so how do i convert my xlsx file to xls
You should be able to select "save as" and choose "Microsoft Excel 97/2000/XP" .xls
In Excel
so i just need to change the extension from xlsx to xls
right?
Should work yes
Did you save it via office or just rename the extension?
just rename
Not sure if it's gonna work or not. You can try but office might need to convert some stuff depending on what's inside.
it didnt work if i just changed extension
did via office too
didnt work
:/
@dense cosmos
hi all, have a question. so i've created a script that prompts the user what column they would like to search in aka account ID, parent ID, etc. then it prompts what they are searching for. but, i've added a column that the data is all numerical, and it doesn't find them ex 188991, but it can find M188991. any idea how to fix this? coding in python using pandas, code is on a separate comp
I don't think anyone can solve it without seeing code
while True:
variable = input(f"{bcolors.WARNING}Search by Acronym / Parent / Alert ID / Account? {bcolors.ENDC}")
if variable == "Exit":
sys.exit(0)
if variable not in df.columns:
print(f"{bcolors.FAIL}Error: Invalid Input{bcolors.ENDC}")
continue
if variable == "Acronym":
while True:
input1 = input("Please provide an Acronym: ")
result1 = df.loc[df[variable] == input1]
if input1 == "Back":
break
if len(result1) == 0:
print(f"{bcolors.FAIL}Acronym not found. Please try again{bcolors.ENDC}")
else:
print(tabulate(result1, headers='keys', tablefmt='psql'))
continue`
here is one part of the code, the rest is the same as if variable == "Acronym": just with Parent ID, Alert ID, etc @sharp prairie
is there any NN that given an image can output the object on the image on different views?
Like if it makes a 3D composition and returns u the object from different angles
@iron basalt @exotic maple
Anyone know how to search and replace within an entire data frame based on a different data frame?
I have one data frame full of email addresses and another data frame with a column for email addresses and a column for names, I’m wanting to search the first dataframe for any email addresses which match an email address in the second dataframe, and then replace that email with the persons name
You can use the apply method on the first df and then use a custom function that checks the second column that searches the second df and replaces based on that
Alternatively, you could do it with merges
so i was wondering why for gradient descent
we decrease each weight
by it's partial derivative * the learning rate
shouldn't we just add the learning rate if the pd (partial derivative) is possible, and subtract it if negative?
wouldn't that work as welll?
(ping 2 reply thx)
doing it this way makes gradient descent take smaller steps automatically when it's closer to the goal
(on the assumption that closer to the goal, the gradient is lower)
imagine u are close to the goal. If u move a lot, u may exit it. Is like if u never reach the goal cuz u move a lot
The gradient supplies the direction to the goal, the learning rate determines the step towards that goal.
imagine u are at point 1/2. And goal is 0. If ur steps are length 1, u will never reach 0
If your learning rate is too high you maybe either overshoot the goal, or spiral out of control. If your learning rate is too small, you might just bounce back and forth on the loss curve without ever reaching the minimum.
they are asking why does the actual update depends not only on the learning rate, but on the magnitude of the gradient.
It's the magnitude of the gradient times the learning rate, I don't see what you are trying to say here?
I'm saying that they aren't asking why the learning rate is involved/what it does, they are asking why the magnitude of the gradient is also used, instead of only its direction.
yeah
but like say hypothetically the gradient is low for a very long period of time
___----O
min___---```just hypothetically
low gradient, but it'd take a very long time
well, network does not converge on my dataset, but an SKlearn algo works
assuming the dataset is lineraly differentiable, why cannot the model capture that relationship?
import pandas as pd
cwd = os.path.abspath('C:\\Temp\\Reports\\Combine')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.csv'):
df = df.append(pd.read_csv(file), ignore_index=True)
df.head()
df.to_csv('C:/Temp/Reports/Combine/combined_file.csv')```
Error: FileNotFoundError: [Errno 2] No such file or directory: 'members_LISTS_NO_CONTRACTS_SENT.csv'
```FileNotFoundError Traceback (most recent call last)
<ipython-input-3-59d50ed3d83a> in <module>
3 for file in files:
4 if file.endswith('.csv'):
----> 5 df = df.append(pd.read_excel(file), ignore_index=True)
6 df.head()
7 df.to_csv('C:/Temp/Reports/Combine/combined_file.csv')```
The file is in the directory as per cwd. I am not sure what I am doing wrong :/ Can someone help?
What I am trying to achieve: Combine all CSV files into one.
How to display graph on my website?
I was about to say
Hi! I have a question about Pytorch tensors.
So, Is there any difference between
parameters.grad.zero_()
and
p.data -= p.grad*lr
p.grad.zero_()```
The params is a tuple containing two tensors a params tensor and a bias tensor.
no, worse.
What libraries are you using?
Is it an interactive chart, or just a Pthon-generated PNG?
are working front-end or back-end?
etc
so
you can't just say "I want to put my line chart in my webpage, how to?" and expect people to know what you're talking about lol
flask. values will be added to graph every day
its recommended you operate on data directly, like in the second
so that you don't add to the computational graph
or depending on event. @exotic maple
for example, if you look at the source code for torch.optim.Adam they call p.data directly to detach it from the backprop graph
and matplotlib library for graphs
just don't answer bad questions-- they'll come around eventually
Eh, I was there too. I think we all were, if I can help, why not? :p
but answering @lapis sequoia Sorry i dont experience with Flask
aight]
well, my network does not converge on my dataset, but an SKlearn algo works
assuming the dataset is lineraly differentiable, why cannot the model capture that relationship?
There's documentation on the matplotlib website that explains how to integrate matplotlib plots into your webapplication @lapis sequoia
Let me see if I can find it.
aight
I will! Thanks
im trying out minimax, and this is what i have to far. been following a course on it, but i can't find what's causing this to go on forever. any ideas?
oh my goD
THANK YOU
i have been suffering with this for like 2weeks
< 3
Can anyone recommend me below100USD keyboards (tenkeyless is fine) for programming beginner?
Thank you! 🙂
I have a question
What's the best way to improve OCR?
Resize, then preprocess or Preprocess, then resize?
@exotic maple can you check out my q above when you're free?
what Q?
hi all, have a question. so i've created a script that prompts the user what column they would like to search in aka account ID, parent ID, etc. then it prompts what they are searching for. but, i've added a column that the data is all numerical, and it doesn't find them ex 188991, but it can find M188991. any idea how to fix this? coding in python using pandas, code is on a separate comp
while True:
variable = input(f"{bcolors.WARNING}Search by Acronym / Parent / Alert ID / Account? {bcolors.ENDC}")
if variable == "Exit":
sys.exit(0)
if variable not in df.columns:
print(f"{bcolors.FAIL}Error: Invalid Input{bcolors.ENDC}")
continue
if variable == "Acronym":
while True:
input1 = input("Please provide an Acronym: ")
result1 = df.loc[df[variable] == input1]
if input1 == "Back":
break
if len(result1) == 0:
print(f"{bcolors.FAIL}Acronym not found. Please try again{bcolors.ENDC}")
else:
print(tabulate(result1, headers='keys', tablefmt='psql'))
continue```
here is one part of the code, the rest is the same as
if variable == "Acronym":
just with Parent ID, Alert ID, etc
@exotic maple
oh wait nvm
?
I would suggest:
- normalize column names and input. i.e -> lowercase all, strip, etc.
- df.columns is not an iterable by defalt (AFAIK) cast it into a list before checking, or something
aside from that, i cant think of anything else
any powerBI user?
Hi. Trying to use/debug an add-on used for open source Python based flashcard app Anki. Uses a python link to communicate with Spacy, python based open source NLP processor. the add-on ,Morphman, communicates with Spacy and gets POS tagging and dependency labels. Get this exception when I attempt to pass certain fields of a flashcard through the language processor through a "recalc", which computes this for all user-specified flashcards.
Asking here because it seems the project has been vacated and devs aren't actively supporting it.
Add-on
https://github.com/rteabeault/MorphMan/tree/rteabeault/spacy_support
Debug info:
Anki 2.1.35 (84dcaa86) Python 3.8.0 Qt 5.14.2 PyQt 5.14.2
Platform: Windows 10
Flags: frz=True ao=True sv=1
Add-ons, last update check: 2021-03-28 23:12:08
Caught exception:
Traceback (most recent call last): File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman__init__.py", line 17, in onMorphManRecalc main.main() File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman\morph\main.py", line 573, in main allDb = mkAllDb(cur) File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman\morph\main.py", line 195, in mkAllDb ms = getMorphemes(morphemizer, fieldValue, ts) File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman\morph\morphemes.py", line 166, in getMorphemes ms = morphemizer.getMorphemesFromExpr(expression) File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman\morph\morphemizer.py", line 51, in getMorphemesFromExpr morphs = self._getMorphemesFromExpr(expression) File "C:\Users\AppData\Roaming\Anki2\addons21\Morphman\morph\deps\spacy\morphemizer.py", line 40, in _getMorphemesFromExpr self.proc.stdin.flush() OSError: [Errno 22] Invalid argument
this seems rather simple but I believe I am getting an infinite loop
nums = [1,0,4,10,14]
for i in range(len(nums)):
if nums[i] == 0:
nums.insert(i,99)
you keep adding elements and then iterate over them; this indeed is an infinite loop
If i scale my features before sklearn logisticregression, i should get different confusion matrices for scaled and unscaled data and yet im not. Does anyone know why this might be?
the b est kind of loop
wait, wouldnt this just add a 99 at index element [1]? basically replacing the 0
I feel inclinde to test it
well i just trying to add a certain number before each element in a list if the element met a certain condition like i ==0
AHHH
I see what happens
It doesnt ever delete the 0, and the loops lenght is reset
interesting
Is there a replace method for lists?
ah yes, i forgot lmao
the recommended method is list comprehension duh
@carmine iron do this
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
lista2 = [n for n in nums if n != 0 else 99]
oh
mmm
but that....why...you know what, nvm lol
I...dont see a good way to do that. Anytime you add an element to a Python list every successive element's index change, so you'll have to do massive iteration loops
for 1 element it wont be hard, but if you have large or dynamic lists...-shivers-
Maybe you can create a dummy index variable and use range dynamically with it? so instead of just range(len(list))) you can do range('dummy index here', len(list)))
basically, skip all the previously seen elements
this will not work bro
there is a problem with your syntax
[x if x != 0 else 99 for x in nums]
[nums.insert(i,99) for i,j in enumerate(nums) if j == 0] are you searching for something like that?
this leads to infinite loop
yeah i am not understanding why it creates an indefinite loop
because every time you add 99 length is increasing 1 and for loop still takes 0 on the 1 index
ah,
0 goes nowhere and for loop detects it and adds 99 infinitely
take a look at that
every time it adds 99 it goes right before 0
yeah thanks. I always need to debug list comprehension lol
it looks so much like if else in other statements
is there a clean way to do this
list comprehgension is the cleanest i can think of
What's a nice way to get the first row of a DataFrame matching some predicate as a Series?
def get_first_row_cond(df:pd.DataFrame,predicate):
rows = predicate(df.index)
first = np.where(rows)[0][0] # np.where here will return a 1-element tuple, the first element of which are the indexes that matched the predicate
return df.iloc[first]
first_row_with_arg = get_first_row_cond(weapons,lambda rows:rows.str.contains("arg")) # for example
looking for an easier solution than this
also interested in the specific case when the predicate is just "the index of the row (the dataframe uses a string index) contains some substring"
And another question: is there a way to hide some columns of a DataFrame so that they are accessible, but not shown by default when printing (well, or whatever the fancy output Jupyter shows dataframes is is called) the dataframe?
Does anyone have any idea why a network cannot grasp a linerally differentiable dataset which can be done with Gaussian Naive Bayes?
you can debug that yourself. for example, what does tornados['Status'] return?
do you think your replace function applies to each element =?
Damn, chat's dead 😞
@lapis sequoia https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
Try using a dict and doing it in place.
Here it is for an individual series https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html?highlight=series replace#pandas.Series.replace
Hello everyone, currently trying to understand why this error keeps popping up. Any help?
AttributeError: 'int' object has no attribute 'lower'
The code is:
`import pandas as pd
from stop_words import get_stop_words
my_list=get_stop_words('english')
df=pd.read_csv("C:/Users/ymaxn/Documents/Python Data Mining/yelpreviews.csv")
#seperate x and y
x=df["stars"]
y=df["text"]
#convert x into document term matrix
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(stop_words=my_list)
cv.fit(df) ##fitting, getting features
features=cv.get_feature_names()`
Apparently SymPy can't solve x^2 = y^4 (1 - y^3) in terms of y. Anyone know how to use SymPy to solve this equation for y? The range for x and y is -1 to 1.
import sympy
x, y = sympy.symbols('x y')
expr = x**2 - y**4 * (1 - y**3)
sympy.solve(expr, y)
# this returns []
hello
Should I install jupyternotebook per conda virtual env
or should I install it globally?
anyone have any opinions?
I usually only install things in virtual enviornments. I do not use conda.
Is there a standard library for solving system of equations?
you might be able to do it using a scientific library using matrices
Where can I get started with machine learning?
Thanks for the awesome tip! I'm looking into this
You just slice the dataframe with []
Because it’s not linear by nature?
Check the data type for every single variable
So that you find out which variable is integer
An integer doesn’t have attribute lower() because that’s for string only
Are you sure this is all the codes?
Global is fine
in spark I am reading in a csv file with column headers
I want one of the columns to be an index
is there a way i can do that?
what does df.iloc[:,4] actually do?
I've never seen the comma before
oops to be more specific
df.iloc[:,4].rolling(window=ma).mean()
where df is dataframe ofc
i'm trying to use pplot from seaborn-qqplot to graph a set of data and fit regression lines. however of the 5 sets of data, it's only fitting 2 regression lines. any help would be appreciated
the code i'm using:
pplot(car_data, x = "horsepower",
y = "price-sq",
hue = "body-style",
kind = 'qq',
height = 4, aspect = 2,
display_kws = {"identity":False, "fit":True})
plt.show()
you know how you can do list slicing with the : symbol on lists and strings?
and what i get is
@serene scaffold yes but idk what the comma does
lists and strings are one-dimensional, whereas dataframes are two-dimensional. So you can slice along both axes
and you separate the two with the comma
so [:,4] is saying every row and col 0 to 3
it should be every row and column 3 only
because 4 is an int, not a slice
What's the difference between Scikit Learn, PyTorch, and TenserFlow? I can't find a clear answer online. I'm just getting my hands dirty with machine learning and wanna know what would be best to start off with. What would be the best to create something like a chat bot that can have full on conversations?
those are all libraries with their specific scopes.
Scikit its like a general ML library with some nice utilities as well.
Pytorch and tensorflow are DL & other stuff a bit more advanced.
do you want to code it yourself?
if you do, then TensorFlow or PyTorch would be more appropriate
that said
PyTorch and Tensorflow have different approaches to similar problems, and they support numpy-like arrays (tensors) that can live on the GPU. scikit learn is more general purpose.
if you're just getting started
I'd say a chatbot is kinda over your head
I would strongly suggest you begin with something simpler
very strongly
Like wot? I thought that would be pretty simple
They could always make an Eliza bot, but I don't think that's what was wanted.
what makes you think it's simple?
Just assumed so lol
I would suggest
Since it’s just taking in input and responding to it
you spend some time reading up on the history of machine learning
and deep learning in particular
natural language processing is extremely complex
in the abstract sense, of course that is true
but think about it...
while it is an input-output operation, being able to respond to things in ways that mimic human behavior is incredibly challenging
how many animals do you know that can understand natural language? 🙂
Not many lol
and think about
how many billion years
it took those animals to evolve
natural language processing with neural networks is nowhere near a century old
let it suffice to say
that a human-level chatbot is still far beyond our abilities @ the moment
we can get close in restricted situations, yes
but only with state of the art techniques
a lot of NLP tasks have pretty narrow scopes
like, "identify all the words in this document that belongs to a certain category, even if you've never seen that word before"
it's a lot better than it was 10 years ago, yes, but it's still nowhere near perfect
so...if you want a general chatbot? you might need to wait a while
of course, if your chatbot's scope is restricted, the problem becomes more tractable
but anyway, I would say...work on your fundamentals first.
So what would be a good library for me, a noobie in machine learning but pretty experienced in Python. And where should I start off with learning it?
learn the basics of data manipulation with numpy and pandas
Alr
really...?
?
?
yeah
you should
minimally
be able to write a simple deep learning library
IMO
backpropagation, feedforward, gradient descent, fully connected layers, all the basic stuff
you could pick up something already in existence and make a basic chatbot, but it'd probz suck
oh, and don't forget your linear algebra, graph theory, calculus, statistics, probability, etc. etc.
Damn alr
Yeah
yes
machine learning is more general
it's a bit fuzzy
but basically
machine learning refers to any technique that allows a computer to, based on an algorithm, modify its responses to incoming data based on already-seen data
deep learning is a specific type of machine learning that uses neural networks with many layers
the "depth" of a neural network refers to the number of layers it has
with computational power having become increasingly cheap in the last 20 years
What's the best way to get the min value across columns ?
more complex (deeper) neural networks became increasingly viable
.min()...?
but if I only want to compare certain columns
Ohhh thank you for clearing that up
then filter the columns first
I have to type every single one out
what if there are many columns
for identifying those columns
is there a better way?
that depends
on how you determine that subset
you can't assume your program automagically knows which columns you want
I know but I was looking for a better way than just listing all of them
my columns look like this
High Low Open Close Volume Adj Close Sma_50 Ema_3 Ema_5 Ema_8 Ema_10 Ema_15 Ema_30 Ema_35 Ema_40 Ema_45 Ema_50 Ema_60
I wanted to compare across my emas
but obv not across high, low, open etc
df[[col for col in df.columns if col.startswith('Ema')]].min()?
oo
I reco this book
it's not as digestible as a lot of the material you will find online
but it is good
Ima check it out, thanks for all the advice 😅
yw 👋
can anyone think of a programatically better way to add values to a pandas column? Let me explain:
My data looks like this
I want to be able to programaticall modifiy the movies column IF both names have movies in common (duh)
this data is held in a list
right now im doing:
- find name in either column
- obtain index location
- create copy df
- find 2nd name in either column of mirror
- obtain index location
- add value in original df at the specified iloc
but this doesnt sound very efficient to me... any hints=?
It works, and I get why I want, but God save me its...messy
ok so...I did it, much better and efficient, but It's the most disgusting piece of code my eyes have ever witnessed
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
relationships.index[((relationships["Name1"] == weights[0][0]) | (relationships["Name2"] == weights[0][0])) & ((relationships["Name1"] == weights[0][1]) | (relationships["Name2"] == weights[0][1]))]
Hey everyone 👋 I have an issue here:
I get this error, and It says op error, but why is it happening?
@velvet thorn hi there I have a question also regarding pandas
if I want to filter all the results from my dataframe in that column and add a new column, how do I do it
so for example, in my column names, there is a lot of names, some also repeated
but I want to filter all the names that are repeated into 1 new column
can anyone help i want to make an ai i am new to python
so just wanted someone to help build an ai
@hasty grail
i want to make an ai so can anyone help with the code
I'm using a numpy array to track snowflakes in a grid, is there a way to create an editable "window" array that affects the bigger array?
pls
#i need help
can anyone help
helpers
does anyone know how to use the index as the column for the plot?
for instance my data is like this
I wanna plot the index against the impliedVolatility'
You can use df.index to access it
df.duplicated()
yeah so like
what I wanna do is to
group the data I have in my dataframe
and create a new column
do you know hwo to do that @dense cosmos
so in my dataframe I have date, stocks, prices
in the stocks column, it has not been cleaned so I have a bunch of different stocks
I want to be able to group the same stock name together
Are the stock different datapoints for dates or just duplicate rows?
different datapoints
they are not duplicated
im trying to do this right now ❤️
but the stock name is duplicated
Probably want to use df.groupby("stock")
so when I use that
how then do I retrieve a certain stock
like how do i retrieve how mnay groups I have
You can do it with len(df.groupby("stock").groups.keys())
And getting a certain stock can be done like selecting a column with groupby["stockname"]
@dense cosmos thanks sir
let me play around with it
I have another question like
so like I have a list of dates --- all last dates for every month for every year
so in the my dataframe I have various dates
1 Jan, 5 Jan etc etc
I want to convert all this dates in the dataframe to the ones in the list
so for example , 1 Jan will be converted to 31 Jan 20XX
same for the rest of the dates and the corresponding years
I will then add it into a new column of the dataframe
how should I go about doing that? @dense cosmos
Are you trying to aggregate by month?
so like
month and year
meaning like for every date in the dataframe that is in Jan 2020, I will convert it to 31 Jan 2020
if it is in Jan 2019 then I will convert it to 31 Jan 2019
@dense cosmos
im a little bit confused on how to recreate this plot
I'm currently doing something like this df.plot_bokeh(kind = 'scatter', x = 'dte', y = 'impliedVolatility')
I'm gonna switch to line graph
but I have the x and y axis
how do I make it graph every different date?
like in the one above
show a few rows of your data
like 10
in text
@velvet thorn so these are the dates in my dataframe
ideally what I want is anohter column
that will be
2008-04-31
2009-03-31
2008-07-31
oh you weren't talking to me
trading_dates = []
trading_month = []
date_ranges = pd.date_range(pf_clean['PC Date'].iloc[0], pf_clean['PC Date'].iloc[-1], freq = 'BM' )
for td in date_ranges:
trading_dates.append(td)```
😅
@velvet thorn in the trading_dates list
I have the dates that will be for used for the respective dates
what?
in general
for data manipulation questions
the simplest way to get the answer you want
is to show a text example of input data and the expected result
ok hold one
PC Date
2008 - 01 - 12
2009 - 05 - 30
Trading Dates = [2008-01-31, 2008-02-28,2009-05-31]
so I want a new column - new dates
such that it will input the dates according to the months
and year
so the New Dates column will be:
2008-01-31
2009-05-30
@velvet thorn
your logic still isn't super clear
but what I'm guessing is
you have two Series of datetimes
the second Series is unique on a month level
so you want to match the two Series on their months and get the corresponding value from the second Series
is that right?
you need a join operation
let me think about this for a bit
alright, thanks!
>>> left = pd.to_datetime(['13/01/2010', '05/05/2010']).to_series().reset_index(drop=True)
>>> left
0 2010-01-13
1 2010-05-05
dtype: datetime64[ns]
>>> right = pd.to_datetime(['16/01/2010', '18/02/2010', '24/05/2010']).to_series().reset_index(drop=True)
>>> right
0 2010-01-16
1 2010-02-18
2 2010-05-24
dtype: datetime64[ns]
>>> pd.concat([left, left.dt.month.rename('month'), left.dt.year.rename('year')], axis=1).merge(pd.concat([right, right.dt.month.rename('month'), right.dt.year.rename('year')], axis=1), on=['month', 'year'], suffixes=['', '_new']).drop(columns=['month', 'year'])
0 0_new
0 2010-01-13 2010-01-16
1 2010-05-05 2010-05-24
real quick
you can, of course, clean it up and spread it out
but I believe this is what youw ant
yup this is it!
trying to understand the code haha
@velvet thorn py on=['month', 'year'], suffixes=['', '_new']).drop(columns=['month', 'year'])
what does this line do
anyone got any tips of which algorithms I should use for Audio classiification:
My project consists of audio files which are 32 numerals and each with a label of the style of voice i.e. bored, excited, neutral etc..
is it possible to roll a 3d array containing an image in numpy?
uh
runcell(0, 'D:/!Code/папкипитона/!!!Project_currency/spyd.py')
Traceback (most recent call last):
File "D:\!Code\папкипитона\!!!Project_currency\spyd.py", line 55, in <module>
from prophet import Prophet
File "D:\!Code\папкипитона\!!!Project_currency\prophet.py", line 56, in <module>
m = Prophet()
NameError: name 'Prophet' is not defined```
tf is this
import fbprophet
from prophet import Prophet
m = Prophet()```
im importing fbprophet and it says that Prophet isnt defined
tho i very clearly imported it
or can it be something with spyder ide?
how do i pass a time series column to lstm. For example i've got a column with dates and column with values. Do I need to pass both?
can someone who has any experience with matplotlib please look at #☕help-coffee
Update on Problem I was facing yesterday, I found out it was all my fault for bad code, I' recoded everything and now it's working with 91% Accuracy. Plugging the code - The Code isn't optmized cause my main focus was just make it work
Ideone is something more than a pastebin; it's an online compiler and debugging tool which allows to compile and run code online in more than 40 programming languages.
Logistic Regression with GD
Could anyone show me a example of a parameterized stored procedure used in big data?
would be pretty annoying to slice out like 20 out of 30 columns each time
how do i feed several datasets into keras lstm
i dont think just uniting them is good since its time series
well, I tried a stack of dense layers with relu activation and it didn't work - not to mention CNN'S, RNN's and transformer architecture. all of them cannot converge (hell, they have 5% accuracy) while Naive Bayes gets near top results.
its a complete mystery - my network seems to overfit to simple data, so I doubt there is a poblem there
why would they have different approaches to similar problems, if an x approach is the most efficient?
the structure of the two frameworks has become very similar now, so not much difference exists except the ease-of-use they offer
Regardless of what it sounds like I meant by that, my point is that they both support developing custom architectures with tensors.
I'm not referring to any implementation details and how optimized they might be.
Hi everyone,
I am a high school student. I have a question, what math do i need for ai topics and how can i learn them quickly?btw i have intermediate level python
you can't learn anything quickly - building knowledge takes time. I recommend you keep learning your school level math and try to learn the AI concepts intuitively rather than focusing on the math at first.
you can always dive into the math later (like in college where they would teach you everything from the ground up). you would be surprised how far an intuitive understanding can get you
I would drop the expectation that you're going to learn any of this quickly. Don't worry about how fast you learn it--worry about your attitude about learning
can anyone explain me about OneHotEncoding ?
Are you familiar with vectors?
So can i learn and write ai code without math?
yes, but you would need to have atleast a basic understanding
yes
data represented as sparse vectors
you can get started, though keep in mind that the math is still there and you can't ignore it forever
so you have a vector of all zeros, except one element is 1. that's a one hot vector.
Of course i need math i am trying to learn precalculus now should i learn linear algebra ?
pre-calculus? in which grade are you BTW?
I would focus on doing well in the classes that you're currently taking, but you should plan to take linear algebra in university.
The CS program that I'm in currently requires you to do well in math courses or they won't let you in.
I am in grade 11 and trying to learn the basic concepts from khan academy
don't they teach you full calculus in grade 11?
well leave that for a moment
No
I would say that you learn your school-level math - and learn the deep concepts in the college because you would have to do them anyways
in the high school system I attended, they taught derivative calculus to the most advanced math students in their last year
but make sure your school math is exceptionally strong


