#data-science-and-ml
1 messages · Page 358 of 1
He is studying dt too 
So he can't help me with it
dt?
Data science
why "t"?
Hi, i understand the reason behind creating env for each project.
I have some confusion in how these env and packages inside them works.
- Anaconda is installed in c:/programdata/anaconda3.
- And i wanna create all my projects inside my d:/
So,
I saw that anaconda environments are inside a folder "envs" in the above location.
My question is when I'm creating an environment inside the envs folder using: "conda create -n venv python"
And then referring that env for my project in d:/
Do all the packages i need for my project will have to installed again inside the new environment? Even though all the packages are already inside the base after the installation?
also i am pretty sure you can compute this multi-class AUC without re-fitting the model each time, but you might end up with weird threshold values
So, I need to re-fit it
you can create environments with explicit paths instead of "names" using -p instead of -n
that said, the location of your project is unrelated to the location of the env you use
Understood
so it's totally fine to put your project in d:\ and to let conda create the environment wherever it wants to with -n
What about the package installation for each project?
what do you mean?
oh
Do all the packages i need for my project will have to installed again inside the new environment? Even though all the packages are already inside the base after the installation?
yes
it won't re-download the packages from anaconda.org, but it will install them separately
that's the whole point
that's how conda achieves separation between environments
you can theoretically build a system that de-duplicates the actual installed contents of packages, but it would be very difficult and messy
if you really want something like that, check out nix or guix
but you will have to write a lot of package specifications yourself in that case. not worth it to save a few gb of disk space imo, although it does make for a more thoroughly reproducible environment setup
So if i create an environment inside the d:/ project folder and refer the default conda python, you are saying i would have to install every package again even though those packages with same version are already in the anaconda folder?
what do you mean "refer the default conda python"
if you create an environment with conda create, you will have to install all your required packages into that environment, yes
like i said, that's explicitly the goal of environments: separation
The python version which anaconda comes with
OK 👍
I thought since it comes with packages i wouldn't have to install again every time.
it comes with packages installed in the base environment
if you create a new environment, you have to install packages in that environment
it's really not a big deal though, i wouldn't worry about it
anaconda comes with a lot of junk imo anyway
So, what's the point of insatlling anaconda i think environment and python packages and environment can also be achieved if we only use python from python.org
heck, you can even create a conda env that doesn't have python at all
that isn't true. anaconda is the conda package manager + a bunch of stuff pre-installed in the base environment
conda is fundamentally different from python + venv
you don't even have to install python in a conda environment
the only reason you need python in the base environment is that conda is itself a python application
Quite offtopic quesion
Is it smart to buy a home server for data science? On a computer, it is not always possible to leave the code running for a long time, since training consumes 100% of the cores
depends. high-end servers are really expensive, and cloud compute can be surprisingly cheap for small jobs.
some people build really wild computers for doing machine learning at home, 2 gpus and xeon processors with ecc ram
but that's very expensive, especially with the gpu and other chip shortage issues
No high end, about 3060 and 2x 12 cores xeons
that's still considered high-end by a lot of standards, and fairly expensive
my home pc is some old i5 and a 1060
That's expensive
Cloud is very cheap if you are doing light weight but can empty your pocket if you dont know your exact requirements
My home pc is r9 3950x, 2070S and 64 gb ram
but even on it some datasets take more than a few days
Aliexpress 
only about 3-4k$
So, would you suggest that if i should go with conda or normal python?
There was a mention of miniconda in their website. Is it a lighter version and without packages inside the base?
I wouldn't consider it cheap lol
yes, i recommend conda instead of python, and i use miniconda. the base env in miniconda only has the packages required to run conda, nothing else. i strongly prefer it that way
that's a lot of money for most people, even people with good jobs
i am a full time professional and i would consider $3000 a very expensive purchase
if that isn't expensive for you, then you are very fortunate, and you should go ahead and build a server
I was gonna recommend the Azure's free tier to you but nvm
It's enough expensive for me, but I can afford this purchase if I would save money for one or two years
So I am trying to understand, worth it buying
Thanks man!!!
Has it time lock?
Jk lol. Free tier has 2gigs of ram and can only run some stuff
Jk? I am not native speaker
Just kidding.
definitely not worth it in that case. learning to do data science on cloud platforms is probably a useful job skill anyway
I was trying some clouds on free tier and it was worse than my home pc(
Ok, thanks
It is. I only use them to schedule refresh some tasks or running bots
Ok)
can anyone give me an easy way to write folium.circlemarker in a folium map from a dataframe
nvm i got it
hi i need help with anaconda when i try to install a library in cmd i type pip install 'lib name' then it gets downloaded in anaconda and i cant use the lib on my main python interpreter
Are you sure that you need to use Anaconda?
If not, try deleting it.
it was working with me so well but after changing windows this problem appeared
do you still have the problem if you activate a virtual environment?
yes
before changing windows i used to download lib twice once on main python and once in anaconda env now when i download in main python it gets downloaded in anaconda nev
since you like using anaconda, why are you trying to not use it, in this case? because you can make separate environments with anaconda and pip install stuff into those different environments.
anyway, what happens if you type which pip in the terminal
in anaconda or cmd/
try where pip in cmd
D:\anaconda3\Scripts\pip.exe
C:\Users\username\AppData\Local\Programs\Python\Python39\Scripts\pip.exe
it's showing separate locations for D and C?
I've never heard of that happening 
but as you can see, on the D drive (whatever that is), pip is pointing to the pip in anaconda3
yea thats the root folder of anaconda i installed it there
so what should i do ;/
do you have gitbash installed? it comes with git, which you will eventually need as a developer anyway.
I only use gitbash and powershell on Windows. cmd is annoying.
i am a beginner and i think anaconda is beginner friendly
in my experience, the opposite is true. Also git and gitbash are unrelated to anaconda.
Can anyone tell me if a script to compare and analyze Cryptocurrency data is an AI ??
git is a version control system and gitbash is a terminal.
i will look that up
thank you for your help tho
depends on what analysis it is doing and how
Ok i didn't started yet i'm waiting to a freind to help me with the criptocurrency analysis and then i will start
It shows the same for me. Not sure why there are 2 listed
its pain
Could I set up a neural network in a way, that instead it uses memory to adapt to different datasets. The idea is, I feed it a hundred or so images, then it checks to see if the remaining images look similar to the first 100
I doubt that would result in comparable performance, but you can try it, I guess
also how would it "check to see if the remaining images look similar"?
I think I misread what you said. Can you be more specific about what you're trying to do? You want to train a model to do what with images?
Train it to find outlying images. At minimum say find a infographic or something among photos, but it would be quite cool, if it could pick out an image of a house in a group of car images
so you want a model that takes a set of images and returns the image that is least similar to all the other images in the set?
close enough
There isn't really "close enough" when you're trying to formally specify what something is supposed to do.
Apon more research, one shot learning on large sets of images.
Wait, let me redefine that.
One shot learning via memory, not retraining
Ok, then does this work: Few shot learning via memory, not retraining
Hello, I'm really new to tensorflow and I'm doing the deeplearning.ai course on coursera
Could you please tell me if this is a good tutorial to follow https://youtu.be/bte8Er0QhDg
Today we use Tensorflow to build a neural network, which we then use to recognize images of handwritten digits that we created ourselves.
◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾
📚 Programming Books & Merch 📚
🐍 The Python Bible Book: https://www.neuralnine.com/books/
💻 The Algorithm Bible Book: https://www.neuralnine.com/books/
👕 Programming Merch: https://www.neural...
Hello, everyone. I work with databricks and I have the following problem. I want to connect databricks cluster with my local machine I tried with Databricks connect but only the spark code execute on the cluster I want the entire code to execute on the cluster.
Hey everyone , i am trying to make a MultiOutput prediction Model using Keras functional api . But when i train the set , i get all NaN values .
Link to colab notebook : https://colab.research.google.com/drive/14Megu0Ta2ZF-Q-57WZtNESi4bdVoejLk?usp=sharing#scrollTo=hPgKeydcZpEY
Where am i doing wrong ?
Hi guys, does anyone know the maths of LDA and variational inference? I have some questions. What does parameterization mean?
Hi Guys im trying to start learning about AI and creating some sort of AI assistant and wondering where there is good places to start and learn to code this in python?
@ me with any advice if possible <3
How much Python do you know, and what is your math/statistics background?
I have Alevel in maths/statistics and currently working as a software developer in python so i have a basic/goodish understanding
sounds good. what all do you want this voice assistant to be able to do? keep in mind that a "general purpose AI" isn't really an attainable goal.
I'm not too sure I haven't much thought into that could you give some examples if thats possible?
not really. I don't know your life

hahaha I'm just creating this to understand more into AI so really an assistant that can do anything if that makes sense.
if your goal is to learn more about AI, try making a K nearest neighbors classifier on some data from Kaggle
Okay I can have a look into that thank you!
@exotic edge a high-level overview of K nearest neighbors: suppose you want to predict the political affiliations of people in a city, and you have the most recent electoral results from that city broken down by household (which sounds illegal as fuck). For people who didn't vote in that election, an effective way to guess how they might have voted would be to assume they voted the same way as those closest to them
so if one didn't vote, and they're in a house with three Purple voters and two Orange voters nearby, you could naively assume that they would have voted Purple.
ah so its like prediction based off of the majority votes in that small area?
it's whoever has the top k shortest distances to the person you're trying to guess for
for an unknown person, assume they're the same as the majority of the k nearest people (where k is an integer like 4, or something)
okay okay thank makes sense!
i'm so confused
corr = df.corr(method = "spearman")
plt.figure(figsize=(30,30))
sns.heatmap(corr, annot = True, fmt = ".2f",cmap="Blues")
plt.title("Spearman Correlation Heatmap")
sns.set(font_scale = 2)
plt.show()
how do i make the font size of a heatmap bigger?
like this is very small
ok i got it
Hi, i got assignments from my prof that bother my mind about machine learning, if i'm not mistaken he asked us to apply regression algorithm to iris data set which is IMO it should be classification problem, i'm new in this field and not get enough clue about this so i need help to determine what type of algorithm should be apply to this dataset, anw the data is from sklearn.datasests but i'm gonna send the link here
if you are trying to predict species, then yes it's a classification problem. but you can predict any of the other 4 variables too
e.g. can you predict sepal length from sepal width, petal length, and petal width?
Multiple Linear Regression
sure, there you go
Never think that before, dang, thanks bro
Is there somone who can help me?
import pandas as pd
df = pd.read_csv('testVersion.csv', sep='|')
showValue = df[['LeadInfo']]
x = df['Huidig type woning', 'Toekomstig adres', 'Toekomstige postcode'] = df['Leadinfo'].str.split(':',expand=True)
print(x)
One would have to know how that data is being represented in the code to know.
You're not likely to get help with screenshots of text. Copy and paste text as text.
Thanks did is what i did so far
do you have a DataFrame or a CSV file or what?
I created a dataframe from a csv file
:incoming_envelope: :ok_hand: applied mute to @severe shell until <t:1638475742:f> (9 minutes and 59 seconds) (reason: newlines rule: sent 110 newlines in 10s).
!unmute @severe shell
:incoming_envelope: :ok_hand: pardoned infraction mute for @severe shell.
!paste Please use this in the future
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
ok
so i have this code
it does this
its an assistant that listens for commands
in the first code it does this
and it listens
and after a command is done , it again listens till you shut it down
i want it to listen only when i say a hotword
like ok google
import speech_recognition as sr
hot_word='Hi'
r=sr.Recognizer()
r.pause_threshold=5#This waits for 5 sec after voice ends
with sr.Microphone() as source:
text=r.listen(source)
text=r.recognize_google(text)
if hot_word in text:
#do anything like calling a function or reply to it```
the second code allows me to use hey google like feature so it listens for commands only when i say the hotword
its a simple thing to do but i dont know what and where to remove in the first code and where to add the second one
thanks for understanding , ping me when you are avalabile for help
if you want you can also help me in my dms or here works well too
@serene scaffold you could help me ?
I'm busy today, sorry
No problem , anyone else ?
@desert oar thanks! Your yesterday decision works. I received quite strange results, but, it looks like true
hi can anyone tell me why this character by character tokenization happening here?
I am trying to call my preprocessor function on the each of the qualification feature
does Qualifications just contain text?
show us the definition of preprosessor
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
please do not post a screenshot of the definition
Ya
this is essentially what i am doing in my preprocessor func
@desert oar
Also another thing, does anyone here has any experience w xgboost?
Any good services that will run my code on a machine with a lot of cores and a lot of ram available?
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1638487055:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Does anyone know what can be a good evaluation metric for it?
So far I found mse and rmse as the most common ones.. but what about the traditional ones?
Such as accuracy, precision and so on
Are they not good way of evaluating an xgb's performance?
for i in text is suspicious when text is just a string
Can someone here help to make a sliding windows classification ?
please help im really stuck
Good night, I'm having trouble retreiving an imaga from a URL
import io
import requests
import pytesseract
from PIL import Image
url = 'https://resultadosgenerales2021.cne.hn/imagen_acta.html?url=https://provisorio-honduras-2021.datosoficiales.com/opt/recuentos/mesa-8766_DIP.jpg'
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Referer' : 'https://resultadosgenerales2021.cne.hn/#resultados/PRE/HN'
}
response = requests.get(url, headers=headers)
response.content
This is my code
But in response.content
import io
import requests
import pytesseract
from PIL import Image
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Referer' : 'https://resultadosgenerales2021.cne.hn/#resultados/PRE/HN'
}
response = requests.get(url, headers=headers)
response.content
b'<!DOCTYPE html>\n<html>\n <script>\n document.addEventListener('DOMContentLoaded', (event) => {\n const queryString = window.location.search;\n const urlParams = new URLSearchParams(queryString);\n const imageUrl = urlParams.get('url');\n if (imageUrl) {\n document.querySelector('#imagen').setAttribute('src', imageUrl);\n }\n })\n </script>\n <body>\n <img id="imagen" src=""/>\n </body>\n</html>\n'
This is what i get
I'm trying to process the image in the url with pytesseract but I can't since it's not in the content of the request
i am so confused rn
this is the confusion matrix in the o'reilly machine learning book
but this is the confusion matrix in a statquest video?
what?
well i guess i'll be using the python variation of it
i'm just gonna go by the book
or is the o'reilly book wrong?
is there a certain confusion matrix layout i should stick to?
Hey I was wondering if someone can help me figure out how can i make my code work, thanks!
def covariance(x,y):
# Trouver le mean du serie x et y
mean_x = sum(column_x) / float(len(column_x))
mean_y = sum(column_y) / float(len(column_y))
# soustraire le mean des elements individuels
sous_x = [i - mean_x for i in x]
sous_y = [i - mean_y for i in y]
#Creer le numerateur et le denominateur afin d'avoir la formule de la covariance
nume = sum([sous_x[i] * sous_y[i] for i in range(len(sous_x))])
denom = len(x) - 1
cov = nume / denom
return cov
with open('nicotinic1.csv') as nicotinic_1:
fonction = covariance(x,y)
print("La covariance du fichier nicotinic_1 est: ", fonction) ```
@boreal escarp what does it do that is different from what you want it to do?
Hello
I have a data frame
Which has date column I want to add an empty row before new date starts
For eg
01-01-2020
01-01-2020
02-02-2020
02-02-2020
02-02-2020
03-02-2020
04-02-2020
04-02-2020
This way
How I can do this?
Ping me when replying
not as I understood it?
because the classical case of federated learning is
a complete dataset divided horizontally across nodes
as opposed to vertically
i.e. target and features
I have pandas series
Which has different values like
String, float, int etc
How I can keep only string values only
Ping me when replying
ok, maybe I didn't fully understand the problem ytou were dealing with there
okay because like
problem was credit scoring with telco data
so the targets were basically default or no default, right
i.e. labels
and the features were the raw telco data like call records etc.
so the problem: when features and labels are separated, how to train model?
oh. but why target label cannot be in the same place as feautures?
like, I don't undertstand why labels are separated from features
is it possible to use df.itterrows() and enumerate(list) in single for loop
for idx, row,idLst,rowLst in zip(villGdf[:5].iterrows(),enumerate(tempList)):
print(row['id'])
# eachPoint = geomPoints[i]```
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19180/940817221.py in <module>
1 tempList = [10,2,5,3,55,57,1]
2
----> 3 for idx, row,idLst,rowLst in zip(villGdf[:5].iterrows(),enumerate(tempList)):
4 print(row['id'])
5 # eachPoint = geomPoints[i]
ValueError: not enough values to unpack (expected 4, got 2)
Does anyone know how to fit a linear regression to a data?
When put inside function definitions, ** indicates capturing variable kwargs. When put in other contexts, such as when calling a function, it's "unpacking" or "opening" the dictionary into kwargs
The two are like mirrored operations of each other
So in this case, **bla is saying "open up/unpack bla and pass its contents as separate kw arguments"
hello
I've got this error i don't raelly inderstand what does it mean
i tried to put ht variable time in array with n.array and doesn't wort even with list
thank you in advance ! 😊
if train accuracy and test accuracy comes up 95 and 86, Should we assume that the model is underfitting ? what should be min difference between train and test accuracies so they are general fit ?
Hi, I am doing to audio clean with U-Net (tensorflow, pytchor, soundfile, ...) , It give me a Nvidia GPU dependency and I have a Radeon AMD GPU. Do someone know how I can train a Neuronal net with my GPU? It's my first time doing this so I might doing something wrong... Some context could help me too
Hello I know "python" (notice the quotes😅) and I would like to start learning artificial intelligence. Any good resources that you recommend to kick off? Thank you very much in advance
https://www.freecodecamp.org/news/machine-learning-systems-book-recommendations/ these are best I know
“Good friends, good books, and a sleepy conscience: this is the ideal life.” ― Mark TwainI hope you’re reading this blog in your pjs looking forward to a rejuvenating and healthy weekend. I have been working on multiple projects lately, from creating Machine Learing Engineering and Machine Learning Operations courses
plus examples, guides and tutorials from tensorflow.org and pytorch.org
hello, i have pandas dataframe in that i have date column python 01-02-2017 01-02-2017 01-02-2017 01-02-2017 01-02-2017 02-02-2027 02-02-2017 03-03-2017 04-02-2017 04-02-2017 04-02-2017 04-02-2017 04-02-2017 ... ... ... 27-02-2018 27-02-2018 27-02-2018 this way
i want to add blank row before new date start
my expected op is
01-02-2017
01-02-2017
01-02-2017
01-02-2017
01-02-2017
02-02-2027
02-02-2017
03-03-2017
04-02-2017
04-02-2017
04-02-2017
04-02-2017
04-02-2017
...
...
...
27-02-2018
27-02-2018
27-02-2018``` this way
ping me when repl,ying
Thank you very much!
I still don’t understand why confusion matrices are presented differently
Like what
OH I SEE
THE PREDICTED AND THE ACTUAL ARE FLIPPED ON THE AXES
🤯🤯🤯🤯
if only it were 100mb larger
Hello everyone, I am tryna get myself acquainted with ML pipelines using TF & Apache-beam...so I'm reading this book. When I got to the TF transform part, the code didn't run well. So I went to the official TF website and trried to run their own example code which also failed.
I don't know if there's a way around this.
make sure you have the same tensorflow version as the ones used in the code
can you send the error that you got?
True, the versions are different. I am on 1.3.0 and the official website's 0.24.1
Google should have updated their website.
If anyone has a walk around this, I'll appreciate. Thannks
use the same version
numpy.core._exceptions.MemoryError: Unable to allocate 820. KiB for an array with shape (100, 350, 3) and data type float64
I have this error, and google searches only bring me to unresolved issues or unanswered questions. Any ideas? lol
they're not gonna update their code every single time theres a new version of tensorflow, that's just unreasonable
it means you don't have enough memory to create the array
either make the array smaller (by reducing the precision or lowering the amount of values somehow) or get more memory
I gathered as much, but surely a 16GB machine can spare 820 KiB?
TFX versions below 1.0.0 were for experimental purposes which was stated, versions above 1.0.0 are what anyone should learn.
with nothing else running and more than enough spare mem
it's just a small dataset for practice, less than 2KB.
Just like on the tensorflow website,
is that the only array?
that message was directed at @lapis sequoia
TypeError: object of type 'NoneType' has no len()
okay.
That's the error I am getting.
well if the code you're using uses 0.24.1, then you should use 0.24.1 if you want to run it
unless you want to make a lot of code changes to make it work with the new versions
the simplest solution to your issue is to just use the same version
ah no, there are more - so assuming it opens them iteratively, that error isnt from the beginning it's when the memory gets full?
okay i was being very not intelligent
yeah if there are a lot of them then it would give that error whenever it runs out of memory trying to allocate for one
you can double check that by checking your memory usage and seeing if it's full
yup just re ran it there with the memory usage in front of me and it makes its way up to 100, then i get that error. Cheers!
in your case, I'd recommend going down to float32 precision, float64 is usually not necessary
anyone here good with encrypted strings?
It's best to always ask your actual question, rather than hope someone will volunteer to help when you've only alluded to it.
Your question might also be more relevant in #cybersecurity
does anyone know a good mlops book?
hey can someone help me figure out why my code isnt working? it has to do with the x and y variable, but i am not sure how to fix it. thanks ! ```py
def covariance(x,y):
# Trouver le mean du serie x et y
mean_x = sum(column_x) / float(len(column_x))
mean_y = sum(column_y) / float(len(column_y))
# soustraire le mean des elements individuels
sous_x = [i - mean_x for i in x]
sous_y = [i - mean_y for i in y]
#Creer le numerateur et le denominateur afin d'avoir la formule de la covariance
nume = sum([sous_x[i] * sous_y[i] for i in range(len(sous_x))])
denom = len(x) - 1
cov = nume / denom
return cov
with open('nicotinic1.csv') as nicotinic_1:
fonction = covariance(x,y)
print("La covariance du fichier nicotinic_1 est: ", fonction) ```
It depends on the task you're trying to solve.
For example, accuracy score isn't always a good metric for Classification Problem; especially when you have an imbalance class. F1-score, Roc score , AuC score will be better off in this case.
I've got a data merge problem I can't seem to find a straight forward solution to. Datetime based records, main dataset every 5 minutes. Set to be merged is every 15 minutes and timestamps don't match exact. I want to merge with existing dataset filling in blanks with
average values. I know I want to use pandas, but I'm really new to that, only a couple months experience. DB is MySQL running on a Linux server. Main app is based on Flask everything else is "pure" Python. I'm good at following rabbit holes, but I could use some advice on where to start and a direction to go in.
They are both accurate. Once you understand the concept very well you can flip the 2x2 confusion matrix whichever way you want and still be able to explain it.
On a side note... What exactly were those guys that came up with "confusion matrix" really smoking when they coined that name 🤣
hello
Underfitting? Nah it's far from that. If it was underfitting your train set won't even smell a 95% accuracy 😊
Underfitting is like asking a 2 year old to solve MANOVA (Multivariate Analysis of Variance) when the child's brain is still too young to handle such complex task. Now, what do you think would happen?
The child is definitely perform woefully on the task. You can relate this to your model. If your model isn't robust and flexible enough to capture complex patterns in your dataset then it's most likely bound to underfit.
For your train data to hit a 95% accuracy score, do you now see it's far from underfitting? 😊
So its overfitting yeah sorry my bad... but the intent is am i right that the data is overfitting??
Does your pc has thunderbolt port? You could use eGPU or better still use the free ones provided by Tesla (on cloud), Google (colab), and Kaggle (kernel)
The task is too big for the free platforms to handle? Then you might wanna use the paid services offered by AWS, GCP, etc.
i need help
The Internet has been quite generous lately.
Books : Check Pinned Post
Video : Udemy, DataCamp, Kaggle, DeepLearning.ai, Andrew Ng's Machine Learning course on Coursera, HuggingFace courses on their website, YouTube etc
It's possible. You'd have to investigate further to confirm that your model isn't overfitting by doing a K-Fold cross validation.
You could also plot the loss per epoch to easily spot when your model starts overfitting.
Meanwhile, I prefer using the loss function like RMSE to guage my model performance / overfitting instead of using the accuracy score.
i noticed the next day it was bc they flipped the axes
There are many methods you can use to compare two images in ML (Siamese NN, CNNs, Ect.) What I cannot figure out is comparing a large number of images (Without Retraining) to find images of a different object. The best way I can describe this is a few shot learning problem without retraining. Any ideas?
My only real idea is to use an RNN and have it memorize some of the required features of an image while it parses through all of them. I would also likely have to ensemble multiple RNNs with different sets of images in case the first RNN starts off on the outliers.
Probably not a good sulution
Hello, i have a question regarding community detection in a bipartite graph.
Let's assume, that we have a set U of elements connected to a set of elements V. We define a proximity function for (a,b)∈U^2 such that F(a,b) are close if a and b map to the same elements in the set V, the more elements in V a and b map to - the higher the proximity (or lower the distance between a and b). Then we get an adjacency matrix which shows us the weight of every edge between the elements of the set U . I assume that there are communities in this graph, but i don't know how many there are.
for info, the set U contains 500k nodes, V contains 4k nodes
How do i detect communities and what is the most accurate way to represent the results?
Since I have so many nodes that I can't simply represent this data as a graph (it would be a complete mess), I was thinking about taking a node, putting the node in a N-dimensional space, then adding the neighbouring nodes according to the proximity to the first node and repeating this process until I embed all the nodes into my N-dimensional space [but that looks kinda like an NP problem (correct me if I'm wrong)]. Then I could use UMAP to detect the communities and a projection into 2D space to represent the results
This sounds like an #algos-and-data-structs question
I'm a bit confused as to how CNNs work in terms of passing information onto other layers.
If you pass 64 feature maps to another convolutional layer, how does it interpret that?
Its a classification model so RMSE ? Along with accuracy I am checkin f1-score too which seems to be around 87
If it's a classification problem, it's best to not use accuracy score as an evaluation metric.
I'd use the following metric in this order:
- roc_auc
- F1-score
- RMSE
The closer #1 and #2 is to 1.00 the better your model performance. The closer your #3 is to 0.00 the better your model performance.
Accuracy score i used to check train and test score,whether they differ or not 🙂 But Deciding which one did good I used F1-Score Yes 👍🏻
hi can anyone help me?
what does this do?
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df.apply(np.mean)
the pipe part i mean
this looks like Python 2?
!docs pandas.DataFrame.pipe
DataFrame.pipe(func, *args, **kwargs)```
Apply func(self, *args, **kwargs).
so it's the same as adder(df, 2)
np.array(arr)[:,np.newaxis]
: = selecting every element in row
np.newaxis = it is creating new object at new column
correct ?
can someone please help me understand this
Ive been looking into CNNs lately so my info may not be the most complete, but i think i've got a good feel for em
the CNN takes its input layer and processes it with a convolution filter, basically a small, shallow neural network applied to each pixel on an image
blurs in games or things like bokeh effects are kinda like this too
for a CNN these filters are learned, as they're a small neural network
Yeah I'm using tutorials point right now
But I know the few diffs between py2 and py3.10
Then why would we want to use pipe?
Idk, I've never used it
hi, can anyone tell me why I am getting this for row 4 here?
It's supposed to be characters like the other rows values
pls lmk if you need any other info to help out
any good short and best ml course for python
Freecodecamp (:
I'm learning from there right now
Also geeks for geeks, tutorialspoimt , w3 schools (:
You can check pins of this channel.
k
so found a great resource for mlops and ml in general, moderators or administrators could pin it to the top https://github.com/visenger/awesome-mlops
not sure if this is the right channel but
can I get the exact number of the line when I read a line and how?
Can I check for spaces in a string?
yo what is the difference of test and validation samples?
is the validation dataset required?
Validation is what’s used for hyperparameter tuning, so there’s a separate test set so you don’t overfit the hyper parameters to the validation set
So unless you’re doing automated hyperparameter tuning, a validation set isn’t really necessary
can anyone explain what is data pipeline between SQL and python? what is it for and how to implement it (by luigi perhaps?)
what are considered as hyperparameters?
anything related to the model architecture or the training configuration
i.e number of layers, nodes per layer, number of conv filters, filter size, batch size, etc
would also include parameters for the optimizer like learning rate or which optimizer to use
how do i know when to normalize a dataset?
When your features aren't in the same unit. You could have a feature whose unit is in secs, another in kg, another in weight, another in joules etc...
You'd have to normalize your data to at least give each feature a level playing ground for optimum performance before training your model with the data.
By doing so, any feature whose contribution is subpar or insignificant to your model performance won't feel so jelly or discriminated against if you decide to use your veto power to disqualify such feature from your magnificent project moving forward (pun intended) 😀
But I hope you get the point now
oh so if i change something on the architecture i would need to use a validation dataset for everychanges i make and evaluate which one do best?
If you’re just doing a few changes it’s probably fine without one, it becomes necessary when you try to have automatic hyperparameter tuning at large scales and such
thank you very sir @austere swift
btw if i created a model with an input size of 500x500x3 and i used that model in a a mobile application and the input will be the camera captures is it ok?
You will probably need to scale to that size.
why i get an error like this? AttributeError: 'numpy.ndarray' object has no attribute 'unique'
Hey @errant path!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
It should be
X_train['insert_column_name_here'].unique()
Your X_train is currently a numpy array, so you'd have to first convert it back to Pandas to use the .unique() method.
so i need to include the part in the application to resize captured or uploaded images to 500x500x3 before passing it to the model?
Can someone help me out with this one
I still get an error. Any other clue?
Yep!
In this case, I want to build an image recognition model with a maximum value of 255. Thus, I dividing an X by 255 for normalization. @odd meteor
do you guys think it'd be a decent idea to add statistics in my resume as well after I finish ML/NLP/Data science libraries? 😛
hello my dataframe this way
i want to groupby data by age column
i tried this way python print(df.groupby('age').head(10)) but i am getting python name marks subject age 0 amar 78 maths 45 1 ajay 56 physics 56 2 kiran 36 science 20 3 pankaj 41 hindi 78 4 kiran 20 maths 23 5 amar 78 physics 45 6 pankaj 63 hindi 12 7 sanket 41 science 12 8 sahil 85 maths 20 9 kiran 26 hindi 84 10 amar 45 science 45 11 pankaj 98 maths 41 12 swapnil 14 hindi 30 13 amar 21 maths 56 14 sham 40 hindi 56 15 sanket 85 maths 45 16 pankaj 42 science 23 this way
ping me when replying
Not sure about cv but random variables and processes are fun to learn😄
And it helps in ml anyways.!
What do you want to do once you have grouped it?
Find the most common subject for each age or what?
You still haven't specified the column name you'd like to get all its unique values.
df['col_name'].unique()
Could anyone please clearly distinguish between data science and data analytics? I've searched online. But the definitions available are vague.
data analytics is merely observing the data and creating reports
science, i think, means that you will be creating predictive models using Machine Learning, Deep Learning, and CNNs and all that good stuff
im not a 100 percent sure but that probably is the gist of it
Thanks for answering. Are you sure that this is data analytics? Or is it data analysis? These two terms are also confusing.
what is the difference between analytics and analysis?
That's what I'm not aware of. I've been seeing these 2 terms interchangeably in several situations. At the same time I feel that there are some differences.
They mean the same thing
not sure if this is the right channel but
can I get the exact number of the line when I read a line and how?
Can I check for spaces in a string?
:incoming_envelope: :ok_hand: applied mute to @lime moon until <t:1638648800:f> (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
code:
def show_path_line_count():
global folder_path
total_linecount, line_count = 0, 0
files = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]
for filee in files:
try:
with open(f"{folder_path}\\{filee}", "r", encoding="UTF-8") as file:
for line in file.readlines():
line_count += 1
filename = os.path.basename(file.name)
total_linecount += line_count
labels.append(filename)
sizes.append(line_count)
except Exception as e:
print(f"Couldn't linecount {filee} | {e}")
pass
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.title(f'Total lines: {total_linecount}')
plt.show()
return total_linecount, filename
I want to make it so that the piechart only displays top 5 values and then displays the rest as "other". How would I do that?
My ANN model is always getting 100% accuracy. How do I get a more accurate accuracy?
do you have a validation set? also how big is your dataset?
it's likely that it's overfitting
and having a small dataset can also do that (conceptually, its easier to get 100% accuracy if you have 5 samples than if you have 5000)
My dataset has around 1 million values. By the way, my test accuracy is also 100%
make sure the way you're calculating the accuracy is correct
can you send some code?
also make sure you don't have your labels in your training inputs, i've done that before and it can be hard to debug
Okay
Oh, originally my data was over 1 million values, but I had to cut a lot of it down because there were null values
Also, I get a memory error if I try to load in the whole dataset, as it is over 100 million rows
Do you know how I could get around this?
if you don't have enough memory to load in your dataset, there isnt really much you can do about it other than just getting more memory
Alright
if its in pandas you can try to use the low_memory parameter
By the way, when I change the number of layers and stuff, the accuracy stays the same. How many layers should I have?
I haven't tried that, thanks
Hey guys
I'm using opencv2 to find a certain colour on a map and only show it whilst blacking out everything else
but the one of the colours is just making the whole screen black
Im using BGR2HSV
What should the output layer of a object detection model look like? (Amount of nodes and activation function)
Are you talking about a mask?
send your code
yes
import cv2
import numpy as np
img=cv2.imread("img.png")
choose = input("Which area: ").lower().strip()
def richplaces():
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
lowerrange = np.array([25,157,1])
upperrange = np.array([130,255,255])
mask = cv2.inRange(hsv,lowerrange,upperrange)
cv2.imshow("Image", img)
cv2.imshow("Mask", mask)
cv2.waitKey(0)
cv2.destroyAllWindows()
def middleclass():
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
lowerrange = np.array([0,121,255])
upperrange = np.array([130,255,255])
mask = cv2.inRange(hsv,lowerrange,upperrange)
cv2.imshow("Image", img)
cv2.imshow("Mask", mask)
cv2.waitKey(0)
cv2.destroyAllWindows()
def poverty():
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
lowerrange = np.array([179,149,251])
upperrange = np.array([130,255,255])
mask = cv2.inRange(hsv,lowerrange,upperrange)
cv2.imshow("Image", img)
cv2.imshow("Mask", mask)
cv2.waitKey(0)
cv2.destroyAllWindows()
if choose == "rich":
richplaces()
if choose == "middle":
middleclass()
if choose == "poor":
poverty()
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
so what's the problem? Your code looks fine
That means that your upper and lower values aren't right, it's not detecting pink anywhere
where did you get these HSV values?
We got the colour from the map
we got its rgb values
using colourpicker
it works well for the rich and middle class
This is a long shot lol, but would anyone wanna work on some sort of software product that incorporates ai,ml,NLP, data science into it? I mean it could look cool on your resume, and you'd gain some hands on exp lol
Is there any nice and precise website where we can find out info about state-of-the-art models for various tasks? as an example state-of-the-art model for machine translation. (please ping me if answered. thanks.)
.bm
Thanks a lot sir!!!
hello i have a dataframe with date column in it
thhis way
i want to add a blank line before new date
my expected output this way
how i can do this ping me when replying
blank row before new date
Guys can u suggest me a free data analytics cours
you might look into openpyxl instead of pandas, because that isn't the kind of thing that pandas expects you to want to do
from pandas's perspective, this is like trying to add entries with all empty strings to an SQL database.
if you provide the code as text, we could offer alternatives.
however I don't see a "yes" outcome, so the function might as well not do any boolean logic and just return "no".
hii i tried this way ```python
def add_blank_rows(df, no_rows):
df_new = pd.DataFrame(columns=df.columns)
for idx in range(len(df)):
df_new = df_new.append(df.iloc[idx])
for _ in range(no_rows):
df_new=df_new.append(pd.Series(), ignore_index=True)
return df_new
df = pd.read_csv('pandas_dataframe.csv', names=['date', 'names', 'age', 'city'])
df_with_blank_rows = add_blank_rows(df, 1)
print(df_with_blank_rows)```
but i am getting python date names age city 0 date names age city 1 NaN NaN NaN NaN 2 01-01-2017 amar 23 mumbai 3 NaN NaN NaN NaN 4 01-01-2017 ankit 24 goa 5 NaN NaN NaN NaN 6 02-01-2017 ajay 25 pune 7 NaN NaN NaN NaN 8 02-01-2017 sameer 26 nashik 9 NaN NaN NaN NaN 10 02-01-2017 ankit 24 goa 11 NaN NaN NaN NaN 12 02-01-2017 ajay 25 pune 13 NaN NaN NaN NaN 14 03-01-2017 ajay 25 pune 15 NaN NaN NaN NaN 16 04-01-2017 sameer 26 nashik 17 NaN NaN NaN NaN 18 05-01-2017 ankit 24 goa 19 NaN NaN NaN NaN 20 05-01-2017 ajay 25 pune 21 NaN NaN NaN NaN
yes, it's going to fill them with NaNs instead of empty strings
yes, but is is adding nan after each row
i want to add blank row before new date strts
it doesn't look like your code does anything to check if the date for the current row is different from the most recent one
Since you mentioned your data is too much, how about loading your data in batches in Pandas and subsequently doing Batch Training?
Presuming you're working with a tabular data, this should solve the problem in TensorFlow
import pandas as pd
import numpy as np
for batch in pd.read_csv('Jubitron.csv', chunksize= 10000) :
target = np.array(batch['your_target_column'], np.float32)
feats = np.array(batch['your_feature_column'], np.float32)
You can increase or decrease the chunksize if you so wish.
You can also check online on how to do batch training with image or sound dataset. I think this should solve your low_memory problem.
It's not guaranteed. You gotta try out different stuff then you compare and contrast.
Also, check HuggingFace and SpaCy for NLP specific projects they have some pretty cool pretrained models capable of shunning out "state-of-the-art" results.
I am trying to get into AI and I am wondering if there are any online resources to help me start coding AI programs such as neural networks or linear regression type stuff. like some sort of youtube video series or some course I can possibly pay for
:incoming_envelope: :ok_hand: applied mute to @glad escarp until <t:1638732796:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
im looking for specifics so I can actually begin coding
ive watched stuff on neural networks and such
and i know what they are
i just need some direction in the realm of actually applying such things
ah for applying, to making your own form scratch.
I find https://www.manning.com/books/deep-learning-with-python-second-edition pretty good in terms of introduction to deep learning.
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow-dp-1492032646/dp/1492032646/ is also quite nice to the field in general, but it covers broader things
The videos from andrew ng are also popular in that regard, but I have never watched them. I also have heard of some udemy courses but never watched them either
Can someone help me make a Siamese NN with a variable number of inputs?
I found this paper, but I don't understand how they implement it. https://www.tandfonline.com/doi/pdf/10.1080/13658816.2018.1542698?needAccess=true
.paste
Hello, does anyone have any ideas on how best to load images in for an image batching process?
I'm running a model on the MSCOCO dataset and have preprocessed image features to (256,256,3) which has given me .npy files that are 292 KB in size each, I'm loading in over 240,000 images each 292 KB which as you can expect, is bottlenecking my performance
(Those 240,000 to train on come from 80,000 input images and each image is used 3 times, realistically I load ~80,000 and duplicate the loaded array 3 times as opposed to loading in 240,000 images which I would definitely never do 👀👀)
Are you using tensorflow?
I am indeed, sorry should've said
My times
I have a sort of work-around, namely that every 2 epochs (1h20m) I save the model weights and load them in the next runthrough
so instead of ~1600 minutes to train it fully on 40 epochs * 40 minutes per epoch, I can chunk it
Isn't it a standard practice that one thread is reading data while another feeds it to the GPU? Like suppose your GPU can do a train step for a batch in one second, ideally you will have the next batch ready to feed immediately from the other thread?
I tried to use pooling but I didn't see any improvements which may be due to the structure of my loop
It's been a while since I messed with it, but I think tf datasets can be configured to have this behavior
i tried to use a parallel map dataset
but that kept giving me incompatible errors
Also if I recall with some of the configurations, they require one full pass before they take effect. I might be thinking of .cache() with that
iirc there was a type of batch = Dataset.from_tensor_slices()
and then a batch.map(lambda item1 : numpy_function( load_function, [item1], [float32]))
but on trying to do .prefetch(batch_size) and passing that to my .fit/.train_on_batch it just wasn't a fan
hmm
in fact, more information that may help
but when i did it before
everytime I called .prefetch it would just never finish prefetching
and that may have been something i did wrong but it made me too scared to try it at all so i avoided it, so i should definitely have a look though i havent really changed my structure much except I do an inner loop to train x times on the batch (as it's already loaded)
sorry to ping but if you're looking this may speed up your search https://www.tensorflow.org/tutorials/text/image_captioning#create_a_tfdata_dataset_for_training
so if I do im_batch = img_ds.batch(batch_size)
if i do im_batch = img_ds.prefetch(buffer_size = batch_size)
If i do the batch and prefetch earlier and try to subscript it
Thank you for trying to help, think I'll have to give up on that idea, tried about 30 different methods and none would work, most of the time it was giving me this error: (my input list to the mapping function would be a list of files i.e. "dir/numpyfile.npy"
why not batch the images within the files
because im dumb 😦
so if you have a batch size of 32, rather than loading in 32 files, put 32 images in each file and only load in one file
it would be much faster
I want my batch sizes to be variable
as it depends on the PC that runs it
on Google Colab I could do batch sizes of 1 before it crashed
on my PC I'm doing 513
I've just figured out how to do Dataset.map
even so, you can chunk it to a specific chunk size (that's able to fit in your memory) then have batch sizes that can span multiple of those chunks
though it depresses me because the thing it replaces was very impressive haha
or batch sizes that are less than one chunk
best case would be to just put them all in one file, although that would take quite a bit of memory
memory as in ram, not storage
how much ram do you have?
that wont handle all of it
240,000 * 292kb = 70gb
yeah but 3 questions per image so the image appears 3 times
so if i load the image once it fits in ram
what i meant by this was to combine all the files into a single numpy array and save that as a single file, which would be fastest in terms of read times but it would take ~70gb ram to load in that single file
yeah true, unless I do what i sort of did in one of my functions that counts how many questions each image has and copies it that many times
so that when it loads in from that file it would copy it three times
and something like np.memmap from what ive read might actually work with that
(but i havent used it)
what I would do is have it saved as a .npz file with all of the numpy arrays inside of that, since loading a .npz file doesnt load the arrays into memory until you try to assign it to a variable (npz files act similar to dicts, with names of arrays as keys and the arrays as values)
so it would be a single file still, but you'd be able to load in individual arrays without loading everything into memory
although I haven't experimented with having more than a couple arrays in a single .npz file so i'm not sure how it will handle 240k
npz is apparently on par with hdf5 for it
which are both meant for huge datasets
however im still not sure
part of me wants to try loading it in batches though i do know thats inefficient in the long run
Thank you very much, if the method im trying out now doesn't work i may look into that though, hope I didn't sound ungrateful 🙂
Does anyone know how I can make my map function for img_ds take multiple inputs? My dataset is 240,000 images but each one is used 3 times in a row, therefore I only need to load 80,000 images and just need to "copy" or duplicate the first value two more times. Any help is much appreciated as this .map function has seriously improved my I/O bottleneck issues by 4x the speed, though I can only load them 1 by 1.
Don't want to needlessly ping you but if you read this peace_within_reach I cannot thank you enough haha, 1 epoch now takes about 1/3rd or 1/4th of the time it took before and that's loading 240,000 images!
Hello I want to insert a row in data frame where condition becomes false
How I can do
My code this way
I am getting this error
Ping me when replying
Can anyone please look into this
Ping me when replying
The speed of python in this area is super fast right? Matching c?
no probably
the syntax is easy for python
but python is slower than the other popular languages
In this area? In machine learning and data analysis?
I thought the AI libraries were made in C?
Hi there, i'm doing a Project part for a Masters Course. The Project in general is about digital quality management: A 5-Axis cnc machine is cutting a Part. Machine Data is collected through a Edge Device and then used to create a dot cloud/stl that itself is then surveyed/measured and compared with the measurements of the real part. My part in the project is using the raw machine data and creating a ML algorithm that does a predictive decision on "if the collected Data is sufficient to create accurate measurements through the digital twin". Sadly i feel ill prepared through previous courses for handling such a specific topic and do not even know where to begin. Thus i could use some pointers on how to proceed with checking/preparing the Data, what algorithms could be used to get a useful result, and so on. If you have questions or suggestions (for possibly useful tutorials on how to get started or such) feel free to respond here or in a DM. Any help is highly appreciated.
If you use numpy then vectorised calculations over numpy arrays are fast (it's all implemented in C under the hood). If you do stuff in pure python then you're probably paying some constant factor overhead compared with implementing the same algorithm in C. But often with big data the real question is how things scale with the size of your data (big - O behaviour), rather than what the details of the constant factors are.
I don't know how you're studying but I highly recommend you watch machine learning crash courses on thr internet and practice using projects (:
Would this have a huge time difference?
A = pd.Series([1,2,3])
Vs.
A = pd.Series(np.array([1, 2,3]))
If I used them in a large scale proj where I manipulate them
Thank you for your reply, i am doing exactly that at the moment, but my big(gest) issue is that basically all crash courses or test-projects are about survey data, text/image recognition/classification. Thus i took my chances and was hoping to find someone here, that may have personal experience with handling similar data as me.
for a given row in Pandas dataframe, how do I return the column name that has the highest value?
the code i have right now only returns the highest value for a given row, but not the name of the column, which the value is belongs to
print(df.loc[139, :].max(axis = 0))
Hi there, i'm currently learn about assumption tests for linear regression and use durbin watson test as one of the tests, but the problem i encountered is that i have to compare d value to the durbin watson table (dL and dU) manually, my question is , is it possible to automatically compare d value i got in Python? Btw i use stats.stattools.durbin_watson( ) function from statsmodels to get d value
This kinda reminded me of my parametric and non-parametric test class 😊. Sadly, we did ours manually with pen and paper, and kinda played around it with SPSS (nothing too serious then)
Unfortunately, I don't know how to navigate statstool to carry out Durbin Watson test... However, If you wouldn't mind using SPSS to figure this out, I'm sure there's plethora of YouTube videos that explained it concisely using softwares like SPSS.
All the best ✌️
Instead of sifting through rows to get the column with the highest value, why not use the column directly?
print(df.loc[0:].max())
Or maybe I don't understand your question perfectly.. 🤷🏾♂️
I want to get the name of the column with the highest value for each row
it's in relation to tf-id where each row is a sentence and every column is a word, so i'd want to get the words with highest score for each sentence
Oohh. You might wanna use Gensim to get that.
Oh, okay, thanks btw, as you mentioned you got parametric and non parametric test class do you have books or resources about assumption tests? I want to look over manual steps to get each tests value cause many resources i got so far are using SPSS
Ahh I don't really have a textbook on it. I only have my Statistics notebook. Everything you seek can be found online. If you still need extra resources DM me, I can screenshot my university notebook to you by tomorrow night (if I don't become too lazy to look for it.) 😂
anyone know how to calculate percent similarity between two columns in dataframe
rebrushing on python
I am trying to make a linear regression model and getting an error
"Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
This is my code, can anyone explain what is happening:
x=df1['Add2(in Thousands)']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42,test_size=0.25)
from sklearn import linear_model
lr=linear_model.LinearRegression()
model= lr.fit(x_train,y_train)
predictions=model.predict(x_test)
Just as the error message reads, you gotta fix up your x and y variable. It means your x and y data aren't in same dimension. So just reshape your y variable to a 1-D array.
You can further confirm this by printing the shape of your X and y to understand why you're getting such error. Your y is most likely a series instead on a data frame or an array.
y_reshaped = np.array(y).ravel()
x_reshaped = np.array(X).ravel()
should fix it. Then rerun the train-test split and train your model again
@odd meteor Thanks a lot..
oh man im almost done with the data analysis libraries like numpy, pandas, matplotlib and seaborn, im so pumped to be learning machine learning (:
im just getting started with data analysis libraries and i wanted to know how i could make a pie chart out of a csv file
for example making a pie chart out of the amount of times the diff groups in exgrupo (second column) appear, im still struggling with this whole concept hehe
I would like someone to look over my data science curriculum
And say what they think of the work
guys im new
and im a kid
i want to develope a programming language in Python
how do i do it
i built the parser
and the lexer
can some1 help me please
and i build ai
too
Hey @pallid bison!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Hey @pallid bison!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Roughly I think you could use Pandas. Something like this'
import pandas as pd
# Load file
path_to_file = '...'
df = pd.read_csv(path_to_file)
# Make a pie chart
df.plot.pie(y='exgrupo')
ohhh tyy!!
When creating a Siamese Network you create a model and instantiate it twice.
I wish to create a layer that will instantiate the model a variable number of times based on the input. It will then take the outputs of all of those models and do the distance math returning a fixed number of outputs.
Important note: The model is not pretrained, the goal is training that model.
!paste
Hi everyone, I have a bit of an issue with List assignment index out of range.
I'm trying to fix it, but so far I got nothing.
here is my code https://paste.pythondiscord.com/ziqibadece.sql
Thank you in advance
Please always share the whole error message. It would take too long to infer where in your code that exception is being raised.
!paste
https://paste.pythondiscord.com/osoxosejec.sql
here is the error. I expect it to run and predict the value whether 0 or 1
@rigid zodiac the problem is that row is an empty list, as row[-1] would work if there was at least one element.
that, or row[-1] returns an integer that is out of range for expected
so how can I fix it?
I suspect it has do with this error. I have change it to different number but still 😦
I'm not sure I understand you properly. Python is already a programming language, how can you develop another programming language inside Python?
Or do you mean, you wanna develop another programming language like python? If so? Why do you wanna embark on such journey? Any specific reason? 😀
PS: IDK how to develop a new programming language myself
if someone comes up with a language spec, it is possible to implement that language in Python, just as the official implementation of Python is in C.
but that's not really a data science question 😛
Ooh I understand now 😀.
this is still off-topic 
right sorry
#algos-and-data-structs would be the nearest equivalent as it's also for the discussion of theoretical computer science in general.
Do you know how to make Poetry and Torch behave? Or do you know of another PEP517 compliant tool to handle dependency/environment management?
I just really dislike the idea of splitting those two tasks between different tools, but Torch has some cursed packaging practices that doesn't seem to work well with a pyproject.toml file.
Basically cuda support relies on a specific pip flag to fetch the wheels from another URL
datos = datos[datos["OcupacionEconomica"]!="SIN DATO"]
datos = datos[datos["OcupacionEconomica"]!="SIN DATO MINDEFENSA"]
df=pd.DataFrame(datos, columns=["TipoDeDesmovilizacion","ExGrupo","AnioDesmovilizacion","Sexo","SituacionFinalFrenteAlProceso","DepartamentoDeResidencia","MunicipioDeResidencia","BeneficioTRV","BeneficioFA","BeneficioFPT","BeneficioPDT","OcupacionEconomica","DesembolsoBIE","NumDeHijos","TotalIntegrantesGrupoFamiliar"])
grafico= df.pivot_table(columns=["OcupacionEconomica"], aggfunc="size")
plt.pyplot.bar(grafico)
plt.pyplot.show()
print(grafico)
im trying to make a bar graph with the following it code
but i get an error that says TypeError: bar() missing 1 required positional argument: 'height'
how do i solve that?
1 gram model
what does that mean?
do yall know if pivot tables can be assgin x and y axis for graphs?
Anyone know a place where I can learn reinforcement learning OTHER THAN Q-LEARNING
maybe check out https://www.coursera.org/learn/practical-rl#syllabus , it's not just q-learning I suppose
Alright I'll see
Wondering because I want to learn better techniques since I doubt any good reinforcement learning trained AI used Q-Learning
the alpha-zero paper https://arxiv.org/pdf/1712.01815.pdf is maybe worth reading about
datos = datos[datos["OcupacionEconomica"]!="SIN DATO"]
datos = datos[datos["OcupacionEconomica"]!="SIN DATO MINDEFENSA"]
grupos =sorted(datos["OcupacionEconomica"].unique())
grupos_dict = dict(list(enumerate(grupos)))
datos.columns = ["TipoDeDesmovilizacion","ExGrupo","AnioDesmovilizacion","Sexo","SituacionFinalFrenteAlProceso","DepartamentoDeResidencia","MunicipioDeResidencia","BeneficioTRV","BeneficioFA","BeneficioFPT","BeneficioPDT","OcupacionEconomica","DesembolsoBIE","NumDeHijos","TotalIntegrantesGrupoFamiliar"]
conteo = datos.groupby(["OcupacionEconomica"]).count()
print(conteo)
conteo.plot.bar()
plt.pyplot.show()
can someone tell me whats wrong with this code?
im trying to do what the person who responded showed
but instead im getting this kind of graph
which shows all the columns within each bar
Maybe you should do
conteo = data['OcupacionEconomica'].value_counts()
instead and try plotting that
Hi does a definition with a nested loop work inside a for loop as I was trying but only plotted graphs when i took out the definition and return
yes, python loops do not introduce a new scope for variables
could somone explain what K mean clustering does and what types of dataset we need to do cluster?
@hallow sparrow it's where you have points in space and the algorithm tries to figure out which groups of points are close together
Here's an example for k = 5
Could someone link me to a keras implementation of magnet loss?
I suspect your googling thereof would be as good as anyone else's
my issue there is googling did not turn up the results I need
This here is both pure tensorflow and also
imports from non-existent files.
from magnet_ops import *
from magnet_tools import *
I found this https://github.com/pumpikano/tf-magnet-loss
But that's both tensorflow 1.0 and even more crazy.
Here is a far more sane implmentation, but it run pytorch not keras https://github.com/vithursant/MagnetLoss-PyTorch/blob/master/magnet_loss/magnet_loss.py
I wanna develop language like python
but more easier to learn
I think languages like lisp, ocaml, scheme, haskell, etc are popular choices for writing new language. Scala if you want something that runs on the JVM
Python is one of the easiest to learn fully featured languages out there. Also is this the right channel for that?
IDK
@pallid bisonthis channel isn't for discussing language design
So I’m applying for a school, and they want to know if my data science program now covers the math topics that are necessary for me to be admitted to the course
I’d love if someone could help me review this syllabus so I could talk about what I should know and what my expectations should be
python is the easiest..
but...
don't look here lol, start from the very basics...go to freecodecamp on youtube and type in python tutorial for beginners
anyone got any knowledge on temporal difference q learning
anyone have any experience scraping data from a forum? What type of backend would I need to scrape data from a forum every day and automatically push it to my website? Also, what technologies would I use to scrape the data? Beautifulsoup?
Hello I am using pandas between time function
I want to check for two different time intervals in my data frame how I can do this?
For eg i want to check for time interval between 09:15:00 to 15:28:00 and 18:30:00 to 19:28:00
This two time interval data i need
How I can get this?
Ping me when replying
@errant path Please don't advertise without getting prior permission from the admins. Thanks.
Hi so I don't have a master's in data science but I'm learning machine learning on my own. Would that get me a role of data scientist or would it be better for my if I decided to switch to software dev?
Can someone tell me the solution for this problem ?
all made with https://colab.research.google.com/drive/12CnlS6lRGtieWujXs3GQ_OlghmFyl8ch?usp=sharing with prompts like "hyperrealistic cyberpunk art deco skyline at dusk" minimizing "minimalism"
Hi There! Does someone of you have experience with calculating post-hoc tests in python and could help me out? 🙂
When would we use map or applymap over apply?
Apply just seems like a superset to me right now
How to drop rows of pandas dataframe which contains specific time value
For eg i want to drop rows which has 15:29:00 value in time column
Ping me when replying
anyone know q learning
I mean I've only just started the ai libraries but can't you just use an if loop?
Or you could use a filter mask
Like this
Df[df[time_column] != '15:29:00']
I think that should work?
I tried this but not worked
Rows are not getting removed
Ping me when u reply
Can anyone help me in this?
!e
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
print(df[df.a <3])
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | a
002 | 0 1
003 | 1 2
Well i can see this working in my case.
Can u use same value as I am using
I tried this way but rows not getti2removed
Rows are not getting removed
Ping me when reply
what are you trying to accomplish?
I want to remove rows which has time value '15:29:00'
Ping me when u reply
Learn how to drop or delete rows & columns from Python Pandas DataFrames using "pandas drop". Delete rows and columns by number, index, or by boolean values.
probably will be something like data[data['time']== ####].drop(axis=0,inplace = False)
ping
:incoming_envelope: :ok_hand: applied mute to @vital dove until <t:1638881921:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
This is not really the appropriate place for this.
Is there an equivalent function in Python to R's emmeans ?
@shrewd lily what does that do
Calculate the estimated marginal means
Is there any way I can pull the list items in jupyter notebook's markdown shell rather than putting it manually?
can you be more specific
for example list_a = [2, 4, 2, 4, 6, 6]
I want to loop over to this list inside a markdown shell
@rigid zodiac
basically, I want to generate a table with outputs
How to fix this error?
@bold timber what do you think the error message is trying to tell you?
when i change the parameter size to vector_size and iter to epoch, it can works. but, why did happened?
@bold timber you passed key word arguments that don't do anything
sorry I havent do anything with Markdown shell. Is markdown shell is a type of comment in jupyter?
no, I think you didn't get me. What do you use when you have to write headings and paragraphs in Jupyter?
That's right. I forget the library has change and that parameter is doesn't exist now hahahaha
do you see how the error message was communicating that?
Yes I know it
Is it nice and worthy to create a app that can find rent rooms and house around someone and it can show people the map to go to that place
We can also filter prize and location
This app can be really helpful for both buyer and room seller
I can also add more things then just rooms like hotels, flats or apartments
Is it a good idea
😋
anyone wanna work a data science/ai project? I've always love the idea of creating AI so after I finish up the theory of machine learning, i defintely wanna invest my time in making stuff (:
what sort of project 😄 Keep in mind that we only let people recruit for open source projects.
oh yeah for sure, open source, i just wanna get some hands on exp (:
I would like to participate
where are you with theory?
I mean im not making anything new, im just gonna look at what idea looks coolest on the internet 😛
I'm facing a little problem. using beautiful soup, i'm not able to create the content of the soup outside the function/loop but was able to output part of the soup tags inside the function, which is strange, it doesn't work that way with i.e list and dictionary.
what am i doing wrong?
https://colab.research.google.com/drive/1Xene9c_5XWBYANtCqRRt51NOD50Yjr1o?usp=sharing
im learning pandas from https://www.tutorialspoint.com/python_pandas/python_pandas_reindexing.htm as of now
Python Pandas - Reindexing, Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.
i tried this way ```python
Traceback (most recent call last):
File "F:\nifty_banknifty\remove time values.py", line 4, in <module>
df = df[df['time']== '15:29:00'].drop(axis=0,inplace = False)
File "C:\Users\shubh\anaconda3\lib\site-packages\pandas\core\frame.py", line 4308, in drop
return super().drop(
File "C:\Users\shubh\anaconda3\lib\site-packages\pandas\core\generic.py", line 4145, in drop
raise ValueError(
ValueError: Need to specify at least one of 'labels', 'index' or 'columns' ``` @rigid zodiac
@dreamy bone hello , can we discuss again?
uh sure (:
just do df = df[df.time!-'15:29:00']
(This filters out the dataframe so it doesn't have that value in it)
i want to remove rows which has this value
how about you?
mostly repeating, didn't do ML for a while, but know a bit of scikit-learn, pandas, numpy , tensorflow and pytorch
ah im going through each thing thoroughly and making notes so i can just refer to that, takes a little bit longer, but its really efficient studying
problem is that each topic is quite deep and you should stop at some point and learn new things in the sphere doing the work, otherwise it will consume lots of time
this worked thanks
is it okay if i show you a list of things i've learned in core python? and if possible let me know what else i should do? (i dont mean the ai libraries yet)
bring it on 🙂
im pm or should i show here?
here, maybe other people will give better recommendations
you forgot to assign the returned value, i think. Basically doing df[df.time!='date'] doesnt change in place, it returns a different value so you gotta store or print out to see it
sure
just give me a sec
i cheked in df that all rows with 15:29:00 gets removed
see in this output df
1) Comments
2) Variables
3) Data Types
4) Numbers
5) Casting
6) Strings
7) Booleans
8) Operators
9)Lists
10) Tuples
11) Sets
12) Dictionaries
13) Loops
14) Functions (Declared and undeclared(i.e. lambda), Generators
----> stuff like return, continue, break, pass
15) Objects and Classes
16) OOP stuff
17) Data structures and algorithms
18) Multi-threading and multi-processing
19) Modules like math, cmath, os, file, string (i have to do JSON just realized - i made the stupid mistake of saving my file in the beginning (like the first day i had begun) as json.py so i couldn't use the actual json library, random, string, statistics, collections, itertools, sys, formatting strings)
20)number methods like bin, oct, hex, etc.
21)datetime, time modules
22) list, dict comprehensions
23)wrapper functions
24) I regret wasting time in GUIs tho :(
25) Also Exception Handling
So, basically from all that, I conclude that i still gotta learn about regex, JSON and sockets.
I'm doing all the AI libaries now, and after I finish the theory ill do django before doing projects in Data Science/AI with Python. (:
Any advice? 😛 
you probably left haha, it took a while to write 😛 @teal mortar
more practice OOP, quite useful in ML when writing custom layers or models, django is good to know, you can start with https://djangoforbeginners.com/
Learn web development with Django 3.2. Proceed step-by-step through building, testing, and deploying web applications of increasing complexity while learning Django best practices.
and there are more advanced books after
i use freecodecamp :P, it's got a load of content for django and i think for ml also
but first i read through the text
ill save the django link on my notepad then
i just bookmarked nvm lol
from ML you can start with Andrew NG course on coursera, I believe it is now free
but I would go with the a book too
Hands-on
Machine Learning
with Scikit-Learn,
Keras & TensorFlow is quite for beginners
could you share the link?
ill use that too if i can get some hands on even though freecodecamp has i think 3 ml models we can make in the process of watching the ml vid
well im gonna take a break from learning and get back to it tomorrow
feel free to add me if you wanna make some cool ML models in the future (:
this one https://learning.edx.org/course/course-v1:ColumbiaX+CSMM.102x+1T2017/home is also good to learn the concepts
yeah, don't rush it, but practice more
learned something new, try to apply it
yeah first solid understanding of theory, i.e., syntax and concepts like k-means clustering, linear regression and stuff is really key
first ill try to understand all the theory, and then ill go for implementation (:
in case of python visit codewars.com to get used to solving some problems
or leetcode
i use leetcode (:
Basically my end goal by next year October is :
1) Python (OOP, DSA, Django, AI libaries)
2) Java (OOP, DSA, Hibernate, Spring, Springboot)
3) SQL
4) Frontend stuff like Typescript, HTML, CSS
5) Rust in my free time, no rush as jobs in this are usually for seniors (:
and C++
C++ could take all that time :), I wouldn't go into Rust if you don't need it
I suppose, the sequence is given in priority-wise and im not gonna rush anything for sure
C++ would be above rust tho
in case of SQL learn postgreSQL and you'll be fine
i use sql server at work actually
well, in that case you know better 🙂
I really think using pandas would be more efficient than sql tho, we have millions of rows of data, so i think pandas could reduce the query time from like 50 seconds to maybe like 10 -20 secs idk
let's see how it goes, either way, big mncs that have java software dev jobs ask for like 2 years of hands-on exp so ive got plenty of time
and python is solid for data science
I would go with javascript instead of java, if you want to go the django route
typescript is actually a supset of javascript
ts is statically typed
imma add you and we can get to making ml models from next year then (:
ok 🙂
hey guys, I'm a data science fresher, just wanted to know if adding Linux in my resume would be beneficial or not?
under technical skills*
i mean i added ms excel lol
Hello, could you please recommend some sources for tensorflow
but ms excel does need to be learnt :3
yeah, i do know MS excel
Im currently doing andrew ng's deeplearning.ai course
But my daily driver is Arch Linux, would this add any weightage? or should i just not mention it?
if you dont mind could you tell me what are the class hours for that?
do you get a certification afterwards?
Yess you do
Umm no im doing it whenever i get the time
Some days i finish a week's work in a day or 2
oh its paid, ill do it after i learn everything then lol
Yeahh it's pretty good
I think you have to pay only if you want the certificate
good luck then (:
btw you should check this website out. it's got a nice amount of theory on the matter, might help
and use youtube (:
Thanks:)) you too!!
Tysmm, ill check it outt
Yes, i used to watch sentdex's
never heard of it!
thanks for your kind help buddy, i learned new small thing today
No prob
plus the site itself has good guides and tutorials tensoflow.org
you are welcome
the concept of svm is a like a vector and distance to something right?
it's where you have all your training data as points in space (each point represented by a vector of its coordinates), and each point has a class assigned to it. so it figures out which regions of that space "belong" to each class
and then when you go to make predictions, it uses the vectors at the edges of the regions (the "support vectors") to make decisions about which region the point you're trying to predict for is in
for it to classify it chooses the region of which vector is close to that input?
it classifies vectors by telling you which region of the space that vector is in. and it uses the vectors near the edges of the regions to figure that out
because what ultimately matters is where the boundaries are for each region
sry my imagination is just poor please bear with me 😅
here
the violet will go to green class because the nearest point is the green on that region?
oh i see nice nice thank you sir
for ppl who are currently in uni: would you advise someone to choose computer science or data science as a major?
has anyone used the python API for apache flink?
any roadbumps opposed to the scala API?
(would rather avoid pure java is possible)
comp sci or math, as data sci major sounds like it doesn't teach you jack shiet
what are u studying/studied?
pure math in undergrad/grad
Dont study pure math
do a lot of applied statistics
and computer science
ok thanks @bronze skiff
Hi Majnu, I'm not sure you'll get much help this way. Why not start by mentioning + showing what you've tried on your own, where you are having challenges etc...
I'm not in University anymore (although I'm kinda thinking maybe I should return for my graduate studies 😋) but I'd pick a major in Computer Science or Statistics over Data Science.
My 2 cents ✌️
@odd meteor I've done that training part, the thing left is making 2 plots from crime dataset, what exactly should I plot? I can't get.. can u suggest?
how do you replace one column value with another
for example i'm setting a for loop like:
for i in df['a']:
replace (df['a'], df['b']
df[['a', 'b']] = df[['c', 'd']]
any time you're trying to do something with a DataFrame, start with the assumption that there are no loops involve and wait to be proven wrong
This is where you'll let your data visualization skills to shine. So you might wanna visualize X3 using barchart, visualize the trend in reported violent crime (X2) using lineplot, visualize the distribution of X1 or X2, visualize X5 using scatter plot etc...
You can just think of any other useful visualization and plot it. You could add a little flex by doing using plotly to make your visualizations interactive
Someone with science / math background able to help with this: https://stackoverflow.com/questions/70264206/golden-section-moving-average-with-python-numpy - Trying to reproduce the Golden Section Moving Average with Python/NumPy. Thank you for your time.
in temporal difference q learning what does a max mean
i got a cat and a dog in a four square room. the first trial states their in the same square and dog moves down
the reward stated earlier is +1 if they're not together and -1 if they are
is the initla reward +1 since the action was moveing the dog out or -1 since they were together
Computer science. Data science is too niche to have demand I feel
Hello I'm relatively new to AI and I've been learning from resources that you facilitated some days ago. I saw that there is a common pattern which consists of prototyping and modelling machine learning algorithms using tools like Octave or Mathlab instead of implementing those algorithms straight away in your desired programming language (Python). The purpose of doing this is to create a functional solution and then translate it into your desired programming language code so that you don't have to start from the very beginning which is a bit time consuming. Is that correct? I'm currently playing around with Octave and it looks cool but I'm afraid that I might not be using it considering that Python has great external libraries like tensorflow and so on...
damn
Octave was built with Matlab compatibility in mind. I'm presuming you're using Andrew Ng's ML course on Coursera to learn. 😀
Coding in octave was actually one of the things that threw me off Andrew Ng's course from the get go plus I struggled to understand at the beginning. He was using Octave to code and I wasn't particularly interested in being language agnostic when I started learning ML.
In my opinion, Andrew Ng's Coursera course is good for understanding the core Math and Statistics + Theory behind most fancy ML algorithms we use.
If you're interested in Python, then you're probably better off starting with Udemy courses or Kaggle.
Yes you are right, I'm learning from Andrew's course
But I will swap to kaggle or udemy as you said I don't like the fact that is being language agnostic
@odd meteor u got any knowledge on Q learning in RL
No bro. I don't know Reinforcement Learning yet. I'm also learning myself. Currently learning Machine Translation in NLP.
nice
I'm really struggling with understanding and implementing mini-batch gradient descent within python (without Scikit learn) for a class, if anyone has experience and would be willing to help me out or give and sort of guidance, it would be greatly appreciated. feel free to drop me a message if you do not want to clog the chat here
To give more context:
I have my vector of targets and a design matrix containing my input variables. I have an initial grid of variables to try for the learning rate, regularization parameter, and num of data points. I believe this is a correct approach? I just don't really understand the math behind the algorithm and thus how to convert it to python code.
Whats the preferred solution for making dynamic type dashboard for data presentation and manipulation?
I am looking at things like Flask but not sure if there are easier tools
Something I could maybe share with someone who isnt familiar with py or jupyter
I have a dataframe with multi indexed columns
I can get a new dataframe from the lower level columns with df['c1']
However I can't do that to filter on the higher level column. Any suggestions?
Dataframe creation is from an API, so I can't change that
Simple way is just calling df['c1']['a1'] for each lower level column and creating a new dataframe with that, but I'm wondering if there's a more built in method of doing this?
Found a solution to my own problem, can use df.swaplevel(axis=1)
Hi! Is there a way to pull strings from a dataframe column and have them return as normal strings?
Do you want to pull the column names?
No, just the values inside that column
So I have a column of 200k+ titles (they are strings) but only want a random sample of 10. The catch is I need them only in basic strings. When I do to_strings, it returns them into one large string instead of their individual titles so that's not what I want
Anyone used 'streamlit' before?
Could use to_numpy() to get an array of the strings
That doesn't work either. It can't be in an array, list, tuple, etc
Only as a string itself
Do you want one string with 10 title names?
I want 10 strings of the title names
titles = df['title'].to_numpy()
selection = np.random.choice(titles, size=10, replace=False)
If this isn't what you want, could you give a small example of the format the data should be in?
That wasn't it, unfortunately.
This is an example from the docs
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]```
Looks like a json file?
titles = df['title'].to_json(orient='records')
It's not clean so hopefully someone has a better solution, but I'd do it this way to get 10
import json
titles = df['title'].to_numpy()
selection = np.random.choice(titles, size=10, replace=False)
json_file = json.dumps(selection.tolist())
I think I will do this
titles = df['title'].sample(10).to_json(orient='records')
yup, much cleaner, lol
Lol, thanks again!
Yea, I'm not familiar enough with pandas to always know what to use, lol
Yeah, I just started learning too and someone else told me about sample() lol
was probably me on an alt
Are you my mentor 👀
no but all staff members are lemons' alts
Was that 2019? 2018? Can't even remember
idk. 2020 was where we had social distancing in the off-topic channels
do you have a few test samples to see what the output should look like?
Oh, that looks interesting. I'll give it a shot if not answered yet tomorrow
Anyone know why I would be getting ValueError: Unknown label type: (array([...]), )? I'm using sklearn and it keeps on getting this error
the actual array is float64 and shape (400, )
for some reason when I put it in as the y of a .fit then it doesn't work
Can you show the whole error message (you've shown only the last line) and the related code?
File "filepath\main.py", line 166, in get_model
model.fit(X_train, y_train)
File "filepath\venv\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 752, in fit
return self._fit(X, y, incremental=False)
File "filepath\venv\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 393, in _fit
X, y = self._validate_input(X, y, incremental, reset=first_pass)
File "filepath\venv\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 1131, in _validate_input
self._label_binarizer.fit(y)
File "filepath\venv\lib\site-packages\sklearn\preprocessing\_label.py", line 301, in fit
self.classes_ = unique_labels(y)
File "filepath\venv\lib\site-packages\sklearn\utils\multiclass.py", line 102, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
Remember to show the code also.
which parts?
some region of the code that includes line 166
enough to establish the context.
like what X_train and y_train are
X_train, X_test, y_train, y_test = train_test_split(data["X"], data["y"], test_size=0.2)
model = MLPClassifier(hidden_layer_sizes=(2, 2), max_iter=1000)
model.fit(X_train, y_train)
like that?
yep
do print(data.head().to_dict('list')) and show the text, please
{'X': [[9.905686378], [66.33001336], [21.75396634], [33.24767143], [2.689293441]], 'y': [[99.59781655], [8.039968478], [13.02439705], [86.38519375], [57.15746171]]}
X y
0 [9.905686378] [99.59781655]
1 [66.33001336] [8.039968478]
2 [21.75396634] [13.02439705]
3 [33.24767143] [86.38519375]
4 [2.689293441] [57.15746171]
Why is each cell in your DataFrame a list?
do data = data.applymap(lambda x: x[0]) to get everything out of the lists
then try again
{'X': [9.905686378, 66.33001336, 21.75396634, 33.24767143, 2.689293441], 'y': [99.59781655, 8.039968478, 13.02439705, 86.38519375, 57.15746171]}
yes, that looks better
Oh I just remembered why they're in lists
I did a thing where the data might actually look like this:
{'X': [[9.905686378, 99.59781655], [66.33001336, 8.039968478], [21.75396634, 13.02439705], [33.24767143, 86.38519375], [2.689293441, 57.15746171]], 'y': [[1079.75072], [677.9094591], [745.1665871], [1196.777592], [1132.214797]]}
I'm pretty sure that the shape of x for .fit is supposed to be 2d
found it in docs
shape of X_train is (400, 2) and y_train is (400, 1) when it's put into .fit
why are the names of your columns X and y, anyway? if the X data has two features, it should be two separate columns rather than one column of lists.
if X is a Series of lists, that's not the same thing as it being a 2d-array-like.
I know, but I do some stuff to data before putting it in so that it is a good shape
having a column of lists in a dataframe breaks the data model
and it won't interface with sklearn correctly
so, don't do that before
It should be noted that the X column is still one dimensional. The fact that it contains lists does not make it two-dimensional.
Right now, X_train looks like this right before going into fit:
[54.39736997, 99.64921956],
[53.00488272, 46.58886973],
[24.22203264, 88.99648647],
[71.8330977, 28.51141576],
...]```
can you do print(type(X_train))?
<class 'numpy.ndarray'>
okay, what about print(X_train.shape)?
(400, 2)
try doing model.fit(X_train, y_train.reshape(-1)) @limpid root
I am reading this: https://stackoverflow.com/questions/45346550/valueerror-unknown-label-type-unknown
same error
do print(y_train)
[1187.151578, 568.100153, 685.7766812, 626.1073536, 1199.412295, 1543.641543, 1261.285556, 350.4392658 ...]
this is y_train after .reshape(-1)
shape of y_train is (400, )
and you're still getting the same error that you showed at the beginning? if so, I don't think I'll be able to debug this remotely.
I think so
Error is still ValueError: Unknown label type: (array([1116.166069 , 830.6421689, 1152.047414 ...]), )
why is that a tuple
I have no idea
I'm just passing y_train into fit
and y_train looks the same as what I sent just now
aight I gotta go for now, I'll continue banging my head on this tomorrow
what do you mean "split each row into a dictionary"?
there's the to_dict method
I think you mean column?
i got it
um for left left column how do i check if the values in the string are the same
so WW returns +1
XV return -1
Hi ^ don't mean to interrupt your comment but I have an important question
Python is probably the best lang to make ai with any amount of complexity right? Cause someone said python was for basic ai and I'm pretty sure they're wrong? I mean python has a lotttttt of stuff in its ai Libraries
:incoming_envelope: :ok_hand: applied mute to @sour dew until <t:1638937557:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
i have this equation
TD q-learning equation
i wanan make a function to get the td q value
i have no idea how to impelmeent this
hey guys im trying to do a project analyzing a gene expression profile matrix from a scRNAseq dataset from cancer cells.
i have about 3589 cells, ~2000 genes, and 6 cell types in the dataset. the tSNE graph would show the cell types. my plan is to use PCA and tSNE, followed by logistic regression to characterize the differences in expression between the clusters that the tSNE graphs gave.
does anyone have tips on how i can perform the logistic regression portion?
Just the examples provided in the paper for phi(p,q) and the visual graph in the end on the trading chart. Looked at it with a friend of mine who had a math module in school and he couldn't understand it too 🙈 they probably made it difficult on purpose 😬
@regal ingot Have a look at some implementations of q-learning in python on Github. All that equation is is the bellman equation. You need to setup a MDP around your environment in order to define an iterative process that you can apply that equation upon
yo anyone here can help me understand eigenfaces?

