#data-science-and-ml
1 messages · Page 265 of 1
@velvet thorn
Given two columns, a and b.
Grouping by a, we're going to have a list of values from b for each group in a.
then I want to compare across the different groups in a to see how many elements from b intersect for each group in a.
ie.
a b
0 4
0 5
0 6
1 4
1 5
2 5
2 6
Then
0: [4, 5, 6]
1: [4, 5]
2: [5, 6]
0 and 1 share two elements, 4 and 5
0 and 2 share two elements, 5 and 6
1 and 2 share one element, 5
!e
from io import StringIO
import itertools as it
import pandas as pd
df = pd.read_csv(StringIO("""
a b
0 4
0 5
0 6
1 4
1 5
2 5
2 6
"""), sep=' ')
for a1, a2 in it.combinations(df['a'].unique(), 2):
intersection = set(df.loc[df['a'] == a1, 'b']) & set(df.loc[df['a'] == a2, 'b'])
print(f"{a1} and {a2} share {len(intersection)} elements: {intersection}")
@paper niche :white_check_mark: Your eval job has completed with return code 0.
001 | 0 and 1 share 2 elements: {4, 5}
002 | 0 and 2 share 2 elements: {5, 6}
003 | 1 and 2 share 1 elements: {5}
@heady hatch do you mean something like this?
@paper niche right right. Is this going to run through at O(N^2)?
I'm trying to find a solution quicker than it.
@heady hatch hm I think that's probably not the way to do it
let me think
like grouping by a is not the most efficient solution
do you need to count all possible intersections?
Seems like a classic map reduce problem
I don’t know if I need to count all possible intersections, but just the ones with at least one intersection.
Oh? How would you go about it in terms of map reduce?
Well you would just map all nodes based on column a, the reduce would be counting the overlaps in b
I think I’m not familiar enough with map reduce.
I’m thinking of map function and reduce function from python.
On the other hand, isn’t that also O(n^2)?
Since when counting all the overlaps with b values, you still need to go through each a value n times.
I don’t know if I need to count all possible intersections, but just the ones with at least one intersection.
@heady hatch hm.
that makes the problem different
so basically you want to find the values of b that correspond to more than one unique value of a?
Yes(I think.)
To give you guys some context.
Each product has a category. And I have two stores and each store has their own set of categories.
The idea is to find categories from each stores that share the most amount of product with the other category.
or do you mean the unique values of a which correspond to at least one value of b that is shared with another unique value of a
Ahh I think it sounds like the latter.
okay maybe with an example
it would be easier
can you provide some sample data and your expected result
Yea definitely.
If you guys don’t mind, I have to anonymizes couple things.
But data is something like this.
Two datasets, each one with a product and the categories they’re in. They can be in more than one category.
So dataset is something like ...
Dataset 1
item -> category
apple -> [a, b, c]
orange -> [a, d, e]
Dataset 2
Item -> category
Apple -> [1, 7, 9]
Watermelon -> [1, 4]
Banana -> [1, 5,6]
Orange -> [1, 2]
And the result that we want to get is something like a utility matrix of sort.
Category from dataset 1 vs category from dataset 2
a b c d e
1[apple, orange][apple]...
4[][][]...
5
6
7
9
I don’t know if this helps.
The way I was thinking of was compute all the items in the categories and go through each category in the other dataset to see how many items they would share.
But it would be O(cate1 * cate2).
However thinking about it, I can filter down a bit of the categories.
Not sure how I would filter, now thinking about it twice. Hahaha
Hello, I'm having some trouble with my dataframes. I have tried playing around with indexes, transposing and the like. For now I just want to plot either of the points in the first row.
Release 0.0.2 of NLP Profiler is now available, see https://pos.li/2h39ue
PyPi: https://pos.li/2h39uf
Github: https://pos.li/2h39ug
Gitter: https://pos.li/2h3b1o
can anyone take a look at the train function and does anyone know how to fill in self.params['W']? this is for my class on linear regression 😭```py
TODO: Use the gradients in the grads dictionary to update the
# parameters of the model (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
self.params['W'] = ???
# END OF YOUR CODE
this is my attempt at it, but it doesnt yield the supposed values from the notebook (loss plot is increasing instead of decreasing XD) https://github.com/poisonivysaur/ml-class/blob/main/Linear%20Regression/linear_regression.py
@velvet thorn kind of messy but yea. Hahaha
For each item I was going to explode the list of categories and regroup the categories. So it would be grouped by categories instead by items.
so I just read that the Andrew Ng class is on Octave
what the hell is Octave guys
matlab but free
can't I just use Jupyter notebook like any sane human being
jupyter notebook is my baby
ugh
do u think u can take a look at my linear regression code? above ^
i tried that course
did you find it good
second week exercises are rigged
mostly because oactave is soooo had to use
i heard answers are on github repositories
@hollow sentinel if they find out u use that they remove u from the course
lamo ok sorry
i know
I don't get it
what is the point of making you handwrite your own linear regression without sci kit learn
exactly
sci kit learn is there for a reason lol
lol
yeah um I may try the columbia course first
all this time i thought it was just python
^
and if a programming language is paid dont even bat an eye (matlab)
thanks
wlcm
hey guys
hello
I am trying to learn data analysis can I ask questions related to it here?
yes this is a data science chat
if you're training an ML model then u can use their public GPUs too
@wheat seal that's cool
lmao I've stuck w Jupyter notebook so far idek how to help
are you talking about this
no sign up option
yes
does google colab do that
wait I can't even sign in xD
lmaooooooo
ye
How to do that?
do I need to install anything other than jupyterlab?
u dont need to install ANYTHING to use googe colab
ok
so I choose new notebook to start right
yes
if you're using google colab i would recommend google's machine learning crash course
oh Portilla uses jupyter notebook lol
🤮
heyyyy
bruh this was easier than I thought xD idk why someone recommended me notebooks.ai
he's good
lmao
thx guys
np
no problem
now I can finally start coding xD
is this your first time doing machine learning? @undone flare
yea
same
when I first came here I couldn't make a matplotlib pie chart properly
so I got an internship interview w CUNA Mutual group and they asked me if i knew any algorithms and i said uhhhhhhhh i make graphs
can relate
yeah safe to say I didn't get the job
does executing things take time firstly or it's just my laptop
not the job part im just a kid
depends on what you're executing
ye
I executed 2+3 xD
uhhhhhhhhhhh
u need ur lapy checked
idk it still works after 7 years
even more reason to get it checked
yeah i would be sus if my machine lasts that long
mine is windows
omg
F
we have met the messaiah
that also Win7 Ultimate lol
lol i got bullied in college for having a mac
bruh
everyone was like who uses a mac to code
i still get bullied by my friends while playing mc even tho i get higher fps than them
with 4gb ram oof
yes
mine would melt
mine does not like gaming
even flash games
i can hear the fans go off
anyways back to ML
me knowing I won't understand TF without calculus and lin alg
bruh do I have to learn jupyter notebook before other stuff?
jupyter notebook and google colab is like the same thing
except google colab is the cloud
I mean that only
well normally you would learn how to visualize data
then you would learn how to clean data
and finally machine learning
and then all those niche topics like NN, NLP
👍
yes
nice what course
yt freecodecamp
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)
nope i just went straight to udemy
Does udemy have free courses?
some but i wouldn't call them amazing
i learned python from freecodecamp course
oof
i would recommend downloading a Kaggle dataset and trying to do visualizations off that
while you're following the video
Ctrl+M B what does that mean
pressing M B keys together?
ok I got it
is datacamp good?
lol idk
nah it's boring I just checked lol
df["term"] = df["term"].apply(lambda term: int(term[:3]))
TypeError: 'int' object is not subscriptable
anyone see what's wrong here
I don't get it this was exactly what Portilla typed
visible confusion
see that is the same line
I need to learn lambda expressions
yes
ok
it might be bc i ran the cell more than once
but it still doesn't work lmao
idk how to fix it
should i go to a help channel
does numpy or scipy have any functions to add values with some masks but masks like 0 to 1 ?
mask = array > 10
nah, I want to add 2 arrays, center is full new, and outerring is some average ;d
this
it is just.... numpy ;D
idk lmao maybe someone else knows
yep
X = df.drop('loan_repaid',axis=1).values
y = df['loan_repaid'].values
KeyError: "['loan_repaid'] not found in axis"
I ran the cell more than once and now I can't get it to work properly
train_test_split needs an X and a y
haha nvm i fixed it
just restarted the cell and ran everything again
fun fact
@bitter harbor was that directed at me
sort of
i've never read the pandas docs but that's the first thing that shows up
I hate reading doc it’s so boring
I like to read the doc and then just use methods from it on data from Kaggle
I’m getting better at reading doc tho
I just use stack overflow
Do they still teach rtfm in cs schools?
the pandas documentation is often confusing
Yes, I can definitely agree with that and I think the scikit-learn documentation is rather informative and easy to read.
Hello, anyone know in jupyter notebook how can I get the cleaner looking histogram shown? Mine (up) is hard to see
have you tried the edgecolor parameter?
have you tried the
edgecolorparameter?
@austere swift That did the trick. Thanks ^^
Np
scrapeing part of data science?
err no im trying the basics of scrapeing so im trying html scrapeing. but i keep getting 0 data no matter what i try
just send your code
🙂 just wondered if it was right category
read the beutiful soup documentations but my experiance differs from the documents.
this might just give you a general guide on how to do bs4 scraping
ofc
oh man
im sleepy from studying an dstreaming
finally learning NN
learned that each layers is a logistic regression function.
essentially
idk man
I don't know if i want to spend time learning octave
people don't use it
it's either jupyter notebook or google co lab
hey guys if you want to brush up on your python basics automate the boring stuff with Python is free on Udemy with this code: NOV2020FREE
be careful the code only works for a limited amount of time
it's either jupyter notebook or google co lab
I've been sceptical about notebook but once I tried it I'm not going back
Makes data manipulation and presentation waaaaaaaaay better than any IDE out there
has anyone tried the IBM data science course
AAAAAAAAAAAAH THEY'RE ASKING FOR CREDIT CARD INFO ON COURSERA
Has anyone come up on this problem?
I download a .csv file directly from a local server, but it doesn't import properly into pandas
The column headers are shifted two over. Only workaround i've figured out was opening the file in numbers/excel and resaving. Then it imports fine
but this will run on a scheduler... anyway to fix the headings?
wondering if its a delimiter problem in the csv file
in other words, there may not be a comma between those columns?
i have the andrew ng machine learning python homework assignments
what are you trying to learn
someone created the python homework set, and basically it's approved when submitting it, that's why I'm using that version anyways
columbia ml course?
ok
there's a columbia machine learning course on edx
this professor puts me to sleep tho
i like coursera the best
yeah but i have to pay for that
i think i should do the google ml crash course
i can't stand these boring machine learning theory lessons
that's unfortunate
well I'm not gonna do Ng anytime soon
and learn Octave
lol the columbia course is lame not doing it
If I wanted to learn the math I would’ve done 3b1b
Where would we talk about AI?
you dont need to learn octave
just get the p ython homework version for andrew ng's class
there's actually a github of someone who created all of it in python for homeowrk exercises
if you wanna do data science jupyter notebook and google colab is good @jolly folio
but other than that I would recommend VSC
@lapis sequoia here lol
@agile wing thanks man I found one that does it entirely in Python
I just wish there were more courses like Portilla's
i don't understand everywhere I read it says the Ng course is free
but it requires credit card info??
I have an numpy array a = [0, 0.5, 1, 1.5, 2] and when I print it a[0] it gives 0.0 why?
I edited it
yes because 0 is the zeroth index of the list
no I mean why did it get converted to float?
In array it was 0 but when I print that element it becomes 0.0
oh idk lmao
and if I do a[-1] it will give 2.0 and not 2
is it because the data type of a is float?
cuz arrays can have only same data type values
can that be the reason?
because NumPy arrays can only hold one dtype, and I think it cast everything to float if there's a float in there.
you can check array's dtype with array.dtype.
yea thx got it
Anyone know of any Video upscaling tutorials? I want to take a 1080p movie and upscale it to 4k.
Thought I could make a fun project out of it.
how do you guys deal with loading xls files into mysql?
I'm working on a flask app
trying to come up with a way to load new user data in bulk, into my db
guys was just wondering tensorflow2 supports only till python 3.8 but new distros like fedorra 33 comes with py3.9 so was wondering is there any news on the new updation on the tensorflow or you can share any new updates that exist
@vital cipher "when all of our dependencies support py3.9" is when it will be available
yup i agree with you @lapis sequoia but it was posted like 26 days ago and wanted to know like whats new thats all... 🙂
You can always check what dependencies support python 3.9 @vital cipher
Maybe one is holding them back
heya anyone there??
yes ?
When we may require to create an array initialised to zeros or ones?
im sorry but i did not understand that 🙂
like
you create an numpy array
np.zeros((3,4))
or
np.ones((3,2))
Why we need an array with only zeros and ones
im afraid i do not know the answer to that.
me neither xD
@undone flare if you need to init an array with values, you'd use one of those. Zeros is used pretty commonly in networks and stats related stuff but the usages depend on the range you need. np.full works as well and can fill an array with values such as inf and NaN: values that aren't really values. np.empty can be used as well + is faster considering there aren't any values but you have to set all the values - including invalids.
they're just different ways to init an array
np.(zeros/ones/full/empty)_like is similar to all that too, but it's used to copy the shape + data-type + order
yea I just learnt about that
Hey @velvet thorn I think I finally have a better way of describing what I was looking for.
So I'm grabbing the cartesian product of two lists of categories, and under each category, it's a list of products.
For each product of categories, I wanted to find the length of shared products under the categories.
I was able to do something along the lines of for each product of categories, find the set intersection of the two.
but I'm curious if there's a way to do it more efficiently.
I would love to hear what others' advice as well.
Can anyone give me some patter to print using arrays (simple one as I am still learning)
import numpy as np
array = np.array([1, 2, 3, 4, 5])
print(array)
for x in array:
print(x)
something like this
https://www.machinelearningplus.com/python/101-numpy-exercises-python/ that looks half decent
@chrome barn wdym
like printing out elements of an array
I am already learning it
hey guys if you want to brush up on your python basics automate the boring stuff with Python is free on Udemy with this code: NOV2020FREE
@hollow sentinel Thanks
!e <head> Hello world </head>
You are not allowed to use that command here. Please use the #bot-commands channel instead.
matplotlib: i want to grab the array of pixels for a plot, manipulate that array, and then write it again with ax.imshow(arr) but I don't see any way to get a plot (bar in this case) as an array of pixels
hi anyone here?
@remote valley that seems like an odd way of editing a graph. What's the specific use case of this vs using mpl's normal functions for changing a graph?
@spiral peak oh yeah it's odd. I'm working on some procedural art stuff. not an intended use case for sure.
made some bar charts with polar coordinates and wanted to use them as patterns to fill voronoi cells 🙂
Aaaah, okay. I'm not sure, let me do some research
@hollow sentinel do you know if the determinant of the identity matrix is always 1 or not?
nah
uhhhhhh
What do you mean by pattern? As in you want it to be a 2d array?
If so convert img to array.
I solved it with #algos-and-data-structs but thanks
@heady hatch It cant be done that way, it has to be dynamic... you want size 8, or 10 etc. 😄
How much better is automating with openpyxl than with Macros/VBA?
Hey fam, anyone where a wiz at matplotlib?
or know anything about plotting antenna radiation patterns?
what is the criteria for choosing a neural network library like scikit lern or TF for NLP?
learn*
im having an issue during numpy import, referenced here https://github.com/xianyi/OpenBLAS/issues/2709
anyone familiar with that? im not sure where to go from here, running windows 10.0.19041 Build 19041
"RuntimeError: The current Numpy installation ('venv\lib\site-packages\numpy\init.py') fails to
pass a sanity check due to a bug in the windows runtime. See this issue for more information: https://tinyurl.com/y3dm3h86"
Never mind, got it.
dataset['b'].head()
Oh but that doesn't include group by.
I had to do df.groupby('a')['c'].nlargest.
In regards to your question, @fallow prism . The kind of algorithms you want to use for NLP depends on your problem and constraint and how you want to go about it.
do you can resend the problem please?
Scikit learn isn't a neural network library.
So if you need NN, you would look into NN framework libraries such as PyTorch or TensorFlow.
But if you need to use classical machine learning, Scikit-Learn is there.
But there's also more nlp focused ones like NLTK or SpaCy.
Scikit-Learn is a general library for ML, but they don't include NN.
thank you, I going to study that better
my problem in NLP is interpretation and classification of description made by people about car accidents
NLP problem*
and i need train a NN to do that
or i think that
also whoops, apparently I didn't solve my issue.
The data is
So I'm looking to group by a, and sort by c.
But retaining the values of b.
Regarding your issue of classification of description made by people about car accidents.
NN could work.
But there's nothing wrong with trying out classical algorithms as well.
Or I guess I'm curious, why do you think you need to jump to NN right away?
noob question here, but this is my data set and I am trying to Display movie name, number of genres for the movie in dataframe and also print(total number of movies which have more than one genres)...any idea where to start here? I looked up documention of .sum() function but can't see to get it to work...
@tall seal to clarify, are those three different requests?
Would something like df.sum(axis=1) be what you're looking for?
imagine you crash your car and you describe me the accident, i have to be able to classify the accident and know what part of your car was damaged, know how occurs the accident
and who is responsible
in a few words
So from my understanding you're trying to extract information from text?
basically
Nine, did you try a sort_values on the group by object yet?
and my set of texts doesn't have structure
Hey @ripe forge , thanks for responding.
I did and this was the result I got.
But the issue I'm encountering now is the b column is in index form instead of its actual value.
And so now I'm not too sure how to go about it.
Oh just use reset index after that
Would I just index into column by with level_1?
This is after reset index yeah?
Mhm.
Then I think the only part left is top 5,yeah?
Oh then yep, you're done
Is there an elegant solution to do it without remapping?
Not sure off the top of my head. I'm a bit surprised why it came as level1
Can you change nlargest to head and see if it still comes?
Oh good point.
Oh wait I remember trying that and needed to sort beforehand.
So this was something else I've also tried.
But the issue here is a isn't in groups and c is just based off of the absolute sort instead of within groups.
This should still logically contain all the rows that you're interested in. But yeah, this one aside, I was thinking group by, sort values, and head. What's the output of the operations in that order?
I remember hitting error on using sort_values after groupby.
Or did you mean
df.groupby('col')['col2'].sort_values...
On the other hand I just tried a new one.
This seems to get me there.
hey guys if I asked questions in octave would you be able to help me
I haven’t started the Ng course bc I’ve been busy w school
Would something like
df.sum(axis=1)be what you're looking for?
@heady hatch I tried this and it didn't see to work
What results were you trying to get to? and could you clarify what you were trying to achieve?
@tall seal to clarify, are those three different requests?
@heady hatch 2 requests, to display movie name and number of genres for the movie and then print total number of movies with more than one genre.
this is the result I got with item.sum(axis = 1)
Right right, if you don't mind let me try to lead you through what you're seeing here.
Oh ops.
I realized why it's adding random things, it's adding the id.
so you might need to do something like
df.iloc[:, 3:].sum(axis=1) or df.loc[:, 'Action':].sum(axis=1)
What this does is adds up all the values in your genres. Since your data is a boolean encoding of the genres. By adding up the values, you get to see the total amount of genres per movie.
From there, then if you want to filter to movies with more than 1 genre, you would then need to do
col > 1
If it's too much, I guess let me know what questions you have.
Anyone here use statsmodels?
is that like scikit learn
Yeah similar concept.
yeah I don't use it but I remember you asking a question about it before
I've only used statsmodels tbh, I should probably try SciKit Learn too.
Yeah was just curious.
have you used Octave lol
Haven't had time to tweak my linear model I did from last time.
Nah what is that?
I'm pretty new to coding lol.
it's like matlab
Andrew Ng uses it for his machine learning course
on coursera
Do you have to pay for it?
@hollow sentinel Do you work in ML?
@lapis sequoia lol no i'm just a college business student who thinks ML is cool
Ah gotcha lol. Same here, ML seems dope. I work in corporate finance and some of our financial models and tools we use are starting to become ML.
that's very cool
We now have ML dashboards to predict and model out future costs.
that sounds very cool
I have been trying to create some sort of regression model for our labor hours/direct labor costs to find the drivers and create a thing where we can choose each component of the product and see how many hours get added but been struggling since the initial regression.
That was the one I asked about a few weeks ago.
just be careful about who you give the data to on the internet
that kind of stuff in the wrong hands is really bad
Yeah I always hide company info.
good
hi @weak kiln bro can i get access to the voice chat to interact with you guyz
weird place to ping me, dude. see #voice-verification for information on why you don't have speaking permissions.
lol
lol
@hollow sentinel i'm new at this place
haha welcome
I've been here for a couple weeks. This thread is for DS/ML questions so if you have any just ask
I've been here for a couple weeks. This thread is for DS/ML questions so if you have any just ask
@hollow sentinel Yes i have a lot of...
uhhhh then ask them?
@fading wigeon hey, remember that presentations about how notebooks are sucky?
https://www.youtube.com/watch?v=9Q6sLbz37gk here's one from one of the authors of fast.ai on how notebooks might actually be interesting and attempts to refute some of the arguments of that other presentation
I like using Jupyter Notebooks (https://jupyter.org/). Particularly when combined with nbdev (https://nbdev.fast.ai/). In this video, I explain why, and explain why I have a different opinion to Joel Grus, who discussed in another talk why he doesn't like using Jupyter Noteboo...
TBH, I wasn't that convinced on the refutals but nbdev caught my attention
uhhhh then ask them?
@hollow sentinel bro can we integrate osint with ai??
uhhhhhh
@charred blaze Cool, thanks, will check it out
also, this was the presentation where Jeremy Howard was somewhat... "canceled"
and a schism is starting to brew that involves Jupyter (really)
jupyter was the first thing i used for DS/ML
hello i already have python installed in my windows system
is it good to register anaconda3's python as default?
anacondas python has many inbuilt libraries so you can make it default to use them in ease
octave is a pain jus saying
But if you are more familiar with general IDE then ignore
ye but it takes up a lot of space
anaconda is just venv or virtualenv module but with a fancy unnecessary gui and all libraries install by default
even installing the libraries on your own takes up less space
i have a api question.
with get request api is giving data like more than 10000 in count and with pagiantion. now to use those data I use for loop and request data for every page or is there any other efficient way.
i googled a lot but i m not getting the answer
filedata = np.genfromtxt('data.txt', delimiter=",") this gives an error data.txt not found but I have saved it
I am using google colab
ping me pls if you know what's wrong
np.loadtxt() gives the same error too
hello, currently my code is giving me output likepython albania_passport: 100.00% confidence_level: 100.00
this way . how i can make more changes to get my output like
predictions [[0.03083993 0.9471298 0.02203036]] python albania_driving_licence : 0.03083993 albania_passport : 0.9471298 invalid : 0.02203036
this way
i want to get prediction for all labels
my code herepython print("label:", label) predictions = np.argmax(predictions) print(predictions) if (label == prediction): print(f"{label}: {(predictions)*100:.2f}%") logger.debug ("{}: {:.2f}%".format(label, predictions * 100)) confidence_level = predictions * 100 confidence_level1 = "{:.2f}".format(confidence_level) print("confidence_level: ", confidence_level1) logger.debug(f"confidence_level: {confidence_level1}")
my code here https://paste.pythondiscord.com/rogokizezo.py
predictions are from an sklearn model?
predictions [[0.03083993 0.9471298 0.02203036]]
@mild topaz if so you can use model.predict_proba() it gives the confidence probabilities directly
@lapis sequoia hello
predictions are from an sklearn model?
@lapis sequoia no i am usingpredictions = model.predict(img)see line 242
line 245 is what you need
prediction_prob = model.predict_proba(img) this ? @lapis sequoia
predict gives label and predict_proba gives probability. which im assuming this is "predictions [[0.03083993 0.9471298 0.02203036]]"
yes
prediction_prob: [[0.03083993 0.9471298 0.02203036]] this one u were talking about
i want to get my output this waypython albania_driving_licence : 0.03083993 albania_passport : 0.9471298 invalid : 0.02203036
...its the same isnt it
ya but not get the required output @lapis sequoia
prediction_prob = model.predict_proba(img) did you print this?
see prediction_prob: [[0.03083993 0.9471298 0.02203036]] this is what i get @lapis sequoia
yes but i dont understand your problem
you have your probabilties you just need to print it in label : confidence format
see first i explain what i want to achieve
i want that each label has its corresponding prediction_probability prediction_prob: [[0.03083993 0.9471298 0.02203036]]
this way i want my outputpython albania_driving_licence : 0.03083993 albania_passport : 0.9471298 invalid : 0.02203036
i want to show probability for each label using prediction_prob: [[0.03083993 0.9471298 0.02203036]] this
@lapis sequoia
each label has its own prediction_prob value
its possible you are treating a multiclass problem as a multilabel classification.
Cant help you more than this without seeing model training
can u atleast give some suggestion how i can fix this issue ? @lapis sequoia
Hello, so I have a questionm that's rather about stastics, but maybe someone can help me out here:
Ok, so I have a list of normalized values, asking which year was the worst for their allergies
0 means nearer to present
1 means nearer to the start of their allergies
If they answered, that they didn't notice any change, can I just use 0.5?
I calculated the values by: (2020 - worstyear) / (2020 - first year)
hey anyone here?
👻
Lets say I got matrix and some kernel
#print(mat)
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
# kernel
kernel = np.array([[1,1,1],
[1,1,1],
[1,1,1]])
out = np.correlate(mat, kernel)
and I want to use correlate to count sum of each square in matrix 3x3
ValueError: object too deep for desired array
its only 1d 😐
So I have a text file named data_ and when I try to do np.genfromtxt('data_.txt', delimiter = ',') or np.loadtxt('data.txt', delimiter = ',') it gives an OSError : data_.txt not found
ik but I have a file named data_.txt
not in this workpath*
I am using google colab
!d numpy.genfromtxt
so do I add that file in google colab?
@undone flare you can try is_file = os.path.is_file(file_path) and then print(is_file) you will get True or False depending if it can see file
ok
bruh now I am getting new error
ValueError: Some errors were detected !
Line #2 (got 1 columns instead of 5)
Line #4 (got 1 columns instead of 5)
If all my independent variables are dummy variables should I use something other than linear regression?
which cloud should I learn if i want to get into machine learning and ds?
word-clouds, at most 
Hi everyone
I made a project on AI for high school
Can you help me by giving suggestions?
Hey @tawny cradle!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .flac, .afdesign, .m4a, .csv.
Feel free to ask in #community-meta if you think this is a mistake.
you click the + button if you're on a machine
@undone flare yes
@hollow sentinel can you link me the course?
paid oof
yeah my prof i was doing research w paid for it
but this does help
I liked it more than I liked the Ng course
the Ng course is free
but he teaches ML in octave
but there's githubs with everything in python
ok thx for suggestion
no prob
@lapis sequoia have you considered ANOVA? I'm assuming your dependent variable is continuous
@lapis sequoia That's what someone else suggested, but I have no idea how to do/implement that. And yes my dependent variable is continuous.
might be helpful idk
https://github.com/mohan-mj/ANOVA---Sales-Volume @lapis sequoia
this github does ANOVA it might be helpful to look at
Let me check those out, thank you.
no problem
Is it supposed to be that long?
does anyone have a guide/paper/link to some websites or packages that checks on data anomaly or data fidelity?
@hollow sentinel Yeah trying to make it work now.
with beautiful soup how could I find all links anchored within h3 headers?
This is what I've tried so far:
self.all_links = soup.find_all("h3",
{"class": "entity-title"},
limit=41).a.get("href")
seems that the .a is not working
https://www.w3resource.com/python-exercises/BeautifulSoup/python-beautifulsoup-exercise-11.php @tidal bronze
Python BeautifulSoup Exercises, Practice and Solution: Write a Python program to a list of all the h1, h2, h3 tags from the webpage python.org.
Hi everyone. I'm trying to deploy a flask app with my custom BERT keras model which takes tweets as input. The model runs perfectly by itself but whenever I try to make the model.predict() function call within the flask app, it always results in the flask app being terminated. Any help/suggestions would be appreciated. Thanks!
I specifically need help understanding why pycache is reloading and how can I prevent that
@summer holly one thing you could try is to execute flask run --no-reload and see if it solves the problem or specify use_reloader=False in the app.run() argument
Hi, I'm starting in Data Science and First I'm studying CSV
I'm trying to edit a CSV and it always gets messy when I use to_csv
what gets messy?
Before
use pathlib to get the path of the object btw
After
use
pathlibto get the path of the object btw
@lapis sequoia Ok, I'm using Visual Studio, can this influence?
the IDE doesn't influence the libraries you import
This problem is not related to what you're asking btw.
I'm still not sure what the problem is
because I don't know how the .csv looks like, how you delimit the rows and columns etc.
maybe use the same encoding utf-8
I have multiple columns before importing
I import the file, convert it to a dataframe and after saving it it messes everything up in just two columns
it's likely something to do with the delimiter setting, change it to delimiter=" " or delimiter="\t" and see if that helps
and use pandas.read_csv()
Like this:
arquivo = open('c:\Users\Pichau\Downloads\caso_full.csv', encoding="utf-8", delimiter="\t")?
arquivo = pd.read_csv('c:\Users\Pichau\Downloads\caso_full.csv')
try this first
then save it using pd.to_csv() and see if the columns are preserved
ok
if not then try pd.read_csv('...', delimiter=' ') and pd.to_csv('...', sep=' ')
@summer holly one thing you could try is to execute
flask run --no-reloadand see if it solves the problem or specifyuse_reloader=Falsein the app.run() argument
@lapis sequoia
--no-reload worked. Thanks alot!
hey guys how do you invite people to this server
thank you @bitter harbor
Hey guys I'm using matpotlib for an image processing assignment and ideally we are suppose to make a function called chromeKey() which you can guess is used to remove a green screened background. Here's my code.
I have no idea what to do when I'm calling my function inorder to implement the two images together
I don't really expect an answer I just need help
I’m not sure what your question is
The usual way to do this is to take a linear combination of the two images at each pixel as a function of the pixel data at a point and possibly it’s neighbors to smooth the edges
It looks like right now you are simply choosing one or the other based on the intensity of the color channel
Hey, does anyone know how to load a bunch of random seaborn subplots into one plot?
I have no idea what to do when I'm calling my function inorder to implement the two images together
@steel talon hey you have arguments in your chromekey function which are not optional so if you just call the function you eventually get error
hi
hello
What exactly is data science?
Data science is a field that uses scientific methods, algorithms, systems, etc. to extract knowledge from structured and unstructured data. (Big data)
Something that's pretty strange to me is that the recommended way to preprocess text in Deep Learning with Python for multi-class classification is to do one-hot on the encoded text. Isn't that what the Embedding layer is supposed to do?
I just loaded a dataset using read_csv in pandas however I realised there is '?' instead of null how do i get rid of all the rows that contain '?'
df = df[(df.T != '?').any()] should work
what is T?
transpose
its alright now I tried some from stackoverflow but finally found one that works
data = data.replace('?',pd.np.nan)
am I getting this because I am not printing it?
@undone flare what is the question?
but this looks cool
if you plot something and do not show() jupyter will display image anyway
if I print it is messy
but this table form looks sick
idk
I just downloaded this dataset
idk anything about pokemon lol
just learning pandas
👍
Anyone here have experience with openpyxl?
Linear Algebra, Stats, Prob, Calculus
in which grade are these taught?
not in my grade
which grade
i am in 5th grade
oh then yea
yea
It is too soon for you to learn ML
i know
bcuz you might not understand high level math
o dont worry...i already know most high level math
huh..
my elder sister tutors me after school
do you know matrix multiplication
no ;-;
bcuz that's the first thing in Linear Algebra
ok
do u have any tips for ML?
(Keras) Input, TextVectorization, and Embedding layers are a bit difficult to wrap my head around. Every time I feel like I get it, something throws a wrench in my mental model and I have to start over.
@gritty wedge Python for DS/ML Bootcamp by Jose Portilla on Udemy
Yea that's what it's meant for
The only reason I'd highly suggest you learn what's going on behind the scenes os, a lot of concepts (universal across ml) can't be fully understood without knowing what's actually going on
oh yeah what @bitter harbor said too
If you want to have a look at the 'basics' of ml, id highly recommend watching 3b1b's series' on the topics
Thatd be a good course too
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv("iowa_file_path")
# Call line below with no argument to check that you've loaded the data correctly
step_1.check()
does anyone see what's wrong with that
Not to discourage you but ml's got quite a few layers that you should learn before jumping in, that course will help with python implementations of the basic mechanics
any course recommendations then?
Na there're quite a few ways to learn it, I learnt what I know about it through 3b1b
who is 3b1b?
It's pretty heavy tho regardless
he's a youtuber
oh
3 blue 1 brown
o lol....among us memes 
Na lol he's been around for a bit longer than the game :)
lol
thnx 🙂 .....i cant take u seriously with that profile pic....no offence
it's just tony stark lmao
but it looks funny too
that's the point
lmao
What I'm trying to figure out these days is how exactly these model constants are determined. I suppose I just need to find more tutorials about multiclassification models for things like the Reuters dataset.
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = home_data["LotArea"].mean()
#print(avg_lot_size)
# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = home_data["YearBuilt"].mean()
# Checks your answers
step_2.check()
Incorrect: Incorrect value for avg_lot_size: 10516.828082191782
how is lot area not the lot size
Kaggle is so stupid
Would you not want the standard deviation of the lot sizes?
Also I'm assuming you'd want some form of Sig digs
AttributeError: 'Series' object has no attribute 'stdev'
Do you have to call it in 1 line?
statistics.stdev(dataset)
i have to import statistics too right
Yea
TypeError: can't convert type 'str' to numerator/denominator
there's strings in the dataset
dataset["lot_size"]
How did you call the mean then?
avg_lot_size = home_data["LotArea"].mean()
I got that part but how did you call it if there's a string
oh i meant when you do stdev the whole dataset there's strings in the dataset
Oh ya you're only doing it to the lot size
statistics.stdev(dataset, xbar)
That's with both args
Xbar being the median
i don't think they're taking a standard dev
:(
i looked up the answers
and they just want me to round up the 10516.828082191782
idek how to do that
i tried calling ,round()
To how many decimal points
Also this is a prime example of where you should use stdev why is that wrong
Or is it "too advanced"
#bot-commands message
casting to an int floors the decimal so be careful with that
thanks
newest_home_age = 2020 -(home_data["YearBuilt"].mean())
the correct answer is 8
8
what're you getting?
48.73219178082195
niccce
Are you supposed to take the mean of the year built or the subtraction?
the directions say current year - the date in which it was built
ohhhh
nope still wrong
On the other hand nice, only +/- 41.
what does home_data["YearBuilt"].mean() return?
1971.267808219178
humor me and try with stdev
AttributeError: 'Series' object has no attribute 'stdev'
you mean home_data["YearBuilt"].stdev() right
I mean we're getting closer ¯_(ツ)_/¯
lmao i'm sorry
I thought this would be easy and I would get it done in like 2-3 days
it's a mini course lmao
it's all good lol this takes time, are there any outliers you're expected to clean?
no they didn't ask me to
there are some columns i'd drop
but they didn't ask so
weird idk sorry
it's ok
i have the answers so
lmao the correct answer was 10
how tf they be getting 10
i feel like i'm in math class rn
kaggle is lame
@hollow sentinel What's the min? like home_data['YearBuilt'].min
1872 @heady hatch
Oh what about the max?
2010
that'd do it
More or less, if I would want to compute the Tf-IDF vectorizer for 12 GBs of pdfs, how much time will that take? should I consider cloud computing?
If you have the resource, I would go cloud.
But tfidf can also be done on regular machine itself. Probably need to use a generator instead if you don't have the memory.
https://stackoverflow.com/questions/64686302/using-pickle-object-of-model-to-predict-output
Can anyone help me with this?
@hollow sierra what do you need help with exactly?
I have build ML model and exported to pickle file now i want to use it in a web app to make predictions .I want to use Node.js in web app , So is it possible to use this pickle model in javascript enviorment.
@heady hatch
Right right.
What kind of ML model is it? and from what library? Or was it written with native Python?
So you have couple options here.
- build Python microservice instead of going straight into NodeJS
- https://github.com/nok/sklearn-porter
- https://www.npmjs.com/package/scikit-learn
There are also couple pickle converters.
Automatically exported from code.google.com/p/pickle-js - sciyoshi/pickle-js
But pickle is a finicky object.
You can try to convert it first and see how it goes.
If not then I would try one of the options above.
Ok thanks @heady hatch , I have seen articles and tutorials on deployment of pickle ML model ,All the time flask was used .Is it easier or recommended to use python based web-library when u have python ml models?
It's recommended to use Python because of the consistency, pickling is weird because it takes environment into account.
I don't know how Python environment will work transitioning to non-Python environments.
Plus you don't have a language switch.
You don't necessary need Python to do full backend.
you can set ML up as a microservice
and have your backend call the ML API.
Have deployed ml models?If yes, how ?
Are you asking if I have deployed?
yes
Those articles and tutorials should serve as a good entry point.
You can create a simple backend with Flask, add gunicorn or uvicorn on top of the framework.
load ml model and set prediction as an api endpoint.
I got your point and it cleared some doubts of mine. Thanks @heady hatch for your time and help.
Sanity check: if different approaches to how we structure our layers yield the same accuracy results, then doesn't that imply that our dataset isn't good enough in quality to be able to predict what you're trying to do?
hi
looking for some feedback of how dumb it would be to use a function like this to shrink a df's memory footprint
def auto_cats(df):
for col in df.columns:
curr_usage = df[col].memory_usage(deep=True)
if curr_usage > df[col].astype('category').memory_usage(deep=True):
df[col] = df[col].astype('category')
return df.info(memory_usage='deep')```
Hey guys, I was wondering if anyone here has some half decent experience with using seaborn / pandas
I'm trying to complete some tasks but I'm unsure if the way I'm representing the data is the best way? Even I'm not really understanding the graphs that this spits out - would've thought they'd have to be somewhat interpretable
For example
I'm quite new to both of these libraries so just figuring things out - any advice, ideas, anything really would be greatly appreciated
@jade lava Hey, just saw your reaction now for my openpyxl question. Does it take long to create a code to automate reports/tasks? I have weekly reports I have to send out and it's a headache to go through the process of having to clean them up, same repetitive task for the most part.
Not enough info to answer
Not enough info to answer
@jade lava What info do you need?
You have to use VBA for macros, but you can read and write Excel documents, sure...
Not asking about VBA
# from _ import _
from sklearn.tree import DecisionTreeRegressor
#specify the model.
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor()
# Fit the model
iowa_model.fit(X,y)
# Check your answer
step_3.check()
so I have this
Incorrect: You forgot to set the random_state.
sike i figured it out sorry
Hello
I am new to data science but interested
Anyone have any advice on how to start or where to go?
Do you like books or courses?
Well, I would prefer courses but books are fine
I have good experience with python itself already, I am just new to the topic of data science
hmm in terms of courses, I'm not too sure about MOOC. But I've heard people liking Data Camp.
But I'm sure there are tons of MOOC.
pinging @hollow sentinel , they've had lots of experience there.
Hi, thanks a lot, I will try them out. Any good books too?
I don't like Data Camp, it's like more of fill in the blank type
I think this is my go to for intro to data science with Python.
You're implementing algorithms from scratch.
Brilliant, thanks a lot
I’d recommend python for data science & machine learning bootcamp
and Kaggle mini courses
hello does anyone know well about Matplotlib in python?
thanks @heady hatch for the book rec bc this looks really good
i might do this over the stanford ML course
can someone help me out with pandas
The following code does save and add a new row into the file
but when I re run the program, it will not create a new row, it will just replace the row of data that was created previosul
Can anyone help me with Matplotlib
whats your question
Can we search for index by values in pandas DataFrame?
ie given a dataframe
a b
1 0
3 2
5 4
get the location of value 0.
I'm also curious how would duplicate values work.
From my brief search on Google it seems that you're supposed to convert it into np.ndarray and then use np.ndarray.nonzero to get the indices you're interested in
I'm a bit confused. How would I use nonzero to search?
in your example you would first get the column 'b' from your DataFrame, then convert it to NumPy
afterwards you can create a boolean mask and get nonzero of that mask
Oh but how would I know what column it is in?
if you look at the docs for nonzero you'd understand
Maybe I'm misunderstanding something.
you would use nonzero after getting the b column, right?
