#data-science-and-ml
1 messages · Page 373 of 1
i'm sorry you are under such pressure. the market is hot, but not hot enough to warrant making yourself insane for it. tell your family that a professional data scientist said that
the market isn't going anywhere. there might be a slump if companies start folding and/or laying people off after the current hype wave
but that will be short-lived
Seconded by another professional data scientist.
data strategy is becoming a necessity for pretty much all medium+-sized companies, and data science is not going to be automated away for at least a decade
im not worried about killing the project
probably more lke 5 decades lol
i want to learn
i just want to finish this fucking project and have SOMETHING that says hey i tried
great
i can share my last project
so, what do you currently know?
if that could show insights to what im capable of better than me explaining it
sure, that might give us a starting point
which i got an ASS fucking score on
but you should also explain anything you've learned between now and then
I've learned EDA functions with pandas and seaborn for visualizing concepts
ill link you my notebook one second...
Hey @lapis sequoia!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
idk how to get this to yall through ere
its an ipynb
shit like this basically
nothing special. just what I have time to put together
that was my first project
Yes ill link SS of it
and it sucks too because
ill be in "class"
and my professor will be trying to correct his own errors for 30+ minutes
yikes
that's really bad
i would definitely attempt to ask for your money back somehow
maybe go through your credit card company or bank if you have to
that's... not worth your time to suffer though, and certainly not your money
i'm sorry this was a bad experience
so do you know what "linear regression" is?
if it makes you feel better, i had no idea what "differential pricing" meant, and i had to look it up (as per the problem statement)
my bad. wrong one on top
Yes i just have trouble executing the approaches they outline
I sincerely wish i had more time under my belt
its a quantitative representation of two variables relationship
Such as brand name and price
or mileage and price
etc
idk how to generate something to predict that though..
well i think you need to start by understanding the feedback
unfortunately a lot of topics in data science heavily depend on prior topics
i know it's difficult because you are behind in the class
the "better conclusions" comment is interesting - this suggests that you might also be struggling to interpret the stuff you do know how to produce
so let's start at the very very beginning. you know what a mean (aka average) is, right? median? standard deviation?
of course
spread from the average
high Standard of deviation means more outliers if im not mistaken
high standard deviation means higher average spread
low standard is more consolidated around the mean no
maybe that means outliers, or maybe it means that the data is just spread out
but ok, you know that much. good
and do you know what a "mode" is?
do you know what a "frequency table" is?
not entirely sure what a freq table is
i know what the mode is
most occurring value
I see that it can be one or the other
yep
a frequency table is just a table of how often each value appears in the data
so if you have the data H H H T T H T H H T of coin flips, then the frequencies are H: 6 and T: 4
in pandas you would invoke the .value_counts() method to compute this on a column in a dataframe
you might have seen that one
havent.. im most familiar w
note that this only makes sense for categorical variables
oh i see what you mean there
it makes no sense to compute a frequency table for a continuous variable
you'd get all counts of 1, for the most part
but you can extend this to two categorical variables, and this is known as a "cross-tabulation", or "crosstab" for short, and it is implemented with pandas.crosstab
okay
So
i just implemented this in my last project
data.value_counts()
just taking a look at it right quick
good, so you are familiar with that much
and I see
i believe this is what they were asking for when they asked for the "proportion of customers" across different categories
with my repeated values such as
for example i have states that sales are made from
and those values have the 1
what do you mean by "those values have the 1"?
ah, you called it on the entire dataframe
that applies .value_counts() to every column at once
as you can see it's quite messy, and the ... means that it's cutting off rows in the middle
also, does it make senes to compute .value_counts() on order id?
it's not a continuous variable
it is clearly categorical
but does it make sense to measure the counts of order ids?
probably not, unless you are expecting lots of repeated order ids!
do you see why that is the case?
also in general it's much more useful to look at counts of individual variables: each variable is worth examining on its own before trying to examine their relationships
Im wrapping my head around your comments
and let me say; I appreciate immensely your help
this is what actual professional data scientists do btw, i am not dumbing anything down for you. this is literally how i start every project
it's cutting off rows to avoid dumping a huge amount of data
i see what you mean about order id
good!
now there are cases when you are interested in the counts of order ids
for example, you are given a dataset by some finance person and they claim that every row has a unique order id
the first thing you should do with that dataset, is verify that assertion
because people make mistakes
in that case, you are checking for the counts of order ids to assert that they all are indeed 1
otherwise there is a problem w/ the dataset and you need to either 1) make a decision about how to deduplicate rows, or 2) send the data back and tell them to fix it
de-duplication is a big serious topic in applied data science and i will assume for the most part that you won't have to do it (and you don't have time to worry about it)
but i hope it's clear why order id isn't usually useful for data analysis, in this intro-level scenario
so let's pause. given what i said above, how could you improve that one output you generated?
give me a couple of ideas
and yes i am happy to help, i happen to have a few minutes of free time and i get really frustrated when i see that people got sucked into stuff like this and are struggling
sure, take your time
I dont feel incapable of doing or learning these things
Its just
ends gotta be met and I'm stressed as fuck. but Im following - give me a sec to
confused by what you are saying here
sure, you definitely seem capable under less-bad circumstances. but even smart people have limits of what they can do with limited time. so don't feel bad that you're struggling
need to understand why 1 is what it is in this situation
i generally think that most people are capable of learning most things, at least conceptually
precisely!
but if its not - im combing it to ensure that they are indeed unique
if theyre repeated then we have an error
right
bc each order should be its own
if any order id is repeated, then it will have a count of >1
right, but i hope you also see why this is different from "data analysis" as such. this is more like checking that the data is "clean" before trying to work with it
exactly
and therefore why you should treat it separately from variables that you intend to use in the model
heard and understood
its the opposite of scanning missing values almost
just looking for repeats where they shoudnt be
yep! and great, i'm glad you know enough to look for missing values
Im not a complete idiot I just feel like one because
i paid out my ass for this
i feel like im disappointing myself and others when im just trying to make shit work out
i basically spent 2021-2022 so far paying out the ass for things that ended up being kind of a bust, or overpaying for things that i should have paid less for
it sucks
if i didnt care i wouldnt be embarrasing myself on discord
you are not embarrassing yourself and you shouldn't be embarrassed
taking a course and working a full time job is hard enough
it's worse when the course is clearly bad and you are unsupported as a student
I work for a voting machine company and
and the fact that it's disgustingly expensive is adding insult to injury. i feel you, and no you shouldn't feel bad about asking for help
basically its crunch time 100% of the time
that sucks too. nobody should have to work like that
im really hoping tomorrow we have the day off bc
im in pittsburgh rn but we have a huge winter storm coming in
praying the county closes the warehouse so i can actually work on this!
otherwise ima be doing other shit from 6am to 5pm
and lemme tell you trying to do this shit when ur dead exhausted is
can you take a sick day? don't even get me started on how awful us labor laws are...
I could but itd comprimise my future work
well that's fucked up in and of itself
they flew me out here for two weeks
so im staying in a hotel and riding airfares on their tab
ah i see, you're onsite somewhere
that's rough for sure
and it seems like you work enough hours that even applying for other jobs is a big chore
ok, so let me try to help you a bit more and at least you can think about this stuff tomorrow a little, even if you can't get hands on
the assignment asked for proportions, i.e. the fraction of data points, which you can always convert to a percentage. you can easily get a % from a count by dividing by the total number of data points. so in the heads/tails example above, you have 10 data points, and therefore you have proportions 60% H and 40% T
6/10 and 4/10 are 0.6 and 0.4, i.e. 60% and 40%
counts and proportions are basically equivalent, but humans are generally bad at numbers so i like to present both when it isn't cumbersome to present both
e.g. "our experiment ran 10 times, and we found 6 heads (60%) and 4 tails (40%)"
I see
and this is what i was going to say before, about extending a frequency table to two categorical variables:
let's say you are looking at people's clothing, and you are writing down 2 binary variables for each person. (binary means just "yes" and "no", which are represented in python as True and False or 1 and 0, depending on the situation). the two variables in this case are "is the person wearing boots?" and "is the person carrying an umbrella?" so the data might look like this:
boots? umbrella?
True False
True True
False False
False True
True True
False False
True False
True True
False False
so the crosstab of boots? and umbrella? would look like this:
umbrella
boots True False
True 3 2
False 1 3
that's 9 data points, so you should confirm that the sum of all the numbers in the crosstab is indeed 9
of course, if you have more than 2 categories in each variable, the crosstab will have more rows or columns
and if you have a lot of categories, cross tabs and frequency tables start to get a bit messy and hard to read, in which case you would fall back to other techniques that you probably don't need to worry about right now
and of course you can compute proportions for a crosstab too, e.g.:
umbrella
boots True False
True 33% 22%
False 11% 33%
which should add up (approximately) to 100 (in this case it adds up to 99 because of rounding)
the crosstab is a "bivariate" analysis, meaning "two variables". whereas the frequency table for a single variable is called "univariate", meaning "one variable".
another name for a crosstab that you might see in statistics is a "contingency table"
and what's really interesting is that you can recover the frequency table for each variable individually from the crosstab!
umbrella
boots True False Total
True 3 2 | 5
False 1 3 | 4
------------+---
Total 4 5 | 9
maybe just ruminate on that for a while
ruminating
feel free to @ me with questions, i need to work on something for a bit
salt rock lamp is so fucking good
is real analysis important for ml
i googled it and it says it's not that important, but idk
also did not know what real analysis was before you asked
sounds like proof-based stuff
Yes and no. It might come up, but its benefits are more indirect (but not insignificant). It will teach you how to think and make you more comfortable with mathematics in general. An important skill to have when reading / understanding other's work. It also depends on what you consider to be part of "real analysis".
(If you want to really understand how probability works (which can come up if you are doing very experimental ML), then you need it)
i cant believe you saved this 😆
Where is best to get started with data science?
Hope someone mentors me from here
So while it's not directly needed, I would still recommend it, just for getting into the right head space.
(Also it's fun, if you like math)
wait wat degree did u guys have
before going data science
or mle
im thinking of doing either
im from australia
It depends on what your background is. Give people here more information about your training and degree/major, and someone will respond.
search for O’REILLY
You should find many books
look for the 2019-2021 ones
Data Science is an interdisciplinary field so alotta people in this field started off from different backgrounds. I have a friend who have a major in Fishery but he's working as a Data Scientist now 😀
In essence, you might as well study Human Kinetics and Sports Education and still end up working as a Data Scientist if you put in the much needed work to learn it.
Notwithstanding, going for a major in Mathematics, Computer Science or Statistics will definitely offer you more options and give you an edge over others.
Hey guys,
I'm trying to use the gpt-3 question answering function.
Anyone have clue how to use it?
For example if I want to create a bot that acts like a real human with same personality like if someone ask him what is age he will answer the same age but in other way
Im stats cs
how much do most data scientists in aud
stats is hard
im doing theiss
I have tried importing the chess module using pip. It said dependency satisfied but when I use any of it's function, it does not work?
please help smeone
Hey, i am having a weird problem with training a NN. When it its through about 3/4 of training the loss suddenly gets the value nan
at around the point of the red square. before that it has normal numerical value
any idea why this happens?
Has any of you any experience in combining genetic programming with ML?
I'm not sure I understand what 'lm' means though. I have a major in Statistics. And I believe it's not really that hard.
I enjoyed Stats more than Math. I even picked more CS electives than Math electives when I was in school.
Goodluck on your thesis ✌️
i am stats is hard
r u australian
how is stats not hard theres so much proof writing
i struggle in bayesian the most
how bout stochastic calculus the list goes on
Go to your windows command prompt and ensure the library you installed is pointing to the right PATH.
Also, check if you're working in the same environment where the library was installed
No. I'm Nigerian
how do I check where the library is installed
just celebrated new year
It's probably a vanishing gradient problem. Investigate further to ensure it's not the problem of exploding gradient or vanishing gradient
?
hmm okay. Do you know any resources that talk about this or do you know how i can find that out?
im struggling in tittanic
To be honest I ain't gon lie, the proving part is what I love most (especially when it's going well) 😀
You could legit use up 3 sheets of paper to prove a stats equation. Of all the proofing I did I enjoyed Experimental Design, Confounding, and Gambler's Ruin class the most.
Not to say, there are no topic in Stats I really don't enjoy. I particularly don't enjoy sample survey classes lol.
One of the ways to get past what you struggle with is to:
-
Make friends with the brilliant guys in your class, ask them to help you understand the concepts you struggle with.
-
Attend after-lecture tutorials (if such exist in your class)
In your command prompt, search for that chess library in maybe the scripts folder (could be different on your pc) and ensure it's loaded in the right directory where your python is installed.
There are different way to solve this problem tho. Check stackoverflow.com
Are you working on Jupyter Notebook? When you try importing this chess library in JNB do you get any error message?
I'm lazy to type now but do check this https://www.analyticsvidhya.com/blog/2021/06/the-challenge-of-vanishing-exploding-gradients-in-deep-neural-networks/
Thanks. Tho i was dumb and just realised i had some nan vlaues in my data🤦♂️
thanks for your help tho
tbh i find stats better than a cs degree
for ml
is ai/ml better for maths degrees
than cs degrees
Yeah i split it into train test and validation. The apple one is more reliable since it has more data and i trained it using the same parameters.
It isnt a forcast. Its a generated trading strategy
I used around 13 different indicators to get the signals
Here is the apple ticker used only on test data which wasnt used in the model
The accuracy when matching up with the 1s and 0s of the position signal method i used is around 72% on the unseen test data which is shown above
Tesla went public around 2010 where the graph started and ended in 2022 and the graph includes the training testing and validation data
Yeah but the reason I point that out is that the success of the strategy should not be measured by the return on your investment. as simply buying at the beginning and selling somewhere in the last few years would have a 100x+ return
And how does the image show the training and testing data?
Did you see the year roi and the strat year roi
That image is only on data that was unseen by the model
It wasnt used in validation either
how does the model perform on test data that does not increase heavily over a long time?
It out performs around 2 to 10%
I am trying to update the y data generation to get better results
It's just pretty hard to get reliable tests on these types of prediction models. It looks like a fun project though, but I wouldn't be so sure about the performance by testing it on a few test sets.
It's still impressive btw not trying to belittle you or anything
Thanks
I wouldn't know yet, ask the question first 😛
Please don't dm me. And think about why one would normalize data and if the amount of figures would be relevant.
Hello I am trying to combine the results of a couple masks to make another column in my dataframes indicating whether revenue is recurring or non recurring
pcn_mask = x['prod_code_name'].isin(gfs['prod_code_name']).any()
#print(pcn_mask)
pidmap_mask = x['PRODUCT_ID_MAP'].iloc[i].isin(gfs['PRODUCT_ID_MAP']).any()
#print(pidmap_mask)
sector_mask = x['Sector'].isin(gfs['Sector']).any()
#print(sector_mask)
relevant_product_codes = ["Usf33", "Usf34", "Us756", "Usf37", "Usf40", "Usf29"]
product_code_mask = gfs['Product_Code'].isin(relevant_product_codes).any()
#print(product_code_mask)
relevant_company_codes = ["Us05", "Us1b", "Usm6"]
company_code_mask = gfs['Company_Code'].isin(relevant_company_codes).any()
#print(company_code_mask)
all_masks = (
(pcn_mask and pidmap_mask and sector_mask) and (product_code_mask or company_code_mask)
).all()
return all_masks ```
I keep getting the following error when calling the function “ ‘str’ object has no attribute ‘isin’ “
Could I use just ‘in’ and still get the same comparison?
Hello guys I have a doubt that
In cost function of linear regression we are dividing the SSE by 1/2m .
So what's the use of doing 1/m ie. the average??
If I will not do this then what will happen??
thats a good question, so actually the 1/m is needed for averaging without it our cost would be quite big and dependent on our data size, taking 1/m removes that dependency and make math easier to work with
1/2 i believe is there to make math easier to work with when we take partial derivative- it a constant but doesnt affect loss it just simplifies math again.
in apriori, which one do i value more. Confidence or lift?
Also, the full derivation of these formulas will also show how those maths come to be namely gaussian formula , can check out this or google linear regression map derivation which is an alternate to traiditional linear regression approach;
https://math.stackexchange.com/questions/884887/why-divide-by-2m
can I please get some help with this when someone gets a chance
what?
Someone asked if I could help with a question but didn't ask their question, wasn't in response to you
they removed their message
oh sorry for pining then
oh nw lol
Hi, im having a problem. My code is supposed to highlight the nuclei of several blood cells of specific animals and give me their diameter and radius, which will be used later. The code works fine until it stops and ends at this error
heres the code itself
ive ran my code through cmd on admin but it still doesnt work, and it does run for a couple seconds before the code just stops working
any form of help is appreciated
So basically I m trying to work my way through projects on ml and ai
I did a few projects on them bt the hands on projects usually don't bother with the mathematics and understanding level
Like for neural networks in tensorflow and keras they build models based on layers like dense and models like sequential
With optimisers like adam
So are there any courses which actually explain these things in some detail so that atleast i can judge things by myself
And know when to use what and how to use them actually
You'd probably want to start with a basic course on linear algebra
and statistics
and after that try to use this information for understanding a machine learning/ neural networks course
We're a large, friendly community focused around the Python programming language. Our community is open to those who wish to learn the language, as well as those looking to help others.
"Data Science from Scratch" is a book that I recommend in that it goes over some of the fundamentals that PcCamel just mentioned
if you are a student, I would see if you can get the ebook through your library.
@mild dirge I think I do understand the basics of linear algebra and statistics plus a bit calculus but i need some good machine learning courses i guess
I did take andrew ngs coursera course on machine learning and it was quite great
He explained the mathematical ideas along with the implementation sections
@serene scaffold oh that's great I will try to get that book and go through it. Thanks!
it might actually not be advanced enough if you feel comfortable with the material in the andrew ng course
I'm not really sure what to suggest 
@serene scaffold it was difficult for me to implement those things in octave bt the videos were good
Like any material which can explain things like neural nets in a similar fashion
please?
isin is a method of a pandas.Series, but you're using it on a string. Which line, exactly, is the one that causes the error? (Be sure to always share the whole error message, starting from Traceback, as that would answer the question.)
AttributeError Traceback (most recent call last)
C:\Users\PBWEWU~1\AppData\Local\Temp/ipykernel_9948/2090180030.py in <module>
1 x = []
----> 2 x = get_rec_value(pipe_short)
C:\Users\PBWEWU~1\AppData\Local\Temp/ipykernel_9948/3247770990.py in get_rec_value(x)
4
5 for i in range(x.shape[0]):
----> 6 pcn_mask = x['prod_code_name'].iloc[i].isin(gfs['prod_code_name']).any()
7 #gfs['prod_code_name'] == x['prod_code_name']
8 print(pcn_mask)
AttributeError: 'str' object has no attribute 'isin'```
In Resnets, do we really skip calculation of a[l+1] layer??
x['prod_code_name'].iloc[i] must be an individual string, whereas isin is a Series method, like I said.
You can probably accomplish what you're trying to do without any loops. What are x and gfs?
hey i'm having a very strange problem
i cannot graph simple data anymore on my jupyter notebook
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use("seaborn")
x = [5, 7, 8, 5, 6, 7, 9, 2, 3, 4, 4, 4, 2, 6, 3, 6, 8, 6, 4, 1]
y = [7, 4, 3, 9, 1, 3, 2, 5, 2, 4, 8, 7, 1, 6, 4, 9, 7, 7, 5, 1]
colors = [7, 5, 9, 7, 5, 7, 2, 5, 3, 7, 1, 2, 8, 1, 9, 2, 5, 6, 7, 5]
plt.scatter(x, y, s=100, c="green", edgecolor = "black", linewidth = 1, alpha=0.75)
plt.show()
i keep getting a "dead kernel" error
what can i do to fix this?
you have to restart the kernel.
the problem is unrelated to your code; the jupyter environment has stopped.
can i do that with restart and run all code?
bc i tried that
and it still wouldn't work
did you get a more substantial error message than "dead kernel"? or is that the only text that displayed?
"The kernel appears to have died. It will restart automatically."
this error just constantly pops up
shit, is it related to the millions of rows of data i was dealing w before for that internship?
ugh
did i melt my computer?
I googled that error message, and it appears that there's a few possible causes. try looking at them to see if any relate to something you're doing.
should i try hitting ctrl + C on the terminal, closing out of the conda environemnt, and then opening it uup agian?
if it has to do with having too much data, you would have gotten a memory error.
i see
that might fix it
I don't use conda.
i am beginning to not like conda too
i don't like how my code is in sep cells and i have to run the entire thing over and over
ik there is a restart and run all code thing
that's a jupyter thing, not a conda thing
oh
but yes, jupyter notebooks are overused among data scientists as well.
yeah i see why you dislike it now
I'm so proud 
i'm gonna do some more googling and figure out what's going on
i think i wanna switch to sublime text
by the way, if you want a nice environment for quickly testing stuff, but don't want the false sense of reproducibility that you get from jupyter notebooks, try python -m IPython
it's basically the regular python console, but with lots of quality-of-life features
python -m IPython in the terminal?
yes
i will check it out
if you have jupyter, you should already have it
jupyter is basically IPython but with a gui. and cells. ||and sadness||
also don't get angry but i have been using excel lately
use pandas.
ik ik ik
but for some reason it's like putting honey out for bees
for recruiters
yeah, I had excel on my resume
ok, so this is strange printing hello world in a notebook prints hello world
but I've never been asked to use it, so I'll probably delete it if I ever job hunt again
don't think you gotta job hunt for a while w that mitre job you got now
congrats on that btw
thx
ok, so strangely this particular notebook will not produce the intended behavior even tho a separate notebook with me printing hello world will
lemme see if i can just copy paste it into another notebook for funsies
f u n s i e s
ok, now i'm stumped
is there something wrong with the import statements?
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use("seaborn")
x = [5, 7, 8, 5, 6, 7, 9, 2, 3, 4, 4, 4, 2, 6, 3, 6, 8, 6, 4, 1]
y = [7, 4, 3, 9, 1, 3, 2, 5, 2, 4, 8, 7, 1, 6, 4, 9, 7, 7, 5, 1]
colors = [7, 5, 9, 7, 5, 7, 2, 5, 3, 7, 1, 2, 8, 1, 9, 2, 5, 6, 7, 5]
plt.scatter(x, y, s=100, c="green", edgecolor = "black", linewidth = 1, alpha=0.75)
plt.show()
looks fine to me
oh, you can also do python -m IPython --matplotlib
and it will show figures in a separate window when you .show them
try that
interesting, command not found
typo?
oh yes it is a typo lol
"no module named IPython"
pip install IPython, I guess
sudo pip install ipython
yeah
oh quick question
what's the diff b/w sudo and sudo pip
i never asked before
sudo and pip are unrelated. sudo means "super user do"
so does it do anything special?
you put it before commands that are restricted to administrators. on your own computer, you presumably have that. whereas on a production system, it's usually limited to only a few people.
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
you probably don't need it for pip.
x is the pseudo name for whatever dataframe i place in the function
yeah, i have no idea what's going on here
gfs is another dataframe that holds those product and company codes
a "pseudo name for whatever you put in a function" is called a parameter. what dataframe did you pass, in this case?
another dataframe i have called pipeline
it's case sensitive
the iloc i is 100% a string
right, so strings don't have an isin method. you have to call it from a whole Series
wait what was the original command? pip install IPython?
I’m looking to rewrite this function in some type of way so that I could use a ‘.apply’ to create a new column that has whether the result of these masks are true or false
python -m IPython --matplotlib
hey, it worked
i'm not used to seeing code in my own terminal lol
this is cool
apply is bad because it doesn't benefit from any of pandas' optimizations
can you show all the dataframes involved here with print(df.head().to_dict('list'))? that way I can copy and paste them directly.
oh yeah do not mess w apply
I’m not following
it's necessary sometimes.
Why do you need to know the dataframes involved?
I tried that command and got an error because I don’t have a dataframe named “df”
I can't help you with dataframe operations unless I know what is actually in the dataframes that you're working with, because every dataframe is different. the columns, their names, what types of data they have. if I don't know that, there's nothing I can do.
you have to replace df with the name of a dataframe.
in "normal python", this isn't usually necessary, since "nums is a list of ints" pretty much tells you anything you'd need to know. not as simple with dataframes.
i'm confused on when you have to do reassignments with dataframes
and when you don't
you'll just have to refer to the docs to see if the operation returns a new dataframe or modifies it in place. most return a new one.
modifies in place would mean that i would have to reassign it, right?
or no
i actually don't know the answer to that
🥲
stuff = [1, 2, 3]
stuff.append(5)
list.append modifies a list in-place and returns none
i see
oh, so i was right
i had a feeling i was right i saw a bunch of leetcode problems w solving things in place
output is too long
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
...:
...: plt.style.use('seaborn')
...:
...: x = [5, 7, 8, 5, 6, 7, 9, 2, 3, 4, 4, 4, 2, 6, 3, 6, 8, 6, 4, 1]
...: y = [7, 4, 3, 9, 1, 3, 2, 5, 2, 4, 8, 7, 1, 6, 4, 9, 7, 7, 5, 1]
...:
...:
...: # colors = [7, 5, 9, 7, 5, 7, 2, 5, 3, 7, 1, 2, 8, 1, 9, 2, 5, 6, 7, 5]
...:
...: # sizes = [209, 486, 381, 255, 191, 315, 185, 228, 174,
...: # 538, 239, 394, 399, 153, 273, 293, 436, 501, 397, 539]
...:
...: # data = pd.read_csv('2019-05-31-data.csv')
...: # view_count = data['view_count']
...: # likes = data['likes']
...: # ratio = data['ratio']
...:
...: # plt.title('Trending YouTube Videos')
...: # plt.xlabel('View Count')
...: # plt.ylabel('Total Likes')
...:
...: plt.tight_layout()
...:
...: plt.show()
In [2]: Segmentation fault: 11
the file isn't there
what file
this is my code snippet i’m not sure if you saw from above
i ean
mean
there is no file
it's just plotting from two lists alone
one as the x and one as the y
yes but I need to see the dataframes themselves. I showed you the statement that prints them in a useable way.
actually i'm dumb
just seeing code that has dataframes in them doesn't give me enough information to know what to do.
hastebin - oyecufetiq
one moment
@warm raven
In [16]: pipeline['PRODUCT_ID_MAP'].isin(gfs['PRODUCT_ID_MAP'])
Out[16]:
0 False
1 False
2 False
3 False
4 False
Name: PRODUCT_ID_MAP, dtype: bool
In [17]: pipeline['PRODUCT_ID_MAP'].isin(gfs['PRODUCT_ID_MAP']).any()
Out[17]: False
see how isin returns a boolean Series?
you don't want to do it for individual values, as that won't work.
right so should I use IN?
no, you should restructure the solution to use isin
how would I do that thought to achieve getting the result value for every row, or essentially creating a new column in the dataframe for the result of “all_masks”
all_masks = (
(pcn_mask and pidmap_mask and sector_mask) and (product_code_mask or company_code_mask)
).all()
this won't work because you can't use and and or for pandas objects. you have to use the & and | operators.
I disagree
I was stuck on this about a week or so ago
I was using bitwise operators
you disagree. I said that you can't use and and or for pandas objects, and that is a fact.
was getting errors until i switched over to and and r
or*
listen i’m not trying to be rude i’m telling you what I’ve tried
well, you can't chain bitwise operators with pandas objects, so that might be why you were having an issue
you'd have to concatenate the Series into a DataFrame and use any or all.
did you remove the comment to print the product ID mask every time?
no
okay sorry I re-read a bit and I see our discrepancy
I accidentally sent an old snippet, although my error is the same
[Finished in 2.2s with exit code -11]
[shell_cmd: python -u "/Users/rahuldas/Desktop/Project Folder Sublime Text/matplotlibstuff.py"]
[dir: /Users/rahuldas/Desktop/Project Folder Sublime Text]
[path: /usr/bin:/bin:/usr/sbin:/sbin]
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use("seaborn")
x = [5, 7, 8, 5, 6, 7, 9, 2, 3, 4, 4, 4, 2, 6, 3, 6, 8, 6, 4, 1]
y = [7, 4, 3, 9, 1, 3, 2, 5, 2, 4, 8, 7, 1, 6, 4, 9, 7, 7, 5, 1]
plt.tight_layout()
plt.show()
um i googled it, idk what exit code -11 is
This function works when not using a .apply, although it returns one result.
my fault bro
i’m trying to get it to return a result for every row of the dataframe
pcn_mask = x['prod_code_name'].isin(gfs['prod_code_name']).any()
pidmap_mask = x['PRODUCT_ID_MAP'].isin(gfs['PRODUCT_ID_MAP']).any()
sector_mask = x['Sector'].isin(gfs['Sector']).any()
first = pd.concat(
(pcn_mask, pidmap_mask, sector_mask),
axis=1
).all(axis=1)
product_code_mask = gfs['Product_Code'].isin(["Usf33", "Usf34", "Us756", "Usf37", "Usf40", "Usf29"]).any()
company_code_mask = gfs['Company_Code'].isin(["Us05", "Us1b", "Usm6"]).any()
second = product_code_mask | company_code_mask
return first & second
I think this is the solution but I didn't test it.
Hey again
added axis=1 to .all in one of them
if I’m working on this linear regression model. Do I need to remove null values from my set ?
I think so.
okay
I’m going to try and get this shit running here within the next day or so
Well I have to essentially finish it today smfh
Gave an error on the PD.concatenation of First
“Cannot concatenation object type ‘<class ‘numpy.bool_’>’; only Series and Dataframe objs are valid”
did you remove all the calls to iloc?
oh, I see the problem
I fixed it
so your function works
but it still does not do exactly what I’ve been asking for

i’ve made a short dataframe to test it with, it’s still returning one result
The short dataframe has 5 rows
it occurs to me that all the calls to .any() are probably wrong.
since if you call any or all on a series, that reduces it to a stand-alone bool
yeah makes sense
that’s why I had the iloc in there initially thinking I’d compare the row values
or rather that one row of the input dataframe to compare against every row of gfs
whats the best way to go about
removing null values
after running my df.isna().sum()
S.No. 0
Name 0
Location 0
Year 0
Kilometers_Driven 0
Fuel_Type 0
Transmission 0
Owner_Type 0
Mileage 2
Engine 46
Power 175
Seats 53
New_Price 0
Price 1234
dtype: int64
"Sublime Text is not a Python Package installer, just a text editor. With it, you can edit a python script. When you are done editing, you just launch your script using python script.py"
shit bro
i might just use spyder
i'll mess around w spyder when i have the time
rn i got bigger fish to fry
ask yourself why the values are null in the first place
lemme think,
the reason for them being null can significantly change how you handle it
sometimes you just want to drop those rows entirely
other times it makes sense to "impute" a value - basically replace the null with an educated guess
missing data imputation is a huge field too, and something that you don't want to spend a lot of time on right now probably
i assume this was at least mentioned in your course?
Imputing was but
Right now we've covered preprocessing but I havent had that much time to dive in
What im HOPING for is that we get this freeze here in pittsburgh so i can work tomorrow
well what did they talk about with respect to imputing data? the most basic choices include filling the missing values with the mean, median, or mode
usually missing data imputation is a matter of understanding what the data means and where it comes from, and letting that guide you to a sensible approach
you might want to think a little about your actual task
What's up Data Science gang, I have a question about concat in Pandas
when I concat two data frames why does the final column in the data frame I connect have NaN?
@lapis sequoia ultimately you need to come up with some kind of "willingness to pay" estimate based on different attributes of the car, and use that to segment customers into 2 different price tiers. that's what "differential pricing" is
for example:
trying to one hot encode "City" in our original DataFrame known as "Feature" but when I concat the one hot encoded dataframe it produces the NaN?
any thoughts?
predicting the price of a used cars rn
this doesn't have to do with the location of the column in the dataframe. if you can post a minimal example that someone can copy and paste and reproduce the problem, then we can figure out what the actual problem is
in general, if you get unexpected missing values after a concat operation, it's because either the column names or row index labels don't match up
but it's pretty hard to debug someone else's screenshots
gotcha sorry for the complication
example data + runnable code are ideal
@desert oar Could you help me see if my understanding that a decision tree model performed better than a logistic regression model is correct?
I already trained both models and measured for precision, plus made a confusion matrix
for what it's worth, sometimes the process of constructing such an example is enough to guide you to the right solution without even having to ask
i was about to log off, but you shouldn't "ask to ask". just post the question, etc. you know the deal!
i can answer if it's quick
It's okay if you're logging off though
df.dropna()
how can i take that result
and call that as my dataframe going forward
would It be x = df.dropna()
then call that going forward?
yes, or you can do df = df.dropna()
in addition to df.dropna(), consider the general pattern for filtering rows:
row_is_ok = # do some operation that returns a boolean Series, one value per row
df = df.loc[row_is_ok].copy()
yessss!!!!
worth reading when you have more time https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
which is the relevant part here?
they will probably ding you for this, given their feedback on your other project. maybe it's fine for the sake of just getting it done
personally i'd rather write explicitly "for the sake of simplicity, i avoided dealing with missing data imputation and i just removed all rows with missing values. i know this isn't really the right thing to do, but i did it in the interest of getting something done."
depends on how kind the grader is
Close to the end
When I trained both models, I tuned them for precision and made confusion matrix
Since the trees precision was higher, but the confusion matrix looked worse, I wanted to know if I'm correct thinking it is better suited for the task
what do you mean by "looked worse"? how did you tune them? are you aware of the "bias-variance tradeoff"?
The tree was more pessimistic in predicting true positives (which is what I wanted), but the logistic regression model looked like it performed better "overall" (albeit risking false positives).
I tuned them both using GridSearchCV.
I am not aware of bias-variance tradeoff.
https://mecha-mind.medium.com/explaining-bias-variance-tradeoff-to-a-ml-engineer-d747bdbb1f1d this article has a good explanation, although i find it both funny and depressing that they went all the way up to gradient boosting without so much giving logistic regression a nod
So now another problem im having..
this article is probably better @main fox https://medium.com/@venkatavinay222/understanding-bias-variance-trade-off-in-machine-learning-952d2d7a86ba
mileage is listed as kmpl
gas mileage? that looks like "kilometers per liter"
so if you want "miles per gallon" you need to do a bit of math
I mean
you need to make sure they're both the same unit
I suppose it doesnt matter bc the problem statement is referring to an indian market so
distance per liter is all good yeah
so step 1 done. nix all null values - imputing would be nice but in the name of time i think its best to proceed from here
Im trying to figure out how to exclude outliers now...
quick and dirty approach: remove anything more than 2 standard deviations from the mean. however i very strongly suggest that you plot a univariate distribution (boxplot and kernel density plot) to see the shape of each variable
mostly outlier removal isn't needed in real data except for a few really egregious data points
just because something is "far from the average" doesn't make it an "outlier" in the sense of "this is a measurement error or something else weird that i need to exclude from my model"
extreme values do happen in real life, and you don't want to remove them just because they seem unusual
in that case then yes, do look for it
e.g. if a car price is 5 or 0, that's clearly not right and probably should be re-coded as "null"
yeah...
is there a command that will
no
display the range of values?
i guess the variance?
guess? more like actually compute
nahhhh I just want to see what the lowest / highest values are
and then find the average
well there's .min() and .max()
the range is the difference thereof
variance is .var() and standard deviation is .std()
so i tried doing
you might want to skim the list of functions available to use on Series objects: https://pandas.pydata.org/docs/reference/series.html
i get series is not defined
hm? those are methods on the Series class
if you get a single column from a dataframe, that object is of type Series
Thanks, I'll give it a read. Is there any other resource or book you'd recommend?
other than a full machine learning course? not that i can think of offhand
or if i run "Price"
and if they don't discuss bias-variance tradeoff in your machine learning course, then you were robbed. it's one of the most important concepts to understand
well yeah, what did you expect? there's no variable Price
presumably your data is called df based on your examples
so you'd do df['Price'] to get the Price column
df['Price'] gives you a Series instance, representing the Price column in df
df is an instance of DataFrame
So that last column doesn't act that way
act what way?
Like i cant call it where it's defined?
what do you mean by that?
what happens when you try that?
is it different from what you expect?
Price isn't a stand-alone variable
it's a column in the dataframe
I see I see.
you can assign it to a separate variable if you want
Thank you for the advice. I didn't formally have a ML course, just a deep interest that I'm trying to pursue. The math and theory can go as deep as one is willing to dive so it can be a timesink to cover topics.
Did you take an ML course you'd recommend?
yes
Word!
no, i went to school for quantitative social science and learned the rest along the way
it's small concepts like these that help me clear the much larger picture
good! building a foundation of concepts is very important
i do need to run now though
You can think of [] as a function, that takes some arguments and returns something. It's just a special function that uses [] syntax (in Python you can change what operators like [] do on certain objects, which is what Pandas is doing).
can someone explain why this question's answer is this? and how can i find the x1 and x2 features?
The order of +ve and -ve examples can be up or down but you will get the features like this from the plot.
As we know that the svm tries to find the large margin between the +ve and - ve examples so that it can classify the two.
The decision boundary will go from 3 on X1 axis straight perpendicular and therefore the margin will be max in this case only
And as far as I know the optimization function tries to find the smallest values of theta possibally it can so that then
To classify the x(an example) as (a). +ve --->norm theta and product with p(i)(no of -ve examples ) <-1 then it classify y as 0(-ve) il
(b) Whereas on the other hand norm(theta) and product of with p(i) i I is no of +ve examples ) should be greater than or equal to 1 then it will classify the example (that are plotted) as +ve
Here is have assumed theta0=0 so the decision boundary passes through origin.
as the cost function selects low values of theta so p(i) (distance between the decision boundary should be as large as possible ) then only it is following the constraints that are there in (a),(b) will be satisfied. no other decision boundary other than the one is allowed.
Also I am also having the confusion of selecting the theta because the ||theta|| for every value of theta is coming same .
Is you get the ans. Of selecting the theta then do tell
Does anyone here know how (and can either tell me how, or show me an example of how) to create a model for an image generation AI?
How can i learn gpt 3 as a complete novice?
Learn to make a gpt 3 clone?
Learn to use gpt-3?
hello.
i have a question.
cnn in the convolution block, dense block. why make a block?
i mean just can do write code, why make a block?
Please where can I get data science and machine learning projects and exercises with source code
would y'all explain keras as a better interface to control the tensorflow framework?
CodeBasics, he has whole playlists on Machine learning and Deep learning with many projects and exercises
Any good material on directed attention?
Idk
is this model over-fitting or under-fitting?
how to make a function of activation functions?
i get an error when trying to do
# return nn.Sin()
# return nn.Tanh()
# return nn.Sigmoid()
# return nn.Tanhshrink()
return nn.HardTanh(-1,1)
# return nn.Hardswish()
# return nn.functionnal.silu()````
does the activation function contains learnable parameters?
if yes you have to inherit from nn.Module if no then you can apply it directly to the output
hello, I need help to validate if there are duplicates value in csv column and items which failed the validation should be logged (e.g. stderr) and ignored for the next processes
the function f returns an object and it does not take any args the correct way is to do the following
import torch.nn as nn
import torch
def f():
return nn.Tanh()
x = torch.randn(5)
result = f()(x)
if you are using pandas you could grouby that column and all the duplicates will be grouped together
how can i log to stderr each duplicate item
spyder is fantastic lol
Depends on application, but i would say neither. The training accuracy is pretty high and doesn't have a huge gap with the val_acc
There should be little to no dependence on x2, since a vertical line would separate things nicely, so θ₂ should be small. Inputting (0, 0) should be a negative case, i.e. θ₀ < 0. This leaves only one option
12 params is all the nodes
ah i understand now.
how many dense layer and number of neurons? any tips where to learn how to describe this cnn model?
#Augmented Layer
model.add(augmented)
#Input shape Layer
model.add(Input(shape=(WIDTH,HEIGHT,3)))
#Conv2D and MaxPool2D Layers
model.add(Conv2D(16, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(32, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
model.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
model.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
#Flatten Layer
model.add(Flatten())
#Fully connected layer, OUTPUT Layer
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=class_size, activation='softmax'))```
A conv2d has kernel size * filters number of learnable parameters
filters is essentially how many kernels you have
a dense layer has units number of parameters
depending on what your teacher meant by neuron specifically it could be just the learnable params in the dense layers or including the kernels as well
thank you pontifex, might as well throw question to the instructor on which neurons is asked
Sounds good
if you need help with data science or AI, please ask a question directed to the whole channel.
clueless tbh
I'm parikshith. Stream ECE branch section B. I was wondering if you were available to help me with Lec 3: Image formation: Radiometry which is a little bit vast as per my knowledge in math.
And I would like to share the resources I'd gone through to figure out a solution
Below link video of Lec 3: Image formation: Radiometry (NPTEL course video)
https://www.youtube.com/watch?v=ch1xdUFABA8
Another same concept video explanation in YouTube link I'd gone through
https://www.youtube.com/watch?v=kPIqO929pIc
Questions:
-
The light has a radiant flux of 100 watts, what is the irradiance on an object which is placed at 2 meters from the light (assuming object is perpendicular to the night light)? Wm−2Wm−2
2.99
1.25
1.99
0.55 -
A light source has a radiant flux of 100 watts, what is the flux on a rectangular object of size 20 cm by 30 cm placed 2 meters away (perpendicular to the light)?
0.1194 mW
0.1163 mW
0.1189 mW
0.1123 MW -
Given the 10-watt source coming in from 2π32π3 solid angle (in sr) of a radius 3 meter, the corresponding source of energy carried by the ray is
52π252π2
12π212π2
π2π2
10
-
Light source has a radiant intensity of 60 W sr−1. Determine the irradiance on a sign board 2 meters away.
10
15
20
30 -
Suppose a source with an area of 4 m−2m−2 is viewed at an angle of 30 degree and has a radiance of 0.3 Wm−2sr−1Wm−2sr−1. Calculate the radiant intensity of the source?
1.65 Wsr−1Wsr−1
1.04 Wsr−1Wsr−1
2.78 Wsr−1Wsr−1
2.11 Wsr−1Wsr−1
- Suppose the source in question 9 is viewed from a perfectly reflecting Lambertian surface. Then find the value of radiosity.
0.3145Wm−2Wm−2
0.1645 Wm−2Wm−2
0.2598Wm−2Wm−2
0.4768Wm−2
Thank you for your time
did you put import tensorflow as tf at the top of your file? Also, please copy and paste actual text instead of screenshots as this is a lot more useful for answerers.
sorry, is this a data science question? it sounds like it might be physics.
Ya it's a radiometry concept and underpinning of math and physics of CV
import tensorflow as ts
string = tf.Variable("this is a string", tf.string)
print(string)```
I would ask in a different discord server. Sorry
@lapis sequoia you did import tensorflow as ts, with a ts instead of tf.
No; please ask your questions in the channel.
Oh ok
but how can i get rid of this error
Skipping registering GPU devices...
2022-02-04 19:15:31.196989: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<tf.Variable 'Variable:0' shape=() dtype=string, numpy=b'this is a string'>```
Do you understand what this error message is telling you?
I'm so glad u said that I'm literally looking for a particular domain based server on CV and IP bacas I have had taken a course on computer vision and image processing fundamentals and application I'm really excited to learn that course but my exams are going so I need to manage for few weeks with assignments if I don't get good marks in assignments my score gets low however I completed my two assignments and this week questions I had left with those above sent
Thank you for your time
So plz let me know sir
hey! i have a pandas dataframe of yearly population projections, and the format is a bit iffy to process. i'd like to merge the rows like this:
i have never worked with pandas before so i don't really know where and how to look for guidance lol
I think you need to pivot it. Can you do print(df.head().to_dict('list')) so that I can copy and paste it and experiment?
alternatively, you can look at the docs and try to figure it out.
!docs pandas.DataFrame.pivot_table
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)```
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
let me take a look, the above was a quick excel mock of the larger messier thing i have so i'd rather spare your time and effort and see if this does it
I don't mind as long as you provide the data in a format that I can use immediately, like a CSV, or something.
Hi guys I;m currently facing a task from MLSS2020 regarding RL environment and agents and im kinda stuck on one issue, I have to adjust the environment or the agent so that his actions reflect the given probability like in the AIMA example. I have the environment defined but the actions of the agent are where im stuck. anyone willing to take a look?
yeah i'm a bit lost still, here's the first three years or so, {"column": ["row", ...]} ```py
{'City': ['Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total', 'Country total'], 'Year': ['2021', '2021', '2021', '2021', '2021', '2021', '2021', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2023', '2023', '2023', '2023', '2023', '2023', '2023', '2024', '2024', '2024', '2024', '2024', '2024'], 'Age': ['Total', '0 - 14', '15 - 24', '25 - 44', '45 - 64', '65 - 74', '75 -', 'Total', '0 - 14', '15 - 24', '25 - 44', '45 - 64', '65 - 74', '75 -', 'Total', '0 - 14', '15 - 24', '25 - 44', '45 - 64', '65 - 74', '75 -', 'Total', '0 - 14', '15 - 24', '25 - 44', '45 - 64', '65 - 74'], 'value': [5547045, 852577, 608053, 1423098, 1383040, 703342, 576935, 5555002, 843285, 609690, 1422508, 1377411, 695173, 606935, 5562569, 833001, 614092, 1420517, 1374381, 684221, 636357, 5569645, 821592, 618233, 1419647, 1369556, 677181]}
i dont have a gpu?
you can do this
In [25]: df.pivot_table(index=['City', 'Year'], columns='Age')
Out[25]:
value
Age 0 - 14 15 - 24 25 - 44 45 - 64 65 - 74 75 - Total
City Year
Country total 2021 852577.0 608053.0 1423098.0 1383040.0 703342.0 576935.0 5547045.0
2022 843285.0 609690.0 1422508.0 1377411.0 695173.0 606935.0 5555002.0
2023 833001.0 614092.0 1420517.0 1374381.0 684221.0 636357.0 5562569.0
2024 821592.0 618233.0 1419647.0 1369556.0 677181.0 NaN 5569645.0
jesus
alright let me see
wait holup! this might be exactly how i imagined it should be in my head, thanks!! now i just have to figure out how to work the MultiIndexes(?)
the multiindexes. did you want to "flatten" them?
In [27]: df.pivot_table(index=['City', 'Year'], columns='Age', values='value').reset_index()
Out[27]:
Age City Year 0 - 14 15 - 24 25 - 44 45 - 64 65 - 74 75 - Total
0 Country total 2021 852577.0 608053.0 1423098.0 1383040.0 703342.0 576935.0 5547045.0
1 Country total 2022 843285.0 609690.0 1422508.0 1377411.0 695173.0 606935.0 5555002.0
2 Country total 2023 833001.0 614092.0 1420517.0 1374381.0 684221.0 636357.0 5562569.0
3 Country total 2024 821592.0 618233.0 1419647.0 1369556.0 677181.0 NaN 5569645.0
also why is every city named "country total"?
because the first one is the whole country
up until 2040
so the first city is like, 200 rows down?
ah
💚
got a task at work and the library i use returns organized data as a DataFrame
turned out to be quite the can of worms lol
We are designing an underwater robot. While the underwater robot is in autonomous driving, it will search for a certain object. But while searching for the object, it should not hit the walls around it. Can I detect walls with OpenCV? Or how do I make sure it doesn't crash into walls?
Hi, I am trying to make a database model and i am running into memory error
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 868. MiB for an array with shape (10669, 10669) and data type float64
this think works when i try it in in Ipython (jupyter lab launched from anaconda) without any errors
but when I try it in plain python, it gives error
data = Product.objects.all() # gets data
df = pd.DataFrame(data.values())
tfidf = TfidfVectorizer(stop_words='english')
df['product_name'] = df['product_name'].fillna('')
overview_matrix = tfidf.fit_transform(df['product_name'])
similarity_matrix = linear_kernel(overview_matrix, overview_matrix)
i get the error on similarity_matrix line
i am on a 64bit machine, and it gives error for 868 MB
is there a way to make the result of tfidf.fit_transform a sparse array?
nope
i think it is some Kind of numpy problems in which some kind of limit is set for memory allocation
it actually does return a sparse array; can you show the whole error message starting from Traceback?
just a min
i am running it in thread btw
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\LAKSHYA\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\LAKSHYA\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\LAKSHYA\PycharmProjects\mega-env\PC website\website\backend\main\views.py", line 39, in product_recommendations_variables
similarity_matrix = linear_kernel(overview_matrix, overview_matrix)
File "C:\Users\LAKSHYA\PycharmProjects\mega-env\venv\lib\site-packages\sklearn\metrics\pairwise.py", line 1073, in linear_kernel
return safe_sparse_dot(X, Y.T, dense_output=dense_output)
File "C:\Users\LAKSHYA\PycharmProjects\mega-env\venv\lib\site-packages\sklearn\utils\extmath.py", line 161, in safe_sparse_dot
return ret.toarray()
File "C:\Users\LAKSHYA\PycharmProjects\mega-env\venv\lib\site-packages\scipy\sparse\compressed.py", line 1039, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\LAKSHYA\PycharmProjects\mega-env\venv\lib\site-packages\scipy\sparse\base.py", line 1202, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 868. MiB for an array with shape (10669, 10669) and data type float64```
so this is the part that causes the error: similarity_matrix = linear_kernel(overview_matrix, overview_matrix)
(while that may be obvious to you, I had no way of knowing that before you provided the whole error message)
sorry 😅 , my bad
try linear_kernel(overview_matrix, overview_matrix, dense_output=False)
just a min
I'm in a meeting now, so I may become unresponsive
the pandas documentation and user guides are much better than they used to be. i recommend reading through the user guide material if you are feeling stuck. stackoverflow also has a lot of pandas questions now
64bit isn't relevant here. if you don't have 868 MB of free RAM, you won't be able to allocate an array of that size
many people ask 32bit / 64bit in memory type errors cause of the 4gb memory limit, so i just gave it :)
that might have to do with float32 vs float64, which has less to do with your operating system and more to do with how your data has been stored
btw if you are trying to compute cosine distance, you might also be interested in https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
thanks alot :)
yeah i figured this along the way, but i ended up asking here as i had no idea of what to even look for in the docs
hey @serene scaffold, sorry to disturb you while you are in a meeting,
but after turning dense_output to false, the model stopped working altogether
I can't help; sorry
try restating what information might help someone debug this with you.
"the model stopped working altogether" is uninformative; what happened instead? how do you know it isn't working?
it didn't gave any output
ok, thanks
Unless this means that linear_kernel(overview_matrix, overview_matrix, dense_output=False) returned None, you have not yet divulged enough information for anyone to assist.
result_dict = {}
for index, row in df.iterrows():
if row["Column value"] in result_dict:
result_dict[row["Columne value"]].append(row)
else:
result_dict[row["Column value"]] = [row]
anyone happen to know how I could do this properly, i.e declaratively, with pandas/python? this works but you aren't supposed to iterate imperatively like that with pandas.
basically im trying to get a key value dict, where the keys are the unique values of a column in the dataframe (table), and the dict's values are the rows of the data frame with that column value. just trying to figure out the approach I should take to do it declaratively but it's not coming to me for some reason
sorry if there is a better channel for this
can you show df.head().to_dict('list')? This is the right channel for this.
yeah give me a sec, it takes me a few minutes to run the function. also this is for worked so trying to keeping the actual data anonymized if that is ok
thanks
you'd have to make a copy of the dataframe with fake data that captures the schema of the real dataframe.
ok might just share the real output and delete after
you can DM it to me, if you must.
I'm interested on getting into Data Science, is there anything I should know before start messing with it?
it's mostly math
Just that?
that's the part that a lot of people end up being dissapointed about
@brave latch what if the same Deal appears more than once? you can't have the same key twice in a dict.
you want a nested list?
yes I want unique deals -> rows containing that deal
result_dict = {}
for index, row in df.iterrows():
if row["Deal"] in result_dict:
result_dict[row["Deal"]].append(row)
else:
result_dict[row["Deal"]] = [row]
my imperative code checks for that
I want to make this declarative
because pandas
Not really. If you're into tabular data, then there's very good course starting soon. It's on machine learning with python and Sci-kit learn taught by svikit learn core devs. It's free btw. https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/
@brave latch
In [40]: df.groupby('Deal').apply(lambda d: [row for _, row in d.iterrows()])
try that.
well, I guess that's still a dataframe
thats the exact schema I need though
In [42]: df.groupby('Deal').apply(lambda d: [list(row) for _, row in d.iterrows()]).to_dict()
there you go.
the trick is that df.groupby is like a magical amalgamation of individual dataframes
and then apply does a function to each of those
and i just wasn't grokking the api
I would argue that you don't need lots of math for applied DS/ML. Like Sci-kit learn library abstract s a lot of that so you can focus on applying the tools.
this worked perfectly, appreciate it, and understand how it works now. mind if i ask you to delete the data?
Hello! You won't probably remember but in early december I posted a message asking for help in order to decide a machine learning/optimization algorithm that would solve basketball matches referee assignment. You provided pretty solid answers without knowing the actual datasets to work with. Now that we know the datasets, it turns out that there are so many restrictions to implement ML or optimization algorithms. My coworkers decided to use a rules-based AI algorithm. I've been surfing the net trying to figure out some implementations of this approach but I'm constantly reading posts explaining the differences between rules-based and ML algorithms and so on. I wonder if you know an example of a rules-based AI algorithm so that I don't look like an absolute beginner whenever I have to code things that interact with it.
Or even coding it myself😅
rule-based AI is the subset of AI that isn't machine learning. instead of having parameters that are learned from data, someone decides how the output should be determined based on the data.
If you hear someone say "AI is glorified if statements", that is the subset of AI that they're referring to.
I see, I've been searching for some implementations of rule-based but I didn't come across with a realistic example
Why is matrix multiplication significantly faster with numpy than with tensorflow? Should it not run faster on my gpu with tensorflow?
The GPU has data transfer and setup overhead. Try very large matrices.
(like 1024x1024)
ok i will try it
Also the times will be different when you actually do something with the results. Since right now you are just calculating it and throwing it away immediately.
(Which has different destruction times, and GPUs are faster if you keep the data there and use it for something else as well)
(Avoid going back and forth between the CPU and GPU)
Your CPU can calculate many small matrix multiplications before a single small matrix reaches the GPU.
(If the matrix data is already in the CPU's local memory)
However, with enough of them, it becomes worth it again, but you have to send them all in one batch to the GPU.
Ok, thank you now I understand why it took sol long.
They probably want hand-crafted fuzzy logic (with hand-crafted fuzzification functions).
It's still "glorified if-statements", but the input can be vague and it's its own programming style (you can make a DSL for it, but don't need to).
Yeah I feel like they are pretending to create an ad hoc algorithm for this use case in particular
Fuzzy logic can make use of ML later, since the fuzzification / input process can be whatever.
I have even seen it slapped on top of spiking neural networks.
25 votes and 11 comments so far on Reddit
Many early AI research projects involved constructing a representation of a domain using first-order logic predicates, or something similiar. For example you would have a description of a restaurant domain as follows:
at(restaurant,Alice)
at(restaurant,Bob)
at(restaurant,Carol)
works_at(restaurant,Carol)
has_job(restaurant,waitress,Carol)
orders(Bob,pizza)
orders(Alice,sushi)
along with rules for reasoning about the domain, such as:
forall X,Y,Z. orders(X,Y) and has_job(restaurant,waitress,Z) -> serves(Z,X,Y)
which attempts to encode the rule that if person X orders food Y and Z is a waitress at the restaurant then Z will serve food Y to person X.
From the above representation we can deduce:
serves(Carol,Bob,pizza) serves(Carol,Alice,sushi)
Now GPU is faster with a 2000 by 2000 matrix
Symbolic AI has always had a huge flaw which GPT shares, they both don't anchor symbols / words / etc to physical objects. They have no "world model". In this sense they are both naive algorithms. The real hard work is getting that world model, especially since it's a very complex world we live in, way more complex and messy than any simulation.
(GPT is way more efficient than the pure symbolic methods though, so it worked out better in being able to go through way more data and use induction)
(Training on text alone will never be enough for an AI to understand language, since language's meaning comes from our physical world)
(In addition, not only does it need to train on the real world (or at least a simulation of it), it also needs to be human aligned, in the sense that it needs to assign meaning in the same way we do, we care about certain things that matter to us and therefor label them, it might not care about the same things (where does the chair object begin and end? well for us it begins and ends where it's useful for us, but a computer does not need to sit, so where does it begin and end for it?))
I read it, pretty interesting tbh
I'm grateful, learned the term GOFAI as well as the interesting/educational reasoning about what it was and how it was thought about.
Also kind of doubtful about 'you can't learn language through just language' but that's just my intuition and I don't want to derail. 😅
I have a book on that approach at home
They use LISP and variants
One time a Dutch prof lectured at our uni on some of those topics, fuzzy logic, expert systems and neural net...he kept pronouncing Variables Var eyeable
It's on topic, but think about this, with text alone, how will an AI every truly know what a chair is? Can I ask it to simulate one falling over?
I can simulate one falling over in my head.
On AI Dungeon (iirc running on OpenAI DaVinci) it literally described the process for building a chair to completion.
So it seemed to be able to verbalize what parts are and a finished project, inferring process from token inference.
To me that's understanding and I'm pretty sure that's both naive and subjective at the same time.
That's interesting
Being able to chain together words does not mean that it understand what a chair is. It never experienced a chair and never will. It has its own understanding you could say, but when we mean that it understands a chair, it means our understanding, not its own made up universe of text only.
I think a made-up universe of text (the kind of knowledge that the GOFAI critics pointed out, abstract from anything the computer might be able to understand beyond boiling it down to algebra) can still represent knowledge.
And if you're asking me to argue why I think a representation of knowledge is indistinguishable from knowledge I'm probably being the naive one. I don't really know enough to defend that position, just my instinct. 😅
So just to make sure that I got it right, when we talk about rule-based AI it's nothing less than plain if statements aka glorified if statements
I guess I'm still stuck on primitive thoughts like even though all maps are inaccurate, Google Maps sure seems useful.
Yes it can represent knowledge, but it's not language like how humans have language, which is the goal. Humans have language linked to objects like being able to build a chair physically, not just give the instructions as words on how to do it. The issue is that it's only basing its knowledge on the language structure / predictability, which can get you decently far, but it's not enough for some cases, plus if I tell it the word "chair", it can't feel the chair with for example touch, there is no shared sensory organ / word relationship.
When it gives you the instructions to build a chair, it's not simulating the chair, it's just regurgitating the instructions that someone typed in at some point (with induction / mixed together responses).
like how humans have language, which is the goal
Honestly I call myself an NLP enthusiast at best but I never thought this was the goal.
If the computer operates with underlying abstraction layers that are totally different than our meat brains but we can still interrelate complicated language processes to each other, doesn't seem far-fetched to say that the computer's completely insular universe of text was still able to provide usefulness to our real world based one.
Well if you are into AI/AGI it's the goal, but GPT can be your end goal too, whatever floats your boat.
Decision tree would be an example for an rule based system https://en.wikipedia.org/wiki/Decision_tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in operations research, specifically in decis...
You're correct that it's not AGI but it's still AI 🤔
Like pretend I put a baby in front of GPT and ask it to tell the baby how to build a chair, you're saying if the baby learned how to build the chair that's not AI?
I would certainly count it as AI.
Seems to me like possibly we don't disagree then, except on what the goal of NLP is.
Yeah cause you can represent the logic behind a decision tree with if statements and so on
NLP in general could be anything involving natural language, including taking natural language as input and outputing random numbers.
Yes exactly 🙂
can anyone help me get back propagation working in my program, I understand the calculus, but not how to apply it
(It's about what the input is)
Thank you so much I feel like I'm fully aware of what rules-based AI is and I think that I can tackle the referee assignment problem myself👌
You are coding backprop in python?
trying to :)
Are you storing all your weights in numpy at least?
ofc
im just not using tf or keras
And I’m ready to stay up like a crack head yo finish this project
I have the whole thing pretty much done, I can create a model with some Dense layers and it will feed forward fine, just the back prop i cant get my head around
What do you have so far
in terms of code
okay
import numpy as np
class Dense():
def __init__(self, units, activation):
self.units = units
self.activation = activation
self.type = "Dense"
def initialise(self, num_inputs):
self.weights = (np.random.rand(num_inputs, self.units) * 2) - 1
self.bias = np.random.rand()
def forward_propagate(self, inputs):
self.z = np.dot(inputs, self.weights) + self.bias
self.a = self.activation(self.z)
return self.a
def back_propagate(self):
pass
This is my dense layer
I have written then deleted attempts at back prop many times
Heyy. How can I use matplotlib in Vscode on mac ? I need it to do graphs for a physic project
I'm trying to predict product sales (of different products in different times) in relation to stock.
I know I can use fbprophet - but I'm not sure how I'd set up a relation between the regressor (the stock) and the sales (the predicted timeseries) so that every time a sale is predicted, the stock is reduced and a new prediction is ran with the new input.
Does anyone know easier ways of doing this using other models? Is there an easier way to do it using fbprophet?
can anyone help me get back propagation working in my program, I understand the calculus, but not how to apply it
Data frames are always two dimensional. What do you mean?
Like i have one x but two y data overlayed
You can have one dataframe for each xy combination
how do you impute missing values ><
Im looking to impute Mean for a few and Mode for frequency on others
I already removed all null values but I want to attempt atleast imputing them before i move on with my model
i want to do mode for horsepower
What's up Python gang, when I concat two panda dataframes of same Rows why does it end up giving me NaN on the last column and add a row?
I know why.. it's because 128 rows and 127 rows are not same length
but so weird, how is a row missing from the same data?
u can remove all nulls from ur data set
I have this dataset known as df and is 127 rows by 13 columns. The first row known as Lender is going to be one hot encoded (OHE) and I'd like to make columns for each of the unique lender names.
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Lender.values.reshape(-1,1)).toarray()
dfOneHot = pd.DataFrame(ohe, columns=['Lender_' +str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])
dfh = pd.concat([df, dfOneHot], axis = 1)```
this is the result
I'd like to drop my Lenders column and put these in instead. Simple concat works but when I do concat on the last line where I set the value equal to dfh it adds another row?
@lapis sequoia you can use fillna with the mean. It's slightly more complicated for mode. I'm on mobile so I can't show you
Don't normalize sharing screenshots of code. Use the !code command.
What does that do
Tells you how to post code
okay
Right
df['Price'].min().fillna.mean()
had an issue there
df['Price'].fillna(df['Price'].mean())
Yeah that spat out an error
I'm on mobile but I typed it anyway bc I appreciate you
Show whole error from traceback
Saying that something "caused an error" is opaque.
what you wrote worked
Yay
so now i have this small dataframe but
df = df['Price'].fillna(df['Price'].mean())```
OH
this should now have imputed average into the missing values. time to check
Also you're replacing the whole df variable with one column
would be great is if can see the distribution of the data to make sure that the mean filling in price for data isnt bias which can be overfitting model
but again this depends on how many nan for price column; if just a few then it no biggy
oh I see.
hmmm im still digging myself into a larger hole here.
It's way easier to just eliminate all null values
Why is it that I'm One Hot Encoding a DataFrame column with 127 rows but spits out a 126 rows dataframe??
Sounds like that's not going to work too well
Can you elaborate on that please?
so youre trying to basically
take one column and turn that into x amount of columns instead?
Noooo, there's 34 unique values inside this specific column. So we'll make 34 new columns but why is it losing a row?
here's my OHE code:
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Lender.values.reshape(-1,1)).toarray()
dfOneHot = pd.DataFrame(ohe, columns=['Lender_' +str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])
dfh = pd.concat([df, dfOneHot], axis = 1)```
You have to one hot encode each feature separately. But it sounds like you have too many features
No way. As a programmer we're too robust to be OHE each individual unique value?
@thin palm well, you only one hot encode nominal features. I don't know what your features are.
can u not manually introduce another row or something
instead of one hot encode you can apply feature engineering
such as pca or creating new feature column with grouping simlairities
Do you understand when you would or wouldn't use one hot encoding?
can someone help me understand this code?
class MyRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(MyRNN, self).__init__()
self.hidden_size = hidden_size
self.in2hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.in2output = nn.Linear(input_size + hidden_size, output_size)
def forward(self, x, hidden_state):
combined = torch.cat((x, hidden_state), 1)
hidden = torch.sigmoid(self.in2hidden(combined))
output = self.in2output(combined)
return output, hidden
def init_hidden(self):
return nn.init.kaiming_uniform_(torch.empty(1, self.hidden_size))```
so RNNs take an input and a hidden state
and then they give an output and a hidden state
the neural net that produces the output is this:
self.in2output = nn.Linear(input_size + hidden_size, output_size)```
the neural net that produces the new hidden state is this:
self.in2hidden = nn.Linear(input_size + hidden_size, hidden_size)```
but this makes no sense to me
in this youtube tutorial explaining RNNs, the food represents the input
and the weather represents the hidden state
Yes I understand the difference between different encoders and in the case of text we need to One Hot encode so our computer can understand text
@plush jungle are you sure? I would expect both to the the input
It's more complicated than just "understand text"
oh
then where is the hidden state in this diagram
Let me check so I don't mislead you. But I'm pretty sure all the nodes in the middle are the hidden state and the food and the weather are features
that would make total sense
Also it might be a while before I can look into it.
except that the code I posted has two neural net layers