#data-science-and-ml
1 messages ยท Page 247 of 1
I'm checking the csv
I don't see anything nan values
even if there are I avoided those columns
uh
why aren't you doing it programmatically
it's in a DataFrame
you can query it
okay...?
lol I know I'm not the best
Oh i see the error now
ok yeah there is a loss now
no more nans
yup
that's good
btw for future reference
you can use df.isna().sum()
to see whether there are nulls on a column-wise basis
or df.isna().sum().sum() to get the total number of nulls
yeah...
well.
pandas knowledge is really important IMO
scanning a CSV by hand is hell
yeah I can see that
I was looking at the tutorial before I fell asleep
the way you can handle data is really useful
IMO you should be @ least intermediate with pandas and numpy1
before you even think of touching TF/torch
yeah I'm gonna take a look at those
100%
I just thought it would be to cool to learn ML
I went in without really knowing what I needed to know
I also got an inf in one of my losses
The values of loss change depending on what optimizer I use
it's cool
but
not many people can pick it up right off the bat.
and if your fundamentals are weak
you run into a lot of problems
without knowing how to solve them
Well I just read the errors
and mess with the code for like 30 minutes or something
before I ask for help
yeah, I think you depend way too much on getting help
and if your fundamentals are weak
@velvet thorn largely because of this
to be fair, part of it is about experience
Well I can't say much about that since I'm new to this
I just mess around with stuff I know and read the docs
Not that I can understand a lot of it
I would suggest at least 2-3 months of quality Python experience before beginning to touch deep learning
What would you consider quality then
Because I feel like i'm good at python but there are things that I miss
Because I feel like i'm good at python but there are things that I miss
@desert parcel don't think so TBH
I know I'm not good at it but I just feel like im' good at it lol
even on the knowledge level
I did say feel
for example, are you familiar with decorators, context managers, or the descriptor protocol?
never heard of the last one
__get__?
lol then nvm
the descriptor protocol underlies properties
anyway, IMO quality is about building things that stretch your capabilities
and developing knowledge
This being python for beginners
breadth is really useful.
not the tut for getting into ml
because everything is connected
a bit of computer science knowledge is also really nice for ML
I have none of that
mostly because I'm still in highschool and I haven't really searched up any vids on CS
Maybe I should do that
the nice thing
about life nowadays
is that you're no longer locked into your major
I don't have a CS background either
nothing even close
but, yeah, knowing the answer to questions like "why is it faster to tell if an element is in a set vs a list" will come in handy someday.
An element in a set doesn't have repeating values
right?
like if there are repeats only one instance will be printed, not sure if the right language is used
uh...
yes but no
I mean, nothing of what you have said is wrong
but what I mean is
3 in {1, 2, 3} vs 3 in [1, 2, 3]
the former is faster; why?
and that is a CS question
no point in me telling you
ik
the thing is that these things are not obvious, but they will be important sometime in the future
I'm checking it out
just an illustration
anyway I'm out
have fun learning!
I was in your position like a year ago
it's a great journey
Well thanks for the constructive criticism
Really did bring somethings to light
or shine
Do i need to know the foundational linear regression algorithm and knn algorithm and some algebra,matrix,probability and statistics,calculus and numpy completely to order start new life in machine learning sequence/interested in the legendary computer vital version
You don't need it to implement basic algorithms, but it's very handy if you want to know what you're doing.
I mean, you don't need a huge maths course, but you need to know a thing or two about probabilities and calculus.
@lapis sequoia probability, calculus, linear algebra. i agree you dont need to learn it all at once, but you should definitely start learning it and seek to keep learning it over time.
@desert oar Thanks buddy
COMMANDLINE Video Player - convert video files to ascii art
upvotes 263 comments 21 user Slingerhd
What is the most impressive Python based project you have seen?
Sometimes I find that Python can be so much more, but people use it mainly in data science (which is fine). Wonder any...
upvotes 38 comments 37 user vitsensei
Crime Watch: An Interactive Way To View Crime
​ A Demonstration Of Crime Watch Github Link...
<
638729835245731840> 29 <
638729835073765387> 8 <
638729835442602003> python959
Python logo in colored ASCII art!
<
638729835245731840> 28 <
638729835073765387> 2 <
638729835442602003> Honno
[A DoS attack in 15 lines of code.
Hi, I have tried to create the simplest possible denial of service attack; for this script, I have not used more than 15...
upvotes 6 comments 26 user progsNyx
Hi Guys,
I have a pandas data frame like this
The driverBreakdown column here is a nested dict
'environment': {'average': 5,
'questions': {'Question 1': 5}},
'peerRelationship': {'average': 4,
'questions': {'Question 2': 4}}},
'Mood': {'average': 5.0,
'mood': {'average': 5,
'questions': {'Question 3': 5}}},
'RewardsAndRecognition': {'average': 1.0,
'recognition': {'average': 1,
'questions': {'Question 4': 1}}}}
I would like to convert the driverBreakdown column into multiple rows in this way
is there any way to achieve this directly via pandas
and by not using multiple python iterators
Hello Everyone
@viral scroll i would do a combination
- write a function to "flatten" each nested dict
- "explode" the flattened dicts into dataframe rows
hello, in which channel can i clear my doubts about python?
@desert oar
Any suggestion in how to optimize the flatten part as my data set can contains upto a million rows and I am afraid that flatting each row could be time consuming.
Also, Thanks for letting me know about the explode function.
@tribal hornet you should clarify what exactly your doubts are, then ask in a help channel ๐ #โ๏ฝhow-to-get-help
where can i learn Linear Algebra
Hello can somebody help? so i am planning to make a driver distraction detection using open cv. so i am thinking of adding a feature which shows the number of alert the driver gets during driving. So how can i get the data?
you convince a lab or some agency to fund your research
because I am 99.9% sure those data are not publicly available if they exist at all
I've got a matplotlib question:
I have a plot using 2x2 subplots, and things are layouted properly
But when I add a column there is a large gap between the rows
and changing the figsize doesn't help
let me take some screenshots and show examples
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]]
``` anyone have an idea how to make this mask in numpy
please ping me if you have an answer
Pre-adding a column:
the pics are intentionally low-quality
the content of the plots is irrelevant here
post-adding a column:
why is there a large gap between the two rows?
I ran plt.figure(figsize=(20, 20)) before both
I'm assuming that the figsize affects the subplots, since the bottom pic is clearly not square
because the total figure size is now 20 by 20. thas an aspect ration of 1:1 for those youd need an aspect ratio of like 2:3 i belive
wait is there whitespace after the lowr ones?
@last wind if the figsize affects the total size then why is the bottom picture clearly non-square?
no
that's the full figure
thanks
I set figsize=(20, 30) and got this:
lemme try with (30, 20)
oh yea that solved it
still don't fully understand how that works
but ยฏ_(ใ)_/ยฏ
If I have a intake csv file from an animal shelter and an outcomes csv file from an animal shelter, but there are about 200 more records in the outcomes csv, how could I remove those so I can nicely join dataframes from the two csv files with pandas?
there are types of joins you can use
idr how exactly pandas does joins
but shouldn't you be able to do a Left Join (assuming intake is on the left)?
@jolly sinew ^
Hi guys I'm learning numpy and make a face-recogintion using opencv , how do i do that
ik numpy basics and some essential so should i get start learning opencv/
or do i need to know how ml works
@solid aurora I tried a merge on the animal ID column, but it is not a unique column because sometimes an animal with the same animal ID is recorded / admitted multiple times, so merging on animal ID multiplied those records. However, I really appreciated your advice and I'll try the left join.
So there's not really a good primary key
@jolly sinew hmm maybe generate a new column which is f"{animal-id}-{visit-number}"
so if animal 1 gets seen 3 times, the column will be 1-1, 1-2, and 1-3
Oh nice, that's a good idea
that's assuming that the intake and outtake happens sequentially and there are no missing records
that would totally break if there is an outtake that's not recorded
I'm going to give it a shot
Am I allowed to post links to the datasets here? It might make more sense if you could see their general shape
Is anyone familiar with the package MIP? I have a somewhat complex mixed integer LP problem I am trying to solve and the package seems to be running into numerical artifacts, or getting stuck trying to find a solution without terminating. I can't find much information online on how to trouble shoot it.
uhh
what exactly are you want to achieve?
maybe it is using wrong numeric methods
@solid aurora I found a solution thanks to your help! I used cumcount to generate a new column of occurrences for each animal id and then did a left join on both the animal id and the occurrences columns
outcomesdf["Occ_Number"] = outcomesdf.groupby("Animal ID").cumcount()+1
intakesdf["Occ_Number"] = intakesdf.groupby("Animal ID").cumcount()+1
fulldf = pd.merge(intakesdf, outcomesdf, on=['Animal ID', 'Occ_Number'],
how='left', validate="1:1")
@jolly sinew glad to hear that!
I am trying to solve a convex mixed integer LP problem. I'm not sure sure how else I can describe it without going into extreme detail.
I read a recent paper by Garvie and Burkardt that showed their (unrelated) LP problem would not converge using gurobi, but would converge reliably with other solvers. I'm not sure how often something like that occurs in practice before I try to reimplement this entire thing.
what's the best approach to store data to access it afterwards?
i want to build a face recognition db, which would store the id, entrances on screen and such stuff, now i'm confused on how to store it better, would an object be appropriate?
@upbeat ore use a proper database such as sqlite or postgresql or smth
@merry ridge tbh I don't think #data-science-and-ml is the best place to find linear programming advice
I'm not even sure how many machine learning engineers have any linear programming experience
I for sure have none, but then I'm just a high schooler ๐
========================
Anyway I came here to ask
I'm not sure I agree with that statement, but I'm willing to admit I'm wrong.
Is there a functional difference between a high figsize and a high DPI value in matplotlib?
@merry ridge you may be more likely right than wrong - the machine learning engineers I've interacted with are all fresh out of college and focusing more on the buisness/analytics side than the math side
@merry ridge this is a perfectly fine place to ask, but i dont know how many people anywhere on this discord have experience with that
@solid aurora it's not that they don't know math. it's that linear programming isn't typically an important part of machine learning or data science nowadays, so it's not typically taught much. especially not to ML engineers who don't need the full methodological breadth that a researcher might need
it's probably good for a generalist to at least be aware of LP solvers and such, but i've certainly never needed it in my career
I figured that a lot of ML practioner's would frequently run into a problem where their black box algorithm fails and they need to trouble shoot it. Are the packages used more resilient than I think they are?
@desert oar oh yea I'm sure ML engineers know math, and are at least aware of what LP can do
just I doubt they have experience utilizing LP to solve such problems
@merry ridge oh no you're entirely right, black box algorithms fail often, just that i've never heard of someone turning to LP to solve them
@merry ridge ML practitioners rarely use black box algorithms
99% of the time machine learning is differentiable and solved with convex optimization methods like gradient descent
Could anyone advise me on how to proceed with building something like this? -- We got a surveillance system in a bar, and people sometimes fight here, bring weapons and stuff, we want to be able to identify weapons and people that are on ban list from cameras. My first idea, was using facenet, for face detection and recognition, what about weapons?, + if people come in with masks or hats, should i be looking at another data set? All suggestions are welcome. Thank you.
I think they means things like neural networks where it's difficult to understand why something misperforms @desert oar
even so, explicit LP solvers just aren't used for that
mmhm ^
@upbeat ore I always like to say that obtaining good data is 80% of the work of creating an ML model
you're going to need to find a dataset of images where people are holding/concealing weapons
and then label it so you can create ground truth
that can help build a model where you detect weapons on people
^ this. you need a big labeled dataset of people holding weapons in various poses etc. and you need to make sure you arent accidentally training a racist model. basically this is a huge task that even major well-funded police departments have completely failed to successfully tackle, and i doubt you will be able to do it on your own.
and you need to make sure you arent accidentally training a racist model
THIS ^^^
but if you really feel like you want to try it, someone will have to sit down and label potentially thousands of still frames of security footage
what's the fastest face recognition right now ?
there are data labeling tools you can use or purchase for that task. then you gotta actually build a model on top of it which will likely require significant gpu computing.
that said apparently you can do OK fine-tuning existing models like YOLO v3 https://eng-memo.info/blog/yolo-original-dataset-en/
i would start there
Do you mind telling me a bit more about what you do salt rock lamp? I'm just curious because the kind of work I see actively used and considered within the realm of ML sounds very different from yours.
@upbeat ore you don't want fastest, you want most accurate
I can describe the last few problems I've worked on if that helps. I just consider myself in data science
I can make a "face recognition toolkit" that just assigns labels randomly
it can run in 0 ms, but it will have absolutely terrible performance
There's always a good medium balance between speed and accuracy
well, i was thinking that i need a fast one to be able track the faces and weapons in between
and you 100% want to err on the side of accuracy
and still be able to output stuff on monitor
well what device is this running on?
a desktop computer with some sort of GPU should easily be able to handle 60fps if it's like a convnet
easily is an overstatement tbh
would you mind to explain how to go with this, so basically i use yolo to detect the human, then extract the human and detect the face with facenet, then search in db for banlist, then look back the the full human and try to find the weapon there
or there's a better approach, sorry if this sounds stupid
you might have a lot of false positives with that system, although i guess thats good enough to go and send a guard to visually inspect
yeah that was the idea, just to know before hand
as people with weapons sit near the gambling machines, we got 3 of those
and usually its like +10 minutes before the real stuff happens, so this is the frame we would like to catch and disable the person to not harm himself or any others
i mean this is not a small task and you'll have to test it a lot
but we are just now at a point where maybe this tech is within reach for a bar
there is no timeline for it
and not like, a giant company
@merry ridge sure
data science is a pretty broad range of job titles and tasks
i always like to know what other people work on
@desert oar The last three projects I've had was classifying electricity prices to detect moments when a player in the market could change their bidding strategy to alter spot prices in a significant way; looking at modeling strategies to predict mean electricity prices levels given some anticipated regulatory changes next year; and modelling how congestion affects crude oil prices in the US Gulf Coast. The last several positions in the data science for oil & gas, pipeline, electricity and other commodity markets I applied for all required very strong LP programming knowledge (which I am not that great at) because they still use quite a lot of excel models in house.
It's mainly because a lot of this stuff depends on economic factors, so the LP part mostly handles finding price equilibrium
i see. im mostly doing nlp classification nowadays, although my background was more in social science and statistics
But I don't exactly enjoy it. It is finicky and frustrating to work on
hm yeah. honestly i only learned the simplex method in school
like i said its just not something ive ever needed
but i can probably think of times in the past where maybe it might have come in handy
e.g. i used to work in business travel, there were some problems i had at that job that i couldnt easily solve with standard "fit a predictive model" techniques
A lot of this is being handed a paper that showed great results at some conference I've never been to
and being told to implement it in some completely different context that doesn't always even make sense
To fish out some competitive advantage and it is pretty tiring having nothing ever work
Hey is this the right chat for machine learning based questions?
@iron rampartyes
Alright, so is it possible to create an machine learning script that can learn how to use a computer?
@iron rampartuhh....woah good question, I think so in terms of the operating system but how far would you want it to go, like opening notepad or...
Well doing task's on it own
Lets start simple
So ive created a "bot" than can open spotify by moving the curser to the right x and y coords. And then click
hmm....I believe that's possible but, im not sure how to go about that
But when i move the spotify icon it will be completly useless. So is their a way it can learn it self where it is?
gonna have to research
Or should i start with a simpler idea
nah that's sound good, doing automation with spotify but I'm new myself so I'm not sure how I would go about it
what library do you use? @iron rampart
Euhm tensorflow?
@merry ridge fair enough, at least you have people with domain expertise guiding you. in most of my work im completely doing it all from scratch and i have no clue whats going on
grass is greener i suppose ๐
Oh I certainly have no idea what I am doing
@merry ridge๐
@iron rampart "learn how to use a computer" is a big and ill-defined task. this is really more of an AI question than a ML question anyway
they are kinda tbh I believe machine learning has to do with models and coming up with algorithms to train them
but that's just my observation so far other than that it seems similar
id say that machine learning is "lower level"
AI would be like carrying out sequences of tasks, and reacting to unexpected input
whereas ML is simpler clearly-defined tasks
e.g. something like "identify individuals with weapons in surveillance video" is machine learning
but "identify threatening individuals in surveillance video" is AI, because then the model needs to learn the general concept of "threatenting"
@desert oarahh I see that's makes better sense
however im not an AI practitioner so i can't claim to speak for the industry
but that's how i separate the two in my mind
in common data science practice, ML usually means making one-off predictions in a live/production setting
or generally just making predictions without human input
or even more loosely, it's sometimes just used to refer to techniques for building models that aren't "traditional" statistics
or even just for building models without really being concerned with statistical inference
its weird because its used all the time but nobody seems to have a good definition for what ML really is
@desert oar Wow you seem pretty into machine learning... could you tell me where to start learning it?
Can anyone suggest some good projects which can be done by an intermediate data science learner but like mainly about data cleaning and preprocessing??
how much level of code should a data scientist know vs a programmer/coder
pretty much comparable to any programmer if you are dealing with Deep learning per say
other wise, just a fundamental understanding of algorithms and statistics is sufficient
im pre new and interested in this field.. im learning code right now and I finished a stats course last year in college.. is deep learning a masters level thing?
oh thats cool, no deep learning is not really a masters thing. It just happens to demand some prerequisites that are based in linear algebra, diff calc. And it has more to do with neural networks.
but otherwise like machine learning concepts it is pretty easy to learn
https://www.coursera.org/learn/machine-learning
I recommend this introductory ML course to everyone. It covers linear and logistic regression, basic unsupervised learning, some Support Vector Machine stuff and neural networks (including personally implementing backpropagation).
So essentially the basics of all fields, and with very little required knowledge - only basic linear algebra, which the course provides materials and a refresher for.
free, too.
damn i ahvent taken linear algebra in college
@tidal bough yes, that is a good place. If any one is interested in deep learning I would highly recommend http://www.deeplearningbook.org/ and the UCL deepmind lecture series on youtube
i only took differential calculus and even then i got a bad mark rip
...how did you take differential calculus without linear algebra? All the stability theorems are about matrix eigenvalues and stuff.
@crude karma dont worry it is not that difficult, just put your mind to it. you can easily learn so many concepts quickly. Just don't think of it as some advanced concept
uh
our courses are split
so like
differential calculus, integral calculus, then linear algebra
thats the progression
ah, I got it now
I thought you mean differential equations
differential calculus is indeed a lot earlier
i really have to re take diff calculus
i mean it depends on what you plan to do. @crude karma
you really won't need diff calculus to become an entry ML engineer or a data scientist.
Most of the linear algebra, differential calculus, statistics stuff only gets really important once you start going into research. Until then it's just familiarizing yourself with the techniques or models that have already been researched and proven.
thinking of going into industry rather than academia
well there's research in industry as well. Companies hire a number of ML researchers
i kinda disagree
linalg and stats are essential problem solving tools in my work
and calculus is just a necessary prerequisite for understanding pretty much anything
do you need it all on day 1? no. will you need it to actually make it through and understand the material? yes. will you be worse off without it? yes.
take it from me, who tried to get by for a long time learning as little "theory" as possible
if you don't know the underlying math at least roughly, you can't really know what you're doing or why it works / doesn't work.
intuition is nice but not enough
especially once you're past the entry level and the problems in front of you no longer resemble exactly things that you saw in your coursework and textbooks
maybe if you're lucky enough that your work allows you to dump everything into keras and call it a day, then fine
but i dont know of many people whose jobs are actually like that
^
this a thousand times
if you do not understand the concepts underlying the code you use (and I don't just mean the programming abstractions, but also the mathematics)
your work will be slow, inefficient and of low quality
and you will spend a ton of time not understanding the errors you get and the problems you face
I have trained data scientists and done freelance teaching
I cannot understate the importance of a strong foundation
(where "errors" includes "my model kinda works but it performs badly sometimes", not just "ValueError")
Whatโs the best and easiest python package to implement plots on my website
, including playable interface and 3D graphs
(where "errors" includes "my model kinda works but it performs badly sometimes", not just "ValueError")
@desert oar edited to clarify
why cant i do if (var = some_function): continuewithcode else: break
goal is to continue the code if a value can be assigned to var. If there's an error assigning a value to var, it should break the loop
goal is to continue the code if a value can be assigned to var. If there's an error assigning a value to var, it should break the loop
@frank bone because that's not whatifis for
but try-except
@frank bone in python 3.8+ you can do
if (var := some_function()):
...
tied except, worked well ๐ thanks!
oh, i see
it wont help with your specific question, i realized
but its a useful new feature otherwise
Yeah I agree, but it depends on what exactly you're work is, and how far you plan to go in terms of an ML/DS career @desert oar
By all means, actually understanding the underlying mathematics is vital to get far in ML. But an entry engineer doesn't really need to really know the underlying details.
For new people it might be worth just spending 4 - 5 months learning ML, and then getting an entry job for a while. And then come back and learn all those topics in depth
If you start getting into doing research, writing your own libraries, CUDA software, or just encountering problems that are very specific to a company, then those will be necesarry.
In my opinion, at the absolute minimum even a very basic crash course in linear algebra provides a mountain of intuition that is helpful at all levels of skill. It is the least forgivable of the mathematical corners to cut.
yeah prob that, stats, and gradients
I've found stats to be a bit less useful when building nn's but it's still %100 important for ml
its more so for DS
Actual neural nets it's not as important unless you're working with probabilistic models
that's what I mean, it's more using in pre/post processing
So I'll be honest. I have no idea why stats is important.
Are you rolling regression techniques into the area of stats? Probability theory into stats? In my mind they are different things.
stats and machine learning have the same problem
nowadays theyre mostly just references to different sets of problem solving approaches
the far ends of stats are very different from the far ends of machine learning, but its really hard to draw a line between them. to the extent that theyre even different things
so yeah, id say that fitting a probability model counts as stats
as do all the various forms of hypothesis testing
i can't say that NHST is that useful in industry nowadays, but the concept i think is very important to understand
Fair enough. At least in my experience, I feel like people cheat on the hypothesis testing part like crazy
true. depends on what industry
basically any time youre inferring parameters of a probability model, i have a hard time resisting the temptation to call that stats
fun debate topic: is holt winters smoothing stats or machine learning?
Here, you learn linear regression etc as part of your calculus curriculum before Lagrange multipliers and most of the probalistic tools are taught in a proper probability theory course separate from stats which is why I ask.
I mean neural networks are just stats in a nutshell. We just made the process more incremental.
We're still finding a distribution of generated values that closely match the original dataset. You won't likely use direct stat techniques in a conventional neural net though, at least not directly.
For Bayesian/Probabilistic models, though, those are pretty much grounded on distributions.
And then stats also helps with data analysis. You need to know what you're working with before you do any feature engineering or data cleaning.
I mean neural networks are just stats in a nutshell. We just made the process more incremental.
i dont know if this is true
fun debate topic: is holt winters smoothing stats or machine learning?
ml
"machine learning is making predictions without explicitly making statistical inferences, except where making such inferences is convenient for making better predictions"
its way past my bedtime
means and variances are probability theory... but estimating them from data is statistics
what other aphoristic platitudes can i come up with at 2:30 AM
well ML was largely built on the basis of stats.
I guess its difficult to say that they're exactly the same, but they follow very similar principles.
The line is very blurred. I guess the question would be what is stats in the first place, cause statistics is a very large encompassing term.
arguing that when a computer makes an inference on a dataset - when it's simply an equation is stats is dumb, even a basic neural net is a series of equations that allows those inferences to be made, so just because it has a 'lower complexity' + can be done by hand, doesn't mean it's not machine (or ig if you really want to be specific: mathematical) learning
I think the 'learning' part of it is misleading
but then again, mathematically(/statistically) induced inferences doesn't roll off the tongue
in all seriousness, ML is fundamentally a problem domain, whereas statistics is a set of techniques for fitting and making inferences from probabilistic models. it just so happens that we now have a lot of methods for making predictions that arent inherently statistical, but still fulfill the task of ML (and happen to be useful in many other places), so we call them "ML techniques" and the whole thing becomes a terminological mess
statistics is one tool for approching the task of ML
but you can minimize loss functions without explicitly appealing to statistics
whereas much of statistics does depend on minimizing loss functions
basically its all historical nonsense
From my perspective, ML is mainly Numerical Analysis (which is basically taylor series and secant lines) and linear algebra with comparatively little stats sprinkled on the shoulders of those two giants.
see i'd argue ml is what happens when statistical models get combined with math
i dont see much numerical analysis used in ML at all @merry ridge , i think your experience is somewhat unique (and very interesting)
What about something like gradient descent?
sure
that is a core topic in every numerical analysis course
then, granted
its funny that traditional statistics historically depended on 2nd order optimization methods
when it turned out that gradient descent was good enough all along. maybe its just that computers used to be so much slower so you needed the faster convergence of 2nd order methods?
I mean I guess it depends on what you see ML models are doing. They fit to distributions on the data at the end of the day, whether they're good or not, using a number of statistical based techniques.
Well gradient descent on its own is not necesarrily enough. We don't directly use the gradient anymore.
a neural network is not explicitly fitting any distribution, at least in the general case
is there always an implied model for the conditional expectation and an error distribution around said expectation? yeah
does that mean you can actually know or make use of that information? maybe, maybe not
One of the problems with higher order methods is that they may be faster, but they may be less numerically stable
doesnt computing hessians also get pretty gnarly
Truncating at x^2 is a bit of a trade off
what were you saying about the secant lines?
I say secant lines because a lot of derivatives are just approximted by one
ah
I mean the goal of a neural network in the majority of cases is to create values that are in the best case basically exactly matching to the real global dataset.
Unless I'm making a major error here, it seems to me that we're trying to approximate distributions
i dont think thats the case. you can say that your predictions should roughly follow the distribution of the target in the training data (assuming the features have the same distribution as in the training data), but that isnt really the goal
well roughly, because we don't have a true dataset encompassing all possible data points.
If we did we wouldn't need to worry about overfitting as much as we do now
however im pretty sure you can safely say without loss of generality that a neural network in a regression problem typically makes predictions of the form f(x) = E(Y|X=x)
right
also on the topic of gradients, it is worthwhile wondering if the gradient is really the most useful thing. I mean the brain is using some form of learning, and its pretty good. I don't know if its using gradient descent and backprop - there was a recent paper on this using inverse functions - or something else
Tho how synapses form between neurons and strengthen based on current research seem to suggest its based on repeated firing.
i mean, gradient descent is just the optimization algorithm that happens to work on networks
and it also lends itself to this really elegant formulation in terms of computation graphs and layers
im sure hexicle can attest to all kinds of other specialized ways to find the parameters of a function that minimize average loss on a dataset
true, but its largely suited for supervised learning
I mean, I've know of them, but they are mostly a gigantic pain in the ass to use
if someone comes up with a better technique, then i dont think anyone is going to reject it out of hand, but youre going to have to make a case for why it beats the elegance and convenience of stochastic or batch gradient descent
oh yeah, there's been attempts at more biologically plausible neurons before
They haven't seemed to work as of yet though. No reason to use it unless it's actually performing above our current SOTA
right
hence all of these weird and interesting machine learning techniques have been relegated to history. goodbye to the radial basis functions and kernels and coordinate descent and support vectors
and this is why i dont know anything about anything
because i just dont need to anymore
A lot of the fancier techniques get increasingly complicated to the point of absurdity
One of my former Numerical Analysis professors would tell us stories of specialized rooms with paper that hung from the top of the walls to the floor and they would write high order methods from step ladders to figure out all the coefficients needed in a formula.
i mean everything's getting more complicated in general, until someone goes an abstracts portions of it
^^
i wonder how long the actual equations were...
It's been too long to remember the exact topic at the time
I'd have to ask them, but it is nearly impossible during the pandemic
yeah for sure
wait are u a grad student?
No
I keep an office at my alma mater because I am an editor for a discrete mathematics journal and I didn't want to "take my work home". Also it is nice having a space at the university even if I show up once every 4 months.
ah gotcha gotcha
salt I want your opinion on something ;)
heh sure
I've started exploring linux based systems and found a laptop calling the kubuntu focus
which claims:
is that a thing?
i have no idea
like it's a i7-9750H 6c/12t 4.5GHz, 2060-80 rtx laptop for like 2.5k
and I get branding and all that
but I can't tell if that's bs or not
would you happen to know what an ai score is?
that sounds super dodgy.
do not buy a laptop for deep learning
like, ever ever ever
it will have a combination of shitty gpu, shitty ventilation, and being overpriced
as well as generally not being upgradable
do yourself a favor and get a better desktop for half the price
Hello - I would like to get some help condensing a large dataset. Was wondering if anyone has experience with this?
Hi @willow kernel, continuing from where we left off, let's write a little script to break up your dataset into separate files.
Sounds good!
You have a csv file yes?
Yes
Actually we should probably move to a help channel so we don't flood this one. Head over to #help-broccoli
any idea how you would go about converting "requests.models.Response" type of data (coming from an API call) into normal python type of format...dict, list, pandas?
i got it, converting it into json then going from there
Hey @vague portal!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hmm how can I share an excel file with you guys? I need some help with analysing this dataframe ๐ฎ
Hello all, Im currently a comp sci student in college actively looking for motivated people to work on AI based projects with. TensorFlow would be the main api. I am always looking for like minded people to collaborate with. Pardon me if this is a inappropriate place for a message like this. Pointers to other places where I could look for people would be great. Networking during a pandemic is hard and this was the first place that came to mind. Thanks!
Basically, I have a large dataframe (200,000 x 10) that I want to analyse. I want to be able to group the data by one of the columns, and then make sub-groups within these groups. I think I could do this manually, but it will take me forever, is there a quicker way to do this than using for loops?
@charred agate check out R42 institute https://www.r42group.com/r42institutefellows
I'm on the water data science project and it would be helpful to have someone with machine learning experience to help us out
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
Hey thank you, I really appreciate it. I will check it out later today, itโs 5am for me. Thank you again @vague portal ! Also @atomic forge , thanks for the resources.
np
Im planning on buying and putting together a pc build with the new 3080 when it comes out. Do you guys know if its possible to train small models on gpus that arnt like a titan? I want to work on building a deep learning model to play chess.
has anyone used eta-squared in here (ideally in python)? have a question about it
Hi! I've been trying to screenshot data from a program so I could then convert it to text because I wanted to automate the whole process. But from some reason, four separate screenshots are taken at the same time although I set time.sleep()in multiple places โ but when I do the same thing, but with no running the program (just desktop is visible) the screenshots are taken separately. How can I delay screenshots while inside program?
this is the part of the code:
`from datetime import datetime
from datetime import date
import subprocess
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_core(filename):
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
text = pytesseract.image_to_string(Image.open(filename))
return text
os.startfile('C:\Program Files\Stellarium\stellarium.exe')
time.sleep(8)
pag.hotkey('f3')
pag.typewrite('Delta Cep')
pag.hotkey('enter')
time.sleep(4)
now_cep=datetime.now()
vrijeme_cep=now_cep.strftime('%m-%d-%Y %H-%M-%S')
folder='images/'
filename=' Delta Cep.png'
output_cep=folder+vrijeme_cep+filename
time.sleep(2)
im=scr.grab(bbox=(0,0,1919,1079))
im.save(output_cep)`
hey @charred agate
Yeah looking for motivated AI/ML people as well, maybe we can work on a project together ๐
Guys, what is one of the fastest ways for a newcomer in AI to start writing machine learning code. ๐
Where should i start with machne learning?
@proud iron doing a coursera course, probably.
@iron rampart @proud iron https://www.coursera.org/learn/machine-learning I much recommend this one as an introductory ML course.
It has little to no needed background knowledge, yet covers most of the fields.
@tidal bough do you currently remember a resource that requires background Python knowledge or? ๐
There probably are introductory ML courses in Python, but I don't know them.
Doesn't really matter though, since all the courses in the https://www.coursera.org/specializations/aml specialization use Python, pretty much.
I've only done https://www.coursera.org/learn/practical-rl?specialization=aml of them, mind - wasn't very interested in the rest.
Cheers @tidal bough . ๐
yo, so I got 2 kind of information, RGB map and height map,
how would I detect water groups?
lets say I want to count water areas
well, here it looks like water is just all the blue, lol
xD
genius
its kinda ml question
how to find and count objects
xd
orrr
I will simply create mask and adios
you can just find contigent areas of a color
flood fill
hey
i was wondering how i could (with panda or numpy if needed) detect categorical data ?
WDYM by detecting categorical data?
get every categorical columns inside a new dataset
You could check how many unique values there are in that column, and if it's, say, below 100, assume it's categorical.
it wont work everytime
cause if we have the speed of a car it can be unique every time (like: 140.274km/h)
If your dataset has only like a dozen different values for a column, you can consider that column categorical even if the dataset's creators considered it continious ๐
ok and what about that:
the passenger class is a categorical column
however all the values are not unique
There are only 2 unique values, so by my definition it'd be a categorical column.
Same with price, here.
dataset with 3 rows isn't much of a one, though
@haughty turtle if the column contains only integers thats one potential sign
moreso if they're all consecutive integers
whereas very large nonconsecutive nonnegative integers all of the same digit length could be id numbers
look for "type" or "category" or "class" in the column name
this seems like a very weird problem to have
its not haha
what is this data? where is it retrieved from? how is it stored?
Hey @haughty turtle!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
its pretty unusual to have to perform automated data processing on 100s of tables with unknown schemata...
i want to fillna
or rather, tables with 100s of columns with unknown schemata
i know the schemata
"i got the instruction manual but its long and i dont want to read it, how can i detect the right screws to tighten on my car?"
that's why im doing a lib
i have plenty of other project to do and i want to create a lib to go faster
thats what code is supposed to do
by the time out figure out an algorithm for this you could have just typed out a list of categorical/numeric indicators by hand
if you want to do it for fun, then go for it. but this doesnt sound like an optimal use of your time
ill try and if it shows it results publish it .... maybe it will help others to save time
@tidal bough The course is in Python lang right?
if you mean the coursera ML one - no, it uses Octave.
The programming assignments there all require manually writing code rather than using existing libraries anyway, so it wouldn't be much different if it were in Python.
Octave has builtin linear algebra support, so it's kinda like numpy ๐
from zlib import crc32
import numpy as np
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32```
can anyone help me with this?
The function checks, whether one row of a DataFrame should belong to the test-set or the train-set.
I don't know how it works
For deep learning, what OS do you guys recommend and most people use? I've heard that Windows is pretty bulky and Linux is the best. I cant use Linux for school since I need some software, so any recommendations? (I have a macbook pro as well)
use your macbook pro , that what I would go for , besides linux has most of the softwares that you need in school
Is windows not good? I've seem some ml people use windows @crisp jewel
the specs of my macbook pro are pretty bad compared to my pc's (its a desktop pc not a laptop btw which i built this summer)
use the macbook
leave the desktop as windows
macbook for working
the other one sell it lmao
dont sell the monitor tho
k
macbooks are pretty bad tbh, im not doing anything and the fan runs really fast
and for deep learning a macbook isnt the best idea because it doesnt even have a gpu
They have gpus tho
In this video we look at a paper which proposes with theoretical and empirical evidence to use tempered sigmoids instead of ReLU (or in general exploding activation functions) to improve on differentially private stochastic gradient descent (DP-SGD). I would love to spark discussions here or on the youtube comment section about this paper!
Video: https://www.youtube.com/watch?v=g2acvGl99-k
Paper: https://arxiv.org/abs/2007.14191
Abstract: Because learning sometimes involves sensitive data, machine learning algorithms have been extended to offer privacy for training data. In practice, this has been mostly an afterthought, with privacy-preserving models obtained by re-running training with a different optimizer, but using the model architectures that already performed well in a non-privacy-preserving setting. This approach leads to less than ideal privacy/utility tradeoffs, as we show here. Instead, we propose that model architectures are chosen ab initio explicitly for privacy-preserving training. To provide guarantees under the gold standard of differential privacy, one must bound as strictly as possible how individual training points can possibly affect model updates. In this paper, we are the first to observe that the choice of activation function is central to bounding the sensitivity of privacy-preserving deep learning. We demonstrate analytically and experimentally how a general family of bounded activation functions, the tempered sigmoids, consistently outperform unbounded activation functions like ReLU. Using this paradigm, we achieve new state-of-the-art accuracy on MNIST, FashionMNIST, and CIFAR10 without any modification of the learning procedure fundamentals or differential privacy analysis.
In this video we look at a paper which proposes with theoretical and empirical evidence to use tempered sigmoids instead of ReLU (or in general exploding activation functions) to improve on differentially private stochastic gradient descent (DP-SGD).
Paper: https://arxiv.org...
Because learning sometimes involves sensitive data, machine learning
algorithms have been extended to offer privacy for training data. In practice,
this has been mostly an afterthought, with...
โthe best tempered sigmoid achieves 98.1% test accuracy whereas the baseline ReLU model trained to provide identical privacy guarantees (ฮต = 2.93) achieved 96.6% accuracy.โ
Iโd like to see some proof of that
Also what does the โheatโ in figure 2 represent?
i see we have another yannic klicher @raven mulch
from zlib import crc32
import numpy as np
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32```
@crisp jewel this is WILD
why would anyone do that
this is like a great example of "trying to be smart" IMO
alright so when i run that code this is what i get
where these are all columns in the csv file
that i'm reading from
sorry if this is too vague im just starting with data science ๐ฌ
ah, this is the infamous Titanic dataset
indeed
incidentally, I would suggest data.corr().abs()['survived'].sort_values() instead
hm
are you asking about the meaning of the correlation coefficient?
lol no okay lemme try to word this better
๐ณ๏ธ
okay first of all why are there [[]] around the "survived"
which conceptually represents 2D data
mhm
okay, so you know what a Series is?
a Series represents 1D data
either a row or a column
so, for example, if you do data['survived']
you get the column representing whether each person survived or not
because a column is 1D, that's a Series
ohh
so you can think of a DataFrame as a collection of Series
like a vector?
yes
now, we used square brackets above
to access a single column of data
but what if we want to take multiple columns?
then we would pass a list
say we wanted the sex and age columns
data[['sex', 'age']]
which you can break down as:
columns = ['sex', 'age']
data[columns]
make sense?
yes
so now
that's the difference between 2D data with one unit dimension and 1D data
in other words
if you did ['survived']
you would get a Series
but with [['survived']] you have a DataFrame with one column.
and the two are different things
ohh so data[['sex', 'age']] returns a dataframe okayy
so data.corr() .abs() without the [['survived']] would give a very large dataframe with every pairwise correlation coefficient i assume
lemme try it out
yes
yw
Hello ! I have a csv which represents the temperature data for the 4 seasons, I want to add a precise number for each 90 iterations and I am a little stuck doing it with pandas
Hello ! I have a csv which represents the temperature data for the 4 seasons, I want to add a precise number for each 90 iterations and I am a little stuck doing it with pandas
@versed violet what do you mean precise number for each 90 iterations
like first 90 rows one number, next 90 rows one number, etc.?
I want to count 94 row for exemple and add a number for each of this 94 rows
My csv looks like this
From your image, do you mean
- Add 3 to each of the the first 79 rows
- Add 2 to each of the next 93 rows
- Add 5 to each of the 94 rows after that
- ...
?
Yes !
Can you claim a help channel (read #โ๏ฝhow-to-get-help ) so that we can go more in-depth about it?
Yes 1mn just to read how to claim the help channel thanks !
quick question, how come if i replace '?' with numpy.nan i can still use .dropna() on the dataframe? does python have a built in nan data type? my intuition tells me that numpy.nan is different but idk
quick question, how come if i replace
'?'withnumpy.nani can still use.dropna()on the dataframe? does python have a built innandata type? my intuition tells me thatnumpy.nanis different but idk
@fathom raptor yes
@hasty grail it tells me I'm in a "Cool Down" expect i've never opened a help channel
yes to python having a builtin nan?
hmm not sure what that means, maybe one of the helpers/mods can elaborate?
yeah that is strange...
But to answer your question, a simple way would be just to read the entire file and put the contents into a list
Edit the list
then overwrite the file with the contents of the new list
Oh i see, and to write a loop to add the numbers right ?
yes
That's where i have a problem, I can't really see how i can right the loop, like do I do it with a count or with a len ?
you can use enumerate()
wait idk if this is a datascience question or just a noob programming question, but how come both of these syntaxes work?
!eval
for i, v in enumerate(['a', 'b', 'c']):
print(i, v)
You are not allowed to use that command here. Please use the #bot-commands channel instead.
This is why being in a help channel would be helpful...
you can take a look at the output in #bot-commands
@fathom raptor It's analogous to calling class instance methods. You can call it by <obj>.<func> or <func>(<obj>)
This is why being in a help channel would be helpful...
@hasty grail I see yeah, thanks a lot I will try with what you gave me until now and waint for this "cooldown" to finish, thanks a lot for the help !
wait idk if this is a datascience question or just a noob programming question, but how come both of these syntaxes work?
@fathom raptor the latter is more idiomatic
can anyone whos familiar with pandas tell me the difference between .nunique and .value_counts?
can anyone whos familiar with pandas tell me the difference between .nunique and .value_counts?
@graceful glacier have you tried calling both of them on the same data
the difference should be quite apparent
ok i got it after looking it up. just starting out with pandas so some of the concepts blur together for me
is it even possible to save a matplotlib animation as a html file?
is the variable "axis" a built in variable in numpy???
idk about axis, but row and column are
like:
for row in array:```
but axis is a argument for most np functions
ig it is
probably through some properties but I'm not completely sure
but how does python recognize axis
@crude karma 0 R and 1 as columb
is the variable "axis" a built in variable in numpy???
I don't understand what you mean by "built in"
ok i'm super sleep deprived but is there a clean, elegant way of iterating through "square sections" in a numpy 2d array?
i.e. if n=2 I would want to look at the four "quarters" of the array:
>>> squares( np.reshape(np.arange(16), (4, 4)), n=2)
np.array([
[[ 0, 1],
[ 4, 5]],
[[ 2, 3],
[ 6, 7]],
[[ 8, 9],
[12,13]],
[[10,11],
[14,15]]
])```
there doesn't happen to be a built-in numpy way of doing this, right?
I just have to use slices?
I think there is a function for that
let me see..
ok no, apparently this is one of the things Tensorflow has but NumPy doesn't -_-
like
how does axis know its 0 for row and 1 for column
you can name anything other than axis and have it assign 0 for row and 1 for column right?
how does axis know its 0 for row and 1 for column
@crude karma convention
by default, axis 0 is rows
convention?
@crude karma axis 0 is the outermost axis
i.e. you enter 0 lists before hitting the 0th axis
axis 1 is the second-most outermost axis
you must enter one list before you hit the 1st axis list
only under C-order
true
ah
which is why, if you change the order, you notice that the speed of iteration across specific axes changes
since memory contiguity is actually affected
mmhm
purely for research purposes, somebody could probably make a wrapper that lets you specify a custom axis order to persist on disk
it would probably just permute the axises it passes to numpy
and use numpy's C-order internally
anyway @hasty grail why is that implemented in tensorflow? lol
seems like something they should have contributed to numpy
UnimplementedError: 2 root error(s) found.
(0) Unimplemented: {{function_node __inference_train_function_16567}} File system scheme '[local]' not implemented (file: '../input/birdsong-resampled-train-audio-03/redcro/XC143214.wav')
[[{{node ReadFile}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional_5]]
(1) Unimplemented: {{function_node __inference_train_function_16567}} File system scheme '[local]' not implemented (file: '../input/birdsong-resampled-train-audio-03/purfin/XC171695.wav')
[[{{node ReadFile}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional_3]]
0 successful operations.
7 derived errors ignored.
whats this error?
looks like you're running a TF model with distributed training?
or using a tf.data.Dataset object
in any case you need to provide more info than that for us to help, such as your actual code
Hey @lapis sequoia!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
:p
does pastebin not work?
hey all, I need help figuring out where to go in my project. Basically I have gathered a TON of data of "if x then y's" and have a bunch of probabilities regarding such. I need help figuring out how to pick the next candidates given a set of x values.
does pastebin not work?
@hasty grail https://notepad.pw/47i2f3b1
hey all, I need help figuring out where to go in my project. Basically I have gathered a TON of data of "if x then y's" and have a bunch of probabilities regarding such. I need help figuring out how to pick the next candidates given a set of x values.
linear regression?
I think so, thats what I found when I googled it. Guess I'll have to look back into my AI homework ๐ซ
I'm assuming y's are continuous labels and x are features and you want to predict a y for x?
@lapis sequoia do you have any experience with it? Could I dm you?
Oh, you're using TPU strategy to train the model
don't use TPU training then
Hmm I've used TPU on kaggle before in the same way. Let me remove it and try
@hasty grail OOM error with GPU xD
umm how many params does the model have?
350,000
what does nvidia-smi return?
what worked?
your GPU must be tiny
350k params is not that much in terms of GPU usage unless you got a bunch of attention layers...
also removed Residual connections
not what i wanted but welp something to build on
wait I see you are doing some signal transformations
so that's probably what it ate up so much GPU
Kaggle GPU tho yes, I'm using MEL spectogram on audio
but it's gonna take 50 hours to train ffs
I would like to hear your thoughts on this. First perform DETR object detection if the object is the desired one, e.g. person, do perform face detection from facenet on the found cutout object. would that be efficient in terms of speed?
as DETR would be really nice to find other objects as weapons that i would need to perform on the same image as well
I have a Pandas related question in #help-popcorn. Could someone please take a look.
@hidden halo did you get an answer?
Not entirely
One sec, I'll post it here
I have dataframe in Pandas which looks like this:
+--------+------------+---------+-------+
|class | student_id | subject | score |
+--------+------------+---------+-------+
|4 | 5| Maths | 56|
|4 | 5| English | 65|
|4 | 6| Maths | 73|
|4 | 6| English | 78|
+--------+------------+---------+-------+
I want to convert the subject column to column headers, while retaining all other columns as is, like this:
+--------+------------+---------+-------+
|class | student_id | English | Maths |
+--------+------------+---------+-------+
|4 | 5| 65| 56|
|4 | 6| 78| 73|
+--------+------------+---------+-------+
I got this. But I can't seem to be able to flatten it. Any ideas
df.set_index(['class', 'id', 'sub']).unstack('sub').reset_index()
oh, yeah thats a thing
i think if you do rename with a callable, it receives a tuple
let me check
@hidden halo ```python
data = data.pivot(index=['class', 'student_id'], columns='subject').reset_index()
flat_colnames = ['_'.join(filter(None, ctup)) for ctup in data.columns.to_flat_index()]
data.columns = flat_colnames
print(data)
.to_flat_index() seems to be the trick you were missing
maybe I'm missing something, but couldn't you just have selected 'score' before resetting the index? as in df.set_index(['class', 'id', 'sub']).unstack('sub')['score'].reset_index()
also, pretty interesting point about .to_flat_index(), didn't know that was a thing
yeah that would work too @paper niche
seems more "opaque" though. if i wrote that code i would want to leave a comment like # selecting the column avoids creating a multi-index
sure, I agree. I like the pivot table solution better, I think the intention is clearer than setting index and unstacking
i do wish pivot wouldn't mess w/ the index though
e.g. if you already have a meaningful index you then have to keep track of the indexes to reset
it gets messy
data = data.pivot(index=['class', 'student_id'], columns='subject').reset_index()
This did not work for me, it threw an error saying Length mismatch: Expected 4 rows, received array of length 2
However, I tried the second part with my unstack method and that did the trick. Thanks a lot.
Now I'll go try to unpack what happened in that line.
@hidden halo this was my whole script, maybe your real data is different
import io
from operator import methodcaller
import pandas as pd
data_txt = '''
class | student_id | subject | score
4 | 5| Maths | 56
4 | 5| English | 65
4 | 6| Maths | 73
4 | 6| English | 78
'''
data = pd.read_csv(io.StringIO(data_txt), sep='|') \
.rename(columns=methodcaller('strip'))
data['subject'] = data['subject'].str.strip()
data1 = data.pivot(index=['class', 'student_id'], columns='subject').reset_index()
flat_colnames = ['_'.join(filter(None, ctup)) for ctup in data1.columns.to_flat_index()]
data1.columns = flat_colnames
print(data1)
maybe I'm missing something, but couldn't you just have selected 'score' before resetting the index? as in
df.set_index(['class', 'id', 'sub']).unstack('sub')['score'].reset_index()
@paper niche And apparently this does the trick too, in a much simpler manner as well.
Thanks a lot fickletofu
i would definitely leave comments explaining what this is doing
if i had to read that code id be confused
and also w/ fickle's method you have to prepend subject_ to the unstacked column names
yeah if you'ld like to add a prefix to the pivoted columns, just go with salt rock lamp's solution
oooh wait
Actually, this works fine for my usecase
fickle's method names the axis
that's kind of nice
unstack is nice because it names the column index itself subject which i think is cool
I haven't worked with multi-index data frames much, so I find this very confusing. Still trying to figure out what fickle's method does exactly by running it step by step
@hidden halo selecting score selects a Series from the dataframe
so when you reset_index, that promotes the Series to a DataFrame with flat column names
rather, unstack creates a DataFrame with a multiindex column
the "outer" layer of the column axis has a score label
selecting that gives you just the "inner" layer, which is a DataFrame with a non-multi column index
then you reset_index on that, and the index "columns" become regular DataFrame columns
comparison of both methods
@hidden halo selecting
scoreselects a Series from the dataframe
@desert oar Yes, I get this now. I was not able to understand what was there insidescore. After looking at it in multiple ways, now I figured
i was actually wrong in those first 2 lines ๐
look at the next
i crossed out the wrong parts
Yeah, it makes sense. It's not very clear yet, I guess that will take a little more working with multi-index DFs for this seem familiar. But I get the general idea.
One more question actually, I'm not able to get rid of the first column, which is basically the index, titled sub. I don't want the sub there, but reset index doesn't remove it.
It doesn't change at all
that isnt the first column
that's the name of the column index
which is what i was saying before
see the example repl i posted? you need to .rename_axis(columns=None)
if you do df2.columns you will see that the result is an Index object with name='sub'
this is an artifact of selecting a single key from MultiIndex columns
Oh
Got it, working now. Thanks. Will probably take some time till I fully understand these methods.
Hi all. I have a question about the pandas module. I'm trying to delete rows from a series, but it looks like the drop() documentation only allows me to do this by making a completely new series. Is there a way to edit the current series I have? Because trying to do this in iterations runs into nightmare keyerrors, because I can't simply write series = series.drop([2])
I'm essentially asking if there's something in pandas that is the equivalent of .append or .remove in python's lists
.drop() has an inplace argument if you need to modify the series in place.
There is also a .append() method on pandas series that will let you effectively concatenate one series to another.
@solar bluff are you good with pandas ?
I use it every day in my job. I'm no world class expert or anything but I get by
@lapis sequoia note that drop drops by row label, not by numerical position
Well I'm currently trying to add the inplace=True argument, but now printing out the series is printing "None"
df.drop(index=[2], inplace=True) might be the 100th row, or it might even be multiple rows, with the label2
inplace=True makes .drop return None
Oh, is there a way to drop by numerical position? the documentation is confusing to me
@solar bluff Yesterday @hasty grail helped me a lot to right a code that goes through every line of a csv and add a value depending of which season the line is, right now i have a little bug with the code and i can't find the error https://repl.it/repls/FamiliarMinorBases#main.py
"inplace bool, default False
If True, do operation inplace and return None." sure enough
I was doing some testing and series.drop([2]) seemed to remove the 3rd column (as I would expect)
import pandas as pd
s = pd.Series(list('abcdefghijklmnop'))
pos = [2]
s.drop(index=s.index[pos], inplace=True)
print(s)
wait, columns?
or rows
Row
it works if the row labels happen to be the same as the row numbers
which is only true sometimes or by default
I pretty much always avoid inplace so I'm not very well skilled with using that as an argument
In that case, is there no way to essentially remove a row without having to make a new variable?
i just showed you
Because again this creates so many keyerror problems
the keyerror problems have nothing to do with creating a new variable
the keyerror has to do with you confusing row numbers and row labels
s.drop(index=s.index[2], inplace=True)
should work
s.index gives you the row labels
so you can index that to get the relevant labels
then drop using that
also note that pandas doesn't make a copy of all the data even when you copy the Series
however if you are 100% sure that your row labels and row numbers are identical then you can just use drop(index=[2])
but if you for example do .sort_values() then the row labels will be out of order because the row labels stay attached to the rows
and then you'd have to .reset_index() to remove the out-of-order index and create a new correctly ordered index
So what's the point of returning none?
Ah I see
Okay, that seems to work. And evidently I need to read up on how pandas defines indexes and labels because I'm getting confused
im using the term "labels" loosely
a DataFrame has two "axes": the index (i.e. row labels) and the columns (i.e. column labels)
each "axis" is represented by an Index object, which is similar to but not the same as a Series
an Index has a dtype and can contain strings, numbers, dates, etc.
and you can do row and/or column lookups on DataFrames using the Index values
the .loc accessor does index lookups. the .iloc accessor does positional lookups
if you create a DataFrame and don't specify the index, you get a default RangeIndex which is just 1:1 with the row numbers
So an index is not always numerical?
I see
So s.drop(index=s.index[2], inplace=True) seems to work on its own, but when doing it in an iteration I still seem to be getting keyerrors
keyerror happens if the index value is missing
So I'm trying to drop a row at row index value x, but it can't find x
Either because it's out of bounds or there's nothing there
Oh, I think it's happening because when you remove something from a series, no index values change
"hey pandas, do this thing at this location"
pandas: "that location? what location? i don't see no location like that. KeyError"
thats correct @lapis sequoia
when you remove the a row from the dataframe in my example, there is no longer an a row
Yeah, which I believe is something .remove would automatically take care of in python
I suppose I could find a way to work around that though
im not sure what you mean
can you show more of your code
what are you even trying to do
Well, in a list, say [a, b, c]. if you do list.remove[2], list would be [a, b], with the indexes adjusting automatically
thats because the indexes are positional in a list
a DataFrame is more like an OrderedDict
it's on another computer, but let me write something out to make it easier
it has both positional indexes and named indexes, i.e. keys
the keys in a dict don't adjust when you delete an entry
if you want to keep the row labels in sync with the row positions you need to .reset_index(drop=True) after every row deletion
for j in range(0, products_b):
if products_b[j] == products_a[i]:
deleted_indicies.append(i)
products_b.drop(index=products_b.index[j], inplace=True)
break
products_b and products_a are series objects? and the indexes are unique?
they are both series objects, yes
As for the indices, I've just been using numerical positions
as in, you never call set_index on these right? and you never otherwise explicitly specified an index?
right
deleted_indices = []
for i_a, val_a in products_a.items():
for i_b, val_b in products_b.items():
if val_a == val_b:
deleted_indicies.append(i_b)
products_b.drop(index=i_b, inplace=True)
break
does this work?
that's what giving the keyerror
even after adding the reset_index line
what if the key was deleted in a previous iteration?
Yeah, thats the problem
you shouldnt modify something you're iterating over
you want to just delete the elements from b where they occur in a?
Yeah, I'm trying to delete the elements from b when they occur in a so that the iteration doesn't take as long
for i, val in products_b[~products_b.isin(products_a)].items():
# do something
but you should always question iterating manually over a Series
usually you can go a lot faster by using .map or .apply
I was wondering if there were better ways, because this is taking a very long time. the series are both pretty long
I'll look into map and apply
iterating over pandas series is very slow
compared to iterating over a list
what are you trying to do more generally?
Basically I'm taking a master list of items, comparing them with another list of discontinued items
Ideally all of the discontinued items will be in the masterlist. If that's true it should be relatively simple to remove the discontinued items from the masterlist
Iteration like this was just the first thing that came to me
it's just going to take a long time when the masterlist has about 75k entries
yeah just use .isin
products_current = products_master[~products_master.isin(products_discontinued)]
yeah I figured I was overthinking something to the very bone
I suppose that line would imply iteration as well, though?
Ah, i see
although it is a shame how slow iteration over a Series is
but thats a bigger design issue
So I'm looking at the .isin documentation now. Once you get the series of booleans I suppose you can just filter out all of the "True"s
I would imagine with the drop() function
Oh you can't actually pass a series object into .isin(). Guess I'll have to convert products_b to a list
products_discontinued is a series itself
Oh, you can
the documentation says it only accepts a set or list-like
pd.testing.assert_series_equal(
pd.Series([1,2,3]).isin(pd.Series([1,2,4,7,-5])),
pd.Series([True, True, False])
)
does that include a series obejct?
yes, pandas docs do a poor job of defining their terms
aha, it's simple then
a "list-like" is a list, pandas series, numpy array, and a handful of other things
and ~ is logical negation on a Series
so
pd.Series([1,2,3], index=['a', 'b', 'c']).isin(pd.Series([1,2,4,7,-5]))
is
pd.Series([True, True, False], index=['a', 'b', 'c'])
and therefore
~pd.Series([1,2,3], index=['a', 'b', 'c']).isin(pd.Series([1,2,4,7,-5]))
is
pd.Series([True, True, False], index=['a', 'b', 'c'])
itd be nice if there was a .notin method for efficiency but this is still a lot faster than iterating
So my thought is that once I have the series of booleans, I could use the .iloc method to return the indices of all "True"s
you dont need that
Oh, that's not true
you can index/subset a series with a boolean series
again:
products_current = products_master[~products_master.isin(products_discontinued)]
So is the ~ operator specific to pandas? ive never seen it before
no, the ~ is binary inversion
It's the bitwise NOT operator. pandas overloads it to work elementwise on Series.
okay, so if i wanted to keep an archives of all the discontinued products i can just delete the ~
well the discontinued products are already in their own series..
oh, getting carried away there
unless you want the intersection of products_master and products_discontinued, in which case yes
that said... what format does this data originally arrive in?
Excel files
(note that in python, custom classes can override the behavior of various operators including +, -, /, *, &, |, ~)
sure, but they are in 2 different files?
or different sheets
different files