#data-science-and-ml
1 messages · Page 248 of 1
Is it possible to use the .map function on a series of booleans?
Because now I might be thinking of making a new series on whether or not something is discontinued, and I was going to replace "False" with "No" and "True" with "Yes"
don't see why it wouldn't be possible
Well, I'm trying, but it's replacing the entire series with NaN, and in the documentation they show it only working with obejct dtypes, not booleans
pd.Series([True, True, False]).map({True: 'hello', False: 'goodbye'})
yeah, True and "True" are different
should have known, haha
I use .map all the time to map series that contain enumerated values into their corresponding strings
(i look at a lot of data that's produced by C++ code pushing structs into hdf5 files)
is there a way to use the inplace argument for mapping as well? or do i have to create a new series?
The documentation seems to point to there not being a method
df["series"] = df["series"].map(dict) works just fine
hah, i feel strongly about helping people w/ pandas
because its really hard to learn from the official docs..
this might be a good learning resource https://tomaugspurger.github.io/modern-1-intro.html @lapis sequoia
Posts and writings by Tom Augspurger
this is the most recent post in the series https://tomaugspurger.github.io/modern-8-scaling.html
Posts and writings by Tom Augspurger
you too? whew, i thought i was the only one that struggled with documentation reading here
its awful
they tried, but
its amazing anyone knows anything about pandas
it needs a serious overhaul imo, ive been wanting to write my own guide for a while
I LOVE that series from Tom Augspurger
I also feel that the McKinney book is confusing, and that's sad because he's the creator
best book I know of on how to use Pandas is the Pandas 1.x Cookbook by Ted Petrou and Matt Harrison
Does anyone have experience with sparse matrices? I want to solve some linear systems of equations. The coefficient matrix is predominantly diagonal and the rest of the elements are 0 thus I think sparse matrices are the way to go. Thanks!
@glass wyvern yeah, what's your question exactly?
sparse matrices can be good for what you just described, if the coefficient matrix is very big
not much value in it for a small matrix, the main benefit of a sparse matrix is saving memory
This is kind of a dumb question but is anyone familiar with the term imputation? How is this conjugated? Is the base word Impute? It is very difficult for me to see it as anything less than a misspelling of the word "input" but other people in data science in my work group insists that this is a common word.
None of them speak English as a first language, and I am skeptical of the way they are using it in a sentence
yes, imputation is the "verbal action" form of "impute"
"missing data imputation" is the act of "imputing missing data"
Alright, thank you for the reassurance
I have been in a constant battle of made-up words until now.
does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
Not sure if this is the right sub-server, but can someone maybe help me with a little problem i'm encountering with pandas?
It's a problem with data frames: I'm trying to map a column with numbers N. These N are also present as keys (K) in a dictionary with values V. Whenever I try to substitute the N in the dataframe by the the V from the dictionary with df.map, the indexes of dict are mapped and not the key-reflecting values...
Anyone know how to solve this?
Are you using df = df.map?
I think df.map creates a copy of the dataframe off the top of my head
There should be an optional argument inplace = True to make it update the dataframe itself otherwise, but I would need to check the documentation. I have the memory of a goldfish
main_df['column_with_N'].replace(dictionary, inplace=True) works, but its super slow. from what i read df.map would be faster, but its not doing the job rn
Looking at the documentation, it is clear I have no idea what I am talking about. So disregard
main_df['column_with_N'].map(dictionary) shows a proper mapping in a series, but it does not substitute the values in main_df
ahah no worries, thx for ur help
🙂
When I try writing main_df['column_with_N'] = main_df['column_with_N'].map(dictionary) it seems to output what you want
Not sure if this is what you are after @amber anvil I am more of a Matlab person.
@amber anvil you need to assign the result back to the original, as hexicle pointed out
.map does not work "in place"
I want to ask a very simple question regarding something I encountered in anaconda, but I dont know if this is the correct place. May I or should I move to a help channel?
(I ask here because anaconda is considered a popular python distribution amongst data scientists)
@muted sapphire what's the question? (I am not an admin or mod but if I can help I will)
@muted sapphire yep this place is fine, otherwise #tools-and-devops is ok
Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?
It happened to me yesterday and it seemed weird that I had to reinstall whats already in my pc
Thanks guys
Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?
@muted sapphire that’s what a virtual environment is
it effectively acts like a new “container” for installed packages
Packages I can understand. But jupyter, i mean its like an IDE, isnt it?
Jupyter is a package too
And to be honest a friend of mine doesnt have to install it when he makes a new environment so I was unsure whether i made a mistake or not
it’s possible to do that too
I see. Do you know how? I didnt know jupyter behaves like a package tbh. I just considered it an IDE, like pycharm
okay, first
“package” and “IDE” are not mutually exclusive
a package is just a Python module container
and you can write an IDE in Python
which would make it a package too
anyway, to answer your question...
Thank you for the valuable information, I hadnt thought about it this way but makes sense.
Yes please, go on
I believe you cannot customise it directly, but it depends on your version of Anaconda...? (I’ve never had a need to do this)
I have the latest, he doesnt
Maybe thats a reason, i dont know
As long as it is "normal" and its not a mistake by me, i dont mind it installing it
IMO
new environments coming with stuff that is not necessary is an antipattern
you won’t always be doing stuff that needs Jupyter
and by “necessary” I mean for Python to run
This is true, it makes sense for it NOT to come with it installed.
Perhaps I just want to test something in the console or w/e.
You are right, i was mainly confused because I didnt consider jupyter as a package you know?
does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself
Thank you anyway 🙂 @velvet thorn You were very helpful
Thank you anyway 🙂 @velvet thorn You were very helpful
@muted sapphire np!
i wrote code to filter words from a given Pandas series that contain atleast two vowels
import pandas as pd
from collections import Counter
color_series = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
print("Original Series:")
print(color_series)
print("\nFiltered words:")
result = mask = color_series.map(lambda c: sum([Counter(c.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
print(color_series[result])
any suugestions to if i can use regex to solve this
>>> import re
>>> colours = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
>>> colours.str.count('[aeiou]', flags=re.I)
0 1
1 2
2 3
3 1
4 2
5 2
dtype: int64
there you go
thanks
yw
is Pandas worth learning
hm
u need to learn that
and
matplotlib
for graphing
then again mysql does the job of pandas so
no
SQL does not do the job of pandas...
and pandas doesn't do the job of SQL either
then how would u describe a pandas dataframe
in particular, SQL focuses strongly on guarantees that databases provide, like ACID
the DataFrame is an abstraction representing tabular data
and dont say a dictionary of series cuz it rly isnt :/
well i mean it IS but
acc yea
and
ic where ur comming from
pandas doesn't need a database
it's (more or less) source-agnostic
SQL deals only with databases
huh
ic ic
well imma need a resource to learn pandas anyway so if u dont mind
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
hm I don't really have one, sorry
also pandas is a lot more suited to quick experimentation than SQL
because the name of the package is pandas
well
ok so
if im not wrong
from what ive learned
if u have a datafram and u wanna only do when a certain condition is true
df.iloc[df["column"] > 5]```
?
or was it df.loc
damn ot
okay
so, if you just want to flter on rows
you can do df[df['column'] > 5]
.loc is for when you want to filter on rows and columns
so say you want all the rows where column_1 > 5, and only the column column_2
that would be df.loc[df['column_1'] > 5, ['column_2']]
df.loc[row_indexer, col_indexer]
.iloc, on the other hand, is for positional indexing
ic
so say you want the 3rd row
ohhh
df.iloc[2, 0]
ohhhh
ic ic
and if u want to get by row name?
ok ok i got it thz
rhx
rhx
thx
rows don't have names, normally.
but if u want
like say
u have a list of states
and their population
and area
and u want
the U.S's row
you would have a column
called "state" or something like that
then df[df['state'] == 'US']
yay im understanding this
yup, good job
does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself
this might be a lousy question but can you treat data frames like arrays
liek if i import a .csv file into jupyter... can i treat the data as an array and index stuff out of it
if you read the csv using pandas this might help you. Take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
thxx
Im learning ds now. would any of you be able to send me intro classes udemy or whatever it may be?
i used freecodecamp
Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
same question
Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
@lapis sequoia why do you want someone to tell you what to do
the world is full of interesting questions
find one close to your heart.
that's not how you should end your university journey IMO
@velvet thorn i need some Suggestions only
then maybe you should tell us what you're interested in
because AI is so wide
like games? make a game AI
interested in photography? how about some kind of smart filter
food? maybe an NLP project for parsing recipes?
health? ML for mental health, given a daily questionnaire?
there are a ton of ideas out there; this is just what I came up with off the top of my head.
I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum
Maybe ask some final-years what they made and get to know what kind of stuff works, usually it should be deployable too
I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum
@lapis sequoia it's really hard to say because standards vary widely across institutions
That's true
but, yeah, honestly
which is why it's best to ask around
I have seen way too many people whose first instinct is to come to a community and ask for help with something they should have spent some time thinking about first
so I suppose I'm a little jaded
need some help with webscraping in #help-burrito !
I agree with gm
Learning to ask for help is good. But learn to try and think for yourself first.
If they said "hi im debating between X Y and Z topics and my advisor is ambivalent, can someone give me insight into any of these domains for an undergrad thesis?"
Im sure we would all be happy to help
Their question is one step above the people who just ask for homework answers
Part of writing a thesis is picking a topic, that's part of research
Sounds weird but anyone from India here?
We've got folks from all over the world, though that question doesn't seem on topic for here.
how could i update a dataframe in real time
if i am passing in input from a file and adding to it
Is it possible to reference "NaN" in pandas? it's automatically filling in blank cells as such and I would like to map it to "" because that seems easier to work with outside of the pandas module. I can't seem to find out how to reference "NaN", though
so you want to replace NaN cells with empty strings?
What's the datatype of those cells?
Yes, simply because I can't seem to reference the "NaN" in other statements, like if statements or loops
objects
If there's a way to reference "NaN" outside of pandas, that would be nice
I tried numpy.nan, but no dice
just tried that, but apparently the module doesnt have that attribute
hello, do you guys have any resource recommendations for the math side of data science which will accompany me throughout my data science learning journey? i know linear algebra, calculus and linear programming, however, i really need help with the statistics
I need to map it to a string because I'm using pandas with regular expressions
Oh, sorry.
found it. there's a .fillna method
the book data science from scratch goes a bit into it and has additional resources if you want to learn more which probably answers your question @onyx juniper
Is there a way in pandas to get the index of a value, column, or row?
Hi, i am trying to plot a stock market graph on python with the date on the x axis and the price on the y axis. However I get an error that says KeyError: 'Date'.. but in my CSV file there is a column called date? Could it be that the jupyter notebook cannot recognize my DTG format?
@crude karma make sure capitalization is the same, but for debugging you might need to post a code snippet, like the section of your graphing code and what your df.head() looks like
okay i figured it out but this doesnt look like a stock chart.. how do i combine both highs and lows
Is this close to what you’re looking for? https://www.byteacademy.co/blog/time-series-python?hs_amp=true
Unleashing the power of Panadas to visualise a time series data of Stock PRices
oh danng okay ill read that and figure it out thanks buddy
hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?
it is not good practice to create variables in a for loop for each file and row, right?
that's why i was thinking to either create a variable for each row manually instead of doing it manually
then i would pass each variable with the datarow to a thread, make the request, update and then replace the row in the dataframe with the updated row
but that would mean I would need to create 200 variables by hand... so i am sure there must be some better way to do this if creating them dynamically is bad practice
how would you go about this?
Is there a way in pandas to get the index of a value, column, or row?
@lapis sequoia which do you want?
I guess either. or are the methods really different from each other?
hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?
@lapis sequoia does any row depend on any other row?
I guess either. or are the methods really different from each other?
@lapis sequoia hm...let's take a step back
why do you want to do that?
I just think it could be useful sometimes, like if you want to sort something
.sort_values()?
Oh, I guess there's a method for that but
Is there really never a good time to return the index of something? just feels like something that could come in handy
I don't think I have ever needed to do that, but you could filter and then access .index
like literally ever as far as I can remember
oh, I didn't realize the .index method returned a value
also native Python nan is float('nan')
I have a dataset with 100 labels how do i calculate the accuracy?
how come when i specify a figsize, it says 'list' object has no attribute 'loc'
when i do df= plt.plot(df.loc[:,'Time'],df.loc[:,'VO2']) it works but when i add a figsize, taht error shows up
where are you adding figsize?
😄
so I'm trying to implement keras tuner as an automatic hyperparameter tuner in my model and for the weight regularization I was wondering what would be a good minimum and maximum value to have?
and a good value to step too
Ping me if you have an answer and thank you
hello, i have a dataframe with a column which takes only two values, say A and B, and want to create a column A_1,A_2,A_3....A_countA,B_1,B_2,....B_countB
how do I achieve this?
t = pd.Series(["a", "b", "b", "b", "b", "a"]) t 0 a 1 b 2 b 3 b 4 b 5 a dtype: object func(t) 0 a a_1 1 b b_1 2 b b_2 3 b b_3 4 b b_4 5 a a_2 dtype: object
can someone tell me how i can achieve func?
try determining the indices of all the letters and store that into another list
so that you can use those indices to append to the letters
i have trouble getting the indices
prob best if you had a function that went through the list and kept individual counters
and appending them to a larger list
you mean have a global counter
did it thanks
for reference
t = pd.DataFrame({'A': ["a", "b", "b", "b", "b", "a"]}) counter={} def func(x): ix = counter.get(x, 0) counter[x] = ix + 1 return '{0}_{1}'.format(x, ix) t.A.apply(func)
uh...
@solid lagoon a bit late but well
you should actually use cumcount
>>> t + (t.groupby(t).cumcount() + 1).map(lambda v: f'_{v}')
0 a_1
1 b_1
2 b_2
3 b_3
4 b_4
5 a_2
dtype: object
thanks man, I knew I had seen this somewhere way before
yeah I know because I myself spent time coding exactly that
and then a while later I found there was something for this
Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?
Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?
@lapis sequoiapandas?
@velvet thorn I’m trying to avoid reading in the data every time and then selecting what I need. So came across this SQLite database I could potentially use to store the data and then query what I need. Was just wondering if SQLite is commonly used?
for small datasets
What’s generally considered small?
well
anything under a gigabyte
but honestly
I don't really see the problem with reading data into memory every time...?
although if you don't have to do interactive analysis
SQLite might be just what you need
I presume you're good with SQL so why not
I’ve just been finding it super slow and there is certain repetitive analysis I do, that I know exactly what I need.
Well, I know the basics, but it can’t be that hard to pick up!
if you find pandas slow
generally one of two things is true
- you're using it wrongly
- your data is too big
anything above a gigabyte (on disk) starts to poke into "bad for pandas" territory (you can consider something like dask I suppose)
Right I see. Yes I’m going over a gigabyte. I’m super new to this kind of stuff so more than likely not being optimal! Literally just ordered Python for Data Analysis by Wesley McKinney!
okay so
very simple rule of thumb
if you have a for loop in your pandas code, you're probably doing something wrong
+1 although i do tend to loop over .columns occasionally
if you have a
forloop in yourpandascode, you're probably doing something wrong
@velvet thorn
Definitely noforloops!
@lapis sequoia can you give an example of something that's slow
And can you give an example of a different tool where the same operation is not slow
+1 although i do tend to loop over .columns occasionally
@desert oar oh yeah that's perfectly fine
any1 have a good pyspark resource for me to learn? Im thinking w3 schools for hiveql first.
@bitter fiber hiveql is basically just sql. i wouldnt start there
i dont know of specific resources, but it helps if you think of pyspark as a declarative interface to a query engine
I know sql just wanted to learn how to setup the environment and special quirks
ah, i cant say i know much about setting up the env
Right.. I have 6 raspberry pi's and 1 main computer that i wanted to interface together for a hobby of mine
but yeah, spark is weird because you have to think of it more like constructing a query or constructing a program that is to be compiled and executed, rather than executing code line by line as in python
I was thinking maybe it would be useful to create a datamine
i think typically you deploy on yarn or mesos, although it does support "standalone" cluster mode
and i guess it supports k8s too
Apache Spark 3.0.0 documentation homepage
the docs are decent albeit sometimes disorganized
i would start by practicing w/ pyspark itself on a local cluster before you try to actually deploy on your rpi farm
What does standalone cluster mode mean?
ok so on my own computer in a local cluster meaning running just on my workstation?
my workstation that im working on first has 16 physical and 32 total with virtual cpus I want to learn how to utilize everything.
"standalone cluster" would be spark running directly on the machines without an engine like yarn/mesos/kubernetes underneath it
ah..
"local cluster" is 1 machine
i think for making use of a single high-core workstation spark probably isn't the best unless you have tons of RAM
256 GBS of ram
I bought a 1500 dollar refurbished machine
i have a similar machine at work, its nice but we never use spark on it
for big stuff there i just use dask or i just yolo 30 GBs of data into memory with pandas or data.table
Thats what I do for work. pandas
i would like to start a data mine in my house that consumes many public apis
thats a fun project
I was originally thinking of running with a LAN mongodb to not worry about schemas
and just injest everything into my big computer
Any tips or starting points on downloading main posts and comments from a Facebook group I’m a member of? Doing some text analysis and word cloud type stuff - want to know if it’s doable, link examples, and see if anyone’s aware if it’s against any sort of TOS
read in with rpis and save on hdfs running on a NAS or something? idk
@floral mantle that's probably against facebook's TOS and you check to see if they have any provisions about "automation" or "crawling"
I think you’d have to use their Graph API to do it and I see references for it
only 2 TB of harddisk on my workstation though..
So need a Dev key etc.
facebook is tough. you need to get verified app permission
its not like twitter which is more open
yeah i would say learning the Graph api is very valuable in the marketplace though
i dont know what goes into getting that kind of permision
@bitter fiber or just work for Big Corp where they contract out all that stuff 😛
(but then you end up doing half the work for the contractor anyway because they dont know wtf theyre doing)
Lol they hired some guy to just maintain the facebook api and he barely works nowe
at my job and they cant fire him because everyone else is too lazy to work on that.
its more legal stuff than anything lmao
@desert oar i had another question; a claim that people say about hadoop is that you use 1/10th the server cost; is that because you compress the data more or something across a cluster?
built in backups?
i dont know what that even means
like if you pay for 300 GB of storage hadoop only lets you use 30 GB?
i dont work with hadoop directly ever so i honestly have no idea, but that's a questionable claim
Gotcha..
Might be true if you have a lot replication going on. Storing 30 GB of data might end up taking up lot more than just that.
Not sure how this applies to hadoop directly but this blog post gives an overview of why storing 30 GB of data might need 300 GB. https://jrs-s.net/2016/11/08/depressing-storage-calculator/
Hi all, I have a machine learning algorithm that I am trying to code. I have had very little experience with it so I am getting stuck on what type of algorithm I should use. I am trying to make a program that if a song is playing (SONG X), it recommends the next song (song Y). In order to do so, I have a set of variables that song Y should fit or be closest to (variables a,b,c,d,e,f....). All of the variables are percentages. Given a list of songs that song Y could be in, I want to find the best match for song Y in the list. If it was only one variable, all I would do is find the song in the list of songs that has the closest variable value. But what do I do once I start comparing multiple variables?
So, find the closest point to a given one in a high-dimensional space, based on a predefined metric?
I believe so yes ?
I mean, that's it. It's no different from the one-variable case. The only thing you need is to design the metric function. For that, you could use just sum of squared differences (euclidean metric), but note that you'd want to normalize all of your parameters then (so that they're all 0 mean and 1 variance), otherwise certain parameters will affect the distance more than others.
So sum of the squared differences and then find the song that has the smallest sum. Which should be the song that is least different?
Pretty much. I mean, that's just finding the closest point in space to this one.
To calculate the distances efficiently, you can use https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html#scipy.spatial.distance.cdist
EDIT: fixed link, it's cdist you want.
Wow that is literally so much help, thank you
I have been trying a bunch of different complicated algorithms for the past few days
So there is one caveat, one of the variables isnt a percentage like the others are. That variable being bpm(beats per minute) which doesnt really have a max or a min so there isnt a way for me to represent it in a percentile manner.
Sure there is 🙂
How so?
For your entire dataset, for each variable, calculate the mean and standard deviation for that variable
Then subtract the mean and divide by std.
Every variable will then end up with 0 mean and 1 std.
It means they'll then lose obvious meaning (having 1 on the bpm score would mean "it's around 1 standard deviation more than the mean among all songs"), but that'd put them all into a similar range.
scikit-learn has a function for this transformation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
wait
wrong one, hold on
I'm writing all of this down. I think that is all I need so really really thank you
and scikit-learn is nice in that is has detailed User Guides
Here's one for data standartization: https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
oh, and by the way, scikit-learn actually has an entire module for efficiently (without considering every single other point) find closest neighbours for a point:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors - list of functions
https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors - User Guide for it.
So, I am trying to have this recommender program iterate through the entire list so that each song is "perfectly" played after another and that it creates a playlist/mix. The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist. I feel like that would be too brute force. Do you have any advice on how I should do that?
Why not just start from a random (or user-defined) point and then traverse the graph of songs, always choosing the closest non-explored point?
Holy crap, okay, better to look back into my algorithm textbooks haha
The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist.
In geometry terms, this can be rephrased as "I want to find the shortest-length path that visits all of my points exactly once". Do you happen to know how that task is called, perhaps? 🙂
That's how the path traversing all nodes exactly once is called (I think), but the problem of finding the one with the shortest length is very (in?)famous under the name of the https://en.wikipedia.org/wiki/Travelling_salesman_problem
it's, uhm, a very very hard problem. NP-complete, even.
So you probably shouldn't bother. Just always go to the closest unexplored neighbour or something.
yeah, just that every single other song is technically a neighbor
unless I could categorize the bpms as neighbors since I want the songs to flow into eachother...
yup, it's like euclideanTSP on a plane (where you can go to any city you want and the distance is just the euclidean distance between them), but it's in n-dimensional space instead 😅
Nevertheless, whenever your problem turns out to be a subclass of TSP, that's generally a sign that you might want to simplify it.
(TSP isn't easily solved even for points on a plane)
Okay yeah, I think this is a great starting point, thank you
Mind if I add you in case I have any other questions?
sure
I am having a long back and forth on the coursera forums for Andrew Ng's machine learning course. Either I am just dense and need someone else to explain things to me (most likely), or the other person is wrong. Anyone on here willing to help me out. Here is the discussion (I had to save it to a PDF since the forum post is behind a user/password wall on coursera).
Gofile is a free and anonymous file-sharing platform. You can store and share data of all types (files, images, music, videos etc...). There is no limit, you download at the maximum speed of your connection and everything is free.
Here is the coursera link, in case you have credentials and can view the forum (Just in case you a weary about opening some random dude's PDF from a file sharing site you may not be familiar with)
https://www.coursera.org/learn/machine-learning/discussions/weeks/1/threads/5WdAbuk8EeqXNhLj2fFeZQ
I don't think either of you are really wrong. You're basically asking why use the least squared error function of all things. The answer is something like "it's the provably best way under certain assumptions to minimize the mean error".
The data you made doesn't really fit these assumptions, so it unsurprisingly is a very bad fit under LSE. You could potentially achieve that orange line by detecting outliers - for example, if one searched for a subset of points of size around 70% of the total that had the least average squared error when fitting a line to it, then one would obtain the red line:
So, I guess, I could also say that your concerns are valid, but they pretty much never occur in practice. You don't usually have to fit a line to a dataset that's obviously non-linear.
Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?
def mu_davidson(mus, mws, xs):
mus = np.asarray(mus)
mws = np.asarray(mws)
xs = np.asarray(xs)
a = 0.375
e = (2 * np.sqrt(mws * np.array([mws]).T)) / (mws + np.array([mws]).T)
f = 0.0
n = len(mus)
for i in range(n):
for j in range(n):
f = f + xs[i] * xs[j] * e[i, j]**a / np.sqrt(mus[i] * mus[j])
mu_mix = 1 / f
return mu_mix
Here's an example of using the function:
mus = [179.75, 363.87]
mws = [2.016, 28.014]
xs = [0.85, 0.15]
mu_mix = mu_davidson(mus, mws, xs)
print(f'mu_mix = {mu_mix}')
@lapis sequoia what's that supposed to do?
@tidal bough thanks! FYI, I am taking Professor Ng's course so I can make sense of what you told me a few days ago. I almost have it all sorted out.
Regarding the recent discussion, this is the latest update, which I think clears up my confusion and explains the other dude's opinion:
Found this, and I think it answers my question:
"PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as Orthogonal Regression or Total Least Squares, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error. This is in contrast to the usual regression assumption that predictor variables are measured exactly, and only the response variable has an error component."
My interpretation of this statement is... Normally, we assume zero error in the X axis (the input), only error in the Y axis (the output). But, in the case that the X value is also susceptible to error, then PCA is a better fit.
So, for the example of the square feet of a home vs predicted home price. There is a negligible error in the square feet measurement that can assumed to be zero, while there is much error in the price values. In that case, do not use PCA.
However, if the city required the use of a specific contractor to make square feet measurements on homes and that contractor was known to intentionally add error into their measurements just to throw everyone off, then PCA would be the better method to use.
IF I understood everything correctly, that explanation clears up my confusion. Please let me know if I am understanding this correctly.
This example shows how to use Principal Components Analysis (PCA) to fit a linear regression.
@velvet thorn See my edit. I added an example of using the function.
As for as I can tell PCA matches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.
@velvet thorn See my edit. I added an example of using the function.
@lapis sequoia I mean, what are theforloops intended to achieve?
understanding how to optimise your algorithm from a high-level description would be simpler than trying to figure it out from your code
The for-loops are part of a summation equation.
okay, it is too early for me to read math or iterative numpy code, so I will leave this to someone else...
hopefully someone else will come along
never mind I got bored and did it
@lapis sequoia ((xs * xs.T) * (e ** a) / ((mus * mus.T) ** 0.5)).sum(axis=None)
you need to make xs and mus 2D
[:, np.newaxis]
As for as I can tell
PCAmatches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.
@modest rune Pretty much. PCA is used for dimensionality reduction - basically, take a lot of points in n-dimensional space, and find an m-dimensional (m<n) subspace to project the points to such that the lengths of the projections are minimized. For n=3,m=2, it's finding a plane in 3d space that the data most closely matches. For n=2,m=1, it's your example. Unlike LSE, PCA indeed doesn't have any direction bias - in fact, PCA is perfectly fine with fitting a vertical line, something LSE can't do at all, because, well, the latter assumes thatyis a function ofx.
@lapis sequoia
Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?
I'd say like this:
prod = np.multiply.outer(xs,xs) # prod[i,j] = xs[i]*xs[j]
prod *= e**a
prod /= np.sqrt(np.multiply.outer(mus,mus))
f = np.sum(prod)
outer is one of my favorite numpy features; I've done so much stuff to mimic its behavior before I found it 🙂
yeah, I should have used the outer product
...too early.
perhaps I should go back to sleep
@tidal bough and @velvet thorn thanks, I had no idea outer existed
it can be applied to any numpy ufunc
What's the difference between np.outer and np.multiply.outer? They appear to do the same thing.
What's the difference between
np.outerandnp.multiply.outer? They appear to do the same thing.
@lapis sequoia effectively, nothing
outer is a method on all numpy ufuncs
so, for example, you could have np.add.outer
however, because np.multiply.outer is a special operation known as the outer product, it is given a top-level alias np.outer
you can tell this if you look at their signatures.
np.multiply.outer takes the generic ufunc.outer signature
whereas np.outer is different
Ah, I see. Thanks again.
np
I revised my previous function based on help from @tidal bough and @velvet thorn. This looks much cleaner.
def mu_davidson(mus, mws, xs):
a = 0.375
e = 2 * np.outer(mws, mws)**0.5 / np.add.outer(mws, mws)
f = np.sum(np.outer(xs, xs) * e**a / np.outer(mus, mus)**0.5)
mu_mix = 1 / f
return mu_mix
if anyone whos familiar with SQL help me with why the last statement is printing 1
the first table is derived from the 'hacker_news' table. it shows the top users and their score
update: i found my mistake, it was an miscalculation i made with the JOIN statement
Hi guys, I make youtube videos where I vulgarize Artificial Intelligence terms and news for non-experts. My goal is to demistify the AI “Black box” for everyone and sensitize people about the risks. Give it a check if you can, I am actually posting a new video un 2 hours ! 😁 I would love any feedback (especially negative, but pertinent) in order to improve my videos and vulgarizing skills! Thank you!
Here's the channel: https://www.youtube.com/c/WhatsAI
Got anything on LSTMs? Cause I'd love to know wtf those are doing
yeah LSTM is amazing but I'm looking for YOLO v4 @hallow briar
Hi All- I am fairly new to discord and have a question regarding time series analysis and handling missing data in a df
I have a df, indexed by date, to capture the spread history of various bonds. However, for some bonds, the spread levels become unstable as the bond reaches maturity/is close to being paid off (as shown by the sudden drops in the plot)
sample spreads
Now, I want to apply some data quality checking to stabilize such bonds
specifically, the DQ check that I want to apply is that across the bonds (which are columns in the df), each time there are 10 consecutive NaN i.e. missing data, cut off the rest of the data as the last 20 days of data
However, I am struggling to find a clean way of defining a function to perform this DQ check. Any thoughts on what an ideal approach may be?
corresponding df
*as well as the last 20 days of data
Hi. The binned distribution of one of the columns in a dataframe is shown below in blue. I've tried removing outliers using IQR and variations of IQR (tuning the quantiles) and in red you see the binned distribution of the subset of elements which lie in the quantiles [0.05, 0.95]
My question is why the red distribution is so much smaller. The filtering removed only about 100 elements. Shouldnt the red be about as high as the blue distribution?
Zoomed in on x E [0, 5]
@lapis sequoia It looks like the bins of the red one are smaller and the histogram is not normalized (density=True isn't passed), so the smaller the bins, the lower they will be (because fewer elements falls into them).
Pass density=True when examining distributions. Here's a comparison:
X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50)
plt.hist(X,bins=100)
plt.show()
produces:
X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50,density = True)
plt.hist(X,bins=100,density = True)
plt.show()
produces:
A more specific question to DL/ML:
Are there any publications/projects/demos to a program that involves hand-gestures to control a e.g web-page or similar?
gesture control on google scholar gives a lot of promizing patents:
https://patents.google.com/patent/US9640181B2/en
https://patents.google.com/patent/US8448083B1/en
but I'm actually unable to find studies, huh.
hey beginner question on good algorithms to try for classifying text into one of two categories
scikit-learn is an amazing library for ready solutions (rather than making your own). Check out logistic regression, and every topic with "classification" in it:
https://scikit-learn.org/stable/supervised_learning.html
that page is exactly what i needed thanks
hey guys can anyone link me some sites where i can find some data sets, specifically im looking for server logs
@tidal bough Wow, thanks for those. Exactly what I am searching for
Hello. I'm testing some Pandas and threading/multiprocessing. I find it odd that threading is a bit faster than multiprocessing. The function I passed to multiprocessing.Process and threading.Thread sums() a dataframe and threading finished first. Is this right? I thought multiprocessing would finish faster.
are you actually doing parallel work?
are you actually doing parallel work?
@still otter I'm counting the number of votes in the dataframe per candidate and I pass it on to a function that filters the dataframe by candidate name and sum() them up.
This is the threading version. The multiprocessing one is similar.
hm. well in general Thread is faster because it has less overheads, but Thread is not capable of concurrent computation in pure python. So which is faster depends on how you are doing the computation and how much data you're working with
i don't know much about pandas but it's possible that sum() is run in native code that releases the GIL, which means it can be run concurrently with Threads, in which case the main downside of Thread is sidestepped and Thread will almost certainly be faster in this case
Thank you!
anyone know a method to reduce mode collapse in GANs, without adding another neuronal network.
does anyone know how to add a custom function into the model in Keras? as in I want to pass the output of a layer through my function and use it's output for another layer
@frail arch Can't you just use the functional API? Can you provide an example of what you want to achieve?
@hasty grail for eg. say, I want to take output of a layer, add something to it, pass it to a dictionary and use the dictionary's output as input for the next layer
Does the functional API not work for that?
scikit-learnis an amazing library for ready solutions (rather than making your own). Check out logistic regression, and every topic with "classification" in it:
https://scikit-learn.org/stable/supervised_learning.html
@tidal bough Thanks I also needed this page
Anyone got any good resources on reinforcement learning?
hello, i am going in to year 12 and am looking to do CS at uni, can someone explain what a job in data-science would entail
I'm working on a pixel art editing program and wanted to know what a good method for finding similar neighbors with bucket fill? I have the pixels mapped with a dictionary in f"{x}x{y}" format. I made my own function which figures out all the valid neighbors recursively but don't know if there is a more efficient method.
def bucket_fill(id, layer):
to_fill = [id]
def find_neighbors(neighbor_id):
x, y = neighbor_id.split("x")
x, y = int(x), int(y)
l = None if x == 0 else f"{x-1}x{y}"
r = None if x == layer.width - 1 else f"{x+1}x{y}"
t = None if y == 0 else f"{x}x{y-1}"
b = None if y == layer.height - 1 else f"{x}x{y+1}"
neighbors = [l, r, t, b]
return [n for n in neighbors if n]
def check_neighbors(neighbor_list): #Check if color matches, and not already in the to-fill list, returns new pixels to check after adding them to to-fill
new_neighbors = []
for n in neighbor_list:
if not n in to_fill:
if layer.pixeldict[n].color == layer.pixeldict[id].color:
to_fill.append(n)
new_neighbors.append(n)
print(f"Added {n}")
return new_neighbors
def check(neighbor_id, i): #Recursively check a pixel and its neighbors
print(f"Check recursion {i}")
neighbors = find_neighbors(neighbor_id)
neighbor_list = check_neighbors(neighbors)
print(neighbor_list)
for n in neighbor_list:
check(n, i + 1)
check(id, 0)
return to_fill
Am not entirely sure why you need to store them in a dictionary. Wouldn't a 2-D array do pretty much the same thing?
Seems that you are performing a breadth-first search, which is perfectly valid imo
This I think is a super easy question. In numpy, what is the best way to create a 2D (lets call it G) array with dimensions Mx2, each column is a feature that has a defined linspace representing values I want to predict, and G needs to be every possible combination of the the 2 features linspace values.
for example:
# probably would create these using linspace to create these, unless a function exists that does
# everything at once.
ages = [45;50;55;60]
nose_pimples = [0;1;2]
# Desired Result
G =
[ 0, 45;
0, 50;
0, 55;
0, 60;
1, 45;
1, 50;
1, 55;
1, 60;
2, 45;
2, 50;
2, 55;
3, 60 ]
np.stack(list(itertools.product(nose_pimples, ages)))
or you can use meshgrid I guess
You are not allowed to use that command here. Please use the #bot-commands channel instead.
!eval
import numpy as np
print(np.mgrid[0:3:1, 45:61:5].reshape(2, -1).T)
You are not allowed to use that command here. Please use the #bot-commands channel instead.
^
thanks
Anyone got any good resources on reinforcement learning?
@gaunt tusk https://www.coursera.org/learn/practical-rl/ I'm doing this coursera course on it.
Also:
• Sutton, Barto - Reinforcement Learning: An Introduction
• Berkeley - CS285: Deep Reinforcement Learning
(copied from the AI discord server)
how to install caffe in windows? I am getting error CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool.
I have latest CMake installed
In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!
I encourage you to create discussions here or on the youtube comment section about the paper and share related work, we can all learn from each other!
In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!
Paper: https://arxiv.org/abs/2002.04599
Abstract: Adversarial examples are mal...
anyone know where i can get lecture videos and slides for the latest cs109 courses with a recent Python 3.x version
Or is the 2015 version the only one that's free for all
If you do enjoy it please consider subscribing and promoting the channel! It encourages me to put more effort into these videos I have other videos which span related topics.
@raven mulch
I think it's great that you're creating YouTube content and sharing it with our members in a channel that has a relevant topic - but "remember to subscribe" crosses the line over into straight up advertising, and violates our rules. Maybe you can use this channel to ask for feedback, instead. I wouldn't have any problem with that.
Just try to keep that in mind for the next video, though. We technically don't allow advertising, but I think it's a shame to completely block content creators who are making things that may be relevant to the interests of our members - so you're basically walking a bit of a tightrope with these posts.
Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though
Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though
@raven mulch honestly, it's a p good topic
but I was not sure if it was against the rules
Hi i have dataframe with multi columns and nan values at the beginning. And I try this:
i need to each columns beginning value's timestamp
any idea?
do you want the timestamp (which is in index?) for the first non nan value per dataframe column?
@tidal bough i finally figured out how to generate a surface plot for IV. I mean, I actually understand what the heck I am doing and how the math works under the hood. Thanks for your help earlier. And you were right, I was a few characters away from having working code. But, I was about 10 hours of learning away from actually understanding what was going on. Anywho, coursera, for free, has an excellent intro course on machine learning by Stanford's Dr. Andrew Ng. I'm only 2 weeks into the 8 Week course,but prof. Ng explains everything in a way I can understand and doesn't make assumptions about my math background.
yeah, I very much liked how that course gives you an understanding of how it works under the hood
you won't need to actually implement these algorithms, most likely - just use premade algorithms from libraries like scikit-learn or pytorch - but it's going to be useful if you would want to understand ML articles or code some advanced (and so non-standard) algorithm.
Yeah. I hate blindly using a library without enough understanding of the underlying principles. I just can't be confident in my usage.
do you want the timestamp (which is in index?) for the first non nan value per dataframe column?
@novel remnant yes i need a new dataframe that contains: column names and first non nan value's index
Anyone here experienced with tensorflow?
@tawny pivot
something like this then?
import pandas as pd
import numpy as np
df = pd.DataFrame({
'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'),
'a': [np.nan, np.nan, 1, 2, 3],
'b': [np.nan, 1, 2, 3, 4],
'c': [1, 2, 3, 4, 5]
})
df.set_index('time', inplace=True, drop=True)
# This is the part that you want
new_dict = {}
for col in df.columns:
new_dict[col] = df[~pd.isna(df[col])].index[0]
pd.DataFrame.from_dict(new_dict, orient='index').T
@lapis sequoia pure tensorflow or tensorflow keras?
uhh like a simple doubt
first i download the dataset using this
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
now i need to use this function to resize the images to 64 x 64
but i am getting an error 🤔
what error are you getting?
the images are grayscale do you reshape them first to shape (-1, 28, 28, 1) before resizing?
Are there best practices for text pre-processing?
This is what I need to do
this is what i'm doing
but applying this using a pandas transform is super slow on 2k text bodies
and i will need to do it on 16k on the out-of-sample texts
I'm not getting any errors on my part, can you share the part of your code that throws the error?
@tawny pivot
something like this then?
import pandas as pd import numpy as np df = pd.DataFrame({ 'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'), 'a': [np.nan, np.nan, 1, 2, 3], 'b': [np.nan, 1, 2, 3, 4], 'c': [1, 2, 3, 4, 5] }) df.set_index('time', inplace=True, drop=True).T # This is the part that you want new_dict = {} for col in df.columns: new_dict[col] = df[~pd.isna(df[col])].index[0] pd.DataFrame.from_dict(new_dict, orient='index').T
@novel remnant this works for me thank you ^_^
cheers!
What is the best technique for finding feature importance in a dataset?
Let's say I have a trained SKLearn model with a good enough (~80%) accuracy
There seem to be several ways I can find feature importance:
- sklearn's
.feature_importance_, which I'm not sure how it works
- Recursive Feature Elimination
- Permutation feature importance
Which of the above will give the "best" results?
And when would I want to use one over the other?
And should I be doing RFE/PFI with a cross-validation set? or using accuracy from the training set itself?
can someone help me understand sparse matrices and how to manipulate them. from what i understand a sparse matrix basically only gives the non-zero entries to save memory
can i use standard numpy functions on a sparse matrix?
particularly i want to do something like
`np.sum(np.multiply(x!=0,(y>0)[:,None]),axis=0)
Yup, you pretty much can.
okay, can i also ask why after using a count_vectorizer i would have columns that sum to zero?
that would mean the word doesn't show up in any documents right?
something like this
cv = CountVectorizer()
bagofwords = cv.fit_transform(text)
np.min(np.sum(bagofwords,axis=0))
returns zero
I think so, yeah. It's weird if that's the case
hmm, must mean something weird is going on
Im having some trouble wrapping my head around how to approach the problem I am currently having with Panda's and my dataframe.
Basically I have 4 columns using a datetime index that are all daily values. from different shop locations. I want to resample it into monthly columns, but without losing each daily value by just using resample.mean I have several years worth of data, and it would be nice to have each column in the final df be labeled Month Year. Im a little stuck. Any help would be appreciated.
would be easier to visualise what you want if you showed us a sample of your data and how you want it to look like
One moment
raw data is formatted like this
I want to turn it into this
I can do it manually via
a = df.loc['2011-08']
a = a.unstack().reset_index(drop=True)
But its a huge hassle to do for large datasets and I know there is some way my beginner brain isn't seeing
The key is to preserve the data and not just use reshape.mean or some other thing that doesn't allow me to keep all data.
so you basically want to reshape all rows from 2016-04 in the original df, to a single column in the new df
yes, but my data goes back to 1993, till today
so I need a solution that isnt using .loc 444 times
I have a sample csv with data from 2006 till 2020 with random int in it to try to figure this out
Hey @lapis sequoia!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
@tidal bough Thank you for those resources, both look nice
anyone have pyspark experience and want to share what it was like for you?
anyone have pyspark experience and want to share what it was like for you?
@still verge what was what like?
working with PySpark?
like working with pandas but much more tiring and bothersome
it being distributed means that stuff runs slower
on small datasets
of course, you wouldn't be able to do that kind of stuff on large datasets with native pandas (would need, like, dask or something)
but, yeah.
the abstractions are not as convenient
e.g. selecting specific rows and columns
many people told me not to use it if the dataset is small, is it that bad?
you have to litter your code with a lot of the function operators
many people told me not to use it if the dataset is small, is it that bad?
@still verge without a reason, I'd say you shouldn't
tahnks for the input!
hello, if you are a little bit familiar with multithreading, can you help me understand what i am doing wrong here?
import _thread
from threading import Thread, Lock
mutex = Lock()
df = pd.read_csv(f"dftest_1.csv")
df = df.reset_index(drop=True)
df['id']='NaN'
df['new_score'] = 'NaN'
for index, row in df.iterrows():
s = row['full_link']
s = s[38:44]
df.at[index, 'id'] = s
def get_new_data(index, row):
global df
submission = reddit.submission(row['id'])
print(submission.score)
mutex.acquire()
df.at[index, 'new_score'] = submission.score
mutex.release()
for index, row in df.iterrows():
_thread.start_new_thread(get_new_data, (index, row))
I am loading a csv, create two new columns filled mit 'NaN'. then i create the ID from the full link. so far so good
now, I try to update the column 'new_score'. I do this using _thread so the requests i make with reddit.submission() happen all at the same time.
in the get_new_data() function I make the request and print the submission.score. it works and i can see the scores one after another and almost instantly - so the multithreading seems to work
then i lock the dataframe, write the new value and release it again
but the dataframe that is returned doesnt have the new values
no error
but also no new values
Try to use Ray for multithreading
Any idea to convert this data frame into a time serie taking months' columns as index?
Any idea to convert this data frame into a time serie taking months' columns as index?
@ivory panther which is the month column?
Enero, Febrero, Marzo ... (January, February, March, etc)
Looks like murder counts to me
The number of crimes
no, I mean
what do you expect the result to look like
in general for "how do I convert this to that" questions sample output is very useful in helping people understand what you expect
because "time series" is rather vague
Have date instead of just year. For example 2015/January, Aguascalientes, Homicidio, 2 (crimes)
Somithing similar to this
i pressed anaconda navigator but it isnt working
(i cant open my anaconda navigator how to fix?)
what are you trying to get to anaconda for?
I don't really use the navigator, but you can open a terminal on a mac and type jupyter notebook and it'll open notebooks
open anaconda prompt / miniconda prompt on windows to do the same thing
thanks
I don't know that I like matplotlib
For that matter, what nice higher-level plotting libraries that matplotlib are there? I don't quite get how they can be easier to use than the latter.
import typing as t
from plotter import Plot, Point, Color
class Car(Point):
value: int
speed: int
type: Color
cars: t.List[Car]
car_chart = Plot(cars)
car_chart.show()
that's the kind of API I'd expect to see
ah, interesting
I guess I could make one but that requires work.
hi Everyone
so I am a python coder and
I have experience with Numpy but
Im trying to learn Pandas
and eventually matplotlib
does anyone have any good resources for learning these libraries
book, yt tutorials, anything
uhh
then what do i use
for pandas
well matplotlib as well if smth makes it easier to learn
In this video, we will be learning how to get started with Pandas using Python.
This video is sponsored by Brilliant. Go to https://brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription.
In t...
?
Hello, I want to iterate over a period of time. how can i compare a day's information with its previous day? One of the things i want to compare is the close of a stock with its previous day
Hey @marsh seal
I tink I am doing something similar
do you know how to call on the previous row data?
hey thanks for a quick reply @wintry sapphire no i don't
use shift and create a new column with the shifted values for which you can compare with the original values
this way you can vectorize the operations for quick results
@novel remnant Hi potaki, could you show me an example please
sure one momment
for example if you want to subtract the value of the previous day from the value of the current day
import numpy as np
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2020-01-01', '2020-01-10', freq='d'),
'a': np.arange(10)
})
df.set_index('date', inplace=True, drop=True)
df['a_previous'] = df.a.shift()
df['a_minus_previous'] = df.a - df.a_previous
df
Series of Data Science Articles for getting started with Data Science / Machine Learning, includes step-by-step implementations:
Please consider reading if you are interesting and subbing to my channel to help build your knowledge and skills in data science
hey, im kinda starting out on python, ik some basics n stuff and after some help from ppl i wanna get into machine learning, i think? xd, i dont rly know what it is, anyone got resources on what it is
hey @silk saddle
I release at least one episode every week
if you guys arent sure about anything or don't understand any concepts leave a comment on my channel or article
and i will get back to you as soon as possible
tysm ❤️
Sup guys, Im learning the math that is needed for ML (which will take 2-3 month) but In these 2-3 I don't want to just learn math without programming(I already has experience with python about 6months) so can you give me any advice of what I should do like what kind of projects should I work on rn because Im kinda lost now.
since you are learning machine learning, you can try implementing some algorithms from scratch
maybe start with linear regression with OLS and/or gradient descent
principal components
things like that
maximum likelihood even
but things like this don't need math?
sure they do
you need to know and understand the equations in order to implement them
but Im still learning the math so how I can deal with them?
you can do the simpler things.
what specifically are you learning now?
alternatively, you can work on general projects that are not specifically related to ML
currently learning calculus
calculus* btw thanks for the advice it will try to work on projects not about ML.
if you mean with math nothing much but If you mean python projects, I have made a website and some automation stuff
yup, that's cool!
if you wanna do ML
it's important to also be a good programmer.
have you worked with visualisation tools?
in particular, matplotlib
if you want
just thought it might fit into calculus
it's kind of hard to think of a programming project that can focus on that
yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.
the https://www.coursera.org/learn/machine-learning course doesn't really assume any background knowledge - it needs linear algebra, but it teaches it in process. It has plenty of implementation tasks.
no Reinforcement Learning there, sadly, for that you need a more serious course, from the Advanced ML specialization.
hi. i am trying to install a local package on disk
pip install /directory/my_package
but when i run jupyter, and import my_package , it says it is not found
is there a work around to fix this?
i tried installing the package with the same pip as in the which jupyter directory
yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.
@arctic canopy that's fine too! as long as you're practicing programming and learning new things, don't worry
there are a ton of interesting concepts that I picked up along the way that became relevant months later
thanks a lot man,thats mean a lot
If you had $5.2K in tuition reimbursement from your employer for accredited coursework, what course/degree/boot camp would you use it for?
^for someone who has Python basics down but wants to pivot careers into data science
@nimble solar are you using a venv or other environment?
Hi guys, I am trying to acheive this in a Dataframe
but i keep getting NaN
does anyone know how to do it?
@arctic canopy. Yeah like salt was saying try reimplementing some algorithms or papers. For the first few you might want to follow a guide.
As for the actual math, if you've covered calculus you can reimplement many of the basic algorithms without much difficulty. linear, logistic should all be doable
I mean ur dividing by zero @wintry sapphire. Pandas doesn't know how to deal with anything divided by 0
well yeah but depends on the problem
Hmm alright
cause I want to find the percentage change
@flat quest
How do I
fill my 1 Jan number with my second Jan?
Hey @flat quest , do you happent o know why
even after I fill my 1 Jan with a number
i still get an error?
not sure i totally follow what ur trying to do
fill 1jan number with second jan?
@flat quest Alright so here is my output
Bascially in option 1
column
I want to 2019-01-01
to be my initial which is 10,000
for the next date, 2019-01-02, I would want it to be the value under StkB_close 2Jan - 1 Jan divide by 1 Jan
times the value above in option 1 in 1 jan
Meaning
In my option 1, for 2 Jan
the value would be
(101.12 - 101.12) / 101.12 * 10000
Assuming the final value is 30000
Then for 3 Jan
it would be
(97.40 - 101.12) / 101.12 * 30000
not sure i totally follow what ur trying to do
fill 1jan number with second jan?
@flat quest this is what I'm tying to do
But I keep getting an NaN
@wintry sapphire df.shift(1) / df - 1
ohhh
@velvet thorn do you know
How to print out
certain rows and columns only
like in my DF, I have 5 columns = A B C D E
But I only wanted rows 5, 6 from Columbs C D E
is it
the .loc?
@velvet thorn so what I did was
for i, one_d in enumerate(date_check):
print(portfolios.loc[one_d, 'Option_1'])```
where date_check is the dates whcih I want to find
dates are the
index
but I want it to be from
several columns
not just
option_1
i can't seem to calculate the input features for my linear layer in pytorch
i apply the formula but i get size mismatch error
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
self.conv3 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=5)
self.fc1 = nn.Linear(in_features=?????, out_features=360)
#forward method for pooling
tensor = F.relu(self.conv1(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
print(tensor.size())
tensor = F.relu(self.conv2(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
tensor = F.relu(self.conv3(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
can someone ELI5?
CxHxW = 1x40x40
according to my calculations its supposed to be 2411 but i get size mismatch error
nvm im dumb, i was calculating correct, error was elsewhere 😄
Wow
Hello guys, what do i need to know to getting a start with data science? i learned a little bit of pandas and numpy
scikitlearn is nice to know and is fun to work with
@supple frigate it contains also datasets that you can work with and make predictions or stuff like that
I have another question though:
I have two different CSVs with time series data. One Table is continuous, starting in 01.01.2017 at 00:00. From there each row represents one hour (1. Table). The data looks kind of like this:
- Table aka df1:
Date, Volume
2017-02-03 12-PM, 9787.51
2017-02-03 01-PM, 9792.01
2017-02-03 02-PM, 9803.94
2017-02-03 03-PM, 9573.99
The other table contains events that happened and are serialized by UNIX datetime in seconds. I was able to convert it to datetime and group it by hour with this code:
df['datetime'] = pd.to_datetime(df['created_utc'], unit='s')
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
df['date_by_hour'] = df['datetime'].apply(lambda x: x.strftime('%Y-%m-%d %H:00'))
This resulted in this data:
- Table aka df2:
created_utc, score, compound, datetime, date_by_hour
1486120391, 156, 0.125, 2017-02-03 12:13:11, 2017-02-03 12:00:00
1486125540, 1863, 0.475, 2017-02-03 13:39:00, 2017-02-03 13:00:00
1486126013, 863, 0.889, 2017-02-03 13:46:53, 2017-02-03 13:00:00
1486130203, 23, 0.295, 2017-02-03 14:56:43, 2017-02-03 14:00:00
Now I need to map the events (2.table) to the Time Series of the 1. Table. If multiple events happened in one hour, i need to make an addition of the scores and calculate the mean average of the compound. In the end i want to have a dataframe like this:
- Final Dataframe
Date, Volume, score, compound,
2017-02-03 12-PM, 9787.51, 156, 0.125,
2017-02-03 01-PM, 9792.01, 2726, 0.682,
2017-02-03 02-PM, 9803.94, 23, 0.295,
2017-02-03 03-PM, 9573.99, 0, 0,
I know my code below does not work and is wrong, but I wanted to show what I was thinking how I could achieve this. I thought I could loop through each row of my events table df2 and compare if the datetime matches. If so, I would calculate score and compound. The issue is that I know that one should not loop through a dataframe and I don't know how to loop through another dataframe at the same time and perform the right calculations based on the previous rows...
for index, row in df2.iterrows():
memory_score = 0
memory_compound = 0
if df1['Date'] == df2['date_by_hour']:
df1['score'] = row['score'] + memory_score
df1['compound'] = (row['compound'] + memory_compound) / 2
How can I get to my Final Dataframe? There must be some pandas magic that I could use to make this work and map the time series data to the right hours.
@lapis sequoia how about a join?
Can you suggest another way of having the functionality the class searchNodeMaker(dict) provides here https://py3.codeskulptor.org/#user305_MIkOQUGRta9iO5Z_1.py ? I am trying to write that code into C++ and I don't know how to convert that.
what is ξ(never defined before) in this context?
hey every1, im using xlsxwriter to make a report in excel. On that i have a pie chart, and i'm not able to put the legend in the circular instead of being outside, like this image: Idk if its because xlsxwriter uses chart styles from Excel 2007.
Hey everyone, is there any way I can turn my deep learning model (regression) in google colab into a coreML model
@distant moss what paper is that?
i dont know the answer btw but thats pretty bad to just not define notation like that
I would think it's some kind of a know operator in matrices computations or smth
https://people.eecs.berkeley.edu/~jrs/meshpapers/Sorkine.pdf @desert oar page 4
or maybe the ξ is the sparse Cholesky factorization they computing....
yeah, I'd think it's the decomposition or something
Do you guys know how to remove the bold and cell borders when you use pandas to_excel? I'm trying to overwrite it but the first column doesnt change
Does anyone have experience with connecting microsoft forms and answers to python?
How do I make line charts and how do I save them as .png and send them into an embed?
How do I make line charts
With matplotlib, probably.
how do I save them as .png
plt.savefig
you can set a different matplotlib backend if all you want is images
yup, matplotlib.use("AGG") if you only want pngs
alright
def Diff2(old_list, new_list):
li_dif = [[i for i in old_list if i not in new_list],[i for i in new_list if i not in old_list]]
return li_dif
Is there a more efficient way to find differences in 2 lists than my current function? I want to know added and removed items separately^
I'll post it in general :/
@bold bane not in and of itself. but it can be useful in data science projects
@unique wolf you could use a set maybe? return list(set(old_list) ^ set(new_list))
well, for the exact same output(as sets), use
def Diff2(old_list, new_list):
s1,s2 = map(set,(old_list,new_list))
return s1 - s2, s2 - s1
symmetric difference s1 ^ s2 is equivalent to (s1-s2) | (s2-s1) (elements that are in one of the sets)
Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook? Javascript (more specifically React) has visualization components like https://github.com/storybookjs/react-treebeard. Is there something similar for Jupyter?
hey, quick question: how do i convert AM/PM datetime to 24 hour datetime? Like, I want to convert 2020-05-19 01-PM to 2020-05-19 13:00:00
do i use also something like
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
@lapis sequoia keep it stored as a "datetime" internally. when you want to print it, use df['datetime'].dt.strftime https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html?highlight=strftime#pandas.Series.dt.strftime
the strftime format spec is here https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook?
@rare ice I think it's not possible. But you can load the JSON object into a dictionary with the json module and print it. It'll look like a json tree
it would be a nice notebook or lab extension though!
@velvet thorn i just found a use case for "auxiliary" pandas indexes, a query like this:
all_urls.groupby(['url_source'])['homepage_eval'].value_counts()
although like you said you can always set_index first
although like you said you can always
set_indexfirst
@desert oar can you elaborate
@velvet thorn all_urls.set_index('url_source').gropuby(level=0
but that has a lot of less-desirable properties e.g. if you want to use the original index inside each group
also i currently have a 3-level column index with names, yikes
is there a jupyter notebook or lab extension that allows you to bookmark a cell?
hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?
is there a jupyter notebook or lab extension that allows you to bookmark a cell?
@desert oar not sure about this but you can use HTML
like how you would do a table of contents but for one cell
hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?
@dusk aspen you can try a Siamese network
ok, i can try it
@lapis sequoia you can try asking here too next time
anyway, I believe you have that problem because you create a new BeautifulSoup for each page (in your loop)...but you only ever extract stuff after the loop ends?
I have a sort of design question. I have a lot of separate files that are being generated in real time. I will be making scripts to extract some plottable data out of these files. What can I use/do to save this arbitrary plottable data in a central location which can also notify plotters that there is new data to plot?
@still otter what does "central" mean in this case? where/how are the plots being generated?
I have a 300 mb text file (glove embeddings), what is the fastest way to upload it in colab everytime? my google drive is full so that is out of option.. Does hadoop or spark help for this?
@chilly pasture google cloud storage?
This is the first time i am hearing about it. thanks
i mean i thought it was a general term for google drive
Hewo, can I ask about if I can get the tweets from Tweepy package by year? Or is it just by tags
Hi everyone. Not sure if this is the place to ask - How often is Cython used in data science for computationally expensive programs?
Hello everyone!!
I never heard of it as a total beginner, despite lots of people dislike for Python's performance. But, as soon as I started my position in computational research, Cython was immidiately brought into light.
Hey guys, if you were to use an algorithm to make a text based on a large dataset, in a Markov chain fashion that'd actually yield (mostly) grammatically correct results and could be tuned based on user inputs (supervised learning?), what algorithm would you choose?
@tall bronze i use it at work occasionally. sometimes numba can help improve performance too
cython is good when you have to process a lot of data in a "production" setting and/or for writing libraries with better performance than pure python alone. it's not necessarily that useful in data science, moreso in ML engineering
hey guys I posted a data science related question in help-nitrogen, any help would be highly appreciated!
@eager heath Why not explore Deep Learning models? They are pretty efficient and highly accurate. Personally, I don't think general algos ever provide great performance or accuracy.....
Because it is way more work, I en ever did anything deep learning related 
But I think it would be a good introduction, wouldn't it?
Would you have any great resource to get me started please?
Hi, I know basic python and some external libraries for research. But I am interested in building AIs. I am not aware of how to proceed using python. Can someone suggest online sources where i can begin learning AI using Python?
@spring flame maybe https://course.fast.ai/ could be a good place to start
Thanks a lot
Hi All,
Is there any function in pandas to calculate cumulative average/mean just like there is cumsum for cumulative sum
?
expanding average
it's simple then call series.expanding().mean()
like for every month all the rows of previous months should be included in the mean
expanding does that
great thanks....let me try
alright cheers
What would you call seaborn, is it a wrapper for matplotlib? If so, please define what a wrapper is.
I am searching for a technical term
Please answer me in #help-lollipop
@desert oar By "central" I mean if a plotter client (backed by matplotlib/bokeh/whatever) wants to get some data to plot, regardless of what data it wants or which script is creating the data, the plotter client will always access the same location to get its data.
For example, I was thinking maybe I can use a single sqlite db to store all this extracted data. If I want to make a new script to extract some additional data, the script would add a new table to the db, which it would then populate. Plotters can then be easily updated to plot the new data, without caring about the scripts that made it. Does this design seem good for this goal? Or is there perhaps a tool/library that is better fit for this purpose?
hello! can anyone help me with excel filtering?
@still otter ok. what are the plotters though? are they all separate programs? how are they accessing the data? is this all happening on a single machine? on a network?
why not just keep a bunch of parquet files somewhere and watch for updates?
@still otter so basically you want a central store for data + push notifications?
everything is on a single machine for now, and probably will be for a while
@velvet thorn yeah, basically. it's probably a simple thing to solve but i'm unfamiliar with what's available to me
hm.
to give good advice on design it would be important to understand (quite a bit) more about the architecture and your needs
like so you have another script running
that's constantly updating plots?
so, i haven't really decided on the plotter quite yet
i have a basic plotter right now which is just a simple python script that makes a matplotlib plot manually from the files i have
basically parses the relevant data from the files without storing it anywhere
but it's a bit slow
and has to re-parse everything if i close and reopen it
also, this is kind of poor timing for me, i need to leave for a while 😓
why do you want push notifications then
thanks for the responses though, i'll have to read about parquet
sounds like pull would be a better strategy
maybe, i just like the sound of plots updating immediately when a new file is created
anyway, thanks for now, will be back later if anyone is around
Any tips on how to get rid of the for-loops in the first function? My attempt in the second function gives the wrong result because I'm not accounting for j != i. But I'm not aware of anything in NumPy to account for that.
import numpy as np
def mu_brokaw(mus, mws, xs):
n = len(mus)
mu_mix = 0.0
for i in range(n):
d = 0.0
for j in range(n):
if j != i:
mij = ((4 * mws[i] * mws[j]) / ((mws[i] + mws[j])**2))**0.25
num = (mws[i] / mws[j]) - (mws[i] / mws[j])**0.45
den = 2 * (1 + mws[i] / mws[j]) + ((1 + (mws[i] / mws[j])**0.45) / (1 + mij)) * mij
aij = mij * ((mws[j] / mws[i])**0.5) * (1 + num / den)
sij = 1.0
d = d + sij * aij * xs[j] / np.sqrt(mus[j])
mu_mix = mu_mix + (xs[i] * np.sqrt(mus[i])) / (xs[i] / np.sqrt(mus[i]) + d)
return mu_mix
def mu_brokaw2(mus, mws, xs):
mij = ((4 * np.outer(mws, mws)) / ((np.add.outer(mws, mws))**2))**0.25
num = np.divide.outer(mws, mws) - np.divide.outer(mws, mws)**0.45
den = 2 * (1 + np.divide.outer(mws, mws)) + (1 + np.divide.outer(mws, mws)**0.45) / (1 + mij) * mij
aij = mij * (np.divide.outer(mws, mws)**0.5) * (1 + num / den)
sij = 1.0
d = np.sum(sij * aij * xs / np.sqrt(mus))
mu_mix = np.sum((xs * np.sqrt(mus)) / (xs / np.sqrt(mus) + d))
return mu_mix
if __name__ == '__main__':
# dynamic gas viscosity in µP
mu_h2 = 179.75
mu_n2 = 363.87
# molecular weight in g/mol
mw_h2 = 2.016
mw_n2 = 28.014
# mole fraction
x_h2 = 0.85
x_n2 = 0.15
mu_mix = mu_brokaw([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
print(f'mu_mix = {mu_mix:.4f}')
mu_mix2 = mu_brokaw2([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
print(f'mu_mix2 = {mu_mix2:.4f}')
honestly this seems like it's really overengineered @still otter
whatever the plotter process is, i would just poll the directory every 30 seconds for new files
you can get fancier by sending notifications over a socket or something but i'd start with something really simple like polling for new files
i think inotify can watch for directories too
yep
import inotify.adapters
from my_library import plot_from_file
def main():
i = inotify.adapters.Inotify()
i.add_watch('./plot-data')
for _, type_names, path, filename in i.event_gen(yield_nones=False):
plot_from_file(filename)
if __name__ == '__main__':
main()
^ agree about the overengineered part
How do I make a table like this in python? I have all the data already in a couple of df's. I just need it to print out a decent looking table.
Ok I'm big noob but I need some help. I'm working on a project to monitor market data and I'm running into a stupid problem. I already know what's causing it, just not sure how to get around it
ts = ForeignExchange(key='secret',output_format='pandas')
avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This returns a tuple which is the root of all my issues (I think)
avdf.drop(['open','high','low'])
#Returns "tuple object has no attribute drop"
#So I tried converting it into a list a few ways
df = list(avdf)
#This worked but I'm still having the same issue
df.drop(['open','high','low'])
#Returns "list object has no attribute drop"
#So I thought maybe it was because I had to directly link it to pandas. So I tried this
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns=avdf.columns, index=avdf.index)
#But still no luck.. It returns "tuple has no object columns"
#Getting annoyed and it's probably a super easy fix so if anyone could help me out, that would be greatly appriciated :D
I'm on python 3.8 using pandas 1.0.5 and alpha_vantage 2.2.0
Im gonna head to bed. If anyone has the time to respond please @ me in the message
Added " to the pd line
Now I get this traceback
Hey @fast bluff!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Would rather not use that so hopefully this works
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Hey guys! Anyone who works in analytics? Maybe you can help me out here: https://stackoverflow.com/questions/63513587/best-way-to-get-selling-windows-for-each-product-category-in-pandas I'm stuck with this for 12 days
Hi everyone, I'm trying to get a count column that get total 'Unit Sold' by tags from 'tags' in this df:
I want to make new two columns one for tag names and second for how much unit sold
For example:
How can I do that? can I split tags by ","?
yes I want to split the tags.. and count the unit sold for each tag
Hi everyone.
I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.
I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.
Is there any way I can do that?
Thanks!
Hi everyone.
I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.
I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.
Is there any way I can do that?
Thanks!
@dense wharf PyCharm has Jupyter integration, but only for the Professional version
if you're a student you can get it for free though
yes I want to split the tags.. and count the unit sold for each tag
@grizzled saffrondf['tags'].str.get_dummies(','), then groupby columns
Thankyou! I'll look into it
uh.
do you know what that does?
okay
just run df['tags'].str.get_dummies(',') by itself
and you should understand what you're doing wrong
it made every values=0
and..
it made new columns by the tag names
I want to make one column named 'tag' then get this dummies to the row of 'tag'
then count the values of each tags in all rows
@velvet thorn Im sorry Im pretty new with pandas.. can you write here an example for the code..
hey guys. I'm trying to write an implementation of the NEAT algorithm, and there's something i don't quite understand:
between speciation, killing off the weakest genomes and repopulation, what happens to the existing spiecies? i know that none of the genomes from the previous generation survives, but do the spiecies?
i mean, do i just wipe all the spiecies and respeciate each new generation from scratch, or do i somehow keep a representive genome from each and let them live untill they have been underperforming for too long?
i haven't personally dived into genetic algorithms so can't really say that much here. But surely ur keeping the best genomes from the previous generation?
@wild pine
unless you're implementing some sort of elitism, where you let a couple of the best performing genomes survive unchanged, i don't think that's usually the case.
i mean if you think about it, evolution is about finding out who gets to reproduce, not finding out who gets to live forever.
generally the idea is that everyone dies, but only the best performing organisms pass on their genes to the next generation
Hi all
how do I print out this
form my dataframe
currently this is my dataframe
I want it toprint out
On 2019-03-29, Option A is XXX, Option B is XXX, Option C is XXX
@velvet thorn any suggestions? 🙂
does anybody have any experience with DEAP library? I wanna know how can u set a specific chromosome in already define population for genetic algorithm?
Hey all! So I was thinking about making a project related to AI. Anybody want to collaborate? Actually I'm a final undergrad student and a project would really help me get a hands on experience about the topics I learnt. Thoughts?
wwhere can i get data? Like is there a site or something that i can use?
@steel roost Kaggle, Data.world
Could someone please peep my message from last night if you have the time
to sum up
avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This works fine but returns a tuple which I can't work with.
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns="avdf.columns", index="avdf.index")
#This returns an error involving my arguements w/ index (full traceback posted w/ pastebin link above)
df = list(avdf)
#Tried this
df.drop(['open','high','low'])
#"list has no attribute drop"
```See earlier post for further explanation
I tried a bunch of stuff and I think I'm on the right track with the pd.DataFrame but I think I'm having a problem passing the columns and index
well yes @wild pine the one that reproduces is the one that continues. But with most ML problems, u want the best producing rather than the one that happens to survive the best.
Ah so took a brief look at Neat. So its an evolution of the neural architecture itself. Based on what I can tell it looks like they're making mutations and then forming species based on a certain threshold difference. It looks like organisms are only eliminated based on their performance compared to indiivduals within their same species.
So by that logic, no species should completely die out. However, it might be beneficial to at some point remove the species completely if their performance is too terrible.
https://www.coursera.org/learn/machine-learning/
Is this course still good to learn ML? I mean, it is 9 years old, lots of things have changed
Complete guess, but I assume the fundamentals are still about the same so it couldn't hurt
is there a way to speed up pandas readers?
i have a data file that extremley huge, but appears to freeze when just trying to print the dataframe
import pandas as pd
import numpy
data_file = '/home/doomedapple7565/Downloads/Parking_Violations_Issued_-_Fiscal_Year_2017.csv'
# want sheet 1 to be new york
# want sheet 2 to be new jersey
#and i want a count of the number of tickets for each license plate
#and i want the first and last ticket of each license plate
df = pd.read_csv(data_file)
print(df[0])
i took this file: https://www.kaggle.com/new-york-city/nyc-parking-tickets?select=Parking_Violations_Issued_-_Fiscal_Year_2017.csv
@flat quest so basically each generation will consist of a mix of survivors from the previous generation, along with their offspring?
tbh that also makes more sense to me and was my first intuition, until i read this response on a related question on stackexchange:
The neural networks with the worst performance are killed off after speciation. None of the neural networks survive - the entire population is replaced with the offspring of the nets remaining after the culling stage.
i suppose there're several different approaches.
I guess i'll try to eliminate half of each species and let them reproduce until the population limit is reached, keeping the best performing genomes from each generation (possibly mutating them slightly). I can always just rewrite it if it turns out to be a disaster.
thanks a lot for your input! I've been working really hard on the rest of the code, and it's been so frustrating to be stuck on such a minor detail. at least i feel like i can get back to coding now.
Can someone please help me ;-;