#data-science-and-ml
1 messages · Page 296 of 1
really? i didnt know that at all
so inside the env i should use conda install, not pip install?
Yes always conda install
If you cant run it that way, run it in other env
Unfortunately, issues can arise when conda and pip are used together to create an environment, especially when the tools are used back-to-back multiple times, establishing a state that can be hard to reproduce. Most of these issues stem from that fact that conda, like other package managers, has limited abilities to control packages it did not install. Running conda after pip has the potential to overwrite and potentially break packages installed via pip. Similarly, pip may upgrade or remove a package which a conda-installed package requires. In some cases these breakages are cosmetic, where a few files are present that should have been removed, but in other cases the environment may evolve into an unusable state.
Unfortunately, issues can arise when conda and pip are used together to create an environment, especially when the tools are used back-to-back multiple times, establishing a state that can be hard to reproduce. Most of these issues stem from that fact that conda, like other package managers, has…
me too lmao. rip env
one always learns something
-destroys env-
honestly
it doesn’t really matter for smaller environments where all dependencies are Python
for convenience you can just use pip
problems arise when you keep mixing the two over time
which you can generally avoid with a bit of planning

although using conda all the time is the safe play
do you happen to know where/how this is done?
ik I keep forgetting how bad it looks
but i want keep the lines
i was thinking about draw another graphic, and hide the legend of the lineplot and try to hide the other graphic, but i think it will take too much performance to render 2 graphic
thats what id do but only bc im not as familiar with seaborn
seaborn is default gorgeous but sometimes you want something a biiit more specific
and you end up accessing the matplotlib background anyways
honestly the 1 and only thing I miss about R is the plotting
just python 4 things
based on ggplot2
already hooked

oh this is a really cool approach

might play with this lib next time i have to plot something
Is there a library/tool that lets you run custom simulations with AI agents? For example, I wrote a gridworld simulation, and I want to test AI models with it, and train them/evolve them.
I think there is one library for visual simulations but i cant remember it lol
Dude I love that dog sticker you use lol
It's a grid world, shouldn't be hard to simulate yourself
yeah, I've already written one as a sort of prototype
but I want to see if there are other options for me before I start "finalizing" things
specifically the "connecting AI agents" part, right now my ai is baked into the simulation code and I'd want to change that going forward
but if something already exists that does that for me i'd rather use that than reinvent the wheel
its my favorite one. i also use it WAY too much. literally 50% of the time.
I'm looking at openai/gym right now, but it's difficult to know where I'd find things similar to it
especially since gym seems to be built around reinforcement learning, while i'm focusing more on unsupervised learning right now
simulations with AI agents most of the time will be focused on RL
since you need it to actually do RL
lol
i think a cool side project to do would be to teach an AI to play a small game with Q-Learning
like that flappy bird example
actually maybe i have my terms messed up and i am doing reinforcement learning lol
brute force reinforcement learning
or monte carlo might be the closest I guess
anyway, I see now that openai/gym provides specific environments and wasn't really made to handle custom environments
actually I was thinking about doing that for my bots if I ever get there
I'm also thinking it'd be easier just to code in how to play the game well (with a ltsm?)
send all the bot's vars for dectection/movement/whatever through a ltsm but clamp the values first based on conditions?
that would be a cool idea
i only bring up q-learning since thats the easiest to implement
less parameters
also have you seen the code for some q-learning stuff?
its SO short
its wild
@misty flint whats q learning?
"Q-learning is a model-free reinforcement learning algorithm to learn quality of actions telling an agent what action to take under what circumstances."```you send a bunch of actions to the algorithm, those actions change the algorithm's state, a reward is given + the algo optimises that reward
a type of RL that is model-FREE but still performs very, very well
2 main parameters is alpha (learning rate) and gamma (temporal discounting)
thats it
only downside is it requires lots of repetitions
to "learn"
performs alright if you have a small discrete state space*
the model-free part i would say is the biggest advantage
hahaha
yep
lets add that caveat
what are all the environmental constraints again
like it has to be deterministic ,etc.

doesn't it also like just not work sometimes?
single-agent
yeah
sometimes
lol
q-learning is like the linear regression of reinforcement learning
so dont put your hopes on it

q learning is brute forcing it
there's X actions you can take, Y states you can be in
im sure it'd be fine for flappy bird tho
Make an X*Y table of actions-state combination
choose them randomly
as you learn more you can update your actions to exploit whatever information about the reward you gained
it's been a while but isn't one of the problems that it's maximising not optimising?
maximising is optimising
maximising is a form of optimizing
any optimization is, by definition, max or min of a cost function.
in English?
I usually struggle more with terms than with understanding lol
i want to be a data scientist and i will pay cash anyone who can teach me and be my mentor..
it actually works amazing for flappy bird. take a look http://sarvagyavaish.github.io/FlappyBirdRL/
If there's only a few states the agent can be in and a few different types of actions. If the state of the agent is continuous, or there's a lot of states it can occupy and lots of actions it can take, it becomes incredibly inefficient
Remember you're making a table where the columns are the action you take and the rows are the states you can be in, and you're randomly taking actions to see how the reward changes
As action space and state space become large, it becomes a massive table
Or if they're continuous spaces, that's not a table anymore
This might be a dumb question with an obvious answer, but what should I do if my dataset contains names such as St. Joseph's, Pennsylvania, Virginia Commonwealth, or Loyola, Illinois but I'm trying to use these as parts of a URL for scraping which contain these names under different aliases (St-josephs, VCU, Loyola-IL)? there's no dictionary that I know of which contains a list of aliases to try for each instance, so is the only option to manually rename each?
Unless there's a lot of different names it can be, it might just be easier to manually search for key terms
ooh lemme see
how big is your dataset
you always have the most interesting issues joseph. i like it
If it's only 3 names and variations of those, pick key words like "joseph" and find all the ones that have it, add it in. Keep doing it for them until there's none left.
186 different names, idk how many are inconsistent with the URL
hmm 186 is not small
oof a lot i guess lol

😔
idk how many are inconsistent with the URL
maybe take a quick look through the data
its probably not that many
idk i mean
they're all college names
so like half are acronyms
i gtg for another thing 😩 sorry bye
ok it was only like 3 lmao
it now looks like this
what i want to do is take the winner name, for example 1985 Temple, and get a URL like https://www.sports-reference.com/cbb/schools/temple/1985.html
df['url'] = df.apply[lambda x: x['Winner'].lower().replace(' ', '-') + "/" + str(x['Date']) + ".html"] is what I have, but i'm getting TypeError: 'method' object is not subscriptable
any ideas how I should do this?
(this is just for converting the columns Winner and Date to winner/date with dashes instead of whitespace
you need regular brackets after df.apply, not square brackets
Excuse me, i have an question here.
uid category
110 banana
101 banana
.
.
001 apple
010 apple
when i train this dataset with datasplit 80% training and 20% testing, and then when my classification program running, I input manually test data with "uid 000" which is this data not on data train, and then the result is apple. my question is, how come the data that not include on data train can be classified as apple? i want to know how does the classification tell us that this data is classified as an apple?```
sorry for bad writing, my english is not good.
How do you preprocess the "uid" into vector?
huh so
i got a column like this
and i used df['Date'] = df['Date'].str.replace('85', '1985') to change 85 to 1985 and df['Date'] = df['Date'].str.replace('16', '2016') for 16 to 2016, and a same one for all years in between
this is what's returned
'2011199989', '20119990', '20119991', '20119992', '20119993',
'20119994', '20119995', '20119996', '20119997', '20119998', '1999',
'2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
'2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
'2016'], dtype=object)``` these are all the unique values of the column
what am I doing wrong?
it's a problem with all before 1999
wait nvm im a dumbass
Its just an example, the real data have 8 features, one of them is uid... Something like that.
the 99 and 98 functions change the 98 and 99 in 198x and 199x to something else
how should I rewrite it to not have this bug?
Strange, but usually if it is a unique id for each sample. That feature needs to be drop.
What does Date column look like? is it consistently two-digit that represents year from 85 to 16?

yeah so like 85, 94, 00, 06, 12, etc...
honestly i would probably run into the same problem bc id do it your way. its just funny seeing it joseph

lol
def f(x):
if int(x[0]) == 0 | int(x[0]) == 1:
x = str("20" + x)
if int(x[0]) == 8 | int(x[0]) == 9:
x = str("19" + x)
df['Date'] = df['Date'].apply(lambda x: f(x))
also tried it this way but it all ended up as None
try this pd.to_datetime(df["Date"], format="%y").dt.year

will it work if you feed it as a string
idk the datetime module as well as i should
YESSSSS
this worked

THANK YOU
party time
im building a march madness predictor btw if y'all couldn't tell
That column type changes to Date by the way, need to convert it back to string if you somehow need it.
i got a $2000 pool and im the only one in a stem major lol
ok i'll make sure to do that

march madness leggo
actually ken jee is building something similar

since hes a sports analytics guy and all
thank u!!
oof i never have time for podcasts
lol i'm basically doing homework 18 hours of the day
i tried it and i either cant focus on the podcast or i cant focus on the hw 😔
hw all day everyday...

anyway i will leave this here for the lurkers
grazie i'll check it if i ever find time!!
is there a way to use a rolling window on multiple columns in pandas?
i have this list of lists i want to put in my dataframe, i have the row number and the column name but it doesnt work the way google tells me to do it, what am i doing wrong?
ls = df_sub.values.tolist()
print(ls) df.loc[i,'trains'] = ls
i keep getting this value error: ValueError: Must have equal len keys and value when setting with an ndarray
I am surprised anyone even uses Q-learning now. Weren't DQN's the go-to "always better" method than Q-learning, since Q-learning is such a naive method
DQN is still Q learning, the table is just replaced with an NN
It's still relatively naive
I was talking about the ease of implementation. If you could implement a DQN In the same amount of time as Q-learning, why would you use the more naive/simpler method?
what happens if I only standardize my features but not normalize to range(0, 1)
for neural network
my model training was stalling at 10 epochs before standardizing and now it seems to be improving past that after standardizing. is there a need to normalize then?
Assuming you have an NN library, DQN doesn't take that many more lines than Q Learning
And also because Q learning is probably faster to train and computationally faster
Hey guys,
I have one of my thesis topic as 'Video captioning system to search through videos if an digital assistant needs to find an answer.'
Could anyone guide me on how do i start research on this topic and the concepts that i need to learn for this?
Currently I'm going through papers with video captioning system. Any help would be really helpful.
Thanks.
Hello guys
I am thinking of buying a new laptop for training deep learning models
Can anyone tell the minimum requirements (especially the CPU like no of cores) should I consider before buying
are you gonna be training on the laptop? or coding on the laptop and using something like colab or kaggle
if you're gonna be training locally on the laptop you definitely need a gpu
I will upgrade it to 16
bro that bottleneck is gonna be insane
i think so
@pearl vault i actually think u should try to config a pc
that would be cheeper and you can freely choose ur components
oh ok
how muck dollars are that?
i think there are lots of videos for laptops to train ur ai
Other option is
still to less RAM
I can upgrade ram
One slot will be empty I can buy an another ram stick but the CPU gpu choice is the problem
this could help
Ty
ur wlcm
Can someone help me with this please ?
I do discourage using laptop tho - you can easily use it up7in a few months.
@grave frost I just want to know wether the CPU is a bottle neck or not
Currently Laptop is the only option I have now
What happens if there is multicollinearity or useless features in a neural network? Is the algorithm smart enough to fix them? Sorry if this sounds stupid I'm a complete beginner
are df.loc and .iloc O(1) lookup if you only use indices/column names (or any for iloc)?
I mean I guess they're both O(n) for the number of rows or columns you ask them to fetch
#Create a function to combine the values of the important columns into a single string
def get_important_features(data):
important_features = []
for i in range(0, data.shape[0]):
important_features.append(data['Actors'][i]+' '+data['Director'][i]+' '+data['Genre'][i]+' '+data['Title'][i])
return important_features #Create a column to hold the combined strings
df['important_features'] = get_important_features(df)
#Show the data
df.head(3)
This is my code although i am getting below error: Traceback (most recent call last)
<ipython-input-10-50d23e3e0015> in <module>()
1 #Create a column to hold the combined strings
----> 2 df['important_features'] = get_important_features(df)
3
4 #Show the data
5 df.head(3)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (1) does not match length of index (1000)
Please help
!code @civic ferry
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
also what is data?
Its data of movie reccomendation collected from kaggle
can you run print(data)?
Yes ,should i send you the whole code?
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
^ please paste it like that
Okay 1 sec
#Store the data
df = pd.read_csv('IMDB-Movie-Data.csv')
#Show the first 3 rows of data
df.head(3)
I havent tried the print function though
This prints 3 columns of data
def get_important_features(data):
important_features = []
for i in range(0, data.shape[0]):
important_features.append(data['Actors'][i]+' '+data['Director'][i]+' '+data['Genre'][i]+' '+data['Title'][i])
I can't guess what this function does unless I know what data looks like. Please run print(data)
I still don't know what data is. I really need to know that or I can't continue.
yes
please have that be the first line of get_important_features, and then copy and paste what it prints as text.
Its giving an error
Hey @civic ferry!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
@serene scaffold
did it print out what data is before you got the error?
Nope
can you copy and paste the error as text into this chat?
@civic ferry I usually try to avoid any sort of loop when I'm working with dataframes. If you need to create a new column that combines other columns, have you tried something like:
df['combined'] = df['actors'] + ' ' + df['director'] + ' ' + df['genre'] + ' ' + df['title']
Though you may want to use a better delimiter than space, otherwise parsing the new column may get a little ugly (the names from actors, director and title are all likely going to have spaces)
oh...and replace df with data (since I think your dataframe is called data
out of habit, I called mine df
focus more on your graphics card as machine learning and ai are graphics card intensive, if you are looking forward for reinforcement learning and stuff then 16gb+ ram is also recommended
1050ti is kinda bad choice for ai, I would recommend you going for 1650 super or better
@fierce shadow 1660ti sir
thats pretty nice then
What about processor? I5 10300 h is a quad core processor many people are telling it will bottle neck the gpu
I don't think so that it would bottle neck
Ok ty
no a 1660ti and i5 10300h would not bottleneck the gpu
What's a good "sparsity" to switch a dense matrix to a sparse one?
I am dealing with a matrix with roughly 50% zeros and want to reduce ram usage
not sure if 50% is low enough - you can test it yourself, compare the memory usage of numpy.ndarrays with scipy.sparse.csr_matrixes, say.
often you have to deal with matrixes with a density of like one percent, in which case it's definitely profitable
Deep Learning shouldn't be done on a laptop usually.
Your CPU doesn't matter much. what is priority is memory and GPU. since mobile GPU's (like the on in laptops) have to optimize power for batteries with thermals, you likely will face a lot of crashes and won't be able to use it for long (Speaking from experience). Very few laptops have good thermals like Razer Laptops that cost 1.5 Lakh+
Desktop is usually the best choice of Deep Learning. Laptops are good only for light stuff
quick stats question: what metric should I use for measuring how spread out the majority of a variable is (if that makes sense?
Variance?
for example
you can see that pts drops off after 4 rows
Alternatively, you may be looking for kurtosis.
what's kurtosis?
@tidal bough Isnt kurtosis only used when you want to know how one-sided your distribution is?
Nah, you're thinking of skewness
as in, you want to know if you can assume a normal distribution
kurtosis is for measuring how thicc the tails are, basically
oh yeah skewness. i dont remember what kurtosis is lol
E X T R A T H I C C
as a more broad question, I'm trying to figure out if a team relies very heavily on a few players to score or if its scoring is very evenly distributed throughout the entire team
so like how top-heavy it is
would kurtosis apply ?
I think that would be skewness
but honestly, you should be able to find it with a basic histogram and / or kde curve of goals
basically what you want to know is if a player or group of players as the majority of goals right?
yeah basically
also consider: you need to normalize it by position. I dont know much about american "football" but
positions don't have a huge influence because all 5 posisitions are on the court the same amt of time for 90% of teams
yeah for basketball i dont think you will have that consideration
how thicc the tails
I honestly never though confusedreptile would ever use smthing like that lol
lol
reminds of Dani 😁
so skew is the way to go?
I would think so, yes. perhaps someone ese can chime
remember that skewness measures "simmetry" basically
so, if every position / person scored the same, it would be symmetrical
ah yeah that makes sense
but if one person scores most, your distribution of scores will be assymetrical
ok, that really helps a lot!
I really need to have a statistics book to keep on hand ffs.
im really thinking i should get at least a minor in statistics, i couldn't imagine only taking two courses and then going into industry
I love hypothesis testing but i always forget when to use what
-rages in 5 types of t-tests-
funny story
my data science class was supposed to be for stats majors only and i'm a data science major, and my university forgot to put a stats prereq on it even though most students are DS majors in the class and none of us knew about the prereq
so me and like 80% of the class went halfway thru without knowing a bit of statistics when the prof thought there was a stats prereq
luckily she caught on and removed the stats part of assignments cus i was lost lol
yeah i never took it in high school and my first semester was filled with calc, CS, and gen ed
oof
oof
do you know it now?
look at this guy not knowing what's the difference between mean, median and mode :v
-calls them all average-
lmao

no lmao
lol that and stdev is about the extent of what i know
that's basically descriptive statistics :p
add range, and quartiles and you're set on descriptives
and stdev is stretching it because i was only told that it was "the average of how far away from the average things are" in sophomore year of HS
hehe my prof called the central limit theorem basic stats and i was like 😳
depending on how elitist the person is, some will claim stddev is useless and you should always speak in variances instead
i think each has their use :v
nah she's a really down to earth prof who understands the students, she just didnt' know we didnt know anything about stats
central limit theorem isnt 'basic' but its definetely not the toughjest thing out there
yeah i just have a hard time when she put an integral into stats when i dont even get the stats without calculus in there lmao
i've literally never, ever used calculus for stats
its in the definition of the density and all, but ive never, ever used it
how do you calculate conditional distributions and stuff like that, then?
if I knew what that is, i'd have an answer .:p
i cant go on in my major like this
in my probability theory course, we eventually switched to a notation where even sums over discrete distributions are written via integrals
because it's just nicer to have things consistent
that sounds a bit overkill lol
I can see where its coming from thou
well its been nice talkin but i gtg to said data science class in 1 minute
I've always found the definition of Riemann integrals to be so succintly beautiful yet complex
https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction
can anyone help me with this? i have some doubts
explain your doubts
im not sure which is bankrupt and which is not
like are 0=bankrupt and 1=not in the 'Bankrupt?' column?
also since im new its hard and takes time understanding the code
hmmm
but (1min)
your data is skewed a lot
0 normally is no
and lso, use logic
do you believe most companies would be bankrupt?
my friend online believed so and i fell into confusion
go to google datasets...you might find an explanation
he knows ml so maybe i mightve trust him :/
oh thanks!
you can know something and still screw up
always curate things with a bit of common sense as well lol. It's unlikely 90% of taiwanese companies are bankrpt...
if they are, i'll send you some money :v
ill check that out
yes yes i asked the same thing and he was like it might be an error in the dataset
i didnt know if that was possible so here i am 🙂
wait ew no i dont want the emoji
thx!
welcome anytime
for the prev dataset(one im using)
i put an example to check but it gives the same output for any data i put in there
check all ur variables
what is the best place to take a refresher in calculus? I think i need to review a lot of things.
coursera
any specifics?
i tried khan academy in the past and liked it, but im thinking something a bit more...applied
i might be able to hook you up with some of my calc 3 lab pdfs
variables as in all the atributes?
they're pretty applied to engineering problems
Calc3 as in vector calculus?
we're getting into that right now actually so yeah
no like the variables that you have defined
but start of the semester was series
I think i have my college notebook for that one. The ones i've lost are calc 1 and 2 lol
ohhh lemme check if im still in the canvas course for those
thanks man 🙂
alright
ye im in it!
anything specific?
something like this?
it has some calculus
@uncut orbit my friend asked me to run this and came to the conclusion that majority are bankrupt
id say not...
that could do
Hey @astral path!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @astral path!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @astral path!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
thanks a lot man 😄 you too @uncut orbit
welcome
how should i download the dataset from the link you sent? the dataset folder has only parent dictionary
and i dont know how to go after that because in whatever ive learned there'll be .data files there
the kaggle one is the same
to my knowledge
yes thats what i did
vs code
hmmm do you have anaconda?
Hi there, I am interested in Simulation of Physics/Engineering problems by using python. For example: Simulation in Fluid Dynamics and Thermodynamics . So, which modules i have to learn and practice and get projects. Now days, I am learning Pandas, Numpy, Matplotlib and seaborn. My background is Mechanical Engineer. Please suggest modules / Techniques , I would like to hear and apply.
yo how much does tensorRT help
im also right there with you

havent done calc in 10 years

but looked at partial derivatives just so i could understand backpropagation

8 years here
I remember some things about derivatives integration and calculus
But if u ask me to solve one ill embarass myself lol
I am learning Pandas, Numpy, Matplotlib and seaborn
Uh, how does that have anything to do with fluid simulations?
itll start to come back if you start looking at examples again
## Import Libraries
import pandas as pd # pandas is a dataframe library
import matplotlib.pyplot as plt # matplotlib.pyplot plots data
import numpy as np # numpy provides N-dim object support
# do ploting inline instead of in a separate window
%matplotlib inline
## Load and review data
df = pd.read_csv("./tree/Notebooks/data/pima-data.csv") # load Pima data. Adjust path as necessary
df.shape
Error that I'm getting:
NameError Traceback (most recent call last)
<ipython-input-8-633337079cd0> in <module>
----> 1 df.shape
NameError: name 'df' is not defined
Also, I am using Jupyter notebook
what happens when you run 'df' by itself in a cell
if it doesnt show up, you probably didnt import pandas
movie recomodation is based on KNN??
what is the context for this question? you could make a movie recommendation system that uses KNN in some way.
like it is based on matrix factorization ?
this was interesting to see
we're talking about k nearest neighbors, right?
then for those in the states
lmao wyoming
california, texas, and new york seem to have most jobs postings
rip
ay my state is second highest %
washington state not DC!
my state is 2nd highest in postings
spent some time in austin tho. at least ik there will be jobs for me in my state after i graduate
yeah gonna try to move there after my first job
we shall see
slc is on the rise in tech jobs overall
they have a ton jobs down there
which is where i go to school
noice
I'm building a function to scrape from a table and one part is dropping certain columns from a dataframe at some point. however, not all tables will have a specific column. how would I make it so this one column is only dropped if it exists?
rows_df = rows_df.drop(columns=['#', 'Weight', 'Hometown', 'High School', 'RSCI Top 100']) is what i have right now, and not all tables have High School as a column
what happens if you try to drop the column but it doesnt exist?
does it return an error?
i would just do a small try/except code right before dropping the column
i thought about it but
see if it exists
wait nvm

i'm ALMOST done with data collection ! !
np sometimes its easy to figure out once you talk your ideas out
and nice

i believe in you
$350 now on the line here!
💸
💰
@serene scaffold yes like i was searching for the project like what it is based upon
am not able to understand is it based o knn or matrix....
hmm so it's happening on other columns now in debugging
so it's not just one column
is there a way to just conditionally do it for any col?
@astral path You can try creating a master list of cols, then retrieve the columns names from the actual table, evaluate which ones match, and then only delete the matche
matches
for example
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
cols_to_delete = ["PLOP1", "PLOP2", "COL3"]
# here you get the cols from the df
df_cols = list(df.columns)
# evlauate and keep only those that are found
to_del = [col for col in df_cols if col in cols_to_delete
#finally, drop them
df.drop(to_del, axis=1)
I did it via lists but there are other ways to do it
that seems like it would work yeah
just dont forget the ending brackets for the list comprehension
that ALWAYS trips me up

"why doesnt this list comprehension work"
forgets the brackets

I sometimes forget that list comprehension needs them at all lol
thank you! this worked
what would be a good, non iteration way to find the largest string on a list
?
right now I have this
long_word = ""
for word in unique_words:
if len(word) > len(long_word):
long_word = word
else:
continue
but im sure there has to be more optimal way to do it 😦
>>> strings = ["abcd", "12345", "this_is_a_very_long_string"]
>>> max(strings, key=len)
'this_is_a_very_long_string'```
max will implicitly iterate over the list but if the list is in no way sorted, you would need to iterate over the list to find the longest string
if I have a dataframe which looks like this
and i want to append a row with new columns that looks like this for each row in the dataframe
ho would I do that?
so the first row with 2016 Villanova vs. 2016 North Carolina would have that all the data from the second image as data in new columns
and all the other rows in the dataframe would have the same columns
probably would be something called a "join"
what if there's no common columns?
then how do you decide which row gets to which row?
i'm iterating over each row
for index, row in df.iterrows():
try:
# a should be the two rows combined, get_stats is a function that creates the new columns
a = get_stats(row['url']).join(row)
display(a)
except:
print("ERROR: ", row['url'])
time.sleep(1.2)
break
so just by the order?
yeah just by order
you can concat with axis=1
i'll try it
pd.concat([row, get_stats(row['url'])], axis=1) gives, although it should be a single row
I don't know what you are doing
I thought you had two dataframes: df1 and df2
and you want them smooshed together horizontally
that is done by pdf.concat([df1, df2], axis=1)
i have a dataframe df with a column url, and I have a function get_stats(url) which takes the value of url for a given row and returns a new dataframe which is a single row. I want to append get_stats(url) to the dataframe for each value of url
use row.append
or you could use pd.apply and create the entire dataframe of stats from the urls
aight i'm running that
In machine learning, is it important to remove outliers?
I've got this column where the majority of values are in the range of 50-100, however the column has a max value of 400. So when i normalise the column, i get a bunch of low numbers because of only one outlier
This would presumably have an affect on how the machine learns right? Is there a scheme for removing outliers? Like looking at the standard deviation or something?
You can normalise using standard deviation instead of max value
Yeah standard deviations for finding outliers works fine
oh hi Raggy!

that's just normalizing it by min and max values
yeah what raggy pasted is better for yuo
physics wizards pulling through once again 👏 cheers Raggy ^^
So i gave this a whirl, and I just want to know how this will affect learning:
So there are negatives, so if i put this through a MLP i imagine this will "slacken" weights instead of "tightening" them (does that make sense)?
Is this better or worse than using normalised values from 0 to 1?
Moreover, do i still need to remove these outliers? Or is it fine to keep them as this data is "standardised" instead of normalised?
the outliers should be removed i think
or replaced
by median or mean or imputed with whatever statistic you see fit
so if i remove them i get rid of the whole row right?
you can get rid of the whole row, column, or just the value itself
I thought i had to get rid of NaN data?
but if u were to get rid of just the value itself you should replace it with another value that wont disproportionately affect training
yea
hmmm i didn't think of that before. In this case it doesn't matter too much since it's only 2 values. But i have removed a decent amount of NaN data before
Didn't think to replace the values with the mean/median
is removing it the "safer" option?
depends on how important your row/column is
if you identify a column as significant or you think it will be useful during training, then perhaps u should keep the oclumn
heres a stackoverflow question explaining the difference between minmaxscaler and standardscaler
https://stackoverflow.com/questions/51237635/difference-between-standard-scaler-and-minmaxscaler
the answer would be that I don't know, since im not sure how much of an affect this particular column has on the data
u can try estimating how important it is, if its not something blatantly irrelevant then it should be kept for the network to find hidden patterns in
I can't seem to find the explanation for when you use a ! or a % in google colab....when do you use which? why do I have to use ! for ls and % for cd?
its the same as for jupyter magic commands
since colab is built on jupyter
look up jupyter notebook magic

Got it! I am a magician now

RuntimeError: Input type (torch.FloatTensor) and weight type (XLAFloatType) should be the same
i have no idea
what to do
there's no way to guess what the problem is without more context.
though I'll make one uninformed guess anyway: is there a way to convert whatever your weights are to a FloatTensor?
Hi, anyone knows what should i do here? The only thing that I know is that I should make a py code using a montecarlo and the maximum likelihood estimation, but not really know how.
ooh a confidence interval question
I havent seen it wth std of random error thou lol
Confidence interval is
mean +- (t-score @ confidence * std / sqrt(samples)
t-score also requires degrees of freedom, which normally is samples -1
but i have never seen a case where you're given the standard deviation of the error
in your case can calculate the sample mean from the observations, the t-score from the confidence interval and degrees of freedom
but i have no clue if you can just input the standard error of the mean there lol
Oh ok ok, I will give it a try. Thanks for the help tho @exotic maple 🙂
Thank you so much!
can someone please help me to re-arrange this data frame?
want a matrix that is 50 (states) x months
i have tried to group by date and then transpose
You want a multilevel index of state and month?
yes
month1 val val val
month2 val val val```
maybe i can export it as a csv and reshape it in R lol
@mortal flicker are you able to share your data? or a snippet? Just easier to try some things on.
If not, have you tried pivot? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot
im just stumped at filling in the columns in this dataframe
im at the point where it looks like this, it has the desired columns and the original columns
and i have a function
def stats(x):
try:
a = get_stats(x)
except:
return np.nan
display(x)
return a
which takes a URL and returns a dataframe with one row containing all the stats in the columns from ppg on
like this
i just have no clue how to loop over each url and insert the desired values from stats(url) into the desired columns
any help? Thanks!!
oh nice function. def taking note of that one for later use

Posted a Question about Normalizing at #help-kiwi , Please help if possible
Hi everyone ,I’m Anggita. Currently I’m making sentiment analysis twitter. I want to analyze hashtag of trending tweets on different country? Can anyone know what should I do?
Trade signals xD
getting precision as 0 with this dataset. help?
Hello guys, I'm starting in python and I trying detect anomalies in a list. I search anything to make this and tested a code, but I'm receiving a error:
ModuleNotFoundError: No module named 'numpy'
Anyone can help me?
you have to import numpy
You have to install numpy first
@lean ledge would you mind helping me out?
TY
TY
can anyone help me to setup gpu fr jupyter notebook
Greetings,
I have a pandas dataframe with a Datetimeindex and various discrete values.
For simplification:
index p q
2020-1-1 6 9
2020-1-8 4 19
2020-1-12 4 17
Every time a value p or q changes, a record is added. That means that at 2020-1-2 the values are essentially the same as 2020-1-1.
I want to plot these data but face a problem:
- Plotting the sparse data structure as a line, there's a slope between measurements which is misleading.
- Plotting the sparse data structure as a bar leaves countless gaps.
- Filling the missing dates and plotting line or bar is slow, because the real dataset spans several years and has 40 columns.
Essentially I want a bar chart where bars have different widths, until the next measure, or a line plot that moves "like a staircase" with only horizontal and vertical segments.
What would you recommend?
@rich reef this may help you: https://stackoverflow.com/questions/66545695/python-plotly-dash-question-custom-labels-and-color-based-on-values
@rich reef I like the line option
Oh god stupid Jupyter notebooks. I put my model to train for 48 hours on cloud time and it didn't say anything at all. the next time when I used the command line I got killed. Jupyter sucks 🤬
Now I have to spend more money to train it for the 3rd time with more RAM 😞
😔
I'll try that, then. I think I'll duplicate the index, offset by a second, and merge and forwardfill.
That should get
index p q
2019-12-31 23:59:59 NA NA
2020-01-01 00:00:00 6 9
2020-01-07 23:59:59 6 9
2020-01-08 00:00:00 4 19
2020-01-11 23:59:59 4 19
2020-01-12 00:00:00 4 17
A line plot should then practically be a staircase. The NA at the start doesnt matter because its outside the interval.
I would have done it in pure numpy but that works too (and is arguably better!)
How would you do that? Honestly I always grab pandas out of convenience, and I work with Dataframe's in the rest of my data wrangling
https://www.youtube.com/watch?v=7jiPeIFXb6U I'll just leave this here ...
I have been using and teaching Python for many years. I wrote a best-selling book about learning data science. And here’s my confession: I don’t like notebooks. (There are dozens of us!) I’ll explain why I find notebooks difficult, show how they frustrate my preferred pedagogy, demonstrate how I prefer to work, and discuss what Jupyter could do ...
Ai is cool but cyber secuiryt is better
im doing cyber sec currently
just hopefully i can get a good gpa
but im working on it
i'm doing a bot where multiple ppl gonna use it to generate a img and sent it to the person, so i was thinking how the file should be render, but i was thinking, is possible to XY graphic be sent without be saved and the script understand the different of ppl like using ram?
What i mean is: instead save the file with a ID and sent it and after it delete, is possible i dont need to save it and he just sent the file who match the person who request?
Anyone got any recommends on using Python and SLURM? I got a NLP pipeline I am working on and I am wondering if anyone here got some tips on trying to scale Python on a supercomputer.
Specifically I am using Spacy and pipe multiprocessing but I seems like I am going to hit some road blocks with that but I can figure that out on my end.
What is slurm?
There are plenty of projects to automate CyberSec with AI. currently they are experimental so it required bombarding the server with requests - however I read about some corporate AI projects that can query a server and construct a full profile of it to aid another model to find vulnerabilities in it. pretty cool if you ask me
If you are using SLURM I guess you'll be deploying your program on a Multi-Mode scale? The easiest would be to start using MPI via mpi4py to communicate your processes in the cluster
It's a job scheduler for super computers. It's not like Hadoop with map reduce. It's looks like a very fancy bash script.
multinode rather
I am worried about figuring out how to get Nodes to talk to each other. So far my code is basically what I run on my computer. Single node basically and spawning a bunch of processes.
How many of a pain it is to run things with MPI?
Horovod works well
Very very automatic
Essentially you just need to add a few lines to your code and it's automatically capable of running on a cluster
Does that work with SLURM? It looks like I have to drag the admins to add that.
It should work fine with slurm. Just need to load the openmpi and then set up relevant Python environments with horovod installed
Admins not needed unless for some reason there's no ompi
Edit your few lines of code, then you can mpirun within your sbatch
There is definitely OMPI.
Isn't this for TensorFlow stuff? I am not running any GPU stuff just Python multiprocessing stuff.
Oh you're not distributed NLP training?
Oh, I am not doing training. I am running Spacy models. I should have mentioned that.
There is already a trained model. The challenge is to run it in a reasonable amount of time. I got thousands of biomedical papers to work through so clearly I need to think harder about running things.
Do you have your NLP corpus stored in a pandas df at any point?
Yea. But the dataset itself is a bunch of little json files with the plaintext in there. It's the CORD19 dataset if you know about it.
So you're going to predict over some data? Can't you just divide it up between however many cores you have?
Honestly there's a good chance you're better off launching multiple sbatch runs with different number arguments that pick which data they end up processing and then combining the information later
That's the idea so far. It's "embarrassingly parallelized" since the docs don't have anything to do with each other. We aren't trying to predict things but mining them for linguistic features.
Just make a script to split up the data into X directories, have a generic sbatch with an argument for which directory to look at, then launch X sbatch runs
They can be squeued and executed on their own
Makes sense. The bash scripts to do this is shouldn't be hard to do.
While your first test run seems to be running, you can write a script to combine the output of the runs
Yep
Much easier than trying to parallelise the actual work on the application level using MPI
You can also use more_itertools.chunked and joblib
I still need to make sure the pipeline itself runs with multiprocessing though, but that's already done.
Psst, if in the future you need to do distributed training, horovod is lit 🔥 🔥
Implemented it on prod for all AutoML clients to use
Was fun
I'll keep that in mind. I know the cluster has some spicy GPUs on it but I don't we are at that point yet.
I sorta miss HPC stuff. The power rush you get from launching 64 GPUs for training is like a drug to my ego
Plus module load tensorflow is the chillest experience I've had with dependency management ever
Hmm... is there any way we can use pre-trained word embeddings with large documents (except averaging them)?
coz I don't think averaging would be very useful or retain information tbh
Got it. Thanx a ton!
stuff like sentence-BERT isn't viable, and I only have the word2vec embeddings for a particular language. averaging loses the order, so that's not so good either
What d'you reckon might work better :-
Doc2Vec trained from scratch on dataset (which is small but for a specific domain)
OR
Word2Vec trained on Wiki + average + Tf-Idf
Is it fine to return a variable from a function where the variable has the same name as the function it is contained in? For example:
def ds(x, y):
z = x + y**3
a = 2 * (z / 3.14)
ds = a + x / y
return ds
# use the function as follows
>>> ds = ds(4.1, 9)
Or would something like this be better:
def calc_ds(x, y):
z = x + y**3
a = 2 * (z / 3.14)
ds = a + x / y
return ds
# use the function as follows
>>> ds = calc_ds(4.1, 9)
I have a bunch of functions like this defined in a module called solid. So I would actually call the function as
import solid
# approach 1
ds = solid.ds(4.1, 9)
# or using approach 2
ds = solid.calc_ds(4.1, 9)
I'm just wondering if there is a preferred naming convention for functions contained in a module.
@lapis sequoia functions contain references to themselves in their own namespace via the same name of the function in the outer namespace. That's what enables us to use recursion in the language. However in your first example, the statement ds = a + x / y re-assigns that name within the function's namespace.
however you could just do
def ds(x, y):
z = x + y**3
a = 2 * (z / 3.14)
return a + x / y
no need to assign a name to it if you're just going to return it right away
please DM @sonic vapor about that
Anyway, it's bad practice to overwrite existing variable names unless you're trying to update the data that that name is supposed to represent.
Gotcha.
a common convention to avoid doing that is to put a trailing underscore at the end of the variable name. So you could have ds_ = a + x / y, though in this case that wouldn't be useful since you can just return that expression right away
I like to assign whatever I return in a function to variable. So the return statement is something like return x where x can be a small calculation (as my example above) or are large calculation that spans 2 or 3 lines of code. I like the idea of using an underscore. Something like return _ds would work for my example.
You can do it that way if you'd like, though I'm not sure I see the advantage.
The convention is usually to use a trailing underscore to avoid name overwriting. Leading underscores indicate that an attribute isn't part of an object's interface, but variables internal to a function aren't exposed anyway.
My second approach seems to avoid these issues by providing a more descriptive function name which avoids the naming problems.
# in solid.py
def calc_ds(x, y):
z = x + y**3
a = 2 * (z / 3.14)
ds = a + x / y
return ds
# use the function
import solid
ds = solid.calc_ds(4,1, 9)
my suggestion is for the last two lines to be:
ds_ = a + x / y
return ds_
rather than
_ds = a + x / y
return _ds
by changing the location of the underscore.
im getting my precision = 0
searched up online and realised i have too many values for class 1 (6k+) compared to class 2(around 200)
now the ratio is 500 to 200 ish but im still getting precision = 0
help?
I don't agree with the second approach. if your function already has a descriptive name for the namespace that it's in, prepending it with "calc_" isn't very informative. most functions to a calculation.
@bronze jacinth either your model doesn't work, or the system you're using to calculate your precision score doesn't work. Or some other third thing. Can you be more descriptive about how you arrived at this point?
sure yes
(im new so i apologise for any mistakes or misunderstandings)
i think the model is working because im getting decent accuracy and confidence
no problem. do you understand what the precision score is telling you?
like, what is the formula, and why does it matter?
is it like the number of predicted values that turned out to be right?
Help, my code keeps craching when it reaches model.fit(X,Y)
https://paste.pythondiscord.com/fanihamipe.py
"right" in what sense? 
our teacher didnt explain the code much so i have to refer to youtube to understand syntax
what course is this?
¯_(ツ)_/¯
on second thought though, it sounds like you do understand it
just a college course where our seniors teach us stuff
because precision tells you how often the predictions that you actually make are right
oOOo how often alright
if it is then I should leave
im here so i drop the average here by a couple
do you know the formula?
I'm looking for one expression with tp, fp, fn, tn in there. but you won't use one two of those
nope im afraid that wasnt thought
but i did see something like that while learning online
no 😦
suppose you're making a classifier that tells you if something is a sandwich
if something is a sandwich, and your model says it's a sandwich, that's a true positive
ooo confusion matrix
if something is a sandwich, and your model says it's a salad, that's a false negative. it said it wasn't something (negative) and it was wrong (false)
but if it was a salad, and your model said it was a sandwich, that's a false positive. It said it was the thing you were looking for, but it's wrong.
speaking of stats, this was a good statement on p values and common misconceptions from the ASA https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
sir that explaination is beautiful
you're beautiful 💚
so basically, precision means "out of all the times my model said it was a sandwich, how many of them were actually sandwiches?"
as opposed to "what percentage of the sandwiches in the data did the model find?"
the latter is called recall
oOhh yes
So to your question, why are you getting 0 for your precision score
and true predictions over all predictions made is accuracy right?
yes, though the accuracy score isn't very useful if true negatives aren't helpful
Well it depends on the use case. Pandas has a bunch of functions that start with read_. So knowing if a function performs a calculation or parses some data or provides a file path by just looking at the name of the function can be useful.
my model says that these are the sandwiches, but actually none of them are
and thats why precision is coming 0
so if your precision is quite literally zero, and this isn't an error with the precision calculator
I guess you'll have to adjust the parameters of your model 🙃
but not having run your code, I can't rule out that the precision calculator isn't broken
what does support mean?
in what context?
while using classification report
I assume ds has an established meaning
like in this other conversation, if I had a function that calculates precision, I would just call it precision because it's known that precision is a metric
so it's some other metric?
I've never heard of it
yea maybe
i printed the confusion matrix beacuse it sounded cool (i still have to learn that tho)
It might be that support is the number of instances of that class for a given calculation
but I'm not completely sure
hmm
anyway, was this informative for you? I was going to study for my midterm, and then everything changed.
miss me with that operating systems shit
im just stumped at filling in the columns in this dataframe
im at the point where it looks like this, it has the desired columns and the original columns
and i have a function
def stats(x):
try:
a = get_stats(x)
except:
return np.nan
display(x)
return a
which takes a URL and returns a dataframe with one row containing all the stats in the columns from ppg on
like this
i just have no clue how to loop over each url and insert the desired values from stats(url) into the desired columns
any help? Thanks!!
You can use the map method: df[“url”] = df[“url”].map(stats)
aight that's running right now
it'll take a while to see if it worked tho, uses a webscraper
Oh it’s probably a good idea to make a simple test case so you can see if it works quickly
oop yeah thats true
this is what's returned
vs. before
df_test = df.head(5)
df_test['more_stats'] = df_test['url'].map(stats)
df_test
ignoring the warning for a moment, what's in the more_stats column(s)?
this is it
looks like it worked but created a nested structure
my bad i didnt realize there were multiple columns at first
instead of redefining df["more_stats"] you can just append the new dataframe as extra columns
would that just append new columns for each iteration or just for the first time i iterate
it will only append once if you use pd.concat([df1, df2], axis = 1) after creating the more_stats df
as an intermediate variable I mean
hmm so just df_test['url'].map(stats) still returns a nested structure
whats the difference between a 3d array and a tensor
is it like the difference between a computer science vector and a physics vector
different definitions depending on the field?

Hi everyone, can anyone know how to fetch_tweets in arabic word?
isn't a tensor just an array that can be on the GPU?
would changing tweet.lang == 'en' to 'ar' work?
@astral path maybe use .loc again to pull out the data frame you want and then concatenate
could you make one dataframe out of da and concatenate it onto the bottom?
rather than doing len(da) append operations?
@astral path can you copy and paste the code in that cell into this chat as text?
ya
da = df_test['url'].map(stats)
dw = pd.DataFrame()
for i in da:
dw = dw.append(i, ignore_index=True)
dw
!code
da = df_test['url'].map(stats)
dw = pd.concat(da, ignore_index=True)
see if that's what you want. it might be faster.
i'll try
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
it's a series of dataframes, right?
try dw = pd.concat(da.iteritems(), ignore_index=True)
@astral path does that work?
thats my working definition. mathematicians have a slightly dif variant but i dont envision it changing much
I didn't think "tensor" had a mathematical definition
I thought it was just a way of specifying how a mathematical data structure is being stored in a machine 😛
I stand corrected, but then I'm not a mathematician: https://en.wikipedia.org/wiki/Tensor
lemme check
nah TypeError: cannot concatenate object of type '<class 'tuple'>'; only Series and DataFrame objs are valid
it didn't
so it's not a series of dataframes
It’s error and says “failed to parse JSON payload: Unterminated string starting at: line 1 column 644416”
I'd need to see the whole error message and the related code
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

how about you do print(da) and copy/paste the text?
Can you instead do print(da.iloc[:5].to_csv())
ya
,url
2049," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/v... 80.6 ... 1 1.7
[1 rows x 20 columns]"
2048," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/v... 80.6 ... 1 1.7
[1 rows x 20 columns]"
2047," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/n... 87.5 ... 0.833333 1.8
[1 rows x 20 columns]"
2046," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/s... 72.7 ... 0.666667 1.6
[1 rows x 20 columns]"
2045," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/n... 87.5 ... 0.833333 1.8
[1 rows x 20 columns]"
These are the error
@sudden panther I can't help with this, unfortunately
Ok, no problem and thank you🙏🏻
Hi folks, I have imported a csv files and I want to drop every column if the name equals alertXYZ, where XYZ is [0-9]
can you copy and paste the first few rows of the CSV as text?
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
i think i'll just keep it with the O(N) implementation instead of vectorized unless it takes too long
I think your performance issue is that each append operation has to copy all the data in a dataframe, since append isn't a mutator method
In fact I think that might mean that it's O(n^2)
oh yikes
I'd still like to know what print(da.iloc[:5].to_csv()) looks like
I can probably help you solve it if you are able to provide that
OH i sent it like 10 minutes ago
turns out it errored
,url
2049," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/v... 80.6 ... 1 1.7
[1 rows x 20 columns]"
2048," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/v... 80.6 ... 1 1.7
[1 rows x 20 columns]"
2047," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/n... 87.5 ... 0.833333 1.8
[1 rows x 20 columns]"
2046," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/s... 72.7 ... 0.666667 1.6
[1 rows x 20 columns]"
2045," url ppg ... momentum experience
0 https://www.sports-reference.com/cbb/schools/n... 87.5 ... 0.833333 1.8
[1 rows x 20 columns]"
it didn't say it was an error until i hovered over the message...
this doesn't look right
actually
hmm
@astral path join the code help 0 voice chat
ok
Have to bring tomorrow from work. Computer is not connected to Internet
ok im there now
@astral path this avoids the append issue:
df2 = pd.concat([stats(url) for _, url in df['url'].iteritems()])
full_df = pd.concat([df, df2], axis = 1)
oh i should have specified, stelercus helped me in the VC to fix it
oh no worries -- glad you got it working
thanks!
I have a dataframe like this, I want to create a new column C That has the first value of A for each value of B(table is sorted by B)
A B C
0 150 0
1 153 0
2 157 0
3 160 1
4 165 1
So, when populated it would look like this:
A B C
0 150 0 150
1 153 0 150
2 157 0 150
3 160 1 160
4 165 1 160
Any ideas?
It's not just a copy of A
ohhhh
I would do it like this (logical not code)
- find unique values of B
- define a function that, for every value of B determines
- range; samples of that B value
- minimum value of A for that range (this is the value you want repeated) - create a new column C that pastes min(a) for range B
You can probably implement something like that with df.apply()
df["C"] = df.apply(YOUR FUNCTION HERE, axis=1 -> since you want it over columns)
Ah, dataframe isn't sorted on A, so it wouldn't always be the smallest value that has to be put into C, needs to be the first occurring value(time series index).
I hacked my way through with a loop, but it's pretty slow since I have 61k rows
you can still do it
instead of min you can just cast list(a) and retrieve index 0
but since you loop already nvm :p
I think groupby with transform would work here
df.groupby('b').transform(min)
oh my b not min but your function for getting the first index
Hello I need a little help with cv2
So cv2 does not support gifs. How can I read gifs from Url's to manipulate them
I tried reading frame by frame and storing them in a np array but it didn't work
@deft ruin to this day i still dont know wtf does transform do lol
@exotic maple at least in this case it will broadcast the grouped df back to the original size
nice when you want e.g. a column with group means
so if i nwated a column
with the mean of A
i can do
df["MEan"] = df.groubpy("A").transform(np.average) ?
yeah exactly
thanks man. I never wtf that was for lol. never got the hang of it
btw question, just to confirm if im right.
apply() on a dataframe is applied as a vector or as functional programming? That is, to the wholc olumn at once or value-by--value
i was having a debate about that in another forums
last i remember apply is also vectorized
applies the function to the whole column, at once
yeah I think it's confusing because the dataframe method applies the function to each column or row (i.e. vector) but the apply method on a series is element by element
so you have to be careful with types
It depends on the function that you pass to it. Pandas will do different things depending on which function is given.
To understand apply vs transform: https://pbpython.com/pandas_transform.html
Why is it depends the most common programming answer for almost everything?
it depends
I'm having some trouble in configuring my environment for Apache spark - when I try to run things like connecting to a postgresql, I get streams of errors
Sexy
Do any of you, guys know about sympy?



