#data-science-and-ml
1 messages ยท Page 229 of 1
how can i impute categorical missing values using knn?
Oh dear, sounds fancy. If you have enough data points, dropping rows might be easier
i only have 6000 rows
and some columns are missing 50% of the data
actually only one column is
If a column is missing that much, drop the column
Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn
Train the knn using the rows for which this column in question has values present. Predict the values for rows where this column (which is now the target for the knn) is missing.
But if you have a lot of values missing, you can't blindly run knn on it. I personally don't think you should be trying to impute any column with more than 50% data missing at all. Drop the column, or figure out some different style of handling the missing values if your data understanding offers some way
A column like that should probably be dropped honestly
Is anyone here familiar with plotly? Specifically buttons / dropdowns ?
alright thank you @ripe forge
what do you think is the missing data threshold proportion for a column to be dropped
30%? 50%?
i already dropped all columns missing more than 60% of values
alright, im just gonna look at the distribution of missing values in my data, and see if there's a falloff at a particular point
If you can sensibly fill or aggregate some of the missing values, it becomes easier to manage
Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn
also i probably need to impute like 50 columns lol. would i have to run 50 knns?
if theres no easier way than that im just gonna replace with the mode^
You know, I'm not a 100% sure on that one.
alright
well for now im just gonna replace with mode since that's a lot easier. i'll look into knn replacement later
thank you
Sounds good, cheers
@lapis sequoia im more of a stats guy than programming guy, but about missing data -- it really depends on why the data is missing. MCAR vs MAR vs MNAR etc https://en.wikipedia.org/wiki/Missing_data . id personally try to figure out why the data is missing first more than anything. if speed is a concern, just drop it unless you have domain knowledge saying that column really matters... and if it does really matter... then you have a bigger problem than DS if your dataset is missing a critical component ๐
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Missing data can occur because of ...
thank you
^ yep great point. i was remiss in not mentioning MCAR,MAR,MNAR
some imputation methods only make sense for certain types of missingness
Hi, is this chat for neural networks?
Well, it says machine learning
So, Iยดm looking forward to learn more about it
if i want to see association between categorical variables, what are my best options? i was thinking use a contingency table to see frequency distributions
obviously i cant use pearson correlation
and what if my variables have high cardinality (i.e. categorical level with 200+ levels) and i want to see association
mutual information
or chi square
mutual information has better properties imo but chi square is a lot faster to compute
the built-in scikit-learn MI is single-threaded and very slow
you can also one-hot-encode and do stuff like hamming or jaccard similarity. depending on the type of feature
alright thanks. ill look up mutual information
has anyone worked with pandas' series before
theres a part of the documentation that's worded a little bit funny
Yes
wait nvm i think i got it...
"Rยฒ is the percentage variation in y explained by all the x variables together.
For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the Rยฒ for our model is 0.72 โ that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."
What does variation in y mean in simple terms
anyone know why the below line gives me an error? ws.range('A2').value = results
import pyodbc
import xlwings as xw
wb = xw.Book('Book1.xlsx')
ws = wb.sheets['Sheet1']
def read(conn):
cursor = conn.cursor()
cursor.execute("SELECT * FROM people")
results = cursor.fetchall()
print(results) # prints a list of tuples, each tuple is a record
# works fine
ws.range('A2').value = [(1, 'joe', 'emmets', 23),
(2, 'sally', 'jacobs', 37),
(3, 'katie', 'falcone', 42)]
# does not work - why?
ws.range('A2').value = results
conn = pyodbc.connect(
"Driver={SQL Server Native Client 11.0};"
"Server=(LocalDB)\MSSQLLocalDB;"
"Database=testdb;"
"Trusted_Connection=yes;"
)
read(conn)
conn.close()
๐ฆ
@blazing bridge r-squared ranges from 0 to 1. higher r-squared typically means better. a cynical and basic way of looking at it is it just tells you how good or bad a line fits to your points. compare these plots
in your 0.72 example, it means that 28% of the variance is unaccounted for. this is just fancy stats speak for saying "hey like 30% of the model can't be explained using the variables we have, so thats why our predictions might be off if we were to use the parameters from this model"
anyone know why this isn't working in Jupyter Notebooks?
@blazing bridge if it helps you wrap your head around it conceptually... r-squared is literally just the square root of the data's correlation. but correlation can range of -1 to 1 while r-square is only positive (0 - 1)
@turbid hazel You should provide errors If you have one, or some content
so @tranquil falcon this is basically means the line of best fit is only 78% accurate and the rest is not accurate with our x values. So 28% is inaccurate, therefore the variation in y is 28%.
is this correct
hi there everyone! in my code i have an error ```python
import numpy as np
import matplotlib.pyplot as plt
incomes = np.random.normal(100.0, 50.0, 10000)
incomes.append(900)
plt.hist(incomes, 50)
plt.show()the error isAttributeError Traceback (most recent call last)
<ipython-input-4-9b18b9dd3278> in <module>
4
5 incomes = np.random.normal(100.0, 50.0, 10000)
----> 6 incomes.append(900)
7
8 plt.hist(incomes, 50)
AttributeError: 'numpy.ndarray' object has no attribute 'append'```
it says numpy doesnt have an append atribute?
how can i add a data point to my array
@uncut shadow for some reason literally nothing is outputted when i run it. no errors, nothing
@blazing bridge not quite. i think conceptually youre getting there but im not sure id use the word 'accurate' here. when we're talking about variance, which is what r-squared is trying to describe for a linear model, i think of it more about fuzziness in our actual guesses / predictions
like if we had a model to guess where a ball will land, if we had a model with 0.72 r-squared vs a model that is 0.95 r-squared, both COULD be accurate in a real-world scenario, but the 0.95 r-squared one would be more likely to have better results and less error on where the ball would land
@blazing bridge but long story short, you could make an argument that a model with 0.72 r-squared is probably more likely to be more accurate than a model with 0.65 r-squared, but that's not necessarily a fact. thats why i wouldnt use the words accurate vs inaccurate here because usually 'accurate' has to do with predictions while r-squared really just has to do with explaining variance
@tranquil falcon Thank you for clearing it up, one more question what does variance mean
is that the error that the line produces
like in simple linear regression we use the Squared error to see how well our line has fit the data
in multiple linear regression is r-squared the equivalent of the loss with two or more variabels
*variables
@livid turret no prob. variance is basically the center of the universe for statistics -- its really just a fancy word for how much variety/range there is in your data set. high variance = wider range of values low variance = smaller range of values.
so you mentioned error -- when you make error you actually included standard deviation somewhere in your calculation. standard deviation is literally the square root of variance. so higher variance = higher standard deviation which would also mean higher error. but bear in mind higher calculated error doesnt necessarily mean less accurate (it all depends on context!!), it just means a wider range of expected values.
and to answer your question if im interpreting it correctly, then yes conceptually r-squared is showing the equivalent of the incorrect guesses. r-squared, mathematically underneath the hood, is actually calculating how far each point is from your best fitted line and aggregating that data. so if you have a lot of points that are really far away from the line that would indicate maybe the line, even if its the best fitted line, isnt taking into account other factors (thus the whole it explains x% of variance) that are relevant to the real world
but bear in mind r-squared is really only used for linear models. and there are many, many, many, many real-world scenarios where a linear model just doesn't work or doesnt make sense ๐
(bear in mind a true academic statistician would be murdering me for some of the things ive said but im just trying to explain things conceptually and how they relate)
not sure if this is the spot for this, but im doing some plotly data visualizations and running into the issue of a slider I implemented not being displayed at all. Is anyone familiar with plotly / networkX and would willing to assist me for a bit. Thanks!
ok thank you so much @tranquil falcon
the data that is being passed into your cum_freq_fn isn't a column name, it's the element(s) themselves, that's why you're getting a KeyError
It's probably worth noting, there's already a method of pandas series to calculate the quantile value
^
yep! salt rock explained it to me, thank you both for helping out though
btw just a tip for ML.
Might be good to have a baseline knowledge of statistics @blazing bridge
That should help you better understand how ML models actually work, and what concepts like std, gaussian, variance, etc. are.
yeah @flat quest I was thinking about that but I dont know where to start
Im only in grade 10 so i dont know what would make sense for me
ah nice
yeah best place to start is prob something like khanacademy and then go to like opencourseware. @blazing bridge . YT is also a pretty good place to look at
ok thank you @flat quest what would you say variance is in simple terms
well mathematically variance is the square of the standard deviation
like zammy said, variance is a measure of spread of your data. How much distance or change in numeric value is between your data points.
Are they clustered around certain values? thats small variance or spread.
If they're widely spread out, you have high spread or variance.
"or example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the Rยฒ for our model is 0.72 โ that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."
could you explain what they mean over here
I like to think of the variance as the distance between your dataset and a dataset where every data point is equal to the mean
Or something not unlike the average distance from the mean although the whole squared part overweights points that are farther away vs a true mean
Hello everyone,
I have just started with data science.
I am thorough with the basics of pandas and matplotlib, also for web scraping I know selenium, also beautiful Soup . Please advise me on what to do next.
@lapis sequoia That depends on what areas of data science you're interested in. But I'd encourage you to also take on some real world projects. If you're through the basics of pandas/matplotlib and BS4 there's plenty of opportunities to visualize and analyze scraped data.
I just used them to make a project,
So basically what it does it looks up for the input-ed hashtags on instagram and scrapes all the post's hashtags
Further on I have added some keywords
The program filters all the hashtags with the given keywords and presents them to me in sets of 25 in a word file
So where's the data science? That sounds like pure scraping.
Basic pandas
I dont know much about data science, just wanna know how to get into it
Well there's tons of resources like: https://towardsdatascience.com/the-ultimate-guide-to-getting-started-in-data-science-234149684ef7
Data science encompasses a lot of things (as mentioned in that article) if you could narrow it down that would be helpful.
@brazen cloud Hi you can use tolist() method to convert your array into list and then use append on the list to your data point and then convert it back to an array
What do you think I should take up next? @slate scroll
@brazen cloud or you can use np.append method. Its easier to do it this way
@lapis sequoia That really depends on what kind of career you're interested in. With some basic knowledge under your belt, what parts of it did you most enjoy?
Oh I really enjoy automating with selenium and pulling off data from internet
I just dont know what's next @slate scroll
If data collection and augmentation is what you enjoy then data engineering would be a great path.
For most real-world examples that'll include things like streaming data, kafaka, and big data tools like hadoop, hive, hbase, bigquery, etc...
SQL would be a great next step towards that world too
So what is data engineering all about?
Well data engineering is all about collecting data and augmenting it with other data sources. Think about a website (my work so an easy example). They're constantly collecting tons of data on how their users are interacting. The data engineers are responsible for turning that stream of interactions into nice tables that can be used for reporting and modeling.
I am also interested in AI/ML (Atleaset wanna give it a try) always fascinated me
AI/ML is also a wide field, there's R&D and algorithm development (traditional Data Science) and then there's ML engineering or applied ML. Basically turning ML into products and affecting consumers
Ohh so is data engg. About data representation for making decisions?
I dont have good idea of where to go, I just wanna start with basics and let's see what I like on the way
Yeah data engineering can be about how to collect and represent data. so let's say my website adds a new feature and the execs want to know how many people click on it. The data engineers will work with tracking engineers on how to collect that data and transform it into reportable data locations
Not sure if you have a blog or website or anything but if you want to play with some streaming data I have a demo on my Github: https://github.com/Raab70/serverless-streaming-web-analytics
Cloud stuff is also always a great addition to any resume. Most cloud providers have some amount of resources you can use for free it just depends on which one you use.
Sure so that should be plenty of directions to head in, cloud computing, and SQL would be great starting off points that will be applicable to any data science role.
ok , i will start with sql soon
Learning Docker is a great addition too
so whats the use of sql, heard it is for programming databases
but I don't want to throw too much your way
SQL is a query language for relational databases
mySql was something i read of
MySQL is a type of database which is relational and therefore uses SQL. Postgres would be another. Those are both OLTP (online transactional processing) databases.
There are also OLAP or online analytics processing. Databases like bigquery or redshift. But they're all based on SQL
how will sql be useful in a datascience career?
Well in data science you'll need to be collecting data from different sources, most of them will be SQL.
ok so if i wanna learn sql, ho shall i go about it on python?
Well you can get into more topics like ORMs, which are basically a mapping of SQL into python objects. But a knowledge of base sql is super useful. Maybe try just adding it into your existing dataset
So make your scraper work for multiple runs. How would it work if you ignored posts that were already seen
Store them to a DB instead of just pandas
nah, it would re-scrape it
ok will keep that in mind
will research more about sql
It is essential for any data science role
thankyou for your time rob
variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean. The goal is to have a value that is low. What low means is quantified by the r2 score (explained below).
is this a correct definition
ok so what is the difference between sum of squared residuals and r-squared
Hi have someone used the Yoga-82 dataset?
or knows datasets that are public domain related to yoga?
Is anybody available to help? I need some help figuring out how to import a excel file into python and then combining a few of the columns and running some simple calculations. Just a little rusty and on a bit of a time crunch
install openpyxl
Does anyone know if it's possible without to use Tensor flow, only with numpy and opencv, transform differents images into arrays, class them and compare them using sklearn ?
Hey, is anyone familiar with the method of turning a linear program into standard form, and is willing to walk through an example with me?
@gritty vapor yes but it's not a beginner level task by any means
Oh sklearn. Mayyybe
Image classification existed before deep learning
So yes you can go back and dig up all the old literature on creating features for classifying images with SVMs
I don't see why you would want to do that
I would like to take a lot of picture of Brain tumors, and when I give one the to the program it it's a tumor or not, without 500 lines of codes etc
is anyone aware of a simple method or something similar to make a seaborn heatmap axis show all of its labels when zooming in? I have a 67 by 10 heatmap, and want to be able to see the labels of any part of the heatmap that I zoom in on. Unfortunately, it does not automatically do that. any suggestions?
here is what the heatmap looks like
@gritty vapor Image classification is a complicated difficult task, you're asking too much to not have to at least write some code in order to get your complicated difficult task done using modern tools
Maybe yolo object detection can work on brain tumors
I would be much less concerned about lines of code than actually training and validating a good model
I think that's not a good attitude at all
Okay thanks
@barren topaz you would need some kind of fancy dynamic graph with a slider or something like that
I don't think seaborn has that ability by itself
Seems like something you can do with D3.js, maybe you can do it with Bokeh
@desert oar awesome, thanks! I'll try those out. Worst come to worst, and I'll just split it into 3 or 4 different subplots.
I can't get it working. I've never used bokeh before, so I'm learning it now. will probably take a while to figure out. Thanks for pointing me in the right direction though!
Looking for some help with datatables (Dash) and callbacks in Dash. Basically i want to auto generate an editable Datatable and after editing it, click on a button to update some other tables.
Anybody can help?
I have got the first part working but i cant call the data from the Datable to update the rest afterwords
Anyone know how to reshape a Tensor while keeping it on the same GPU?
@desert oar i had a followup question to yesterday
do you know why the code works for an individual column but not on the entire dataframe in my for loop?
let me know if you want me to type the code out
this is the function that captures 80% of the cumulative values that we discussed yesterday
can you post your code as code and not a screenshot
yea. sorry i code & use discord on different computers
def cum_freq(df,colname, threshold=0.80):
x=df[colname]
fractions=x.value_counts()/len(x)
frac_cum=fractions.cumsum().sort_values()
if frac_cum[0]>0.8:
return frac_cum.index[0]
else:
return frac_cum.loc[frac_cum <= 0.8].index.tolist()
for colname in df_new.columns:
cumfreq(df_new, colname)
what's this
frac_cum[0]thing
@desert oar because some values have like the first value count as 0.85 and then all the columns after are like 0.01
ill try iloc
ah iloc worked
why did i have to use iloc though? why wouldnt it work without it
it's... complicated
.iloc is positional indexing
[] by default is .loc which is key-based indexing
if your keys are floats, it's going to convert 0 to float then say "there's no key 0.0"
which is what happened
columns are keys
wait...
yeah
df.columns is a list of column keys
i.e. column names
dataframes have their own behavior
i try to use .loc and .iloc whenever possible to eliminate ambiguity
the only exception is using df[colname] for column access
otherwise i always try to use .loc and .iloc
iloc is really nice when you do sortings
can someone help me understand how 2^g(n) shoots up after a certain value while 2^f(n) remains flat even though f(n) > g(n)?
Hi! I dont know if this is a place to ask but I just am getting started and i basically just want to learn to program a graph and have it track or plot a dataset and i dont know where to begin any ideas? Thanks in advance
You should ask your question and provide code/errors required for us to solve your problem
could someone explain this to me. Like what does variance mean in terms of linear regression. Someone told me it is the spread of the data but how does that relate to this in that case.
they say 72% variation in y, im not sure what they mean. Is it its 72% accurate
Last thing what is mean squared error and whats the difference between R-squared and mean squared error
i think you really need to spend some time learning statistics
it makes these concepts easier to understand
yeah ik. I searched this up on khan academy, didnt really make sense to me
any recommendations. I know I've been really annoying and probably pissing you guys off
im not sure what the standard stats textbooks are nowadays
but a good book should help
imo you shouldnt even try to learn regression until you know basic statistics
you can do it, but... this is what happens
what would be a good book suitable for someone in high school or maybe a course
let me look
maybe The Statistical Sleuth by Ramsey & Schafer
or Statistics by Freedman, Pisani, & Purves
however they are both "textbooks" and might be expensive
oh, im sorry to bother you, anything thats free?
ok, once again sorry
would you think this is a good definition: in terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean.
like variation in y
have you heard of a "random variable"
no
ok, that is the big missing piece for you, i think
oh, wait is it something that has infinite number of possible values
height, weight
for example
it can have a finite number of values
the very short version is that a random variable describes a data-generating process. a random variable, let's say Y, is not a single value, but a description of the kinds of values that it can have, and how probable each value is
for example: when you flip a coin, the outcome of the coin flip can be Heads or Tails
"A random variable is a variable whose value is unknown"
thats an incomplete definition
the outcome of the coin flip can also be described as a random variable, with outcomes Heads (0.5 probability) and Tails (0.5 probablity)
a random variable is a description of a data generating process
it's a relationship between "possible outcomes" and "probabilities"
i am glossing over a lot of things here, but that's the basic concept at the bottom of pretty much all of stats and probability, and by extension much of machine learning
variance tells you how spread out the possible outcomes are. high variance means a lot of spread in the possible outcomes, so if you were to re-generate many data points with high variance, there would be a lot of variation among the data points
yeah I just watched a video on that and its the spread of the data around the mean
yeah, sure
so the variance of Y in regression could mean a few different things
one meaning is: the actual variance of Y, which generated the data
we can never know that answer
another meaning is: the variance of the predictions, which is the definition you gave
there are other variances to consider as well, but you can probably ignore those for now
so the definition in this case would be how far the actaul values are from the predicted values, that is the variance
no, that is something else
those are residuals
you can compute the variance of the residuals
oh ok
that document talks about "variation in y"
in this case they mean "the variation in the data"
i.e. the observed variance in the data
so R^2 is the amount of the variance of Y that is "explained" by the variance of X
...kind of
the meaning of "explained" is fuzzy without equations and a more coherent understanding of random variables
but its good enough to interpret results
also @blazing bridge https://www.reddit.com/r/statistics/comments/gzv74a/q_learning_statistics_by_yourself/
117 votes and 38 comments so far on Reddit
STAT 100 in there might be useful
Enroll today at Penn State World Campus to earn an accredited degree or certificate in Statistics.
OpenIntro's mission is to make educational products that are free, transparent, and lower barriers to education. We're a registered 501(c)(3) nonprofit.
I am using pandas to find the mean of a 45k line column in a dataset. I noticed some of the cells in the column have text content while the column is intended for integer based values. Still, pandas is able to provide me with the mean. I extracted a small section of the column and copied it to a separate spreadsheet for analysis while using the exact same pandas code. However, when I run the code against the column in a different dataset, i get a "can not convert string to float" error
I would like to understand why I am not also receiving this string to float error when I run the exact same code in the 45k line column which contains many strings, but receive the string to float error when I attempt to acquire the mean for a smaller section of the column.
its almost as if pandas decides that it doesn't care about the string values when we are dealing with 45k lines, as opposed to 20. When the amount of cells in the column is lesser, then it wants to get real righteous
that is weird
it is possible that it's using a different algorithm on a bigger series
however i can't say i've ever had that experience
makes me wonder at what point pandas throws out the book. lol
disregard, identified that I am seeing the information incorrectly because I am using shit software to open the spreadsheets
that was my next question
hi i want to start with machine learning i found a tutorial series on youtube but its from 2018 and since then many changes were made to the pytorch framework so i would like to ask if https://www.youtube.com/watch?v=GIsg-ZUy0MY would be a good starting point? This is probably the most recent video about pytorch
In this course, you will learn how to build deep learning models with PyTorch and Python. The course makes PyTorch a bit more approachable for people starting out with deep learning and neural networks.
๐ป Code:
https://jovian.ml/aakashns/01-pytorch-basics
https://jovian.ml/aa...
Is there a updated PyTorch chatbot?
does anyone know how many hidden layers to use in a neural network using tf.keras.models.Sequential() ?d
Hi guys - I am trying to develop a model to solve a multi label problem (with tf and keras) and I see this is solved diffrently sometimes when building the model layers...
I see this (1 sigmoid layer):
layers.Dense(number_of_classes, activation='sigmoid', name='output')
and this (1 sigmoid layer per class):
output2 = Dense(1, activation = 'sigmoid')(x)
output3 = Dense(1, activation = 'sigmoid')(x)```
What are the differences between these two options?
Or maybe a better question would be -> when to use only 1 output layer and when to use multiple?
If I need to find the cosine distance of two vectors, and one is longer, do I pad the other with zeros until they're the same len?
Well that worked but something about it feels wrong.
@serene scaffold uuuh if you know that the shorter vector of size n is the first n dims in the longer vector (in the same order) then sounds like it should work
The vectors are actually from completely different spaces. The goal is to learn the mapping.
How do you measure the distance if they're in different spaces?
Haven't gotten that far.
I mean you shouldn't be able to.. you need some kind of information to learn the mapping
This is an nlp thing. We're trying to put things into an ontology based on embeddings for the tokens and embeddedings for the ontology.
Right. That's the goal.
We have mention-label mappings for a training corpus.
If the goal is to use some kind of regression to learn a transformation then I guess your method might work
Because in that case you're defining how you treat shorter vectors
A paper has already been published describing how they were successful with this process
But they didn't describe it throughly or release the code
Isnt this a pretty typical case for a NN
Multi input multi output
Distance doesn't exist or make mathematical sense between things in different spaces
Maybe it's used as the loss function? between the output and the label
it really depends on what the method used is
I am working with some grad students at my school on a project that involves getting two programs to talk to each other and while they have some of the code written i can't find any documentation on the syntax or very very little any idea where to look for something like this
Hi, do you guys know the best method to interpolate missing financial data before correlating them?
How hard is the maths in data science and AI?
@late jackal talk how?
@iron escarp its not usually "hard" but there is a lot to know
@desert oar developing a python bassed GUI so being able to write to HYSYS
what is HYSYS
A process engineering modeling software
So like it can model a gas adsorbed and run a bunch of simulations
I know it can be done since we have some working code but there's like no documentation on the API
Seems to be the case
And you need a gui app that can take user commands and then interact with the hysys api
Ok. Thats not really on topic here, but you can use tkinter, kivy, pyqt, or toga for that
#user-interfaces can help you
Ah ok I'll maybe send some info in there thank you
I guess obviously Iโll get better at python
To get into data science later on
But how do ik if Iโll like data science?
Without trying it?
how can i utilize categorical variables with high cardinality in my data to make predictive models?
(i.e i have a column called business units with 1000+ levels, and several other variables with 100+ levels in a 6000-row dataset)
i probably cant one hot encoding that because will introduce extremely large dimensionality
i was thinking of potentially k-means clustering into smaller groups
any other ideas come to mind?
it depends.
first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem
you can consider some form of clustering, as you noted
you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)
depending on the problem, you could consider some other form of encoding
e.g. think about how words can be vectorised using an embedding vs one hot encoding
you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)
@velvet thorn yeah i did this
is it a bad idea to just do that for all the columns though?
since you mentioned domain knowledge is important
i just put everything with 80% of the value counts as long as the length was < 20, and put the remaining 20% as 'other'
in layman's terms, you are saying that all these different things are in fact "the same"
right
so the question is - how much information does this feature hold, relative to the rest of the dataset?
now, imagine this.
say your dataset is one of names and you want to predict if it's a male or female name
if you used one hot encoding and turned all the minorities into the same category
your predictions for them would suck, right?
ya
so that wouldn't be an appropriate method
i see
and the reason is that you have only one feature
mhmm
on the other hand, if, say, your dataset is a housing one and the high cardinality feature is road name
that probably won't be an issue because road name is unlikely to have a large impact anyway
first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem
@velvet thorn can you explain this part though
no need to tag me by the way
i did one hot encoding for columns with nunique <= 15
no need to tag me by the way
alright sorry, when i quote you, it tags you by default but i'll delete it
so how can i one hot encode with some cat variables with 100+ levels
in a sparse matrix
oh wait
is quoting a new feature or have I just been living under a rock
LOL
anyway
are you using sklearn?
you right click and press quote
yeah I just saw
are you using
sklearn?
yes
wait before you answer that
when you do one hot encoding
and you do pd.get_dummies
you dont need to drop the numeric values right
okay
more accurately
you can but
well this is kinda an opinion thing so ignore me
but
let's go back a bit
do you know what a sparse array is?
nope lol
i know about curse of dimensionality
does being sparse have to do with that
i looked up sparse array
A sparse array is an array of data in which many elements have a value of zero
makes sense
okay
so a normal array
stores each value
but a sparse array stores only the nonzero values
the number of "1"s in a one hot encoded array is always equal to the number of rows
right
and the number of "0"s is equal to the number of rows * (the number of categories - 1)
so the more categories you have, the bigger the space savings of a sparse array
i see
so isn't that like label encoding?
because label encoding only stores 1 through n values in onecolumn
label encoding is non-zero values for a particular column .. i.e. dog,cat,rat would just be 1,2,3 in one column
ok nvm just ignore that
but what does this have to do with pd get dummies
are you suggesting to just use sparse array of OHE?
pd.get_dummies(..., sparse=True)
so how does using sparse affect the model performance
im using a random forest, idk if that matters
ah
when you said curse of dimensionality
you meant as a modelling problem
not as a storage problem
I misunderstood you I think
no i think this is actually helpful to know
because i ran two models:
1 model where features nunique <=15 and i one hot encoded
and a model where im gonna use all the categorical variables with cardinality
but as i mentioned some of the categorical variables have 100+ levels. should i just do pd.get_dummies(sparse=True)
it's a good start
think I went through them earlier?
alright ill try clustering and ohe with sparse = true
thanks
also one more question
if sparse = true reduces storage, why dont people just always use sparse = true
there should be some drawbacks to it?
yes
a few
you can't do certain things with a sparse array
because of the constraint that many values must be 0
e.g. imagine standardisation (- mean, then / std)
mhmm
an array that was sparse
most likely cannot be standardised
because the majority value (0), after subtracting the mean, will not be 0
a sparse array can take more memory than a dense array if the data is not actually sparse
because of the way nonzero values are stored
one implementation of a sparse array
is storing the indices and values of non-sparse values
so you need to store two things instead of one per value
and the tradeoff is that you need not store all the values
i see
so how are the actual values stored with sparse = true vs pd.get_dummies(sparse=false)? bc if you have labels like cat,dog,rat, then if you one hot encode with sparse= false, it would be
1, 0, 0
0, 1, 0
0, 0, 1
ye
is it just gonna show as
1
1
1?
uh
I'm not sure how they're displayed
but
that's an entirely different concern from how they're stored
it's more like
ohh ok
ok thanks
i'll take a stab at this tomorrow
also just to make sure, the main advantage/use of sparse is reduction in storage/memory consumption correct?
vs numpy array
yes
ty
@hidden halo i fixed the problem, but i still don't really know what the problem was
i ended up hardcoding it by turning the dates into strings and chopping off the time section
then converting it back to a datetime object
๐คทโโ๏ธ
how can i find correlations amongst ALL variables in a dataframe containing categorical and numeric variables
@velvet thorn so using pd.get_dummies(sparse=true) yielded a 5718 row x 8661 column dataset
is this ok to put through a random forest lol
i ended up hardcoding it by turning the dates into strings and chopping off the time section
@marsh chasm I was on phone earlier, so couldn't try. Now I tried anddf['date'].dt.dateseemed to work for me. It removed the time part. I just did:
df = pd.read_csv("csv_file.csv", parse_dates=[0])
df['date'] = df['date'].dt.date
why are there more columns than rows
@velvet thorn uh im not sure
the rows are the exact same
5718 are the number of rows i had
but the original dataset had like 63 columns i think
are the number of rows supposed to stay the same like in one hot encoding?
the thing is
you should have
1 new column
for each unique value
and clearly the number of unique values cannot be more than the number of rows
yeah this is what i did
mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape
returns ```python
((5718, 59), (5718, 8661))
also there are 0s in the sparse dataframe. didnt you say there werent supposed to be any 0s?
mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape
@velvet thorn any tips?
what kind of data are you trying to OHE here? if you went from 59 columns to 8.6k columns, it's clearly some columns with a tad too many "unique" values, like an address or something
so I just saw your previous msg, you said some cat columns had 100+ levels and you're wondering how to OHE it? It probably depends on what the data is representing, but I wouldn't OHE with that many categories.
If for example, you had an address column that had 100+ unique values, you could, for example, extract only the district/province information and OHE that instead.
well for instance
i have business units with over 1400 levels
in a 6000-row data set
theres another column with 2500 levels
find some meaningful way to bin them
with that many levels, it's like 2-4 rows per level on average
yea
amongst the categorical data, my data has like 40 columns with <20 unique values and like 10-15 with >150
theres nothing between 20-150
even the 40 columns with 20 unique values is kinda ridiculous once you OHE them. gotta be careful with curse of dimensionality here, considering you have so few training examples
okay so i built 2 separate models --
- baseline containing only categorical columns with <10 unique values + numeric variables
- keeping all unique values that capture 80% of column value counts less than 20 and storing remaining 20% as 'other' + numeric vars
- im gonna try somehow using all categorical variables + numeric vars
however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that
you could do feature selection / dimensionality reduction afterwards of course, but some domain knowledge to clear out some useless features (or even binning the categories in a way that makes sense) might be a good start
I want to make a roulette game, how would I calculate the odds (I have 3 colors rn) so that "the house" has a good edge and would be profiting instead of losing money
alright thanks tofu
however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that
@lapis sequoia do you know anything about this question though?
tips to circumvent what? doing OHE?
i guess having accurate variable importances
because one hot encoding is skewing the variable importances of random forests towards the non-OHE data (numeric columns)
so the numeric columns tend to be at the top
and the one hot encoded features are below them
i have business units with over 1400 levels
@lapis sequoia is it absolutely impossible to encode them with more meaningfull numerical values?
like you can not establish any kind of numerical relationship between categories, i.e. catergory "A" can be represented with 0, "B" with 1 and "C' with 5?
๐ yeah I was hoping maybe you can somehow establish it
to be honest I simply have doubts that 1400 unique values with that size of data will contribute postively to your ML model results
I would not be surpised if dropping the enitre column would yield better results than OHE 1400 values
i was thinking the same
i believe that there is some sort of association/correlation between that column and a few others as well
im calculating a cramer v correlation
among categorical columns
and will drop any with high association
i guess having accurate variable importances
@lapis sequoia no, not particularly. that issue is intrinsic with random forests & having high cardinality data, I think.
so is it ill-advised to use a random forest in this scenario?
It can work fine as a meta-model, just gotta do some preprocessing beforehand
like I said, binning, or you could try doing embedding the high cardinality features into a smaller space
though the latter would not be useful if you're trying to do inference
alright thank you
i already tried binning
any particular methods you'd recommend about the latter option?
embedding the high cardinality features into a smaller space
Did you try target encoding
Another option is to train your model w/ a Factorization Machine
https://medium.com/building-ibotta/reg2vec-learning-embeddings-for-high-cardinality-customer-registration-features-faf712f12842 heres an embedding method
Will do thank you
I have a file that looks like this
--------------Time step: 1 ---------------
Accumulated rewards: 1.5
Alpha: 660
Beta: 173
TCP_Friendliness: 1
Fast_Convergence: 1
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 2 ---------------
Accumulated rewards: 2.724744871391589
Alpha: 193
Beta: 0
TCP_Friendliness: 0
Fast_Convergence: 0
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 3 ---------------
Accumulated rewards: 3.869459113944921
I'd like to extract the time step values into an X array and the Accumulated rewards value into a Y array, I have no idea how to do that as I have 0 python experience, but this is my initial loop i've written that skips the first couple of lines that I have not included in the example(gibberish data)
with open('Tuner_result_1.txt') as f:
for _ in range(11):
next(f)
for line in f:
x = [line.split()[0] for line in lines]
y = [line.split()[1] for line in lines]
obviously the actions inside the 2nd for are incorrect, idk how to read the lines i want properly.
After doing some research would you say this is correct:
The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average.
Yeah sounds good
Ok it took me 3 days. One question y average and SEy are the same thing which is just a line with no slope and only intercept so if the r-squared is 0.72 itโs 72% better than the line.
Which is the mean line
No? SE is standard error
Oh i see
Hmm
I guess yeah its the mean squared error of an estimate that's just the sample mean
They used that in the khan academy video
Yeah that's "good enough" for now
A more practical understanding is: the better the R^2, the closer your data is to a straight line fit
In the case with 1 x and 1 y you can compute it from the correlation
So if the r-squared is 72% it means that 72% of the data is a straight line?
It means if you replace your data with that straight line it would still keep 72% of the variation of your data. Not necessarily that 72% of the data lies on that straight line.
Specifically 72% of the variation in Y
do yall think dropping columns with correlations > 0.70 is a good threshold
Why does it have such a high correlation
idk there are like 14/63 columns that have > 0.80 correlation
this is a cramer v correlation
so between categorical variables
idk if that matters
and they have like 100+ ;levels
The reason you drop high correlations is in case they are effectively duplicates of each other
So it depends on what they are
Eg SIC and NAICS codes
SIC is basically redundant if you know NAICS
But in a lot of data you have multiple SIC codes and only one NAICS
So you keep both despite high corr
(SIC and NAICS are two different business classification systems in the USA)
16 features is not too many for you to manually review
so i'd have to look at the data individually to determine if i need to drop it?
alright
If its 160 features out of 1000 then id say you need automated methods
Or if you need to dynamically retrain a model on unknown data like if you were building some kind of auto ML solution
And this is actually a great example of why auto ML solutions are a long way from perfect
Understanding the domain and business context and the meaning of the data, and applying that understanding towards solving your problem. That is the added value of a good data scientist, more so than whatever mechanical skills they have in implementing models
Look at something like alphago, they didn't just throw a big generic neural network at the problem, they built a whole solution specifically oriented around Go
alright thanks for the help
Sorry for ranting
"The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average."
@desert oar Can you tell me if this is a correct interpretation of this if I explain it. What r-squared is telling us, is how much closer the data points are to the line of best fit compared to the average y value line, referred to as SEy or variation in y. So if the r-squared is 0.72, it means that the line of best fit is 72% better than the average values of y mean. And the other 28% is missing due to us not including other variables. For example if we have rent and square feet. The other 28% could be in something like age of building and etc.
@desert oar again sorry for bothering you so much
72% better than the mean..... eh
Yes the other 28% could be omitted variables
It could also be natural random variation in Y with nothing to account for it
It could also be that the remaining relationship is nonlinear
It could also be that the variance of Y is not constant over the range of X so the whole model is invalid
so i am partially correct
but for the part where what % of variation in y is described by x. The percentage we get is checking with the x variables we have , the line of best fit is better than the y mean line
quick python question: right here, what exactly are the loc arguments specifying? as I understand it , ride_sharing['tire_sizes'] > 27 is specifying this column where this variable isgreater than 27, but whats the second argument for?
the second 'tire_sizes'
lmao, turns out it wasnt needed after all.
no wonder i was so confused.
The conditional returns a boolean array of all rows that satisfy the condition. The second argument tire_sizes specifies which column to return.
Hi , in order to get the correlation coefficient, I performed a quadratic interpolation in order to fill in the missing values. However it seems that I still have missing data.
Is it normal or does the problem comes from my code?
Thank you for your responses
Hi i am creating a chatbot and i have got struggles with my bots responses. For example if i type: "What is the weather in London" I want from the bot to go to some webpage and get the data of the weather. Is there someone who knows how to do this? Thanks a lot for your responses
Pandas read_csv() issues. Not sure where this belongs, but please help me, nothing is working here
@ripe marlin i can't read that, can you post your code and full error output as text
!code-block
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
โข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โข You can also use py as the language instead of python
โข The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
@lapis sequoia impossible to say without seeing your code
@blazing bridge it's the % of variation in y described by our model
I performed a quick test to see if the firebase_admin package was imported
@hearty jewel
data['hello'] # select the "hello" column
data.loc['hello'] # select the row(s) with index value "hello"
data.loc['hello', 'hello'] # select the row(s) with index value "hello" and the "hello" column
@lapis sequoia please post your code as text. you are not new here, you know this
also this has nothing to do with your previous issue
any time you see "no module named X" it means you have the wrong environment active
whether that's venv or conda or whatever
that is always the solution
that, or it had an error during installation
@desert oar nothing that complicated. I'm just trying to use read_csv() to open a csv while. I'm getting an unicode error. And when I use r to declare a raw string, it says that the file doesn't exist
@ripe marlin show the error? it's probably in the file itself, not the filename
how was the file created? if it came from Excel, use encoding='windows-1252'
From excel
windows-1252 is a derivative of iso-8859-1 which has a few differences from UTF-8
so you probably hit one of those different characters
and it can't decode the bytes to text
Wait i'll just post the code
@desert oar thanks for your feedback, I tried installing the environment through the terminal, as shown on the anaconda website https://anaconda.org/auto/python-firebase
I got this error: PackagesNotFoundError: The following packages are not available from current channels:
- python-firebase
Current channels:
- https://conda.anaconda.org/auto/osx-64
- https://conda.anaconda.org/auto/noarch
- https://repo.anaconda.com/pkgs/main/osx-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/osx-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
just use pip with your conda env activated @lapis sequoia
that's a non-standard channel
you would need to conda install -c auto python-firebase but again i dont know who or what auto is so i dont trust it
@ripe marlin try pd.read_csv(..., encoding='windows-1252')
just try it
Right
Still shows Unicode error and FileNotFoundError if I add a r for rawstring
if you are on windows you are probably writing '\\' so yes raw would fail
ok show the full error
@lapis sequoia
conda activate <my env>
pip install <package>
@desert oar instead of <my env> I should put the IDE I work on? For example Spyder?
File "<ipython-input-13-343da795340c>", line 1
df=pd.read_csv('C:\Users\dell\Downloads\py-master.zip\py-master\ML\1_linear_reg\homeprices.csv',encoding='windows-1252')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I did, it shows a Filenotfound error
then you have the wrong filename
you can't just get a file out of a zip
windows explorer cheats and lets you think you can
you have to unzip first
either unzip thru windows or unzip inside python
How do i unzip inside Python?
from zipfile import ZipFile
import pandas as pd
with ZipFile(r'C:\Users\dell\Downloads\py-master.zip') as archive:
with archive.open(r'py-master\ML\1_linear_reg\homeprices.csv') as fp:
data = pd.read_csv(fp, encoding='windows-1252')
@ripe marlin
!d g zipfile
Source code: Lib/zipfile.py
The ZIP file format is a common archive and compression standard. This module provides tools to create, read, write, append, and list a ZIP file. Any advanced use of this module will require an understanding of the format, as defined in PKZIP Application Note.
This module does not currently handle multi-disk ZIP files. It can handle ZIP files that use the ZIP64 extensions (that is ZIP files that are more than 4 GiB in size). It supports decryption of encrypted files in ZIP archives, but it currently cannot create an encrypted file. Decryption is extremely slow as it is implemented in native Python rather than C.
The module defines the following items:
how about gunzip
@desert oar yeah i found out there actually is high association between the variables providing redundant information. is correlation of 0.6 a good threshold to drop predictors?
these are all categorical?
yea
yea'
im still skeptical of discarding features based on that
.6 doesnt seem high
high collinearity is usually only a problem in extreme cases
and even then, with regularization you typically dont need to care much
even moreso in ensembled decision tree models like a random forest
fffffffffffffffffffffffffffffffffffffffff
which are selecting random features anyway
i was gonna drop high predictor variables that had high cardinality & high correlation lol
but this wont fix it it seems
no you need to fix the high cardinality issue
i posted a bunch of options the other day
welcome to data science
@lapis sequoia not sure what your context is and how many variables you are dealing with, if general cramers V thresholds are not desired you could consider other dimensionality reduction methods like PCA or PFA
trial and error
@sullen kiln they have several very high cardinality features that are all moderately associated with each other
like high cardinality as in 1000 distinct categories
we have suggested several options like target encoding, vector embedding, building a sparse or reduced model like a factorization machine
etc
so even if i have 2 variables with >0.80 correlation i shouldnt drop it?
you said 0.6
ik but im increasing the threshold now
i have like 10-15 variables with >0.80 correlation
what is the nature of the cardinality? what data are you dealing with?
these correlations are amongst categorical variables
they have a shitton of unique values which is why im tryna drop them
not rly correlations. cramer's v chi square stats
there are other methods suggested in here but this is one i thought of
so i have a 6000row dataset and these categorical variables have 300-1000 unique values and correlations >0.80 or >0.90
that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.
One thing ive done is. train a univariate model with them separately and together. If you dont see big lift, discard one
Its hacky
.8 is high
so why cant i just drop some of the .8 corr variables
that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.
@sullen kiln i'll check out doing a pca
Yeah just do it tbh
reason is im running low on time and i have to turn this over soon lol
PCA is a good idea
PCA features before regression is an old school ML technique anyway
yes, start with a more parsimonious model specification, and add/subtract each variable, try to establish what has more predictive power (and what makes more theoretical sense)
I wouldn't usually go manual like this, but its not a bad thing, its just best to do a PCA as it does so much legwork
havent played around with pca much aside from learning about it in school
so i can use it to reduce cardinality then run a random forest on the pca data?
that's gonna reduce the interpretability a lot right
its very hard to quantitatively distinguish between variables of extreme cardinality, you will inevitably have bias, I would favour a proper dim reduction method over model tinkering
not necessarily, you will likely find several principal components that exhibit theoretical themes, if anything, you will be able to explain your data better
interpretability is increased
ok thank you. should i drop the highly correlated variables (>.80) then run pca?
no
just run pca on the entire dataset?
let the PCA analyse the data in its natural form, after all measures are normalized of course
PFA might be needed with categorical varaibles, its not very fresh in the mind
Could I ask someone for a bit of syntax help here? Simple Python/R stuff
#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)
I want add to my function here, computing another mean with the range of q_at(0.10) and q_at(0.90)
I was going to try extract the statistics and merge them back into the data, but that is not elegant and not very viable with big data
what do you mean extract
you want to join the aggregated data back to the original data?
e.g. join df1 to incdata? that's pretty much the only way to do it
also
!code-block
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
โข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โข You can also use py as the language instead of python
โข The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
hhmm, ok, I was hoping for a cleaner way. As I am extracting the descriptives from incdata. I wanted to extract a conditional descriptive - the mean between the inter-decile range, without adding more steps/merges
what do you mean, "the mean between the inter-decile range"
The mean between the 10th quantile and the 90th quantile. Basically a more generous IQR
Though I don't think there's an easier way to do that you saying (if I understood you right)
oh
idk what the automatic variable names for those columns will be, but
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
df1['iqr'] = df1['q75'] - df1['q25']
df1['idr'] = df1['q10'] - df1['q90']
no?
@sullen kiln is PFA = factor analysis?
PFA might be needed with categorical varaibles, its not very fresh in the mind
yes
yes @lapis sequoia and @desert oar , I want an IDR mean, without doing another meri-go-round and merging my statistics back into the data to re-compute an IDR mean
when I deploy this program to my data in work, i am dealing with a lot of data and limited pc memory, so I am trying to keep it concise
so does my code make sense or no
bare with
(btw this might be better to do in sqlite which is definitely probably going to be more memory efficient)
thanks for the tip, yeah, for work it might be best, I am writing the program in python and R and will do it in sql too, I just want options going forward
i think you should at least try my code though. sql is good because its the same in both python and r
so you connect to the same sqlite db and use the same query
There's a library that works similar to pandas that doesn't store the entire df in memory and streams it as needed. It's quite cool, though I forgot it's name
as far as I know it has all the same methods as pandas
dask?
Yeah that sounds familiar
dask and vaex are 2 options
#The rename decorator renames the function so that the pandas agg function
#can deal with the reuse of the quantile function returned.def rename(newname):
def decorator(f):
f.name = newname
return f
return decoratordef q_at(y): #define a function q, for values y
@rename(f'q{y:0.2f}') #define format renaming of new quantiles returned
def q(TOTAL_MINUTES):
return TOTAL_MINUTES.quantile(y)
return q#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1['iqr'] = df1['q0.75'] - df1['q0.25']
df1['idr'] = df1['q0.10'] - df1['q0.90']
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)
@desert oar this solution would require me to compute the values within the ranges for every row, whereas, I want to build in the adjusted means into the function
the function above exports a nice little descriptive table, df1, rather than transforming new columns in what will be a large dataset
im still not sure what you mean
oh
i just flipped the lines
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
df1['iqr'] = df1['q0.75'] - df1['q0.25']
df1['idr'] = df1['q0.10'] - df1['q0.90']
the usual caveats apply about untested code written by volunteers on the internet
df1 didnt even exist in those first 2 lines
so the code clearly wouldnt have worked as written
wait
๐ haha
you flipped them. i did it right. scroll up.
apologies
this is the result i am after
with the IDR and IQR means also being computed in the function, without the need to merge quantiles back into the dataset
yes
what i wrote gives you what you show in the screenshot
btw you also flipped q0.10 and q0.90
oh nvm i did that
wait my data isnt ordinal so i dont think factor analysis will work
just so yall know lol
ive done PCA on one hot encoded categorical before
i read that doesnt work well
yeah its kinda sus from a theoretical perspective
you think i should do it anyways lol
id still be curious if target encoding works well
theres also this https://github.com/esafak/mca
idk how well it works for very high cardinality
theres also https://en.wikipedia.org/wiki/Random_projection
In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared to other methods. According to ...
hah
RIP MCA
so id have to drop the numerical variables before doing MCA~~/PCA~~?
alright im just gonna try PCA
and if that doesnt work im gonna do MCA
then MCA'ed categorical features
dont think about it too hard
youre just playing with lego at this point
^
ok thanks. but for pca all i have to do is just one hot encode then apply pca on one hot encoded + numeric data right
normalize numeric data first*
then just numbers?
aahhhhhh okkk thx
yeah otherwise PCA will likely throw of those categorical
is it bc u dont wanna lose info from the numbers?
another option is feature hashing
if they have 1000s classes
you can combine all of your 15 categorical features into 1 big "multi categorical" feature
and hash it all down to 1000 buckets
alright im gonna try pca first
what about the strong correlations >0.80? is it ok to leave those columns in the pca
hey can someone help me understand this list comprehension:
counts = dict([[letter, sentence.count(letter)] for letter in set(sentence) if letter in alphabet])
yuck, a list comprehension being passed to dict
dict(
[
[letter, sentence.count(letter)]
for letter in set(sentence)
if letter in alphabet
]
)
not sure if that helps...
seems to be Counter(sentence) with all keys that are not in alphabet filtered out
yeah sentence might be some special thing
im not going to second guess that
this is kind of amateur code
id write it like this
dict((letter, sentence.count(letter)) for letter in set(sentence) & set(alphabet))
or
dict((letter, sentence.count(letter)) for letter in set(sentence) if letter in alphabet)
at least
wait
in pca
when i one hot encode
it wont have the problem where the model assumes 1>0 right
well it's not an issue right
even if it's nominal
(i.e. i have several business units i one hot encoded and will do pca on)
eh?
1 = yes, 0 = no
so after you one hot encode, 1 is greater than 0
w/out loss of generality you can flip 1 and 0 but
yea i was just making sure
ty
I need some help :(
I'm using a dataset that has location information in the first few columns, as well as a column for each date in the data, with a corresponding value
Example:
+------+-----------+------------+---------+---------+---------+
| Org | Org Owner | Building# | 4/1/20 | 5/1/20 | 6/1/20 |
+------+-----------+------------+---------+---------+---------+
| OrgA | John Doe | 1234 | $1,256 | $987 | $1,562 |
+------+-----------+------------+---------+---------+---------+
There are a few more columns for identifying more specifics about the org I need to retain, and the dates are tens long corresponding to specific dates.
The values for these columns are corresponding sales.
I'm trying to pivot the table to have a single row with each date, and it's value, and the org information
How can I achieve that with pandas?
IE:
+------+-----------+--------------+------+------+
| Org | Building# | Org Owner | Date | Sales |
+------+-----------+--------------+------+------+
pd.pivot_table
Which I'm aware of, but can't seem to find any way to get my desired results
IE: use columns [:3] as the columns, [3:] as index, and the values per row of the date columns as the values in the new column 'date'
Here's an example. Some columns are removed, and all data is false
It's like a hybrid of transpose and pivot I think..?
You can probably create the desired output by slicing the data frames into two parts
How many rows are there in the top dataframe
Currently 3
If you could make code that generates a dummy dataframe with atleast 2 rows just like that, that will be super helpful
Meanwhile let me get to a laptop
sure thing, one sec
dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})
for date in dates:
df[date] = '$1200'
df.head()
Checking now @desert oar
looks like that melt probably does the trick
god
bless
That looks like it did the trick
Thank you guys so much!
final code to help those who need it (using the sample data)
dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})
for date in dates:
df[date] = '$1200'
df.melt(id_vars=df.columns[:3])
๐
using pca with 90% variance reduced columns from 9000 to 420
so thats fine for a random forest right
6000 rows x 420 cols?
yeah but you might need a large number of trees to get a good sample of columns
depending on tree depth
oh shoot. i used the default of 100 in my previous models (which had ~60-80 columns)
im running the random forest right now with 100 trees. i'll change it
100 is (probably) too low even for 80 columns
you should always start with like 10 though
should i use a grid search?
is there a rule of thumb for number of trees you should use
i looked it up and people said 64-128 but obv it depends on data size
LOL wtf i got an 18% R-squared
my baseline got like 35
im gonna try 200 trees
id look at tree depth first
at 100
if its a fast model to train you probably can just grid search and go get a cup of coffee
lol yeah im gonna play around with the parameters a bit
any ideas as to why PCA with 90% variance explained yielded such shitty performance?
maybe the resulting features are junk
or bc i used it after one hot encoding?
yeah the principal components?
i mean they look like principal components to me lol
ill try pca after playin around with parameters
yeah im gonna do MCA now
anyone here working on object detection using yolov5?? ๐
Hello guys i m new to this group and learning data science
I'm getting ready to slam my head into a wall.
I found a cython implementation for a KD tree, and I made a subclass of np.ndarray that contains a reference to what each ndarray represents
but when the kd tree is constructed, it gets rid of that wrapping
the pure python kd tree is too slow.
but when the kd tree is constructed, it gets rid of that wrapping
what do you mean
and show your code as usual
warning: it's terrible code
its cython, nobody writes good cython code
sorry salt rock quick question -- the problem cant be because i didnt normalize the numerical vars right?
its absolutely can be
because i normalized the one hot encoded and not numerics
should i use pca with numerics as well or move onto mca?
dont mix them in the pca
i just want to understand why we handled them separately
the numerics vs categoricals
This part is pure Python.
class Vocab(np.ndarray):
pass
def create_tensor(array, token, padding):
many_zeroes = np.zeros((padding,), np.float)
tensor = np.concatenate((array, many_zeroes))
tensor = tensor.view(Vocab)
tensor.token = token
return tensor
cuis = KeyedVectors.load_word2vec_format('/home/farnsworthsw/datasets/s1975.cui.200.bin', binary=True)
print('make tree')
vocab = [create_tensor(cuis[k], k, 568) for k in cuis.vocab.keys()]
print('vocab made')
tree = sp.spatial.cKDTree(vocab)
print('tree made')
def learn(mention: str) -> str:
tensor = torch.cuda.LongTensor(tokenizer.encode(mention)).unsqueeze(0)
bert_output = model(tensor)[0][0]
bert_output = bert_output.cpu().detach().numpy()
best = tree.query(bert_output)
print(vocab_lookup[best[0]])
return best
you could try it i guess. im thinking that it will be weird because the variance of a numerical variable might or might not be on par with the variances of a bunch of one hot columns @lapis sequoia
just seems incongruous
alright thank you
oh i thought you wrote the kd yourself @serene scaffold
no
C isn't flexible like that. presumably it's just stripping out the data into some kind of lower level buffer data structure thing
ahhh
wait
๐๏ธ you didnt read the docs
it returns a tuple
1st elem, the vectors
2nd element, the integer indexes of the vectors
unless it's shuffling the order of the data in which case the whole thing is kind of useless anyway
another possibility is to use a dict where the key is the vector
because hashing magic
so have a separate lookup table for what the vector at each index represents
and look that up?
yeah i think youd have to
that or just a list/array
so you can look them up by position
from the 2nd returned element