#data-science-and-ml | Python | Page 229

flint root Jun 18, 2020, 3:11 PM

#

Hi. Does someone knows how can I use compression algorithms in data science?

#

I would like to have a few examples 🙂

lapis sequoia Jun 18, 2020, 3:39 PM

#

how can i impute categorical missing values using knn?

ripe forge Jun 18, 2020, 3:59 PM

#

Oh dear, sounds fancy. If you have enough data points, dropping rows might be easier

lapis sequoia Jun 18, 2020, 3:59 PM

#

i only have 6000 rows

#

and some columns are missing 50% of the data

#

actually only one column is

ripe forge Jun 18, 2020, 3:59 PM

#

If a column is missing that much, drop the column

#

Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn

#

Train the knn using the rows for which this column in question has values present. Predict the values for rows where this column (which is now the target for the knn) is missing.

#

But if you have a lot of values missing, you can't blindly run knn on it. I personally don't think you should be trying to impute any column with more than 50% data missing at all. Drop the column, or figure out some different style of handling the missing values if your data understanding offers some way

#

A column like that should probably be dropped honestly

marsh fog Jun 18, 2020, 4:07 PM

#

Is anyone here familiar with plotly? Specifically buttons / dropdowns ?

lapis sequoia Jun 18, 2020, 4:32 PM

#

alright thank you @ripe forge

#

what do you think is the missing data threshold proportion for a column to be dropped

#

30%? 50%?

#

i already dropped all columns missing more than 60% of values

ripe forge Jun 18, 2020, 4:33 PM

#

Tough to say, but that's a good question to ask

#

Sadly the answer is : it depends.

lapis sequoia Jun 18, 2020, 4:33 PM

#

alright, im just gonna look at the distribution of missing values in my data, and see if there's a falloff at a particular point

ripe forge Jun 18, 2020, 4:34 PM

#

If you can sensibly fill or aggregate some of the missing values, it becomes easier to manage

lapis sequoia Jun 18, 2020, 4:34 PM

#

Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn
also i probably need to impute like 50 columns lol. would i have to run 50 knns?

#

if theres no easier way than that im just gonna replace with the mode^

ripe forge Jun 18, 2020, 4:34 PM

#

You know, I'm not a 100% sure on that one.

lapis sequoia Jun 18, 2020, 4:34 PM

#

alright

#

well for now im just gonna replace with mode since that's a lot easier. i'll look into knn replacement later

#

thank you

ripe forge Jun 18, 2020, 4:35 PM

#

Sounds good, cheers

tranquil falcon Jun 18, 2020, 5:18 PM

#

@lapis sequoia im more of a stats guy than programming guy, but about missing data -- it really depends on why the data is missing. MCAR vs MAR vs MNAR etc https://en.wikipedia.org/wiki/Missing_data . id personally try to figure out why the data is missing first more than anything. if speed is a concern, just drop it unless you have domain knowledge saying that column really matters... and if it does really matter... then you have a bigger problem than DS if your dataset is missing a critical component 🙂

Missing data

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Missing data can occur because of ...

lapis sequoia Jun 18, 2020, 5:27 PM

#

thank you

desert oar Jun 18, 2020, 5:40 PM

#

^ yep great point. i was remiss in not mentioning MCAR,MAR,MNAR

#

some imputation methods only make sense for certain types of missingness

fleet rose Jun 18, 2020, 5:52 PM

#

Hi, is this chat for neural networks?

#

Well, it says machine learning

#

So, I´m looking forward to learn more about it

lapis sequoia Jun 18, 2020, 6:12 PM

#

if i want to see association between categorical variables, what are my best options? i was thinking use a contingency table to see frequency distributions

#

obviously i cant use pearson correlation

#

and what if my variables have high cardinality (i.e. categorical level with 200+ levels) and i want to see association

desert oar Jun 18, 2020, 6:14 PM

#

mutual information

#

or chi square

#

mutual information has better properties imo but chi square is a lot faster to compute

#

the built-in scikit-learn MI is single-threaded and very slow

#

you can also one-hot-encode and do stuff like hamming or jaccard similarity. depending on the type of feature

lapis sequoia Jun 18, 2020, 6:17 PM

#

alright thanks. ill look up mutual information

marsh chasm Jun 18, 2020, 7:27 PM

#

has anyone worked with pandas' series before

#

theres a part of the documentation that's worded a little bit funny

desert oar Jun 18, 2020, 7:37 PM

#

Yes

marsh chasm Jun 18, 2020, 8:00 PM

#

wait nvm i think i got it...

blazing bridge Jun 18, 2020, 8:01 PM

#

"R² is the percentage variation in y explained by all the x variables together.

For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the R² for our model is 0.72 — that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."

#

What does variation in y mean in simple terms

lapis sequoia Jun 18, 2020, 8:36 PM

#

anyone know why the below line gives me an error? ws.range('A2').value = results

#

import pyodbc
import xlwings as xw

wb = xw.Book('Book1.xlsx')
ws = wb.sheets['Sheet1']

def read(conn):
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM people")
    results = cursor.fetchall()
    print(results) # prints a list of tuples, each tuple is a record
    
    # works fine
    ws.range('A2').value = [(1, 'joe', 'emmets', 23),
                            (2, 'sally', 'jacobs', 37),
                            (3, 'katie', 'falcone', 42)]
    
    # does not work - why?
    ws.range('A2').value = results

conn = pyodbc.connect(
    "Driver={SQL Server Native Client 11.0};"
    "Server=(LocalDB)\MSSQLLocalDB;"
    "Database=testdb;"
    "Trusted_Connection=yes;"
)

read(conn)
conn.close()

#

😦

tranquil falcon Jun 18, 2020, 8:47 PM

#

@blazing bridge r-squared ranges from 0 to 1. higher r-squared typically means better. a cynical and basic way of looking at it is it just tells you how good or bad a line fits to your points. compare these plots

📎 Good-and-bad-fits2.png

#

in your 0.72 example, it means that 28% of the variance is unaccounted for. this is just fancy stats speak for saying "hey like 30% of the model can't be explained using the variables we have, so thats why our predictions might be off if we were to use the parameters from this model"

turbid hazel Jun 18, 2020, 8:58 PM

#

anyone know why this isn't working in Jupyter Notebooks?

📎 Screen_Shot_2020-06-18_at_1.52.34_PM.png

tranquil falcon Jun 18, 2020, 8:59 PM

#

@blazing bridge if it helps you wrap your head around it conceptually... r-squared is literally just the square root of the data's correlation. but correlation can range of -1 to 1 while r-square is only positive (0 - 1)

uncut shadow Jun 18, 2020, 8:59 PM

#

@turbid hazel You should provide errors If you have one, or some content

blazing bridge Jun 18, 2020, 9:03 PM

#

so @tranquil falcon this is basically means the line of best fit is only 78% accurate and the rest is not accurate with our x values. So 28% is inaccurate, therefore the variation in y is 28%.

#

is this correct

brazen cloud Jun 18, 2020, 9:05 PM

#

hi there everyone! in my code i have an error ```python
import numpy as np
import matplotlib.pyplot as plt

incomes = np.random.normal(100.0, 50.0, 10000)
incomes.append(900)

plt.hist(incomes, 50)
plt.show()the error isAttributeError Traceback (most recent call last)
<ipython-input-4-9b18b9dd3278> in <module>
4
5 incomes = np.random.normal(100.0, 50.0, 10000)
----> 6 incomes.append(900)
7
8 plt.hist(incomes, 50)

AttributeError: 'numpy.ndarray' object has no attribute 'append'```

#

it says numpy doesnt have an append atribute?

#

how can i add a data point to my array

lapis sequoia Jun 18, 2020, 9:06 PM

#

Anyone know how to append values to a column?

📎 image0.png

#

In a dataframe

turbid hazel Jun 18, 2020, 9:07 PM

#

@uncut shadow for some reason literally nothing is outputted when i run it. no errors, nothing

tranquil falcon Jun 18, 2020, 9:13 PM

#

@blazing bridge not quite. i think conceptually youre getting there but im not sure id use the word 'accurate' here. when we're talking about variance, which is what r-squared is trying to describe for a linear model, i think of it more about fuzziness in our actual guesses / predictions

like if we had a model to guess where a ball will land, if we had a model with 0.72 r-squared vs a model that is 0.95 r-squared, both COULD be accurate in a real-world scenario, but the 0.95 r-squared one would be more likely to have better results and less error on where the ball would land

#

@blazing bridge but long story short, you could make an argument that a model with 0.72 r-squared is probably more likely to be more accurate than a model with 0.65 r-squared, but that's not necessarily a fact. thats why i wouldnt use the words accurate vs inaccurate here because usually 'accurate' has to do with predictions while r-squared really just has to do with explaining variance

blazing bridge Jun 18, 2020, 9:25 PM

#

@tranquil falcon Thank you for clearing it up, one more question what does variance mean

#

is that the error that the line produces

#

like in simple linear regression we use the Squared error to see how well our line has fit the data

#

in multiple linear regression is r-squared the equivalent of the loss with two or more variabels

#

*variables

tranquil falcon Jun 18, 2020, 9:42 PM

#

@livid turret no prob. variance is basically the center of the universe for statistics -- its really just a fancy word for how much variety/range there is in your data set. high variance = wider range of values low variance = smaller range of values.

so you mentioned error -- when you make error you actually included standard deviation somewhere in your calculation. standard deviation is literally the square root of variance. so higher variance = higher standard deviation which would also mean higher error. but bear in mind higher calculated error doesnt necessarily mean less accurate (it all depends on context!!), it just means a wider range of expected values.

and to answer your question if im interpreting it correctly, then yes conceptually r-squared is showing the equivalent of the incorrect guesses. r-squared, mathematically underneath the hood, is actually calculating how far each point is from your best fitted line and aggregating that data. so if you have a lot of points that are really far away from the line that would indicate maybe the line, even if its the best fitted line, isnt taking into account other factors (thus the whole it explains x% of variance) that are relevant to the real world

#

but bear in mind r-squared is really only used for linear models. and there are many, many, many, many real-world scenarios where a linear model just doesn't work or doesnt make sense 🙂

#

(bear in mind a true academic statistician would be murdering me for some of the things ive said but im just trying to explain things conceptually and how they relate)

icy flax Jun 18, 2020, 11:37 PM

#

not sure if this is the spot for this, but im doing some plotly data visualizations and running into the issue of a slider I implemented not being displayed at all. Is anyone familiar with plotly / networkX and would willing to assist me for a bit. Thanks!

blazing bridge Jun 19, 2020, 12:25 AM

#

ok thank you so much @tranquil falcon

paper niche Jun 19, 2020, 12:44 AM

#

the data that is being passed into your cum_freq_fn isn't a column name, it's the element(s) themselves, that's why you're getting a KeyError

#

It's probably worth noting, there's already a method of pandas series to calculate the quantile value

desert oar Jun 19, 2020, 12:46 AM

#

^

lapis sequoia Jun 19, 2020, 12:53 AM

#

yep! salt rock explained it to me, thank you both for helping out though

flat quest Jun 19, 2020, 1:51 AM

#

btw just a tip for ML.

Might be good to have a baseline knowledge of statistics @blazing bridge
That should help you better understand how ML models actually work, and what concepts like std, gaussian, variance, etc. are.

blazing bridge Jun 19, 2020, 2:04 AM

#

yeah @flat quest I was thinking about that but I dont know where to start

#

Im only in grade 10 so i dont know what would make sense for me

flat quest Jun 19, 2020, 2:08 AM

#

ah nice

yeah best place to start is prob something like khanacademy and then go to like opencourseware. @blazing bridge . YT is also a pretty good place to look at

blazing bridge Jun 19, 2020, 2:09 AM

#

ok thank you @flat quest what would you say variance is in simple terms

flat quest Jun 19, 2020, 2:11 AM

#

well mathematically variance is the square of the standard deviation

like zammy said, variance is a measure of spread of your data. How much distance or change in numeric value is between your data points.

Are they clustered around certain values? thats small variance or spread.
If they're widely spread out, you have high spread or variance.

blazing bridge Jun 19, 2020, 2:15 AM

#

"or example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the R² for our model is 0.72 — that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."

#

could you explain what they mean over here

desert oar Jun 19, 2020, 2:43 AM

#

I like to think of the variance as the distance between your dataset and a dataset where every data point is equal to the mean

#

Or something not unlike the average distance from the mean although the whole squared part overweights points that are farther away vs a true mean

lapis sequoia Jun 19, 2020, 4:50 AM

#

Hello everyone,
I have just started with data science.
I am thorough with the basics of pandas and matplotlib, also for web scraping I know selenium, also beautiful Soup . Please advise me on what to do next.

slate scroll Jun 19, 2020, 4:56 AM

#

@lapis sequoia That depends on what areas of data science you're interested in. But I'd encourage you to also take on some real world projects. If you're through the basics of pandas/matplotlib and BS4 there's plenty of opportunities to visualize and analyze scraped data.

lapis sequoia Jun 19, 2020, 4:59 AM

#

I just used them to make a project,
So basically what it does it looks up for the input-ed hashtags on instagram and scrapes all the post's hashtags
Further on I have added some keywords
The program filters all the hashtags with the given keywords and presents them to me in sets of 25 in a word file

slate scroll Jun 19, 2020, 5:00 AM

#

So where's the data science? That sounds like pure scraping.

lapis sequoia Jun 19, 2020, 5:01 AM

#

Basic pandas

#

I dont know much about data science, just wanna know how to get into it

slate scroll Jun 19, 2020, 5:04 AM

#

Well there's tons of resources like: https://towardsdatascience.com/the-ultimate-guide-to-getting-started-in-data-science-234149684ef7

Data science encompasses a lot of things (as mentioned in that article) if you could narrow it down that would be helpful.

Medium

The Ultimate Guide to Getting Started in Data Science

How I started getting data science job offers in under 6 months

untold anvil Jun 19, 2020, 5:11 AM

#

@brazen cloud Hi you can use tolist() method to convert your array into list and then use append on the list to your data point and then convert it back to an array

lapis sequoia Jun 19, 2020, 5:11 AM

#

What do you think I should take up next? @slate scroll

untold anvil Jun 19, 2020, 5:17 AM

#

@brazen cloud or you can use np.append method. Its easier to do it this way

slate scroll Jun 19, 2020, 5:20 AM

#

@lapis sequoia That really depends on what kind of career you're interested in. With some basic knowledge under your belt, what parts of it did you most enjoy?

lapis sequoia Jun 19, 2020, 5:20 AM

#

Oh I really enjoy automating with selenium and pulling off data from internet

#

I just dont know what's next @slate scroll

slate scroll Jun 19, 2020, 5:21 AM

#

If data collection and augmentation is what you enjoy then data engineering would be a great path.

#

For most real-world examples that'll include things like streaming data, kafaka, and big data tools like hadoop, hive, hbase, bigquery, etc...

#

SQL would be a great next step towards that world too

lapis sequoia Jun 19, 2020, 5:23 AM

#

So what is data engineering all about?

slate scroll Jun 19, 2020, 5:24 AM

#

Well data engineering is all about collecting data and augmenting it with other data sources. Think about a website (my work so an easy example). They're constantly collecting tons of data on how their users are interacting. The data engineers are responsible for turning that stream of interactions into nice tables that can be used for reporting and modeling.

lapis sequoia Jun 19, 2020, 5:25 AM

#

I am also interested in AI/ML (Atleaset wanna give it a try) always fascinated me

slate scroll Jun 19, 2020, 5:25 AM

#

AI/ML is also a wide field, there's R&D and algorithm development (traditional Data Science) and then there's ML engineering or applied ML. Basically turning ML into products and affecting consumers

lapis sequoia Jun 19, 2020, 5:26 AM

#

Ohh so is data engg. About data representation for making decisions?

#

I dont have good idea of where to go, I just wanna start with basics and let's see what I like on the way

slate scroll Jun 19, 2020, 5:27 AM

#

Yeah data engineering can be about how to collect and represent data. so let's say my website adds a new feature and the execs want to know how many people click on it. The data engineers will work with tracking engineers on how to collect that data and transform it into reportable data locations

#

Not sure if you have a blog or website or anything but if you want to play with some streaming data I have a demo on my Github: https://github.com/Raab70/serverless-streaming-web-analytics

Cloud stuff is also always a great addition to any resume. Most cloud providers have some amount of resources you can use for free it just depends on which one you use.

lapis sequoia Jun 19, 2020, 5:30 AM

#

I am just a beginner

#

Starting from scratch

slate scroll Jun 19, 2020, 5:31 AM

#

Sure so that should be plenty of directions to head in, cloud computing, and SQL would be great starting off points that will be applicable to any data science role.

lapis sequoia Jun 19, 2020, 5:32 AM

#

ok , i will start with sql soon

slate scroll Jun 19, 2020, 5:32 AM

#

Learning Docker is a great addition too

lapis sequoia Jun 19, 2020, 5:33 AM

#

so whats the use of sql, heard it is for programming databases

slate scroll Jun 19, 2020, 5:33 AM

#

but I don't want to throw too much your way

#

SQL is a query language for relational databases

lapis sequoia Jun 19, 2020, 5:33 AM

#

mySql was something i read of

slate scroll Jun 19, 2020, 5:34 AM

#

MySQL is a type of database which is relational and therefore uses SQL. Postgres would be another. Those are both OLTP (online transactional processing) databases.

#

There are also OLAP or online analytics processing. Databases like bigquery or redshift. But they're all based on SQL

lapis sequoia Jun 19, 2020, 5:35 AM

#

how will sql be useful in a datascience career?

slate scroll Jun 19, 2020, 5:35 AM

#

Well in data science you'll need to be collecting data from different sources, most of them will be SQL.

lapis sequoia Jun 19, 2020, 5:35 AM

#

ok so if i wanna learn sql, ho shall i go about it on python?

slate scroll Jun 19, 2020, 5:37 AM

#

Well you can get into more topics like ORMs, which are basically a mapping of SQL into python objects. But a knowledge of base sql is super useful. Maybe try just adding it into your existing dataset

#

So make your scraper work for multiple runs. How would it work if you ignored posts that were already seen

#

Store them to a DB instead of just pandas

lapis sequoia Jun 19, 2020, 5:37 AM

#

nah, it would re-scrape it

#

ok will keep that in mind

#

will research more about sql

slate scroll Jun 19, 2020, 5:38 AM

#

It is essential for any data science role

lapis sequoia Jun 19, 2020, 5:43 AM

#

thankyou for your time rob

blazing bridge Jun 19, 2020, 6:44 AM

#

variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean. The goal is to have a value that is low. What low means is quantified by the r2 score (explained below).

#

is this a correct definition

#

ok so what is the difference between sum of squared residuals and r-squared

earnest meteor Jun 19, 2020, 7:36 AM

#

Hi have someone used the Yoga-82 dataset?

#

or knows datasets that are public domain related to yoga?

sacred sail Jun 19, 2020, 8:05 AM

#

Is anybody available to help? I need some help figuring out how to import a excel file into python and then combining a few of the columns and running some simple calculations. Just a little rusty and on a bit of a time crunch

earnest meteor Jun 19, 2020, 10:07 AM

#

install openpyxl

gritty vapor Jun 19, 2020, 10:31 AM

#

Does anyone know if it's possible without to use Tensor flow, only with numpy and opencv, transform differents images into arrays, class them and compare them using sklearn ?

wicked bloom Jun 19, 2020, 11:04 AM

#

Hey, is anyone familiar with the method of turning a linear program into standard form, and is willing to walk through an example with me?

desert oar Jun 19, 2020, 11:39 AM

#

@gritty vapor yes but it's not a beginner level task by any means

#

Oh sklearn. Mayyybe

#

Image classification existed before deep learning

#

So yes you can go back and dig up all the old literature on creating features for classifying images with SVMs

#

I don't see why you would want to do that

gritty vapor Jun 19, 2020, 11:42 AM

#

I would like to take a lot of picture of Brain tumors, and when I give one the to the program it it's a tumor or not, without 500 lines of codes etc

barren topaz Jun 19, 2020, 11:50 AM

#

is anyone aware of a simple method or something similar to make a seaborn heatmap axis show all of its labels when zooming in? I have a 67 by 10 heatmap, and want to be able to see the labels of any part of the heatmap that I zoom in on. Unfortunately, it does not automatically do that. any suggestions?

#

here is what the heatmap looks like

📎 Heatmap.png

desert oar Jun 19, 2020, 11:52 AM

#

@gritty vapor Image classification is a complicated difficult task, you're asking too much to not have to at least write some code in order to get your complicated difficult task done using modern tools

#

Maybe yolo object detection can work on brain tumors

#

I would be much less concerned about lines of code than actually training and validating a good model

#

I think that's not a good attitude at all

gritty vapor Jun 19, 2020, 11:53 AM

#

Okay thanks

desert oar Jun 19, 2020, 11:54 AM

#

@barren topaz you would need some kind of fancy dynamic graph with a slider or something like that

#

I don't think seaborn has that ability by itself

#

Seems like something you can do with D3.js, maybe you can do it with Bokeh

barren topaz Jun 19, 2020, 11:57 AM

#

@desert oar awesome, thanks! I'll try those out. Worst come to worst, and I'll just split it into 3 or 4 different subplots.

desert oar Jun 19, 2020, 11:58 AM

#

Try Bokeh first

#

Ping me if you get it to work

mellow creek Jun 19, 2020, 12:33 PM

#

https://gist.github.com/sPyOpenSource/5fe6e962da56a1a6218d17a490325173#file-forecast-py

Gist

short programs

short programs. GitHub Gist: instantly share code, notes, and snippets.

barren topaz Jun 19, 2020, 1:12 PM

#

I can't get it working. I've never used bokeh before, so I'm learning it now. will probably take a while to figure out. Thanks for pointing me in the right direction though!

rich silo Jun 19, 2020, 2:15 PM

#

Looking for some help with datatables (Dash) and callbacks in Dash. Basically i want to auto generate an editable Datatable and after editing it, click on a button to update some other tables.
Anybody can help?

#

I have got the first part working but i cant call the data from the Datable to update the rest afterwords

serene scaffold Jun 19, 2020, 3:17 PM

#

Anyone know how to reshape a Tensor while keeping it on the same GPU?

lapis sequoia Jun 19, 2020, 3:40 PM

#

@desert oar i had a followup question to yesterday

#

📎 image0.jpg

#

do you know why the code works for an individual column but not on the entire dataframe in my for loop?

#

let me know if you want me to type the code out

#

this is the function that captures 80% of the cumulative values that we discussed yesterday

desert oar Jun 19, 2020, 3:43 PM

#

can you post your code as code and not a screenshot

lapis sequoia Jun 19, 2020, 3:43 PM

#

yea. sorry i code & use discord on different computers

desert oar Jun 19, 2020, 3:44 PM

#

what's this frac_cum[0] thing

#

try frac_cum.iloc[0]

lapis sequoia Jun 19, 2020, 3:45 PM

#

def cum_freq(df,colname, threshold=0.80):
  x=df[colname]
fractions=x.value_counts()/len(x)
frac_cum=fractions.cumsum().sort_values()
if frac_cum[0]>0.8:
  return frac_cum.index[0]
else:
  return frac_cum.loc[frac_cum <= 0.8].index.tolist()

#

for colname in df_new.columns:
  cumfreq(df_new, colname)

#

what's this frac_cum[0] thing
@desert oar because some values have like the first value count as 0.85 and then all the columns after are like 0.01

#

ill try iloc

#

ah iloc worked

#

why did i have to use iloc though? why wouldnt it work without it

desert oar Jun 19, 2020, 3:48 PM

#

it's... complicated

#

.iloc is positional indexing

#

[] by default is .loc which is key-based indexing

#

if your keys are floats, it's going to convert 0 to float then say "there's no key 0.0"

#

which is what happened

lapis sequoia Jun 19, 2020, 3:49 PM

#

ohh

#

but how come it worked for the individual columns?

desert oar Jun 19, 2020, 3:49 PM

#

columns are keys

#

wait...

#

yeah

#

df.columns is a list of column keys

#

i.e. column names

#

dataframes have their own behavior

#

i try to use .loc and .iloc whenever possible to eliminate ambiguity

#

the only exception is using df[colname] for column access

#

otherwise i always try to use .loc and .iloc

lapis sequoia Jun 19, 2020, 3:52 PM

#

i see

#

that makes sense

#

thank you

slim fox Jun 19, 2020, 4:16 PM

#

iloc is really nice when you do sortings

modern canyon Jun 19, 2020, 5:58 PM

#

can someone help me understand how 2^g(n) shoots up after a certain value while 2^f(n) remains flat even though f(n) > g(n)?

📎 Capture1.PNG

pallid mica Jun 19, 2020, 6:52 PM

#

Hi! I dont know if this is a place to ask but I just am getting started and i basically just want to learn to program a graph and have it track or plot a dataset and i dont know where to begin any ideas? Thanks in advance

pliant wasp Jun 19, 2020, 7:27 PM

#

hi everyone !

#

( I need help )

uncut shadow Jun 19, 2020, 7:38 PM

#

You should ask your question and provide code/errors required for us to solve your problem

blazing bridge Jun 19, 2020, 8:10 PM

#

📎 fhfqaNWtmm222mRLPi94aeqAI1FwCElM1t1UchEQgTwJ4J0ipMcu6EsvvbTzTq277rru3zpEQAREICkBiamkpHSeCIiACIiACIiA.png

#

📎 lxFqZAiMHagAAAABJRU5ErkJggg.png

#

📎 ZEIBFIBBKBRCAR6A2BJFW9QZkNJQKJQCKQCCQCicAoI5CkapRHP21PBBKBRCARSAQSgd4QSFLVG5TZUCKQCCQCiUAikAiMMgJJqk.png

#

could someone explain this to me. Like what does variance mean in terms of linear regression. Someone told me it is the spread of the data but how does that relate to this in that case.

#

they say 72% variation in y, im not sure what they mean. Is it its 72% accurate

#

Last thing what is mean squared error and whats the difference between R-squared and mean squared error

desert oar Jun 19, 2020, 8:13 PM

#

i think you really need to spend some time learning statistics

#

it makes these concepts easier to understand

blazing bridge Jun 19, 2020, 8:14 PM

#

yeah ik. I searched this up on khan academy, didnt really make sense to me

#

any recommendations. I know I've been really annoying and probably pissing you guys off

desert oar Jun 19, 2020, 8:16 PM

#

im not sure what the standard stats textbooks are nowadays

#

but a good book should help

#

imo you shouldnt even try to learn regression until you know basic statistics

#

you can do it, but... this is what happens

blazing bridge Jun 19, 2020, 8:17 PM

#

what would be a good book suitable for someone in high school or maybe a course

desert oar Jun 19, 2020, 8:21 PM

#

let me look

#

maybe The Statistical Sleuth by Ramsey & Schafer

#

or Statistics by Freedman, Pisani, & Purves

#

however they are both "textbooks" and might be expensive

blazing bridge Jun 19, 2020, 8:23 PM

#

oh, im sorry to bother you, anything thats free?

desert oar Jun 19, 2020, 8:26 PM

#

i asked in another server

#

i will let you know if i get any recommendations

blazing bridge Jun 19, 2020, 8:27 PM

#

ok, once again sorry

#

would you think this is a good definition: in terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean.

desert oar Jun 19, 2020, 8:28 PM

#

no, i dont think so

#

the variance of what

blazing bridge Jun 19, 2020, 8:29 PM

#

like variation in y

desert oar Jun 19, 2020, 8:29 PM

#

have you heard of a "random variable"

blazing bridge Jun 19, 2020, 8:30 PM

#

no

desert oar Jun 19, 2020, 8:30 PM

#

ok, that is the big missing piece for you, i think

blazing bridge Jun 19, 2020, 8:31 PM

#

oh, wait is it something that has infinite number of possible values

#

height, weight

#

for example

desert oar Jun 19, 2020, 8:31 PM

#

it can have a finite number of values

#

the very short version is that a random variable describes a data-generating process. a random variable, let's say Y, is not a single value, but a description of the kinds of values that it can have, and how probable each value is

#

for example: when you flip a coin, the outcome of the coin flip can be Heads or Tails

blazing bridge Jun 19, 2020, 8:32 PM

#

"A random variable is a variable whose value is unknown"

desert oar Jun 19, 2020, 8:33 PM

#

thats an incomplete definition

#

the outcome of the coin flip can also be described as a random variable, with outcomes Heads (0.5 probability) and Tails (0.5 probablity)

#

a random variable is a description of a data generating process

#

it's a relationship between "possible outcomes" and "probabilities"

#

i am glossing over a lot of things here, but that's the basic concept at the bottom of pretty much all of stats and probability, and by extension much of machine learning

#

variance tells you how spread out the possible outcomes are. high variance means a lot of spread in the possible outcomes, so if you were to re-generate many data points with high variance, there would be a lot of variation among the data points

blazing bridge Jun 19, 2020, 8:36 PM

#

yeah I just watched a video on that and its the spread of the data around the mean

desert oar Jun 19, 2020, 8:36 PM

#

yeah, sure

#

so the variance of Y in regression could mean a few different things

#

one meaning is: the actual variance of Y, which generated the data

#

we can never know that answer

#

another meaning is: the variance of the predictions, which is the definition you gave

#

there are other variances to consider as well, but you can probably ignore those for now

blazing bridge Jun 19, 2020, 8:38 PM

#

so the definition in this case would be how far the actaul values are from the predicted values, that is the variance

desert oar Jun 19, 2020, 8:38 PM

#

no, that is something else

#

those are residuals

#

you can compute the variance of the residuals

blazing bridge Jun 19, 2020, 8:39 PM

#

oh ok

desert oar Jun 19, 2020, 8:39 PM

#

that document talks about "variation in y"

#

in this case they mean "the variation in the data"

#

i.e. the observed variance in the data

#

so R^2 is the amount of the variance of Y that is "explained" by the variance of X

#

...kind of

#

the meaning of "explained" is fuzzy without equations and a more coherent understanding of random variables

#

but its good enough to interpret results

#

also @blazing bridge https://www.reddit.com/r/statistics/comments/gzv74a/q_learning_statistics_by_yourself/

r/statistics - [Q] Learning statistics by yourself

117 votes and 38 comments so far on Reddit

#

STAT 100 in there might be useful

#

https://online.stat.psu.edu/statprogram/stat100

PennState: Statistics Online Courses

STAT 100: Statistical Concepts and Reasoning | STAT ONLINE

Enroll today at Penn State World Campus to earn an accredited degree or certificate in Statistics.

blazing bridge Jun 19, 2020, 8:43 PM

#

yeah i was just looking at that

#

thank you so much

desert oar Jun 19, 2020, 8:43 PM

#

http://greenteapress.com/thinkstats/
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

#

https://www.openintro.org/book/stat/

Help Me Decide

OpenIntro's mission is to make educational products that are free, transparent, and lower barriers to education. We're a registered 501(c)(3) nonprofit.

#

https://www.mybooksucks.com/

MYBOOKSUCKS.COM

I posted my first Youtube video in October of 2007. Over the years, I have continued to post educational videos on statistics, economics, calculus, and algebra. My Youtube videos have been viewed...

eternal pagoda Jun 19, 2020, 9:27 PM

#

I am using pandas to find the mean of a 45k line column in a dataset. I noticed some of the cells in the column have text content while the column is intended for integer based values. Still, pandas is able to provide me with the mean. I extracted a small section of the column and copied it to a separate spreadsheet for analysis while using the exact same pandas code. However, when I run the code against the column in a different dataset, i get a "can not convert string to float" error

#

I would like to understand why I am not also receiving this string to float error when I run the exact same code in the 45k line column which contains many strings, but receive the string to float error when I attempt to acquire the mean for a smaller section of the column.

#

📎 unknown.png

#

📎 unknown.png

#

its almost as if pandas decides that it doesn't care about the string values when we are dealing with 45k lines, as opposed to 20. When the amount of cells in the column is lesser, then it wants to get real righteous

#

📎 unknown.png

desert oar Jun 19, 2020, 9:32 PM

#

that is weird

#

it is possible that it's using a different algorithm on a bigger series

#

however i can't say i've ever had that experience

eternal pagoda Jun 19, 2020, 9:33 PM

#

makes me wonder at what point pandas throws out the book. lol

#

disregard, identified that I am seeing the information incorrectly because I am using shit software to open the spreadsheets

desert oar Jun 19, 2020, 9:42 PM

#

that was my next question

spiral bane Jun 19, 2020, 11:02 PM

#

hi i want to start with machine learning i found a tutorial series on youtube but its from 2018 and since then many changes were made to the pytorch framework so i would like to ask if https://www.youtube.com/watch?v=GIsg-ZUy0MY would be a good starting point? This is probably the most recent video about pytorch

YouTube

freeCodeCamp.org

PyTorch for Deep Learning - Full Course / Tutorial

In this course, you will learn how to build deep learning models with PyTorch and Python. The course makes PyTorch a bit more approachable for people starting out with deep learning and neural networks.

💻 Code:
https://jovian.ml/aakashns/01-pytorch-basics
https://jovian.ml/aa...

▶ Play video

wheat wolf Jun 19, 2020, 11:29 PM

#

Is there a updated PyTorch chatbot?

lapis sequoia Jun 20, 2020, 1:36 AM

#

does anyone know how many hidden layers to use in a neural network using tf.keras.models.Sequential() ?d

uncut shadow Jun 20, 2020, 7:21 AM

#

it depends

#

you can have 1 or you can choose to have more

#

it's up to you

umbral aspen Jun 20, 2020, 2:47 PM

#

Hi guys - I am trying to develop a model to solve a multi label problem (with tf and keras) and I see this is solved diffrently sometimes when building the model layers...

I see this (1 sigmoid layer):
layers.Dense(number_of_classes, activation='sigmoid', name='output')

and this (1 sigmoid layer per class):

output2 = Dense(1, activation = 'sigmoid')(x)
output3 = Dense(1, activation = 'sigmoid')(x)```

#

What are the differences between these two options?

#

Or maybe a better question would be -> when to use only 1 output layer and when to use multiple?

serene scaffold Jun 20, 2020, 3:48 PM

#

If I need to find the cosine distance of two vectors, and one is longer, do I pad the other with zeros until they're the same len?

serene scaffold Jun 20, 2020, 4:19 PM

#

Well that worked but something about it feels wrong.

devout sail Jun 20, 2020, 5:01 PM

#

@serene scaffold uuuh if you know that the shorter vector of size n is the first n dims in the longer vector (in the same order) then sounds like it should work

serene scaffold Jun 20, 2020, 5:02 PM

#

The vectors are actually from completely different spaces. The goal is to learn the mapping.

devout sail Jun 20, 2020, 5:04 PM

#

How do you measure the distance if they're in different spaces?

serene scaffold Jun 20, 2020, 5:04 PM

#

Haven't gotten that far.

devout sail Jun 20, 2020, 5:05 PM

#

I mean you shouldn't be able to.. you need some kind of information to learn the mapping

serene scaffold Jun 20, 2020, 5:05 PM

#

This is an nlp thing. We're trying to put things into an ontology based on embeddings for the tokens and embeddedings for the ontology.

#

Right. That's the goal.

#

We have mention-label mappings for a training corpus.

devout sail Jun 20, 2020, 5:08 PM

#

If the goal is to use some kind of regression to learn a transformation then I guess your method might work

#

Because in that case you're defining how you treat shorter vectors

serene scaffold Jun 20, 2020, 5:10 PM

#

A paper has already been published describing how they were successful with this process

#

But they didn't describe it throughly or release the code

desert oar Jun 20, 2020, 5:25 PM

#

Isnt this a pretty typical case for a NN

#

Multi input multi output

#

Distance doesn't exist or make mathematical sense between things in different spaces

devout sail Jun 20, 2020, 5:37 PM

#

Maybe it's used as the loss function? between the output and the label

#

it really depends on what the method used is

late jackal Jun 20, 2020, 5:57 PM

#

I am working with some grad students at my school on a project that involves getting two programs to talk to each other and while they have some of the code written i can't find any documentation on the syntax or very very little any idea where to look for something like this

lapis sequoia Jun 20, 2020, 6:03 PM

#

Hi, do you guys know the best method to interpolate missing financial data before correlating them?

iron escarp Jun 20, 2020, 6:28 PM

#

How hard is the maths in data science and AI?

desert oar Jun 20, 2020, 6:51 PM

#

@late jackal talk how?

#

@iron escarp its not usually "hard" but there is a lot to know

late jackal Jun 20, 2020, 6:54 PM

#

@desert oar developing a python bassed GUI so being able to write to HYSYS

desert oar Jun 20, 2020, 6:55 PM

#

what is HYSYS

late jackal Jun 20, 2020, 6:56 PM

#

A process engineering modeling software

#

So like it can model a gas adsorbed and run a bunch of simulations

#

I know it can be done since we have some working code but there's like no documentation on the API

desert oar Jun 20, 2020, 6:57 PM

#

So hysys has an api

#

But its undocumented or underdocumented

late jackal Jun 20, 2020, 6:57 PM

#

Seems to be the case

desert oar Jun 20, 2020, 6:57 PM

#

And you need a gui app that can take user commands and then interact with the hysys api

late jackal Jun 20, 2020, 6:57 PM

#

Yes

#

We are trying to build out that gui

desert oar Jun 20, 2020, 6:58 PM

#

Ok. Thats not really on topic here, but you can use tkinter, kivy, pyqt, or toga for that

#

#user-interfaces can help you

late jackal Jun 20, 2020, 6:58 PM

#

Ah ok I'll maybe send some info in there thank you

iron escarp Jun 20, 2020, 7:20 PM

#

I guess obviously I’ll get better at python

#

To get into data science later on

#

But how do ik if I’ll like data science?

#

Without trying it?

lapis sequoia Jun 20, 2020, 11:41 PM

#

how can i utilize categorical variables with high cardinality in my data to make predictive models?

#

(i.e i have a column called business units with 1000+ levels, and several other variables with 100+ levels in a 6000-row dataset)

#

i probably cant one hot encoding that because will introduce extremely large dimensionality

#

i was thinking of potentially k-means clustering into smaller groups

#

any other ideas come to mind?

velvet thorn Jun 20, 2020, 11:51 PM

#

it depends.

#

first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem

#

you can consider some form of clustering, as you noted

#

you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)

#

depending on the problem, you could consider some other form of encoding

#

e.g. think about how words can be vectorised using an embedding vs one hot encoding

lapis sequoia Jun 20, 2020, 11:57 PM

#

you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)
@velvet thorn yeah i did this

#

is it a bad idea to just do that for all the columns though?

#

since you mentioned domain knowledge is important

velvet thorn Jun 20, 2020, 11:57 PM

#

well

#

again

#

it depends.

#

basically

lapis sequoia Jun 20, 2020, 11:58 PM

#

i just put everything with 80% of the value counts as long as the length was < 20, and put the remaining 20% as 'other'

velvet thorn Jun 20, 2020, 11:58 PM

#

in layman's terms, you are saying that all these different things are in fact "the same"

lapis sequoia Jun 20, 2020, 11:58 PM

#

right

velvet thorn Jun 20, 2020, 11:58 PM

#

so the question is - how much information does this feature hold, relative to the rest of the dataset?

#

now, imagine this.

#

say your dataset is one of names and you want to predict if it's a male or female name

#

if you used one hot encoding and turned all the minorities into the same category

#

your predictions for them would suck, right?

lapis sequoia Jun 20, 2020, 11:59 PM

#

ya

velvet thorn Jun 20, 2020, 11:59 PM

#

so that wouldn't be an appropriate method

lapis sequoia Jun 20, 2020, 11:59 PM

#

i see

velvet thorn Jun 20, 2020, 11:59 PM

#

and the reason is that you have only one feature

lapis sequoia Jun 20, 2020, 11:59 PM

#

mhmm

velvet thorn Jun 20, 2020, 11:59 PM

#

on the other hand, if, say, your dataset is a housing one and the high cardinality feature is road name

#

that probably won't be an issue because road name is unlikely to have a large impact anyway

lapis sequoia Jun 21, 2020, 12:00 AM

#

first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem
@velvet thorn can you explain this part though

velvet thorn Jun 21, 2020, 12:00 AM

#

no need to tag me by the way

lapis sequoia Jun 21, 2020, 12:00 AM

#

i did one hot encoding for columns with nunique <= 15

#

no need to tag me by the way
alright sorry, when i quote you, it tags you by default but i'll delete it

#

so how can i one hot encode with some cat variables with 100+ levels

#

in a sparse matrix

velvet thorn Jun 21, 2020, 12:01 AM

#

oh wait

#

is quoting a new feature or have I just been living under a rock

#

LOL

#

anyway

#

are you using sklearn?

lapis sequoia Jun 21, 2020, 12:02 AM

#

you right click and press quote

velvet thorn Jun 21, 2020, 12:02 AM

#

yeah I just saw

lapis sequoia Jun 21, 2020, 12:02 AM

#

are you using sklearn?
yes

#

wait before you answer that

#

when you do one hot encoding

#

and you do pd.get_dummies

#

you dont need to drop the numeric values right

velvet thorn Jun 21, 2020, 12:02 AM

#

ah...

#

don't use pd.get_dummies

lapis sequoia Jun 21, 2020, 12:02 AM

#

shit

#

why

#

i just ran both my models using that

#

lol

velvet thorn Jun 21, 2020, 12:03 AM

#

okay

#

more accurately

#

you can but

#

well this is kinda an opinion thing so ignore me

#

but

#

let's go back a bit

#

do you know what a sparse array is?

lapis sequoia Jun 21, 2020, 12:03 AM

#

nope lol

#

i know about curse of dimensionality

#

does being sparse have to do with that

#

i looked up sparse array

#

A sparse array is an array of data in which many elements have a value of zero

#

makes sense

velvet thorn Jun 21, 2020, 12:04 AM

#

okay

#

so a normal array

#

stores each value

#

but a sparse array stores only the nonzero values

#

the number of "1"s in a one hot encoded array is always equal to the number of rows

lapis sequoia Jun 21, 2020, 12:05 AM

#

right

velvet thorn Jun 21, 2020, 12:05 AM

#

and the number of "0"s is equal to the number of rows * (the number of categories - 1)

#

so the more categories you have, the bigger the space savings of a sparse array

lapis sequoia Jun 21, 2020, 12:05 AM

#

i see

#

so isn't that like label encoding?

#

because label encoding only stores 1 through n values in onecolumn

velvet thorn Jun 21, 2020, 12:06 AM

#

uh

#

no

#

well

#

I mean

#

could you clarify your question

lapis sequoia Jun 21, 2020, 12:06 AM

#

label encoding is non-zero values for a particular column .. i.e. dog,cat,rat would just be 1,2,3 in one column

#

ok nvm just ignore that

#

but what does this have to do with pd get dummies

#

are you suggesting to just use sparse array of OHE?

velvet thorn Jun 21, 2020, 12:08 AM

#

pd.get_dummies(..., sparse=True)

lapis sequoia Jun 21, 2020, 12:08 AM

#

so how does using sparse affect the model performance

#

im using a random forest, idk if that matters

velvet thorn Jun 21, 2020, 12:09 AM

#

ah

#

when you said curse of dimensionality

#

you meant as a modelling problem

#

not as a storage problem

#

I misunderstood you I think

lapis sequoia Jun 21, 2020, 12:10 AM

#

no i think this is actually helpful to know

#

because i ran two models:
1 model where features nunique <=15 and i one hot encoded

#

and a model where im gonna use all the categorical variables with cardinality

#

but as i mentioned some of the categorical variables have 100+ levels. should i just do pd.get_dummies(sparse=True)

velvet thorn Jun 21, 2020, 12:12 AM

#

it's a good start

lapis sequoia Jun 21, 2020, 12:13 AM

#

alright

#

thanks

#

any other ideas i could try in case that doesnt work?

velvet thorn Jun 21, 2020, 12:13 AM

#

think I went through them earlier?

lapis sequoia Jun 21, 2020, 12:14 AM

#

alright ill try clustering and ohe with sparse = true

#

thanks

#

also one more question

#

if sparse = true reduces storage, why dont people just always use sparse = true

#

there should be some drawbacks to it?

velvet thorn Jun 21, 2020, 12:14 AM

#

yes

#

a few

#

you can't do certain things with a sparse array

#

because of the constraint that many values must be 0

#

e.g. imagine standardisation (- mean, then / std)

lapis sequoia Jun 21, 2020, 12:15 AM

#

mhmm

velvet thorn Jun 21, 2020, 12:15 AM

#

an array that was sparse

#

most likely cannot be standardised

#

because the majority value (0), after subtracting the mean, will not be 0

#

a sparse array can take more memory than a dense array if the data is not actually sparse

#

because of the way nonzero values are stored

#

one implementation of a sparse array

#

is storing the indices and values of non-sparse values

#

so you need to store two things instead of one per value

#

and the tradeoff is that you need not store all the values

lapis sequoia Jun 21, 2020, 12:17 AM

#

i see

#

so how are the actual values stored with sparse = true vs pd.get_dummies(sparse=false)? bc if you have labels like cat,dog,rat, then if you one hot encode with sparse= false, it would be
1, 0, 0
0, 1, 0
0, 0, 1

velvet thorn Jun 21, 2020, 12:18 AM

#

ye

lapis sequoia Jun 21, 2020, 12:18 AM

#

is it just gonna show as
1
1
1?

velvet thorn Jun 21, 2020, 12:18 AM

#

uh

#

I'm not sure how they're displayed

#

but

#

that's an entirely different concern from how they're stored

#

it's more like

lapis sequoia Jun 21, 2020, 12:18 AM

#

ohh ok

velvet thorn Jun 21, 2020, 12:19 AM

#

non_sparse_values = [(0, 0, 1), (1, 1, 1), (2, 2, 1)]

#

something like that?

lapis sequoia Jun 21, 2020, 12:19 AM

#

ok thanks

#

i'll take a stab at this tomorrow

#

also just to make sure, the main advantage/use of sparse is reduction in storage/memory consumption correct?

#

vs numpy array

velvet thorn Jun 21, 2020, 12:23 AM

#

yes

lapis sequoia Jun 21, 2020, 12:23 AM

#

ty

marsh chasm Jun 21, 2020, 1:00 AM

#

@hidden halo i fixed the problem, but i still don't really know what the problem was

#

i ended up hardcoding it by turning the dates into strings and chopping off the time section

#

then converting it back to a datetime object

#

🤷‍♀️

lapis sequoia Jun 21, 2020, 2:01 AM

#

how can i find correlations amongst ALL variables in a dataframe containing categorical and numeric variables

lapis sequoia Jun 21, 2020, 2:19 AM

#

@velvet thorn so using pd.get_dummies(sparse=true) yielded a 5718 row x 8661 column dataset

#

is this ok to put through a random forest lol

velvet thorn Jun 21, 2020, 3:35 AM

#

that doesn't seem right

#

why are there more columns than rows

hidden halo Jun 21, 2020, 3:41 AM

#

i ended up hardcoding it by turning the dates into strings and chopping off the time section
@marsh chasm I was on phone earlier, so couldn't try. Now I tried and df['date'].dt.date seemed to work for me. It removed the time part. I just did:

df = pd.read_csv("csv_file.csv", parse_dates=[0])
df['date'] = df['date'].dt.date

marsh chasm Jun 21, 2020, 3:42 AM

#

o wtf

#

o well

lapis sequoia Jun 21, 2020, 4:14 AM

#

why are there more columns than rows
@velvet thorn uh im not sure

#

the rows are the exact same

#

5718 are the number of rows i had

#

but the original dataset had like 63 columns i think

#

are the number of rows supposed to stay the same like in one hot encoding?

velvet thorn Jun 21, 2020, 4:19 AM

#

the thing is

#

you should have

#

1 new column

#

for each unique value

#

and clearly the number of unique values cannot be more than the number of rows

lapis sequoia Jun 21, 2020, 4:27 AM

#

📎 image0.jpg

#

yeah this is what i did

#

mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape

#

returns ```python
((5718, 59), (5718, 8661))

#

also there are 0s in the sparse dataframe. didnt you say there werent supposed to be any 0s?

static gull Jun 21, 2020, 2:39 PM

#

I cant seem to be able to install Tensorflow Object Detection API

#

someone helps pls

lapis sequoia Jun 21, 2020, 2:51 PM

#

mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape

@velvet thorn any tips?

paper niche Jun 21, 2020, 3:44 PM

#

what kind of data are you trying to OHE here? if you went from 59 columns to 8.6k columns, it's clearly some columns with a tad too many "unique" values, like an address or something

#

so I just saw your previous msg, you said some cat columns had 100+ levels and you're wondering how to OHE it? It probably depends on what the data is representing, but I wouldn't OHE with that many categories.

If for example, you had an address column that had 100+ unique values, you could, for example, extract only the district/province information and OHE that instead.

lapis sequoia Jun 21, 2020, 3:47 PM

#

well for instance

#

i have business units with over 1400 levels

#

in a 6000-row data set

#

theres another column with 2500 levels

paper niche Jun 21, 2020, 3:48 PM

#

find some meaningful way to bin them

lapis sequoia Jun 21, 2020, 3:48 PM

#

k means clustering?

#

actually idk if that'd work

#

alright

paper niche Jun 21, 2020, 3:48 PM

#

with that many levels, it's like 2-4 rows per level on average

lapis sequoia Jun 21, 2020, 3:48 PM

#

yea

#

amongst the categorical data, my data has like 40 columns with <20 unique values and like 10-15 with >150

#

theres nothing between 20-150

paper niche Jun 21, 2020, 3:50 PM

#

even the 40 columns with 20 unique values is kinda ridiculous once you OHE them. gotta be careful with curse of dimensionality here, considering you have so few training examples

lapis sequoia Jun 21, 2020, 3:51 PM

#

okay so i built 2 separate models --

baseline containing only categorical columns with <10 unique values + numeric variables
keeping all unique values that capture 80% of column value counts less than 20 and storing remaining 20% as 'other' + numeric vars
im gonna try somehow using all categorical variables + numeric vars

#

however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that

paper niche Jun 21, 2020, 3:52 PM

#

you could do feature selection / dimensionality reduction afterwards of course, but some domain knowledge to clear out some useless features (or even binning the categories in a way that makes sense) might be a good start

twilit arch Jun 21, 2020, 3:53 PM

#

I want to make a roulette game, how would I calculate the odds (I have 3 colors rn) so that "the house" has a good edge and would be profiting instead of losing money

lapis sequoia Jun 21, 2020, 3:53 PM

#

alright thanks tofu

#

however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that
@lapis sequoia do you know anything about this question though?

paper niche Jun 21, 2020, 3:56 PM

#

tips to circumvent what? doing OHE?

lapis sequoia Jun 21, 2020, 3:56 PM

#

i guess having accurate variable importances

#

because one hot encoding is skewing the variable importances of random forests towards the non-OHE data (numeric columns)

#

so the numeric columns tend to be at the top

#

and the one hot encoded features are below them

slim fox Jun 21, 2020, 4:07 PM

#

i have business units with over 1400 levels
@lapis sequoia is it absolutely impossible to encode them with more meaningfull numerical values?

#

like you can not establish any kind of numerical relationship between categories, i.e. catergory "A" can be represented with 0, "B" with 1 and "C' with 5?

lapis sequoia Jun 21, 2020, 4:08 PM

#

there isn't any inherent order

#

if thats what you're asking

#

it's nominal

slim fox Jun 21, 2020, 4:09 PM

#

😕 yeah I was hoping maybe you can somehow establish it

#

to be honest I simply have doubts that 1400 unique values with that size of data will contribute postively to your ML model results

#

I would not be surpised if dropping the enitre column would yield better results than OHE 1400 values

lapis sequoia Jun 21, 2020, 4:15 PM

#

i was thinking the same

#

i believe that there is some sort of association/correlation between that column and a few others as well

#

im calculating a cramer v correlation

#

among categorical columns

#

and will drop any with high association

paper niche Jun 21, 2020, 4:17 PM

#

i guess having accurate variable importances
@lapis sequoia no, not particularly. that issue is intrinsic with random forests & having high cardinality data, I think.

lapis sequoia Jun 21, 2020, 4:18 PM

#

so is it ill-advised to use a random forest in this scenario?

paper niche Jun 21, 2020, 4:19 PM

#

It can work fine as a meta-model, just gotta do some preprocessing beforehand

#

like I said, binning, or you could try doing embedding the high cardinality features into a smaller space

#

though the latter would not be useful if you're trying to do inference

lapis sequoia Jun 21, 2020, 4:21 PM

#

alright thank you

#

i already tried binning

#

any particular methods you'd recommend about the latter option?

#

embedding the high cardinality features into a smaller space

desert oar Jun 21, 2020, 5:29 PM

#

Did you try target encoding

#

Another option is to train your model w/ a Factorization Machine

#

https://medium.com/building-ibotta/reg2vec-learning-embeddings-for-high-cardinality-customer-registration-features-faf712f12842 heres an embedding method

Medium

Reg2Vec: Learning Embeddings for High Cardinality Customer Registra...

Supervised Feature Learning with PyTorch

lapis sequoia Jun 21, 2020, 6:39 PM

#

Will do thank you

lapis sequoia Jun 21, 2020, 7:10 PM

#

I have a file that looks like this

--------------Time step: 1 ---------------
Accumulated rewards: 1.5
Alpha: 660
Beta: 173
TCP_Friendliness: 1
Fast_Convergence: 1
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 2 ---------------
Accumulated rewards: 2.724744871391589
Alpha: 193
Beta: 0
TCP_Friendliness: 0
Fast_Convergence: 0
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 3 ---------------
Accumulated rewards: 3.869459113944921

I'd like to extract the time step values into an X array and the Accumulated rewards value into a Y array, I have no idea how to do that as I have 0 python experience, but this is my initial loop i've written that skips the first couple of lines that I have not included in the example(gibberish data)

with open('Tuner_result_1.txt') as f:
    for _ in range(11):
        next(f)
    for line in f:
        x = [line.split()[0] for line in lines]
        y = [line.split()[1] for line in lines]

obviously the actions inside the 2nd for are incorrect, idk how to read the lines i want properly.

blazing bridge Jun 21, 2020, 7:12 PM

#

After doing some research would you say this is correct:

#

The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average.

desert oar Jun 21, 2020, 7:13 PM

#

Yeah sounds good

blazing bridge Jun 21, 2020, 7:25 PM

#

Ok it took me 3 days. One question y average and SEy are the same thing which is just a line with no slope and only intercept so if the r-squared is 0.72 it’s 72% better than the line.

#

Which is the mean line

desert oar Jun 21, 2020, 7:33 PM

#

No? SE is standard error

#

Oh i see

#

Hmm

#

I guess yeah its the mean squared error of an estimate that's just the sample mean

blazing bridge Jun 21, 2020, 7:41 PM

#

They used that in the khan academy video

desert oar Jun 21, 2020, 7:46 PM

#

Yeah that's "good enough" for now

#

A more practical understanding is: the better the R^2, the closer your data is to a straight line fit

#

In the case with 1 x and 1 y you can compute it from the correlation

blazing bridge Jun 21, 2020, 8:09 PM

#

So if the r-squared is 72% it means that 72% of the data is a straight line?

polar acorn Jun 21, 2020, 8:41 PM

#

It means if you replace your data with that straight line it would still keep 72% of the variation of your data. Not necessarily that 72% of the data lies on that straight line.

desert oar Jun 21, 2020, 9:45 PM

#

Specifically 72% of the variation in Y

lapis sequoia Jun 21, 2020, 11:36 PM

#

do yall think dropping columns with correlations > 0.70 is a good threshold

desert oar Jun 21, 2020, 11:37 PM

#

Why does it have such a high correlation

lapis sequoia Jun 21, 2020, 11:37 PM

#

idk there are like 14/63 columns that have > 0.80 correlation

#

this is a cramer v correlation

#

so between categorical variables

#

idk if that matters

#

and they have like 100+ ;levels

desert oar Jun 21, 2020, 11:38 PM

#

The reason you drop high correlations is in case they are effectively duplicates of each other

lapis sequoia Jun 21, 2020, 11:38 PM

#

yea

#

they seem to be correlated

#

in this business context

desert oar Jun 21, 2020, 11:38 PM

#

So it depends on what they are

#

Eg SIC and NAICS codes

#

SIC is basically redundant if you know NAICS

#

But in a lot of data you have multiple SIC codes and only one NAICS

#

So you keep both despite high corr

#

(SIC and NAICS are two different business classification systems in the USA)

lapis sequoia Jun 21, 2020, 11:40 PM

#

i see

#

hmm

desert oar Jun 21, 2020, 11:41 PM

#

16 features is not too many for you to manually review

lapis sequoia Jun 21, 2020, 11:41 PM

#

so i'd have to look at the data individually to determine if i need to drop it?

#

alright

desert oar Jun 21, 2020, 11:41 PM

#

If its 160 features out of 1000 then id say you need automated methods

#

Or if you need to dynamically retrain a model on unknown data like if you were building some kind of auto ML solution

#

And this is actually a great example of why auto ML solutions are a long way from perfect

#

Understanding the domain and business context and the meaning of the data, and applying that understanding towards solving your problem. That is the added value of a good data scientist, more so than whatever mechanical skills they have in implementing models

#

Look at something like alphago, they didn't just throw a big generic neural network at the problem, they built a whole solution specifically oriented around Go

lapis sequoia Jun 21, 2020, 11:44 PM

#

alright thanks for the help

desert oar Jun 21, 2020, 11:49 PM

#

Sorry for ranting

blazing bridge Jun 22, 2020, 12:57 AM

#

"The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average."

#

@desert oar Can you tell me if this is a correct interpretation of this if I explain it. What r-squared is telling us, is how much closer the data points are to the line of best fit compared to the average y value line, referred to as SEy or variation in y. So if the r-squared is 0.72, it means that the line of best fit is 72% better than the average values of y mean. And the other 28% is missing due to us not including other variables. For example if we have rent and square feet. The other 28% could be in something like age of building and etc.

#

@desert oar again sorry for bothering you so much

desert oar Jun 22, 2020, 4:07 AM

#

72% better than the mean..... eh

#

Yes the other 28% could be omitted variables

#

It could also be natural random variation in Y with nothing to account for it

#

It could also be that the remaining relationship is nonlinear

#

It could also be that the variance of Y is not constant over the range of X so the whole model is invalid

blazing bridge Jun 22, 2020, 4:22 AM

#

so i am partially correct

#

but for the part where what % of variation in y is described by x. The percentage we get is checking with the x variables we have , the line of best fit is better than the y mean line

hearty jewel Jun 22, 2020, 7:06 AM

#

quick python question: right here, what exactly are the loc arguments specifying? as I understand it , ride_sharing['tire_sizes'] > 27 is specifying this column where this variable isgreater than 27, but whats the second argument for?

📎 unknown.png

#

the second 'tire_sizes'

#

lmao, turns out it wasnt needed after all.

#

no wonder i was so confused.

lapis sequoia Jun 22, 2020, 9:43 AM

#

The conditional returns a boolean array of all rows that satisfy the condition. The second argument tire_sizes specifies which column to return.

lapis sequoia Jun 22, 2020, 12:05 PM

#

Hi , in order to get the correlation coefficient, I performed a quadratic interpolation in order to fill in the missing values. However it seems that I still have missing data.
Is it normal or does the problem comes from my code?

Thank you for your responses

jade hazel Jun 22, 2020, 12:56 PM

#

Hi i am creating a chatbot and i have got struggles with my bots responses. For example if i type: "What is the weather in London" I want from the bot to go to some webpage and get the data of the weather. Is there someone who knows how to do this? Thanks a lot for your responses

ripe marlin Jun 22, 2020, 2:08 PM

#

Pandas read_csv() issues. Not sure where this belongs, but please help me, nothing is working here

📎 PicsArt_06-22-07.28.09.jpg

desert oar Jun 22, 2020, 2:11 PM

#

@ripe marlin i can't read that, can you post your code and full error output as text

#

!code-block

arctic wedgeBOT Jun 22, 2020, 2:11 PM

#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')

desert oar Jun 22, 2020, 2:11 PM

#

@lapis sequoia impossible to say without seeing your code

#

@blazing bridge it's the % of variation in y described by our model

lapis sequoia Jun 22, 2020, 2:13 PM

#

I performed a quick test to see if the firebase_admin package was imported

📎 Screen_Shot_2020-06-22_at_16.12.20.png

desert oar Jun 22, 2020, 2:13 PM

#

@hearty jewel

data['hello']  # select the "hello" column
data.loc['hello']  # select the row(s) with index value "hello"
data.loc['hello', 'hello']  # select the row(s) with index value "hello" and the "hello" column

#

@lapis sequoia please post your code as text. you are not new here, you know this

#

also this has nothing to do with your previous issue

#

any time you see "no module named X" it means you have the wrong environment active

#

whether that's venv or conda or whatever

#

that is always the solution

#

that, or it had an error during installation

ripe marlin Jun 22, 2020, 2:14 PM

#

@desert oar nothing that complicated. I'm just trying to use read_csv() to open a csv while. I'm getting an unicode error. And when I use r to declare a raw string, it says that the file doesn't exist

desert oar Jun 22, 2020, 2:15 PM

#

@ripe marlin show the error? it's probably in the file itself, not the filename

#

how was the file created? if it came from Excel, use encoding='windows-1252'

ripe marlin Jun 22, 2020, 2:15 PM

#

From excel

desert oar Jun 22, 2020, 2:15 PM

#

windows-1252 is a derivative of iso-8859-1 which has a few differences from UTF-8

#

so you probably hit one of those different characters

#

and it can't decode the bytes to text

ripe marlin Jun 22, 2020, 2:16 PM

#

Wait i'll just post the code

lapis sequoia Jun 22, 2020, 2:16 PM

#

@desert oar thanks for your feedback, I tried installing the environment through the terminal, as shown on the anaconda website https://anaconda.org/auto/python-firebase

#

I got this error: PackagesNotFoundError: The following packages are not available from current channels:

python-firebase

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

desert oar Jun 22, 2020, 2:18 PM

#

just use pip with your conda env activated @lapis sequoia

#

that's a non-standard channel

#

you would need to conda install -c auto python-firebase but again i dont know who or what auto is so i dont trust it

#

@ripe marlin try pd.read_csv(..., encoding='windows-1252')

#

just try it

ripe marlin Jun 22, 2020, 2:19 PM

#

Right

lapis sequoia Jun 22, 2020, 2:19 PM

#

ok thanks

#

One last question what do you mean by env activated?

ripe marlin Jun 22, 2020, 2:21 PM

#

Still shows Unicode error and FileNotFoundError if I add a r for rawstring

desert oar Jun 22, 2020, 2:21 PM

#

if you are on windows you are probably writing '\\' so yes raw would fail

#

ok show the full error

#

@lapis sequoia

conda activate <my env>
pip install <package>

lapis sequoia Jun 22, 2020, 2:25 PM

#

@desert oar instead of <my env> I should put the IDE I work on? For example Spyder?

ripe marlin Jun 22, 2020, 2:25 PM

#

File "<ipython-input-13-343da795340c>", line 1
    df=pd.read_csv('C:\Users\dell\Downloads\py-master.zip\py-master\ML\1_linear_reg\homeprices.csv',encoding='windows-1252')
                  ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

desert oar Jun 22, 2020, 2:25 PM

#

@ripe marlin oh, you need r

#

because you are using single \s

ripe marlin Jun 22, 2020, 2:26 PM

#

I did, it shows a Filenotfound error

desert oar Jun 22, 2020, 2:26 PM

#

then you have the wrong filename

ripe marlin Jun 22, 2020, 2:26 PM

#

I literally copy pasted it's path

#

I didn't type a thing

desert oar Jun 22, 2020, 2:26 PM

#

you can't just get a file out of a zip

#

windows explorer cheats and lets you think you can

#

you have to unzip first

#

either unzip thru windows or unzip inside python

ripe marlin Jun 22, 2020, 2:27 PM

#

How do i unzip inside Python?

desert oar Jun 22, 2020, 2:29 PM

#

from zipfile import ZipFile
import pandas as pd

with ZipFile(r'C:\Users\dell\Downloads\py-master.zip') as archive:
    with archive.open(r'py-master\ML\1_linear_reg\homeprices.csv') as fp:
        data = pd.read_csv(fp, encoding='windows-1252')

@ripe marlin

#

!d g zipfile

arctic wedgeBOT Jun 22, 2020, 2:29 PM

#

`zipfile`

Source code: Lib/zipfile.py

The ZIP file format is a common archive and compression standard. This module provides tools to create, read, write, append, and list a ZIP file. Any advanced use of this module will require an understanding of the format, as defined in PKZIP Application Note.

This module does not currently handle multi-disk ZIP files. It can handle ZIP files that use the ZIP64 extensions (that is ZIP files that are more than 4 GiB in size). It supports decryption of encrypted files in ZIP archives, but it currently cannot create an encrypted file. Decryption is extremely slow as it is implemented in native Python rather than C.

The module defines the following items:

ripe marlin Jun 22, 2020, 2:29 PM

#

I see

#

I'll try this

#

Hope it works

#

Thanks a lot

lapis sequoia Jun 22, 2020, 2:50 PM

#

how about gunzip

lapis sequoia Jun 22, 2020, 3:12 PM

#

@desert oar yeah i found out there actually is high association between the variables providing redundant information. is correlation of 0.6 a good threshold to drop predictors?

desert oar Jun 22, 2020, 3:13 PM

#

these are all categorical?

lapis sequoia Jun 22, 2020, 3:13 PM

#

yea

desert oar Jun 22, 2020, 3:15 PM

#

how are you computing correlation

#

cramers v?

lapis sequoia Jun 22, 2020, 3:16 PM

#

yea'

desert oar Jun 22, 2020, 3:21 PM

#

im still skeptical of discarding features based on that

#

.6 doesnt seem high

#

high collinearity is usually only a problem in extreme cases

#

and even then, with regularization you typically dont need to care much

#

even moreso in ensembled decision tree models like a random forest

lapis sequoia Jun 22, 2020, 3:23 PM

#

fffffffffffffffffffffffffffffffffffffffff

desert oar Jun 22, 2020, 3:23 PM

#

which are selecting random features anyway

lapis sequoia Jun 22, 2020, 3:23 PM

#

i was gonna drop high predictor variables that had high cardinality & high correlation lol

#

but this wont fix it it seems

desert oar Jun 22, 2020, 3:23 PM

#

no you need to fix the high cardinality issue

#

i posted a bunch of options the other day

#

welcome to data science

sullen kiln Jun 22, 2020, 3:24 PM

#

@lapis sequoia not sure what your context is and how many variables you are dealing with, if general cramers V thresholds are not desired you could consider other dimensionality reduction methods like PCA or PFA

desert oar Jun 22, 2020, 3:24 PM

#

trial and error

#

@sullen kiln they have several very high cardinality features that are all moderately associated with each other

#

like high cardinality as in 1000 distinct categories

sullen kiln Jun 22, 2020, 3:25 PM

#

bloody hell

#

😆

desert oar Jun 22, 2020, 3:25 PM

#

we have suggested several options like target encoding, vector embedding, building a sparse or reduced model like a factorization machine

#

etc

lapis sequoia Jun 22, 2020, 3:25 PM

#

so even if i have 2 variables with >0.80 correlation i shouldnt drop it?

desert oar Jun 22, 2020, 3:25 PM

#

you said 0.6

lapis sequoia Jun 22, 2020, 3:25 PM

#

ik but im increasing the threshold now

#

i have like 10-15 variables with >0.80 correlation

sullen kiln Jun 22, 2020, 3:26 PM

#

what is the nature of the cardinality? what data are you dealing with?

lapis sequoia Jun 22, 2020, 3:26 PM

#

these correlations are amongst categorical variables

#

they have a shitton of unique values which is why im tryna drop them

desert oar Jun 22, 2020, 3:26 PM

#

not rly correlations. cramer's v chi square stats

lapis sequoia Jun 22, 2020, 3:26 PM

#

there are other methods suggested in here but this is one i thought of

#

so i have a 6000row dataset and these categorical variables have 300-1000 unique values and correlations >0.80 or >0.90

sullen kiln Jun 22, 2020, 3:28 PM

#

that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.

desert oar Jun 22, 2020, 3:28 PM

#

One thing ive done is. train a univariate model with them separately and together. If you dont see big lift, discard one

#

Its hacky

#

.8 is high

lapis sequoia Jun 22, 2020, 3:29 PM

#

so why cant i just drop some of the .8 corr variables

#

that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.
@sullen kiln i'll check out doing a pca

desert oar Jun 22, 2020, 3:29 PM

#

Yeah just do it tbh

lapis sequoia Jun 22, 2020, 3:29 PM

#

reason is im running low on time and i have to turn this over soon lol

desert oar Jun 22, 2020, 3:29 PM

#

PCA is a good idea

#

PCA features before regression is an old school ML technique anyway

sullen kiln Jun 22, 2020, 3:29 PM

#

yes, start with a more parsimonious model specification, and add/subtract each variable, try to establish what has more predictive power (and what makes more theoretical sense)

#

I wouldn't usually go manual like this, but its not a bad thing, its just best to do a PCA as it does so much legwork

lapis sequoia Jun 22, 2020, 3:30 PM

#

havent played around with pca much aside from learning about it in school

#

so i can use it to reduce cardinality then run a random forest on the pca data?

#

that's gonna reduce the interpretability a lot right

sullen kiln Jun 22, 2020, 3:31 PM

#

its very hard to quantitatively distinguish between variables of extreme cardinality, you will inevitably have bias, I would favour a proper dim reduction method over model tinkering

#

not necessarily, you will likely find several principal components that exhibit theoretical themes, if anything, you will be able to explain your data better

#

interpretability is increased

lapis sequoia Jun 22, 2020, 3:34 PM

#

ok thank you. should i drop the highly correlated variables (>.80) then run pca?

sullen kiln Jun 22, 2020, 3:34 PM

#

no

lapis sequoia Jun 22, 2020, 3:34 PM

#

just run pca on the entire dataset?

sullen kiln Jun 22, 2020, 3:34 PM

#

let the PCA analyse the data in its natural form, after all measures are normalized of course

lapis sequoia Jun 22, 2020, 3:34 PM

#

ok

#

ty

sullen kiln Jun 22, 2020, 3:34 PM

#

PFA might be needed with categorical varaibles, its not very fresh in the mind

lapis sequoia Jun 22, 2020, 3:35 PM

#

yea i was boutta ask how would i normalize cat variables

#

but i'll look it up

sullen kiln Jun 22, 2020, 3:36 PM

#

Could I ask someone for a bit of syntax help here? Simple Python/R stuff

#

#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)

#

I want add to my function here, computing another mean with the range of q_at(0.10) and q_at(0.90)

#

I was going to try extract the statistics and merge them back into the data, but that is not elegant and not very viable with big data

desert oar Jun 22, 2020, 3:39 PM

#

what do you mean extract

#

you want to join the aggregated data back to the original data?

#

e.g. join df1 to incdata? that's pretty much the only way to do it

#

also

#

!code-block

arctic wedgeBOT Jun 22, 2020, 3:39 PM

#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')

sullen kiln Jun 22, 2020, 3:40 PM

#

hhmm, ok, I was hoping for a cleaner way. As I am extracting the descriptives from incdata. I wanted to extract a conditional descriptive - the mean between the inter-decile range, without adding more steps/merges

desert oar Jun 22, 2020, 3:44 PM

#

what do you mean, "the mean between the inter-decile range"

lapis sequoia Jun 22, 2020, 3:47 PM

#

The mean between the 10th quantile and the 90th quantile. Basically a more generous IQR

#

Though I don't think there's an easier way to do that you saying (if I understood you right)

desert oar Jun 22, 2020, 3:48 PM

#

oh

#

idk what the automatic variable names for those columns will be, but

#

f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
         q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f) 
df1['iqr'] = df1['q75'] - df1['q25']
df1['idr'] = df1['q10'] - df1['q90']

#

no?

lapis sequoia Jun 22, 2020, 3:51 PM

#

@sullen kiln is PFA = factor analysis?

#

PFA might be needed with categorical varaibles, its not very fresh in the mind

sullen kiln Jun 22, 2020, 3:53 PM

#

yes

#

yes @lapis sequoia and @desert oar , I want an IDR mean, without doing another meri-go-round and merging my statistics back into the data to re-compute an IDR mean

#

when I deploy this program to my data in work, i am dealing with a lot of data and limited pc memory, so I am trying to keep it concise

desert oar Jun 22, 2020, 3:54 PM

#

so does my code make sense or no

sullen kiln Jun 22, 2020, 3:55 PM

#

bare with

desert oar Jun 22, 2020, 3:55 PM

#

(btw this might be better to do in sqlite which is ~~definitely~~ probably going to be more memory efficient)

sullen kiln Jun 22, 2020, 3:56 PM

#

thanks for the tip, yeah, for work it might be best, I am writing the program in python and R and will do it in sql too, I just want options going forward

desert oar Jun 22, 2020, 3:56 PM

#

i think you should at least try my code though. sql is good because its the same in both python and r

#

so you connect to the same sqlite db and use the same query

lapis sequoia Jun 22, 2020, 3:57 PM

#

There's a library that works similar to pandas that doesn't store the entire df in memory and streams it as needed. It's quite cool, though I forgot it's name

#

as far as I know it has all the same methods as pandas

slim fox Jun 22, 2020, 3:59 PM

#

dask?

lapis sequoia Jun 22, 2020, 3:59 PM

#

Yeah that sounds familiar

desert oar Jun 22, 2020, 4:01 PM

#

dask and vaex are 2 options

sullen kiln Jun 22, 2020, 4:05 PM

#

#The rename decorator renames the function so that the pandas agg function
#can deal with the reuse of the quantile function returned.

def rename(newname):
def decorator(f):
f.name = newname
return f
return decorator

def q_at(y): #define a function q, for values y
@rename(f'q{y:0.2f}') #define format renaming of new quantiles returned
def q(TOTAL_MINUTES):
return TOTAL_MINUTES.quantile(y)
return q

#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1['iqr'] = df1['q0.75'] - df1['q0.25']
df1['idr'] = df1['q0.10'] - df1['q0.90']
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)

#

@desert oar this solution would require me to compute the values within the ranges for every row, whereas, I want to build in the adjusted means into the function

#

the function above exports a nice little descriptive table, df1, rather than transforming new columns in what will be a large dataset

desert oar Jun 22, 2020, 4:07 PM

#

im still not sure what you mean

#

oh

#

i just flipped the lines

#

f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
         q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
df1['iqr'] = df1['q0.75'] - df1['q0.25'] 
df1['idr'] = df1['q0.10'] - df1['q0.90']

#

the usual caveats apply about untested code written by volunteers on the internet

#

df1 didnt even exist in those first 2 lines

#

so the code clearly wouldnt have worked as written

#

wait

sullen kiln Jun 22, 2020, 4:08 PM

#

🙂 haha

desert oar Jun 22, 2020, 4:08 PM

#

you flipped them. i did it right. scroll up.

sullen kiln Jun 22, 2020, 4:08 PM

#

apologies

#

📎 unknown.png

#

this is the result i am after

#

with the IDR and IQR means also being computed in the function, without the need to merge quantiles back into the dataset

desert oar Jun 22, 2020, 4:11 PM

#

you arent merging anything

#

df1 is your grouped summary table

sullen kiln Jun 22, 2020, 4:11 PM

#

yes

desert oar Jun 22, 2020, 4:12 PM

#

what i wrote gives you what you show in the screenshot

#

btw you also flipped q0.10 and q0.90

#

oh nvm i did that

lapis sequoia Jun 22, 2020, 4:13 PM

#

wait my data isnt ordinal so i dont think factor analysis will work

#

just so yall know lol

desert oar Jun 22, 2020, 4:14 PM

#

ive done PCA on one hot encoded categorical before

lapis sequoia Jun 22, 2020, 4:14 PM

#

i read that doesnt work well

desert oar Jun 22, 2020, 4:14 PM

#

yeah its kinda sus from a theoretical perspective

lapis sequoia Jun 22, 2020, 4:14 PM

#

you think i should do it anyways lol

desert oar Jun 22, 2020, 4:16 PM

#

id still be curious if target encoding works well

#

theres also this https://github.com/esafak/mca

GitHub

esafak/mca

Multiple correspondence analysis. Contribute to esafak/mca development by creating an account on GitHub.

#

idk how well it works for very high cardinality

#

theres also https://en.wikipedia.org/wiki/Random_projection

Random projection

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared to other methods. According to ...

lapis sequoia Jun 22, 2020, 4:17 PM

#

thanks

#

i looked up mca on google and it showed this lol

#

📎 unknown.png

desert oar Jun 22, 2020, 4:17 PM

#

hah

slim fox Jun 22, 2020, 4:17 PM

#

ive done PCA on one hot encoded categorical before
I did it too

#

surpisngly it can work

desert oar Jun 22, 2020, 4:17 PM

#

RIP MCA

lapis sequoia Jun 22, 2020, 4:18 PM

#

so id have to drop the numerical variables before doing MCA~~/PCA~~?

#

alright im just gonna try PCA

desert oar Jun 22, 2020, 4:18 PM

#

yeah you can concatenate them later

#

numerical features separately

lapis sequoia Jun 22, 2020, 4:18 PM

#

and if that doesnt work im gonna do MCA

desert oar Jun 22, 2020, 4:18 PM

#

then MCA'ed categorical features

#

dont think about it too hard

#

youre just playing with lego at this point

sullen kiln Jun 22, 2020, 4:18 PM

#

^

lapis sequoia Jun 22, 2020, 4:18 PM

#

ok thanks. but for pca all i have to do is just one hot encode then apply pca on one hot encoded + numeric data right

#

normalize numeric data first*

desert oar Jun 22, 2020, 4:19 PM

#

no thats what im saying. dont mix them

#

PCA the one-hot encoded categorical

lapis sequoia Jun 22, 2020, 4:19 PM

#

then just numbers?

desert oar Jun 22, 2020, 4:19 PM

#

then concatenate that with the numerical data

#

yes

lapis sequoia Jun 22, 2020, 4:19 PM

#

aahhhhhh okkk thx

slim fox Jun 22, 2020, 4:19 PM

#

yeah otherwise PCA will likely throw of those categorical

lapis sequoia Jun 22, 2020, 4:19 PM

#

is it bc u dont wanna lose info from the numbers?

desert oar Jun 22, 2020, 4:19 PM

#

another option is feature hashing

slim fox Jun 22, 2020, 4:19 PM

#

if they have 1000s classes

desert oar Jun 22, 2020, 4:20 PM

#

you can combine all of your 15 categorical features into 1 big "multi categorical" feature

#

and hash it all down to 1000 buckets

lapis sequoia Jun 22, 2020, 4:20 PM

#

alright im gonna try pca first

desert oar Jun 22, 2020, 4:20 PM

#

that's basically how vowpal wabbit works

#

and fasttext

lapis sequoia Jun 22, 2020, 4:20 PM

#

what about the strong correlations >0.80? is it ok to leave those columns in the pca

desert oar Jun 22, 2020, 4:20 PM

#

yes thats the whole point of pca

#

or MCA

lapis sequoia Jun 22, 2020, 4:21 PM

#

yea ur right lol

#

thx yall

robust dome Jun 22, 2020, 4:22 PM

#

hey can someone help me understand this list comprehension:
counts = dict([[letter, sentence.count(letter)] for letter in set(sentence) if letter in alphabet])

desert oar Jun 22, 2020, 4:22 PM

#

yuck, a list comprehension being passed to dict

#

dict(
    [
        [letter, sentence.count(letter)]
        for letter in set(sentence)
        if letter in alphabet
    ]
)

not sure if that helps...

pale thunder Jun 22, 2020, 4:25 PM

#

seems to be Counter(sentence) with all keys that are not in alphabet filtered out

desert oar Jun 22, 2020, 4:25 PM

#

yeah sentence might be some special thing

#

im not going to second guess that

#

this is kind of amateur code

#

id write it like this

dict((letter, sentence.count(letter)) for letter in set(sentence) & set(alphabet))

#

or

dict((letter, sentence.count(letter)) for letter in set(sentence) if letter in alphabet)

at least

lapis sequoia Jun 22, 2020, 4:32 PM

#

wait

#

in pca

#

when i one hot encode

#

it wont have the problem where the model assumes 1>0 right

#

well it's not an issue right

#

even if it's nominal

#

(i.e. i have several business units i one hot encoded and will do pca on)

desert oar Jun 22, 2020, 4:34 PM

#

eh?

#

1 = yes, 0 = no

#

so after you one hot encode, 1 is greater than 0

#

w/out loss of generality you can flip 1 and 0 but

lapis sequoia Jun 22, 2020, 4:34 PM

#

yea i was just making sure

desert oar Jun 22, 2020, 4:34 PM

#

then you run into things like "actually computing the result"

#

yeah

#

youre fine

lapis sequoia Jun 22, 2020, 4:35 PM

#

ty

gritty solstice Jun 22, 2020, 4:54 PM

#

I need some help :(

I'm using a dataset that has location information in the first few columns, as well as a column for each date in the data, with a corresponding value

Example:

+------+-----------+------------+---------+---------+---------+
| Org | Org Owner | Building# | 4/1/20 | 5/1/20 | 6/1/20 |
+------+-----------+------------+---------+---------+---------+
| OrgA | John Doe | 1234 | $1,256 | $987 | $1,562 |
+------+-----------+------------+---------+---------+---------+

There are a few more columns for identifying more specifics about the org I need to retain, and the dates are tens long corresponding to specific dates.

The values for these columns are corresponding sales.

I'm trying to pivot the table to have a single row with each date, and it's value, and the org information

How can I achieve that with pandas?

IE:
+------+-----------+--------------+------+------+
| Org | Building# | Org Owner | Date | Sales |
+------+-----------+--------------+------+------+

lapis sequoia Jun 22, 2020, 5:01 PM

#

pd.pivot_table

gritty solstice Jun 22, 2020, 5:02 PM

#

Which I'm aware of, but can't seem to find any way to get my desired results

#

IE: use columns [:3] as the columns, [3:] as index, and the values per row of the date columns as the values in the new column 'date'

#

Here's an example. Some columns are removed, and all data is false

#

📎 unknown.png

#

It's like a hybrid of transpose and pivot I think..?

ripe forge Jun 22, 2020, 5:22 PM

#

You can probably create the desired output by slicing the data frames into two parts

#

How many rows are there in the top dataframe

gritty solstice Jun 22, 2020, 5:22 PM

#

Currently 3

ripe forge Jun 22, 2020, 5:23 PM

#

If you could make code that generates a dummy dataframe with atleast 2 rows just like that, that will be super helpful

#

Meanwhile let me get to a laptop

gritty solstice Jun 22, 2020, 5:24 PM

#

sure thing, one sec

desert oar Jun 22, 2020, 5:27 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html @gritty solstice

gritty solstice Jun 22, 2020, 5:28 PM

#

dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})

for date in dates:
         df[date] = '$1200'
df.head()

Checking now @desert oar

ripe forge Jun 22, 2020, 5:29 PM

#

looks like that melt probably does the trick

gritty solstice Jun 22, 2020, 5:30 PM

#

god

#

bless

#

That looks like it did the trick

#

Thank you guys so much!

#

final code to help those who need it (using the sample data)

dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})

for date in dates:
         df[date] = '$1200'
df.melt(id_vars=df.columns[:3])

desert oar Jun 22, 2020, 5:32 PM

#

👍

lapis sequoia Jun 22, 2020, 5:35 PM

#

using pca with 90% variance reduced columns from 9000 to 420

#

so thats fine for a random forest right

#

6000 rows x 420 cols?

desert oar Jun 22, 2020, 5:38 PM

#

yeah but you might need a large number of trees to get a good sample of columns

#

depending on tree depth

lapis sequoia Jun 22, 2020, 5:38 PM

#

oh shoot. i used the default of 100 in my previous models (which had ~60-80 columns)

#

im running the random forest right now with 100 trees. i'll change it

desert oar Jun 22, 2020, 5:39 PM

#

100 is (probably) too low even for 80 columns

#

you should always start with like 10 though

lapis sequoia Jun 22, 2020, 5:39 PM

#

should i use a grid search?

desert oar Jun 22, 2020, 5:39 PM

#

just to make sure it works

#

no

#

grid search over tree params, not number of trees

lapis sequoia Jun 22, 2020, 5:40 PM

#

is there a rule of thumb for number of trees you should use

#

i looked it up and people said 64-128 but obv it depends on data size

#

LOL wtf i got an 18% R-squared

#

my baseline got like 35

desert oar Jun 22, 2020, 5:41 PM

#

welcome to data science

#

for real this time

#

lol

lapis sequoia Jun 22, 2020, 5:42 PM

#

im gonna try 200 trees

desert oar Jun 22, 2020, 5:43 PM

#

id look at tree depth first

#

at 100

#

if its a fast model to train you probably can just grid search and go get a cup of coffee

lapis sequoia Jun 22, 2020, 5:44 PM

#

lol yeah im gonna play around with the parameters a bit

#

any ideas as to why PCA with 90% variance explained yielded such shitty performance?

desert oar Jun 22, 2020, 5:45 PM

#

maybe the resulting features are junk

lapis sequoia Jun 22, 2020, 5:47 PM

#

or bc i used it after one hot encoding?

desert oar Jun 22, 2020, 5:48 PM

#

could be, who knows

#

try MCA

#

did you look at the features?

#

do they look sane?

lapis sequoia Jun 22, 2020, 5:48 PM

#

yeah the principal components?

#

i mean they look like principal components to me lol

#

ill try pca after playin around with parameters

desert oar Jun 22, 2020, 5:51 PM

#

look at the loadings

#

id seriously consider MCA instead

lapis sequoia Jun 22, 2020, 5:57 PM

#

yeah im gonna do MCA now

vital cipher Jun 22, 2020, 6:06 PM

#

anyone here working on object detection using yolov5?? 🙂

limber pollen Jun 22, 2020, 6:08 PM

#

Hello guys i m new to this group and learning data science

serene scaffold Jun 22, 2020, 6:10 PM

#

I'm getting ready to slam my head into a wall.

#

I found a cython implementation for a KD tree, and I made a subclass of np.ndarray that contains a reference to what each ndarray represents

#

but when the kd tree is constructed, it gets rid of that wrapping

#

the pure python kd tree is too slow.

desert oar Jun 22, 2020, 6:11 PM

#

but when the kd tree is constructed, it gets rid of that wrapping
what do you mean

#

and show your code as usual

serene scaffold Jun 22, 2020, 6:11 PM

#

warning: it's terrible code

desert oar Jun 22, 2020, 6:12 PM

#

its cython, nobody writes good cython code

lapis sequoia Jun 22, 2020, 6:12 PM

#

sorry salt rock quick question -- the problem cant be because i didnt normalize the numerical vars right?

desert oar Jun 22, 2020, 6:13 PM

#

its absolutely can be

lapis sequoia Jun 22, 2020, 6:13 PM

#

because i normalized the one hot encoded and not numerics

desert oar Jun 22, 2020, 6:13 PM

#

well... less likely for random forest

#

but still possible

lapis sequoia Jun 22, 2020, 6:13 PM

#

should i use pca with numerics as well or move onto mca?

desert oar Jun 22, 2020, 6:13 PM

#

dont mix them in the pca

lapis sequoia Jun 22, 2020, 6:13 PM

#

i just want to understand why we handled them separately

#

the numerics vs categoricals

serene scaffold Jun 22, 2020, 6:13 PM

#

This part is pure Python.

class Vocab(np.ndarray):
    pass

def create_tensor(array, token, padding):
    many_zeroes = np.zeros((padding,), np.float)
    tensor = np.concatenate((array, many_zeroes))
    tensor = tensor.view(Vocab)
    tensor.token = token
    return tensor

cuis = KeyedVectors.load_word2vec_format('/home/farnsworthsw/datasets/s1975.cui.200.bin', binary=True)
print('make tree')
vocab = [create_tensor(cuis[k], k, 568) for k in cuis.vocab.keys()]
print('vocab made')
tree = sp.spatial.cKDTree(vocab)
print('tree made')

def learn(mention: str) -> str:
    tensor = torch.cuda.LongTensor(tokenizer.encode(mention)).unsqueeze(0)
    bert_output = model(tensor)[0][0]
    bert_output = bert_output.cpu().detach().numpy()
    best = tree.query(bert_output)
    print(vocab_lookup[best[0]])
    return best

desert oar Jun 22, 2020, 6:14 PM

#

you could try it i guess. im thinking that it will be weird because the variance of a numerical variable might or might not be on par with the variances of a bunch of one hot columns @lapis sequoia

#

just seems incongruous

lapis sequoia Jun 22, 2020, 6:14 PM

#

alright thank you

desert oar Jun 22, 2020, 6:14 PM

#

oh i thought you wrote the kd yourself @serene scaffold

serene scaffold Jun 22, 2020, 6:14 PM

#

no

desert oar Jun 22, 2020, 6:15 PM

#

C isn't flexible like that. presumably it's just stripping out the data into some kind of lower level buffer data structure thing

serene scaffold Jun 22, 2020, 6:15 PM

#

😦

#

I'm not sure how else to have vectors that represent specific things.

desert oar Jun 22, 2020, 6:16 PM

#

what does the tree return from a query?

#

a number?

serene scaffold Jun 22, 2020, 6:16 PM

#

an ndarray

#

actually a tuple of them, but sometimes there's only one.

desert oar Jun 22, 2020, 6:17 PM

#

ahhh

#

wait

#

🗞️ you didnt read the docs

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.query.html#scipy.spatial.cKDTree.query

#

it returns a tuple

#

1st elem, the vectors

#

2nd element, the integer indexes of the vectors

#

unless it's shuffling the order of the data in which case the whole thing is kind of useless anyway

#

another possibility is to use a dict where the key is the vector

#

because hashing magic

serene scaffold Jun 22, 2020, 6:18 PM

#

so have a separate lookup table for what the vector at each index represents

#

and look that up?

desert oar Jun 22, 2020, 6:19 PM

#

yeah i think youd have to

#

that or just a list/array

#

so you can look them up by position

#

from the 2nd returned element

serene scaffold Jun 22, 2020, 6:20 PM

#

😄

#

I'll try to make that work

#

Thanks 😄

#

however, the docs says that it returns only the 1 nearest neighbor by default

#

but I'm getting 3

#

Actually it's just representing the location of the vector as a 3d thing

desert oar Jun 22, 2020, 6:29 PM

#

eh? i thought you select that with k

#

it returns 1) the distances, and 2) the indexes