#data-science-and-ml

1 messages ยท Page 229 of 1

flint root
#

Hi. Does someone knows how can I use compression algorithms in data science?

#

I would like to have a few examples ๐Ÿ™‚

lapis sequoia
#

how can i impute categorical missing values using knn?

ripe forge
#

Oh dear, sounds fancy. If you have enough data points, dropping rows might be easier

lapis sequoia
#

i only have 6000 rows

#

and some columns are missing 50% of the data

#

actually only one column is

ripe forge
#

If a column is missing that much, drop the column

#

Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn

#

Train the knn using the rows for which this column in question has values present. Predict the values for rows where this column (which is now the target for the knn) is missing.

#

But if you have a lot of values missing, you can't blindly run knn on it. I personally don't think you should be trying to impute any column with more than 50% data missing at all. Drop the column, or figure out some different style of handling the missing values if your data understanding offers some way

#

A column like that should probably be dropped honestly

marsh fog
#

Is anyone here familiar with plotly? Specifically buttons / dropdowns ?

lapis sequoia
#

alright thank you @ripe forge

#

what do you think is the missing data threshold proportion for a column to be dropped

#

30%? 50%?

#

i already dropped all columns missing more than 60% of values

ripe forge
#

Tough to say, but that's a good question to ask

#

Sadly the answer is : it depends.

lapis sequoia
#

alright, im just gonna look at the distribution of missing values in my data, and see if there's a falloff at a particular point

ripe forge
#

If you can sensibly fill or aggregate some of the missing values, it becomes easier to manage

lapis sequoia
#

Anyways just for the general idea though, the column which you want to impute, treat it like target column for a knn
also i probably need to impute like 50 columns lol. would i have to run 50 knns?

#

if theres no easier way than that im just gonna replace with the mode^

ripe forge
#

You know, I'm not a 100% sure on that one.

lapis sequoia
#

alright

#

well for now im just gonna replace with mode since that's a lot easier. i'll look into knn replacement later

#

thank you

ripe forge
#

Sounds good, cheers

tranquil falcon
#

@lapis sequoia im more of a stats guy than programming guy, but about missing data -- it really depends on why the data is missing. MCAR vs MAR vs MNAR etc https://en.wikipedia.org/wiki/Missing_data . id personally try to figure out why the data is missing first more than anything. if speed is a concern, just drop it unless you have domain knowledge saying that column really matters... and if it does really matter... then you have a bigger problem than DS if your dataset is missing a critical component ๐Ÿ™‚

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Missing data can occur because of ...

lapis sequoia
#

thank you

desert oar
#

^ yep great point. i was remiss in not mentioning MCAR,MAR,MNAR

#

some imputation methods only make sense for certain types of missingness

fleet rose
#

Hi, is this chat for neural networks?

#

Well, it says machine learning

#

So, Iยดm looking forward to learn more about it

lapis sequoia
#

if i want to see association between categorical variables, what are my best options? i was thinking use a contingency table to see frequency distributions

#

obviously i cant use pearson correlation

#

and what if my variables have high cardinality (i.e. categorical level with 200+ levels) and i want to see association

desert oar
#

mutual information

#

or chi square

#

mutual information has better properties imo but chi square is a lot faster to compute

#

the built-in scikit-learn MI is single-threaded and very slow

#

you can also one-hot-encode and do stuff like hamming or jaccard similarity. depending on the type of feature

lapis sequoia
#

alright thanks. ill look up mutual information

marsh chasm
#

has anyone worked with pandas' series before

#

theres a part of the documentation that's worded a little bit funny

desert oar
#

Yes

marsh chasm
#

wait nvm i think i got it...

blazing bridge
#

"Rยฒ is the percentage variation in y explained by all the x variables together.

For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the Rยฒ for our model is 0.72 โ€” that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."

#

What does variation in y mean in simple terms

lapis sequoia
#

anyone know why the below line gives me an error? ws.range('A2').value = results

#
import pyodbc
import xlwings as xw

wb = xw.Book('Book1.xlsx')
ws = wb.sheets['Sheet1']

def read(conn):
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM people")
    results = cursor.fetchall()
    print(results) # prints a list of tuples, each tuple is a record
    
    # works fine
    ws.range('A2').value = [(1, 'joe', 'emmets', 23),
                            (2, 'sally', 'jacobs', 37),
                            (3, 'katie', 'falcone', 42)]
    
    # does not work - why?
    ws.range('A2').value = results

conn = pyodbc.connect(
    "Driver={SQL Server Native Client 11.0};"
    "Server=(LocalDB)\MSSQLLocalDB;"
    "Database=testdb;"
    "Trusted_Connection=yes;"
)

read(conn)
conn.close()
#

๐Ÿ˜ฆ

tranquil falcon
#

@blazing bridge r-squared ranges from 0 to 1. higher r-squared typically means better. a cynical and basic way of looking at it is it just tells you how good or bad a line fits to your points. compare these plots

#

in your 0.72 example, it means that 28% of the variance is unaccounted for. this is just fancy stats speak for saying "hey like 30% of the model can't be explained using the variables we have, so thats why our predictions might be off if we were to use the parameters from this model"

turbid hazel
tranquil falcon
#

@blazing bridge if it helps you wrap your head around it conceptually... r-squared is literally just the square root of the data's correlation. but correlation can range of -1 to 1 while r-square is only positive (0 - 1)

uncut shadow
#

@turbid hazel You should provide errors If you have one, or some content

blazing bridge
#

so @tranquil falcon this is basically means the line of best fit is only 78% accurate and the rest is not accurate with our x values. So 28% is inaccurate, therefore the variation in y is 28%.

#

is this correct

brazen cloud
#

hi there everyone! in my code i have an error ```python
import numpy as np
import matplotlib.pyplot as plt

incomes = np.random.normal(100.0, 50.0, 10000)
incomes.append(900)

plt.hist(incomes, 50)
plt.show()the error isAttributeError Traceback (most recent call last)
<ipython-input-4-9b18b9dd3278> in <module>
4
5 incomes = np.random.normal(100.0, 50.0, 10000)
----> 6 incomes.append(900)
7
8 plt.hist(incomes, 50)

AttributeError: 'numpy.ndarray' object has no attribute 'append'```

#

it says numpy doesnt have an append atribute?

#

how can i add a data point to my array

lapis sequoia
#

In a dataframe

turbid hazel
#

@uncut shadow for some reason literally nothing is outputted when i run it. no errors, nothing

tranquil falcon
#

@blazing bridge not quite. i think conceptually youre getting there but im not sure id use the word 'accurate' here. when we're talking about variance, which is what r-squared is trying to describe for a linear model, i think of it more about fuzziness in our actual guesses / predictions

like if we had a model to guess where a ball will land, if we had a model with 0.72 r-squared vs a model that is 0.95 r-squared, both COULD be accurate in a real-world scenario, but the 0.95 r-squared one would be more likely to have better results and less error on where the ball would land

#

@blazing bridge but long story short, you could make an argument that a model with 0.72 r-squared is probably more likely to be more accurate than a model with 0.65 r-squared, but that's not necessarily a fact. thats why i wouldnt use the words accurate vs inaccurate here because usually 'accurate' has to do with predictions while r-squared really just has to do with explaining variance

blazing bridge
#

@tranquil falcon Thank you for clearing it up, one more question what does variance mean

#

is that the error that the line produces

#

like in simple linear regression we use the Squared error to see how well our line has fit the data

#

in multiple linear regression is r-squared the equivalent of the loss with two or more variabels

#

*variables

tranquil falcon
#

@livid turret no prob. variance is basically the center of the universe for statistics -- its really just a fancy word for how much variety/range there is in your data set. high variance = wider range of values low variance = smaller range of values.

so you mentioned error -- when you make error you actually included standard deviation somewhere in your calculation. standard deviation is literally the square root of variance. so higher variance = higher standard deviation which would also mean higher error. but bear in mind higher calculated error doesnt necessarily mean less accurate (it all depends on context!!), it just means a wider range of expected values.

and to answer your question if im interpreting it correctly, then yes conceptually r-squared is showing the equivalent of the incorrect guesses. r-squared, mathematically underneath the hood, is actually calculating how far each point is from your best fitted line and aggregating that data. so if you have a lot of points that are really far away from the line that would indicate maybe the line, even if its the best fitted line, isnt taking into account other factors (thus the whole it explains x% of variance) that are relevant to the real world

#

but bear in mind r-squared is really only used for linear models. and there are many, many, many, many real-world scenarios where a linear model just doesn't work or doesnt make sense ๐Ÿ™‚

#

(bear in mind a true academic statistician would be murdering me for some of the things ive said but im just trying to explain things conceptually and how they relate)

icy flax
#

not sure if this is the spot for this, but im doing some plotly data visualizations and running into the issue of a slider I implemented not being displayed at all. Is anyone familiar with plotly / networkX and would willing to assist me for a bit. Thanks!

blazing bridge
#

ok thank you so much @tranquil falcon

paper niche
#

the data that is being passed into your cum_freq_fn isn't a column name, it's the element(s) themselves, that's why you're getting a KeyError

#

It's probably worth noting, there's already a method of pandas series to calculate the quantile value

desert oar
#

^

lapis sequoia
#

yep! salt rock explained it to me, thank you both for helping out though

flat quest
#

btw just a tip for ML.

Might be good to have a baseline knowledge of statistics @blazing bridge
That should help you better understand how ML models actually work, and what concepts like std, gaussian, variance, etc. are.

blazing bridge
#

yeah @flat quest I was thinking about that but I dont know where to start

#

Im only in grade 10 so i dont know what would make sense for me

flat quest
#

ah nice

yeah best place to start is prob something like khanacademy and then go to like opencourseware. @blazing bridge . YT is also a pretty good place to look at

blazing bridge
#

ok thank you @flat quest what would you say variance is in simple terms

flat quest
#

well mathematically variance is the square of the standard deviation

like zammy said, variance is a measure of spread of your data. How much distance or change in numeric value is between your data points.

Are they clustered around certain values? thats small variance or spread.
If they're widely spread out, you have high spread or variance.

blazing bridge
#

"or example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the Rยฒ for our model is 0.72 โ€” that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent)."

#

could you explain what they mean over here

desert oar
#

I like to think of the variance as the distance between your dataset and a dataset where every data point is equal to the mean

#

Or something not unlike the average distance from the mean although the whole squared part overweights points that are farther away vs a true mean

lapis sequoia
#

Hello everyone,
I have just started with data science.
I am thorough with the basics of pandas and matplotlib, also for web scraping I know selenium, also beautiful Soup . Please advise me on what to do next.

slate scroll
#

@lapis sequoia That depends on what areas of data science you're interested in. But I'd encourage you to also take on some real world projects. If you're through the basics of pandas/matplotlib and BS4 there's plenty of opportunities to visualize and analyze scraped data.

lapis sequoia
#

I just used them to make a project,
So basically what it does it looks up for the input-ed hashtags on instagram and scrapes all the post's hashtags
Further on I have added some keywords
The program filters all the hashtags with the given keywords and presents them to me in sets of 25 in a word file

slate scroll
#

So where's the data science? That sounds like pure scraping.

lapis sequoia
#

Basic pandas

#

I dont know much about data science, just wanna know how to get into it

slate scroll
untold anvil
#

@brazen cloud Hi you can use tolist() method to convert your array into list and then use append on the list to your data point and then convert it back to an array

lapis sequoia
#

What do you think I should take up next? @slate scroll

untold anvil
#

@brazen cloud or you can use np.append method. Its easier to do it this way

slate scroll
#

@lapis sequoia That really depends on what kind of career you're interested in. With some basic knowledge under your belt, what parts of it did you most enjoy?

lapis sequoia
#

Oh I really enjoy automating with selenium and pulling off data from internet

#

I just dont know what's next @slate scroll

slate scroll
#

If data collection and augmentation is what you enjoy then data engineering would be a great path.

#

For most real-world examples that'll include things like streaming data, kafaka, and big data tools like hadoop, hive, hbase, bigquery, etc...

#

SQL would be a great next step towards that world too

lapis sequoia
#

So what is data engineering all about?

slate scroll
#

Well data engineering is all about collecting data and augmenting it with other data sources. Think about a website (my work so an easy example). They're constantly collecting tons of data on how their users are interacting. The data engineers are responsible for turning that stream of interactions into nice tables that can be used for reporting and modeling.

lapis sequoia
#

I am also interested in AI/ML (Atleaset wanna give it a try) always fascinated me

slate scroll
#

AI/ML is also a wide field, there's R&D and algorithm development (traditional Data Science) and then there's ML engineering or applied ML. Basically turning ML into products and affecting consumers

lapis sequoia
#

Ohh so is data engg. About data representation for making decisions?

#

I dont have good idea of where to go, I just wanna start with basics and let's see what I like on the way

slate scroll
#

Yeah data engineering can be about how to collect and represent data. so let's say my website adds a new feature and the execs want to know how many people click on it. The data engineers will work with tracking engineers on how to collect that data and transform it into reportable data locations

#

Not sure if you have a blog or website or anything but if you want to play with some streaming data I have a demo on my Github: https://github.com/Raab70/serverless-streaming-web-analytics

Cloud stuff is also always a great addition to any resume. Most cloud providers have some amount of resources you can use for free it just depends on which one you use.

lapis sequoia
#

I am just a beginner

#

Starting from scratch

slate scroll
#

Sure so that should be plenty of directions to head in, cloud computing, and SQL would be great starting off points that will be applicable to any data science role.

lapis sequoia
#

ok , i will start with sql soon

slate scroll
#

Learning Docker is a great addition too

lapis sequoia
#

so whats the use of sql, heard it is for programming databases

slate scroll
#

but I don't want to throw too much your way

#

SQL is a query language for relational databases

lapis sequoia
#

mySql was something i read of

slate scroll
#

MySQL is a type of database which is relational and therefore uses SQL. Postgres would be another. Those are both OLTP (online transactional processing) databases.

#

There are also OLAP or online analytics processing. Databases like bigquery or redshift. But they're all based on SQL

lapis sequoia
#

how will sql be useful in a datascience career?

slate scroll
#

Well in data science you'll need to be collecting data from different sources, most of them will be SQL.

lapis sequoia
#

ok so if i wanna learn sql, ho shall i go about it on python?

slate scroll
#

Well you can get into more topics like ORMs, which are basically a mapping of SQL into python objects. But a knowledge of base sql is super useful. Maybe try just adding it into your existing dataset

#

So make your scraper work for multiple runs. How would it work if you ignored posts that were already seen

#

Store them to a DB instead of just pandas

lapis sequoia
#

nah, it would re-scrape it

#

ok will keep that in mind

#

will research more about sql

slate scroll
#

It is essential for any data science role

lapis sequoia
#

thankyou for your time rob

blazing bridge
#

variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean. The goal is to have a value that is low. What low means is quantified by the r2 score (explained below).

#

is this a correct definition

#

ok so what is the difference between sum of squared residuals and r-squared

earnest meteor
#

Hi have someone used the Yoga-82 dataset?

#

or knows datasets that are public domain related to yoga?

sacred sail
#

Is anybody available to help? I need some help figuring out how to import a excel file into python and then combining a few of the columns and running some simple calculations. Just a little rusty and on a bit of a time crunch

earnest meteor
#

install openpyxl

gritty vapor
#

Does anyone know if it's possible without to use Tensor flow, only with numpy and opencv, transform differents images into arrays, class them and compare them using sklearn ?

wicked bloom
#

Hey, is anyone familiar with the method of turning a linear program into standard form, and is willing to walk through an example with me?

desert oar
#

@gritty vapor yes but it's not a beginner level task by any means

#

Oh sklearn. Mayyybe

#

Image classification existed before deep learning

#

So yes you can go back and dig up all the old literature on creating features for classifying images with SVMs

#

I don't see why you would want to do that

gritty vapor
#

I would like to take a lot of picture of Brain tumors, and when I give one the to the program it it's a tumor or not, without 500 lines of codes etc

barren topaz
#

is anyone aware of a simple method or something similar to make a seaborn heatmap axis show all of its labels when zooming in? I have a 67 by 10 heatmap, and want to be able to see the labels of any part of the heatmap that I zoom in on. Unfortunately, it does not automatically do that. any suggestions?

desert oar
#

@gritty vapor Image classification is a complicated difficult task, you're asking too much to not have to at least write some code in order to get your complicated difficult task done using modern tools

#

Maybe yolo object detection can work on brain tumors

#

I would be much less concerned about lines of code than actually training and validating a good model

#

I think that's not a good attitude at all

gritty vapor
#

Okay thanks

desert oar
#

@barren topaz you would need some kind of fancy dynamic graph with a slider or something like that

#

I don't think seaborn has that ability by itself

#

Seems like something you can do with D3.js, maybe you can do it with Bokeh

barren topaz
#

@desert oar awesome, thanks! I'll try those out. Worst come to worst, and I'll just split it into 3 or 4 different subplots.

desert oar
#

Try Bokeh first

#

Ping me if you get it to work

mellow creek
barren topaz
#

I can't get it working. I've never used bokeh before, so I'm learning it now. will probably take a while to figure out. Thanks for pointing me in the right direction though!

rich silo
#

Looking for some help with datatables (Dash) and callbacks in Dash. Basically i want to auto generate an editable Datatable and after editing it, click on a button to update some other tables.
Anybody can help?

#

I have got the first part working but i cant call the data from the Datable to update the rest afterwords

serene scaffold
#

Anyone know how to reshape a Tensor while keeping it on the same GPU?

lapis sequoia
#

@desert oar i had a followup question to yesterday

#

do you know why the code works for an individual column but not on the entire dataframe in my for loop?

#

let me know if you want me to type the code out

#

this is the function that captures 80% of the cumulative values that we discussed yesterday

desert oar
#

can you post your code as code and not a screenshot

lapis sequoia
#

yea. sorry i code & use discord on different computers

desert oar
#

what's this frac_cum[0] thing

#

try frac_cum.iloc[0]

lapis sequoia
#
def cum_freq(df,colname, threshold=0.80):
  x=df[colname]
fractions=x.value_counts()/len(x)
frac_cum=fractions.cumsum().sort_values()
if frac_cum[0]>0.8:
  return frac_cum.index[0]
else:
  return frac_cum.loc[frac_cum <= 0.8].index.tolist()
#
for colname in df_new.columns:
  cumfreq(df_new, colname)
#

what's this frac_cum[0] thing
@desert oar because some values have like the first value count as 0.85 and then all the columns after are like 0.01

#

ill try iloc

#

ah iloc worked

#

why did i have to use iloc though? why wouldnt it work without it

desert oar
#

it's... complicated

#

.iloc is positional indexing

#

[] by default is .loc which is key-based indexing

#

if your keys are floats, it's going to convert 0 to float then say "there's no key 0.0"

#

which is what happened

lapis sequoia
#

ohh

#

but how come it worked for the individual columns?

desert oar
#

columns are keys

#

wait...

#

yeah

#

df.columns is a list of column keys

#

i.e. column names

#

dataframes have their own behavior

#

i try to use .loc and .iloc whenever possible to eliminate ambiguity

#

the only exception is using df[colname] for column access

#

otherwise i always try to use .loc and .iloc

lapis sequoia
#

i see

#

that makes sense

#

thank you

slim fox
#

iloc is really nice when you do sortings

modern canyon
#

can someone help me understand how 2^g(n) shoots up after a certain value while 2^f(n) remains flat even though f(n) > g(n)?

pallid mica
#

Hi! I dont know if this is a place to ask but I just am getting started and i basically just want to learn to program a graph and have it track or plot a dataset and i dont know where to begin any ideas? Thanks in advance

pliant wasp
#

hi everyone !

#

( I need help )

uncut shadow
#

You should ask your question and provide code/errors required for us to solve your problem

blazing bridge
#

could someone explain this to me. Like what does variance mean in terms of linear regression. Someone told me it is the spread of the data but how does that relate to this in that case.

#

they say 72% variation in y, im not sure what they mean. Is it its 72% accurate

#

Last thing what is mean squared error and whats the difference between R-squared and mean squared error

desert oar
#

i think you really need to spend some time learning statistics

#

it makes these concepts easier to understand

blazing bridge
#

yeah ik. I searched this up on khan academy, didnt really make sense to me

#

any recommendations. I know I've been really annoying and probably pissing you guys off

desert oar
#

im not sure what the standard stats textbooks are nowadays

#

but a good book should help

#

imo you shouldnt even try to learn regression until you know basic statistics

#

you can do it, but... this is what happens

blazing bridge
#

what would be a good book suitable for someone in high school or maybe a course

desert oar
#

let me look

#

maybe The Statistical Sleuth by Ramsey & Schafer

#

or Statistics by Freedman, Pisani, & Purves

#

however they are both "textbooks" and might be expensive

blazing bridge
#

oh, im sorry to bother you, anything thats free?

desert oar
#

i asked in another server

#

i will let you know if i get any recommendations

blazing bridge
#

ok, once again sorry

#

would you think this is a good definition: in terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean.

desert oar
#

no, i dont think so

#

the variance of what

blazing bridge
#

like variation in y

desert oar
#

have you heard of a "random variable"

blazing bridge
#

no

desert oar
#

ok, that is the big missing piece for you, i think

blazing bridge
#

oh, wait is it something that has infinite number of possible values

#

height, weight

#

for example

desert oar
#

it can have a finite number of values

#

the very short version is that a random variable describes a data-generating process. a random variable, let's say Y, is not a single value, but a description of the kinds of values that it can have, and how probable each value is

#

for example: when you flip a coin, the outcome of the coin flip can be Heads or Tails

blazing bridge
#

"A random variable is a variable whose value is unknown"

desert oar
#

thats an incomplete definition

#

the outcome of the coin flip can also be described as a random variable, with outcomes Heads (0.5 probability) and Tails (0.5 probablity)

#

a random variable is a description of a data generating process

#

it's a relationship between "possible outcomes" and "probabilities"

#

i am glossing over a lot of things here, but that's the basic concept at the bottom of pretty much all of stats and probability, and by extension much of machine learning

#

variance tells you how spread out the possible outcomes are. high variance means a lot of spread in the possible outcomes, so if you were to re-generate many data points with high variance, there would be a lot of variation among the data points

blazing bridge
#

yeah I just watched a video on that and its the spread of the data around the mean

desert oar
#

yeah, sure

#

so the variance of Y in regression could mean a few different things

#

one meaning is: the actual variance of Y, which generated the data

#

we can never know that answer

#

another meaning is: the variance of the predictions, which is the definition you gave

#

there are other variances to consider as well, but you can probably ignore those for now

blazing bridge
#

so the definition in this case would be how far the actaul values are from the predicted values, that is the variance

desert oar
#

no, that is something else

#

those are residuals

#

you can compute the variance of the residuals

blazing bridge
#

oh ok

desert oar
#

that document talks about "variation in y"

#

in this case they mean "the variation in the data"

#

i.e. the observed variance in the data

#

so R^2 is the amount of the variance of Y that is "explained" by the variance of X

#

...kind of

#

the meaning of "explained" is fuzzy without equations and a more coherent understanding of random variables

#

but its good enough to interpret results

#

STAT 100 in there might be useful

blazing bridge
#

yeah i was just looking at that

#

thank you so much

desert oar
#
eternal pagoda
#

I am using pandas to find the mean of a 45k line column in a dataset. I noticed some of the cells in the column have text content while the column is intended for integer based values. Still, pandas is able to provide me with the mean. I extracted a small section of the column and copied it to a separate spreadsheet for analysis while using the exact same pandas code. However, when I run the code against the column in a different dataset, i get a "can not convert string to float" error

#

I would like to understand why I am not also receiving this string to float error when I run the exact same code in the 45k line column which contains many strings, but receive the string to float error when I attempt to acquire the mean for a smaller section of the column.

#

its almost as if pandas decides that it doesn't care about the string values when we are dealing with 45k lines, as opposed to 20. When the amount of cells in the column is lesser, then it wants to get real righteous

desert oar
#

that is weird

#

it is possible that it's using a different algorithm on a bigger series

#

however i can't say i've ever had that experience

eternal pagoda
#

makes me wonder at what point pandas throws out the book. lol

#

disregard, identified that I am seeing the information incorrectly because I am using shit software to open the spreadsheets

desert oar
#

that was my next question

spiral bane
#

hi i want to start with machine learning i found a tutorial series on youtube but its from 2018 and since then many changes were made to the pytorch framework so i would like to ask if https://www.youtube.com/watch?v=GIsg-ZUy0MY would be a good starting point? This is probably the most recent video about pytorch

In this course, you will learn how to build deep learning models with PyTorch and Python. The course makes PyTorch a bit more approachable for people starting out with deep learning and neural networks.

๐Ÿ’ป Code:
https://jovian.ml/aakashns/01-pytorch-basics
https://jovian.ml/aa...

โ–ถ Play video
wheat wolf
#

Is there a updated PyTorch chatbot?

lapis sequoia
#

does anyone know how many hidden layers to use in a neural network using tf.keras.models.Sequential() ?d

uncut shadow
#

it depends

#

you can have 1 or you can choose to have more

#

it's up to you

umbral aspen
#

Hi guys - I am trying to develop a model to solve a multi label problem (with tf and keras) and I see this is solved diffrently sometimes when building the model layers...

I see this (1 sigmoid layer):
layers.Dense(number_of_classes, activation='sigmoid', name='output')

and this (1 sigmoid layer per class):

output2 = Dense(1, activation = 'sigmoid')(x)
output3 = Dense(1, activation = 'sigmoid')(x)```
#

What are the differences between these two options?

#

Or maybe a better question would be -> when to use only 1 output layer and when to use multiple?

serene scaffold
#

If I need to find the cosine distance of two vectors, and one is longer, do I pad the other with zeros until they're the same len?

serene scaffold
#

Well that worked but something about it feels wrong.

devout sail
#

@serene scaffold uuuh if you know that the shorter vector of size n is the first n dims in the longer vector (in the same order) then sounds like it should work

serene scaffold
#

The vectors are actually from completely different spaces. The goal is to learn the mapping.

devout sail
#

How do you measure the distance if they're in different spaces?

serene scaffold
#

Haven't gotten that far.

devout sail
#

I mean you shouldn't be able to.. you need some kind of information to learn the mapping

serene scaffold
#

This is an nlp thing. We're trying to put things into an ontology based on embeddings for the tokens and embeddedings for the ontology.

#

Right. That's the goal.

#

We have mention-label mappings for a training corpus.

devout sail
#

If the goal is to use some kind of regression to learn a transformation then I guess your method might work

#

Because in that case you're defining how you treat shorter vectors

serene scaffold
#

A paper has already been published describing how they were successful with this process

#

But they didn't describe it throughly or release the code

desert oar
#

Isnt this a pretty typical case for a NN

#

Multi input multi output

#

Distance doesn't exist or make mathematical sense between things in different spaces

devout sail
#

Maybe it's used as the loss function? between the output and the label

#

it really depends on what the method used is

late jackal
#

I am working with some grad students at my school on a project that involves getting two programs to talk to each other and while they have some of the code written i can't find any documentation on the syntax or very very little any idea where to look for something like this

lapis sequoia
#

Hi, do you guys know the best method to interpolate missing financial data before correlating them?

iron escarp
#

How hard is the maths in data science and AI?

desert oar
#

@late jackal talk how?

#

@iron escarp its not usually "hard" but there is a lot to know

late jackal
#

@desert oar developing a python bassed GUI so being able to write to HYSYS

desert oar
#

what is HYSYS

late jackal
#

A process engineering modeling software

#

So like it can model a gas adsorbed and run a bunch of simulations

#

I know it can be done since we have some working code but there's like no documentation on the API

desert oar
#

So hysys has an api

#

But its undocumented or underdocumented

late jackal
#

Seems to be the case

desert oar
#

And you need a gui app that can take user commands and then interact with the hysys api

late jackal
#

Yes

#

We are trying to build out that gui

desert oar
#

Ok. Thats not really on topic here, but you can use tkinter, kivy, pyqt, or toga for that

late jackal
#

Ah ok I'll maybe send some info in there thank you

iron escarp
#

I guess obviously Iโ€™ll get better at python

#

To get into data science later on

#

But how do ik if Iโ€™ll like data science?

#

Without trying it?

lapis sequoia
#

how can i utilize categorical variables with high cardinality in my data to make predictive models?

#

(i.e i have a column called business units with 1000+ levels, and several other variables with 100+ levels in a 6000-row dataset)

#

i probably cant one hot encoding that because will introduce extremely large dimensionality

#

i was thinking of potentially k-means clustering into smaller groups

#

any other ideas come to mind?

velvet thorn
#

it depends.

#

first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem

#

you can consider some form of clustering, as you noted

#

you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)

#

depending on the problem, you could consider some other form of encoding

#

e.g. think about how words can be vectorised using an embedding vs one hot encoding

lapis sequoia
#

you could set a threshold and put everything below that in an "other" category (generally domain knowledge is important for this)
@velvet thorn yeah i did this

#

is it a bad idea to just do that for all the columns though?

#

since you mentioned domain knowledge is important

velvet thorn
#

well

#

again

#

it depends.

#

basically

lapis sequoia
#

i just put everything with 80% of the value counts as long as the length was < 20, and put the remaining 20% as 'other'

velvet thorn
#

in layman's terms, you are saying that all these different things are in fact "the same"

lapis sequoia
#

right

velvet thorn
#

so the question is - how much information does this feature hold, relative to the rest of the dataset?

#

now, imagine this.

#

say your dataset is one of names and you want to predict if it's a male or female name

#

if you used one hot encoding and turned all the minorities into the same category

#

your predictions for them would suck, right?

lapis sequoia
#

ya

velvet thorn
#

so that wouldn't be an appropriate method

lapis sequoia
#

i see

velvet thorn
#

and the reason is that you have only one feature

lapis sequoia
#

mhmm

velvet thorn
#

on the other hand, if, say, your dataset is a housing one and the high cardinality feature is road name

#

that probably won't be an issue because road name is unlikely to have a large impact anyway

lapis sequoia
#

first, since you can use sparse matrices to handle one hot encoded output, that is not in itself a problem
@velvet thorn can you explain this part though

velvet thorn
#

no need to tag me by the way

lapis sequoia
#

i did one hot encoding for columns with nunique <= 15

#

no need to tag me by the way
alright sorry, when i quote you, it tags you by default but i'll delete it

#

so how can i one hot encode with some cat variables with 100+ levels

#

in a sparse matrix

velvet thorn
#

oh wait

#

is quoting a new feature or have I just been living under a rock

#

LOL

#

anyway

#

are you using sklearn?

lapis sequoia
#

you right click and press quote

velvet thorn
#

yeah I just saw

lapis sequoia
#

are you using sklearn?
yes

#

wait before you answer that

#

when you do one hot encoding

#

and you do pd.get_dummies

#

you dont need to drop the numeric values right

velvet thorn
#

ah...

#

don't use pd.get_dummies

lapis sequoia
#

shit

#

why

#

i just ran both my models using that

#

lol

velvet thorn
#

okay

#

more accurately

#

you can but

#

well this is kinda an opinion thing so ignore me

#

but

#

let's go back a bit

#

do you know what a sparse array is?

lapis sequoia
#

nope lol

#

i know about curse of dimensionality

#

does being sparse have to do with that

#

i looked up sparse array

#

A sparse array is an array of data in which many elements have a value of zero

#

makes sense

velvet thorn
#

okay

#

so a normal array

#

stores each value

#

but a sparse array stores only the nonzero values

#

the number of "1"s in a one hot encoded array is always equal to the number of rows

lapis sequoia
#

right

velvet thorn
#

and the number of "0"s is equal to the number of rows * (the number of categories - 1)

#

so the more categories you have, the bigger the space savings of a sparse array

lapis sequoia
#

i see

#

so isn't that like label encoding?

#

because label encoding only stores 1 through n values in onecolumn

velvet thorn
#

uh

#

no

#

well

#

I mean

#

could you clarify your question

lapis sequoia
#

label encoding is non-zero values for a particular column .. i.e. dog,cat,rat would just be 1,2,3 in one column

#

ok nvm just ignore that

#

but what does this have to do with pd get dummies

#

are you suggesting to just use sparse array of OHE?

velvet thorn
#

pd.get_dummies(..., sparse=True)

lapis sequoia
#

so how does using sparse affect the model performance

#

im using a random forest, idk if that matters

velvet thorn
#

ah

#

when you said curse of dimensionality

#

you meant as a modelling problem

#

not as a storage problem

#

I misunderstood you I think

lapis sequoia
#

no i think this is actually helpful to know

#

because i ran two models:
1 model where features nunique <=15 and i one hot encoded

#

and a model where im gonna use all the categorical variables with cardinality

#

but as i mentioned some of the categorical variables have 100+ levels. should i just do pd.get_dummies(sparse=True)

velvet thorn
#

it's a good start

lapis sequoia
#

alright

#

thanks

#

any other ideas i could try in case that doesnt work?

velvet thorn
#

think I went through them earlier?

lapis sequoia
#

alright ill try clustering and ohe with sparse = true

#

thanks

#

also one more question

#

if sparse = true reduces storage, why dont people just always use sparse = true

#

there should be some drawbacks to it?

velvet thorn
#

yes

#

a few

#

you can't do certain things with a sparse array

#

because of the constraint that many values must be 0

#

e.g. imagine standardisation (- mean, then / std)

lapis sequoia
#

mhmm

velvet thorn
#

an array that was sparse

#

most likely cannot be standardised

#

because the majority value (0), after subtracting the mean, will not be 0

#

a sparse array can take more memory than a dense array if the data is not actually sparse

#

because of the way nonzero values are stored

#

one implementation of a sparse array

#

is storing the indices and values of non-sparse values

#

so you need to store two things instead of one per value

#

and the tradeoff is that you need not store all the values

lapis sequoia
#

i see

#

so how are the actual values stored with sparse = true vs pd.get_dummies(sparse=false)? bc if you have labels like cat,dog,rat, then if you one hot encode with sparse= false, it would be
1, 0, 0
0, 1, 0
0, 0, 1

velvet thorn
#

ye

lapis sequoia
#

is it just gonna show as
1
1
1?

velvet thorn
#

uh

#

I'm not sure how they're displayed

#

but

#

that's an entirely different concern from how they're stored

#

it's more like

lapis sequoia
#

ohh ok

velvet thorn
#

non_sparse_values = [(0, 0, 1), (1, 1, 1), (2, 2, 1)]

#

something like that?

lapis sequoia
#

ok thanks

#

i'll take a stab at this tomorrow

#

also just to make sure, the main advantage/use of sparse is reduction in storage/memory consumption correct?

#

vs numpy array

velvet thorn
#

yes

lapis sequoia
#

ty

marsh chasm
#

@hidden halo i fixed the problem, but i still don't really know what the problem was

#

i ended up hardcoding it by turning the dates into strings and chopping off the time section

#

then converting it back to a datetime object

#

๐Ÿคทโ€โ™€๏ธ

lapis sequoia
#

how can i find correlations amongst ALL variables in a dataframe containing categorical and numeric variables

lapis sequoia
#

@velvet thorn so using pd.get_dummies(sparse=true) yielded a 5718 row x 8661 column dataset

#

is this ok to put through a random forest lol

velvet thorn
#

that doesn't seem right

#

why are there more columns than rows

hidden halo
#

i ended up hardcoding it by turning the dates into strings and chopping off the time section
@marsh chasm I was on phone earlier, so couldn't try. Now I tried and df['date'].dt.date seemed to work for me. It removed the time part. I just did:

df = pd.read_csv("csv_file.csv", parse_dates=[0])
df['date'] = df['date'].dt.date
marsh chasm
#

o wtf

#

o well

lapis sequoia
#

why are there more columns than rows
@velvet thorn uh im not sure

#

the rows are the exact same

#

5718 are the number of rows i had

#

but the original dataset had like 63 columns i think

#

are the number of rows supposed to stay the same like in one hot encoding?

velvet thorn
#

the thing is

#

you should have

#

1 new column

#

for each unique value

#

and clearly the number of unique values cannot be more than the number of rows

lapis sequoia
#

yeah this is what i did

#
mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape
#

returns ```python
((5718, 59), (5718, 8661))

#

also there are 0s in the sparse dataframe. didnt you say there werent supposed to be any 0s?

static gull
#

I cant seem to be able to install Tensorflow Object Detection API

#

someone helps pls

lapis sequoia
#
mod3_dummies = pd.get_dummies(df_mod3, sparse = True)
df_mod3.shape, mod3_dummies.shape

@velvet thorn any tips?

paper niche
#

what kind of data are you trying to OHE here? if you went from 59 columns to 8.6k columns, it's clearly some columns with a tad too many "unique" values, like an address or something

#

so I just saw your previous msg, you said some cat columns had 100+ levels and you're wondering how to OHE it? It probably depends on what the data is representing, but I wouldn't OHE with that many categories.

If for example, you had an address column that had 100+ unique values, you could, for example, extract only the district/province information and OHE that instead.

lapis sequoia
#

well for instance

#

i have business units with over 1400 levels

#

in a 6000-row data set

#

theres another column with 2500 levels

paper niche
#

find some meaningful way to bin them

lapis sequoia
#

k means clustering?

#

actually idk if that'd work

#

alright

paper niche
#

with that many levels, it's like 2-4 rows per level on average

lapis sequoia
#

yea

#

amongst the categorical data, my data has like 40 columns with <20 unique values and like 10-15 with >150

#

theres nothing between 20-150

paper niche
#

even the 40 columns with 20 unique values is kinda ridiculous once you OHE them. gotta be careful with curse of dimensionality here, considering you have so few training examples

lapis sequoia
#

okay so i built 2 separate models --

  1. baseline containing only categorical columns with <10 unique values + numeric variables
  2. keeping all unique values that capture 80% of column value counts less than 20 and storing remaining 20% as 'other' + numeric vars
  3. im gonna try somehow using all categorical variables + numeric vars
#

however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that

paper niche
#

you could do feature selection / dimensionality reduction afterwards of course, but some domain knowledge to clear out some useless features (or even binning the categories in a way that makes sense) might be a good start

twilit arch
#

I want to make a roulette game, how would I calculate the odds (I have 3 colors rn) so that "the house" has a good edge and would be profiting instead of losing money

lapis sequoia
#

alright thanks tofu

#

however one hot encoding is tainting my variable importance in random forests bc of curse of dimensionality. do you have any tips to circumvent that
@lapis sequoia do you know anything about this question though?

paper niche
#

tips to circumvent what? doing OHE?

lapis sequoia
#

i guess having accurate variable importances

#

because one hot encoding is skewing the variable importances of random forests towards the non-OHE data (numeric columns)

#

so the numeric columns tend to be at the top

#

and the one hot encoded features are below them

slim fox
#

i have business units with over 1400 levels
@lapis sequoia is it absolutely impossible to encode them with more meaningfull numerical values?

#

like you can not establish any kind of numerical relationship between categories, i.e. catergory "A" can be represented with 0, "B" with 1 and "C' with 5?

lapis sequoia
#

there isn't any inherent order

#

if thats what you're asking

#

it's nominal

slim fox
#

๐Ÿ˜• yeah I was hoping maybe you can somehow establish it

#

to be honest I simply have doubts that 1400 unique values with that size of data will contribute postively to your ML model results

#

I would not be surpised if dropping the enitre column would yield better results than OHE 1400 values

lapis sequoia
#

i was thinking the same

#

i believe that there is some sort of association/correlation between that column and a few others as well

#

im calculating a cramer v correlation

#

among categorical columns

#

and will drop any with high association

paper niche
#

i guess having accurate variable importances
@lapis sequoia no, not particularly. that issue is intrinsic with random forests & having high cardinality data, I think.

lapis sequoia
#

so is it ill-advised to use a random forest in this scenario?

paper niche
#

It can work fine as a meta-model, just gotta do some preprocessing beforehand

#

like I said, binning, or you could try doing embedding the high cardinality features into a smaller space

#

though the latter would not be useful if you're trying to do inference

lapis sequoia
#

alright thank you

#

i already tried binning

#

any particular methods you'd recommend about the latter option?

#

embedding the high cardinality features into a smaller space

desert oar
#

Did you try target encoding

#

Another option is to train your model w/ a Factorization Machine

lapis sequoia
#

Will do thank you

lapis sequoia
#

I have a file that looks like this

--------------Time step: 1 ---------------
Accumulated rewards: 1.5
Alpha: 660
Beta: 173
TCP_Friendliness: 1
Fast_Convergence: 1
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 2 ---------------
Accumulated rewards: 2.724744871391589
Alpha: 193
Beta: 0
TCP_Friendliness: 0
Fast_Convergence: 0
State: 3
Retries: 0.0
---------------------------------------------------------------
---------------Time step: 3 ---------------
Accumulated rewards: 3.869459113944921

I'd like to extract the time step values into an X array and the Accumulated rewards value into a Y array, I have no idea how to do that as I have 0 python experience, but this is my initial loop i've written that skips the first couple of lines that I have not included in the example(gibberish data)

with open('Tuner_result_1.txt') as f:
    for _ in range(11):
        next(f)
    for line in f:
        x = [line.split()[0] for line in lines]
        y = [line.split()[1] for line in lines]

obviously the actions inside the 2nd for are incorrect, idk how to read the lines i want properly.

blazing bridge
#

After doing some research would you say this is correct:

#

The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average.

desert oar
#

Yeah sounds good

blazing bridge
#

Ok it took me 3 days. One question y average and SEy are the same thing which is just a line with no slope and only intercept so if the r-squared is 0.72 itโ€™s 72% better than the line.

#

Which is the mean line

desert oar
#

No? SE is standard error

#

Oh i see

#

Hmm

#

I guess yeah its the mean squared error of an estimate that's just the sample mean

blazing bridge
#

They used that in the khan academy video

desert oar
#

Yeah that's "good enough" for now

#

A more practical understanding is: the better the R^2, the closer your data is to a straight line fit

#

In the case with 1 x and 1 y you can compute it from the correlation

blazing bridge
#

So if the r-squared is 72% it means that 72% of the data is a straight line?

polar acorn
#

It means if you replace your data with that straight line it would still keep 72% of the variation of your data. Not necessarily that 72% of the data lies on that straight line.

desert oar
#

Specifically 72% of the variation in Y

lapis sequoia
#

do yall think dropping columns with correlations > 0.70 is a good threshold

desert oar
#

Why does it have such a high correlation

lapis sequoia
#

idk there are like 14/63 columns that have > 0.80 correlation

#

this is a cramer v correlation

#

so between categorical variables

#

idk if that matters

#

and they have like 100+ ;levels

desert oar
#

The reason you drop high correlations is in case they are effectively duplicates of each other

lapis sequoia
#

yea

#

they seem to be correlated

#

in this business context

desert oar
#

So it depends on what they are

#

Eg SIC and NAICS codes

#

SIC is basically redundant if you know NAICS

#

But in a lot of data you have multiple SIC codes and only one NAICS

#

So you keep both despite high corr

#

(SIC and NAICS are two different business classification systems in the USA)

lapis sequoia
#

i see

#

hmm

desert oar
#

16 features is not too many for you to manually review

lapis sequoia
#

so i'd have to look at the data individually to determine if i need to drop it?

#

alright

desert oar
#

If its 160 features out of 1000 then id say you need automated methods

#

Or if you need to dynamically retrain a model on unknown data like if you were building some kind of auto ML solution

#

And this is actually a great example of why auto ML solutions are a long way from perfect

#

Understanding the domain and business context and the meaning of the data, and applying that understanding towards solving your problem. That is the added value of a good data scientist, more so than whatever mechanical skills they have in implementing models

#

Look at something like alphago, they didn't just throw a big generic neural network at the problem, they built a whole solution specifically oriented around Go

lapis sequoia
#

alright thanks for the help

desert oar
#

Sorry for ranting

blazing bridge
#

"The r-squared coefficient is the percentage of y-variation that the line "explained" by the line compared to how much the average y-explains. You could also think of it as how much closer the line is to any given point when compared to the average value of y. SEy is the total variation in y (sum of squared distances from the mean of y) and tells you the how much the data deviates from the mean of y. The variation in y gives you a baseline by which to judge how much better the best fit line fits the data compared to the y average."

#

@desert oar Can you tell me if this is a correct interpretation of this if I explain it. What r-squared is telling us, is how much closer the data points are to the line of best fit compared to the average y value line, referred to as SEy or variation in y. So if the r-squared is 0.72, it means that the line of best fit is 72% better than the average values of y mean. And the other 28% is missing due to us not including other variables. For example if we have rent and square feet. The other 28% could be in something like age of building and etc.

#

@desert oar again sorry for bothering you so much

desert oar
#

72% better than the mean..... eh

#

Yes the other 28% could be omitted variables

#

It could also be natural random variation in Y with nothing to account for it

#

It could also be that the remaining relationship is nonlinear

#

It could also be that the variance of Y is not constant over the range of X so the whole model is invalid

blazing bridge
#

so i am partially correct

#

but for the part where what % of variation in y is described by x. The percentage we get is checking with the x variables we have , the line of best fit is better than the y mean line

hearty jewel
#

quick python question: right here, what exactly are the loc arguments specifying? as I understand it , ride_sharing['tire_sizes'] > 27 is specifying this column where this variable isgreater than 27, but whats the second argument for?

#

the second 'tire_sizes'

#

lmao, turns out it wasnt needed after all.

#

no wonder i was so confused.

lapis sequoia
#

The conditional returns a boolean array of all rows that satisfy the condition. The second argument tire_sizes specifies which column to return.

lapis sequoia
#

Hi , in order to get the correlation coefficient, I performed a quadratic interpolation in order to fill in the missing values. However it seems that I still have missing data.
Is it normal or does the problem comes from my code?

Thank you for your responses

jade hazel
#

Hi i am creating a chatbot and i have got struggles with my bots responses. For example if i type: "What is the weather in London" I want from the bot to go to some webpage and get the data of the weather. Is there someone who knows how to do this? Thanks a lot for your responses

ripe marlin
desert oar
#

@ripe marlin i can't read that, can you post your code and full error output as text

#

!code-block

arctic wedgeBOT
#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
โ€ข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โ€ข You can also use py as the language instead of python
โ€ข The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')
desert oar
#

@lapis sequoia impossible to say without seeing your code

#

@blazing bridge it's the % of variation in y described by our model

lapis sequoia
desert oar
#

@hearty jewel

data['hello']  # select the "hello" column
data.loc['hello']  # select the row(s) with index value "hello"
data.loc['hello', 'hello']  # select the row(s) with index value "hello" and the "hello" column
#

@lapis sequoia please post your code as text. you are not new here, you know this

#

also this has nothing to do with your previous issue

#

any time you see "no module named X" it means you have the wrong environment active

#

whether that's venv or conda or whatever

#

that is always the solution

#

that, or it had an error during installation

ripe marlin
#

@desert oar nothing that complicated. I'm just trying to use read_csv() to open a csv while. I'm getting an unicode error. And when I use r to declare a raw string, it says that the file doesn't exist

desert oar
#

@ripe marlin show the error? it's probably in the file itself, not the filename

#

how was the file created? if it came from Excel, use encoding='windows-1252'

ripe marlin
#

From excel

desert oar
#

windows-1252 is a derivative of iso-8859-1 which has a few differences from UTF-8

#

so you probably hit one of those different characters

#

and it can't decode the bytes to text

ripe marlin
#

Wait i'll just post the code

lapis sequoia
#

I got this error: PackagesNotFoundError: The following packages are not available from current channels:

  • python-firebase

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

desert oar
#

just use pip with your conda env activated @lapis sequoia

#

that's a non-standard channel

#

you would need to conda install -c auto python-firebase but again i dont know who or what auto is so i dont trust it

#

@ripe marlin try pd.read_csv(..., encoding='windows-1252')

#

just try it

ripe marlin
#

Right

lapis sequoia
#

ok thanks

#

One last question what do you mean by env activated?

ripe marlin
#

Still shows Unicode error and FileNotFoundError if I add a r for rawstring

desert oar
#

if you are on windows you are probably writing '\\' so yes raw would fail

#

ok show the full error

#

@lapis sequoia

conda activate <my env>
pip install <package>
lapis sequoia
#

@desert oar instead of <my env> I should put the IDE I work on? For example Spyder?

ripe marlin
#
File "<ipython-input-13-343da795340c>", line 1
    df=pd.read_csv('C:\Users\dell\Downloads\py-master.zip\py-master\ML\1_linear_reg\homeprices.csv',encoding='windows-1252')
                  ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
desert oar
#

@ripe marlin oh, you need r

#

because you are using single \s

ripe marlin
#

I did, it shows a Filenotfound error

desert oar
#

then you have the wrong filename

ripe marlin
#

I literally copy pasted it's path

#

I didn't type a thing

desert oar
#

you can't just get a file out of a zip

#

windows explorer cheats and lets you think you can

#

you have to unzip first

#

either unzip thru windows or unzip inside python

ripe marlin
#

How do i unzip inside Python?

desert oar
#
from zipfile import ZipFile
import pandas as pd

with ZipFile(r'C:\Users\dell\Downloads\py-master.zip') as archive:
    with archive.open(r'py-master\ML\1_linear_reg\homeprices.csv') as fp:
        data = pd.read_csv(fp, encoding='windows-1252')

@ripe marlin

#

!d g zipfile

arctic wedgeBOT
#

Source code: Lib/zipfile.py

The ZIP file format is a common archive and compression standard. This module provides tools to create, read, write, append, and list a ZIP file. Any advanced use of this module will require an understanding of the format, as defined in PKZIP Application Note.

This module does not currently handle multi-disk ZIP files. It can handle ZIP files that use the ZIP64 extensions (that is ZIP files that are more than 4 GiB in size). It supports decryption of encrypted files in ZIP archives, but it currently cannot create an encrypted file. Decryption is extremely slow as it is implemented in native Python rather than C.

The module defines the following items:

ripe marlin
#

I see

#

I'll try this

#

Hope it works

#

Thanks a lot

lapis sequoia
#

how about gunzip

lapis sequoia
#

@desert oar yeah i found out there actually is high association between the variables providing redundant information. is correlation of 0.6 a good threshold to drop predictors?

desert oar
#

these are all categorical?

lapis sequoia
#

yea

desert oar
#

how are you computing correlation

#

cramers v?

lapis sequoia
#

yea'

desert oar
#

im still skeptical of discarding features based on that

#

.6 doesnt seem high

#

high collinearity is usually only a problem in extreme cases

#

and even then, with regularization you typically dont need to care much

#

even moreso in ensembled decision tree models like a random forest

lapis sequoia
#

fffffffffffffffffffffffffffffffffffffffff

desert oar
#

which are selecting random features anyway

lapis sequoia
#

i was gonna drop high predictor variables that had high cardinality & high correlation lol

#

but this wont fix it it seems

desert oar
#

no you need to fix the high cardinality issue

#

i posted a bunch of options the other day

#

welcome to data science

sullen kiln
#

@lapis sequoia not sure what your context is and how many variables you are dealing with, if general cramers V thresholds are not desired you could consider other dimensionality reduction methods like PCA or PFA

desert oar
#

trial and error

#

@sullen kiln they have several very high cardinality features that are all moderately associated with each other

#

like high cardinality as in 1000 distinct categories

sullen kiln
#

bloody hell

#

๐Ÿ˜†

desert oar
#

we have suggested several options like target encoding, vector embedding, building a sparse or reduced model like a factorization machine

#

etc

lapis sequoia
#

so even if i have 2 variables with >0.80 correlation i shouldnt drop it?

desert oar
#

you said 0.6

lapis sequoia
#

ik but im increasing the threshold now

#

i have like 10-15 variables with >0.80 correlation

sullen kiln
#

what is the nature of the cardinality? what data are you dealing with?

lapis sequoia
#

these correlations are amongst categorical variables

#

they have a shitton of unique values which is why im tryna drop them

desert oar
#

not rly correlations. cramer's v chi square stats

lapis sequoia
#

there are other methods suggested in here but this is one i thought of

#

so i have a 6000row dataset and these categorical variables have 300-1000 unique values and correlations >0.80 or >0.90

sullen kiln
#

that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.

desert oar
#

One thing ive done is. train a univariate model with them separately and together. If you dont see big lift, discard one

#

Its hacky

#

.8 is high

lapis sequoia
#

so why cant i just drop some of the .8 corr variables

#

that cardinality is quite extreme, maybe you can devise a dimensionality reduction method that deals with that. You may find a lot those categorical responses barely contribute any variance to the data.
@sullen kiln i'll check out doing a pca

desert oar
#

Yeah just do it tbh

lapis sequoia
#

reason is im running low on time and i have to turn this over soon lol

desert oar
#

PCA is a good idea

#

PCA features before regression is an old school ML technique anyway

sullen kiln
#

yes, start with a more parsimonious model specification, and add/subtract each variable, try to establish what has more predictive power (and what makes more theoretical sense)

#

I wouldn't usually go manual like this, but its not a bad thing, its just best to do a PCA as it does so much legwork

lapis sequoia
#

havent played around with pca much aside from learning about it in school

#

so i can use it to reduce cardinality then run a random forest on the pca data?

#

that's gonna reduce the interpretability a lot right

sullen kiln
#

its very hard to quantitatively distinguish between variables of extreme cardinality, you will inevitably have bias, I would favour a proper dim reduction method over model tinkering

#

not necessarily, you will likely find several principal components that exhibit theoretical themes, if anything, you will be able to explain your data better

#

interpretability is increased

lapis sequoia
#

ok thank you. should i drop the highly correlated variables (>.80) then run pca?

sullen kiln
#

no

lapis sequoia
#

just run pca on the entire dataset?

sullen kiln
#

let the PCA analyse the data in its natural form, after all measures are normalized of course

lapis sequoia
#

ok

#

ty

sullen kiln
#

PFA might be needed with categorical varaibles, its not very fresh in the mind

lapis sequoia
#

yea i was boutta ask how would i normalize cat variables

#

but i'll look it up

sullen kiln
#

Could I ask someone for a bit of syntax help here? Simple Python/R stuff

#

#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)

#

I want add to my function here, computing another mean with the range of q_at(0.10) and q_at(0.90)

#

I was going to try extract the statistics and merge them back into the data, but that is not elegant and not very viable with big data

desert oar
#

what do you mean extract

#

you want to join the aggregated data back to the original data?

#

e.g. join df1 to incdata? that's pretty much the only way to do it

#

also

#

!code-block

arctic wedgeBOT
#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
โ€ข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โ€ข You can also use py as the language instead of python
โ€ข The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')
sullen kiln
#

hhmm, ok, I was hoping for a cleaner way. As I am extracting the descriptives from incdata. I wanted to extract a conditional descriptive - the mean between the inter-decile range, without adding more steps/merges

desert oar
#

what do you mean, "the mean between the inter-decile range"

lapis sequoia
#

The mean between the 10th quantile and the 90th quantile. Basically a more generous IQR

#

Though I don't think there's an easier way to do that you saying (if I understood you right)

desert oar
#

oh

#

idk what the automatic variable names for those columns will be, but

#
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
         q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f) 
df1['iqr'] = df1['q75'] - df1['q25']
df1['idr'] = df1['q10'] - df1['q90']
#

no?

lapis sequoia
#

@sullen kiln is PFA = factor analysis?

#

PFA might be needed with categorical varaibles, its not very fresh in the mind

sullen kiln
#

yes

#

yes @lapis sequoia and @desert oar , I want an IDR mean, without doing another meri-go-round and merging my statistics back into the data to re-compute an IDR mean

#

when I deploy this program to my data in work, i am dealing with a lot of data and limited pc memory, so I am trying to keep it concise

desert oar
#

so does my code make sense or no

sullen kiln
#

bare with

desert oar
#

(btw this might be better to do in sqlite which is definitely probably going to be more memory efficient)

sullen kiln
#

thanks for the tip, yeah, for work it might be best, I am writing the program in python and R and will do it in sql too, I just want options going forward

desert oar
#

i think you should at least try my code though. sql is good because its the same in both python and r

#

so you connect to the same sqlite db and use the same query

lapis sequoia
#

There's a library that works similar to pandas that doesn't store the entire df in memory and streams it as needed. It's quite cool, though I forgot it's name

#

as far as I know it has all the same methods as pandas

slim fox
#

dask?

lapis sequoia
#

Yeah that sounds familiar

desert oar
#

dask and vaex are 2 options

sullen kiln
#

#The rename decorator renames the function so that the pandas agg function
#can deal with the reuse of the quantile function returned.

def rename(newname):
def decorator(f):
f.name = newname
return f
return decorator

def q_at(y): #define a function q, for values y
@rename(f'q{y:0.2f}') #define format renaming of new quantiles returned
def q(TOTAL_MINUTES):
return TOTAL_MINUTES.quantile(y)
return q

#Perform set of statistical functions for each value in 'TOTAL_MINUTES'
#Grouped by incident type
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1['iqr'] = df1['q0.75'] - df1['q0.25']
df1['idr'] = df1['q0.10'] - df1['q0.90']
df1 = incdata.groupby('INC_TYPE').agg(f)
print(df1)

#

@desert oar this solution would require me to compute the values within the ranges for every row, whereas, I want to build in the adjusted means into the function

#

the function above exports a nice little descriptive table, df1, rather than transforming new columns in what will be a large dataset

desert oar
#

im still not sure what you mean

#

oh

#

i just flipped the lines

#
f = {'TOTAL_MINUTES': ['mean','median', 'std', 'min', 'max', 'var',
         q_at(0.10) ,q_at(0.25), q_at(0.50), q_at(0.75), q_at(0.90)]}
df1 = incdata.groupby('INC_TYPE').agg(f)
df1['iqr'] = df1['q0.75'] - df1['q0.25'] 
df1['idr'] = df1['q0.10'] - df1['q0.90']
#

the usual caveats apply about untested code written by volunteers on the internet

#

df1 didnt even exist in those first 2 lines

#

so the code clearly wouldnt have worked as written

#

wait

sullen kiln
#

๐Ÿ™‚ haha

desert oar
#

you flipped them. i did it right. scroll up.

sullen kiln
#

apologies

#

this is the result i am after

#

with the IDR and IQR means also being computed in the function, without the need to merge quantiles back into the dataset

desert oar
#

you arent merging anything

#

df1 is your grouped summary table

sullen kiln
#

yes

desert oar
#

what i wrote gives you what you show in the screenshot

#

btw you also flipped q0.10 and q0.90

#

oh nvm i did that

lapis sequoia
#

wait my data isnt ordinal so i dont think factor analysis will work

#

just so yall know lol

desert oar
#

ive done PCA on one hot encoded categorical before

lapis sequoia
#

i read that doesnt work well

desert oar
#

yeah its kinda sus from a theoretical perspective

lapis sequoia
#

you think i should do it anyways lol

desert oar
#

id still be curious if target encoding works well

#

idk how well it works for very high cardinality

#

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared to other methods. According to ...

lapis sequoia
#

thanks

#

i looked up mca on google and it showed this lol

desert oar
#

hah

slim fox
#

ive done PCA on one hot encoded categorical before
I did it too

#

surpisngly it can work

desert oar
#

RIP MCA

lapis sequoia
#

so id have to drop the numerical variables before doing MCA~~/PCA~~?

#

alright im just gonna try PCA

desert oar
#

yeah you can concatenate them later

#

numerical features separately

lapis sequoia
#

and if that doesnt work im gonna do MCA

desert oar
#

then MCA'ed categorical features

#

dont think about it too hard

#

youre just playing with lego at this point

sullen kiln
#

^

lapis sequoia
#

ok thanks. but for pca all i have to do is just one hot encode then apply pca on one hot encoded + numeric data right

#

normalize numeric data first*

desert oar
#

no thats what im saying. dont mix them

#

PCA the one-hot encoded categorical

lapis sequoia
#

then just numbers?

desert oar
#

then concatenate that with the numerical data

#

yes

lapis sequoia
#

aahhhhhh okkk thx

slim fox
#

yeah otherwise PCA will likely throw of those categorical

lapis sequoia
#

is it bc u dont wanna lose info from the numbers?

desert oar
#

another option is feature hashing

slim fox
#

if they have 1000s classes

desert oar
#

you can combine all of your 15 categorical features into 1 big "multi categorical" feature

#

and hash it all down to 1000 buckets

lapis sequoia
#

alright im gonna try pca first

desert oar
#

that's basically how vowpal wabbit works

#

and fasttext

lapis sequoia
#

what about the strong correlations >0.80? is it ok to leave those columns in the pca

desert oar
#

yes thats the whole point of pca

#

or MCA

lapis sequoia
#

yea ur right lol

#

thx yall

robust dome
#

hey can someone help me understand this list comprehension:
counts = dict([[letter, sentence.count(letter)] for letter in set(sentence) if letter in alphabet])

desert oar
#

yuck, a list comprehension being passed to dict

#
dict(
    [
        [letter, sentence.count(letter)]
        for letter in set(sentence)
        if letter in alphabet
    ]
)

not sure if that helps...

pale thunder
#

seems to be Counter(sentence) with all keys that are not in alphabet filtered out

desert oar
#

yeah sentence might be some special thing

#

im not going to second guess that

#

this is kind of amateur code

#

id write it like this

dict((letter, sentence.count(letter)) for letter in set(sentence) & set(alphabet))
#

or

dict((letter, sentence.count(letter)) for letter in set(sentence) if letter in alphabet)

at least

lapis sequoia
#

wait

#

in pca

#

when i one hot encode

#

it wont have the problem where the model assumes 1>0 right

#

well it's not an issue right

#

even if it's nominal

#

(i.e. i have several business units i one hot encoded and will do pca on)

desert oar
#

eh?

#

1 = yes, 0 = no

#

so after you one hot encode, 1 is greater than 0

#

w/out loss of generality you can flip 1 and 0 but

lapis sequoia
#

yea i was just making sure

desert oar
#

then you run into things like "actually computing the result"

#

yeah

#

youre fine

lapis sequoia
#

ty

gritty solstice
#

I need some help :(

I'm using a dataset that has location information in the first few columns, as well as a column for each date in the data, with a corresponding value

Example:

+------+-----------+------------+---------+---------+---------+
| Org | Org Owner | Building# | 4/1/20 | 5/1/20 | 6/1/20 |
+------+-----------+------------+---------+---------+---------+
| OrgA | John Doe | 1234 | $1,256 | $987 | $1,562 |
+------+-----------+------------+---------+---------+---------+

There are a few more columns for identifying more specifics about the org I need to retain, and the dates are tens long corresponding to specific dates.

The values for these columns are corresponding sales.

I'm trying to pivot the table to have a single row with each date, and it's value, and the org information

How can I achieve that with pandas?

IE:
+------+-----------+--------------+------+------+
| Org | Building# | Org Owner | Date | Sales |
+------+-----------+--------------+------+------+

lapis sequoia
#

pd.pivot_table

gritty solstice
#

Which I'm aware of, but can't seem to find any way to get my desired results

#

IE: use columns [:3] as the columns, [3:] as index, and the values per row of the date columns as the values in the new column 'date'

#

Here's an example. Some columns are removed, and all data is false

#

It's like a hybrid of transpose and pivot I think..?

ripe forge
#

You can probably create the desired output by slicing the data frames into two parts

#

How many rows are there in the top dataframe

gritty solstice
#

Currently 3

ripe forge
#

If you could make code that generates a dummy dataframe with atleast 2 rows just like that, that will be super helpful

#

Meanwhile let me get to a laptop

gritty solstice
#

sure thing, one sec

desert oar
gritty solstice
#
dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})

for date in dates:
         df[date] = '$1200'
df.head()

Checking now @desert oar

ripe forge
#

looks like that melt probably does the trick

gritty solstice
#

god

#

bless

#

That looks like it did the trick

#

Thank you guys so much!

#

final code to help those who need it (using the sample data)

dates = ['01/01/2020','01/02/2020','01/03/2020','01/04/2020','01/05/2020','01/06/2020', '01/07/020']
df = pd.DataFrame(data={'Org':['A','B','C'], 'Org Owner':['John', 'Sal', 'Chris'], 'Building':['1234','5678','1298']})

for date in dates:
         df[date] = '$1200'
df.melt(id_vars=df.columns[:3])
desert oar
#

๐Ÿ‘

lapis sequoia
#

using pca with 90% variance reduced columns from 9000 to 420

#

so thats fine for a random forest right

#

6000 rows x 420 cols?

desert oar
#

yeah but you might need a large number of trees to get a good sample of columns

#

depending on tree depth

lapis sequoia
#

oh shoot. i used the default of 100 in my previous models (which had ~60-80 columns)

#

im running the random forest right now with 100 trees. i'll change it

desert oar
#

100 is (probably) too low even for 80 columns

#

you should always start with like 10 though

lapis sequoia
#

should i use a grid search?

desert oar
#

just to make sure it works

#

no

#

grid search over tree params, not number of trees

lapis sequoia
#

is there a rule of thumb for number of trees you should use

#

i looked it up and people said 64-128 but obv it depends on data size

#

LOL wtf i got an 18% R-squared

#

my baseline got like 35

desert oar
#

welcome to data science

#

for real this time

#

lol

lapis sequoia
#

im gonna try 200 trees

desert oar
#

id look at tree depth first

#

at 100

#

if its a fast model to train you probably can just grid search and go get a cup of coffee

lapis sequoia
#

lol yeah im gonna play around with the parameters a bit

#

any ideas as to why PCA with 90% variance explained yielded such shitty performance?

desert oar
#

maybe the resulting features are junk

lapis sequoia
#

or bc i used it after one hot encoding?

desert oar
#

could be, who knows

#

try MCA

#

did you look at the features?

#

do they look sane?

lapis sequoia
#

yeah the principal components?

#

i mean they look like principal components to me lol

#

ill try pca after playin around with parameters

desert oar
#

look at the loadings

#

id seriously consider MCA instead

lapis sequoia
#

yeah im gonna do MCA now

vital cipher
#

anyone here working on object detection using yolov5?? ๐Ÿ™‚

limber pollen
#

Hello guys i m new to this group and learning data science

serene scaffold
#

I'm getting ready to slam my head into a wall.

#

I found a cython implementation for a KD tree, and I made a subclass of np.ndarray that contains a reference to what each ndarray represents

#

but when the kd tree is constructed, it gets rid of that wrapping

#

the pure python kd tree is too slow.

desert oar
#

but when the kd tree is constructed, it gets rid of that wrapping
what do you mean

#

and show your code as usual

serene scaffold
#

warning: it's terrible code

desert oar
#

its cython, nobody writes good cython code

lapis sequoia
#

sorry salt rock quick question -- the problem cant be because i didnt normalize the numerical vars right?

desert oar
#

its absolutely can be

lapis sequoia
#

because i normalized the one hot encoded and not numerics

desert oar
#

well... less likely for random forest

#

but still possible

lapis sequoia
#

should i use pca with numerics as well or move onto mca?

desert oar
#

dont mix them in the pca

lapis sequoia
#

i just want to understand why we handled them separately

#

the numerics vs categoricals

serene scaffold
#

This part is pure Python.

class Vocab(np.ndarray):
    pass

def create_tensor(array, token, padding):
    many_zeroes = np.zeros((padding,), np.float)
    tensor = np.concatenate((array, many_zeroes))
    tensor = tensor.view(Vocab)
    tensor.token = token
    return tensor

cuis = KeyedVectors.load_word2vec_format('/home/farnsworthsw/datasets/s1975.cui.200.bin', binary=True)
print('make tree')
vocab = [create_tensor(cuis[k], k, 568) for k in cuis.vocab.keys()]
print('vocab made')
tree = sp.spatial.cKDTree(vocab)
print('tree made')

def learn(mention: str) -> str:
    tensor = torch.cuda.LongTensor(tokenizer.encode(mention)).unsqueeze(0)
    bert_output = model(tensor)[0][0]
    bert_output = bert_output.cpu().detach().numpy()
    best = tree.query(bert_output)
    print(vocab_lookup[best[0]])
    return best
desert oar
#

you could try it i guess. im thinking that it will be weird because the variance of a numerical variable might or might not be on par with the variances of a bunch of one hot columns @lapis sequoia

#

just seems incongruous

lapis sequoia
#

alright thank you

desert oar
#

oh i thought you wrote the kd yourself @serene scaffold

serene scaffold
#

no

desert oar
#

C isn't flexible like that. presumably it's just stripping out the data into some kind of lower level buffer data structure thing

serene scaffold
#

๐Ÿ˜ฆ

#

I'm not sure how else to have vectors that represent specific things.

desert oar
#

what does the tree return from a query?

#

a number?

serene scaffold
#

an ndarray

#

actually a tuple of them, but sometimes there's only one.

desert oar
#

ahhh

#

wait

#

๐Ÿ—ž๏ธ you didnt read the docs

#

it returns a tuple

#

1st elem, the vectors

#

2nd element, the integer indexes of the vectors

#

unless it's shuffling the order of the data in which case the whole thing is kind of useless anyway

#

another possibility is to use a dict where the key is the vector

#

because hashing magic

serene scaffold
#

so have a separate lookup table for what the vector at each index represents

#

and look that up?

desert oar
#

yeah i think youd have to

#

that or just a list/array

#

so you can look them up by position

#

from the 2nd returned element

serene scaffold
#

๐Ÿ˜„

#

I'll try to make that work

#

Thanks ๐Ÿ˜„

#

however, the docs says that it returns only the 1 nearest neighbor by default

#

but I'm getting 3

#

Actually it's just representing the location of the vector as a 3d thing

desert oar
#

eh? i thought you select that with k

#

it returns 1) the distances, and 2) the indexes