#data-science-and-ml

1 messages Β· Page 219 of 1

chrome rampart
#

ik

#

It's mostly theoritical stuff, right?

lapis sequoia
#

been studying python and compsci for 6 months and I still dunno jack shit compared to someone who's been doing it for 1 year or less

#

yes

#

they use pseudo languages

#

if this then that

chrome rampart
#

like C?

#

Oh, so pseudocode is just code to explain programs simply, not real langauges

lapis sequoia
#

yep

chrome rampart
#

Welp, I think I started too quickly, I'm just gonna take tge CS50 course, maybe gonna take three or four months

#

the*

lapis sequoia
#

I thought I'd be done in a month

#

stll here 6 months later

chrome rampart
#

CS50?

lapis sequoia
#

No just python courses and books

velvet thorn
#

@merry portal like this?

#
>>> list(zip(*np.nonzero(a < 5)))
[(0, 0), (0, 1), (0, 2), (1, 1)]
slim elm
#

No just python courses and books
@lapis sequoia I would play around with a lot of the python packages before you move to far from Python. Maybe start diving into SQL if you are looking to do some data work. You can essentially run SQL syntax in pandas anyways.

agile monolith
slim elm
#

r u trying to run data viz?

agile monolith
#

me?

#

i started r studio a couple of weeks ago

#

i think this is data viz doe yh

#

im pretty sure it is

slim elm
#

ya, I mean are you trying to create charts and shit

#

or run stats calcs

agile monolith
#

yh

#

i have been given a task that is composed of 5 stages

#

the task is to simulate the double slit experiment using R

#

if i were to do this from scratch i would be clueless

#

but the tasks build up to give the result

#

can you help me with this? @slim elm

slim elm
#

rippin a game of HoTS quick then i will look at it

agile monolith
#

i might take a powernap meanwhile

agile monolith
lapis sequoia
#
>>> y_max = 200
>>> -y_max
-200
#

Since the screen goes from -y_max to y_max, you only have to define one of them. just negate it

agile monolith
#

is -y.max automatically equal to -200 when y.max=200?

lapis sequoia
#

It it were from [-300, 200], you'd have to define -y_max

#

I mean yeah, the negative of 200 is -200. Adding a - to a variable negates it

agile monolith
#

damn didnt know that worked in R

#

mind if i ping you in the future for help

#

i just started this lang

#

its my first lang

lapis sequoia
#

sure, though I'm not very knowledgable in R

agile monolith
#

neither am i

hollow quartz
#

hi I use pandas
I have two columns iand j example {(1,2), (3,4), (2,1)} I want to count the pairwise iand j . (1,2) = (2,1). Can I use once groupbyto do this. Or I use groupbyand after I iterate?

lapis sequoia
#

Hey guys, does anyone have some tips on how to optimize randomforest regressions most effectively? random gridsearch? gridsearchcv? a combination thereof? I know how to do those, but I'm an absolute beginner in terms of randomforests... i only did a parameter grid test for n_estimatorsand max_features and the results are rather unsatisfactory. Thanks
bump

agile monolith
#

`

distance (from apertures) to screen

L <- 1000

size of each aperture

b <- 0.5

wavelength of light

lambda <- 0.05

separation of the slits

d <- 3

wavenumber

k <- 2 * pi / lambda

#all of these define the physical parameters

define distance along the screen, y, from -y_max to y_max

size of the screen - extends (-y.max, y.max)

y.max <- 200

sets the number of pixel on the screen

n.screen <- 501

#stating -y.max negates the original y.max value therefore if y.max is defined -y.max is naturally the negative correspondance.
y <- seq(-y.max, y.max, length.out = n.screen)

#the units for these values are in mm and the centre of the screen is at y=0

#defines theta
theta <- atan(y/L)

sinc <- function(x) { #defines sinc
y <- sin(x) / x
y[x== 0] <- 1 # "x==0" checks if x is equal to zero and if it is zero then it will be replaced by 1
return(y)
}

phi<- 2pia*sin(theta)/lambda
`

#

this is my code

#

a is slit width

#

but idk how to work it out

#

@lapis sequoia

#

can u help me find out what "a" (the slit width) is

#

everything is above

lapis sequoia
#

That image is blurry on my phone si I can't make out the figures. But I would assume that you function to calculate phi is supposed to accept a as an argument

agile monolith
#

that is probably correct

#

how do i make it accept a as an argument?

lapis sequoia
#

im trying to grok L2 regularization from a mathematical perspective

#

how come the regularization provides a lesser update, according to Ng, than an un-regularized weight update?

#

it seems like it would almost always increase the update magnitude unless lambda is negative xor the derivative is negative

deft harbor
#

Say I have a 32 x 32 RGB image that I am passign through two convolution kernals. On the other end, I have a tensor with depth 2. Now I need to apply a bias, but does the bias just get added to each specific element in the output layer?

#
[[0, 1, 0],
 [1, 0, 1],
 [1, 3, 4]]
#

So adding a bias of 3 would result in:

#
[[3, 4, 3],
 [4, 3, 4],
 [4, 6, 7]]
#

correct?

#

Then I would add another bias to output from the other kernal

#

ignore dimensions, I just slapped something together

velvet thorn
#

each filter has one bias, yes

valid drum
#

Hi, I've recently started to learn neural nets and now I'm reading this book: http://static.latexstudio.net/article/2018/0912/neuralnetworksanddeeplearning.pdf
When I'm using the first neural net there I'm getting over 97% accuracy but only with pictures from the data set and not with pictures that I create (28x28 with MS Paint). What am I missing regarding to creating/processing the pictures?
The network is implemented here:
https://github.com/MichalDanielDobrzanski/DeepLearningPython35/blob/master/network.py

oblique belfry
#

I know tf/keras has built in distributed training strategies (Pytorch also has some but I do not know enough about them), but I also know about Hovorod by Uber, and how many people use this library as well. Are there any good benchmarks showing if there is a difference between Hovorod and these other strategies? The ring-reduce method by Hovorod is interesting, but I am unsure how that could be better than using a parameter server. I do not know much about the mpi protocol.

fringe quiver
#

Hi, guys! I have the following problem:
consider the dataset

  • day
  • month
  • year
  • hour
  • some per-hour variable values (like temperature, etc.)
  • XValue (real value in range 0.0 - 30000.0)
  • Is XValue Maximum for this day (0-1)
    So, XValue changes every hour of each day and take always different values. Also, it reaches it's peak value on some hour. I need, given day, month, year and per-hour variable values for each hour guess, when does XValue will reach its peak value (basically receive an int in range 0 - 23). What approaches should I look into to solve such problem?
thorny pasture
#

Looking at potentially getting a MBP, opinion on how much ram I should get? It's a toss up between 16GB and 32GB likewise with 4GB vs 8GB for graphics. My gaming desktop only has 16GB and I've never had an issue. I plan to get into Data Science more so that's why I'm asking here.

velvet thorn
#

@thorny pasture depends, what do you want to do?

#

16 GB is plenty for many data science uses

thorny pasture
#

I'm still too new to say for certain, could you maybe touch on some cases when 16 wouldn't be?

#

@velvet thorn

velvet thorn
#

so the main use of RAM is storing your dataset to work with interactively

#

in this context.

#

there may be a point where 16 GB is too little and you need 32 GB, but that is rarely the case

#

in my experience, if it's too big for 16 GB, it will likely be too big for 64 GB, and you'll need to use a cluster or something

thorny pasture
#

@velvet thorn what kind of specs does your computer have?

velvet thorn
#

the laptop that I use for DS work, no DL, is some mid-range Lenovo ThinkPad

#

16 GB RAM, in particular

#

I use my desktop for gaming + DL, and it has a GTX 1080 (planning to upgrade in a few months)

#

also 16 GB of RAM

prisma imp
#

any way to write text to a pdf file?

jade cloud
#

@prisma imp reportlab

lapis sequoia
#

can somebudy suggest how to optimize the following parameters in random forest?

min_impurity_decrease
min_impurity_split
n_jobs
random_state```
#

is random_state even necessary? I obviously used it for train_test split, but I'm not sure whether it is necessary for using bootstrapping

#

as for the other 3 commands, i don't really understand their explanations very well in the userguide...

#

@velvet thorn i recently updated my PC from 8 to now 24Gb because of the python project and I still have experienced crashes with randomforest gridsearch on higher n_estimators parameter values ^^ my working dataset is "only" 1-1.5m rows and less than 1gb after cleaning. the process took like 48+ hours and at some point crashed

velvet thorn
#

bet it was in IPython

lapis sequoia
#

jupyter lab

#

is there nobody who has experience with optimizing randomforests? :[

velvet thorn
#

yeah, that's built on IPython

#

@lapis sequoia are you sure you know what you're doing?

#

n_jobs is the number of parallel jobs to run (e.g. fitting multiple trees at once).

#

it is not a hyperparameter.

#

neither is random_state.

#

you might as well optimise the time of day at which you run your code

#

min_impurity_decrease just means how much impurity (a measure of how good a tree is at differentiating between values) must go down by before your tree splits

#

min_impurity_split is deprecated, and serves roughly the same purpose as min_impurity_decrease.

#

don't touch it.

lapis sequoia
#

i know random_state is not a hyper parameter, but it is suggested to use a random state when using bootstrapping in some tutorials... I didn't use a specific random_state though. My grid search, however, showed that using bootstrapping is better than not using it.

velvet thorn
#

yes

#

you should set random_state

lapis sequoia
#

thanks πŸ™‚

velvet thorn
#

but you should not be searching over it

#

which was what you asked, I believe

lapis sequoia
#

my question was not well specified I guess, sorry for that

velvet thorn
#

np

lapis sequoia
#

it was aimed to get some explanations for the parameters, and whether random_state is needed when using bootstrapping

velvet thorn
#

yes

#

in general, when you do something involving randomness

#

you should ensure reproducibility

#

which is done by, in this cas,e fixing random_state

lapis sequoia
#

i tried fine-tuning only max_featuresand n_estimators in the beginning, but the results are very disappointing compared to rbf-SVM

#

I honestly expected (from the papers i read), that randomforest will have the best results for my study... but seems like it doesn't

velvet thorn
#

IME

#

random forest is not very responsive to hyperparameter tuning

lapis sequoia
#

you should ensure reproducibility
yup, that's logical, however, while train_test splits should obviously be reproducible, the random state of the bootstrapping-hyperparameter in my view shouldn't make any difference... because as you said, it's not like the result of hyperparameter tuning will change if you change the random_state

#

random forest is not very responsive to hyperparameter tuning
yeah, I agree

velvet thorn
#

you'll get different bootstrapped samples

#

which will slightly change the fit of the individual trees

lapis sequoia
#

aaah

#

right

velvet thorn
#

of course, being a rather low variance method in the aggregate, this shouldn't change much, but it will make a difference, still.

lapis sequoia
#

but then again.... bootstrapping is a yes/no option

#

so even if there is slight difference, the outcome should remain consistent right

velvet thorn
#

huh?

#

no, each bootstrapped sample will vary slightly

#

because it's drawn randomly from the original data, right

#

the random state influences this

lapis sequoia
#

sure sure

#

i mean... when you're using grid-search, then your goal is to check which is best for your specific model as in bootstrapping=[true, false]

velvet thorn
#

ah

#

yes

lapis sequoia
#

and if bootstrapping=true is the better option, that should not change whatever your random_state is

velvet thorn
#

no, not necessarily

#

at least, not theoretically.

#

and not practically as well, if you're using sklearn

lapis sequoia
#

ugh... ok 😦 πŸ˜„

velvet thorn
#

because the individual trees also are influenced by randomness, entirely apart from bootstrapping

#

see the documentation.

#

and, this aside

#

if you agree that bootstrapping performance varies with random_state

#

surely you can agree that it is conceivable that when random_state takes certain values, bootstrapping may perform worse than not bootstrapping.

#

on the other hand, when random_state takes other values, bootstrapping may perform better.

#

because, remember, it may influence the performance of bootstrapping.

#

if you accept this, then it does matter.

lapis sequoia
#

but would that not make random_state a hyperparameter then? lol

#

if the performance difference was that meaningful, i mean

velvet thorn
#

hm

#

no

#

or rather, practically speaking, perhaps

#

but theoretically, I'd say no

lapis sequoia
#

haha πŸ˜„

velvet thorn
#

this is kind of a philosophical question

lapis sequoia
#

hahaha

#

ok, so you suggest i don't touch the min_impurity_split and _deacrease, but max_depth, min_samples_split, min_samples_leaf does make sense to finetune, right? on top of max_features and n_estimators, obviously

velvet thorn
#

no I said

#

don't touch min_impurity_split becausei t's deprecated

agile wing
#

hi @lapis sequoia, where did you learn machine learning with python?

velvet thorn
#

in favour of min_impurity_decrease

#

anyway, go ahead

#

no harm trying.

lapis sequoia
#

@agile wing self-taught learning by doing and using lots of tutorials on the web and youtube, plus 1-2 books. I had to learn it for my thesis-project... but i'm literally a beginner. this chat was a huge help for me, when I was stuck... gm certainly knows what he's talking about, I on the other hand do absolutely not πŸ™‚

agile wing
#

oh, I'm trying to find a pretty good MOOC on python and ml

lapis sequoia
#

in favour of min_impurity_decrease
@velvet thorn any suggestions what values to use? since it's a threshold i have absolutely no idea... except that default is (default=1e-7)

#

@agile wing sorry what's MOOC?

agile wing
#

online class, like coursera, edx...

#

online class platforms

lapis sequoia
#

mh scikit has lots of data online, github too.... i never really had the time to play around with other data, because my own project had so much data to work with and my project obviously has a deadline

#

@velvet thorn oh and for n_jobs yeah i read the documentation... i was asking, because i thought it might help accelerating the gridsearch process... but i don't know and i don't really want to risk jupyter to crash my PC again :[

agile wing
#

ic thanks

lapis sequoia
#

@agile wing what's your current state of knowledge though? if you're starting from scratch, i would recommend to watch Daniel Chen`s python beginner tutorials on youtube...

agile wing
#

i know my python, but Im new in ML

tame sedge
silk frigate
#

Can someone help me get started with data science? I'm learning Python and I have to make a bar chart of the first column of a csv file but I'm really struggling with it

#

Pls tag me if you can

lapis sequoia
#

@silk frigate what have you done so far?

silk frigate
#

Well

#

I have downloaded the csv file (I opened in Excel now)

#

It's about the McDonalds menu

#

so its looks something like this:

#

And now I want to make a bar chart of the first column (category)

#

At least that's what the assignment asks me to do haha πŸ˜‚

#

They gave me the start of the program

#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import menu and display the first two rows of the dataframe
menu = pd.read_csv("mcdonalds_menu.csv")```
#

but I don't really understand how to make bar charts etc.

#

It's all new to me

lapis sequoia
#

how new are you to programming?

silk frigate
#

I started about 5 weeks ago I think. In my class we started with the basics (data types, etc.)

#

I have worked my way through module 1 till 8

lapis sequoia
#

Ok. So you're using a library called pandas to read the CSV file. pandas is very useful for anything related to data manipulation and analysis

silk frigate
#

Yes

lapis sequoia
#

matplotlib is usually used for plotting, but pandas has some built-in plotting methods

silk frigate
#

Yes I found that link

lapis sequoia
#

(hold on, someone is calling me, brb in 2 min)

silk frigate
#

sure thing

lapis sequoia
#

ok im back

silk frigate
#

hii

lapis sequoia
#

anyways as I was saying. As you see pandas as a built-in method for creating a bar-plot

silk frigate
#

plot.bar()

#

?

lapis sequoia
#

yes

#

However since you want to make a plot from strings (category) you need to find a way to quantify them numerically

#

And in this case I assume it's the number of each items in the category

silk frigate
#

yes

lapis sequoia
silk frigate
#

it says it should look like this

lapis sequoia
#

oh

#

lol I kinda misunderstood what the task

silk frigate
#

what did you have in mind then? haha

lapis sequoia
#

no never mind I had it right

#

just got confused by the ordering

silk frigate
#

ohh haha

lapis sequoia
#

anyways did you see the above link?

silk frigate
#

yea it's alphabetically

#

not yet let me see

#

how did you find that link?

#

Did you just googled "python pandas counting values in column" or something?

lapis sequoia
#

yes

silk frigate
#

hmm okay

#

I'm kinda scared with the formula that's on the top of that page haha

#

all the paramters

#

meters*

lapis sequoia
#

don't worry about that yet

#

The only thing you should take a not at are all parameters that dont have a default value

#

Basically parameters that dont have =something at the end of them

silk frigate
#

but all of them have a default value right?

lapis sequoia
#

Yes exactly. With the exception of self. But that can be ignored

#

So that means you can just call .value_counts() without providing any parameters.

silk frigate
#

hmm okay

lapis sequoia
#

Which is great as that means we can just focus on getting some plot working, then finetuning it with parameters as needed

silk frigate
#

haha yes

lapis sequoia
#

Anyways. in pandas you have something called DataFrame and something called Series. The boiled down version is that a DataFrame is just a collection of Series

#

And the link above shows pandas.Series.value_counts()

silk frigate
#

is a series one column?

#

and a dataframe multiple columns?

lapis sequoia
#

yes

silk frigate
#

ahh okay

#

that kind of makes sense

lapis sequoia
#

yeah. So in your example, if you were to do:

df = pd.read_csv(...)

print(type(df['category'])

you should get <class 'pandas.core.series.Series'>

#

You can access any column in a dataframe like you would a dictionary.

silk frigate
#

but aren't columns always series?

lapis sequoia
#

And if the name of the column doens't have any spaces you can also do it without the brackets: df.category

#

Yes thats right. But that means that we can do this: df.category.value_counts()

silk frigate
#

so python print(type(df.category))
would also work?

lapis sequoia
#

yes

silk frigate
#

ah okay

#

Yes so then we're counting all values

#

in that series

#

called category

lapis sequoia
#

Now here's a task for you. Using the documentation, see if you can find out how to plot from a series (instead from a dataframe)

#

Link me the page about plotting a bar-plot from a pandas series

lapis sequoia
#

yes

#

So how would you use this?

silk frigate
#

well pd.plot.bar() is a function from pandas that you can use

#

although I don't really get why there's pandas**.series**.plot.bar() in the title

#

because in the example they don't do series

lapis sequoia
#

That means you can use it on series as well

silk frigate
#

but you shouldn't use that in the actual function?

lapis sequoia
#

What do you mean?

silk frigate
#

well if you are going to use this function

#

should you do: pd.series.plot.bar()

#

or pd.plot.bar()

lapis sequoia
#

Actually what pandas.Series.plot.bar means is that withing the pandas library, any Series has an attribute called plot whic has a bar method

#

And if you run this: print(type(df.category.value_counts())) you'll see that the output is a series as well

#

And as mentioned, since it's a series we can do this: df.category.value_counts().plot.bar()

silk frigate
#

yes

lapis sequoia
#

Lastly, what the pandas plotting methods do, is just create the figure. To display it we have to use matplotlib after by calling plt.show()

silk frigate
#

yeah that's weird haha

#

why doesn't python just do that automatically 😭

#

it works haha

#

I'm already proud of this

lapis sequoia
#

good job! πŸ™‚

#

I hope everything made sense to you, and hopefully you learned not just how to plot a bar-chart, but also how to navigate the documentation and understand it

silk frigate
#

haha yes, I really like that I can use discord to learn things like this. The lecture notes of my teacher are sometimes useful but do not really explain things in-depth

#

I do have 2 more questions then tho, how did they manage to sort the categories alphabetically and give cool colours to the bars?

#

If I do python plt.show(df.Category.value_counts(sort=False).plot.bar())
It shows this graph:

#

so first it sorts it with the value counts descending

#

but how does it come up with the order of categories in this second one?

#

It's not the order in which they appear in the csv file

#

@lapis sequoia could you maybe help with this? It's also completely okay if you don't want to, I get that it takes a long time to help beginners like me started πŸ˜‚

lapis sequoia
#

I tried looking into how the sortargument was used but couldn't

silk frigate
#

what do you mean?

#

Can I use this function to sort my values from that series alphabetically?

lapis sequoia
#

oh I looked at the wrong method

#

lol

#

oh man I read your line wrong and thought it was .bar(sort=False)

silk frigate
#

ohh haha

#

that probably not possible in that function

lapis sequoia
#

Ah I found it

#

So value_counts() uses something called a hashtable which means the ordering is essentially random

#

By default sort=True and so after counting the elements it will sort the result

silk frigate
#

so when that's False, the order is completely random?

lapis sequoia
#

essentially yes

silk frigate
#

so is there a way to sort the values alphabetically?

lapis sequoia
#

Sure. every series and dataframe has a column named index which as you can imagine is the index of every row

silk frigate
#

Yes

#

it starts with 0

lapis sequoia
#

So if you want to change the order of the plot, we have to reindex the series

silk frigate
#

because the initial index was alphabetically?

lapis sequoia
#

not alphabetically, in the documentation is says The resulting object will be in descending order so that the first element is the most frequently-occurring element

silk frigate
#

yes

#

and it will probably create a new series with new indexes

#

so 'Coffee & Tea' has index number 1

#

or 0

#

since that's the most frequently-occuring element

lapis sequoia
#

Yes so to change the ordering from most-frequent to alphabetical I would this:

category_value_counts = df.category.value_counts()
alphabetically_sorted_indices = sorted(category_value_counts.index)
category_value_counts.reindex(alphabetically_sorted_indices).plot.bar()
plt.show()
silk frigate
#

how did u know to use .index?

#

I can't find a webpage that explains that

lapis sequoia
#

Well I knew it already / used the documention. But if you google pandas change series index it should be first hit

#

oh you meant the .index

silk frigate
#

ehh yes well both of them actually

#

with my question I meant the .index indeed but I'm confused by your 3rd line as well haha

#

at least I understand half of it

lapis sequoia
#

so the same applies. I knew I could get the index by calling .index, but I also found it by googling pandas get index from series

#

I recommend you play around a bit. Try to print out the output of .index and the other methods

silk frigate
#
alphabetically_sorted_indices = sorted((df.category.value_counts().index)

plt.show(df.alphabetically_sorted_indices.plot.bar())``` @lapis sequoia do you know why my python intepreter gives an `invalid syntax` error for the 2nd line?
lapis sequoia
#

alphabetically_sorted_indices is just a normal python list

#

not a pandas.Series object

#

So .plot.bar() will not work

#

@silk frigate

silk frigate
#

hmm okay well I'm working on another assignment now

#

I almost finished it

#

but I need to print 3 specific columns and now it prints all of them

#

Do you know how I can manage my code to print only 3 columns @lapis sequoia ?

#
healthy = df.loc[(df["Trans Fat"]==0) & (df["Cholesterol (% Daily Value)"]==0) & (df["Total Fat (% Daily Value)"]<=20) & (df["Sugars"]<=20)].sort_values('Calories')

print(healthy[(healthy['Category']!='Beverages') & (healthy['Category']!='Coffee & Tea')])```
#

I only want to print the columns 'Category', 'Item' and 'Calories'

lapis sequoia
#

df[['Category', 'Item', 'Calories']]

silk frigate
#

inside of my first line or second line?

#

and why the double square brackets

lapis sequoia
#

That's just an example of the syntax

#

use it on whatever dataframe you want. In your case Its probably on healthy

silk frigate
#

I have difficulties with where to put such an argument inside of my function

#

So I have used df.loc to address specific columns

#

and I gave 4 arguments which need to be true

#
healthy = df.loc[(df["Trans Fat"]==0) & (df["Cholesterol (% Daily Value)"]==0) & (df["Total Fat (% Daily Value)"]<=20) & (df["Sugars"]<=20)].sort_values('Calories')
healthy2 = healthy['Category', 'Item', 'Calories']

print(healthy2[(healthy['Category']!='Beverages') & (healthy2['Category']!='Coffee & Tea')])```
#

I get an KeyError now

lapis sequoia
#

double [

#

healthy2 = healthy[['Category', 'Item', 'Calories']]

silk frigate
#

ahh yes it works

#

how did you know about the double brackets?

lapis sequoia
#

I've worked with pandas before

lapis sequoia
#

what does verbose mean? verbose: int, optional (default=0) Controls the verbosity when fitting and predicting.

oblique belfry
#

It’s similar to logging level. As you run the fit method, it will log to the console the stats on each step of training.

lapis sequoia
#

thanks

#

so if you would verbose=2 it would only log ever other or every third step?

#

why would anyone want to change that option? it's unnecessary basically, right?

oblique belfry
#

For the Keras API, it is quite helpful.

#

You can read the docs for the functionality, but it’s like python logging levels (debug, info, error, warning, critical)

lapis sequoia
#

thanks, i didn't find the explanation regarding verbose in random forest very helpful in the docs... but i don't have time to get too deep into it anyway. i really need to get this project done a.s.a.p

#

what would be a professional way of visualizing predictions vs actual values in a time series model, when there are like 100s of actual values and predictions to be considered for each day?
is there anything other than just comparing the RMSE or MAE with bar charts for different prediction models?

velvet thorn
#

"professional" is a bit of a weird term

#

you could consider an interactive visualisation

#

or some sort of moving average

#

(don't @ me)

#

bar charts are like a one-stop comparison

#

you can use that as an intro

#

and go further into detail with other visualisations

#

there's also a middle ground

#

e.g. group by, say, month, and plot error bar charts

#

maybe plot the residuals

lapis sequoia
#

mhh it's supposed to be in a paper, so interactive visualization is cool as i have to turn in the code, too.... but i'd prefer something that visualizes the results on paper

velvet thorn
#

you're limited only by your creativity

lapis sequoia
#

ok, so i did all the grouping stuff in the description part, to show what the data looks like etc

#

but for the results part, I'd like to have something more than just "here's the RMSEs for the different models, enjoy!" you know?

velvet thorn
#

yeah, that's why I said

#

wait

#

what do you mean

#

ok, so i did all the grouping stuff in the description part, to show what the data looks like etc

lapis sequoia
#

the problem i'm facing is, that the project is about delays that are predicted like 100s of times a day. So i can't just plot a time series on the x-axis and and predicted values on y, like when you predict a max temperature for a day or anything like that.

#

I could try to plot averages over all daily errors for each model-type (I'm comparing 3 model types) maybe... and print it against average actual errors (delay - schedule)

#

in a boxplot maybe?

velvet thorn
#

yeah, moving average...

lapis sequoia
#

what do you mean
in the description part, i showed how the actual delays are distributed for days of the week, hour of day etc.
sure, i could do the same with predictions, but goal is more like making good predictions in general... like for any day over the complete dataset of different trains etc

#

yeah, moving average...
mhh, can you elaborate? I'm considering a moving average a plot that takes previous average errors (let's say, for example, the past 4 days) into account when calculating "todays" error... how would that be an adequate solution?

#

boxplots would probably overlap like 90%. So plotting the average errors for any day for 3 models plus the actual delays from schedule and the baseline prediction (that i'm ultimately trying to beat) I'd probably end up with a very cluttered chart.

#

you're limited only by your creativity
haha, that could be the problem... guess my creativity is just really limited on this 😩 πŸ˜‚ despite having plotted lots of stuff in the description part, i just don't seem to find any good looking way to visualize the results. only bar chart for feature importances and bar chart for RMSE is kinda lame

velvet thorn
#

okay I think I don't really understand what you're trying to do

#

which is why maybe my suggestions don't make sense

#

visualisation is quite an intimate art

lapis sequoia
#

i don't know how to better explain what i'm trying to do :[ but I'll try

#

i have:

2. original baseline prediction (which is nothing but a linear shift of previous delays into the future)
3. predicted delays using linear regression
4. predicted delays using SVM
5. predicted delays using RF```
and I'd like to have some sort of visualization of the model performance results with high information content, other than just RMSE bar plots and residual plots and maybe feature importances and significance/t-stat in case of RF and lin reg, respectively
#

I'd very much like to incorporate the time-series aspect in one plot

burnt topaz
#

hi data scientists of discord πŸ˜ƒ, i have a question for you: is pandas and geopandas the same? or they have nothing to do with each other

summer plover
#

geopandas builds on pandas but they are different. @burnt topaz

burnt topaz
#

@summer plover oh okay

#

i have a problem on #help-coconut maybe you could help me with that

velvet thorn
#

why can't you plot actual delays and error against time?

#

am I missing something

lapis sequoia
#

@ gm how would you do it? i can't plot a line chart against time to compare the delays with predicted delays, because there is not just one delay per day, but hundreds or thousands actually.
So I could, if at all, do a scatter plot. But then again, the delays are in general all very similar... and plotting thousands of points on top of thousands of points in other colors doesn't help anybody to understand what's going on, right?

tough egret
#

Could someone help me? I'm reading a 1.5gb file and when it arrives in 2kk lines it returns "ValueError: Length of values ​​does not match length of index"

mild topaz
#
Traceback (most recent call last):
  File "modeltest.py", line 26, in <module>
    model = load_model("E:/PanModel.model")
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\saving.py", line 583, in load_model
    with H5Dict(filepath, mode='r') as h5dict:
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\utils\io_utils.py", line 191, in __init__
    self.data = h5py.File(path, mode=mode)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\h5py\_hl\files.py", line 408, in __init__
    swmr=swmr)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\h5py\_hl\files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'E:/PanModel.model', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
echo monolith
#

Please answer this

round crane
#

try model = load_model("E:\\PanModel.model")

#

because windows sucks

echo monolith
oblique belfry
#

I hate to be a dick, but this is where I just google what the ImageDataGenerator does. There are somethings I just don't remember.

jolly briar
#

that's not being a dick - it's showing how you'd solve the problem

echo monolith
#

oh I got that thing done... it does alll of it.. all options were to be checked

#

Anyways... could you help me with

#

What are the input and output shapes of an embedding layer with vocab_size = 1000 and embedding dimension = 25

oblique belfry
#

Well, I try to not answer questions with just "google it", but that is how I got through grad school, so it works. lol

echo monolith
#

ive been googling it past an hour... thanks anyways

oblique belfry
#

Sometimes reading the documentation is the best answer I can give.

echo monolith
#

thank you very muchhh

lapis sequoia
#

Somebody there who is familiar with xarray and netCDF?

silent current
#

I'm getting SystemError: <built-in function imread> returned NULL without setting an error

#

I'm passing a jpg file path to cv2.imread()

#

It was working a few minutes ago lol

#

literally didn't touch that portion of the code

eager heath
#

Still using the same jpg file?

ripe forge
#

Did the file poof? Is the file open? Did your computer turn against you?

silent current
#

My computer has turned against me

ripe forge
#

Abort abort!

silent current
#

printing out the path right before the call to imread():

#

datasets\animals\cat\cats_00001.jpg

ripe forge
#

Is the file still actually a proper image?

#

Try opening it outside python

silent current
ripe forge
#

Pretty. Hmm, I'm not sure. Try restarting python I guess

silent current
#

Β―_(ツ)_/Β―

cosmic crater
#

What are good Data Science Ideas for Beginners looking to build a portfolio? Do you have an suggestions for Environment or Finance related projects?

lapis sequoia
#

I am interested in finding out the exactly same thing @cosmic crater

cosmic crater
#

@lapis sequoia I would like projects to stick strictly to data analytics and statistics and not everything about machine learning for example for using linear algebra, differential calculus, practically anything about data science as I feel that's to advance right now.

What about you?

lapis sequoia
#

Same, but I am also curious about machine learning @cosmic crater

jolly briar
#

@cosmic crater being able to build a clean dataset from open data sources

shell mirage
#

Hi,

I am not sure if this is the right channel to ask questions. If it’s not let me know.

So I have a file with X,Y,Z data which I managed to load.

I wanted to visualize it in a 2D plot with z as the color scale. Tried googling but nothing seems to be working from what I tried.

Any suggestions?

Thanks

cosmic crater
#

@jolly briar Can you give example of open data sources. I think for me that's the first step. Finding Open Data Sources, I would like anything Environmental, Any-Science Based Sources, Finance, or Economic.

jolly briar
#

@cosmic crater gov sites often have them

#

often buried in excel sheets et

#

*c

cosmic crater
#

@jolly briar Thanks

jolly briar
#

@cosmic crater search UK gov transport data or something

#

You'll bump into portals and there's loads of stuff

cosmic crater
#

@jolly briar I'm going to search for U.S. based data since I'm from the United States. Unless it's climate change related, cuz I know the Trump Administration took down the data on climate change from the EPA and NOAA (I think).

jolly briar
#

Whatevers clever πŸ‘

nova nest
#

Can anyone explain me the difference between MarkovChain and Word2Vec?? They did the same thing, group up words with closest context.

thin terrace
#

Which models other than neural networks use gradient descent?

desert cradle
#

evolutionary programming

#

maybe - i don't know much about this stuff tbh

deft harbor
#

Im working with a keras multi-class classification problem. After fitting the model, how do I get the predictions? When I use model.predict, I'm not ending up with ones and zeros.

#

Is there another tool I should use?

thorny pasture
#

How much memory do you think is needed for this field? I know another person was helping me the other day and said it doesn't matter if I do 16, 32, etc cause I will have to spin up a service no matter what when it gets big enough. I just wanted to confirm with some other Data Science people how they feel on that?

I'm looking at getting a 16" Macbook Pro, but I'm not sure what configuration. A lot of friends say 32GB, but they might just be thinking about themselves and what they do.

oblique belfry
#

@deft harbor I assume the last layer in your model is a Softmax layer. Run an argmax on the output vector. It will tell you the class with the highest probability. That’s your output.

Note. If there are ten labels you want to predict, the output vector is either (10,) or (1, 10). Make sure argmax is working on the correct dimension.

deft harbor
#

Thanks, I found that after a lot of digging.

#

Should have thought about it to be honest.

thorny pasture
#

Anyone have input on my earlier question?

silent swan
#

it depends on your usage

#

I don't do any compute on my laptop, but I'm always watching stuff in the background and have tons of tabs open

#

also pycharm eats memory

#

so I'm beyond stretching my 16gb

thorny pasture
#

So you regret only buying 16?

#

@silent swan

silent swan
#

yes but also my computer is 5 years old now

#

and I have permanent access to a cluster meaning I don't do any analysis on my laptop at all

thorny pasture
#

I'm new, but how did you get permanent access to a cluster @silent swan

silent swan
#

university researcher

thorny pasture
#

Would you also then recommend I go for the 32GB over 16GB? On my personal PC for gaming I have 16GB and have maybe twice slowed down

#

Oh gotcha

silent swan
#

really comes down to your own usage. I guess in my case most of what I'm using the memory for isn't really work related (except pycharm eating memory)

thorny pasture
#

Normal usage would be fine, but alas I'm too new to say how much I'd use

#

But worst case I buy aws instance time

silent swan
#

well monitor your own ram use now and see how much of a margin you have

thorny pasture
#

generally speaking 50-60%

cinder prawn
#

16 is most def good enough for data science

oblique belfry
#

I have to give my team lead a write up of the ML lifecycle so we can make sure our project managers truly understand machine learning and how it will fit in our companies' workflow.

I'd appreciate it if you can give this a quick overview and let me know if there is anything I have forgotten that I need to add. Tried to really condense this down to easier talking. I could write an entire manifesto on the machine learning lifecycle.

https://docs.google.com/document/d/1slmpunUPAjR_bC8G4LjE8x6bKN9iijrI3SVeE4vMtpU/edit?usp=sharing

Thanks.

thin terrace
#

Which metric of accuracy, precision, recall, f1 and auc should I look at when tuning the hyperparameters of my model? I have an imbalanced dataset 4:1 ratio but I do resample the training data with SMOTEENN to make it balanced, however the test/validation remains imbalanced as it should not be resampled.

royal lodge
#

I'm thinking it's because the estimator stored in grid was only fitted with 90% of the dataset (cv=10). Is this assumption correct?

oblique belfry
#

@thin terrace A choice of metric(s) will be influenced by the business objectives.

However, keep an eye on the validation loss as well. You want to minimize the loss as you optimize for hyperparams.

thin terrace
#

@oblique belfry business metrics? Why is minimizing loss important and how important is it compared to the mentioned metrics?

oblique belfry
#

I meant "business objectives." I corrected that mistake.

#

I have personally chased down metrics to be where I wanted them without realizing the loss was going up. I manipulated the data and the model to get me the metric that I wanted without realizing it was learning less.

thin terrace
#

So I'm classifying the default credit card tabular dataset (binary).

oblique belfry
#

I have a similar skepticism of p-values. One can "p-hack" all day, but does that mean that their model(s) is/are affective?

Just try to have a holistic view on everything.

#

Is this for practice or for your business?

thin terrace
#

just a part of a ML experiment im running

#

Trying to learn some hyperparam optimization

oblique belfry
#

Ah....

#

I probably a bit overkill with my advice.

thin terrace
#

I'm basically doing random grid search and I compile the metrics for each param-setting before I go on and actually use the best model

#

Now I just need to know which metric to look for when deciding the best one

oblique belfry
#

I'd watch the true positive / false positive rates (which are captured by precision/recall but I can't remember which.) For fraud, the cost of classifying something okay that is not fraud is more costly than not over identifying transactions of fraud. I'd go for a large true positive - false positive gap. But, that is just my approach to it.

#

I'd go with this approach because you are look how the model does on each class, and not in it's entirety.

#

If you have a 4/1 split of non-fraud/fraud data and you get 80% accuracy, this might seem great. But if you look at precision/recall, you might see that you labeled 99% of the non-fraud labels correctly, and got 1% of the fraud labels correct. Obviously, this is not good.

thin terrace
#

Yes I'm aware of that part

#

However, this dataset does not classify frauds

#

it classifies whether a clients credit card will be defaulted next month or not

#

Maybe I should just go for f1-score?

oblique belfry
#

Once again, another assumption. My bad. But the logic is still sound.

#

yeah. F1 seems like a good start. Should get you where you need to go.

This wiki might be overkill, but definitely helped me get that there were more metrics than just the normal accuracy. https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as a classification function, that are widely used in medicine:

Sensitivity (also called the true positive rate, the recall, or probability of det...

#

What library are you using to train?

thin terrace
#

keras

oblique belfry
#

There should be an F1 metric callback. Probably in the tf.keras.metrics in tensorflow 2.x.

thin terrace
#

didnt keras remove their metrics because it's misleading to use during training?

oblique belfry
#

So....I know keras does not have metrics. tf.keras does.

thin terrace
#

"Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place."

#

As I understand it, they should only be calculated once training is done?

oblique belfry
#

Lol.

So....you bring up a good point. I had to modify my own keras ones to be stateful. for my old work. I think the tf.keras.metrics does this automatically.

#

So to have less work, probably ought to do it at the end. lol Will def be easier for you.

#

Thank goodness scikit-learn has nice helper functions for F1 score and Confusion Matrices. You can def use that.

thin terrace
#

yeah, atm I calculate the metrics after the training at each fold of my 10-fold cross-validation, then I take the average of all 10 folds

#

using sklearn yes

oblique belfry
#

I laugh because I learned that metrics thing the hard way and spent MANY hours reading Keras source code to get what I wanted to correctly.

thin terrace
#

where each row is the metrics for a setting of parameters

lapis sequoia
#

can anybody recommend some brief but useful NN tutorials? xD

#

i need to implement emoji2vec into my project and realising I have a limited timeframe to complete it

lapis sequoia
#

What does this exactly mean?

#

1 threads and 100 connections

ripe forge
#

Without context, tough to say

#

If I had to guess, it means a server that can only handle 1 request at a time, since it only runs on 1 thread. And then there's like 100 connection requests.

lapis sequoia
#

I need someone to explain this to me so it'll stick

#

I want to understand why I need to include continent in the SELECT here

#

intuitively it makes sense, but I want to remember for the long run

#
SELECT continent, max(women_in_parliament)
FROM countries
GROUP BY continent
ORDER BY continent
velvet thorn
#

when you group by a column and aggregate, you get one aggregated value for every unique value in the column.

#

max(women_in_parliament) is the aggregated value...

#

...and continent is the unique value it corresponds to.

lapis sequoia
#

thanks.. now I can remember this

slow yew
jaunty basin
#

14 year old trying to figure out how i could get a p-value from quantum random numbers with a range of -2, and 2 in python. if someone can help me that would be much appreciated. my goal here is to see if consciousness intent has any sway over quantum random numbers. kinda like what this university is doing. http://noosphere.princeton.edu/
add me on discord: leyland124#3364

lapis sequoia
#

anyone knows how to print predictions on an SVM in a for loop?

#
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_excel('C:\\Users\\GentB\\OneDrive\\Documents\\Python\\2020\\FootballPredictions.py\\data.xlsx', 
                         sheet_name='Dataset')
dataset = dataset.head(500)

X = dataset.drop('Result', axis=1)
y = dataset['Result']

X_train, X_test, y_train, y_test = tts(X, y, test_size = 0.20)

svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

y_pred = svclassifier.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
#

the output I get is only the accuracy

 [ 1 39]]
              precision    recall  f1-score   support

          -1       0.98      0.98      0.98        60
           1       0.97      0.97      0.97        40

    accuracy                           0.98       100
   macro avg       0.98      0.98      0.98       100
weighted avg       0.98      0.98      0.98       100
hollow shard
#

Does anyone know why this function with the guvectorize target set to cpu works, but when I set it to cuda, it gives me this error:

Invalid use of Function(<built-in function sub>) with argument(s) of type(s): (array(float64, 1d, A), array(float64, 1d, A))
Known signatures:
 * (int64, int64) -> int64

[The known signatures list goes on for a bit but you get the point]
Code:

@guvectorize([(float64[:], float64[:], float64, float64, float64[:])], "(n),(n),(),() -> (n)", nopython=True, target="cuda")
def calc1(a, b, g, m, out):
    vec = a-b
    r = ((a[0]-b[0])**2+(a[1]-b[1])**2)**0.5
    out = m*g*vec/(r*r*r)
#

I think I shouldn't be getting, as the two numbers I'm subtracting are float64, which is supported like it says in the error

#

Thought you guys might know since this is a numba question

paper niche
#

@lapis sequoia what do you mean? just print(y_pred)?

lapis sequoia
#

@paper niche yeah figured it out later on

wanton wasp
#

I have a question if anybody can help me...
Im working on a project and i want to gather twitter data. Now i want to refrain from using the twitter API and i stumbled upon a module on github called twint.
Its more or less perfect for what i need but i always get an error after about some 8000 scrapped tweets.
Does anybody know any other way of going about this?

silent current
#

So sklearn.KNeighborsClassifier has parameters for both metric and p. I'd like to test my model using the manhattan metric (which I was always under the impression implied p=1), and the euclidean metric (again, I assumed this implied p=2...). Do I need to be changing both parameters?

velvet thorn
#

no

#

you don't

#

just change either

lapis sequoia
#

I need some advice calculating vCPUs

#

can someone help me calculate, I'm not familiar with how to read machine sizing

granite steppe
#

so i have found these MIT courses for linear algebra and single variable caluclus

#

was wondering is this enough for understanding basic level of linear algebra and calculus or sud i look for more resources regarding these topics ?

#

im trying to revise on the maths needed for data science ....

crimson flame
#

should also learn some multivariable calculus

#

but it's mostly the easy stuff there that's useful (partial derivatives, directional derivatives, gradient, etc)

jolly briar
#

@granite steppe most people who've been through uni won't remember most of their courses

granite steppe
#

sounds fair enough..thnx for the info @crimson flame @jolly briar

crimson flame
#

and probability/stats

small tartan
#

Hey guys! I need a push in the right direction

#

I have 2 tables:

#

This is just proof of concept. will be built in sql and displayed in tableau

#

Table1 is by week. Table2 is by month

#

This is content usage data and quota attainment with dummy data

#

I need to perform some kind of operation to get the desired outcome.

#

Desired outcome: Sort content by 'best' to 'worst' judging by the quota that was attained during its months usage.

#

Any ideas or directions to research? I'm kind of at a standstill and its late in the day so my head is not worth much

real wigeon
#

heck I just wanna know how to set a condition for if the first cell in a a specific row then do this

velvet thorn
#

cell, meaning Excel?

real wigeon
#

yea

#

iloc?

velvet thorn
#

huh?

#

didn't you say Excel?

real wigeon
#

pandas df

#

should I use iloc?

velvet thorn
#

so you mean reading a spreadsheet into a pandas dataframe

#

and working with it with the pandas API?

real wigeon
#

ya

velvet thorn
#

and not openpyxl?

real wigeon
#

correct

velvet thorn
#

okay, then that's not Excel, that's pandas

#

anyway

#

how do you identify the row?

real wigeon
#

I can show my code

#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

real wigeon
velvet thorn
#

so what do you want to do?

real wigeon
#

lines 13 - 20

#

basically my script spits out some strings into a .txt file

#

and i ned to do certain things depending on the type of information in certain columns from the df

#

at the start of the df I need to print a certain set of strings

#

and I don't know how to articulate that into python code

#

so if on index 1 I guess?

#

does that make any sense?

velvet thorn
#

a little...?

#

so basically

#

that's this line, right

#

if branch == df.iloc[0,0]:

real wigeon
#

yeah

velvet thorn
#

which is the first cell of the first row

real wigeon
#

I believe

#

IDK if that includes the header though?

#

but either way i need to make it so that on the first row/cell adpo_x_syntax saves a set of instructions

#

that script has 3 sets of strings depending on conditions, but I only need help with what I just wrote

velvet thorn
#

uh...

#

why don't you just if on i?

real wigeon
#

I actually tried that as I was explaining this to you

#
            if i == 1:
                adpo_x_syntax = [
                    'Key tab',
                    'Type ' + str(buyer),
                    'Type ' + str(int(branch)),
                    'Type ' + str(int(vendor)),
                    'Key Enter',
                ]```
velvet thorn
#

also I don't really see why you wouldn't use itertuples, too

real wigeon
#

uh because the df reuses branch values

velvet thorn
#

also, that should be 0, not 1

#

unless your index specifically starts from 1

#

for the general case you can just use if i == df.index[0]

real wigeon
#

well is index 0 the header

#

?

velvet thorn
#

depends on what you mean by "header" and how your file is formattedf

real wigeon
#

it's an xls

#

and it has a header

#

and it's still not printing the right condition

velvet thorn
#

so you mean the actual Excel header

real wigeon
#

I'm asking if df.index[0] corresponds to the header row, because my data starts in the 2nd row

velvet thorn
#

wait.

#

do you mean the actual HEADER, or the first row which happens to contain the column names?

real wigeon
#

row which contains the column names

velvet thorn
#

okay, then that's not a header

#

it has a specific meaning in Excel

#

anyway, by default, the first row will be parsed as column names.

#

and therefore not part of the data

#

therefore the row you access with .iloc[0] refers to the second row in the Excel spreadsheet

real wigeon
#

I see

velvet thorn
#

try playing with it in an interactive session

#

it'll be easier for you to understand

real wigeon
#

I can conceptualize it

#

but regardless running a if i == df.index[0]: still doesn't print the right set

#

it is printing the lines 49 and greater

velvet thorn
#

print both values and see what you get

real wigeon
#

please be more specific, which values

#

im referring to the variable adpo_x_syntax

velvet thorn
#

i

#

actually, just play with the dataframe interactively

#

this should be pretty simple to debug if you actually have the data...?

#

like

#

print(list(df.iterrows())[0])

real wigeon
#

?

#

you're asking me to print the i of an else condition

velvet thorn
#

no

#

in the for loop

#

like

#

print(list(df.iterrows())[0][0]) this gives you the first value of i

#

compare that with df.index[0]

real wigeon
#
with open('excel_po.txt', 'a+') as f:
        for i, row in df.iterrows():
            branch, item, distro_size, delivery_date, buyer, vendor = row
            print(df.index[0])
            print("---------")
            print(list(df.iterrows())[0][0])
            if pd.isnull(branch):```
#
0
---------
0
0
0
---------
0
1
0
---------
0
2
0
---------
0
3
0
---------```
#

so on and so forth

velvet thorn
#

uh

#

your if conditions are wrong

#

I think you want an elif in the middle

#

also I actually meant to have those print statements outside, since the input doesn't change

real wigeon
#

outside the?

velvet thorn
#

loop

#

doesn't matter

real wigeon
#

well df.index[0] corresponds to 0

#

and

#

print(list(df.iterrows())[0][0])

#

the first value is ``0

velvet thorn
#

uh

#

the point is

#

they don't change

#

df.index[0] and list(df.iterrows())[0][0] are constants.

#

but anyway

#

like I said

#

you have if followed by an if-else

#

and your conditions are such that

#

the else will trigger if the if does, from what I see

#

if i == 0:

and then:

if branch != df.iloc[i - 1, 0] and i != 0:
    ...
else:
    ...

logically, therefore, the else will trigger if at least one of those conditions is not true

#

i.e. it will trigger in all cases where i == 0

#

since you use the same variable in all branches, the value in it will be overwritten

real wigeon
#

i mean i had if pd.isnull(branch):

#

if i == df.index[1]:

#

elif branch != df.iloc[i - 1, 0] and i != 0:

#

else:

real wigeon
#

i feel like you are mistaken, these aren't nested

serene scaffold
#

Are there any O'Reilly books that go into word embeddedings? I decided to write a paper on them for my linear algebra class and I want to write some pseudocode (which will probably just be Python) in the paper for how they're created and used mathematically.

velvet thorn
#

yes, precisely, they aren't nested

#

in the first one

#

and in the second one you're comparing to df.index[1]

#

which is the second row

oblique belfry
#

Has anyone successfully integrated Metaflow and DVC together?

flint hamlet
#

Hey anyone here that visualized WhatsApp Chat Data with Python? A few weeks ago I tried it but had some problems. I wanna try it again later and I was curious if anyones willing to help me out with it when the problems occur again.

#

Not sure if I should ask it here or in any of the help channels. Just tell me when I'm doing something wrong

brazen canyon
#

I'm interested. I'm not so much of a data science guru tho🧐

flint hamlet
#

Cool. Mind if I add or dm you so I remember you later this evening? I still have to finish some stuff so I might do it in a few hours or tomorrow. @brazen canyon

brazen canyon
#

I don't mind. Feel free to dm or add me

thorny pasture
#

What editors do you guys use? Someone recommended me Anaconda, but I'm pretty sure that's super bloated no?

oblique belfry
#

VSCode.

ripe forge
#

i like my anaconda. anaconda isnt an editor though

#

anaconda is a "batteries included" approach to python/data science. comes with a lot of goodies. So, it depends on whether you want that or not

#

(the editors anaconda would ship with would be jupyter notebook and spyder)

thorny pasture
#

@ripe forge @oblique belfry The guy is saying he personally uses a Text Editor and a separate IPython REPL. Do you need a REPL, is there a reason for it?

ripe forge
#

that's where python runs

#

without a place to run python, you're basically writing in the equivalent of a notepad or word document

#

(with some fancy features. ) πŸ˜›

#

anyways when i was learning this stuff, i personally liked using things that didn't make me worry about these kinds of details

oblique belfry
#

Ah. I use VSCode for most things. If I need to interact with stuff, I'll use IPython.

I like Jupyter/IPython, but I tend to just run everything as scipts.

thorny pasture
#

Im so confused

ripe forge
#

if you're confused, download ONE thing

#

draw a chit, whatever.

thorny pasture
#

Why do you need multiple really though

ripe forge
#

they all work.

#

don't ask that question right now πŸ˜›

thorny pasture
#

no like repl and editor

oblique belfry
#

If I need to make plots or visualize images, then I'll use Jupyter.

#

Ah.

thorny pasture
#

Nothing wrong with viewing it here

ripe forge
#

then you're "running python" behind the scenes

thorny pasture
#

behind the scenes?

#
from bs4 import BeautifulSoup as Soup
import requests
from pandas import DataFrame

ffc_response = requests.get(
    "https://fantasyfootballcalculator.com/adp/ppr/12-team/all/2017"
)


adp_soup = Soup(ffc_response.text, "html.parser")

# adp_soup is a nested tag, so call find_all on it

tables = adp_soup.find_all("table")

# find_all always returns a list, even if there's only one element, which is
# the case here
len(tables)

# get the adp table out of it
adp_table = tables[0]

# adp_table another nested tag, so call find_all again
rows = adp_table.find_all("tr")

# this is a header row
rows[0]

# data rows
first_data_row = rows[1]
first_data_row

# get columns from first_data_row
first_data_row.find_all("td")

# comprehension to get raw data out -- each x is simple tag
[str(x.string) for x in first_data_row.find_all("td")]

# put it in a function
def parse_row(row):
    """
    Take in a tr tag and get the data out of it in the form of a list of
    strings.
    """
    return [str(x.string) for x in row.find_all("td")]


# call function
list_of_parsed_rows = [parse_row(row) for row in rows[1:]]

# put it in a dataframe
df = DataFrame(list_of_parsed_rows)
df.head()

# clean up formatting
df.columns = [
    "ovr",
    "pick",
    "name",
    "pos",
    "team",
    "adp",
    "std_dev",
    "high",
    "low",
    "drafted",
    "graph",
]

float_cols = ["adp", "std_dev"]
int_cols = ["ovr", "drafted"]

df[float_cols] = df[float_cols].astype(float)
df[int_cols] = df[int_cols].astype(int)

df.drop("graph", axis=1, inplace=True)

# done
print(df.head())
oblique belfry
#

I read this question wrong. So sorry to give a confusing answer.

ripe forge
#

when you invoke python, it does and goes it's running and stuff, and then just throws you the output back

#

aka, you already actually have a REPL

#

so yeah...you dont need anything else πŸ˜›

#

ipython is nice though

thorny pasture
#

what for exactly?

ripe forge
#

so, the big power of python comes when you dont just run it as a script

#

but rather run it interactively

#

ipython does wonders when you're trying to run stuff back and forth

thorny pasture
#

back and forth

#

?

ripe forge
#

(as in, imagine running just first 3 lines of your program, getting the output, then continuing to work, writing couple more lines, but selecting and choosing whatever you really want to run)

#

it's one of the things i loved most about python when paired with a good IDE

thorny pasture
#

Sounds very weird!

ripe forge
#

it IS

#

and you'll get the outputs wrong so many times initially

thorny pasture
#

DS sounds so complex haha

ripe forge
#

but there's just a charm of just, you know..instantly selecting a variable, and running it in REPL, and it spits out it's value

#

without having to rerun the whole script

#

you can even run code out of order. not recommended initially, at all!

#

it lets you essentially "Experiment" with writing the logic of the code, and quickly running just that line

#

leads to some insane boost in productivity once you get used to it

#

if it all sounds like hand wavey and fancy, don't worry, it's probably meant to be hand wavey and fancy. just use python any way you prefer.

thorny pasture
#

Thats what im told from someone else

#

lol

ripe forge
#

mhm

#

opinions everywhere

thorny pasture
#

yeah.

ripe forge
#

fwiw, i give vs code full points too. it's not bad at all

#

just, my personal first choice is spyder still. somehow vs code makes me feel "cramped"

thorny pasture
#

conflicted as hell

ripe forge
#

(and in terms of simply market share on IDE, actually pycharm is on top. but again, pick whatever. they all do the same thing)

#

literally, pick one at random.

#

not like your choice is locked for life πŸ˜›

thorny pasture
#

but its like totally different anaconda has that spyder thing and code one side

ripe forge
#

anaconda is a pretty painless introduction to python on windows imo

thorny pasture
#

Well I know how to install pandas and whatnot

#

and I have a Mac as well

ripe forge
#

cool. in that case, pick whatever!

thorny pasture
#

is the only reason to use anaconda cause it installs pandas and numpy, etc

ripe forge
#

hmm

#

well, there's the conda environment/package manager as well

#

also the fact that it gives everything you need out of the box i suppose

#

those really are the big things. you can achieve the same kind of setup if you like without anaconda too.

#

tohugh, the dependency resolution of conda packages is pretty amazing

#

makes installing some stuff a breeze, that would have been a pain to manage manually

#

geopandas and tensorflow come to mind. though i believe tensorflow fixed their issues and now pip install works just fine too

#

(also, not to mention, you can have anaconda, and then use vscode or some other editor too)

oblique belfry
#

I personally am not a fan of Conda package management.

#

I start to use repls when I start adding a bunch of print statements everywhere.

thorny pasture
#

What do you use as a REPL

#

@oblique belfry

#

I see VS Code has a REPL like Jupyter inside it

oblique belfry
#

A mix between IPython and Jupyter.

thorny pasture
#

Why a mix?

oblique belfry
#

If I need to check the functionality of something quickly, I will use IPython3. But if I am doing some sort of exploratory data analysis, then I will spin up Jupyter.

#

Most of the time, I am ssh-ing into servers, and I don't feel like setting up Jupyter.

thorny pasture
#

So you do all the actual coding in vscode or another and then sometimes you open up a IPython Notebook like the web based ones?

oblique belfry
#

If I am doing any type of data analysis, I will use Jupyter. Then I will move to vscode as I get more familiar with the data.

If there are some functions or classes I want to quickly test, I will open up ipython.

thorny pasture
#

So you dont start in text editor you start in jupyter

#

What makes you not use Anaconda Tony?

#

And instead do it how you mentioned

oblique belfry
#

It depends on what I am doing.

I'd say I spend 85% of my time in vscode, 10% in ipython, and 5% in Jupyter.

#

It is probably a lot simpler now, but when I first tired all the anaconda stack 2-3 years ago, it was a pain. I could create a virtualenv and just use pip just as easily.

thorny pasture
#

create a virtualenv?

#

I have like no DS experience whatsoever, what is that needed for?

#

I agree with pip though

oblique belfry
#

Do you have much experience with Python? Virtual environments are a way to make sure you have the correct dependencies per project. Instead of installing everything in the global python environment, you can install the dependencies you need per project in this "virtual environment".

thorny pasture
#

I'm new to Python as a whole really

oblique belfry
#

Have you used any programming languages?

thorny pasture
#

C#, JS

oblique belfry
#

Some people will disagree with this, but virtual environments are akin to local node_modules folders. Instead of installing a node package globally, you can install it per project.

thorny pasture
#

Basic JS*

#

lol

#

Why not install globally though?

oblique belfry
#

Good question. Some packages require certain versions of a dependency. Package A may require version 1.13 of numpy. Package B may require version 1.14 or greater. Clearly, an issue will arise.

#

Package A's dependencies are incompatible with Package B's.

#

Well, if each package had a local dev environment where they can run any version of the dependency, then you wouldn't have this issue.

thorny pasture
#

So Anaconda you don't need to do virtual env?

oblique belfry
#

If you were using Docker or something to deploy a project, then you mighty not want to. Since the container is single purpose, then you won't have these issues. But my desktop is multi-purpose, thus I will run into issues with this.

#

Anaconda does something similar to virtual environments. They accomplish the same thing. conda can make sure whatever project you are in has the correct dependencies.

#

Now, conda can do MORE than that. But, that is a quick breakdown of my take on it.

thorny pasture
#

Nothing's easy to get into haha

oblique belfry
#

It's easier than you think. I promise. lol

thorny pasture
#

IPython says it's Jupyter

#

But you said you have both IPython and Jupyter lol

oblique belfry
#

Jupyter use ipython internally.

Ipython is a repl that runs in the shell.
Jupyter uses Ipython, but runs in the browser.

#

*actually, Jupyter is a web UI, and sends the data/commands/whatever to the ipython kernel.

thorny pasture
#

so you downloaded a seperate application IPython or do you use vscode interactive?

#

when you say Jupyter you mean Jupyter lab as well right?

#

https://jupyter.org/try I see Classic and Lab which says it's newer

oblique belfry
#

For the purpose of this discussion, yes. There is a difference, but not for this discussion.

thorny pasture
#

<ipython-input-1-4cbb279a3e44> in <module>
----> 1 from bs4 import BeautifulSoup as Soup
2 import requests
3 from pandas import DataFrame
4
5 ffc_response = requests.get(

ModuleNotFoundError: No module named 'bs4'

#

When I try to run my code in Jupyter I get this

oblique belfry
#

Currently, I have been researching Metaflow to use at work. I am looking at the metaflow repo I downloaded from git. As I am following the online instructions, I have VSCode open with the source code of the tutorials. On the right, I have ipython open. I am exploring previous metaflow runs. (Don't worry about what is there. Just though that I am interactively stepping through the code and executing things one at a time.)

#

Yeah. You do not have bs4 downloaded.

thorny pasture
#

I have bs4 downloaded on my pc

oblique belfry
#

Is it globally installed?

thorny pasture
#

I did pip install bs4

#

it works in vs code, sublime, etc

#

but that error is jupyter website thing

oblique belfry
#

How did you install jupyter?

thorny pasture
#

I'm doing a web test thing it's not installed

#

How did you get your Terminal to look like that in VSCode?

oblique belfry
#

That is called ipython. It is the REPL Jupyter uses

#

It is just another package.

#

Are you running jupyter in a conda environment?

thorny pasture
#

Disregard conda

#

I have VS Code open

#

I'm asking like 5 questions at once, so let's focus it down to one thing at a time cause Im an idiot

#

Do you have the Extension Open in IPython by Ilya Vouk?

oblique belfry
#

For my terminal, I ran ipython instead of python. It is a package.

#

Nope.

thorny pasture
#

so I need to pip install ipython?

oblique belfry
#

No. Unless you want to.

#

What are you trying to do.

thorny pasture
#

Currently make my terminal window on vscode look like yours

#

yours looks like spyder [1] etc whereas mine is >>>

oblique belfry
#

If you are using pip, pip install ipython.

I think you need to better research how Python works first before you delve into data science. Understanding how package management works is VERY important. I can say from personal experience that certain versions of keras , tensorflow, and numpy do not play well together.

#

It is important to know how to correct that stuff.

I had to pin the keras version and not just download the newest stuff.

thorny pasture
#

I've yet to have to, but imsure it'll happen

oblique belfry
#

Yeah. But, you have some fundamental gaps that are going to only get larger as you keep going. Not all data scientists/data engineers/ml engineers need to be the best software developers, but it is important.

#

Looks fun. But review the basics of Python first.

thorny pasture
#

Well I've done some projects and stuff, nothing crazy

#

I've been learning from Corey Schafer

#

Didn't watch Ep22 Pipenv

thorny pasture
#

@oblique belfry venv isn't hard at all, you weren't wrong

trail kite
#

guys I have question about parsing really extra nested json. am I in the right place?

ripe forge
#

you can just use the help channel

trail kite
#

didn't get enough help 😦

tacit jewel
#

Hey could someone offer some insight on datasets? I've only used free datasets that are available online, but what would the process of gathering your own dataset look like?
I suspect connecting to different sites' API would be the way to go

#

Total beginner question but would love if someone could point me in the right direction.

vital cipher
#

@tacit jewel you can use a scrapper to scrape which ever information you need from any given website....

tacit jewel
#

Thank you @vital cipher . I think I will try saving in just a python dictionary rather than sql or sqlite

#

for now since I'm just starting out and it's not a huge or complex dataset

vital cipher
#

cool

undone shard
#

going to be working on some AMD Vega optimized witchery with pyopencl, wheeeee

uncut shadow
#

witchery? GWcorbinMonkaGIGA

orchid lintel
#

So, what's the deal with compress in itertools? Seems to basically do the same thing as filter?

velvet thorn
#

not...really?

#

filter processes an iterable, removing elements that evaluate to False

#

compress processes two iterables, removing elements that come from the first iterable paired with elements from the second iterable evaluating to False

orchid lintel
#

@velvet thorn Could you give an example of when you'd use compress?

#


def name_selection(names):
    name_selectors = []

    for name in names:
        if name.startswith("A"):
            name_selectors.append(1)
        else:
            name_selectors.append(0)

    return name_selectors


names = ["Albert", "Alexandra", "Miriam", "Sascha"]
filtered_names = list(compress(names, name_selection(names)))```
#

that just seems like a clunkier way of doing filter(lambda x: x.startswith("A"), names)

#

but maybe that's just a bad example that doesn't really show what compress is good for

velvet thorn
#

yes, it actually is a bad example

#

in general, you would use compress only in two cases (where data refers to the first iterable and mask to the second):

  1. the mask is not reproducible from the data alone
  2. the mask has already been calculated
#

so, for example, say you have a list of names and a list of ethnicities

#

you could use compress to get only, say, the Gaelic names (along with a generator expression on the second list

#

compress is like a slightly neutered form of filtering by something else

#

if you have worked with languages that have something like filterBy...that's basically it.

orchid lintel
#

@velvet thorn Aha! Awesome, thanks!

lapis sequoia
#

I am trying to learn how to make neural networks but don't know where to start

velvet thorn
#

@lapis sequoia how much do you know?

#

like what's your level of ability in linear algebra, probability, and programming

lapis sequoia
#

high in programming but really low in math

#

where do i learn the math ?

#

@velvet thorn

velvet thorn
#

I suggest this

lapis sequoia
#

do i have to learn more numpy and graphing stuff

#

@velvet thorn

velvet thorn
#

no need

#

to keep tagging me

#

knowing numpy concepts is good

#

visualisation is not mandatory, but it is very helpful

lapis sequoia
#

ok so is there a course u can show me for the programming

velvet thorn
#

huh

#

if you're good at programming it should be really simple

#

just pick up Keras/Tensorflow/PyTorch and start hacking

#

don't think a course is necessary

lapis sequoia
#

KK

south dagger
#

A bit confused not sure what I did wrong, the Annuel Revenue doesnt seem right. For index 522 it seems right but like for index 1476 shouldnt it be 33.333*

*please @ me *

polar acorn
#

@south dagger you are diving years by Total in Millions which is wrong. You should divide Total in Millions by year. Also for calculating years you don't need the apply and lambda you can just do sales['Year published'] - 2019, same for calculating Annual Revenue just divide one column by the other as if you would any other variable.

lapis sequoia
#

I have this MT Dataset, what can be done with such dataset, as a Data scientist, what questions can be asked regarding this dataset, what can be the yields, what to analyze, etc.

So far I thought to make a classification on age, gender.

To find the reasons why patients undergo, surgery, allergy tests, etc. what else can be done with it, any suggestions, please?

Any suggestion is appreciated.

thorny pasture
#

What ecosystem do you guys program on? I'm curious.

obsidian copper
#

Can I train YOLO (darkflow) on hand gestures and recognise hand gestures in real time?

#

Ping me in replies

ripe forge
#

@obsidian copper as long as you have a dataset for hand gestures, and as long as the actual prediction of a gesture is done as if it was on a "Frame", then absolutely.

obsidian copper
#

@ripe forge ok thank you

ripe forge
#

@lapis sequoia some kind of nlp around the column that gives description. could be classification to one of the categories under 2nd column, or just a simple patient clustering algo to try to fit new cases in

#

having said that, that pic doesn't give a lot to go on

#

ecosystem..not really sure what that means tbh. just python i guess πŸ˜›

sand timber
#

having some trouble with an implementation of sparse pegasos svm... it looks like the more i train, the worse it is :(

ripe forge
#

are you measuring performance on both train and test? is it improving on train?

sand timber
#

i'm just measuring performance on train and it's getting worse the more i train

ripe forge
#

interesting

#

sadly, i know nothing specific about how sparse pegasos svm works

sand timber
#

it's supposed to be linearly separable

#

but i'm not really sure how to prove it is

#

sigh never mind i think i figured it out

#

silly mistakes

south dagger
#

@polar acorn thank you !

ripe forge
#

curious, if you don't mind sharing, what was it? @sand timber

sand timber
#

regularization lambda was too strong

#

classic self-implementation move

ripe forge
#

ah

sand timber
#

you never know if you implemented it right or if the hyperparameters are bogus

ripe forge
#

heh. yep, that'd do it

oblique belfry
#

@obsidian copper What type of gestures are you doing?

obsidian copper
#

@oblique belfry actually I want to detect hand gestures in real time. Gestures like thumbs up or fist etc. I'll later use these classification to perform various mouse functions.

oblique belfry
#

Yolo might work for you.

If you find that training on still images is not good enough, you can use SlowFast or Conv3Ds and train on multiple videos.

obsidian copper
#

I thought of Conv3Ds but idk much about it. And I've to develop this project in like 2 weeks max.

#

I'm still learning cnns

oblique belfry
#

You might have to leave the "image classification" approach to things and go to "action recognition." I did a similar problem. Spent hours trying to do a Conv2D + LSTM approach and got nowhere.

#

Gotcha. Just was scrolling through and saw your comment.

obsidian copper
#

Can u help me with this thing? I may ask for more help

#

😬

oblique belfry
#

I'll do my best

south dagger
lapis sequoia
#

well hello there
i plotted this with matplotlib

#

i think we all know what this is.
However as you may also seen, you can't read s**t from it.
i did the dpi at 150 and figsize at (20,10)
but its like unreadable
also lags of course a lot, hence i am converting them in pictures instead of plotting them
so i need it readable on an image
any methods?

lapis sequoia
#

Sad, can nobody give more insights?

velvet thorn
#

@south dagger do you want to center or wrap?

south dagger
#

would center auto wrap it ?

#

both ?

velvet thorn
#

did you Google

south dagger
#

yes but it kept giving me col selection stuff which wasnt working

chilly fog
#

Do you guys know anything about normalization by chance

#

i need someone rn

velvet thorn
#

try df.style.set_table_styles([{'selector': "th", 'props': [('max-width', '50px')]}]) @south dagger

#

@chilly fog

#

!ask

arctic wedgeBOT
#

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving.
β€’ Be patient while we're helping you.

You can find a much more detailed explanation on our website.

lapis sequoia
#

hi

undone shard
#

Hi bon, having fun with this N-body particle thing I made in PyOpenCL, glad I fixed a bug

#

Vector fields!

#

I'll show an image.

hexed aurora
#

Hi all, im very to coding in python!! great to be here!

arctic wedgeBOT
#

Hey @burnt wharf!

It looks like you tried to attach file type(s) that we do not allow (.txt). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.

Feel free to ask in #community-meta if you think this is a mistake.

hexed aurora
#

Can anyone point me to a good reading material. I want to compare snapshots of the same table(s) at different points in time and identify the differences and make some meaningful inference

wheat frost
#

Well numpy sounds good for that sort of thing

#

Depends what sort of meaningful inference you want to make

dusky cairn
#

Hey, i am kind of new to Python, and i need some help. I am trying to do a polynomial regression

lapis sequoia
hexed aurora
#

@wheat frost I look at the snapshot of the table every 10 minutes

#

news rows can be added to a table, or existing rows could have changed

#

but the volume of data is very high, so RDBMS like comparisons take a bunch of time

fresh cedar
#

Hi All, Pandas question: how would I go about adding new rows to a DataFrame obtained using groupby() and count(). I have cumulative sums of items grouped by date. My resulting dataframe looks like in the screenshot. I'd like to add additional rows to it e.g. to predict future growth.

hollow orbit
#

wow

jolly briar
#

@fresh cedar if you put .reset_index() after it should return a dataframe.

fresh cedar
#

Thanks! I'll try it out

crimson umbra
real wigeon
#

how can i tell my script to do something, if it is on the first row of a pandas df

next cairn
#

How to deal with spatial data ?

jolly briar
#

@real wigeon what do you mean, can you give an example?

silk forge
#

so i was watching andrew ngs course

#

and i decided to code a linear regression model myself

#

and so i have a doubt

#
import sklearn.linear_model as lin
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib
from math import sqrt

data = pd.read_csv(r"C:\Users\home\Desktop\Artifcial intelligence\ML\data\Regr\FuelConsumptionCo2.csv")

x = data[['ENGINESIZE']]
y = data[['CO2EMISSIONS']]

trainx,testx,trainy,testy = train_test_split(x,y,test_size=0.2,train_size=0.8,random_state=7)

regr = lin.LinearRegression()
regr.fit(trainx,trainy)

# h(x) = O0 + O1

o0 = regr.intercept_
o1 = regr.coef_

print(f"o0 shape = {o0.shape}")
print(f"o1 shape = {o1.shape}")
#

o0 shape = (1,)
o1 shape = (1, 1)

#

why is my theta 1 a 1d vector

#

while my theta0 is not

#

nvm makes sense now

#
npar = np.array([o0,
                 o1])
print(npar)
#

when i put them in an array

#

they become a 2d vector