#data-science-and-ml
1 messages Β· Page 219 of 1
been studying python and compsci for 6 months and I still dunno jack shit compared to someone who's been doing it for 1 year or less
yes
they use pseudo languages
if this then that
like C?
Oh, so pseudocode is just code to explain programs simply, not real langauges
yep
Welp, I think I started too quickly, I'm just gonna take tge CS50 course, maybe gonna take three or four months
the*
CS50?
No just python courses and books
@merry portal like this?
>>> list(zip(*np.nonzero(a < 5)))
[(0, 0), (0, 1), (0, 2), (1, 1)]
No just python courses and books
@lapis sequoia I would play around with a lot of the python packages before you move to far from Python. Maybe start diving into SQL if you are looking to do some data work. You can essentially run SQL syntax in pandas anyways.
does anybody here know R?
r u trying to run data viz?
me?
i started r studio a couple of weeks ago
i think this is data viz doe yh
im pretty sure it is
yh
i have been given a task that is composed of 5 stages
the task is to simulate the double slit experiment using R
if i were to do this from scratch i would be clueless
but the tasks build up to give the result
can you help me with this? @slim elm
rippin a game of HoTS quick then i will look at it
i might take a powernap meanwhile
how is the size of the screen defined here without defining "-y.max"??
>>> y_max = 200
>>> -y_max
-200
Since the screen goes from -y_max to y_max, you only have to define one of them. just negate it
is -y.max automatically equal to -200 when y.max=200?
It it were from [-300, 200], you'd have to define -y_max
I mean yeah, the negative of 200 is -200. Adding a - to a variable negates it
damn didnt know that worked in R
mind if i ping you in the future for help
i just started this lang
its my first lang
sure, though I'm not very knowledgable in R
neither am i
hi I use pandas
I have two columns iand j example {(1,2), (3,4), (2,1)} I want to count the pairwise iand j . (1,2) = (2,1). Can I use once groupbyto do this. Or I use groupbyand after I iterate?
Hey guys, does anyone have some tips on how to optimize randomforest regressions most effectively? random gridsearch? gridsearchcv? a combination thereof? I know how to do those, but I'm an absolute beginner in terms of randomforests... i only did a parameter grid test for
n_estimatorsandmax_featuresand the results are rather unsatisfactory. Thanks
bump
`
distance (from apertures) to screen
L <- 1000
size of each aperture
b <- 0.5
wavelength of light
lambda <- 0.05
separation of the slits
d <- 3
wavenumber
k <- 2 * pi / lambda
#all of these define the physical parameters
define distance along the screen, y, from -y_max to y_max
size of the screen - extends (-y.max, y.max)
y.max <- 200
sets the number of pixel on the screen
n.screen <- 501
#stating -y.max negates the original y.max value therefore if y.max is defined -y.max is naturally the negative correspondance.
y <- seq(-y.max, y.max, length.out = n.screen)
#the units for these values are in mm and the centre of the screen is at y=0
#defines theta
theta <- atan(y/L)
sinc <- function(x) { #defines sinc
y <- sin(x) / x
y[x== 0] <- 1 # "x==0" checks if x is equal to zero and if it is zero then it will be replaced by 1
return(y)
}
phi<- 2pia*sin(theta)/lambda
`
this is my code
and i have no clue what a is supposed to be
this is the task
this is the theory
a is slit width
but idk how to work it out
@lapis sequoia
can u help me find out what "a" (the slit width) is
everything is above
That image is blurry on my phone si I can't make out the figures. But I would assume that you function to calculate phi is supposed to accept a as an argument
anybody here use deeplearning.ai?
im trying to grok L2 regularization from a mathematical perspective
how come the regularization provides a lesser update, according to Ng, than an un-regularized weight update?
it seems like it would almost always increase the update magnitude unless lambda is negative xor the derivative is negative
Say I have a 32 x 32 RGB image that I am passign through two convolution kernals. On the other end, I have a tensor with depth 2. Now I need to apply a bias, but does the bias just get added to each specific element in the output layer?
[[0, 1, 0],
[1, 0, 1],
[1, 3, 4]]
So adding a bias of 3 would result in:
[[3, 4, 3],
[4, 3, 4],
[4, 6, 7]]
correct?
Then I would add another bias to output from the other kernal
ignore dimensions, I just slapped something together
each filter has one bias, yes
Hi, I've recently started to learn neural nets and now I'm reading this book: http://static.latexstudio.net/article/2018/0912/neuralnetworksanddeeplearning.pdf
When I'm using the first neural net there I'm getting over 97% accuracy but only with pictures from the data set and not with pictures that I create (28x28 with MS Paint). What am I missing regarding to creating/processing the pictures?
The network is implemented here:
https://github.com/MichalDanielDobrzanski/DeepLearningPython35/blob/master/network.py
neuralnetworksanddeeplearning.com integrated scripts for Python 3.5.2 and Theano with CUDA support - MichalDanielDobrzanski/DeepLearningPython35
I know tf/keras has built in distributed training strategies (Pytorch also has some but I do not know enough about them), but I also know about Hovorod by Uber, and how many people use this library as well. Are there any good benchmarks showing if there is a difference between Hovorod and these other strategies? The ring-reduce method by Hovorod is interesting, but I am unsure how that could be better than using a parameter server. I do not know much about the mpi protocol.
Hi, guys! I have the following problem:
consider the dataset
- day
- month
- year
- hour
- some per-hour variable values (like temperature, etc.)
- XValue (real value in range 0.0 - 30000.0)
- Is XValue Maximum for this day (0-1)
So, XValue changes every hour of each day and take always different values. Also, it reaches it's peak value on some hour. I need, given day, month, year and per-hour variable values for each hour guess, when does XValue will reach its peak value (basically receive an int in range 0 - 23). What approaches should I look into to solve such problem?
Looking at potentially getting a MBP, opinion on how much ram I should get? It's a toss up between 16GB and 32GB likewise with 4GB vs 8GB for graphics. My gaming desktop only has 16GB and I've never had an issue. I plan to get into Data Science more so that's why I'm asking here.
@thorny pasture depends, what do you want to do?
16 GB is plenty for many data science uses
I'm still too new to say for certain, could you maybe touch on some cases when 16 wouldn't be?
@velvet thorn
so the main use of RAM is storing your dataset to work with interactively
in this context.
there may be a point where 16 GB is too little and you need 32 GB, but that is rarely the case
in my experience, if it's too big for 16 GB, it will likely be too big for 64 GB, and you'll need to use a cluster or something
@velvet thorn what kind of specs does your computer have?
the laptop that I use for DS work, no DL, is some mid-range Lenovo ThinkPad
16 GB RAM, in particular
I use my desktop for gaming + DL, and it has a GTX 1080 (planning to upgrade in a few months)
also 16 GB of RAM
any way to write text to a pdf file?
@prisma imp reportlab
can somebudy suggest how to optimize the following parameters in random forest?
min_impurity_decrease
min_impurity_split
n_jobs
random_state```
is random_state even necessary? I obviously used it for train_test split, but I'm not sure whether it is necessary for using bootstrapping
as for the other 3 commands, i don't really understand their explanations very well in the userguide...
@velvet thorn i recently updated my PC from 8 to now 24Gb because of the python project and I still have experienced crashes with randomforest gridsearch on higher n_estimators parameter values ^^ my working dataset is "only" 1-1.5m rows and less than 1gb after cleaning. the process took like 48+ hours and at some point crashed
bet it was in IPython
yeah, that's built on IPython
@lapis sequoia are you sure you know what you're doing?
n_jobs is the number of parallel jobs to run (e.g. fitting multiple trees at once).
it is not a hyperparameter.
neither is random_state.
you might as well optimise the time of day at which you run your code
min_impurity_decrease just means how much impurity (a measure of how good a tree is at differentiating between values) must go down by before your tree splits
min_impurity_split is deprecated, and serves roughly the same purpose as min_impurity_decrease.
don't touch it.
i know random_state is not a hyper parameter, but it is suggested to use a random state when using bootstrapping in some tutorials... I didn't use a specific random_state though. My grid search, however, showed that using bootstrapping is better than not using it.
thanks π
my question was not well specified I guess, sorry for that
np
it was aimed to get some explanations for the parameters, and whether random_state is needed when using bootstrapping
yes
in general, when you do something involving randomness
you should ensure reproducibility
which is done by, in this cas,e fixing random_state
i tried fine-tuning only max_featuresand n_estimators in the beginning, but the results are very disappointing compared to rbf-SVM
I honestly expected (from the papers i read), that randomforest will have the best results for my study... but seems like it doesn't
you should ensure reproducibility
yup, that's logical, however, while train_test splits should obviously be reproducible, the random state of the bootstrapping-hyperparameter in my view shouldn't make any difference... because as you said, it's not like the result of hyperparameter tuning will change if you change the random_state
random forest is not very responsive to hyperparameter tuning
yeah, I agree
you'll get different bootstrapped samples
which will slightly change the fit of the individual trees
of course, being a rather low variance method in the aggregate, this shouldn't change much, but it will make a difference, still.
but then again.... bootstrapping is a yes/no option
so even if there is slight difference, the outcome should remain consistent right
huh?
no, each bootstrapped sample will vary slightly
because it's drawn randomly from the original data, right
the random state influences this
sure sure
i mean... when you're using grid-search, then your goal is to check which is best for your specific model as in bootstrapping=[true, false]
and if bootstrapping=true is the better option, that should not change whatever your random_state is
no, not necessarily
at least, not theoretically.
and not practically as well, if you're using sklearn
ugh... ok π¦ π
because the individual trees also are influenced by randomness, entirely apart from bootstrapping
see the documentation.
and, this aside
if you agree that bootstrapping performance varies with random_state
surely you can agree that it is conceivable that when random_state takes certain values, bootstrapping may perform worse than not bootstrapping.
on the other hand, when random_state takes other values, bootstrapping may perform better.
because, remember, it may influence the performance of bootstrapping.
if you accept this, then it does matter.
but would that not make random_state a hyperparameter then? lol
if the performance difference was that meaningful, i mean
haha π
this is kind of a philosophical question
hahaha
ok, so you suggest i don't touch the min_impurity_split and _deacrease, but max_depth, min_samples_split, min_samples_leaf does make sense to finetune, right? on top of max_features and n_estimators, obviously
hi @lapis sequoia, where did you learn machine learning with python?
@agile wing self-taught learning by doing and using lots of tutorials on the web and youtube, plus 1-2 books. I had to learn it for my thesis-project... but i'm literally a beginner. this chat was a huge help for me, when I was stuck... gm certainly knows what he's talking about, I on the other hand do absolutely not π
oh, I'm trying to find a pretty good MOOC on python and ml
in favour of
min_impurity_decrease
@velvet thorn any suggestions what values to use? since it's a threshold i have absolutely no idea... except that default is (default=1e-7)
@agile wing sorry what's MOOC?
mh scikit has lots of data online, github too.... i never really had the time to play around with other data, because my own project had so much data to work with and my project obviously has a deadline
@velvet thorn oh and for n_jobs yeah i read the documentation... i was asking, because i thought it might help accelerating the gridsearch process... but i don't know and i don't really want to risk jupyter to crash my PC again :[
ic thanks
@agile wing what's your current state of knowledge though? if you're starting from scratch, i would recommend to watch Daniel Chen`s python beginner tutorials on youtube...
i know my python, but Im new in ML
Hey! I love solving challenges like Project Euler, Codewars, Codingame...
Do you have something similar for Data Science/Machine Learning?
I know about Kaggle of course but it's a bit hard for beginners. So far, I have found https://www.machinelearningplus.com/python/101-numpy-exercises-python/, https://www.machinelearningplus.com/python/101-pandas-exercises-python/ and a sub-subsection about Machine Learning on Hackerrank (about 20 questions). Do you have more cool stuff?
Can someone help me get started with data science? I'm learning Python and I have to make a bar chart of the first column of a csv file but I'm really struggling with it
Pls tag me if you can
@silk frigate what have you done so far?
Well
I have downloaded the csv file (I opened in Excel now)
It's about the McDonalds menu
so its looks something like this:
And now I want to make a bar chart of the first column (category)
At least that's what the assignment asks me to do haha π
They gave me the start of the program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import menu and display the first two rows of the dataframe
menu = pd.read_csv("mcdonalds_menu.csv")```
but I don't really understand how to make bar charts etc.
It's all new to me
how new are you to programming?
I started about 5 weeks ago I think. In my class we started with the basics (data types, etc.)
I have worked my way through module 1 till 8
Ok. So you're using a library called pandas to read the CSV file. pandas is very useful for anything related to data manipulation and analysis
Yes
matplotlib is usually used for plotting, but pandas has some built-in plotting methods
If you google pandas barchart you'll hopefully find this link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.bar.html#pandas.Series.plot.bar
Yes I found that link
(hold on, someone is calling me, brb in 2 min)
sure thing
ok im back
hii
anyways as I was saying. As you see pandas as a built-in method for creating a bar-plot
yes
However since you want to make a plot from strings (category) you need to find a way to quantify them numerically
And in this case I assume it's the number of each items in the category
yes
If you google pandas count of unique values in column you'll hopefully find this link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
what did you have in mind then? haha
ohh haha
anyways did you see the above link?
yea it's alphabetically
not yet let me see
how did you find that link?
Did you just googled "python pandas counting values in column" or something?
yes
hmm okay
I'm kinda scared with the formula that's on the top of that page haha
all the paramters
meters*
don't worry about that yet
The only thing you should take a not at are all parameters that dont have a default value
Basically parameters that dont have =something at the end of them
but all of them have a default value right?
Yes exactly. With the exception of self. But that can be ignored
So that means you can just call .value_counts() without providing any parameters.
hmm okay
Which is great as that means we can just focus on getting some plot working, then finetuning it with parameters as needed
haha yes
Anyways. in pandas you have something called DataFrame and something called Series. The boiled down version is that a DataFrame is just a collection of Series
And the link above shows pandas.Series.value_counts()
yes
yeah. So in your example, if you were to do:
df = pd.read_csv(...)
print(type(df['category'])
you should get <class 'pandas.core.series.Series'>
You can access any column in a dataframe like you would a dictionary.
but aren't columns always series?
And if the name of the column doens't have any spaces you can also do it without the brackets: df.category
Yes thats right. But that means that we can do this: df.category.value_counts()
so python print(type(df.category))
would also work?
yes
Now here's a task for you. Using the documentation, see if you can find out how to plot from a series (instead from a dataframe)
Link me the page about plotting a bar-plot from a pandas series
well pd.plot.bar() is a function from pandas that you can use
although I don't really get why there's pandas**.series**.plot.bar() in the title
because in the example they don't do series
That means you can use it on series as well
but you shouldn't use that in the actual function?
What do you mean?
well if you are going to use this function
should you do: pd.series.plot.bar()
or pd.plot.bar()
Actually what pandas.Series.plot.bar means is that withing the pandas library, any Series has an attribute called plot whic has a bar method
And if you run this: print(type(df.category.value_counts())) you'll see that the output is a series as well
And as mentioned, since it's a series we can do this: df.category.value_counts().plot.bar()
yes
Lastly, what the pandas plotting methods do, is just create the figure. To display it we have to use matplotlib after by calling plt.show()
yeah that's weird haha
why doesn't python just do that automatically π
wow
it works haha
I'm already proud of this
good job! π
I hope everything made sense to you, and hopefully you learned not just how to plot a bar-chart, but also how to navigate the documentation and understand it
haha yes, I really like that I can use discord to learn things like this. The lecture notes of my teacher are sometimes useful but do not really explain things in-depth
I do have 2 more questions then tho, how did they manage to sort the categories alphabetically and give cool colours to the bars?
If I do python plt.show(df.Category.value_counts(sort=False).plot.bar())
It shows this graph:
so first it sorts it with the value counts descending
but how does it come up with the order of categories in this second one?
It's not the order in which they appear in the csv file
@lapis sequoia could you maybe help with this? It's also completely okay if you don't want to, I get that it takes a long time to help beginners like me started π
I tried looking into how the sortargument was used but couldn't
what do you mean?
Can I use this function to sort my values from that series alphabetically?
oh I looked at the wrong method
lol
oh man I read your line wrong and thought it was .bar(sort=False)
Ah I found it
So value_counts() uses something called a hashtable which means the ordering is essentially random
By default sort=True and so after counting the elements it will sort the result
so when that's False, the order is completely random?
essentially yes
so is there a way to sort the values alphabetically?
Sure. every series and dataframe has a column named index which as you can imagine is the index of every row
So if you want to change the order of the plot, we have to reindex the series
because the initial index was alphabetically?
not alphabetically, in the documentation is says The resulting object will be in descending order so that the first element is the most frequently-occurring element
yes
and it will probably create a new series with new indexes
so 'Coffee & Tea' has index number 1
or 0
since that's the most frequently-occuring element
Yes so to change the ordering from most-frequent to alphabetical I would this:
category_value_counts = df.category.value_counts()
alphabetically_sorted_indices = sorted(category_value_counts.index)
category_value_counts.reindex(alphabetically_sorted_indices).plot.bar()
plt.show()
Well I knew it already / used the documention. But if you google pandas change series index it should be first hit
oh you meant the .index
ehh yes well both of them actually
with my question I meant the .index indeed but I'm confused by your 3rd line as well haha
at least I understand half of it
so the same applies. I knew I could get the index by calling .index, but I also found it by googling pandas get index from series
I recommend you play around a bit. Try to print out the output of .index and the other methods
alphabetically_sorted_indices = sorted((df.category.value_counts().index)
plt.show(df.alphabetically_sorted_indices.plot.bar())``` @lapis sequoia do you know why my python intepreter gives an `invalid syntax` error for the 2nd line?
alphabetically_sorted_indices is just a normal python list
not a pandas.Series object
So .plot.bar() will not work
@silk frigate
hmm okay well I'm working on another assignment now
I almost finished it
but I need to print 3 specific columns and now it prints all of them
Do you know how I can manage my code to print only 3 columns @lapis sequoia ?
healthy = df.loc[(df["Trans Fat"]==0) & (df["Cholesterol (% Daily Value)"]==0) & (df["Total Fat (% Daily Value)"]<=20) & (df["Sugars"]<=20)].sort_values('Calories')
print(healthy[(healthy['Category']!='Beverages') & (healthy['Category']!='Coffee & Tea')])```
I only want to print the columns 'Category', 'Item' and 'Calories'
it prints this now (all columns)
but it should be this
df[['Category', 'Item', 'Calories']]
That's just an example of the syntax
use it on whatever dataframe you want. In your case Its probably on healthy
I have difficulties with where to put such an argument inside of my function
So I have used df.loc to address specific columns
and I gave 4 arguments which need to be true
healthy = df.loc[(df["Trans Fat"]==0) & (df["Cholesterol (% Daily Value)"]==0) & (df["Total Fat (% Daily Value)"]<=20) & (df["Sugars"]<=20)].sort_values('Calories')
healthy2 = healthy['Category', 'Item', 'Calories']
print(healthy2[(healthy['Category']!='Beverages') & (healthy2['Category']!='Coffee & Tea')])```
I get an KeyError now
I've worked with pandas before
Here is the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
what does verbose mean? verbose: int, optional (default=0) Controls the verbosity when fitting and predicting.
Itβs similar to logging level. As you run the fit method, it will log to the console the stats on each step of training.
thanks
so if you would verbose=2 it would only log ever other or every third step?
why would anyone want to change that option? it's unnecessary basically, right?
For the Keras API, it is quite helpful.
You can read the docs for the functionality, but itβs like python logging levels (debug, info, error, warning, critical)
thanks, i didn't find the explanation regarding verbose in random forest very helpful in the docs... but i don't have time to get too deep into it anyway. i really need to get this project done a.s.a.p
what would be a professional way of visualizing predictions vs actual values in a time series model, when there are like 100s of actual values and predictions to be considered for each day?
is there anything other than just comparing the RMSE or MAE with bar charts for different prediction models?
"professional" is a bit of a weird term
you could consider an interactive visualisation
or some sort of moving average
(don't @ me)
bar charts are like a one-stop comparison
you can use that as an intro
and go further into detail with other visualisations
there's also a middle ground
e.g. group by, say, month, and plot error bar charts
maybe plot the residuals
mhh it's supposed to be in a paper, so interactive visualization is cool as i have to turn in the code, too.... but i'd prefer something that visualizes the results on paper
you're limited only by your creativity
ok, so i did all the grouping stuff in the description part, to show what the data looks like etc
but for the results part, I'd like to have something more than just "here's the RMSEs for the different models, enjoy!" you know?
yeah, that's why I said
wait
what do you mean
ok, so i did all the grouping stuff in the description part, to show what the data looks like etc
the problem i'm facing is, that the project is about delays that are predicted like 100s of times a day. So i can't just plot a time series on the x-axis and and predicted values on y, like when you predict a max temperature for a day or anything like that.
I could try to plot averages over all daily errors for each model-type (I'm comparing 3 model types) maybe... and print it against average actual errors (delay - schedule)
in a boxplot maybe?
yeah, moving average...
what do you mean
in the description part, i showed how theactualdelays are distributed for days of the week, hour of day etc.
sure, i could do the same with predictions, but goal is more like making good predictions in general... like for any day over the complete dataset of different trains etc
yeah, moving average...
mhh, can you elaborate? I'm considering a moving average a plot that takes previous average errors (let's say, for example, the past 4 days) into account when calculating "todays" error... how would that be an adequate solution?
boxplots would probably overlap like 90%. So plotting the average errors for any day for 3 models plus the actual delays from schedule and the baseline prediction (that i'm ultimately trying to beat) I'd probably end up with a very cluttered chart.
you're limited only by your creativity
haha, that could be the problem... guess my creativity is just really limited on this π© π despite having plotted lots of stuff in the description part, i just don't seem to find any good looking way to visualize the results. only bar chart for feature importances and bar chart for RMSE is kinda lame
okay I think I don't really understand what you're trying to do
which is why maybe my suggestions don't make sense
visualisation is quite an intimate art
i don't know how to better explain what i'm trying to do :[ but I'll try
i have:
2. original baseline prediction (which is nothing but a linear shift of previous delays into the future)
3. predicted delays using linear regression
4. predicted delays using SVM
5. predicted delays using RF```
and I'd like to have some sort of visualization of the model performance results with high information content, other than just RMSE bar plots and residual plots and maybe feature importances and significance/t-stat in case of RF and lin reg, respectively
I'd very much like to incorporate the time-series aspect in one plot
hi data scientists of discord π, i have a question for you: is pandas and geopandas the same? or they have nothing to do with each other
geopandas builds on pandas but they are different. @burnt topaz
@summer plover oh okay
i have a problem on #help-coconut maybe you could help me with that
@ gm how would you do it? i can't plot a line chart against time to compare the delays with predicted delays, because there is not just one delay per day, but hundreds or thousands actually.
So I could, if at all, do a scatter plot. But then again, the delays are in general all very similar... and plotting thousands of points on top of thousands of points in other colors doesn't help anybody to understand what's going on, right?
Could someone help me? I'm reading a 1.5gb file and when it arrives in 2kk lines it returns "ValueError: Length of values ββdoes not match length of index"
Traceback (most recent call last):
File "modeltest.py", line 26, in <module>
model = load_model("E:/PanModel.model")
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
return load_function(*args, **kwargs)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\saving.py", line 583, in load_model
with H5Dict(filepath, mode='r') as h5dict:
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\utils\io_utils.py", line 191, in __init__
self.data = h5py.File(path, mode=mode)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\h5py\_hl\files.py", line 408, in __init__
swmr=swmr)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python37\lib\site-packages\h5py\_hl\files.py", line 173, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'E:/PanModel.model', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Please answer this
thisssssssssss
I hate to be a dick, but this is where I just google what the ImageDataGenerator does. There are somethings I just don't remember.
that's not being a dick - it's showing how you'd solve the problem
oh I got that thing done... it does alll of it.. all options were to be checked
Anyways... could you help me with
What are the input and output shapes of an embedding layer with vocab_size = 1000 and embedding dimension = 25
Well, I try to not answer questions with just "google it", but that is how I got through grad school, so it works. lol
ive been googling it past an hour... thanks anyways
Sometimes reading the documentation is the best answer I can give.
@echo monolith Going off of https://keras.io/layers/embeddings/, the answer is (batch_size, sequence_length, output_dim) where output_dim is the embedding dimension.
thank you very muchhh
Somebody there who is familiar with xarray and netCDF?
I'm getting SystemError: <built-in function imread> returned NULL without setting an error
I'm passing a jpg file path to cv2.imread()
It was working a few minutes ago lol
literally didn't touch that portion of the code
Still using the same jpg file?
Did the file poof? Is the file open? Did your computer turn against you?
My computer has turned against me
Abort abort!
printing out the path right before the call to imread():
datasets\animals\cat\cats_00001.jpg
π
Pretty. Hmm, I'm not sure. Try restarting python I guess
Β―_(γ)_/Β―
What are good Data Science Ideas for Beginners looking to build a portfolio? Do you have an suggestions for Environment or Finance related projects?
I am interested in finding out the exactly same thing @cosmic crater
@lapis sequoia I would like projects to stick strictly to data analytics and statistics and not everything about machine learning for example for using linear algebra, differential calculus, practically anything about data science as I feel that's to advance right now.
What about you?
Same, but I am also curious about machine learning @cosmic crater
@cosmic crater being able to build a clean dataset from open data sources
Hi,
I am not sure if this is the right channel to ask questions. If itβs not let me know.
So I have a file with X,Y,Z data which I managed to load.
I wanted to visualize it in a 2D plot with z as the color scale. Tried googling but nothing seems to be working from what I tried.
Any suggestions?
Thanks
@jolly briar Can you give example of open data sources. I think for me that's the first step. Finding Open Data Sources, I would like anything Environmental, Any-Science Based Sources, Finance, or Economic.
@jolly briar Thanks
@cosmic crater search UK gov transport data or something
You'll bump into portals and there's loads of stuff
@jolly briar I'm going to search for U.S. based data since I'm from the United States. Unless it's climate change related, cuz I know the Trump Administration took down the data on climate change from the EPA and NOAA (I think).
Whatevers clever π
Can anyone explain me the difference between MarkovChain and Word2Vec?? They did the same thing, group up words with closest context.
Which models other than neural networks use gradient descent?
Im working with a keras multi-class classification problem. After fitting the model, how do I get the predictions? When I use model.predict, I'm not ending up with ones and zeros.
Is there another tool I should use?
How much memory do you think is needed for this field? I know another person was helping me the other day and said it doesn't matter if I do 16, 32, etc cause I will have to spin up a service no matter what when it gets big enough. I just wanted to confirm with some other Data Science people how they feel on that?
I'm looking at getting a 16" Macbook Pro, but I'm not sure what configuration. A lot of friends say 32GB, but they might just be thinking about themselves and what they do.
@deft harbor I assume the last layer in your model is a Softmax layer. Run an argmax on the output vector. It will tell you the class with the highest probability. Thatβs your output.
Note. If there are ten labels you want to predict, the output vector is either (10,) or (1, 10). Make sure argmax is working on the correct dimension.
Thanks, I found that after a lot of digging.
Should have thought about it to be honest.
Anyone have input on my earlier question?
it depends on your usage
I don't do any compute on my laptop, but I'm always watching stuff in the background and have tons of tabs open
also pycharm eats memory
so I'm beyond stretching my 16gb
yes but also my computer is 5 years old now
and I have permanent access to a cluster meaning I don't do any analysis on my laptop at all
I'm new, but how did you get permanent access to a cluster @silent swan
university researcher
Would you also then recommend I go for the 32GB over 16GB? On my personal PC for gaming I have 16GB and have maybe twice slowed down
Oh gotcha
really comes down to your own usage. I guess in my case most of what I'm using the memory for isn't really work related (except pycharm eating memory)
Normal usage would be fine, but alas I'm too new to say how much I'd use
But worst case I buy aws instance time
well monitor your own ram use now and see how much of a margin you have
generally speaking 50-60%
16 is most def good enough for data science
I have to give my team lead a write up of the ML lifecycle so we can make sure our project managers truly understand machine learning and how it will fit in our companies' workflow.
I'd appreciate it if you can give this a quick overview and let me know if there is anything I have forgotten that I need to add. Tried to really condense this down to easier talking. I could write an entire manifesto on the machine learning lifecycle.
https://docs.google.com/document/d/1slmpunUPAjR_bC8G4LjE8x6bKN9iijrI3SVeE4vMtpU/edit?usp=sharing
Thanks.
Machine Learning Lifecycle The ML lifecycle is a continuous feedback loop; it is not a sequential operation. You have: Defining Business Objectives Data Acquisition Modeling Auditing Deployment Monitoring The business objectives define the data. The data define the model. The ...
Which metric of accuracy, precision, recall, f1 and auc should I look at when tuning the hyperparameters of my model? I have an imbalanced dataset 4:1 ratio but I do resample the training data with SMOTEENN to make it balanced, however the test/validation remains imbalanced as it should not be resampled.
Hi question about gridsearchCV
I thought grid.predict should behave as the predict of the best estimator, but it gives a different output (see first 3 columns)
I'm thinking it's because the estimator stored in grid was only fitted with 90% of the dataset (cv=10). Is this assumption correct?
@thin terrace A choice of metric(s) will be influenced by the business objectives.
However, keep an eye on the validation loss as well. You want to minimize the loss as you optimize for hyperparams.
@oblique belfry business metrics? Why is minimizing loss important and how important is it compared to the mentioned metrics?
I meant "business objectives." I corrected that mistake.
I have personally chased down metrics to be where I wanted them without realizing the loss was going up. I manipulated the data and the model to get me the metric that I wanted without realizing it was learning less.
So I'm classifying the default credit card tabular dataset (binary).
I have a similar skepticism of p-values. One can "p-hack" all day, but does that mean that their model(s) is/are affective?
Just try to have a holistic view on everything.
Is this for practice or for your business?
just a part of a ML experiment im running
Trying to learn some hyperparam optimization
I'm basically doing random grid search and I compile the metrics for each param-setting before I go on and actually use the best model
Now I just need to know which metric to look for when deciding the best one
I'd watch the true positive / false positive rates (which are captured by precision/recall but I can't remember which.) For fraud, the cost of classifying something okay that is not fraud is more costly than not over identifying transactions of fraud. I'd go for a large true positive - false positive gap. But, that is just my approach to it.
I'd go with this approach because you are look how the model does on each class, and not in it's entirety.
If you have a 4/1 split of non-fraud/fraud data and you get 80% accuracy, this might seem great. But if you look at precision/recall, you might see that you labeled 99% of the non-fraud labels correctly, and got 1% of the fraud labels correct. Obviously, this is not good.
Yes I'm aware of that part
However, this dataset does not classify frauds
it classifies whether a clients credit card will be defaulted next month or not
Maybe I should just go for f1-score?
Once again, another assumption. My bad. But the logic is still sound.
yeah. F1 seems like a good start. Should get you where you need to go.
This wiki might be overkill, but definitely helped me get that there were more metrics than just the normal accuracy. https://en.wikipedia.org/wiki/Sensitivity_and_specificity
What library are you using to train?
keras
There should be an F1 metric callback. Probably in the tf.keras.metrics in tensorflow 2.x.
didnt keras remove their metrics because it's misleading to use during training?
"Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place."
As I understand it, they should only be calculated once training is done?
Lol.
So....you bring up a good point. I had to modify my own keras ones to be stateful. for my old work. I think the tf.keras.metrics does this automatically.
So to have less work, probably ought to do it at the end. lol Will def be easier for you.
Thank goodness scikit-learn has nice helper functions for F1 score and Confusion Matrices. You can def use that.
yeah, atm I calculate the metrics after the training at each fold of my 10-fold cross-validation, then I take the average of all 10 folds
using sklearn yes
then I get tables like this one
I laugh because I learned that metrics thing the hard way and spent MANY hours reading Keras source code to get what I wanted to correctly.
where each row is the metrics for a setting of parameters
can anybody recommend some brief but useful NN tutorials? xD
i need to implement emoji2vec into my project and realising I have a limited timeframe to complete it
Without context, tough to say
If I had to guess, it means a server that can only handle 1 request at a time, since it only runs on 1 thread. And then there's like 100 connection requests.
I need someone to explain this to me so it'll stick
I want to understand why I need to include continent in the SELECT here
intuitively it makes sense, but I want to remember for the long run
SELECT continent, max(women_in_parliament)
FROM countries
GROUP BY continent
ORDER BY continent
when you group by a column and aggregate, you get one aggregated value for every unique value in the column.
max(women_in_parliament) is the aggregated value...
...and continent is the unique value it corresponds to.
thanks.. now I can remember this
Here, we teach a neural network to learn to play that classic Ice Fishing game from Club Penguin!
We first try using a fully-connected recurrent network with LSTM nodes, trained on human gameplay. We then use the NEAT genetic algorithm to evolve the neural connections from s...
14 year old trying to figure out how i could get a p-value from quantum random numbers with a range of -2, and 2 in python. if someone can help me that would be much appreciated. my goal here is to see if consciousness intent has any sway over quantum random numbers. kinda like what this university is doing. http://noosphere.princeton.edu/
add me on discord: leyland124#3364
The Global Consciousness Project, home page, scientific research network studying global consciousness
anyone knows how to print predictions on an SVM in a for loop?
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_excel('C:\\Users\\GentB\\OneDrive\\Documents\\Python\\2020\\FootballPredictions.py\\data.xlsx',
sheet_name='Dataset')
dataset = dataset.head(500)
X = dataset.drop('Result', axis=1)
y = dataset['Result']
X_train, X_test, y_train, y_test = tts(X, y, test_size = 0.20)
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
the output I get is only the accuracy
[ 1 39]]
precision recall f1-score support
-1 0.98 0.98 0.98 60
1 0.97 0.97 0.97 40
accuracy 0.98 100
macro avg 0.98 0.98 0.98 100
weighted avg 0.98 0.98 0.98 100
Does anyone know why this function with the guvectorize target set to cpu works, but when I set it to cuda, it gives me this error:
Invalid use of Function(<built-in function sub>) with argument(s) of type(s): (array(float64, 1d, A), array(float64, 1d, A))
Known signatures:
* (int64, int64) -> int64
[The known signatures list goes on for a bit but you get the point]
Code:
@guvectorize([(float64[:], float64[:], float64, float64, float64[:])], "(n),(n),(),() -> (n)", nopython=True, target="cuda")
def calc1(a, b, g, m, out):
vec = a-b
r = ((a[0]-b[0])**2+(a[1]-b[1])**2)**0.5
out = m*g*vec/(r*r*r)
I think I shouldn't be getting, as the two numbers I'm subtracting are float64, which is supported like it says in the error
Thought you guys might know since this is a numba question
@lapis sequoia what do you mean? just print(y_pred)?
@paper niche yeah figured it out later on
I have a question if anybody can help me...
Im working on a project and i want to gather twitter data. Now i want to refrain from using the twitter API and i stumbled upon a module on github called twint.
Its more or less perfect for what i need but i always get an error after about some 8000 scrapped tweets.
Does anybody know any other way of going about this?
So sklearn.KNeighborsClassifier has parameters for both metric and p. I'd like to test my model using the manhattan metric (which I was always under the impression implied p=1), and the euclidean metric (again, I assumed this implied p=2...). Do I need to be changing both parameters?
I need some advice calculating vCPUs
can someone help me calculate, I'm not familiar with how to read machine sizing
so i have found these MIT courses for linear algebra and single variable caluclus
This course covers matrix theory and linear algebra, emphasizing topics useful in other disciplines such as physics, economics and social sciences, natural sciences, and engineering. It parallels the combination of theory and applications in Professor Strangβs textbook Introdu...
was wondering is this enough for understanding basic level of linear algebra and calculus or sud i look for more resources regarding these topics ?
im trying to revise on the maths needed for data science ....
should also learn some multivariable calculus
but it's mostly the easy stuff there that's useful (partial derivatives, directional derivatives, gradient, etc)
@granite steppe most people who've been through uni won't remember most of their courses
sounds fair enough..thnx for the info @crimson flame @jolly briar
and probability/stats
Hey guys! I need a push in the right direction
I have 2 tables:
Table2:
This is just proof of concept. will be built in sql and displayed in tableau
Table1 is by week. Table2 is by month
This is content usage data and quota attainment with dummy data
I need to perform some kind of operation to get the desired outcome.
Desired outcome: Sort content by 'best' to 'worst' judging by the quota that was attained during its months usage.
Any ideas or directions to research? I'm kind of at a standstill and its late in the day so my head is not worth much
heck I just wanna know how to set a condition for if the first cell in a a specific row then do this
cell, meaning Excel?
so you mean reading a spreadsheet into a pandas dataframe
and working with it with the pandas API?
ya
and not openpyxl?
correct
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
so what do you want to do?
lines 13 - 20
basically my script spits out some strings into a .txt file
and i ned to do certain things depending on the type of information in certain columns from the df
at the start of the df I need to print a certain set of strings
and I don't know how to articulate that into python code
so if on index 1 I guess?
does that make any sense?
yeah
which is the first cell of the first row
I believe
IDK if that includes the header though?
but either way i need to make it so that on the first row/cell adpo_x_syntax saves a set of instructions
that script has 3 sets of strings depending on conditions, but I only need help with what I just wrote
I actually tried that as I was explaining this to you
if i == 1:
adpo_x_syntax = [
'Key tab',
'Type ' + str(buyer),
'Type ' + str(int(branch)),
'Type ' + str(int(vendor)),
'Key Enter',
]```
also I don't really see why you wouldn't use itertuples, too
uh because the df reuses branch values
also, that should be 0, not 1
unless your index specifically starts from 1
for the general case you can just use if i == df.index[0]
depends on what you mean by "header" and how your file is formattedf
so you mean the actual Excel header
I'm asking if df.index[0] corresponds to the header row, because my data starts in the 2nd row
wait.
do you mean the actual HEADER, or the first row which happens to contain the column names?
row which contains the column names
okay, then that's not a header
it has a specific meaning in Excel
anyway, by default, the first row will be parsed as column names.
and therefore not part of the data
therefore the row you access with .iloc[0] refers to the second row in the Excel spreadsheet
I see
try playing with it in an interactive session
it'll be easier for you to understand
I can conceptualize it
but regardless running a if i == df.index[0]: still doesn't print the right set
it is printing the lines 49 and greater
print both values and see what you get
i
actually, just play with the dataframe interactively
this should be pretty simple to debug if you actually have the data...?
like
print(list(df.iterrows())[0])
no
in the for loop
like
print(list(df.iterrows())[0][0]) this gives you the first value of i
compare that with df.index[0]
with open('excel_po.txt', 'a+') as f:
for i, row in df.iterrows():
branch, item, distro_size, delivery_date, buyer, vendor = row
print(df.index[0])
print("---------")
print(list(df.iterrows())[0][0])
if pd.isnull(branch):```
0
---------
0
0
0
---------
0
1
0
---------
0
2
0
---------
0
3
0
---------```
so on and so forth
uh
your if conditions are wrong
I think you want an elif in the middle
also I actually meant to have those print statements outside, since the input doesn't change
outside the?
well df.index[0] corresponds to 0
and
print(list(df.iterrows())[0][0])
the first value is ``0
uh
the point is
they don't change
df.index[0] and list(df.iterrows())[0][0] are constants.
but anyway
like I said
you have if followed by an if-else
and your conditions are such that
the else will trigger if the if does, from what I see
if i == 0:
and then:
if branch != df.iloc[i - 1, 0] and i != 0:
...
else:
...
logically, therefore, the else will trigger if at least one of those conditions is not true
i.e. it will trigger in all cases where i == 0
since you use the same variable in all branches, the value in it will be overwritten
i mean i had if pd.isnull(branch):
if i == df.index[1]:
elif branch != df.iloc[i - 1, 0] and i != 0:
else:
i feel like you are mistaken, these aren't nested
Are there any O'Reilly books that go into word embeddedings? I decided to write a paper on them for my linear algebra class and I want to write some pseudocode (which will probably just be Python) in the paper for how they're created and used mathematically.
yes, precisely, they aren't nested
in the first one
and in the second one you're comparing to df.index[1]
which is the second row
Has anyone successfully integrated Metaflow and DVC together?
Hey anyone here that visualized WhatsApp Chat Data with Python? A few weeks ago I tried it but had some problems. I wanna try it again later and I was curious if anyones willing to help me out with it when the problems occur again.
Not sure if I should ask it here or in any of the help channels. Just tell me when I'm doing something wrong
I'm interested. I'm not so much of a data science guru thoπ§
Cool. Mind if I add or dm you so I remember you later this evening? I still have to finish some stuff so I might do it in a few hours or tomorrow. @brazen canyon
I don't mind. Feel free to dm or add me
What editors do you guys use? Someone recommended me Anaconda, but I'm pretty sure that's super bloated no?
VSCode.
i like my anaconda. anaconda isnt an editor though
anaconda is a "batteries included" approach to python/data science. comes with a lot of goodies. So, it depends on whether you want that or not
(the editors anaconda would ship with would be jupyter notebook and spyder)
@ripe forge @oblique belfry The guy is saying he personally uses a Text Editor and a separate IPython REPL. Do you need a REPL, is there a reason for it?
that's where python runs
without a place to run python, you're basically writing in the equivalent of a notepad or word document
(with some fancy features. ) π
anyways when i was learning this stuff, i personally liked using things that didn't make me worry about these kinds of details
Ah. I use VSCode for most things. If I need to interact with stuff, I'll use IPython.
I like Jupyter/IPython, but I tend to just run everything as scipts.
Im so confused
Why do you need multiple really though
no like repl and editor
For instance I run the code on Sublime and get
Nothing wrong with viewing it here
then you're "running python" behind the scenes
behind the scenes?
from bs4 import BeautifulSoup as Soup
import requests
from pandas import DataFrame
ffc_response = requests.get(
"https://fantasyfootballcalculator.com/adp/ppr/12-team/all/2017"
)
adp_soup = Soup(ffc_response.text, "html.parser")
# adp_soup is a nested tag, so call find_all on it
tables = adp_soup.find_all("table")
# find_all always returns a list, even if there's only one element, which is
# the case here
len(tables)
# get the adp table out of it
adp_table = tables[0]
# adp_table another nested tag, so call find_all again
rows = adp_table.find_all("tr")
# this is a header row
rows[0]
# data rows
first_data_row = rows[1]
first_data_row
# get columns from first_data_row
first_data_row.find_all("td")
# comprehension to get raw data out -- each x is simple tag
[str(x.string) for x in first_data_row.find_all("td")]
# put it in a function
def parse_row(row):
"""
Take in a tr tag and get the data out of it in the form of a list of
strings.
"""
return [str(x.string) for x in row.find_all("td")]
# call function
list_of_parsed_rows = [parse_row(row) for row in rows[1:]]
# put it in a dataframe
df = DataFrame(list_of_parsed_rows)
df.head()
# clean up formatting
df.columns = [
"ovr",
"pick",
"name",
"pos",
"team",
"adp",
"std_dev",
"high",
"low",
"drafted",
"graph",
]
float_cols = ["adp", "std_dev"]
int_cols = ["ovr", "drafted"]
df[float_cols] = df[float_cols].astype(float)
df[int_cols] = df[int_cols].astype(int)
df.drop("graph", axis=1, inplace=True)
# done
print(df.head())
I read this question wrong. So sorry to give a confusing answer.
when you invoke python, it does and goes it's running and stuff, and then just throws you the output back
aka, you already actually have a REPL
so yeah...you dont need anything else π
ipython is nice though
what for exactly?
so, the big power of python comes when you dont just run it as a script
but rather run it interactively
ipython does wonders when you're trying to run stuff back and forth
(as in, imagine running just first 3 lines of your program, getting the output, then continuing to work, writing couple more lines, but selecting and choosing whatever you really want to run)
it's one of the things i loved most about python when paired with a good IDE
Sounds very weird!
DS sounds so complex haha
but there's just a charm of just, you know..instantly selecting a variable, and running it in REPL, and it spits out it's value
without having to rerun the whole script
you can even run code out of order. not recommended initially, at all!
it lets you essentially "Experiment" with writing the logic of the code, and quickly running just that line
leads to some insane boost in productivity once you get used to it
if it all sounds like hand wavey and fancy, don't worry, it's probably meant to be hand wavey and fancy. just use python any way you prefer.
yeah.
fwiw, i give vs code full points too. it's not bad at all
just, my personal first choice is spyder still. somehow vs code makes me feel "cramped"
conflicted as hell
(and in terms of simply market share on IDE, actually pycharm is on top. but again, pick whatever. they all do the same thing)
literally, pick one at random.
not like your choice is locked for life π
but its like totally different anaconda has that spyder thing and code one side
anaconda is a pretty painless introduction to python on windows imo
cool. in that case, pick whatever!
is the only reason to use anaconda cause it installs pandas and numpy, etc
hmm
well, there's the conda environment/package manager as well
also the fact that it gives everything you need out of the box i suppose
those really are the big things. you can achieve the same kind of setup if you like without anaconda too.
tohugh, the dependency resolution of conda packages is pretty amazing
makes installing some stuff a breeze, that would have been a pain to manage manually
geopandas and tensorflow come to mind. though i believe tensorflow fixed their issues and now pip install works just fine too
(also, not to mention, you can have anaconda, and then use vscode or some other editor too)
I personally am not a fan of Conda package management.
I start to use repls when I start adding a bunch of print statements everywhere.
What do you use as a REPL
@oblique belfry
I see VS Code has a REPL like Jupyter inside it
A mix between IPython and Jupyter.
Why a mix?
If I need to check the functionality of something quickly, I will use IPython3. But if I am doing some sort of exploratory data analysis, then I will spin up Jupyter.
Most of the time, I am ssh-ing into servers, and I don't feel like setting up Jupyter.
So you do all the actual coding in vscode or another and then sometimes you open up a IPython Notebook like the web based ones?
If I am doing any type of data analysis, I will use Jupyter. Then I will move to vscode as I get more familiar with the data.
If there are some functions or classes I want to quickly test, I will open up ipython.
So you dont start in text editor you start in jupyter
What makes you not use Anaconda Tony?
And instead do it how you mentioned
It depends on what I am doing.
I'd say I spend 85% of my time in vscode, 10% in ipython, and 5% in Jupyter.
It is probably a lot simpler now, but when I first tired all the anaconda stack 2-3 years ago, it was a pain. I could create a virtualenv and just use pip just as easily.
create a virtualenv?
I have like no DS experience whatsoever, what is that needed for?
I agree with pip though
Do you have much experience with Python? Virtual environments are a way to make sure you have the correct dependencies per project. Instead of installing everything in the global python environment, you can install the dependencies you need per project in this "virtual environment".
I'm new to Python as a whole really
Have you used any programming languages?
C#, JS
Some people will disagree with this, but virtual environments are akin to local node_modules folders. Instead of installing a node package globally, you can install it per project.
Good question. Some packages require certain versions of a dependency. Package A may require version 1.13 of numpy. Package B may require version 1.14 or greater. Clearly, an issue will arise.
Package A's dependencies are incompatible with Package B's.
Well, if each package had a local dev environment where they can run any version of the dependency, then you wouldn't have this issue.
So Anaconda you don't need to do virtual env?
If you were using Docker or something to deploy a project, then you mighty not want to. Since the container is single purpose, then you won't have these issues. But my desktop is multi-purpose, thus I will run into issues with this.
Anaconda does something similar to virtual environments. They accomplish the same thing. conda can make sure whatever project you are in has the correct dependencies.
Now, conda can do MORE than that. But, that is a quick breakdown of my take on it.
Nothing's easy to get into haha
It's easier than you think. I promise. lol
Jupyter use ipython internally.
Ipython is a repl that runs in the shell.
Jupyter uses Ipython, but runs in the browser.
*actually, Jupyter is a web UI, and sends the data/commands/whatever to the ipython kernel.
so you downloaded a seperate application IPython or do you use vscode interactive?
when you say Jupyter you mean Jupyter lab as well right?
https://jupyter.org/try I see Classic and Lab which says it's newer
For the purpose of this discussion, yes. There is a difference, but not for this discussion.
<ipython-input-1-4cbb279a3e44> in <module>
----> 1 from bs4 import BeautifulSoup as Soup
2 import requests
3 from pandas import DataFrame
4
5 ffc_response = requests.get(
ModuleNotFoundError: No module named 'bs4'
When I try to run my code in Jupyter I get this
Currently, I have been researching Metaflow to use at work. I am looking at the metaflow repo I downloaded from git. As I am following the online instructions, I have VSCode open with the source code of the tutorials. On the right, I have ipython open. I am exploring previous metaflow runs. (Don't worry about what is there. Just though that I am interactively stepping through the code and executing things one at a time.)
Yeah. You do not have bs4 downloaded.
I have bs4 downloaded on my pc
Is it globally installed?
I did pip install bs4
it works in vs code, sublime, etc
but that error is jupyter website thing
How did you install jupyter?
I'm doing a web test thing it's not installed
How did you get your Terminal to look like that in VSCode?
That is called ipython. It is the REPL Jupyter uses
It is just another package.
Are you running jupyter in a conda environment?
Disregard conda
I have VS Code open
I'm asking like 5 questions at once, so let's focus it down to one thing at a time cause Im an idiot
Do you have the Extension Open in IPython by Ilya Vouk?
so I need to pip install ipython?
Currently make my terminal window on vscode look like yours
yours looks like spyder [1] etc whereas mine is >>>
If you are using pip, pip install ipython.
I think you need to better research how Python works first before you delve into data science. Understanding how package management works is VERY important. I can say from personal experience that certain versions of keras , tensorflow, and numpy do not play well together.
It is important to know how to correct that stuff.
I had to pin the keras version and not just download the newest stuff.
I've yet to have to, but imsure it'll happen
I was going to get into stuff trying this out https://fantasycoding.com/
Yeah. But, you have some fundamental gaps that are going to only get larger as you keep going. Not all data scientists/data engineers/ml engineers need to be the best software developers, but it is important.
Looks fun. But review the basics of Python first.
Well I've done some projects and stuff, nothing crazy
I've been learning from Corey Schafer
Didn't watch Ep22 Pipenv
@oblique belfry venv isn't hard at all, you weren't wrong
guys I have question about parsing really extra nested json. am I in the right place?
you can just use the help channel
didn't get enough help π¦
Hey could someone offer some insight on datasets? I've only used free datasets that are available online, but what would the process of gathering your own dataset look like?
I suspect connecting to different sites' API would be the way to go
Total beginner question but would love if someone could point me in the right direction.
@tacit jewel you can use a scrapper to scrape which ever information you need from any given website....
Thank you @vital cipher . I think I will try saving in just a python dictionary rather than sql or sqlite
for now since I'm just starting out and it's not a huge or complex dataset
cool
going to be working on some AMD Vega optimized witchery with pyopencl, wheeeee
witchery? 
So, what's the deal with compress in itertools? Seems to basically do the same thing as filter?
not...really?
filter processes an iterable, removing elements that evaluate to False
compress processes two iterables, removing elements that come from the first iterable paired with elements from the second iterable evaluating to False
@velvet thorn Could you give an example of when you'd use compress?
like I'm looking at the example given here: https://florian-dahlitz.de/blog/introduction-to-itertools
Introducing the itertools functions using real world examples
def name_selection(names):
name_selectors = []
for name in names:
if name.startswith("A"):
name_selectors.append(1)
else:
name_selectors.append(0)
return name_selectors
names = ["Albert", "Alexandra", "Miriam", "Sascha"]
filtered_names = list(compress(names, name_selection(names)))```
that just seems like a clunkier way of doing filter(lambda x: x.startswith("A"), names)
but maybe that's just a bad example that doesn't really show what compress is good for
yes, it actually is a bad example
in general, you would use compress only in two cases (where data refers to the first iterable and mask to the second):
- the mask is not reproducible from the data alone
- the mask has already been calculated
so, for example, say you have a list of names and a list of ethnicities
you could use compress to get only, say, the Gaelic names (along with a generator expression on the second list
compress is like a slightly neutered form of filtering by something else
if you have worked with languages that have something like filterBy...that's basically it.
@velvet thorn Aha! Awesome, thanks!
I am trying to learn how to make neural networks but don't know where to start
@lapis sequoia how much do you know?
like what's your level of ability in linear algebra, probability, and programming
high in programming but really low in math
where do i learn the math ?
@velvet thorn
no need
to keep tagging me
knowing numpy concepts is good
visualisation is not mandatory, but it is very helpful
ok so is there a course u can show me for the programming
huh
if you're good at programming it should be really simple
just pick up Keras/Tensorflow/PyTorch and start hacking
don't think a course is necessary
KK
A bit confused not sure what I did wrong, the Annuel Revenue doesnt seem right. For index 522 it seems right but like for index 1476 shouldnt it be 33.333*
*please @ me *
@south dagger you are diving years by Total in Millions which is wrong. You should divide Total in Millions by year. Also for calculating years you don't need the apply and lambda you can just do sales['Year published'] - 2019, same for calculating Annual Revenue just divide one column by the other as if you would any other variable.
I have this MT Dataset, what can be done with such dataset, as a Data scientist, what questions can be asked regarding this dataset, what can be the yields, what to analyze, etc.
So far I thought to make a classification on age, gender.
To find the reasons why patients undergo, surgery, allergy tests, etc. what else can be done with it, any suggestions, please?
Any suggestion is appreciated.
What ecosystem do you guys program on? I'm curious.
Can I train YOLO (darkflow) on hand gestures and recognise hand gestures in real time?
Ping me in replies
@obsidian copper as long as you have a dataset for hand gestures, and as long as the actual prediction of a gesture is done as if it was on a "Frame", then absolutely.
@ripe forge ok thank you
@lapis sequoia some kind of nlp around the column that gives description. could be classification to one of the categories under 2nd column, or just a simple patient clustering algo to try to fit new cases in
having said that, that pic doesn't give a lot to go on
ecosystem..not really sure what that means tbh. just python i guess π
having some trouble with an implementation of sparse pegasos svm... it looks like the more i train, the worse it is :(
are you measuring performance on both train and test? is it improving on train?
i'm just measuring performance on train and it's getting worse the more i train
it's supposed to be linearly separable
but i'm not really sure how to prove it is
sigh never mind i think i figured it out
silly mistakes
@polar acorn thank you !
curious, if you don't mind sharing, what was it? @sand timber
ah
you never know if you implemented it right or if the hyperparameters are bogus
heh. yep, that'd do it
@obsidian copper What type of gestures are you doing?
@oblique belfry actually I want to detect hand gestures in real time. Gestures like thumbs up or fist etc. I'll later use these classification to perform various mouse functions.
Yolo might work for you.
If you find that training on still images is not good enough, you can use SlowFast or Conv3Ds and train on multiple videos.
I thought of Conv3Ds but idk much about it. And I've to develop this project in like 2 weeks max.
I'm still learning cnns
You might have to leave the "image classification" approach to things and go to "action recognition." I did a similar problem. Spent hours trying to do a Conv2D + LSTM approach and got nowhere.
Gotcha. Just was scrolling through and saw your comment.
I'll do my best
how would i center or word wrap column names?
well hello there
i plotted this with matplotlib
i think we all know what this is.
However as you may also seen, you can't read s**t from it.
i did the dpi at 150 and figsize at (20,10)
but its like unreadable
also lags of course a lot, hence i am converting them in pictures instead of plotting them
so i need it readable on an image
any methods?
Sad, can nobody give more insights?
@south dagger do you want to center or wrap?
did you Google
yes but it kept giving me col selection stuff which wasnt working
try df.style.set_table_styles([{'selector': "th", 'props': [('max-width', '50px')]}]) @south dagger
@chilly fog
!ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving.
β’ Be patient while we're helping you.
You can find a much more detailed explanation on our website.
hi
Hi bon, having fun with this N-body particle thing I made in PyOpenCL, glad I fixed a bug
Vector fields!
I'll show an image.
Hi all, im very to coding in python!! great to be here!
Hey @burnt wharf!
It looks like you tried to attach file type(s) that we do not allow (.txt). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.
Feel free to ask in #community-meta if you think this is a mistake.
Can anyone point me to a good reading material. I want to compare snapshots of the same table(s) at different points in time and identify the differences and make some meaningful inference
Well numpy sounds good for that sort of thing
Depends what sort of meaningful inference you want to make
Hey, i am kind of new to Python, and i need some help. I am trying to do a polynomial regression
@wheat frost I look at the snapshot of the table every 10 minutes
news rows can be added to a table, or existing rows could have changed
but the volume of data is very high, so RDBMS like comparisons take a bunch of time
Hi All, Pandas question: how would I go about adding new rows to a DataFrame obtained using groupby() and count(). I have cumulative sums of items grouped by date. My resulting dataframe looks like in the screenshot. I'd like to add additional rows to it e.g. to predict future growth.
wow
@fresh cedar if you put .reset_index() after it should return a dataframe.
Thanks! I'll try it out
βTop 20 YouTube Channels for Data Science in 2020β by Benedict Neo https://towardsdatascience.com/top-20-youtube-channels-for-data-science-in-2020-2ef4fb0d3d5
how can i tell my script to do something, if it is on the first row of a pandas df
How to deal with spatial data ?
@real wigeon what do you mean, can you give an example?
so i was watching andrew ngs course
and i decided to code a linear regression model myself
and so i have a doubt
import sklearn.linear_model as lin
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib
from math import sqrt
data = pd.read_csv(r"C:\Users\home\Desktop\Artifcial intelligence\ML\data\Regr\FuelConsumptionCo2.csv")
x = data[['ENGINESIZE']]
y = data[['CO2EMISSIONS']]
trainx,testx,trainy,testy = train_test_split(x,y,test_size=0.2,train_size=0.8,random_state=7)
regr = lin.LinearRegression()
regr.fit(trainx,trainy)
# h(x) = O0 + O1
o0 = regr.intercept_
o1 = regr.coef_
print(f"o0 shape = {o0.shape}")
print(f"o1 shape = {o1.shape}")
o0 shape = (1,)
o1 shape = (1, 1)
why is my theta 1 a 1d vector
while my theta0 is not
nvm makes sense now
npar = np.array([o0,
o1])
print(npar)
when i put them in an array
they become a 2d vector