#data-science-and-ml
1 messages · Page 257 of 1
they got calculus 2 also about 7 hours course
wondering how much days it needs for me to become good in machine learning
i had my friends say like 3 months and more
thanks for maths man
I suppose this is more of a software best-practises type question but why doesnt numpy use some sort of abstract representation to evaluate the expression after it is required to be calculated (without an @ operation or maybe there are even more gains to efficiency you could make with this knowledge) which would allow numpy to always minimize the complexity? or maybe there is already a wrapper to array that does this?
idk what exactly ur asking but hope someone answers ur q my man
yo so idk if anyone here remembers but im working on a gan
i been trying to run it
i get this error
InvalidArgumentError Traceback (most recent call last)
<ipython-input-16-41501876ca3c> in <module>()
15
16 for example_input, example_target in t_in.take(1):
---> 17 generate_images(generator, example_input, example_target)
18
19 EPOCHS = 150
7 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: Index out of range using input dim 0; input has only 0 dims [Op:StridedSlice] name: strided_slice/
<Figure size 1080x1080 with 0 Axes>```
here's the code with the model and train step https://pastebin.com/dA6dS27C
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
yes
for that i need a speech recognition module
the current one is not compatible with python3.8\
because pyaudio stopped at 2017
so use python 3.7?
no it is only compatible with 3.6
u can just use that lol
i just looked this up https://pypi.org/project/SpeechRecognition/
never used it tho
so i guess i should downgrade
this need pyaudio as well?
Hey guys, just wondering how I could make a function to check if these cities are present in the 'Kommun_name' column:
("Borlänge", "Gävle", "Göteborg", "Haparanda", "Helsingborg" , "Jönköping", "Kalmar", "Karlstad", "Linköping", "Malmö", "Stockholm", "Sundsvall", "Uddevalla", "Umeå", "Uppsala", "Västerås", "Älmhult", "Örebro")
I did a test to see that it works by checking if 'Haparanda' exists which it does as you can see
df.Kommun_name.is_in(YOUR_TUPLE)
Nice!
#gladtohelp
Thanks!
#gladtohelp
@native ridge Like this? ```#Tuple with all the cities that already has an Ikea store
ikea_stores = ("Borlänge", "Gävle", "Göteborg", "Haparanda", "Helsingborg" , "Jönköping", "Kalmar", "Karlstad", "Linköping", "Malmö", "Stockholm", "Sundsvall", "Uddevalla", "Umeå", "Uppsala", "Västerås", "Älmhult", "Örebro")
df_dropped.Kommun_name.is_in(ikea_stores)```
No
Sorry, try isin, without the "_".
Hard to remenber every single name...
This should return a Series of bool, with which you can index the DataFrame.
yup, it's isin
also, try to use a set ({}) with isin
{"Borlänge", "Gävle", "Göteborg", "Haparanda", "Helsingborg" , "Jönköping", "Kalmar", "Karlstad", "Linköping", "Malmö", "Stockholm", "Sundsvall", "Uddevalla", "Umeå", "Uppsala", "Västerås", "Älmhult", "Örebro"}
No worries, thanks. It worked but is there a more intuitive way so that it outputs just like the above?
That the output shows be the complete row like with the city like 'Haparanda'
filter on it
df[df_dropped.Kommun_name.isin(ikea_stores)]
exactly as you did in the previous cell
if you take the df[] out of it
you'll see that it also returns a boolean Series
I want to create a Has_store variable and assign value 1 for city having Ikea store and 0 otherwise. The cities in my tuple are the cities that has an Ikea store
Should I create a conditional function for this or what is the best way?
It sets all to yes now, what is the problem here?
your condition checks if the name is equal to the whole tuple
not if the name is in the tuple
how do I check if the values in the tuple are present
in operator
and if thats the only purpose of that tuple you get a bit of perfomance by making it a set
how to make a dataset through webpages?
how to make a dataset through webpages?
@royal thunder do you mean scraping data from web ?
yeah like that
here is an example
i wanna scrape some data from that website and make some csv file tho
@paper niche that was the ticket, thank you.
https://www.dataquest.io/blog/web-scraping-beautifulsoup/
this site might help .. though it uses beautifulsoup
it teaches you to scrape data from web and make a visual representation of the data you scraped
if you got your dataframe
you may do this:
dataframe.to_csv("data.csv",sep=',',index=False)
this should get you your csv file
thanks man
Hi guys,
I've a stationary time series, I'm trying to know which model to use for forecasting this timeseries, I've performed different analysis technics and come to conclusion that my series is stationary and normally distributed but couldn't know what will be right model for forecasting, here's pictures for my seasonal decomposition, acf and pacf:
looking at this graphs what will be first conclusion comes to mind ? Can we say that a moving average is good model here ?
thanks in advance for ur help 🙂
(I know for most of you this is pretty basic but not for me so any help or insight to put me in the right path is highly appreciated
)
@tidal sonnet how did you find the values of a, b, c to be 3, -1/2, and 0?
@heady hatch They were from a previous question, which I got correct, then they said to take the answer and plug them into the [a,b, c], then get the echelon form :(
hi good morning
a little question: how can i select a title of a column without using the name column
thats the dataset
i want select 'BTC Returns' as string
maybe iloc?
df.iloc[:0, 5:6]
but can i select it as string
how*
@slender nymph Hey I'm not too sure what you're asking for.
Are you looking to grab one particular cell as a string? Or did you want them as a list of strings?
@tidal sonnet
from what I've found, A in REF ended up to be
A = [
1, 1, 1,
0, 1, 2,
0, 0, 1
]
Then I would probably manipulate S the way A was manipulated.
I'm not super sure where a, b, c is supposed to come in.
can anyone suggest me any good machine learning cources?
Interesting... Can you explain the method you used?
https://hatebin.com/hortmvvtid
anyone know why i can't install transformers? i could use a bit of help... thanks.
try running cmd as admin, had the same problem before when I installed scrapy, got it installed when I ran cmd as admin
a little question: how can i select a title of a column without using the name column
@slender nymph col_headers = dataframe.head()
try this
i use that for selecting column headers of csv
Ahh that makes a lot of sense that they want the column name. hahaha I thought they wanted the values from the column.
@tidal sonnet
Yea so I started with
A = [
[1, 1, 1],
[3, 2, 1],
[2, 1, 2]
]
Then from there,
-3 * first row + second
-2 * first + third
then keep going from there to reach REF.
You can do it multiple times??
so that would give you
A = [
[1, 1, 1],
[0, -1, -2],
[0, -1, 0]
]```
That's what i got as well, difference being I had multiplied first row by 3 and 2 and subtracted it from the others. But where did you go next?
seeing that in row 3, the [a] and [c] are both the same? @heady hatch ?
@tidal sonnet and then I
multiplied -1 * second row + third
then divide third by 2.
[0, 1, 2],
[0, 0, 2]
MY GOSH
@heady hatch THAT'S SO COOL
i didn't know that you didn't HAVE to use the first row
thank youuuuuuuuu
hahaha glad to be of help.
@eternal fractal oh i didn't think about that
@eternal fractal https://hatebin.com/qkqzgoqemd
no success
Did I do this correctly? @surreal ingot
If so, how can i get rid of the negatives that I get out for S?
Using Back Substitution
:(
I am genuinely confused
I think you might have gotten the wrong @heady hatch . hahaha
On the other hand, the first S should be [15, -17, -7] I think.
Because 15 * -2 + 23 = -7.
@tidal sonnet
yea it's -7, i realized that just now
But I also can't figure out where the (r) is supposed to come in
And this is correct
then they tell me to take this answer, and plug it back into question 1
So i end up with something like
A =[[1, 1, 1], [3, 2, 1], [2, 1, 2]]
r = [3, -0.5, 0]
S = [15, 28, 23]
Hmm. let me think.
Yea same.
hahaha
Can you give me the problem in sequence. Like in screenshots? I might be able to better give some suggestions.
🤔
Because I feel like I'm getting bits and pieces and can't really connect back together.
i'm getting different numbers since i fixed me having -8 instead of -7
so it's actually looking like it's start to make sense... a bit
Okay cool cool cool.
:^)
Yea
that's what I meant...
like if i'd have to set it to
A = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]```
they wanted me to solve it 😁
THANK YOU SOOO MUCH
i've been on this practice quiz for 3 days now :(
Hey, we've learned something new today!
I checked the thing just now...
hahaha congratulations.
thank you alot m8
Glad to be of help.
i couldn't figure out at all how to get row 3...
but now i know that you don't have to use the first row
something else i'm curious about
If I hadn't multiplied by a negative scalar and added
but instead multiplied by a positive scalar and subtracted, would that have been the same?
It is the same.
You can think of it as 2 - 3 => 2 + (-3).
I prefer the add negative notation just to keep things uniformed.
hello data scientist. someone had made a OLS regresssion without statsmodels module only with numpy and pandas
hello could somebody help me with scraping a web page? Basically I am trying to get this picture off a website but its labled as an event which i think means that some javascript is being executed or somthing so beautiful soupd doesn't read it. Any ideas on what to do?
But i'm not sure where I went wrong
I tried finding the inverse...
They say that the above answer is the right one
But this is what I got out
ok
Trying to do a histogram with array([13., 23., 33., 48., 52., 48., 33.]).
So every element is one column. Instead, I get these elements sorted to their numerical value.
How do I fix this?
let me think for a second its been a while since I've worked with this
Alright, thanks sir
no need to call me sir haha im only a teenager
hahaha alrighty
are you using google colab or what?
jupyter
hmm i never used jupyter but its similar to colab i think
what was your code for this line?
ax1.hist(DS, density=True) ```
DS stands for the array
density, I tried turning off and on
ok and what do you want this to be
yeah what do u want the histogram to represent
so every element in the array has to represent one column
if I have an array of all 10, then all the columns have to be same height of 10
Is this clear?
yes, I think I understand what you're asking
sorry I just haven't done these in a while, almost a year
Oh, i see, well if u think u cant help me, dont worry
but if u have any hints at least of how to tweak
no no I think I can it will take me just a bit to remember some things
alrighty
I might not be able to give the solution but I could definitely point you in the right direction
thats more than enough
I believe it might be a problem with the axis because your array seems to be only for the x - axis, you might have to make it to where the y -axis has the same array as your x - axis if you want the histogram to have the same height as its location on the x - axis
did that make sense?
I don't know if that is correct though
let me digest that
yeah go ahead, I am not the best at clearly explaining stuff but if you need clarity go ahead and ask
oh, I think i get it. Because there is no linearity (correlation) between the two variables, the y axis is misrepresented
thus, showing that funny 0.00 to 0.07 value
on the y-axis
yup
oh, let me tweak on that, thanks mate
I don't know why you got 0.00 - 0.07
that was given by the program
but im trying to look for the y axis parameter
but cant find
hmm well try tweaking around with the axis and if u need anymore help just ping me in this channel
uhum, thanks mate
@old thorn Hey, couldnt really find a way through this.. Is there more that you know?
Hi, I need some advice: I want to try to train a very (VERY) simple network, a simple perceptron, and for that there is an analytical solution, which involves the Penrose Pseudo inverse. However, my input data is a bunch of binary strings like "00010111". Now, calculating the that inverse through np.linalg.pinv(X_train) gives me sometimes a convergence error, but if I run it a second time then that error does not appear (no idea why). But if on the other hand I decide to build a keras mode like this:
model_y=keras.models.Sequential([keras.layers.Dense(100, activation="relu",input_shape=[8]),
keras.layers.Dense(1, activation="linear", name="l3")])
model_y.compile(optimizer=keras.optimizers.SGD(learning_rate=0.1), loss='mse', metrics=[tf.keras.metrics.RootMeanSquaredError()])
I get "no learning" at all. My first guess is that this has something to do with the fact that my input consists of binary data, but does anyone have any ideas what can be done?
Hmm How come you chose 100 units for the first layer?
oh, my bad, that was a mistake. It should only have the 1 neuron layer with the input shape specified
Ahh okay okay. What's the input/input shape?
the input are binary integers in the neurons like an array of [0,1,1,0,0,1,1,0] and the output will be a continuous variable
its a regression problem
That's what I was thinking of too. Any reason why you're using relu?
no reason at all. I'm way too unexperienced for it
to predict an answer to something i should just take the mean?
That is one form of prediction.
what others are there?
Depends on the context, let's say given some data you want to find some form of predictor for this set of data.
You can choose mean or median.
or maybe even mode.
wouldn't mean be the same as mode in the context of YES or NO?
and how would median be relevant for prediction?
So prediction without any kind of context is vague.
Could you clarify what you mean by prediction?
like say with a given age, the program tells you if it's more likely the person will say yes or no to something
like idk... do you have a bedtime?
i'm not working on a project. just trying to understand the basics
@keen root I think we can start really simple.
Just maybe something simple like
model = Sequential()
model.add(Dense(1, activation='linear', input_dim=input_len))
model.compile(...)
I don't know if this will work or not. Would start simple.
Yea of course. @lapis sequoia
So in your example, the prediction would be a yes or a no.
And I'm assuming what you're basing that prediction is on some stats or measurement, like mean?
yeah mean ig
but wouldn't mean just get the same result as mode? in that situation
Depends on your data.
If I have a 4x4 np.array and I want to add all of the rows horizontally so that I end up with a single column, what is a way to do it?
[[69,0,86,8],
[45,52,87,29],
[42,38,81,43],
[63,73,60,0]]
to
[[162]
[213]
[204]
[196]]
Let's say your data is 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4.
I don't think the mean is the same as the mode here.
that is true but what i am saying is data will either be 1 or 2
@keen root I think we can start really simple.
Just maybe something simple like
model = Sequential() model.add(Dense(1, activation='linear', input_dim=input_len)) model.compile(...)I don't know if this will work or not. Would start simple.
@heady hatch I've tried it before, though I'll try it again
@past raptor Are you using regular python or some library like NumPy?
as in yes or no
numpy sir
My main concern is the fact that there is binary data at the entry. Is there anything special about it? Would I have to have any special care?
@past raptor You can do np.sum(data, axis=1)
Let me try that!
@lapis sequoia Ahh okay so like 1 = yes and 2 = no?
So what would the mean represent?
the same as mode no?
@keen root I think once you've changed it into an array representation, it's not in binary format anymore. Instead it is an array/tensor of numbers.
@lapis sequoia Oh uh just to make sure we're on the same page, how do you calculate the mode?
@heady hatch thanks, it worked!
@lapis sequoia right, you might need to connect the dots for me.
I'm not sure how you're getting the mean and the mode to be the same here if all you have is 1 = yes and 2 = no.
Right.
1.6 i mean
To focus a bit on the details here, so rounding the mean isn't the same as the mean itself.
yes i know that, but to make the mean into an answer wouldn't i have to round it?
i'm probably being dumb 😅
No no you're not, you're learning and we're discussing.

So straight up taking the mean and rounding it is very crude but it's one way to get predictions.
Do you know anything about linear regression or logistic regression?
i don't
Ahh.
To give you analogy.
Let's say we're creating an algorithm to predict whether someone will be asleep or not.
Our simple algorithm is just to take the mean of the data and probably apply some function.
hahaha sorry don't want to throw too many things at you.
So linear or logistic regression are another kind of algorithms.
and what do they do?
Similar to how we grab the mean and apply some function.
So linear regression will predict a number of some kind given an x.
I don't know how familiar you are with math.
but like y = mx + b.
mx being...
m = slope, x = data.
oh no worries, it's more of a math term. hahaha
Imagine the line y = x.
You know how it's just a diagonal line?
Right.
unless x has got something to do with y
And the slope of the function is ratio of the vertical change over the horizontal change.
In y = x
slope = 1
But let's say y = 2 * x + 3
slope here is 2
So the +3 is something called the intercept.
You might need to learn some basic algebra if you're unsure of all this.
i might know what these terms are in my language
i just don't recognize them in english that well
Ahh. hmm what language are you familiar with?
portuguese
Let me google it.
This is from google translation.
A inclinação de uma função linear
A inclinação de uma colina é chamada de declive. O mesmo vale para a inclinação de uma linha. A inclinação é definida como a razão entre a mudança vertical entre dois pontos, a elevação, e a mudança horizontal entre os mesmos dois pontos, a corrida.
Let me know if that makes sense. hahaha
yes
when predicting something should i add that for future learning or not? bc it might not be 100% correct
I have a list of stocks I’ve kept track of over a while. I want to get the stock price per cell, per day, and then see what the price was.
Hello guys, I present a repo for a good cause, This repository is an initiative to share knowledge in data science to a community of Spanish-speaking practitioners, most of the content on this subject is in English, if you know techniques and methods of data science and machine learning you can share it with our study group through a pull request to be translated and serve as study material and expand the amount of understandable material, they can be, explain how a machine learning model works, some technique of cleaning or data exploration, a tutorial on how to use a module etc. etc .. Apart from participating in Hacktober fest and winning a shirt or planting a tree, you are helping a community of people who want to learn.
https://github.com/LATAM-Data-Science-Study-Group/Data-Science-Notebooks
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 89, in <module>
assert (x_train.shape[1:] == (imageDimensions)), "the dimension of training images are wrong"
AssertionError: the dimension of training images are wrong```
i am following https://www.youtube.com/watch?v=SWaYRyi0TTs this tutorial
Train and classify Traffic Signs using Convolutional neural networks This will be done using OPENCV in real time using a simple webcam . CNNs have been gaining popularity in the past couple of years due to their ability to generalize and classify the data with high accuracy. ...
i am facing an AssertionError in my code
I think the dimensions of your training images are wrong. How does the tutorial setup the input?
@heady hatch in tutorial they have used imageDimensions = (32, 32, 3)
but when i pass the the same imageDimensions then it is giving an error
And you both are using the same version of Tensorflow, right?
What about the data input? Similar shape?
Because that would be my guess, your data input might not be of right shape.
i think he has not mentioned about shape
So I would take it as an assumption here from the variable imageDimension.
Because it's looking for (32, 32, 3).
I would set that to be your shape.
Sure.
https://paste.pythondiscord.com/usuxakuxis.coffeescript my code here ,please check line 30 and line 37 @heady hatch
From what I'm seeing on line 89, you're checking
assert (x_train.shape[1:] == (imageDimensions)), "the dimension of training images are wrong"
right?
yes
So I think if I'm following the code correctly,
you're checking if (32, 3) == (32, 32, 3)?
Because images are of the shape, 32 x 32 x 3?
I guess I'm wondering how come you're checking the shape index 1 and on instead of just x_train.shape == imageDimensions?
in tutorial 6:06 please check line 66
i am following the same as shown in tutorial @heady hatch
Right right, I would ignore the tutorial real quick.
Try changing x_train.shape == imageDimensions real quick.
And let me know how that goes.
Oh wait.
probably something like
x_train[0].shape == imageDimensions
OH
Wait I get it now.
print your x_train.shape before the assert.
Sorry I'm not thinking too clearly.
So before the line
assert ..., add a print(x_train.shape)
(378,)```
this is my console output```python
total classs detected : 24
noofClasses: 24
importing classes...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
data shapes
train (378,) (378,)
validation (95,) (95,)
test (119,) (119,)
(378,)
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 90, in <module>
assert (x_train.shape == (imageDimensions)), "the dimension of training images are wrong"
AssertionError: the dimension of training images are wrong```
Right right, sorry about the x_train.shape == imageDimensions.
I think there's something wrong with your images.
these lines.
count = 0
images = []
classNo = []
#mylist = os.listdir(path)
p = pl.Path(path)
mylist = [x for x in p.iterdir() if x.is_dir()]
print("total classs detected :", len(mylist))
noofClasses = len(mylist)
print("noofClasses:", noofClasses)
print("importing classes...")
for x in range(0, len(mylist)):
myPicList = os.listdir(path+"/"+str(count))
for y in myPicList:
curImg = cv2.imread(path+"/"+str(count)+y)
images.append(curImg)
classNo.append(count)
print(count, end = " ")
count+=1
print(" ")
images = np.array(images)
classNo = np.array(classNo)
np.array(images) gives None
After images = np.array(images)
can you print images[0]?
and double check if that's what you expect it to be.
None
After
images = np.array(images)can you print
images[0]?
@heady hatch
Yea.
I think you're not reading in the images properly.
I'm not familiar enough with the cv library, but I can help you debug.
I think you're not reading in the images properly.
@heady hatch ok
So on line 59.
after curImg = cv2.imread(path+"/"+str(count)+y)
add a print curImg.
and then add a break.
on line 57.
after myPicList = os.listdir(path+"/"+str(count)) add a print myPicList and then add a break.
add a print curImg.
@heady hatchpython None 0 None 1 None 2 None 3 None 4 None 5 None 6 None 7 None 8 None 9 None 10 None 11 None 12 None 13 None 14 None 15 None 16 None 17 None 18 None 19 None 20 None 21 None 22 None 23
Be sure to add a break.
so something like
print(curImg)
break
So double check
is this where your images are?
path = r'E://demo3//india'
path = r'E://demo3//india'
@heady hatch yes
okay now, since you've imported os.
You can try something like
os.path.isfile(path_to_image)
and check to see if you have the right path.
You can print it anywhere.
os.path.isfile(path_to_image)
@heady hatch ok let me try...
os.path.isfile(r'E://demo3//india//0//a.jpg')
True``` @heady hatch
Okay okay cool.
i think the input shape or dimensions is not proper i guess
I think it's your images.
Because you saw up there that it's printing None.
I think curImg = cv2.imread(path+"/"+str(count)+y) is incorrect.
okay, means my images are not in correct format?
Hmm.
Or maybe you're not reading them correctly.
like
I guess print path+"/"+str(count)+y
to make sure it's path to actual image.
Or actually no I think you might have a point, sorry for jumping the gun.
I think either path to images isn't correct
or something wrong with the images.
Since cv2.imread isn't reading them properly.
my images consists of rotated images also
but they're still in readable formats, right?
Oh wait.
I think
I might have an idea.
path+"/"+str(count)+y isn't this supposed to be path + "/" + str(count) + "//" + y?
Seeing how the files live in 'E://demo3//india//0//a.jpg'.
Double check if your path is correct.
It might also be path + "//" + str(count) + "//" + y
let me check path + "/" + str(count) + "//" + y this?
Yea
or even check path+"/"+str(count)+y.
Like add a line above curImg,
print(path+"/"+str(count)+y)
See if that's what you think it is.
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 71, in <module>
print(images[0])
IndexError: index 0 is out of bounds for axis 0 with size 0```
Can some one please help me with this:
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 78, in <module>
x_train, x_test, y_train, y_test = train_test_split(images, classNo, test_size = testRatio)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2122, in train_test_split
default_test_size=0.25)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1805, in _validate_shuffle_split
train_size)
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.``` @heady hatch
I don't know what you want me to say @mild topaz . hahaha
Hey @sage palm , what do you need help with?
Are you allowed to use libraries?
@heady hatch sorry, i got confused can u explain again
Thanks for answering! Yes, I'm allowed to use numpy
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 59, in <module>
print(path+"/"+str(count)+y)
NameError: name 'y' is not defined``` @heady hatch
@mild topaz
Sorry looking at your code, your line 58 and 59 are these.
for y in myPicList:
curImg = cv2.imread(path+"/"+str(count)+y)
So instead of that
add a print statement there.
for y in myPicList:
print(path+"/"+str(count)+y)
curImg = cv2.imread(path+"/"+str(count)+y)
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 78, in <module>
x_train, x_test, y_train, y_test = train_test_split(images, classNo, test_size = testRatio)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2122, in train_test_split
default_test_size=0.25)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1805, in _validate_shuffle_split
train_size)
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.``` @heady hatch
@sage palm Hmm I have some idea but what do you have in mind? Sorry I was typing a bunch of stuff but realized I should asked you first.
Hey @mild topaz I really think you should check the path. Because I think your data is just filled with Nones.
total classs detected : 24
noofClasses: 24
importing classes...
['a.jpg', 'aa.jpg', 'aaa.jpg', 'aaaa.jpg', 'b.jpg', 'bb.jpg', 'bbb.jpg', 'bbbb.jpg', 'c1.jpg', 'cc.jpg', 'ccc.jpg', 'ccccc.jpg', 'download (7).jpg', 'download.jpg', 'ges.jpg', 'images (1).jpg', 'rfg.jpg', 't.jpg', 'tt.jpg', 'ttt.jpg', 'ttttt.jpg', 'z.jpg', 'z1.jpg', 'zz.jpg', 'zzzzz.jpg']
E://demo3//india///0a.jpg
0 ['a.jpg', 'aa.jpg', 'aaa.jpg', 'aaaa.jpg', 'bbb.jpg', 'bbbb.jpg', 'bbbbb.jpg', 'cdfg.jpg', 'cfd.jpg', 'download (3).jpg', 'download (4).jpg', 'download (5).jpg', 'download (6).jpg', 'download (7).jpg', 'g.jpg', 'gg.jpg', 'ggg.jpg', 'images (1).jpg', 'qqq.jpg', 'r.jpg', 'rr.jpg', 'rrr.jpg', 's.jpg', 'ss.jpg', 'sss.jpg', 'ssss.jpg', 'z.jpg', 'zz.jpg', 'zzz.jpg']
E://demo3//india///1a.jpg
1``` @heady hatch
Okay yea. I think it's your path.
You noticed E://demo3//india///0a.jpg?
But your files live in E://demo3//india///0//a.jpg?
You're missing //.
But your files live in
E://demo3//india///0//a.jpg?
@heady hatcha.jpgis name of my image file
and neither is 1a.jpg, right?
and neither is
1a.jpg, right?
@heady hatch correct
So I guess I'm wondering, how come you're trying to read those files if they don't exist?
see this directory of images @heady hatch
Yes.
But you see how
You're printing out
'download (7).jpg', 'g.jpg', 'gg.jpg', 'ggg.jpg', 'images (1).jpg', 'qqq.jpg', 'r.jpg', 'rr.jpg', 'rrr.jpg', 's.jpg', 'ss.jpg', 'sss.jpg', 'ssss.jpg', 'z.jpg', 'zz.jpg', 'zzz.jpg']
E://demo3//india///1a.jpg
1
@heady hatch No problem, I will wait. This is a problem sheet which I'm working on for the upcoming exam. There are 6 pure math problems which I have done, but the ones in python I simply can't figure it out on my own. I'm not good at Python and the course has been a bit of a nightmare, so we have not learned what we should.
'download (7).jpg', 'g.jpg', 'gg.jpg', 'ggg.jpg', 'images (1).jpg', 'qqq.jpg', 'r.jpg', 'rr.jpg', 'rrr.jpg', 's.jpg', 'ss.jpg', 'sss.jpg', 'ssss.jpg', 'z.jpg', 'zz.jpg', 'zzz.jpg']
E://demo3//india///1a.jpg
1
@heady hatch OH i see
why i am getting this but ?
@sage palm I can help you with the python but my linear algebra is a bit rusty. hahaha
I'm reading up on the converging series right now. But I would love to hear you breaking down the math portion if you can.
i am getting this with every folder
Yes.
Because you wrote curImg = cv2.imread(path+"/"+str(count)+y)
Your path is wrong.
I think it's supposed to be
curImg = cv2.imread(path+"/"+str(count)+"//"+y)
okay, let me check
see this way i am getting here```python
14 ['1021.jpg', '123.jpg', '152.jpg', '52.jpg', '7856.jpg', 'a.jpg', 'aa.jpg', 'aaa.jpg', 'b.jpg', 'bb.jpg', 'bbb.jpg', 'c.jpg', 'cc.jpg', 'ccc.jpg', 'd.jpg', 'dd.jpg', 'ddd.jpg', 'e.jpg', 'ee.jpg', 'eee.jpg', 'images (1).jpg', 'images (2).jpg', 'images (4).jpg', 'images.jpg', 'pn_dl2.jpg', 'pn_dl9.jpg', 'x.jpg', 'xx.jpg', 'xxx.jpg']
E://demo3//india///151021.jpg
[[[184 181 150]
[184 181 150]
[161 158 127]
...
[198 201 199]
[ 56 61 62]
[ 0 0 3]]
[[162 165 139]
[190 193 168]
[175 178 153]
...```
Congratulations.
[[255 255 255]
[255 255 255]
[255 255 255]
...
[255 255 255]
[255 255 255]
[255 255 255]]]
23
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 78, in <module>
x_train, x_test, y_train, y_test = train_test_split(images, classNo, test_size = testRatio)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2122, in train_test_split
default_test_size=0.25)
File "C:\Users\Admin\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1805, in _validate_shuffle_split
train_size)
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.``` @heady hatch
@heady hatch i will! But please give me a minut or two, i’m on my phone because my mac has frosen. Sorry about that
No worries, I can wait.
what is wrong in my case , can u plz help me to understand? @heady hatch
i have changed this only pytho for y in myPicList: print(path+"/"+str(count)+y) curImg = cv2.imread(path+"/"+str(count)+"//"+y)
@heady hatch can i share my code again?
Sure.
@mild topaz Oh remove line 64 and 65.
print(curImg)
break
see this
E://demo3//india///23download (16).jpg
E://demo3//india///23download (18).jpg
E://demo3//india///23download (19).jpg
E://demo3//india///23download (21).jpg
E://demo3//india///23download.jpg
E://demo3//india///23gfd.jpg
E://demo3//india///23gh.jpg
E://demo3//india///23images (10).jpg
E://demo3//india///23images (11).jpg
E://demo3//india///23images (12).jpg
E://demo3//india///23images.jpg
E://demo3//india///23iu.jpg
E://demo3//india///23ry.jpg
E://demo3//india///23uiop.jpg
E://demo3//india///23y.jpg
23
data shapes
train (377,) (377,)
validation (95,) (95,)
test (119,) (119,)
(377,)
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 96, in <module>
assert (x_train.shape[1:] == (imageDimensions)), "the dimension of training images are wrong"
AssertionError: the dimension of training images are wrong``` @heady hatch
before that can you print x_train[0]?
on which line?
Probably on line 80 or something.
after
x_train, x_test, y_train, y_test = train_test_split(images, classNo, test_size = testRatio)
x_train, x_validation, y_train, y_validation = train_test_split(x_train, y_train , test_size = validationRatio)
[[232 245 253]
[232 245 253]
[232 245 253]
...
[220 230 237]
[221 231 241]
[222 231 245]]]
data shapes
train (377,) (377,)
validation (95,) (95,)
test (119,) (119,)
(377,)
Traceback (most recent call last):
File "E:\demo3\image_classification.py", line 96, in <module>
assert (x_train.shape[1:] == (imageDimensions)), "the dimension of training images are wrong"
AssertionError: the dimension of training images are wrong``` @heady hatch
@heady hatch As I see there is no LaTeX bot on the server, but I can probably explain it without-
So
is have you define the exponential function for a matrix.
It is very similar to the one for numbers. exp(kx). Here k is a real number. In our definition of the exponential function this constant is a squared matrix! let us say a m x m matrix.
Taking a square matrix to the power of n means: A^n = A · A · ... ·A (n times)
Let take an example
A=[[1,0],[0,1]]. So this is the Identity matrix of dimension 2 x 2. And A^3 = A · A · A.
The dot is just a symbol for a matrix product.
So let us look at the rest of the term: x^n/n!·A^n.
x is the variable, it can be 0 or negative. x^n just means x·...·x (n times)
@heady hatch are you there? 🙂
I am, I'm also worried about time since I will have to sleep soon.
I think I kinda understood everything so far.
I guess I was wondering, this is an infinite series that converges.
How do you calculate the convergence?
Alright, then just go to bed 🙂 sleep is important. Can we discuss it later, when you have time?
I would love that!
@mild topaz , we will have to deal with your issue tomorrow as well.
Good night you both.
if u dont mind can u give some hint to me so i can try something @heady hatch
Good night! (just woke up. lol)
Alrighty. @mild topaz
So I'm not sure why your images have the shape of (377,).
Because if they were an ndarray, they should have multi-dimensions.
I would look into your images and see how you can make them (32, 32, 3).
so do stuff like print(images[0]) and stuff and try to track it down and see if they're what you expect them to be.
377 is a no of training images @heady hatch
Hey, I implemented a GAN and will that be considered as a final year project?
@heady hatch I have found a very nice method of implementing our problem in python. I will tell you about it when you wake up. I can also use one of the voice channels if you like.
@royal thunder Are you confused about the sudden cut in the line plot?
It's a zoomed plot so you can imagine them connecting at infinite or something.
yeah
print("Hello World")
can I ask questions related to tensorflow here?
ye
@royal thunder overfitting is where the algorithm trying to learn patterns in the data becomes too specialised to the data its training on, as you can see in the picture the predictions are very accurate on those data points but if you consider a point halfway between the 2 right most points, you can see that the general trend is a straight line but the line the algorithm has generated for that data has the prediction for that input far off the charts, thats where it can be sometimes better to use a simpler algorithm / simpler structure because a simple straight line can describe the data there quite well and as the text says, the predictions from a linear model are more likely to be accurate on new data than that line which has horribly overfitted to the training data
thanks @spark stag
anyone know how to plot 3d vector fields without using quiver in matplotlib?
@marble bison You can try other libraries like plotly.
https://plotly.com/python/3d-charts/
See if they fit your requirements.
Also check the 3D plot section of matplotlib. Everything that is possible is documented.
https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html
Looking at this graph of outliers for my timeseries, can wee consider series normally distributed ?
Is this an accurate description of logistic regression?
A model that projects each object into an n-dimensional space and solves for the n-dimensional plane that best separates objects from different classes.
I guess it should be (n-1)dimensional
looks like I may have inadvertently described SVM instead.
The above definition is of any general linear classifier.
Both SVM and Logistic regression can separate different classes that are in N-Dimensional Hyperspace using an (N-1) - Dimensional HyperPlane.
@serene scaffold
I see
I have this list of stocks I write down years ago. I just wrote down the ticker and if it was up or down. How would you use this? I want to display the stocks on a grid of how many I found per day, how often a “same” stock was written down when.... also the actual prices on a graph. What will I need?
@lapis sequoia hey yeah thanks, ill have look at plotly. its just when i try to make a 3d vector field function with quiver it doesn't like having the arrow directions as an input
Hey @sage palm , I would love to hear it. Voice channel will definitely work as well, let me know when you're free!
Good morning to you too.
Were you able to solve the matrix exponential problem? Or are you in the debugging stage?
No, unfor. I how only solved the proof based math question. I do not know python, so I need some help to get my thoughts implemented.
😄
Oh man, I'm totally excited to help you implement it.
thanks!
I'm kinda slow mentally because of lack of sleep, so bare with me
Which channel to join?
Give me about 30 minutes, I'm going to get ready and be back!
yes
@sage palm Alrighty, I'm back!
👍
Yes indeed.
I'm not super familiar with Discord so I'll be figuring things out with you.
Lol, I was about to say the same 😄
But I think a private discord call will be eaiser!
do you mind?
Nope not at all.
What do I do if I want to apply an undersampling/oversampling technique on a different target column and then train the model with a different column as the label?
All the imabalanced-learn methods I have seen are applied after the training-test split, so at that point y is already defined.
Basically, I have two columns - cancer yes/no & gender M/F. I want to sample the dataset so that there are equal instances of M and F, and then proceed with my classification problem: cancer yes or no (irrespective of the no. of instances).
do you think I need to scale my data or just use it as its ?
@bold olive It's not a good practice to apply under-sampling or oversampling before train and test split. You should first do a random splitting and then sampling to create balance training set.
I understand but then how do you balance the dataset according to a different label when y is different in the split?
@lapis sequoia Scaling helps in faster convergence to the optimal result. So you should do it almost all the time.
@lapis sequoia Robust scaling here is correct right ?
@bold olive I'm not able to understand your statement can you rephrase it or describe it more.
Do you want to know how to do the sampling?
No.
Basically, I have two columns - cancer yes/no & gender M/F. I want to sample the dataset so that there are equal instances of M and F, and then proceed with my classification problem: cancer yes or no (irrespective of the no. of instances).
I know how to sample using the existing target label, but how do I sample it according to a different label in the dataset while being in the same classification problem?
@lapis sequoia how to know which one to use between robust and standard ? as both they look the same
@lapis sequoia Both are good choice and you should get very similar result from them. You can choose anyone.
@lapis sequoia thanks 🙂
@bold olive One option will be to increase the weight such that Male and Female have same count in dataset.
Increase the weight where exactly?
are you using scikit-learn ?
Yes.
X = dataa.iloc[:, 10:26]
y = dataa.iloc[:, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)```
```from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)```
y is the cancer yes/no column, there is one more gender column and I want to balance the whole dataset according to that.
X = [Gender, Cancer]
stratify=X[Gender],
test_size=0.25)```
try this once and tell me if this works.
Sure, hang on.
I don't think it will work.
from imblearn.under_sampling import RandomUnderSampler
X_temp = [X_train,y_train]
rus = RandomUnderSampler(random_state=42)
X_temp_rus, _ = rus.fit_resample(X_temp, X_temp[Gender])
X_train_new = X_temp_rus[Gender]
Y_train_new = X_temp_rus[Cancer]
You'll have to make the X[Gender] column as Y and concat the label into X_temp.
then do the sampling and extract your X_train and Y_train.
Is this making sense ?
@bold olive
So in this way the sampling is done according to the gender but target label in the classification is still cancer? @lapis sequoia
hello can someone help me to solve this issue ? idk if the csv idownloaded is broken or my utf - 8 encoding not working ?
it is supposed to be "système"
i have th same for this where it is supossed to be a " ' "
Not working out as I want it to unfortunately.
ok i found a solution using a westen europe iso
somewhat random question, but I had a thought. is a linter basically the same as a compiler stopping half way, or are they completely different beasts? I only have a very superficial understanding of compilers, but it seems to me a linter will need to do the same work of tokenization and building some kind of syntax tree.
there was something I noticed recently where the kotlin linter in intellij would warn me that while I did a check to see if something was null, the variable is mutable and thus the value could change to null at anytime. I don't see how it could know that without doing all that stuff and working through the program in it's entirity
@rustic fern dutch
Out of curiosity, how long does it take to deploy a machine learning model for you guys?
@marble jasper So your models are deployed automatically? Is what I am hearing?
our pipelines that automatically ingest data for unsupervised learning, pretty much do it automatically, they're run in Airflow and I dunno, process takes a few minutes to push the models to the models store API, and update some database values. Next time something tries to use a model, the endpoint reads the latest model version from database and realises it doesn't have the model cached, so pulls it, and now it's in the models cache to be used for this and new requests.
for other kinds of model, someone has to compress it, and upload it to the models bucket, and update the path in the project that's using it so it downloads the model. Tag a build, CI takes care of deploying it
sorry, pasting message from earlier
depends on the infrastructure of the project i'm working on
at my current work, about 6 months
yes, I assume you're talking after all the training is done and there's a model ready to go into production
we have a bunch of stuff on Airflow that's ingesting data and running unsupervised learning to create new models, and then uploading it to our models store
How helpful would a feature like that be benefiival or useless
you're describing automl @lapis sequoia
so that pipeline is pretty much zero-touch. we have an internal model store API that you can post models to, and it tracks the latest version of a model for a given task; the Airflow ETL is pushing models in there and bumping the version number. Production systems that use the model query for the latest model, and checks in their local cache for the model, and pulls it if it's newer
automl is pretty good
For side projects, it is a large platform to incorporate into
well, it's a google cloud product, they're upselling their entire cloud
What could make autoML better, if there was anything that could imprive it?
infrastructure
automl as it is, at least from my point of view, is perfectly suited for companies without a data science department, that want to deploy simple-ish models, in such cases, having it managed for you on some cloud platform is a perfect fit imo
That is a good point
But doesn't take forever for data science teams to deploy models?
if you're building the entire platform from scratch, i think using NAS is about the same as developing a "traditional" model
in a perfect world, all you'd have to do is to provide the saved model, define endpoints, and you're good
probably they don't automate that part because the time cost of deploying the model is miniscule compared to the time it takes to do all the other stuff
I thought it usually takes weeks to months to deploy a single model
And then cleaning beforehand
depends what you mean by "deploy"
yeah, i think any sufficiently large company with a good devops team with ml engineers can do that part in a small proportion of the total time
I think I phrased this poorly
I am not an ML engineer haha I am new to data science
i have a friend working in a data science consulting company, and they can deploy models for clients in less than a day after the model is ready
because they invest a lot in devops
the company i work at, on the other hand, does not have any dev ops dpt because "we don't do development", and it takes literal months to get it ready on a project
yeah, ours is probably about 20 minutes, depending on how quickly you can convince someone to review the PR for the model version change, due to PR review policy. Assuming the model has been vetted already
generally we just quickly patch it together and leave it as is
the full release process depends on exactly what, but usually:
- upload some files to bucket
- change some docker files or env vars
- commit and tag to trigger a CI build of it
- go and edit the version you want in production in a different repo, and PR that
- wait for someone to accept PR
- someone has to run that deploy because that's not automatic (but could be, just a human-in-the-loop thing)
Ohh
this is assuming everyone agrees that the model is ready
I'm not sure what problem you're solving, because it sounds like to use your system it would require someone to hook up an API
sure
Also this is very helpful
I think for some companies that have a separate team for data pipelines and devops, this is probably not that useful, because our model deployment process isn't that different from other CD tasks (there's just an extra big model file somewhere to handle). Maybe for smaller teams like Igneous mentioned, who don't have dedicated devops?
Again I am not a data engineer or data scientists
or data analyst
@marble jasper What is something that would speed up the process within your work? What takes the longest?
Also, I should have asked this before, but are you a data scientist?
or data engineer
no, I lean more on the devops and backend side, but I manage some ML engineers
Ohh I see
my main gripe with our systems is Apache Airflow kind of sucks
Also, I am also new to Discord haha I didnt know I can jump into communities just recently
it just doesn't FEEL like a modern app, what with the weird limitations like not being able to schedule two tasks at the same time, etc
Do many companies use Apache Airlfow?
probably quite a lot
I mean, AirBnB use it
most likely. it would be weird if they didn't
Ah I see, if you don't mind, and I understand if youre not comfortable with sharing this, but which company do you work at?
Idk if on discord, people share those details or that is not really a common
Could you elaborate on apache airflow process?
Airflow runs pipelines. Our pipeline stages are mostly either docker images running in kubernetes, or calls to internal APIs. We use Airflow for some periodic data collection, and periodic generation of some models that use unsupervised learning
some runs have external triggers, most run on a timer (and pull available data from an ingestion API that collects data to be processed)
they're not cron jobs, but those processes are on a timer
Ohh I see
Airflow uses python
When do you guys agree that a model is ready?
you define your pipeline in python. for the tasks on the timer, it's part of your DAG definition. Those python DAGs live in a repository. When you push to the repository, CI/CD takes it and inserts it into a foder that Airflow watches, and Airflow automatically loads this
so the process of defining a new pipeline is you just create a new python file in the DAGs repo, and commit it and tag for CI/CD build
So everything is mostly being automated by airflow
unsupervised learning, yes
there's a data ingestion API that handles collecting data and making clean formatted data available to some of the pipelines
it's slightly decoupled, because there are Airflow processes that perform the data collection, pushing the data to the ingestion API, and other pipelines that get data from the API. This is because Airflow isn't responsible for raw data storage, and also we get a data stream from elsewhere as well
but yes, this is unsupervised learning on ML that's already been defined. Everything else - exploring new ML algorithms and anything that requires supervised learning, that's all offline
someone has to design the experiments, do the labelling, etc. etc. that's all desktop stuff
It was really not a question but rather a comment. That is insanely impressive
But this is really helpful information
A lot of people don't really help out this much with the process of how it flows or put in time to write it out. I truly appreciate it
I hope I didn't sound weird
idk where i
lol
idk where i'd put this but i made a calculator function
just found it
Also, is there any way that I can reach out @marble jasper whenever ? That was really helpful
What’s the better alternative to web scrapping?
@marble jasper Also, there is no devops team that detects data drif
drift
detecting data drift
I wrote down stocks that caught my eye years ago. As a test, I want to display the price per day and some more info. What type of way would you display stocks like this? There are duplicates
What’s the better alternative to web scrapping?
@rustic apex it depends.
if the data is publicly available through an API, that usually is better
@velvet thorn how can I display the gain/loss difference from my list? I have allot of stock tickers listed
@velvet thorn how can I display the gain/loss difference from my list? I have allot of stock tickers listed
@rustic apex I don't understand the question
@velvet thorn it’s ticker names for stocks. How can I cycle through the list to show the +- of each to now?
@velvet thorn I want to display a graph/line per stock to see how the trend has been since I wrote them down.
but where are the numbers
@velvet thorn I didn’t write them down, just the date. That’s what I want to have added to them
okay, what does each row represent?
Each row is a day
I'm pretty sure each COLUMN is a day?
what an odd sheet
@velvet thorn yes
@velvet thorn I guess not really anything, it’s a list of stocks by day. The day is at the top
for real.. columns for days, u expect then rows to be symbols? but it isnt
@velvet thorn it’s ticker names for stocks. How can I cycle through the list to show the +- of each to now?
@rustic apex so why does the day matter?
does it matter at all?
or do you just want to get all the ticker symbols in that DataFrame
@velvet thorn it’s when I found a stock, and want to know the difference between when I wrote it down again
okay
so
for each stock
you want the difference between
the day it was entered (from the column)
and the present
and I'm assuming
you will get the prices
from some external API?
@velvet thorn yes
okay
got it
no need to mention me if you're not replying to a specific message btw
@velvet thorn ok 👍 there’s sticks I wrote down at +100%, that then shot up even more at +800%, so I want to see how this list still holds up.
Stocks.... not sticks
hm.
a CSV is a bit of a bad choice for this
okay what I would do
is apply some data transformation
so you have a 2-column DataFrame
ticker and date
then you can iterate through it and call an API
to get those prices
Should I use web scrapping from yahoo finance?
@rustic apex that's a separate question
one thing at a time
Or should I just go by day?
@rustic apex what do you mean by that?
isn't that implied here
ticker and date
@velvet thorn this
that would be up to you
you'd also need to think about what you're plotting in the first place
simple price?
some sort of moving average?
comparison to an index?
etc.
Ok, all of those 👍, I’ve seen tutorials to predict a stock, I want to try that latter as well. There’s been allot of stocks I’ve found at around 50¢/1$, and they ended up being $5, $10, $25
yup
so like
you have a lot on your plate right now
I suggest you make a list of things you want to do
and work on them bit by bit
When I’ve watched tutorials and also some samples on Kaggle, it dosnt show a import from any API or url, it just has a analysts of the data
okay
what are you getting at though
I don't really understand
where are you going to get the prices then
Well, yes I want the prices 👍, but which api or web scraping is best?
What are some beginner projects for open cv that people have done
Are there any for just simply video classification ones or integrations within scale.ai
Well, yes I want the prices 👍, but which api or web scraping is best?
@rustic apex depends on what you want.
I suggest you do some original research
@whole roost you can ask about matplotlib here
anyway, to answer your question, a for loop would be appropriate
@snow flax I made a bunch of a simple filters, like negative, pixelate, posterize, and just a whole bunch of aliases to cv2 built ins, like edge detection, blur, ...
anyone ?
i am currently learning machine learning
from hands on machine learning
i have this huge doubt anyway
either to learn the math and continue on or parallely learn machine learning and learn math for it?
What do I do if I want to apply an undersampling/oversampling technique on a different target column and then train the model with a different column as the label?
All the imabalanced-learn methods I have seen are applied after the training-test split, so at that point y is already defined.
Basically, I have two columns - cancer yes/no & gender M/F. I want to sample the dataset so that there are equal instances of M and F, and then proceed with my classification problem: cancer yes or no (irrespective of the no. of instances).
Currently, I have this:
X = dataa.iloc[:, 10:26]
y = dataa.iloc[:, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)```
But this balances the dataset according to the cancer yes/no (column 2) label as y is defined that way. I want to perform the sampling with column 3 (gender) and then perform classification with 2.
Hey
i am new here
can anyone guide me from where i should learn machine learning
It's be awesome if you guys can help me 🙂
Hey guys, so im wondering what the best way is to fill those missing values. Dtypes returns as objects. I dropped all rows that all have NAN's. This is the output
i dunno much but may be putting mean values instead of dropping them might be better @lapis sequoia ?
I have made Tic Tac Toe in Python
Hey @hushed flax!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hi : ) Thanks for the recommendation that I use this channel! I'm trying to figure out how to use ax.bar to, eventually, make a histogram. This is for homework, so much of the code I'm trying to use was provided and I'm modifying it. The error I'm getting is this: AttributeError: 'numpy.ndarray' object has no attribute 'bar'
The code I'm using is this: https://repl.it/repls/GrowingScientificService#main.py
I'm still quite unsure what ax.bar is technically supposed to do.
@whole roost okay, I looked at your code
what do you want to return from plot_f_sampled?
it seems like you're returning an array
Hm, how could I share a Word document that provides a lot of background information?
never mind
@whole roost the main thing is:
pmf_for_test_plot = plot_f_sampled(n=15)
print("\nBegin homework 1, problem 3")
plot_pmf_samples(pmf_for_test_plot, x_lim=(0, 1), n=200)
here
you're passing pmf_for_test_plot into plot_pmf_samples, right?
so eventually it calls this:
plot_pmf(keep_count,bins)
however, the signature of plot_pmf is def plot_pmf(ax, pmf=(0.1, 0.8, 0.1), x_vals=(-1, 0, 1), title='No title')
so you're passing keep_count as the ax argument, which expects an Axes
An Axes instead of a y-array (like keep_count currently is, if I'm understanding it right)? Or is it feasible that I tell keep_count to add an Axes argument in addition?
yes, and no
look up keyword arguments
to specify how to make each argument go where you want
So I need to convert keep_count into an ax argument ...
mp
no
you need to tell plot_pmf that you're not going to provide it an Axes
and for it to create its own
Ah! So, make Axes an optional argument with Axes='default axes'?
uh.
not exactly
look at its code
and think about how that would work
plot_pmf
I'm ... trying. Unfortunately, due to lack of sleep, my initiative is pretty shot : /
def plot_pmf(ax, pmf=(0.1, 0.8, 0.1), x_vals=(-1, 0, 1), title='No title'):
"""
Plot a pmf as a set of bars
:param ax: Figure axes. If None, will call subplots
this :param ax: comment seems to imply that ax should already default to something if not provided ... oh. plot_pmf_samples has axis, I should just provide them to plot_pmf as an argument, yeah?
Unsure how to reword this to be able to get the axes from it:
Make the subplots
f, (ax1, ax2, ax3) = plt.subplots(1, 3)
I'm a little reluctant to alter it or move it around, as it's part of the code I was provided.
I'm looking at the documentation, and can I call 'ax' within plot_pmf_samples can have it know it's referring to the
f, (ax1, ax2, ax3) = plt.subplots(1, 3)
this :param ax: comment seems to imply that ax should already default to something if not provided
nope
ax1, ax2, ax3 from here?
hint: you can just modify how you call plot_pmf
you don't need to modify plot_pmf itself
Oh, three subplots, three axis, right?
(The homework has this as an example that my results should roughly resemble:
So the idea is that it provides me with axis for each subplot, and then I call the appropriate axis when plotting?
Hah, fixed the error! But the function still doesn't quite return my plots when I run it.
Do you guys recommend anything for learning Machine Learning? I'm trying Codecademy for K-Means clustering and I just don't understand it.
I've a bit of an odd numpy question -- given a list of 2d matrices, what would be a simple way of removing all transposed copies of a matrix, leaving only one (any version)
Do you guys recommend anything for learning Machine Learning? I'm trying Codecademy for K-Means clustering and I just don't understand it.
@lapis sequoia
Hi ! Dou you know this site :
https://towardsdatascience.com/complete-guide-to-data-visualization-with-python-2dd74df12b5e
Do you learn on scikit learn ?
https://scikit-learn.org/stable/search.html?q=KMEAN+CLUSTERING
Hi,
Trying to get some outside perspective. I'm working on a project about housing prices. I am using a dataset that has 500 entries. With attributes of ('Monthly Mortage Payment', "Sq Ft", etc).
The question is, "How much monthly payment can one afford?" (Taking into account average income and debt).
I'm brainstorming ideas of how to answer it, and open for suggestions.
Restricted to (Pandas, Numpy, Seaborn, Matplotlib, and Scikit Learn).
Hi,
Trying to get some outside perspective. I'm working on a project about housing prices. I am using a dataset that has 500 entries. With attributes of ('Monthly Mortage Payment', "Sq Ft", etc).
The question is, "How much monthly payment can one afford?" (Taking into account average income and debt).
I'm brainstorming ideas of how to answer it, and open for suggestions.
Restricted to (Pandas, Numpy, Seaborn, Matplotlib, and Scikit Learn).
@rocky fjord Have you tried linear regression?
What do I do if I want to apply an undersampling/oversampling technique on a different target column and then train the model with a different column as the label?
All the imabalanced-learn methods I have seen are applied after the training-test split, so at that point y is already defined.
Basically, I have two columns - cancer yes/no & gender M/F. I want to sample the dataset so that there are equal instances of M and F, and then proceed with my classification problem: cancer yes or no (irrespective of the no. of instances).
Currently, I have this:
X = dataa.iloc[:, 10:26]
y = dataa.iloc[:, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)```
But this balances the dataset according to the cancer yes/no (column 2) label as y is defined that way. I want to perform the sampling with column 3 (gender) and then perform classification with 2.
How do you guys detect data drift while monitoring the quality of your models?
And in what cases do people run object classification algorithms on videos/images? For what purpose?
And in what cases do people run object classification algorithms on videos/images? For what purpose?
@lapis sequoia lots of stuff.
are you talking about just simple classification?
i.e. into one of several mutually exclusive classes
e.g. DOG or CAT or HAMSTER
Like
Is there any any ttool out there that runs different models onto videos/images at once?
For instance, I want to run a simple object classifcation model with tensorflow
Or pytorch with faster rcnn
For instance, I want to run a simple object classifcation model with tensorflow
@lapis sequoia faster R-CNN isn't really
simple classification
Not a bad featuer, but they should really make it in depth
Also I am new to data science
not really sure if there's a better term for it
Correct
Do you think it would help others out
Because if Scale.ai
beside data labelling had that next step of now running predictions on your video footages
That would be the best integration I would ever see
Could be completely wrong
what do you mean by "running predictions on your video footages" ?
Also, how do you guys monitor your models?
as in, what's special about video here
I can poorly wording all of this, I apologize for that. Meaning, running classification models on the videos
Also, how do you guys monitor your models?
at my current workplace, we have a dashboard that shows our metrics along with the pictures that were last taken, we can specify timeframes and stuff, but it's all really basic stuff
the hardest part is making the frontend pretty to be fair
and to ensure the model is still relevant, we run campaigns every N months
yes, that's a problem
Oh wow every N months
you have to re-annotate every N <time unit> to ensure the data can still be represented by the algorithm
Are you guys alerted whenever the quality goes poor?
no, we can't know whether or not it goes bad
Ohhh
hence why we have to manually monitor, our annotation process is also extremely tedious, so we can't do it continuously
our clients don't want to hire annotators
So how can you guys conclude certain decisions,
Oh I see
And if you mind me asking, but do you work as an ML engineer or which side are you on?
So I am aware of the perspective speaking, because
My title is a bit unclear, but I guess that would close to ML engineer ?
My official title is something along the lines of "Image Processing Engineer" so not very helpful
Ah I see, and why dont companies use Sagemaker's model monitoring for their companies?
Because I heard some do but some don't. Is there a reason behind that?
I'm not experienced with sagemaker, what does the model monitoring aspect of it do ?
I honestly just heard about it today earlier. I was speaking to a data analyst, and she said that for her job, she deploys models on Amazon's sagemaker
And because she does not write large python scripts, she can easily mimic data scientists' tools using SageMaker
Hence then I asked, how she monitored the quality constantly. She told me that Sagemaker has that feature?? I may be wrong
Don't know if the perspective was widely ranged because she was a data analyst at a consulting firm, so I cannot tell
sagemaker is probably very useful, it can be used for any part in the pipeline: annotation, analysis, training, verification, deployment
tho it's only for very generic problems last we checked, and it was very hard to customize models and stuff iirc
I'm only speaking from what my colleague that was supposed to explore sagemaker told us
Oh wow
i thought you were cheeringly agreeing with me lol
But she told me how it cannot automate classifications for training data
If I am not wrong on what it does
and many customers requested that feature
I assumed Amazon would have such a service by that point
like creating versions and stuff
Ohhh
there are many annotation tools, but like, every single one we tried was missing something
so we made our own
Also, if your company was able to detect and get alerted by data drift, how impactful would it be to the overall decision making and makeup?
Oh thats smart
the impact would be huge
for the projects where it matters
i feel like it's either a non-issue, or it's crippling
That they dont even monitor carefully and all they do is ask their ML engineers to redeploy
Who makes the decision for you guys to deploy? Is it just by an automated timer?
Okay
so, we have clauses in the contract with our client that says we include in the product the price of maintenance, this includes going over to the site collecting data every N months (depends on the project, the client, etc) to evaluate the existing models and see if they need to relearn on new data
Ohh, and just for background, do you work at a large enterprise company or for a firm for private clients?
it's very large
Ohh okay
but we're a research branch, so we're a small team
So that is super interesting how they wait for a certain time period
Do people just put it aside??
put what aside ?
i'm certain some do yea
it's one of the annoying part of ML that you don't see after much time
A lot of others have been telling me that same sort of issue'
I never thought that other people experienced it
i'm sure many ML solutions out there don't check for quality over time, either from ignorance, laziness, or malice
Why would you say malice?
i think in a lot of fields monitoring data/model quality is important
ie distributional shifts leading to biased predictions, etc.
you can promise a product, developing and shipping it, then forget about it, because you know that's where the real challenge is
And how does one check qqality over time of a model?
and you still got the client's money
Like store information of that model
for example, our fully productionized pipelines run on kubeflow set to a job that triggers during periodic data ingests
Just save it over time
That is just crazy to me how no one yet fixed this
So it persists in large companies, but people havent yet found a way to just simply manage their model's infrastructure over time
Because this isnt the first I heard about this
why do you say that? there are a shit ton of tools that allow you to trigger retrainings based on shifting metrics
it's not a new problem, and it has ways to go about it
yeah, but i would say that's domain specific
the MLops solutions are there
it needs to be tailored per project
I heard someone saif also about the metrics
beginner here . how to install tenforlow
I dont know if I heard it correctly from someone. But they did say it was about receiving the metrics as well
tamserlow
try again