#data-science-and-ml
1 messages Β· Page 203 of 1
thats exactly what i dont understan d
@lyric canopy
Regression is so complex
Classification way easier
Your model predicts values, right?
Let's call those predicted values y-hat (yΜ).
We also have the "true" values, let's call those why
Now we want to compare the predictions our model makes to the true values to see how well it does
That's what the formula does
It computers the difference between the predicted values and the true values, yΜ - y, and we call that the error of the model
Or, how wrong each prediction is
We then square those errors to get the squared error between each individual prediction and "true" value
And then we calculate the mean of those squared errors by summing them together and dividing them by the number of predictions we have
And that's what the formula does here
So, it's the mean squared error
In this case, there's an additional 1/2 involved (that's why it's 1/2m), but that's just to make differentiating easier
It doesn't have influence on the optimization
Just to make a term drop out when you take the derivative
It doesn't influence the minimum/optimization, so it doesn't matter
Now, think of your model: If the MSE is some kind of measurement for how accurate the model is, then it would be a good thing to minimize it
so i dont need to care about that right?
And that's what a lot of algorithms do: Find the model with the lowest MSE (or another measure of accuracy)
I'm not sure. I only know of a really good book, but that book probably contains more than what you're looking for
Fox (2015) Applied Regression Analysis and Generalized Linear Models (third edition)
hmm
thanks
ill see if i can get the book
yo wait
@lyric canopy
in an RMSE
do we need to do formula stuff in every point
i mean
in all these green points
do you see that red line in there labeled error? That's the yΜ - y of earlier
The difference between the predicted value of the model (represented by the line), and the actual value (the green dot)
Anyone have any of the following books? (Ping me)
https://smile.amazon.com/Machine-Learning-Deep-Algorithms-Techniques/dp/9388511131/ref=sr_1_3?keywords=deep+learning+matlab&qid=1561045671&s=books&sr=1-3
https://smile.amazon.com/MATLAB-Deep-Learning-Artificial-Intelligence/dp/1484228448/ref=sr_1_3?keywords=deep+learning+matlab&qid=1561046115&s=books&sr=1-3
https://smile.amazon.com/MATLAB-Machine-Learning-Recipes-Problem-Solution/dp/1484239156/ref=sr_1_9?keywords=deep+learning+matlab&qid=1561046632&s=books&sr=1-9
> Matlab
@lean ledge given an Array<Sample> how would I combine those samples back into a proper waveform?
Hey everyone . I wanna brand detection on live video. How can I Do it ? What is your advice? I am trying with using CNN
can someone tell me what this 3 means here
energy_series = df.loc[:, ('Energy', '3')]
Energy is a column name..
but not sure what '3' refers to
hmm maybe this is a tuple of columns
im having trouble coercing my dataframe column to datetime values
it's in unicode
I tried pd.to_datetime
@weary rose fortunately for you, 1 million isnt considered big π
yes you "can" do that
how is the data stored?
@lapis sequoia it could be one of two scenarios: 1) the column label is actually a tuple, so this tuple is accessing the column called ('Energy', '3'), or 2) the dataframe has a "multiindex" instead of simple column names, so ('Energy', '3') is accessing the key 'Energy' in the outer level of the index, and the key '3' in the next level of the index
@jagged stump i have no experience in this area myself but i found this which could help https://wiki.tum.de/display/lfdv/Video+Analysis
What would be the best way to converting a text file of cells into csv?
@wide gyro pd.read_csv ?
@desert oar well I want to set it to a csv file that allows a dataframe to come in after and take the data but I'm struggling with getting the formatting down for csv
create the csv using pandas in the first place?
what do you mean "getting the formatting down"
you shouldnt ever be manually constructing CSV data except in very simple cases
I am using iwlist scan and trying to get that output into csv file
what is "iwlist scan"
But my csv file is setting it row by row, with no column headers
can you share some code
oops forgot iwlist is in linux, but it scans for access points nearby
can you share some code
Sure gimme a sec
ye
π
trying to do something similar to wifimap but a beginner version of it
if that makes sense
im not familiar w/ wifimap. but how are you parsing the output of iwlist
can i show a terminal script here? im not doing it off python rn
yeah that's what im stuck on right now
bash highlighting:
echo 1
x=3
echo $x
wifimap . io I think it is @desert oar it's like a hotspot finder for restaurants or stores
good if you travel and stuff
ok. yeah go ahead and share your shell script
the goal is to create a CSV that you can read to pandas right
echo "Scan?"
select yn in "Yes" "No"; do
case $yn in
Yes) iwlist wlan0 scan | tee ~/output.txt;;
No) exit;;
esac
done
Yes
I receive a block of cells that all have their respective data, but not sure how to then parse it
Maybe I could bring all the data onto one line without using the category for it, and then when setting it to csv I could set the column headers manually that match up to the data?
or would that not be smart @desert oar
can you show some example output from iwlist wlan0 scan?
Actually I figured it out, there's a command called awk that will take care of it
yes i've used awk a lot
still
if you make bad CSV files you make bad CSV files
also awk isn't that easy to use
and there is a lot of really really bad awk advice out there
so i'd still like to see the iwlist output
so you don't waste time on someone's over-clever awk solution that you don't understand
Is there a formatting for txt files?
ye
Cell 01 - Address: 00.something
Channel: 10
Quality: 57/70
ESSID: "Main"
Cell 02 - Address: 00.something
Channel: 10
Quality: 57/70
ESSID: "Main"
Essentially that for txt file, and on csv its that but each one is a new row
I figured I would get rid of the cell categories and just keep the data, get all the data on same line separated by a comma or tab, and then when setting it to a csv, I would put the column headers in manually
I saw online something like sed -e "s/\tsignal: //" -e "s/\tSSID: //" which I figured I would do for all of them
can you demonstrate the output format you want to see
00.something 10 57/70 "Main"
something like that, or could be separated by tabs or commas
since you already know python id personally suggest using python
But would you suggest trying to clean it up a bit in linux before?
if you already know the tools, or want an excuse to learn them, then by all means let's get into it
but if you just wanna get the data processed, use what you already know
I might wanna try it a couple ways just to learn, but I'll head down the path of what I already know just to get it running first
and then have a backup plan ready in case I end up not understanding the other methods
What do you mean haha
i mean, learning a bunch of new tools while trying to actually get something done, without having a backup in place first
hello, I fail to create a .mplstyle for matplotlib with Spyder. Python doesn't find the new_style in the style library. And even if I modify a present default style, it doesn't change the output. I don't really know where I'm missing something :/
I missed this: matplotlib.style.reload_library()
import sklearn.naive_bayes as bae
import numpy
import pandas as pd
data = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/nibbeh.csv")
x = pd.DataFrame.as_matrix(data.Height,data.Weight)
y = [data["Gender"]]
clf = bae.GaussianNB()
clf.fit(x,y)
n = clf.predict([[140,120]])
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
help?
the error message is pretty clear in this case
the x is expected to be a 2d array
you have a 1d array
predict and expect do expect sequences of data points not just a single data point
so you always gotta wrap them in an additional array
@drowsy marsh
oh shoot
@silk forge
Guys, how do I make selenium do multiple pages read to export a excel file after?
I've successfully made it reads one page and then export a file with the results
Let me show MWE:
from bs4 import BeautifulSoup
from openpyxl import Workbook
import numpy as np
import pandas as pd
url = "https://scon.stj.jus.br/SCON/legaplic/toc.jsp?materia=%27Lei+8.429%2F1992+%28Lei+DE+IMPROBIDADE+ADMINISTRATIVA%29%27.mat.&b=TEMA&p=true&t=&l=1&i=18&ordem=MAT,@NUM"
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)
python_button = driver.find_element_by_xpath('/html/body/div[2]/div[6]/div/div/div[3]/div[2]/div/div/div/div[16]/a')
python_button.click()
driver.switch_to.window(driver.window_handles[-1])
python_button = driver.find_element_by_xpath('/html/body/div[2]/div[6]/div[1]/div/div[3]/div[2]/div/div/div/div[3]/div[2]/span[2]/a')
python_button.click()
driver.switch_to.window(driver.window_handles[-1])
textList = driver.find_elements_by_class_name("docTexto")
resultados = BeautifulSoup(driver.page_source, 'lxml')
parse = resultados.find('div', {'id':'listadocumentos'})
paragrafoBRS = parse.find_all('div',{'class':'paragrafoBRS'})
header = []
content = []
for each in paragrafoBRS:
header.append(each.find('h4', {'class':'docTitulo'}).text.strip())
content.append(each.find(['div','pre'], {'class':'docTexto'}).text.strip())
df = pd.DataFrame([content], columns = header)
df.to_excel('dados.xlsx')
driver.quit()
So, selenium opens up a page, then go through some links and get to the point I want (a page that display the data I want to scrape)
The problem is that there are 5 pages of data
And I'm struggling to make it reads all the 5 pages and then export the Excel file
Guys, nvm
I've figured it out
I'm struggling how to merge columns that have the same name
Any tips?
Example: I have 3 columns named "Processo", but I want to merge those columns to instead of having 1 line and 3 columns, have 1 column and 3 lines of data
Using Pandas DataFrame, ofc
rephrase your question so it makes sense
Hello?
hello
I cannot figure out what the second part of the prompt is asking for: Given the root of a binary search tree with distinct values, modify it so that every node has a new value equal to the sum of the values of the original tree that are greater than or equal to node.val.
I can post a picture of an example if that helps
I dont really see how much clearer you could express this question, what exactly is your problem with it?
I do not understand how to modify the tree
node.val = 4343434?
Input: [4,1,6,0,2,5,7,null,null,null,3,null,null,null,8]
Output: [30,36,21,36,35,26,15,null,null,null,33,null,null,null,8]
This is the example. There is a picture of the tree as well. But I do not understand where the values are coming from
show me the picture then
oh those lists are actually equally long i see
that confused me
well then its quite obvious where the new numbers are coming from, the tasks describes exactly how to calculate them
What does node.val mean?
the value associated with the node?
I see
@inland viper just a recursive solution should work. It's not a hard problem by any measure. It recurses down and returns the sum to the upper tree. Upper tree takes the sum to the left and assigns it as it's value and returns back next sum by adding its old value to the sum
This is also not data science
HI may i know how can i show all x axis fit in the graph ?
wanna show us the code? @strong flare
@silk forge hi fwiz here is the code π
# Plot with differently-colored markers.
plt.plot(combine_month.index, combine_month.April, 'b-', label='April')
plt.plot(combine_month.index, combine_month.March, 'g-', label='March')
plt.plot(combine_month.index, combine_month.May, 'r-', label='May')
# Create legend.
plt.legend(loc='right')
plt.xlabel('Date')
plt.ylabel('Month trend')
plt.show()
@lapis sequoia when I get the xlsx file with all the data I've collected with selenium and so, instead of having multiple lines correlated to 11 different columns, it create multiple columns and one single line. I'll post a screenshot over here so you can check it out.
As you can see, the column B "Processo" is repeated at column L, but they are correlate to different data sets. I want to get them both at the same column, but in two different lines.
Hi anyone know how to change a line plot Y-aixs to normal figure ? not in scientific notation mode
Hello everyone
Does someone know why my scipy optimize minimze doesn't give the most optimal solution
Things like this
When you're working with a jupyter notebook, do you still tend to put all imports at the top, or does it make sense to group imports around the cells they are used? What's the common pattern?
I put them on top. If you're sharing the notebook with someone later they have all the requirements in one cell instead of scattered around.
yes you want people to run into any errors immediately, not later down the line. It's always good to put imports first, especially since it wont give you a compile time error, and mostly runtime @serene crane
@outer marsh what's wrong with that output and what were you expecting instead
@desert oar The second one could've been a bit more down
Same for the first one
I could a better line by hand
So you're trying to fit a linear regression line with least squares?
Oh I see, power law
How about this, punch in the parameters for what you think is a better line and compute the log likelihood of both your "manual" solution and the solution the computer found
If it turns out that you can in fact draw a better line then clearly it failed to converge for some reason
Or didn't find a global minimum
Also make sure to rule out any bugs in your log likelihood
That log likelihood looks kind of off to me
What expression are you actually trying to optimize, zipfs law log likelihood?
has anyone here worked with Python cv2, computer vision. for face detection or anything
@desert oar Yeah
Hi guys, I'm in a spot of bother with a 1 hidden layer neural net. I created a very basic neural network after the one I created for MNIST didnt work, training on a very simple generated dataset. However, it spits out nonsense results, and nothing that ive tried works. Could anyone take a look? Apologies in advance for rubbish code and the probably basic errors, I'm completely new.
@hollow shard it looks like you're re-initializing everything inside the loop
so every single step all the parameters get re-initialized to 0
I have a problem that I can't seem to figure out, I am trying to create a boosting model based off of car accidents for a particular city. I have about 30-50k records, since this is a classic imbalanced data case, I am "creating" my own training set by sampling accidents, changing a feature, then if it is non-negative, add it to my trainset as a 0/non-accident
I can't seem to get any good recall scores, since I am more focused on actually preventing type II errors
anyone who can help would be appreciated
i have about 15-20 different features as well btw, I can go into more detail if needed
strange, I modified the code to put the resets outside of the loop, and its still playing up.
Anybody on? I have a smol question.
Can increasing the resolution of a picture reduce the accuracy of the model?
Cuz at 45x45 the model was performing flawlessly. When I made it 100x100, suddenly the accuracy got stuck at 47%
And wouldn't rise even with 10 epochs. Btw the training data size of 73k images.
seems weird
Yeah. I too am wondering what went wrong. From what I found online, they say that low resolution helps in gathering global features better and high helps in finer features. But as you increase the resolution, it's harder to grasp global features.
So I am not sure if OCT scans seem to be classified on the basis of global features though.
maybe you can visualize the middle layers to see what's being learned
what was accuracy in 45x45?
88
oh. huh
im really not a image/vision guy but that aint right
the usual "check your code" and "check your data" caveats apply
Well. Can't really check my data. It's just an image.
Code wise. We could try out different designs though.
well how are you producing the images
i assume these arent actually 45x45 images
and youre downsampling them somehow
so do the 100x100 images actually look right?
Yeah. That's why I was like, 45 might be very low resolution.
its kind of a long shot but worth checking. sorry im not more helpful on the actual neural network side of things
And guess what, with 45x45, the test dataset accuracy is 91%
it's possible that 100x100 is a "bad zone" where the easy blob-like features start to break up but the image itself hasnt resolved into anything meaningful yet
but again im speculating
cause 45x45 is just like what, blobs of dark and light?
Yus
So I guess then I'll just do it with multiple image sizes and networks
And plot it on tensorboard
seems reasonable as long as training isnt too intensive
Well. Not much. 76k images 100x100 takes 1. 10 min per epoch.
*1min 10s
I am pretty happy at the fact that this actually works fast with a gpu. With a cpu every epoch would've taken 5-10min
Say. Another smol question. I have some work related to taking features out of an image to generate a vector with features. I need to take texture, a little bit of geometry and some morphological features. Which library has these functions and can I get a source where I can look up such algorithms?
Stuff like glcm, drlbp and all.
Hey everyone , I wanna ask a question . Is Data Engineer good ? It looks like focus arthitect beyond the data. And less work about ML,DL etc is it right
cause I took a call about data engineer position and idk what will I do anyone can give me info about it well ?
thanks
"good" is up to you. it's certainly important work, and the industry can always use competent data engineers
the work you'd do depends on the company, but you will generally be building or maintaining database systems and data pipelines
so maybe a data scientist or ML researcher will produce a model, maybe in a docker container
and the data engineers might be the people who connect that container to an application, which other programmers can use
or maybe you would be administering a data warehouse
stuff like that
possibly doing some "light duty" data analysis of your own
Thanks for answer I am talking about Really big and good company
Datas coming from autonomus cars
and they want hadoop spark also kubernetes docker etc. Also they say about creating new pipelines and ML,DL is plus
my mind so mess nw idk that company is awesome and job would be better if it would be data-scientist but still looking cool cause of autonomus cars
yeah that would be a serious data engineering job
low latency pipelines handling huge amounts of data
cause of datas will be real time . So its level 5
also level 4 I guess cause predict they are also suppose
so clearly level4-5 data engineer for autonomous cars
i dont know what levels you refer to
in fact I dont know well about levels just I Know like level 4 is like predictable data and level 5 like real time data
ive never heard of these levels
So. I am still waiting if anyone knows about such sources. Please do answer.
@sand reef yes, increasing resolution can reduce accuracy in some scenarios. Might be a good idea to learn about Laplacian triangles and multi-scale features from traditional computer vision. But you would expect deeper CNN to generally learn Gaussian blurring filters to get over that often, so it might have to do with your hyperparameters and how they're suboptimal for the new loss landscape
I see. Thanks! And @lean ledge do you know of a library with feature extraction algorithms for images like glcm, drlbp, etc.?
I do not, sorry
Oh ok.
Data engineering for that is basically making high speed pipelines into ML models along with data validation and cleaning.
Personally thatβs my least favorite part of the process, but itβs super valuable
What you want to know really well is data structures, pipeline methods, and outlier detection for the interview
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
i dont understand this line ^
why the [0][0] and stuff
Its bad
coef_ is a 2d array, i.e. a matrix
Looks like they're plotting the 0,0th element of coef_
Should be [0,0] not [0][0] - the second way works but it's not recommended or idiomatic
Meanwhile intercept is a 1d array, i.e. a vector
Hey I was thinking about to get started with machine learning, just wanted to know what math is required ?
anything more ?
thanks
@fierce shadow there's a "mathematics of machine learning" post pinned on this channel that summarises the maths used
Right here actually https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568
@sand reef sry for ping but i got a life or death situation again
i am loading few images from a directory but their shape is (250, 250, 2)
why is an images shape 3?
when i printed it
it has one pair of useless brackets
how can i remove it?
The images shape is 3D because you have pixel X pixel y colour channels
how can i reshape it?
You usually don't want to?
oh
Why do you want to reshape it?
so its fine?
Yeah it's perfectly fine
okay thanks
Image data is spatial. CNNs are spatial. Can't reshape it in any way without screwing up how CNNs naturally work
I mean you could certainly make it smaller or bigger or just make it grey scale (aka 1 channel) if you wanted to
You can up or down sample etc or do other similar preprocessing but you can't reshape the network layer without losing properties
okay
anyone knows how to make pie charts with matplotlib.figure?
you mean matplotlib.pyplot ? @round crane
first line import matplotlib.pyplot as plt
idk I clicked figure
sry
so you mean this kind of import https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.figure.html
right this one
cuz what I clicked was as an example of it
really
oohhh you want it to be in a flask site?
I don't think it makes a difference
@earnest prawn please explain this
from matplotlib.figure import Figure
right
import matplotlib.pyplot.figure
what's the difference
I think that the first one only imports the figure module
and the second one too
the FAQ says without pyplot
idk
pyplot can actually generate figures if you want it to
I think that they're the same
so pyplot is just a bit higher level
oohhhh so like more functionality
but internally (and with the right way also externally) you can created figures with it
no its less functionality
pyplot internally uses figures
I mean pyplot
yeah basically
i in fact am guessing this based on seeing other matplotlib code which mentioned the creation of a fig variable
and as in htis code they also do it i am just making an assumption
I have a question. I want to make a face detection system. Just detection, nothing else. How do I get a good dataset to train my network in? I want to do it with live video feed.
And what model is good for it?
I want to do something for a small self project.
I tried training a small network with 32 conv net filters, dropout layers and a 36 neuron network. With just a resolution of 100x100. I used 1k images with face and 1k without.
Which I captured using the video camera.
How do I proceed? I tried doing it and the issues I see is, the model kind of does and doesn't recognize the face at different angles and distances.
I thought of using the hidden layers and generating Euclidean distance to find similarity between images and not, but I am not sure if that will work. Will it?
And I am also not sure how would I implement it.
Sorry. The resolution is 50x50
import re
pattern = r"^gr.y$"
if re.match(pattern, "grey"):
print("Match 1")
if re.match(pattern, "gray"):
print("Match 2")
if re.match(pattern, "stingray"):
print("Match 3")
this outputs match 1 and 2
and someone explain meta characters to me??
@hallow pendant wrong channel. Go to one of the help channels
oh sorry
I wanna ask something. For example I want data for CNN . Mercedes-Benz brand logo. I didnt want ready logo so I wanted make my own. I did install automatically from google graphics. with different keywords first 40 image. Keywords like ; mercedes benz logo , mercedes benz brand etc. But some of them really not about mercedes benz so its not well picked data. Wha is the best way classification this data and out non mercedes benz images? Thanks a lot
well...that's what you're building the CNN for, right?
you could a. do it yourself or b. pay someone to do it for you
you could run unsupervised classification
@jagged stump if you just want to filter images between sets, you can use microsoft computer vision api.. it's easy, just upload a couple of images to train it
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 2000, 10000)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()
plt.show()
Hi guys, may i know how can i set ytick to number not in scientific notation ?
this is the graph look like
you want the y axis to write 1 followed by 9 zeroes????
1000000000
2000000000
3000000000
ax.set_yticklabels should do it
what's the support? (in classification tasks) i'm trying to make sense of my classification report..
I trained a classifier for 3 classes, I have the precision recall f1 score and support for 120k examples.. and it shows for individual classes, the support is 40k
not sure what that means
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
def tickformat(x):
if int(x) == float(x):
return str(int(x))
else:
return str(x)
plt.figure(figsize = (8,6))
fig, ax = plt.subplots()
x = np.linspace(0, 2000, 10000)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')
fmt = FuncFormatter(lambda x, pos: tickformat(x))
ax.yaxis.set_major_formatter(fmt)
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()
plt.show()
do you want your y in that scale?
@desert oar @lean ledge i just get this on google it's work 
ok seems like you did
@lapis sequoia Yes, this is just a sample i need those number in figure not in scientific notation
@lapis sequoia I prefer use python script
@velvet thorn so we can say like K-means? But how ? HoG? Shift? What is the way?
??
@lapis sequoia you offer me ; microsoft computer vision api.. I have no idea what its even . Can you give me more details
oh it seems like you want to use python.. so it's not an option
it must be someehow. I have 100 data and 50 of them mercedes benz and others something different . Cant I take somehow only take that 50 mercedes benz ? Maybe with some ML algorithm?
any idea?
Hi. I hava a variation of numerical variable with date-time variable. Can I use a ML algorithm to predict the numerical variable?
What's a good model to use for face detection? For live video feed?
Switching from iterrows() to get_value for efficiency, but now I'm running into a problem with appending my copied dataframe
I have everything inside for loop for for i in df.index:
and inside that loop I have a check statement and then dfCopy = dfCopy.append(i)
in which I am now getting TypeError: cannot concatenate object of type "<class 'int'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I previously used for index, row in df.iterrows():
with the append statement being dfCopy = dfCopy.append(row)
which worked fine
nevermind seeing its being removed in a future release
chances are you don't actually need to loop over rows manually
what are you trying to achieve?
most of the time you think you need iterrows() you probably just want .apply(..., axis=1) instead
Anyone on any start for voice authentication
I mean recognizing a perosn voice
doesn't matter if initially its only for numbers
"initially only for numbers"??? a computer cannot make predictions on inputs which are not numeric
I have a much values in my scatter. How can I reduce the values of x axis
ax.set_xticks @hollow quartz
anyone here has particiated/participating ina two sigma competition in kaggle? I was interested in how the data set looks like as the deadline has passed, if from an older one also there's no problem , please feel free to DM me
I would love to learn data science, but I am still struggling with learning it.
Could I get a recommendation for a project to do? A project to do so that ppl will hire me for internship?
I know, wxpython, pyqt5, decent amount of tensorflow and keras, SQL and currently doing Django.
And I have been doing competitive in python and C++.
And I know cv2 and Pygame. And SFML in C++.
I checked online for good projects to do, and ppl were like reimplement GNU ld, objdump and make a binary recompiler or something. And I thought maybe I could do something with Facenet, turns out there is a library already with what I want to do. I wanted to use Facenet with live video feed. I am completely lost.
@sand reef scrape the weather station info from NOAA and try to do 5 minute ahead wind speed, direction, and day-ahead high / low temperatures
Oh. That's almost perfect. Thanks!
Common weather forecasts are: 5 minute ahead, 1 hour ahead, 1-7 days ahead, 90 day average temp
That actually sounds really interesting
Hi guys i have a dataframe with date, may i know how can i get the moving avg with most easy way ?
i already get the min & max date
the moving avg like 3 day and 7 day
some of the solution is using this pandas_datareader but this is only online catching data ?
https://stackoverflow.com/questions/40060842/moving-average-pandas
https://youtu.be/XWAPpyF62Vg
I would like to add a moving average calculation to my exchange time series.
Original data from Quandl
Exchange = Quandl.get("BUNDESBANK/BBEX3_D_SEK_USD_CA_AC_000", authtoken="xxxxxxx")
Valu...
Welcome to the python pandas programming tutorial part 2. In this python pandas episode, we are going to calculate moving averages for a stock using python p...
ok i get it have to use this df.rolling().mean()
Can I get some advice on a series of algos I got from scikit image?
I need to take out texture features.
And there are a multitude of algos for it.
But what I see is: SIFT, SURF, Daisy, LBP, HoG and a binary descriptor extractor called ORB.
Is it of any use if I use multiple of them? Or just one suffices?
Here is the documentation : https://scikit-image.org/docs/dev/api/skimage.feature.html#skimage.feature.hog
And could anyone point me to a source where I can understand why should I use any of them, and which one is preferred when?
Computer vision is a rather large field in of itself, with a very large body consisting of stuff inspired by signal processing, traditional algorithms, statistics and machine learning based techniques
Seriously though, it's a large field.
Take one example, SIFT. It's "Scale Invariant Feature Transform"
It takes an image and converts it into a difference of gaussian scale space representation for scale invariance (signal and image processing)
It finds features using generic multivariate calculus techniques + a discretised taylor series on the image, doing that over the entire new 3D scale space (general multivariate calculus)
It does feature matching (backed by a kd tree supported nearest neighbour searches) (traditional algorithms)
Uses hough transformations to identify clusters in the feature space and do voting (generic computer vision optimisation stuff)
There is no possible reference you can give that goes over the nuances of the techniques without a lot of background and explanation
Hmm. So that means back to square one.
And any directions for if I need to find some geometric features like depth of the retina, and for morphological features like lesions in the retina?
Something like the depth(or height? whichever makes more sense, its 2D) of the whitish region.
Or abnormalities on the surface, like a little part of it being partially peeled.
And from the looks of it, I think that computer vision might be a field gradually explored by experience? Like very very gradual exploration
Or just like, reading a book or going through a course? @sand reef
There's no need to be slow and gradual but also don't expect the field to be reduceable to a cheat sheet
@sand reef https://www.edx.org/course/robotics-vision-intelligence-machine-pennx-robo2x Here's how I learnt it
There's little specific to robots there so be scared not
Thanks! I'll look it up! Although, any pointers on what would I do for those requirements in my above question?
Hey guys, I have a pandas question. I am working with a dataset that has 'Download Date' and 'Customer Name' fields built into it. What I want to do is remove duplicate Customer Names based on the Download Date. For example, if this is the starting point:
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 06/10/2017 Customer Name: Mark Johnson
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 09/15/2018 Customer Name: Jim Jacobs
I want it to end like this:
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 06/10/2017 Customer Name: Mark Johnson
Download Date: 09/15/2018 Customer Name: Jim Jacobs
I have been trying to do it without iterating (I don't think I have to, I could be wrong). Can anyone point me in the right direction? My initial thoughts were to use .groupby() and do stuff with the groups but for some reason I cannot get that to work how I want it to.
Edit: I got it. I needed to pass the download date into the subset parameter of drop_duplicates.
you can use drop_duplicates
the more general approach is to use groupby to aggregate by whatever fields you need and perform whatever operation you need
The only reason why I swayed from that is because the excel file that I was using was already grouped by the download date. Forgot to mention that D:. But I just ended up passing download date and customer name into drop_duplicates and that got me the results that I was looking for
Do distances in PCA 2D space mean anything?
Numerically, I don't think so i.e. a distance of 5 or 10 has no interpretation. But relative to each other I would think that a distance has some interpretation i.e. point pairs that have smaller distances are in some way more similar than point pairs with large distances, but I don't think there is a clear interpretation other than this vague idea of similarity. At least not without carefully examining the principal components.
Okay, so I want to use machine learning to try to categorize measurement series, where every data point is essentially a group of measurements (somewhere between 10 000 and 100 000 measurements for each data point). Does anyone know if there exists machine learning algorithms for these types of tasks, that will derive attributes from the data on its own, or will that have to be done manually?
This could be done with a LSTM or an TCN. Do I understand it correctly if each timestep in your series has between 10 000 and 100 000 values associated with it? If that's the case maybe you should look into some form of dimensionality reduction before feeding to a classification model?
Each timestep basically has as many values as i want. The more i use, the more accurate it becomes, as the measurements are prone to noise
I am basically looking at RTT delay measurements.
An example of what i want to be able to do is getting measurements between two points A and B, and a Server C. Then i want to be able to differentiate between the RTT of A-C and B-C. Most of the time the minimum value should be sufficient if i have enough measurements, but i want to be able to reduce the number of measurements to a point where i may not be able to tell from the minimum delay alone.
Does that clarify it a little bit? @polar acorn
Sure! Now I do in no way have enough domain knowledge to actually be of use here so i'll try making some assumptions. If I understand this correctly you have a series of measurement sets for example lets say [[3,4,3], [4,5,4,5], [5,5,2]] etc of some arbitrary length. And you want to estimate if this series belong to the A-C class or the B-C class. I assume that you have several examples of both classes. I assume what sets A-C and B-C apart is some attribute of the series it self. Maybe A-C rises quicker or B-C oscillates more over time or whatever. In that case I would for each time step find the mean, std, minimum and maximum of the values sampled at that time and create a new time series with the same length but those same measurements at each time step. And then feed that to an LSTM model.
Yea that was basically hat i was wondering: If i had to derive values such as mean, std, minimum, etc. on my own, or if i could just feed the whole measurement series into a ML method that would do it on its own.
In regards to domain i am measuring RTT in a network with low traffic, so network traffic is generally considered noise i want to sort out, so maximum values wont be very interesting
That is an example of the data i can get between 2 different points and a Server C
Anyway i should probably look into RNN and LSTM
Another way could be to first compute summary stats for each time step and then use something like. https://github.com/blue-yonder/tsfresh to automatically extract features from the whole of those time series and then throw those features into whatever classifier you like. I'm afraid you have to create those summary stats yourself, but I'm sure you can set them up them nicely into some pipeline together with your training.
Yea, I was actually about to do something similar before i stopped myself and wondered if this was something ML could take care of entirely
Anyway, thanks so much for the input!
deep learning would do a lot better of a job than machine learning ever could @lapis sequoia
Deep learning is a sub discipline of machine learning so your statement does simply not make any sense
hey all can anyone look at some sample code ive been trying to get working? Im working on a simple trading algorithm using pandas and numpy and i feel like ive been looking at the code so long i cant see whats in front of my face
!ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving
β’ Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
@lapis sequoia that is objectively false
Deep learning is rarely better than traditional ML in reality for a lot of reasons
not from what Iβve read up about it js @lean ledge
from what Iβve read, deep learning is machine learning, but better cause of the neural networks @earnest prawn
Can anyone tell me what im doing wrong when im trying to get my signal to buy in this simple trading algorithm applicaton? https://docs.google.com/document/d/1d3LzcXByLDlF6a-amt-U4b_bOy8l6c2L109i9xU7OvE/edit?usp=sharing
whyβre you trying to get your signal to do that? @eternal flare
@lapis sequoia neural networks doesnt make it better
Deep learning is very rarely better
Outside of the domains of CV and NLP, it's not even used much
its supposed to be a basic cross investment strategy. When the simple moving average for 10 or 30 days is reach a signal is put out to either buy or sell.
I concur.. DL hasn't had major applications over ML in applications outside of NLP.. I'm not sure about what they do in CV.. but image processing otherwise has had lot of impact
well then why exactly isnβt it ever used much, in that case? @lean ledge
Time to train
Amount of data needed
Being finicky in training
Getting ample clean data is ridiculously hard in the real world
Unlike traditional models, getting training right is a hard part of DL
yeah, but yet it still manages to be better in terms of accuracy @lean ledge
there are literally full on articles online talking about how deep learning's better than machine learning @lean ledge
Congrats there's articles saying the opposite also. Your point?
I've listed out the reasons why deep learning is rarely the solution for things outside CV and NLP. Think you know better? Tell me why I'm wrong instead of referring to "there's articles"
@lapis sequoia
If I am not wrong, the amount of data needed for DL is the biggest drawback, right?
there're the generic things like: lots of data, big compute (varies depending on algo), large memory requirement
there're more hand-wavey things specific to specific areas like "imposing structure on your problem/solution"
DL is big and very good and progressing quickly. It's not all of ML, but it's certainly the most exciting and quickly progressing part of ML right now
but it's also because DL is big and exciting that you don't hear much about regular ML (which is still everywhere, and growing), because it's not that exciting to read about
no one wants to read about 1000 small companies using collaborative filtering or gradient boosted trees
everyone wants to read about deepnudes
lol deepnudes
anyone
wanna do a project
on Speaker Verification ?
creating a complete library where we just take voice of user say from number 1 to 10
and create a model that works for that user only
there are no ready made material for this so more the merrier
maybe a publish a good research paper on our technique.
guys i have an image with shape (120, 120, 2)
how how i show it in mat plotlib?
import os
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
DOGS_DIRECTORY = "training_set/dogs/"
CATS_DIRECTORY = "training_set/cats/"
cats = os.listdir(CATS_DIRECTORY)
dogs = os.listdir(DOGS_DIRECTORY)
def load_data_dogs():
x = []
for dog in dogs:
if dog[-4] == ".":
img = Image.open(DOGS_DIRECTORY + dog)
img = img.resize((120, 120))
img = img.convert('LA')
x.append(np.array(img))
return x
x = load_data_dogs()
plt.imshow(x[0])```
when i run this code
TypeError Traceback (most recent call last)
<ipython-input-30-2e8a46bc51f4> in <module>()
20
21 x = load_data_dogs()
---> 22 plt.imshow(x[0])
23
3 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/image.py in set_data(self, A)
636 if not (self._A.ndim == 2
637 or self._A.ndim == 3 and self._A.shape[-1] in [3, 4]):
--> 638 raise TypeError("Invalid dimensions for image data")
639
640 if self._A.ndim == 3:
TypeError: Invalid dimensions for image data```
please ping me when answering Thank you
how're you able to program an image with shape? @lapis sequoia
@lapis sequoia images always have a shape, what matplotlib.imshow is expecting is however an image with 3 channels ( the format being (X pixels, Y Pixels, Channels) @lapis sequoia what type of image is this to only have two channels?
@naive shore have you tried to change dtype parameter ?
no, I mean I was taking a bit aback that was even able to literally program his own entire image too, although I shouldn't have been, now thinking back @earnest prawn
What
I was referring to @lapis sequoia with my comment @earnest prawn
@lapis sequoia Can you stop pinging random people with dumb questions? You've been asked to do so by an admin already
sorry, didnβt mean to ping you asshole @lean ledge
@lapis sequoia That's not an appropriate way to address others, you've already been warned for your behaviour, watch it.
well tbqf, mr. βdumb questionsβ wasnβt being appropriate himself, with the tone he was conveying, so @south quest
could agree that was also not polite. Still using '@' on every post is not necessary and very annoying
In rooms with so little traffic it should not be needed
It triggers a notification
itβs how I usually directly convey a message to someone in a public chat usually, although Iβve been trying to refrain from doing it as much recently
Probably a good thing. Using @ notifies the user directly even sometimes with sound. Can be very annoying π
It adds importance and urgency to your message, something that is almost never needed
I'm running some custom data feature creation on WSL. I'm getting a "kernal has died" error pretty consistently on the run through. Individually, each function runs through and the resulting dataframe when dumped to .csv should be 1.7 gb (~1.4M rows, ~100 columns) and fits in memory. I've created this for a slightly smaller data set (~1.2m rows, what the estimates are based off of). Is this happening due to an out of memory error for the WSL subsystem?
Error dump from the console:
https://pastebin.com/X1WrHrLS
I am looking to use Grafana to plot some data via influxDB and am not sure how to structure my data
I have a CSV file with a column of scores (0,10) and a timestamp for each score
Does anyone here have any experience with Grafana/InfluxDB and how I should go about posting my data?
Not sure if I should use a data_frame
@silent swan you seem to know your stuff in terms of machine & deep learning, you a dev for either?
(referring to this):
@lapis sequoia it should always be 3D image. When it's black and white, it has 1 channel. Meaning it's dimension is [X, Y, 1]
So try using numpy.reshape
π
Solved it yesterday but thanks
But I get another problem as usual
I donβt think this is good
How can I reduce the over fitting
from keras.models import Sequential
from keras.layers import Flatten, Dense, Conv2D, MaxPooling2D, Activation
x_train, y_train = load_train_data()
x_test, y_test = load_test_data()
x_train = x_train.reshape((-1, 120, 120, 1))
print(x_train.shape, y_train.shape)
model = Sequential()
model.add(Conv2D(120, (3, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Dense(1))
model.add(Activation("sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy",
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, shuffle=True)```
I already normalised the data
Maybe a dropout between the conv and the linear part of the model could help @lapis sequoia
so a drouput after the first layer?
I'd say after the max pooling but i am honestly not exactly sure
okay lemme try after the maxpooling
@lapis sequoia gtg now, I'll be back later but I'm sure if this doesn't help you'll find other people with more ideas
whats a good way to visualize pandas data sets
python interpreter can fit all the data π«
what if i wanted to see all the data
That would depend on the data
@silk forge pandas profiling? or you can do a group by and do size()
for each group
so what about sparkSQL does anyone knows it very well?
I have a study case from a company about data analystics. I never doing something about big data but it looks like a little big on OpenStreetMap(OSM).
So I must do something with sparkSQL(I know basic SQL) or/and MLlib of Spark
I am looking for example and well introduce about these topics
if your purpose is to dashboard the data or use sql for running queries on it.. then you're better off using big query, etc..
if you want to write nested queries on a large dataset, go with sparksql
hey
I have to with Sparksql cause I already told its such a big data
sparksql is not the only tool for handling big data..
I want explore data with it
yeah what do people prefer with big data and python?
i personally hate spark
good for its time but i dont like it
in fact I Must knowledge about scala but so lazzy for laern it π
not a big a fan of scala either
there are lots of other options too, in python. Maybe some that have CPython involved too.
How do you calculate a probability vector in Python? https://scipy.github.io/devdocs/generated/scipy.spatial.distance.jensenshannon.html
Row(tags=[Row(key=bytearray(b'highway'), value=bytearray(b'residential')), Row(key=bytearray(b'name'), value=bytearray(b'Honv\xc3\xa9d utca'))])sample of map data. Its data from first row. So how can I get all rows' data like key=highway and value=residential with sparkSQL
any idea
Hey guys ! I am new to machine learning. I want to understand the math used in the sklearn library. Can you recommend any books ?
Type like an adult
@celest moss KF Riley's Mathematical methods for physics and engineering has always been my maths reference of choice
Does anyone here know a good tutorial where the deriavatives for the first layer of a 1 hidden layer Neural net are derived?
All the ones that I've read so far gloss over it
Also, really sorry for off topic question, but does anyone know where I could ask a broad questionabout fluid simulations with python?
|-- version: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- changeset: long (nullable = true)
|-- uid: integer (nullable = true)
|-- user_sid: binary (nullable = true)
|-- tags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: binary (nullable = true)
| | |-- value: binary (nullable = true)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: integer (nullable = true)
| | |-- nodeId: long (nullable = true)
I want parse data with sparkSQL anyone knows how?
Row(tags=[Row(key=bytearray(b'highway'), value=bytearray(b'residential')), Row(key=bytearray(b'name'), value=bytearray(b'Honv\xc3\xa9d utca'))])
Sample of data ;+--------------------+ | tags| +--------------------+ |[[highway, reside...| |[[highway, reside...| |[[highway, reside...| |[[highway, second...| |[[highway, second...| |[[highway, primar...| |[[highway, tertia...| |[[cycleway:both, ...| |[[highway, second...| |[[highway, reside...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, primar...| |[[highway, reside...| |[[highway, reside...| |[[highway, tertia...| +--------------------+ only showing top 20 rows
Sample of data. I wanna tags column has value-key as you see. For example I wanna get data key=highway and value=primary
Offtopic, and MATLAB. So thing is I'm interfacing with ros and subscribing to some topics in matlab
When I'm using the rossubscriber command in terminal, everything is working fine, but when I'm trying to put it in a .m code, for example
Sub=rossubscriber('point_data')
Recv=Sub.LatestMessage
Recv.Data
The code is not printing anything
The same code works in the terminal. A note here, the point_data has a custom message, I've tried the same code with other topics with standard messages and they worked in both terminal and .m file
I can't understand why something will behave differently in terminal and code
@hollow shard https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ explain backpropagation very well
This invite was removed by our filters, but it doesn't quite fit what we mean by advertisement. I think it was meant for @proud iris by @polar acorn
Thanks!
Hey guys, I have a small question about Arrays and Lists. Is this the right place to ask?
That probably depends on your question. If it's a more general Python question, you're probably better off in a help channel, since you'll get a much quicker response there. If it's more touching on data science or a related field, then this channel is probably a better fit.
Ok, it is actually verry basic. I need to compare lists to arrays for an exam. If I want to use an array I have to declare it first, right? So I would need to specify the data type it uses and how big it will be
It depends a bit on what kind of array you're talking about. When you construct a Python array.array, you do indeed set the type, but it doesn't have a fixed size. You can still append to it just like a regular list. The same holds for a numpy.ndarray.
There's another difference that's more important
Because that was my understanding of Arrays and then I saw the following code
a = arr.array('d', [1.1, 3.5, 4.5])
and there was no specification of size
Yes, this sets the type (first argument) and initializes it with three elements (from the iterable that's provided as the second argument)
But, you can still append to it
a.append(9.2)
It's just like a python list: a = list([1.1, 3.5, 4.5])
It's just that if you were to look at how the two are defined in memory, then you'd see something entirely different
yeah
thats what I am getting at
so isn't that type of array more of a linked list?
No
A Python list doesn't actually contain the values in memory, but rather references to values. In this case, you'd have three float objects somewhere in memory, 1.1, 3.5, and 4.5, and the list object holds references to those objects
So, the list itself doesn't really care for the size of objects, because it only ever holds references to objects "somewhere" else
hm
An array.array, in contrast, doesn't store references, but rather the binary representation of the values itself
So, if you were to look in memory, you'd really see those three float represented as bit sequentially in an array
Yes
so there are 3 different types of ways to store lists/ collections of data
In a linked list, there is no sequence of values in memory, one after the other. Rather, you have an element that holds a reference to the next element, which holds a reference to the next, and so on
So, you've got separated things in memory, but they reference each other in a linked chain, so a linked list.
ok, so the reference is stored with the pieces of data?
Yes and each value just references the next one (singly linked list) or both the previous and the next one (doubly linked list)
But, to get to the third item, you need to start with the first, get the address of the second, then look at the second to get the third
Since the first doesn't know where the third is, it only knows where the second is
jeez. I didn't think there where this many caveats to something as simple as storing packets of data. Thats so interesting
The problem with a regular list or array.array is that when you want to insert or remove something from the middle or the beginning, you need to move all the values after it, since you can't have a gap
With a linked list, if you want to cut out, say the sixth element, you only need to change the address the fifth refers to: let it point to what was the seventh instead of the sixth
This makes for very fast "popping" at both ends and that's why the Python deque is such a doubly linked list.
Can I recap if I got everything right?
Sure
so there are more or less 3 diffrent ways to store collections of data
An array just takes some space in memory and then stores them one after another
A python list, doesn't store them in the same place, but stores the location of where the actual data is stored
And a linked list is stored "all over the place" with refrences from each element to the next one?
And sometimes from the current one to the previous one as well (for a doubly linked list)
But, yeah, that sounds about right.
nice
Im going to get to 10 minutes from this stuff alone :D
I will have to hand in some sources to from where I know all of this stuff. Do you know any by chance?
No, I'm sorry, I don't have any handy
ok
oh and one more thing. Those things in linked lists that reference to the next/previous element, are these called pointers or am I getting things mixed up?
Yes,
Although we don't usually use that term in Python, as Python abstracts that layer away
Yeah, but my exam requires me to dig a little deeper into the stuff we covered in class
If you're interested, the implementation is here: https://github.com/python/cpython/blob/master/Modules/_collectionsmodule.c
Thank you. You are a great help...
Another question, what advantages does the method bring in wich we store the adress of the data in a list? wouldn't it itself need to store the data in some type of array or linked list? why not skip this extra step?
Because itβs faster to go to 6 than to 1,2,3,4,5,6
What's the difference between tensorflow and keras?
I mean what are they, algorithms or libraries.?
Tensorflow is a c++ library for ml related algorithms with a python frontend
An ugly one
Or rather hard to use for the average guy
Keras is part of tf (tf.keras) and exposes an easier to use and more simplistic python API
@lapis sequoia
Why is keras seperate while importing?
One can just say import keras
Without importing tf
Keras can be used for more than just tf
Like for most of neural networks algorithms
And am running TF on spider, I can't import mnist dataset.
Does it need internet to work with TF?
Would this be an appropriate channel for help with pandas?
or are the general help channels where I should go?
Go ahead, I don't know how appropriate it is but you're hardly alone in doing it.
pandas fits here I believe
Gotcha
I'm trying to do some data processing with pandas in Python, but when I give the command to split my data into the dependent variable and independent variable I'm having an issue. Code below:
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('CyberData.csv')
x = dataset.iloc[:, :-1 ].values #not viewable object?
y = dataset.iloc[:, 44].values```
```python
code
```
Thanks. Both my dataset variable and my y variable are correct. But my x variable turns into an n-dimensional array (ndarray object of numpy module, as my IDE calls it). I've no idea why it is turning this into that instead of a 2D array like my y variable.
:-1 -> -1?
I can provide the file if you would like as well
But I need every column besides the last column
well
If you're slicing over 3 dimensions you'll get an ndarray
your second line slices over 2 dimensions
If I just do -1 it only gives me the last column, essentially making x the same as y
what's the shapes of the outputs
dataset.head():
0 Copper 24.2 43 ... -2370 -262.368 -2570
1 Copper 24.2 44 ... -2320 -263.158 -2580
2 Copper 24.2 45 ... -2310 -263.158 -2580
3 Copper 23.7 43 ... -2380 -262.632 -2580
4 Copper 23.8 43 ... -2380 -263.158 -2590
[5 rows x 45 columns]```
y is just a 1x 10 array, while I need x to be a 44x10
dataset.iloc[:, :-1 ].shape
(10,)
right, so one is a one-dimensional array, and one is a two dimensional array
But when I split it into x, the type listed is different than y and I can see a preview of y, but not x
whereas in the videos I was watching, they did what I'm doing (with different but similar data) and x was the same as y
this just seems like an ide issue. what ide?
Spyder
It may be, I'm not really sure. They are using spyder in the video I'm watching without an issue, so I can only assume it's something on my end
well itβs easy to check I think. generate a random 2D numpy array and see if you can view it
Yep, I can
So as far as data types go, 1D and 2D arrays say either int64 or float64 (respectively), while my x variable just says 'object'. It also won't list the data inside the sidebar like a 1D or 2D does, instead it says 'ndarray object of numpy module' which tells me its like a 3D array or something?
Seems like a Spyder issue. And seems your code should work fine despite it.
That doesn't really make sense, as in the video his array has multiple types and works just fine. Regardless, I suppose I'll have to switch. What IDE do you all recommend?
In the thread they mentioned it might be a problem with a specific Spyder version. Maybe check that you're using the latest one?
I just downloaded it a few weeks back. So I would think it should be newer than whatever version they were using in 2018 lol
I'll check
I use PyCharm for the record. Β―_(γ)_/Β―
I've tried like 5 different ways of updating this IDE. It refusese to change, so PyCharm it is
I'm trying to fit a neural net on my laptop, which has an NVIDIA GTX 1050
For now, I have ~36k training images, and I'm using VGG19 and doing some transfer learning so I don't have to train the net from scratch
Problem is, when I run the code I get a resource exhausted error
Is this reasonable, given the amount of data I have, my GPU, and the RAM (24 GB)?
Do I have to acquire a dedicated machine for this, or is there a way to optimize my data/training procedure?
The issue here is that your GPU only has a limited amount of memory itself (GPUs have dedicated so called VRAM which at least on normal ones doesnt exceed one digit GB space) what you can do in order to stop this error from happening with limited resources is split your data up into so called mini batches, so instead of fitting on 36k images only fit on as many images as you GPU can hold at a time, then the next minibatch the next one and so on until you got every image and then continue with the next epoch @haughty wind
π
You're probably running out of vram. Try monitoring your vram usage using nvidia-smi and reduce your batch size to 1
(for now)
If it still can't fit on your VRAM, you'll need to use a different GPU
(if VRAM is actually the issue that's highly unlikely as the model you have doesnt have a version with less than 2 GB ....and youd really have to have a big image to exceed that with one image at a time)
Right now my images are somewhere around 70x70 pixels or so
yeah if VRAM is actually the issue the batch size fix should work out perfectly
I dropped my batch size to 4 as well as samples per epoch
I forgot I also dropped my epochs to 5 so my net understands nothing, but at least it made it through
cool
keep increasing the batch size until it doesnt work anymore then and then decrease by a little and you should be ready to go π
ok cool, thanks for the help
what';s the recommended approach for deploying ML models? specifically just a normal pickle/joblib? other than using AWS, Azure, Google Cloud
Flask + Celery + Redis?
Hey
I have a dataset
Can anyone help me in Modifying it
like every row is in form of id;message;label in one column
i need them in different column
Is your dataset stored in a csv?
Hey in my data set i have two feature, one of type datetime and one double values. I want to predict double values. Is it possible?
Has anyone here done a proper live company project?
I have started to work in a small startup, it's basically a online event ticketing platform website.
In order for me to join this company they have given me a task to come up with a model that can help their business.
I want to know how can I use machine learning or any similar technology in python to create a live project which can be deployed to help that company.
Any ideas?
Come up with an idea or create and deploy a model to carry it out?
I need help with keras model
im getting must be from the same graph as Tensor when im trying to use it from python loading reshaped cv image
@odd osprey could you show us your code and your exact error?
ahm, just fixed finally -_-'
my issue was actually very stupid - I was wrong with singleton in pyhon :/
@earnest prawn Well well well
what
ive been here for ages lol
I just joined
server in nice structure
The Authentication is nice
Best to avoid spam
Does anyone use the vim mappings in jupyterlab?
I'm trying to figure out how to do some remapping.
Hello anyone here have studied the sugarscape or the Think Complexity book?
I have a problem i'm not sure how to solve
there's labelled data, random garbled text sequences and they're split into four classes.. how would I go about running a classifier on them
they're not meaningful words but all the words in each sequence are five characters, all alphanumeric
so like example: row 1: sdasa tyghs ileis 1 row 2: ffasa byghs glets 2
?
1,2 being labels
meaningless meaning they are meaningless but have a certain pattern
or meaningless meaning you just dont know the pattern
or meaningless meaning literally meaningless
because the last part wont really let you do much with the data
meaningless meaning I probably don't know the pattern
or what it's about
also, the words are separate by comma
if you think there might be a pattern and you just dont know it, usually if it's dealing with words or characters it's more NLP based
fair warning i am not data scientist
but usually something related to a multi class classifier like naive bayes is a possible starting point
or if you can use some type of unsupervised learning it might can reveal a pattern
i would look up like NLP algorithms that can do multi class classification
then just create your normal train, test, validation sets
there is also some for document classification, like if you have 30 different text corpuses from different letters you are reading, it can detect which ones belong to which areas
but that's more unsupervised, you said they have classes already
the hints do say that the words are from a text document
but it is a classification task
I will use unsupervised for eda
try something like naive bayes first
then move up to logistic regression + word2vec or something like that
then if you need even better results, move towards something more advanced
honestly logistic regression works very well for like sentiment analysis and such, even text classification in general
but obviously wont be as good as something like a NN
thanks man
@lapis sequoia what kind of data is this?
it's "garbled" but is there any meaning to it?
if it's truly random then you won't have much luck with this
@desert oar i'm not sure where it's from.. I was only told "training data is tab separated with a label and tokenized sequential data..the tokenized data are actually words of a text document. Each token is a length of 5 (alpha-numeric letters)."
but it is a classification task
I see
So they are words but you arent allowed to see the words? Interesting
Yes bag of words classification seems to be the best (maybe only) option
yeah.. and there's 4 classes, so I'm guessing if there are common words between labels, I should drop them and check accuracy of classifying again
they might be stop words, etc
Yeah although i wouldnt drop all common words
Could use eg chi square or mutual information for feature selection
How many records and how many unique words?
i'm not sure, let me reduce it and check
reduce it to a csv I mean, with the labels removed
I wonder if it makes sense to check if there are duplicate rows (same sequences repeated) but i'm not able to view it as rows next to each other when I do :
How is your data provided?
the data is in a tsv, label separated by sequence of words
And are they actually sequential or just a bag of words already
Use df['sequence'].str.split(',') to get a column of lists of words
No need to save again
However you can save as Parquet format in the future to avoid having to split again every time you load
I have the column of words now, it's a series and each series seems to be now converted to a list
I might export to csv and check it in something like answer minor to get quick eda
Each element of the series will be list yes
You can use .map(len) on that now to get the length of each list, etc
they're all varying lengths, I see 19 and 45.. etc.
63799 count, min 14, max 204, mean 44
@lapis sequoia .tolist() and .map() are Series methods
its much faster to iterate over a list than a series
Btw that was wrong anyway, it should be Counter(chain.from_iterable(df['sequences_split'].tolist()))
neat:)
im just cleaning up my code.. seeing if some of the test sequences are already in train, etc.. haven't gotten to training part yet
I am trying to implement Naive bayes classification on Iris dataset, but the distribution of some features are not normal, so how do I proceed ? Should I just ignore it and use guassian distribution ?
@celest moss I think you assume normality for each feature conditioned on each class. So if the PetalLenght fore each class looks somewhat normal that would be good enough.
@pptt, Thank you ! Implemented with normal distribution and seems like you are right, I got an accuracy of 93.33 %. So you are saying that P(feature_vector | class) must be normally distributed right ?
Yes, that is the assumption we are making at least. I'm sure you could often get good enough results even if that is not entirely the case.
Thank you very much !
np
@heavy crow @digital meteor @digital crescent
The data science channel could use some of this discussion π
hi
okay so what I am saying to do it
fit a polynomial to it
in a linear regression
go ahead and do it. i cant get any good results with it
i.e the cubic fit that you already did
@heavy crow the thing you should do to make sure that your LSTM actually perform well is to compare it to a simple model. So for instance split the data at several points train a LSTM on the data prior to the points and fit a linear model to the data prior to that point. Compare how wrong their predictions are on the next three months.
What Apex is trying to tell you is that the LSTM likely isn't doing better than the linear trend in such a scenario.
the obv hits almost all the points in training and hits most points in validation
Also if you're into time series have a look at this free e-book https://otexts.com/fpp2/ It's to the point and a good intro.
essentially a summary of what I have been trying to say is that
I think deviations from that quadratic curve are random noise
and when I say random noise what I mean is "a random variable that is normally distributed with constant variance"
that's how it was defined in my uni lectures anyway π€·
ok cool but that doesnt help at all when you are trying to predict values not trends
you may be able to improve the quadratic fit with some time-lagged variables i.e. autoregression
but I don't think the beta-coefficient on time-lagged variables will be that big in this case
well what the quadratic fit allows you to do
is predict future values using the assumption that the quadratic trend will continue
so essentially if you are using the quadratic fit to predict a value 3 months in the future, you can do that by carrying on the red line and taking the y-value from that date
this is essentially just the regular multivariate regression method
what I don't think is that you can make a model (including non-regression-based models) that can predict in the future all those bumps and dips. I don't think its possible to accurately predict those with the data that you have
because, as I was saying, the bumps and dips don't seem to have much of a pattern, they just look like random noise on top of the quadratic curve
(random noise being a random variable that is normally distributed, mean of zero and constant variance)
so the way I was taught multivariate analysis was that in this situation you can't capture that random noise, and that you should settle for a polynomial fit, maybe with some lagged variables (autoregression)
even if there is an underlying pattern in the deviations from the quadratic, it hasn't repeated itself that many times. This means that if you do extend your model to try to capture the deviations from the quadratic (under the assumption that they are non-random) the model will have a lot of trouble capturing the pattern.
so its possible that a machine learning algo or other recursive algo will capture those deviations from quadratic a bit better, if they are non-random, but it won't beat the quadratic regression by much even if it does. Especially if lagged variables are introduced to the regression because they have some limited ability to capture patterns themselves.
that's essentially what I was trying to say earlier
@heavy crow
Because I'm procrastinating and should be doing useful stuff I did this instead. I wrote a short script that tests your model (the LSTM for instance) vs a linear model with time series cross validation. You just have to implement two functions. One that extracts the features you want from x_train and trains your model from scratch. And one that extracts the features you want from x_test and predicts on it. Feel free to try it out. If you are interested in predictive power, comparing predictions might be a good idea π
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit
df = pd.read_csv("water-levels.csv")
# Baseline model, we will try to beat this. (if we can you were right and the "noise" was not noise!)
lr = LinearRegression()
# Your alternative model, implement a train and predict function for your model.
def train_alt_model(x_train, y_train):
# Your code here
return alt_model
def predict_alt_model(alt_model, x_test):
# Your code here
return predictions
# Set up experiment.
tscv = TimeSeriesSplit(max_train_size=None, n_splits=26)
results_df = pd.DataFrame()
# Time series cross validation.
for train_index, test_index in tscv.split(df):
x_train = df.date[train_index]
y_train = df.waterlevel[train_index]
x_test = df.date[test_index]
y_test = df.waterlevel[test_index]
results_df_tmp = df.iloc[test_index].copy()
# Fit and predict with linear regression
lr.fit(x_train.values.reshape(-1, 1), y_train)
results_df_tmp['linear_regression_prediction'] = lr.predict(x_test.values.reshape(-1, 1))
# Fit and predict with alternative model
alt_model = train_alt_model(x_train, y_train)
results_df_tmp['alternative_model_prediction'] = predict_alt_model(alt_model, x_test)
# Update results data frame
results_df = pd.concat([results_df, results_df_tmp], axis=0)
# Compute t statistic, we assume here that both model's residuals are standard normal dist.
linear_regression_residuals = (results_df.waterlevel-results_df.linear_regression_prediction)**2
alternative_model_residuals = (results_df.waterlevel-results_df.alternative_model_prediction)**2
residual_deltas = linear_regression_residuals - alternative_model_residuals
t_stat = np.mean(residual_deltas)/(np.std(residual_deltas)/np.sqrt(len(results_df)))
# See if we beat linear regression.
if t_stat >= 1.96: # We're assuming here that len(results_df) > 1000
print("Congrats! Your model beats linear regression!")
elif t_stat > 0:
print("Your model predicts slightly better than linear regression but the improvement is not statistically significant.\n"
"So we can not say for sure that you beat linear regression.")
else:
print("Your model predicts worse than linear regression. Did you even try?")
plt.plot(df.date, df.waterlevel)
plt.plot(results_df.date, results_df.linear_regression_prediction)
plt.plot(results_df.date, results_df.alternative_model_prediction)
plt.legend(['waterlevel', 'linar reg', 'alt model'])
plt.show()
n_splitsis set to 26 which is equivalent to predicting a year ahead into the future at each split. Set it to 100 to predict 3 months ahead, although this might take some time. If it takes to long you can set it to 10 which is predicting 2.5 years into the future. Or you could reduce max_train_size to cut down on the time.
Also if you want to compare linear and quadratic regression I made the two functions you have to implement:
def train_alt_model(x_train, y_train):
alt_model = LinearRegression()
x_train_reshaped = x_train.values.reshape(-1, 1)
x_train_full = np.stack([x_train_reshaped[:,0], x_train_reshaped[:,0]**2], axis=1)
alt_model.fit(x_train_full, y_train)
return alt_model
def predict_alt_model(alt_model, x_test):
predictions = [2]*len(x_test)
x_test_reshaped = x_test.values.reshape(-1, 1)
x_test_full = np.stack([x_test_reshaped[:,0], x_test_reshaped[:,0]**2], axis=1)
predictions = alt_model.predict(x_test_full)
return predictions
Good luck!
hi
HI Guys may i know what error is this ? i cant do the prediction on my data it showing me keyError
It's says pretty clearly it's a keyError. You are asking where in the index is the key, ''20150217". But that key isn't in the index and so you get an error.
@polar acorn i try to check the index but it show this
The result doesn't have a index. When you call result.predict somewhere inside that function it checks some index for your key. I don't know what index that is or how the insides of result.predict looks. Can you try to remove ''20150217" and only predict on "20170301"?
@polar acorn it show another error now
Ah I see now, those are the start and end values. Okay you should put that value back. Hmmm.
π in a deadlock now cant figure out which part error
What is the best way for real time logo detection ? ANy suggestion?
I have a big dataframe to clean and some columns have
'200, 334'
'200/334'
'200/ 334'
ADJ:Enc'
>Faul```
first i want to clean them by a space, afterwards i want to split them into two different columns to train the model
@quartz monolith
This should work, replace col with your column name.
import pandas as pd
df = pd.DataFrame({'col':['203,608', '200, 334', '200/334', '200/ 334']})
df.col = df.col.str.replace(' ', '')
df[['col_split_1', 'col_split_2']] = df.col.str.extract("(\d*)[,|/](\d*)")
@polar acorn Thanks that exactly what I'm looking for
np
Whats the best method dealing with NaN values and learning machine? rather dropping the raws, converting in int like 999 or just replace them
if i drop my raws i will lose better result so imo thats not the case
It depends on what the missing values mean
Sometimes it's better to "impute" the missing values
How to do to display all the values to the column. I use pyspark
Why dont you round the values? @hollow quartz
ok i try
@hollow quartz also string formatting might help, if that applies in your case
I'm training a DecisionTreeClassifier and I'm really bluffed... Why 5500 rows dataset performs better than 130k. Somebody has exp. with DT?
precision recall f1-score support
weighted avg 0.57 0.59 0.57 1383 small
weighted avg 0.56 0.58 0.57 33584 big
your smaller dataset might be overfitting due to less samples that are more representative of your actual dataset
also I think DTs are very much influenced by the data size more so than others
also be sure you dont have a class imbalance issue somewhere
just a few things off the top of my head
Hi, I am trying to deploy a model to google ai platform and running into some generic errors, was wondering if anyone has ran into certain error I am getting
Create Version failed. Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 47. (Error code: 0)"
is anyone alive
I'm using sns.distplot to compare two features of a training dataset!
does it work only for binary datasets?
do the histograms get corrupted if the dataset is too large?
another question
how important are system features? if I'm using only cloud platforms?
like colab
https://gyazo.com/db268fec4c046c6203f94e0b7799a212
i feel alive, but i might be biased, need to collect more data
Hi I want to normalize a column. I can normalize a column of Vector but not a column of value
what do you mean
@lapis sequoia the difference between featurescolumn and scaled_featurescolumn
which column are you having trouble normalizing
conso_total column because it is not a vector
what does conso total represent
it's the my target prediction$
my code is here df_train.withColumn('Scalerconso_total', ((df_train.conso_total)-_min)/(_max-_min))
the problem is here AnalysisException: "Can't extract value from conso_total#16: need struct type but got double;"
sorry it's like 1 am here.. I'll get back to you in the morning
ok thanks
I am currently focusing on making a "Python Bot" for an online flash game. I've been looking it up on the internet, and most results say that i have to use Autopy in order to handle the automated mouse click and keyboard input. But i'm having problems in sniffing the output that comes from the Flash Game to my ip adress, and if i successfully sniff the packages, how am i able to use AutoPy in Python to make the bot capture a screenshot every 0.5 seconds? (the bot should be able to do it because it can then convert the captured screen into a numpy array for analysis)
I think screenshots can be taken with the ImageGrab module from PIL.
Keyboard press and Mouse clicks are also supported by the module: pyautogui
LabelEncoder with never seen before values someone encounter this kind of stuff when im transforming my labels?
found the answer maybe but cant implement it into my code:
https://stackoverflow.com/a/52505373
Hi guys, sorry to bother you again but my mnist 1 hidden layer neural net is still refusing to work. I'm completely stumped
heres the code, thanks in advance to those who look at it π
although i ahve helper installed but this line of code is giving me
https://gyazo.com/8c378d412bb0fe0a4bdde8fd1db10475
can anyone help me solve that
is anyone alive
I need to evaluate my model on a suitable metric, but I'm wondering if k-fold CV is enough..
usually I figure out a good heuristic that's specific for the problem im solving, but i'm tired and facing a deadline..
I guess what im asking is.. how do I cover my ass
lol
I saved my model, how do I know the properties of one of my layers
I want to find the filter size (kernel size)
I got it
.get_layer(layername).kernel_size
How to inverse transform on the True label and Predicted label?
`# Accuracy per Class
ConfusionMatrix heatmap between RootCause
y_test_it = new_le.inverse_transform(y_test)
y_pred_it = new_le.inverse_transform(y_pred)
conf_mat = confusion_matrix(y_test_it, y_pred_it)
f, ax = plt.subplots(figsize=(50, 35))
#cmap=cmap
conf_mat_normalized = conf_mat.astype('int') / conf_mat.sum(axis=1)[:, np.newaxis]
mask = np.zeros_like(conf_mat)
sns.heatmap(conf_mat_normalized, vmax=1, center=0,
square=True, linewidths=2,
annot=True, annot_kws={"size": 15}, fmt=".1f", mask=mask)
plt.ylabel('True label')
plt.xlabel('Predicted label')`
hey guys I'm not sure if i should post here or help
im following a video and the n_jobs =1 in output
not sure how that can be
n_jobs is just the number of parallell CPU threads scikit used; it doesn't change any properties of your linear model. If you have n_jobs=None, then it defaults to 1 anyway, so there's no difference at all @tight sparrow
If your output is different from the video, then you might have a slightly different scikit version or something, but that's fine probably
Awesome thanks for the clarification π
Hey team, I have a 3d numpy array (e.g. 218x100x2502). I run np.argmax(..., axis=1) to get indices along the middle axis. I want to index another array with the same shape, using these indices. Essentially:
for ii in ndindex(a.shape[0]):
for kk in ndindex(a.shape[2]):
out[ii,kk] = a[ii, indices[ii,kk], kk]
I thought maybe np.take would do this (I tried np.take(a, indices, axis=1) but this is definitely not the right thing, since it hangs Python...)
Looks like this: https://stackoverflow.com/questions/32089973/numpy-index-3d-array-with-index-of-last-axis-stored-in-2d-array is almost what I want. Couldn't get the right search terms!
np.take_along_axis(a, indices[:,None,:], axis=1).squeeze() -- easy! thanks team
yo