#data-science-and-ml

1 messages Β· Page 203 of 1

lyric canopy
#

@silk forge Are you familiar with the term "mean squared error"?

silk forge
#

thats exactly what i dont understan d

#

@lyric canopy

#

Regression is so complex

#

Classification way easier

lyric canopy
#

Your model predicts values, right?

#

Let's call those predicted values y-hat (yΜ‚).

#

We also have the "true" values, let's call those why

#

Now we want to compare the predictions our model makes to the true values to see how well it does

#

That's what the formula does

#

It computers the difference between the predicted values and the true values, yΜ‚ - y, and we call that the error of the model

#

Or, how wrong each prediction is

#

We then square those errors to get the squared error between each individual prediction and "true" value

#

And then we calculate the mean of those squared errors by summing them together and dividing them by the number of predictions we have

#

And that's what the formula does here

#

So, it's the mean squared error

silk forge
#

Oh

#

so they try to find accuracy

lyric canopy
#

In this case, there's an additional 1/2 involved (that's why it's 1/2m), but that's just to make differentiating easier

#

It doesn't have influence on the optimization

silk forge
#

but why 1/2

#

?

lyric canopy
#

Just to make a term drop out when you take the derivative

#

It doesn't influence the minimum/optimization, so it doesn't matter

#

Now, think of your model: If the MSE is some kind of measurement for how accurate the model is, then it would be a good thing to minimize it

silk forge
#

so i dont need to care about that right?

lyric canopy
#

And that's what a lot of algorithms do: Find the model with the lowest MSE (or another measure of accuracy)

silk forge
#

also is there any good linear regression tutorials?

#

that i can learn from

lyric canopy
#

I'm not sure. I only know of a really good book, but that book probably contains more than what you're looking for

#

Fox (2015) Applied Regression Analysis and Generalized Linear Models (third edition)

silk forge
#

hmm

#

thanks

#

ill see if i can get the book

#

yo wait

#

@lyric canopy

#

in an RMSE

#

do we need to do formula stuff in every point

#

i mean

#

in all these green points

lyric canopy
#

do you see that red line in there labeled error? That's the yΜ‚ - y of earlier

#

The difference between the predicted value of the model (represented by the line), and the actual value (the green dot)

wide echo
#

> Matlab

spark nimbus
#

@lean ledge given an Array<Sample> how would I combine those samples back into a proper waveform?

jagged stump
#

Hey everyone . I wanna brand detection on live video. How can I Do it ? What is your advice? I am trying with using CNN

lapis sequoia
#

can someone tell me what this 3 means here

#

energy_series = df.loc[:, ('Energy', '3')]

#

Energy is a column name..

#

but not sure what '3' refers to

#

hmm maybe this is a tuple of columns

#

im having trouble coercing my dataframe column to datetime values

#

it's in unicode

#

I tried pd.to_datetime

desert oar
#

@weary rose fortunately for you, 1 million isnt considered big πŸ˜›

#

yes you "can" do that

#

how is the data stored?

#

@lapis sequoia it could be one of two scenarios: 1) the column label is actually a tuple, so this tuple is accessing the column called ('Energy', '3'), or 2) the dataframe has a "multiindex" instead of simple column names, so ('Energy', '3') is accessing the key 'Energy' in the outer level of the index, and the key '3' in the next level of the index

wide gyro
#

What would be the best way to converting a text file of cells into csv?

desert oar
#

@wide gyro pd.read_csv ?

wide gyro
#

@desert oar well I want to set it to a csv file that allows a dataframe to come in after and take the data but I'm struggling with getting the formatting down for csv

desert oar
#

create the csv using pandas in the first place?

#

what do you mean "getting the formatting down"

#

you shouldnt ever be manually constructing CSV data except in very simple cases

wide gyro
#

I am using iwlist scan and trying to get that output into csv file

desert oar
#

what is "iwlist scan"

wide gyro
#

But my csv file is setting it row by row, with no column headers

desert oar
#

can you share some code

wide gyro
#

oops forgot iwlist is in linux, but it scans for access points nearby

desert oar
#

can you share some code

wide gyro
#

Sure gimme a sec

desert oar
#

iwlist is a command line tool right

#

so what are you doing with it

wide gyro
#

ye

desert oar
#

πŸ‘

wide gyro
#

trying to do something similar to wifimap but a beginner version of it

#

if that makes sense

desert oar
#

im not familiar w/ wifimap. but how are you parsing the output of iwlist

wide gyro
#

can i show a terminal script here? im not doing it off python rn

desert oar
#

is this a shell script?

#

yes ofc

#

use "bash" instead of "python" for highlighting

wide gyro
#

yeah that's what im stuck on right now

desert oar
#

bash highlighting:

echo 1
x=3
echo $x
wide gyro
#

wifimap . io I think it is @desert oar it's like a hotspot finder for restaurants or stores

#

good if you travel and stuff

desert oar
#

ok. yeah go ahead and share your shell script

#

the goal is to create a CSV that you can read to pandas right

wide gyro
#
echo "Scan?"
select yn in "Yes" "No"; do
    case $yn in
        Yes) iwlist wlan0 scan | tee ~/output.txt;;
        No) exit;;
    esac
done
#

Yes

#

I receive a block of cells that all have their respective data, but not sure how to then parse it

#

Maybe I could bring all the data onto one line without using the category for it, and then when setting it to csv I could set the column headers manually that match up to the data?

#

or would that not be smart @desert oar

desert oar
#

can you show some example output from iwlist wlan0 scan?

wide gyro
#

Actually I figured it out, there's a command called awk that will take care of it

desert oar
#

yes i've used awk a lot

#

still

#

if you make bad CSV files you make bad CSV files

#

also awk isn't that easy to use

#

and there is a lot of really really bad awk advice out there

#

so i'd still like to see the iwlist output

#

so you don't waste time on someone's over-clever awk solution that you don't understand

wide gyro
#

Is there a formatting for txt files?

desert oar
#

nothing standard

#

can you just share some output

wide gyro
#

ye

#
Cell 01 -    Address: 00.something
        Channel:    10
        Quality: 57/70
        ESSID: "Main"

Cell 02 -    Address: 00.something
        Channel:    10
        Quality: 57/70
        ESSID: "Main"
#

Essentially that for txt file, and on csv its that but each one is a new row

#

I figured I would get rid of the cell categories and just keep the data, get all the data on same line separated by a comma or tab, and then when setting it to a csv, I would put the column headers in manually

#

I saw online something like sed -e "s/\tsignal: //" -e "s/\tSSID: //" which I figured I would do for all of them

desert oar
#

can you demonstrate the output format you want to see

wide gyro
#

00.something 10 57/70 "Main"

#

something like that, or could be separated by tabs or commas

desert oar
#

got it. ok awk is actually a good tool for that

#

you could also do it in python

wide gyro
#

Which would you suggest?

#

I just saw awk first and jumped on it

#

but open to anything

desert oar
#

since you already know python id personally suggest using python

wide gyro
#

But would you suggest trying to clean it up a bit in linux before?

desert oar
#

eh

#

no point imo

wide gyro
#

Or everything in python

#

ye I understand

desert oar
#

if you already know the tools, or want an excuse to learn them, then by all means let's get into it

#

but if you just wanna get the data processed, use what you already know

wide gyro
#

I might wanna try it a couple ways just to learn, but I'll head down the path of what I already know just to get it running first

#

and then have a backup plan ready in case I end up not understanding the other methods

desert oar
#

thats a good idea

#

i learned not to do that the hard way

wide gyro
#

What do you mean haha

desert oar
#

i mean, learning a bunch of new tools while trying to actually get something done, without having a backup in place first

drowsy marsh
#

hello, I fail to create a .mplstyle for matplotlib with Spyder. Python doesn't find the new_style in the style library. And even if I modify a present default style, it doesn't change the output. I don't really know where I'm missing something :/

#

I missed this: matplotlib.style.reload_library()

silk forge
#
import sklearn.naive_bayes as bae
import numpy
import pandas as pd




data = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/nibbeh.csv")


x = pd.DataFrame.as_matrix(data.Height,data.Weight)
y = [data["Gender"]]

clf = bae.GaussianNB()
clf.fit(x,y)
n = clf.predict([[140,120]])
#
ValueError: Expected 2D array, got 1D array instead:
#
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#

help?

desert oar
#

the error message is pretty clear in this case

#

the x is expected to be a 2d array

#

you have a 1d array

earnest prawn
#

predict and expect do expect sequences of data points not just a single data point

#

so you always gotta wrap them in an additional array

#

@drowsy marsh

#

oh shoot

#

@silk forge

bitter pewter
#

Guys, how do I make selenium do multiple pages read to export a excel file after?

#

I've successfully made it reads one page and then export a file with the results

#

Let me show MWE:

#
from bs4 import BeautifulSoup
from openpyxl import Workbook
import numpy as np
import pandas as pd

url = "https://scon.stj.jus.br/SCON/legaplic/toc.jsp?materia=%27Lei+8.429%2F1992+%28Lei+DE+IMPROBIDADE+ADMINISTRATIVA%29%27.mat.&b=TEMA&p=true&t=&l=1&i=18&ordem=MAT,@NUM"

driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

python_button = driver.find_element_by_xpath('/html/body/div[2]/div[6]/div/div/div[3]/div[2]/div/div/div/div[16]/a')
python_button.click()

driver.switch_to.window(driver.window_handles[-1])

python_button = driver.find_element_by_xpath('/html/body/div[2]/div[6]/div[1]/div/div[3]/div[2]/div/div/div/div[3]/div[2]/span[2]/a')
python_button.click()

driver.switch_to.window(driver.window_handles[-1])

textList = driver.find_elements_by_class_name("docTexto")

resultados = BeautifulSoup(driver.page_source, 'lxml')

parse = resultados.find('div', {'id':'listadocumentos'})
paragrafoBRS = parse.find_all('div',{'class':'paragrafoBRS'})

header = []
content = []
for each in paragrafoBRS:
    header.append(each.find('h4', {'class':'docTitulo'}).text.strip())
    content.append(each.find(['div','pre'], {'class':'docTexto'}).text.strip())

    df = pd.DataFrame([content], columns = header)

df.to_excel('dados.xlsx')

driver.quit()
#

So, selenium opens up a page, then go through some links and get to the point I want (a page that display the data I want to scrape)

#

The problem is that there are 5 pages of data

#

And I'm struggling to make it reads all the 5 pages and then export the Excel file

bitter pewter
#

Guys, nvm

#

I've figured it out

#

I'm struggling how to merge columns that have the same name

#

Any tips?

#

Example: I have 3 columns named "Processo", but I want to merge those columns to instead of having 1 line and 3 columns, have 1 column and 3 lines of data

#

Using Pandas DataFrame, ofc

lapis sequoia
#

rephrase your question so it makes sense

inland viper
#

Hello?

earnest prawn
#

hello

inland viper
#

I cannot figure out what the second part of the prompt is asking for: Given the root of a binary search tree with distinct values, modify it so that every node has a new value equal to the sum of the values of the original tree that are greater than or equal to node.val.

#

I can post a picture of an example if that helps

earnest prawn
#

I dont really see how much clearer you could express this question, what exactly is your problem with it?

inland viper
#

I do not understand how to modify the tree

earnest prawn
#

node.val = 4343434?

inland viper
#

Input: [4,1,6,0,2,5,7,null,null,null,3,null,null,null,8]
Output: [30,36,21,36,35,26,15,null,null,null,33,null,null,null,8]

#

This is the example. There is a picture of the tree as well. But I do not understand where the values are coming from

earnest prawn
#

show me the picture then

inland viper
earnest prawn
#

oh those lists are actually equally long i see

#

that confused me

#

well then its quite obvious where the new numbers are coming from, the tasks describes exactly how to calculate them

inland viper
#

What does node.val mean?

earnest prawn
#

the value associated with the node?

inland viper
#

I see

lean ledge
#

@inland viper just a recursive solution should work. It's not a hard problem by any measure. It recurses down and returns the sum to the upper tree. Upper tree takes the sum to the left and assigns it as it's value and returns back next sum by adding its old value to the sum

#

This is also not data science

strong flare
silk forge
#

wanna show us the code? @strong flare

strong flare
#

@silk forge hi fwiz here is the code πŸ˜€

# Plot with differently-colored markers.
plt.plot(combine_month.index, combine_month.April, 'b-', label='April')
plt.plot(combine_month.index, combine_month.March, 'g-', label='March')
plt.plot(combine_month.index, combine_month.May, 'r-', label='May')

# Create legend.
plt.legend(loc='right')
plt.xlabel('Date')
plt.ylabel('Month trend')
plt.show()
bitter pewter
#

@lapis sequoia when I get the xlsx file with all the data I've collected with selenium and so, instead of having multiple lines correlated to 11 different columns, it create multiple columns and one single line. I'll post a screenshot over here so you can check it out.

#

As you can see, the column B "Processo" is repeated at column L, but they are correlate to different data sets. I want to get them both at the same column, but in two different lines.

strong flare
#

Hi anyone know how to change a line plot Y-aixs to normal figure ? not in scientific notation mode

outer marsh
#

Hello everyone

#

Does someone know why my scipy optimize minimze doesn't give the most optimal solution

lean ledge
serene crane
#

When you're working with a jupyter notebook, do you still tend to put all imports at the top, or does it make sense to group imports around the cells they are used? What's the common pattern?

polar acorn
#

I put them on top. If you're sharing the notebook with someone later they have all the requirements in one cell instead of scattered around.

onyx granite
#

yes you want people to run into any errors immediately, not later down the line. It's always good to put imports first, especially since it wont give you a compile time error, and mostly runtime @serene crane

desert oar
#

@outer marsh what's wrong with that output and what were you expecting instead

outer marsh
#

@desert oar The second one could've been a bit more down

#

Same for the first one

#

I could a better line by hand

desert oar
#

So you're trying to fit a linear regression line with least squares?

#

Oh I see, power law

#

How about this, punch in the parameters for what you think is a better line and compute the log likelihood of both your "manual" solution and the solution the computer found

#

If it turns out that you can in fact draw a better line then clearly it failed to converge for some reason

#

Or didn't find a global minimum

#

Also make sure to rule out any bugs in your log likelihood

#

That log likelihood looks kind of off to me

#

What expression are you actually trying to optimize, zipfs law log likelihood?

lapis sequoia
#

has anyone here worked with Python cv2, computer vision. for face detection or anything

outer marsh
#

@desert oar Yeah

hollow shard
#

Hi guys, I'm in a spot of bother with a 1 hidden layer neural net. I created a very basic neural network after the one I created for MNIST didnt work, training on a very simple generated dataset. However, it spits out nonsense results, and nothing that ive tried works. Could anyone take a look? Apologies in advance for rubbish code and the probably basic errors, I'm completely new.

desert oar
#

@hollow shard it looks like you're re-initializing everything inside the loop

#

so every single step all the parameters get re-initialized to 0

hollow shard
#

oh crap hahaha

#

thanks, I appreciate it

onyx granite
#

I have a problem that I can't seem to figure out, I am trying to create a boosting model based off of car accidents for a particular city. I have about 30-50k records, since this is a classic imbalanced data case, I am "creating" my own training set by sampling accidents, changing a feature, then if it is non-negative, add it to my trainset as a 0/non-accident

#

I can't seem to get any good recall scores, since I am more focused on actually preventing type II errors

#

anyone who can help would be appreciated

#

i have about 15-20 different features as well btw, I can go into more detail if needed

hollow shard
#

strange, I modified the code to put the resets outside of the loop, and its still playing up.

sand reef
#

Anybody on? I have a smol question.

#

Can increasing the resolution of a picture reduce the accuracy of the model?

#

Cuz at 45x45 the model was performing flawlessly. When I made it 100x100, suddenly the accuracy got stuck at 47%

#

And wouldn't rise even with 10 epochs. Btw the training data size of 73k images.

desert oar
#

seems weird

sand reef
#

Yeah. I too am wondering what went wrong. From what I found online, they say that low resolution helps in gathering global features better and high helps in finer features. But as you increase the resolution, it's harder to grasp global features.

#

So I am not sure if OCT scans seem to be classified on the basis of global features though.

desert oar
#

maybe you can visualize the middle layers to see what's being learned

#

what was accuracy in 45x45?

sand reef
#

88

desert oar
#

oh. huh

#

im really not a image/vision guy but that aint right

#

the usual "check your code" and "check your data" caveats apply

sand reef
#

Well. Can't really check my data. It's just an image.

#

Code wise. We could try out different designs though.

desert oar
#

well how are you producing the images

#

i assume these arent actually 45x45 images

#

and youre downsampling them somehow

sand reef
#

Nope.

#

Those are 360x760

desert oar
#

so do the 100x100 images actually look right?

sand reef
#

Yeah. That's why I was like, 45 might be very low resolution.

desert oar
#

its kind of a long shot but worth checking. sorry im not more helpful on the actual neural network side of things

sand reef
#

And guess what, with 45x45, the test dataset accuracy is 91%

desert oar
#

it's possible that 100x100 is a "bad zone" where the easy blob-like features start to break up but the image itself hasnt resolved into anything meaningful yet

#

but again im speculating

#

cause 45x45 is just like what, blobs of dark and light?

sand reef
#

Yus

#

So I guess then I'll just do it with multiple image sizes and networks

#

And plot it on tensorboard

desert oar
#

seems reasonable as long as training isnt too intensive

sand reef
#

Well. Not much. 76k images 100x100 takes 1. 10 min per epoch.

#

*1min 10s

#

I am pretty happy at the fact that this actually works fast with a gpu. With a cpu every epoch would've taken 5-10min

#

Say. Another smol question. I have some work related to taking features out of an image to generate a vector with features. I need to take texture, a little bit of geometry and some morphological features. Which library has these functions and can I get a source where I can look up such algorithms?

#

Stuff like glcm, drlbp and all.

jagged stump
#

Hey everyone , I wanna ask a question . Is Data Engineer good ? It looks like focus arthitect beyond the data. And less work about ML,DL etc is it right

#

cause I took a call about data engineer position and idk what will I do anyone can give me info about it well ?

#

thanks

desert oar
#

"good" is up to you. it's certainly important work, and the industry can always use competent data engineers

#

the work you'd do depends on the company, but you will generally be building or maintaining database systems and data pipelines

#

so maybe a data scientist or ML researcher will produce a model, maybe in a docker container

#

and the data engineers might be the people who connect that container to an application, which other programmers can use

#

or maybe you would be administering a data warehouse

#

stuff like that

#

possibly doing some "light duty" data analysis of your own

jagged stump
#

Thanks for answer I am talking about Really big and good company

#

Datas coming from autonomus cars

#

and they want hadoop spark also kubernetes docker etc. Also they say about creating new pipelines and ML,DL is plus

#

my mind so mess nw idk that company is awesome and job would be better if it would be data-scientist but still looking cool cause of autonomus cars

desert oar
#

yeah that would be a serious data engineering job

#

low latency pipelines handling huge amounts of data

jagged stump
#

cause of datas will be real time . So its level 5

#

also level 4 I guess cause predict they are also suppose

#

so clearly level4-5 data engineer for autonomous cars

desert oar
#

i dont know what levels you refer to

jagged stump
#

in fact I dont know well about levels just I Know like level 4 is like predictable data and level 5 like real time data

desert oar
#

ive never heard of these levels

jagged stump
sand reef
#

So. I am still waiting if anyone knows about such sources. Please do answer.

lean ledge
#

@sand reef yes, increasing resolution can reduce accuracy in some scenarios. Might be a good idea to learn about Laplacian triangles and multi-scale features from traditional computer vision. But you would expect deeper CNN to generally learn Gaussian blurring filters to get over that often, so it might have to do with your hyperparameters and how they're suboptimal for the new loss landscape

sand reef
#

I see. Thanks! And @lean ledge do you know of a library with feature extraction algorithms for images like glcm, drlbp, etc.?

lean ledge
#

I do not, sorry

sand reef
#

Oh ok.

void anvil
#

Data engineering for that is basically making high speed pipelines into ML models along with data validation and cleaning.

#

Personally that’s my least favorite part of the process, but it’s super valuable

#

What you want to know really well is data structures, pipeline methods, and outlier detection for the interview

silk forge
#
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")
#

plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')

#

i dont understand this line ^

#

why the [0][0] and stuff

desert oar
#

Its bad

#

coef_ is a 2d array, i.e. a matrix

#

Looks like they're plotting the 0,0th element of coef_

#

Should be [0,0] not [0][0] - the second way works but it's not recommended or idiomatic

#

Meanwhile intercept is a 1d array, i.e. a vector

fierce shadow
#

Hey I was thinking about to get started with machine learning, just wanted to know what math is required ?

silk forge
#

linear algebra

#

differential calculus and linear equations

#

@fierce shadow

fierce shadow
#

anything more ?

silk forge
#

probability

#

ik a course for you

fierce shadow
#

hmm...okay...really ?

#

can you suggest one please ?

silk forge
#

Essential math for machine learning

fierce shadow
#

thanks

lean ledge
#

@fierce shadow there's a "mathematics of machine learning" post pinned on this channel that summarises the maths used

lapis sequoia
#

@sand reef sry for ping but i got a life or death situation again

#

i am loading few images from a directory but their shape is (250, 250, 2)

#

why is an images shape 3?

#

when i printed it

#

it has one pair of useless brackets

#

how can i remove it?

earnest prawn
#

The images shape is 3D because you have pixel X pixel y colour channels

lapis sequoia
#

how can i reshape it?

earnest prawn
#

You usually don't want to?

lapis sequoia
#

oh

lean ledge
#

Why do you want to reshape it?

lapis sequoia
#

so its fine?

earnest prawn
#

Yeah it's perfectly fine

lapis sequoia
#

okay thanks

lean ledge
#

Image data is spatial. CNNs are spatial. Can't reshape it in any way without screwing up how CNNs naturally work

earnest prawn
#

I mean you could certainly make it smaller or bigger or just make it grey scale (aka 1 channel) if you wanted to

lean ledge
#

You can up or down sample etc or do other similar preprocessing but you can't reshape the network layer without losing properties

lapis sequoia
#

okay

round crane
#

anyone knows how to make pie charts with matplotlib.figure?

olive willow
#

you mean matplotlib.pyplot ? @round crane

round crane
#

no

#

i don't mean that

#

i'm in fact explicitly avoiding that

round crane
#

first line import matplotlib.pyplot as plt

olive willow
#

idk I clicked figure

#

sry

round crane
#

right this one

olive willow
#

cuz what I clicked was as an example of it

round crane
#

really

olive willow
#

yh

#

look

#

it last column 5th row

#

bar or pie

round crane
#

this mentions figure.figure not pyplot.figure

olive willow
#

oohhh you want it to be in a flask site?

round crane
#

not necessarily

#

but yes a web-app

olive willow
#

I don't think it makes a difference

#

@earnest prawn please explain this

#
from matplotlib.figure import Figure
round crane
#

right

olive willow
#
import matplotlib.pyplot.figure
#

what's the difference

#

I think that the first one only imports the figure module

#

and the second one too

round crane
#

the FAQ says without pyplot

olive willow
#

idk

earnest prawn
#

whats there to explain

#

you got your examples and youre done

olive willow
#

what's the difference ?

#

or is there none?

earnest prawn
#

pyplot can actually generate figures if you want it to

olive willow
#

I think that they're the same

earnest prawn
#

so pyplot is just a bit higher level

olive willow
#

oohhhh so like more functionality

earnest prawn
#

but internally (and with the right way also externally) you can created figures with it

#

no its less functionality

#

pyplot internally uses figures

olive willow
#

I mean pyplot

earnest prawn
#

yeah basically

olive willow
#

Danke!

#

you see, you know everything

#

they should make a nix bot

earnest prawn
#

i in fact am guessing this based on seeing other matplotlib code which mentioned the creation of a fig variable

#

and as in htis code they also do it i am just making an assumption

sand reef
#

I have a question. I want to make a face detection system. Just detection, nothing else. How do I get a good dataset to train my network in? I want to do it with live video feed.

#

And what model is good for it?

#

I want to do something for a small self project.

#

I tried training a small network with 32 conv net filters, dropout layers and a 36 neuron network. With just a resolution of 100x100. I used 1k images with face and 1k without.

#

Which I captured using the video camera.

#

How do I proceed? I tried doing it and the issues I see is, the model kind of does and doesn't recognize the face at different angles and distances.

#

I thought of using the hidden layers and generating Euclidean distance to find similarity between images and not, but I am not sure if that will work. Will it?

#

And I am also not sure how would I implement it.

#

Sorry. The resolution is 50x50

hallow pendant
#

import re

pattern = r"^gr.y$"

if re.match(pattern, "grey"):
print("Match 1")

if re.match(pattern, "gray"):
print("Match 2")

if re.match(pattern, "stingray"):
print("Match 3")

#

this outputs match 1 and 2

#

and someone explain meta characters to me??

sand reef
#

@hallow pendant wrong channel. Go to one of the help channels

hallow pendant
#

oh sorry

jagged stump
#

I wanna ask something. For example I want data for CNN . Mercedes-Benz brand logo. I didnt want ready logo so I wanted make my own. I did install automatically from google graphics. with different keywords first 40 image. Keywords like ; mercedes benz logo , mercedes benz brand etc. But some of them really not about mercedes benz so its not well picked data. Wha is the best way classification this data and out non mercedes benz images? Thanks a lot

velvet thorn
#

well...that's what you're building the CNN for, right?

#

you could a. do it yourself or b. pay someone to do it for you

#

you could run unsupervised classification

lapis sequoia
#

@jagged stump if you just want to filter images between sets, you can use microsoft computer vision api.. it's easy, just upload a couple of images to train it

strong flare
#
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 2000, 10000)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')

plt.xlabel('x label')
plt.ylabel('y label')

plt.title("Simple Plot")

plt.legend()

plt.show()

Hi guys, may i know how can i set ytick to number not in scientific notation ?

lean ledge
#

you want the y axis to write 1 followed by 9 zeroes????

#

1000000000
2000000000
3000000000

strong flare
#

yes

#

you have any idea ? πŸ˜€

desert oar
#

ax.set_yticklabels should do it

lapis sequoia
#

what's the support? (in classification tasks) i'm trying to make sense of my classification report..

#

I trained a classifier for 3 classes, I have the precision recall f1 score and support for 120k examples.. and it shows for individual classes, the support is 40k

#

not sure what that means

strong flare
#
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np

def tickformat(x):
    if int(x) == float(x):
        return str(int(x))
    else:
        return str(x)  

    
plt.figure(figsize = (8,6)) 
fig, ax = plt.subplots()
    
x = np.linspace(0, 2000, 10000)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')

fmt = FuncFormatter(lambda x, pos: tickformat(x))
ax.yaxis.set_major_formatter(fmt)

plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")

plt.legend()

plt.show()
lapis sequoia
#

do you want your y in that scale?

strong flare
#

@desert oar @lean ledge i just get this on google it's work xd

lapis sequoia
#

ok seems like you did

strong flare
#

@lapis sequoia Yes, this is just a sample i need those number in figure not in scientific notation

lapis sequoia
#

ok

#

@lapis sequoia I'm having trouble understanding Support

jagged stump
#

@lapis sequoia I prefer use python script

#

@velvet thorn so we can say like K-means? But how ? HoG? Shift? What is the way?

lapis sequoia
#

??

jagged stump
#

@lapis sequoia you offer me ; microsoft computer vision api.. I have no idea what its even . Can you give me more details

lapis sequoia
#

oh it seems like you want to use python.. so it's not an option

jagged stump
#

it must be someehow. I have 100 data and 50 of them mercedes benz and others something different . Cant I take somehow only take that 50 mercedes benz ? Maybe with some ML algorithm?

#

any idea?

hollow quartz
#

Hi. I hava a variation of numerical variable with date-time variable. Can I use a ML algorithm to predict the numerical variable?

sand reef
#

What's a good model to use for face detection? For live video feed?

wide gyro
#

Switching from iterrows() to get_value for efficiency, but now I'm running into a problem with appending my copied dataframe

#

I have everything inside for loop for for i in df.index:

#

and inside that loop I have a check statement and then dfCopy = dfCopy.append(i)

#

in which I am now getting TypeError: cannot concatenate object of type "<class 'int'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

#

I previously used for index, row in df.iterrows():

#

with the append statement being dfCopy = dfCopy.append(row)

#

which worked fine

#

nevermind seeing its being removed in a future release

desert oar
#

chances are you don't actually need to loop over rows manually

#

what are you trying to achieve?

#

most of the time you think you need iterrows() you probably just want .apply(..., axis=1) instead

quartz stream
#

Anyone on any start for voice authentication

#

I mean recognizing a perosn voice

#

doesn't matter if initially its only for numbers

earnest prawn
#

"initially only for numbers"??? a computer cannot make predictions on inputs which are not numeric

hollow quartz
desert oar
#

ax.set_xticks @hollow quartz

vital bison
#

anyone here has particiated/participating ina two sigma competition in kaggle? I was interested in how the data set looks like as the deadline has passed, if from an older one also there's no problem , please feel free to DM me

lapis sequoia
#

I would love to learn data science, but I am still struggling with learning it.

sand reef
#

Could I get a recommendation for a project to do? A project to do so that ppl will hire me for internship?

#

I know, wxpython, pyqt5, decent amount of tensorflow and keras, SQL and currently doing Django.

#

And I have been doing competitive in python and C++.

#

And I know cv2 and Pygame. And SFML in C++.

#

I checked online for good projects to do, and ppl were like reimplement GNU ld, objdump and make a binary recompiler or something. And I thought maybe I could do something with Facenet, turns out there is a library already with what I want to do. I wanted to use Facenet with live video feed. I am completely lost.

void anvil
#

@sand reef scrape the weather station info from NOAA and try to do 5 minute ahead wind speed, direction, and day-ahead high / low temperatures

sand reef
#

Oh. That's almost perfect. Thanks!

void anvil
#

Common weather forecasts are: 5 minute ahead, 1 hour ahead, 1-7 days ahead, 90 day average temp

lapis sequoia
#

That actually sounds really interesting

strong flare
#

Hi guys i have a dataframe with date, may i know how can i get the moving avg with most easy way ?
i already get the min & max date

#

the moving avg like 3 day and 7 day

#

some of the solution is using this pandas_datareader but this is only online catching data ?
https://stackoverflow.com/questions/40060842/moving-average-pandas
https://youtu.be/XWAPpyF62Vg

Welcome to the python pandas programming tutorial part 2. In this python pandas episode, we are going to calculate moving averages for a stock using python p...

β–Ά Play video
strong flare
#

xd ok i get it have to use this df.rolling().mean()

sand reef
#

Can I get some advice on a series of algos I got from scikit image?

#

I need to take out texture features.

#

And there are a multitude of algos for it.

#

But what I see is: SIFT, SURF, Daisy, LBP, HoG and a binary descriptor extractor called ORB.

#

Is it of any use if I use multiple of them? Or just one suffices?

#

And could anyone point me to a source where I can understand why should I use any of them, and which one is preferred when?

lean ledge
#

Computer vision is a rather large field in of itself, with a very large body consisting of stuff inspired by signal processing, traditional algorithms, statistics and machine learning based techniques

#

Seriously though, it's a large field.
Take one example, SIFT. It's "Scale Invariant Feature Transform"
It takes an image and converts it into a difference of gaussian scale space representation for scale invariance (signal and image processing)
It finds features using generic multivariate calculus techniques + a discretised taylor series on the image, doing that over the entire new 3D scale space (general multivariate calculus)
It does feature matching (backed by a kd tree supported nearest neighbour searches) (traditional algorithms)
Uses hough transformations to identify clusters in the feature space and do voting (generic computer vision optimisation stuff)

#

There is no possible reference you can give that goes over the nuances of the techniques without a lot of background and explanation

sand reef
#

Hmm. So that means back to square one.

#

And any directions for if I need to find some geometric features like depth of the retina, and for morphological features like lesions in the retina?

#

Or abnormalities on the surface, like a little part of it being partially peeled.

#

And from the looks of it, I think that computer vision might be a field gradually explored by experience? Like very very gradual exploration

lean ledge
#

Or just like, reading a book or going through a course? @sand reef

#

There's no need to be slow and gradual but also don't expect the field to be reduceable to a cheat sheet

#

There's little specific to robots there so be scared not

sand reef
#

Thanks! I'll look it up! Although, any pointers on what would I do for those requirements in my above question?

hollow hearth
#

Hey guys, I have a pandas question. I am working with a dataset that has 'Download Date' and 'Customer Name' fields built into it. What I want to do is remove duplicate Customer Names based on the Download Date. For example, if this is the starting point:
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 06/10/2017 Customer Name: Mark Johnson
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 09/15/2018 Customer Name: Jim Jacobs

I want it to end like this:
Download Date: 06/10/2017 Customer Name: Jim Jacobs
Download Date: 06/10/2017 Customer Name: Mark Johnson
Download Date: 09/15/2018 Customer Name: Jim Jacobs

I have been trying to do it without iterating (I don't think I have to, I could be wrong). Can anyone point me in the right direction? My initial thoughts were to use .groupby() and do stuff with the groups but for some reason I cannot get that to work how I want it to.

Edit: I got it. I needed to pass the download date into the subset parameter of drop_duplicates.

chilly shuttle
#

you can use drop_duplicates

#

the more general approach is to use groupby to aggregate by whatever fields you need and perform whatever operation you need

hollow hearth
#

The only reason why I swayed from that is because the excel file that I was using was already grouped by the download date. Forgot to mention that D:. But I just ended up passing download date and customer name into drop_duplicates and that got me the results that I was looking for

true badger
#

Do distances in PCA 2D space mean anything?

polar acorn
#

Numerically, I don't think so i.e. a distance of 5 or 10 has no interpretation. But relative to each other I would think that a distance has some interpretation i.e. point pairs that have smaller distances are in some way more similar than point pairs with large distances, but I don't think there is a clear interpretation other than this vague idea of similarity. At least not without carefully examining the principal components.

lapis sequoia
#

Okay, so I want to use machine learning to try to categorize measurement series, where every data point is essentially a group of measurements (somewhere between 10 000 and 100 000 measurements for each data point). Does anyone know if there exists machine learning algorithms for these types of tasks, that will derive attributes from the data on its own, or will that have to be done manually?

polar acorn
#

This could be done with a LSTM or an TCN. Do I understand it correctly if each timestep in your series has between 10 000 and 100 000 values associated with it? If that's the case maybe you should look into some form of dimensionality reduction before feeding to a classification model?

lapis sequoia
#

Each timestep basically has as many values as i want. The more i use, the more accurate it becomes, as the measurements are prone to noise

#

I am basically looking at RTT delay measurements.

An example of what i want to be able to do is getting measurements between two points A and B, and a Server C. Then i want to be able to differentiate between the RTT of A-C and B-C. Most of the time the minimum value should be sufficient if i have enough measurements, but i want to be able to reduce the number of measurements to a point where i may not be able to tell from the minimum delay alone.

#

Does that clarify it a little bit? @polar acorn

polar acorn
#

Sure! Now I do in no way have enough domain knowledge to actually be of use here so i'll try making some assumptions. If I understand this correctly you have a series of measurement sets for example lets say [[3,4,3], [4,5,4,5], [5,5,2]] etc of some arbitrary length. And you want to estimate if this series belong to the A-C class or the B-C class. I assume that you have several examples of both classes. I assume what sets A-C and B-C apart is some attribute of the series it self. Maybe A-C rises quicker or B-C oscillates more over time or whatever. In that case I would for each time step find the mean, std, minimum and maximum of the values sampled at that time and create a new time series with the same length but those same measurements at each time step. And then feed that to an LSTM model.

lapis sequoia
#

Yea that was basically hat i was wondering: If i had to derive values such as mean, std, minimum, etc. on my own, or if i could just feed the whole measurement series into a ML method that would do it on its own.

#

In regards to domain i am measuring RTT in a network with low traffic, so network traffic is generally considered noise i want to sort out, so maximum values wont be very interesting

#

That is an example of the data i can get between 2 different points and a Server C

#

Anyway i should probably look into RNN and LSTM

polar acorn
#

Another way could be to first compute summary stats for each time step and then use something like. https://github.com/blue-yonder/tsfresh to automatically extract features from the whole of those time series and then throw those features into whatever classifier you like. I'm afraid you have to create those summary stats yourself, but I'm sure you can set them up them nicely into some pipeline together with your training.

lapis sequoia
#

Yea, I was actually about to do something similar before i stopped myself and wondered if this was something ML could take care of entirely

#

Anyway, thanks so much for the input!

lapis sequoia
#

deep learning would do a lot better of a job than machine learning ever could @lapis sequoia

earnest prawn
#

Deep learning is a sub discipline of machine learning so your statement does simply not make any sense

eternal flare
#

hey all can anyone look at some sample code ive been trying to get working? Im working on a simple trading algorithm using pandas and numpy and i feel like ive been looking at the code so long i cant see whats in front of my face

lapis sequoia
#

!ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving
β€’ Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

eternal flare
#

Whats the simplest way to post code here?

#

Its over the 2,000 limit

lean ledge
#

@lapis sequoia that is objectively false

#

Deep learning is rarely better than traditional ML in reality for a lot of reasons

lapis sequoia
#

not from what I’ve read up about it js @lean ledge

#

from what I’ve read, deep learning is machine learning, but better cause of the neural networks @earnest prawn

eternal flare
#

Can anyone tell me what im doing wrong when im trying to get my signal to buy in this simple trading algorithm applicaton? https://docs.google.com/document/d/1d3LzcXByLDlF6a-amt-U4b_bOy8l6c2L109i9xU7OvE/edit?usp=sharing

lapis sequoia
#

why’re you trying to get your signal to do that? @eternal flare

lean ledge
#

@lapis sequoia neural networks doesnt make it better

#

Deep learning is very rarely better

#

Outside of the domains of CV and NLP, it's not even used much

eternal flare
#

its supposed to be a basic cross investment strategy. When the simple moving average for 10 or 30 days is reach a signal is put out to either buy or sell.

lapis sequoia
#

I concur.. DL hasn't had major applications over ML in applications outside of NLP.. I'm not sure about what they do in CV.. but image processing otherwise has had lot of impact

lapis sequoia
#

well then why exactly isn’t it ever used much, in that case? @lean ledge

lean ledge
#

Time to train
Amount of data needed
Being finicky in training

#

Getting ample clean data is ridiculously hard in the real world

#

Unlike traditional models, getting training right is a hard part of DL

lapis sequoia
#

yeah, but yet it still manages to be better in terms of accuracy @lean ledge

lean ledge
#

@lapis sequoia please don't comment on stuff you don't know.

#

It's best to not

lapis sequoia
#

there are literally full on articles online talking about how deep learning's better than machine learning @lean ledge

lean ledge
#

Congrats there's articles saying the opposite also. Your point?

#

I've listed out the reasons why deep learning is rarely the solution for things outside CV and NLP. Think you know better? Tell me why I'm wrong instead of referring to "there's articles"

#

@lapis sequoia

sand reef
#

If I am not wrong, the amount of data needed for DL is the biggest drawback, right?

silent swan
#

there're the generic things like: lots of data, big compute (varies depending on algo), large memory requirement

#

there're more hand-wavey things specific to specific areas like "imposing structure on your problem/solution"

#

DL is big and very good and progressing quickly. It's not all of ML, but it's certainly the most exciting and quickly progressing part of ML right now

#

but it's also because DL is big and exciting that you don't hear much about regular ML (which is still everywhere, and growing), because it's not that exciting to read about

#

no one wants to read about 1000 small companies using collaborative filtering or gradient boosted trees

#

everyone wants to read about deepnudes

lapis sequoia
#

lol deepnudes

quartz stream
#

anyone

#

wanna do a project

#

on Speaker Verification ?

#

creating a complete library where we just take voice of user say from number 1 to 10

#

and create a model that works for that user only

#

there are no ready made material for this so more the merrier

#

maybe a publish a good research paper on our technique.

lapis sequoia
#

guys i have an image with shape (120, 120, 2)
how how i show it in mat plotlib?

#
import os
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np


DOGS_DIRECTORY = "training_set/dogs/"
CATS_DIRECTORY = "training_set/cats/"
cats = os.listdir(CATS_DIRECTORY)
dogs = os.listdir(DOGS_DIRECTORY)
def load_data_dogs():
  x = []
  for dog in dogs:
    if dog[-4] == ".":
      img = Image.open(DOGS_DIRECTORY + dog)
      img = img.resize((120, 120))
      img = img.convert('LA')
      x.append(np.array(img))
  return x

x = load_data_dogs()
plt.imshow(x[0])```
#

when i run this code

#
TypeError                                 Traceback (most recent call last)
<ipython-input-30-2e8a46bc51f4> in <module>()
     20 
     21 x = load_data_dogs()
---> 22 plt.imshow(x[0])
     23 

3 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/image.py in set_data(self, A)
    636         if not (self._A.ndim == 2
    637                 or self._A.ndim == 3 and self._A.shape[-1] in [3, 4]):
--> 638             raise TypeError("Invalid dimensions for image data")
    639 
    640         if self._A.ndim == 3:

TypeError: Invalid dimensions for image data```
#

please ping me when answering Thank you

lapis sequoia
#

how're you able to program an image with shape? @lapis sequoia

earnest prawn
#

@lapis sequoia images always have a shape, what matplotlib.imshow is expecting is however an image with 3 channels ( the format being (X pixels, Y Pixels, Channels) @lapis sequoia what type of image is this to only have two channels?

dull fern
#

@naive shore have you tried to change dtype parameter ?

lapis sequoia
#

no, I mean I was taking a bit aback that was even able to literally program his own entire image too, although I shouldn't have been, now thinking back @earnest prawn

earnest prawn
#

What

lapis sequoia
#

I was referring to @lapis sequoia with my comment @earnest prawn

lean ledge
#

@lapis sequoia Can you stop pinging random people with dumb questions? You've been asked to do so by an admin already

lapis sequoia
#

sorry, didn’t mean to ping you asshole @lean ledge

south quest
#

@lapis sequoia That's not an appropriate way to address others, you've already been warned for your behaviour, watch it.

lapis sequoia
#

well tbqf, mr. β€œdumb questions” wasn’t being appropriate himself, with the tone he was conveying, so @south quest

obsidian linden
#

could agree that was also not polite. Still using '@' on every post is not necessary and very annoying

#

In rooms with so little traffic it should not be needed

#

It triggers a notification

lapis sequoia
#

it’s how I usually directly convey a message to someone in a public chat usually, although I’ve been trying to refrain from doing it as much recently

obsidian linden
#

Probably a good thing. Using @ notifies the user directly even sometimes with sound. Can be very annoying πŸ˜ƒ

#

It adds importance and urgency to your message, something that is almost never needed

void anvil
#

I'm running some custom data feature creation on WSL. I'm getting a "kernal has died" error pretty consistently on the run through. Individually, each function runs through and the resulting dataframe when dumped to .csv should be 1.7 gb (~1.4M rows, ~100 columns) and fits in memory. I've created this for a slightly smaller data set (~1.2m rows, what the estimates are based off of). Is this happening due to an out of memory error for the WSL subsystem?

Error dump from the console:
https://pastebin.com/X1WrHrLS

grave void
#

I am looking to use Grafana to plot some data via influxDB and am not sure how to structure my data

#

I have a CSV file with a column of scores (0,10) and a timestamp for each score

#

Does anyone here have any experience with Grafana/InfluxDB and how I should go about posting my data?

#

Not sure if I should use a data_frame

lapis sequoia
versed wyvern
#

AYE

#

im here for the DS!

sand reef
#

@lapis sequoia it should always be 3D image. When it's black and white, it has 1 channel. Meaning it's dimension is [X, Y, 1]

#

So try using numpy.reshape

lapis sequoia
#

πŸ‘

#

Solved it yesterday but thanks

#

But I get another problem as usual

#

How can I reduce the over fitting

#
from keras.models import Sequential
from keras.layers import Flatten, Dense, Conv2D, MaxPooling2D, Activation


x_train, y_train = load_train_data()
x_test, y_test = load_test_data()
x_train = x_train.reshape((-1, 120, 120, 1))
print(x_train.shape, y_train.shape)

model = Sequential()
model.add(Conv2D(120, (3, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Dense(64))
model.add(Activation("relu"))
model.add(Dense(1))
model.add(Activation("sigmoid"))

model.compile(optimizer="adam", loss="binary_crossentropy",
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, shuffle=True)```
I already normalised the data
earnest prawn
#

Maybe a dropout between the conv and the linear part of the model could help @lapis sequoia

lapis sequoia
#

so a drouput after the first layer?

earnest prawn
#

I'd say after the max pooling but i am honestly not exactly sure

lapis sequoia
#

okay lemme try after the maxpooling

earnest prawn
#

@lapis sequoia gtg now, I'll be back later but I'm sure if this doesn't help you'll find other people with more ideas

lapis sequoia
#

okay thanks

#

it did show slight improvement but very minute

silk forge
#

whats a good way to visualize pandas data sets

#

python interpreter can fit all the data 😫

#

what if i wanted to see all the data

brazen wing
#

That would depend on the data

lapis sequoia
#

@silk forge pandas profiling? or you can do a group by and do size()

#

for each group

jagged stump
#

so what about sparkSQL does anyone knows it very well?

lapis sequoia
#

what about it

#

depends why you want to use it

jagged stump
#

I have a study case from a company about data analystics. I never doing something about big data but it looks like a little big on OpenStreetMap(OSM).

#

So I must do something with sparkSQL(I know basic SQL) or/and MLlib of Spark

#

I am looking for example and well introduce about these topics

lapis sequoia
#

if your purpose is to dashboard the data or use sql for running queries on it.. then you're better off using big query, etc..

#

if you want to write nested queries on a large dataset, go with sparksql

silk forge
#

hey

jagged stump
#

I have to with Sparksql cause I already told its such a big data

lapis sequoia
#

sparksql is not the only tool for handling big data..

jagged stump
#

I want explore data with it

crude parcel
#

yeah what do people prefer with big data and python?

#

i personally hate spark

#

good for its time but i dont like it

jagged stump
#

in fact I Must knowledge about scala but so lazzy for laern it πŸ˜›

crude parcel
#

not a big a fan of scala either

#

there are lots of other options too, in python. Maybe some that have CPython involved too.

stoic rune
jagged stump
#

Row(tags=[Row(key=bytearray(b'highway'), value=bytearray(b'residential')), Row(key=bytearray(b'name'), value=bytearray(b'Honv\xc3\xa9d utca'))])sample of map data. Its data from first row. So how can I get all rows' data like key=highway and value=residential with sparkSQL

#

any idea

celest moss
#

Hey guys ! I am new to machine learning. I want to understand the math used in the sklearn library. Can you recommend any books ?

safe merlin
#

EVERYONE LISTEN, WHICH IS BETTER JUPYTER NOTEBOOK OR JUPYTER LABS

#

OOOF

simple crag
#

Type like an adult

lean ledge
#

@celest moss KF Riley's Mathematical methods for physics and engineering has always been my maths reference of choice

hollow shard
#

Does anyone here know a good tutorial where the deriavatives for the first layer of a 1 hidden layer Neural net are derived?

#

All the ones that I've read so far gloss over it

#

Also, really sorry for off topic question, but does anyone know where I could ask a broad questionabout fluid simulations with python?

jagged stump
#
 |-- version: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- changeset: long (nullable = true)
 |-- uid: integer (nullable = true)
 |-- user_sid: binary (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: binary (nullable = true)
 |    |    |-- value: binary (nullable = true)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: integer (nullable = true)
 |    |    |-- nodeId: long (nullable = true)
#

I want parse data with sparkSQL anyone knows how?

#

Row(tags=[Row(key=bytearray(b'highway'), value=bytearray(b'residential')), Row(key=bytearray(b'name'), value=bytearray(b'Honv\xc3\xa9d utca'))])

#

Sample of data ;+--------------------+ | tags| +--------------------+ |[[highway, reside...| |[[highway, reside...| |[[highway, reside...| |[[highway, second...| |[[highway, second...| |[[highway, primar...| |[[highway, tertia...| |[[cycleway:both, ...| |[[highway, second...| |[[highway, reside...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, second...| |[[highway, primar...| |[[highway, reside...| |[[highway, reside...| |[[highway, tertia...| +--------------------+ only showing top 20 rows

#

Sample of data. I wanna tags column has value-key as you see. For example I wanna get data key=highway and value=primary

proud iris
#

Offtopic, and MATLAB. So thing is I'm interfacing with ros and subscribing to some topics in matlab

#

When I'm using the rossubscriber command in terminal, everything is working fine, but when I'm trying to put it in a .m code, for example

Sub=rossubscriber('point_data')
Recv=Sub.LatestMessage
Recv.Data

The code is not printing anything

#

The same code works in the terminal. A note here, the point_data has a custom message, I've tried the same code with other topics with standard messages and they worked in both terminal and .m file

#

I can't understand why something will behave differently in terminal and code

dull fern
hollow shard
#

oh thanks

#

much obliged πŸ‘

lyric canopy
#

This invite was removed by our filters, but it doesn't quite fit what we mean by advertisement. I think it was meant for @proud iris by @polar acorn

proud iris
#

Thanks!

faint spruce
#

Hey guys, I have a small question about Arrays and Lists. Is this the right place to ask?

lyric canopy
#

That probably depends on your question. If it's a more general Python question, you're probably better off in a help channel, since you'll get a much quicker response there. If it's more touching on data science or a related field, then this channel is probably a better fit.

faint spruce
#

Ok, it is actually verry basic. I need to compare lists to arrays for an exam. If I want to use an array I have to declare it first, right? So I would need to specify the data type it uses and how big it will be

lyric canopy
#

It depends a bit on what kind of array you're talking about. When you construct a Python array.array, you do indeed set the type, but it doesn't have a fixed size. You can still append to it just like a regular list. The same holds for a numpy.ndarray.

faint spruce
#

yeah

#

thats why I am confused

lyric canopy
#

There's another difference that's more important

faint spruce
#

Because that was my understanding of Arrays and then I saw the following code

#

a = arr.array('d', [1.1, 3.5, 4.5])

#

and there was no specification of size

lyric canopy
#

Yes, this sets the type (first argument) and initializes it with three elements (from the iterable that's provided as the second argument)

#

But, you can still append to it

#

a.append(9.2)

#

It's just like a python list: a = list([1.1, 3.5, 4.5])

#

It's just that if you were to look at how the two are defined in memory, then you'd see something entirely different

faint spruce
#

yeah

#

thats what I am getting at

#

so isn't that type of array more of a linked list?

lyric canopy
#

No

#

A Python list doesn't actually contain the values in memory, but rather references to values. In this case, you'd have three float objects somewhere in memory, 1.1, 3.5, and 4.5, and the list object holds references to those objects

#

So, the list itself doesn't really care for the size of objects, because it only ever holds references to objects "somewhere" else

faint spruce
#

hm

lyric canopy
#

An array.array, in contrast, doesn't store references, but rather the binary representation of the values itself

faint spruce
#

ok

#

and that binary representation is stored all in the same place, right?

lyric canopy
#

So, if you were to look in memory, you'd really see those three float represented as bit sequentially in an array

faint spruce
#

ah

#

And a linked list is something different, right?

lyric canopy
#

Yes

faint spruce
#

so there are 3 different types of ways to store lists/ collections of data

lyric canopy
#

In a linked list, there is no sequence of values in memory, one after the other. Rather, you have an element that holds a reference to the next element, which holds a reference to the next, and so on

#

So, you've got separated things in memory, but they reference each other in a linked chain, so a linked list.

faint spruce
#

ok, so the reference is stored with the pieces of data?

lyric canopy
#

Yes and each value just references the next one (singly linked list) or both the previous and the next one (doubly linked list)

#

But, to get to the third item, you need to start with the first, get the address of the second, then look at the second to get the third

#

Since the first doesn't know where the third is, it only knows where the second is

faint spruce
#

jeez. I didn't think there where this many caveats to something as simple as storing packets of data. Thats so interesting

lyric canopy
#

The problem with a regular list or array.array is that when you want to insert or remove something from the middle or the beginning, you need to move all the values after it, since you can't have a gap

#

With a linked list, if you want to cut out, say the sixth element, you only need to change the address the fifth refers to: let it point to what was the seventh instead of the sixth

#

This makes for very fast "popping" at both ends and that's why the Python deque is such a doubly linked list.

faint spruce
#

Can I recap if I got everything right?

lyric canopy
#

Sure

faint spruce
#

so there are more or less 3 diffrent ways to store collections of data

#

An array just takes some space in memory and then stores them one after another

#

A python list, doesn't store them in the same place, but stores the location of where the actual data is stored

#

And a linked list is stored "all over the place" with refrences from each element to the next one?

lyric canopy
#

And sometimes from the current one to the previous one as well (for a doubly linked list)

#

But, yeah, that sounds about right.

faint spruce
#

nice

#

Im going to get to 10 minutes from this stuff alone :D

#

I will have to hand in some sources to from where I know all of this stuff. Do you know any by chance?

lyric canopy
#

No, I'm sorry, I don't have any handy

faint spruce
#

ok

#

oh and one more thing. Those things in linked lists that reference to the next/previous element, are these called pointers or am I getting things mixed up?

lyric canopy
#

Yes,

#

Although we don't usually use that term in Python, as Python abstracts that layer away

faint spruce
#

Yeah, but my exam requires me to dig a little deeper into the stuff we covered in class

lyric canopy
faint spruce
#

Thank you. You are a great help...

#

Another question, what advantages does the method bring in wich we store the adress of the data in a list? wouldn't it itself need to store the data in some type of array or linked list? why not skip this extra step?

void anvil
#

Because it’s faster to go to 6 than to 1,2,3,4,5,6

lapis sequoia
#

What's the difference between tensorflow and keras?
I mean what are they, algorithms or libraries.?

earnest prawn
#

Tensorflow is a c++ library for ml related algorithms with a python frontend

#

An ugly one

#

Or rather hard to use for the average guy

#

Keras is part of tf (tf.keras) and exposes an easier to use and more simplistic python API

#

@lapis sequoia

lapis sequoia
#

Why is keras seperate while importing?

#

One can just say import keras

#

Without importing tf

lean ledge
#

Keras can be used for more than just tf

lapis sequoia
#

Like for most of neural networks algorithms

#

And am running TF on spider, I can't import mnist dataset.
Does it need internet to work with TF?

vale arrow
#

Would this be an appropriate channel for help with pandas?

#

or are the general help channels where I should go?

polar acorn
#

Go ahead, I don't know how appropriate it is but you're hardly alone in doing it.

zenith nova
#

pandas fits here I believe

vale arrow
#

Gotcha

#

I'm trying to do some data processing with pandas in Python, but when I give the command to split my data into the dependent variable and independent variable I'm having an issue. Code below:

#
import matplotlib.pyplot as plt 
import pandas as pd  

# Import Dataset 
dataset = pd.read_csv('CyberData.csv') 
x = dataset.iloc[:, :-1 ].values #not viewable object? 
y = dataset.iloc[:, 44].values```
zenith nova
#

```python
code
```

vale arrow
#

Thanks. Both my dataset variable and my y variable are correct. But my x variable turns into an n-dimensional array (ndarray object of numpy module, as my IDE calls it). I've no idea why it is turning this into that instead of a 2D array like my y variable.

zenith nova
#

:-1 -> -1?

vale arrow
#

I can provide the file if you would like as well

#

But I need every column besides the last column

zenith nova
#

well

#

If you're slicing over 3 dimensions you'll get an ndarray

#

your second line slices over 2 dimensions

vale arrow
#

how so?

#

Does :-1 not just mean all columns but 1?

zenith nova
#

it does

#

but :, : means all/all

#

whereas the other means all/one

#

...I think

vale arrow
#

If I just do -1 it only gives me the last column, essentially making x the same as y

zenith nova
#

what's the shapes of the outputs

vale arrow
#

dataset.head():

#
0 Copper 24.2 43 ... -2370 -262.368 -2570 
1 Copper 24.2 44 ... -2320 -263.158 -2580 
2 Copper 24.2 45 ... -2310 -263.158 -2580 
3 Copper 23.7 43 ... -2380 -262.632 -2580 
4 Copper 23.8 43 ... -2380 -263.158 -2590 
[5 rows x 45 columns]```
#

y is just a 1x 10 array, while I need x to be a 44x10

zenith nova
#

dataset.iloc[:, :-1 ].shape

vale arrow
#

ah okay my bad. one sec

#

(10,44)

zenith nova
#

and the other lines .shape?

#

dataset.iloc[:, 44].shape

vale arrow
#

(10,)

zenith nova
#

right, so one is a one-dimensional array, and one is a two dimensional array

vale arrow
#

But when I split it into x, the type listed is different than y and I can see a preview of y, but not x

#

whereas in the videos I was watching, they did what I'm doing (with different but similar data) and x was the same as y

paper niche
#

this just seems like an ide issue. what ide?

vale arrow
#

Spyder

#

It may be, I'm not really sure. They are using spyder in the video I'm watching without an issue, so I can only assume it's something on my end

paper niche
#

well it’s easy to check I think. generate a random 2D numpy array and see if you can view it

vale arrow
#

Yep, I can

#

So as far as data types go, 1D and 2D arrays say either int64 or float64 (respectively), while my x variable just says 'object'. It also won't list the data inside the sidebar like a 1D or 2D does, instead it says 'ndarray object of numpy module' which tells me its like a 3D array or something?

polar acorn
#

Seems like a Spyder issue. And seems your code should work fine despite it.

vale arrow
#

That doesn't really make sense, as in the video his array has multiple types and works just fine. Regardless, I suppose I'll have to switch. What IDE do you all recommend?

polar acorn
#

In the thread they mentioned it might be a problem with a specific Spyder version. Maybe check that you're using the latest one?

vale arrow
#

I just downloaded it a few weeks back. So I would think it should be newer than whatever version they were using in 2018 lol

#

I'll check

polar acorn
#

I use PyCharm for the record. Β―_(ツ)_/Β―

vale arrow
#

I've tried like 5 different ways of updating this IDE. It refusese to change, so PyCharm it is

haughty wind
#

I'm trying to fit a neural net on my laptop, which has an NVIDIA GTX 1050

#

For now, I have ~36k training images, and I'm using VGG19 and doing some transfer learning so I don't have to train the net from scratch

#

Problem is, when I run the code I get a resource exhausted error

#

Is this reasonable, given the amount of data I have, my GPU, and the RAM (24 GB)?

#

Do I have to acquire a dedicated machine for this, or is there a way to optimize my data/training procedure?

earnest prawn
#

The issue here is that your GPU only has a limited amount of memory itself (GPUs have dedicated so called VRAM which at least on normal ones doesnt exceed one digit GB space) what you can do in order to stop this error from happening with limited resources is split your data up into so called mini batches, so instead of fitting on 36k images only fit on as many images as you GPU can hold at a time, then the next minibatch the next one and so on until you got every image and then continue with the next epoch @haughty wind

lean ledge
#

πŸ‘

#

You're probably running out of vram. Try monitoring your vram usage using nvidia-smi and reduce your batch size to 1

#

(for now)

#

If it still can't fit on your VRAM, you'll need to use a different GPU

earnest prawn
#

(if VRAM is actually the issue that's highly unlikely as the model you have doesnt have a version with less than 2 GB ....and youd really have to have a big image to exceed that with one image at a time)

haughty wind
#

Right now my images are somewhere around 70x70 pixels or so

earnest prawn
#

yeah if VRAM is actually the issue the batch size fix should work out perfectly

haughty wind
#

I dropped my batch size to 4 as well as samples per epoch

#

I forgot I also dropped my epochs to 5 so my net understands nothing, but at least it made it through

earnest prawn
#

cool

#

keep increasing the batch size until it doesnt work anymore then and then decrease by a little and you should be ready to go πŸ‘

haughty wind
#

ok cool, thanks for the help

onyx granite
#

what';s the recommended approach for deploying ML models? specifically just a normal pickle/joblib? other than using AWS, Azure, Google Cloud

#

Flask + Celery + Redis?

quartz stream
#

Hey

#

I have a dataset

#

Can anyone help me in Modifying it

#

like every row is in form of id;message;label in one column

#

i need them in different column

polar acorn
#

Is your dataset stored in a csv?

hollow quartz
#

Hey in my data set i have two feature, one of type datetime and one double values. I want to predict double values. Is it possible?

lapis sequoia
#

Has anyone here done a proper live company project?

I have started to work in a small startup, it's basically a online event ticketing platform website.
In order for me to join this company they have given me a task to come up with a model that can help their business.
I want to know how can I use machine learning or any similar technology in python to create a live project which can be deployed to help that company.
Any ideas?

polar acorn
#

Come up with an idea or create and deploy a model to carry it out?

odd osprey
#

I need help with keras model

#

im getting must be from the same graph as Tensor when im trying to use it from python loading reshaped cv image

earnest prawn
#

@odd osprey could you show us your code and your exact error?

odd osprey
#

ahm, just fixed finally -_-'

#

my issue was actually very stupid - I was wrong with singleton in pyhon :/

distant merlin
#

@earnest prawn Well well well

earnest prawn
#

what

distant merlin
#

I joined a server and here you are

#

This was unexpected One in random

earnest prawn
#

ive been here for ages lol

distant merlin
#

I just joined

#

server in nice structure

#

The Authentication is nice

#

Best to avoid spam

dense rose
#

Does anyone use the vim mappings in jupyterlab?

#

I'm trying to figure out how to do some remapping.

gusty maple
#

Hello anyone here have studied the sugarscape or the Think Complexity book?

lapis sequoia
#

I have a problem i'm not sure how to solve

#

there's labelled data, random garbled text sequences and they're split into four classes.. how would I go about running a classifier on them

#

they're not meaningful words but all the words in each sequence are five characters, all alphanumeric

onyx granite
#

so like example: row 1: sdasa tyghs ileis 1 row 2: ffasa byghs glets 2

#

?

#

1,2 being labels

lapis sequoia
#

yeah

#

I have no clue how to approach this.. because they are meaningless

onyx granite
#

meaningless meaning they are meaningless but have a certain pattern

#

or meaningless meaning you just dont know the pattern

#

or meaningless meaning literally meaningless

#

because the last part wont really let you do much with the data

lapis sequoia
#

meaningless meaning I probably don't know the pattern

#

or what it's about

#

also, the words are separate by comma

onyx granite
#

if you think there might be a pattern and you just dont know it, usually if it's dealing with words or characters it's more NLP based

#

fair warning i am not data scientist

#

but usually something related to a multi class classifier like naive bayes is a possible starting point

#

or if you can use some type of unsupervised learning it might can reveal a pattern

#

i would look up like NLP algorithms that can do multi class classification

#

then just create your normal train, test, validation sets

#

there is also some for document classification, like if you have 30 different text corpuses from different letters you are reading, it can detect which ones belong to which areas

#

but that's more unsupervised, you said they have classes already

lapis sequoia
#

the hints do say that the words are from a text document

#

but it is a classification task

#

I will use unsupervised for eda

onyx granite
#

try something like naive bayes first

#

then move up to logistic regression + word2vec or something like that

#

then if you need even better results, move towards something more advanced

#

honestly logistic regression works very well for like sentiment analysis and such, even text classification in general

#

but obviously wont be as good as something like a NN

lapis sequoia
#

thanks man

desert oar
#

@lapis sequoia what kind of data is this?

#

it's "garbled" but is there any meaning to it?

#

if it's truly random then you won't have much luck with this

lapis sequoia
#

@desert oar i'm not sure where it's from.. I was only told "training data is tab separated with a label and tokenized sequential data..the tokenized data are actually words of a text document. Each token is a length of 5 (alpha-numeric letters)."

#

but it is a classification task

desert oar
#

I see

#

So they are words but you arent allowed to see the words? Interesting

#

Yes bag of words classification seems to be the best (maybe only) option

lapis sequoia
#

yeah.. and there's 4 classes, so I'm guessing if there are common words between labels, I should drop them and check accuracy of classifying again

#

they might be stop words, etc

desert oar
#

Yeah although i wouldnt drop all common words

#

Could use eg chi square or mutual information for feature selection

#

How many records and how many unique words?

lapis sequoia
#

i'm not sure, let me reduce it and check

#

reduce it to a csv I mean, with the labels removed

#

I wonder if it makes sense to check if there are duplicate rows (same sequences repeated) but i'm not able to view it as rows next to each other when I do :

desert oar
#

How is your data provided?

lapis sequoia
#

the data is in a tsv, label separated by sequence of words

desert oar
#

And are they actually sequential or just a bag of words already

lapis sequoia
#

let me show an example row

#

printed from a df

desert oar
#

Use df['sequence'].str.split(',') to get a column of lists of words

#

No need to save again

#

However you can save as Parquet format in the future to avoid having to split again every time you load

lapis sequoia
#

I have the column of words now, it's a series and each series seems to be now converted to a list

#

I might export to csv and check it in something like answer minor to get quick eda

desert oar
#

Each element of the series will be list yes

#

You can use .map(len) on that now to get the length of each list, etc

lapis sequoia
#

they're all varying lengths, I see 19 and 45.. etc.

desert oar
#

Use matplotlib or .plot to plot them, eg hist or kernel density

#

Or .describe

lapis sequoia
#

63799 count, min 14, max 204, mean 44

desert oar
#

Use co.Counter(_.tolist()) to get word counts

#

Import collections as co

lapis sequoia
#

to_list on the series?

#

I can do the map on the co.Counter()

desert oar
#

@lapis sequoia .tolist() and .map() are Series methods

#

its much faster to iterate over a list than a series

#

Btw that was wrong anyway, it should be Counter(chain.from_iterable(df['sequences_split'].tolist()))

lapis sequoia
#

neat:)

#

im just cleaning up my code.. seeing if some of the test sequences are already in train, etc.. haven't gotten to training part yet

celest moss
#

I am trying to implement Naive bayes classification on Iris dataset, but the distribution of some features are not normal, so how do I proceed ? Should I just ignore it and use guassian distribution ?

polar acorn
#

@celest moss I think you assume normality for each feature conditioned on each class. So if the PetalLenght fore each class looks somewhat normal that would be good enough.

celest moss
#

@pptt, Thank you ! Implemented with normal distribution and seems like you are right, I got an accuracy of 93.33 %. So you are saying that P(feature_vector | class) must be normally distributed right ?

polar acorn
#

Yes, that is the assumption we are making at least. I'm sure you could often get good enough results even if that is not entirely the case.

celest moss
#

Thank you very much !

polar acorn
#

np

#

@heavy crow @digital meteor @digital crescent
The data science channel could use some of this discussion πŸ˜ƒ

digital meteor
#

hi

heavy crow
#

this is the dataset

#

feel free to download it and play with it

digital meteor
#

okay so what I am saying to do it
fit a polynomial to it
in a linear regression

heavy crow
#

go ahead and do it. i cant get any good results with it

digital meteor
#

i.e the cubic fit that you already did

heavy crow
#

because for the 5th time. i dont want the trend

#

i want values

polar acorn
#

@heavy crow the thing you should do to make sure that your LSTM actually perform well is to compare it to a simple model. So for instance split the data at several points train a LSTM on the data prior to the points and fit a linear model to the data prior to that point. Compare how wrong their predictions are on the next three months.

heavy crow
#

yeah. does way better than the linear

#

the linear does terrible.

polar acorn
#

What Apex is trying to tell you is that the LSTM likely isn't doing better than the linear trend in such a scenario.

digital meteor
#

^

#

or the quadratic/cubic

#

i.e. this

heavy crow
#

the obv hits almost all the points in training and hits most points in validation

polar acorn
heavy crow
#

mk gonna take a look at tha

#

t

digital meteor
#

essentially a summary of what I have been trying to say is that
I think deviations from that quadratic curve are random noise
and when I say random noise what I mean is "a random variable that is normally distributed with constant variance"
that's how it was defined in my uni lectures anyway 🀷

heavy crow
#

ok cool but that doesnt help at all when you are trying to predict values not trends

digital meteor
#

you may be able to improve the quadratic fit with some time-lagged variables i.e. autoregression
but I don't think the beta-coefficient on time-lagged variables will be that big in this case

#

well what the quadratic fit allows you to do
is predict future values using the assumption that the quadratic trend will continue

#

so essentially if you are using the quadratic fit to predict a value 3 months in the future, you can do that by carrying on the red line and taking the y-value from that date

#

this is essentially just the regular multivariate regression method

#

what I don't think is that you can make a model (including non-regression-based models) that can predict in the future all those bumps and dips. I don't think its possible to accurately predict those with the data that you have

#

because, as I was saying, the bumps and dips don't seem to have much of a pattern, they just look like random noise on top of the quadratic curve
(random noise being a random variable that is normally distributed, mean of zero and constant variance)

#

so the way I was taught multivariate analysis was that in this situation you can't capture that random noise, and that you should settle for a polynomial fit, maybe with some lagged variables (autoregression)

#

even if there is an underlying pattern in the deviations from the quadratic, it hasn't repeated itself that many times. This means that if you do extend your model to try to capture the deviations from the quadratic (under the assumption that they are non-random) the model will have a lot of trouble capturing the pattern.

#

so its possible that a machine learning algo or other recursive algo will capture those deviations from quadratic a bit better, if they are non-random, but it won't beat the quadratic regression by much even if it does. Especially if lagged variables are introduced to the regression because they have some limited ability to capture patterns themselves.

#

that's essentially what I was trying to say earlier

polar acorn
#

@heavy crow

Because I'm procrastinating and should be doing useful stuff I did this instead. I wrote a short script that tests your model (the LSTM for instance) vs a linear model with time series cross validation. You just have to implement two functions. One that extracts the features you want from x_train and trains your model from scratch. And one that extracts the features you want from x_test and predicts on it. Feel free to try it out. If you are interested in predictive power, comparing predictions might be a good idea πŸ˜ƒ

#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

df = pd.read_csv("water-levels.csv")

# Baseline model, we will try to beat this. (if we can you were right and the "noise" was not noise!)
lr = LinearRegression()

# Your alternative model, implement a train and predict function for your model.
def train_alt_model(x_train, y_train):
    # Your code here
    return alt_model
    
def predict_alt_model(alt_model, x_test):
    # Your code here
    return predictions

# Set up experiment.
tscv = TimeSeriesSplit(max_train_size=None, n_splits=26)
results_df = pd.DataFrame()

# Time series cross validation.
for train_index, test_index in tscv.split(df):
    x_train = df.date[train_index]
    y_train = df.waterlevel[train_index]
    x_test = df.date[test_index]
    y_test = df.waterlevel[test_index]
    
    results_df_tmp = df.iloc[test_index].copy()
    
    # Fit and predict with linear regression
    lr.fit(x_train.values.reshape(-1, 1), y_train)
    results_df_tmp['linear_regression_prediction'] = lr.predict(x_test.values.reshape(-1, 1))
    
    # Fit and predict with alternative model
    alt_model = train_alt_model(x_train, y_train)
    results_df_tmp['alternative_model_prediction'] = predict_alt_model(alt_model, x_test)
    
    # Update results data frame
    results_df = pd.concat([results_df, results_df_tmp], axis=0)

# Compute t statistic, we assume here that both model's residuals are standard normal dist.
linear_regression_residuals = (results_df.waterlevel-results_df.linear_regression_prediction)**2
alternative_model_residuals = (results_df.waterlevel-results_df.alternative_model_prediction)**2
residual_deltas = linear_regression_residuals - alternative_model_residuals
t_stat = np.mean(residual_deltas)/(np.std(residual_deltas)/np.sqrt(len(results_df)))
#
# See if we beat linear regression.
if t_stat >= 1.96: # We're assuming here that len(results_df) > 1000
    print("Congrats! Your model beats linear regression!")
elif t_stat > 0:
    print("Your model predicts slightly better than linear regression but the improvement is not statistically significant.\n" 
          "So we can not say for sure that you beat linear regression.")
else:
    print("Your model predicts worse than linear regression. Did you even try?")
    
plt.plot(df.date, df.waterlevel)
plt.plot(results_df.date, results_df.linear_regression_prediction)
plt.plot(results_df.date, results_df.alternative_model_prediction)
plt.legend(['waterlevel', 'linar reg', 'alt model'])
plt.show()
#

n_splitsis set to 26 which is equivalent to predicting a year ahead into the future at each split. Set it to 100 to predict 3 months ahead, although this might take some time. If it takes to long you can set it to 10 which is predicting 2.5 years into the future. Or you could reduce max_train_size to cut down on the time.

Also if you want to compare linear and quadratic regression I made the two functions you have to implement:

def train_alt_model(x_train, y_train):
    alt_model = LinearRegression()
    x_train_reshaped = x_train.values.reshape(-1, 1)
    x_train_full = np.stack([x_train_reshaped[:,0], x_train_reshaped[:,0]**2], axis=1)
    alt_model.fit(x_train_full, y_train)
    return alt_model
    
def predict_alt_model(alt_model, x_test):
    predictions = [2]*len(x_test)
    x_test_reshaped = x_test.values.reshape(-1, 1)
    x_test_full = np.stack([x_test_reshaped[:,0], x_test_reshaped[:,0]**2], axis=1)
    predictions = alt_model.predict(x_test_full)
    return predictions

Good luck!

lapis sequoia
#

hi

lapis sequoia
#

this is weird

#

my numpy array shows datatype as

#

dtype='|S1181'

strong flare
#

HI Guys may i know what error is this ? i cant do the prediction on my data it showing me keyError

polar acorn
#

It's says pretty clearly it's a keyError. You are asking where in the index is the key, ''20150217". But that key isn't in the index and so you get an error.

strong flare
polar acorn
#

The result doesn't have a index. When you call result.predict somewhere inside that function it checks some index for your key. I don't know what index that is or how the insides of result.predict looks. Can you try to remove ''20150217" and only predict on "20170301"?

strong flare
polar acorn
#

Ah I see now, those are the start and end values. Okay you should put that value back. Hmmm.

strong flare
#

πŸ˜‚ in a deadlock now cant figure out which part error

jagged stump
#

What is the best way for real time logo detection ? ANy suggestion?

quartz monolith
#

I have a big dataframe to clean and some columns have

'200, 334'
'200/334'
'200/ 334'
ADJ:Enc'
>Faul```
first i want to clean them by a space, afterwards i want to split them into two different columns to train the model
heavy crow
#

.replace(" ","")

#

And by what do you want to split them?

polar acorn
#

@quartz monolith
This should work, replace col with your column name.

import pandas as pd
df = pd.DataFrame({'col':['203,608', '200, 334', '200/334', '200/ 334']})

df.col = df.col.str.replace(' ', '')
df[['col_split_1', 'col_split_2']] = df.col.str.extract("(\d*)[,|/](\d*)")
quartz monolith
#

@polar acorn Thanks that exactly what I'm looking for

polar acorn
#

np

quartz monolith
#

Whats the best method dealing with NaN values and learning machine? rather dropping the raws, converting in int like 999 or just replace them

#

if i drop my raws i will lose better result so imo thats not the case

desert oar
#

It depends on what the missing values mean

#

Sometimes it's better to "impute" the missing values

hollow quartz
quartz monolith
#

Why dont you round the values? @hollow quartz

hollow quartz
#

ok i try

vestal gazelle
#

@hollow quartz also string formatting might help, if that applies in your case

quartz monolith
#

I'm training a DecisionTreeClassifier and I'm really bluffed... Why 5500 rows dataset performs better than 130k. Somebody has exp. with DT?

#

precision recall f1-score support
weighted avg 0.57 0.59 0.57 1383 small
weighted avg 0.56 0.58 0.57 33584 big

onyx granite
#

your smaller dataset might be overfitting due to less samples that are more representative of your actual dataset

#

also I think DTs are very much influenced by the data size more so than others

#

also be sure you dont have a class imbalance issue somewhere

#

just a few things off the top of my head

onyx granite
#

Hi, I am trying to deploy a model to google ai platform and running into some generic errors, was wondering if anyone has ran into certain error I am getting

#
Create Version failed. Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 47. (Error code: 0)"
lapis sequoia
#

is anyone alive

vital bison
#

I'm using sns.distplot to compare two features of a training dataset!
does it work only for binary datasets?

do the histograms get corrupted if the dataset is too large?
another question
how important are system features? if I'm using only cloud platforms?
like colab
https://gyazo.com/db268fec4c046c6203f94e0b7799a212

fringe timber
#

i feel alive, but i might be biased, need to collect more data

hollow quartz
#

Hi I want to normalize a column. I can normalize a column of Vector but not a column of value

lapis sequoia
#

what do you mean

hollow quartz
lapis sequoia
#

which column are you having trouble normalizing

hollow quartz
#

conso_total column because it is not a vector

lapis sequoia
#

what does conso total represent

hollow quartz
#

it's the my target prediction$

#

my code is here df_train.withColumn('Scalerconso_total', ((df_train.conso_total)-_min)/(_max-_min))

#

the problem is here AnalysisException: "Can't extract value from conso_total#16: need struct type but got double;"

lapis sequoia
#

sorry it's like 1 am here.. I'll get back to you in the morning

hollow quartz
#

ok thanks

nova crow
#

I am currently focusing on making a "Python Bot" for an online flash game. I've been looking it up on the internet, and most results say that i have to use Autopy in order to handle the automated mouse click and keyboard input. But i'm having problems in sniffing the output that comes from the Flash Game to my ip adress, and if i successfully sniff the packages, how am i able to use AutoPy in Python to make the bot capture a screenshot every 0.5 seconds? (the bot should be able to do it because it can then convert the captured screen into a numpy array for analysis)

rocky dust
#

I think screenshots can be taken with the ImageGrab module from PIL.

#

Keyboard press and Mouse clicks are also supported by the module: pyautogui

quartz monolith
#

LabelEncoder with never seen before values someone encounter this kind of stuff when im transforming my labels?
found the answer maybe but cant implement it into my code:
https://stackoverflow.com/a/52505373

hollow shard
#

Hi guys, sorry to bother you again but my mnist 1 hidden layer neural net is still refusing to work. I'm completely stumped

#

heres the code, thanks in advance to those who look at it πŸ‘

vital bison
lapis sequoia
#

is anyone alive

#

I need to evaluate my model on a suitable metric, but I'm wondering if k-fold CV is enough..

#

usually I figure out a good heuristic that's specific for the problem im solving, but i'm tired and facing a deadline..

#

I guess what im asking is.. how do I cover my ass

#

lol

lapis sequoia
#

I saved my model, how do I know the properties of one of my layers

#

I want to find the filter size (kernel size)

#

I got it

#

.get_layer(layername).kernel_size

quartz monolith
#

`# Accuracy per Class

ConfusionMatrix heatmap between RootCause

y_test_it = new_le.inverse_transform(y_test)
y_pred_it = new_le.inverse_transform(y_pred)
conf_mat = confusion_matrix(y_test_it, y_pred_it)

f, ax = plt.subplots(figsize=(50, 35))
#cmap=cmap
conf_mat_normalized = conf_mat.astype('int') / conf_mat.sum(axis=1)[:, np.newaxis]
mask = np.zeros_like(conf_mat)
sns.heatmap(conf_mat_normalized, vmax=1, center=0,
square=True, linewidths=2,
annot=True, annot_kws={"size": 15}, fmt=".1f", mask=mask)

plt.ylabel('True label')
plt.xlabel('Predicted label')`

tight sparrow
#

hey guys I'm not sure if i should post here or help

#

im following a video and the n_jobs =1 in output

#

not sure how that can be

feral lodge
#

n_jobs is just the number of parallell CPU threads scikit used; it doesn't change any properties of your linear model. If you have n_jobs=None, then it defaults to 1 anyway, so there's no difference at all @tight sparrow

#

If your output is different from the video, then you might have a slightly different scikit version or something, but that's fine probably

tight sparrow
#

Awesome thanks for the clarification πŸ‘

grizzled folio
#

Hey team, I have a 3d numpy array (e.g. 218x100x2502). I run np.argmax(..., axis=1) to get indices along the middle axis. I want to index another array with the same shape, using these indices. Essentially:

for ii in ndindex(a.shape[0]):
  for kk in ndindex(a.shape[2]):
    out[ii,kk] = a[ii, indices[ii,kk], kk]

I thought maybe np.take would do this (I tried np.take(a, indices, axis=1) but this is definitely not the right thing, since it hangs Python...)

#

np.take_along_axis(a, indices[:,None,:], axis=1).squeeze() -- easy! thanks team

silk forge
#

yo