#data-science-and-ml | Python | Page 206

surreal nacelle Jul 28, 2019, 5:18 PM

#

Started it, and it seems alright so far

teal veldt Jul 28, 2019, 5:58 PM

#

Hello folks, hope you're having a nice evening.

#

I wanted to ask about the best option for choropleths maps through Python, I was looking at Folium but it seems that it doesn't have a lot of functionality at the moment

#

Do you know about better options? Also, since I'm at it, is anyone aware about the existence of a dedicated Data Visualization Discord server?

glass wyvern Jul 28, 2019, 6:22 PM

#

Hello all! I have a question about plotting data which I hope fits here. I have a list of 3d point and some lines connecting these point. How would you go about in plotting them. Matplotlib was the first option but it's a bit too slow in displaying the points and lines. Is there anything a bit more efficient? I also tried plotly but the whole web interface is a bit too much. I want to see the data and move the POV a bit. Thanks!

grizzled folio Jul 28, 2019, 9:11 PM

#

@glass wyvern How many points/lines are you plotting? Matplotlib performance can vary a lot depending on how you're trying to do it

muted garden Jul 28, 2019, 9:14 PM

#

Hello,is this program related to ML field ? https://www.udacity.com/course/intro-to-self-driving-cars--nd113

Intro to Self-Driving Cars Nanodegree | Udacity

This introductory program is the perfect way to start your journey.

silent swan Jul 28, 2019, 10:53 PM

#

isn't seaborn built on matplotlib

desert oar Jul 28, 2019, 10:58 PM

#

yes

serene scaffold Jul 29, 2019, 12:35 AM

#

@desert oar, scikit-crfsuite, yes.

crude bloom Jul 29, 2019, 1:33 AM

#

looking for honest impressions, does a deep learning library written in pure python (without numpy, even) sound more like an interesting gimmick or something a depraved mind would come up with?

#

asking for a friend...

earnest prawn Jul 29, 2019, 1:40 AM

#

if its for a learning experience (a very intense one) it is probably okay...if youre actually trying to do stuff with it youll soon, very soon notice the performance impacts of using pure python

crude bloom Jul 29, 2019, 1:42 AM

#

yeah I'm already finding out how incredibly slow it is, it's only instantaneous for <100 parameters

#

half for learning, half for experimenting with something that I can't figure out how to do with pytorch/tensorflow

desert oar Jul 29, 2019, 1:59 AM

#

it might work in pypy

#

should get a few x speedup at least... but ultimately no its not a good idea

#

it will be educational, but not useful beyond that

crude bloom Jul 29, 2019, 2:02 AM

#

ya it's definitely been educational being up close and personal with weights, biases, gradients etc. thanks 😃

silent swan Jul 29, 2019, 2:08 AM

#

funky fancy indexing with numpy breaks my brain 🙃 🙃 🙃

grizzled folio Jul 29, 2019, 3:15 AM

#

hah, yup

gilded notch Jul 29, 2019, 3:35 AM

#

Anyone want to work on building a text-extraction suit? I can't seem to find a decent one that works on Windows. Even the one that works on Linux is a bit dogey

#

The idea is to extract text so it can be inputted into ElasticSearch. for further analysis

silent root Jul 29, 2019, 5:03 AM

#

@gilded notch extract text from? also it would be a better idea to get the project started, showcase it in #303934982764625920 and ask for contributors there

gilded notch Jul 29, 2019, 5:04 AM

#

@silent root Thanks, I will do. I'm working on it now but its far from finished.

#

@silent root Oh and extract text from as much as possible, News Sites, Google, Wiki, PDF's, PTT, CSV, Excel, OCR for Images (PDF, GIF, JPEG etc) Also extract text from audio, Ideally in a way that dosent need any paywalls and any outside resources so no models that are not downlaodable.

silent swan Jul 29, 2019, 6:19 AM

#

oh man I'm in such a coding high right now

#

translated some tedious matrix/prob manip code to numpy

#

then rewrote and cleaned up the logic to scale up to massive sizes in pytorch

#

and it works!

gilded notch Jul 29, 2019, 6:21 AM

#

Nice, I'm always suspicious when I do something quite complicated and it just works. It makes me think there is a Run Time error that I just havent encountered yet.

silent swan Jul 29, 2019, 6:28 AM

#

there's that old meme/comic

#

with two panes of the guy being kept awake at night

#

"My code doesn't work and I don't know why"

#

"My code works and I don't know why"

gilded notch Jul 29, 2019, 6:31 AM

#

thats me

quartz stream Jul 29, 2019, 6:54 AM

#

Quick Question

#

%%time
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data1, y_train1)
naive_bayes.fit(training_data2, y_train2)

#

will this train the naive bayes on both training data1 and training data2

#

or will it get trained only on 2

#

Hmm looks like only training data 2 is getting trained

#

how do I do

#

if i wanna train on both

#

obviously training data 1 and 2 have different dimensions

gilded notch Jul 29, 2019, 7:15 AM

#

@quartz stream you can't if they have different dimensions.

#

You can append them onto each other and then just null the dimensions your not using

quartz stream Jul 29, 2019, 7:16 AM

#

or what we I reshape both into common shapes

#

and add together

#

is it possible ?

gilded notch Jul 29, 2019, 7:33 AM

#

yes they have to be the same shape.

quartz stream Jul 29, 2019, 7:33 AM

#

lol

#

ok

desert oar Jul 29, 2019, 12:34 PM

#

@quartz stream they need the same number of columns. also you can use .partial_fit() to incrementally train on 2 different data sets. only some models support that, however

#

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

hollow quartz Jul 29, 2019, 1:06 PM

#

I want to delete the multi index.

📎 Capture.PNG

desert oar Jul 29, 2019, 1:13 PM

#

@hollow quartz multi index in the columns?

#

what's the final data format you're going for?

hollow quartz Jul 29, 2019, 1:16 PM

#

@desert oar i want to align Jour, Région, 00:00:00, 01:00:00,.....

desert oar Jul 29, 2019, 1:56 PM

#

@hollow quartz what is the original format of the data and what did you use to put it in that format

hollow quartz Jul 29, 2019, 2:26 PM

#

@desert oar the original of data

📎 Capture.PNG

#

I have used de function pivot_table()

quartz stream Jul 29, 2019, 2:51 PM

#

Bro ! @desert oar

#

You are freaking awesome

#

This is the thing I was looking for

#

Spent almost 9 hours with no progress but a workaround

desert oar Jul 29, 2019, 2:52 PM

#

@hollow quartz you can just re-assign to .columns

#

@hollow quartz can you show your whole code

hollow quartz Jul 29, 2019, 2:56 PM

#

ok

#

@desert oar

📎 Capture.PNG

desert oar Jul 29, 2019, 3:04 PM

#

@hollow quartz write values='Total energie soutiree (Wh)' without the []

hollow quartz Jul 29, 2019, 3:06 PM

#

it's working thanks @desert oar

wind marlin Jul 29, 2019, 3:16 PM

#

Hey guys when you were first starting learning ML what did you get stuck on the most?

desert oar Jul 29, 2019, 3:32 PM

#

making sense of all the different tools and models available. it's easier nowadays imo than it was a few years ago, a lot of problems have a "best" solution now, whereas in the past you often had to guess and try a million different things

wind marlin Jul 29, 2019, 3:40 PM

#

What are some best in the game tools for financial market predictions? I already build algorithms but I think they can assist AI. My algorithms aren't all time series based so that might be a challenge if I'm training the AI on tick data and indicator data correct? Would I be looking for a specific genre of ML for this purpose?

#

I figure I will need to find an easy way to port Ninjatrader indicators into Python.

#

That might involve converting C# code into Python. I don't know how doable that idea is though.

silent swan Jul 29, 2019, 5:11 PM

#

We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. gg

desert oar Jul 29, 2019, 5:21 PM

#

yeah BERT is wild

silent swan Jul 29, 2019, 5:23 PM

#

yea but this is saying "lol we pwned BERT six ways from sunday"

desert oar Jul 29, 2019, 5:28 PM

#

which model was this

#

the msoft one?

#

i thought you meant someone just retrained bert w/ different parameters

silent swan Jul 29, 2019, 5:31 PM

#

https://arxiv.org/abs/1907.11692

arXiv.org

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but
careful comparison between different approaches is challenging. Training is
computationally expensive, often done on private...

#

the new hotness

#

depending on which msoft one you were talking about, that was based on fine-tuning BERT

#

now the two big contenders are XLNet (Google) and RoBERTa (Facebook)

desert oar Jul 29, 2019, 5:36 PM

#

nice they replicated bert too

surreal nacelle Jul 29, 2019, 5:36 PM

#

@wind marlin You could use tradingview, or simply pull the charts from your exchanges, and then use a library like TA-lib to calculate the indicators from there

desert oar Jul 29, 2019, 5:36 PM

#

gotta love a good replication and SOTA result in one paper

silent swan Jul 29, 2019, 5:37 PM

#

otoh

#

"We pretrain our model using 1024 V100 GPUs for approximately one day"

desert oar Jul 29, 2019, 5:37 PM

#

lol

#

this is why fine-tuning is a blessing

silent swan Jul 29, 2019, 5:37 PM

#

just gotta go look around for some spare v100s

#

maybe I got a couple hundred lying here or there

swift karma Jul 29, 2019, 6:36 PM

#

Anyone experience with API's?

#

I have a few questions

earnest prawn Jul 29, 2019, 6:39 PM

#

with what APIs, API is just a shortcut for application programming interface

swift karma Jul 29, 2019, 6:42 PM

#

We'll I'd like to create an API so I can pull products from other peoples websites

#

Like an API with something like Magento.

swift karma Jul 29, 2019, 7:32 PM

#

jinja2.exceptions.UndefinedError: 'form' is undefined

#

anyone know why I'm getting this error

quartz stream Jul 30, 2019, 6:48 AM

#

Is it possible to load csv files

#

without loading it in memoru

#

like pandas use memory

#

but my csv is 3GB

silent swan Jul 30, 2019, 7:26 AM

#

you can load csv files row by row

supple ferry Jul 30, 2019, 7:29 AM

#

@quartz stream you can load file in chunks. There is a parameter for that in load_csv

quartz stream Jul 30, 2019, 7:32 AM

#

Lol

#

Yeah

lapis sequoia Jul 30, 2019, 11:16 AM

#

Can you also use pandas to load csvs?

#

Sorry if that is wrong, I'm all new to this!

grizzled folio Jul 30, 2019, 11:38 AM

#

I think that was the suggestion: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

desert oar Jul 30, 2019, 12:41 PM

#

pandas has a very fast and robust csv reader

#

i rcommend using it for csvs even if you dont want to do data analysis

#

pd.read_csv('mydata.csv').to_dict(orient='records')

that gives you a list of dicts, what could be better?

#

(obviously consider the cost of adding a large compiled dependency)

quartz stream Jul 30, 2019, 1:02 PM

#


import pandas as pd

df_chunk = pd.read_csv(r'data.csv,chunksize =30000)

def chunk_preprocessing(chunk):
   #Find all the rows where it matches a particular company_id
  data = chunk.loc[lambda df: df.Company_ID == '123345', :]
  return data

%%time

chunk_list = []  # append each chunk df here 

# Each chunk is in df format
for chunk in df_chunk:  
    # perform data filtering 
    chunk_filter = chunk_preprocessing(chunk)
    
    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)
    
# concat the list into dataframe 
df_concat = pd.concat(chunk_list)

#

It is taking around 5s to get entries

#

is there any better way than this

#

I'm thinking the way I select

#

is slow

desert oar Jul 30, 2019, 1:22 PM

#

@quartz stream why use chunksize if you just concatenate everything anyway

#

that feature is for when you need streaming processing

#

i.e. you dont keep it all in memory

#

just delete chunksize= and process all at once

quartz stream Jul 30, 2019, 1:33 PM

#

@desert oar It takes more time to load the csv

desert oar Jul 30, 2019, 1:33 PM

#

you still have to load the whole csv

quartz stream Jul 30, 2019, 1:33 PM

#

actually csv is of 6GB

desert oar Jul 30, 2019, 1:33 PM

#

oh

#

so chunk_preprocessing reduces the size of the data?

quartz stream Jul 30, 2019, 1:33 PM

#

yes

desert oar Jul 30, 2019, 1:33 PM

#

what does the chunk processing do?

quartz stream Jul 30, 2019, 1:33 PM

#

like i want few rows outta

#

6gb

desert oar Jul 30, 2019, 1:33 PM

#

because processing a 6 gb file is just going to be slow

quartz stream Jul 30, 2019, 1:33 PM

#

so why load complete

desert oar Jul 30, 2019, 1:34 PM

#

thats a lot of data. it will be slow

quartz stream Jul 30, 2019, 1:34 PM

#

and it will use memory too

#

loading the6gb

desert oar Jul 30, 2019, 1:34 PM

#

but you can share your chunk_processing code and maybe it can be made faster

quartz stream Jul 30, 2019, 1:34 PM

#

everything is there

#

see above

#

so the problem is I have 6GB needs to filter some rows and use the filtered output

#

if I load 6gb it will take time to load

#

plus memory usage

#

so i thought with chunksize I will atleast savememory usage

desert oar Jul 30, 2019, 1:35 PM

#

you did the right thing

#

is there any better way than this
I'm thinking the way I select
is slow
my response: no, it's going to be slow no matter what. but maybe you can share your chunk_processing code and we could make it faster

quartz stream Jul 30, 2019, 1:36 PM

#

def chunk_preprocessing(chunk):
  data = chunk.loc[lambda df: df.Company_ID == 'Z716683', :]
  return data

#

i also want company id to be provided by user

#

any way for that

#

like I am mentioning Company_ID explicity here

#

is there any way else

desert oar Jul 30, 2019, 1:38 PM

#

not much you can do for that... have you considered using a streaming command line tool like XSV?

quartz stream Jul 30, 2019, 1:38 PM

#

is it fast

#

?

desert oar Jul 30, 2019, 1:38 PM

#

im not sure if its faster

#

you have to try it and benchmark

#

pass in the company id as a function parameter

def chunk_preprocessing(chunk, company_id):
  return chunk.loc[chunk['Company_ID'] == company_id]

quartz stream Jul 30, 2019, 1:39 PM

#

im talkin about column name

#

not row value

desert oar Jul 30, 2019, 1:39 PM

#

oh.

#

yeah

#

same thing

quartz stream Jul 30, 2019, 1:40 PM

#

i guess this would work

#

yeah

#

thanks

desert oar Jul 30, 2019, 1:40 PM

#

def chunk_preprocessing(chunk, id_colname, company_id):
  return chunk.loc[chunk[id_colname] == company_id]

quartz stream Jul 30, 2019, 1:41 PM

#

yeah

#

I'm not at beginner level

#

Thanks !

#

😛

#

You Really Rock

#

BTW XSV doesnt exist for python @desert oar

#

https://github.com/BurntSushi/xsv

desert oar Jul 30, 2019, 1:44 PM

#

@quartz stream it's a command line tool

#

you dont use it in python

quartz stream Jul 30, 2019, 1:46 PM

#

Okay

#

Thanks

#

I'll try and update

#

if it's faster

quartz stream Jul 30, 2019, 2:19 PM

#

@desert oar

#

✌ 😄

#

%%time
import dask.dataframe as dd
print(psutil.cpu_percent())
df = dd.read_csv('data.csv')
data = df.loc[df["Company_ID"] == "12341234"]
print(psutil.cpu_percent())

#

1.2
75.0
CPU times: user 37.8 ms, sys: 1.03 ms, total: 38.8 ms
Wall time: 43.6 ms

desert oar Jul 30, 2019, 2:38 PM

#

oh nice

#

i was just using dask

#

didnt think about dask dataframe here

#

anyone ever get the "buffer source array is read-only" error in pandas + joblib?

#

it looks like pandas is trying to mutate itself while joblib is using a memmapped array

surreal nacelle Jul 30, 2019, 2:45 PM

#

while loading something ?

desert oar Jul 30, 2019, 2:45 PM

#

sorta. my code is basically this

    with joblib.parallel_backend('loky'):
        with Parallel(n_jobs=5, pre_dispatch=5, verbose=10) as parallel:
            results = parallel([
                delayed(do_fit_and_score)(k, p, model, x_trans, x, y, ix_train, ix_test, params)
                for p, params in enumerate(grid)
                for k, (ix_train, ix_test) in enumerate(splits)
            ])

joblib detects and automatically caches large matrices and data frames, and then read-only memmaps them in the worker processes

#

but something in my code is trying to mutate the underlying data

surreal nacelle Jul 30, 2019, 2:47 PM

#

I'm afraid I can't help you 😄

desert oar Jul 30, 2019, 2:47 PM

#

yeah np. if i figure out a workaround i'll post an update

#

might have to file a bug report w/ pandas

silent swan Jul 30, 2019, 5:19 PM

#

sometimes pandas does some weird stuff under the hood

grizzled folio Jul 30, 2019, 9:18 PM

#

joblib seems neat

#

maybe a more appropriate method than dask to do this outer-loop parallel?

desert oar Jul 30, 2019, 10:49 PM

#

If you're operating on data frames I would use dask

#

Rather, operating on subsets of a very large data frame

#

Joblib is more of a general parallelism library

#

Basically an equivalent to multiprocessing

#

Except it does smart caching of its inputs

#

has a higher level API, and also can use different scheduler back ends

#

sklearn uses it internally in order to parallelize things like grid search

tight sparrow Jul 30, 2019, 10:53 PM

#

any tensorflow 2.0 users?

desert oar Jul 30, 2019, 11:16 PM

#

tensorflow 2.0 noob here

tight sparrow Jul 30, 2019, 11:27 PM

#

@desert oar sorry for the delay

#

so i figured out constants and placeholders are gone

#

what do i use instead?

#

u there dude?

#

or dudette?

#

???

desert oar Jul 30, 2019, 11:34 PM

#

tf.Variable I think?

tight sparrow Jul 30, 2019, 11:35 PM

#

tried and failed

desert oar Jul 30, 2019, 11:35 PM

#

Share code?

tight sparrow Jul 30, 2019, 11:35 PM

#

hhang on

#

📎 unknown.png

#

📎 unknown.png

#

restarted kernel and ran latest code

#

any ideas?

desert oar Jul 30, 2019, 11:50 PM

#

its easier if you write the code in discord or on a paste site like https://paste.pydis.com

#

im not sure what im looking at either

#

what are you trying to do

#

(im not going to pretend like the tf 2.0 docs are any good or that the api makes any sense)

#

but yeah placeholders are just gone

#

https://www.tensorflow.org/beta/guide/migration_guide#low-level_variables_operator_execution

TensorFlow

Convert Your Existing Code to TensorFlow 2.0 | TensorFlow Core...

tight sparrow Jul 30, 2019, 11:58 PM

#

yeah looks like just use functions and args

#

that makes sense

desert oar Jul 31, 2019, 12:00 AM

#

i just wish you didnt have to dig to find that page

#

theres literally no API docs for 2.0

#

rather, no link to it

tight sparrow Jul 31, 2019, 12:04 AM

#

well i found it earlier

#

i just didn't get it

#

one of those cases where i was looking at the forest instead of the trees

#

documentation is my kryptonite

#

IMO every documentation should be the vocab, a brief description and several examples of the process in action as simple as possible

#

    print('Operations with Placeholders')
    print('Addition:', np.add(x.numpy(),y.numpy()))
    print('Subtraction:', np.subtract(x.numpy(),y.numpy()))
    print('Multiplication:', np.multiply(x.numpy(),y.numpy()))
    print('Division:', np.divide(x.numpy(),y.numpy()))```

#

this is what i ended up doing

lapis sequoia Jul 31, 2019, 12:40 AM

#

Anyone know of any good resources (like videos or websites) that kinda give you a guide to Tensor Flow?

grizzled folio Jul 31, 2019, 1:16 AM

#

@desert oar re: joblib, my problem isn't really amenable to dask. I guess it would be nice for different tasks in parallel to share some data, but it's hard to come up with a reasonable way to do that. failing that, I am just looking for a way to parallelise the outer loop

#

(actually I use dask to build up a computation from temporary on-disk arrays)

quartz stream Jul 31, 2019, 11:17 AM

#

@desert oar

#

Hey

#

remember the function you created yesterday for getting values

#

def chunk_preprocessing(chunk, id_colname, col_value):
  return chunk.loc[chunk[id_colname] == col_value]

#

what if i want multiple values of the same column

lapis sequoia Jul 31, 2019, 3:20 PM

#

Anyone that could perhaps help me with LSTM implemeting in RL?

#

I am not sure how it would be implemented

#

The goal i am looking for is that the agent/rl network can "see" previous states with a rollback window of 10

#

however i can't get how it all works and the internet wasn't at much help yet aswel

#

i am using https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py as my main rl code

GitHub

MorvanZhou/Reinforcement-learning-with-tensorflow

Simple Reinforcement learning tutorials. Contribute to MorvanZhou/Reinforcement-learning-with-tensorflow development by creating an account on GitHub.

#

a direction of objective would be nice, if someone knows what you'd need to achieve such goal in RL? because i heard there are several methods to LSTM

desert oar Jul 31, 2019, 5:24 PM

#

@quartz stream what do you mean?

true badger Jul 31, 2019, 6:50 PM

#

Hi all, a quick question regarding Python multiprocessing. I want to call a function, say print(x) on different cores and want to provide different x variables to each print call. How would this be done?

native lark Jul 31, 2019, 7:04 PM

#

@true badger well you could do this using the multiprocessing module, but my personal favourite is concurrent.futures
https://docs.python.org/3/library/concurrent.futures.html

true badger Jul 31, 2019, 7:05 PM

#

Noob question, but why are there so many different libraries for this?

#

There's multiprocessing, threading, concurrent

#

What are the differences?

native lark Jul 31, 2019, 7:06 PM

#

there aren't many,
threading and multiprocessing are the same thing, only one uses threads and the other processes
it's a complex task, so there are different implementations

#

concurrent.futures is actually a high-level interface for multiprocessing/threading

true badger Jul 31, 2019, 7:06 PM

#

Got it. What's the difference between threads and processes?

#

Ohhh alright, gotcha

native lark Jul 31, 2019, 7:08 PM

#

threads run in the same process

if you want to know more, check out
https://medium.com/contentsquare-engineering-blog/multithreading-vs-multiprocessing-in-python-ece023ad55a

Medium

Multithreading VS Multiprocessing in Python

Revealing the true face of Multithreading

silent swan Jul 31, 2019, 7:08 PM

#

good illustration

true badger Jul 31, 2019, 7:08 PM

#

Thanks!

#

Haha nice pic

long jacinth Jul 31, 2019, 9:20 PM

#

I made a package to transform lists in tables!

desert oar Jul 31, 2019, 9:31 PM

#

interesting

#

put it up in #303934982764625920 maybe

gilded dagger Aug 1, 2019, 12:43 AM

#

so

#

A friend tol me he was using Docker to set up his python environments for machine learning

#

He finds it easier to make a docker image that has all dependencies and use this to make sure his code can run on any machine

#

What do you guys think of it?

desert oar Aug 1, 2019, 1:14 AM

#

id rather use a conda environment, but docker works i guess

#

seems like a pain in the ass to reconfigure if you want to install a new package, gotta tear down the container and rebuild/restart

gilded dagger Aug 1, 2019, 1:22 AM

#

Docker has a conda environment already, https://hub.docker.com/r/continuumio/anaconda3/

#

You can just start from there

desert oar Aug 1, 2019, 1:32 AM

#

why use both

gilded dagger Aug 1, 2019, 1:45 AM

#

Portability? I have to run my code on Linux, Mac OS, and Windows regularly, and handling python environments is a bit of a mess

#

I'm trying to see what's the best to combine convenient dev and easy deployment

silent swan Aug 1, 2019, 1:54 AM

#

I need to learn this docker biz

desert oar Aug 1, 2019, 2:07 AM

#

yeah thats fair @gilded dagger

#

conda is somewhat portable too but it depends on a binary package being available

#

with windows 10 you can just docker up

#

im not sure why portability is that important for development

#

for reproducibility, i get

#

still seems like a pain for day-to-day

gilded dagger Aug 1, 2019, 2:13 AM

#

I dev mainly on Mac OS tho

#

Well I code on Mac OS, run parsers on Linux, and run GPU based stuff on Windows

#

currently I manage my environments by hand pretty much

#

Which is maybe the dirtiest possible way?

#

I feel like working in Docker containers for everything should simplify it a great time, right?

desert oar Aug 1, 2019, 2:45 AM

#

yeah

#

somewhat declarative config

#

you know actually

#

some kind of dockerfile generator

#

that would be interesting

#

so you can "install" packages by adding them to your dockerfile, then run some command to rebuild and restart the container in one shot

gilded dagger Aug 1, 2019, 2:46 AM

#

https://devblogs.microsoft.com/python/remote-python-development-in-visual-studio-code/

Python

Remote Python Development in Visual Studio Code | Python

Today at PyCon 2019, Microsoft’s Python and Visual Studio Code team announced remote development in Visual Studio Code, enabling Visual Studio Code developers to work in development setups where their code and tools are running remotely inside of docker containers, remote S...

#

like this?

#

(I'm actually searching around atm)

#

Looks like VS code has great integration of docker for development

desert oar Aug 1, 2019, 2:46 AM

#

does docker have incremental builds?

gilded dagger Aug 1, 2019, 2:47 AM

#

I mean if I have to re-build whenever I change dependencies and that's it it's not too bad tbh

desert oar Aug 1, 2019, 2:47 AM

#

yeah and you probably shouldnt change deps that often

#

if you need to experiment use a virtualenv

#

then do actual dev and research work in a container

#

i dont hate that

#

easier to reproduce than trying to keep a lockfile version controlled

quartz stream Aug 1, 2019, 6:29 AM

#

@desert oar

#

I mean like say temperature column I want to access two values

hollow quartz Aug 1, 2019, 2:00 PM

#

I want to do linear regression but I have a nominal variable. Do I have to use OneHotEncoder or is StringIndexer sufficient? I use pyspark

dense rose Aug 1, 2019, 5:52 PM

#

How would I normalize values such that they lie on a logarithmic curve between 0 and 1?

#

Okay it seems that applying a log first and then just normalizing works pretty well.

velvet compass Aug 1, 2019, 6:15 PM

#

New article by Kirit Thadaka. 👀 Do the rewards of Data Science outweigh the risks? 🤔 What do you think?
https://www.kite.com/blog/python/future-of-data-science

Kite - AI-Powered Python Copilot

Kirit Thadaka

Data Science, the Good, the Bad, and the... Future - Kite Blog

Data scientists walk a fine line moving forward. Technological advances are often countered by specific concerns, like gender bias and privacy.

desert oar Aug 1, 2019, 6:40 PM

#

yes?

#

did you write this?

velvet compass Aug 1, 2019, 8:20 PM

#

I did not (: my posts are more nuts-and-bolts

desert oar Aug 1, 2019, 9:51 PM

#

this would be like in 1835 writing an article about the rewards and risks of chemistry

velvet compass Aug 1, 2019, 11:16 PM

#

^^ That's a pretty darn good analogy!

silent swan Aug 2, 2019, 12:23 AM

#

but salt, don't you know the singularity is just around the corner?

quartz stream Aug 2, 2019, 6:45 AM

#

import pandas as pd
df = pd.read_csv('sample_data/california_housing_train.csv')
df = df.loc[]

what do I write if i wanna find multiple values
eg from longitude I want -114.56 and 114.57
and from latitude I want 33.69 and 32.76
i want all the columns of the original df to be displayed in new df

wispy glacier Aug 2, 2019, 8:45 AM

#

So I made a simple weight penalty mechanism on my neural network and each layer of nodes created each of these lines. Interesting, So I can mimic a linear neural network by using the same mechanism on all the layers.

📎 Skarmklipp.PNG

surreal nacelle Aug 2, 2019, 10:39 AM

#

Hey guys, I'm still a newbie, but I'd like to find an internship in the field in 2 months or so, (to start in 3-4), I looked at the offers online, and I feel like it gives me more than enough time to reach what the companies are looking for (for internship at least), but I'd like to build a little portfolio meanwhile, do you have some ideas about projects that wouldn't take too long to do, but have a decent weight on a resume ?
I was thinking about writing most of the basic algo from scratch, to show my understanding of them and of the maths behind, what do you think ? (maybe make a nice notebook for each algo or something like that)

#

Thanks 😃

quartz stream Aug 2, 2019, 12:05 PM

#

@desert oar

#

Can you please help with the above problem

desert oar Aug 2, 2019, 12:06 PM

#

@quartz stream 1) its still unclear what youre asking, and 2) i am a volunteer and i can't help everyone with everything, nor should i be expected to

quartz stream Aug 2, 2019, 12:17 PM

#

Lol

#

okay

#

def chunk_preprocessing(chunk, id_colname, col_value):
  if(len(col_value) == 1):
    return chunk.loc[chunk[id_colname] == col_value]

  else:

    for i in col_value:
      df2 = chunk.loc[chunk[id_colname] == col_value]
      df1 = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
    return df1

chunk_preprocessing(df,"latitude",[32.79,41.84])

#

I want something like this

#

@desert oar if user wants multiple values he can get the rows from single column

#

I only tagged you because you knew about the problem

native lark Aug 2, 2019, 12:27 PM

#

@quartz stream i suggest you stop pinging individual helpers, "salt rock lamp" is not the only person that can help you

quartz stream Aug 2, 2019, 12:28 PM

#

Alright !

#

Nevermind

#

got the answer

#

col_value = [32.82,41.75]
subsetDataFrame = df[(df['latitude'].isin(col_value))]

royal mango Aug 2, 2019, 3:09 PM

#

Hi guys

#

Who uses pandas here?

desert oar Aug 2, 2019, 3:19 PM

#

most people

desert oar Aug 2, 2019, 6:18 PM

#

if you already know programming, https://fast.ai is good if you want to get into deep learning right away

desert oar Aug 2, 2019, 7:23 PM

#

err what

#

oh www

#

https://www.fast.ai/

Home

Making neural nets uncool again

grizzled folio Aug 2, 2019, 11:33 PM

#

Hmmph, I suppose just opening ~400 files over network-attached storage is going to take a while no matter what

prime plover Aug 3, 2019, 12:36 AM

#

anyone can recommend a good place to learn how to build AI and machine learning?

desert oar Aug 3, 2019, 12:43 AM

#

@prime plover what is your background? math? programming?

lapis sequoia Aug 3, 2019, 4:12 AM

#

Would you guys recommend the tensorflow library for getting started with AI and machine learning?

desert oar Aug 3, 2019, 4:13 AM

#

i dont think the tensorflow new user experience is very good

#

probably stick with keras tbh

lapis sequoia Aug 3, 2019, 4:13 AM

#

Okay thank you

desert oar Aug 3, 2019, 4:13 AM

#

people tend to like pytorch although i havent used it much at all

lapis sequoia Aug 3, 2019, 4:14 AM

#

What are the real differences between them all and why don’t you recommend tensorflow

desert oar Aug 3, 2019, 4:14 AM

#

tensorflow 2.0 will be a lot saner, but tensorflow 1.0 was really complicated

#

and pardon my language but the docs were/are shit

#

all the information is technically there but good luck finding any of it when you need it

lapis sequoia Aug 3, 2019, 4:15 AM

#

Okay

desert oar Aug 3, 2019, 4:37 AM

#

@lapis sequoia that said, tensorflow is probably still more widely used, and i think a lot of new models come out in TF versions first

surreal nacelle Aug 3, 2019, 5:57 PM

#

https://www.humblebundle.com/books/data-analysis-machine-learning-books?hmb_source=navbar&hmb_medium=product_tile&hmb_campaign=tile_index_4
Humble book bundle, bunch of really good o'reilly books for next to nothing 😃

Humble Bundle

Humble Book Bundle: Data Analysis & Machine Learning by O'Reilly

Pay what you want for awesome ebooks and support charity!

silent swan Aug 3, 2019, 9:39 PM

#

USE. PYTORCH.

#

(unless you need to build super-scalable stuff)

#

(like, really REALLY scalable stuff)

#

(like train on >8 separate machines type of stuff)

silent swan Aug 3, 2019, 10:30 PM

#

even in terms of new models, that's not entirely true

#

models coming out of google, like 95% of them are in TF (conversely, 95% of facebook models come out in pytorch, but also google has a much, much larger research lab)

#

for more broad research though, I would estimate about 65% pytorch 35% TF, from what I've seen (not based on hard statistics)

#

researchers generally like pytorch more, unless they're doing something that really requires scale

#

(I'm aware of the overall project statistics of TF vs pytorch, but github projects =/= new research)

fierce shadow Aug 4, 2019, 1:07 PM

#

hey when I find the derivate of f(x) = x**2; the derivative turns out to be 2x. When I try to find the derivative of f'(x) = 2x; ;then again derivative comes 2x, however if I apply the power rule, it turns out to be 2. Am I making mistake here ?

def f(x):
return x*x

def derivative(x):
return (f(x+h)-f(x))/h

print (derivative(derivative(100)))

Output : 400.0010.... (cuz i took h as 0.001)

Whereas it should be coming 2 I guess..

desert oar Aug 4, 2019, 2:43 PM

#

@fierce shadow try this:

def derivative(f, h=1e-7):
    def ff(x):
        return (f(x+h) - f(x)) / h

def square(x):
    return x**2

print(derivative(square)(4))
print(derivative(derivative(square))(4))

quaint ruin Aug 4, 2019, 8:27 PM

#

Hey, trying to use DecisionTreeClassifier from sklearn and I'm fitting my model and predicting result and then trying to get the accuray:

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree

# some features are better using LabelEncoder like HouseStyle but the chance that they will affect
# the target LotFrontage are small so we just use HotEncoder and drop unwanted columns later
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
                                                       'LotShape', 'LandContour', 'Utilities',
                                                       'LotConfig', 'LandSlope', 'Neighborhood',
                                                       'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
encoded_df = encoded_df[['LotFrontage', 'LotArea', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
           'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside']]

# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use this)
encoded_df['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X = encoded_df.drop('LotFrontage', axis=1)
y = encoded_df['LotFrontage'].astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = DecisionTreeRegressor()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
classifier.score(y_test, y_pred)
# print("Accuracy is: ", accuracy_score(y_test, y_pred) * 100)

but I get ValueError: Expected 2D array, got 1D array instead:
I'm not sure why y_pred or y_test needs to be 2D if anyone can clarify but I'm also getting this error after reshaping them with (-1, 1)
anyone got an idea?

silent swan Aug 4, 2019, 8:41 PM

#

hm that doesn't sound right

#

both y_pred and y_test are 1D, and you're sure the error shows up on the last line?

quaint ruin Aug 4, 2019, 8:52 PM

#

yes

#

I changed the score line to: classifier.score(y_test.values.reshape(-1, 1), y_pred)

#

because in the docs its says that the test sample shape needs to be shape = (n_samples, n_features)

#

so no I get y_test shape as (365, 1)

#

and now I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1

silent swan Aug 4, 2019, 10:18 PM

#

It looks like you're misusing .score?

#

for .score you're supposed to supply X_test and y_test

#

it's a shorthand for predicting and computing accuracy in one step

coral surge Aug 4, 2019, 10:30 PM

#

so i have a question about nlp

#

where should i ask it?

#

(it’s a test question)

coral surge Aug 4, 2019, 10:53 PM

#

anyone have the answer to this

📎 image0.png

coral surge Aug 4, 2019, 11:32 PM

#

how so

silent swan Aug 4, 2019, 11:58 PM

#

I would help, but don't we have rules re: homework/tests?

#

anyway, if we were to do elimination

#

one of those is not a supervised modeling method

#

one of those should not be used with only 1k examples

#

but the other two answers I can find some argument for either

#

so I don't like this question

coral surge Aug 5, 2019, 2:23 AM

#

i dont like it either

grizzled folio Aug 5, 2019, 2:45 AM

#

sigh there's no winning with parallelisation, is there? use threads and get killed by GIL locking, use processes and get killed by serialisation

silent swan Aug 5, 2019, 2:45 AM

#

use GPUs, get killed by bank account

grizzled folio Aug 5, 2019, 2:45 AM

#

GPUs aren't great at opening files...

silent swan Aug 5, 2019, 2:45 AM

#

that's also true

grizzled folio Aug 5, 2019, 2:46 AM

#

also not really, I have a decent GPU in my workstation, and access to batch GPUs otherwise 😉

#

how does the prun's ncalls column work? 3020684/828 for pickle.py:457(save)

#

ah, recursion

supple ferry Aug 5, 2019, 12:01 PM

#

Hey there! Anyone knows alternatives to R t.test() in Python?? Scipy has it, but it does not support alternative hypothesis.

t.test(b$Sepal.Length, mu=5.6, alternative="greater")

#

you can do this in R, but not with Scipy

#

stats.ttest_1samp(df.sepal_length, popmean= 5.6, alternative = "greater")
TypeError: ttest_1samp() got an unexpected keyword argument 'alternative'

random jasper Aug 5, 2019, 2:28 PM

#

Sorry if this is not that correct channel, but does anyone have any experience in quadratic programming? Specifically the quadprog package? I'm trying to convert matlab to octave/python however there is one matlab function, "quadprog" that does not work directly in octave, and I am recieving buffer errors when trying to use the quadprog package in pythno

desert oar Aug 5, 2019, 3:38 PM

#

it doesn't assume normality as such, but the student-t distribution is symmetric. so yes that works

supple ferry Aug 5, 2019, 5:11 PM

#

@void anvil I am afraid I did not quite understand what you meant. Can you give maybe a code sample for that?
Scipy does test for two sided afaik, not less or greater.

#

So greater and less are just the same thing then? visible confusion it maybe because I worked today more than usual and brain refuses to process

desert oar Aug 5, 2019, 7:29 PM

#

that said, if you have 1-sided test your alternative hypothesis is often a lot saner

#

the problem with NHST is that the null hypotheses aren't "fuzzy"

#

either you reject or you dont

#

im not sure what the solution is. but i havent read anything coherent or useful about making the null hypothesis itself "fuzzy", rather than fudging it by treating p-values in the Fisherian sense of weight-of-evidence-against-the-null, which just fails in large, low-noise samples

supple ferry Aug 5, 2019, 7:37 PM

#

thanks for the clarification!

#

I am still digesting all these

#

can any of you explain me what this code does in the background ?

#

maybe i can understand better knowing the steps behind

#

t.test(a$Sepal.Length, mu = 5)

One Sample t-test

data:  a$Sepal.Length
t = 12.473, df = 149, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 5.709732 5.976934
sample estimates:
mean of x 
 5.843333 

t.test(a$Sepal.Length, mu = 5, alternative = 'greater')

    One Sample t-test

data:  a$Sepal.Length
t = 12.473, df = 149, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
 5.731427      Inf
sample estimates:
mean of x 
 5.843333

#

So, I am using iris dataset and sepal_length as my test subject

#

@void anvil , tagging you because of the discussion above

#

@desert oar , if you can also help, that would be super

desert oar Aug 5, 2019, 7:41 PM

#

R eh?

supple ferry Aug 5, 2019, 7:41 PM

#

yes

#

because my goal is to replicate it in python

desert oar Aug 5, 2019, 7:41 PM

#

well, do you know how a t test is done?

supple ferry Aug 5, 2019, 7:41 PM

#

as alternative hypothesis being mean greater than 5

desert oar Aug 5, 2019, 7:41 PM

#

more generally, do you know how hypothesis testing actually works?

supple ferry Aug 5, 2019, 7:42 PM

#

variance between groups / variance within groups

#

?

desert oar Aug 5, 2019, 7:42 PM

#

hypothesis tests work by constructing a test statistic with a known probability distribution

#

then estimating the distribution

#

and then finally computing the probability of the observed value of the test statistic

#

so you need to do the same:

compute the test statistic t
estimate the parameters of the correct probability distribution, in this case student-t
compute the probabiilty of T >= t where T is distributed according to the student-t distribution you fitted in step 2

supple ferry Aug 5, 2019, 8:04 PM

#

Okay. I will generalize what I understood from technical point of view

#

stats.ttest_1samp(data.sepal_length, popmean= 5.9)
Out[9]: Ttest_1sampResult(statistic=-0.8381239979992521, pvalue=0.40330353059421875)

#

Here my alternative hypothesis is that mean is not equal 5.9

#

if my alternative hypothesis was mean is less than 5.9
it would be 0.4033 / 2 = .2017

#

and if alternative hypothesis was mean is greater than 5.9
it would be 1 - 0.4033 / 2 = .7983

#

is it right ?

desert oar Aug 5, 2019, 8:17 PM

#

scipy might have it

supple ferry Aug 5, 2019, 8:21 PM

#

this was indeed scipy, but it only tests for alternative hypothesis != popmean

solar torrent Aug 6, 2019, 12:17 AM

#

Hey does anyone know how to go about extracting the first observation for each unique value in a column?

#

I've googled and looked through StackOverflow but nothing

#

I've thought about using drop_duplicates on the specified column, but I can't figure out if the function always retains the first value

#

nevermind, figured it out

grizzled folio Aug 6, 2019, 1:34 AM

#

@solar torrent enlighten the rest of us?

solar torrent Aug 6, 2019, 1:42 AM

#

I sorted by a time column as a new df, then pd.drop_duplicates specifying keep='first'

#

on a specified subset (which is the first arg of the drop_duplicates function)... so in this way you can get the first observation for each unique value

grizzled folio Aug 6, 2019, 1:58 AM

#

nice one

surreal nacelle Aug 6, 2019, 11:23 AM

#

Hey, I've been intensively learning maths for the past 2 weeks or so, and until then I aimed at being capable to solve every calculation by hand for linear algebra and calculus. I am now wondering if that is the good strategy to adopt. Do I really need to know how to do everything by hand ? Or is knowing what everything represents and when to use it what really matters in the context of machine learning/deep learning ? I feel like using wolfram alpha should be good enough if I know what to ask it 😄

#

I would add that I want to get a job in the field as soon as possible.

tight sparrow Aug 6, 2019, 5:11 PM

#

anyone here playing with google colab?

#

Im trying to get my gpu to work with it

#

📎 unknown.png

#

as you can see here its working in jupyter but not on colab

#

📎 unknown.png

tulip estuary Aug 6, 2019, 5:13 PM

#

@tight sparrow Your GPU or Google's GPU?

tight sparrow Aug 6, 2019, 5:13 PM

#

my gpu

tulip estuary Aug 6, 2019, 5:13 PM

#

ah, never done that 😃 Just used theirs.

tight sparrow Aug 6, 2019, 5:13 PM

#

i know i can use goggles in notebook settings but still no dice

#

my gpu is a 1070 i no longer game with so i might as well use it

tulip estuary Aug 6, 2019, 5:14 PM

#

Got it. Can't help on that, I haven't tried using my local GPU (mostly because I don't have one 😃 )

tight sparrow Aug 6, 2019, 5:38 PM

#

https://stackoverflow.com/questions/57381360/tensorflow-2-0-beta-gpu-running-in-jupyter-notebook-but-not-in-google-colab

Stack Overflow

Tensorflow 2.0 beta GPU running in jupyter notebook, but not in go...

I am working with tensorflow 2.0 beta, and while i managed to get my GPU working on anaconda through a few youtube tutorials I am unable to get my gpu running in google colab. I know google has the

#

posted to stack

olive robin Aug 6, 2019, 7:41 PM

#

Can someone explain why you transpose W to x in the z = wTx + b equation?

#

Is this just a fancy way of applying/multiplying the weights to all the features?

silent swan Aug 6, 2019, 7:45 PM

#

it depends on how the dimensions of the respective vector/matrices are set up

#

but roughly yes

#

good to always know the dimensions though

olive robin Aug 6, 2019, 10:01 PM

#

What's happening from lines 20 - 75 in this code?

#

https://github.com/pytorch/examples/blob/master/imagenet/main.py

GitHub

pytorch/examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - pytorch/examples

#

what are all the parser.add_argument's doing?

grizzled folio Aug 6, 2019, 10:17 PM

#

defining command line arguments, see the argparse library

solar torrent Aug 6, 2019, 10:31 PM

#

does anyone know if you can customize jupyter notebooks so all of the columns of an output df shows up, even if you have to scroll?

#

I hate when it has these ellipsis in the middle of the df

grizzled folio Aug 6, 2019, 10:47 PM

#

@solar torrent http://songhuiming.github.io/pages/2017/04/02/jupyter-and-pandas-display/

#

I don't use pandas, but there seem to be a few options available

solar torrent Aug 6, 2019, 10:55 PM

#

@grizzled folio thank you soooooo much

solar torrent Aug 6, 2019, 11:30 PM

#

hey can you guys help me conceptualize something real quick

#

I need to figure out the speed traveled, by comparing detections points and at the coordinates of different towers (imagine a bird flying and pinging different towers as it goes)

#

so I have this df...

#

📎 unknown.png

#

I need to isolate, by tagID (first col) - when the detections change from being detected at one tower versus another (tower names in the second column)... so in this way I could calculate the distance traveled over a certain amount of time

#

but I'm stuck because one tagID can be hit by several different towers before it switches to another. I'm not sure where to go next or what sort of function I could use

#

btw the df is sorted by the time column

grizzled folio Aug 6, 2019, 11:42 PM

#

"can be hit by several different towers before it switches to another" not sure what you mean by that

solar torrent Aug 6, 2019, 11:43 PM

#

see rows 5 and 6 as an example - we see that the bird (by tagID) is detected by two different towers at that time (tower names and coordinate points change, shown in columns 2, 4-5 respectively)

#

does that make sense?

#

actually wait

#

see rows 3 and 4

#

sorry

#

📎 Screen_Shot_2019-08-06_at_7.44.42_PM.png

#

I think I figured it out.

#

maybe I could drop_duplicates on the second column and keep the last values

#

why not drop duplicates by timestamp?

#

because some of the towers still give hits at the same location, even as a different time

#

the towers signify a change in place - so I need to observe difference over space

#

Ok next dilemma

#

📎 Screen_Shot_2019-08-06_at_8.05.56_PM.png

#

how would I go about calculating speed by these matching tags...?

#

like throughout a whole df.

#

googling for functions now...

#

oooooh you meant sort by ID and time.... gotcha

lapis sequoia Aug 7, 2019, 12:57 AM

#

📎 20190807_084505.jpg

#

whats the answer pls

grizzled folio Aug 7, 2019, 4:15 AM

#

if I have an xarray DataArray, say with dimensions (T: 168, Y: 520, Xp1: 881), can I somehow drop all the data, and make it (T: 0, Y: 520, Xp1: 881) (or something to that effect). I think pandas indexing might be similar? -- I really just want to exploit xarray to get me metadata, and then put some different data in its place

#

maybe this is the wrong way to be doing this..

frigid elk Aug 7, 2019, 5:38 AM

#

what are the thoughts on DataQuest?

#

considering giving it a go just to build some foundations

dreamy tartan Aug 7, 2019, 8:11 AM

#

Hi,
I have a problem with the unbalanced dataset. I have labelled text data and im trying to do classification. The dataset has 3 label and labels are not equal to each other. (label 1: 2.3k, label 2: 1.2k, label 3: 0.5k) I can say results are fine for label 1 and 2 but label 3 is very bad in the confusion matrix. What can I do to improve the results?

polar acorn Aug 7, 2019, 8:14 AM

#

Depending on what you are using for classification you may be able to do something like sample so that the classes are evenly distributed across each batch or weighting class 3 higher.

#

Some quick googling gives me these two articles. They might provide a good starting point.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://towardsdatascience.com/dealing-with-imbalanced-classes-in-machine-learning-d43d6fa19d2

Machine Learning Mastery

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Da...

Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn! This is an example of an imbalanced...

Medium

Dealing with Imbalanced Classes in Machine Learning

Introduction

dreamy tartan Aug 7, 2019, 8:18 AM

#

Thank you. I will try to do sample if it wont work than i'll try to create synthetic data.

dreamy tartan Aug 7, 2019, 8:56 AM

#

@void anvil i was thinking about it but i'll lose so much data in case of undersampling thats why oversampling can be solution for me maybe. Im searching about it.

desert oar Aug 7, 2019, 11:25 AM

#

@dreamy tartan i've seen it done where you oversample and then undersample to reduce the size of the data set back to what you originally had

#

Also depending on the classifier you are using sometimes you can re-weight the loss function with class weights

#

Eg naive bayes and logistic regression can do that

dreamy tartan Aug 7, 2019, 11:28 AM

#

I used SMOTE for over sampling and its looks like results are acceptable. Maybe i can improve the results with working on pre-processing section.

hollow quartz Aug 7, 2019, 12:22 PM

#

Hi do you a library using to visualise decision tree from pyspark

charred onyx Aug 7, 2019, 3:27 PM

#

Hey guys should skip some machine learning concepts and learn deep learning or learn the entire thing.

desert oar Aug 7, 2019, 3:43 PM

#

@charred onyx depends on what you want to achieve. if you are interested in computer vision, speech, etc. then yes you can skip stuff like gradient boosting and more advanced probability/stats and start learning deep learning (although you should 100% plan on covering the probability/stats material later)

olive robin Aug 7, 2019, 4:07 PM

#

what does it mean to seed training?

desert oar Aug 7, 2019, 4:11 PM

#

where did you see that phrase

olive robin Aug 7, 2019, 4:12 PM

#

https://github.com/pytorch/examples/blob/master/imagenet/main.py

GitHub

pytorch/examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - pytorch/examples

#

line 80

desert oar Aug 7, 2019, 4:13 PM

#

https://pytorch.org/docs/stable/torch.html#torch.manual_seed

#

Sets the seed for generating random numbers. Returns a torch._C.Generator object.

olive robin Aug 7, 2019, 4:16 PM

#

wait so why would this slow down training

#

and if it does, why would you do it?

desert oar Aug 7, 2019, 4:22 PM

#

probably has to do with CUDA

#

im not sure how random numbers are implemented in that

olive robin Aug 7, 2019, 4:42 PM

#

oh ok, thank you

silent swan Aug 7, 2019, 6:25 PM

#

seeding does not slow down training

#

it's setting it to deterministic that does

#

basically there's a mode where CUDA forces its computations to be deterministic (same input = same output always), but that comes at a computational cost

#

important if you're numerically reproducing work, but not so important if you're just trying to train a model

olive robin Aug 7, 2019, 6:36 PM

#

what does same input = same output mean?

silent swan Aug 7, 2019, 6:46 PM

#

if you applied the same computation to the same inputs you will always get the same result

#

you may think "wait, why would applying the same computation ever lead to different results?"

#

one example (may not be the only case, but worth keeping in mind), is that GPUs essentially do parallel computation, and then combine the results

#

because of limited floating point precision, adding the same numbers in different orders can lead to different results

#

so CUDNN into deterministic mode would force the numbers to always be added in the same order, which makes it deterministic but is also slower

olive robin Aug 7, 2019, 6:57 PM

#

ok so it's a method of making results more accurate?

silent swan Aug 7, 2019, 7:02 PM

#

it's for exactly replicating results

#

there're different specific reasons for doing, e.g. debugging, comparing models, etc

#

I'd say that unless you know you need it to be deterministic, you don't need to worry about it

#

good to know that numerically your results may vary though

#

agreed for algorithm, not so much for academia imo

#

if you're getting within the same ballpark of results, people don't really care that it's numerically the same

#

if anything, being "robust" to initialization is seen as a plus, in some areas

polar acorn Aug 7, 2019, 7:13 PM

#

I thought setting the seed was mostly about getting the same random initialisation in a deep model.

#

How much does the other sources of randomness influence the model?

silent swan Aug 7, 2019, 7:15 PM

#

for common models, it should matter very little, but there're some models that are especially brittle

polar acorn Aug 7, 2019, 7:15 PM

#

With dropout it might make a difference though.

silent swan Aug 7, 2019, 7:16 PM

#

dropout is a separate thing, it's intended randomness

#

and that should also be controlled by the seed

polar acorn Aug 7, 2019, 7:18 PM

#

Also if you're using mini batch or even SGD it will probably be significant.

silent swan Aug 7, 2019, 7:19 PM

#

that would also be controlled by the seed

polar acorn Aug 7, 2019, 7:22 PM

#

Of course. What I'm commenting in is that the first thing you mentioned when asked why to set the seed was randomness stemming from parallel computing. I would assume the initialisation and the batch sampling are adding significantly more randomness to training and those would be the first thing a beginner should think about when learning about setting the seed.

#

Though I haven't really done any testing to compare how much each of these sources contribute randomness to the training, so what do I know.

silent swan Aug 7, 2019, 7:23 PM

#

ah, specifically I was commenting on "why does setting the seed it cause slowdown", which led to discussion of the non-seed randomness

#

good info going all around

olive robin Aug 7, 2019, 7:27 PM

#

makes a lot of sense now

#

thanks for all the info @silent swan and @void anvil !

exotic cedar Aug 7, 2019, 7:40 PM

#

is anyone familiar with colab

#

it seems to be having trouble recognizing my local gpu

olive robin Aug 7, 2019, 7:42 PM

#

wait holy crap

#

you can just import the loss and optimization functions from torch?

#

You don't have to build anything yourself?

#

https://github.com/pytorch/examples/blob/master/imagenet/main.py#L168 ; lines 168 - 173

GitHub

pytorch/examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - pytorch/examples

exotic cedar Aug 7, 2019, 7:43 PM

#

yes

#

all modern deep learning frameworks have that kind of implementation doe

#

criterion = nn.CrossEntropyLoss()

#create optimizer object
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)```

#

something like that

silent swan Aug 7, 2019, 7:50 PM

#

pytorch is great

#

disregard tf/keras, acquire torch skills

polar acorn Aug 7, 2019, 7:53 PM

#

lol work with whatever gets you where you want. Disregard fanboyism 🙃

silent swan Aug 7, 2019, 7:54 PM

#

naw, fanboyism can be a great impetus for work

#

"you think language X is so great because it has lib Y? well I'm gonna port lib Y to my favorite language Z. take THAT."

polar acorn Aug 7, 2019, 7:56 PM

#

Sure sure it might have it's use. But one shouldn't limit oneself by it needlessly.

exotic cedar Aug 7, 2019, 8:00 PM

#

anyone used colab before :3

tame merlin Aug 8, 2019, 1:38 AM

#

o.o

solar torrent Aug 8, 2019, 2:02 AM

#

I use colab all the time

gaunt blade Aug 8, 2019, 5:11 AM

#

Do I need math knowledge to use tensorflow for AI? 🙃

sullen wing Aug 8, 2019, 5:12 AM

#

well for machine learning algorithm, yes you need math

#

apart from that no you only need to python syntax

eternal cargo Aug 8, 2019, 10:14 AM

#

hey , does anyone knows about any package to detect emotion analysis from a text?

wicked flare Aug 8, 2019, 10:19 AM

#

http://www.nltk.org/howto/sentiment.html This perhaps?

eternal cargo Aug 8, 2019, 10:26 AM

#

@wicked flare hey thanks for it. but it just gives us 3 basic emotions. i want more elaborate like joy,anger fear. like -ve emotion can be fear or anger.

#

wanted to classify that

wicked flare Aug 8, 2019, 10:26 AM

#

That sounds like a very challenging problem.

eternal cargo Aug 8, 2019, 10:39 AM

#

yeah

#

im just into ML , so cant even train a model as of now

lapis sequoia Aug 8, 2019, 10:40 AM

#

Mobile legends

eternal cargo Aug 8, 2019, 11:31 AM

#

okay, so now like i have to detect a text if its a question or a general text or some sort of order to someone. any idea how to detect that in a text?

polar acorn Aug 8, 2019, 11:50 AM

#

You could gather a lot of labeled text find a pre-trained model and do some transfer learning. But if you're just getting into ml this might be a big project. Why are you doing this again? Is it for your own amusment/learning or are your trying to solve a real life problem?

eternal cargo Aug 8, 2019, 11:59 AM

#

like just an application im working on

#

you're saying it would require intro to ML

polar acorn Aug 8, 2019, 12:04 PM

#

I mean with enough time and strong enough will you could probably make something that does this (somewhat poorly probably) without an intro to ML. But if this is something you intend to use and it needs to fulfil some criteria then you should take some time to get the basics of ml down before or while you're doing this.

#

Or you could just slap together some regex based heuristics or whatever.

desert oar Aug 8, 2019, 12:21 PM

#

@eternal cargo detecting if it's a question isn't a sentiment analysis task, but there probably some pre-trained models you can start with

#

spacy for example ships a model with part of speech detection

eternal cargo Aug 8, 2019, 12:52 PM

#

@polar acorn okay cool . so basically i need to learn ML to deliver a somewhat accurate model for that

#

@desert oar like detecting a sentence , if its question or normal speech or just an order to someone. i've searched in spacy but didnt find any. can you help me with a link i guess?

desert oar Aug 8, 2019, 1:19 PM

#

@eternal cargo spacy doesnt have it out of the box. but it has the ability to let you train a text classifier based on it https://spacy.io/usage/training#textcat

Training spaCy’s Statistical Models

Training spaCy’s Statistical Models · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

#

so you can label a bunch of sentences and classify them accordingly

#

stanford corenlp might have something useful as well

#

there are also lots of "traditional" NLP or linguistics tools available:

https://stackoverflow.com/q/3573872/2954547
https://stackoverflow.com/q/4083060/2954547

Stack Overflow

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not?

I am working on a question answering system that needs to analyze if the text input b...

Stack Overflow

Determine if a sentence is an inquiry

How can I detect if a search query is in the form of a question?

For example, a customer might search for "how do I track my order" (notice no question mark).

I'm guessing most direct questions w...

eternal cargo Aug 8, 2019, 1:23 PM

#

wow. i think i have to start learning ML. its pretty great. thanks a lot btw

surreal nacelle Aug 8, 2019, 2:21 PM

#

Hey, I'm working on implementing machine learning algos from scratch, and I'm currently doing linear regression, I'm using a formula from the course on multivariate calculus I followed on coursera, but I can't find the name of the thing! Does that speak to any of you guys ?

📎 Screenshot_2019-08-08_launchcode01dl_mathematics-for-machine-learning-cousera.png

fierce shadow Aug 8, 2019, 3:19 PM

#

Hey can anyone help me good course for machine learning ? I am learning it from scratch..

silent swan Aug 8, 2019, 4:20 PM

#

that's your regression coefficient

desert oar Aug 8, 2019, 4:37 PM

#

@surreal nacelle that's the ordinary least squares estimate of m

surreal nacelle Aug 8, 2019, 4:39 PM

#

Thank you, the normal equation should give similar result right ?

tight sparrow Aug 8, 2019, 4:40 PM

#

@surreal nacelle https://www.khanacademy.org/math/statistics-probability

https://www.youtube.com/watch?v=hQxRv8DOnts&list=PL2jykFOD1AWazz20_QRfESiJ2rthDF9-Z&index=32

https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr

https://www.youtube.com/playlist?list=PLnvKubj2-I2LhIibS8TOGC42xsD3-liux

Khan Academy

Learn for free about math, art, computer programming, economics, physics, chemistry, biology, medicine, finance, history, and more. Khan Academy is a nonprofit with the mission of providing a free, world-class education for anyone, anywhere.

YouTube

intrigano

Linear Algebra – Wrap up of this linear algebra course

Mathematics for Machine Learning: Linear Algebra, Module 5 Eigenvalues and Eigenvectors Application to Data Problems To get certificate subscribe at: https:/...

▶ Play video

YouTube

3Blue1Brown

The Essence of Calculus, Chapter 1

What might it feel like to invent calculus? Brought to you by you: http://3b1b.co/eoc1-thanks Home page: https://www.3blue1brown.com/ In this first video of ...

▶ Play video

YouTube

Machine Learning Course MIT OpenCourseWare - YouTube

#

oh and download this

#

https://chrome.google.com/webstore/detail/video-speed-controller/nffaoalbilbmmfgbnbgppjihopabppdk?hl=en

Video Speed Controller

Speed up, slow down, advance and rewind any HTML5 video with quick shortcuts.

surreal nacelle Aug 8, 2019, 4:41 PM

#

Thank you for this, I actually followed the coursera LA and MC courses, and watched part of 3b1b serie, haven't done much statistics tho

desert oar Aug 8, 2019, 4:41 PM

#

dont neglect stats too long

#

you can kinda fudge past it at first, but youll quickly start to feel very lost if you neglect it

surreal nacelle Aug 8, 2019, 4:42 PM

#

The next course is on PCA, which I don't know anything about, is that stats ?

desert oar Aug 8, 2019, 4:42 PM

#

that, or you wont feel lost, but you will start having problems and not understadnding the problems

#

PCA is a traditional stats technique. you can learn and understand it on a mechanical level without stats, but there is a statistical perspective to PCA that is useful to understand

surreal nacelle Aug 8, 2019, 4:43 PM

#

Alright, just finished the two courses on coursera after 2 intensive weeks, thought I could allow myself to take a little break from mathematics 😄

desert oar Aug 8, 2019, 4:43 PM

#

logistic regression is also a stats thing. it's a probability model, that's where the loss function comes from. you dont strictly need stats to understand logistic regression, but it will make more sense and you'll feel more empowered if you understand the stats behind it

#

fortunately there is a lot of stats that isn't "heavy math", it requires some basic algebra but there are some important concepts that aren't necessarily that complicated from a mathematical perspective

surreal nacelle Aug 8, 2019, 4:45 PM

#

I remember liking stats in highschool

silent swan Aug 8, 2019, 4:45 PM

#

stats is great

surreal nacelle Aug 8, 2019, 4:45 PM

#

more abstractive than calculus and co

#

I'll start soon then 😄

silent swan Aug 8, 2019, 5:41 PM

#

https://twitter.com/wesmckinn/status/1159508306001637377

Wes McKinney (@wesmckinn)

Please help pandas by taking the 2019 User Survey! #pydata

https://t.co/1Mn1FB9Dni

polar acorn Aug 8, 2019, 6:50 PM

#

👍 answered.

silent swan Aug 8, 2019, 7:13 PM

#

I'm gonna go in and suggest to deprecate access columns as attributes

#

wait

#

Pandas is capitalized wot

#

ok I'm assuming it's a mistake, everywhere else it's all lower case

desert oar Aug 8, 2019, 8:05 PM

#

I mean, it's a proper noun. Stylizing the names of computer programs with lowercase is a 40 year meme that should probably go away

solar torrent Aug 8, 2019, 11:37 PM

#

hey can someone tell me if there's a function available for this before I embark on trying to write a function

#

I'm trying to calculate the distance and speed through a succession of points

📎 motus_df2.jpg

#

initially I set up the variables using this

#

df_over1['lon0'] = df_over1.groupby('motusTagID')['recvLon'].transform(lambda x: x.iat[0])
df_over1['t0'] = df_over1.groupby('motusTagID')['ts.h'].transform(lambda x: x.iat[0])```

#

I'm trying to think of how I can iterate through them, but by ID... so then I can calculate the distance and speed in succession

#

it would helpful if I could do .apply like one a range of rows

#

but I've never seen code like that so I'm guessing there's another way or that I'm thinking about it wrong

#

maybe I could split all of the IDs into separate dfs

desert oar Aug 9, 2019, 1:07 AM

#

@solar torrent what do you mean "in succession"?

#

you want to know the distance and travel time (implying speed) between successive pairs of points?

solar torrent Aug 9, 2019, 1:15 AM

#

@desert oar meaning over a period of time, at each of the different locations

#

as it stands, my code just calculates it based on the original starting point

#

so I was thinking the solution has something to do with assigning the intermediate variables throughout "succession" (over time)... but I can't think of how I can extract a single value by groups since each of the IDs has a different number of rows... without breaking them out into different dataframes

desert oar Aug 9, 2019, 1:19 AM

#

i still dont understand what you're trying to do

solar torrent Aug 9, 2019, 1:19 AM

#

Ok one second

#

For each ID, I want to assign lat0,long0, and t0 - to each of their rows… essentially in this screenshot I’m trying to replace the values on the right with the values on the left (see colored boxes)… so when I calculate the distance across - I get different distances and speed throughout time, in succession

📎 motus_df2.jpg

#

notice how the values for lat0,long0,th0 all match to the very first row

#

so... I'm asking how would I extract the individual values in this way, by groups

desert oar Aug 9, 2019, 1:25 AM

#

so you want to calculate the distance between 18552 and 18553, for example?

solar torrent Aug 9, 2019, 1:25 AM

#

yes

desert oar Aug 9, 2019, 1:25 AM

#

and 18553 -> 18595, etc

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html#pandas.DataFrame.shift

solar torrent Aug 9, 2019, 1:27 AM

#

I've come across this

desert oar Aug 9, 2019, 1:30 AM

#

yeah 1 sec

#

let me cook up an example

#

data = # your dataset

data['times'] = data['ts.h'].diff()  # assuming the column is already datetime and not a string

def great_circle_distance(coords):
    lon0, lat0, lon1, lat1 = coords
    return # i can't remember the formula off the top of my head

coords = data[['recvLon', 'recvLat']]
coords = coords.join(coords.shift(), rsuffix='_prev')
data['distances'] = coords.apply(great_circle_distance, axis=1, raw=True)

#

something like that @solar torrent ?

solar torrent Aug 9, 2019, 1:45 AM

#

yes, that looks right! @desert oar

#

I'm gonna check out the .shift calls and "rsuffix" on .join... I'm not familiar

#

I've already got the Haversine formula for the coordinates and everything else. thanks a lot

desert oar Aug 9, 2019, 1:48 AM

#

df = pd.DataFrame({'B': [0, 1, 2, None, 4]})
df['C'] = df['B'] + 100

print(df)
print(df.shift())

#

just for illustration

heavy crow Aug 9, 2019, 3:46 AM

#

Dont really know where else to post this, but it has something to do with science and data is involved haha

#

Im trying to create a multithreaded variant of the A* pathfinding algo.

#

And there are a few ways I can do that

#

But I think im doing it wrong in general.

#

I have been creating a thread for a function call and then letting it run.

#

But ofc the creation of the thread takes time and in the end its around 10x slower than the non threaded version

#

I have heard of worker pools etc and was hoping someone could explain that to me:)

#

Ping me please!

grizzled folio Aug 9, 2019, 4:03 AM

#

@heavy crow what does the simplified pseudocode version of your algorithm look like? i.e. at what level are you multithreading the pathfinding?

lapis sequoia Aug 9, 2019, 4:58 AM

#

Hi. I have some issues with training a deep learning model.

I'm doing binary classification on time series. The problem is that the accuracy (and loss) fluctuates a lot. It will go from 50% to 60% to 98% and then back down to 50% again, and just get stuck there.

lapis sequoia Aug 9, 2019, 11:01 AM

#

There is only 1 feature

heavy crow Aug 9, 2019, 11:26 AM

#

Its just normal A* @grizzled folio

#

And I was going to have a pool of workers all taking from the open heap

#

Its more of a question of how I am supposed to use threading

desert oar Aug 9, 2019, 11:30 AM

#

The GIL will bite you here

#

https://wiki.python.org/moin/GlobalInterpreterLock

#

That said a ThreadPoolExecutor is probably the best option. But again it probably won't make your code run any faster because of the global interpreter lock

#

Can use a ProcessPoolExecutor but then IPC might become a bottleneck

#

!d g concurrent.futures.ThreadPoolExecutor

arctic wedgeBOT Aug 9, 2019, 11:34 AM

#

`concurrent.futures.ThreadPoolExecutor`

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())```An [`Executor`](#concurrent.futures.Executor "concurrent.futures.Executor") subclass that uses a pool of at most *max\_workers* threads to execute calls asynchronously.

*initializer* is an optional callable that is called at the start of each worker thread; *initargs* is a tuple of arguments passed to the initializer. Should *initializer* raise an exception, all currently pending jobs will raise a [`BrokenThreadPool`](#concurrent.futures.thread.BrokenThreadPool "concurrent.futures.thread.BrokenThreadPool"), as well as any attempt to submit more jobs to the pool.... [read more](https://docs.python.org/3.7/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor)

desert oar Aug 9, 2019, 11:34 AM

#

!d g multiprocessing.Pool

arctic wedgeBOT Aug 9, 2019, 11:34 AM

#

Sorry, I could not find any documentation for multiprocessing.Pool.

desert oar Aug 9, 2019, 11:35 AM

#

https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool

desert oar Aug 9, 2019, 3:17 PM

#

theres almost always a better way

#

if yo'ure looping over values, use .map

#

if yo'ure looping over columns use .apply on the dataframe

desert oar Aug 9, 2019, 3:48 PM

#

why not just do that all in one function?

#

def process_column(y):
    y = y.copy()
    null_frac = y.isnull().mean()
    if null_frac > 0.3:
    # ...
    return y

df_processed = df.apply(process_column)

desert oar Aug 9, 2019, 5:14 PM

#

yeah something like that

#

df[i].isnull().values.any() can be just df[i].isnull().any()

mossy dragon Aug 10, 2019, 5:54 AM

#

O.o

silk forge Aug 10, 2019, 1:05 PM

#

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix , accuracy_score , recall_score , f1_score  , precision_score
from sklearn.model_selection import train_test_split

data =pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/Artifcial intelligence/ML/data/Classifcation/SPAM/spam.csv" , encoding= "ISO-8859-1")
data = data.rename(columns ={"v1":"type" , "v2":"msg"})

x = data.msg


enc = LabelEncoder()
data.type = enc.fit_transform(data.type)
y = data.type
trainx , testx , trainy , testy = train_test_split(x , y , test_size=0.2 , random_state=8)

cv = TfidfVectorizer(min_df=1,stop_words='english')
trainx = cv.fit_transform(trainx)

testx = cv.transform(testx)

clf = MultinomialNB()
clf.fit(trainx,testx)

predy = clf.predict(testx)

print(predy)

#

C:\Users\admin\PycharmProjects\leaarning\venv\Scripts\python.exe "C:/Users/admin/PycharmProjects/leaarning/text class.py"
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/leaarning/text class.py", line 26, in <module>
    clf.fit(trainx,testx)
  File "C:\Users\admin\PycharmProjects\leaarning\venv\lib\site-packages\sklearn\naive_bayes.py", line 588, in fit
    X, y = check_X_y(X, y, 'csr')
  File "C:\Users\admin\PycharmProjects\leaarning\venv\lib\site-packages\sklearn\utils\validation.py", line 724, in check_X_y
    y = column_or_1d(y, warn=True)
  File "C:\Users\admin\PycharmProjects\leaarning\venv\lib\site-packages\sklearn\utils\validation.py", line 760, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1115, 7458)

#

help

lapis sequoia Aug 10, 2019, 2:08 PM

#

Probably, before your data be able to predict response, u have to correctly shape using reshape function maybe? idk .

silk forge Aug 10, 2019, 11:14 PM

#

Damn!!!

#

I’m supposed to fit trainx and trainy !!!

#

My bad @void anvil

maiden phoenix Aug 10, 2019, 11:52 PM

#

How does image recognition work? Is it just training a neural net to recognize similar images?

earnest prawn Aug 10, 2019, 11:55 PM

#

The "just" and the "similar" part are the interesting things about it but yes more or less

#

@maiden phoenix

maiden phoenix Aug 10, 2019, 11:58 PM

#

i meant in a nutshell, yes :P

#

thank you! it's starting to seem interesting to me. kinda surprised it took so long

earnest prawn Aug 11, 2019, 12:09 AM

#

to be honest, image related stuff is one of the few areas where you have more than enough data to do cool stuff with it

maiden phoenix Aug 11, 2019, 12:10 AM

#

~~such as checking if something is a hotdog or not~~

earnest prawn Aug 11, 2019, 12:11 AM

#

in the end you're building a model whihc tries to output a probability distribution over several classes for your input or just give an answer to a yes or no question like you just asked how you build that model and how exactly its trained is the interesting part

silk forge Aug 11, 2019, 9:05 AM

#

okay so i made this spam/ham classifier

#

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix , accuracy_score , recall_score , f1_score  , precision_score
from sklearn.model_selection import train_test_split

data =pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/Artifcial intelligence/ML/data/Classifcation/SPAM/spam.csv" , encoding= "ISO-8859-1")
data = data.rename(columns ={"v1":"type" , "v2":"msg"})

x = data.msg
cv = TfidfVectorizer(min_df=1,stop_words='english')
x = cv.fit_transform(x.values).toarray()


enc = LabelEncoder()
data.type = enc.fit_transform(data.type)
y = data.type.values



trainx , testx , trainy , testy = train_test_split(x , y , test_size=0.2 , random_state=4)



clf = MultinomialNB()
clf.fit(trainx,trainy)

predy = clf.predict(testx)


cm = confusion_matrix(y_pred=predy , y_true=testy)
prec = precision_score(y_pred=predy,y_true=testy , average="micro")
rcl = recall_score(y_pred= predy , y_true=testy, average="micro")
f1 = f1_score(y_pred=predy,y_true=testy, average="micro")
print(f"confusion matrix \n \n {cm}")
print(f"precision_score : \n {prec}")
print(f"recall_score : \n {rcl}")
print(f"F1 score : \n {f1}")

#

all this seems to work fine

#

this is my output ```py
confusion matrix

[[941 0]
[ 37 137]]
precision_score :
0.9668161434977578
recall_score :
0.9668161434977578
F1 score :
0.9668161434977578

#

but now i want to predict a single message

#

so i defined a function that vectorizes functions and predicts it

#

def encodetext(message):
    msg= cv.transform(message)
    pred = clf.predict(msg)
    if pred != [0]:
        print("spam")
    else:
        print("ham")

#

encodetext(["FreeMsg Hey there darling it's been 3 week's now and no word back I'd like some fun you up for it still? Tb ok! XxX "])

#

but this function classifies every message as HAM

silk forge Aug 11, 2019, 9:26 AM

#

nvm lol i figured it out

lapis sequoia Aug 11, 2019, 9:40 AM

#

import android.app.Activity;
import android.widget.TextView;
import android.os.Bundle;```

#

Python needs this

#

It does only import in-build functions

#

I'm too lazy for pip

#

That's why I use C#

#

THIS AIN'T DATA SCIENCE

vapid wren Aug 11, 2019, 11:41 AM

#

why are all weights adjusted during the learning process of a perceptron (w_new = w_old + (Y - y) * x)? Is it not enough to adjust the weight of the bias neuron (threshold value (theta [w0]))?

verbal bison Aug 11, 2019, 4:35 PM

#

Afternoon everyone, would anyone be able to provide/speak to a few Machine Learning terms I'm having difficulty finding definitions for?

earnest prawn Aug 11, 2019, 5:45 PM

#

@verbal bison just ask instead of asking to ask

#

@vapid wren you're adjusting weights, or even have weights in a perceptron so you can build functions which have many many many paramters, your approach would just throw parameters away that doesn exactly make sense.

For example when trying to find a simple function for a line I doubt you can work with

f(x) = x + 2 * t

I'm quite sure you'd love to have a parameter before x as well wouldnt you?

#

(yes the t is supposed to be a metaphor for the weight of constant value of a bias neuron here)

onyx moth Aug 12, 2019, 5:46 AM

#

hello guys I want to learn machine learning from A - Z well with that I mean I want to learn from somewhere where they start at A and very basic. Where could I do that? Suggestions to a course/videos?

torn musk Aug 12, 2019, 6:10 AM

#

@onyx moth https://www.khanacademy.org/math/early-math/cc-early-math-counting-topic/cc-early-math-counting/v/counting-with-small-numbers

Khan Academy

Counting with small numbers

Sal counts squirrels and horses.

#

khan academy goes quite advanced and most of its courses are relevant: eg statistics calculus and linear algebra

#

but it also starts very basic

quasi nacelle Aug 12, 2019, 8:01 AM

#

Hi i am have a large dataset, and i wanted to do basic analysis first, it seems a bit overwhelming as it is now.
I have 4272, 104 #rows X columns and I would like to plot how the overall data is looking. so i could get a visual representation on which columns are missing the most data - any ideas?

torn musk Aug 12, 2019, 2:30 PM

#

@quasi nacelle https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

#

How to get this to output 64x64 images? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

storm gate Aug 12, 2019, 3:15 PM

#

Best way to replace an entire row in pandas with a new row by row index?

desert oar Aug 12, 2019, 3:25 PM

#

@storm gate data.loc[row_index] = new_row

storm gate Aug 12, 2019, 3:37 PM

#

thank you!

#

@desert oar im getting a syntax error with that

desert oar Aug 12, 2019, 3:43 PM

#

eh?

#

show your code

storm gate Aug 12, 2019, 3:43 PM

#

matrix = pd.read_excel(path)
d_path = r'D:\AtomProjects\clean_genes\genes_by_index_dupes_with_avereages.txt'
dupes = ast.literal_eval(
    open(d_path, 'r').read())
for gene in dupes:
    matrix.loc(dupes[gene]['rows']) = dupes[gene][new_vals]

desert oar Aug 12, 2019, 3:44 PM

#

() vs []

storm gate Aug 12, 2019, 3:44 PM

#

Ah

#

thank you

desert oar Aug 12, 2019, 3:44 PM

#

also what data type is dupes

storm gate Aug 12, 2019, 3:44 PM

#

It is a dict but I call a list

desert oar Aug 12, 2019, 3:44 PM

#

i see

#

also why in the world are you storing data as literal python code

#

if its a list of dicts wouldnt you just use json

storm gate Aug 12, 2019, 3:45 PM

#

im storing it as text for now

desert oar Aug 12, 2019, 3:45 PM

#

🤔

#

right

#

why

storm gate Aug 12, 2019, 3:45 PM

#

dude im new to this

desert oar Aug 12, 2019, 3:45 PM

#

ok fair enough

#

use json

storm gate Aug 12, 2019, 3:45 PM

#

should i just save it as a json

#

ok

desert oar Aug 12, 2019, 3:45 PM

#

yeah definitely

#

if it's just dicts and lists of strings, numbers, Nones, etc

#

json is the way to go

storm gate Aug 12, 2019, 3:46 PM

#

that and firebase

desert oar Aug 12, 2019, 6:10 PM

#

what kind of backend

quasi nacelle Aug 12, 2019, 8:35 PM

#

Hi there...In a dataframe I want to drop everything but columns that begin with MTX
df_MTX_time = [df.loc[:, df.columns.str.startswith('MTX')]
but I also want to keep the very first column called UID - can i do that in one line ?

quasi nacelle Aug 12, 2019, 8:58 PM

#

or should I add the column after and if with .loc 0 ? or would that make further problems ??

#

please help me out

desert oar Aug 12, 2019, 9:34 PM

#

option 1:

df_mtx = df[['UID', *df.columns.str.startswith('MTX').tolist()]]

option 2:

df_mtx = df.set_index('UID')[df.columns.str.startswith('MTX')]

#

@quasi nacelle ^

silent swan Aug 12, 2019, 10:52 PM

#

oh snap, didn't realize you could do *args for indexing

desert oar Aug 12, 2019, 11:22 PM

#

im not

#

['UID', *df.columns.str.startswith('MTX').tolist()] is a list

#

[['UID', *df.columns.str.startswith('MTX').tolist()]] is indexing with said list

#

@silent swan

silent swan Aug 12, 2019, 11:28 PM

#

oh, I guess *args for creating the latter part of the list then

desert oar Aug 12, 2019, 11:44 PM

#

yep

grizzled folio Aug 13, 2019, 12:30 AM

#

Pandas and friends feel like they could be such powerful DSLs

desert oar Aug 13, 2019, 12:56 AM

#

@grizzled folio pandas is heavily inspired by R, which is basically a statistics DSL

#

the whole concept of a data frame i think originates with R

grizzled folio Aug 13, 2019, 12:56 AM

#

oh, that's true

#

it is nicer having a general programming language behind things, rather than R

desert oar Aug 13, 2019, 12:57 AM

#

exactly. i think thats a big part of what pushed people away from R and towards Python/Pandas

grizzled folio Aug 13, 2019, 12:57 AM

#

but also I interact mostly through the strange panda--ish abstraction of xarray

#

afaik R definitely doesn't do that

desert oar Aug 13, 2019, 12:58 AM

#

in R you can have named rows, columns, etc

#

in arbitrary dimensions

#

but xarray is a lot more performant

#

R has data.table for big tabular datasets, but it has no equivalent of xarray for "lower level" performance

#

also multiple competing/incompatible sparse matrix implementations

grizzled folio Aug 13, 2019, 12:59 AM

#

yes, none of that means anything without performance... lazy loading, dask integration, blahblahblah

desert oar Aug 13, 2019, 12:59 AM

#

R really sucks for general scientific computing

grizzled folio Aug 13, 2019, 12:59 AM

#

oh that's fun

#

yeah, I've heard that. if you're doing stats, primo, otherwise not so

desert oar Aug 13, 2019, 12:59 AM

#

its literally only good for stats

#

yeah

grizzled folio Aug 13, 2019, 12:59 AM

#

also has a pretty plotting grammar

desert oar Aug 13, 2019, 12:59 AM

#

its a fun language to hack with

#

honestly, matplotlib gets you up and running faster

#

i used to love R plotting but i think matplotlib is way easier to deal with once you learn the data model

grizzled folio Aug 13, 2019, 1:00 AM

#

matplotlib is more intuitive in most cases, but ggplot can do some very clever things

desert oar Aug 13, 2019, 1:00 AM

#

ggplot is another story

#

its a gem, and the failure to replicate its success in Python is kind of confusing

#

the API is very un-R and something you could easily implement in Python

grizzled folio Aug 13, 2019, 1:00 AM

#

but nobody has?

desert oar Aug 13, 2019, 1:00 AM

#

matpltolib already has a grid-like system

#

there were some port attempts but afaik they all lost steam

#

at least one of them was commercially backed (y-hat)

#

(who also developed rodeo which was like a shitty version of rstudio or spyder)

grizzled folio Aug 13, 2019, 1:03 AM

#

I find that somewhat surprising...then again the hard part is the actual plotting, not so much the grammar

desert oar Aug 13, 2019, 1:21 AM

#

yeah

#

i think maybe because matplotlib is "easy enough"

#

eg. you can .groupby your dataframe and loop over it

#

python people are used to doing things with 50 key strokes when you could use 20

vale hedge Aug 13, 2019, 1:37 AM

#

Does anyone know how to figure out what libraries like BLAS I should use?

grizzled folio Aug 13, 2019, 2:12 AM

#

"libraries like BLAS"

#

@vale hedge what do you mean?

vale hedge Aug 13, 2019, 2:13 AM

#

so i can run stuff like numpy faster

grizzled folio Aug 13, 2019, 2:13 AM

#

@desert oar haha, true. and with lots of extra brackets, splats, etc.

vale hedge Aug 13, 2019, 2:13 AM

#

i think they need binaries for doing Linear algebra etc

grizzled folio Aug 13, 2019, 2:14 AM

#

I suppose you could compile numpy with a different BLAS implementation (I think our HPC uses MKL)

sullen wing Aug 13, 2019, 2:14 AM

#

Hmm, did you try pypy out?

grizzled folio Aug 13, 2019, 2:14 AM

#

intel distributes numpy build with MKL, I think?

desert oar Aug 13, 2019, 2:14 AM

#

@vale hedge if youre using conda on an intel machine, the MKL version will be installed by default

sullen wing Aug 13, 2019, 2:14 AM

#

It's pretty fast last time I give it a try

grizzled folio Aug 13, 2019, 2:14 AM

#

I don't think numpy on pypy is as fast as cpython

desert oar Aug 13, 2019, 2:14 AM

#

@sullen wing numpy passes everything off to BLAS anyway

vale hedge Aug 13, 2019, 2:15 AM

#

oh im on AMD do i need to install something else?

sullen wing Aug 13, 2019, 2:15 AM

#

Ah, rip

desert oar Aug 13, 2019, 2:15 AM

#

@vale hedge it should probably detect that and install the openblas version which is noticeably slower

#

but still good enough for most cases

grizzled folio Aug 13, 2019, 2:15 AM

#

are there no atlas versions?

desert oar Aug 13, 2019, 2:15 AM

#

not prebuilt that i know of

#

@grizzled folio i think the issue with pypy is there's more overhead, so if you do a lot of matrix ops in a hot loop it's slower

#

bigger problem with pypy is zero cython support

#

which is fine obviously cause pypy itself is good for that

grizzled folio Aug 13, 2019, 2:16 AM

#

yeah, I was thinking of the overhead. the native stuff is going to be the same regardless

vale hedge Aug 13, 2019, 2:16 AM

#

what does cython support mean?

desert oar Aug 13, 2019, 2:16 AM

#

but if you wanna do fast non-vectorized stuff on a numpy array with pypy idk how youd even do it

#

does numba work on pypy?

#

@vale hedge cython is a python-like language that compiles to a CPython C extension

grizzled folio Aug 13, 2019, 2:17 AM

#

my intuition says no

#

are you actually doing heavy enough linear algebra that this matters?

desert oar Aug 13, 2019, 2:18 AM

#

yeah looks like you need to hack numba to get it to work https://www.embecosm.com/2017/01/19/running-numba-on-pypy/

Embecosm

Running Numba on PyPy

Summary Numba can be modified to run on PyPy with a set of small changes. With these changes, 91.5% of Numba tests pass. Execution speed appears to be similar to using Numba on CPython, with a smal…

vale hedge Aug 13, 2019, 2:18 AM

#

oh default python is cpython and it compiles to C right?

desert oar Aug 13, 2019, 2:18 AM

#

no

grizzled folio Aug 13, 2019, 2:18 AM

#

default python is interpreted, but the interpreter is written in C

desert oar Aug 13, 2019, 2:18 AM

#

^

#

CPython is called CPython because it's written in C

vale hedge Aug 13, 2019, 2:18 AM

#

what are all the pyc files?

desert oar Aug 13, 2019, 2:18 AM

#

they are CPython "bytecode"

vale hedge Aug 13, 2019, 2:18 AM

#

oh ok

desert oar Aug 13, 2019, 2:19 AM

#

cpython is a bytecode interpreter

vale hedge Aug 13, 2019, 2:19 AM

#

so kind of like java?

desert oar Aug 13, 2019, 2:19 AM

#

yes

#

except unlike java the VM isn't part of the spec

#

rather it's an implementation detail

#

@grizzled folio the biggest difference i noted between openblas and mkl was in SVD computation time

vale hedge Aug 13, 2019, 2:20 AM

#

I see

desert oar Aug 13, 2019, 2:20 AM

#

MKL was orders of magnitude faster

#

(in some artificial benchmarks i ran)

vale hedge Aug 13, 2019, 2:20 AM

#

how do you guys use numba?

desert oar Aug 13, 2019, 2:20 AM

#

i dont really have a need for it

grizzled folio Aug 13, 2019, 2:20 AM

#

I don't, I just make sure to write vectorised code most of the time

desert oar Aug 13, 2019, 2:21 AM

#

but its for when you want to implement simple iterative algorithms using numpy arrays or python lists

#

eg you can probably implement something like Kmeans with it

#

stuff that doesn't vectorize well but does use uniformly typed numerical data in arrays

vale hedge Aug 13, 2019, 2:21 AM

#

oh ok

grizzled folio Aug 13, 2019, 2:22 AM

#

I wonder how it'd go on this particle advection problem...

desert oar Aug 13, 2019, 2:22 AM

#

what is the problem

grizzled folio Aug 13, 2019, 2:23 AM

#

rk4 advection of particles interpolated on a prescribed velocity field

#

we use it on ocean velocity data we generate offline for analyses

vale hedge Aug 13, 2019, 2:24 AM

#

is advection same thing as convection?

grizzled folio Aug 13, 2019, 2:25 AM

#

convection can be a subset of buoyancy-driven advection

#

but you can also get diffusive convection

desert oar Aug 13, 2019, 2:33 AM

#

heh i dont know what any of that is. what computationally does it entail?

vale hedge Aug 13, 2019, 2:34 AM

#

guessing its a huge matrix of numbers and you have to calculate differentials

grizzled folio Aug 13, 2019, 2:35 AM

#

rk4 is just fancy integration, you could do forward euler like dx/dt = vt => x^{n+1} = x^n + v dt

#

so you just need to be able to interpolate velocity at an arbitrary position

#

(rk4 just adds the complexity that you interpolate in time too)

vale hedge Aug 13, 2019, 2:36 AM

#

are you doing implicit or explicit methods?

grizzled folio Aug 13, 2019, 2:36 AM

#

this is explicit, the velocity field is usually generated by a semi-implicit method so it's reasonably stable

vale hedge Aug 13, 2019, 2:40 AM

#

what kind of numerical precision do you need?

grizzled folio Aug 13, 2019, 2:41 AM

#

32-bit is usually fine, depending on how the velocity fields are staggered

vale hedge Aug 13, 2019, 2:44 AM

#

what kind of objects are you modeling around and what are you doing for grid system

grizzled folio Aug 13, 2019, 2:45 AM

#

what do you mean?

vale hedge Aug 13, 2019, 2:46 AM

#

do you need to model fluid flow around islands or other objects?

grizzled folio Aug 13, 2019, 2:46 AM

#

yeah, we'll use realistic ocean bathymetry in a lot of cases

vale hedge Aug 13, 2019, 2:49 AM

#

oh have you done much numerical methods before?

#

do your velocity fields change over time?

#

would numpy work for this project?

grizzled folio Aug 13, 2019, 2:53 AM

#

in this case, the velocity fields do change over time, but they're discrete snapshots. numpy alone wouldn't work, because it can't do the interpolation of velocity

#

I expect the advection alone could be vectorised over arrays of particle positions though

vale hedge Aug 13, 2019, 2:55 AM

#

i think you can just implement it yourself?

#

do you need to vectorize it

grizzled folio Aug 13, 2019, 2:56 AM

#

non-vectorised python is insufferably slow

#

I guess the interpolation could be done in numba, then the rest would be straightforward

vale hedge Aug 13, 2019, 2:58 AM

#

scipy might have runge kutta implementations

grizzled folio Aug 13, 2019, 2:59 AM

#

that's the easy bit really

#

just adding a few numbers together 😉

vale hedge Aug 13, 2019, 2:59 AM

#

oh do you want to animate it in python too

onyx moth Aug 13, 2019, 6:29 AM

#

What is the difference between labels and attributes? and what are labels exactly?

silent swan Aug 13, 2019, 6:51 AM

#

what's the context of that question

onyx moth Aug 13, 2019, 6:53 AM

#

@silent swan well I started on ML and they are talking about labels its not very clear what labels are. Are labels the unknown attribute?

silent swan Aug 13, 2019, 6:54 AM

#

mm well might be clearer given more context

#

but likely

#

the goal of a standard beginner ML task is to "label" or "classify" something

#

so that's likely the variable/information you're trying to predict with a model

onyx moth Aug 13, 2019, 8:02 AM

#

Linear regression only works with stuff which has a correlation with each other right?

#

I passed in some BTC/USDT data and it put out an accuracy of 0.99999 but you could in theorie also yourself take the open, close price and calculate the missing piece (the volume that is needed to move the price that much) yourself.

#

Then its more of an algo than a machine leaning? Or am I mistaken?

#

and with linear regression you cant predict the next days price right? (as you need to pass in all the other values and you dont have any of them)

spark stag Aug 13, 2019, 8:15 AM

#

u probably could if u have enough pints cuz it just calculates a gradient basically so if there is a direct relation between x and y then just put in the next value for x and u will (hopefully) get the next y value

#

its just drawing a line of best fit through the data assuming the variables are directly proportional

#

if u calculate gradient and intercept then u can predict future values

onyx moth Aug 13, 2019, 8:19 AM

#

Bitcoins price and volume differ everyday. Can I without any data of tomorrow predict its price today? Im getting the feeling Linear regressionis used to fill in the blank when u have the rest. for example no one tells me what todays price is but I can let the code tell it to me by filling in the volume open, high and low price

spark stag Aug 13, 2019, 8:23 AM

#

idk but if u want u can try use this and put in any numbers u got then use the prediction subroutines, just ask if u don't understand it

#

📎 unknown.png

#

Data points is an array of points btw eg [[1, 3], [3, 2], [8, 6]]

lavish agate Aug 13, 2019, 8:25 AM

#

I'm just trying my first steps with machine learning / sklearn. I have a dataset with lots of categorical data (size, color etc. of a product) and sales as the meassure. How do you call it if I want to find out wich categories have a positive or negative correlation with sales?

#

so "color red has a [number] positive correlation with sales"

#

or... I don't know

spark stag Aug 13, 2019, 8:27 AM

#

not 100% sure, but could u work out the average sales normally then the average sales with the chosen attributes to see if its higher or lower?

lavish agate Aug 13, 2019, 8:32 AM

#

I guess that is the most logical but that is so normal and boring 😅

#

I think I try a correlation matrix and see if that's interessting

quasi nacelle Aug 13, 2019, 9:20 AM

#

hi i have a dataframe and what to first (1) make a new df based on a row (patient) and several columns. (2) plot the normal distribution of that df. (3) can i automate it for 200 or more rows ?

desert oar Aug 13, 2019, 12:20 PM

#

@onyx moth yes a regression model (or better yet, a time series regression model) would be the start of something like that

#

@quasi nacelle yes of course, you can use a for loop over .iterrows() or .itertuples() depending on what you need

#

@lavish agate correlation would give you basically the same answer.

there are two approaches here:

one-hot-encode the data and take the correlation between each category column (of 1s and 0s) with sales
one-hot-encode the data and fit a linear model with the 1s and 0s as features and sales as the target

method 2 lets you use a set of classical statistical techniques called ANOVA, but basically you're getting the same "directional" answer in both cases

#

@onyx moth how to approach that problem depends on what other data you have available

quasi nacelle Aug 13, 2019, 12:26 PM

#

@desert oar thanks - sorry but could you help me .. i am really stuck ```df.SEX[df.SEX == 'Male'] = 1
df.SEX[df.SEX == 'Female'] = 2

plt.scatter(df.MTX36, df.MTX23,
c = (df.SEX), cmap="cool")
ax = plt.gca()

plt.colorbar(label="")
plt.xlabel("MTX36")
plt.ylabel("MTX23")````

#

i intended to convert string to int so i could use it in a color bar

desert oar Aug 13, 2019, 12:27 PM

#

you can't use color like that in matplotlib, it's not like ggplot

#

(sadly)

quasi nacelle Aug 13, 2019, 12:28 PM

#

okay - that s to bad

#

@desert oar can i color based on a cutoff value?

40 is red < is blue ?

lavish agate Aug 13, 2019, 12:32 PM

#

@desert oar I did some googleing and tried to replicate it

onyx moth Aug 13, 2019, 12:32 PM

#

@desert oar The only data I have is, open, close, high, low and volume but thats like todays data, I have nothing for tomorrow

desert oar Aug 13, 2019, 12:32 PM

#

@quasi nacelle pandas has its own plotting routines. for a cutoff value you would have to make a separate bool column

lavish agate Aug 13, 2019, 12:33 PM

#

so far it looks like this: http://dpaste.com/00YBKXM

desert oar Aug 13, 2019, 12:34 PM

#

@quasi nacelle https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html#pandas.DataFrame.plot.scatter

lavish agate Aug 13, 2019, 12:34 PM

#

source is: https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

Medium

The Search for Categorical Correlation

Exploring the uncharted territories where Perason's R no longer works

desert oar Aug 13, 2019, 12:35 PM

#

df['thing_above40'] = df['thing'] > 40
df.plot.scatter('MTX36', 'MTX23', c='thing_above40')

@quasi nacelle

#

@lavish agate that's for correlation w/ the entire categorical feature. what i described is for individual categories. also that article is for association between two categoricals

#

that said it's a great article and thank you for sharing it

#

@onyx moth yes if you have that data going back a long time you can use linear regression. you can also consider an AR model where you use yesterdays bitcoin price to predict todays

lavish agate Aug 13, 2019, 12:39 PM

#

I will look into it, thanks

swift karma Aug 13, 2019, 1:09 PM

#

how would you describe computational problems to a five year old?

desert oar Aug 13, 2019, 1:11 PM

#

what kind of computational problem

lavish agate Aug 13, 2019, 1:27 PM

#

@desert oar so just that I understand the article right, the function cramers_v computes the correlation between two dimensions. now I would need to do that for each pair and put that in a table of some sort?

#

basically, how did he create this picuture:

📎 1kUuEuJu3B1LNAXiFwUvDIw.png

desert oar Aug 13, 2019, 1:53 PM

#

no thats specifically not what V does

compact thistle Aug 13, 2019, 6:53 PM

#

Hey guys, I'm not even sure if it's right to post it here. I'm doing my own personal project using Amazon Customer Reviews dataset (https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt) from AWS public data registry.
It's a gigantic dataset divided in 53? product categories. If it helps when I parse it in pandas dataframe, it's gonna look like the attached.
What I want to do with this dataset basically is that I want to build a recommendation system. I'm not familiar with recommendation system but I find it interesting and thought I could probably build it with this massive dataset I have got.
I'm not sure if the dataset I have has enough features to build such. I'd like to hear what you guys think. Thanks!

📎 unknown.png

silent swan Aug 13, 2019, 7:13 PM

#

you probably could! (not like an amazing one, but good enough for a fun project)

#

https://en.wikipedia.org/wiki/Collaborative_filtering

Collaborative filtering

Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interes...

compact thistle Aug 13, 2019, 7:35 PM

#

@silent swan Thanks man, I'm just starting to read up what approaches they have with recommendation system building and I'm just checking my dataset if I have enough for inter-categorical(since there's 52 more) prediction. I believe collaborative filtering has to do with analyzing other customer's purchase patterns right?

prisma verge Aug 13, 2019, 7:47 PM

#

any good courses on basics for deep learning? preferably, not too much in depth and with less math stuff
i wanna get into it as a hobby and wanna find a way where i could dash into it on the basics, and when i get more experienced at math, get into it deeper maybe, haha

#

is http://fast.ai good perhaps?

Home

Making neural nets uncool again

surreal nacelle Aug 13, 2019, 8:06 PM

#

https://www.youtube.com/watch?v=Ul0Gilv5wvY
This is so cool

YouTube

Yoshiboy2

Phase-Functioned Neural Networks for Character Control

We present a real-time character control mechanism using a novel neural network architecture called a Phase-Functioned Neural Network. In this network struct...

▶ Play video

desert oar Aug 13, 2019, 8:07 PM

#

@prisma verge yes fast.ai is recommended

prisma verge Aug 13, 2019, 8:08 PM

#

thank you!

onyx moth Aug 13, 2019, 9:46 PM

#

is reinforcement learning already like in the NN and deeplearning box or is it still just outside of that in the ML

#

if someone answers this plz @ me as im gonna go sleep now

earnest prawn Aug 13, 2019, 10:36 PM

#

@onyx moth it's certainly not deep learning only, however most modern successes in that area are made using Deep learning, at the moment especially with LSTM based networks

desert oar Aug 14, 2019, 1:56 AM

#

i dont, but Elo is a guy's name

#

it's not an acronym

desert oar Aug 14, 2019, 2:36 AM

#

how would you go + or - then with no win or loss?

#

does the team win or lose as a group?

#

because then you can still implement MMR

#

+5 if you win (as a team), -5 if you lose (as a team)

#

i dont know of any literature on it though, would be curious

desert oar Aug 14, 2019, 3:03 AM

#

interesting

#

so does the "win +5, lose -5" model not work for your case?

desert oar Aug 14, 2019, 3:31 AM

#

so you need 2 scores then, right?

#

for matching players use euclidean distance on score "vector" maybe

#

would have to start tweaking relative scores a lot

#

how does it work in DND AL if someone is a dick

#

hm

#

i was asking about your case specifically though

#

what's this for?

#

yeah i feel like you end up with the same kind of +/- MMR/Elo system

#

but instead the size of the +/- is determined by the gap between your skill and your teammates' skill

#

whereas something like a personality score would be absolute increments

quartz stream Aug 14, 2019, 11:02 AM

#

Any idea on how to get started with a voice authentication project

#

All I want is a python program which can detect a person based on voice

#

any kick start is appreciated 😛

hollow quartz Aug 14, 2019, 12:00 PM

#

Hi, can I use postgressql in a datascience project?

polar acorn Aug 14, 2019, 1:05 PM

#

Sure. Then again you could probably do a data science project with crayons and a piece of paper. So a better question might be if you should.

desert oar Aug 14, 2019, 4:06 PM

#

Thats like asking if you can use timing belts to drive to the store

onyx moth Aug 15, 2019, 7:43 AM

#

if I want to do reinforcement learning is qlearning the way to go? Or is there another way?