#data-science-and-ml

1 messages Β· Page 189 of 1

simple fjord
#

so how would I use their placement information

#

no

#

several strain gauges in different positions

#

the strain sensor has temperature sensor its a fiber optice one

small ore
#

You can add relative co-ords as regression variables

simple fjord
#

Can you explain more

#

I'm very beginner

#

Yesterday I tried PCA with regression

#

I got good prediction from 2 sensors out of 15

small ore
#

I am a beginner of some sort too. I am learning. If you could take the first couple lessons from a Machine learning course( Perhaps Andrew Ng) you know how to do multi-variable regression. We can wait for someone else to answer you on PCA. Meanwhile I will try to learn what it is

simple fjord
#

I followed that article

#

maybe its interesting for you

small ore
#

Most certainly interesting. Now I need to read up other methods to reduce variables too πŸ˜ƒ

void anvil
#

With Scikit learn, is there a way to specify batches for online learners (like MLP)?

#

e.g.

#
 clf = MLPRegressor(solver='lbfgs', alpha = 1e-5, hidden_layer_sizes  = (25,25,25,), random_state = 11)
 clf.fit(x_train, y_train)```

```MLPRegressor(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
            beta_2=0.999, early_stopping=False, epsilon=1e-08,
            hidden_layer_sizes=(25, 25, 25), learning_rate='constant',
            learning_rate_init=0.001, max_iter=200, momentum=0.9,
            nesterovs_momentum=True, power_t=0.5, random_state=11, shuffle=True,
            solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
            warm_start=False)```
#

And what I want to do is:

#
Batch 2: trains on 10-150 from x_train
Batch 3: trains on 9,2,159,37 from x_train
etc.```
#

just hl me if you know

small ore
simple fjord
#

@small ore Thanks so much, that's a very good book

radiant notch
#

Hello

#

My neural network is taken off of the internet because I'm a beginner and wanted to play around with it

#

But, I have a problem: All the neural network samples like this that I take off the internet use a sigmoid function which apparently means that the output can only ever be between 0 and 1

#

Furthermore, even if the output should be between 0 and 1 it doesn't seem to be giving correct responses

#

For example, [-3,-2,-1] would hopefully give 0

#

but instead it gives somewhere around 0.22

stone oasis
#

what is that list of numbers

#

smoothstep or something? like what you are doing with those three args and how do they interact with ssigmoid

#

oh and the default or 'centered' value concerning sigmoid functions with an input of 0 is .5, not 0

radiant notch
#

Well

#

I'm just testing it at the moment

#

and giving it test runs where [1,2,3] = 4

#

and [3,2,1] = 0

#

stuff like that

stone oasis
#

you know order matters right

#

or wait nvm

#

u just showed me lol

radiant notch
#

I know what that is

#

I've read some of it

#

And already knew anyhow

pine dune
#

I am using this github project and it is not working i am getting this error Traceback (most recent call last):
File "create_dataset.py", line 9, in <module>
from predict import predict
File "D:\AI\Game\Game-Bot-master\predict.py", line 3, in <module>
from scipy.misc import imresize
File "C:\Users\Fidgety\Anaconda3\envs\tensorflow\lib\site-packages\scipy__init__.py", line 62, in <module>
from numpy import show_config as show_numpy_config
ImportError: cannot import name 'show_config' https://github.com/ardamavi/Game-Bot

lapis sequoia
#

!t resources

arctic wedgeBOT
#
resources

It can be difficult to know where to begin when you are first starting out with Python. On our website, we have compiled a list of both free and paid resources that we recommend for learning and mastering Python.

It is hard to say exactly where you should start, as everyone will have a different prefered method of learning, but whether you like video tutorials, books or courses, you should find a suitable resource on our resources page

lapis sequoia
#

hey guyss

#

i am trying to learn data science using python

#

i have learnt how to use numpy

#

and some plotting

#

thaT was free course on data camp

#

πŸ˜‰

#

now where should i head?

#

guide me

radiant notch
#

I'm planning on making a neural network but the structure of one is obviously critical

#

So how can I get the neural network to change its own structure? Is this even possible? I don't want the network to be limited by my bad structure... if it is bad.

trail flicker
#

@radiant notch if we could do that well, we'll have cracked the ai problem

radiant notch
#

Surely there's a neural network that adapts in structure?

#

You'd just need to experiment with different structures when backpropagating?

radiant orbit
#

What is difference between SGD and BGD in linear regression..... How are the training examples visited in both cases... Can someone please explain me in brief?

high ocean
#

Sdg = one update per single training pair. BDG = one update per many training pairs (batch)

#

Also my phone is correcting SGD

#

and BGD for some reason

radiant orbit
#

Thankyou for your response.. Lemme clear .. in bgd.. Suppose i have selected the whole training set as a batch so it means that the average of same set of examples in batch would do an update in every epoch? And in sgd.. You mean that in each epoch I will consider only one training example at once.. ?

junior ore
#

Kindly post the answers for Quizzes 2, 3 and 4 in Applied Social Network Analysis in Python from Coursera
The course in the specialization Applied Data Science in Python is extremely abstract and challenging, the tutor extremely vague. My subscription to the specialization ends in less than 8 hours and my $49 USD will go in drain if I don't secure this specialization. Kindly post the answers for the quiz questions.

Quiz 2: https://www.coursera.org/learn/python-social-network-analysis/exam/tZYRH/module-2-quiz

Quiz 3: https://www.coursera.org/learn/python-social-network-analysis/exam/0qgIf/module-3-quiz

Quiz 4: https://www.coursera.org/learn/python-social-network-analysis/exam/CgIV0/module-4-quiz

trail flicker
#

πŸ€” asking for course answers?

junior ore
#

without other options

#

because it is inaccessible at least for me in spite of trying to get my head around it for the past 8 hours

trail flicker
#

you do realize we cant see those without logging, right?

junior ore
#

my bad.. would you able to help me if i send you the pictures?

silk acorn
#

You aren't asking for help but for straight up answers

junior ore
#

it would be great if you would be able to help me find the answers as well. understand i have reached this stage without other options

hearty hazel
radiant orbit
#

Can someone please respond to my previous question?

junior ore
#

no issues .. lemme try it would be great if someone can just help me with the questions

hearty hazel
#

You can ask specific questions in one of the designated help channels.

#

!ask

#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving
β€’ Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

hearty hazel
#

read that page for more info

junior ore
#

the only problem i am grappling with is:

#

the tutor provided juvenile code to perform the functions and the quiz has got a network visual which is really difficult to replicate and also he did not teach the ways in which we can find stuff like node connectivity manually

#

it would help if anyone can provide a workaround so that I can climb this myself

simple crag
#

Carte blanche tutoring giving you code with 8 hours to go still is not the purpose of this server

#

If you have a specific question then you can proceed to an unused help channel as previously directed

small ore
#

I think it is against Coursera honor code to ask anyone on this planet for quiz answers. If you are not going for the certification you could even skip the quiz

polar acorn
#

So he could potentially ask people on the international space station or on Mars for help? Or is in orbit considered on the planet ( I never read the coursera honor code, I just clicked agree)

lapis sequoia
#

Hi from where i can learn sci kit for data science

#

i watched a video

#

but i am not getting it

#

help me

#

@chrome spade i know u

#

u r justin from pramp server πŸ˜„

solid chasm
#

@chrome spade no advertising

chrome spade
#

@solid chasm Wasn't trying to advertise. Is there a place on this server I could post about a paid project?

#

@lapis sequoia that is correct πŸ˜ƒ Hello!

solid chasm
#

nothing comes to mind, no, sorry

copper swan
#

@rich yarrow you might find help here about that problem

#

and you can get some resources aswell when you ask

rich yarrow
#

Hello, I am wondering where I can learn more on how to "Use NumPy and Matplotlib to draw a scatterplot of uniform random (x, y) values all drawn from the [0, 1] interval"

#

How do I use both NumPy and Matplotlib to make one scatter plot..

#
import numpy as np```
rich yarrow
#

What does the following mean?

#

Fixing random state for reproducibility

np.random.seed(19680801)

silk acorn
#

semi-random numbers are based off of a seed

#

If you set the seed, you'll get the same random numbers everytime

rich yarrow
#

import matplotlib.pyplot as plt
import numpy as np

  1. How to generate a random number in the range [0, 1]
    import random
    x=random.randint(0,1)
    print(x)

  2. How to do that for two dimensions (x/y)

  3. How to show that on a plot

#

Is anyone able to help me with this?

earnest prawn
#

That sounds extremely assignmentish

simple fjord
#

Does someone has signal processing experience ?

#

Would someone please help me with calculating the time delay between two signals ?

#

the data set is there in the post

#

I can't find the peaks and shift the signal using the max peak index

wary willow
#

Is there a place I can go to learn about making python AIs/machine learning?

placid snow
#

πŸ“Œ pins should have a few sources iirc

velvet anchor
#

Andrew Ng on coursera πŸ‘Œ

small ore
#

That is not python though

velvet anchor
#

the original message was about AI / ML πŸ˜›

tight ibex
#

should i just ask the same here? not well versed in discord etiquette, sorry

#

sorry again, got the meaning of the emoji. used to irc kek

small ore
split zodiac
#

posted in UI too, but I've been working a new way to do data science stuff with ipython/jupter using graphnodes to connect ipython notebooks with data sources / control flows

lean ledge
#

That looks super cool, whoa

polar acorn
#

Matrix looking

polar acorn
#

I'm doing a lot of similar Jupyter notebooks for testing some models. I would like to print the same metrics for the results every time, but don't want to copy paste those cells into every notebook. I will of course write an external function and import to all notebooks. But is there a way for this function to output things in several cells? I would like something like this:
cell 1 {
from external_helpers import print_results
print_results(y_hat, y)}
cell 2 {
Print accuracy in %
}
cell 3 {
Plot of y and y_hat
}
cell 4 {
Print a confusion matrix
}
etc. etc.

simple fjord
#

Hi

#

would someone help me with this please ?

proud raven
polar acorn
#

@proud raven Thanks, I'll check it out

viscid aspen
#

Hey, how could I make something like

sensors[any(sensors['Zone'].str.contains(sensor) for sensor in relevant_sensors)]

work in pandas ?
relevant_sensors is a list of strings. (The error I'm getting is about the ambiguity where normally I'd need a bitwise operation instead of a boolean one)

placid snow
#

Maybe something like py sensors[sensors['Zone'] in relevant_sensors]? and index is for the first one

#

Without seeing the data, or trying it.

viscid aspen
#

I have to make sure to use str.contains cause I want to match e.g. SENSOR_1.203 while there's only SENSOR_1 in relevant_sensors

placid snow
#

What is your df storing

#

and what are you trying to get with that slice? The first one that is in relevant sensors?

viscid aspen
#

so let's say there's SENSOR_1, SENSOR_2, SENSOR_3 in my relevant_sensors but the values in sensors['Zone'] would be something like SENSOR_1.201, SENSOR_1.202, SENSOR_2.001, SENSOR_5.1234. The result should therefore be a DF sensors where sensors['Zone'] values are only the first 3 (everything except SENSOR_5.1234)

#

does that make sense?

placid snow
#

But are they strings?

#

or some object

viscid aspen
#

yep, sorry

placid snow
#

What do you get if you try my suggestion above then?

viscid aspen
#

I get The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

#

(error)

#

just to clarify... the relevant_sensors list is a primitive python list of strings and the column sensors['Zone'] has values of the pandas object type

#

@placid snow I can do what you're suggesting if I convert relevant_sensors to a np.array and then do sensors[sensors['Zone'].isin(relevant_sensors)] but that gives me only the values that are equal to some string in relevant_sensors, not those that merely contain that value

placid snow
#

Did a bit of testing and googling, but what about something like this (similar to yours) ```py

df = pd.DataFrame()
df["Zone"] = ["SENSOR_1.201", "SENSOR_1.202", "SENSOR_2.001", "SENSOR_5.1234"]
relevant = ["SENSOR_1", "SENSOR_2", "SENSOR_3"]
df[df["Zone"].str.contains("|".join(relevant))]
Zone
0 SENSOR_1.201
1 SENSOR_1.202
2 SENSOR_2.001```

#

but it doesn't use any, and it joins all the results in relevant to an or like regex expression

viscid aspen
#

Well, it gets the job done. Thanks a lot.

#

So .str.contains can be used with a regex?

#

(I assume?)

placid snow
#

That's what i gathered from it

#

That it requires a regex to search with

#

so we just create a regex which is basically this or this or this

viscid aspen
#

would still be useful to know if I can somehow put my own functions/logic inside the [] of a dataframe, like I wanted to do with any(). This time there was a workaround but sometimes there might not be.

placid snow
#

Not that I'm aware of

#

General logic + some of their methods

#

afaik

viscid aspen
#

alright, thanks for the help πŸ˜ƒ

placid snow
#

Anytime

dreamy tapir
#

What is the most easy to use neural network lib? I want something like

import nnetwork as nn
MyNet = nn.layers(input=3,hidden=1,output=2)
MyNet.train ([[0,1,0],[0,1]],[1,0,0],[1,0]])
print(MyNet.calculate([0,0,1])```
hasty maple
#

keras maybe

dreamy tapir
#

Doesn't seem that easy.

finite solar
#

deepy?

dreamy tapir
#

Actually, pyBrain looks now the easiest I found.

#
>>> net = buildNetwork(2, 3, 1, bias=True, hiddenclass=TanhLayer)
>>> trainer = BackpropTrainer(net, ds)```
#

It is actually easier than I wanted! Cool!

polar acorn
#

sickit-learn also has a limited selection of networks that is very easy to get started with

velvet anchor
#

@dreamy tapir Keras.

#

It’s really a lot easier than it seems

#

You just like throw your layers In like

a=Layer()
B=layer()(a)
....
Out = layer()(last layer)
Model.compile(Out)
#

I also have a wrapper I’ve been working on to simplify the generation of networks slightly for genetic optimization but it’s kind of on hold at the moment

#

Then to train you just write up a generator or use one of the built in ones and pass it into model.train()

#

iirc anyways, it’s been a bit since I’ve worked with it

dreamy tapir
#

Really?

#

I'll take a look on the documentation

#

I don't understand... I'm not so advanced...

#

And it looks weird

velvet anchor
dreamy tapir
#

I still don't understand

dreamy tapir
#

But I understand pyBrain

#

And I know synaptic.js perfectly and that's extremely easy and powerful.

dreamy tapir
#

And it's a pleasure the work with that library

#

Omg!

#

It's perfect

dreamy tapir
#

But doesn't work on python3

flat thistle
#

i’m just a beginner in ML and datascience...i wanted to start working on some small projects so that i can inprove my skills..could u guys please suggest me some projects to start or help me how to find a beginner project

keen pivot
#

what's a good graph edge container?

#

i was thinking a dictionary like

#

{0:1, 0:2, 1:2, 2:3}

#

but you can only have unique keys...

#

so can you do {0: [1,2]}

#

of is it better to use {0:(1,2)}

placid snow
#

What about pandas / numpys ?

keen pivot
#

what do those have?

placid snow
#

Depends what you need, but plotting the example data you gave would look something like

#

but thats 2 different graphs for each column, you could most likely swap the numbers around to get the desired effect from that

#

@keen pivot

velvet anchor
#

Could use a graph DB as wel

keen pivot
#

that's not the kind of graph i mean @placid snow

polar acorn
#

A dict would probably do for small things. Any bigger and I'm sure you'll find plenty of packages for that.

keen pivot
#

okay.

#

I'm trying to figure out whether to use a tuple, dict, or list inside the dict as well.

#

I'm creating something would be described as a multilayed graph

#

with a main node that has edges to other main nodes

#

and within each node, there's a list(or dict?) of embedded nodes that may have edges to one or more other embedded nodes either in the same main node or others.

velvet anchor
#

Tried Neo?

keen pivot
#

I've not

velvet anchor
#

It’s a neat graph database

keen pivot
#

this makes a lot of sense.

#

I worry this may be a bit too heavy for what I'm looking to do.

#

I'm trying to make an application that runs concurrently while another larger application is running.

#

and based on what that application writes to a file, manipulate the nodes in my graph.

#

I've been implementing my graph myself purely in python.... I don't imagine there being more than 100 main nodes.

velvet anchor
#

It may be a little overkill but it's neat to learn if you're going to be doing more stuff with graphs in the future

keen pivot
#

Okay.

lean ledge
#

@flat thistle MINST database is a classic project everyone does. If you want a more classic ML type, you can look at some basic regression or classification datasets on Kaggle

#

There's a few nice clustering datasets too to flex your DBSCAN and KNN muscles

pliant mantle
#

I have a hard one for you, I want to learn about data syncronization

#

filesystems, binary data, python dictionaries

#

what are the tags / subject terms I should be looking for?

lean ledge
#

Those are some very different subjects. Python dictionaries are hashmaps with dynamically allocated arrays. Filesystems is big enough to be a course on its own but easily searchable. Binary data is a vague term that can mean anything. For any specific file format, it can be differently laid out so you'll have to search those up separately. Executable binaries have their own executable formats too such as ELF for Linux

#

@pliant mantle

trail flicker
#

elf is best binary executable, change my mind

pliant mantle
#

@lean ledge
I'm talking about syncing binary data. Example, recognizing a bit has changed on another system and transmitting, byte 1235623451 has changed to 0b00001111

lean ledge
#

Oof, I don't know about that, can't comment

pliant mantle
#

binary, file, directory, node

lean ledge
#

(@pliant mantle you might also wanna post this somewhere else since that isnt exactly data science)

spark nimbus
#

@pliant mantle read /dev/sdX in binary mode and set up IPC between the two

#

not sure about writing though

trail flicker
#

@spark nimbus ayy

spark nimbus
#

ayy

velvet anchor
#

@pliant mantle maybe researching how ECC ram works? It’s not exact but the error correcting there might translate over somehow

thorn river
#

I'm interested in learning more about fairness/bias in machine learning (and trying to account for those things in a ml project on a dataset with text possibly), would anyone happen to have some directions on some good articles about this?

lean ledge
#

ooh on a related note, on a tech conference recently I met the person in charge behind the AI ethics report for Australia. Australia's been a bit behind in terms of policy changes to accomodate AI and the ethical challenges it poses so she was in charge of writing a report built of case studies and possible pitfalls for policy makers with regards to what sort of ML models should and shouldnt be used, how they should be managed etc to take care of fairness and bias, and data privacy of people whose data is used in the models etc. She did her PhD in neuroscience and did work in ethics at the neuroscience research institute hence why she was part of that role

#

just relevant and thought it'd be cool to share

thorn river
#

Oh wow that's interesting!

lapis sequoia
#

and I'm getting the following error: AttributeError: 'NoneType' object has no attribute 'seq'

#

I have no idea what that means lol So on line #12 I create a list from values within a column and on line #23 I use it to set the xticks

polar acorn
#

Which line gives you the error?

placid snow
#

!t traceback

arctic wedgeBOT
#
traceback

Please provide a full traceback to your exception in order for us to identify your issue.

A full traceback could look like: java Traceback (most recent call last): File "tiny", line 3, in do_something() File "tiny", line 2, in do_something a = 6 / 0 ZeroDivisionError: integer division or modulo by zero
The best way to read your traceback is bottom to top.
β€’ Identify the exception raised (e.g. ZeroDivisonError)
β€’ Make note of the line number, and navigate there in your program.
β€’ Try to understand why the error occurred.

To read more about exceptions and errors, please refer to the official Python tutorial.

lapis sequoia
#

I was actually able to figure that part out LOL but I do have a question resulting from this bit of code!

When I plot the graph itself, it prints out pretty small and scrunched up. Then I MANUALLY click+drag the bottom corner of the image then Save, and it saves more stretched out; making it easier to read the long x-axis. Why is that, and how do I get it to plot out with a width/length wide enough to accommodate a long axis? I've uploaded the two images.

#

So basically I'd like to use the 'savefig()' option to save the figure already 'stretched-out', without me having to manually print out the plot then click+drag?

placid snow
#

What did you not understand with using pandas? It seems to be the way to go with this, both for the reason it's asked for and your data goes well with it?

lapis sequoia
#

"list indices must be integers or slices, not str" posts this error message

#

so I did the first 3 without pandas but can't figure out how to get the year latitude and name of hurricane all in one

placid snow
#

Are you trying to create a DataFrame pr hurricane?

#

Seems to be a csv file youre loading. Pandas can create a single df from a csv file with pandas.read_from_csv(filepath) (iirc)

#

And you can do most of this logic with the df itself

small ore
#

I dont see a problem description there. I onlysee some unfilled problems

#

Ah. It is in the picture. Difficult to even see the problem.

lapis sequoia
#

Okay so I retried the pandas approach and did the read_from_csv and got a data table. Do I add headers now so the data looks more organized?

placid snow
#

Your csv file dont have them already?

#

You can write them in the csv, or in code up to you. If they dont already exist

small ore
#

There is something like a header= argument for that method

lapis sequoia
#

No it looks really messy but im reading a tutorial now on how to clean it it looks a lot cleaner when access df.head()

#

when i print df though it comes out like a badly formatted text file

placid snow
#

Show me

lapis sequoia
#

okay give me a sec

small ore
#

    List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued.
lapis sequoia
placid snow
#

Specify the names param as a list of your headers

#

And see if that fixes it

small ore
#

Although, going by just what I see ( I don't see the problem statement for problems 1 to 5, I might be missing something), they already have the data ready for you

lapis sequoia
#

Problem 1 was to find the amount of unique hurricane names problem 2 was most common hurricane name problem 3 was year with most hurricanes and then 4 is most northern hurricane and 5 is hurricane with maximum sustained wind

placid snow
#

Are there only 2 columns to your data?

#

Seemed to have lat and long as well

silk acorn
#

For finding unique names, there exists a data type that can only have unique values in them.

small ore
#

I mean is it required to use pandas? In the codeblocks above they have opened the file in a different way and allowed you to use it. Or is it some different data for those?

silk acorn
#

The length of a set of the names will be the amount of unique names

lapis sequoia
#

Its not required, I did the first 3 problems without pandas very easily I just got lost on how to find the latitude and longtitude values from it

#

So I tried adding more columns but it ends up piling multiple data into one column and none of the columns retrieve the latitude, after column 6 it all starts saying NaN

wary willow
#

Where should I go to learn very basic machine learning, if I'm a quite beginner python programmer who has no idea how machine learning work?

lean ledge
#

@wary willow Check pinned!

small ore
#

@lapis sequoia If you have still not been able to solve your problem, try changing the file open code to:

records = []
with open(local_fname,'r') as f:
    for line in f:
        if line.startswith("AL"):
            record = line.strip()
            reports = []
            records.append((record, reports))
        else:
            reports.append([line.strip()])
small ore
#

scracth that. No need for that change.
You can just do:

print(max([float(rec.split(',')[4].strip()[:-1]) for rec in record[1]]))
hardy drift
#

how could I add a new column to a pandas dataframe, based on other data in each row?

#

say I have a function that takes each row's 'text' column and transforms the data and adds a new column value

#

should i just create a new column , then write a for loop that iterates over each row and fills in the column?

#

not sure if that is the most efficient way

small ore
tacit meteor
hardy drift
small ore
#

Link from nowhere without context is suspicious brainmon

lapis sequoia
#

Alright so I did print(max([float(rec.split(',')[4].strip()[:-1]) for rec in record[1]]))

#

and it returned the highest latitude N of that hurricane. now to generate that for all of them, do I need to make a set out of the latitudes?

small ore
#

Well, I believe if you have done the previous problems you can figure out although this involves somewhat more head-scratching. This server does not encourage giving out answers to your assignments afaik. Think and let people here know what you have or where you are stuck. I will see if I can give you more hints tomorrow

lapis sequoia
#

Hello data bois

#

anyone good with matplotlib?

earnest prawn
#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving
β€’ Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

earnest prawn
#

@lapis sequoia

lapis sequoia
#

lol okay. So some basic setup: I'm trying to plot two sets of data; one pulled from iexfinance and one from quandl

#
China = quandl.get_table('DY/IPA', ticker='000010')
China.sort_values('date', inplace=True)```
#

so that works. Okay, great. Now I tried graphing them separately.

#
fig, ax = plt.subplots()
ax.plot(China_date, China.close)
ax.set_title('SSE 180 Index daily price')```
#

and that works. Not gonna lie I'm pretty new to matplotlib / python so that might be a mess

#

and then this works

#

SPY['close'].plot()
plt.show()

#
plt.show()````
#

So i was looking through the matplotlib tutorials and it seemed to indicate that you could just do two plt.plot(data) and it would set them on the same chart

#

like this:

#
plt.plot([2,3,4,5], [10,9,6,4], label='line2')

plt.xlabel('price')
plt.ylabel('date')```
#

Which works fine, alright cool

#

SO why doesn't this work?

#

# these will plot separately but not together. I have no idea why.
plt.plot(China_date, China.close)
SPY['close'].plot()

plt.xlabel('Date')
plt.ylabel('Price')

plt.title('China and SPY Comparison')

plt.show()```
#

I'm fairly sure it's something to do with the "China" data frame having a 'date' column that's converted to an object, while the SPY df has an index column just called date

lapis sequoia
#

@earnest prawn any thoughts?

simple crag
#

What do you mean by β€œdoesn’t work”

silent current
#

Is there a way to increase the size of jupyter inline figures without putting the figure in a scroll box?

#

I'm using plt.rcParams['figure.figsize'] = [15, 5]

#

but no mater what dimensions I make the figsize, I end up getting a scroll box and it's ugly

lapis sequoia
#

@silent current I think it's due to the browser resizing the img

lapis sequoia
#

hey guys, sorry for the extremely long question yesterday. I kind of tabled that issue as it wouldn't really make sense to graph anyway.

#

However, new question, I'm using Quandl to get the following:

#

SPY = get_historical_data('SPY', start=start, end=end, output_format='pandas')

#

and then plotting the close price

#
plt.show()```
#

but it's not plotting the date, or at least, it's not displaying the date. I'm kind of out of ideas on why / how to get it to actually display the figures

viscid aspen
#

Any idea why conda install jupyterlab won't install the latest version for me ? I'm at 0.32 right now but there's already a 0.34, I can even see it in conda search jupyterlab

serene veldt
#

hello, im having trouble with memory usage on numpy

#

im working with a lot of matrixes, usually 500+, and when they become about 11x190 +- i get a memoryError

#

is there any eficient way to serialize/compress the matrixes to a list or similar, so that i can just grab them by index and unserialize/decompress on demand?

#

much apreciated

simple crag
#

What operations are you performing with the matrices? And how much RAM do you have?

serene veldt
#

I have 32 GB of RAM, I'm doing dot products mostly

#

But I managed to fix me memory usage during the operations, it now crashes. By just having them stored

simple crag
#

Are you sure you're not growing the matrices? 500 11x190 matrices shouldn't cause OOM issues

serene veldt
#

All elements are tuples, not sure if it influences

#

I'm almost sure but I can double check

simple crag
#

What do you mean all elements are tuples

serene veldt
#

*floats

#

Sorry, idk how it autocorrected

small ore
#

Are you doing dot-products on them all together?

#

Or can you open them from a dump and do it one by one and close as you go on?

serene veldt
#

I was doing them all together

#

Dumping to dusk and reading when needed will greatly affect performance

#

That's why I wanted to k ow what options I have :/

#

Right know I'm checking pytorch, maybe i can use the tensors

simple crag
#

Are you using 64 bit or 32 bit python?

#

Your arrays combined take about 8.4 gigs of memory

serene veldt
#

should be 64

#

ill also double check that

placid snow
#

I forgot what the name is, but theres a module that (with a bit of overhead) splits up big work loads into smaller chunks that can be run on different hardware even

#

Theres a pycon talk about it alongside pandas and numpy iirc, maybe you can find it

serene veldt
#

will also look inot that

#

much apreciated

lapis sequoia
#

wtf is data science, analysis.. etc

placid snow
#

Read the channel description

lapis sequoia
#

@placid snow didn't answer my question^^

placid snow
#

It's the work of analysing, categorizing, visualizing and in general read & understand large sets of data.

#

Say you wanted to create a graph showing the wealth distribution of everyone in the US, that's data science for instance

trail flicker
#

Big Dataβ„’ basically

placid snow
#

Or machine learning / AI

#

Pretty much

polar acorn
#

It's the channel for the buzziest of buzz words

tall drum
#

Hi, I was able to create the following pandas dataframe by using groupby. Now I would need to graph the results with bar plot (one bar for one column, divided to 2 parts)

simple crag
tall drum
#

thanks got it

tight dove
#

Hi guys

#

So I'm getting this error while trying to get financial data from Yahooi's API

#
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web

style.use('ggplot')

start = dt.datetime(2000, 1, 1)
end = dt.datetime(2016, 12, 31)

df = web.DataReader('TSLA', 'yahoo', start, end)

#

Which gives me this error -

#
ImmediateDeprecationError: 
Yahoo Daily has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.
#

Please how can I resolve this?

#

Is there a work around?

#

I'm using python 3.6

trail flicker
#

dont use yahoo

tight dove
#

Okay, which can I use

trail flicker
#

dunno

tight dove
#

*sigh

tight dove
#

I found a fix for this

#
import fix_yahoo_finance as yf

yf.pdr_override()
supple mango
#

I have a pandas dataframe as follows. I want to add a new cost column that is equal to the Shares column * the corresponding price for the current symbol. So for example in the first row (symbol=AAPL) the cost column would have a value of 1500 * 340.99000 = 511485. How would you do that?

placid snow
#
df["newcol"]= df.apply(lambda row: row["Shares"]*row["AAPL"], axis=1)``` perhaps?
#

Can't test it myself on phone, but do try just the apply part first, and see if the result is correct

supple mango
#

the row['AAPL'] is hardcoded there is it not? I would like to use whatever is in the Symbol column and pull the appropriate value from the corresponding price column for that symbol

#

so for example the third row with IBM, should multiply 4000 (Shares) * 144.55000 (the price under the IBM column)

placid snow
#

You could most likely just pass the symbol to the lamda, or write a function instead and have it read which symbol to use beforehand

supple mango
#
orders["newcol"] = orders.apply(lambda row: row["Shares"] * row[row['Symbol']], axis=1)
#

Surprisingly, that seems to be working. Thank you @placid snow !

stone oasis
#

yahoo stopped with the ticker data i thought

coral lichen
#

hopefully a question that can be answered!

#

I have three 1d arrays of data. Two of the arrays are coordinate arrays (x and y). The third array is intensity (z) measurements are each coordinate position. All three arrays are of the same length. This means that the position x[i], y[i] has an intensity of z[i]. Currently, i cannot create a map like a pcolormesh plot to create a sort of heat map due to the data structures. Does anyone know how i can grid this data so that it will have a map of coordinates with intensity values for each point on the map?

polar acorn
#

Are the x and y values placed such that they form a grid?

coral lichen
#

no, thats the thing. they are simple 1d arrays that have x and y values that represent coordinate values along each axis. not until you take the first element of each x and y array do we get a position. hoping that makes sense lol

polar acorn
#

Sure, they are not formatted as a grid. But if you were to plot them would they make a grid? i.e. do you have n uniqe x values each repeated m times and m unique y values each repeated n times?

coral lichen
#

no, i did try to use np meshgrid, and that created the grid like coordinates. now my issue i guess is assigning the positions with their intensity values

#

I assign intensities to the diagonal of that grid right?

polar acorn
#

Are all your coordinates positive?

#

As in all x and y values

coral lichen
#

theyre both negative actually

polar acorn
#

oh

#
# imports
import numpy as np
import matplotlib.pyplot as plt

# fake data, replace with your own x, y, and z list
x = [-1,-2,-3]
y = [-3,-1,-3]
z = [10,3,4]

# turn negative coordinates into positive indexes
x_pos = np.array(x) + abs(min(x))
y_pos = np.array(y) + abs(min(y))

# create empty grid for z values
zz = np.zeros(shape = (max(x_pos)+1, max(y_pos)+1))

# fill empty grid at indexes corresponding to coordinates
for i in range(len(x)):
    zz[x_pos[i],y_pos[i]] = z[i]

# these are the coordinates of the lower right corner of each colored rectangle
xx = np.arange(min(x)-1, max(x)+1) +0.5
yy = np.arange(min(y)-1, max(y)+1) +0.5

# plot
plt.pcolormesh(xx, yy, np.transpose(zz))
plt.show()
polar acorn
#

@coral lichen, this might work. Its not very nice looking but it plots the intensity at the right coordinates.

polar acorn
#

Also in case your x and y values are not integers you might want to do a simple coloured scatterplot instead. Like this

x = -5*np.random.rand(1000)
y = -7*np.random.rand(1000)
z = x*y

plt.scatter(x,y, marker='o', c=z, linewidths=5)
plt.show() 
coral lichen
#

@polar acorn thanks! I'm going to work on these when i get home and will let you know!

wild hinge
#

hi everyone, I have the following df and I want to create a binary matrix out of it using pandas or any other module to achieve this.

#

I want to find a connection between the series and it seems that the best way to do this is by observing the binary matrix

placid snow
#

What about creating dummy data?

polar acorn
#

If you want binary columns also for values not observed in your columns yet, you can enjoy this monstrous one liner

import pandas as pd

df = pd.DataFrame({'val1':[1,3,4,6], 'val2':[2,2,1,3], 'val3':[5,6,11,2]}, index = ['s1', 's2', 's3', 's4'])

binary_df = pd.concat([pd.concat([pd.get_dummies(df[col]) for col in df], axis=1).groupby(lambda x:x, axis=1).sum(), pd.DataFrame(columns=list(range(1, df.max().max()+1)))]).fillna(0) 

Enjoy pulling it apart πŸ˜‰ I had to much fun writing that to not share it

wild hinge
#

@chibli cheers for the answer, the direction is right
@pptt cheers for the answer, I will give this a run asap but first I want to wrap my head around the formula :)))

limpid pivot
#

Anyone here know stuff about the ASDF (Advanced Scientific Data Format) library?

small ore
#

Guys, if anyone knows, any input regarding my question in #tools-and-devops is appreciated

radiant orbit
#

In lowess.. After minimizing the error how should I define y to be predicted.. if I am writing the algorithm from scratch

lapis sequoia
#

Is there any python library which is able to detect a float as being a fraction including an irrational number?

simple crag
lapis sequoia
#

yeah, but I'm talkling about irrationals

#

That library isn't able to detect something as pi, e, or a square root

simple crag
#

SymPy, maybe

#

I'm not really sure what the purpose is of what you're trying to do

lapis sequoia
#

Well, I do a lot of math stuff using python. And when I get a result I would like to know if it's actually a random number or a fraction which includes an irrational

#

found it in sympy, thanks @simple crag

lapis sequoia
#

Hi, anyone can suggest a data science path courses which make me data scientist after courses finishes, but courses have to have nice python syntax and covers all data science concepts, as I see, there is no good course to take my data science and python level simualtenously next level, anyone to help? Thanks πŸ˜ƒ

pseudo sentinel
#

Have you checked out Udacity?

lapis sequoia
#

Yes but, as I remember, there was a payment forcing for nanodegree 😦 . If something, some product is good, or better than others, there is nothing to prove yourself.

#

If course would face 2 face program, maybe I can give this money for this program. But it is online

small ore
#

Check pins

#

But yeah, no python. But there are a lot of free courses, tutorials on the net which will not give you any certificate

simple prism
#

This channel is a quiet

lapis sequoia
#

ded server

vestal axle
#

Hello πŸ˜ƒ

#

Could someone please help me with this task, I really don't know how to approach it

placid snow
#

I don't know much about the topic, but have you tried breaking it into smaller pieces?

#

a function for the equations for instance, and plan out what happens

lapis sequoia
#

@vestal axle from what I can tell, three parameters are constant

#

a first step *could be rewriting the equations with those values filled in

#

reduce the greek letter salad a bit

small ore
#

It looks straight forward to me. ( Havent tried to work it though). You have sigma0 and y0 given and calculate (sigma1, y1) and so on every (sigmat, yt)using the given formula till you get 100 values (t = 100 or 99 not sure, prolly latter) . You have your simulated data. Now plot the data over time. I think how to do the plot in that specific way is the task at your hand

#

@vestal axle

pliant mantle
#

this a good place to talk about data structures?

small ore
#

Not likely

#

#python-discussion if you want a generic discussion on that topic. Any of the help channels if it is a specific question. #databases if it is regarding databases

vapid lion
#

I’m am currently studying ML. And I ran into a query. How would you calculate a confusion matrix (in python) for a multi-label dataset?

pliant mantle
#

What is it called when you have a ridiculously extreme deviation in data?

#

singularity?

trail flicker
#

outlier?

brittle pewter
#

Sounds like outlier

#

@SYMPHONIC DISHARMONY#3195 Row is true label, columns are predicted label

#

compute the counts

#

Use .pivot_table in pandas

vestal axle
#

py_noob thank you for your answer, we figured it out! Thanks though πŸ˜ƒ

vestal axle
#

Btw, is there anyone here who would be willing to help me out with a task that has to be delivered next week πŸ˜ƒ

polar acorn
#

@vestal axle If you have a specific question just ask and someone will answer it they have time.

@SYMPHONIC DISHARMONY#3195 scikit-learn has a nice implementation if you just want to see your results.

radiant orbit
#

Guys which is the best book I should opt for probability and statistics.. ???

turbid bay
#

can i ask a question on data-science that isnt written in python but instead octave?

turbid bay
#

no one?

#

its not a hard question

simple crag
#

This is a Python server, so I'm not sure how well anyone is going to be able to help you with Octave

turbid bay
#

im more worried with the logical aspect of the programming tho

#

not any features that octave may have to offer

#

ill ask anyways

#

anybody with octave knowledge want to talk me through how to calculate the cost of the hypothesis using certain theta/weight variables. I am struggling to get my head around it.

i am using a feature/X matrix (5X4), Weight/theta vector (4X1), correct_label/y vector (5X1)

prediction = sigmoid(X*theta)
cost_1 = (-y) .* log(prediction)
cost_2 = (1-y).*log(1-prediction)

total_cost = cost_1-cost_2
J = sum(total_cost)/m```


J should be the cost but it doesnt calculate it correvtly
#

can you @me if you respond thanks πŸ˜ƒ

small ore
#

@turbid bay What do you mean by not calculate correctly? Values different from what you should have got or are you getting some errors? Secondly you do not seem to have given the entire code. That m in the last line is not available elsewhere in the code. Thirdly did you check if you get the right answers if you created a vector of 1s instead of just using (1-y).

#

The problem may also be in the sigmoid function if you wrote it instead of a built-in

turbid bay
#

m is the length of the y vector (sorry for not including that) and no the sigmoid function is built in

#

and yes it doesnt give me the expected output

small ore
#

Try (ones(m,1)-y) instead of (1-y) and same change for (1-prediction)

#

You could also do y'*log(prediction) instead of y .* log(prediction) coz that is supposed to be more efficient

turbid bay
#

what does the ones() do exactly. and i tried the Transpose matrix multiplication but it wouldn’t give me the right value either so i kept on swapping between the 2

small ore
#

ones(p, q) gives you a matrix of the size pXq with all the values as 1

turbid bay
#

ah yes that will probably work better than using a scalar value then subtracting a matrix

#

when i am next at my computer i will change it and see if that works. thanks

dreamy tartan
#

Hi i want to ask something.
I will try to predict lifetime period. Dataset has this information as months in column and i'll set it as target column for predict it. Lifetime period range is from 0 to 74. Which method should i use for predict it? I was thinking to use Linear Regression but these terms made me confused like Multioutput regression, multi label classification etc.

small ore
#

That is an incomplete question with insufficient information. Either make a clear presentation of your problem keeping in mind the reader or ....

#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving
β€’ Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

timid zealot
#

Hello! I'm new here, although I'll do my best to make a clear case. I'm new to data science, and basically my problem is I'm trying to make a recommendation system, using knn algorithm (k-neearest neighbor) and Euclidean distance, mixed ofc with Panda and NumPy modules. I'm looking for some guide or cheatsheet or anything at all to get started, since the ones I've found all talk about the same example (some Iris flower test db that comes with SciKit module) and have little (at least to me) adaptability to my case.

#

What I want is this system to recommend users new artists by comparing their answers to other users who had answered in a similar manner. User would submit three artists of their liking and then the program would have to work with that

#

If I'm allowed to, I can attach some files, like the db and the code I have so far

small ore
timid zealot
#

there it is

real wigeon
#

@small ore hey I was told you might be the person to speak to regarding selenium

daring bison
#

Hi

#

I have a question, why cooling some metals enables them to become superconductive?

small ore
#

@real wigeon What? You must be mistaken. I dont even know how it looks like. From my limited knowledge, Selenium does not seem remotely related to this channel

real wigeon
#

welp

#

can't tell if I'm being trolled

small ore
#

I am not someone to speak to for any topic for that matter. And in general you should not be looking at individuals in this server for any question/discussion even if it they are knowledgeable on a topic

real wigeon
#

ok well thank you, I was in the help channel and someone told me I should speak to you.

small ore
#

I can't tell if you were really trolled or you are trolling me πŸ˜„

real wigeon
#

well played sir.. well played

lapis sequoia
#

Ayo wha

#

@quiet gyro ^^

#

[better idea to ping @ Moderators or to ping an online mod?]

small ore
#

Yeah. Need more active mods who look into this channel. I suspect this channel is going dead in ways

quiet gyro
#

Better to ping @moderators

#

@real wigeon That sort of behavior and language isn't tolerated here

small ore
#

I mean. If you could ban them ( I think this is repeat. I searched for them and they pinged the last person who asnwered earlier too) and preferably also delete all of what they said, I'd be glad

quiet gyro
#

Bit preoccupied right now, I'll look in a bit, thanks for the heads up

real wigeon
#

If you’re referring to me, this is the first time I’ve said anything remotely negative. Py_noob is being difficult and asserting that I should not ask questions. I assumed it was a joke so I responded in kind.

#

Throw the banhammer around if you want, life goes on regardless

lapis sequoia
#

@real wigeon py_noob is being nothing

simple crag
#

Enough

lapis sequoia
#

it just looks like facknoobs misunderstood

#

some suggestion they got to talk to someone?

#

oh, sorry. didn't see I think this is repeat.

turbid bay
#

hello, i am doing logistic regression and am trying to calculate the cost of some theta values. However, I am not getting the expected value I want with the code I have made. I was originally using Octave but have moved to python as I understand how to use it better. But still no luck. I will paste the entire code and would like to ask if anyone can spot any mistakes. The cost is supposed to equal 2.534819. https://pastebin.com/hsG3HcR6

#

thanks in advance for whoever can help me solve this problem

polar acorn
#

@turbid bay Can't really see that you're doing something wrong here. Are you sure you are comparing the correct examples?

spare karma
#

Anyone familiar with making logistic regression models in python? I'm making my first model and can't seem to increase my accuracy_score() with the available variables. Specifically, whenever I include a 3rd, 4th variable, my score goes down. Does this mean that my model's accuracy in predicting outcomes is decreasing as I include more and more variables?

#

I'm trying to make the "best" model with what's available.

polar acorn
#

@spare karma are you testing out of sample? You might be overfitting, consider regularisation.

spare karma
#

@polar acorn Thank you for the response. I think out of sample? I'm using a baked-in function to test, train and split (my instructor recommended one). I'll look into regularization. Are there any general conventions towards selecting variables? (If interested, code below - kobe bryant data.)```

fit a logistic regression model and store the predictions

feature_cols = ['combined_shot_type_numeric','season_numeric', 'shot_distance', 'minutes_remaining']
X = kobe[feature_cols]
y = kobe.shot_made_flag

model = Model()
model.fit(X, y)
kobe['pred'] = model.predict(X)

from sklearn.metrics import accuracy_score
accuracy_score(kobe.shot_made_flag, kobe.pred.round())```

#

0.5952834961279527

small ore
#

Could you show more code please? I am unable to know what your Model class is. If regularisation does not work, then consider reviewing the model. Like introducing more variables which could be some powers of the existing variables and do some analysis to determine which variables contribute little to your model

#

Disclaimer.I am not an expert. I am trying to learn from online sources and through forums like these

turbid bay
#

@polar acorn thats why im extremely confused too. because i think its working correctly. and im using the data from a course on coursera i copied and pasted it pretty much so im 99% sure its the right data. my only thought is. is that ther is multiple ways of calculatong cost so possibly this is just a different way compared to what they expected you to use on the course

dim wolf
#

Hi! I have probably trivial question, but I lack the proper word to search effectively for information on it. If I do a standard plt.plot(x,y) using matplotlib.pyplot, and y is a numpy array of large numbers, python will automatically remove some appropriate power from the y-tick labels, and write that power in the upper left corner of the plot. Can I control this feature somehow, and and tell pyplot which factor should be divided out of the ticks? Most importantly, can this also be done for plt.yscale('log') ie a logarithmic axis? Code example here: https://paste.pythondiscord.com/nomamojofe.py

dim wolf
#

I mean, I realize that I can just divide my y-array with some number and annotate that in the plt.ylabel, but since the automatic function is already there, it would make sense that it could also be controlled by the user.

small ore
#

@turbid bay it is not only efficient but also easier to code and debug if you use those as vectorised stuff.

turbid bay
#

@small ore i am using vectorisation in my octave code. however i dont know how to do vectorisation in python

small ore
#

Using numpy methods. You dont have to code yourself

#

Numpy has arrays, matrices and ndarrays. I am not sure which one is a good fit but I think any of those will work with corresponding methods for transpose and such

#

If you have problems in your octave bits I will try to help provided the head guys have no problems

turbid bay
#

ill msg u privately with octave questions

spare karma
#

@dim wolf I'm pretty new, so my opinion isn't worth much but if it were me, and there's a deadline associated to your plot, I'd just extract what I need manually into a separate df.

#

@small ore I can introduce powers to existing variables? That's awesome. I never thought of that. Not sure what you want to see, I'm currently at ~100 lines.

small ore
#

I mean if that model.fit is a fancy method which already automatically calculates what variables/powers are needed and what are not and does a super good fit all of what I said regarding bettering the model makes no sense. I just wanted to see what the Model class is

#

And please note: your problem still stays linear if you introduce powers coz these become the co-eff of the function rather than the 'variable'. So when I said 'variable' to your feature_cols , I was in some sense wrong

spare karma
#

@small ore If only they'd make em' that way, lol. On the flipside, then they wouldn't be that much fun to produce. Anywho i'm using from sklearn.linear_model import LogisticRegression as Model

wispy harbor
#

hey guys i have been trying to upload my ipynb to github. Once they are done uploading I am not able to view them instead it gives me this error

lone mist
#

try clicking on "raw" at the top right

wispy harbor
#

@lone mist yea but how does it help?

lone mist
#

it will show you the contents of the file

#

isn't that what you wanted

small ore
#

from sklearn.linear_model import LogisticRegression as Model . I wonder why these modules use such conventions in examples as well as their code.

knotty nexus
#

as a first assignment I've been asked to perform a classification task based on JIRA user story summary, description and activity data. Sample size is ~100k user stories. Classification is based on which 'customer' is being supported by said user story, eg, regulators, internal IT improvement, actual product improvement, etc. Any idea which model is optimal for such a case? If supervised, how big should my output sample be?

vestal axle
#

List all the odd numbers between 1 and 200 that are divisible by 5. Your result should be a list containing a set of integers satisfiying this condition.

#

Anyone who can help me on this?

hollow gulch
#

anyone able to turn this code so that instead of cumcount being inserted into the df, it would use the max count value for each group? Using pandas

twilit bolt
#

@vestal axle something like this maybe?

#
set_of_integers = []

for i in range(1, 201, 2):
    if (i%5)/5 == 0.0:
        set_of_integers.append(i)
vestal axle
#

Yeah, that works! Thanks FONZ πŸ˜ƒ

twilit bolt
#

@hollow gulch why not just use .count() instead of .cumcount()? Or am I misunderstanding your objective?

hollow gulch
#

@twilit bolt for some reasons using count for groupby doesnt work

#

unless i miss something

#

is there anyway to list all the function that I am typing out while using spyder? so I have some idea of what I could do

lyric canopy
#

What are you counting? Values within Address Line 1 or do you want to know the count each group has?

#

@hollow gulch

hollow gulch
#

@lyric canopy thanks for replying, this is what my df looks like

#

not exactly the problem I have but similar and easier to understand

#

I want to create a count column by group to produce result so for example with 'ST', the data in the above example would look like this 2 2 1 1 1

#

the groupby.cumcount + 1 wold look like this 1 2 1 1 1

lyric canopy
#

Right, just total count per group on each of the lines of that group

hollow gulch
#

yep ❀

#

here is my code

#

df2['Count']=df2.groupby('Address Line 1').max(df2.groupby('Address Line 1').cumcount()+1)

#

the max function doesnt work 😦 I don't know which function available that I could use so I just test out random syntax hope it would work

#

its different from excel where as you start typing, it suggest a list of function available to use. this one doesn't give suggestion which make it harder to code

lyric canopy
#

Right. Well, it's been a while since I used panda's, so I'm going to experiment for a minute

#

So, df2.groupby('Address Line 1').size() will get you the group sizes

#

Now, you still have to add them to the dataframe

#

This should work:

hollow gulch
#

how do I add it to each line?

lyric canopy
#
df2['Count'] = df2.groupby('GROUPINGTHINGY').transform(len)

should work I think

#

Wait

#

That wrogn paste

#

Wait, now it doesn't work anymore.

hollow gulch
#

yeah, individual syntax work

#

like print (df2.groupby('Address Line 1').size())

#

but I dont know how to put that value into the data frame for each row based on the 'Address Line 1' value

lyric canopy
#

Yeah, I know, but there's a simple way to do it with one statement, it's just that I can't remember it because it's been months since I touched pandas. I can only hack it currently:

>>> df['count'] = df.groupby('A').size()
>>> df
   A  count
0  a    NaN
1  a    NaN
2  a    NaN
3  b    NaN
4  b    NaN
5  a    NaN
>>> df['count'] = df.groupby('A').transform(len)
>>> df
   A  count
0  a      4
1  a      4
2  a      4
3  b      2
4  b      2
5  a      4
#

But, that should be possible in one line

hollow gulch
#

what do you mean you hack it πŸ˜›

twilit bolt
#

If I'm understanding correctly, this should do the trick.: ```df = pd.DataFrame({
'City' : ['BILLINGS', 'LANSING', 'HICKORY', ' HAYWARD', 'NORTH EAST', 'SAN DIMAS'],
'ST' : ['MT', 'MI', 'NC', 'CA', 'MD', 'CA']
})

df

df['Count'] = df.groupby('ST')['ST'].transform('count')```

lyric canopy
#

That's it, yeah.

#

Couldn't get it to work because I tried:

df['Count'] = df.groupby('ST').transform('count')

So, without the ['ST']. I

#

@hollow gulch Did you see the answer above?

hollow gulch
#

what does transform do?

#

let me look it up real quick

#

the syntax looks a bit weird because I dont see groupby () [] next to each other that often

lyric canopy
#

Okay.

hollow gulch
#

way i understand it is group() and () would call function or combination of condition

#

and [] is to apply array

lyric canopy
#

df.groupby('ST') creates a "groupby" object (grouped by 'ST')

#

Next, you select the ['ST'] column of that object and use count on it

#

So, it's equivalent to:

#
grouped_df = df.groupby('ST')
df['Count'] = grouped_df['ST'].transform('count')
hollow gulch
#

and I didnt know you could do transform ('count') to command a function using a string

#

i thought it has to be something like groupby().command here

#

learning so much ❀

#

thanks you the greatest teachers ❀ @lyric canopy and @twilit bolt

hollow gulch
#

so to have a better understanding of how things work, grouped_df['ST'].cumcount() would give cumulative count for each row within a group.

#

grouped_df['ST'].count() assume it would work would give count of the group but it's not in a usable format (wonder why)

#

so we need to use grouped_df['ST'].transform('count') to put it in the right format (I still don't know which format we are in or needed for each of this)

#

i assume the 2 format we dealing with here are array + int. For Array to work, it has to match the size.

#

I am not used to the transform function and what it does

twilit bolt
hollow gulch
#

❀ thanks so much for the resources I’ve been trying to find a good source so I can study from. I want to move my excel work to panda but syntax is what usually throw me off

twilit bolt
#

You're welcome.

vestal axle
#

Anyone who know's how to solve this?

#

I need to run a four loop, but how should i code it?

lyric canopy
#

What have you tried so far?

#

And, do you need to use for-loops? Because you don't need to if you don't

#

Well, I've gtg @vestal axle , but there are plenty of options for calculating the correlation without explicitly using for-loops yourself.

small ore
#

I am curious what course that is

proud raven
lapis sequoia
#

im pondering the thought of using machine learning to identify what type of log file i am going to perform some graphing on. Would that be possible? Right now i am using pandas ```py
df.columns.tolist() == my_dict.get("type"):
plottype = type

But i feel a little automation there would be cool? would it be possible?
regal yarrow
#

Hey

placid snow
#

This is not the place for advertisement. @regal yarrow

regal yarrow
#

sorry

lapis sequoia
#

anaconda is a "closed" enviroment right? it wont mess up my already installed python3 stuff?

lapis sequoia
#

hm, installed latest anaconda and started a "conda" project in pycharm, but it said python 3.6. Is it not 3.7 wich comes with conda?

vestal axle
#

Vesiculus I managed it

#

py_noob its data analysis in python im on my second year of Msc Finance

#

in

small ore
#

Nice!

minor bolt
#

any help with timeit

lyric canopy
#

Post your question and I'm sure people will help you

hollow gulch
#

Good morning, I am trying to make new column so that if QTY and MAX QTY >0, they stay the same, otherwise QTY=0 and MAX QTY = 100000. Datatype is object. and I attempted to convert them to numetic but failed. can anyone help please

#

df1[['QTY','MAX QTY']] = df1[['QTY','MAX QTY']].apply(pd.to_numeric)

#

I also tried this pd.to_numeric(df1[['QTY']], errors='coerce').fillna(0).astype(int)

twilit bolt
#

@hollow gulch This is the way I would do it ```df = pd.DataFrame({
'qty' : [1, 2, 450, '--'],
'max_qty' : ['nan', 2, '-', '--']
})

df

df.replace(['nan', '--', '-'],[9999, 9999, 9999 ], inplace = True)
df.loc[(df['qty'] == 9999 ) | (df['max_qty'] == 9999), ['qty', 'max_qty']] = 0, 100000```

hollow gulch
#

Thanks @twilit bolt as always ❀

#

I was hoping there was a better way to capture it instead of manually identify all miscellaneous items

#

it is what it is then ❀

twilit bolt
#

I'm sure there is, using the 're' module. But, I'll leave someone else to come up with that solution. As still can't get my head around working with combinations of strings.

hollow gulch
#

quick and simple solution sometimes better as I do need to get this project move on, thanks for all the help

twilit bolt
#

np

hollow gulch
#

I love python because it has great learning curve and so much to its capability

#

and those are the question I keep on the back of my heads that one day I can come up with a better solution

#

I hope python is a good solution to my data analyics day to day project and it would be better than excel/VBA. I found it more amusing because I'd have to code the same thing in VBA, might as well just do all of the tasks in python and save the workload if I need to go back and edit anything

naive hornet
#

this channel is meant to serve somewhat as a data science discussion location, but this is also a Python server first and foremost. might be a question better suited to an off topic channel

foggy junco
#

Is there alot of math in datascience

foggy junco
#

Hello

#

Yo

placid snow
#

Depends what you do.

#

There is also no need to bump your message, this isnt a forum

naive orchid
#

I'm getting a memory error from loading a CSV file (a few GB) into pandas. Is there a way to split up the file and analyze it piece by piece?

lyric canopy
#

@naive orchid Sure.

olive trench
#

Guys do you know what could trigger the setting with copy warning in pandas? I have 9 columns I'm doing an operation to and I only get the warning on two of them.

#

this is my code

    for column in metadata_part.columns:
        new_col = metadata_part.loc[:, column].astype(CategoricalDtype(categories=unique_vals[column]))
        metadata_part.loc[:, column] = new_col
polar acorn
#

Setting with copy warning is a mysterious beast. But it usually means you are trying to write to something that is a view of a dataframe. In your case where does metadata_part come from? Is it a subset you've taken from a larger data frame?

olive trench
#

It is. And I didn't use .loc for it. I probably should, right?

polar acorn
#

Might not matter actually. If you want to change the parent dataframe you should change that directly. If you want to change only a copy of the subset and work with that leaving the parent df intact you can use .copy() when you subset it i.e. metadata_part = parent_df.loc[].copy()

olive trench
#

I am not working with the source dataframe any further. I might try the .copy() thing. I am mostly confused it warns me only on two of my columns but not on the rest

polar acorn
#

As I said it is a strange warning that sometimes triggers when things are fine and sometimes doesn't even when it should. But in general you should always be fine if you're working on a df that is not a shallow copy (either original or subsetted with .copy()) and using only one .loc or .iloc statement to subset it.

olive trench
#

Alright, thank you very much for the help. I'll look more into it and see if anything changes!

twilit bolt
#

@hollow gulch Something like this? df['Count'] = 1 df.loc[(df['Tax Y/N'] != df['Tx Ex']) | (df['Tx Ex'] == 'NaN'), 'Count'] = 2

blazing bolt
#

Has anyone got any good suggestions for creating tables and exporting them automatically using python, in either a .png format or straight onto a powerpoint? Has to be compatible with pandas.

#

???

hallow obsidian
#

Say I partition the unit interval [0, 1] into a grid via u=1/(m-1) and G=(k*u for k in range(m)). I want a function which maps a real x to the value in that interval that's closest to x. Is there a better way that doing a sort of logarithmic walk starting in the middle of Gand using abs for comparison? Any libraries which do that. I need to do it hundred thousands of time in a numerical execution.

twilit bolt
blazing bolt
#

Thanks :')

lyric canopy
#

@blazing bolt You can use Pandas with Matplotlib to create table figures. It's not that difficult to do, but I don't know how difficult it is to make them look pretty. I usually export my table data in LaTeX markup to display it in the style of the document I'm writing, but that may not suitable for your application (Powerpoint; I mainly use LaTeX/pdf for presentations as well).

#

I've never used it, though, so you maybe have to google a bit for the appropriate approach

#

Anyway, I hope the terminology on that page I linked you helps with that

copper swan
#

where do i learn the python libraries for data science

#

and machine learning

earnest prawn
#

in their documentation and stack overflow

#

as for many libraries

small ore
#

Most YT videos are not good though. Either they use obsolete python 2 or they use dirty conventions and can be confusing.

#

On their sites, they have their own tutorials alongisde the documentation. So that should help too

small ore
lyric canopy
#

You need to calculate the correlation between the first and the second column.

#

In this case, you've only provided one of the two as an argument

#

Try this:

#
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])
#

That will return a correlation matrix as an array, I think
@lapis sequoia

lapis sequoia
#

OH I GOT IT NOW

#

THANK YOU SO MUCH

young aurora
#

I need help with data cleaning and asked in help-0. Is it kosher for me to ask here, too?

#

Data cleaning/merging, rather.

lean ledge
#

Covariance makes me very happy tbh

#

Non-zero correlation/covariance terms are best

small ore
#

@young aurora It is always best to ask those questions here. This channel is not as active as the help channels but there is a chance your question is lost in the help channels if no one around there knows the answer. I have seen people go through old questions and answer in this channel a week after it was asked

young aurora
#

If I want to combine 35 CSVs that have somewhat different column names, what's the best way to do it?
I'm trying to do it with "usecols" from Pandas, but when there aren't column names that I'm specifying within a CSV, it spits an error.
Here's an example

                                                                   'COL2',
                                                                   'COL3',
                                                                   'COL4',
                                                                   'COL5',)
    for f in os.listdir(os.getcwd()) if f.endswith('csv')]

combexample = pandas.concat(example, axis=1, join='inner').sort_index()```

So, some of the files might not have COL4 (it's technically the same info within the row, but the header name is different). How can I get it to combine properly?
I'd love any help I can get.
If it helps, there are 40 total files.
#

There's the question, for what its worth - thanks, @small ore

lapis sequoia
#

Are there any experts here in Sqlalchemy? I was wanting to have a code review done on a project I’m working on.

trail flicker
#

ill take a look

foggy junco
#

Hello?

lyric canopy
#

Hi

hearty hazel
#

@heavy brook We don't currently have a system for recruitment, we're trying to figure out the best way to handle it right now.

#

We tend to remove recruitment messages for the sake of safety

young aurora
#

So I've got a function I've written that gives me a number as it's return. I'd like to loop it X times, assigning an ID to each loop + the value within the two columns "Test Number" and "Value". How can I do this?

heavy brook
#

im currently working on multi gaze tracking/estimation

#

I dont have an ML background, but I am good at piecing things together

#

im learning python along the way

#

anyhow, the current challenge I want to solve right now is wether someone is looking at the screen or not

#

i'll make a tv stand, put a tv and then put a camera on top

#

I have seen some open source code already for head pose estimation

#

my main question now is, is there a formula to determine if someone is looking directly given a roll, pitch yaw? plus some extra params? distance from the screen angle maybe?

heavy brook
#

wow my code is slow AF T_T

fleet ginkgo
#

I have a dataset that has a predictor column with two values: 'Fully Paid' and 'Charged Off'. For data visualization purposes, I split data into two. Unfortunately, the 'Fully Paid' subset has twice as many rows as the 'Charged Off' subset. I made a function to randomly shuffle the 'Fully Paid' subset and match length of the 'Charged Off' subset to see if I can understand the full picture better, but every time I run the function, the plots are way different each time. Is there a way I can work around this problem to better handle keeping the size of the two subsets the same but to somehow include the "entirety" of the larger subset? I hope I was clear enough.

polar acorn
#

Why would they need to be the same length again?

small ore
#

@heavy brook ML is about finding the prediction formula. I have no expertise in determining the (roll, pitch, yaw) of a certain face but I imagine it is needs database of several faces rotated at various (pitch, yaw, roll) each and a formula is determined by number-crunching from that database

hoary terrace
#

yo i wanna start a project using AI. Starting from scratch so does anyone have any ideas of a good project idea to do?

weary ferry
#

sure

#

a good first project is a Mancala Bot

lean ledge
#

@small ore @heavy brook If you have the position and angles of your gaze then calculating whether it's looking at the screen is just simple maths

small ore
#

How do you find the position and angle of the gaze? I Understood it as him trying to find the position and angle rather than a binary answer of whether or not he is looking at the screen

#

@lean ledge

lean ledge
#

is there a formula to determine if someone is looking directly given a roll, pitch yaw? plus some extra params

#

hecc, you actually dont even need ML for this

#

you can do some fairly simple CV

#

do some face detection, do some feature extraction, a transform for the eyes, get location, compare it to rest location

small ore
#

Hm. Yes. That final question seems to mean the opposite but if you look at the rest of the question including
im currently working on multi gaze tracking/estimation

foggy junco
#

Is this true lol

#

People are leaving data science jobs now

#

Welp

lean ledge
#

They're not really

#

I'd ignore 80% of stuff that's on medium

small ore
#

And yeah. I did think of transformation too . Wasn't sure if it could be applied

foggy junco
#

A data scientist made the article

#

So this could be accurate

lean ledge
#

could be. It's most likely not. I too am in data science and nobody wants to leave, they earn too much to :P

#

Everything he said is part of all jobs

#

Just worse in jobs where data science isnt the main business of the company

#

In which case I point you to a bunch of other devs who have the same problems in non tech companies

twilit bolt
#

A bit of sample selection bias in those numbers, I think. But, what do I know.

lean ledge
#

He provided like no numbers

twilit bolt
#

The article, states, in part >These data were collected by Stack Overflow in their survey based on 64,000 developers.

foggy junco
#

But he said he wad a dats scientist

twilit bolt
#

Right, lol

foggy junco
#

Why would they lie about that its a popular article

#

And he proves he knows his stuff

lean ledge
#

The only piece of data is ML/DS people are looking for different jobs, which might as well be because there's lots of new high paying jobs coming up

#

I didnt say he wasnt a data scientist

#

I said being a data scientist doesnt at all make him an authority on the market and motivations of thousands of other data scientists

#

Everything else was an opinion piece from his job working in a non-tech company

foggy junco
#

It was fact based, he proves he knows his stuff lol

#

Unless u can correct his fact mistakes in the article

lean ledge
#

He said no facts lmao, he gave like 1 piece of statistics

foggy junco
#

He said alot about vector algorithms

lean ledge
#

So do I

foggy junco
#

Everything he said made sense

lean ledge
#

If your reasoning is that he's a data scientist, I am too. I work in a data science company. My coworkers are data scientists. We work with clients who have their own data science teams yet pay us to do things. Every deal we make gets verified by their data scientists and the data scientists in our accountancy firm (Deloitte)

foggy junco
#

So basically u disagree with this guy that is also a data scientist

#

Did he get any of his facts wrong in the article

lean ledge
#

I'm not saying he doesn't make sense, I'm saying his reasoning is personally based, and largely applicable only in scenarios where you're a tech employee in a non tech company, which is far from only true for data scientist and can not at all be generalised to explain the motivations of thousands of other data scientists that are looking for a new job

#

Jesus never mind

foggy junco
#

So is the everyone looking for a new job and leaving data science jobs true or just a troll

foggy junco
#

Well like ur a data scientist is this true lol

lyric canopy
#

So, what I've seen around me is that a lot of people from our masters end up in data science. Most of our students are actually scooped up before they even graduate.

#

So, I'd take that article with a grain of salt

#

Still, there are some decent points made in that article

#

Some that apply to more fields than just data science

late garnet
#

I think that the article had a great take away of managing expectations of the company. Know what you are getting yourself into prior to joining.

#

I'm in a similar situation as what the article describes - a company in it's infancy when it comes to DS.

#

This can become frustrating at times when expectations on both sides are not agreed upon.

small ore
#

And a lack of data on their part?

late garnet
#

Depending on the project, yes.

foggy junco
#

I thought most students in data science come out with a bachelors?

lean ledge
#

most people in data science have masters, maybe even PhD

#

the problem with data science market is the number of companies following the hype that hire people without knowing what data science is for

placid snow
#

Can confirm companies scoop up people for hype techs

lean ledge
#

they may think their excel sheet with a hundred rows is Big Data, may hire someone without having anything to give them and then expecting something out, may only hire 1 person (1 is pretty much never enough), etc etc. They dont know how to pick the right person because DS really isnt anything like normal software dev.

#

Data scientists likely arent leaving jobs because data science can be a bad job but because there's a ton of companies, even good companies, that have no reason or way to manage data scientists which makes them miserable and because there's an increasing number of more lucrative opportunities daily

lyric canopy
#

@foggy junco Getting a masters is pretty much standard here in The Netherlands, but our system is a bit different. It's almost impossible to start a Ph.D. without having completed a masters degree first, since, well, that's seen as the "base" degree for university.

#

The only field I've seen Dutch students working on a Ph.D. without getting their masters first a couple of times is medicine.

foggy junco
#

So basically data science is starting to die

lean ledge
#

no

#

it's always been this way

foggy junco
#

Swe might be more secure than data science in job security wise but idk

lean ledge
#

completely different careers with very different skillsets

late garnet
#

@lean ledge there is more overlap between SWE and DS than you would think. πŸ˜ƒ

lean ledge
#

I wouldn't say so. Perhaps if they are at a smaller place and take on some responsibilities that would be handled by a data engineer

#

DS requires experimentation, research skills, exploration and experimentation, ability to stay up to date on literature + whatever maths and statistics skills you need that dont really match up with the skills a good SWE needs

late garnet
#

The overlap that I am thinking of is more along the lines of algorithm development - writing unit tests, structuring libraries, etc.

lean ledge
#

Not really the sort of stuff data scientists do from what I've seen. Data scientists I've seen do the exploration etc, come up with the right techniques etc in a notebook, pass the notebook on to someone else who writes it as proper deployable software. It's a norm in data science AFAIK

#

There's too much stuff to do for data scientists to be concerned with implementation

late garnet
#

@lean ledge agreed within enterprise environment.

lean ledge
#

Where else is that not true? Startups?

#

I'm in a startup and it's still the norm here

late garnet
#

Open source?

lean ledge
#

Open source what?

terse pewter
#

Hey, feature vector question I wanted to get clarified

#

Im working on a image classifier using a support vector machine

#

I've extracted various features such as hog, dominant colors, canny edge

#

I stumbled on this on quora:

#
if {a1,a2,a3,a4,a5}, {b1,b2,b3} and {c1,c2,c3,c4} are the features extracted from an object using different feature extraction mechanisms, concatenate them to form a single feature vector as {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4} and use it for classification.
#

So if {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4} is a list of all the features for one image in the training data and I have 500 images, would all of those concatenated vectors be placed into a vector containing: [ {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4}, {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4}, ... ] ?

lean ledge
#

Sure, that works.

terse pewter
#

Thanks Raggy for the confirmation

#

Additionally, I was wondering why in numpy.concatenate, in some of the examples, the array contents were mixing rather then just adding on to the end of the first previous array. Why is that?

lean ledge
#

good question, i have no clue

small ore
#

After days (months?) this channel saw some activity. Not sure if I am happy with the nature of activity

terse pewter
#

lol thanks anyways raggy

lean ledge
#

out of curiosity, what features did you extract from canny edge? just the pixel locations after non-maximal suppression?

terse pewter
#

just the pixel locations, tbh I just looked up NMS because of your comment, I think I should have that haha

lean ledge
#

NMS is part of canny edge. if you're using a library function, it's included.

terse pewter
#

im new to ML and data science and just learning

#

ah

#

Yeah then just the pixel locations

lean ledge
#

cool cool

terse pewter
#

So for dominant colors I have a size of (7,3) and canny edge and hog have a shape of (100,100). Would I need to pad my dominant colors ndarray?

lean ledge
#

shouldnt have to. you should only be comparing relevant features to the same type of features so shape consistency isnt important. you do need to find a way to weigh or normalise differences in feature vectors to get the difference between images

#

assuming you use something like L2 distance between two vectors for measure of "closeness", the distances will be on different scales

terse pewter
#

Makes sense, I asked because functions such as np.concatenate, np.hstack, np.vstack, np.column_stack, complain about the shape of the dominant colors. True, I haven't yet thought about normalizing the data yet, was going to get the vector filled and then go back to look into normalization

#

Most examples I've seen online, people have the same rows but different columns, but I have yet to see a different number of rows

terse pewter
#

What if I flattened my data and then concatenated it?

#

Oh but that would cause data loss if I have something like a black and white image array as a feature?

hollow gulch
#

anyone knows why it won't iterate?

simple crag
#

len() returns an integer

#

Perhaps you mean to also use range()

hollow gulch
#

oh i see

#

range return an array?

simple crag
#

No, but it's iterable

desert cradle
#

or just for row in df3

hollow gulch
#

got not error

#

new*

desert cradle
#

for i, row in enumerate(df3) if you need the number

#

@hollow gulch it needed to be for row in df3 i think

#

it'd be range(len(df3)) if you were doing what ELA suggested though

hollow gulch
#

still error

#

I tried all 3

lyric canopy
#

This is a different error

#

Your if-statement there is wrong

#

You need to use and instead of &

hollow gulch
#

in what event use & | / 'and ', 'or'

desert cradle
#

& | are bitwise, or memberwise on numpy arrays

#

and/or are short-circuiting and strictly boolean

#

so if you do foo() and bar() and foo returns true, bar won't even be called

#

though, the other issue [and the actual reason yours fails] is that & is tighter-binding than ==/!=

#

anyway, it's partially a style thing (even though the short circuiting does mean a bit for efficiency) - it's very unusual to use &/| for connecting conditions within an if statement.

#

you'd need parentheses around each of the conditions if you did

small ore
#

But the eror complains about them being strings no?

desert cradle
#

yes

#

because it's just doing df4.iloc[row1, 2] & df3.iloc[row, 2]

#

just change the & to and

lapis sequoia
#

do i need all built functions for data science?

#

in python

lean ledge
#

@sa#8919 What do you mean?

#

@lapis sequoia

#

πŸ€”

viscid aspen
#

Hey, I have some wifi sensor time series accompanied with a schematic of the area where the sensors are located. Can anyone think of any tools that could help me visualize which sensor is pinged on the map over time ? Something where I can give it my schematic (Image/PDF) and map a column in the timeseries to some object on a map that would change color at a certain time in the animation?

hollow gulch
#

anyone know what the right syntax to do this?

misty sonnet
#

@hollow gulch try using a linter?

fleet flower
#

hey buddies

quiet gyro
#
placid snow
#

Cool extension

fathom zenith
#

Very cool extension! As an avid jupyter notebook and spyder user, I really ought to give it a shot

#

But I am yet to find an editor with such a convenient ipython terminal like spyder's

twilit bolt
#

I would agree. Spyder is, for me, second to none when it comes to iPython implementation in an IDE.

viscid aspen
#

I've never used it, would you mind giving some examples briefly ? What kind of features/workflows do you like the most about it ?

fathom zenith
#

Spyder? I like it because it's got a variable explorer that displays dataframe data in a tabular form and the iPython terminal allows you to play around with commands before you include them in your code. You can also select and run a few lines of your code in the iPython terminal and it'll keep track of the variables created/changed so you can continue experimenting with those adjusted variables

spark summit
#

So I've been trolling in here for a while

#

I have decided that I need to dig into some data-science studies, any recommendation on study materials, or specific topics that I should look into?

viscid aspen
#

@fathom zenith Oh, nice. So it's kinda like R-Studio for Python if you're familiar with it.

fathom zenith
#

@viscid aspen Yup, exactly! Should've just said that lol, R-Studio for Python

#

@spark summit I would definitely recommend Andrew Ng's machine learning course on Coursera. It's not in Python, it's in Octave but it shouldn't be too difficult to translate everything you learn, besides it's more theory-heavy so you'll only be coding to solidify the theoretical concepts.

#

I'm on Week 8 of it and I must say, that man has a knack for simplifying the most complicated concepts

#

And when you start translating it to Python you'll probably just use a library for much of it instead of writing the learning algorithms from scratch. But ground-up knowledge of the ML algorithms will help you optimize them better, adjusting the parameters, performing error analysis, devising ways to improve the performance of your algorithm, creating learning curves for bias-variance trade-off analysis and all that

#

I can't wait to finish it so I can start his Deep Learning course, and the best part is that that's in Python so I can immediately start using what I learn as I learn it

viscid aspen
#

Yeah, I've been doing a lot of work in pandas for uni lately, and jupyter (even jupyterlab) gets annoying quickly. I like VS Code better for how clean it looks, but the variable inspector in spyder looks super useful. I'm assuming it's especially helpful for stuff like pd.Series and basically all numpy structures, since those don't have a nice HTML output in jupyter

fathom zenith
#

Absolutely. That's exactly what it's super helpful for. You can inspect your dataframe variables and arrays like they're in spreadsheets.

#

What are you studying in Uni @viscid aspen ?

spark summit
#

thanks @fathom zenith

#

how much is the course?

fathom zenith
#

you can do it entirely for free

#

but if you want a cert, it's $79

#

well worth it if you ask me

viscid aspen
#

@fathom zenith CompSci (Master's), this semester's project is ML-oriented with a focus on the modelling of a real world dataset (a pretty nasty one if you ask me)

analog musk
#

hi everyone, i'm developing an image classifier with pytorch and im getting an accuracy of about 80% on my test set. When i predict the image label, few times i get totally wrong predictions. Is this normal or am i supposed anyway to get any prediction right even without having a 100% accuracy?

small ore
#

How heavy or light are these spyder and R-Studio? I have an old spyder installation and I cant get it to use jupyter or ipython

lyric canopy
#

R-Studio is an IDE for R. Spyder is one for Python that looks very similar to R-Studio. That's also why a lot of people use it, because they were used to R-Studio.

#

I don't know how heavy they are

small ore
#

I read above something like R-Studio for python

lyric canopy
#

I think what was meant was "Spyder is like R-Studio, but for python"

#

"So it's [Spyder's] kinda like R-Studio for Python if you're familiar with it."

small ore
#

Ahh. Ty for the clarification

polar acorn
#

Coming from R myself Spyder felt very familiar but also a bit unpolished compared to PyCharm which now has a scientific mode. Haven't looked at Spyder for a year now though so it might have improved.

small ore
#

A scientific mode? Is that available on the community edition?

polar acorn
#

Oh I'm sorry only on pro

woven tundra
#

Jupyter is extremely beginner friendly though, you should start off with that

#

I can't wait until jupyterlabs goes off beta into launch, but right now I just don't want to risk it messing up my analyses

lean ledge
#

Seconding Jupyterlabs, it's excellent

#

While we're on the topic, I believe Ng's course is rather bad and extremely mathematically shallow. I wouldn't recommend it. Being MATLAB focused isn't a plus either

#

Sort of lacking in theory

#

Columbia's course on edX is pretty good and is more language agnostic. Gives a much better and stronger grasp at fundamentals

polar acorn
#

While that is true, the course can be very good depending on where you come from and what you want out of it of course. So first of all one should maybe think about ones goals and background.

long gate
#

If i want to learn much about algorithms and data structure with focus on python. I have already completed a algorithms and data structure course in C, but want to read with python implementation.
Can anyone recommend a resource for data structures and algorithms with python? Don't really care about the medium, if it's printed book i'll buy it πŸ‘ Video tutorials works great too πŸ™‚

#

Just want to focus on the algorithms and data structure, for now i want to avoid diffrent modules like numpy and pandas

lean ledge
#

@long gate data science isn't data structures and algorithms but look at CLRS which is all written in pseudocode. There's also elements of programming interviews python edition but I recommend the former.

#

CLRS is πŸ‘Œ

long gate
#

Okey was mistaken then, but is this the book you meant? https://en.wikipedia.org/wiki/Introduction_to_Algorithms

Introduction to Algorithms is a book by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. The book has been widely used as the textbook for algorithms courses at many universities and is commonly cited as a reference for algorithms in published pa...

#

@lean ledge

#

Nvm saw in the wikipedia page that it's called CLRS, thanks πŸ‘

fathom zenith
#

@lean ledge I agree that it's mathematically shallow, but it was never meant to be an academic in-depth course on machine learning and was instead meant to introduce people to it and quickly give them a few tools they can practically use in the workplace.

#

Because unless you're in academia or R&D, you won't really be delving deep into the math all of the time. Most libraries abstract away the math after all, but knowing what you're doing instead of treating every model like a black box and choosing the one with the highest accuracy really helps when using ML to solve problems. Especially when you have to explain your solution to your colleagues.

#

But it all comes down on how you like to learn and what you like to learn. If you're mathematically-oriented, it's the wrong course for you because he just explains the intuition behind the formulas and algorithms.
If you like programming though and and you want to quickly learn how to do some basic data science then it's absolutely the right course for you.

#

It comes down to your motivations, a really high level of math can be overwhelming to most people. But building little programs and watching them work is an absolute delight.

#

All in all, it's a great primer. So I wouldn't go as far as saying it's "bad". There's a reason more than 100,000 people have done it.

#

@viscid aspen what's the dataset, if you don't mind me asking

viscid aspen
#

@fathom zenith wifi sensor time-series from a large European airport

#

There's a contractor that labels certain passengers with complicated rule engines etc. We're trying to model it and apply some basic ML

lean ledge
#

@fathom zenith I think any decent data scientist will need a good understanding of the maths behind things. There's a reason almost all DS have at least a master's. Perhaps it's a good course for someone in HS or someone who only wants to do ML for fun but if it's a serious career goal, you very very likely should be familiar with the mathematical understanding.

fathom zenith
#

Absolutely @lean ledge, but that's for data scientists. There are technology executives, data analysts, and hobbyists, and programmers with passing interests. Multiple types of users all of whom would find Andrew Ng's course a brilliant introduction to the world of machine learning.

#

But I'd like to ask you, with all of the layers of abstraction today, how vital is it to get to the absolute grass roots of the math behind ML? Especially if what you're primarily doing is commercial work within a company where you're employing models (validated by the academic community) and not cutting edge work (i.e., coming up with your own models)

lean ledge
#

It's very very much vital. Again, most data scientists have a master's or PhD for a reason. Being able to choose an appropriate model for the task and the data requires understanding of both the data, the techniques and their intricacies. If you ask industry, they'll tell you there's not enough data scientists etc. That's not because there's not enough people interested in it. It's because there's too many people with shallow understanding and lack of people who've gained the understanding of ML techniques. This isn't in academia, this is according to industry.

#

Data scientists generally do not make their own cutting edge models

#

They're specialists in choosing them

#

If you don't understand how your data is shaped, how higher dimensionality affects models, how your network's topology affects it, how different architectures differ, etc, most of which you can't understand without a solid grounding in the mathematics, you can't choose the best model

fathom zenith
#

I agree with you

#

But I don't agree that Andrew Ng's course is "bad" simply because its mathematically shallow

#

Different people need different introductions

#

And Andrew Ng's is one type of it

#

You don't need a strong mathematical grasp to get started, to go deeper you do absolutely. But to get started and playing around with it? Absolutely not. Interest is sustained when you can see what you do, not in abstracts. But then again, that's just my opinion

#

As for the vitality of math in commercial work, I'd have to agree to disagree there. I work with practitioners in the field, and the pressure on them to stick to deadlines and deliver fast is immense. That means you don't have the time to think through everything, it's all about pushing that MVP out and if it works, it works. Is it good data science? Hell no. Is it effective? Heck yes.

#

I'm not saying you don't need to know any of the math. Of course you do! But you don't need to go incredibly in-depth.

#

A fundamental understanding should be enough to get started

lean ledge
#

The vast majority of companies I know look for mathematical talent. I was hired not because I was a good programmer but because I was great at maths and physics. Many companies finish off their job adverts with "candidates from maths, physics and electrical engineering do the best". Not a single one of my coworkers would agree with you in this case

#

I honestly would suggest someone do a statistics degree rather than a CS or whatever degree if they want to be a data scientist

#

Data science is 99% maths, 1% programming

fathom zenith
#

Well that depends on the nature of your job of course. But there are far more users of the knowledge of machine learning than data scientists who don't need to necessarily understand all of the math behind the models.

But in summary my argument is that while that course may not be mathematically dense, it's still not "bad" for many, many people who use that knowledge in various ways.

#

So let's just agree to disagree πŸ˜ƒ

ocean crag
#

its popping

small ore
#

Raggy, fyi, Andrew Ng has topics on how to pick models etc later in his course. Also many-a-times people who know all the math formulae and derivation fail to use their own intution. I believe you are top 1 percentile if you are able to just see those formulae and get the intutions and are able to apply it yourself. Obviously Andrew Ng sounds frivulous for you. I believe for most people it is less stressful to go his way first and then maybe look into the math. It becomes easier to understand math when you already know some application and looking to understand the nuances

lapis sequoia
#

@lean ledge and @fathom zenith masters or phd are not really required to be a data scientist. I spoke to a family friend who is a data scientist and he has a bachelor in biology. So, yah.

lean ledge
#

@lapis sequoia I am a data scientist. It is not required but the vast majority have a higher degree and they make it much easier to develop the skills you need.

#

@small ore He does not go into detail at all and skips over a lot of important things

#

And can we stop this idea that learning the maths will mean you're somehow incapable of actual practice????

#

I'm not telling you to memorize formulae, that's not learning maths. With maths comes a deeper understanding of what the data and models convey

#

Your mathematical intuition enhances and complements your normal intuition. Learning maths involves learning how things work not memorizing. I believe if you think otherwise that you have a fundamental misunderstanding of what maths involves.

lyric canopy
#

I think the problem @lean ledge is highlighting is a major in statistics as well. This comment on Reddit to an article on statistical fraud illustrates why: https://old.reddit.com/r/statistics/comments/9toukm/1_in_4_biostatisticians_say_they_were_asked_to/e8ymwn9/?st=jo16k8bw&sh=9da7ca00

What happens a lot is in academia is that a lot of the users of statistics, so your typical emperical researchers, don't actually know all that much about statistics other than the introductory applied stuff they've been taught while getting their degree. This leads to tunnel vision: People either shaping their problem/research question to fit the model they know ("to a man with a hammer everything's a nail") or, worse, shaping their data to fit to the model. What's still worse than that is that, when they do actually apply a model, they don't realize what the limitations are, what violations of the assumptions mean for the interpretation of the results, or even what it means to delete a few "odd" data points from your data set. Some even don't really know why it's bad to choose a technique based on the results it gives you, so prefering model B over model A because model B gives you the results you were looking for (as described in that comment I linked above).

#

At the end of the day, what this means is that the number of Type I Errors in academia increases (so, the false positives, one of the reasons behind the replication "crisis") and/or models that don't actually do what they're claimed to do. Once it becomes clear that all those results that were once taken as facts turn out to be misleading or wrong, people lose their trust in the methodology and the statistics itself, while it was really the interpretation and the application where stuff went wrong. I think that's a danger for ML as well: If people start using models wrong, claim that they do things they don't actually do, and start suggesting decisions based on misleading results, the confidence in ML will diminish like the confidence in statistics has diminished in some of fields of academics.

In a way, this is already happening: The recent amazing AI tool for recruitement that was scrapped because it turned out to be very biased is one of those examples that could lead to diminishing trust in the techniques, while, in fact, the way the model was trained was the problem, not the fact that ML was applied.

lean ledge
#

^^^

#

That's just one of the issues with doing ML without having a mathematical background

lyric canopy
#

Yeah, probably. It's just one of the issues with research not actually knowing statistics, too.

lean ledge
#

What I said comes down to this. Your models are mathematical models. You cant understand them if you dont understand the maths. The idea that somehow you can know a lot of maths but not know intuition for the model is honestly just ignorance of what data science involves. Data science, including in a commercial setting, involves a lot of maths and it's simply not possible to consistently guess what models you need right every time without any understanding of maths

woven tundra
#

I have a question and maybe I'm going to sound super ignorant but let's assume we've got a dude named Bob who has supposedly taken a whole bunch of Coursera courses in ML and DL and all the other "L"s.

#

Bob builds a recommendation engine for his website hacking together models based on what he's learned. Bob doesn't understand the math to a great level but since there are so many great libraries out there abstracting away a lot of the detail, he just uses them to do his job.

#

And it just so happens that it works and more people are buying things on his website because they're getting good recommendations

#

So really a black box method, he knows what goes in, he has very little knowledge of what's happening inside, but what he's getting, the output, is working for him

#

I know that's not proper data science, but is that in general a bad way of approaching ML?

lean ledge
#

In general, yeah. now that Bob has made what he's made, he likely doesnt know how to make it better. He doesnt understand why exactly it works, what model they might want to swap out for another model, in what cases his model fails. Were he to try something else, there is no guarantee his previous strategy would work and no guarantee that his knowledge would transfer over.

#

Not being able to make your model better or knowing what's stopping it as it is is a problem. So is not being able to transfer over to another scenario due to lack of understanding.

#

Yes, it has worked apparently well in this scenario and that's great! Bob can continue using this model as it is if it works well enough for him. However, that method tends to not work very well in general

fallow summit
#

Hello!
Can someone recommend me how to get started with Machine Learning? I'm kind of person that likes to start making some practical project and learn while doing it. Basically that's how I learnt a lot of programming concepts. For past few days I am trying to dip my toes into machine learning, but all learning resources that I've found were like hours and hours of theory stuff instead of building something. I'm good at math even though I'm still in highschool, but I'll be fine in learning concepts that I do not understand. I've got no problems at all with programming in Python or something else. Can someone recommend me a good way to start my journey with machine learning?

#

I've got no problems learning theory, but I want to get started and see "that something is happening" and learn all the concepts needed while building something.

misty sonnet
#

I'm actually really interested in a answer for this ^^^

lean ledge
#

/r/LML has a few resources who dont really care about learning ML but just want to build stuff listed in their hackers guide: https://old.reddit.com/r/learnmachinelearning/wiki/getting_into_ml_hackers_guide

#

There's also a high schooler's guide for someone who wants a career in the field https://old.reddit.com/r/learnmachinelearning/wiki/getting_into_ml_high_schoolers_guide

#

@fallow summit @misty sonnet

stable tinsel
#

@fallow summit #1 pick a problem you want to solve w/ ML

#

#2 make a dataset

lean ledge
#

#1 choose a dataset on kaggle
#2 cry

stable tinsel
#

yeah i mean building the dataset is like 90% of the battle

#

thats really all that matters

lean ledge
#

Worst thing about working with non tech rather old fashioned client companies is that the data they have is the biggest pile of crap

stable tinsel
#

"yeah all we got is freeform natural language can you make this work?"

lean ledge
stable tinsel
#

i dont even know what that means

lean ledge
#

I'm not sure anyone but the guy who wrote it does

#

My last client company was based in chile and sent me all their data in spanish too

stable tinsel
#

lol

#

sniadek just say it pls

#

whats on ur mind

fallow summit
#

Thanks for your reply!
Don't get me wrong, I want to learn as much as I can about machine learning and Data Science. I'm not aiming for job anytime soon. I'm just fascinated about what can be acomplished with it. I just know my strenghts and I know that jumping into a lot of theory at the beggining and not seeing anything happening will push me back from this field. Usually when I am learning something new I'm trying to build something and learn while building it. I couldn't find any good resources to get straight into the project with machine learning and learn, so I came here to ask. Atm I would like to build something that can play video game, some really basic game.

#

πŸ˜„

stable tinsel
#

there's a lot of research on atari being published

#

do you want to start with an agent that plays atari games?

fallow summit
#

Yup!

lean ledge
#

Basic machine learning requires maths which is usually taught in early 2nd year uni in the US, late first year in other places. It's hard to get into it in high school when you dont understand multivariate calc or linear algebra

#

You can try andrew ng's course but despite being rather shallow, I still dont recall it going over basic multivariate calc