#data-science-and-ml
1 messages Β· Page 189 of 1
no
several strain gauges in different positions
the strain sensor has temperature sensor its a fiber optice one
You can add relative co-ords as regression variables
Can you explain more
I'm very beginner
Yesterday I tried PCA with regression
I got good prediction from 2 sensors out of 15
I am a beginner of some sort too. I am learning. If you could take the first couple lessons from a Machine learning course( Perhaps Andrew Ng) you know how to do multi-variable regression. We can wait for someone else to answer you on PCA. Meanwhile I will try to learn what it is
Learn about Principal Component Regression. A step-to-step tutorial to build a NIR calibration model using Principal Component Regression in Python.
I followed that article
maybe its interesting for you
Most certainly interesting. Now I need to read up other methods to reduce variables too π
With Scikit learn, is there a way to specify batches for online learners (like MLP)?
e.g.
clf = MLPRegressor(solver='lbfgs', alpha = 1e-5, hidden_layer_sizes = (25,25,25,), random_state = 11)
clf.fit(x_train, y_train)```
```MLPRegressor(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(25, 25, 25), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=11, shuffle=True,
solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)```
And what I want to do is:
Batch 2: trains on 10-150 from x_train
Batch 3: trains on 9,2,159,37 from x_train
etc.```
just hl me if you know
@simple fjord Chapter 3 of https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf seems to have some methods for it
@small ore Thanks so much, that's a very good book
Hello
My neural network is taken off of the internet because I'm a beginner and wanted to play around with it
But, I have a problem: All the neural network samples like this that I take off the internet use a sigmoid function which apparently means that the output can only ever be between 0 and 1
Furthermore, even if the output should be between 0 and 1 it doesn't seem to be giving correct responses
For example, [-3,-2,-1] would hopefully give 0
but instead it gives somewhere around 0.22
what is that list of numbers
smoothstep or something? like what you are doing with those three args and how do they interact with ssigmoid
oh and the default or 'centered' value concerning sigmoid functions with an input of 0 is .5, not 0
Well
I'm just testing it at the moment
and giving it test runs where [1,2,3] = 4
and [3,2,1] = 0
stuff like that
you know order matters right
or wait nvm
u just showed me lol
@radiant notch https://en.wikipedia.org/wiki/Vanishing_gradient_problem check this out, rly interestin stuff
I am using this github project and it is not working i am getting this error Traceback (most recent call last):
File "create_dataset.py", line 9, in <module>
from predict import predict
File "D:\AI\Game\Game-Bot-master\predict.py", line 3, in <module>
from scipy.misc import imresize
File "C:\Users\Fidgety\Anaconda3\envs\tensorflow\lib\site-packages\scipy__init__.py", line 62, in <module>
from numpy import show_config as show_numpy_config
ImportError: cannot import name 'show_config' https://github.com/ardamavi/Game-Bot
!t resources
It can be difficult to know where to begin when you are first starting out with Python. On our website, we have compiled a list of both free and paid resources that we recommend for learning and mastering Python.
It is hard to say exactly where you should start, as everyone will have a different prefered method of learning, but whether you like video tutorials, books or courses, you should find a suitable resource on our resources page
hey guyss
i am trying to learn data science using python
i have learnt how to use numpy
and some plotting
thaT was free course on data camp
π
now where should i head?
guide me
I'm planning on making a neural network but the structure of one is obviously critical
So how can I get the neural network to change its own structure? Is this even possible? I don't want the network to be limited by my bad structure... if it is bad.
@radiant notch if we could do that well, we'll have cracked the ai problem
Surely there's a neural network that adapts in structure?
You'd just need to experiment with different structures when backpropagating?
What is difference between SGD and BGD in linear regression..... How are the training examples visited in both cases... Can someone please explain me in brief?
Sdg = one update per single training pair. BDG = one update per many training pairs (batch)
Also my phone is correcting SGD
and BGD for some reason
Thankyou for your response.. Lemme clear .. in bgd.. Suppose i have selected the whole training set as a batch so it means that the average of same set of examples in batch would do an update in every epoch? And in sgd.. You mean that in each epoch I will consider only one training example at once.. ?
Kindly post the answers for Quizzes 2, 3 and 4 in Applied Social Network Analysis in Python from Coursera
The course in the specialization Applied Data Science in Python is extremely abstract and challenging, the tutor extremely vague. My subscription to the specialization ends in less than 8 hours and my $49 USD will go in drain if I don't secure this specialization. Kindly post the answers for the quiz questions.
Quiz 2: https://www.coursera.org/learn/python-social-network-analysis/exam/tZYRH/module-2-quiz
Quiz 3: https://www.coursera.org/learn/python-social-network-analysis/exam/0qgIf/module-3-quiz
Quiz 4: https://www.coursera.org/learn/python-social-network-analysis/exam/CgIV0/module-4-quiz
1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.
1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.
π€ asking for course answers?
without other options
because it is inaccessible at least for me in spite of trying to get my head around it for the past 8 hours
you do realize we cant see those without logging, right?
my bad.. would you able to help me if i send you the pictures?
You aren't asking for help but for straight up answers
it would be great if you would be able to help me find the answers as well. understand i have reached this stage without other options
As stated in #303906096458891264, not gonna happen.
Can someone please respond to my previous question?
no issues .. lemme try it would be great if someone can just help me with the questions
You can ask specific questions in one of the designated help channels.
!ask
!t ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving
β’ Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
read that page for more info
the only problem i am grappling with is:
the tutor provided juvenile code to perform the functions and the quiz has got a network visual which is really difficult to replicate and also he did not teach the ways in which we can find stuff like node connectivity manually
it would help if anyone can provide a workaround so that I can climb this myself
Carte blanche tutoring giving you code with 8 hours to go still is not the purpose of this server
If you have a specific question then you can proceed to an unused help channel as previously directed
I think it is against Coursera honor code to ask anyone on this planet for quiz answers. If you are not going for the certification you could even skip the quiz
So he could potentially ask people on the international space station or on Mars for help? Or is in orbit considered on the planet ( I never read the coursera honor code, I just clicked agree)
Hi from where i can learn sci kit for data science
i watched a video
but i am not getting it
help me
@chrome spade i know u
u r justin from pramp server π
@chrome spade no advertising
@solid chasm Wasn't trying to advertise. Is there a place on this server I could post about a paid project?
@lapis sequoia that is correct π Hello!
nothing comes to mind, no, sorry
@rich yarrow you might find help here about that problem
and you can get some resources aswell when you ask
Hello, I am wondering where I can learn more on how to "Use NumPy and Matplotlib to draw a scatterplot of uniform random (x, y) values all drawn from the [0, 1] interval"
How do I use both NumPy and Matplotlib to make one scatter plot..
import numpy as np```
What does the following mean?
Fixing random state for reproducibility
np.random.seed(19680801)
semi-random numbers are based off of a seed
If you set the seed, you'll get the same random numbers everytime
import matplotlib.pyplot as plt
import numpy as np
-
How to generate a random number in the range [0, 1]
import random
x=random.randint(0,1)
print(x) -
How to do that for two dimensions (x/y)
-
How to show that on a plot
Is anyone able to help me with this?
That sounds extremely assignmentish
Does someone has signal processing experience ?
Would someone please help me with calculating the time delay between two signals ?
the data set is there in the post
I can't find the peaks and shift the signal using the max peak index
Is there a place I can go to learn about making python AIs/machine learning?
π pins should have a few sources iirc
Andrew Ng on coursera π
That is not python though
the original message was about AI / ML π
should i just ask the same here? not well versed in discord etiquette, sorry
sorry again, got the meaning of the emoji. used to irc kek
I like math when explained like this: https://www.youtube.com/watch?v=FgakZw6K1QQ
Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly c...
posted in UI too, but I've been working a new way to do data science stuff with ipython/jupter using graphnodes to connect ipython notebooks with data sources / control flows
That looks super cool, whoa
Matrix looking
I'm doing a lot of similar Jupyter notebooks for testing some models. I would like to print the same metrics for the results every time, but don't want to copy paste those cells into every notebook. I will of course write an external function and import to all notebooks. But is there a way for this function to output things in several cells? I would like something like this:
cell 1 {
from external_helpers import print_results
print_results(y_hat, y)}
cell 2 {
Print accuracy in %
}
cell 3 {
Plot of y and y_hat
}
cell 4 {
Print a confusion matrix
}
etc. etc.
Hi
would someone help me with this please ?
@polar acorn That's a good use case for https://github.com/nteract/papermill
@proud raven Thanks, I'll check it out
Hey, how could I make something like
sensors[any(sensors['Zone'].str.contains(sensor) for sensor in relevant_sensors)]
work in pandas ?
relevant_sensors is a list of strings. (The error I'm getting is about the ambiguity where normally I'd need a bitwise operation instead of a boolean one)
Maybe something like py sensors[sensors['Zone'] in relevant_sensors]? and index is for the first one
Without seeing the data, or trying it.
I have to make sure to use str.contains cause I want to match e.g. SENSOR_1.203 while there's only SENSOR_1 in relevant_sensors
What is your df storing
and what are you trying to get with that slice? The first one that is in relevant sensors?
so let's say there's SENSOR_1, SENSOR_2, SENSOR_3 in my relevant_sensors but the values in sensors['Zone'] would be something like SENSOR_1.201, SENSOR_1.202, SENSOR_2.001, SENSOR_5.1234. The result should therefore be a DF sensors where sensors['Zone'] values are only the first 3 (everything except SENSOR_5.1234)
does that make sense?
yep, sorry
What do you get if you try my suggestion above then?
I get The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
(error)
just to clarify... the relevant_sensors list is a primitive python list of strings and the column sensors['Zone'] has values of the pandas object type
@placid snow I can do what you're suggesting if I convert relevant_sensors to a np.array and then do sensors[sensors['Zone'].isin(relevant_sensors)] but that gives me only the values that are equal to some string in relevant_sensors, not those that merely contain that value
Did a bit of testing and googling, but what about something like this (similar to yours) ```py
df = pd.DataFrame()
df["Zone"] = ["SENSOR_1.201", "SENSOR_1.202", "SENSOR_2.001", "SENSOR_5.1234"]
relevant = ["SENSOR_1", "SENSOR_2", "SENSOR_3"]
df[df["Zone"].str.contains("|".join(relevant))]
Zone
0 SENSOR_1.201
1 SENSOR_1.202
2 SENSOR_2.001```
but it doesn't use any, and it joins all the results in relevant to an or like regex expression
Well, it gets the job done. Thanks a lot.
So .str.contains can be used with a regex?
(I assume?)
That's what i gathered from it
That it requires a regex to search with
so we just create a regex which is basically this or this or this
would still be useful to know if I can somehow put my own functions/logic inside the [] of a dataframe, like I wanted to do with any(). This time there was a workaround but sometimes there might not be.
alright, thanks for the help π
Anytime
What is the most easy to use neural network lib? I want something like
import nnetwork as nn
MyNet = nn.layers(input=3,hidden=1,output=2)
MyNet.train ([[0,1,0],[0,1]],[1,0,0],[1,0]])
print(MyNet.calculate([0,0,1])```
keras maybe
deepy?
Actually, pyBrain looks now the easiest I found.
>>> net = buildNetwork(2, 3, 1, bias=True, hiddenclass=TanhLayer)
>>> trainer = BackpropTrainer(net, ds)```
It is actually easier than I wanted! Cool!
sickit-learn also has a limited selection of networks that is very easy to get started with
@dreamy tapir Keras.
Itβs really a lot easier than it seems
You just like throw your layers In like
a=Layer()
B=layer()(a)
....
Out = layer()(last layer)
Model.compile(Out)
I also have a wrapper Iβve been working on to simplify the generation of networks slightly for genetic optimization but itβs kind of on hold at the moment
Then to train you just write up a generator or use one of the built in ones and pass it into model.train()
iirc anyways, itβs been a bit since Iβve worked with it
Really?
I'll take a look on the documentation
I don't understand... I'm not so advanced...
And it looks weird
You can look at this https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
Keras is a powerfulΒ easy-to-use Python library for developing and evaluating deep learning models. It wraps the efficient numerical computation libraries Theano and TensorFlow and allows you to define and train neural network models in a few short lines of code. In this post...
I still don't understand
But I understand pyBrain
And I know synaptic.js perfectly and that's extremely easy and powerful.
Just look at it. https://github.com/cazala/synaptic it is so easy and predefined
And it's a pleasure the work with that library
Omg!
I found it http://jon--lee.github.io/neuralpy/
My new website
It's perfect
But doesn't work on python3
iβm just a beginner in ML and datascience...i wanted to start working on some small projects so that i can inprove my skills..could u guys please suggest me some projects to start or help me how to find a beginner project
what's a good graph edge container?
i was thinking a dictionary like
{0:1, 0:2, 1:2, 2:3}
but you can only have unique keys...
so can you do {0: [1,2]}
of is it better to use {0:(1,2)}
What about pandas / numpys ?
what do those have?
Depends what you need, but plotting the example data you gave would look something like
but thats 2 different graphs for each column, you could most likely swap the numbers around to get the desired effect from that
@keen pivot
Could use a graph DB as wel
that's not the kind of graph i mean @placid snow
A dict would probably do for small things. Any bigger and I'm sure you'll find plenty of packages for that.
okay.
I'm trying to figure out whether to use a tuple, dict, or list inside the dict as well.
I'm creating something would be described as a multilayed graph
with a main node that has edges to other main nodes
and within each node, there's a list(or dict?) of embedded nodes that may have edges to one or more other embedded nodes either in the same main node or others.
Tried Neo?
I've not
Itβs a neat graph database
this makes a lot of sense.
I worry this may be a bit too heavy for what I'm looking to do.
I'm trying to make an application that runs concurrently while another larger application is running.
and based on what that application writes to a file, manipulate the nodes in my graph.
I've been implementing my graph myself purely in python.... I don't imagine there being more than 100 main nodes.
It may be a little overkill but it's neat to learn if you're going to be doing more stuff with graphs in the future
Okay.
@flat thistle MINST database is a classic project everyone does. If you want a more classic ML type, you can look at some basic regression or classification datasets on Kaggle
There's a few nice clustering datasets too to flex your DBSCAN and KNN muscles
I have a hard one for you, I want to learn about data syncronization
filesystems, binary data, python dictionaries
what are the tags / subject terms I should be looking for?
Those are some very different subjects. Python dictionaries are hashmaps with dynamically allocated arrays. Filesystems is big enough to be a course on its own but easily searchable. Binary data is a vague term that can mean anything. For any specific file format, it can be differently laid out so you'll have to search those up separately. Executable binaries have their own executable formats too such as ELF for Linux
@pliant mantle
elf is best binary executable, change my mind
@lean ledge
I'm talking about syncing binary data. Example, recognizing a bit has changed on another system and transmitting, byte 1235623451 has changed to 0b00001111
Oof, I don't know about that, can't comment
binary, file, directory, node
(@pliant mantle you might also wanna post this somewhere else since that isnt exactly data science)
@pliant mantle read /dev/sdX in binary mode and set up IPC between the two
not sure about writing though
@spark nimbus ayy
ayy
@pliant mantle maybe researching how ECC ram works? Itβs not exact but the error correcting there might translate over somehow
I'm interested in learning more about fairness/bias in machine learning (and trying to account for those things in a ml project on a dataset with text possibly), would anyone happen to have some directions on some good articles about this?
ooh on a related note, on a tech conference recently I met the person in charge behind the AI ethics report for Australia. Australia's been a bit behind in terms of policy changes to accomodate AI and the ethical challenges it poses so she was in charge of writing a report built of case studies and possible pitfalls for policy makers with regards to what sort of ML models should and shouldnt be used, how they should be managed etc to take care of fairness and bias, and data privacy of people whose data is used in the models etc. She did her PhD in neuroscience and did work in ethics at the neuroscience research institute hence why she was part of that role
just relevant and thought it'd be cool to share
Oh wow that's interesting!
Hello Everyone! I'm hoping someone could help me clear up this error I'm receiving? So I've been trying to learn the basics of Pandas and Matplotlib for Data Analysis/Science, and I've done OK so far. But whenever I get stuck its tough to find an answer since I'm new to Python as well lol here's my code:
and I'm getting the following error: AttributeError: 'NoneType' object has no attribute 'seq'
I have no idea what that means lol So on line #12 I create a list from values within a column and on line #23 I use it to set the xticks
Which line gives you the error?
!t traceback
Please provide a full traceback to your exception in order for us to identify your issue.
A full traceback could look like: java Traceback (most recent call last): File "tiny", line 3, in do_something() File "tiny", line 2, in do_something a = 6 / 0 ZeroDivisionError: integer division or modulo by zero
The best way to read your traceback is bottom to top.
β’ Identify the exception raised (e.g. ZeroDivisonError)
β’ Make note of the line number, and navigate there in your program.
β’ Try to understand why the error occurred.
To read more about exceptions and errors, please refer to the official Python tutorial.
I was actually able to figure that part out LOL but I do have a question resulting from this bit of code!
When I plot the graph itself, it prints out pretty small and scrunched up. Then I MANUALLY click+drag the bottom corner of the image then Save, and it saves more stretched out; making it easier to read the long x-axis. Why is that, and how do I get it to plot out with a width/length wide enough to accommodate a long axis? I've uploaded the two images.
Here's the pastebin for the code:
So basically I'd like to use the 'savefig()' option to save the figure already 'stretched-out', without me having to manually print out the plot then click+drag?
What did you not understand with using pandas? It seems to be the way to go with this, both for the reason it's asked for and your data goes well with it?
"list indices must be integers or slices, not str" posts this error message
so I did the first 3 without pandas but can't figure out how to get the year latitude and name of hurricane all in one
Are you trying to create a DataFrame pr hurricane?
Seems to be a csv file youre loading. Pandas can create a single df from a csv file with pandas.read_from_csv(filepath) (iirc)
And you can do most of this logic with the df itself
I dont see a problem description there. I onlysee some unfilled problems
Ah. It is in the picture. Difficult to even see the problem.
Okay so I retried the pandas approach and did the read_from_csv and got a data table. Do I add headers now so the data looks more organized?
Your csv file dont have them already?
You can write them in the csv, or in code up to you. If they dont already exist
There is something like a header= argument for that method
No it looks really messy but im reading a tutorial now on how to clean it it looks a lot cleaner when access df.head()
when i print df though it comes out like a badly formatted text file
Show me
okay give me a sec
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued.
Although, going by just what I see ( I don't see the problem statement for problems 1 to 5, I might be missing something), they already have the data ready for you
Problem 1 was to find the amount of unique hurricane names problem 2 was most common hurricane name problem 3 was year with most hurricanes and then 4 is most northern hurricane and 5 is hurricane with maximum sustained wind
For finding unique names, there exists a data type that can only have unique values in them.
I mean is it required to use pandas? In the codeblocks above they have opened the file in a different way and allowed you to use it. Or is it some different data for those?
The length of a set of the names will be the amount of unique names
Its not required, I did the first 3 problems without pandas very easily I just got lost on how to find the latitude and longtitude values from it
So I tried adding more columns but it ends up piling multiple data into one column and none of the columns retrieve the latitude, after column 6 it all starts saying NaN
Where should I go to learn very basic machine learning, if I'm a quite beginner python programmer who has no idea how machine learning work?
@wary willow Check pinned!
@lapis sequoia If you have still not been able to solve your problem, try changing the file open code to:
records = []
with open(local_fname,'r') as f:
for line in f:
if line.startswith("AL"):
record = line.strip()
reports = []
records.append((record, reports))
else:
reports.append([line.strip()])
scracth that. No need for that change.
You can just do:
print(max([float(rec.split(',')[4].strip()[:-1]) for rec in record[1]]))
how could I add a new column to a pandas dataframe, based on other data in each row?
say I have a function that takes each row's 'text' column and transforms the data and adds a new column value
should i just create a new column , then write a for loop that iterates over each row and fills in the column?
not sure if that is the most efficient way
hmm it seems like https://stackoverflow.com/questions/34962104/pandas-how-can-i-use-the-apply-function-for-a-single-column answers my question
@small ore you Ninja! π https://javascript.info/ninja-code
i ended up following the using a loop method described here https://stackoverflow.com/questions/15118111/apply-function-to-each-row-of-pandas-dataframe-to-create-two-new-columns
Link from nowhere without context is suspicious 
Alright so I did print(max([float(rec.split(',')[4].strip()[:-1]) for rec in record[1]]))
and it returned the highest latitude N of that hurricane. now to generate that for all of them, do I need to make a set out of the latitudes?
Well, I believe if you have done the previous problems you can figure out although this involves somewhat more head-scratching. This server does not encourage giving out answers to your assignments afaik. Think and let people here know what you have or where you are stuck. I will see if I can give you more hints tomorrow
!t ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving
β’ Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
@lapis sequoia
lol okay. So some basic setup: I'm trying to plot two sets of data; one pulled from iexfinance and one from quandl
China = quandl.get_table('DY/IPA', ticker='000010')
China.sort_values('date', inplace=True)```
so that works. Okay, great. Now I tried graphing them separately.
fig, ax = plt.subplots()
ax.plot(China_date, China.close)
ax.set_title('SSE 180 Index daily price')```
and that works. Not gonna lie I'm pretty new to matplotlib / python so that might be a mess
and then this works
SPY['close'].plot()
plt.show()
plt.show()````
So i was looking through the matplotlib tutorials and it seemed to indicate that you could just do two plt.plot(data) and it would set them on the same chart
like this:
plt.plot([2,3,4,5], [10,9,6,4], label='line2')
plt.xlabel('price')
plt.ylabel('date')```
Which works fine, alright cool
SO why doesn't this work?
# these will plot separately but not together. I have no idea why.
plt.plot(China_date, China.close)
SPY['close'].plot()
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('China and SPY Comparison')
plt.show()```
I'm fairly sure it's something to do with the "China" data frame having a 'date' column that's converted to an object, while the SPY df has an index column just called date
@earnest prawn any thoughts?
What do you mean by βdoesnβt workβ
Is there a way to increase the size of jupyter inline figures without putting the figure in a scroll box?
I'm using plt.rcParams['figure.figsize'] = [15, 5]
but no mater what dimensions I make the figsize, I end up getting a scroll box and it's ugly
@silent current I think it's due to the browser resizing the img
hey guys, sorry for the extremely long question yesterday. I kind of tabled that issue as it wouldn't really make sense to graph anyway.
However, new question, I'm using Quandl to get the following:
SPY = get_historical_data('SPY', start=start, end=end, output_format='pandas')
and then plotting the close price
plt.show()```
but it's not plotting the date, or at least, it's not displaying the date. I'm kind of out of ideas on why / how to get it to actually display the figures
Any idea why conda install jupyterlab won't install the latest version for me ? I'm at 0.32 right now but there's already a 0.34, I can even see it in conda search jupyterlab
hello, im having trouble with memory usage on numpy
im working with a lot of matrixes, usually 500+, and when they become about 11x190 +- i get a memoryError
is there any eficient way to serialize/compress the matrixes to a list or similar, so that i can just grab them by index and unserialize/decompress on demand?
much apreciated
What operations are you performing with the matrices? And how much RAM do you have?
I have 32 GB of RAM, I'm doing dot products mostly
But I managed to fix me memory usage during the operations, it now crashes. By just having them stored
Are you sure you're not growing the matrices? 500 11x190 matrices shouldn't cause OOM issues
All elements are tuples, not sure if it influences
I'm almost sure but I can double check
What do you mean all elements are tuples
Are you doing dot-products on them all together?
Or can you open them from a dump and do it one by one and close as you go on?
I was doing them all together
Dumping to dusk and reading when needed will greatly affect performance
That's why I wanted to k ow what options I have :/
Right know I'm checking pytorch, maybe i can use the tensors
Are you using 64 bit or 32 bit python?
Your arrays combined take about 8.4 gigs of memory
I forgot what the name is, but theres a module that (with a bit of overhead) splits up big work loads into smaller chunks that can be run on different hardware even
Theres a pycon talk about it alongside pandas and numpy iirc, maybe you can find it
wtf is data science, analysis.. etc
Read the channel description
@placid snow didn't answer my question^^
It's the work of analysing, categorizing, visualizing and in general read & understand large sets of data.
Say you wanted to create a graph showing the wealth distribution of everyone in the US, that's data science for instance
Big Dataβ’ basically
It's the channel for the buzziest of buzz words
Hi, I was able to create the following pandas dataframe by using groupby. Now I would need to graph the results with bar plot (one bar for one column, divided to 2 parts)
Something like this? https://stackoverflow.com/questions/23415500/pandas-plotting-a-stacked-bar-chart
thanks got it
Hi guys
So I'm getting this error while trying to get financial data from Yahooi's API
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
style.use('ggplot')
start = dt.datetime(2000, 1, 1)
end = dt.datetime(2016, 12, 31)
df = web.DataReader('TSLA', 'yahoo', start, end)
Which gives me this error -
ImmediateDeprecationError:
Yahoo Daily has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.
Please how can I resolve this?
Is there a work around?
I'm using python 3.6
dont use yahoo
Okay, which can I use
dunno
*sigh
I have a pandas dataframe as follows. I want to add a new cost column that is equal to the Shares column * the corresponding price for the current symbol. So for example in the first row (symbol=AAPL) the cost column would have a value of 1500 * 340.99000 = 511485. How would you do that?
df["newcol"]= df.apply(lambda row: row["Shares"]*row["AAPL"], axis=1)``` perhaps?
Can't test it myself on phone, but do try just the apply part first, and see if the result is correct
the row['AAPL'] is hardcoded there is it not? I would like to use whatever is in the Symbol column and pull the appropriate value from the corresponding price column for that symbol
so for example the third row with IBM, should multiply 4000 (Shares) * 144.55000 (the price under the IBM column)
You could most likely just pass the symbol to the lamda, or write a function instead and have it read which symbol to use beforehand
orders["newcol"] = orders.apply(lambda row: row["Shares"] * row[row['Symbol']], axis=1)
Surprisingly, that seems to be working. Thank you @placid snow !
yahoo stopped with the ticker data i thought
hopefully a question that can be answered!
I have three 1d arrays of data. Two of the arrays are coordinate arrays (x and y). The third array is intensity (z) measurements are each coordinate position. All three arrays are of the same length. This means that the position x[i], y[i] has an intensity of z[i]. Currently, i cannot create a map like a pcolormesh plot to create a sort of heat map due to the data structures. Does anyone know how i can grid this data so that it will have a map of coordinates with intensity values for each point on the map?
Are the x and y values placed such that they form a grid?
no, thats the thing. they are simple 1d arrays that have x and y values that represent coordinate values along each axis. not until you take the first element of each x and y array do we get a position. hoping that makes sense lol
Sure, they are not formatted as a grid. But if you were to plot them would they make a grid? i.e. do you have n uniqe x values each repeated m times and m unique y values each repeated n times?
no, i did try to use np meshgrid, and that created the grid like coordinates. now my issue i guess is assigning the positions with their intensity values
I assign intensities to the diagonal of that grid right?
theyre both negative actually
oh
# imports
import numpy as np
import matplotlib.pyplot as plt
# fake data, replace with your own x, y, and z list
x = [-1,-2,-3]
y = [-3,-1,-3]
z = [10,3,4]
# turn negative coordinates into positive indexes
x_pos = np.array(x) + abs(min(x))
y_pos = np.array(y) + abs(min(y))
# create empty grid for z values
zz = np.zeros(shape = (max(x_pos)+1, max(y_pos)+1))
# fill empty grid at indexes corresponding to coordinates
for i in range(len(x)):
zz[x_pos[i],y_pos[i]] = z[i]
# these are the coordinates of the lower right corner of each colored rectangle
xx = np.arange(min(x)-1, max(x)+1) +0.5
yy = np.arange(min(y)-1, max(y)+1) +0.5
# plot
plt.pcolormesh(xx, yy, np.transpose(zz))
plt.show()
@coral lichen, this might work. Its not very nice looking but it plots the intensity at the right coordinates.
Also in case your x and y values are not integers you might want to do a simple coloured scatterplot instead. Like this
x = -5*np.random.rand(1000)
y = -7*np.random.rand(1000)
z = x*y
plt.scatter(x,y, marker='o', c=z, linewidths=5)
plt.show()
@polar acorn thanks! I'm going to work on these when i get home and will let you know!
hi everyone, I have the following df and I want to create a binary matrix out of it using pandas or any other module to achieve this.
the result should be like this
I want to find a connection between the series and it seems that the best way to do this is by observing the binary matrix
If you want binary columns also for values not observed in your columns yet, you can enjoy this monstrous one liner
import pandas as pd
df = pd.DataFrame({'val1':[1,3,4,6], 'val2':[2,2,1,3], 'val3':[5,6,11,2]}, index = ['s1', 's2', 's3', 's4'])
binary_df = pd.concat([pd.concat([pd.get_dummies(df[col]) for col in df], axis=1).groupby(lambda x:x, axis=1).sum(), pd.DataFrame(columns=list(range(1, df.max().max()+1)))]).fillna(0)
Enjoy pulling it apart π I had to much fun writing that to not share it
@chibli cheers for the answer, the direction is right
@pptt cheers for the answer, I will give this a run asap but first I want to wrap my head around the formula :)))
Anyone here know stuff about the ASDF (Advanced Scientific Data Format) library?
Guys, if anyone knows, any input regarding my question in #tools-and-devops is appreciated
In lowess.. After minimizing the error how should I define y to be predicted.. if I am writing the algorithm from scratch
Is there any python library which is able to detect a float as being a fraction including an irrational number?
yeah, but I'm talkling about irrationals
That library isn't able to detect something as pi, e, or a square root
Well, I do a lot of math stuff using python. And when I get a result I would like to know if it's actually a random number or a fraction which includes an irrational
found it in sympy, thanks @simple crag
Hi, anyone can suggest a data science path courses which make me data scientist after courses finishes, but courses have to have nice python syntax and covers all data science concepts, as I see, there is no good course to take my data science and python level simualtenously next level, anyone to help? Thanks π
Have you checked out Udacity?
Yes but, as I remember, there was a payment forcing for nanodegree π¦ . If something, some product is good, or better than others, there is nothing to prove yourself.
If course would face 2 face program, maybe I can give this money for this program. But it is online
Check pins
But yeah, no python. But there are a lot of free courses, tutorials on the net which will not give you any certificate
This channel is a quiet
ded server
Hello π
Could someone please help me with this task, I really don't know how to approach it
I don't know much about the topic, but have you tried breaking it into smaller pieces?
a function for the equations for instance, and plan out what happens
@vestal axle from what I can tell, three parameters are constant
a first step *could be rewriting the equations with those values filled in
reduce the greek letter salad a bit
It looks straight forward to me. ( Havent tried to work it though). You have sigma0 and y0 given and calculate (sigma1, y1) and so on every (sigmat, yt)using the given formula till you get 100 values (t = 100 or 99 not sure, prolly latter) . You have your simulated data. Now plot the data over time. I think how to do the plot in that specific way is the task at your hand
@vestal axle
this a good place to talk about data structures?
Not likely
#python-discussion if you want a generic discussion on that topic. Any of the help channels if it is a specific question. #databases if it is regarding databases
Iβm am currently studying ML. And I ran into a query. How would you calculate a confusion matrix (in python) for a multi-label dataset?
What is it called when you have a ridiculously extreme deviation in data?
singularity?
outlier?
Sounds like outlier
@SYMPHONIC DISHARMONY#3195 Row is true label, columns are predicted label
compute the counts
Use .pivot_table in pandas
py_noob thank you for your answer, we figured it out! Thanks though π
Btw, is there anyone here who would be willing to help me out with a task that has to be delivered next week π
@vestal axle If you have a specific question just ask and someone will answer it they have time.
@SYMPHONIC DISHARMONY#3195 scikit-learn has a nice implementation if you just want to see your results.
Guys which is the best book I should opt for probability and statistics.. ???
can i ask a question on data-science that isnt written in python but instead octave?
This is a Python server, so I'm not sure how well anyone is going to be able to help you with Octave
im more worried with the logical aspect of the programming tho
not any features that octave may have to offer
ill ask anyways
anybody with octave knowledge want to talk me through how to calculate the cost of the hypothesis using certain theta/weight variables. I am struggling to get my head around it.
i am using a feature/X matrix (5X4), Weight/theta vector (4X1), correct_label/y vector (5X1)
prediction = sigmoid(X*theta)
cost_1 = (-y) .* log(prediction)
cost_2 = (1-y).*log(1-prediction)
total_cost = cost_1-cost_2
J = sum(total_cost)/m```
J should be the cost but it doesnt calculate it correvtly
can you @me if you respond thanks π
@turbid bay What do you mean by not calculate correctly? Values different from what you should have got or are you getting some errors? Secondly you do not seem to have given the entire code. That m in the last line is not available elsewhere in the code. Thirdly did you check if you get the right answers if you created a vector of 1s instead of just using (1-y).
The problem may also be in the sigmoid function if you wrote it instead of a built-in
m is the length of the y vector (sorry for not including that) and no the sigmoid function is built in
and yes it doesnt give me the expected output
Try (ones(m,1)-y) instead of (1-y) and same change for (1-prediction)
You could also do y'*log(prediction) instead of y .* log(prediction) coz that is supposed to be more efficient
what does the ones() do exactly. and i tried the Transpose matrix multiplication but it wouldnβt give me the right value either so i kept on swapping between the 2
ones(p, q) gives you a matrix of the size pXq with all the values as 1
ah yes that will probably work better than using a scalar value then subtracting a matrix
when i am next at my computer i will change it and see if that works. thanks
Hi i want to ask something.
I will try to predict lifetime period. Dataset has this information as months in column and i'll set it as target column for predict it. Lifetime period range is from 0 to 74. Which method should i use for predict it? I was thinking to use Linear Regression but these terms made me confused like Multioutput regression, multi label classification etc.
That is an incomplete question with insufficient information. Either make a clear presentation of your problem keeping in mind the reader or ....
!t ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving
β’ Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
Hello! I'm new here, although I'll do my best to make a clear case. I'm new to data science, and basically my problem is I'm trying to make a recommendation system, using knn algorithm (k-neearest neighbor) and Euclidean distance, mixed ofc with Panda and NumPy modules. I'm looking for some guide or cheatsheet or anything at all to get started, since the ones I've found all talk about the same example (some Iris flower test db that comes with SciKit module) and have little (at least to me) adaptability to my case.
What I want is this system to recommend users new artists by comparing their answers to other users who had answered in a similar manner. User would submit three artists of their liking and then the program would have to work with that
If I'm allowed to, I can attach some files, like the db and the code I have so far
if the code is small enough to fit, post it here using codeblocks, else https://paste.pythondiscord.com/ is your friend.
@small ore hey I was told you might be the person to speak to regarding selenium
Hi
I have a question, why cooling some metals enables them to become superconductive?
@real wigeon What? You must be mistaken. I dont even know how it looks like. From my limited knowledge, Selenium does not seem remotely related to this channel
I am not someone to speak to for any topic for that matter. And in general you should not be looking at individuals in this server for any question/discussion even if it they are knowledgeable on a topic
ok well thank you, I was in the help channel and someone told me I should speak to you.
I can't tell if you were really trolled or you are trolling me π
well played sir.. well played
Ayo wha
@quiet gyro ^^
[better idea to ping @ Moderators or to ping an online mod?]
Yeah. Need more active mods who look into this channel. I suspect this channel is going dead in ways
Better to ping @moderators
@real wigeon That sort of behavior and language isn't tolerated here
I mean. If you could ban them ( I think this is repeat. I searched for them and they pinged the last person who asnwered earlier too) and preferably also delete all of what they said, I'd be glad
Bit preoccupied right now, I'll look in a bit, thanks for the heads up
If youβre referring to me, this is the first time Iβve said anything remotely negative. Py_noob is being difficult and asserting that I should not ask questions. I assumed it was a joke so I responded in kind.
Throw the banhammer around if you want, life goes on regardless
@real wigeon py_noob is being nothing
Enough
it just looks like facknoobs misunderstood
some suggestion they got to talk to someone?
oh, sorry. didn't see I think this is repeat.
hello, i am doing logistic regression and am trying to calculate the cost of some theta values. However, I am not getting the expected value I want with the code I have made. I was originally using Octave but have moved to python as I understand how to use it better. But still no luck. I will paste the entire code and would like to ask if anyone can spot any mistakes. The cost is supposed to equal 2.534819. https://pastebin.com/hsG3HcR6
thanks in advance for whoever can help me solve this problem
@turbid bay Can't really see that you're doing something wrong here. Are you sure you are comparing the correct examples?
Anyone familiar with making logistic regression models in python? I'm making my first model and can't seem to increase my accuracy_score() with the available variables. Specifically, whenever I include a 3rd, 4th variable, my score goes down. Does this mean that my model's accuracy in predicting outcomes is decreasing as I include more and more variables?
I'm trying to make the "best" model with what's available.
@spare karma are you testing out of sample? You might be overfitting, consider regularisation.
@polar acorn Thank you for the response. I think out of sample? I'm using a baked-in function to test, train and split (my instructor recommended one). I'll look into regularization. Are there any general conventions towards selecting variables? (If interested, code below - kobe bryant data.)```
fit a logistic regression model and store the predictions
feature_cols = ['combined_shot_type_numeric','season_numeric', 'shot_distance', 'minutes_remaining']
X = kobe[feature_cols]
y = kobe.shot_made_flag
model = Model()
model.fit(X, y)
kobe['pred'] = model.predict(X)
from sklearn.metrics import accuracy_score
accuracy_score(kobe.shot_made_flag, kobe.pred.round())```
0.5952834961279527
Could you show more code please? I am unable to know what your Model class is. If regularisation does not work, then consider reviewing the model. Like introducing more variables which could be some powers of the existing variables and do some analysis to determine which variables contribute little to your model
Disclaimer.I am not an expert. I am trying to learn from online sources and through forums like these
@polar acorn thats why im extremely confused too. because i think its working correctly. and im using the data from a course on coursera i copied and pasted it pretty much so im 99% sure its the right data. my only thought is. is that ther is multiple ways of calculatong cost so possibly this is just a different way compared to what they expected you to use on the course
Hi! I have probably trivial question, but I lack the proper word to search effectively for information on it. If I do a standard plt.plot(x,y) using matplotlib.pyplot, and y is a numpy array of large numbers, python will automatically remove some appropriate power from the y-tick labels, and write that power in the upper left corner of the plot. Can I control this feature somehow, and and tell pyplot which factor should be divided out of the ticks? Most importantly, can this also be done for plt.yscale('log') ie a logarithmic axis? Code example here: https://paste.pythondiscord.com/nomamojofe.py
I mean, I realize that I can just divide my y-array with some number and annotate that in the plt.ylabel, but since the automatic function is already there, it would make sense that it could also be controlled by the user.
@turbid bay it is not only efficient but also easier to code and debug if you use those as vectorised stuff.
@small ore i am using vectorisation in my octave code. however i dont know how to do vectorisation in python
Using numpy methods. You dont have to code yourself
Numpy has arrays, matrices and ndarrays. I am not sure which one is a good fit but I think any of those will work with corresponding methods for transpose and such
If you have problems in your octave bits I will try to help provided the head guys have no problems
ill msg u privately with octave questions
@dim wolf I'm pretty new, so my opinion isn't worth much but if it were me, and there's a deadline associated to your plot, I'd just extract what I need manually into a separate df.
@small ore I can introduce powers to existing variables? That's awesome. I never thought of that. Not sure what you want to see, I'm currently at ~100 lines.
I mean if that model.fit is a fancy method which already automatically calculates what variables/powers are needed and what are not and does a super good fit all of what I said regarding bettering the model makes no sense. I just wanted to see what the Model class is
And please note: your problem still stays linear if you introduce powers coz these become the co-eff of the function rather than the 'variable'. So when I said 'variable' to your feature_cols , I was in some sense wrong
@small ore If only they'd make em' that way, lol. On the flipside, then they wouldn't be that much fun to produce. Anywho i'm using from sklearn.linear_model import LogisticRegression as Model
hey guys i have been trying to upload my ipynb to github. Once they are done uploading I am not able to view them instead it gives me this error
try clicking on "raw" at the top right
@lone mist yea but how does it help?
from sklearn.linear_model import LogisticRegression as Model . I wonder why these modules use such conventions in examples as well as their code.
as a first assignment I've been asked to perform a classification task based on JIRA user story summary, description and activity data. Sample size is ~100k user stories. Classification is based on which 'customer' is being supported by said user story, eg, regulators, internal IT improvement, actual product improvement, etc. Any idea which model is optimal for such a case? If supervised, how big should my output sample be?
List all the odd numbers between 1 and 200 that are divisible by 5. Your result should be a list containing a set of integers satisfiying this condition.
Anyone who can help me on this?
anyone able to turn this code so that instead of cumcount being inserted into the df, it would use the max count value for each group? Using pandas
@vestal axle something like this maybe?
set_of_integers = []
for i in range(1, 201, 2):
if (i%5)/5 == 0.0:
set_of_integers.append(i)
Yeah, that works! Thanks FONZ π
@hollow gulch why not just use .count() instead of .cumcount()? Or am I misunderstanding your objective?
@twilit bolt for some reasons using count for groupby doesnt work
unless i miss something
is there anyway to list all the function that I am typing out while using spyder? so I have some idea of what I could do
What are you counting? Values within Address Line 1 or do you want to know the count each group has?
@hollow gulch
@lyric canopy thanks for replying, this is what my df looks like
not exactly the problem I have but similar and easier to understand
I want to create a count column by group to produce result so for example with 'ST', the data in the above example would look like this 2 2 1 1 1
the groupby.cumcount + 1 wold look like this 1 2 1 1 1
Right, just total count per group on each of the lines of that group
yep β€
here is my code
df2['Count']=df2.groupby('Address Line 1').max(df2.groupby('Address Line 1').cumcount()+1)
the max function doesnt work π¦ I don't know which function available that I could use so I just test out random syntax hope it would work
its different from excel where as you start typing, it suggest a list of function available to use. this one doesn't give suggestion which make it harder to code
Right. Well, it's been a while since I used panda's, so I'm going to experiment for a minute
But, you can find a list of all the groupby functions on this page: https://pandas.pydata.org/pandas-docs/stable/api.html#id39
So, df2.groupby('Address Line 1').size() will get you the group sizes
Now, you still have to add them to the dataframe
This should work:
how do I add it to each line?
df2['Count'] = df2.groupby('GROUPINGTHINGY').transform(len)
should work I think
Wait
That wrogn paste
Wait, now it doesn't work anymore.
yeah, individual syntax work
like print (df2.groupby('Address Line 1').size())
but I dont know how to put that value into the data frame for each row based on the 'Address Line 1' value
Yeah, I know, but there's a simple way to do it with one statement, it's just that I can't remember it because it's been months since I touched pandas. I can only hack it currently:
>>> df['count'] = df.groupby('A').size()
>>> df
A count
0 a NaN
1 a NaN
2 a NaN
3 b NaN
4 b NaN
5 a NaN
>>> df['count'] = df.groupby('A').transform(len)
>>> df
A count
0 a 4
1 a 4
2 a 4
3 b 2
4 b 2
5 a 4
But, that should be possible in one line
what do you mean you hack it π
If I'm understanding correctly, this should do the trick.: ```df = pd.DataFrame({
'City' : ['BILLINGS', 'LANSING', 'HICKORY', ' HAYWARD', 'NORTH EAST', 'SAN DIMAS'],
'ST' : ['MT', 'MI', 'NC', 'CA', 'MD', 'CA']
})
df
df['Count'] = df.groupby('ST')['ST'].transform('count')```
That's it, yeah.
Couldn't get it to work because I tried:
df['Count'] = df.groupby('ST').transform('count')
So, without the ['ST']. I
@hollow gulch Did you see the answer above?
what does transform do?
let me look it up real quick
the syntax looks a bit weird because I dont see groupby () [] next to each other that often
Okay.
way i understand it is group() and () would call function or combination of condition
and [] is to apply array
df.groupby('ST') creates a "groupby" object (grouped by 'ST')
Next, you select the ['ST'] column of that object and use count on it
So, it's equivalent to:
grouped_df = df.groupby('ST')
df['Count'] = grouped_df['ST'].transform('count')
and I didnt know you could do transform ('count') to command a function using a string
i thought it has to be something like groupby().command here
learning so much β€
thanks you the greatest teachers β€ @lyric canopy and @twilit bolt
so to have a better understanding of how things work, grouped_df['ST'].cumcount() would give cumulative count for each row within a group.
grouped_df['ST'].count() assume it would work would give count of the group but it's not in a usable format (wonder why)
so we need to use grouped_df['ST'].transform('count') to put it in the right format (I still don't know which format we are in or needed for each of this)
i assume the 2 format we dealing with here are array + int. For Array to work, it has to match the size.
I am not used to the transform function and what it does
@hollow gulch It seems like you are relatively new to Pandas. May I suggest 10-minutes to Pandas and the Cookbook, short and sweet examples. If you work through the examples therein, you'll be able to tackle about any data wrangling issues you may encounter.
β€ thanks so much for the resources Iβve been trying to find a good source so I can study from. I want to move my excel work to panda but syntax is what usually throw me off
You're welcome.
Anyone who know's how to solve this?
I need to run a four loop, but how should i code it?
What have you tried so far?
And, do you need to use for-loops? Because you don't need to if you don't
Well, I've gtg @vestal axle , but there are plenty of options for calculating the correlation without explicitly using for-loops yourself.
I am curious what course that is
I don't know if this was posted elsewhere here but Twitter released a data store for all the accounts related to Russian trolling. Could be a fun dataset to play with. https://about.twitter.com/en_us/values/elections-integrity.html#data
im pondering the thought of using machine learning to identify what type of log file i am going to perform some graphing on. Would that be possible? Right now i am using pandas ```py
df.columns.tolist() == my_dict.get("type"):
plottype = type
But i feel a little automation there would be cool? would it be possible?
Hey
This is not the place for advertisement. @regal yarrow
sorry
anaconda is a "closed" enviroment right? it wont mess up my already installed python3 stuff?
hm, installed latest anaconda and started a "conda" project in pycharm, but it said python 3.6. Is it not 3.7 wich comes with conda?
Vesiculus I managed it
py_noob its data analysis in python im on my second year of Msc Finance
in
Nice!
any help with timeit
Post your question and I'm sure people will help you
Good morning, I am trying to make new column so that if QTY and MAX QTY >0, they stay the same, otherwise QTY=0 and MAX QTY = 100000. Datatype is object. and I attempted to convert them to numetic but failed. can anyone help please
df1[['QTY','MAX QTY']] = df1[['QTY','MAX QTY']].apply(pd.to_numeric)
I also tried this pd.to_numeric(df1[['QTY']], errors='coerce').fillna(0).astype(int)
@hollow gulch This is the way I would do it ```df = pd.DataFrame({
'qty' : [1, 2, 450, '--'],
'max_qty' : ['nan', 2, '-', '--']
})
df
df.replace(['nan', '--', '-'],[9999, 9999, 9999 ], inplace = True)
df.loc[(df['qty'] == 9999 ) | (df['max_qty'] == 9999), ['qty', 'max_qty']] = 0, 100000```
Thanks @twilit bolt as always β€
I was hoping there was a better way to capture it instead of manually identify all miscellaneous items
it is what it is then β€
I'm sure there is, using the 're' module. But, I'll leave someone else to come up with that solution. As still can't get my head around working with combinations of strings.
You may want to look into it, https://docs.python.org/3/library/re.html
quick and simple solution sometimes better as I do need to get this project move on, thanks for all the help
np
I love python because it has great learning curve and so much to its capability
and those are the question I keep on the back of my heads that one day I can come up with a better solution
I hope python is a good solution to my data analyics day to day project and it would be better than excel/VBA. I found it more amusing because I'd have to code the same thing in VBA, might as well just do all of the tasks in python and save the workload if I need to go back and edit anything
this channel is meant to serve somewhat as a data science discussion location, but this is also a Python server first and foremost. might be a question better suited to an off topic channel
Is there alot of math in datascience
I'm getting a memory error from loading a CSV file (a few GB) into pandas. Is there a way to split up the file and analyze it piece by piece?
@naive orchid Sure.
Take a look at the chunk size parameter of https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.read_csv.html
It uses http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking to read a file chunk by chunk
Guys do you know what could trigger the setting with copy warning in pandas? I have 9 columns I'm doing an operation to and I only get the warning on two of them.
this is my code
for column in metadata_part.columns:
new_col = metadata_part.loc[:, column].astype(CategoricalDtype(categories=unique_vals[column]))
metadata_part.loc[:, column] = new_col
Setting with copy warning is a mysterious beast. But it usually means you are trying to write to something that is a view of a dataframe. In your case where does metadata_part come from? Is it a subset you've taken from a larger data frame?
It is. And I didn't use .loc for it. I probably should, right?
Might not matter actually. If you want to change the parent dataframe you should change that directly. If you want to change only a copy of the subset and work with that leaving the parent df intact you can use .copy() when you subset it i.e. metadata_part = parent_df.loc[].copy()
I am not working with the source dataframe any further. I might try the .copy() thing. I am mostly confused it warns me only on two of my columns but not on the rest
As I said it is a strange warning that sometimes triggers when things are fine and sometimes doesn't even when it should. But in general you should always be fine if you're working on a df that is not a shallow copy (either original or subsetted with .copy()) and using only one .loc or .iloc statement to subset it.
Alright, thank you very much for the help. I'll look more into it and see if anything changes!
@hollow gulch Something like this? df['Count'] = 1 df.loc[(df['Tax Y/N'] != df['Tx Ex']) | (df['Tx Ex'] == 'NaN'), 'Count'] = 2
Has anyone got any good suggestions for creating tables and exporting them automatically using python, in either a .png format or straight onto a powerpoint? Has to be compatible with pandas.
???
Say I partition the unit interval [0, 1] into a grid via u=1/(m-1) and G=(k*u for k in range(m)). I want a function which maps a real x to the value in that interval that's closest to x. Is there a better way that doing a sort of logarithmic walk starting in the middle of Gand using abs for comparison? Any libraries which do that. I need to do it hundred thousands of time in a numerical execution.
@blazing bolt I know statsmodels allows you to create simple tables from data and save them either to html, csv, or LaTex. https://www.statsmodels.org/dev/generated/statsmodels.iolib.table.SimpleTable.html
Thanks :')
@blazing bolt You can use Pandas with Matplotlib to create table figures. It's not that difficult to do, but I don't know how difficult it is to make them look pretty. I usually export my table data in LaTeX markup to display it in the style of the document I'm writing, but that may not suitable for your application (Powerpoint; I mainly use LaTeX/pdf for presentations as well).
Here's a short tutorial on plotting tables with graphs, but if you skip the graph part, you should be able to just display/export the table in a graphical form: http://pandas.pydata.org/pandas-docs/stable/visualization.html#plotting-tables
I've never used it, though, so you maybe have to google a bit for the appropriate approach
Anyway, I hope the terminology on that page I linked you helps with that
Most YT videos are not good though. Either they use obsolete python 2 or they use dirty conventions and can be confusing.
On their sites, they have their own tutorials alongisde the documentation. So that should help too
Scroll down to python section and the sections below in https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78
That has some suggestions
You need to calculate the correlation between the first and the second column.
In this case, you've only provided one of the two as an argument
Try this:
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])
That will return a correlation matrix as an array, I think
@lapis sequoia
I need help with data cleaning and asked in help-0. Is it kosher for me to ask here, too?
Data cleaning/merging, rather.
@young aurora It is always best to ask those questions here. This channel is not as active as the help channels but there is a chance your question is lost in the help channels if no one around there knows the answer. I have seen people go through old questions and answer in this channel a week after it was asked
If I want to combine 35 CSVs that have somewhat different column names, what's the best way to do it?
I'm trying to do it with "usecols" from Pandas, but when there aren't column names that I'm specifying within a CSV, it spits an error.
Here's an example
'COL2',
'COL3',
'COL4',
'COL5',)
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
combexample = pandas.concat(example, axis=1, join='inner').sort_index()```
So, some of the files might not have COL4 (it's technically the same info within the row, but the header name is different). How can I get it to combine properly?
I'd love any help I can get.
If it helps, there are 40 total files.
There's the question, for what its worth - thanks, @small ore
Are there any experts here in Sqlalchemy? I was wanting to have a code review done on a project Iβm working on.
Hello?
Hi
@heavy brook We don't currently have a system for recruitment, we're trying to figure out the best way to handle it right now.
We tend to remove recruitment messages for the sake of safety
So I've got a function I've written that gives me a number as it's return. I'd like to loop it X times, assigning an ID to each loop + the value within the two columns "Test Number" and "Value". How can I do this?
im currently working on multi gaze tracking/estimation
I dont have an ML background, but I am good at piecing things together
im learning python along the way
anyhow, the current challenge I want to solve right now is wether someone is looking at the screen or not
i'll make a tv stand, put a tv and then put a camera on top
I have seen some open source code already for head pose estimation
my main question now is, is there a formula to determine if someone is looking directly given a roll, pitch yaw? plus some extra params? distance from the screen angle maybe?
wow my code is slow AF T_T
I have a dataset that has a predictor column with two values: 'Fully Paid' and 'Charged Off'. For data visualization purposes, I split data into two. Unfortunately, the 'Fully Paid' subset has twice as many rows as the 'Charged Off' subset. I made a function to randomly shuffle the 'Fully Paid' subset and match length of the 'Charged Off' subset to see if I can understand the full picture better, but every time I run the function, the plots are way different each time. Is there a way I can work around this problem to better handle keeping the size of the two subsets the same but to somehow include the "entirety" of the larger subset? I hope I was clear enough.
Why would they need to be the same length again?
@heavy brook ML is about finding the prediction formula. I have no expertise in determining the (roll, pitch, yaw) of a certain face but I imagine it is needs database of several faces rotated at various (pitch, yaw, roll) each and a formula is determined by number-crunching from that database
yo i wanna start a project using AI. Starting from scratch so does anyone have any ideas of a good project idea to do?
@small ore @heavy brook If you have the position and angles of your gaze then calculating whether it's looking at the screen is just simple maths
How do you find the position and angle of the gaze? I Understood it as him trying to find the position and angle rather than a binary answer of whether or not he is looking at the screen
@lean ledge
is there a formula to determine if someone is looking directly given a roll, pitch yaw? plus some extra params
hecc, you actually dont even need ML for this
you can do some fairly simple CV
do some face detection, do some feature extraction, a transform for the eyes, get location, compare it to rest location
Hm. Yes. That final question seems to mean the opposite but if you look at the rest of the question including
im currently working on multi gaze tracking/estimation
Is this true lol
People are leaving data science jobs now
Welp
And yeah. I did think of transformation too . Wasn't sure if it could be applied
could be. It's most likely not. I too am in data science and nobody wants to leave, they earn too much to :P
Everything he said is part of all jobs
Just worse in jobs where data science isnt the main business of the company
In which case I point you to a bunch of other devs who have the same problems in non tech companies
A bit of sample selection bias in those numbers, I think. But, what do I know.
He provided like no numbers
The article, states, in part >These data were collected by Stack Overflow in their survey based on 64,000 developers.
But he said he wad a dats scientist
Right, lol
Why would they lie about that its a popular article
And he proves he knows his stuff
The only piece of data is ML/DS people are looking for different jobs, which might as well be because there's lots of new high paying jobs coming up
I didnt say he wasnt a data scientist
I said being a data scientist doesnt at all make him an authority on the market and motivations of thousands of other data scientists
Everything else was an opinion piece from his job working in a non-tech company
It was fact based, he proves he knows his stuff lol
Unless u can correct his fact mistakes in the article
He said no facts lmao, he gave like 1 piece of statistics
He said alot about vector algorithms
So do I
Everything he said made sense
If your reasoning is that he's a data scientist, I am too. I work in a data science company. My coworkers are data scientists. We work with clients who have their own data science teams yet pay us to do things. Every deal we make gets verified by their data scientists and the data scientists in our accountancy firm (Deloitte)
So basically u disagree with this guy that is also a data scientist
Did he get any of his facts wrong in the article
I'm not saying he doesn't make sense, I'm saying his reasoning is personally based, and largely applicable only in scenarios where you're a tech employee in a non tech company, which is far from only true for data scientist and can not at all be generalised to explain the motivations of thousands of other data scientists that are looking for a new job
Jesus never mind
So is the everyone looking for a new job and leaving data science jobs true or just a troll
Well like ur a data scientist is this true lol
So, what I've seen around me is that a lot of people from our masters end up in data science. Most of our students are actually scooped up before they even graduate.
So, I'd take that article with a grain of salt
Still, there are some decent points made in that article
Some that apply to more fields than just data science
I think that the article had a great take away of managing expectations of the company. Know what you are getting yourself into prior to joining.
I'm in a similar situation as what the article describes - a company in it's infancy when it comes to DS.
This can become frustrating at times when expectations on both sides are not agreed upon.
And a lack of data on their part?
Depending on the project, yes.
I thought most students in data science come out with a bachelors?
most people in data science have masters, maybe even PhD
the problem with data science market is the number of companies following the hype that hire people without knowing what data science is for
Can confirm companies scoop up people for hype techs
they may think their excel sheet with a hundred rows is Big Data, may hire someone without having anything to give them and then expecting something out, may only hire 1 person (1 is pretty much never enough), etc etc. They dont know how to pick the right person because DS really isnt anything like normal software dev.
Data scientists likely arent leaving jobs because data science can be a bad job but because there's a ton of companies, even good companies, that have no reason or way to manage data scientists which makes them miserable and because there's an increasing number of more lucrative opportunities daily
@foggy junco Getting a masters is pretty much standard here in The Netherlands, but our system is a bit different. It's almost impossible to start a Ph.D. without having completed a masters degree first, since, well, that's seen as the "base" degree for university.
The only field I've seen Dutch students working on a Ph.D. without getting their masters first a couple of times is medicine.
So basically data science is starting to die
Swe might be more secure than data science in job security wise but idk
completely different careers with very different skillsets
@lean ledge there is more overlap between SWE and DS than you would think. π
I wouldn't say so. Perhaps if they are at a smaller place and take on some responsibilities that would be handled by a data engineer
DS requires experimentation, research skills, exploration and experimentation, ability to stay up to date on literature + whatever maths and statistics skills you need that dont really match up with the skills a good SWE needs
The overlap that I am thinking of is more along the lines of algorithm development - writing unit tests, structuring libraries, etc.
Not really the sort of stuff data scientists do from what I've seen. Data scientists I've seen do the exploration etc, come up with the right techniques etc in a notebook, pass the notebook on to someone else who writes it as proper deployable software. It's a norm in data science AFAIK
There's too much stuff to do for data scientists to be concerned with implementation
@lean ledge agreed within enterprise environment.
Open source?
Open source what?
Hey, feature vector question I wanted to get clarified
Im working on a image classifier using a support vector machine
I've extracted various features such as hog, dominant colors, canny edge
I stumbled on this on quora:
if {a1,a2,a3,a4,a5}, {b1,b2,b3} and {c1,c2,c3,c4} are the features extracted from an object using different feature extraction mechanisms, concatenate them to form a single feature vector as {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4} and use it for classification.
So if {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4} is a list of all the features for one image in the training data and I have 500 images, would all of those concatenated vectors be placed into a vector containing: [ {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4}, {a1,a2,a3,a4,a5,b1,b2,b3,c1,c2,c3,c4}, ... ] ?
Sure, that works.
Thanks Raggy for the confirmation
Additionally, I was wondering why in numpy.concatenate, in some of the examples, the array contents were mixing rather then just adding on to the end of the first previous array. Why is that?
good question, i have no clue
After days (months?) this channel saw some activity. Not sure if I am happy with the nature of activity
lol thanks anyways raggy
out of curiosity, what features did you extract from canny edge? just the pixel locations after non-maximal suppression?
just the pixel locations, tbh I just looked up NMS because of your comment, I think I should have that haha
NMS is part of canny edge. if you're using a library function, it's included.
im new to ML and data science and just learning
ah
Yeah then just the pixel locations
cool cool
So for dominant colors I have a size of (7,3) and canny edge and hog have a shape of (100,100). Would I need to pad my dominant colors ndarray?
shouldnt have to. you should only be comparing relevant features to the same type of features so shape consistency isnt important. you do need to find a way to weigh or normalise differences in feature vectors to get the difference between images
assuming you use something like L2 distance between two vectors for measure of "closeness", the distances will be on different scales
Makes sense, I asked because functions such as np.concatenate, np.hstack, np.vstack, np.column_stack, complain about the shape of the dominant colors. True, I haven't yet thought about normalizing the data yet, was going to get the vector filled and then go back to look into normalization
Most examples I've seen online, people have the same rows but different columns, but I have yet to see a different number of rows
What if I flattened my data and then concatenated it?
Oh but that would cause data loss if I have something like a black and white image array as a feature?
No, but it's iterable
or just for row in df3
for i, row in enumerate(df3) if you need the number
@hollow gulch it needed to be for row in df3 i think
it'd be range(len(df3)) if you were doing what ELA suggested though
This is a different error
Your if-statement there is wrong
You need to use and instead of &
in what event use & | / 'and ', 'or'
& | are bitwise, or memberwise on numpy arrays
and/or are short-circuiting and strictly boolean
so if you do foo() and bar() and foo returns true, bar won't even be called
though, the other issue [and the actual reason yours fails] is that & is tighter-binding than ==/!=
anyway, it's partially a style thing (even though the short circuiting does mean a bit for efficiency) - it's very unusual to use &/| for connecting conditions within an if statement.
you'd need parentheses around each of the conditions if you did
But the eror complains about them being strings no?
yes
because it's just doing df4.iloc[row1, 2] & df3.iloc[row, 2]
just change the & to and
Hey, I have some wifi sensor time series accompanied with a schematic of the area where the sensors are located. Can anyone think of any tools that could help me visualize which sensor is pinged on the map over time ? Something where I can give it my schematic (Image/PDF) and map a column in the timeseries to some object on a map that would change color at a certain time in the animation?
@hollow gulch try using a linter?
hey buddies
Cool extension
Very cool extension! As an avid jupyter notebook and spyder user, I really ought to give it a shot
But I am yet to find an editor with such a convenient ipython terminal like spyder's
I would agree. Spyder is, for me, second to none when it comes to iPython implementation in an IDE.
I've never used it, would you mind giving some examples briefly ? What kind of features/workflows do you like the most about it ?
Spyder? I like it because it's got a variable explorer that displays dataframe data in a tabular form and the iPython terminal allows you to play around with commands before you include them in your code. You can also select and run a few lines of your code in the iPython terminal and it'll keep track of the variables created/changed so you can continue experimenting with those adjusted variables
So I've been trolling in here for a while
I have decided that I need to dig into some data-science studies, any recommendation on study materials, or specific topics that I should look into?
@fathom zenith Oh, nice. So it's kinda like R-Studio for Python if you're familiar with it.
@viscid aspen Yup, exactly! Should've just said that lol, R-Studio for Python
@spark summit I would definitely recommend Andrew Ng's machine learning course on Coursera. It's not in Python, it's in Octave but it shouldn't be too difficult to translate everything you learn, besides it's more theory-heavy so you'll only be coding to solidify the theoretical concepts.
I'm on Week 8 of it and I must say, that man has a knack for simplifying the most complicated concepts
And when you start translating it to Python you'll probably just use a library for much of it instead of writing the learning algorithms from scratch. But ground-up knowledge of the ML algorithms will help you optimize them better, adjusting the parameters, performing error analysis, devising ways to improve the performance of your algorithm, creating learning curves for bias-variance trade-off analysis and all that
I can't wait to finish it so I can start his Deep Learning course, and the best part is that that's in Python so I can immediately start using what I learn as I learn it
Yeah, I've been doing a lot of work in pandas for uni lately, and jupyter (even jupyterlab) gets annoying quickly. I like VS Code better for how clean it looks, but the variable inspector in spyder looks super useful. I'm assuming it's especially helpful for stuff like pd.Series and basically all numpy structures, since those don't have a nice HTML output in jupyter
Absolutely. That's exactly what it's super helpful for. You can inspect your dataframe variables and arrays like they're in spreadsheets.
What are you studying in Uni @viscid aspen ?
you can do it entirely for free
but if you want a cert, it's $79
well worth it if you ask me
@fathom zenith CompSci (Master's), this semester's project is ML-oriented with a focus on the modelling of a real world dataset (a pretty nasty one if you ask me)
hi everyone, i'm developing an image classifier with pytorch and im getting an accuracy of about 80% on my test set. When i predict the image label, few times i get totally wrong predictions. Is this normal or am i supposed anyway to get any prediction right even without having a 100% accuracy?
How heavy or light are these spyder and R-Studio? I have an old spyder installation and I cant get it to use jupyter or ipython
R-Studio is an IDE for R. Spyder is one for Python that looks very similar to R-Studio. That's also why a lot of people use it, because they were used to R-Studio.
I don't know how heavy they are
I read above something like R-Studio for python
I think what was meant was "Spyder is like R-Studio, but for python"
"So it's [Spyder's] kinda like R-Studio for Python if you're familiar with it."
Ahh. Ty for the clarification
Coming from R myself Spyder felt very familiar but also a bit unpolished compared to PyCharm which now has a scientific mode. Haven't looked at Spyder for a year now though so it might have improved.
A scientific mode? Is that available on the community edition?
Oh I'm sorry only on pro
Jupyter is extremely beginner friendly though, you should start off with that
I can't wait until jupyterlabs goes off beta into launch, but right now I just don't want to risk it messing up my analyses
Seconding Jupyterlabs, it's excellent
While we're on the topic, I believe Ng's course is rather bad and extremely mathematically shallow. I wouldn't recommend it. Being MATLAB focused isn't a plus either
Sort of lacking in theory
Columbia's course on edX is pretty good and is more language agnostic. Gives a much better and stronger grasp at fundamentals
While that is true, the course can be very good depending on where you come from and what you want out of it of course. So first of all one should maybe think about ones goals and background.
If i want to learn much about algorithms and data structure with focus on python. I have already completed a algorithms and data structure course in C, but want to read with python implementation.
Can anyone recommend a resource for data structures and algorithms with python? Don't really care about the medium, if it's printed book i'll buy it π Video tutorials works great too π
Just want to focus on the algorithms and data structure, for now i want to avoid diffrent modules like numpy and pandas
@long gate data science isn't data structures and algorithms but look at CLRS which is all written in pseudocode. There's also elements of programming interviews python edition but I recommend the former.
CLRS is π
Okey was mistaken then, but is this the book you meant? https://en.wikipedia.org/wiki/Introduction_to_Algorithms
@lean ledge
Nvm saw in the wikipedia page that it's called CLRS, thanks π
@lean ledge I agree that it's mathematically shallow, but it was never meant to be an academic in-depth course on machine learning and was instead meant to introduce people to it and quickly give them a few tools they can practically use in the workplace.
Because unless you're in academia or R&D, you won't really be delving deep into the math all of the time. Most libraries abstract away the math after all, but knowing what you're doing instead of treating every model like a black box and choosing the one with the highest accuracy really helps when using ML to solve problems. Especially when you have to explain your solution to your colleagues.
But it all comes down on how you like to learn and what you like to learn. If you're mathematically-oriented, it's the wrong course for you because he just explains the intuition behind the formulas and algorithms.
If you like programming though and and you want to quickly learn how to do some basic data science then it's absolutely the right course for you.
It comes down to your motivations, a really high level of math can be overwhelming to most people. But building little programs and watching them work is an absolute delight.
All in all, it's a great primer. So I wouldn't go as far as saying it's "bad". There's a reason more than 100,000 people have done it.
@viscid aspen what's the dataset, if you don't mind me asking
@fathom zenith wifi sensor time-series from a large European airport
There's a contractor that labels certain passengers with complicated rule engines etc. We're trying to model it and apply some basic ML
@fathom zenith I think any decent data scientist will need a good understanding of the maths behind things. There's a reason almost all DS have at least a master's. Perhaps it's a good course for someone in HS or someone who only wants to do ML for fun but if it's a serious career goal, you very very likely should be familiar with the mathematical understanding.
Absolutely @lean ledge, but that's for data scientists. There are technology executives, data analysts, and hobbyists, and programmers with passing interests. Multiple types of users all of whom would find Andrew Ng's course a brilliant introduction to the world of machine learning.
But I'd like to ask you, with all of the layers of abstraction today, how vital is it to get to the absolute grass roots of the math behind ML? Especially if what you're primarily doing is commercial work within a company where you're employing models (validated by the academic community) and not cutting edge work (i.e., coming up with your own models)
It's very very much vital. Again, most data scientists have a master's or PhD for a reason. Being able to choose an appropriate model for the task and the data requires understanding of both the data, the techniques and their intricacies. If you ask industry, they'll tell you there's not enough data scientists etc. That's not because there's not enough people interested in it. It's because there's too many people with shallow understanding and lack of people who've gained the understanding of ML techniques. This isn't in academia, this is according to industry.
Data scientists generally do not make their own cutting edge models
They're specialists in choosing them
If you don't understand how your data is shaped, how higher dimensionality affects models, how your network's topology affects it, how different architectures differ, etc, most of which you can't understand without a solid grounding in the mathematics, you can't choose the best model
I agree with you
But I don't agree that Andrew Ng's course is "bad" simply because its mathematically shallow
Different people need different introductions
And Andrew Ng's is one type of it
You don't need a strong mathematical grasp to get started, to go deeper you do absolutely. But to get started and playing around with it? Absolutely not. Interest is sustained when you can see what you do, not in abstracts. But then again, that's just my opinion
As for the vitality of math in commercial work, I'd have to agree to disagree there. I work with practitioners in the field, and the pressure on them to stick to deadlines and deliver fast is immense. That means you don't have the time to think through everything, it's all about pushing that MVP out and if it works, it works. Is it good data science? Hell no. Is it effective? Heck yes.
I'm not saying you don't need to know any of the math. Of course you do! But you don't need to go incredibly in-depth.
A fundamental understanding should be enough to get started
The vast majority of companies I know look for mathematical talent. I was hired not because I was a good programmer but because I was great at maths and physics. Many companies finish off their job adverts with "candidates from maths, physics and electrical engineering do the best". Not a single one of my coworkers would agree with you in this case
I honestly would suggest someone do a statistics degree rather than a CS or whatever degree if they want to be a data scientist
Data science is 99% maths, 1% programming
Well that depends on the nature of your job of course. But there are far more users of the knowledge of machine learning than data scientists who don't need to necessarily understand all of the math behind the models.
But in summary my argument is that while that course may not be mathematically dense, it's still not "bad" for many, many people who use that knowledge in various ways.
So let's just agree to disagree π
its popping
Raggy, fyi, Andrew Ng has topics on how to pick models etc later in his course. Also many-a-times people who know all the math formulae and derivation fail to use their own intution. I believe you are top 1 percentile if you are able to just see those formulae and get the intutions and are able to apply it yourself. Obviously Andrew Ng sounds frivulous for you. I believe for most people it is less stressful to go his way first and then maybe look into the math. It becomes easier to understand math when you already know some application and looking to understand the nuances
@lean ledge and @fathom zenith masters or phd are not really required to be a data scientist. I spoke to a family friend who is a data scientist and he has a bachelor in biology. So, yah.
@lapis sequoia I am a data scientist. It is not required but the vast majority have a higher degree and they make it much easier to develop the skills you need.
@small ore He does not go into detail at all and skips over a lot of important things
And can we stop this idea that learning the maths will mean you're somehow incapable of actual practice????
I'm not telling you to memorize formulae, that's not learning maths. With maths comes a deeper understanding of what the data and models convey
Your mathematical intuition enhances and complements your normal intuition. Learning maths involves learning how things work not memorizing. I believe if you think otherwise that you have a fundamental misunderstanding of what maths involves.
I think the problem @lean ledge is highlighting is a major in statistics as well. This comment on Reddit to an article on statistical fraud illustrates why: https://old.reddit.com/r/statistics/comments/9toukm/1_in_4_biostatisticians_say_they_were_asked_to/e8ymwn9/?st=jo16k8bw&sh=9da7ca00
What happens a lot is in academia is that a lot of the users of statistics, so your typical emperical researchers, don't actually know all that much about statistics other than the introductory applied stuff they've been taught while getting their degree. This leads to tunnel vision: People either shaping their problem/research question to fit the model they know ("to a man with a hammer everything's a nail") or, worse, shaping their data to fit to the model. What's still worse than that is that, when they do actually apply a model, they don't realize what the limitations are, what violations of the assumptions mean for the interpretation of the results, or even what it means to delete a few "odd" data points from your data set. Some even don't really know why it's bad to choose a technique based on the results it gives you, so prefering model B over model A because model B gives you the results you were looking for (as described in that comment I linked above).
At the end of the day, what this means is that the number of Type I Errors in academia increases (so, the false positives, one of the reasons behind the replication "crisis") and/or models that don't actually do what they're claimed to do. Once it becomes clear that all those results that were once taken as facts turn out to be misleading or wrong, people lose their trust in the methodology and the statistics itself, while it was really the interpretation and the application where stuff went wrong. I think that's a danger for ML as well: If people start using models wrong, claim that they do things they don't actually do, and start suggesting decisions based on misleading results, the confidence in ML will diminish like the confidence in statistics has diminished in some of fields of academics.
In a way, this is already happening: The recent amazing AI tool for recruitement that was scrapped because it turned out to be very biased is one of those examples that could lead to diminishing trust in the techniques, while, in fact, the way the model was trained was the problem, not the fact that ML was applied.
^^^
That's just one of the issues with doing ML without having a mathematical background
Yeah, probably. It's just one of the issues with research not actually knowing statistics, too.
What I said comes down to this. Your models are mathematical models. You cant understand them if you dont understand the maths. The idea that somehow you can know a lot of maths but not know intuition for the model is honestly just ignorance of what data science involves. Data science, including in a commercial setting, involves a lot of maths and it's simply not possible to consistently guess what models you need right every time without any understanding of maths
I have a question and maybe I'm going to sound super ignorant but let's assume we've got a dude named Bob who has supposedly taken a whole bunch of Coursera courses in ML and DL and all the other "L"s.
Bob builds a recommendation engine for his website hacking together models based on what he's learned. Bob doesn't understand the math to a great level but since there are so many great libraries out there abstracting away a lot of the detail, he just uses them to do his job.
And it just so happens that it works and more people are buying things on his website because they're getting good recommendations
So really a black box method, he knows what goes in, he has very little knowledge of what's happening inside, but what he's getting, the output, is working for him
I know that's not proper data science, but is that in general a bad way of approaching ML?
In general, yeah. now that Bob has made what he's made, he likely doesnt know how to make it better. He doesnt understand why exactly it works, what model they might want to swap out for another model, in what cases his model fails. Were he to try something else, there is no guarantee his previous strategy would work and no guarantee that his knowledge would transfer over.
Not being able to make your model better or knowing what's stopping it as it is is a problem. So is not being able to transfer over to another scenario due to lack of understanding.
Yes, it has worked apparently well in this scenario and that's great! Bob can continue using this model as it is if it works well enough for him. However, that method tends to not work very well in general
Hello!
Can someone recommend me how to get started with Machine Learning? I'm kind of person that likes to start making some practical project and learn while doing it. Basically that's how I learnt a lot of programming concepts. For past few days I am trying to dip my toes into machine learning, but all learning resources that I've found were like hours and hours of theory stuff instead of building something. I'm good at math even though I'm still in highschool, but I'll be fine in learning concepts that I do not understand. I've got no problems at all with programming in Python or something else. Can someone recommend me a good way to start my journey with machine learning?
I've got no problems learning theory, but I want to get started and see "that something is happening" and learn all the concepts needed while building something.
I'm actually really interested in a answer for this ^^^
/r/LML has a few resources who dont really care about learning ML but just want to build stuff listed in their hackers guide: https://old.reddit.com/r/learnmachinelearning/wiki/getting_into_ml_hackers_guide
Reddit gives you the best of the internet in one place. Get a constantly updating feed of breaking news, fun stories, pics, memes, and videos just for you. Passionate about something niche? Reddit has thousands of vibrant communities with people that share your interests. Alt...
There's also a high schooler's guide for someone who wants a career in the field https://old.reddit.com/r/learnmachinelearning/wiki/getting_into_ml_high_schoolers_guide
Reddit gives you the best of the internet in one place. Get a constantly updating feed of breaking news, fun stories, pics, memes, and videos just for you. Passionate about something niche? Reddit has thousands of vibrant communities with people that share your interests. Alt...
@fallow summit @misty sonnet
#1 choose a dataset on kaggle
#2 cry
yeah i mean building the dataset is like 90% of the battle
thats really all that matters
Worst thing about working with non tech rather old fashioned client companies is that the data they have is the biggest pile of crap
"yeah all we got is freeform natural language can you make this work?"
my coworkers have it worse sometimes
i dont even know what that means
I'm not sure anyone but the guy who wrote it does
My last client company was based in chile and sent me all their data in spanish too
Thanks for your reply!
Don't get me wrong, I want to learn as much as I can about machine learning and Data Science. I'm not aiming for job anytime soon. I'm just fascinated about what can be acomplished with it. I just know my strenghts and I know that jumping into a lot of theory at the beggining and not seeing anything happening will push me back from this field. Usually when I am learning something new I'm trying to build something and learn while building it. I couldn't find any good resources to get straight into the project with machine learning and learn, so I came here to ask. Atm I would like to build something that can play video game, some really basic game.
π
there's a lot of research on atari being published
do you want to start with an agent that plays atari games?
Yup!
Basic machine learning requires maths which is usually taught in early 2nd year uni in the US, late first year in other places. It's hard to get into it in high school when you dont understand multivariate calc or linear algebra
You can try andrew ng's course but despite being rather shallow, I still dont recall it going over basic multivariate calc