#data-science-and-ml
1 messages ยท Page 213 of 1
guys want inputs as in material, things to learn.... on cyber security threat analysis-- like where to start.....or any suggestions
was looking into materials from https://medium.com/cyberdefenders/python-for-cybersecurity-lesson-4-network-traffic-analysis-with-python-6321f4c9d3f7
but this does not share any insights on threat analysis.....
do share your inputs
๐
I would hop on over to #cybersecurity @vital cipher
@drifting hemlock that was a lot of words for saying basically nothing
@torn garden Seems to me @drifting hemlock is talking about how to get data science work done when you're dealing with people on your team with very little (if any) technical knowledge? Or am I reading that wrong?
@torn garden @clear jetty I guess I did a bad work at explaining myself, my point is that, in the workplace, it's very difficult to structure a data science team within an organization and providing them with the tools and the databanks they need in order to concentrate on the tasks they are assigned to do, and furthermore, to provide them with the structure to actually deploy machine learning models that works for the business.
This is a problem that you don't actually read in the myriad of blogs out there. For example, if we are assigned with a simple binary classification task, if we do this as a hobby then it's very simple: Clean the data set, train a model, deploy a simple flask server to act as an API.
Is it that simple in a work environment? Hardly, even less if you're working with a team. First you don't get a dataset as in Kaggle, you have to create it from scratch using multiple sources, then you have to document the process because you have to inform the stakeholders how the model works and why it predicted certain value or you'll risk credibility. You also need a cloud service to work with your teammates because only using github will not be enough. And you need a whole pipeline to retrain your model every now and then. Not mentioning a lot of things in between to ensure that something can be deploy within the organization.
If the last paragraph was confusing, it's because that confusing it is to build and manage a data science team, we don't really have a platform that makes this process a lot easier, and if there is, well, let me know then, that'd be great!
I'm participating in a data science course, and I made this visualization to address the question of "what is the correlation between Carbon Monoxide concentration and air quality?"
From the more professional data scientists here, is there any advice on things to improve about the graph?
@scarlet night Well I'm not the best when it comes to visualizations but usually there are a couple questions that you can ask yourself to improve them:
- Who is your audience? If you define how technical your audience is and how much domain language they have then you can either simplify or add more relevant details to it. If it's a broader audience you might even want to add a caption that summarizes what you are trying to say.
- What problem (or benefit) are you trying to pinpoint in this visualization? If it's an all time low, they you might want to add a label with the value of the highest or lowest point. I know that we have it at the left/right side of it, but it can be hard to read.
Hope you nail that course btw :)
Oh and I forgot to say, it looks great for me! I just wanted to give some advice.
Thanks, I've had the wisdom of the average people by asking people what they think in other Discord servers
The audience is between the academic people and the average person, as while having specific and more technical terms and data, its insights and results apply to everyone: pollutants in our air make the air quality worse
im fairly new to data science, and im trying to get started with tensorflow. does anyone have a moment to help me get started?
@drifting hemlock I don't think there's going to be a single solution to everything you mentioned. Data Science is (almost intentionally) an incredibly broad term and covers a ton of different contexts/paradigms, and I think no one system is going to be able to streamline all the different use cases and workflows. So it's not surprising to me that you've not found much meaningful to that end. I think everyone just patches together some workflow that works well for them and their particular problem and sticks with it.
@small nebula what're you trying to do, and is there a reason why tensorflow over pytorch?
well tensorflow is the first thing that came up on my google search, a good while ago. been playing around a bit with the tutorials and haven't really checked out pytorch yet
I am quite biased, but unless you have a strong reason to use TensorFlow (the specific model you want is in TF, or you need to deploy models for business usage), PyTorch would be more recommended
e.g. if you're just learning DL for fun / curiosity, go with PyTorch
@void anvil I find graphing it like that makes the data less meaningful, as it is much harder for people to understand the significance of a roughly 45 degree angle line compared to 2 lines graphed over time lining up
Also, graphing it over time serves the purpose of showing that this is a constant correlation, not just a day overlapping.
(An example of graphing it like a scatter)
Though seeing it, I do see what you mean, but I don't think it helps as much
@silent swan why do you say that?
which statement
If you find TF frustrating but still want to use it, Iโd recommend trying out Keras. Itโs a higher level api for TF (and other platforms) that simplifies dev.
I'd recommend PyTorch over TF/Keras regardless unless you have specific reasons for needing TF
a higher level API can make it harder rather than easier to debug
I agree with that
the latter statement
so far TF/Keras have worked fine for me though, so I guess I personally see no compelling reason to try PyTorch out
A related article talking about the differences between TF and PyTorch:
https://medium.com/@UdacityINDIA/tensorflow-or-pytorch-the-force-is-strong-with-which-one-68226bb7dab4
worth noting that PyTorch does support Tensorboard now
TF also has Eager mode now but from everything I've heard, it's still tedious af to use/debug
I have found Tensorboard to be....annoying.
I get the debugging part. I like the abstraction from the static graph.
The more I have played with Pytorch, the more I want to use it on our next work project.
Does anyone works with Tableau Server or Online?
Has anyone tried this course? Is this course good for someone that is new to ds? https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
personally I've never used Tensorboard much
I much prefer to log my own metrics and then build my own visualizations
What you guys use to deploy your experiments?
I'm deciding between Flask or Django. Flask would be easier but I'm concerned about the scalability, but I don't know if Django would be an overkill for a very simple API.
Well definitely I'm not a Flask expert, so I'm open to suggestions, @void anvil do you think it would lag if I set up an API that handles multiple requests?
It's not gonna be a TON of requests, but it would be great to have a responsive API that deals with a decent amount of requests.
@snow junco I'm currently about 20% through the udemy course you linked, and so far it has been fantastic.
it isn't geared towards absolute beginners, but as long as you know the basics you'll be fine
@void anvil I guess I'll try Flask then, thanks!
You guys think that 25% vs 75% would count as an imbalanced dataset for a binary classification exercise?
I would say yes, but whether it's "sufficiently imbalanced" might depend on your context
Yeah that's a good point
eg if you have a ton of data and the problem is simple, and you're using the right metric, 3-1 might not be a big issue
but if it's a hard problem and with some metrics, even 2-1 can be very bad if you don't account for it
hey guys - what method would you use for scraping pages that are loaded with javascript?
PhantomJS is deprecated, PyQt4 doesn't work for me, chrome webdriver also does not work for me
or maybe someone can help me fixxing the chrome webdriver issue:
File "scraper.py", line 4, in <module>
driver = webdriver.Chrome("/root/chromedriver")
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /opt/google/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)```
hm, I can give it a shot. with chrome webdriver do you mean selenium, or am I just being stupid about another tool here xD
yeah selenium
I downloaded the chromedriver here: https://chromedriver.storage.googleapis.com/80.0.3987.16/chromedriver_linux64.zip
hm, sounds interesting. it often get's the job done for me. what site, and what are you trying to get?
oh, just looked at the error. the driver is not starting.
for learning I want to scrape this site: https://pythonprogramming.net/parsememcparseface/ and the <p> with id "yesnojs"
Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free.
from selenium import webdriver
url = 'https://pythonprogramming.net/parsememcparseface/'
driver = webdriver.Chrome("/root/chromedriver")
driver.get(url)
dynamicText = driver.find_element_by_id("yesnojs")
message = dynamicText.get_attribute('innerHTML')
print("Dynamic text element says: '" + message + "'." )```
code looks correct, never used arguments with the driver. just put it in the working directory. but I am pretty sure it should work like you have it.
what chrome version are you running?
80
you downloaded from this site? https://chromedriver.storage.googleapis.com/index.html?path=80.0.3987.16/
yep
wget https://chromedriver.storage.googleapis.com/80.0.3987.16/chromedriver_linux64.zip
then the driver should be correct, hm.
(unknown error: DevToolsActivePort file doesn't exist)
I am trying to launch chrome with an URL, the browser launches and it does nothing after that.
I am seeing the below error after 1 minute:
Unable to open browser with url: 'https://www.google.co...
lets see if adding --disable-dev-shm-usage will help ๐
just going to recommend that link xD, just reading over it.
sounds like something new in a new update
I started seeing this problem on Monday 2018-06-04. Our tests run each weekday. It appears that the only thing that changed was the google-chrome version (which had been updated to current) JVM and Selenium were recent versions on Linux box ( Java 1.8.0_151, selenium 3.12.0, google-chrome 67.0.3396.62, and xvfb-run).
wait, that says chrome 67, not 80.
But when we investigated further, noticed that XVFB screen doesn't started property and thats causing this error. After we fix XVFB screen, it resolved the issue.
it works
i just removed my path from driver = webdriver.Chrome(chrome_options=chrome_options)
๐
Senior Python Developer
glad you got it working ๐
Anyone experienced with spaCy?
I'm doing some unit testing and it gets convoluted when I've changed a property of the Token or Doc class in another test, so there's no way to anticipate what properties the Doc object in my test will have
I'm wondering if there's a way to reset the Token and Doc classes at the beginning of each test.
Hey. What is the difference between a * b and numpy.dot(a, b)?
Also, what is learning rate and weights in Machine learning?
a*b will do element wise multiplication. np.dot will do a dot product (which in the case of vectors, means element-wise multiplication and then summing)
weights are the parameters of your model. learning rate is a hyperparameter for your optimizer that influences how quickly you want your optimizer to try to learn to the data
trying to make a seasonal decomposition on a time series but i get some errors.
the series type object
any ideas? I'll post the error as well
always post the error when you're asking why you're getting errors
your series has dtype object, rather than int or float
hm better use quotes i guess
try calling .astype(float) or checking if there're any weird values (e.g. strings) in your series
must be because i dont know well the Series class yet, but iterating and checking each value was indeed int type
anyway, i used .astype and will see if everything goes right now
I have a very large dask array representing an image, how can I store it as a png quickly?
skimage's imsave is quite slow and so is converting the dask array to a PIL image
my recommendation is
don't use deep learning for finance
speaking as someone with experience with both quant finance and deep learning
standard statistics/econometrics problem
basically: don't treat stock prediction as a machine learning, treat it as an economics problem that you need to solve with statistics tools
more concretely: don't do stock prediction
My first graph with matplotlib, pandas and seaborn! 
Guys, what are the best resources to learn data science, not just like machine learning but literally all the theory, i'm really overwhelmed
I tried to learn statistics by my own, but i don't see the theories being applied to data science, so i kinda need a practical way to learn? Any suggestion for this?
How to force Y-axis to start from 0 (zero)?
def plot(x_axis, points, title, x_label, y_label):
plt.title(title)
plt.xlabel(x_label, fontsize=3)
plt.ylabel(y_label)
plt.bar(x_axis, points, width=0.80)
plt.xticks(rotation=90)
plt.ylim([0, 10])
plt.show()
@stray monolith Maybe try this: plt.ylim(ymin=0)
I want to use matplotlib.pyplot to plot a double line graph in which i can analyze for overfitting/underfitting. I am confused as to which variables I should pass to it https://pastebin.com/yfiZbKGy Ive messed around with trying to graph my intended goal for awhile before deleting my attempts
Do I put training and validation outputs before or after training or in other words what are the ideal metrics to use to plot for overfitting/underfitting?
@lapis sequoia you can use ML/NLP to generate features for prediction, but unless you're ready to read up a lot about quant finance, don't get anywhere near the stock prediction problem. There're too many ways to do it wrongly and draw the wrong conclusions if you don't know what you're doing.
@lapis sequoia Elements of Statistical Learning, but sooner or later you're going to need to dig in and read a standard probability/statistics textbook
@stray monolith I think you already solved this problem, but if you didn't, then you should do plt.ylim(0). The same can be achieved x axis with plt.xlim(0) (but not in your plot)
http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf
To those who do Computer Vision, this paper is very interesting. Geoffrey Hinton is the main author on the paper. His resume is ridiculous.
I'd say stick with convnets for now
Hinton has been pushing capsules for quite a while but as far as I've seen it's still not had any meaningful adoption
@silent swan I have that book, but i think thatโs more into machine learning rather than the workflow in data science, can you recommend a more general text in Data Science? The thing is that iโm confused on what to really learn and how to connect the dots
Oh...I am def going to stick with Convnets. I just find it funny how he thinks max pooling is stupid, and for some reason it works better than all the other pooling.
Guys I fixed that issue.
plt.ylim() was not working for some reason
Then I found out that the problem was because the points were being plotted as Strings
I converted the data to float using list comprehension and everything is fine now
But thanks for your input anyways @heavy thicket @uncut shadow
I need to discretise continous (and continousish) data for an ID3 decision tree- At the moment, I'm just finding the best possible value to split an attribute in two, is this a valid solution, or do I need to try and account for cases where I need to split the data into more than two parts?
Anyone here experienced at webscraping?
i'm having some difficulties regarding scraping a website.
I want to store all the data but, the show-more button at the end of the page, hides info so i only get a bit of the information
How do i get around this..
Use selenium and press the button with that
i was using python
and beautiful soup
is there anyway to acomplish this with bs?
it depends on the site, but in some cases you need something like selenium
there's also chromedriver
and the mozilla one, mozilladriver i think
sorry geckodriver
WebDriver for Chrome
if you want to stick to just beautiful soup, there could be some way to trigger it, but it entirely depends on the site
the site is just best buy
a lot of things will use some backend undocumented api that runs the site
BS has some nice functionality to it
i prefer it in most cases, only use selenium if absolutely necessary
but don't most modern sites have multiple pages and show-more buttons
where is bs applicable if it cannot
sure but a lot of them also usually have backend apis that serve the data
and a lot of those things can still be triggered with the proper request
theres scrapy too
lol ima sound like a noob but
how do u find the backend api requests that you're reffering to?
you can use firefox or chrome
the dev tools have a network inspector
you can watch elements of the page load
yes i was doing this
so you look for ones that are jso
i noticed some thing called loadmore
json
thats what most apis use
an api endpoint typically just returns a json document
or graphql
which i think also registers as json in the network inspector
search pages in particular usually have them
do you have an example link?
hm..
now if that json has all the data you need
what is this?
you can change the parameters in that call
its the api that drives their search
if you have the network tab open
go to that page
then click 'show more'
sort by Type
look for json
yeah
Ok, interesting. And now you can scrape this weblink
exactly
and if you look through it, it should give you enough info to filter, and page through the results
and json is easy to work with once you have it
no html parsing
yes, though really you could do it with just requests
import json
import requests
response = requests.get(...)
json_data = json.loads(response.text)
so this data is in json format?
I see, however I am not too familiar with how json is read or even works.
json is easy
gotta read up on it
its dicts and lists
yeah a lot of times search pages will have like
link and title
so then id use beautiful soup to hit that link
scrape the content that isnt served by the api
that best buy api is a bit crazy in terms of detail
pretty heavy
that's a good thing for my purposes i think
lol i'm just a student who's trynna play with some new technology for my resume
scraping is a good place to start
this project seems so daunting now 0-0
just keep at it, its not too hard
i like to use jupyter to prototype these things
makes it a lot easier to test navigating through the site to build the scraper
jupyter?
its great for prototyping, you can re run small parts of the code
rather than re-running everything
if you hit an error, you just retry
and you can use it to document the exploration
so you can look back later, put it all together
great for learning, especially if you are handling larger datasets
no worries
it's a json right? so like json manipulation or?
yes its json
so if you do something like this
import json
import requests
response = requests.get('https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page=2&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc')
json_data = json.loads(response.text)
json_data is a dict
yup
json_data['products']
is a list of products
its a list of dicts
and if you look at the url, it has 'page=2'
so that list is the list of dicts of products on page 2
iterate through the pages, iterate through the products lists
wow!
that's really cool
How exactly did you know 'products' was the key for this json file
firefox has a nice json navigator
just looked at it, its at the top
if you look at json_data['totalPages'] as well, thats the total number of pages
so youd want to iterate through a list from 1 to that
and change the url each time to set page= that number
just make a list and append each page's product list to that list
how would you navigate through each page like that tho, just a loop through the pages?
do something like
for page in range(1, json_data['totalPages']+1):
... request & json stuff
and in the request url you'd put that page
where it says page=
so
f'https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={page}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc'
oh ok so then load in the request url
then just append the different 'products' into a list?
yeah exactly
kinda like what you showed above?
extend?
create a new list, then list.extend(new list)
The extend() extends the list by adding all items of a list (passed as an argument) to the end.
# language list
language = ['French', 'English', 'German']
# another list of language
language1 = ['Spanish', 'Portuguese']
language.extend(language1)
# Extended List
print('Language List: ', language)
except in this case it'd be extending with json_data['products']
no problem, good luck
I really appreciate the examples, I am very much an example-based learner, and I could not find too many examples on the topic online.
Do you have any resources you could possibly pass on to someone who just wants to learn some more?
If you don't mind answering, are you also a college student?
if there is a particular library or package you need help with, just search it
no i'm a data engineer
build apis, scrapers, stuff like that
hey that's cool
when'd you get interested in scraping and stuff?
i was pretty young and i loved news, but hated the news sites at the time
i wanted something that would show me just the title and text
and rss wasnt really used at the time
no google news
so i made a scraper that ran and gathered all the news
ive made a few hundred since
damn lol
I'm just a cs student looking for some meat n potatoes on my resume, thought a scraper would be interesting.
its good stuff
yeah pretty much
I was able to find the backend api json thing through the network tools, but i'm still confused as to if it represents every single page or just one
the json is just for one page
like i said above, something like
for page in range(1, json_data['totalPages']+1):
... request & json stuff
url = f'https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={page}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc'
then do requests.get(url)
see the page=?
on the top of the page?
oh yea
i see
you'd want to update that
which is what the for loop is acomplishing
incrementing it by 1 eachtime
yup
yeah and since totalPages is in the json dump
youd get the first url
and take the totalPages from that
import json
import requests
response = requests.get('https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page=1&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc')
json_data = json.loads(response.text)
for page in range(1, json_data['totalPages']+1):
... request & json stuff
url = f'https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={page}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc'
that way as the # of pages changes, the scraper knows and iterates through the right length list
i mean do it better than that, but that's the basic idea
ok everything is starting to click
what is the f in the url in the loop sir?
f'
its an f string
its the same as doing
"string {0}".format("is a string")
which returns 'string is a string'
so if page = 3
f"page {page}"
would return 'page 3'
f"string {"is a string"}"
in retrospect string is a string is not the best example
uhh
I have not seen this before haha
actually for this something like
you could also do something like
base_url = "https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc"
for page in range(1, json_data['totalPages']+1):
url = base_url.format(page)
... json stuff
yeah they are shorter and look better
but they aren't re-usable the way my example above is
but essentially what's going on is they're taking the thing and making them string literals right?
yes the f string evaluates the contents of the {} and returns a string
where as in the example above, the {} is only evaluated when the string has .format(value)
until you do that, {} is just a part of the string
hm... why can't we just toss the url into the loop without the f tho
like what it's data type
you need it to include the contents of page
so if page = 3
you need the url to say page=3
that makes sense
and you got the loop incrementing page
but why does the whole thing have to be a string
the url?
mhm
because you need a url to make a request
and the url needs to be a string
so either
base_url = "https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc"
for page in range(1, json_data['totalPages']+1):
url = base_url.format(page)
... json stuff
or
for page in range(1, json_data['totalPages']+1):
url = f"https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={page}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc"
... json stuff
id argue the first one with .format looks nicer
I think the second one makes more sense to me
Ofc, yes
but like conceptually
the second one sits easier with me, just a preference
for page in range(1, json_data['totalPages']+1):
url = "https://www.bestbuy.ca/api/v2/json/sku-collections/17056?categoryid=¤tRegion=BC&include=facets%2C%20redirects&lang=en-CA&page={}&pageSize=24&path=&query=&exp=&sortBy=relevance&sortDir=desc".format(page)
... json stuff
yeah f strings are far easier to read
i prefer them myself
but its only in i think 3.6+
.format is older
ahh it's a substituent
i see.
page is entered into the {}
ok that makes sense
I mean,
i guess i'm bored too,
trynna learn this stuff lmao
to pass time
but it's very cool so, it's a nice pass time haha
Hey did you ever have experience with getting a co-op?
co-op?
internship
oh no
different names different countries
i was in IT by trade, got right into a paid job
that led to dev work, mostly python
which led to this
wow so did you have to get a cs-degree or not even?
not even
yeah i wouldn't recommend my route
its harder
but its the road i took, so what can you do
haha yes i'm in school and planning on staying in it
the thing is, it's almost time to find experience
i'm scared for technicals man
make your own experience
I currently am
come up with some projects for yourself
build scrapers, bots
set them up right
learn how to use docker and deploy things
learn version control
what are these things?
its a tip of an iceberg is what it is
docker is a great platform for virtualization
and delivering software in isolated containers
what it basically does is allow you to specify an environment for your app/script to run in
as an example, i have a docker script that loads a pre-built Ubuntu image
installs all of the required packages
copies the source over
you can run the code within that environment
that pretty much eliminates the odds that changes in your environment will break your script
then you can take that docker image and directly deploy it
so if you had a scraper, for example, you could push it to any server, either hosted at home or something on like AWS
just by pushing the entire image
no setting stuff up, cause its already set up
and that might sound hard
wow, this shit
but the file that does it all
sounds so complex
wow
let me find an example

i did IT for a long time
just learned everything i could
automated everything i could
I just wish i knew some good resources you know?
automate the simple stuff is a great starter
is python a good place to put my money in terms of where to be in cs
i'm learning so much shit
like html, css, c, java, low-level, a bunch of stuff
But i don't know where I wanna become specialized in
python is a nice language, solid, simple, and powerful
language knowledge is pretty transferable
once you have your head around a couple
learning others is easy
that's true.
thats because java came from c
well that'd make a lot of sense
but our school makes us do memory management manually
even tho, we have garb collectors
it be liek that ๐ฆ
well because a) garbage collectors suck, they draw performance and require a larger runtime and b) manual memory performance teaches you something about how things work
hey man, that's true
there is a reason for example many of the libraries you use in data science like tensorflow are written in C++
knowing a little about how computers work @ a low level is really helpful sometimes
you don't need to be able to write a device driver
C++ is super powerful
in fact
c++ is too powerful
c++ is basically c + every object oriented feature mankind has thought of which makes it a super overloaded language where if you have a new project youll have engineers arguing about which of the versions from c++99 over 11 to 17 they should use
and once they have come down to a version of c++ what the best design pattern is because you can achieve the same thing in a million ways
1983 - Bjarne Stroustrup bolts everything he's ever heard of onto C to create C++. The resulting language is so complex that programs must be sent to the future to be compiled by the Skynet artificial intelligence. Build times suffer
a bunch of people at work who do profesional c++ in their department have come up with the theory that stroustrup came up with C++ only to make money off the books people will buy to understand it
As someone who is taking c++ nxt semester, these perspectives are very interesting.
i think there is a big difference between learning a language at school where youre limited to the feature set your teachers throw at you
and using it professionally where everyone and their mom has another idea on what design pattern is the same
when did we go to functional programming
i'm still a junior loll
and mor interestingly
idek what that is
why is this all happening in #data-science-and-ml
I think it started from the f-string discussion
which was in the context of scraping
Yes! i'm making a scraper
which led to me going on about deploying it
my first real side project
which led to languages
@lapis sequoia do you primarily program python or also do some others from time to time?
interesting, a true python master
So, how much math is involved in data science.
data science is a pretty wide field
typically in a college degree, the ds pathways are more math focused
it depends on what exactly you're doing.
minimally, I would say basic statistics and linear algebra
and calculus
wow calc too?
if you're planning to go down the research scientist path, then a lot more mathematics, probably
I was planning on not doing calc again lol
only if you need to do theory
that's a bit higher-level?
like if you're doing DL research
There's a lot of work for just supporting data science as well
I'd say for the majority of junior data scientists it wouldn't be relevant
As long as you have a grasp of the basic concepts
except insofar as you need it to understand stuff like gradient descent (actually, more or less only gradient descent...)
the world of development is super vast by the sounds of it
almost overwhealmingly vast
there are also things that are data science and not dev work, and vice versa
you can be a very bad coder and still be a fairly good data scientist
Damn i thought development was the thing that tied everything together kinda
well, if you mean software development specifically...then not really? because you can do DS without software at all, though that would be inefficient
however, it's fairly common for data scientists to not-really be developers
i.e. their expertise is in statistical analysis and experimentation
isn't the minimum these days a college degree for a post tho?
and being able to code is just a means to an end
that would depend on your country, really
mine is a little uptight about education, and I don't actually have a CS degree
so I kind of took the alternative route?
not sure how it works over there.
but I'd say that that kind of requirement is becoming more and more common...?
and if you don't fulfil it then you probz need to show competence in some other way
yep
Does anyone have any good resources on art and AI?
Im curious about music generation, lighting, paintings, etc.
Anyone here doing financial time series forecasting? Is there anything open source that you found to work well? Perhaps utilizing RL for forecasting?
Is a 1D convolutional network with attention something that I can get help with? I heard it's good for forecasting but haven't seen any code examples or examples of anyone using it.
Machine learning for stock predictions isn't generally done. Youre almost certainly not going to do very well with it
Although as a learning exercise its probably fine
Its pretty common to team machine learning with options trading. However, it pulls outside data and isn't just trying to predict the future based on movement alone.
Also, the fund managers I have discussed it with use it for marking potential trades not auto trading.
What do you guys think about AutoML? Is it good enough?
My biggest fear with machine learning is people creating models that impact our daily lives, but they dont really have a clue about what is going on.
That said, never used automl
And that's exactly one of the reasons many experiments don't get deployed, I mean if you develop a machine learning model to predict X thing, then people are going to ask why it made such prediction. I feel that with AutoML you don't exactly know what's going on behind the scenes.
I mean it looks great and all, I've tinkered a bit with IBM Watson which is basically an AutoML, but I don't know, I don't trust it lol
I'm concerned about people being able to deploy these models without understanding all the issues and ethical questions
Looking at you COMPAS, can't even see behind the curtain to defend yourself in court
They will use satellite images to track shipments, global warming trends and weather data.. Etc
using DL to generate features ๐
using DL to predict stocks ๐
I was just explicitly agreeing with you
I don't believe anyone can get great results using ML only, but I have a deep understanding of building algorithms that backtest decently already. My algorithms output into an oscillator format. I've only gotten to where I am because I know how other builders think. There's a lot of them and they all make the same errors. With ML I can use more instruments because I can't keep up visually, I'd rather get an alert than watch screens all day, I would lose my mind lol But I agree with you guys about stocks, I know about the satellite parking lot stuff. Evidence of insider trading/commitment is actually where ML should shine. I have some ideas for how to spot that. I want many tools that each look at different things.
Im trying to build a decision tree model with a bot I am making for Discord. I have no clue what im doing in regards to the ML part lol. I try to read tutorials but it gets so insanely heavy into math I zone out and look for another source.
I have a great dataset that is widely used. As long as I receive good accuracy and precision I think Ill be alright..
Before I even deploy the bot I will test the model against real data and if its misclassifying then i know ill need to make adjustments. I just play around with parameters and values until the results show. I know its not methodical but idk .. its works for me
You probably won't have enough data when you are training your discord bot live. You need to train your bot offline and feed it tens/hundreds of thousands of data.
I suggest doing a standalone ML project first and then worry about discord. ML is a really broad field. Some people who work in ML never work on the model but are purely occupied with gathering sensible data and optimizing those further or they are in the field of generating data which they plan to feed into their model. You are trying to make a finished product (the discord bot) but the core of your project (ML) is still at its infancy.
hey any1 know im trying to read excel file using pandas but i come across problem with o/p that it showing Nan ?
in terminal o/p?
whatโs o/p?
output I assume
Yep
copy and paste the actual error
@ocean beacon This is the dataset i am using https://archive.ics.uci.edu/ml/datasets/phishing+websites there is also this one but holy cow, its an enormous file http://www.fcsit.unimas.my/research/legit-phish-set
I have thought about that, "what if my bot is misclassifying too many real samples?" Maybe I could use the decision tree + reinforcement learning or decision tree + a neural network (albeit, the simplest form of neural network. I really dont understand how to implement the logic of an extremely robust neural network)
Complete datasets for phishing and legitimate websites are available for download here. We provide (a) raw samples of phishing websites and legitimate websites, and (b) feature vector data for machine learning based phishing detection.
I was reading a little bit about 2-class neural networks which might work
So your goal would be to take items people are typing, and then capture ones that are phishing attempts?
When users post a link it will go to my model which is being hosted by flask. All links are analyzed
If so, you will need more than just a decision tree. You will need NLP items to process the text, created topics, etc.
I did something similar. I created a py program that extracts features from incoming links
My dataset has 30 attributes so I extract those 30 attributes from urls
those 30 attributes from urls are then put into a new dataset. That new dataset is sent to the model which will determine if its phishing or not
Ok, not sure how well it performs, but you can check for false positives by doing a confusion matrix or ploting AUC
I havent tested the dataset yet but I do have my decision tree all set up so I can do it whenever i want
thanks
I would try tweeking the decision tree model, measuring overall performance and also check things like AUC before feeding that data to a NN
Thats just the way I would go about it
Definitely. I read good things about boosted trees so I will play around with that as well. What is NN?
Oh right ๐
If i were to use a tree + NN.. how would I consider the algorithm?
Decision tree classifies (assuming it classifies correctly) -> send to NN so that the DT keeps learning -> send results from NN back to DT?
like a constant feedback loop of-sorts
Hey guys. I'm coming from general discussion
I'm doing some data analysis via pandas. I have to do this task periodically like every 15secs. Though I think dataframes are slow, or I'm using them in a wrong way.
Here is my code: https://github.com/gokturksm/oyleler/blob/master/main_hibrit.py
I mean I thought of using python matrices, but thought it would be a lot harder to deal with.
I need to change some specific roads, and update my dataframe etc
I need to get that data as well, and I'm being aware for a live demo pandas won't work I guess
But for now, i just want to know whether I'm doing things wrongly, or any other ideas you guys have considering my code? I'm not the best, still learning. So there is that
Pandas becomes real slow when you start appending stuff to a dataframe over and over.
I'm not sure I follow everything in the code, as the comments aren't English.
I could translate them if it would help
I did it nonetheless ๐
This one is the latest one. I tried using apply and update. But I think I'm doing something wrong
I need to update my data and calculate some values, then I decide some pricing stuff. I have not implemented pricing stuff because I need to resolve my issues first I think
basically im just looking for a "simple" solution. When I read about implementation of ML and theyre like this is how you do it:
im like okay i dont know what any of that means lol
i dont have that background
but i can read code and understand that so i have to study actual implementation logic
Search n log n in Google. Codeproject.com has a good explanation
Has anyone used autoencoders to great effect before? I am thinkning about using it in a signal processing problem, but all the hoopla I see about it are with MNIST examples.
There are a lot of academic papers using them, but I am not so sure if they work well in the real world.
@delicate carbon that's entropy
basically, you want to create rules that allow you to split a dataset into subsets where each subset has "similar" values of the variable you're trying to predict
higher "similarity" is lower entropy
I have a csv with the following columns: update_rule, K, N, L, attack, time_taken, and eve_score (where attack can be "none" or "geometric"). I want to do two things:
- average
time_takenandeve_scorefor each row with the sameupdate_rule,K,N,L, andattack - get rid of the
attackcolumn and replaceeve_scorewitheve_score_no_attackandeve_score_geometric
at the end these should be the columns: update_rule, K, N, L, time_taken, eve_score, eve_score_no_attack, eve_score_geometric. does anyone know how I can do this? I've been looking at the pandas api but I don't really care about which library I use
that's basically a groupby mean
I don't see how it can be used in this case. can you explain a little more?
do you know how groupby works?
if I understand you correctly, what you want to do is df.groupby(['update_rule', 'K', 'N', 'L', 'attack']).mean() (for the first part)
ahhh that probably will do what I want
do you know how to approach the second part?
I'm not really sure what you want there
attack isn't numeric which makes it a little harder to work with in analysis. I want to get rid of that column. the start of my processed csv is currently:
update_rule,K,N,L,attack,time_taken,eve_score
anti_hebbian,4.0,4.0,4.0,geometric,12.9164006,79.53125
anti_hebbian,4.0,4.0,4.0,none,11.625815399999999,78.90625
I want it to be:
update_rule,K,N,L,attack,time_taken,eve_score_no_attack,eve_score_geometric
anti_hebbian,4.0,4.0,4.0,12.271108,78.90625,79.53125
note that the updated time_taken comes from the average of the two previous time takens
if I understand correctly, you want to groupby attack then
and take the mean
and then join onto the result of the first part
Is there a way to use Kite for Colab?
@tawdry rose deep learning is machine learning with certain kinds of "deep" neural networks
hm thanks ๐
In my code I am converting a csv file into pandas then I am adjusting the data in the columns
I have created two excel workbooks before the if statement
I have two issues in my code the first issue is I have tried a number of solutions but I cannot add the dataframe data in the excel sheet properly according to the if statement. The if statement sets conditions to add the pandas data rows to the accepted contacts otherwise the rows not passing the condition should be added to the rejected contacts workbooks. What am I getting is in the accepting contacts the full contacts are looped in the workbook and are added once into the rejected contacts
for column, row in df.iterrows():
row_check = row.isna()
if (not row_check[0] and not row_check[1]) and ((not row_check[2] and not row_check[3]) or (not row_check[2]) and (not row_check[6]) or (not row_check[16]) and (not row_check[12] and not row_check[13] and not row_check[14] and row_check[15]) or (not row_check[8] and not row_check[9] and not row_check[10] and not row_check[11] and (not row_check[7] or not row_check[17]))):
for r in dataframe_to_rows(df, index=False, header=False):
ws.append(r)
else:
for r in dataframe_to_rows(df, index=False, header=False):
ws2.append(r)
wb.save("Accepted Contacts.xlsx")
wb2.save("Rejected Contacts.xlsx")
some theory help.
I am creating a neural net, with a softmax activation function as the output. If I'm doing cross entropy, do I have to hot-encode the output of the soft max before doing cross entropy. If I do have to hot-encode, how does one go about doing this?
or does it come out of the softmax as a hot encoded vector?
cross entropy is computed between distributions
in practice, your labels will be a length-N vector of class IDs
and you predictions will be a NxK matrix of either prediction probabilities or logits
ok and the output of the softmax will be the x_pred in the cross entropy
with x_actual being the comparison array, right?
correct?
that sounds correct
a little more theory help
@silent swan to update the weights on the backpropagation, do I take the derivative of the activation function (softmax in this scenario) or do I take the derivative of the cost function (cross entropy in this scenario).
the latter.
well, eventually the former too
in terms of backpropagation, there is no difference between an "actual" layer and an activation layer
when would the former be needed?
well, you need to propagate backwards throughout the network, right?
yea
you can think of each layer as applying a function to the previous layer's output
starting with the input
that's the forward pass.
in this case, the last layer is the softmax
the result of that then goes into calculating the loss with reference to the ground truth
as well as the gradient
so now you have loss and gradient
you propagate the gradient backwards through the last layer (there are no weights to update there)
but nevertheless you must apply the chain rule so the gradient's magnitude is correct
your second last layer is probably a FC layer or something
and that has weights to update
i think i've seen this used as a concrete example for back-propagation before
๐
Ok thanks, I'll take a look at that, FC layer?
I thought you started the back prop from output backwards and not from the input layer again
ok, sorry I got confused when you said you start from the input, all good
the point is using the chain rule to get the derivative of the loss function with respect to the current layer, right?
yea
but surely by that logic, you would need the derivative of the softmax?
since that would be the "current layer" when starting the backprop
eventually
but remember
the value of the loss function is derived from the output of the softmax
so the very first gradient you can calculate is of the softmax with respect to the loss
is that what you mean?
yea
yeah, then you go backwards
so output of softmax is used to calculate the cross entropy, and then I take the derivative of the CE to get the gradient, I compare the previous weights. Do i get the derivative of the sigmoid in the hidden layer then
yes, unless you don't want to train it
whats the point of that then xD
frozen layers?
oh yea sure
I only have 1 hidden layer though
I own't have frozen layers anyway
Ok I'll try and figure it out
so far what I've found is I struggle more with the python implementation than the math/theory
i was running my neural network and i got a DSATray.exe error talking about a "gaurd page" and the ide said this: Process finished with exit code -1073740791 (0xC0000409). What happened?
@velvet thorn my issue with that article is that the example it gives, it uses squared error and sigmoid in a multi class classification, and the code/math is much simpler for both of those over softmax with cross entropy
For cross entropy, where do I get the actual probabilities? Is that just the target value? Because the target value won't be a vector of the same size as the predicted probabilities
yes
a perfect classifier would predict 1.0 for the actual value
and 0.0 for everything else
oh
so for example if I have 5 options, and the target value is num_1, the vector will be [1, 0, 0, 0, 0]
1.0, 0.0... actually
but not that important
wait, do you mean the target
or the prediction?
target
ok how would I get the target value as vector, like is this where one-hot encoding comes in?
yea
oki got it, I feel stupid af lmao
gm, question, so if I wanted to optimise the gradient descent, is there a way I could use adam without having to code it myself. would that entail wrapping my model in scikits learn wrapper
fair, maybe I'll just stick with sgd, it doesnt have to be a good classifier, just a classifier
It might be worth having a look at seaborn's source code - no idea how readable or not it is
btw can i show correlations with any other plot
I've implemented a simple NN with one hidden layer, and I think it works, but I'm not sure. I'm just doing the most basic sigmoid/sigmoid_prime stuff - and it doesn't handle xor, should it?
just compute a correlation matrix, and then use plt.imshow
I have a question, this guy calculates cross entropy, but doesn't use it anywhere in his backprop.
https://beckernick.github.io/neural-network-scratch/
Is that wrong or did he get around it somehow that I am unable to see? I thought the point of the cost function was to see how far off the weights were, but how is he training if he isnt using the cost function values
@worn stratus if you have a hidden layer, it should be able to handle nonlinearities
Yeah, it turns out it does/did - but I was just bad and read 0.05 as 0.5
@granite sierra doesn't he do that under "Implementing Backpropagation"
well he does something, but he doesnt use the cost function value
he literally only uses the cost function value for printing the current loss
@velvet thorn do you understand what he is doing, is it correct?
but
why do you say that?
output_error_signal = (output_probs - training_labels) / output_probs.shape[0]
error_signal_hidden = np.dot(output_error_signal, layer2_weights_array.T)
error_signal_hidden[hidden_layer <= 0] = 0
sure but thats not cross entropy?
he has a cross entropy function higher up that he calculates loss
oh
I see your confusion
this is why naming functions well is important
cross_entropy_softmax_loss_array
you mean this, right?
that calculates the "average" loss
yea I think so, let me check
but output_error_signal is what you actually use for backpropagation
because remember
oh
when you apply the loss function to the predictions and target
you should get one value per pair
pretty bad naming TBH
and that's old code
maybe I should do a blog on this
why is it old? (might be a dumb question)
I thought there were more resources available
because it uses xrange
which is a Python 2 function
not a dumb question
I only briefly glanced through it
so he uses loss function just to calculate the average loss and prints that out every epoch
Fair, thanks for help, muchacho gracias
Anyone has experience with using Gaussian process for regression problems?
sorry if i posted this in the channel
!ask
Asking good questions will yield a much higher chance of a quick response:
โข Don't ask to ask your question, just go ahead and tell us your problem.
โข Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
โข Try to solve the problem on your own first, we're not going to write code for you.
โข Show us the code you've tried and any errors or unexpected results it's giving.
โข Be patient while we're helping you.
You can find a much more detailed explanation on our website.
the most elaborate 0 to DL approach covering theory, math and code https://d2l.ai/
^ can some mod pin this
Thank you very much
hours upon hours of resources
Hey guys, so I have a neural net, multiclass, if I have 3 outputs and 5 hiddens, my softmax is returning 3x5, so like [0.15911229 0.66902292 0.17186479] but 5x
so it gives this
[[0.19122457 0.36888354 0.4398919 ]
[0.19122457 0.36888354 0.4398919 ]
[0.19122457 0.36888354 0.4398919 ]
[0.19122457 0.36888354 0.4398919 ]
[0.19122457 0.36888354 0.4398919 ]]
does anyone know why it might do this?
nvm all good, I fixed it
How can I make my for loops closer to O(logN) efficiency?
Replace them with a binary search?
Hey. I have 2 questions which I don't quite understand:
- How to turn words and text into code machine can understand (for chatbot, so bot can create a new meaningful sentence).
- After professing etc. the network will return a number. How can I turn this number into a word again?
I'm trying to make a deep learning chatbot, but these 2 things are impossible for me to solve
if you donโt know how to do those
you probably need to do a lot of work.
a lot.
look up tokenisation and embedding @uncut shadow
thatโs a good start
How do I use a gaussianProcessRegressor() for time series forecasting?
Hi guys, I am wondering if I can ask a simple pandas question
Right now I have this small sample dataset with order details:
My task is to find out the ONE order here that "has been processed by mistake".
My train of thought is: since it's a given that it has been processed, I will begin with all the orders with COMPLETED status.
So here is what I got
I know I can see that there is only one row with Inactive status. But to be accurate, I still use groupby to find out that the anomaly can be shown by using groupby(STATUS) on the result of the last step.
And eventually I can get the row I am after with this
In total I got c = df_merge[df_merge.ORDER_STATUS_CD == 'S'] c.groupby(['STATUS'])['ORDER_ID'].size() df_merge[(df_merge.STATUS == 'Inactive') & (df_merge.ORDER_STATUS_CD == 'S')]
I feel like this is way too verbose and I have to use my eye to identify which groupby result group ends up being 1. I am wondering if there is a way I can directly utilize the fact that there will be 1 group with a size of 1 in the 2nd step. And directly get that row I need from there.
Instead of checking the result 'Oh, I should focus on this group as its size is 1' and then used the 3rd line of code to finally get the result.
Hope I made myself clear. Sorry for being verbose here.
Thanks.
@drowsy grove expect input and desired output please.. rephrase your question
how do you define an order as being 'processed by mistake'
Well, I know what is word tokenization and embedding, but (iirc) it only allows to match 2 words so for example alice, wonderland.
I mean, I'm thinking How to make a chatbot in python which would create It's own sentenecs based sentences it Has seen. The problem is, How to make sentences. It can check if 2 words have the same (or close) meaning, but idk if It's going to work with sentences.
then you don't understand what embeddings are
let's take a look at two sentences, and tell me what you observe from them
I waited at the signal
vs
He gave me a signal
what's different here?
Well, everything, but signal is different
And the meaning of signal is technically the same in both sentences
the semantics changed depending on the context
embeddings are the same way.. they are a representation
and some embeddings are context aware.. so the word signal when converted to an embedding is not the same and changes with the context
you can hence arrive at sentence level context aware embeddings
understand?
Your objective is to have your chatbot make responses.. one similar system that makes responses based on context is widely called Question Answering..
Gmail smart reply for example
And when you're using something like Gmail's smart compose.. that's something called NLG, that uses context in realtime to update its predictions... returning the highest semantically similar prediction based on its trained corpus
@uncut shadow andddd you disappeared
@lapis sequoia I'm back :P
read and ask questions if you have any
Ok, I'll look into it and I'll try to google something, thanks!
semantic similarity with sentence level embeddings is the concept I just told you about
@lapis sequoia My bad. Reading my question again, it does seem that the basic desired output is missing.
Input is the 15rows * 11columns dataframe.
Desired output is the record/row of the order that was 'processed by mistake'.
As for what is defined as 'processed by mistake', that is a very good question. I honestly do not know. The given information is that there is indeed one order in this dataframe that is processed by mistake.
that's your filter condition
Exactly.
ohk.. so you don't know
So I've been trying to deduct from the given.
Exactly.
Nope...like a small project haha.
It's a merged dataframe.
Oops, that looks awful.
of course it does...
what does active inactive mean
That's exactly where I got stumped.
What I was trying to do was find one record that is anomaly.
And that must be the one.
Basically I'm deducting my filter condition from result.
let's try to be less verbose.. and more objective
Since there's only 1 inactive record that shows "COMPLETED". I thought it must be it.
That's a huge problem of mine. Verbosity.
IMO, 'processed by mistake' consists of 2 parts: 'processed' and 'by mistake'
let's see what can be your possible conditions based on the fact that you don't know the information about the fields you have
Sure!
im just waiting for you to let me
My bad. Go on.
- Status of User Inactive, but Order Status CD shows as S
- Order Status Cancelled but Order Status CD shows as S or P
- No order date or No Order ID but order status CD shows as P or S
that's it
from what I can gather from this
Makes total sense. I failed to see the possibility of 2 and 3.
it helps to not get in your own way
looking at a problem requires breaking down in to manageable pieces and solving the parts
You are right. Lemme look at the data again.
So I shall create filter for each situation and then compare the results?
i've a script that i wanted call using a bash alias as
alias s='path/to/script.py
the issue is that the script requires some libraries that aren't in my base python install and it also calls some folders within the project (from dir import constants) that i think aren't working when trying to use it as an alias like this.
what's the best approach here?
I know.. but you're talking about a bash alias.. seems like they'd be able to help.. or you can ask in a help channel
don't have a specific question - this channel seems ok... i don't think a bash alias is particularly exotic
I've just realise that I'm in the #data-science-and-ml channel not the general python channel ๐คฆ, sorry
@lapis sequoia BTW, I ran the queries for all 3 scenarios and only the first scenario returned 1 row.
I think now I can safely assume that 1 is the filter condition for 'processed by mistake' right?
Thanks. This looks also doable with SQL.
Anyone with experience in learning Spark from scratch via PySpark?
Regarding exploratory data analysis: What can be done in the beginning when analyzing a dataset?
I can think of checking dimensions, missing values.
Run a df.info(), plot some visualization. Anything else?
Using Pandas and Numpy at the moment
@raven glacier me
This book a good way?
well
if you like books, maybe? I didn't learn from a book
do you already have experience with pandas?
I was already experienced with pandas
the concepts are more or less the same
so I just started working with it
the main difference is SQL-like queries
and the ML model
okay so you just learnt as you went along by googling etc
fair enough mate
Do you think making my own models from scratch (for example, only with NumPy and Matplotlib) makes sense? I want to jump into machine learning and things like deep learning so do you think implementing my own models from scratch makes sense?
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pandas_datareader as pdr
plt.show()
plt.close("all")
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot()
the above isn't plotting, using ipython in a terminal (osx), i'm expecting a popup plot but am not getting one, anything obvious I'm missing?
ok, plt.show() just has to be at the end it seems ๐
this seems kinda verbose, if i'm running ts.plot() in an interactive session it seems that i'm going to want to view it immediately, rather than having to enter plt.show() each time after
I was asked to describe "how to make a cup of tea to people who know nothing about it" today in a big data analyst interview.
3 to 4 minutes, don't squeeze the bag
I was a little flabbergasted but I tried my best to break it down into steps.
I know there isn't a correct answer. But what shall I specifically emphasize when answering questions like this.
@jolly briar they didn't say it comes in bags
you can make reasonable assumptions i'm sure
But in my limited experience, loose tea, tea bags, procedure is the same haha.
well, depends on the tea for temp and stuff but sure... i would assume that they're just seeing a problem broken down into steps here tho
That's what I thought. Still, a fun question and thought I would share.
They made me choose
Either this or how to get leaves off my lawn.
unless they were waggling their cup at you when they asked, maybe it was a hint
I never used leaf blower once so I chose
Haha, it would be. But this was on the phone so I wouldn't know.
hope it went ok ๐
closes all the plots i reckon, i've not used matplotlib for ages so it's all a mystery... ggplot > matplotlib ๐ค
Ugh, I just do df.groupby()....plot()usually
Would I just be using the plot function of pandas?
yea but it's calling matplotlib anyway afaik
Gotcha. Thanks. I learned from Brandon Rhodes's talk.
Most people on Kaggle use sns. And it does make a different I have to admit.
matplotlib is rank in my opinion
rank as in stink?
yes, i don't like it... i always feel as though i'm having to do more than is necessary when i use it, feels too low level or whatever. I think seaborne helps there, but gg ( for me ) is really nice. Building things up with pipes in dplyr, passing to gg, facet wraps etc, are all quite nice
I like matplotlib because it gives me explicit control over everything
yeah i can get that, i've never really found i need it tho
it makes great plots in a pretty logical manner...
facet wrap enables you to create a grid of plots easily, like facet_wrap(sex ~ .) will create a grid by sex
I've had to make very many different kinds of plots, so sooner or later any premade abstraction fails for me
