#data-science-and-ml
1 messages · Page 299 of 1
offtopic: srry if wrong channel. i just created a ai/ml starter pack here: https://www.reddit.com/r/ProgrammerHumor/comments/mal4eb/aiml_developer_starter_pack/
hi guys - I'm getting
y_pred.shape.assert_is_compatible_with(y_true.shape)
ValueError: Shapes (None, 2) and (None, 1) are incompatible```
when trying to use more than only 1 metric in my model.compile.. why would that be?
when I only use metrics=['acc'], it works..
I think that you have particularly strong opinions on some topics that may not necessarily align with reality or with the needs of someone else. For example, Kaggle's mini-courses do teach important aspects for beginners. while it may not be needed for you, in practice most beginners like to start with something small.
https://towardsdatascience.com/kaggles-micro-courses-my-favorite-introduction-to-data-science-f0cc6aeb024c Here, the author lists that Kaggle's Mini-courses starts with:-
- Data visualization
- Pandas
- Basic DL which covers Transfer Learning and Data augmentation
- Intro and Advanced SQL
- GeoSPatial analysis with GeoPandas
- Basic NLP
- Intro to RL that covers an agent using simple
minimax
This differs somewhat from your statement-
In fact, it doesn't teach you ML at all
It gives you a tutorial thing which literally just calls the Decision Tree method without really explaining what it is
Which seems pretty wrong seeing the above evidence.
TBH I really admire your knowledge and apologize if I sound rude or always contradictory (because people in PyDis enjoy arguments a lot) but simply that everyone has an opinion - and if different people provide their perspectives to someone, the person on the other end receives a much better answer of their question overall.
Kaggle is not good enough for beginners
I'm sorry dude
it's designed to show the highlights of ML while they spoonfeed you code
but it'll never give you a strong basis
so I agree with Raggy
did you even do the course?
bc if you didn't I don't think you have any standing to talk about it
I did do the course when I was new to ML 🙂 if you want to see the amount of stuff they have, this is "intro the DL" where I think the amount of math and code usually levels out with the amount of code required for a beginner https://www.kaggle.com/learn/intro-to-deep-learning
You can inspect more courses here https://www.kaggle.com/learn/overview and decide for yourself. as for me, they helped me when I was a beginner so I stand by my opinion
Use TensorFlow and Keras to build and train neural networks for structured data.
Practical data skills you can apply immediately: that's what you'll learn in these free micro-courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
Which metric could I use to evalute different clusters given that the feature they are based on will differ
yeah sure you can learn the highlights of DS/ML
doesn't mean you actually know what you're doing
@hollow sentinel also, you can execute an exercise where it takes you to a different notebook which tries to explain the concepts learnt with a sample dataset 🤷
especially when they spoonfeed you all the code
lmao
wow an exercise with a sample dataset
with all the code given to you
so innovative
fill in the blanks
lmao
@hollow sentinel I think what your approach is very new to CS, because most people already know the basic coding required to start the courses from the scratch
I do know the basic coding to start from scratch
intro to DL is not the first course BTW
its like the 6th or 7th one. before that they teach stuff like visualization and even more basic stuff
micro courses are not enough to build any significant skills
ofc
hence the name "Micro-course"
yeah, but its more than good to give an overview to a beginner
(especially when they are not in college)
I wrote this before ^^ people who usually do the courses have been doing CS since school, not learning it first time in college. they can't cater to everyone
it's just a way to cater excitement
generate hype
it does nothing to teach you
well, then I can't argue with you since you are just fueled on opinion rather than arguments 🙂 have a good day
nice way to concede
maybe help me instead of pointlessly arguing 😆
So I've been trying to use K-means and Spectral Clustering to try and detect this cluster over here. I've read that these are good with even clustering sizes, so does this mean it couldn't pick up on the cluster in the circled area?
Uh. Visually that circle doesn't look like a separate cluster to me
Unless you only mean the little group of points off to its own side
Got it. That is definitely better
So i also know that there is a cluster here
So, K means needs a k upfront, what output did you get with K set to 4?
hol' up lemme go check
Im using sklearn btw, so do you mean like the number of clusters i told it?
As for clustering algorithms in general, the idea is they're usually doing their own thing. Usually you only really want to use them for exploration that leads upto something down the line
And yes, the number of clusters
Ill show you what i get for 4 clusters:
So i'm just using this to show to my supervisor how machine learning can be used to detect clusters. I'm not trying to get anything analytical from it
It looks like the spectral one is handling that cluster a little better, but it's slightly off
Where can I get started on machine learning and ai in general?
#data-science-and-ml message is a book i was recommended in this chat ^^
Anyone can help me figure out why I'm getting incompatible shapes ValueError when using multiple metrics in my model.compile? If I only use 'acc', it works..
I'm curious how dbscan would perform here.
Thank you
oooooh i read about dbscan! Sklearn says it's good for uneven clusters right? But it said its use-case was for "non-flat geometry"?
@charred egret IRL example of genetic algos https://algorithms-tour.stitchfix.com/#new-style-development

found it from a podcast. very nicely done
apparently made with D3

would you have any insight into setting the eps parameter for DBSCAN? It's not like Kmeans or Spectral where you can predefine how many clusters you're expecting
So, dbscan is intended for spatial clustering, so yes on that.
However that shouldn't stop you from just seeing how it performs, since you can treat each feature as an axis in one dimension
For eps, it's simply a param to play around with, I'd say let it do its thing. Higher eps makes fewer clusters iirc
anyone?
Well i've been playing around with it and I can't really get better than this:
Doesn't seem to even be able to tell the two big clusters apart a lot of the time
Ah. Hmm. Guess that's not the move for this dataset then
RIP
I guess spectral is giving me the best of other options ive tried. I went for MeanShift as well
what's wrong with this results BTW?
The purple overlaps into the red, where the true cluster is just the tiny LHS island of the purple (for the spectral result)
Should look like that
your clusters are too less distinct to be identified by k-means.
A simple google yields me this paper that deals with clusters with high overlap http://ceur-ws.org/Vol-1455/paper-06.pdf their recommendation is to use some EM algorithm using another CBOvalue score to aid it (and they claim it works better than spectral)
ah thank you for that! I'll go check it out ^^
Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans
what are some good hyperparams for keras neural net?
Why is my SVC f1-score on training data in the first code and 2nd code different?
Hey,
Not sure if this is the right place but im having some issues with mat plot lib
this is my graph
it does not show the actuall data
and the line should have a smooth increase
only like +100 per hour
but as you can see it bugs significantly
also, is there a way to make the graph fit to size?
Its a bit wide atm
import tkinter as tk
import matplotlib.pyplot as plt
from pandas import DataFrame
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
import random
import datetime
data = {'Price': [1],
'Years': []
}
time = datetime.datetime.now()
previous_number = 1
for x in range(1000):
num = random.randint(1, 100)
previous_number += num
data["Price"].append(previous_number)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))
time += datetime.timedelta(hours=1)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))
df = DataFrame(data, columns = ['Price','Years'])
for x in df.values:
print(x)
root = tk.Tk()
figure = plt.Figure(figsize=(1000,4), dpi=100)
ax = figure.add_subplot(111)
chart_type = FigureCanvasTkAgg(figure, root)
chart_type.get_tk_widget().pack()
df = df[['Price','Years']].groupby('Years').sum()
df=df.astype(float)
df.plot(kind='line', legend=True, ax=ax)
ax.set_title('Example')
this is my code atm
have you double-checked to see if your data is sorted
yea
'Years': []
}
time = datetime.datetime.now()
previous_number = 1
for x in range(1000):
num = random.randint(1, 100)
previous_number += num
data["Price"].append(previous_number)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))
time += datetime.timedelta(hours=1)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))```
is how its generated
it should always be going up
but 100 max
Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans
@misty flint
i got the aspect ratio to fit
but
1s
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
how can i download with python images from google image search?
and @lapis sequoia just ran
before = 0
total = 0
for line in string.splitlines():
num = int(line.split("[")[1].split(" ")[0])
if num <= before:
total += 1
print(before)
print(num)
before = num```
basic checker to see if there is any anomalies in the data set but it never printed anything
its the graph
its only when showing it in tkinter
Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans
could you elaborate on your problem?
Can anyone suggest some site from which I can get a stream of data / free API to be fed into pypspark
Hello everyone, sorry if this is the wrong channel but I figured this was a basic data science question. I am trying to have a pandas column that will get the change percentage for stock data between yesterdays close and today's open. I cannot seem to find the correct verbage to google to find this very simple task.. attached is the excel relation I am trying to correspond into my python script. Anything helps, thank you!
I read an account where a person trained a word2vec model on their own dataset and then used those vectors to train the model. that seems strange - can we expect a reasonable boost in accuracy on an embedding model trained on little data and get it to capture the contextual vector for each word? it doesn't seem right to me, but maybe you guys can illumintate this issue
I have a retrained InceptionV3 model on Keras and I'm getting 100% acc and val_acc from first to last epoch.. and ofc it's not accurate when I predict.
what could be wrong?
Anyone knows any good source of data for big datasets? I need atleast 4 Gb of data for a project where I need to apply clustering trough apache Spark
How brain handles all the noise and still is able to maintain accurate perception: (non-scientific articles) https://neurosciencenews.com/perception-neuron-misfire-16129/
forecasting tool https://facebook.github.io/prophet/
take a sneak peek. the visuals are mind-blowing
🧠
made with d3

they had a full stack data scientist doing it

I'm pretty sure there is an image scraper somewhere in the web.
Do you guys prefer panda over conventional NOSQL? I'm trying to figure out if I should keep learning SQL as a DevOps
I mean, I'm not a DevOps yet but I'm working towards it, lol
So I developed a rudimentary chess AI, but it’s slow as hell. Any way I can speed it up?
thats becuase brute-forcing takes time
if you're searching the state-space in Python, it's just going to be slow in general
pretty much just because it's the kind of thing Python is slow in - iteration and number-crunching,
python's fine at number crunching using numpy, just iteration and general business logic is slow
In general, though:
- Profile your program. (Every single optimization must start with that step)
- See what the most expensive functions are, and consider if you can't speed them up, or even rewrite them in something like Cython.
yeah, pretty much. The general idea is that the heavy stuff should be in a faster language - numpy is an example of a library doing that under the hood.
but if there isn't a library implementing what you want, your only choice is to learn how to rewrite functions in numba/cython/whatever yourself.
A guy did try to brute-force in chess and implemented some sophisticated techniques to reduce time. it still wasn't enough (he used C++)
so I doubt python contributes much to it
The problem is that chess just has too many combinations. not as much as GO thankfully, but its still pretty significant
so the best way is to train an AI
lmao no
I mean, it's true that search is just slow, but also it'd probably be like a hundred times faster in C++ than in Python 😅
more like 200
also, aren't this how classical (non-ML) chess engines work? They can be very advanced.
search engines are sorta different, but basically every game AI you encountered in any game up until the last couple years had no learning
I agree with all your points, but just that with so many combinations there is no reasonable way to speed it up. minimum, it takes 20-30 minutes for each move
oh, you mean the really basic ones?
even state of the art ML doesn't give the best player possible, that's not reasonably searchable as a space
AlphaGo did use that in Go
nope, it's still not the optimal player, just a really really good one
I mean... chess engines exist.
and it beat the world champion twice, so its not much up to debate
also chess engines were beating grand champions decades ago
it's not hard to write a chess engine that is better than all humans
not chess
Chess engines definitely don’t search the whole state space, but they have efficient methods for pruning nodes with bad moves and taking into account transposition
there's stuff like this which is completely ML-less and still human-level
https://en.wikipedia.org/wiki/Stockfish_(chess)
Stockfish can use up to 512 CPU threads in multiprocessor systems. The maximal size of its transposition table is 32 TB. Stockfish implements an advanced alpha–beta search and uses bitboards. Compared to other engines, it is characterized by its great search depth, due in part to more aggressive pruning, and late move reductions.[4] As of November 2020, Stockfish 12 (4-threaded) achieves an Elo rating of 3516+24
−20 on the CCRL 40/15 benchmark.[5]
though it does get murdered by ML chess players:
In December 2017, Stockfish 8 was used as a benchmark to test Google division Deepmind's AlphaZero, with each engine supported by different hardware. AlphaZero was trained through self-play for a total of nine hours, and reached Stockfish's level after just four.[48][49][50] In 100 games from the normal starting position, AlphaZero won 25 games as White, won 3 as Black, and drew the remaining 72, with 0 losses.[51] AlphaZero also played twelve 100-game matches against Stockfish starting from twelve popular openings for a final score of 290 wins, 886 draws and 24 losses, for a point score of 733:467.[52][note 1]
Stockfish is a free and open-source chess engine, available for various desktop and mobile platforms. It is developed by Marco Costalba, Joona Kiiski, Gary Linscott, Tord Romstad, Stéphane Nicolet, Stefan Geschwentner, and Joost VandeVondele, with many contributions from a community of open-source developers.Stockfish is consistently ranked firs...
On this day 21 years ago, the world changed forever when a computer beat the then-chess champion of the world at his own game. On February 10, 1996, Deep Blue beat Garry Kasparov in the first game of a six-game match—the first time a computer had ever beat a human in a formal chess game.1
yeah, the best chess player is always an AI hands down
mostly because the "AI" is just a normal chess engine being sped up with a good heuristic
so it's more of a normal engine++
questionable. Are there any model-free-learning-based chess AIs?
there might be but they won't beat stockfish :p
how are you even supposed to google that? "computer beat human in chess with no ai"
model-free RL chess?
thats not a good google search term
try it tho, it might be 😉 I just get 3D models lol
got it lol
In 2016, we introduced AlphaGo, the first artificial intelligence (AI) program to defeat humans at the ancient game of Go. Two years later, its successor - AlphaZero - learned from scratch to master Go, chess and shogi. Now, in a paper in the journal Nature, we describe MuZero, a significant step forward in the pursuit of general-purpose algorit...
its a model....?
MuZero just models aspects that are important to the agent’s decision-making process. After all, knowing an umbrella will keep you dry ....
wtf
I dont think you know what you're talking about enough to actually have a conversation about this. It's not given a model of the game which includes the rules of chess
why are they calling it model-free then
wait, what do they mean by "aspects that are important to the agent’s decision-making process" then?
and why do they use reward, value and policy then? sounds kinda like RL to me
MuZero learns a model that, when applied iteratively,
predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and
the value function.
Yes, it's model-free reinforcement learning
Specifically, MuZero models three elements of the environment that are critical to planning:
The value: how good is the current position? The policy: which action is the best to take? The reward: how good was the last action?
oh, that's just what all RL (or at least everything derived from q-learning) does. Not sure why are they calling it modelling, tbh.
https://paperswithcode.com/method/muzero
MuZero is a model-based reinforcement learning algorithm.
LOLOL
MuZero is a model-based reinforcement learning algorithm. It builds upon AlphaZero's search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure.
The main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observat...
It's learning the game dynamics without being given the rules
Why are you going lololol
This is exactly what I've been saying all the time
You haven't "owned" anyone, you just dont understand the conversation we're having
model-free reinforcement learning
^ that's what you said
MuZero is a model-based reinforcement learning
^ paper
Sorry, I should have been more precise, it has a model of the graph structure, it doesn't have the dynamics of the game
that kinda seems like RL, but you can help me understand the difference
The model receives the observation (e.g. an image of the Go board or the Atari screen) as an input and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward.
anyway, so MuZero is model-free and beats chess engines and even older AIs. So sure, it's easier to just plug an ML heuristic to a search engine, but it surely isn't the only way things can go.
so...the only difference is there in this part
the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.
like you mean it does not model the game or its rules
It's not completely model-free, it just doesn't have the dynamics
yes, it doesn't have a dynamic model of the game which alphazero and other chess engines usually have
I should probably try reading that paper, lol
but if alphazero can generalize over multiple games (chess, go shogi) how does not having a dynamic model make a difference as long as it can generalize?
alphazero does have a dynamic model lol, you have to program it in
that's just how nearly all RL works. If I understand it right, the idea is that the transitions between current and possible future states depending on the action are learned by the model (instead of programmed in)... and also not stored at all, I think, just used on each value update?
RL can work entirely on the observables (image, etc) of program (model free), or it can have an idea of the dynamics and/or the decision making structure (model)
This one is partially model free in not having the dynamics but modelled in that it's not working on the direct observables but a parsed structure with decision information
im not much familiar with RL, but wasn't the whole point to learn an environment without any prior hard coding (contrary to the dynamic model that we have to program in it)
Not of AlphaZero
MuZero learns dynamics, sort of, it learns the dynamics of the optimal control based values
Not really, for example, does a human learn without any prior "hard coding"? No, genetics are a thing.
and AZ does not; it has to be hard coded. that seems like a workaround
true
But MuZero is still fundamentally based on a graph search, it's still doing Monte Carlo Tree Search
It's just learning some dynamics alongisde the fundamental decision making MDP model
but conventional consumer level stuff can usually solve simple environments without any hard coding, so I assumed that other techniques just scale up the complexity 🤷 sad
not really? RL tends to only solve without hardcoding for simpler mostly low dimensional problems
The chess "AI"'s are kind of fuzzy when it comes to model vs model-free. Of course, one does not need to model everything, only parts of some task could be modeled.
RL is very very behind what people want it to think
like that survival simulation one of OpenAi; that's kinda complex but it doesn't require hard-coding (at least according to the narrator)
It's productive to read some summarisation of how ridiculously bad RL can be, even if someone with a few million to spare has been able to "solve" Go
https://arxiv.org/pdf/1806.09460.pdf
https://www.alexirpan.com/2018/02/14/rl-hard.html
June 24, 2018 note: If you want to cite an example from the post, please
cite the paper which that example came from. If you want to cite the
post as a whole, you can use the following BibTeX:
Just a pretty naive way to solve the possible RL problem: while an agent is randomly searching its space, why can't we inject a pseudo-randomly combination that represents to a great degree the task we want it close to do and have the model just optimize it for maximum reward (a well defined and thought out reward function).
Wouldn't this allow it to learn complex task if we give it a boost in the start (like a nudge to the correct direction) so that it can easily make the connection on the best way to accomplish a complex task?
it seems to me like you just described what all RL agents that get taught on human records do
like, AlphaZero is Zero because it only got taught on its own games - it wasn't primed by learning on tons of human matches like AlphaGo. As a result, AlphaZero took a lot longer to learn, but ended up better at the end.
umm, I may be misunderstanding you here, but what I meant is just to provide a skeleton of the possible action that the RL algo should take to help it accomplish pretty complex tasks and get a general idea of how it is supposed to solve a particular environment.
does anybody here actually use cupy?
i've tried using it once and it had all sorts of errors
and it seems nice but i've never found it super helpful anyways
@stiff barn @rapid fog yo yo
ayy
google analytics ive heard is a real nice tool
for tracking
does google have AutoML or is that a dif cloud provider?
What's the easiest way to host postgres on the cloud for a discord bot?
Just on the VPS?
Google Cloud has hosted postgres

Is it free?
Can also use digital ocean
Yeah, GCP has auto ml. Most of the clouds do
You get plenty of free credits with both @rapid fog
Digital Ocean is my go to for smaller projects. Has easy to launch vps, managed databases, kubernetes, ect...

I'll take a look. Thank you!
No problem
i need to become more familiar with the cloud
maybe this summer
when working with AWS

AWS is still the most popular so a good place to start
i always feel like in this field there is always an endless amount of things to learn

if you want to stay relevant
Haha yeah you can never learn it all.
do you do any testing/unit-testing
people have said i should also learn that
this never-ending bucket list... 
Yeah, I write test cases for every function/method I build.
that sounds like good swe practice
It becomes natural very quickly. Writing tests in Python is pretty intuitive.
just the native unittest library
Next time you go to test some python code manually just try to write a test instead and you might find that it actually makes your life easier.
Going well. Working on the finishing touches of a recommendation engine I've been building for a while.
Had to build too many pieces for it haha
Should go with Azure AutoML for totally unbiased reasons
What type?
Somewhat nontraditional. I built a binary classification multi-modal model that takes the apartment images and processes those via a CNN, then structured data as a DNN, then concatenates them. I'm recommending apartments to just myself.
hey, its a nice use case

one guy on a podcast i heard built a neural net just for tinder swipes
for himself
Ah so it's supervised content based recommendation
I was expecting collaborative filtering
Yeah, there is just 1 user so hard to use traditional techniques. Plus I wanted to experiment with multiple input type models.
Sounds efficient haha
How about yours @misty flint?
wait let me see if i can find a link
@stiff barn https://towardsdatascience.com/m2m-day-89-how-i-used-artificial-intelligence-to-automate-tinder-ced91b947e53
💀
beginning
if you come from a non-technical background, you can start with andrew ng's AI for Everybody course on coursera
its a good start
Seems to be the one people gravitate to. Must be good
my learning? im still working on group projects haha
we barely finished a 2-3 week long one where we made a contract analysis app with some basic nlp
its good for non-technical people imo. i like andrew's business perspective on AI
he goes through what kind of business projects AI/ML is good at vs. those that arent good projects
and then walks through how to try to build up a AI/data culture at your company if youre trying to create buy-in/not everyone is onboard with change
lol
this is the kind of project i need
Pretty interesting and quick read
im looking for a course thats more hands-on, being that im already somewhat fluent in python
Sounds pretty useful. That's something a lot of people would skip
yeah but i wouldnt really recommend it to most technical peeps unless theyre interested in going into management or part of a large team
at the very least, you can 2x through the videos and get through the gist of it pretty quickly

i shall patiently await for your results
then you can report back to the class

The problem is, i have no training data
1 sample from my side :v
few dimensions
F
Haha yup, gotta build it up

That's the fun part
honestly that sounds like a hilarious project to have on your resume
def a talking point
thats pretty wild yeah but hes done other crazy things before
I built a client and labeled like 2000 apartments for my project and that took a long long time
oh yeah he did this challenge called 12 months to mastery which is honestly pretty ridic
@stiff barn March was his Tinder bot month
💀
Very interesting haha
Can see a lot of room for improvement but super cool for a fast project
i really need to learn CNN
and OpenCV
the whole analyze images makes for some very damn good portfolio projects
did opencv for one project but nowhere near mastery

pillow is a cool library too
using that for another project
The nice thing in that area is that is where the bulk of the pre-trained models for transfer learning are so you can get good results quickly.
I like pillow. Keeps things simple
I think i want to make a repo with 3 projects. One purely "analytics" and visualization
One with a ML model (not sure of how to show it here)
One with an image NN (havent even started NNs lol)
yeah i feel like projects are now either CV or NLP focused
at least the "interesting" ones
bro i just started checing NLP out and i hate it
though, I actually feel like doign a sentiment analysis miniproject in spanish (my mother language)
well i heard you usually end up choosing one or the other
NLP is weird because you just know GPT-3 is there and you'll never get anywhere near it
but id need to scrape some language
yeah NLP is a dead end with GPT and transformer models lol
that would be cool. i think one of my team members is doing sentiment analysis for russian
truth be told, with how behind most companies are, even a simple KNN implementation would do wonders 
yeah but i think there will still be cool business use-cases, no?
Would be nice if GPT-3 was open so we could use it for transfer learning and such. I have access to the api for it but that's limited to some extent.
maybe it will be open more in the future
bro, no joke, I have a guy in my BU trying to make a "bot" for classifying cases? His idea, REGEX!!

There are still plenty of use cases for NLP.
motherfucker I can do that better with a simple KNN lmao
And GPT-3 is spawning many businesses.
Probably will be closed forever though since Microsoft bought the exclusive license to it.
machine learn your way to the right regexp 
It will be for sure. And you can get access to the api if you ask nicely and wait a long time.
there's literally a startup t o make webapss, based on GPT-3 that only needs a rough description of the app

Yeah, the stuff being built with GPT-3 is very interesting
it's pretty rough from what i've seen, but c'mon, can you imagine the chaos of getting rid of half of "full stack devs"?
This was cool as well
Agreed
haha this is great. i like the avocado chair
Avocado chair is pretty good

Hey @tardy plover!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.

How much linear algebra do people in machine learning actually use?
35% in pinned messages
hi guys
so i'm an intermediate in python
and i wrote a tutorial on how to make a face recognition program with python
if anyone has any spare time i'd love if you checked it out and tell me if i have any mistakes
Maybe you’ve been watching a lot of techy tv shows,maybe you’re learning how to code and need a project to practice on, or maybe you’re…
nice one! intermediate and already writing tutorials? very cool
what do you guys think should be my k in this case?
You're saying that because of GPT and transformers, there's nothing else to do in NLP?
My coworkers haven't run out of things to do 🤷♂️
I mean, I said it as a kind of tongue-in-cheek kind of joke, but I can see how it definitely didnt come across as it
My hate for NLP is also not subtle :p
There is so much to learn NLP that I am dying everyday
it's like every paper has some different technique and there's a whole flood of them
Puts down banhammer
raises banhammer
😔
what do you find boring about it? (just curious)
I said hate, not boring
what do you hate about it then?
Hiiii, I have a question
If I want to start learning data science
should i learn jupyter, rstudio, watson studio,... or python's numpy, pandas, matplotlib, seaborn... first?
thanks!
is machine learning related to data science?
emacs
its a kind of irrational hate. Lemmatization (specially of internet l33t speech) is absurdedly annoying
yes, it's a subset of data science. and it's an approach to AI
@serene scaffold wow, I thought machine learning and deep learning are both AI
AI is what you say in your resume, deep learning during interview, machine learning during your work
Data science has an intersection with artificial intelligence but is not a subset of artificial intelligence
its mostly opinionated - I for one agree with the above
so is an if-else program to determine whether a number is even or odd artificial intelligence?
I don't even know what i33t is
yes - extremely basic or naive, it tried to mimic artificial inteliigence keeping in mind the modern interpretation
it technically counts as logic, but doesn't actually exhibit intelligence, you can't say its AI
Ouch, this hurts. https://en.wikipedia.org/wiki/Leet
Leet (or "1337"), also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblance. Additionally, it modifies certain words based on a system of suffixes and alternate meanings. There ar...
i'm currently learning computer science at college as a first year
Again though, most of the definition are up for discussion and opinion 🤷 and you would find plenty of ideas online
is data science a subset of computer science or it use computer science as a tool?
computer science is a tool for it
you can do data science by hand
but, who wants to do that 😆
Can any1 help me?
I think most DS studies would fall into CS schools these days because the majority of the work is involving CS. But economists use DS all the time, so does business school, agriculture,etc...
whats your question?
I have a file that I've done. but I am facing some issues that not running the file
Hey @lapis sequoia!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
I made a file in python
you mean you wrote code
can you show some code or tell us what the error is
and i couldn't solve this error
is it specific to AI / Data-science?
^
if its just a general issue then you can claim a help channel #❓|how-to-get-help
it means what it says
invalid syntax
you wrote the code wrong
this doesnt look like it has anything to do with data science though
so you should claim a help channel
it looks like you already have one lol
o;

why are you learning jupyter BTW?
it's included in the course, i didnt specifically pick it
and I recommend you leave roadmaps but rather learn basics and learn what you like rather than following some set path
no, its a screenshot from my laptop
yup
might as well start somewhere. as long you think you can complete it
just know its not comprehensive
tbh I find a mindset that "I have to learn x thing by y time" to be the most unproductive one ever. its not how we really learn things
because there are a course from edx too but it's very different and have things like sql, rstudio, jupyter lab, watson studio,...
I discourage Python learners from touching jupyter until they have more experience with the language, as it makes everything more difficult to debug and encourages you to not think about code re-usability.
I don't get why everyone's like "I started coding and aim to build an app in 1 month and then learn AI at the end of 5th month" after that just milk that 120K. That is such a bad mindset. you end up leaving CS in 2 weeks just because you don't like it.
This isn't your school exams that you force yourself and it doesn't matter much if you forget everything (or worse, just rote memorize and forget). CS takes time, years, decades of work to be good at it. roadmaps are a good indicator as to what amount of knowledge an average person is expected to have at a certain point, but its not a path set in stone to follow for eternity.
yeah honestly imo it's better to just think of an application or project you wanna learn how to do, then learn about that specific topic/project to be able to do it. then after you learn one project and like the basics of it, you can adapt your code to do other stuff as well
so like if you start off with something like "i want to be able to visualize this dataset", then you learn about how to use the different visualization tools, learn about data management and stuff like pandas/numpy, etc
later, you can use that same code with the same dataset, and learn more stuff building on that
by building up on concepts you learn it makes it a lot easier than just learning stuff in order
Literally I love @austere swift 's approach
personally i set myself 3 goals, in increasing difficulty :
visualizing can be an interesting project too - say you have real-world data of the amount of time kids study and what their behavior is. then visualizing is a great project because then I would be genuinely interested in whether kids who study more have a higher chance of depression or not.
you can do almost everything better and learn faster too if you have the motivation 🙂
- visualization project with python
- ML application, simple with python (predict something with an ML model)
- More complex, CV application with CNN
and im basing my learning on that
mostly
I used to do personal projects for learning all the basics - now I have reduced those (because I can't manage the time very well) but I find competitions much more encouraging to explore experimental techniques and somehow apply them to increase my LB score.
Hello Evervybody , I have a question , how to calculate the average of a signal please ??
That's why I encourage beginners to do those simple kaggle competitions (one which have a monthly LB) to learn more
the statement tells me: create a function that evaluates the average of a signal
yeah whenever i wanna learn something i just start a project on it
agreed. its the most fun way 😁
yeah its a lot more interesting than learning it from somewhere online since you can actually see the results of what you did
and satisfying
LB?
Leaderboard - usually referred to the Leaderboard score
oh
@grave frost answering your question. It might not be that i dislike NLP, but mostly that i'm just pissed off at the awful quality of teaching ive had of it so far lol
so i'll probably have to relearn it from scratch if i ever use it
how bad?
(on a scale of 1-10)
2
meaning your teacher was drunk?
awful explanation, the lecturer was as stimulating as a political speech and his explanations and analogies were shit and there was very little code or matha long
mgiht as well read wikipedia to learn it
haha lol
and honestly im a bit burnout too i think lol
sad, NLP is kinda interesting. though my interest in AI has been dwindling somewhat lately
I hate it because i loved the ML part and i was actually excited about everything i did, even if i struggled with seemingly basic stuff someties
sometimes
so i went to NLP excited, but this guy killed me in a week lmao
does anyone here have some experience with skimage.transform module?
i'm trying to implement either PolynomialTransform().estimate or PiecewiseAffineTransform().estimate and i'm getting errors that idk how to deal with properly
but any experience at all with skimage.transform would be helpful
@grave frost is this what yumeant?
for what?
yeah, that's the LB - the top rankers in a competetion
Can I ask here a pandas question?
ask away
Yeah, what is it?
df['invoicepayed'] = df['invoicepayed'].replace(['\N'],np.nan)
will replace \N with NaN in my dataframe, seems to work okay
if df['invoicepayed'].notnull():
pd.to_datetime(df['invoicepayed'], format='%Y-%m-%d %H:%M:%S')
format works for not NaN and without the if-statement
so with if ... I try to convert only for not NaN
I get
"The truth value of a {0} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Quick pandas/numpy question: I want to get the mean of the bottom 10% of values in a series. How do I do this?
I checked stackoverflow found some posts but I seem not to understand the underlying problem
df['invoicepayed'].notnull()
this is not what you think it is
that returns a series
where each value in the column is evaluated to be NAN or not NAN
so if you do if df['invoicepayed'].notnull():
you get an error because you are saying
"Is Series True?"
and python cant answer that
@lapis sequoia you're not comparing each object inside the series, you're comparing the series itself there
Hi basically working with data frames with excess of 1 million rows and I’m using pd.groupby, specifically
‘’’py
For (a,b), c in df.groupby(by=[‘col1’, ‘col2’])
‘’’
I’ve noticed this is very very slow and was wondering if anyone had any suggestions for improvement? I’ve tried itertools groupby which slightly improved times, but I think because the column consists of strings maybe somehow converting the columns to an integer value might speed things up ? I have no idea but would love to try some of your guys suggestions 🙂
a million rows shouldnt that much of a problem for pandas @abstract zealot can you share a screenshot of your df?
Ok. So I need to check for every element in the series not the series itself.
in theory, but check what pandas datetime does to nan values BEFORE trying to apply logic there
if datetime ignores nans, then you can just use it
Sorry my bad it’s 25 million you made me recheck hahaha
oh.... I played in my jupyter notebook and didn't notice.
thats a lot of data
Thank you very much.
itll probably take a long time regardless unless you have access to more processing power
im not sure using a for loop is a good idea there
thats probably what's slowing you down
why are you looping to do a groupby anyways lol
lol
@exotic maple are there any better alternatives to groupby? I unfortunately need to do calculations on Sub data frames returned by groupby for certain values
you cna try just aggrating or passing a custom function
IF i get you right
for example, you want the SUM of N values of a row in a groupby
df.groupby("RELEVANT GROUP").agg({"COLUMN TO SUMMARIZE": SUMMARY FUNCTION)
I’ll definitely try something like this and let you know thank you very much man
you can also try it via pivot tables
but i found pivot tables in pandas...odd. i prfer grouping manually lol
Another quick question @exotic maple what if in addition to grouping, I wanted to only look at the data frames in intervals from rows 0-20, 20-40, 40-60 etc
Is this possible with the method you describe?
I think so yes
Eh ive never done that but i think you can.
Id try this.
Df["splits"] = Pd.cut(df,5)
This will create 5 equally distinta values for splitting
Then id use groupby by that column
Im sure theres a better way but i dont have a sample df nor energy right now lol
Jahahaha no problem thank you very much again
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 60)
any ideas how to fix that?
how can I save cluster mappings of Kmeans aggregation to a dictinary?
I used sklearn if it matters
@tidal bronze test this: https://stackoverflow.com/questions/60858780/dict-of-cluster-and-partition-with-kmeans-python
clust_map = dict(zip(agg_df.index, agg_df["an_vol_cluster"]))
I used this in the end which seems to be quite similar to waht you are suggesting 😄
thanks anyway @fading kernel
model = Sequential()
model.add(LSTM(4,input_shape=(940,60),return_sequences=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs = 100, batch_size = 32)
input?
an array of minmaxed values
give me a sec
[[1.98211310e-02 2.13912644e-02 2.05622164e-02 ... 7.13034744e-02
8.30816740e-02 8.40888464e-02]
[2.13912644e-02 2.05622164e-02 2.05600173e-02 ... 8.30816740e-02
8.40888464e-02 8.43153503e-02]
[2.05622164e-02 2.05600173e-02 2.06237903e-02 ... 8.40888464e-02
8.43153503e-02 8.45660438e-02]
...
[5.82092875e-05 4.71699742e-05 4.45750758e-05 ... 1.26729997e-03
1.00240043e-03 1.10061074e-03]
[4.71699742e-05 4.45750758e-05 4.63343289e-05 ... 1.00240043e-03
1.10061074e-03 1.38178337e-03]
[4.45750758e-05 4.63343289e-05 4.59165063e-05 ... 1.10061074e-03
1.38178337e-03 1.59278379e-03]] ```
looks something like this
I mean using the input layer
tf.keras.layer.Input
and there is also a reshape layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/Reshape
Layer that reshapes inputs into the given shape.
idk
im having a brain fart at this point
i tried reshaping it like this X_train = np.reshape(X_train, (940, 60, 1)) which worked before btw, but now it gives out this error ValueError: total size of new array must be unchanged, input_shape = [60, 1], output_shape = [940, 60]
@exotic maple your suggestion works! As does using a lambda function in the groupby 🙂
So, any ideas how to fix that?
what is your array length ?
940 * 60 is not 60 !!!!
you can use this for the second argument of reshape :
(1,len(X_train))
is it work !?
but its 940,60…
you want 3D array ?
as i know the multiplication of Dimensions should equal your array length
Wdym
i mean multiplication of values in second argument of reshape
So what u r saying is multiplication of second arguments should equal 3?
no no
i mean this
i did matrix multiplication by hand the other day
for example you have array with length 12 okay ? if you want to reshape it to any Dimensions the multiplication of numbers should be 12
Hey,
I have this def that takes a string and a DataFrame as arguments.
def accuracy_by_species(specie_name, df):
Now, I want to use apply function and pass DataFrame as argument. Is this possible?
Something like:
['a', 'b', 'c'].apply(accuracy_by_species, MyDF)
Any help would be appreciated.
for example (2,3,2) or (1,12) or (3,4,1) or ... is true
R u sure it works that way
yes becouse it is a rule you can check it in documents of numpy
glda to be of help man
Hey there, I have a pretty common task that I struggle with in python, usually trying to use pandas and matplotlib. I've seen guides online for similar things, but never quite this issue, which I'd think is very common:
I have a series of discrete events broken up into timestamps, like "message sent at timestamp x", and some 10,000 of those. The timestamps span maybe 13 months. All I want to do is bin that data into days, so like "10 messages received on Jan 1st, 12 on Jan 2nd, 7 on Jan 3rd," etc. I'd like to see it on a graph, showing the number of events per day over time, to see trends. I've already converted the timestamps into epoch time (seconds since 1970) and have it in CSV form and as a dataframe in pandas.
Anyone know how I can do this?
how is that bad?
well, first of all
- conver the timestamps to a format understood by pandas.
- extract the date only portion of the timestamp
- perform groupby and aggregate sum
do you have to compulsorily use pandas? it could be easier just a simple iteration + slicing/splitting
#1 and #2 I think are done, as long as I'm telling pandas to read the seconds since 1970 properly
#3 What would I be grouping/aggregating by? I see examples like "number of events per day of week" and such, but not just a graph of events over time summed by day
I don't need pandas, I'm just very unfamiliar with how to properly plot things in Python
if you have your days only, you can use
just took a lot of time
do you mean basically just count 24 hour periods from a start date to an end date, iterating over my sorted data and counting manually in a loop?
df.groupby("date") -> this will groupby the table by the UNIQUE values of the grouping column
then, you can cast an aggregation function
in your case you simply want sum so
its easy for small ones
df.groupby("date").agg({"messages":sum})
you can also do it via pivot table, since pivot table is literally a grouping function as well
ahh...you can then simply use split. can you show a sample of your data?
Sure, just give me a minute to upload it
wouldnt taht create many dfs?
the way i got it he only wants grouped sums
normal string split
ah lol
isnt it easier to convert to timestma and extract day?
it depends
that's the approach id use
I prefer the shortest route however dirty it might be 😁
ah, a fellow lazyness connoiseur (or wtf that french thing is spelled)
I had a problem to solve to get the mean of nested arrays and this time I decided to do it properly with a class since it would be re-used. took me an hour
@grave frost https://pastebin.com/aPgXA1DE
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
just to write this piece of shit:
class BPE():
def bpe_embed(self, arg):
vec = []
for _ in arg:
vec.append(bpemb_ny.embed(_))
return self.averager(vec)
def averager(self, sentence_vec):
averaged_vec = []
for j in sentence_vec:
for k in j:
averaged_vec.append(k)
avg = np.mean(averaged_vec, axis=0)
#print("avg:", avg)
return avg
def final(self, arg):
final = []
for stanza in tqdm(arg):
final.append(self.bpe_embed(stanza))
print(final)
return np.array(final)
so you just want to extract the epoch right?
!e
a_string = '1593316925.431|user1'
print(a_string.split('|')[0])
You are not allowed to use that command here. Please use the #bot-commands channel instead.
oh no, I mean, I'm already reading the CSV file in and can manipulate the data as I want. I just want to go to the next step, to produce a graph that shows me the events over time, per day. So like "10 events on this day"
I can see programatically sorting it and then iterating over it in 24 hour chunks, creating a new table that way
I've also used a pivot table for this in the past somehow
how does epoch and user correlate with day and events?
each timestamp is an event that occurs on some day in time. The user column is mostly useless for this case, it's already filtered by user
and the number at the start?
epoch|author is the row that defines the headers
1593316925.431|user1 is the first row of data, with the first column being an event that occurred on June 28th, 2020, 4:02:05 AM UTC, by user1
aight. so then just extract the day from each entry, put it in a list and then use matplotlib. what do you find difficulty in?
you can use count to then count the number of times it appears
I guess I was expecting that matplotlib or pandas would have a function like hist() or something that would automatically know how to do this
I can extract the day and count programmatically, then plot that
I guess simpler is better. I was hoping for fanciness
I guess there might be some function 🤷 but I don't know
tho tbh you might be suprised to do so many common things, we do not have functions (or libs) for it
could numpy help me perhaps? I'll need to sort the data by that column, find the min and max to get the date range, and then iterate through the data to sum up the counts per day. basically binning it manually
Note: if you end up implementing this manually, @numba.njit that function and you'll likely get pretty acceptable speeds.
as for a numpy solution, hmm.
what do you need exactly? Plot the counts of events per day? That seems like a histogram with fixed bin edges to me - if so, you can just use plt.hist or np.hist.
it's a one-time processing of, worst case, 310,000 rows, so I don't really mind the processing time
matplotlib's hist calls np's hist, even.
it spans over a year, so would a histogram work? Like it'd need to bin hours into a day, for maybe 400 days
nah that would be awful
try weeks
pretty much; you might just need to manually generate the bin edges
you can try getting "week of year" (a number from 1 to 52) and generate a histogram from there
but that's a pretty small function, comparatively; it's only 400 numbers
my goal is to see a trend over time, where counts per day would give me a good indicator of daily activity that may fluctuate over time. Like, picture wanting to do analytics on a website where you see hits per day from one user
wait, its refences all th way down?
basically I'm doing a transformation of an unordered list of discrete events into a summation of hits per day
then plotting that
I see. So yeah, that's just a histogram.
ah
I've looked up guides on histograms but not found something clear about this specific thing
@red yew try this
- separate epoch from author as awesome told you
- convert the epoch to a readable dt format
- extract day / week from dt format
- create histogram
like I've tried this:
df = pd.read_csv('output.csv', header=0, delimiter='|', quotechar='^', quoting=csv.QUOTE_MINIMAL)
fig, ax = plt.subplots()
df["timestamp"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()
but it resulted in a graph without sufficient bins
As an example, just plt.hist it. The results will be horrible because it will by default choose like 10-20 evenly sized bins, but it should work. To make it right, pass the bins argument to it.
ye that's the road I'm going down, though #4 I don't know how to do yet, or really #3 and how to store that data in a df
ah hm, let me see the b ins arg
but it resulted in a graph without sufficient bins
yup, precisely, you'd have to specify their count and maybe also the precise positions.
current output
i mean...thats what you want
its just the x axis is wrong
because you are using epochs and not human dates
can I format the x-axis with a function that knows how to handle the epoch? Or do I need to change the input format
i'd choose changing input format, much cleaner
and reproducible
but thats up to you
labels = pd.to_datetime(labels) this part of your code doesnt seemt o be working
I can put it into ISO8601 or something. I'll need to lookup how matplotlib handles datetimes I suppose
o
we are nothing without standards
I don't know how to convert a column of data that's in a dataframe. I assume there's a transformation function that can be applied over it.
bins=365 improves things already
pd.todatetime or whatever the hell is spelled
buuuuuuuuuuuut
im not sure if datetime converst epochs
should be possible to at least convert it to numpy's datetime type
hmm to_datetime takes an strftime format string, hm
Questions: I have been searching for a long time to find a solution to my prob...
once you know what you want, its easy to google it 😉
pandas was literally built to handle annoying datetime stuff lol id be surprised if it didnt handle it
now, being a bit more..."scientific" why you are looking at that trend like that plasma? I think a more interesting observation could be:
1-) messages by day of weeek.
2-) seasonality of messges by month / day / week, etc
daily change itself doesnt seem too valuable to me there
in fact, your data has some very noticeable spikes, so there seems to be somehting there
in this particular case I'm going to correlate the trend over time with events that have happened at discrete times
see, the spikes i mentioned :p
like ideally, a vertical line in the graph with a label indicating what happened on that day
again, if it were a website, picture "sale on this day"
you could add the following
compute the mean per day
and make a single horizontal line
to display it
and then color all the bars n-stds away from the mean
that'd be neat
id like to see your data plotted as a normal distribution
tbf it seems like it COULD approximate it
I imagine there'd be some trends in day-of-week, just not interesting in my case
a running weekly average would be interesting too
spikes here probably line up with weekends
what I'm trying to say is: mark the mean. mark 1 std deviation above and below the mean
and color the bars ABOVE the 1-std differently
I'd like to figure that out as an experiment, sure. It'd be neat to see
that's pretty easy to plot :p
and it can visually display your idea of "something different happene dhere"
I have no idea how to do that currently. both pandas and matplotlib are opaque to me, and most docs seem to be just SO questions/answers, or very verbose API references
Any idea which of those is better in order to count duplicates in a dataframe?
def count_duplicatives(df, col_name=None):
return df.duplicated(col_name or df.columns.tolist()).sum()
def count_duplicatives(df, col_name=None):
return df[df.duplicated(col_name or df.columns.tolist())].shape[0]
you can cast np.mean(df["value"] on the column that holds your values
I think I'll start with trying to fix these x-axis labels (still trying), and then drawing vertical lines for important events
in this case my values are computed histograms...Do I have access to that generated data?
thanks
basically yea, and it sounds like the coordinate system is the x-axis by default, so I can specify an epoch time of the event
which I can figure out
matplotlib is really cool but a massive pain in the ...
so Iv'e gathered! Do you have a preferred plotting lib?
I saw pyplot but it seemed very focused on web-based notebooks
cool
and you can still reference matplotlib objects
since seaborn inherits matplotlib
thats the kind of plot that id like to see in your data. kde for values basically
if its normally or normal-like distributed, you can easily find outlier matematically by declaring
Z scores
(how many standard devs is the value away from the mean)
but what if the outlier is a trend over time? Like "the user slowly stopped using this service over a period of 1 month"
one could calculate the weekly frequency of usage
that's different. I would have to think it over
but thats not related to population
but to one user
so you'd have to compute it separately
i shoould be working on NLP but im findng your data more interesting lmao
I guess i like the intersection of marketng, analytics :p
haha. I think I'd rather be working on NLP
uh. I'm 99% sure pandas already has a method
people reall need to check documentaiton more often
xD
speaking of documentation, I'm trying to find out just what subplots() does and how I can go from this default bar graph to a connected line graph
and then I can have multiple lines indicating different users
subplots is for several subfigures on one figure, basically
if you want to plot more than one plot on a figure (like, several lines), this is as simple as plotting them all between getting a new figure and showing it
plt.figure()
plt.plot(...)
plt.plot(...)
plt.plot(...)
plt.show()
#!/usr/bin/python3
import pandas as pd
import csv
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('output.csv', header=0, delimiter='|', quotechar='^', quoting=csv.QUOTE_MINIMAL)
df['timestamps'] = pd.to_datetime(df['timestamp'], unit='s')
fig, ax = plt.subplots()
df["timestamp"].astype(np.int64).plot.hist(ax=ax, bins=75)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.annotate("event 1", (1611925200, 100), color='r')
plt.axvline(x=1611925200, color='r')
plt.show()
in my case I'm not even sure why it's creating a bar graph
I assume it's being set to that by plt.subplots()
bar graph?
ye, currently looks like this
it is, but, can the histogram data be displayed as a line graph?
That's literally the method I'm using...
so you want to count them?
Indeed
can the histogram data be displayed as a line graph?
So, like, same thing but horizontally?
Oh, I got what you mean. You'd need to usenp.histinstead to give you the raw data; then plot it with justplt.plot.
The Trues and that's what I did kinda
ahh, I see
since True has value 1. You cando
len(df) - sum(column)
plt.hist calls np.hist, so their arguments are pretty much the same.
since True has value 1. You cando
len(df) - sum(column)
try that
or something similar
basically
Trues are 1
Both of my codes work... I'm just asking what's better
if you discountr their sum from the lenght of the rows, you get the Falses
I did exactly that by using .sum()
shape should be faster but does it give you the same result?
Both work exactly the same
I get that generally shape is faster
But is it really faster in this implementation even though I create a whole new DF just for that?
unfortunately i cant answer that confidently
so id rather not misinform you
time them
and if shape is faster, it is
Thanks 🙂
for col in df.select_dtypes(exclude=['int64','float64']):
most_common = df[col].mode()[0]
df[col].fillna(most_common, inplace=True)
How would you guys achieve that?
Gives me the following error A value is trying to be set on a copy of a slice from a DataFrame
The SettingWithCopyWarning should just be a warning not an error, and I am actually able to run your code without issue. Another method would be
df.loc[:,col] = df[col].fillna(most_common). Also, using scikit-learn's Imputer with strategy='most_frequent' with may be a more effective way of filling missing data in preprocessing
I was wandering about using Kmeans clustering for multidimensial dataset.
Being a geometrical method, how i can be sure that clustering has been made correctly? (Not having visualization feedback)
I got that working:
most_common = 9999999
new_df[new_df.select_dtypes(exclude=['int64','float64']).columns.tolist()] = new_df.select_dtypes(exclude=['int64','float64']).fillna(most_common)
Got any idea how to fetch the most common value for each column without iterating through it?
you could do something like new_df.mode().iloc[0]
I think using an Imputer may make your life easier however: https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.Imputer.html
Actually I did the following:
def replace_missing_values(df, col_to_def_val_dict):
new_df = df.copy()
new_df.fillna(col_to_def_val_dict, inplace=True)
new_df[new_df.select_dtypes(exclude=['int64','float64']).columns.tolist()] = new_df.select_dtypes(exclude=['int64','float64']).fillna(new_df.mode())
new_df[new_df.select_dtypes(include=['int64','float64']).columns.tolist()] = new_df.select_dtypes(include=['int64','float64']).fillna(new_df.median())
return new_df
Doesn't always work though
Some of those values stays NaN
If I replace new_df.mode() with a single string it "works"
Otherwise it just stays the same
I dont think there's a way to verify
by definiation Kmeans is an unsupervised / descriptive model. and Kmeans requires predetermining the amount oif clusters you want to use
If you have no idea of how many you have perhaps try using DBSCAN?
YAS!
def replace_missing_values(df, col_to_def_val_dict):
new_df = df.copy()
new_df.fillna(col_to_def_val_dict, inplace=True)
c_df = new_df.select_dtypes(exclude=['int64','float64'])
new_df[c_df.columns.tolist()] = c_df.fillna(c_df.mode().iloc[0])
c_df = new_df.select_dtypes(include=['int64','float64'])
new_df[c_df.columns.tolist()] = c_df.fillna(c_df.median())
return new_df
Now it works!
Nothing advanced, but for those interested in data / cicd pipelines
https://medium.com/analytics-vidhya/creating-a-data-pipeline-with-easyjobs-fastapi-4e302556f05d
class encoder(nn.Module):
def __init__(self, n_inputs = 40):
super(encoder, self).__init__()
self.n_inputs = n_inputs
self.N_c = torch.randint(1, n_inputs + 1, (1,)).item()
self.random_indices = torch.randperm(self.n_inputs)[:self.N_c]
self.fc_enc1 = nn.Linear(2, 64)
self.fc_enc2 = nn.Linear(64, 32)
self.fc_enc3 = nn.Linear(32, 2)
torch.nn.init.normal_(self.fc_enc1.weight, std=0.01)
torch.nn.init.zeros_(self.fc_enc1.bias)
torch.nn.init.normal_(self.fc_enc2.weight, std=0.01)
torch.nn.init.zeros_(self.fc_enc2.bias)
torch.nn.init.normal_(self.fc_enc3.weight, std=0.01)
torch.nn.init.zeros_(self.fc_enc3.bias)
def forward(self, X, y):
x_c, y_c = X[:, self.random_indices], y[:, self.random_indices]
input = torch.cat((x_c, y_c), 2)
h1_enc_output = F.relu(self.fc_enc1(input))
h2_enc_output = F.relu(self.fc_enc2(h1_enc_output))
r_c = F.relu(self.fc_enc3(h2_enc_output))
return r_c
would this be a possible way for a model to accept an arbitrary number of inputs?
omg tools hell. I have so many useful tools lmao
though, not interested in pipelines...atm
After some searching around this also seems to work
import pandas as pd
df = pd.DataFrame({'a': [1] * 3 + [2] * 3 + [np.NaN] * 2,
'b': [True, True, True, True, True, False, np.NaN, np.NaN],
'c': [1.0, 2.0, 3.0, np.NaN, np.NaN, 6.0, 7.0, 8.0]})
print(df.head(10))
df['a'].fillna(df['a'].mean(), inplace=True)
df['b'].fillna(df['b'].mode().iloc[0], inplace=True)
df['c'].fillna(df['c'].median(), inplace=True)
print(df.head(10))```
@exotic maple pipelines are just one possibility with easy jobs, but next closet parallel is celery 🙂
That's basically what I did
But I replaced a bunch of columns in each operation while you only did one at a time
Ah, I see. Seems like a good approach
that's a not DS question, but are you looping through those images?
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
well, While does not create an I value
you're better off using a for loop
but if you insist on a while
you can do something like
While i
While i < len(df)
DO SOMETHING HERE
i += 1
does df stand for dataframe?
Surely there's a better way?
Ofx there ks but im playing minecraft with my daughter 😂 cant think right now
Cute
Thanks 🙂
how old are you?
28
I thought you were somewhere around 22-23 😂
you can pass a dict to fillna to make it more efficient
hi
@dark sonnet hi. Do you wanna talk about data science?
this seems to be a norm haha
idk why you thought that
growing old doesnt mean you need to be grumpy :v
Anyone have suggestions on repo structure/templates for ml projects? Specifically a project with heavy experimentation but also a deployed production model
man i still dont get how to upload files to google colab
that shit is as cryptic as matplotlib docs...
Pretty sure it's just drag and drop
so you have a repo that has a pretrained model, but you can also use it to train new models?
on the notebook?
Yeah or many different types of models and data preprocessing and everything else that you mess around with while working on a project. Basically what’s the best way to structure a repo to keep track of these experiments but also have a production model.
Looking for something like this: https://github.com/jeremyjordan/data-science-template or this: https://github.com/ml-tooling/ml-project-template but I wanted to see if there was anything else out there
The second one seems like a bit of an anti pattern since since it’s basically the same as splitting the research and production into 2 different repos which I don’t want to do
You have to click the folder icon on the menu on the left side when you’re in a notebook and then you can drag and drop files or there should be an upload button
@fleet hare I'm not sure I understand the issue with having the user-facing code in a separate repository if the research-specific code isn't useful to them.
It's a way of representing documents usually employed in Information Recovery or learning from texts. Each document (observation) is modelled as a vector of N dimensions, being N the number of words, terms or whatever base unit you are working with. If the document contains a given word, then the corresponding element of the vector is not zero. It's a generalization of the Standard Boolean Model, where elements of a vector can only take values 0 or 1.
Hello Gentlemen,
Hope you have a very enormous day,
I am new to ai and stuff ,I wnt to write my very first neural network ,just wnt to get started but on yt I cant find the appropriate video, if you guys can suggest some yt video for absolute beginner who knows python programming upto certain extent(not pro though) will be great 😁
subconsciously 🙂 🤷
So im using lstm. Im using 60 values to predict 1 value, append it to these 60 and remove first value, predict again. But my model doesnt seem to be very effective. Is it worth it giving it more data(gonna take long time) or i have to somehow change the model
what? are you trying to do time series prediction?
Hi! I've been struggling to understand why we need nonlinearity for neural networks, why we need to use activation functions.. also for example in case of the famous mnist_dataset where the image sizes are 28*28, we construct the weight matrix with dimensions of (28 * 28, 30), (30 is just an example,) but the point is that it's bigger than one.. why? 😄 What does it look like when an input flow through a neural network with two layers? (28 * 28,30) and the second layer (30,1)
That's my biggest problem I have no idea what it looks like when the pixels of an image(the input data) flow through the network.
(fastai fastbook chapter: https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb)
yeah kinda
can you use silhouette score to compare different kmeans aggregation that use different features?
why we need nonlinearity for neural networks, why we need to use activation functions..
Because it's easy to prove that if you only use linear activation functions, then your linear network, no matter how deep or wide it is, is equivalent to just a linear function from inputs to outputs. And, well, linear function do not useful computation make. A linear function classifiying images as cat or not would have some pixels with positive weights and some with negative ones - make all the former ones pure white, all the former ones pure black, and you'll get a "perfect" cat image as far as the network is concerned.
Don't think I understand the rest of your question.
I don't understand either, I just can't imagine what does it look like when our input data "flows" through the network, the futures that it constructs.
Furthermore, as the image shows instead of creating your weights with one column we create 30 columns, maybe that give you some idea what I was trying say.
This is a network with 28*28 inputs, 30 neurons in the first (and only) hidden layer, and 1 output
So the matrix that transforms from first (inputs) to second(first hidden) layer is (28*28) x 30, and the matrix transforming from the first hidden to the outputs is 30 x 1.
I see, but Why is it good to have more and more neurons? I mean I know that it will make the network deeper and deeper, and it will perform better, but why?
Well, the more complexity, the more complex relationships the network can approximate.
Check out https://neuralnetworksanddeeplearning.com/chap4.html for a visual explanation
can someone help me with lstm
I've seen many visual explanation, but the thing I don't understand is the relationship between the layers, I thought they do the same things but I often hear that they have objectives, guess because of nonlinearity, but can't see the whole picture
If your layers are just dense like here, there's no meaningful "purpose" of each layer. They just do some stuff that, after training, ends up being involved somehow in calculating the result.
If the layers are different, like how in image classification neural networks the first few layers usually do convolutions and stuff, then you can say that the first few ones do stuff like search for lines, then for angles and more complex details - but even that's mostly a guess.
Generally speaking, neural networks just work - their training adjusts the flow between each layer so that the whole ends up doing the task you're making it do. There's no guarantee you can describe the purpose of any specific layer.
You can try searching for research on that matter though, maybe there are papers about trying to determine the function of parts of trained neural networks.
because I constantly think that the whole matrix multiplication it does can be done by just simply one layer, because they basically multiply the layers with weights and weights, but I guess I can describe why this work like that with Nonlinearity, like ReLU
😄
because I constantly think that the whole matrix multiplication it does can be done by just simply one layer, because they basically multiply the layers with weights and weights
yup, this is precisely why activation functions are needed - without them (or if they were linear), like I said, the entire network can be collapsed into just one layer mapping from inputs to outputs
Yes, I see but sometimes ReLU just transforms the negative weights to zero, making its gradient zero too
yeah, it's pretty weird, and yet ReLU is newer to become popular - stuff like logistic and tanh are the older ones. Apparently ReLU was shown to be better, and I don't think I know enough to understand why.
Yes, I guess it just trying to push parameters that are important to be positive and less importants to be closer to negative 😄
Anyways, thank you for your kind explanation I think I get the whole picture now.
I have a question
what is the best source for learning pytorch
but it should be beginner level
I think official web side is a good place to start https://pytorch.org/tutorials/index.html



