#data-science-and-ml
1 messages · Page 187 of 1
It’s the Santa hat.
I’m glad that worked because if it didn’t I had no idea where to go
Are you using some tutorial for al the work you’re doing with the Chinese timeline thing?
@velvet anchor No, I'm not.
Also, here's some cool visualizations from that article I just published
And a link to the article, since I actually finished it. I've got a couple others up there too if anyone thinks they're fun to read.
Oh okay cool i'll take a look in a bit
I was just curious since you've been doing so much with the same data
Yeah, I'm working on a massive project where I'm cataloguing every (ruling) dynasty in all of China's history
Also multiple concurrent dynasties that existed during warring periods
It's a buttload of work
I want to scale high range datas using logarithm based 10, but i find a problem towards datas with 0 value, what's another alternatives to logarithm?
what are you exactly trying to do
arc tangent 😛
arctan might not be steep enough
it's also rather linear so not super useful for scaling
Square root, cubic root or log(x+1) might be worth trying
yeah x+1 was gonna be my suggestion
I'd try that one first 👌 @visual notch
or maybe even just a flat division depending on scale
Actually im going to normalize those data i have into 0-1 scale
Using min-max
And placing the actual 0 value as 0
Instead of ignoring it like now and placing the lowest non zero value as 0
I think ill pick the log x+1
hello, im trying to implement an A star search to find a least cost path, but i dont know where im going wrong
Does it throw an error or is it just performing poorly?
poor performance, no errors
heres a map input ive been trying to get a result on https://paste.pythondiscord.com/ajehojehaz.nginx
When you parse the map you're setting the cost of a node to be based on its w, r, f, h, m type, but you never seem to use cost anywhere. So, despite you updating the nodes' g, they're never set to anything but 0
If I put a little print inside update_node:
def update_node(self, neighbor, node):
neighbor.parent = node
neighbor.g = neighbor.g + node.g
print(neighbor.g)
neighbor.h = self.get_heuristic(neighbor)
neighbor.f = neighbor.h + neighbor.g
I get this:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
So if i'm not wrong the algorithm currently judges distance based completely on f, which lets it pick expensive paths since f is just the Manhattan distance
Since g(n) is the true cost of reaching node n, your problem is fixed by replacing this
neighbor.g = neighbor.g + node.g
with this
neighbor.g = neighbor.cost + node.g
@open pecan
oh wow, thank you very much, id been staring at it for so long!
Always good to get a fresh pair of eyes 👀
Yeah I had a bug i'd been tracking and rerunning for like 8+ hours yesterday. as soon as i posted the snippet i saw it haha. sometimes it takes just stepping away or having someone else look to solve
@feral lodge best results I was able to get on the obscured dataset was uh
Found 3636 correct faces out of 10049 total images
I'm continually enlarging the images and stuff to see if something weird happens but at 4x normal size that is the highest and it started going down from that point
Looks like dlib's face recognizer is built with histogram of oriented gradients! Never heard of it, but from what I can see it's quick and has some other good properties but is largely antiquated by CNNs, at least when you have enough data 🤔
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6077989
https://arxiv.org/pdf/1703.05853.pdf
Yeah
It's a really cool concept
actually inspired my technique a fair bit
24hours no hiccups on the genetic algorithm 👌
👍
Hi, I was trying to plot some data real-time at work. Started of by using matplotlib, but it is too slow for the datastream. PyQtGraph seems like a hassle to get to work plotting real-time data. Does anyone have any recommendations, tips, or tricks?
At risk of appearing like mad scientists, reveling in our latest unholy creation, we proudly introduce you to DeepHack: the open-source hacking AI. This bot ...
Neat video on machine learning
@velvet anchor It depends, but it is sub 300
But it should either be drawn efficiently or be drawn in another thread
And the dots are received at 10-30 fps
And to add some info, it's several different tracks being scatter plotted. So it would be nice if one could scatter plot every track with a unique color.
I think I got something working in pyqtgraph, here's the code for the interested https://github.com/eHammarstrom/pyqtgraph-live
Currently you have to control the clear/update loop externally (which is what I need). Critique is welcome.
Also slandon after a couple days it seems that 40% is the absolute best dlib can get out of the obscured faces
which is honestly pretty good
Does that percentage include false/true positives/negatives? o: like, if it's a data set of 50% non-faces and 50% faces, if we guess randomly we'd be correct 25% of the time
Whereas if the data is all faces, guessing randomly would be correct 50% of the time
And sorry for not replying to the mention before! I have no idea to handle live data though, good job getting it working functor
Hi,
How should i decide that i should use which one for feature selection? My purpose is choose best features from my data for predict target feature.
Univariate selection
Recursive Feature Elimination (RFE)
Principle Component Analysis (PCA)
Choosing important features (feature importance)
Out of those I've only ever used pca, which is cool but should not be used for data with non linear relationships
As far as I know, univariate selection is also mostly (not sure if always) used for linear data
There are other choices, like MIC and lasso/ridge regression, which can also be used for scoring the features
They all come with pros and cons though, and they're not always useful for the same purposes. PCA will not directly tell you which features best describe the data, rather it tell you what linear combinations of the features best describe the data. Lasso has the property of often setting 0 or 1 as coefficients to the features, which singles out important features but loses some information
Slandon that 40% is just dlib over the dataset as it is
Was just curious how good dlib can detect hella obscured faces. I ran it over my test real faces set and it was 50k/50k
Just curious. Is the learning also from obscured faces or is just the predicion on it? Coz learning from a normal data set and predicting on an obscure data set sounds something like a real life scenario
It’s not learning anything. It’s just using the default predictor of dlib because we were curious how well it worked
But yeah that is a real life scenario and it just highlights how important it is to have a correct data set. If you expect to see obscured faces you need to have your model built to find those too
But I think dlib is just trained on a massive amount of face photos. I’m not entirely sure. The specifics
idk if this is the best channel to ask in, but do you guys have any recommended video series on machine learning w/ python (preferably on udemy)?
not on udemy, no
https://www.youtube.com/watch?v=OGxgnH8y2NM this guys pretty good tho
The objective of this course is to give you a holistic understanding of machine learning, covering theory, application, and inner workings of supervised, uns...
ty i'll check it out
i've started to have a look at andrew ng's course on machine learning, and so far i like what i see. it's pretty much exactly how i'd prefer to learn stuff like this. i feel like it's better to learn what happens behind the scenes, and then use that knowledge to implement it yourself to develop a strong understanding before using magical frameworks. thanks for the recommendation, guys ^^
yeah, that's understandable
and on that team only 1 of them will know the inner workins of stuff
another will be really good at manipulating data, another a math guy, etc
@velvet anchor im watching that guty right now! haha
i suppose i'm just the type of person to want to know how stuff works, regardless of whether i'm going to be using it a lot
yeah and thats fine
but the problem is theres a LOT to know
almost impossible really
yeah, but i at least want a general idea.
Is anyone into Algorithms and Data Structure?
I am trying to get good at Algorithms and Data Structure by solving problems on LeetCode and HackerRank.
I've currently solved 69 questions on LeetCode and around 70 on HackerRank but, 95% of them were easy questions.
It’s really all just practice honestly. As you solve more you’ll start to notice patterns you can use between problems
Should I look for answers if I am not able to solve them under 30 mins? (Medium level questions)
I’d say only look for answers if you just actually can’t figure it out
Idk that there’s a hard and fast rule for like x amount of time
It’s important to understand the answer though. Not just get it working and move on
I’m a senior in college so yeah I’ve had a lot of courses in data structures and algorithms and math and stuff
My college will start from next month ;-;
But, they will teach Algorithms and DS from 2nd year.
Pretty standard
I have a list of movies, their genre and gross
I've grouped movies by their genre and summed up their grosses, so, now I have genre vs gross.
For example, Action, Comedy, Drama | $2 Billion
Comedy | $1.38 Million
Now, I am thinking about to make something that will predict the chances of a movie making in billion based on its genre
Can someone help me with that? Like what stuff I should search for? ( I guess that I will have to deal with weightage?)
This is all statistics
You mean I will have to go through the whole
Specifically you’ll want to look up predictive models
Oh, thank you!
Is there anyone who can help me decipher the difference between mean square vs. least square when it comes to ML? I looked up some stuff but couldn't make it out in a simple way
Nevermind. Just came to me hahaha
@lilac shadow Good to know you started on Andrew Ngs course. I kind of got scared and stopped it in the middle when it came to NN. I need a lot more concentration to pass that bit
hey is anybody firmiliar with Q-Learning
No expert, but I've done a few labs on q-learning! What's the problem? @lapis sequoia
i have some basic Q-Learning code that uses an environment and now i want to test my own environment on the brain, but i don't know how to attach it? i have in the environment the Reward state and actions defined?
@feral lodge
the used environment in the default code is tkinter and i don't use that? is that a problem i don't thought so? i studied the code but i couldn't find it where i should put in the values and how the environment reads it @feral lodge
That's a bit tricky to answer without being with you and checking the code, but if I were you I would check for the piece of code that describes the update rule; this thing
Since it uses all the important structures and values of Q-learning, you can find all relevant variables and functions there
Could you maybe link a hastebin with the tkinter code?
is github good also with all the code?
also can u pm me?
@feral lodge
this is the default code i use https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/2_Q_Learning_maze
so i made my own environment with actions environment and rewards already defined
i tried to modify the code to get it to work on my environment but i don't know how i show the AI my environment i made @feral lodge , also PM would be great?
looks interesting
@velvet anchor Hey, will predictive analysis require advanced statistics?
I've read some basics of predictive analysis
Downloaded a book Predictive Analysis - Eric Siegel
Hi everyone,
Im trying to classificate video with CNN using CPU. It takes 200 frames for each predict and every predict takes approximately 1 minute. I want to reduce predict time because it will be real-time project. Is there better way to do it? How can i reduce process time?
Get a better GPU talat
@wide oxide kind of impossible to answer honestly without knowing the ins and outs. It also just depends on how correlated the data is.
So, the best option is to go for beginner Statistics and then predictive analysis?
I mean predictive analysis is just applied statistics
So yeah it’s better to have a fundamental grasp
Though @dreamy tartan I guess first make sure your gpu is handling the requests and not your cpu. How are you predicting? You’re not training the model for every request are you?
Hello! How can I make a histogram using a list of datetime.datetime?
Not sure what you exactly meant. So posted two links unrelated to each other. Pick whichever or give more details
@bleak ether
Anyone able to assist with helping me figure out how to prepare satellite imagery through this tutorial: https://datacube-core.readthedocs.io/en/latest/ops/prepare_scripts.html#prepare-scripts
hold on did you say satellite imagery ?
does that mean that you can watch my house ?
@verbal cairn
Not yet @lapis sequoia 😦
No, I just want to do some land classification
No interest in houses or individuals
no i mean if i did the same as you will i be able to watch people ? and houses and my school ?
You would have access to imagery of country land elements
But it's not real time so you wouldnt be "Watching"
The board game?
or the right question is
:
can you help me in #help-coconut
?
i really need that help
thanks
@lapis sequoia You're ahead of me, I'm just learning around that same area
what the ????
how the heck am i ahead of you
i'm a beginner class programmer
and you are playing with sats
Yep, but I'm just grabbing the things I need to build the code, you're doing a good job starting from first principles
Satellite imagery is just a bunch of numbers on a matrix grid as I understand
So it would just be finding the patterns that equate to certain things and then implementing a bit of code to find and collate those patterns in an encoded number set
i hate matrixes
I did until recently
Then I just realised I've been looking at them as more difficult than they are
i get stuck on creating them
They're just tables of numbers
yea a replicated one but
No different than a simplified excel
i don't know how to use excel
Microsoft excel?
yea
Ah, this complicates things a little
Hey everyone, I'm trying to learn about statistics and programming at the same-ish time and I could use some guidance with an example:
I have a distribution of families by income in the US in 1973 with pre-counted data as follows:
income level (1000 $) percent
0-1 1
1-2 2
2-3 3
3-4 4
4-5 5
5-6 5
6-7 5
7-10 15
10-15 26
15-25 26
25-50 8
= 50 1
note that the percents do not add to 100% due to rounding
and the class intervals include the left endpoint but not the right endpoint
the problem is that I want to make a histogram with matplotlib.pyplot but I have no idea how to use this pre-counted data, any suggestions?
well
pair the data before doing the plot
the income level can be made 1 number
1,2,3,4,5,6,7,10,15,25,50+
I'm sorry but I don't understand what you mean?
I kinda get how I can take the income level and turn it into the bins, but I don't know how to turn the percent into the associated height.
https://stackoverflow.com/questions/33497559/display-a-histogram-with-very-non-uniform-bin-widths this could be helpful @lapis sequoia
https://pandas.pydata.org/pandas-docs/stable/visualization.html#area-plot this might be a nicer visualization
Thanks @hasty maple
I managed to solve it differently, kind of hack-ish solution (feels like there must be a better/more pythonic way) but in case anyone is interested in what it should look like:
#given data:
bins = [0, 1, 2, 3, 4, 5, 6, 7, 10, 15, 25, 50]
percent = [1, 2, 3, 4, 5, 5, 5, 15, 26, 26, 8, 1] # % of population income class/bin
#solution:
x_steps = [abs(a - b) for a, b in zip(bins, bins[1:])]
weights = [perc / step for perc, step in zip(percent, x_steps)]
data = bins[:-1] # leave off 50+ value for proper length
plt.hist(data, ec='black', bins=bins, weights=weights) # ec=edgecolor
plt.xlabel('INCOME (THOUSANDS OF DOLLARS)')
plt.ylabel('PERCENT PER THOUSAND DOLLARS')
plt.title('Distribution of families by income in the U.S. in 1973')
locs = range(0, 55, 5)
plt.xticks(locs)
plt.show()
this is exactly what it looks like in the book
if anyone has a better solution (meaning the same histogram from the same given data but with prettier code): I'm all ears
hey somebody firmiliar with Q-Learning? pm me please
nobody?
dunno what that is,
but look how cool data scientists can be: https://www.youtube.com/watch?time_continue=104&v=ndyjFUF2e9Q
lol i dunno i just started it but have some doubts about this program
you can ask your question Jan but we dont really do PM help
Good morning!
it's not like I intentionally chose bing when creating the link
yes. but this question isn't really something for in a group
Is anyone here using an AMD GPU to do deep learning?
I'd like to explore using Python and specifically scikit learn for a project I have it work. I want to do some useful things in machine learning but I'm not really sure how to. I know the basics of machine learning though...
I held out the regional manager for sales so I constantly have metrics about each branch location to include average revenue per day, upsell, and sales for certain upgrades on our products. I'm trying to figure out how to use machine learning in conjunction with these data. I'm not really sure how to do it though because every metric is calculated using average. For example average revenue per day is just based on the location average and it's expected that Associates push towards or exceed that average to be competitive... I'm not sure how I can make a machine learning algorithm that can predict or tell me if a certain value or range is acceptable or unacceptable
@lapis sequoia I have at home once or twice
@winter violet It seems from your description that NN may not be the best way to train on using like scikit learn. This seems like a statistical problem, I would probably just take all the revenue ranges you have and plot them out. Find a standard deviation of your data and see if a value falls within that block. Maybe I'm misunderstanding your statement though.
Basically, this seems like a statistics problem I suppose is what I mean
So if you had this context and this data what would you use machine learning for? What a practical application of it
@velvet anchor
Neural networks / deep learning is nice when there are problems without like direct correlations to it, or with a ton of different inputs that can’t easily be singled out. Or for image analysis where you need to identify specific patterns that may be inside a range of different things. Like Tumors inside X-rays. Or like right now my research is identifying deepfakes with them
It’s not that machine learning couldn’t do what you want to use it for. It’s just not needed and you might get a lot of additional noise from it that wouldn’t necessarily be present in a statistical model
Also the sheer amount of data you need for networks is kind of another huge road block that stops it from working on applications where statistics and such excel
@velvet anchor what kind of software did you use to do deep learning on the AMD GPU (because CUDA doesn't support AMD GPUs)?
I think Theano supports OpenCL
and there's a couple tensorflow forks that do too
even keras does too
I'll try here as well I guess. So I'm looking for good ways to use Jupyter notebooks with git. I saw the option for adding jq as a filter to gitconfig and it seems very close to what I want. Anyone have any options they feel are better?
Is anyone familiar with Multi-Input Multi-Output systems? I''ve been reading about them on Keras and I'm currently looking for resources to build my own (or adapt one from Github)
the normalize=true option in some fitting methods says it removes the data mean and divides by the L2-norm. I don't get why... its calculated by summing all data values squared and taking the squareroot. But doesn't this value get larger and larger the more data points you have??
is anyone familiar with np.ndarrays?
i have an array like x[][][][] where each dimension has at between 10-20 values
and i want x[1][2][all][all]
i know that supposed to be like one command with this, but i dont fully understand the documentation
thats not a data science question its basic python indexing, look e.g. this tutorial: https://www.youtube.com/watch?v=ktyW-kOqGpY
in your case x[1][2][:][:] or x[1,2,:,:] should both work
Hey data scientists - wanted to know what people suggest to use for python google maps API usage. Namely, I have an excel that I've thrown into a Pandas Dataframe and I'd like to spit out a map using location data. Still haven't converted the locations to something usable yet (which are literally city names as of now) but I wanted to see what packages are best for this sort of thing
To be clear, it'd be nice to be able to make a map like this and highlight/fill in a specific province or city
:^)
@young aurora I drew this with pyshp and PIL:
ahh wrong Joseph, cool and good
sorry
No problem
@young aurora that's based on a dataset of county boundaries i had to download myself, and it's an equirectangular projection
I am tired of machine learning library tutorials. Whether it be TensorFlow or Keras, all tutorials don’t explain concepts in depth and use popular datasets like MNIST. Before I classify MNIST images, I want to be able to hand-write a dataset, and train a network on that.
Well the data sets and such are a much more advanced topic
however I agree with you. I'm working on a keras tutorial now actually
That plans to go more in depth on the ML side
Cool. I’m just looking for one that explains what concepts are being programmed.
hey guys
I'm trying to cluster text data based on common themes
I've never really worked with text before, though
so far I've eliminated common stop words, converted everything to lower case, and eliminated as many spelling errors as I can
does anyone have any advice on where to go from there?
I converted each text block into a dictionary with word counts on a lark but I'm not sure what I'd actually do with those
Maybe NLTK could be of use
thanks for the recommendation @velvet anchor, that looks like a really good resource!
@rich grove For your preprocessing, you'll want to do, among other things possibly:
-
Lemmatization, ie "normalizing" the text, by morphing certain types of words to their lemmas -- their "root" versions. This includes stuff like converting plural to singular: ponies -> pony, converting conjugated verbs to their indefinite form: {being, are, is} -> be, reverting comparative/superlative to standard: {better, best} -> good.
https://en.wikipedia.org/wiki/Lemmatisation, https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html -
Phrase modeling, ie finding combinations of words that occur often enough that they can be assumed to constitute a phrase. For instance, the words happy and hour have very different meanings by themselves compared to when they're used together as happy hour, so treating happy hour, happy and hour, separately is very important when modelling the themes of their sentences. Phrase modeling is a subtask of something called named entity recognition (for instance, New York or the red happy robot are not only phrases, but they're entities with names that exists in the world -- recognizing this is a larger problem than just recognizing that those word sequences occur together often), so you might find some good resources googling that.
https://en.wikipedia.org/wiki/Named-entity_recognition
For topic clustering, Latent Dirichet Allocation (LDA) is, as far as I know, the standard approach.
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation, http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
I like this video a lot https://www.youtube.com/watch?v=6zm9NC9uRkk; it works with several libraries that can do these things out-of-the-box, including LDA. The whole video is great, but the parts i mentioned start at 24:00, where he shows lemmatization followed by phrase modelling. LDA is shown starting at 40:20; that whole section is really cool
Microsoft also has something they’ve been talking about in their machine learning package too that might work. They’ve been using it to tell whether a review is positive or negative for example
But I’m not super knowledgeable on it
Hi, i am looking for tips on numba, especially on how to modify a code to gpu-paralelise it. Any expert in this topic around here ?
@balmy moth numpy??
Hi everyone, i want to predict survival probability of specific person at the specific time. I've looked arround and i found this: http://savvastjortjoglou.com/nfl-survival-analysis-kaplan-meier.html
In this project, i found survival probability for generally i mean not particularly. My purpose is that predict probability for each player. What should i do for predict it?
Hi, im still looking for a numba expert for some questions, anyone around here ?
what is the difference b/w fit , fittransform , transform 😃
could somebody please explain to me and @earnest prawn how B+ trees work? nix is smart, but i have very little understanding of them currently so it would be nice if it can be explained clearly as possible. :D
Now I'm curious as well after reading the wikipedia about it and not understanding bais 😇
Can someone eli5 bias variance trade off ?
@steel glen Dont train your data enough -> Underfitting, you're not fitting the data well enough. Train your model too much with test data -> overfitting, you're trying to perfectly match the training data instead of matching the actual data properties
this essentially
@lean ledge Hmm thanks
@steel glen I'll expand on Rags' explanation, becuase it's a interesting and central topic worth mulling over!
Bias and variance are statistical properties of (among other things) predictive models like neural networks, support vector machines, polynomials, etc. Both are causes for errors while evaluating the model, and a good model has been trained to balance between them, being not too biased and having not too much variance.
A model being biased means that it makes assumptions about the underlying structure of the data that don't seem to quite line up with the actual data we observe. This leads to errors when testing the model, since what the model predicts is different from the true observed values. You can think of the word "biased" as meaning "the model is biased towards its own idea of what the data should be, rather than what is actually is". Bias may seem like a purely bad property that we would want to reduce to zero. This isn't the case however, and to see why, have a look at this figure:
In that figure we see some scatterplots of some observed data (the green dots), and six different attempts to model this data, using polynomials of power 1, 3, 6, 9, 12 and 15. If you want, you can imagine the graphs as being snapshots of a neural network at various stages of its training -- the effect is pretty much the same. In that case, the first graph is right in the beginning of the network's training, and the last graph is after having trained for many iterations.
Reducing the model's bias to 0 would mean we have created a curve that perfectly touches all of the data points. We can see that the 15th power polynomial (or, the neural network that has been trained for many iterations), is approaching this state -- it's not very biased towards the idea that the data must follow a simple and elegant shape. This immediately seems incorrect to us though, since we know that our data will inevitably contain random fluctuations unrelated to our variables. Just like Rags said, the last model shown has started to model this unimportant random noise, lowering its training error but losing any real understanding of the underlying structure of the data. The 3rd degree polynomial is more likely the "true" one even though its training error is higher and it's more biased. As such, a model being a bit biased towards its own idea of what the data should be is actually a useful and desirable property!
Now, as the bias is reduced, the variance is increased. Compare the first two graphs to the last two; you'll see that the curve "wiggles" up and down a lot more. This leads to an increase in the average discrepancy between the predicted values and their mean -- the variance of the model is increased. The increase in variance is what allows the curve to snugly fit the observed data, but this is only valuable to a certain point; like said, the 3rd degree polynomial looks like the proper model of the data.
So, both high bias and high variance are undesirable features of the model.
-
High bias and low variance is common for models that have too little flexibility or that have been trained to little -- underfit models. These models will perform poorly on both training and testing data.
-
Low bias and high variance is common for models with too much flexibility, or that have been trained too much -- overfit models. These models will perform very well on training data, but very poorly on testing data, because their high variance makes them model random noise in the training data and they therefore generalize terribly.
The sweet spot for a model is somewhere inbetween the two extremes.
@analog rampart You're talking about scikit-learn yeah? fit() computes and saves transformation parameters in the session. transform() then applies these saved parameters to the input. fit_transform() is a combination; it just calls both functions, computing the parameters and applying the transform. For example, the standard scaler http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html scales data by transforming it:
When we work with data it's often a good idea to normalize it before training our model on it, to increase numerical stability or to compensate for different features measured in different magnitudes (like if we have two variables x1, measured in micrometeres, and x2, measured in kilometers), or other reasons. There are different ways of normalizing data. One way is to subtract the mean of the data, and then divide by its standard deviation. Ie, doing this:
X' = (X - μ)/σ
and then using X' for training the model. The difference is this:
So you can see we've transformed X into X' using the parameters μ and σ, centering the data and given it variance 1. With s = StandardScaler(), what scikit's s.fit(X) function does is to compute μ and σ and save them in the session. Xprime = s.transform(X) then applies the transformation X' = (X - μ)/σ. We can do this in one step by writing Xprime = s.fit_transform(X)
Anybody know any good free online 3d vector plotters?
tf.initialize_all_variables()
lets say I had variables W1,b1,W2,b2,W3,b3
would that function do this?
W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
W2 = tf.get_variable("W2", [12,25], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b2 = tf.get_variable("b2", [12,1], initializer = tf.zeros_initializer())
W3 = tf.get_variable("W3", [6,12], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b3 = tf.get_variable("b3", [6,1], initializer = tf.zeros_initializer())
I notice there is also a tf.initialize_variables function
@honest relic
love and emotion can be emulated
we aren't unique or special, we're just more complex than the machines we create
yes but they woun;t be real
define real?
um i don't think i can define real
if given enough computational power, our thoughts and feelings can be perfectly replicated
but what is the purpose of an AI?
Humans are really good at conditional statements that can't be like discretely stated
and machines are not
A better question to ask would be "what practical use can AI perform today that a human has issues with."
oh
We aren't planning on replacing ourselves afaik
But AI can definitely exceed in certain tasks that we suck at
such as
Scientists and programmers are using it for a lot of tasks, one that always pops to my mind is emulating fluid simulations.
mass data processing is a big thing that its just not practical for humans to do
or massive number crunching
Or just processing data instantly
Humans can do some amazing video editing, but it takes days or weeks of work to make even a rough video, an AI could do the same thing almost instantaneously
how the nyculina and myrcea ways are connected to the future universe?
nvm found, through rlk and dr4 ways
Anyone used scrapy exxtensively here?
Could anyone here help me in help-2?
Hi, was wondering if anyone had any experience with apache spot? I was thinking of parsing some syslogs using their open data model
Heyyo guys
Let's say I've trained a CNN and saved the model in h5 format
How can I use that model to make predictions?
(he confirmed in another channel that this is a Keras question)
In one of the Keras imports that I don’t remember off hand there’s a classifier.load option
Then you can pass your data into that with .predict
Thx but i figured out
Preprocessing was the part that i had problem
Is it possible to overfit when you train CNN with MaxPooling ?
Cuz my model always classifies every image with the same class
Yes
Though it’s also likely you don’t have the data for the other class. Or your data for the class it’s overfitting is too wide
Well basically I'm using kaggle cats and dogs dataset
I don't think there's a problem with dataset
Can try adjusting windows / filters / activation functions too. Also epocs / steps
I had the same issue at work for the longest time
I think I will add one more Convolutional layer
It’s just all testing and trying stuff honestly. Messing with parameters. Retrain. Tweaking preprocessing done to images
I’ve been working on the same project st work for like almost 6 months now
Training a network to detect deepfakes
Cool!
The escalation war begins.... someone else is training a deep fake to defeat deep fake detection... The internet started because of pron and the AI overloads started with deep fakes...
It’s really cool honestly. Like nvidia is doing kinda the same thing. They made an adversarial network that acts like a filter for photos to defeat facial recognition
Its the golden age right now.. Im glad Im learnign how it all works.
Nvidia and google both have loads of journal articles to read that are super cool with MM
With ML*. Like every other month it seems they have just some giant leap forward
And some guys at defcon made one to perform sql injections
Yeah im learning via fast.ai for top down to bottom details and simultaneously going from bottom up with Andrew Ng on coursera
@velvet anchor Just wondering,Which GPU do you have for training?
Ugh now I have training set accuracy of %83
but still it does not predict correctly
How this is a dog 😩
well it is doing a doge-y pose
https://i.pinimg.com/736x/64/fe/9a/64fe9a90e6c8cf8ea4345e783b5cf703--smiling-dogs-shiba-inu.jpg
verily, the resemblance is uncanny
I tried with other cat images and they're all dogs according to my model 
Now It started to predict few cats correctly
@pulsar surge It did correctly predict that doggy
did it predict the doggo as a cat? 😄
TRAining accuracy is going to be flawed
if one of your data sets is largely bigger than the other classification
like if its alwyas predicting the larger set its gonna gonna be whatever percent the classification is
if that makes sense
@velvet anchor My model now makes some great predictions with %81 accuracy on test set
I will try to make it like %90
but first I need for build Tensorflow from source
cuz they don't support CUDA compute capability 3.0 GPUs on stock version
I used to train with CPU 😄
To make training faster I resized images 64x64
If I can build tf-gpu from source then i will train it with resized 128x128 images, hoping to increase test set accuracy
Made a fishing bot for eso, using python and tensorflow
Made it using tensorflow
@velvet anchor This is a bit late but you wouldn't happen to have any resources on that defcon talk about SQLi via machine learning would you? Can't seem to find it myself and I'm curious.
i think i'm in love with matplotlib
Ye it's a pretty good lib
@topaz walrus it’s on YouTube.
@velvet anchor Awesome thanks, thought I was able to find it after revising my search terms a bit lol
no prob
Heyyo
I'm getting
ValueError: Negative dimension size caused by subtracting 3 from 2 for 'conv2d_81/convolution' (op: 'Conv2D') with input shapes: [?,2,2,64], [3,3,64,32]
And here's the code
I do use TF not Theano so my (128,3,3) parts are correct
But I'don't understand why I get ValueError
Changing dimension ordering in the ~/.keras/keras.json file seems to fix this for many people: https://github.com/keras-team/keras/issues/3945
what slandon said
in the ~ directory?
Yep
put a
from keras import backend
print(keras.backend.image_data_format())
somewhere
and lmk if it says channels first or last
kk
Is there any other platform that I can train DL models with GPU for free
I've tried GCP but they don't accept prepaid cards so 
directy
datagen.flow
same error?
Yeah
Seriously, the Whole fu*ing universe is against me today
GCP,AWS,Colab they all said f*ck off
Is anyone here familiar with web scrapers?
More specifically the BeautifulSoup library
bot.tags.get("ask")
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
@steel thicket
@earnest prawn Alright my bad. It's chill though I figured it out in the end.
No problem
hey guise
general best practice question
i have a network of 30k nodes
any ways of visually making it appealing?
trying to do some network detction
Heyyo guys
If I have 100 output neurons for a classification problem
Let's say i fed my model with some input and i got 83 as a prediction
which is related neuron
How can i get the label of that neuron ?
which framework are you using?
@steel glen ask here please
Ok so afaik you have to reconstruct the labels yourself
In other words. Keras doesn’t care or store the names of your 100 categories. Instead it returns a 100x1 matrix with probabilities of each classification
Yeah
By using np.argmax() you can get the highest element within that matrix
predict_classes() does the same thing
But you’ll have to construct a list or some other data structure that maps each Index to a class
Yeah but predict classes is depreciated soon I believe
But tldr is you can’t get the name. Only the index. You have to map it out yourself
NP
One thing you can do is os.listdir() or some other method to traverse your training set directory and map it that way or can hard code. Doesn’t matter really
also this thread https://stackoverflow.com/a/47944082/9352862
Hmm thx
Hi, Im trying to create a a numpy array representing an image, but i want every pixel to be the same color
I created an array like this
blue = np.zeros((4608, 2592, 3)
blue[blue < 1] = 120
blue[2] = (0,0,255)
the second line succesfully changes it to a fully gray image but the third line failes to change every pixel to blue. Am I doing it wrong? I've searched and searched but all I've found are ways to create fully white or black images in grayscale. But I need different values for each RGB value (3)
Would numpy.full give you what you're looking for? https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.full.html
Hmm that could work, I'm gonna try it
hi, can someone help me with nns?
I want to adapt and this nn to other uses ( I want to change it's possible inputs and outputs)
it's a mini-game, here is the map that comes with it.
help plz
Some people prefer not to open attachments. So it is better to post your code here in codeblocks if it is small enough for discord or use hastebin
what is hastebin?
Either hastebin.com or the one link that is especially made for this server
I am not able to retrieve that link
Heh. I am not qualified to answer your question. I just wander around here and just told you what people prefer. So please wait for your answer
uhh...
what are some basic strategies on improving a computer vision algorithm?
does your algorithm include a relu function?
we're still talking about ais aren't we?
adapt your algorithm to your situation
what is the use of your ai?
also using temporal difference over simple q values might help alot
ill be honest im really new to all of this, the only thing ive done in the past is the loan machine learning thing, if you can recommend to me a tutorial on keras that would be great
hmm
so for like NMIST...
are there any
is there any way to integrate one hot encoding to each pixel
AND
what modifications could i preform on the training data
i.e. rotating
what not
ALSO
for the initial layer, what should i use if the input is a 2d numpy array or list
I learned ai on udemy, the name of the course was "artificial intelligence a-z"
I don't use the exact same vocabulary as you do so it would be wise to ask someone who has deeper knowledge than I do
see ya
So very simple question.. Is it possible to give a function as a blackbox in python?
So I have a function, which should be able to use different distance measures. Usually I would make this with a switch and a shit ton of duplicated code, but something like.
def somefunction(similarity_function,some_object,database) :
for database_item in database :
similarity = similarity_function(some_object, database_item)
Would be cool.
Yeah it is, pretty much exactly as you wrote! https://stackoverflow.com/a/706735 @lapis sequoia
Yeah it is exactly as I wrote. I already got the same answer in general - we had a laugh - fun has now been had. But thanks @feral lodge
yo halp plocks
can someone help me untangle this mess, I want to know how to make this nn usable for other puposes
the first link is the nn and the second is the map of the mini-game
any good ML course
currently doing one from udemy
are you doing the artificial intelligence from a t0 z course?
no i m doing ml a-z course
@analog rampart
Haven't tried any online courses, but I can put you in the direction of some books, where I learned all the theory from.
not a book guy but i ll give it a short 😉
Okay. I think
http://www.dataminingbook.info/pmwiki.php <-- By Zaki and Meira is really good, with quite advanced theory (currently doing research in a model, proposed here.
https://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367 <-- very good introduction to basically everything you need to know.
http://www.deeplearningbook.org/ <-- fine book for more theory on neural lolworks.
https://www.springer.com/la/book/9780387848570 <-- is great for the statistical foundation 😃
This book describes the important ideas in a variety of fields such as medicine, biology, finance, and marketing in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal ...
ty
ty
Are you guys studying CS or some related education ?
if by studying you mean school and university then the answer is nah
just finished high school
ai is one of my hobbies
@analog rampart Check out Andrew Ng's ml course on coursera, it's boss 👌 They have a chat room dedicated to it on /r/LearnMachineLearning's discord https://www.reddit.com/r/learnmachinelearning/comments/8smyod/join_study_chat_groups_for_andrew_ngs_coursera/
Did my bachelor's in CS, doing my master's in AI! Hoping to start a doctorate next year 🤓
I then shall wish you success in your studies
Much appreciated 😄
@daring bison that's a pretty solid hobby 😃 .
@feral lodge cool we're the same place then - I also start my doctorate next year in ML.
thanks
Sick 👌 Any idea what your area'll be?
Currently I'm working on a paper about the class imbalance problem, and maybe I'll go more into semi-supervised learning in the coming year. Not sure though. I have some great ideas for class imbalance handling, that I also consider continuing with.
But it's primarily theoretical..
I'll take this oppurtunity to ask you some questions, I am new when it comes to programming and I desire to learn how to write the codes for a simple nn (I'll go further as I progress), could you recommend me a course that will explain me in detail how to program a nn or cnn for any kind of usage?
why is everyone ignoring my questions : (
@daring bison
(a) most people don't have unlimited time, those that do have the time might not have the answers
(b) Discord is currently experiencing major outages
@grave wasp if you have a particular question people are more likely to respond if you ask it!
im trying to solve a task that its about to compute n-grams , unigrams , probabilities.. i have some errors on my code and i want someone with experience on that
i dont want to give me the solution just to help me to figure it out what is the problem with my code
@naive hornet ok I see, I'll be more patient, sorry : (
Sorry Prom there’s only 3-4 of us that regularly answer ML stuff.
Ok so @daring bison as far as courses go. There’s a couple, but Andrew Ngs on Coursera is 👌 and a lot of people consider it the best
@grave wasp post code please. I think I saw it in another channel but I can’t find it
yep sure
Computing Conditional Unigram Probabilities [4 points] Now we can build the model giving us the conditional probabilities of the next token given the N-1 previous tokens. The implementation of the required class method extract conditional probabilities() amounts to building and storing a probability distribution over all possible following tokens for each N1-gram which occurred in the corpus. The lookup structure will be stored in the value of the cond prob instance variable. For our example with N = 2, looking up how likely the token “viele” is after “sehe” is done via model.cond prob[("sehe",)]["viele"]. (BTW: the notation (token,) is necessary in the bigram case to enforce that a unary tuple is looked up, because (token) == token in Python). The recommended logical structure of your implementation is as follows: for each N-gram contained as a key in prob split the N-gram into the first N-1 tokens (the "mgram") and the final unigram if the mgram is not yet a key in cond_prob, store a new dictionary under that key set the value for the unigram in cond_prob[mgram] to the probability of the N-gram for every dictionary in the values of cond_prob add up the values assigned to all unigram keys divide the value under each unigram by the sum of values
after all i have a key error.. i dont understand because the dictionary is not empty
so i splitted
i did this on different dictionary
Still using my code from yesterday? In that case you'll need to change
if mgram not in cond_prob:
cond_prob[mgram] = {}
p = prob[ngram]
cond_prob[mgram][unigram] = p
to
if unigram not in cond_prob:
cond_prob[unigram] = {}
p = prob[ngram]
cond_prob[unigram][mgram] = p
Since your example usage showed this:
>>> ngram_model.cond_prob[("beobachteten",)]
rather than something like this, which my code was assuming:
>>> ngram_model.cond_prob[("hello", "my", "name", "is")]
ie, you're using unigrams as keys in cond_prob rather than mgrams
So if you are testing this:
>>> ngram_model.cond_prob[("beobachteten",)]
then I would expect a KeyError because ("beobachteten",) does not exist as a key in cond_prob, because we never put it there, because it's a unigram
Btw I'm running back and forth in uni right now, so i'm semi-afk 😄
yes but when i print my dict show me {'alekos' : 1, 'something' :2 }
That's after changing the stuff i showed?
before
sorry im little confused because im all day on this task and my head is really heavy 😃
Try making the change and let's see. might not fix all problems, but my previous code was definitely not compatible with your example usage
No stress friendo
my mistake is on this ?
self.cond_prob[i] = self.prob[key]
That's definitely an issue yeah. What you're doing there is making cond_prob be a dictionary that looks like this:
{("alekos") : 0.08, ("slandon") : 0.05}
But you want to make it looks like this:
{("alekos) : {("my", "name", "is")" : 0.08}, ("slandon") : {("my", "name", "is")" : 0.02}}
Remember the difference between a "regular" probability p(X) and a conditional probability p(X|Y).
The regular probability p("slandon") of your text will be very small -- it'll be (number of times "slandon" appears in the text) / (total number of words in the text)
The conditional probability p("slandon" | "my name is") will be much higher -- it will be (number of times "slandon" appears after the string "my name is" in the text) / (total number of words that immediately follow the string "my name is" in the text).
So since conditional probabilities are p(X | Y) it makes sense that we would need both the unigram X and the mgram Y to create a dictionary of conditional probabilities. What you're doing is only using X
hmm...
Example:
Say your text has 10000 words. The string "my name is slandon" appears once. The string "my name is alekos" appears twice. No one else says "my name is" in the text, and our names don't appear anywhere else but in those sentences.
Then:
p("slandon") = 1/10000
p("alekos") = 2/10000
p("slandon" | "my name is") = 1/3
p("alekos" | "my name is") = 2/3
yes exactly
So that's the difference between your two dictionaries prob and cond_prob
To make a meaningful dictionary for stuff like
p("slandon" | "my name is") = 1/3
p("alekos" | "my name is") = 2/3
we need to use both the unigrams, like "slandon" and the mgrams, like "my name is"
But what you're doing in this snippet
if i not in self.cond_prob:
self.cond_prob[i] = self.prob[key]
is to only use the unigram
i have to get rid of that?
if key not in self.cond_prob?
self.cond_prob[key] = {} ?
it will print something {('alekos', 'something'): {'somethin': 4.}
@feral lodge ik now you're busy so I won't disturb you with my questions, when can I ask them to you?
I didn't forget you! Like Clay said, Andrew Ng's coursera course is a really good place to start. This free book http://neuralnetworksanddeeplearning.com/ is also a wonderful intro to NNs.
After a while checking out libraries like pytorch or keras will make implementing your own nets very simple and intuitive. As you go along you'll find you almost never want to implement the networks yourself, because you'll need to implement really sophisticated algorithms for computing gradients (partial derivatives) for many, many variables. It's really only feasible, imo, to implement your own neural networks if they have max 2-3 layers, so pytorch/keras/etc are more or less necessary. Implementing your own small nets is a good exercise though. Googling "implementing small neural network python" will probably get you some good results
@grave wasp self.cond_prob[i] = {} I think looks better. Remember that cond_prob[key] is a probability. So your snippet would make cond_prob look like this: {0.098 : {}} which is obviously not what you want 😄
@daring bison If you're a beginner at programming as well as NNs, you'll definitely want to learn some basic python before you try to implement nets though 😄 Nothing stops you from learning python and NN theory at the same time however
@feral lodge i get most of the things exept for the 2 most important things in programming
1.what is the goal of an argument?
2.and what does the . mean in for example : torch.optim ?
do you know about classes / OOP
the purpose of passing arguments to functions is to do some type of computation and produce a result that is unique to the argument you pass
can you please rephrase in retarted words, I don't fully understand : (
im not really sure how to explain it uhh
you know how in math you have functions right?
like x and f(x)
you know how when you plug a value in x you do some computation to get a result?
oh I see now
you plug and x into a function
programming is closer to math than I thought
thank you
yeah like y= a x +b could have 2 arguments and a constant
sure
but remember theyre not all like math related
also this is kinda going off topic frmo this channel but
lets say you want a function that makes all the letters in a word capitalized
so you can do py def make_word_uppercase(string): return string.upper()
it just uses arguments to do something, which can be produce an output or accomplish a task
I see
I was lacking a key part
now let's jump to the second part
if I say string.upper() does the dot mean it's the fct upper from the string format?
its the method upper from the str class
you should look more into classes / object oriented progamming with python
class class_name
def name
self.name = name
when is it always something like that?
hey all, I was redirected here
I've a df
and would like to convert 'Date' to POSIX int
Date Open Close High Low Volume
745 2018-07-17 04:00:00+00:00 6723.000000 6695.600000 6759.1 6679.3 331.857518
746 2018-07-17 08:00:00+00:00 6695.600000 6700.273918 6744.0 6667.4 294.171766
747 2018-07-17 12:00:00+00:00 6700.273918 6695.300000 6726.9 6666.0 421.905261
748 2018-07-17 16:00:00+00:00 6695.300000 7190.200000 7274.0 6695.3 2132.705123
749 2018-07-17 20:00:00+00:00 7190.200000 7310.800000 7483.9 7157.7 3500.710142
could someone help?
thx a lot in advance guys
somebody firmiliar with matplotlib
bot.tags['ask']
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
hi
i have a question about visualizing
i wanna call a matplotlib chart
"so that i get the chart how we see it
and then give it to the code
how should that be done?
any ideas
nobody....?
Most people familiar with matplotlib are probably asleep / working. Least here it won’t get buried like in a help channel. Prolly take a few hours before someone can take a look
metis sux
Have some previous python experience in uni. Any thoughts on this udemy course for some personal learning? https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp/
Not on that course @wraith frigate Andrew NGs course though on coursera I believe is currently one of the best as far as I know
@velvet anchor thanks I appreciate your input
@wraith frigate , I've enrolled and have the material. PM me if you are interested
guys, anyone on dataframe experience?
df['Test'] = df['Close'].ewm(span = 24).mean()
for ['Close'] NaN cells, I would like to keep ['Test'] cells also NaN
@lapis sequoia Stop spamming channels with unrelated questions
Hey guys, if you had a tool to collect information from a blockchain. What information would you want it to give back to you?
I'd make a graph with all the info about it
At the moment, I'm only collecting raw data. Just trying to get ideas on what to collect, by asking people what they would like to see
couldnt you just collect everything
data newbie here. Do you guys know how to do a single colorbar for contour subplots? I'm trying to visualize some ocean current data and I'm breaking it up by component (so I have a scalar to contour). The velocity variable is 2D from a netCDF file and depends on time and depth (instrument used collects data at many different binned depths). I found something on stackoverflow that is almost my exact question (https://stackoverflow.com/questions/13784201/matplotlib-2-subplots-1-colorbar) but I can't seem to get imshow to play nice with my data
@prime thistle yep, I could. Still need to format it though
I am just going by ear now and refactoring
Hello!
If I was to create a research bot kind of a thing, what should I start with? The idea would be to gather papers, posts etc that mention in them for example a specific keyword from sources such as scholar.google.com or reddit etc.. At first I was under the impression that web scraping would be the thing to learn, but it seems to be a different thing.
Any insight/ feedback is much appreciated.
Can't you solve this by actually searching google scholar and research gate for instance - and just be subscribed to different authors ?
Are you suggesting to not develop anything for this at all and continue manual research?
So if you do a google scholar search for a keyword, you get a result - why have a bot to do that? (unless you have 1000 keywords), but then you'll get n*1000 papers
There are other sources too, would be nice to have it centralised and then the program could be improved as time foes on
Well I can' t really see the benefit of the program.. In the ideal scenario, you auto pull new papers containing the keywords you're interested. Deriving if a paper is important takes a ton of time, especially compared to searching, so I can only imagine that such a bot will give you a shitload of a papers, that you don't have time to read..
So what is it that you're trying to achieve academically by this? Do you want to be right at the edge of a specific research area - looking for new icebergs to take a shot at?
I'm just looking to learn by doing something that will help me.
I'd prefer if it brought up relevant news, articles etc of current innovations on e.g. gas fuels and mof's etc. But perhaps the MVP of the idea wouldn't be of much help
Are you interested in learning data science? 😃
Here’s the problem - by creating what you want to do now you need both access to a google search API and the ability to process keywords to find words that are close but not in your exact keyword set.
You’ll probably want to use NLTK for the keyword handling
Isn't the API available?
No. Google scholar both disallows bots in their robots.txt and disallows use of the service through anything but the interface they provide in the ToS
@serene oar I would probably look at Kaggle.com to get started
Individual journal publishers may have APIs you can use however
Alright. Thank you 😃
Do any of you guys know how to extract the slope at arbitrary points from contour lines from a generated contour plot? I'm trying to align some data along directions defined by a constant water depth (isobaths). If I can get the slope of the contour at any point, I can calculate the angle it makes with the x axis and project my data onto the resultant vector.
That’s just the derivative right?
yes
Like take first derivative then plug in your point. If you have the equations.
I dont unfortunately. I only have a data field that represents the bathymetry, so I'd have to calculate the contours to find the direction of the bathymetry.
Hm. There’s probably a way to do that with a function call but I’m not sure.
I’ll have to research when I get home
I found this @wanton pier https://skemman.is/bitstream/1946/16233/1/final_processingwithpython_dillon.pdf
But I haven’t read through it. It looks promising though
Thanks @velvet anchor I'll read through it
Hi Guys
I have a dataframe
which , in column form, looks like
Date Store Question
20180701 Store A Q1
20180701 Store A Q2
20180701 Store A Q3
20180702 Store B Q1
20180702 Store B Q2
20180702 Store B Q3
etc
I've written a pretty length app that parses the info as needed
but im trying to expand my pandas knowledge so ive rebuilt it pretty neatly.
however I'm stuck on how to getting unique data out with the new sleeker version of the code.
essentially i want to use df to get
Date Store Q1 Q2 Q3
with the store being unique
I've looked at pandas melt, groupby, unique, stack
but i can't seem to pull it off in the format I want.
Any advice?
Currently im thinking to concatenate , Date + Storename + OwnerName + Telephone to get a new column, 'id' , which is essentially a uniqueness checker?
Then use groupby ? to join all rows with the same 'id' so that i can get one row per unique entry?
eh - i might be wording it wrong
i essentially just want to get to
20180701 Store A Q1 Q2 Q3
20180702 Store B Q1 Q2 Q3
Admittedly my pandas knowledge only comes from helping others but
Is GroupBy() what you want?
If not I’ll load up PyCharm and mess around some. Just got hone
I'm going to test if Groupby works with the idea I had uptop Date + Storename + OwnerName + Telephone to get a new column, 'id' , which is essentially a uniqueness checker?
but yeah - it seems to maybe be what i need
if there was an easier way to do this that would be great.
i should probably give a better snippet of what the data looks like because i realize i might just run into a different problem
columns: ['SurveyGUID', 'SQNo', 'Data Type', 'Date', 'Description', 'Caption_1', 'Value_1', 'Caption_2', 'Value_2', 'Caption_3', 'Value_3', 'Caption_4', 'Value_4', 'Caption_5', 'Value_5', 'Question', 'Answer', 'Report_Criteria_1', 'Report_Criteria_2', 'Report_Criteria_3', 'Report_Criteria_4', 'Report_Criteria_5', 'Report_Criteria_6', 'Report_Criteria_7', 'GPS Longitude', 'GPS Latitude', 'Clerk', 'Checkbox_Option_1', 'Checkbox_Value_1', 'Checkbox_Option_2', 'Checkbox_Value_2', 'Checkbox_Option_3', 'Checkbox_Value_3', 'Checkbox_Option_4', 'Checkbox_Value_4', 'Checkbox_Option_5', 'Checkbox_Value_5', 'Checkbox_Option_6', 'Checkbox_Value_6']
I extract data from a less than ideal database application ( im going to build my own soon, to replace the current one, with proper normalization )
Whats problematic with group by is the Answers
I'm trying to arrange the multiple rows of each store into one row per store ( but I have to ensure the store isn't a duplicate )
however some Questions have their Answers in the very next column.
but other Questions have their Answers in Checkbox Form
Hmm
Maybe intersect?
It seems kinda similar to what they’re asking here
Different idea but kind of? Similar concept
breaks my brain trying to visualize that into what i need 😛
I dont think its far off however
Pandas can do basically everything but it’s so difficult sometimes
it can be quite frustrating to learn new things.
using stuff im used to however makes things a breeze usually. just new territory thats horrid
i've never used pandas but apparently it has a steep learning curve.
thats a bit of an understatement
but it is one of the most powerful data libraries
@feral lodge knows a bit about data frames I believe, not sure when he'll be around to see it though
hmm - im defintely going to pop back in here - i didnt realize Python Discord had a datascience section.
Yeah
its pretty sparse, theres only 3-4 people who regularly help but we normally don't give up xd
I'd look at intersect though
I'm not sure how to implement it for your df but I know that's the key
yeah, the people who can help here usually help a fuckton
im busy implementing the concatenation + groupby still -
if i can get everything into the same row ,
i can generate a checker that differentiates between Traditional Single Answer Type Questions and Checkbox Answer Type Questions.
then getting the results in the proper format shouldn't be hard.
if it fails il re-evaluate with intersect
all in all - if the database was just built correctly - i would not have these issues. 😛
This too dirty?
import pandas as pd
# Recreate important part of dataframe
d = {'Store' : ["StoreA"]*3 + ["StoreB"]*3,
'Question' : [f"q{j}{i}" for j in 'AB' for i in [1,2,3]]}
df = pd.DataFrame(data=d)
print(df) # have a look
print("------")
# hop(s) returns pd.Series([s, s+3, s+6, ..., nrow])
nrow = df.shape[0]
hop = lambda start : pd.Series(range(start, nrow, 3))
# Drop questions. Keep only first row for each Store
df_new = df.drop('Question', axis=1)
df_new = df_new.loc[hop(0)];
# Extract questions
df_new = df_new.assign(Q1 = df.Question[hop(0)].copy().values)
df_new = df_new.assign(Q2 = df.Question[hop(1)].copy().values)
df_new = df_new.assign(Q3 = df.Question[hop(2)].copy().values)
# Update indices from [0,3] to [0,1]
df_new = df_new.reset_index(drop=True)
print(df_new)
Prints this
Store Question
0 StoreA qA1
1 StoreA qA2
2 StoreA qA3
3 StoreB qB1
4 StoreB qB2
5 StoreB qB3
------
Store Q1 Q2 Q3
0 StoreA qA1 qA2 qA3
1 StoreB qB1 qB2 qB3
slandon you're a fucking wizard honestly
My secret is this magic elixir
thanks so much - my swampy brain is going to take a while to understand why you did what you did
I'm going to attempt to MacGyver your code into my own and then add the necessary steps for the results i need
but thanks ! its definitely going to help
will do - nice tunes btw
Andrew Ng's course has a new enrollment on the 23rd btw for anyone whose interested
Are you guys part of other data-science'y discords, that is not only for Python ❤ ❤ ❤
i am @lapis sequoia
also is this diagram accurate for a simple CNN?
do you want an invite to the server or something?
also your drawing might be a convolutional, but it's a little hard to see in that notation what you're convolving.
@lapis sequoia yeah sure
oh nvm i realized it isnt
Hi everyone. I was looking for a tool that'd help me visualize decision boundaries in 2D and 3D.
I know Matplotlib exists, but is there an easy interface to it that'll let me do this?
(I can code many complex algos but when it comes to visualizing stuff I'm kind of useless)
Can someone tell me a easy-to-use machine learning lib
Sklearn `? http://scikit-learn.org/
Thank you
Sklearn Keras tensorflow and pytorch are the relevant ones @lapis sequoia they all have pros and cons
Keras and sklearn are probably the two easiest
@lapis sequoia Keras is da best
with just few lines you get a ANN,CNN,RNN etc
With Tensorflow support
Good morning all, I am trying to finish a degree program and ended up in a Data Mining class as my only option...I have not coded in lets go witth "a long time" and I am really struggling, with even simple things. wondering if anyone has some burnable time today to help walk me through some stuff.
please tell me good resources to learn ML with python!!!
@minor whale @wind wasp you could have a go at https://developers.google.com/machine-learning/crash-course/
@placid snow thanks does it have assignments, exercises?
okay
@feral lodge do you think it's possible to make an android with the current knowledge we have?
like for the walking part we take an ars algorithm
for the eyes just a cnn
for the language we use a chatbot
and etc
i am trying to import my mnist data but when I try it
it gives me that error
even tho that file exists
o_0
@daring bison Things you've said are of course possible but consciousness is decade(s) away
well at least we have the basics : )
@subtle idol U sure it's on right dir?
@minor whale not strictly python related but Andrew Ngs on coursera is the best ML course.
@subtle idol try unzipping it maybe and then loading the pickle file directly
to unzip you write "unsqueeze. "?
I meant like uncompressing the gz part so you have just the .pkl
@velvet anchor it's only the mathematical basis of ml?
Yeah but thats the important part really
the rest kinda transfers over into any language because most of the popular libraries have ports
@minor whale Not much mathematical background required for Andrew Ngs course. It has programming exercises but uses the much easier higher level Octave/Matlab scripts. You can of course implement the same excercises in python with some numpy and scipy.
For a course with thorough mathematical approach grab the one pinned by Rags
@small ore where do i find that course?
Pinned messages upper right 📌
@placid snow ah saw it thanks
Anyone good with pandas profiling
Actually this might just be a returning issue
@steel glen it is in the right dir
@velvet anchor I unzipped the file how do i load it directly?
Nevermind I got it to work
any guide to get started with machine learning?
https://developers.google.com/machine-learning/crash-course/ May be what you're looking for
well idk, i haven't heard too many good things about tensorflow
I used that course to pass my classes at least. If it works for everyone, that's another tale
oh
idk i haven't tried it myself but from what i've heard you have to do things a particular way compared to other ML libraries likes keras
also i'm unexperienced too lol
Andrew Ngs course on coursera @copper swan
Else there's http://scikit-learn.org/stable/tutorial/index.html which is a library built on learning to use scipy afaik
scikit learn, keras, tensorflow, pytorch are the big 4
tensorflow is fine its just less abstracted
oh
@velvet anchor and what language does that course on coursera uses?
You can use any
its a ML course not really a language specific one
(disclaimer: i haven't taken it, I am soon though, I just know its the most recommended one and its more on the backbone of ML than implementing it)
oh okay. but i suppose we need to learn about python libraries to program too right?
Yeah but if you know what you're looking for it becomes much easier to transfer between languages
is anyone familiar with python spline interpolation?
so i wrote something to do natrual cubic spline interpolation
and the yellow points should be all connected
yet i cant find where i went wrong
Hey there, just came by to ask if any of you guys are bioinformaticians?
if so, do you have any pros/cons for me?
i'm kind of tired of the lab work (which is part of bioinformatics too) but I'm sure programming is more of my topic since I prefer math/statistics over biochemistry
would appreciate it a lot
@tulip cosmos if you enjoy one more than the other, that's pretty important, take that into account.
Numpy 1.15.0 is out and contains a number of breaking changes! https://github.com/numpy/numpy/releases/tag/v1.15.0
breaking as in it would break old code that uses numpy?
Or just a short for 'Ground breaking'?
It'd break old code, of course
Hope it aint too much. Coz most good tutorials will become obsolete
Well array indexing is changing. It will be quite is easy to fix, but this may require change in many simple tutorials.
The hardest point for me will be to adapt my small brain to not use the old syntax.
@misty sonnet I know that, but the risk is high since I'm new to programming. My M.sc. i'm doing right now is pretty safe and would be finished in one year. If I switch it will take me 3 years and I don't even know if I'm capable of doing this.. other things such as girlfriend (it is 2h apart from my home) and loosing the right to continue my current master are also points that make me insecure in that regard.
Worst case scenario is that I switch and I'm not smart enough/experienced enough to finish, it could even be that bioinformatics isn't the right thing for me, that's why I was asking
@tulip cosmos looks like alot of negatives to me
No one can tell you whether you will get coding or not quicky
don't know honestly, it's 3 years vs. the rest of my life which makes it an important decision
If you already know some coding and when you look at your course syllabus and perhaps browse what kind of algorithm/math you have to implement and if you feel confident about it, only then go about it
you talk as if you cant do bioinformatics with a biotechnology degree
its very very easy to get experience with coding
pretty simple to teach yourself
Unrelated:
NumPy 1.16 will drop support for Python 3.4.
NumPy 1.17 will drop support for Python 2.7.```
everyone says python is simple, but getting into machine learning etc can be a real pain cant it?
a bit. if you're comfortable with maths/stats, its not too bad
I really find this stuff interesting but if not in university I don't really know where to learn all this stuff apart from youtube celebrities
i'm decent I guess, i'm more of an average guy
yea, I know theres lots of literature
it's just super hard to start from 0 basically
(I guess)
you talking about andrew Ngs books or does he have online courses?
I'll have to look it up, thanks a lot
ok 👌🏻
luckily the university I'm applying for has video records of some of their courses so I can learn this way too
a bit unrelated but what's the best IDE for python in your opinion?
Rags, can you also pin the Andrew Ng course link?
@small ore the pinned messages reveals a link for this course https://courses.edx.org/courses/course-v1:ColumbiaX+CSMM.102x+1T2017/course/ which is purported by Rags to be superior to the Andrew Ng course
is it actualy superior?
@rapid pawn could you post your code for the splines?
@feral lodge i found the problem thank you man 😃
Oh good job! What was the issue if you don't mind? Never saw splines looking like that before
it was one of my coefficient formula
namely the c coefficient
but it was rather strange since when i approx with less than 5 data points my c coefficient was correct
but once i past that threshhold c begins to get weird values lol
Weird 🤔 good job finding it 👌
lol it was werid indeed thus why i couldnt find it at first
Have any of you guys used scypi's KD tree for nearest neighbor searches? I have two large datasets (call them A and B) and I need to find for each point in A the corresponding closest point in B. A and B are spacial data (so x and y coordinates) and are not the same size. A is rectangular, B is not. I found this on stackoverflow (https://stackoverflow.com/questions/10818546/finding-index-of-nearest-point-in-numpy-arrays-of-x-and-y-coordinates) and OP answers his own question using a KDtree. I unfortunately don't at all get what's going on with his code, so if any of you guys have experience with this and would help me understand, I'd be very grateful lol
@chilly crest I am just going to believe the edx course pinned by Rags is superior coz I am myself unable to assess it as it requires a superior intellect like Rags to straight away understand that math without also reading and revising math elsewhere. I therefore prefer Andrew Ngs course on courseera
Anyone good with pandas /pandas profiling? I keep getting this zero division error: float division by zero
Im looking for ELK channels
does anyone have any experience with Surprise?
I want to know how I can implement my own similarity measures in it
as it doesn't have an implementation for Jaccard similarity
I could use MSD
Which algorithms are recommended for predict customer churn? Im using Xgboost classifier, is it good?
I tried MLPClassifier from sklearn, and Xgboost classifier. Xgboost get better accuracy than MLPClassifier.
@hearty heron https://github.com/NicolasHug/Surprise/blob/master/surprise/similarities.pyx I'd have a look at their similarities module to see how their methods are implemented! The cython stuff might look a bit off putting if you're not used to that, but otherwise it seems straight forward to implement new measures in the same way they did
Thanks, I'll take a look. Yeah haven't done cython before, but I'm sure it won't be too hard to figure out with a background in C
thanks for the pointer in the right direction @feral lodge 👍
Are any graphing libs such as matplotlib that allow the graph to be modified (style, index, axes) when the plot is already opened?
Thinking of it in regards to a program that reads and plots different CSV's but some might want to graph the data differently.
Matplot has a live option
But I’m not positive that’s exactly what you’re looking for
I will check it out, thanks
lol @small ore, I'd like to compare the difference at some point.
@wanton pier hiho buddy, did you figure it out? This guy https://stackoverflow.com/a/32781737 in the same thread you linked gave a nice example
Dear plotting gurus.
df = pd.read_csv(str)
sharedax = df.ix[:, 0]
a1 = df.ix[:, 1]
a2 = df.ix[:, 2]
print(a1, a2)
ax1 = plt.subplot(211, sharex=sharedax)
plt.plot(sharedax, a1)
ax2 = plt.subplot(212, sharex=sharedax)
plt.plot(sharedax, a2)
plt.show()
TypeError: 'Series' objects are mutable, thus they cannot be hashed.
I can not find the error.
a1 and a2 print out perfectly, but I can't plot them for some reason.
When I'd plot the df before splitting them into subplots it works fine too.
Is there a workaround here or did I approach it from the wrong angle in the first place?
in whihc line does the error happen @serene oar
when starting to plot, at ax1 =
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html
sharex is a bool, no? Not a sequence
Took inspiration from this
They use sharex=CustomAxisCreatedBefore
Oh, removing the sharex thing makes the window to pop up, but also not respong..
df = pd.read_csv(str)
x = df.ix[:, 0]
a1 = df.ix[:, 1]
a2 = df.ix[:, 2]
ax1 = plt.subplot(211)
plt.plot(x, a1)
ax2 = plt.subplot(212, sharex=ax1)
plt.plot(x, a2)
plt.show()
How about this?
opens, but crashes. not responding
And still crashes when removing all mentions of sharex?
Yep
Other functions in the same class run well. E.g. plotting right after reading as csv.
Looks like it depends on the file that's being read.
A simple, self-made one runs, whereas a downloaded one doesn't.
Both have 3 lines of data.
Three columns yeah? Try just using a few rows of the downloaded data, like this:
x = df.ix[3:10, 0]
a1 = df.ix[3:10, 1]
a2 = df.ix[3:10, 2]
Maybe there's some trash in the beginning or end of the file
That works
Then I'd try using an increasingly larger range of rows until it stops working 😄
It's weird, cause plotting it without making subplots works
Maybe it's the number or rows. Might be too big?
There are 1200 of em
I have no experience with pyplot so I can't say for sure, but that seems unlikely to me
It'd be a pretty poor excuse for a plotting library in that case imo
And it would also probably throw some custom TooMuchDataException
Can I loop trough the list to find the TypeError?
Maybe, I couldn't tell you how though 🤔 the row breaking the plot is probably in the beginning or end though
Hm, the more data I load in, the longer it takes to start it. At 300 lines from all columns it stops responding for a second.
Eventually loads it still
Are the any other plotting libs that allow me to do this subplot view?
Matplotlib is the goto afaik! But this is pretty weird though, 1200 data points isn't all that much. Surely it should be able to handle tens of thousands of points. Maybe this approach works better with dataframes? https://stackoverflow.com/questions/22483588/how-can-i-plot-separate-pandas-dataframes-as-subplots
They show subplots here too, also using df.plot http://pandas.pydata.org/pandas-docs/version/0.13/visualization.html
Looks like that's the intended way to plot
Awesome, thanks. That does the trick
Hey Everyone, I have 1 homework question left, and I simply can not figure it out. I have to create a correlation matrix with rows from one dataframe (which is someone how has to be generated from the values in 1 column) and columns from the other. right now all i can do is either get the column any one out there that can help? I have been at this for 5 hours and my brain is in a deathspiral
df.iloc[row_start:row_end]?
the column to make the rows has 25191 rows in it
i need to group it some how first
what is the bracket command to post code here? i cant remember
I feel like some information is missing. What do you mean by "the column to make the rows"? Do you mean turning one of the columns into the index?
Single backticks for inline, groups of 3 for a block.
''' b3 = kdd20['label'].map(itc)
b3.value_counts()
b4 = pd.DataFrame (b3.value_counts())
b4 '''
i did that wrong
b3 = kdd20['label'].map(itc)
b3.value_counts()
b4 = pd.DataFrame (b3.value_counts())
b4
yay
i am trainable
TypeError: cannot concatenate a non-NDFrame object
i can share my book...its on azure notebooks if you all are really interested...lol
did you replace df1 and df2 with your dataframes? :p
result = pd.concat([b3, Basic], axis=1).corr()
Hmm @feral lodge is normally who I turned to with df questions but I think he's AFK rn
I did find this SO page though https://stackoverflow.com/questions/41823728/how-to-perform-correlation-between-two-dataframes-with-different-column-names might have your answer
I'm still unclear on what you want to do. Why do you need to group the rows? Is there a count mismatch?
the question i am trying to answer is : Create data frames which have intrusions (rows) and features based on the mappings given for Basic and Content features. Then calculate the correlation matrices for each. What is the highest absolute value of correlation (other than 1.0) in the Basic matrix?
Basic and Content are basically filters against column headers for a CSV file
ex.
Basic = ["duration","protocol_type","service","flag","src_bytes","dst_bytes","land","wrong_fragment","urgent"]
Ya, makes sense.
all the intrusion data existed in one column in the spread sheet named 'label' but was categorized using
itc = {'back':'DOS','land':'DOS','neptune':'DOS','pod':'DOS','smurf':'DOS','teardrop':'DOS','satan':'PROBE','ipsweep':'PROBE','nmap':'PROBE','portsweep':'PROBE','normal':'NORMAL','guess_passwd':'R2L','ftp_write':'R2L','imap':'R2L','phf':'R2L','multihop':'R2L','warezmaster':'R2L','warezclient':'R2L','spy':'R2L','buffer_overflow':'U2R','loadmodule':'U2R','perl':'U2R','rootkit':'U2R'}
kdd20['label'].map(itc)
kdd20 is the variable for the speardsheet
Ok, this is making a lot more sense now
i am glad it is for someone 😃
I am not a data scientist but I have some Pandas experience
yay 😃
Do you really need a groupby? Are both the labels in Basic and itc features?
i think ...both are features
It looks like it to me. I understand that rows are always observations or data points, so I assume that's the case here too
the closest I have gotten is
bcm = bcmd[Basic]
co = bcm.corr()
co ```
and
``` groupme = kdd20[Basic].groupby(kdd20['class'])
groupme.corr()```
the first set of code is only correlating basic against itself
the second set of code does put the intrusion cats in , but its funky, and still is correlation basic against it self, almost nested like
Correct, Compute pairwise correlation of columns, excluding NA/null values.
okay...how?
So you want correlation between the Basic and Content data?
Do the number of rows match properly? Can you concatenate the columns together?
they call come from the same csv so i would assume yes
That is, are the ids (really the indices) match between the two, is what I'm asking
that I am not sure, sorry.
This doesn't directly answer your question, but if I understand the problem statement correctly, Create data frames which have intrusions (rows) and features based on the mappings given for Basic and Content features. Then calculate the correlation matrices for each. What is the highest absolute value of correlation (other than 1.0) in the Basic matrix?, you don't need to calculate the correlation between the Basic and Content DataFrames, only within each.
I need to calculate the correlation matrices for the intrusions and basic and separately intrusions and content
something that looks like this bad ms paint representation
aha
that example is intrusions correlated to Basic
I think one way to do it, though ugly, would be (as I think you said) have each intrusion type be a column label, and be 1 or True for an observation where it's the correct intrusion type, and 0/False otherwise.
but i still need the values per intrusion type
kdd20['label'].map(itc)
b3 = kdd20['label'].map(itc)
b3.value_counts()
b4 = pd.DataFrame (b3.value_counts())
b4
label
NORMAL 13449
DOS 9234
PROBE 2289
R2L 209
U2R 11 ```
hey, can someone pls help with a webcrawler ?
!tag g ask
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
any of u good with statistics get how sutff like MSE, NRMSE, PSNR and SSIM work? (or at least the first 2)
mse is mean squared error, nrmse is normalised root mean squared error, psnr is peak signal-to-noise ratio, and ssim is structural similarity (no idea what the IM is)
the functions for them are towards the bottom of that table