#data-science-and-ml
1 messages · Page 370 of 1
here is input
so you see its not a basic page with full 4 edges
because its a book
so you cant simply use the biggest contour
my approach is was to first use canny canny resulted in lines with BIG gaps
so i use contours with convex haul
i know how to do that and use tesseract i just have a small problem
to fill in the gaps and get the 4th edge
and got this
now what i want to do is use just the outer layer of it as a contour so i can get 4 points so i can warp it up
Heiroglyph-creator xD
but how to do that
yeah looks really cool ngl some stuff in a hacker movie
One sec, I saw a similar tut on Murtaza's Workshop, might help.
my phone may shut down in any second btw
thnx i used his document scanning but sucked tbh when using it on a book
In this video we are going to learn how to create an Augmented reality application using opencv. We will use feature detection to find our Target image and then overlay a video on top of it. We will write the code from scratch going step by step so it is easy to follow.
Code and Text Version:
https://www.murtazahassan.com/courses/opencv-proje...
Maybe something from this project can help do what you're doing, not sure tho. Cropping squares on angles n such
yeah may help me thanks
thats part 1 of 3
Image module of the Python image processing library Pillow (PIL) provides putalpha() for adding an alpha channel to an image.Image Module — Pillow (PIL Fork) 4.4.0.dev0 documentation Here, the following contents will be described.How to use Image.putalpha() Set uniform transparency over the entire ...
What if you could detect the book contour, then replace the background with the same majority color, transparify the outer edge, and re-contour the paragraphs?
But I see the page you showed has red bleeding to the page's edge.
G'day all
i would like to ask if anyone in here have proper reading material about embedding analysis using T-SNE
and how to do proper visualization with the result
for example, let said i have 100 class/user where each user have n embedded data. Should i set the perplexity to the number of user or other else
How do I fix this pytorch error?
0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
row_ids = torch.arange(past_length, input_shape[-1] + past_length,
Failed with forward() missing 1 required positional argument: 'has_cache'```
I keep getting this error, I can't figure out why:
row_ids = torch.arange(past_length, input_shape[-1] + past_length,
Failed with forward() missing 1 required positional argument: 'has_cache'```
The code is in this: https://colab.research.google.com/drive/15vFLeepkSTr1qd4xs31g9kMEiwkWP0sh
How do I get rid of that problem so that it will actually fine tune the AI?
'/usr/lib/libtensorflow_io.so' (no such file) in python3.9 m1 mac
Hi Im starting in AI. Do you know any tutorials or books for me to read?
What is the difference of categorical cross entropy and sparse categorical cross entropy? is about the way images in classes is labeled? Like 1,2,3,... or corn1,corn2...?
what type of ai ure planing to develope?
Reinforcement learning
hm
Straight from the source, " Sutton is considered one of the founders of modern computational reinforcement learning,[1] having several significant contributions to the field, including temporal difference learning and policy gradient methods. "
sorry i slept
anyways good idea im gonna try more than one approach and see the fastest one and most reliable
can someone help me how can it be improve:
https://colab.research.google.com/drive/1j6XRjNwijeEc_6r-emE4-qMq1PI_szVr?usp=sharing
@last echoim there ill try to help
what does food types have to do with hand gestures?
Sorry, I forgot to remove the link to hand gesture dataset. I has nothing to do with food types
yes
i have used something very good before one second
the prediction ends up so bad, I do not even know how to improve
ok nvm what you built i just what i had what i used was a python module that does all the job of the cnn for you just need to work on the data
but first of all how many training and testing images you have?
im borrrrrrreeeeeeed so give me stuff that i can make any sgigestion?????
Is there a code to show image count for each class data sir?
I split the data as I do not know how to use the test folder without class
i was building a automobile classifier and it worked really well
so you need to scrap images from google
resize them all to one size
the dataset source is only from kaggle provided sir.
minimum is 2000 training images for each class and 1000 testing for each class
images need to be in all rotations and angles of the foods you have
Hey guys, I am relatively new to coding and i wanna get into machine learning
where should i start
i have been doing a python course recently
How can I generate and populate the missing image dataset sir? without relying from google images
why cant i iterate over a dataframe?
well i just looked at the data from kaggle it seems fairly enough
do you mind me editing some of your code on the google colab?
yes sir its alright
ill just change the perm
kaggle.json
anyways sir you will need kaggle key for authentication
! cp "/content/drive/My Drive/Colab Notebooks/kaggle.json" ~/.kaggle/```
https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/
Almost every data science aspirant uses Kaggle.In this article, I will explain how to load the datasetsfrom Kaggle to Google Colab notebooks.
lmao attribute errors make sm more sense when you understand oop
omg i remember groupby from sql
I'm having trouble with updating conda
$ conda install -n base conda=4.11.0
UnsatisfiableError: The following specifications were found to be in conflict:
- conda=4.11.0
- nbpresent
Use "conda info <package>" to see the dependencies for each package.```
$ conda update -n base -c defaults conda
==> WARNING: A newer version of conda exists. <==
current version: 4.5.11
latest version: 4.11.0
Please update conda by running
$ conda update -n base -c defaults conda
# All requested packages already installed.```
it's telling my to update conda with the command i just ran...
@young dock are you sure that you have to use anaconda? I don't recommend it to newcomers to Python, even though a lot of data science resources treat it as the default assumption.
i don't have to, no
then I would just use "regular python"
i was using it since i do like jupyter
you can use jupyter without conda
oh
the point of conda is that you can install pre-built binaries, but the need for that is increasingly rare.
I work for a research company, and we're actually moving away from it as much as we can.
Conda (the old version) + jupyter was working fine for me for a while, it's just there was an error when I tried installing pygraphviz and stackoverflow said to update conda.
I will try jupyter without conda, thanks.
you can just install jupyter with pip, start it from the terminal, and open it in your browser
I'm not sure if that's different from the workflow to start it with conda, though
i think the only difference is you start it from the anaconda terminal
Hey guys - So I am trying to develop an automatic way of doing meta text generation. I am scraping a couple of websites, that I have gotten permission to, but the issue is in my data fetching process. Either I:
- Get a ton of text, a lot of it useless like "Previous Page" or "Related Products" because I target all text on a webpage.
- Get very little text, because I only target h1 and h2 tags.
I was wondering if there's a model that can tell whether a specific sentence is informative or not. Does my problem make sense? Otherwise I'll try to explain better
is thier a way i can find our how many vertices are there in a contour?
what do you like to use instead of conda?
oh by the way, i was doing some googling and found a linear algebra course for ML on youtube. should i link it in here?
i found two that seemed decent to me
i also found one for multivariate calculus (calc 3), but unfortunately i'm still at calc 1 and learning as we speak
I used Enthought Canopy for a while before shifting to Conda
just normal virtual environments
sir, every matrix is a linear mapping. Which is what a tensor is. And it is a vector in CS - not physics.
does it have "sharp" corners? or are you talking about some kind of spline approximation?
Good Afternoon Everyone,
I have one more question please, as i am using K-Means for Clustering Documents, i would like to evaluate my cluster results,
my question is which evaluation measure i should use to know the accuracy of my cluster results?
Thanks for your support
you might need to try and build heuristics for the html structure itself. if you have only a few sites and a large number of webpages on each site, you should be able to write some decent custom rules that extract text only from useful sections of each page.
maybe there are some deep learning approaches to identifying "interesting" sections of html, but that seems like a much harder problem and probably requires a huge labeled dataset + clever feature engineering
Yeah, I'll try to extract only information that I know (kinda) will be useful in determining a suitable metatext
you could use umap to plot the documents and color them by k-means cluster. you can also look into metrics like silhouette distance
right. it should be easy enough to filter out "Next page" footer-type content that appears on every page
i wonder if there is a way to encode the html hierarchy in a clever way such that you can compute some kind of tf-idf score on it, and thereby automatically filter out chunks of the page that are high-frequency/low-importance
humans are good at that kind of thing... machines, maybe not yet
I was hoping I'd just be able to teach a BERT model how to understand HTML, but only extract the important parts (human language)
seems like wishful thinking. BERT is for sequences, HTML is a tree (a special case of a DAG)
maybe a graph neural network might be useful for this
maybe if you smash all the text into a single "document", BERT could recognize that sentences like "Next page" are meaningless and learn to ignore them
Exactly what I meant - But I am not sure if this would work. My current approach is to extract a few things that I know have information. Then I use that to generate a meta description (my end goal)
hello everyone i wonder if there is anyone here is with knowldge of deepfake videos 
So I installed pygraphviz by first downloading and running the exe for graphviz and then by running the pip command they said to run in the pygraphviz docs (https://pygraphviz.github.io/documentation/stable/install.html#windows-install), in a venv.
Then I made a jupyter kernel for that venv and tried importing pygraphviz in a notebook with that new kernel, which didn't work.
Then I tried installing pygraphviz through jupyter directly using sys.executable as jakevdp suggested in his blog:
In [7]: !{sys.executable} -m pip install --global-option=build_ext --global-option="-IC:\Program Files\Graphviz\include" --global-option="-LC:\Program Files\Graphviz\lib" pygraphviz
then I tried importing pygraphviz, and got this error
11 # Import the low-level C/C++ module
12 if __package__ or "." in __name__:
---> 13 from . import _graphviz
14 else:
15 import _graphviz
ImportError: DLL load failed while importing _graphviz: The specified module could not be found.```
getting this error trying to import pygraphviz even though both pygraphviz and graphviz appear to be installed
:incoming_envelope: :ok_hand: applied mute to @gloomy holly until <t:1643044011:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
What exactly do you mean by "meta description"?
I don't think the graphviz exe is useful here
did you get any errors during the pip install? normally you would have to run that command in a "developer prompt", which is part of visual studio
Introducing the challenge of optimization and the concepts of derivatives
NNFS series playlist: https://www.youtube.com/playlist?list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3
Neural Networks from Scratch book: https://nnfs.io
Channel membership: https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ/join
Discord: https://discord.gg/sentdex
Reddit...
I just finished this
I'm not sure where to go next
This guy takes over 5 months for a single episode
and I'm not gonna wait
What free resources can I learn neural networks from scratch with
MIT 18.06 Linear Algebra
I do not want to pay 30 pounds for a book, I would but I can't
I have a lin alg course
Also the 100 Page Machine Learning Book by Burkov believe is "pay what you want"
Learning programming from a video is like learning music from a book
Man I just want to understand Neural networks and make my own for something simple
I recommend getting familiar with reading software documentation, and then spend time just doing projects
like weather prediction
Pytorch has decent tutorials in their docs
Meta description <meta name="description" content="Enjoy all your favorite things on youtube bla bla bla">. The stuff that appears on search engines
It's defined in that meta tag
on back propagation and all the neural network stuff
I just need a free resource for neural networks 😩
NNs are one of many tools for doing machine learning and data science. Check out the Burkov ML book i mentioned
I see, so the meta tag content is the output from the model? With the rest of the page content being the input?
I think there is a free option to either download or read online
Yep, exactly
I'm 13 no job cant get one
Also the pytorch docs are free
As is Towards Data Science as long as you open a private browsing tab
fr? Il just check that out ty
y private tab
Hmm maybe, though that's what the instructions for pygraphviz installation (windows) suggested. It had the exe as step 1
Think it’s doable? It’s for my master thesis
No errors, neither when installing it in CMD nor installing in Jupyter with sys.executable
Interesting, maybe you can do whatever people do for image caption generation. Except using some nlp model instead of a cnn on images https://medium.com/swlh/automatic-image-captioning-using-deep-learning-5e899c127387
I think perhaps a discussion with your advisor would be more productive than asking random people on the Internet 😉
did you use a developer command prompt? i would be surprised if a source build worked without it
No, just the normal one
Alright - My apologies for asking you
i'm not sure, but since this is more of a question about "installing python packages on windows" and less about "data science", you might want to ask in #tools-and-devops
sorry, that was meant to be light-hearted
but in all seriousness, you should definitely get a sense from your advisor of what they think your capabilities are
If your advisor is an active machine learning researcher, it might be doable. If not, you might struggle for lack of good resources and help
Because you are kind of inventing something new, even if it means you are composing pre-existing tools together
I think it's a great project idea and it does sound like it should be doable with off-the-shelf machine learning stuff, but if you don't have good support for the project then I would consider scaling down your ambitions
Not because you aren't capable, but because doing things for the first time is always difficult, and having somebody experienced looking "over your shoulder" can help a lot
Consider also that it will take a lot of time to put together the HTML processing heuristics
Yeah - We're attempting to do this. Our supervisor isn't mega technical. We've partnered with a company and are limiting our scope to generating meta descriptions on their site. It doesn't have to necessarily produce any amazing results, we'll just document our approach difficulties etc.
And if you haven't scraped all the webpages that need to scraped yet, that might take several hours, it might need debugging, etc
That's helpful context. I would definitely start by reading up on image caption generation and adapting whatever technique is popular for that task
I don't know what the state of the art is in "supervised" caption generation
That's where a good actively practicing technical supervisor will really help, they can point you in the right direction and steer you away from dead ends
@desert oar can u send me a link to ML by burkov
i was wrong, sorry. it's 20 usd minimum price. https://leanpub.com/theMLbook
maybe it was free when it first came out
I've been able to replicate what we want to do (in a small scale) using GPT-3. But we would like to do it using a BART model
🍞
@lapis sequoia check out OpenIntro textbooks on Statistics, Probability, etc. they don't cover neural networks, but you will benefit from studying stats and related topics
OpenIntro's mission is to make educational products that are free, transparent, and lower barriers to education. We're a registered 501(c)(3) nonprofit.
i believe so yes
do you know the basic algos
if you don't it's not benefical to really jump into neural nets
i wouldn't recommend it
Dude, you need to learn the basics
Yes, every matrix is a linear mapping. A tensor (of full rank, if you want to be specific) is a multilinear mapping. Even in CS, a vector is a tensor of rank one[1] --- a trivial tensor, which should not be representative of the entire set of tensors. My point was/is: if you're saying that it's just a list of scalars, you're throwing out most of the important structure.
I just followed the series from the start and i understand it
what are the basics
Do you know linear algebra? Do you know how a neural network works?
im doing a lin alg course
just started it
but do i actually need to know it before i can make some progress
You can make "progress", but you'll probably be bound by doing simple things which are using pre-existing (or common) models, unless you know the theory around NNs.
as for the second one i do but i just dont know back propagation and what to actually do with the model
Actually, I'm not sure why I'm noting this --- this is true in ML in general, but I'm not a NN expert, so someone else should weigh in on that.
yall can judge the series as you want but seems to me that it covers the basics
im taking edX course
on lin alg
That sounds good to me. As long as you're learnin' towards a goal, sounds good.
issue is with resources for the ML and NN
i dont have those
literally have nothing and i really want to avoid paying
The thing with ML (idk as much as others here about NN) is that it divides pretty hard into two sets of classes: those which are very, very math-heavy (Andrew Ng) and those which are much more light-weight and focus on just doing the code and not worrying too much about the math.
Either one is fine, but I'm assuming the second one is better for you right now, at least?
it does have sharp corners anyways i figured it out
I think I should just do the lin alg course
and study lin alg
I agree, I think Linalg is really important to pick up. I think a good thing might be to take a "lightweight" DS class now, just to see how the things look.
and then convince my dad to buy the 30 pound neural network book which I need
first thing first you should be atleast a 11th grader so you dont struggle with going math heavy
Because he obviously will buy programming resources for me (he got me into coding nearly 2 years ago) but he doesnt think im ready for the maths
Yeah, I'd say at least calculus is required. Linalg will 100% help, and stats should be there as well.
check my about me
lmfao
I'm 13
I'm aware that lin alg is 🥶
I don't remember what grade that is --- it's been a while since I was 13.
and not easy but I got an edX course on lin alg which my dad can help me with
may good be with you you will have to learn about alot of things alone but then when you go to 9th grade or 11th you will find everything easier because you already grinded alone and learned it
The maths isnt easy but il manage it
i think its 8th grade
yeah 8th
Cool. That's a great start if you're interested in this stuff. I think the math in the NN book might be a bit much, but it's prob not bad to start it up, and having help will be useful.
defo
i think as a good start to with text NeuralNetworking since it includes less math or easier math
simply wanted something new so chose ML and nn
if you go with images or audio you will need to learn things you will only need in uni
but why not right?
I'm probably gonna do audio
yeah audio is a little easier than images
not like i have done audio before but thats what i think
Despite workin' in the field for a while, I'm quite new at NNs --- I've used autoencoders for work, but that's really all I've needed --- and I was surprised at how effective they were for some things. I'm messing around with the LSTM models for text and I'm both excited for the results and disappointed that all of it is pretty much a black box.
LIME has not been kind to me so far, I'm going to be checkin' out SHAP either today or tomorrow.
You don't wanna pay now or you don't even have the intention of ever paying for a course? 😀
Well, say no more. Check this thread I made on twitter
https://twitter.com/itsDonmonc/status/1483388714344714245?t=Hd70vHV_oKV_JgCqqp-13g&s=19
Dang, this makes me wanna start one of'em up!
you should work in an Ad Agency @odd meteor
or a Marketing Agency you can nail it
that was smooth 😂😂
JESUS that was smooth
😀 It's well curated tbh
thanks
I want to pay but eh sometimes you cant
rn if i could id get neural networks from scratch book, data science in general book
Lmaooooo 😂 I've once worked as a market research analyst before and I still kinda do business development as a side gig
ty
@odd meteor
Course lectures for MIT Introduction to Deep Learning. http://introtodeeplearning.com
is this good?
its pretty long 😵💫
43 hours
this is math heavy right
You should only watch the first 12 videos though
Cause the rest is the same content from previous years
first 12 are good for learning right
they will teach me how to use back propagation and optimizers and how to train a model
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A there is also this
Do you have a github profile? Might be too difficult for you. Have you already done any machine learning?
Nope first time getting into deep learning, ML and NN
But I'm pretty good at other coding
language developement
good general knowledge of python
i still think "youtube" is not a good goal for learning. videos are an adjunct to reading and practice problems
Well it's the best I can get so far
if i learn lin alg then my dad could buy me the book i need
but until then this will do
I'd do practical machine learning. Just follow a simple sklearn tutorial, get a grasp of fitting models and tweaking hyperparameters
Im trying lin alg lol
is there a difference
fair enough. also keep in mind that you are very young and have literally years of time before you are expected have even started thinking about these topics
as long as a machine learns
expected
if you are wanting to learn about the code, i still think you should look at the pytorch documentation
Yes? There's a difference
also the MIT Deep Learning course publishes its course materials, not just the video lectures
i second this
look up a linear regression model on kaggle and play around with it
if you haven’t already
look at logistic regression as well
learn the basics of how a linear regression model works with stat quest
learn how a linear regression model is limited and cannot handle something like e-mail spam
before everything i just stated get the basics of supervised and unsupervised learning down
you gotta know the alphabet and other stuff before you start reading and that really applies to learning ML… fundamentals are very important
not only that but also you should build a basic understanding of pandas and numpy … corey schafer has excellent vids on that
i never liked youtube videos for learning anything about programming my most favourite sources are medium and towardsdatascience articles
i like to read the documentation after i learn a thing from the video series
and then write down the key points
that’s how it makes sense to me
but medium articles are good
i just read those artcles they usually containe information on how every function works under the hood and if not wikipedia usually explains it for me
true
also sometimes if i still cant get it i take an example code an play around with it and check each time whats the difference in the output
that is good too
as long as you can understand what you've written and you are able to explain it
Hi you all!
I would like to show the most common elements across groups. Does anyone can help?
So I have a time series which I grouped by year and month. In this timeseries there is an id column and I want to veerify which are the ids that "repeat" the most across the groups
TDS can be pretty good, but nowadays there seems to be a lot of content that's just repeating the same tutorial-level material over and over
but yes i agree, video lectures are good for conceptual understanding mostly
A tensor (of full rank, if you want to be specific) is a multilinear mapping
this is where you're wrong - a tensor can represent a single linear mapping, as well as multiple as well as a simple scalar. its just an array that can be attached with graphs and useful for frameworks internally
something as simple as a numpy array can be considered as a tensore as far as i know
Yes. A tensor can represent a single linear mapping, as I noted. It is a rank-two in this case. It can also be a rank-one and represent a vector. It can also be rank-zero and represent (technically not a scalar, but a 1-d vector).
@stone marlin look at this way, torch.tensor([1]) is a scalar, no?
but it is inside a tensor
yep 🙂 and that's a simple numpy array too
So, you're literally saying that you don't like the pictorial representation of what a vector tensor signifies in the mathematics because, when you put it into a programming language, it "looks like" an array.
no...because it is an array in CS
its only in science (physics/math) where it diverges significantly, evolving their own systems of computations
It is the same interpretation if you are using it in CS vs. mathematics --- especially if you're doing work with tensors in NNs.
perhaps it might be - I am not educated in Riemann Tensors and stuff, which is a different branch of mathematics. but in CS, its kept simple
I'm not trying to be pedantic here about things "technically not being scalars because they need to be in a vector space" and other pure-math nonsense, I'm saying that to use a tensor in the sense that tensorflow does, is the same as using a tensor in terms of multilinear mappings in linear algebra.
sorry, gtg. Here, this may clarify things better https://stackoverflow.com/questions/43656736/tensor-what-is-it-and-how-is-it-different-from-a-vector
then a tensor is something that holds values, some kind of table or array
Yes, this is how it is implemented.
i always though that a scalar is a 1D array
no, 1D array can contain an arbitraty no. of elements
A scalar is a 0d array.
you are confusing some concepts
[[128], [1827]]
I strongly disagree that I am confusing concepts here, but we can agree to disagree if you'd rather.
linear, or even non-linear mappings do not have to be vectors per-se. They are written as matrices, (multidimensional vectors, set of vectors, choose your pick) but they can be represented in other forms.
Tensor arised out of simply having a framework-agnostic array which we can backprop through. So in the CS realm, they aren't inspired by physics at all. They're just called that to differentiate from arrays, and vectors/matrices was already taken by many\
tensors are mappings, but they don't have to be. They can store any arbitrary information that even though can be translated to vector space, doesn't matter for our use (unless it can help us in some way, like PCA)
The reason tensors are different is because TF attaches a computation graph to those (which can be detached in pytorch) to keep track of ops for autograd to backpropogate. Which is why you have to convert to numpy for any other use outside its built-in methods
sorry for interrupting but can anyone explain to me what a 2D kernel filter is?
in what context?
image manipulation such as opencv
For the sake of everyone else here, I will say that we are on two different pages. I "get" what you are noting here, but in the specific case of ML and NN, as was above, tensors are made to be mutlilinear maps. If they were not, auto-differentiation would not make sense. I'm not sure where you get that the CS realm, as a whole, wasn't inspired by the (same structure) tensors as in the mathematics, but this is fine.
i'm not sure this is any different from what melat0nin is saying
yes, a tensor must be a multilinear map by definition
I don't think it is, I think that I'm saying that in this case we interpret them as multilinear maps --- I think they note that they are "actually" just arrays of arrays.
it's like how every matrix is a linear mapping + a coordinate basis
just because you didn't have a linear map + basis in mind when you wrote the matrix, doesn't mean it doesn't imply the existence of one
Right. I am simply saying: there exists more structure, and to neglect that structure puts one at a disadvantage for understanding the process vectorally.
that's exactly what I said. A tensor however, doesn't have to have a specific structure as melatonin says
If, for example, you had a "tensor" which was not a valid multilinear mapping (let's say [[0, 1], [1]]), does this still qualify as a tensor?
It is an array of arrays.
because essentially, there is no linear mapping of [4] @desert oar but it still qualifies as a tensor
is it?
[4] * [x] = [y]?
Yeah, that is a linear mapping. Trivial, but linear.
yes, its a scalar then
dumb question but
no, it's a 1x1 matrix
what does trivial mean in linear algebra
4 is a scalar
It's not exactly a scalar.
just curious
.
it's not a linear algebra term, it's a general math term that usually means "the simplest possible case"
i see
Trivial just means "the base case", usually like, "0-" or "1-dimensional" versions of something. It's something mathematicians say a lot that's frustrating to everyone else. I do it a lot, ugh.
"degenerate" might be a better use for something like 0-dimensional
ok bc i heard it in like a lot of strang’s lecture
i had a feeling it meant like simple or something
meanwhile a "trivial" proof is one that is immediately obvious
ok mb
It also may seem like we're using tensors in a slightly different way: in my mind, a tensor is a full-rank tensor (though I should qualify this!) and in the other case, it could be m-rank tensors, for m < n.
So, I think when I was saying tensor it may have been confusing, which I'll take the blame for.
what I meant is that tensors are just numpy arrays with a computation graph attached to it. That's final. Period.
Is np.array([1], [2, 3], [4, 5, 6]) a tensor?
if you do tf.Tensor(a) yes
@stone marlin it's not a tensor, but it's convertible to a tf.Tensor
you have to understand in the context of DL, tensors are just framework specific numpy arrays
i don't think that's true in general. it's true in the jargon specifically of (some) machine learning programming frameworks
If you're familiar with NumPy, tensors are (kind of) like np.arrays. -- tensorflow docs.
because its much more convenient internally to convert everything to one type for autograd, rather than sorting through the mess
oh definitely, which is why I said in context of DL
but that's more general than what i said
man people are screwed in the lin alg spring sem course i’m gonna take
they are going all the way to markov processes and trying to apply lin alg to game theory
Well, I don't know how tensors are used in other places 🤷♂️
They are used in nearly exactly the same way.
what the hell does game theory have to do with lin alg
Except they must be well-defined as multilinear mappings.
good question
i don’t know the answer 😀
apparently there is a direct application
trying to proof his theories with an algorithm?
if you can represent the solution to a problem as a system of linear equations, you can solve it with linear algebra 🙂
oh this makes sense but i though you meant the youtuber game theory
well, yeah i guess it is. but linear algebra more or less exists purely because we often need to solve systems of linear equations
in markov processes, you end up using matrices to represent transition probabilities between states
Wait, can you actually do tf.Tensor(np.array([[1], [2,3], [4, 5, 6]])) and have it be valid? That would be really wild to me.
imagine you have a light that randomly changes color between red, green, and blue. you can set up a matrix with the probability of transitioning from any state to any other state
i noticed something schools in general tend to keep teaching the same thing again and again until about 8th grade they will start flooding the student with new things
Markov is awesome, it's super legit.
woah that’s cool
maybe ragged tensors support a limited subset of operations? i'm sure you can do some differentiable operations with that
this class played right into my hands i mean i wanted to use lin alg for ml anyways
well that's good, because you can't really do ML without linear algebra 🙂
Yeah, I'm not sure if these can actual be "tensors" qua tensors, as we were noting.
Which makes sense. I keep getting shape errors.
yep ^ to what salt rock lamp said
i’m up to A = LU
and i did like the inverse of a matrix product too
That's a good decomp method.
i think i’m in a much better spot than a lot of the other kids in the class
if all things go right
indeed, i don't think they correspond to any multilinear map
I think the one that I use frequently is SVD. That one is pretty cool.
other fun ones: cholesky decomposition, QR decomposition
oh yeah i saw qr decomposition when i was looking up supplemental videos
very important in machine learning. it's a generalization of eigendecomposition to non-square matrices
Oh, I meant that I don't think they can even be used as tensors in the tensorflow sense. Seems like you need to pad'em. Hm. Welp, either way.
I don't think I know Cholesky.
that i'm not sure about. good question / maybe it depends on what you need to do
ah eigenvectors and eigendecomposition
not there yet unfortunately but i will be soon
Yeah, I'm not great with tensorflow, so I just wanted to mess around a bit with it. Who knows.
Huh, Cholesky is pretty neat. What's that usually used for?
Oh, Kalman filters, okay, cool.
i used it in my bayesian stats class to help mcmc converge
it turns out that you can reparameterize the gaussian distribution to use the cholesky decomposition of the covariance matrix instead of the original covariance matrix, which has better numerical properties... or something like that
Huh, never heard of that. That's cool. I don't think I've ever checked convergence "in the math way" for MCMC models.
i did finish the stats course i have been watching for a while
check out Statistical Rethinking. course lectures from 2017 and 2019 are on youtube, and the 2022 course is going on right now. the book is great too and worth buying, reading, and keeping on your bookshelf
so i’m revising stuff
the book is Statistical Rethinking by McElreath
This post provides an example of simulating data in a Multivariate Normal distribution with given parameters, and estimating the parameters based on the simulated data via Cholesky decomposition in stan. Multivariate Normal distribution is a commonly used distribution in various regression models that generalize the Normal distribution into mult...
yea, you can convert any numpy array to a tensor
except that's a ragged tensor I suppose
Interesting. I'll check those both out.
oh yeah, another interesting thing we used in bayesian stats @stone marlin https://jakejing.github.io/posts/2021/08/blog-post-2/ this "LKJ distribution" apparently has good properties as a prior distribution for covariance matrices
Lewandowski-Kurowicka-Joe (LKJ) distribution is a very useful prior distribution for parameter estimation in correlation matrices, and is also tightly related to matrix factorizations such as Cholesky decomposition. For example, when you use Cholesky decomposition to decompose a variance-covariance matrix ($\Sigma$) into the multiplication of 3...
Dang, definitely haven't heard of this one. Though, to be fair, my bayesian stats stuff is fairly weak relative to my other stuff.
yeah i took a bayesian stats course w/ some people who were active early stan contributors
I feel like at every place I work there's always one person REALLY into bayesian stuff, and then everyone else is like "oh... neat."
so i got exposed to a bunch of cool stuff that almost kind of went over my head
@hollow sentinel @stone marlin actually the stats server i am in just started a book club for Statistical Rethinking
i think you both should be able to DM me, i will reply with a link to join if you are interested
lmfao true
in england thats true
Unfortunately, I prob won't have time now. :'[ This is something I will be interested in in the future, though. I've gott'a focus on some devops + timeseries stuff.
fair enough. the book club is pretty casual, you are welcome to listen in
Hello everyone quick question, can I be guided to the best channel to ask questions regarding pandas?
this channel
Perfection
no not at all
Alright, let me send you a message then, salt rock, no harm in listening in.
it's better if you send a code sample, not a screenshot
I think you'll have to add me as a friend or accept messages, Salt. But I'm in.
Okay so i’m trying to identify recurring vs non recurring income. I want to basically add 1s and 0s to the pipeline data frame which will be used in another step in my processing
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
@warm raven ☝️
ahh my bad
this is the best channel to talk in tbh its easy to keep up on one context and nobody is doing 1k messages per minute and its actually active
I keep getting the following error “The truth value of a series is ambiguous. Use a.empty, a.bool, a.any or a.all”
Can you copy your code, so that we can debug it? Using the things salt noted above?
yeah one second
,,, pipeline['rec'] = pipeline.apply(lambda x: '1' if (gfs[(gfs['prod_code_name'] == x['prod_code_name'])
& (gfs['PRODUCT_ID_MAP'] == x['PRODUCT_ID_MAP'])
& (gfs['Sector'] == x['Sector'])
&(((gfs['Product_Code'] == 'USF33') |
(gfs['Product_Code'] == 'USF34') |
(gfs['Product_Code'] == 'US756') |
(gfs['Product_Code'] == 'USF37') |
(gfs['Product_Code'] == 'USF40') |
(gfs['Product_Code'] == 'USF29') |
(gfs['Company_Code'] == 'US05') |
(gfs['Company_Code'] == 'US1B') |
(gfs['Company_Code'] == 'USM6')).any())
]).bool() else '0', axis=1)
,,,
pipeline['rec'] = pipeline.apply(lambda x: '1' if (gfs[(gfs['prod_code_name'] == x['prod_code_name'])
& (gfs['PRODUCT_ID_MAP'] == x['PRODUCT_ID_MAP'])
& (gfs['Sector'] == x['Sector'])
&(((gfs['Product_Code'] == 'USF33') |
(gfs['Product_Code'] == 'USF34') |
(gfs['Product_Code'] == 'US756') |
(gfs['Product_Code'] == 'USF37') |
(gfs['Product_Code'] == 'USF40') |
(gfs['Product_Code'] == 'USF29') |
(gfs['Company_Code'] == 'US05') |
(gfs['Company_Code'] == 'US1B') |
(gfs['Company_Code'] == 'USM6')).any())
]).bool() else '0', axis=1)
Close, the mark isn't a comma it's a backtick: it should be by the ~ key on your keyboard. But this is fine for now.
Yes they both are
Okay, I'm gonna clean this up a bit too. Gimme a sec.
It's hard to test without the dfs, but you can simplify a lot of this by using something like:
relevant_product_codes = ["USF33", "USF34", "US756", "USF37", "USF40", "USF29", "US05", "US1B", "USM6"]
product_code_mask = gfs["Product_Code"].isin(relevant_product_codes).any()
There's a lot to keep track of here, so the more you can debug separately the better.
Sure i understand what you’re saying
I can tell you for sure that I know there’s no problem with the product code name, ID or sector comparisons
I think the error is created by the series of booleans returned in the larger or statements used
Yeah, I think that's why I'd prefer the thing I posted above, since that'll give back a single boolean.
the issue is, some of these are coming from the Product_Code column, and some of them are in the Company_Code column
Oh, I missed that.
Either way, similar deal:
relevant_product_codes = ["USF33", "USF34", "US756", "USF37", "USF40", "USF29"]
product_code_mask = gfs["Product_Code"].isin(relevant_product_codes).any()
relevant_company_codes = ["US05", "US1B", "USM6"]
company_code_mask = gfs["Company_Code"].isin(relevant_company_codes).any()
product_and_company_mask = product_code_mask & company_code_mask
Okay one moment
im imagining the devs of opencv before they released it and they are working in computer vision for a company for example
company: "we want you to build a face recognition app for us"
the dev: "ok sure"
it's possible that it was a research project first
a lot of libraries start as academic projects
so @stone marlin
the company: "what are you goin to use?"
I guess i’m a bit confused how to apply this to my pipeline dataframe
the dev: "of course not your crappy framework im gonna use mine"
yeah and ALOT
the pipeline dataframe has products, IDs and Sector that will have recurring revenue in the future. the gfs dataframe also has those products, IDs and sector. The gfs dataframe has the company and product codes which help identify that. You can go ahead and assume that besides those factors the dataframes are quite different. So my overall goal is just to be able to identify which products do have recurring revenue at the product-sector-id level, using the codes found in the gfs data, so that when I iterate through the pipeline data I can easily identify what would be recurring.
Okay, without going through the whole problem, the way I'd write this to be able to debug it a bit easier is:
def get_rec_value(x):
pcn_mask = gfs["prod_code_name"] == x["prod_code_name"]
pidmap_mask = gfs["PRODUCT_ID_MAP"] == x["PRODUCT_ID_MAP"]
sector_mask = gfs["Sector"] == x["Sector"]
relevant_product_codes = ["USF33", "USF34", "US756", "USF37", "USF40", "USF29"]
product_code_mask = gfs["Product_Code"].isin(relevant_product_codes).any()
relevant_company_codes = ["US05", "US1B", "USM6"]
company_code_mask = gfs["Company_Code"].isin(relevant_company_codes).any()
all_masks = (
pcn_mask & pidmap_mask & sector_mask & product_code_mask & company_code_mask
)
# Wouldn't this just be `all_masks`?
if gfs[all_masks].bool():
return 1
return 0
pipeline["rec"] = pipeline.apply(
lambda x: get_rec_value(x)
axis=1,
)
This is sort'a what I meant. Transferring this to a function means you can put print statements in and see what's the goings on, and it allows you to add stuff / take out stuff easier.
I'm not sure that this'll be perfect, you might still get an error, but you'll be able to see exactly what gets returned for each thing.
For that error, it's usually something is producing either a trivial series or a weird series we didn't expect. So the job is to find that.
Okay see my understanding of the bitwise operators convinced me that it was just a series of trues and falses that the a.any() couldn’t handle
thank you
It's possible! I mess this stuff up constantly, which is why I write it all in this verbose format. Good luck!
hello everyone, I am an amateur programmer and I want to get into data a bit more, and I figure I could do that by analyzing my gameplay in a game called Brawlhalla. The game saves files that allow you to view previous matches, and I basically want to make a program that pulls some simple info from those matches (mainly the gamemode, the legend I was playing, the legend the other person was playing, and whether I won or loss, and maybe down the road things like when I took the most damage and from what). anyone have any idea on how I should get started on that? I'm not quite sure how to approach the problem
Yeah still getting the ambiguity issue
Try printing out all of the "masks" in the function, and see what they return (for a small subset of your pipeline). If they're returning a series, that's bad news.
Pollo: This is a cool idea. First, it might be good to go to a general room and figure out how to pull in your data from the files programmatically. Once you have the data, you should come back here and tell us what you think you wanna do with it, and we can help you analyze it!
Okay so i truncated to a subset of only 2024 rows (just 5) and inserted print statements for each and I got the following error “Can only compare identically labeled Series Objects”
I already tried dropping the index of my shortened pipeline dataframe
Okay actually i needed to sort_index not reset_index
Yeah, reset_index weirdly just makes the index a column, haha.
now when putting it back in the function i’m seeing that there must be a nonetype value in one of the dfs
likely gfs, i’ll check
Nice, that's a good find.
Which topical help chat does face detection and trig fall under?
topical?
Trying to use the green triangles from the center to each face to determine the rotations of the face
As in which chat topic would this fall under
The displayed Pitch and Yaw aren't correct.
i think ML AI
I don't see a chat with Machine Learning in the name. I meant from the list to the left <<
"TOPICAL CHAT/HELP"
would make a good meme ngl
This is prob the chat for it, but I know nothing about CV, so I'll let someone else go at it. :']
i do can you # it for me?
Why do these topical channels phase in and out of existence?
I swear to God discord is quantum software. It refuses to operate within human expectation, and behaves in extradimensional ways to do whatever it likes. Kinda like my ex gf.
Meme it up, bro.
is thier a way you can send me the image with only the yellow convex hull?
gimme a min
thnx
looks funny for some reason, thanks
btw when you werent here i had a question for you
fire away or dm it
aight ill dm it
k
nope my mistake, sorting the index is making the dataframe a nonetype
it took me awhile to figure that out, but it’s because that was the last thing I expected, it doesn’t really make sense to me at least
For anyone interested in my face project, this is my end goal ish
https://www.youtube.com/watch?v=20fbXJ8fayM
This is an augmented reality Photoshop plugin I made. I connected it to a Unity app with Vuforia so I could pull images out of the screen. The plugin uses Adobe Generator Core and Javascript to send each layer of the document to a Unity app with WebSockets (C#). Then we can display the images in AR by combining Vuforia's image tracking with ARki...
man the math for log regression is making my head spin
[excuse the U-word.]
unity
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
https://www.youtube.com/channel/UCNU_lfiiWBdtULKOw6X0Dig/join
Please do subscribe my other channel too
https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw
Connect with me here:
Twitter: https://twitt...
even after watching this entire thing all i can discern is
lin reg wouldn't work w binary classification
sigmoid transformation is the best way to go
ooh nvm i figured it out
i basically have it down
so cost entropy is a metric used for logistic regression?
it basically combines a linear combination of variables and the bernoul dist so the y values can go from 0 to 1 which would be the logit and then the inverse of that is the sigmoid
so the equation is like a e^b0 + b1x1 + bnxn
Wait, you got it to work?
no no
just understanding the math behind it
wanted a general mathematical intuition behind how exactly it worked
Hello!
May someone help me with a problem?
I have a DataFrame of all 50 states of US and its abbreviations, and another DataFrame where I have a column called "Origin State" where some rows have the full name of the state and others have the abbreviation. So in order to better visualize this column I have to fix this. I want to replace the abbreviations with the full name of the state and I wanted to use the dataframe containing the names and abbreviations of the states using an if-else statement but I haven't been able to do it.
This is what I have
test["State"] = np.where(test["Origin State"] == states.abbreviation, states.state, test["Origin State"])
The image is an example of how the data looks
This is the dataframe of the states with its abbreviations
Is there a better way to do it?
I thought of comparing each row to the abbreviations and if it matched, to replace it for the full name of the state.
But I haven't been able to put that in code
hello, im trying to do feature selection using pearson corelation technique, but just to understnad do i want to remove features with high corelations not related to the label or target? Or do i want to keep features that have high corelatino with target or label this is for linear regression
thanks
lmao, it ended up being really easy, used this:
test["Origin State"].replace(list(states.abbreviation), list(states.state))
that's handy 👍
Hey, so I have a question.
I kind of know python and been working on it for a while, but now that I'm into ML I'd like to know how important is knowing about OOP, because I literally don't even know what a class is. Is it very important or just something to have a little knowledge about?
Obviously related to ML and DL
how can I do the equiv of df["a"] - df.loc[0, "a"] but with df.groupby("s")? groupby doesn't have .loc so unsure what to do here
well its important
you have to know well oop classes inheritance etc
Hey can someone tell me whats the difference and similarities between ML algorithms and pattern recognition?
I am having these subjects and have to decide which one to pick.
I tried doing google search but didn't really understood it.
anyone with m1 chip? and got successful with object detection?
Hello guys,
I am kind of new to ML, and I was wondering if maybe someone can help me with an issue, and maybe suggest some tools.
The situation: I have an EDF file where I am trying to identify the R wave -range-. I have the peak point value.
I was wondering:
How do you approach this situation?
I'll try to explain what i had in mind and I will be happy to get some suggestions, tools, recomendations on new tech (consider me a total beginner):
- The first developer instinct of mine was to go and compare points - write an algorithm that might work for most situations (but there are a whole bunch of edge cases for that, for instance, when the EDF file is a bit fuzzy but enough to work on. when R waves aren't that bold. etc)
- The second thought was "hey there must be an already trained model to identify peaks including ranges in an almost accurate way".
- I'm tried googling some keywords but I got a lot of useless information and result. How do you usually approach this?
- The last thing I thought of is to run all my EFD data in a model and get it to figure it out itself. yet it's not my comfort zone.
https://youtu.be/iyxqcS1u5go
Is this video good for getting idea of the maths used in machine learning ?
This video on Mathematics for Machine Learning will give you the foundation to understand the working of machine learning algorithms. You will learn linear algebra, statistics, probability, and calculus with hands-on demonstrations in Python.
🔥Free Machine Learning Course: https://www.simplilearn.com/learn-machine-learning-basics-skillup?utm_c...
you can get an overview but you're not going to understand it unless you study the content properly
video looks good though, but again, to get a proper understanding you need to study the content
Then from where can i learn them ?
It's good if it teaches lin alg and calculus, along with some statistics and probability
I'm personally learning lin alg from a course and will do the same for calculus
I'm reading a book which teaches those concepts aswell
Does anybody know any (relatively new) good tutorial how to setup a real time object detection with custom images? There are so many of them, but many are outdated and just dont work for me
i saw a latest video about that on youtube
let me grab the link for you
@lapis sequoia
that would be cool
Want to get up to speed on AI powered Object Detection but not sure where to start?
Want to start building your own deep learning Object Detection models?
Need some help detecting stuff for your course, startup or business?
This is the course you need!
In this course, you’ll learn everything you need to know to go from beginner to practition...
it helped me, hope it helps you too @lapis sequoia
thanks, so everything still works? i know this video and i'm on line 1h30 and not sure if i will get another error and won't know what to do
Hi guys, I was wondering if there's anyone who would be willing to give me an opinion or an advice on ML use cases on a little project I'm working on to try develop my skills. It's connected to Food delivery, order data, user data, restaurant data > to be used with python and machine learning. 🙂 Thanks in advance
I wanted to create an object detection algorithm in python for a company to help with their monitoring duties. I spent a bunch of times preparing everything (I am not a computer science student so a lot of self learning) and how it would look like and they suddenly pulled out of the project. Pretty devasted...
Machine Learning Related Question. I have a Generator which is suppose to output a 24x24 output is there way I can arrange the parameters to do do?
Facial Orientation detection, anyone? 😛
I do need a body/pose detector capable of multiple bodies, though.
I think this one can only do a single person
Hi guys, we making a webscrapping project where we take the names and stacks from companies, but the datas comes differente from the source ex:
node.js, nodejs.
what do you guys usually do for that?
does anyone know if there is a nlp focused discord group
Can anyone suggest good projects related to Deep learning for college . But it should not be too comman or too hard as I am still a beginner in DL.
hey yall wanted to know if anyone knew how to use regular expressions for a hashtag word in between non hashtag words, for example:
What I am trying to do is keep the word learning but get rid of the other hashtags at the end
hello Anyone well with pandas here need a help
what help do you need with pandas? please show the DataFrame in a copy-pastable way with print(df.head().to_dict('list')) (no screenshots) and explain what you want to do with it.
first off, do you know how to find a hashtag by itself, without worrying about context?
start there
then you basically want to wrap it in a match for <anything that isn't a hashtag> <whitespace> <hashtag> <whitespace> <anything that isn't a hashtag>
!e ```python
import re
text = "I love #learning. It's so fun #selflearner #selfstarter #learn"
pattern = r"(?:^|\s+)(?P<hashtag>#\S+)\b"
print(re.search(pattern, text)["hashtag"])
@desert oar :white_check_mark: Your eval job has completed with return code 0.
#learning
this is getting a bit beyond what you'd normally want to do with regex btw
you should use it for xml parsing instead
i'd consider tokenizing first, then you can loop over 3-grams and look for (not-hashtag, hashtag, not-hashtag) triples
heh
(do not listen to this terrible advice.)
!e ```python
import re
text = "I love #learning. It's so fun #selflearner #selfstarter #learn"
pattern = r"[^#\s]+\s+(?P<hashtagBetweenWords>#\S+)\s+[^#\s]+"
print(re.search(pattern, text)["hashtagBetweenWords"])
@desert oar :white_check_mark: Your eval job has completed with return code 0.
#learning.
have to make some adjustments for puncutation though
[^#\s]+\s+(?P<hashtag>#\S+?)[^#\w]*\s+[^#\s]+ this is starting to get both messy and fragile
ya it got pretty ugly for me as well
!e compare to tokenizing:
import re
text = "I love #learning. It's so fun #selflearner #selfstarter #learn"
punctuation = re.compile(r"[.']")
whitespace = re.compile(r"\s+")
tokens = whitespace.split(punctuation.sub("", text))
for t1, t2, t3 in zip(tokens, tokens[1:], tokens[2:]):
if not t1.startswith("#") and t2.startswith("#") and not t3.startswith("#"):
print(t1, t2, t3)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
love #learning Its
of course you might also want to handle sentence boundaires etc
interesting thanks a lot
obviously don't copy and paste without considering what it does and adapting it to your needs 🙂
also @velvet ginkgo i recommend https://regex101.com for experimenting with, testing, and debugging regex
don't forget to select "python" mode
I was using this its pretty helpful!
I was just playing around with regex seeing if there was a way of dealing with hashtags that appear as part of the sentence and those that just appear on their own
so bias is basically assumptions a model makes about data to make the target function easier to create
variance is how much the target function changes depending on the training data?
meaning if there is high variance... that means a model performs poorly given new data?
i'm trying to develop the terminology needed for support vector machine
it seems like a support vector machine is a combination of soft margins and cross validation
looks like R^3
if you used something like mass, height, age, and BP then it would be R^4
woah this is so cool i know these lin alg terms
so further than r^4, it's a hyperplane or a "flat affine subspace"
yeah uh i can see why people would be screwed if they tried to do this without basic lin alg knowledge
from my understanding based on little changes in your training set you don't want your model to change a lot
I think I remember this video I think this is the statsquest one?
indeed
just watching that for now, will delve into the math behind the model after
ya this is basically the avg distance between the predictions and the truth
so the total error = variance + bias
not quite. "variance" in this context is how much the model loss and/or predictions change in response to changes in the training data. "bias" in this context is the amount of wrong-ness that is embedded in the model.
for example, adding L2 regularization adds some bias by producing a deliberately suboptimal (underfitted) result, but tends to reduce the sensitivity of the model training process to random variation in how the training set is sampled or constructed.
it doesn't have to do with cross validation. and "margins" are pretty much only relevant for support vector machines
but yes, precisely. more features means more dimensions in the feature space, which means that you are looking for a separating plane between classes in a higher-dimensional space
it's important to realize that higher-dimensional space is "bigger" than 2d or 3d space, in that distances between points are disproportionately larger.
pro: it's easier to separate points when you have higher dimensions
con: it's harder to actually learn a function in higher dimensions, because there's more "empty space" between points, so you need more data points to fill in those gaps
it might be more useful to think of an SVM as the same thing as logistic regression with a special loss function (called "hinge loss"), than to spend too much time studying the traditional presentation
the two models are equivalent, but i think all the stuff about hard margin & introducing slack variables isn't much more than historical trivia nowadays
:incoming_envelope: :ok_hand: applied mute to @kind robin until <t:1643148728:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
can you eli5 what hinge loss is?
oh actually
hinge loss is not hard to understand
If you're from Africa you should join Masakhane. It's an awesome african NLP community.
Hello I'll be participating in a datawarehouse project shortly that consists of grabbing data from different sources and store it in a hadoop cluster. I've been looking into the technologies that I'll probably use and I have a doubt related to the data proccessing. I wonder if I should use pyspark which is way more faster than map reduce so that I can take advantage of cool features like resource management (yarn), hdfs from hadoop combined with a more efficient way of proccessing my data with pyspark
Someone just asked me for this, so I figured I'd share it too with minimal context. This picture (well, one very similar to it) was what made the Kernel Trick click for me.
Possibly relevant to the above discussion of adding dimensionality to allow for "nonlinear" splitting.
The higher the dimensionality the further apart things are and so it's more likely that it can be cut well with a hyperplane.
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming.Dimen...
I never really think about something being "more likely" to be able to be cut with a hyperplane, though, so that prob wouldn't click in my head. Like, even if there was one more dimension but it was extremely small, we could still cut stuff up.
One example of the blessing of dimensionality phenomenon is linear separability of a random point from a large finite random set with high probability even if this set is exponentially large: the number of elements in this random set can grow exponentially with dimension. I think the relevant part of the "Blessing of Dimensionality."
I dunno if this would make dimension-lifting and stuff click for me, but it's def important to know and think about.
thank you for this
Also "common-sense heuristics based on the most straightforward methods "can yield results which are almost surely optimal" for high-dimensional problems." is very important. Because it means that simple solutions can work and simple solutions are often nice in that they can often be scaled up (also run fast), are more interpretable, and can often have other easily understood and desirable properties.
True, for dimensionality-lifting, this is also a good point. For the kernel trick, I think I really needed to see the "picture" of the kernel before I understood what adding that dimensionality "meant".
Glad to help
is that a server or another forum?
Heyya I'm working on an Ai voice assistant
guys i have a doubt , does keras image generator automatically uses prefetch(autotune) and cache or we have to separately make a function to do so ???
Hi, I've been working GA non stop for the last couple of days, I'm a wreck now... would anyone be able to tell me if something like this can be useful? https://github.com/jamespacileo/formula-finder 😄
:incoming_envelope: :ok_hand: applied mute to @glass dagger until <t:1643191750:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
If this is real, I am almighty impressed.
It's a community on Slack. You can DM me if you're interested or you could also do a bit Googling
Is there a rule of thumb of how big of a dataset you'd need to fine-tune a BERT(BART) model?
data.replace("NA", np.nan)
data.head(5)
i tried slipping in an "inplace = True" kwarg
but it won't let me
TypeError Traceback (most recent call last)
<ipython-input-14-88eddfccfdb7> in <module>
----> 1 data.replace("NA", np.nan, inplace= True)
TypeError: replace() got an unexpected keyword argument 'inplace'
idk how to fix this
i've read the documentation for this as well
!e
import pandas as pd
a = {'b': [4]}
df = pd.DataFrame(a)
df.replace(4, '', inplace=True)
print(df)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | b
002 | 0
works in mine, I'm not sure but are you sure its a dataframe?
Series.str.replace does not have inplace, my assumption is that your data is actually a Series
unless i have to swap it with pd.read_csv
in order for it to work
or dask just reads things as series
no no that should work
thats dask.dataframe.replace then(what is dask?)
that will theoritically work yes.
to read the csv
even with modifications of chunk_size
another kwarg from the pandas library used to break the amt of dataframes a csv is split into
I'm not sure what they mean by that, i mean why is it in doc if dask doesnt support.
bc dask is dumb
and lacks a lot of the functionality pandas has
which means i'm dumb
for using dask in the first place
you can save it?
df = df.replace(...)
anywayssss
i can't even get pandas to read this csv
hm, well you can do something like above
df = df.replace(..) is equivalent in terms of output to inplace.
definitely, yes
but i can't do that unless it is read into a dataframe
correct?
uhm well 😄
whoopsie
nope
pandas is reading it as a "Series"
no it is not
ok i feel dumb
😄
there must be something off
syntatically
oh shit i think i figured it out
who knew basic pandas knowledge makes this stuff sm easier 🤣
now you are starting to understand why i tell everyone here to "learn the basics first" and "read the docs" 😉
exactly man
i am getting better at reading docs every day ngl
bc i read it when i learn new attributes and things
helps the problem solving tons
and now you see why i tell people that reading docs is a skill that requires practice!
it sounds like it! i really like seeing this kind of progress
and it's great to acknowledge your own progress too. that way you don't burn out and feel defeated all the time
yeah i keep the vibes positive and i make sure to take breaks
walk away from the computer and exercise play video game what not
i think since summer bc of all the big changes i made in how i do things i made progress
and i'm just glad to see it pay off
ok, so i am trying to replace some NaN values in my dataset
according to this documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html i cannot use .fillna and replace it with a string
this could be an option... i could use it where in the dataframe the value is NaN
so you want to replace NAs with empty strings?
alr.
i see that there is a NaN value in that third row
hm yep
it may just be a string value
!e
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df = df.fillna('missing')
print(df)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | A B C D
002 | 0 missing 2.0 missing 0.0
003 | 1 3.0 4.0 missing 1.0
004 | 2 missing missing missing missing
005 | 3 missing 3.0 missing 4.0
hm good.
ok, so i applied it to my code with
let's assume its some "xyz"
you can do something like
df.col[df.col == 'xyz'] = 'hohoho'
oh smart
just check string equality
yeah, this may not be working simply because it is not an actual NAN value
!e
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan, 2, 'yo', 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df.B[df.B == 'yo'] = 'how'
print(df)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | <string>:8: SettingWithCopyWarning:
002 | A value is trying to be set on a copy of a slice from a DataFrame
003 |
004 | See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
005 | A B C D
006 | 0 NaN 2.0 yo 0.0
007 | 1 3.0 4.0 NaN 1.0
008 | 2 NaN NaN NaN NaN
009 | 3 NaN 3.0 NaN 4.0
iloc? or loc
!e
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan, 2, 'yo', 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df[df.B == 'yo'].B = 'how'
print(df)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | A B C D
002 | 0 NaN 2.0 yo 0.0
003 | 1 3.0 4.0 NaN 1.0
004 | 2 NaN NaN NaN NaN
005 | 3 NaN 3.0 NaN 4.0
loc would be for cols, no?
maybe it's something like
data = data.loc["phone_num_forward_from].fillna("Missing")```
it could be getting screwed up because i am trying to apply it to the entire dataframe instead of that one specific col with .loc
i am also heavily debating if i even need to clean the NaNs as i don't see how it will be particularly useful
but i do have an idea that can somehow incorporate the values in this specific col and in order to do that i'd definitely need a clean col
!e
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan, 2, 'yo', 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df.loc[df.C == 'yo', 'C'] = 'hi'
print(df)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | A B C D
002 | 0 NaN 2.0 hi 0.0
003 | 1 3.0 4.0 NaN 1.0
004 | 2 NaN NaN NaN NaN
005 | 3 NaN 3.0 NaN 4.0
df.loc[df.specific_col_name, == "string_to_replace"] = "string_to_replace_w_"
will try rn
df.loc[condition, "col_name_where_new_value"] = "string_to_replacew"
moreover Series.replace will work too.
that was my next idea
its good to note that loc will give you independence of putting condition over any column.
right
for more weird tasks apply is always my GOTO.
yeah i am looking at the doc for .apply
"KeyError: 'cannot use a single bool to index into setitem'"
it's a good tool but it can make code long when things can be done simply.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-d81d9dfe8788> in <module>
----> 1 data.loc["phone_num_forwarded_from" == "Nan"] = "Missing"
2
3 data.head(5)
4
5 #df.fillna("Missing")
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
668 key = com.apply_if_callable(key, self.obj)
669 indexer = self._get_setitem_indexer(key)
--> 670 self._setitem_with_indexer(indexer, value)
671
672 def _validate_key(self, key, axis: int):
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
869 else:
870
--> 871 indexer, missing = convert_missing_indexer(indexer)
872
873 if missing:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in convert_missing_indexer(indexer)
2339
2340 if isinstance(indexer, bool):
-> 2341 raise KeyError("cannot use a single bool to index into setitem")
2342 return indexer, True
2343
KeyError: 'cannot use a single bool to index into setitem'
you're missing something.
also your condition is....uhm..
its like
1==2
read this again.
oh
my bad sorry
brain got overwhelmed for a second
data.loc[df["phone_num_forwarded_from"] == "Nan"] = "Missing"
data.head(5)
data.loc[df["phone_num_forwarded_from"] == "Nan", "phone_num_forwarded_from"] = "Missing"
Any idea?
i assume something inside your file is messed up, unicode error.
!paste it helps if you post text and not a screenshot. see below:
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
df[col].value_counts("Nan")
https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python
this may help, I'm bit messed up rn so haven't seen details deeply.
is there a pandas function that can count how many times a certain value appears in a col?
Series.value_counts()?
yep that was it
!d pandas.Series.value_counts
Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)```
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
you can also do series.loc[series == val].sum() if you just need one value
but might as well do series.value_counts().at[val] otherwise
id 0.0
phone_num_from 0.0
phone_num_forwarded_from 0.0
is_blocked 0.0
created_at 0.0
dtype: float64
good
data.loc[data["phone_num_forwarded_from"] == "Nan"].sum()
code to produce that ^
seems like it worked
i think it might be a good idea to rename these col names too bc they are a hassle to type and i know pandas has something in the docs for it
my next plan of attack is going to be to deal with strings like these "2021-12-01 00:00:00", convert them to datetime format
and try to extract things out of them by reading the pandas doc for datetime stuff along w the schafer vid i watched
Where Can I find, UCY data set?
why do you have "Nan" as a string in an otherwise numeric series? You would usually have it as an actual NaN and use .isna() to get a boolean indexer.
that is a question that has been bothering me too
it's a NaN
man
rip
if it's a NaN, not only will it not be equal to the string "Nan", but it won't be equal to literally anything, not even another NaN.
NaNs are just hard-coded to always return False in comparisons
i see
ok well i used .isna()
and it gave me a dataframe of 52486231 rows × 5 columns of bool values
sounds right.
my next plan of action is to use data.fillna()
remember to make sure if it's in-place or not.
in-place would make your changes permanent, right?
that's what i remember from the vids and the doc
in-place is like list.append
but most pandas methods return an entirely new instance, and it's your job to write over the existing one, if you want that.
thats how they tease us doing
'pd' == 'pd' # true
Number('pd') == Number('pd') # false
i see
and inplace is kinda very less used as much I've seen.
people usually reassign.
which also gives us an advantage of having historical dfs(assuming dfs are not TOO big)
i saw a couple articles
saying to not use inplace in pandas
data.fillna({"phone_num_forwarded_from": "Missing"}, inplace = True)
much better
ok, i'm gonna come back later and do more data cleaning
ty sm for the help so far guys
you usually don't want to change things in-place unless you're sure you never want the original again. Otherwise you might introduce unexpected behavior.
I never use in-place=True
hm
i think i can save it into a new df
with just df =df.fillna({"phone_num_forwarded_from": "Missing"})
i can restore the df somehow i forgot tho
do you see why this is a problem?
you don't "restore the df". you just don't write over the same variable.
for every instance of a null value being filled an entire dataframe is created?
well, what do you think?
i think so
i see your point
so just new_df = df.fillna({"phone_num_forwarded_from": "Missing"})
don't use the same variable name
which gives us historic benifit while using notebooks.
as it's more of a trial and error alot of times
a new DF isn't created for each replaced NaN. One is created in which all the NaNs are replaced.
so i was wrong
the problem was that i used the same variable name
thank you
for pointing that out
fr that's what's so nice about it
I actually encourage people to stay away from notebooks, but maybe I'm the angry old man of data science.
you are the angry old man of data science
description fits pretty well ngl
jk jk jk just banter
i'm curious as to what you recommend instead of notebooks too
and do you dislike google colab as it is just notebooks as well?
hmmmmmm this looks familiar
looks a ton like lin reg
yes
i really don't like rapidminer
not that it has anything to do with notebooks
but seriously rapidminer sucks
notebooks (almost actively) discourage code reusability and modularity
what's modularity
look it up.
what does separating the functionality of a program into modules have to do with notebooks?
oh so basically combining things?
i think i understand now
because with notebooks, you can't even do that.
anything that you want the notebook to display for you has to be done in the global scope
I love notebooks!!! They're so handy if used right!!!
yes, if they're used right. I've used them to deliver results to customers
I mean as long as we understand how scoping and variable management works over there, we don't face issues.
they shouldn't be taught as the default environment to beginners.
what would you recommend instead? pycharm?
that is true. they can be very very confusing to the beginners. totally agreed.
any text editor you'd like to use, with whatever code runner you'd like to use. notebooks aren't an IDE or a text editor--they're an entire way of writing and visualizing code.
pycharm is an IDE. its more like notebooks or python file.
you can also do notebooks in pycharm, if you want
i see
also in vscode(but my experience was horrible with notebooks in vscode.)
vscode never agreed w my mac
i don't know why it just would never cooperate with me
Hello,
I want to learn computer vision, where to start? OpenCV or TensorFlow?
i need some help with time series classification
In this video i cover time series prediction/ forecasting project using LSTM(Long short term memory) neural network in python. LSTM are a variant of RNN(recurrent neural network) and are widely used of for time series projects in forecasting and future predictions.
I cover the complete code of the project and this tutorial is intended for begi...
i'm not sure if this could work
step 1: get your head around the problem. your dataset consists of pairs of (timeseries, label). that is, each data point is an entire time series. so whatever you do, it needs to ultimately encode each time series as a single vector.
so i'm basically forced to use tensorflow or keras
not necessarily! but you might want to consider something like a 1-dimensional CNN
that, or you might want to find other ways to encode a time series as a vector
i was thinking time series regression
it seems like i'm dealing with univariate time series data
the thing is i cannot do linear regression with this as this is a classification problem
so maybe logistic regression is the way to go
this would be the same problem. you need to encode each time series as a "data point" somehow
the encoder-decoder LSTM?
you could use the encoder to produce "time series vectors"
sounds like this
interesting
actually it looks like you can use LSTM layers directly in classification models
https://www.kaggle.com/szaitseff/classification-of-time-series-with-lstm-rnn here's one example
be careful not to confuse "forecasting" with "classification"/"regression"
what's the diff
forecasting means trying to predict the future values of a time series
classification means predicting a label for an entire time series
i see