#data-science-and-ml
1 messages · Page 343 of 1
a random forest is a bunch of decision trees, each of which is fit to a subset of your training data (basically)
n_estimators = number of trees
thank you
yw
so if i set the value of n_estimators too high, will overfitting like problems reappear?
excellent question
so
why do we fit multiple trees?
what are the characteristics of a single tree? <- start with this
to get better results?
it is a binary tree
that is all i got
heh, no, sorry
what I mean is more like
are you familiar with bias and variance
?
somewhat, go ahead
so
if you fit a decision tree on your data
without any constraints
if possible
it will overfit madly
because
it will identify wrong patterns
yeah
the trees in a random forest are constrained in 2 ways
- they don't see the whole dataset
- they are limited in depth
this serves to limit overfitting
that is awesome!
but does it mean that they can be vulnerable to underfitting then if the value is too low?
what do you think? 🙂
what do you mean
elaborate
if the trees it is making is going to be fixed length with fixed dataset whether n_estimator is 2 or 15, then yes
the trees are always the same depth, IIRC
unless you change the setting
but eahc tree sees a slightly different subset
it's called bagging (bootstrap aggregation)
so, I guess it should be vulnerable to underfitting?
yup
well, depends on the settings
in general, you have p tiny trees for random forests, so yes
this, and @ some point you don't get much out of it
diminishing returns?
yes
ahh
think about it this way
each tree sees a random subset of the data
but the fitting itself is deterministic-ish
the more trees you have
the higher the probability that two trees will see the same data
Mathematics is problem solving. Just watching videos does not teach, you have to do tasks. Watching videos and reading is good, but solving problems develops the most.
thanks for your help @velvet thorn
yw 👋 hope you understand better now!
I mention that in my earlier comment.
different people learn differently, too.
I personally don't do videos @ all
see here @lapis sequoia
👍
Very important point!
On a more general note, I'm of the opinion that all learning is self-learning.
Yes. I see too much here in university that some students think they get everything on a tray, even though the intention is to develop in increasingly challenging problem-solving tasks.
I managed to find the answer. I asked about it without showing an example, because I saw this in many ML models and though it was a general thing.
Here's the answer https://stats.stackexchange.com/questions/153823/what-is-verbose-in-scikit-learn-package-of-python
hello! pandas' resample somehow moves my table columns around. Any idea how to revert it?
original:
---------------------
| time | name | val |
---------------------
after resample it looks like this:
---------------------
| name | val |
| time |
---------------------
Any idea how to revert it back to the original, or prevent it from happening?
that's because
that column
becomes the index
you can turn it back into a column with reset_index()
but
why?
seems like the other values are a bit tricky to access as well
can you elaborate
but i guess thats not really that relevant. What im really trying to do is given the resampled series
time val
0 2021-09-23 13:27:00 1092.307692
1 2021-09-23 13:30:00 1091.789474
2 2021-09-23 13:33:00 1089.692308
3 2021-09-23 13:36:00 1089.000000
4 2021-09-23 13:39:00 1089.200000
5 2021-09-23 13:42:00 1089.400000
6 2021-09-23 13:45:00 1089.333333
7 2021-09-23 13:48:00 1089.666667
8 2021-09-23 13:51:00 1089.000000
9 2021-09-23 13:54:00 1089.000000
10 2021-09-23 13:57:00 1089.666667
and turn it into a "change per hour" using least square. The sklearn's reg.fit expects some training values (which given this case im not sure is a right approach)
I don't understand
and turn it into a "change per hour" using least square.
this part
looking at the data, it seems like val is reduced by 3 in 30 minutes. So thats 6/hr, which is what im trying to fit using least square
so you want
the difference
between successive values?
the real world data is a lot more noisy so a plain diff doesnt work that well
i think ordinary least squares is whats its called
i got a c implementation somewhere, but it's quite alot of code so i'd rather use a library if possible
like...regression?
how does that relate to taking a diff
yeah i guess. my english math terms are a bit rusty
well i mean you could model the change using a simple diff
or you could use regression
Hey I am currently doing a research internship in nlp it's basically handeling homonyms and contextual words in sentiment analysis anyone know any nice papers related to this topic ?
he's probably taking a diff OF the linear regressed values
no it's fine-- ordinary least squares is a type of linear regression
it's kinda faux pas to just say that all regression is original least squares
your english terms aren't rusty
I would run linear regression with a linear model, a*x+b, of value vs time. Then the rate is the slope of the regression line, the parameter a.
(At least as a first step)
Hi guys, is there someone with knowledge about recommendation engines? I'm writing my thesis on this and would like to talk to an expert
Every problem is 🥳
change my mind
brb gonna do linear regression to figure out what 5 + x is for any x
every problem is a result of several other problems created by intelligent apes known as humans. In the end, intelligence is just a biophysical process.
create AGI, create intelligence.
solve everything
pretty good startup pitch, eh? 😏
Just saw scikit-learn upgraded to 1.0
Welp, RIP compatibility
Though I think some of us will stick with 0.24 for a while
Hi all, hope all is well. Is this the best channel to discuss MLE and Data Scientist interview questions? I'm looking for a channel/resource to do specifically that
what incompatibilities do you have there?
most things are literally either stable, or things that should be deprecated are deprecated
I've been using the pipeline and GridSearchCV APIs
But i think it shouldn't be a problem
Anyone wanting to join me on an open source project to create a 2d self driving car simulation using NEAT and pygame? ( i have completed most of the code, i just need a few teammates to help with more features and bugs, i can send you the git hub link)
Hello. I would like to try to learn about Reinforcement Learning. Most of the material I find either yada yadas over creating an environment etc. Or are super technical. I am willing do do a deep dive into the technical but would like a happy medium to start with. Anyone have any good resources? Thanks in advance.
I'd love to contribute but I don't know about Reinforcement Learning yet. I'm still learning ML & Deep Learning.
This is actually not reinforcement learning, its pretty simple actually, just a genetic algorithm, comparable to selective breeding.
Neuro evolution of augmenting toppoligies
NEAT
thats a basic rundown of it^
or actually, i think thats the longest one, there are shorter ones on te website.
I just recently started learning deep learning so I don't have that much experience yet. I'll check out the attached link as well 😊
hello
i was scraping instagram posts links using selenium and my script is working fine but i am able to scrape only 2k links but the posts are 300,000 and i don't why the browser stops loading content
can someone help me?
hi, i spent the whole day fixing many errors, but im stuck on this one, can someone help me? im working on OpenCV for a project, making the harry potter invisibility cloak, if someone knows anything about this error then please help
hi, i spent the whole day fixing many errors, but im stuck on one, can someone help me? im working on OpenCV for a project, making the harry potter invisibility cloak, if someone knows how to use it then dm me please
hey can anyone help me out
can you tell me where do i start in ml or data sicence
??
take a look at some pins on this channel
like is there some kind of video that can help me out
to learn AI
i found a book in the pinned stuff
University of Udemy
University of Youtube
University of Coursera
If you wanna get a Masters apply for graduate school. I started from buying a ML course on Udemy
oke lemme try udemy
what university of youtube ?
😀 YouTube
yes ik jk
I was taught R in school but I later decided to learn python because that's the programming language used in most of the courses I'm using to learn.
So yeah you can start with Python.
i started with python
ik everything about python
almost
what do you want to give info
yess plz
lol which hand
yeah even i want an example
hehe
yes it is
i dont knwo ai
so im asking what to do
ye
random question but is pytorch supposed to download painfully slow
the wheel for older python version doesnt work so i have to use conda install
I remember the package repos being on the slow side
tensorflow installs like
instantly
is there a way to speed things up im not sure why the pip installs dont work
Has anyone here worked with pytorch-geometric?
(https://pytorch-geometric.readthedocs.io/en/latest/index.html)
installed it today, I'd usually have 100 mbit down yet it took me what felt like 30 minutes
I'd have some specific question regarding GNNs as I'm not 100% sure if they will work in my usecase
Does someone know an easy way to install opencv with cuda?
on what os
ubuntu
This is why the first thing I do with a new OS install is install my big Python packages so that I can use that sweet pip cache 😆
... Linux breaks sometimes, don't judge
hi guys, can i you help with an assignmen, that I can't solve ?
Yeah ... I thought I still had anaconda and pytorch installed. Turns out that was on my old machine.
It's always a fun weekend activity spending hours on getting the enviroment set up...
can anyone here guide me for a time series based dataset
i'm having trouble processing it
basically not knowing where to start
Its for an assignment but any guidance would be appreciable
having trouble processing it
what exactly is your problem?
What does your dataset look like?
What do you need to with it?
Hard giving any suggestion without any infos
Thank u for replying @tough bolt
sorry i was collecting images
Basically this is the data
it has 3Mx3 rows and colums
each of patientid,date of the incident and incident
In general, no one will volunteer to help unless you explain what the question is.
its a dataset of 27k unique patients
having different incidents on diff dates
and our motive is to predict who can survive a new "Target drug" based on their historical incidents
Some of the patients present in the test file are eligible for the drug prescription within a month and some of them are not, using each patient’s historical data predict if he/she is eligible for the “Target Drug”
the problem i'm facing is i dont know on what factor should i configure if someone is eligible for the drug or not
i can pm u the assignment if i couldn't explain the problem properly
Hi,
I am having problem figuring which version of python is on the cluster.
If I type spark-submit --version
2.2.0.cloudera2 Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_131
If I type python --version I get
Python 2.7.6
Reason is that I am trying to use a version of OneHotEncoder, but it changed in 2.4
The error I get using OneHotEncoder is TypeError: __init__() got an unexpected keyword argument 'outputCols'
I cannot import OneHotEncoderEstimator
this probably doesn't help you any, but are you sure you want to use python2? It's quite obsolete
unsupported, etc
also IME it's not super-obvious which version of python that spark will run, if you're submitting a python script
the one time I did that, I made sure to put my preferred python right at the front of PATH
yeah, its not up to me. Else I would had switch
can you submit a simple script that looks like ```py
import sys
print("Hello world! I am python", sys.version)
That would tell you which version is running on the cluster
('Hello world! I am python', '2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
With that then I will just work with the idea that I am in 2.7 and find a work around.
👍
Hi, i have some issue with a code, someone have deep understanding in matplotlib that i can pm please ?
Is there an IDE to recommend for someone starting? Something that would help me get to documentation faster, import helps, autofill after a tab, and stuff like that.
VSC is popular
I kinda dig it, kinda
it's not at all specific to data science, that I know of, although there might be some handy plugins
Is it hard to set up to be able to code in PySpark? I created a new file, and you can only get Python.
Sadly, I am not allow to use Pandas atm. I used to know why, but I forgot :S Maybe because distributed environment.
I'd say "no" since I've done it, and I'm dumb 🙂
dunno what you mean by "you can only get Python"
I couldnt load any spark libraries, so it was basically Python only. I am trying to set the environment right now.
how were you trying to "load spark libraries"?
from pyspark import SparkContext
what happened?
when I do that, iirc, it pauses for like 30 seconds, but then returns sucessfully
oh, no, it's when I do wat = SparkContext() that takes forever. (It's spinning up a giant Java process in the background)
imports cannot be resolved by Pylance
Sadly I see that I would also need to create a whole new envieronment using python 2.7 to be able to work on it.
I am trying to use OneHotEncoder, which I manage to get to work in Docker with Pythong 3. Sadly int he cluster is 2.7
So it doesnt accept multiple columns, so I went around it and made each column at a time.
and now it tells me that
"'OneHotEncoder' object has no attribute 'fit'"
Which I dont know how to go around. I need to learn more about how to find documentation (but from 2.7)
pylance might only work with python3, for all I know
I don't know what OneHotEncoder is, but it too might only work with python3
that's the world you're in, you'll have to get used to it
its been on python since 2.4 under that name, and it does work on the 2 previous lines. I just need to figure whats the translation from "fit" to python 2.7
What is clustering method where cluster centers are points from input data. For example kmeans will optput cluster centeres that are not in general points from input data. Is there a such an algorithm in scikit learn for example?
so you basically want kmeans except that for whatever clusters it comes up with, the centroids have to be one of the points?
is there any reason that you can't pick the points closest to the centroid determined by kmeans?
yes that's correct
i was thinking that way too. then was reading different scikit learn algos, but don't think any of them choses centers from points. unless i missed something. that's why my question.
i was thinking that calculating kmeans and then finding closest point would be like duplicating calculations. i am am thinking about tweaking kmeans to come up with centeres that are in points.
That is the K Medoids algorithm. Exactly like K Means but uses cluster centroids from the actual data.
It is not in the sklearn library but is present in the sklearn-extra library
wow. fantastic. didn't know about sklearn extra. thank you mayur7garg!
hey can someone help me with evaluating certain pts in a numpy array without looping
Whats a good way of cheaply ingesting massive amounts of table data into some cloud service (so i can move it around and download specific parts more easily)
would be terabytes as json so i want a proper format ideally but also something that i can easily append to and won't become corrupted if a write messes up etc
(would be time series and frequently written to)
I imagine AWS, Google, Azure, etc have that sort of thing -- Azure's is called "Databricks" iirc
data lakes is a new word to me, looks like i should do some more reading, thanks
i was hoping to avoid ingesting it into an actual database service just cause that makes it harder to download chunks to work with offline
or i assumed it would at least
I have html code and I want get value with regular expression, but I'm note getting.
<td>Vínculo</td>
<td>CARGO COMISSIONADO</td>
vinculo = re.findall("""<td>Vínculo</td>
<td>([A-Z]+)</td>""", html_detalhes) ```
i presume they want everything within <td> and </td>
oh
I want to get the value "CARGO COMISSIONADO"
i just think your regex is flawed
code correct!
vinculo = re.findall("""<td>Vínculo</td>
<td>([A-Z]+)</td>""", html_detalhes)
er, this is off-topic for this channel
regex needs a channel of its own 😜
whooooosh
Hi, I am new to Machine learning .
I need someone's guidance on a project which I picked up from "ineuron open data science project".
Hoping for a start to end mentorship.
(Some of the work like data scraping, data preprocessing , pipeline, HLD and LLD documentation).
Hello, can anyone help me understand what this means:
I am not sure what the phi symbol is at all
But the equation is something to do with Gaussian Models
cumulative density function?
That worked great. Thanks for sharing this lib. Now, is there a way to calculate inertia in different way? Kmedoid uses sum of distances from cluster medoid to each point in cluster. How to change it to max distance in cluster?
upper-case Φ is often used to represent the gaussian cdf (cumulative density function, aka distribution function). lower-case φ is often used to represent the gaussian pdf (probability density function).
this looks like a bayesian mixture model. this line states that the likelihood of x, given the full set of parameters Θ, is a weighted sum of two different gaussian likelihoods
having fewer brain cells makes you less clever. models should be clever enough to make generalizations. but if your model gets too clever, it starts to find patterns in the data that don't exist. this is bad. so we make our models less clever in order to make them smarter in the long run.
in other words, dropout helps prevent overfitting.
nice
thanks @desert oar
this is pretty good actually
I like it a lot
Haven't actually used K Medoids so I don't know. Maybe you can look into the documentation to see if there is a parameter like that.
i'm struggling to create an array, is it possible to create one between 2 aranges?
its not creating the list as intented
is it because they are different sizes?
@outer girder if A and B are different shapes then that won't work, I don't think.
Arrays have to be "rectangular" for whatever number of dimensions they have.
its my first week in coding
the program im trying to write is something like this
Write a program that prints a table converting Fahrenheit degrees to degrees
Celsius. The values must be calculated from 5th to 5th and the maximum and minimum limits must
be chosen by the user.
Why are you using numpy? Is this for a data science class?
we have been using numpy since week one, but im pretty sure im not "forced" to use it
i just dont know anything else besides it xD
Numpy encourages you to think about your data differently than general python usage
If this isn't for a data science class then I wouldn't use it
the class is called " Computation for Geologists" (which is my field)" xD
so i have no clue if its considered data science or not
Then numpy would probably help you
so do you have any clue where i went wrong with the code?
and where i could change the array
sidenote, (havent seen the core question yet) i think you should learn both the builtin datatypes, and numpy, and learn when to use which.
I don't know what you mean by "from 5th to 5th"
that will help you out in the long run
its like 0ºc to 5ºc to 10ºc
i'll look into it!
oh nah nah, i think all that is overkill
I would prefer a hand saw, myself 😔
to me it sounds like a simple task trying to teach you loops
I don't think we can infer the instructors intentions based on what we know
to me, sounds like this assignment is trying to teach range
the instructions are like this: print stuff from blah1 to blah2 in increments of 5
They introduced numpy in week one 🤷♂️
i.. uh... touche.
i think its this yeah
something like that atleast
i'm gonna try to learn range and come back with the results xD
I am using StringIndexer (python 2.7) and I am trying to understand if all the information is being put on the on the driver. I am looking at the spark.apache documentation for my version, but I cannot decide where the data is taken from.
This is not the firs time that I have this doubs, is there a way/place to see this easier?
Thank you. I have read the docs but it's not clear to me. Possibly not possible without hacking the library.
Hyperparemeter tuning is decreasing accuracy....any idea what could be the problem ?
Is anyone here familiar with NetworkX?
I'm not sure where or how to correctly set the relationship between nodes
e.g.
the default graph
where would I define the distance between 3 and 2
or 2 and 1?
anyone that is familiar with spacy?
I need some help, tried various help channels and servers but seems that don't have a lot of people with knowledge on this lib
I've contributed to spacy, but I have to know your specific question before I know if I can answer it.
with spacy 3.0 i'm very confused in some codes that in the past works pretty fine and now it's confusing to understand. I search for the error in stackoverflow but the cases was different from mine.
model = spacy.blank("en")
categories = model.add_pipe("textcat")
categories.add_label("Happy")
categories.add_label("Scared")
model.add_pipe(categories)
historic = []
I need to transform this code
I think this part is the problem
can you copy and paste the error message that you got as text?
sure
ValueError: [E966] nlp.add_pipe now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.textcat.TextCategorizer object at 0x7fc953d7def0> (name: 'None').
alright, let me see.
this is very common while trying to apply code that works in older versions of spacy, but the cases are different
model.add_pipe("textcat") returns the component, and my impression is that add_label mutates the component in-place. so I suspect that your second call to add_pipe is unnecessary.
but, if I don't put the second call how the Happy and Scared gonna stay in model?
add_label puts them in the model, so to speak, yes?
add_label isn't putting in categories?
oh
sorry
just try deleting model.add_pipe(categories) and see if it works.
ok
works
🔥 
I think I messed up because I put categories = model.create_pipe("textcat") in the first time
probably model.add_pipe(categories) was necessary in this case
but this don't work in spacy 3.0 now
:/
oh no
ValueError: [E989] nlp.update() was called with two positional arguments. This may be due to a backwards-incompatible change to the format of the training data in spaCy 3.0 onwards. The 'update' function should now be called with a batch of Example objects, instead of (text, annotation) tuples.
spacy 3.0 please don't
😫
@serene scaffold what means 'with a batch of Example objects'?
I'm not really sure
some spaCy contributor I am, I know
sad
from spacy.training.example import Example
model.begin_training()
for epoch in range(1000):
random.shuffle(final_data_base)
losses = {}
for batch in spacy.util.minibatch(final_data_base, size=30):
texts = [model(text) for text, entities in batch]
annotations = [{'cats': entities} for text, entities in batch]
example = Example.from_dict(texts, annotations)
model.update([example], losses=losses)
if epoch % 100 == 0:
print(losses)
historic.append(losses)
my code is that
any idea @serene scaffold
?
any idea about what? I would fix the indentation, in either case, so that it's four spaces.
about the error
if you get an error, always copy/paste the text of the whole error.
@serene scaffold
after that I do this code
!traceback
Please provide a full traceback to your exception in order for us to identify your issue.
A full traceback could look like:
Traceback (most recent call last):
File "tiny", line 3, in
do_something()
File "tiny", line 2, in do_something
a = 6 / 0
ZeroDivisionError: integer division or modulo by zero
The best way to read your traceback is bottom to top.
• Identify the exception raised (e.g. ZeroDivisionError)
• Make note of the line number, and navigate there in your program.
• Try to understand why the error occurred.
To read more about exceptions and errors, please refer to the PyDis Wiki or the official Python tutorial.
Ok so i was trying to create dummy variables of large catogorical data
and this is what i got
train["Patient-Uid"] = pd.get_dummies(train["Patient-Uid"],drop_first=True)
Error:
MemoryError: Unable to allocate 81.1 GiB for an array with shape (3220868, 27033) and data type uint8
🙂
Hey, I've been asked to calculate a weighted average for binned data (basically a histogram) using the method mentioned here:
https://stats.stackexchange.com/questions/531794/how-to-calculate-the-mean-from-bin-endpoints-and-frequencies
However, the final bin in my data doesn't have an endpoint for me to calculate a midpoint from.
how should i tackle this
Whats the best way to learn datascience (maybe from a begginer standpoint of python)
Sounds like you need a more compressed representation of your data (e.g. embedding) or shard / split up your dataset
from random import randint
class Character:
def init(self):
self.name = ""
self.health = 1
self.health_max = 1
def do_damage(self, enemy):
damage = min(
max(randint(0, self.health) - randint(0, enemy.health), 0),
enemy.health)
enemy.health = enemy.health - damage
if damage == 0: print "%s evades %s's attack." % (enemy.name, self.name)
else: print "%s hurts %s!" % (self.name, enemy.name)
return enemy.health <= 0
class Enemy(Character):
def init(self, player):
Character.init(self)
self.name = 'a goblin'
self.health = randint(1, player.health)
class Player(Character):
def init(self):
Character.init(self)
self.state = 'normal'
self.health = 10
self.health_max = 10
def dfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
Is that a very long datafffframe?
it's a long sigh of exasperation.
I'm not sure I understand the relevance of this
Does anyone know how I can create this?
I made an example of it using this
ggplot(data = mpg, aes(y = hwy, c = drv)) +
geom_boxplot(fill = 'darkgreen')
but they want me to create a ggplot box plot that show the distribution of Calories in the following Starbucks beverages:
Classic Espresso Drinks
Frappuccino® Blended Coffee
Shaken Iced Beverages
anyone knows how to do that it would help me a lot.
I was fired from Starbucks today.
So triggered rn.
Is this R?
😦
I haven't shown up to work since January so I'm surprised it took this long.
Yep
Are you trying to do it in python
I have to do both ways
R and python
why?
Yea idk how to do that for just the three drinks
Which three?
Classic Espresso Drinks
Frappuccino® Blended Coffee
Shaken Iced Beverages
I have a table but idk how to do it
Those are categories of drinks
Trust me, I worked there from January 2016 until today, coincidentally.
Anyway, you can use loc
And this
I can't be much more helpful at the moment as I'm on my phone. I might remember to check on you later @lapis sequoia
Like I have a table but is long
That's fine
And Idk how to make it as a code
What is the data in? A csv?
This is what I have
Yay
This is R. I can't help with that.
Damn all good thx tho
I did python I’m having trouble wit R
Do you know any discord server that can help me with it?
Let me see
@lapis sequoia if you join this server be sure to check their rules about asking questions https://discord.gg/PD8YMNKB
Thanks
What happened?
Then why are you triggered?
I told them I had to move "because of covid" and that I would come back "soon" and then ghosted them.
I was getting free stuff just for being an employee-on-paper.
Lol I hope it doesn't affect any future job search you may have
my current job doesn't care and no subsequent jobs are going to care about jobs I had before this one.
I have a code that extracts data from html pages and makes some filters until generating this list of dictionaries. I want the "header" information to be the header of a CSV file, but I don't know how to do this correctly. Does anyone have a tip?
I want the file with these columns:
"Matrícula","Referência","Vínculo","Servidor","Cargo","CPF","Lotação","Remuneração","Abono","Eventuais","Desconto","Salário Líquido"
def filtrador():
informacoes = []
for item in raspador():
header = item[1].strip("</td>")
info = item[2].split("</td>")
informacoes.append({header: info[0]})
return informacoes
filtrador()
[output]
[{'Matrícula': '00101105'},
{'Referência': '09 / 2021'},
{'Vínculo': 'CARGO COMISSIONADO'},
{'Servidor': 'DOUGLAS HENRIQUE SANTOS'},
{'Cargo': 'SECRETARIO PARLAMENTAR'},
{'CPF': '***42475'},
{'Lotação': 'COMISSIONADO - GABINETE'},
{'Remuneração': 'R$ 4.800,00'},
{'Abono': 'R$ 0,00'},
{'Eventuais': 'R$ 0,00'},
{'Desconto': 'R$ 849,42'},
{'Salário Líquido': 'R$ 3.950,58'},
{'Matrícula': '00092175'},
{'Referência': '09 / 2021'},
{'Vínculo': 'CARGO COMISSIONADO'},
{'Servidor': 'DULCEANA PALMEIRA DE SA'},
{'Cargo': 'CHEFE DE GABINETE 2oSECRETARIO'},
{'CPF': '***31400'},
{'Lotação': 'MESA'},
{'Remuneração': 'R$ 9.100,00'},
{'Abono': 'R$ 0,00'},
{'Eventuais': 'R$ 0,00'},
{'Desconto': 'R$ 2.178,33'},
{'Salário Líquido': 'R$ 6.921,67'},
{'Matrícula': '00092182'},
{'Referência': '09 / 2021'},
{'Vínculo': 'CARGO COMISSIONADO'},
{'Servidor': 'EDIJANE ALVES SANTOS SILVA'},
{'Cargo': 'CARGOS DE NATUREZA ESPECIAL'},
{'CPF': '***14404'},
{'Lotação': 'MESA'},
{'Remuneração': 'R$ 2.250,00'},
{'Abono': 'R$ 0,00'},
{'Eventuais': 'R$ 0,00'},
{'Desconto': 'R$ 199,30'},
{'Salário Líquido': 'R$ 2.050,70'}]
why is your output a list of dicts with one key-value pair each? this seems like a bad data model.
That really could be it. Thanks!
Any ideas on how to optimize this piece of code?
dataframe_tokenized_speech_Only = all_data_tokenized_FreqDist_df[["Speech"]]
dataframe_tokenized_speech_Only
for country_year_vector, Speech_dictionary in tqdm(dataframe_tokenized_speech_Only.iterrows()):
country =(country_year_vector[0])
year = country_year_vector[1]
for key,value in Speech_dictionary["Speech"].items():
if key in dataframe_tokenized_speech_Only.columns:
dataframe_tokenized_speech_Only.loc[country, year][key] = value
else:
dataframe_tokenized_speech_Only[key] =0
dataframe_tokenized_speech_Only.loc[country, year][key] = value
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Thanks for the heads up
Thanks
I am looking into numba, is that a thing?
Why
I would straighten that out before trying to work around it.
Fire them.
numba works best on numpy operations, it doesn't work well with dataframes
Numba seems to like for loops
yes, on numpy operations
Regardless, if one column of a dataframe contains dicts, make a separate data frame with the same index and expand the dicts into columns.
Yeah that would make sense, but the dictionary is different in every row
Tried that approach too
Different sets of keys?
Yeppity Yippity
Whatever data they contain, think of how you would structure it in a database
I will try to talk them into not giving me that monstrosity
Also your variable names are quite long
But I'll let you do the cost benefit analysis on that.
You might be able to do some kind of join on the two instead of your double for loops. The intersecting columns will be the same for each row
Agree on the variable names, it's generally recommended to not include the type in the variable name anymore (aka Systems hungarian notation) as modern tools make it easy to ascertain the type. EDIT: correction on specific type of hungarian
and it just takes up space, making it harder to understand what's going on. It's certainly better than too short names, though.
A good read is the chapter "Meaningful Names" in the book Clean Code
okay I gotta say this
that's not what Hungarian notation was originally meant to be
"type" was meant in the business case sense, not the formal type sense
I know that is not what it was originally meant to be
which dovetails with this, essentially
but the end result in the code is similar
how is it similar?
The type of the structure is in the variable name
what I am saying is
if you do it as it was meant to be, it isn't
(unless you have, like, refinement types, but most languages don't)
Not really clear on what you're getting at in relation to their code, in general it adds unnecessary verbiage
Do you disagree that they should remove "dataframe_" as a prefix?
no, I don't
That's my only point.
great, and mine is that doing so is one variant of what is called Hungarian notation.
Ok, no disagreements there.
so.. idk if this is the right place to ask, but i recently made an object detection project using opencv. it works fine, except for the fact that it detects things like chairs as toilets, and spectacles as scissors. is there a way to make the model better?
Just used np.savetxt() to store a huge python array into a text file on disk. However, this spiked up my RAM a lot and I didn't store this np.savetxt() into a specific variable to be able to delete the array
Where is this RAM held and how do I release it? (I can't reset the Jupyter notebook because it takes a looong time to re-simulate my stuff)
we can directly tell gc to flush things out. it helped me once(i had similar issue with np)
so i just deleted the array (del arr) and then called gc function
gc.collect()
Hello
I am using resample function of pandas .
I have tick by tick data
I want that data in minutewise
Can anyone look into this?
see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html for the description on what level means and what it does, and see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#aggregation for example use
My code
for i, chunk in enumerate(pd.read_csv(f'{path}{file_name}{extension}' , engine='python', chunksize=250000 , iterator=True, names = ['Msgtype', 'Activity Type', 'Transaction Time', 'script_name', 'expiry', 'strike_price', 'call/put', 'Exchange', 'Token', 'Buy/Sell', 'Buy Order number', 'Sell order number', 'Price', 'qty', 'price_in_rupees', 'Lot'])) :
chunk['Transaction Time'] = pd.to_datetime(chunk['Transaction Time'], errors='coerce')
chunk = pd.DataFrame(chunk).set_index('Transaction Time')
print('chunk1...')
print(chunk)
print()
chunk2 = chunk.resample('T')['price_in_rupees'].agg(['first', 'max', 'min', 'last']).set_axis(['Open', 'High', 'Low', 'Close'],axis=1)
chunk2 = chunk2[chunk2.Close > 0]
print('chunk2...')
print(chunk2)
print()
chunk2.to_csv(f'{new_path}{output_file_name}{extension}', mode= 'a', header=None)
U mean level=m
If that works, then yes
though it should be a str or an int as per the documentation
just be wary
it also must be datetime-like.
See my data this way
Please don't ping me either directly or indirectly, I am right here.
I am getting
Traceback (most recent call last):
File "E:\python files\resample_practice.py", line 26, in <module>
chunk2 = chunk.resample('T', level = 'm')['price_in_rupees'].agg(['first', 'max', 'min', 'last']).set_axis(['Open', 'High', 'Low', 'Close'],axis=1)
File "C:\Users\Admin\anaconda3\lib\site-packages\pandas\core\generic.py", line 8369, in resample
return get_resampler(
File "C:\Users\Admin\anaconda3\lib\site-packages\pandas\core\resample.py", line 1311, in get_resampler
return tg._get_resampler(obj, kind=kind)
File "C:\Users\Admin\anaconda3\lib\site-packages\pandas\core\resample.py", line 1466, in _get_resampler
self._set_grouper(obj)
File "C:\Users\Admin\anaconda3\lib\site-packages\pandas\core\groupby\grouper.py", line 381, in _set_grouper
raise ValueError(f"The level {level} is not valid")
ValueError: The level m is not valid
Above error
Ping me when replying
Have you checked out the links I have attached?
One of them is a comprehensive user guide.
Which link?
Here
I have hone through this first link but i do not get my expected output
Please respect my request.
What have you tried?
Could you share the part of the code where you've made changes?
What is the expected output and what is the output you are getting?
I tried chunk.resample('1T')
But not worked as u can see abov ss u can see i am getting same output
I am expecting
Open, high, low , close columns for each time
09:15
09:16
09:17
...
15:30
``` this way
Do u get my point?
I think it's tricky for me to explain without the data in hand. Would you mind sharing the csv of the data?
Can I provide u ss of data i have
Because CSV file is too big
So u can make dummy CSV file in your system by seeing data in ss?
What do you guys think about PyTorch Lightning
maybe chop it to the first 1000 columns?
Can I dm u ?
DMs are reserved for discord friends, sorry.
Can I provide u ss of data
Can u please make dummy CSV from it
Screenshot of data
????
pls help
im using the coco algorithm
Hi Everyone, I have a quick question. I keep getting this error, how can i fix it
Sounds like some issue with the training data
hmm i'll try replacing that and check
It’s a funny issue tbh
yeah lol
look at the data types of df after you do the rename.
no bad suggestion. i will see it
hello my data this way python Activity Type script_name ... price_in_rupees Lot Transaction Time ... 2011-08-28 09:15:02.006138097 N BANKNIFTY ... 4734.30 47.0 2011-08-28 09:15:02.707897899 N BANKNIFTY ... 2555.95 47.0 2011-08-28 09:15:03.373856246 N BANKNIFTY ... 2556.00 20.0 2011-08-28 09:15:04.159525439 N BANKNIFTY ... 6071.85 47.0 2011-08-28 09:15:05.213452151 M BANKNIFTY ... 2556.05 47.0 ... ... ... ... ... 2011-08-28 09:15:20.175758062 N BANKNIFTY ... 125.00 1.0 2011-08-28 09:15:20.175804372 M BANKNIFTY ... 149.10 1.0 2011-08-28 09:15:20.176193109 M BANKNIFTY ... 148.60 1.0 2011-08-28 09:15:20.176239215 M BANKNIFTY ... 150.70 8.0 2011-08-28 09:15:20.176248648 M BANKNIFTY ... 150.90 8.0 this way
i want to above tick by tick data converted into opne, high, low, close columns
thanks for sharing the data. using print(df.to_string()) would make sure that no columns are left out. there's no way for us to know how many columns the ... represents.
my code https://paste.pythondiscord.com/nuvukilore.py here
I'm still not clear on what you are trying to do.
Index(['Activity Type', 'script_name', 'expiry', 'strike_price', 'call/put', 'Exchange', 'Token', 'Buy/Sell', 'Buy Order number', 'Sell order number', 'Price', 'qty', 'price_in_rupees', 'Lot'], dtype='object')
this does not help, unfortunately. why don't you do print(df.head().to_csv())?
i have stock market tick by tick data. I want to convert that data in open, high, low, close columns
alright. let me know when you've provided enough data for me to solve this.
what this will do ?
print out the data without any missing columns.
while we're at it, try print(df.head(30).to_csv()) and put it in the paste bin
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Transaction Time,Activity Type,script_name,expiry,strike_price,call/put,Exchange,Token,Buy/Sell,Buy Order number,Sell order number,Price,qty,price_in_rupees,Lot
2011-08-28 09:15:02.006138097,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,SELL,0,1400000000013113,473430,1175,4734.3,47.0
2011-08-28 09:15:02.707897899,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,BUY,1400000000019595,0,255595,1175,2555.95,47.0
2011-08-28 09:15:03.373856246,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,BUY,1400000000027793,0,255600,500,2556.0,20.0
2011-08-28 09:15:04.159525439,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,SELL,0,1400000000034501,607185,1175,6071.85,47.0
2011-08-28 09:15:05.213452151,M,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,BUY,1400000000019595,0,255605,1175,2556.05,47.0```
I need enough rows to cover two days.
please put that in the paste bin when you have it.
okay can i share u csv ?
if you can put the whole csv in the paste bin that's fine
let me try
but I just need enough rows to cover two days
please ping me with the URL to the paste bin when you have done this.
okay
can u try with this https://paste.pythondiscord.com/banowanequ.apache
I asked for rows covering at least two days. These are all from one day. I won't be able to help.
@dull turtle could you give us your full csv file please?
see my file is 6.5 gb
i am not able to open that csv file
so i am giving u data read by python
https://paste.pythondiscord.com/ijemerimal.apache plz check here
The index of each row is a timestamp and I asked for enough rows that cover two calendar days worth of timestamps. You can even just include something like five rows for two calendar days (for a total of ten rows).
but csv file too big that i am not able to open it directly
can u please add some dummy data to it so rows get incresed
can u help me how i can get data for per minute
for e.g. i have data in seconds and in microseconds so i want to combine all values to get single one minute data
2011-08-28 09:15:02.006138097,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,SELL,0,1400000000013113,473430,1175,4734.3,47.0
2011-08-28 09:15:02.707897899,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,BUY,1400000000019595,0,255595,1175,2555.95,47.0
2011-08-28 09:15:04.159525439,N,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,SELL,0,1400000000034501,607185,1175,6071.85,47.0
2011-08-28 09:15:05.213452151,M,BANKNIFTY,2021-09-02,31500.0,CE,NSEFO,39203,BUY,1400000000019595,0,255605,1175,2556.05,47.0
2011-08-28 09:15:04.404004891,M,BANKNIFTY,2021-09-02,35700.0,CE,NSEFO,39570,BUY,1400000000035647,0,26885,25,268.85,1.0
2011-08-28 09:15:04.405367275,N,BANKNIFTY,2021-09-02,35700.0,CE,NSEFO,39570,BUY,1400000000036502,0,26875,25,268.75,1.0
2011-08-28 09:15:04.405433392,M,BANKNIFTY,2021-09-02,35700.0,CE,NSEFO,39570,BUY,1400000000035647,0,26915,25,269.15,1.0
2011-08-28 09:15:04.405443075,M,BANKNIFTY,2021-09-02,35700.0,CE,NSEFO,39570,BUY,1400000000032054,0,26660,75,266.6,3.0
2011-08-28 09:15:04.405504048,M,BANKNIFTY,2021-09-02,35700.0,CE,NSEFO,39570,BUY,1400000000036395,0,26920,25,269.2,1.0
2011-08-28 09:15:12.178591633,M,BANKNIFTY,2021-09-02,35600.0,CE,NSEFO,39568,SELL,0,1400000000090331,30555,25,305.55,1.0
2011-08-28 09:15:12.178672232,M,BANKNIFTY,2021-09-02,35600.0,CE,NSEFO,39568,SELL,0,1400000000090552,30550,50,305.5,2.0
2011-08-28 09:15:12.178735441,M,BANKNIFTY,2021-09-02,35600.0,CE,NSEFO,39568,BUY,1400000000002103,0,21515,25,215.15,1.0
2011-08-28 09:15:12.17874251``` my data is this way so i want only single value ```python
date open high low close
28-08-2011 09:15:00 val1 val2 val3 val4
28-08-2011 09:16:00 val1 val2 val3 val4``` this way and so on
@eager heath do u get my point what i am trying to do ?
I don't know, I am not a datascience person :D
@serene scaffold see this way i am trying to do , can u ple look into it ?
I believe Steele had to get back to work, but I'm sure someone will come and he'll you. If not, feel free to ask in an help channel!
Hey @plush leaf!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
just ping me when someone reply
Does anyone knows how to extract labels from sklearn's Pipeline object?
I tried after fitting the pipeline, which gives error
pipe_kmeans = Pipeline([('clustering', KMeans())])
pipe_kmeans.fit(X)
pipe_kmeans.named_steps['clustering'].labels_ #err
assuming they are sorted in manner of minutes you can read line by line by csv module. also to look for pandas solution I'd suggest read csv with chunk size param. some people here do know how to use it, i have personally not used it yet.
@lapis sequoia can u check my code here
np.where() is op 
Hi! I have a question,Can I do a neural network that his work is to see for the actions of a person on his phone and collect those information? It's possible?
@fiery sedge can you elobarate why do you need neural network
you mean through app
you can do some thing called clickstream
which will listen users information on phone app like what a user is doing and then
send it to a s3
its pretty easy to setup the backend
but if you have an app in android
they need to pass this triggered event to this api which handles clickstrem
if you need i have a snippet of code which can handle this clickstream
so basically you will be creating a data lake which will have all users triggered events
Ok I understand, but I mean in general, not only in an App, I mean to know the actions in his cell phone like for example which apps he use more time, or access to the contacts, it's possible to have this funtionality with only a neuronal network?
with only neural networks
means i think you can use LSTM
long-short term memory
i think i read about this
one moment
i dont think you need a neural net for that. just sounds like sketchy data collection and then regular analysis
that is what i suggested
we can do a data lake for all event triggers by users
@misty flint
generally we can track user behaviour with lstm
so one moment
Clickstream events are small pieces of data that are generated continuously with high speed and volume. Often, clickstream events are generated by user actions, and it is useful to analyze them. For example, you can detect user behavior in a website or application by analyzing the sequence of clicks a user makes, the amount of […]
@fiery sedge
can some one please look at my code? i'mtrying to feed multiple csv file in and do some conversion then save each of them down
also where can I paste my code so you can check it
!code
thank you, here is my code
I successfully feed them in, but dont know how to save it in separate file
fname is just any file in side the folder
so what i did was feed each file in separately and then run it through those
This is what I get when I run it
i'm trying to break each one of those into a csv file
Yep and with that many row
so far I add the df.to_csv at the end but it keep written on top of itself
nothing, it just give me back 1 csv file instead of 1000 csv file
which name?
df.to_csv('/content/drive/MyDrive/Huy_2/nonfall_2ft_groupby/'+filename+str(plot_numbers) + '.csv', index=False)
cause in this you have mentioned filename
where is this filename is getting updated
?
I dont want it to update, i just need it to save down as a csv file...
yes dude get it can you show me in the code where you are assigning filename?
like the code before I break that massive csv file into like 1000 csv filed ??
I'm not following with your question, tbh. Cause the one I gave you before just taking those csv file in the folder and feed it
df.to_csv('/content/drive/MyDrive/Huy_2/nonfall_2ft_groupby/'+filename+str(plot_numbers) + '.csv', index=False) in this line can u tell me where is filename is getting set
?
agh... that... that's from previous... i see, let me try to remove it
still same issue
only 1 file coming out after I remove it
can you share the code after that
it has this error code
create a folder called nonfall_2ft_groupby
same error code
same error?
let me reset it again
idk what is going on, like I delete it and it still have the same error
keep it still the same
all of it?
I can give you the original 1, then the code to break it. it will be easier
sure
it is running hold on
still 1 comming out
Hi guys, Can anyone point me to the right resources to start with computer vision video detection problems with transformers. I know NLP but am new to Computer Vision.
Hi
Learn everything you need to know about OpenCV in this full course for beginners. You will learn the very basics (reading images and videos, image transformations) to more advanced concepts (color spaces, edge detection). Towards the end, you'll have hands-on experience building a Deep Computer Vision model to classify between the characters in ...
i think this should be valid source @foggy shuttle
Thank you... Will check it out 👍
Hi ! Anyone know how to do the equivalent of cv2.inrange for HSV color thresholding in PyTorch ?
I wanted to know why is everything related to ai super popular with python in comparison to other languages like c#?
Do please @ me if you mind?
because c# is ancient languages for nerds 🦾
all the rich kids use python
it depends on which part of the stack you're talking about
the low-level networking code, everything that handles transactions etc., yes
in those contexts, Python is good for backtesting/experiments, minimally
and perhaps ML
hm I would say it's because a lot of people who work with AI aren't software engineers first
but mathematicians
and CPython, being dynamically typed + interpreted, is generally easier to work with
more or less. the general pattern is: Python bindings for user-friendliness, C/C++/Fortran backend for speed.
a really good example is numpy
in general, debugging numpy issues is simple
compared to going through the underlying BLAS/LAPACK
yeah. C is at least reasonably readable by someone who doesn't know it
given proficiency in other languages
but C++ is a lot more complicated
sometimes I check out CPython source to understand how something works
if it was C++Python I would probably be like 🥴 and then 😔
agreed
isn't it weird
that VB and Python
are basically the same age?
even if you look @ VB after it started being on .NET and Python 2
say, pre-2.7
I would probably use Rust
it's a lot nicer to work with
apart from the immaturity of tooling
yeah, I intend to go take a Master's next year, and then maybe get back into ML
it's a pretty cool language! but unless you are a relatively hardcore engineer it'll probably be irrelevant
law
shrugs
I was a data scientist (nominally, though more like ML engineer)
it was my first job
why not? I didn't know what I wanted to do @ that time
nope
so might as well take a professional degree that is reasonably prestigious
went for a bootcamp, then got approached
it was a really great first job tbh
can you elaborate on that
hm I'm not sure about that
this is probably true, but it was quite intellectually stimulating, and you have better networking opportunities
Singapore
oh, I was near there once
I went to Saudi Arabia to teach data science
pretty interesting experience
way too dry for me though
yeah it wasn't bad! is part of the reason I went overseas to work
you mean in SA?
like, in Saudi Arabia or Singapore?
since you said this
oh
yeah.
but we're small
I'm actually working with a bank right now
shrugs
bootcamps are just a starting point I think
it's more about the marketing than anything else
Thank you two very much, I hope you have a great day. @velvet thorn @olive jackal
hi, I have question to handle an outlier: it is possible to handle outlier by using transform with yeo-johnson?
it is possible to use both (scaling and transform) in pipeline?
Sure, they are just numerical operations
Quick question, have anyone do train test split all of the csv file in the folder before?
what's effect if i putting an outlier into model?
can someone tell me why its accuracy is so low?
im AI noob
this is the dataset https://archive.ics.uci.edu/ml/datasets/Chess+(King-Rook+vs.+King)
the quick answer is that your params are not tuned enough to get an acceptable accuracy
ok ill keep tweaking them thanks for the input
Hey guys, for producing sequences with continuous values, what's the norm for determining the length of the output sequence?
with decoder networks used in NLP it's easy because you can have a dedicated embedding for EOS tokens
but you can't do that with continuous values
i got it up to 70% 🙂
well done! what was the change?
i used SVM instead of logistic regression
👏
I know a little of chess, but maybe, you can include features like the color from the square black or white for each piece, if the king is next each other, if there is a fork or other things like that. It takes some time to understand the features that matter.
Yes, there are plenty of reasons to, e.g. splitting on metadata that isn't in the csv to ensure no leaking of test into train
Depends on the model and how it's affected by outliers.
can you elaborate of this?
Some models and loss functions are affected differently, e.g. L1 vs L2. I'd recommend taking a course on ML, e.g. https://www.coursera.org/learn/machine-learning
how to preprocess a categorical column where categories are ranges, like 'x<100', '100 <= x < 500', 'x >= 500'
what is a categorical column? are you talking about pandas?
categorical column is a column which contains categories
I'm talking about ML stuff here.. pandas has nothing to do with it
do you know imputation/standardization/preprocessing ?
To clarify, you have a numerical column that you would like to convert into a categorical column?
Just apply a if/elif/else or lambda function that checks its range
!e
func = lambda x: 1 if x < 100 else (2 if x < 500 else 3)
print([func(i) for i in [50, 150, 550]])```
@azure marsh :white_check_mark: Your eval job has completed with return code 0.
[1, 2, 3]
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)```
Bin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
numpy.searchsorted(a, v, side='left', sorter=None)```
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array *a* such that, if the corresponding elements in *v* were inserted before the indices, the order of *a* would be preserved.
Assuming that *a* is sorted...
Thanks guys for your help but I'm actually not asking help with Python or any specific library
I was only asking a conceptual concept on data preprocessing
Thanks for your help in any case. You guys are awesome
You can look up these encodings for categorical data: ordinal, one-hot, dummy variable, embedding.
You could take ordinal further and even convert it back into a lossy numerical column (e.g. taking the midpoints of the bins)
so im using the yolo algorithm, but when i run the code, i always get this error:
cv2.error: OpenCV(4.5.3) C:\Users\runneradmin\AppData\Local\Temp\pip-req-build-sn_xpupm\opencv\modules\dnn\src\darknet\darknet_io.cpp:659: error: (-215:Assertion failed) separator_index < line.size() in function 'cv::dnn::darknet::ReadDarknetFromCfgStream'``` how can i fix this?
someone know how i can convert that dataframe(image) to a dataframe where each element be part of a column?
df = pd.read_csv('location of dataset', sep=',', parse_dates=[column name which contains dates]) or something like that, you basically want to use sep and parse_dates parameters. look up on google
but its not from a file . _.
??
Have you tried searching for that obscure error?
Hey folks, I'll bump my question from earlier
For producing sequences with continuous values, what's the norm for determining the length of the output sequence?
with decoder networks used in NLP it's easy because you can have a dedicated embedding for EOS tokens
but you can't do that with continuous values
a concrete example of this would be generating audio waveforms
split by comma and allocate them into their own respective lists then make a new dataframe out of them
though i see one problem being that comma is used as the 1_000 separator
surely you can do this from the source file
i have, but didnt find a fix for it
You've tried using someone else's config file?
You've ensured there's no comments without whitespace after the '#' ?
so i tried verifying if both files were in the specified path, and they werent, as there was a typo. now i fixed that, but i get this:
parse NetParameter file: models/MobileNetSSD_deploy.prototxt in function 'cv::dnn::ReadNetParamsFromTextFileOrDie'```
no, there arent any comments in the code
lol
Yeah but how can i do it
Any know about web scraping?
yeah some people do, but this is not the right place, this is #data-science-and-ml
enjoy some nightmare fuel from early stages of my gan
this one was even earlier in training
yo what should i learn if i want to create a machine learning model to classify different types of fish through image? is covolutional neural network appropriate with it?
how do i know if i am using a good algorithm?
Hello Friends, I am currently stuck with my BE Project of Helmet Detection System using YOLOv3 on Google Collab with Darknet. I have the training code but not sure its error free & want a proper testing code. I have a custom Dataset which is labelled and ready. However even if i get the readymade Testing code of Yolov3 i dont know what exactly to Add/EDIT in that since i dont know python. Can someone please help me with the python part , i have to present this project to my External and Internal Faculties . Thank You
Please Feel Free to DM me with help
I have an Array which looks like this:
[[0.7651453003611763, 0.764035690858367, 0.7355304233745091], [0.6948386732214498, 0.15246199920890194, 0.1504548793580838], [0.6948386732214498, 0.15246199920890194, 0.1504548793580838], [0.8455282724710679, 0.84655663488637, 0.8337125981891232]]
How can I multiplay the values with 255? (These are converted RGB values)
!e
import numpy as np
arr = np.array([[0.7651453003611763, 0.764035690858367, 0.7355304233745091],
[0.6948386732214498, 0.15246199920890194, 0.1504548793580838],
[0.6948386732214498, 0.15246199920890194, 0.1504548793580838],
[0.8455282724710679, 0.84655663488637, 0.8337125981891232]])
arr2 = arr * 255
print(arr2)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | [[195.11205159 194.82910117 187.56025796]
002 | [177.18386167 38.8778098 38.36599424]
003 | [177.18386167 38.8778098 38.36599424]
004 | [215.60970948 215.8719419 212.59671254]]
@gaunt marsh multiplying an array by a numeric type will multiply each element by that value.
Yes CNNs are pretty good for Image classification.
To know whether the model is good, you can use metrics such as f1, accuracy, recall and precision to measure the performance of any classification model. You should also use a holdout validation dataset to compare the performance of the model on the data you trained it on vs the data you didn't train it on.
Been there. 😂
Is batch normalization still needed even if we normalize the data beforehand i.e. dividing every pixel by 255.0?
Batch Normalisation is needed to keep the weights and gradients in check, not the inputs. So I would say it is good to have them.
good sir do you recommend any youtube videos for beginner in machine learning may end goal is image classification with machinelearning
Hey @plush leaf!
It looks like you tried to attach file type(s) that we do not allow (.zip). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
how do you guys load your datasets? I wanted to load new ones from the drive when needed, but keeping already loaded ones in a list, so that accessing them again is faster
turns out that wasn't a great idea because 70 thousand numpy arrays of shape 64x64x3 are not very fit for my 16 gb of ram 
I worry that if I make it strictly just pull from hdd every time, I will wear out my drive
Check out tutorials on Tensorflow's website
If you are using deep learning via Tensorflow or pytorch, they have mechanisms to create a input data pipeline that brings in data as and when needed
Yeah. Look into tf.Dataset
Or if you are using images for image classification, you can also try ImageDataGenerator in keras.
Another way is to subclass the keras Sequence class for full control
I'm training a gan so yea i pull real image samples for the discriminator
hey I got N number of dfs, i wish to merge them.
but i don't know how to merge by some comparing some certain column.
example:
df1
a b
1 a
2 b
3 c
df2
a c
1 x
2 y
4 z
the df i want:
a b c
1 a x
2 b y
3 c -
4 - z
thats just an outer join
A Quick Question....
I'm planning to start learning DL so I'd like ask
- PyTorch or TensorFlow or Keras
Which framework is advisable for a complete beginner in Deep Learning to learn 1st.
- Please could you give a reason for your suggestion in Q1.
Are you aware that Keras is part of TensorFlow? In either case, I think it depends on your experience with machine learning in general. Deep learning isn't the be-all-end-all of AI.
If you're not familiar with other approaches to AI, I think you'd find your learning experience more satisfying if you start elsewhere.
Oh I had no idea Keras is part of TF. I've seen quite a few code written with TF where keras was mentioned
yes, Keras is a part of TF that wraps around other parts of TF.
for beginners, pytorch is great too - even though the preference is TF since you dont really need to understand anything for TF
Well, I'm almost done with the online ML course I'm using to learn. There's an introductory segment to TF, Keras and PyTorch however that's just that about it. There's no deep material on deep learning
but I would still recommend using Pytorch and JAX, when you get some more experience
Thanks. Does that mean TF is more customer-friendly? 😀
Well, I'd like to be able to understand what's going on in each line of code I'm writing.
wdym by customers?
how do i get into ai
see pinned comments
This is my first time of hearing JAX. All the JD I've seen so far seem not to mention this framework though. Perhaps it's somewhat new
it is yeah - its for really power users, mostly cutting edge research stuff
but it provides an extremem level of flexibility
'customer-friendly' in that context means that it's perhaps perceived to have a more simpler syntax or more easier
if I want to drop a column from a pandas df with
schedule.drop(list(schedule.filter(regex='DAY')), axis=1, inplace = True)
is there a way to select more than one filter?
syntaxwise, TF requires you to write less code automating what happens underneath - whereas Pytorch requires more code but gives you greater control, better docs and lovely syntax
@olive jackal because I am brand new to pandas 🙂 And dont know all the trix yet
Ultimately I dont need the columns at all, and the file will be written back out to excel
Defeats the purpose as a Pandas exercise 🙂
Hey! I'm currently making a sudoku solver from image with opencv.I've got the initial processing and splitting the image into cells done but im having trouble detecting if a specific cell contains a digit(not classifying the digit). Does anyone know how i can go about solving it?
how to tune hyperparameter for gensim doc2vec even though gensim doc2vec doesnt give any accuracy/loss for training?
Hey @plush leaf!
It looks like you tried to attach file type(s) that we do not allow (.zip). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey guy I keep having this issue. But when I run that file separately, it work just fine
@rigid zodiac what file
frame looks like the index.
I think if you want to groupby the index, the parameter for groupby is level=0
otherwise the problem might be that your datatypes are interpreted as strings
The file fall 549
what is that
I have literally no idea what you're talking about
Hello, I could really use some help, ive just tried to whip up my first neural network completely from scratch, and I think its almost working, would any expert be so kind to take a few minutes to go through it with me?
@earnest wadi post code, and post errors if there are any. that's the only way to get help.
hmm, okay
https://pastebin.com/kh68Zu6c <- main
https://pastebin.com/qhTy7XWJ <- functions file
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
My back propagation is off i believe
[2.]
[2.]]
a:\Python\Neural Net testing Stuff\functions.py:8: RuntimeWarning: divide by zero encountered in true_divide
return 1 / (1-x)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[inf inf inf]
[inf inf inf]
[inf inf inf]]
[0. 0.]
[[-0.00109576 -0.00050056]
[-0.00109576 -0.00050056]
[-0.00109576 -0.00050056]]
Traceback (most recent call last):
File "a:/Python/Neural Net testing Stuff/PyNets.py", line 76, in <module>
nn.fit(training_data, batch_size=8, epochs=250)
File "a:/Python/Neural Net testing Stuff/PyNets.py", line 27, in fit
self.backwards_propegate(training_outs[a], output)
File "a:/Python/Neural Net testing Stuff/PyNets.py", line 46, in backwards_propegate
layers[i].adjustment = np.dot(layers[i].inputs.T, layers[i].delta)
File "<__array_function__ internals>", line 5, in dot
ValueError: shapes (2,) and (3,2) not aligned: 2 (dim 0) != 3 (dim 0)
PS A:\Python\Neural Net testing Stuff>```
I get some division by zero then it all spirals into madness
ok I'm not touching this one.
lol, its my first time ever having a go from scratch :c
hello, suppose i want to find P(A|B,C)
for bayem theorm
would i treat P(A|B) as like X and then do P(X|c)?
just confused on how to do this. with 2 conditionals
anyone here good with hardware, need someone to help me pick between 2 hardware choices
for AI
Guys, does anyone know how I can put a limit on adding 100 items from the list each time?
100 elements are processed first, then the next 100 elements in the list, so that less load on the system in general
what hardware are you trying to pick between?
Any recommendations for intro ML books which have good content on time series/LSTM?
Nvidia Quaddro 5000 and Radion P560
but apparently Quadro Sux..
Question: why is ggplot not showing a plot and instead showing a list
Can you provide more context? We have no idea what code you've executed or what its inputs were.
Does Radeon work with ML?
I assume you mean "cuda enabled"?
there is not very good support for non-Nvidia GPUs
Hello. I am in need of some help :((( Do you know those chain supermarket brochures? I'm gonna need to extract manufacturer/title and description info for each peoduct on it.
you might as well go for the Titan RTX at that price point
Finding products/info is easy with object detection, but i don't know how to extract that info
how are you going about converting the content of the brochure to "regular" text?
Object detector + OCR
it is extremely robust
Assume I can convert it to text
I would confirm that you can accurately convert it to text. However spaCy might have a ready-made recognizer for manufacturers.
extracting the description of a given product is going to be more difficult because it's hard to say when a description starts or ends.
Yes exactly :/ I can also find the title and say the rest is description.
Title is basically what the thing is. But I have never done natural language processing on production level
this might sound like a dumb question, but how do you know what the title is? and what is "the rest"?
Or text classification
I do NLP professionally for some reason.
Title is basically what the product is. Say broom from vileda
vileda being manufactuter
so any time "x from y" is one sentence, that is always going to be a product and a manufacturer?
Description has info like say 100 cm length etc
I never do anything with documents that haven't already been converted into ascii/unicode/etc
it never is like that. The title is just written as say "PLC Broom"
My point is, if you don't have carefully constructed training data for this, you will either have to use a classifier that has already been trained or come up with rules
As a human it is easy to extract that info
Yes agreed. I am going with rule based but it is becoming harder and harder unfortunately.
As title and manufacturer is all mixed together
So you may have to go with a pre-built model and accept a certain amount of inaccuracy
Which kibd of model would you suggest for such task?
are you familiar with named entity recognition?
I have heard but not really know
It's where you recognize words/phrases that belong to a certain category. "product" and "manufacturer" are clear-cut categories, but "description" isn't really.
Is it okay for those models not to include some categories? Sometimes it just writes "Tomato"
Do you have anything as a name, that is pretrained etc for that?
spaCy lets people train and publish models for a lot of different NLP things, so it's great for making NLP accessible to a general programming audience. I would look into what NER models they have for product-related stuff.
You're probably not the first person to want to do this kind of thing. But be warned, I still don't know what to do about the product descriptions.
If I find title and manufacturer and remove them from the whole text, then I will be left with descriptions.
So it is not really a big issue :)
I will definetly be looking into those thank you so much :)
Thank you for seeing my question I gave up on figuring out something I was trying to program. I was asking a very general and broad question. But thank you nonetheless 🙂
so im using open opencv to make an object detection program. when i run the code, i get this error:
AttributeError: module 'cv2.cv2' has no attribute 'dnn_DetectionModel'```
i gotta submit the assignment today, so uhh its kinda urgent...
if you need the module versions:
alright i think i know what caused it now..
im upgrading opencv-contrib-python i'll see if that fixes it
new error:
[ WARN:0] global C:\Users\runneradmin\AppData\Local\Temp\pip-req-build-1i5nllza\opencv\modules\videoio\src\cap_msmf.cpp (438) `anonymous-namespace'::SourceReaderCB::~SourceReaderCB terminating async callback
Now its this:
[ERROR:0] global C:\Users\runneradmin\AppData\Local\Temp\pip-req-build-1i5nllza\opencv\modules\dnn\src\tensorflow\tf_importer.cpp (2805) cv::dnn::dnn4_v20210608::`anonymous-namespace'::TFImporter::parseNode DNN/TF: Can't parse layer for node='Fp\pip-req-build-1i5nllza\opencv\modules\dnn\src\tensorflow\tf_importer.cpp:2478: error: (-2:Unspecified error) Const input blob for weights not found in function 'cv::dnn::dnn4_v20210608::`anonymous-namespace'::TFImporter::getConstBlob'
pls help
will this work if df is sorted by something else, or will the order be wrong?
df['val'] = df.sort_values(by=['time']).loc[:, 'val'].apply(foo)
i guess the question is does pandas use the index when group assigning values
yes
!e ```python
import pandas as pd
df = pd.DataFrame({
'x': [1,2,3],
'y': [4,5,6],
}, index=list('abc'))
print(df)
df['x'] = df['x'].iloc[::-1] + 10
print(df)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | x y
002 | a 1 4
003 | b 2 5
004 | c 3 6
005 | x y
006 | a 11 4
007 | b 12 5
008 | c 13 6
assignment to a column in a pandas df is actually a join on the index
ah thats nice to know, thanks!
.join, .loc[]=, pd.concat, and pd.merge all perform joins, with overlapping but not entirely redundant options/features
yes, it's incredibly useful and one of the best parts about pandas
people who don't like pandas are usually the same people who don't understand the index system
Haha i still fall in the latter group i think
Especially with multilevel indexes
do you know if there is any nice ways accessing multindex values?
filtering stuff like df[df['name'] == 'B'] is just so much more convenient than df[df.index.get_level_values('name') == 'B'] or whatever
even with 600k rows and no index pandas is more than fast enough so i struggle to see the value of using indexes apart from various transformations which needs indexing
Hello
Do you need Data science for Ai or vice versa
Anyways i cant do ML or Ai rn anyways so
How can i get started with Data science?
Im reading the pinned messages let me check they usually have some good stuff
Use pd.IndexSlice for that
How would I do early stopping, if the validation dice coefficient is above 0.5?
this is actually so profound
and not how I thought of it
it was a bit mindbending tbh
but that’s a good observation
Isn't this why Naive Bayes algorithm (GaussianNB) is mostly preferable for conditional probability?
You can use Naive Bayes to solve this. Although there probably might be a better approach... You can even do this from scratch
Any cool Data science projects?

