#data-science-and-ml
1 messages · Page 224 of 1
sorry @wind plume , will look into your problem now
I'm a bit tipsy but let's see what I can do.
@radiant nymph no if important features are not selected in ur model due to one - hot ur model will not perform well. Overfitting won't really happen as long as u limit the tree depth / size, and that will deal with dim explosion as well.
okk.
https://github.com/atulbunkar/Wine-Prediction please review my work and give feedbacks
@wind plume Until LOC 152, your module looks OK to me.
considering your final purpose with this, I still think the whole column replacement issue is not the way to go.
considering that you just want to plot data according to where they're dried or wet, couldn't you just create to separate dataframe for each of these types, while having a common attribute between them and when it comes to the code that draws the bar plots, just have this code reference the common attribute then?
sorry if it's not very clear
I appreciate the tipsy advice @charred blaze, in all seriousness!
What exactly do you mean tho? Are you saying like make a data frame with all wet samples and all dry samples? Rather than have a master dataframe with ALL samples? Wouldn't that require me to sort them at the beginning near where I ask for file inputs? Not sure the best way to go around it tbh.
two separate dataframes
The fact I get NaNs from my input confuses the fuck out of me tho. The code is probably not elegant at all as it's my first real project
Would you recommend making the two dataframes AFTER I have the master sheet? Or have two master sheets
My thought process was having one dataframe that I'd constantly be uploading new data to and then isolating however many columns and graphing them. It's the isolation part that is giving me trouble with the Nan shit
I say before the master sheet
So how do you envision it working? I'd still have to either concat, melt, or do what I am currently struggling with, no?
yes, you would have to join those two dataframes afterwards
Would I join them a different way, tho? I still think I'd use the same or similar function right?
you need a common attribute in each row of those two separate dataframes
But how? Just the same as I tried to make my previous code work?
input wet or dry in a column etc
maybe append a column after removing outliers??
Can anyone recommend a book of maths required for data science and ml?
Thanks
can someone help me with pandas plz @ me
@digital lynx what's wrong?
can i print multiple slice objects? using iloc?
I just need to know if Pandas.to_csv(filename) overwrites the current csv file if it has stuff on it
I am making a bot that takes data from a csv file and graphs it and changes the data in the table. I need to delete the first row, and add data to the end. I need to know if the to_csv() method will just overwrite the file because that is what I need, not appending all the data I just changed
Hey ---> can anyone help me install TensorFlow on VSCODE?
- Can anyone here help me with specific code troubleshooting? Basic array and histogram use
wouldnt you just pip install on your machine? @opaque stratus
yo
i have this json
"warnings": [
{
"id": 711390341789646919,
"reason": null
}
how can i remove the obj with that specific id
I am making a bot that takes data from a csv file and graphs it and changes the data in the table. I need to delete the first row, and add data to the end. I need to know if the
to_csv()method will just overwrite the file because that is what I need, not appending all the data I just changed
@digital lynx it does. the default mode is 'w', which is 'write'.
Hey
Currently using VSCode's jupyternotebook interface
any idea how I could route the usage to my laptop CPU and GPU?
I suggest using google colab instead, everything runs in the cloud @opaque stratus
Oh ok
Yeah Ik what you mean
So long story short, I'm reading the pandas docs in my spare time. I'm curious what the best way to go about this would be. Should I be reading linearly, or would it be better to pick certain topics (if so which ones)?
@real wigeon I personally prefer to learn smth doing a project. reading docs isnt gonna stick to your mind unless you actually use them in code. so if you go with topics and related classes it would be better. i think Kaggle has a Pandas course. its short and useful
Ok
I have a job with a lit of downtime so I'm trying to read docs
And code when I'm home
@opaque stratus ... use PIP intall in your command prompt/ Conda Prompt to download necessary modules ... Open VS code and import those modules .. That will work
I have this problem https://www.reddit.com/r/Python/comments/glgo4s/how_do_i_properly_loopbackdsnoop/
Any ideas?
Hello, I am trying to clean up some panda dataframes using BeautifulSoup. I am unable to apply that to one column. Any help is appreciated.
import pandas as pd
from bs4 import BeautifulSoup
df = pd.DataFrame({"id": [1,2], "a":[ ["<a>Hello</a>"],["<c>Aorld</c>"]], "b":[["<c>World</c>"],["<c>Corld</c>"]]})
df['c'] = df.apply(BeautifulSoup(df['a'].all(), 'html.parser').get_text())
print (df)
@lapis sequoia I'm a little confused... Are you trying to parse an existing webpage and put it into a dataframe? It seems like you're ultimately mixing two distinct data structure here
@echo kelp well, I am trying to clean up the text in column a by removing any html tags.
and I saw that BeautifulSoup can help
you can access the data in tags returned in beautiful soup by using .content
so if you have a tag already stored as an object, you should be able to return the 'Hello' for example, by using something like tag.content
I'm not exactly sure about the syntax
sorry, I am not sure if I am following you
yeah, no, sorry
well
I don't think it would be best practice to store the raw tag themselves as the data in columns a and b
if you're looking to manipulate those strings, I'd probably try to use something like a regular expression rather than beautifulsoup in this context
beautiful soup can parse tags, but I don't know how applicable it is to iterating over a series of tags in this fashion, particularly when returned from a dataframe
did this point you in this direction?
yeah
I can definitely see how it applies
I do know though, if you are working with pandas dataframes, every action you take should be "vectorized". Ideally, iterating over a dataframe row by row is heavily discouraged by pandas. So, you can definitely construct a solution somehow doing this, maybe someone else might know better than I do.
np, sorry I couldn't find a neat solution
@echo kelp i think i found it...
df['c'] = df['a'].apply(lambda text: BeautifulSoup(''.join(text), 'html.parser').get_text())
this worked for me
thanks again
it was incorrect data type being passed
did anyone else chuckle when they first saw panda's cumulative functions?
@real wigeon
I find that the problem with reading docs is that they don't really have a structure/lesson plan. You just go to learn random tricks, not really see how they fit together.
I've heard a lot of good things about this book, and I plan to read through it myself down the road (It was written by the creator of Pandas).
https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662
yeah i was going to watch a freecodecamp tut
and use the docs to augment the knowledge
while trying a project
HEy
Im new to ML i had some basic question about Linear Regression
I am trying to understand this question
A^TA x = A^T b
If someone could message me I could give some more context I had some question to clarify what even is going on here
Hi, I'm having troubles with implementing Conv2D backpropagation using Numpy.
This is what I've done for forward propagation:
ch, h, w = x.shape
Hout = (h - self.filters.shape[-2]) // self.stride + 1
Wout = (w - self.filters.shape[-1]) // self.stride + 1
a = np.lib.stride_tricks.as_strided(x, (Hout, Wout, ch, self.filters.shape[2], self.filters.shape[3]),
(x.strides[1] * self.stride, x.strides[2] * self.stride) + (
x.strides[0], x.strides[1], x.strides[2]))
out = np.einsum('ijckl,ackl->aij', a, self.filters)
I tried doing this but it's not working:
F = np.lib.stride_tricks.as_strided(x, (n_filt, size_filt, size_filt, dim_filt, size_filt, size_filt),
(x.strides[0], x.strides[1] * self.stride, x.strides[2] * self.stride) + (
x.strides[0], x.strides[1], x.strides[2]))
F = np.einsum('aijckl,anm->acij', F, dA_prev)
dF = np.zeros(shape=self.filters.shape) # shape=[n_filters, ch, h, w]
size_filt = self.filters.shape[-1]
for filt in range(n_filt):
y_filt = y_out = 0
while y_filt + size_filt <= size_img:
x_filt = x_out = 0
while x_filt + size_filt <= size_img:
dF[filt] += dA_prev[filt, y_out, x_out] * x[:, y_filt:y_filt + size_filt, x_filt:x_filt + size_filt]
This is working great but very slow
!unzip '/content/drive/My Drive/Colab Notebooks/Dataset.zip' not working. The command is run but the images dont show anywhere in my drive. I have a zip file in my drive which I wanna unzip to use for training testing and validation.
Is it ok to webscrape a website if its robots.txt has nearly nothing? It only has 3 lines
It depends what those three lines are and, more importantly, if they have a ToS page
If you paste the website link here I could take a quick look
Looks like a no
You agree:
not to use any manual or automated software, devices or other processes (including but not limited to spiders, robots, scrapers, crawlers, avatars, data mining tools or the like) to "scrape" or download data from any web pages contained in the Website```
In the Terms of Service https://www.horoscope.com/us/tos.aspx
@late cargo
Thanks
How should I go about choosing the best impute method? I don't wont to remove data because it is very small already.
@solar oracle there's so many ways to impute values
The best thing to do is to run a grid search on the best imputer
for easy tasks sklearn's SimpleImputer() is really useful
There's also Knn Imputation and MICE Imputation
Interpolation
the list gos on
the most method would be to run a grid search on the imputers
hello, I want to learn data science or at least improve my understanding of basics in this area, what free materials may you suggest, I'd be also happy if some people will agree for in person help so I will feel no shame asking stupid questions
is a good introduction to machine learning
without too much rigor
then there's Andrew Ng course on Coursera
which is also really good
@raw rapids looks hard but I will note those materials, thank you
@raw rapids because I've checked about deep learning before, it's extremely hard by itself
no its not
its based around simple concepts
the deep learning mit course
is really beginner-friendly intro
I was thinking that first I should basics on learning before going deep but maybe I was wrong
I started of with the course I mentioned above and I'm doing fine
is a really good place to supplement your skills
they have a treasure trove of awesome notebooks
recently I only discovered what is a notebook
I also suggest
https://www.coursera.org/learn/machine-learning
sure I will
ya thats the Andrew Ng course I mentioned earlier
ya definitely
The certificate costs money, but participating is free
good
yup, it requires a lot of time and dedication in my opinion
if you want to retain as much info as possible
so you have to make a commitment
but its I agree with @devout sail , that is a very good intro
my English is very bad but I may try it
Hello, I hope this is the correct channel. I am very new to using python and trying to just write a simple dividend yield formula for a single stock, but want to actually see the steps involved. (My code) # NHI dividend yield=dividend per share/market price per share# d=[1.10, 1.11, 1.12] m=47.25 y=dividend/market
(output) Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'list' and 'float' However, if I write the code without a list it works fine >>> d=1.10
m=47.25
y=d/m
y
0.023280423280423283
y*100
2.3280423280423284 What am I doing wrong with my list?
!code
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
Please paste code this way so we can read it properly
d=1.10
m=47.25
y=d/m
y
0.023280423280423283
y*100
2.3280423280423284
d=[1.10, 1.11, 1.12]
m=47.25
y=d/m
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'list' and 'float'
Oh you are doing it in console
'''python
"Should be backticks, not quotes."
Your problem is that you can't divide a list with a float
@somber tapir if you're using the console a lot have a look at ipython console, it's sooo much nicer
thanks I am trying to put the code in chat, but am such a noob and don't want to spam the chat with my bad attempts
print('this')
just use `
and a float refers to the fact that I have decimals correct?
the problem is coming from you trying to divide the LIST with anything, if it was int instead of float it would still raise an error
you need to divide the items inside the list
or use numpy
that too
Okay, I can see I have a knowledge gap. I am going to do some reading. Thanks for the help!
I think it is actually a fairly intuitive "mistake", but more learning never hurts. Have fun!
Oh holy hell iPython does look way nicer.
What math is used in a self driving car? All the way from the auto pilot code to the electronics that drive the car from the outputs the code gives?
@lapis sequoia arithmetic would be used at all levels i imagine
Yeah I guess so
Hey guys, I'm familiar with manipulating data in Alteryx with GUI but I'm trying my hand at doing it with Pandas. I'm trying to do a cross join with 3 series for a dataframe for every possible combination. Is there an equivalent function in Python/Pandas?
Here's an example I made.
@lapis sequoia , https://selfdrivingcars.mit.edu/ is a good introduction to self driving cars
@storm plume
You could create a array of permutations with sympy and then make it into the dataframe
Nah, I figured it out... you have to create a dummy column and do repeated joins.
df1.assign(foo=1).merge(df2.assign(foo=1)).drop('foo',1)
Wish there was a cleaner way to do it, but oh well.
@storm plume you could have used pandas.MultiIndex.from_product, then convert it to dataframe
Ahhhhhhhhhhh! I just looked at the documentation.
That's perfect.
Wish I had known about the existence of it earlier. Thanks!
Hi guys, is anyone here familiar with tesseract ? I am sort of new to python, but i am familiar with other languages,
when i try to get the text off this image I just get "AN afi" not even a number, i tried to invert the image but still got a similar result, any ideas?
import pytesseract as tess
from PIL import Image
import PIL.ImageOps
# inverted_image = PIL.ImageOps.invert(img)
# inverted_image.save('new_name.png')
img = Image.open("text.png")
text = tess.image_to_string(img)
print(text)```
here is the image
why you inverting the image?
@teal turret
@teal turret set "exposure" to -100, this way the text is more clear
I jut tried to see if inverting will help
Anyone have recommendations on the best way to learn and be proficient in machine learning. For example courses and books
Typically online would be best
Hey guys, I'm trying to do a left join and keep only the parent table of anything that did not match.
data1 = {'NameA':['Tom', 'Nick', 'Krish', 'Jack'],
'AgeA':[20, 21, 19, 18]}
data2 = {'NameB':['Tom', 'Nick', 'C', 'D'],
'AgeB':[20, 21, 3, 4]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
list = [df1, df2]
df1 = pd.merge(df1,df2,how='left',left_on=['NameA','AgeA'],right_on=['NameB','AgeB'])
print(df1)```
Output:
0 Tom 20 Tom 20.0
1 Nick 21 Nick 21.0
2 Krish 19 NaN NaN
3 Jack 18 NaN NaN
Expected:
How should I approach this?
Anyone have recommendations on the best way to learn and be proficient in machine learning. For example courses and books
@blazing bridge check out the book by the Keras creator Francois Chollet - Deep Learning with Python. It's a very great resource with lots of code examples and tutorials. Also, have a look at the fast.ai website (https://www.fast.ai) and community forum where you can share code and learn from other's coding
hi everyone i am new to this channel, i have a question regarding an error i am facing using twitterAPI. i have included an image of my question below
guys, a question: if I want to import a script with functions that also contains imported modules (e.g. sys, os) into an empty script in order to reuse my code, is there a way around to use those already imported modules (sys, os) or do I have to reimport them again?
well, I have never tested it but I don't think there is point in doing that
What is the most clean way to use it then: I have functions that rely on imported modules and now I want to import those functions into a new script; do I reimport those already imported modules? I actually see with dir(function.py) that it does import previously imported modules, but their use needs to be function.sys.argv for example which of course is very annoying
What? If you have in a module a function that import sys. You do not need to explicitly import sys along with that function in a different module if that's what you're asking. Just import the function itself and it should work.
@polar acorn thnx :))
I'm trying to find and filter outliers by each column individually, and then make a new dataframe with the outliers completely filtered.
The output has the same values in every cell, except some values are NaN which I assume means the filter worked. The code seems half functional.
for col in df_new.columns:
Q1 = df_new[col].quantile(0.25)
Q3 = df_new[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*(IQR)
upper = Q3 + 1.5*(IQR)
target = df_new[col]
df_iqr[col] = df_new[((target > lower) & (target < upper))]
Using pandas BTW. Filtering by quartiles and defining it in the for loop. I assume my problem comes with the last sentence of the code. I'll try and play around. At the moment it is filtering things and not filtering things. It is completely removing a row (index skips from 11 to 13)
If i should be posting this in a help thread let me know
I fixed one part by changing the last line to df_iqr from df_iqr[col] but for some reason it is just blasting index 12 row from every column
I fixed one part by changing the last line to df_iqr from df_iqr[col] but for some reason it is just blasting index 12 row from every column
@wind plume As long as the row has 1 column where the value is an "outlier", the whole row will be removed. What are you expecting will happen instead?
the dataframe has to remain in a tabular format (you can't just remove a "cell" -- speaking in excel terms)
But other rows have outliers, yet they don't remove. I'm kinda hoping that I will filter through every column individually by the columns IQR. Yet there's no way there's all outliers in row 12
I hope that makes some sense.
I expect it to be replaced with NaN, a sign either a cell was blank or I filtered properly.
what's df_iqr?
That would be the new filtered dataframe
Col would be the individual columns in df_new
So it's saying df iqr is taking the same column names as dfnew, yet using dfnews values and applying some function to it to selectively filter
hmm but df_new is not changing throughout the loop, so the final df_iqr would just be the df_new with the rows with last column's outliers removed
your last line of the loop (after your "fix"), will keep overwriting df_iqr with df_new with 1 column's outliers removed
I don't see how you get that, not saying you're wrong I just don't really understand what the last line of code is doing outside of the conditions I set.
let's make sure we're on the same page first:
is this what you have at the moment?
for col in df_new.columns:
Q1 = df_new[col].quantile(0.25)
Q3 = df_new[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*(IQR)
upper = Q3 + 1.5*(IQR)
target = df_new[col]
df_iqr = df_new[((target > lower) & (target < upper))]
okay, so when col == 'B' (second loop), it sees row 2 has an outlier
and df_iqr will be set to a dataframe with row 2 removed
but then the next loop, col=='C', df_iqr gets set back to the full df_new (because there's no outliers), and row 2 appears back in df_iqr again
the final loop with col=='D', row 0 has an outlier, and so df_iqr is set to df_new with row 0 removed --> and that's what you get.
the issue here: df_new is never changing, yet df_iqr is being assigned df_new with a single column filtered (outliers removed)
and every loop df_iqr is being overwritten
So basically it will remove the last row with an outlier? And in this case it happened to be mine
it will remove rows where the last column has an outlier
It sounds like I need to overwrite df_iqr[column] them
so you want, at the end of everything, df_iqr and df_new to have the same shape?
just with the outliers to be replaced by np.nan?
But if I replace df_iqr with df_iqr[column] I get a dataframe where column A is all the other columns but with the filter applied
Correct
I want the filter to apply and remove them. I figured making a new dataframe was the way to go but maybe not
yea that won't be possible. you can't assign a dataframe as a series / column in a dataframe
hey yo
Making a new dataframe out of the filtered values that is.
in the simple example I have above: what is the desired output?
which rows should be in df_iqr?
okay, you're not looking to remove them then. just replacing the outlier values with nan
Yes, I suppose so. When I made this before, I was only working with a one column dataframe so it wasn't bad
But when working with multiple columns and applying individual statistics to each column its getting hard
I feel like my code is very close
I might be super dumb. But I am currently doing predictions on a dataset. I tried with both Forest and Linear regression models, but my R2 score is constantly negative
Hello all, I'm novice in data analysis, I need some help, how can I show number range in x and y starting from 1? Thank you very much for your help.
@wind plume maybe try this
Q1 = df_new.quantile(0.25)
Q3 = df_new.quantile(0.75)
IQR = Q3 - Q1
df_iqr = df_new.query('(@Q1 - 1.5 * @IQR) < @df_new < (@Q3 + 1.5 * @IQR)')
hey there
What does @ do @paper niche
it's a syntax for you to access your python variables within the query string
it's akin to
df_new[(Q1 - 1.5*IQR < df_new) & (df_new < Q3 + 1.5*IQR)]
I don't think I've seen thst before but I'm pretty new to coding and pandas. Does it not work if you don't have the @
Do i use the above code in my for loop? I assume not.
yeah first time I'm seeing this too
is this new.. is it performant
the query method
I don't think I've seen thst before but I'm pretty new to coding and pandas. Does it not work if you don't have the @
@wind plume no it doesn't, and no there's no need for a loop. If you have a look at whatQ1 - 1.5*IQR < df_newis, it's a dataframe of the same shape asdf_new, with elements as booleans. (True, if the corresponding element indf_newis a low outlier, False otherwise). You can use this boolean "mask" to filter fromdf_newto get yourdf_iqr
is this new.. is it performant
@lapis sequoia I think so, lemme try to pull up a SO thread about this..
I see the last commit was on march 2020.. seems new
I can't seem to find it.. basically it saves you multiple lookups, especially if you're doing things like df[df.A > 10 & df.B < 100 & df.C > 10]
if I remember correctly
@paper niche woah that worked. I don't really get how. I've seen people use query for things like this. I don't know how it is parsing through every column individually and doing statistics on it. That is insane.
With so little code wtf
the performance is not significant if you're not doing multiple lookups like this (edited...)
I really want to understand this and not just accept this as an answer and move on
Cuz this is confusing to me
@paper niche woah that worked. I don't really get how. I've seen people use query for things like this. I don't know how it is parsing through every column individually and doing statistics on it. That is insane.
@wind plume break the code down into smaller pieces. like I said, have a look at whatQ1 - 1.5*IQR < df_newis first, then what(Q1 - 1.5*IQR < df_new) & (df_new < Q3 + 1.5*IQR)is, then finally whatdf_new[...]looks like. you'll get a better feel of what this code is doing
Like why it is a string. How it is looping through everything. What the @ does when the variables are already well defined
Okay I will
Mind if I try to explain it to you?
go ahead
The @q1-1.5*@iqr and @q3+1.5*iqr is applying a filter. It is saying if any value falls between that, it is now in df_iqr
Sorry for the bootleg discord code
On mobile atm
no worries, yeah I get you
I don't quite get the @ and why it is a string but I can look that up. And I assume the df_new.query is saying "for the values in the dataframe df_new that fit our condition, turn that into a dataframe" I guess it seems redundant and more of a procedure. I assume this wouldn't work if you used a generic df that would be defined earlier?
The grippy part is what the @ in the string is doing, I assume it is shorthand based off what you are saying
Because normally if you use IQR in a formula and you defined IQR before it is no problem
So a query by necessity needs to be a string
are you familiar with server logs, I'm trying to build a parser to end all parsers
wanted to build something useful.. and was looking at this https://github.com/rory/apache-log-parser
if you had a dataframe with a column x
a = 1
df.query("x < @a")
it's just replacing @a with 1 (your python variable)
So how is it applying my filter to EVERY column, individually?
So a query by necessity needs to be a string
@wind plume yeah. within the query string, normal alphabetic characters/words are interpreted as the column names (thus if you haddf.query("x < a")instead (without the '@') then it will try to look for a column in df called 'a'
Why is it not taking the average Q1 iqr etc? Of the whole dataframe
okay 1 sec
are you familiar with server logs, I'm trying to build a parser to end all parsers
@lapis sequoia no unfortunately not. haha I don't deal with the logs 😉
me neither..
Apparently there is something called the common log format, and it can have n fields..
but people have been trying to parse these with regexes, with limited success
seems like, they have to add capture groups when the logs change.. so I'm wondering if there's a way to make a parser that will parse the first line to check for all available fields from the entire list of fields defined in common log format
and not throw errors when a field is missing from newer logs
But your IQR and Q1/3 in this case i assume is applied not by column but by dataframe right?
Oh no, apparently not
sorry I mispoke just now
but this expression calculates the upper outlier per column
in a pandas series
and if a new field is introduced, it should be able to account for that too
performing df_new < (series), pandas compares all the elements in column 'A', 'B' and 'C' with the respective outlier value in the (series) -- this is essentially your for-loop
you end up with a dataframe of booleans (called a mask) -- shown in the pic above
But in that example above, q3+1. 5*iqr won't be a series I think
You're just saying do this function
you mean in the query string?
Ah thst was your query string in this case? Ok
if the query string syntax is still confusing, we can just discuss this one (it's entirely equivalent):
df_iqr = df_new[(Q1 - 1.5*IQR < df_new) & (df_new < Q3 + 1.5*IQR)]
where Q1, IQR and Q3 are pandas Series holding the respective per-column values
The above makes a ton of sense. You are saying the iqr dataframe is now df_new with the appropriate cutoffs
So my problem was that I used THAT line inside a for loop
yeah exactly. You were confused about how I was achieving this without a for-loop
hopefully it's clear now with the explanation about the masking
And every time I did a for loop, it would throw out the last row that had a outlier
So I guess I need to learn when I need to use a for loop or not haha
And every time I did a for loop, it would throw out the last row that had a outlier
@wind plume nono, you would throw out rows with an outlier in that column (the column that you're currently iterating over)
So I guess I need to learn when I need to use a for loop or not haha
@wind plume rule of thumb: explicit loops in pandas (and numpy) shouldn't be required in most cases (certainly not when the operations being performed are so simple)
I've always thought for loops as if I had a list and I wanted to apply iterations on every item in the list, use a for loop
for ordinary python list, this is true. but not with numpy and pandas. much of the speed improvements from using these packages comes from knowing how to take advantage of vectorized operations.
and if a new field is introduced, it should be able to account for that too
@lapis sequoia regex isn't known for being flexible tho xD are you planning to do this with regex too? or..?
Gooootcha. So the rule of thumb is if I'm a beginner there shouldn't be any need for me to use a for loop? I realize what I'm trying to do isn't probably mega difficult but it seems like the only way to solve this is to cheat and get help
I am not sure of the direction yet.. I think it'd be a nice tool to make
It's my personal project not for school or anything so it's not cheating but you know what I mean
https://pypi.org/project/apache-log-parser/ I'm trying to find the source for this, but I don't see it from the pypi page
found the github
line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %l %u")
Gooootcha. So the rule of thumb is if I'm a beginner there shouldn't be any need for me to use a for loop? I realize what I'm trying to do isn't probably mega difficult but it seems like the only way to solve this is to cheat and get help
@wind plume yeah, just try to keep in mind when dealing with numpy/pandas/similar scientific computing packages that if you're explicitly building loops, there's probably a better way of doing it. and don't guilt trip over getting help haha. It's part of the learning process. Reading other people's answers is how you become aware of better solutions to your problems (it beats reading through the entire documentation yourself)
thing is if you see this line, it defines exactly what pattern to expect and where, this won't work if a field is missing in the log or if the fields are in different places
I really appreciate the help dude.those pesky for loops have been fucking me on my pandas project lol
np 🙂
I'll work on a small part of my project first..
so, I'm thinking.. find all the fields available and set everything else to null in a row.. that way I can accept data when the field is introduced in newer logs
something like this https://stackoverflow.com/questions/57528383/normalizing-nested-json-data-with-pandas?rq=1
so im a bit new to datascience related projects
i looked up how to combine duplicate values across multiple columns
and am now trying to sort the output from greatest to smallest
data = confirmed.groupby('Country/Region')['5/13/20'].max().apply(lambda g: g.nlargest(20).sum())
im getting an error regarding the .max()
well, it would be better if you would provide the error tho
@real wigeon do all the parts make sense? what is returned by ...max() ? does it make sense to .apply() to that returned value?
line 52, in <module>
total_sum_by_region()
File "/Users/asdkals/Library/Preferences/PyCharmCE2018.3/scratches/scratch_18.py", line 34, in total_sum_by_region
data = confirmed.groupby('Country/Region')['5/13/20'].max().apply(lambda g: g.nlargest(20).sum())
File "/Users/aklsdjals/.local/share/virtualenvs/COVID19-tX0C9oPJ/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/Users/alksdjals/Library/Preferences/PyCharmCE2018.3/scratches/scratch_18.py", line 34, in <lambda>
data = confirmed.groupby('Country/Region')['5/13/20'].max().apply(lambda g: g.nlargest(20).sum())
AttributeError: 'int' object has no attribute 'nlargest'
Hi, I tried to implement backpropagation for convolutional layer but for some reason the results are wrong.
I tried to make a full convolution of the filters and the previous layer's gradients.
dA_prev shape : [K, H, W]
w(filters) shape: [K, C, H, W]
x shape: [C, H, W]
dA_dim, dA_h, dA_w = dA_prev.shape # previous layer's gradients
pad_h = dA_h - 1
pad_w = dA_w - 1
ow = np.pad(w, ((0, 0), (0, 0), (pad_h, pad_h), (pad_w, pad_w)), 'constant')
ow = ow[:, :, ::-1, ::-1]
dA = np.lib.stride_tricks.as_strided(ow, (ow.shape[0], x.shape[1], x.shape[2], dA_h, dA_w, ow.shape[1]),
(ow.strides[0], ow.strides[2] * stride[0], ow.strides[3] * stride[1]) + (
ow.strides[2], ow.strides[3], ow.strides[1]))
dA = np.tensordot(dA, dA_prev, axes=[(0, 3, 4), (0, 1, 2)])
Hey I am looking for an uncased POS tagging model (prefferably using the hugging face tranformers frame work) does any one have any recorces?
please tag me if you ahve any info I am going to be away from discord for a bit and I dont want to miss it 🙂
@sturdy laurel Never used one (dabbled in NLP only) but found this - https://github.com/huggingface/transformers
Maybe Im just putting in for loops when I dont need to, but say I have values (what I call item) in a column 'Sample' in dataframe called df_melt...
I've been stuck on this for a few hours. I am not sure if this is a red flag and it means my fundamentals are screwed or if this is tricky, or if there are millions of examples I can look up online. I hate coming here asking for help
Ultimately what I want to do is look and see if a value in 'Sample' column has one of the case-insensitive keywords, and if it does, make correspond THAT specific row with 'Weathered'. If it does not have the keyword, we assume it is dry, therefore we call it 'Dry'.
An example column would have a value called "X Dry" or "Y weathered"
df_melt['State'] = ''
keywords = ['wet','weathered','weather']
for item in df_melt['Sample']:
if any(kw.lower() in item.lower() for kw in keywords):
print(item + ' is wet')
df_melt['State'] = np.where(df_melt['Sample'].str.contains(item), 'Weathered','Dry')
else:
print(item + ' is dry')
My natural instinct is to make for loops if I want to iterate thru a list but as fickletofu said, there's probably ways around using for loops.
Is this where I should build a query?
Fwiw when I do this, I see the correct values labeled as 'is dry' or 'is wet', it's a matter of writing it. Not sure how, or why it's so difficult.
@wind plume Like this?
In [39]: df = pd.DataFrame({'Sample': ['wet', 'WET', 'weathered', 'weather', 'dry']})
In [40]: df['State'] = np.where(df['Sample'].str.contains('wet|weathered|weather', case=False), 'Weathered','Dry')
In [41]: df
Out[41]:
Sample State
0 wet Weathered
1 WET Weathered
2 weathered Weathered
3 weather Weathered
4 dry Dry
@rain palm is there any way to make it totally case insensitive tho? So it could accept WeT, etc. That was my hope with the keywords. Will this also work for something named "720 Wet" or something like that?
Do you know how to use regex?
I don't, is it hard to learn? If this is something that will 100% help I am willing to learn
Finds "720 WET" it seems:
>>> df = pd.DataFrame({'Sample': ['wet', '720 WET', 'weathered', 'weather', 'dry']})
>>> df['State'] = np.where(df['Sample'].str.contains('wet|weathered|weather', case=False), 'Weathered','Dry')
>>> df
Sample State
0 wet Weathered
1 720 WET Weathered
2 weathered Weathered
3 weather Weathered
4 dry Dry
@wind plume It will help you long way with string matching.
I missed the case = false, that is AWESOME.
That clarifies so much
I think then, any time I want to make or search something case insensitive I can do that
Does regex do it better or faster?
It's so counter intuitive to me why you wouldn't use a for loop, but fickletofu was right. Good solution without any for loop nonsense
Idk, if I am struggling with stuff like this is it normal? Or does this mean I really need to sit down and watch some YouTube class
Avoid youtube.
Do you recommend ways to learn this? I was really stuck and was going into for loops and shit when I really didn't need to. Tried a bunch of stuff and spent like hours on it. Then you showed me to use "|", and case = false which was immensely helpful. Doubt I could have found that elsewhere
What I heard was to go make your own program that's genuinely useful for you. That's what I'm doing
I've used this time WFH to learn to code since I am a research scientist and can't be in lab lol
If you want to learn the right way that will help you a lot, look for books in youtube, if you want to find something there.
Also check the sites from the tools that you use, they normally spend time doing tutorials for you.
And experiment with your own projects.
But if you don't know the basics of programming avoid data science totally
I learned the very basics through python crash course, but I'm no master of it. At that point in the book it had me code a game and build a website and work up data. I decided to start my own project that would help automate graphing and data workup (remove outliers etc)
What did you learn?
Dictionaries, lists, list comprehension, user inputs, if else, etc
Then it got into coding a game and I felt like I was copy and pasting and not really learning. And it didn't interest me because it wasn't for work. So I made it work applicable.
Learned pandas was pretty damn solid, then learned how to use pandas with other packages like seaborn and numpy tho in very limited detail
You need to get logic going. Solve some problems.
It will help you find ways to solve problems with the data that you use
You can keep going with data science without doing that, but you can get stuck quite often with for loops, if else
Ahhhh, awesome thank you for the link! Are these insanely hard challenges, or totally doable for a novice and if you can complete it, it's a solid start? If not, return to the python crash course?
They have difficulties, so you can start with the easy ones
I notice I get stuck on things for hours and trying to bash my head on things isn't fun. Sometimes I fix it, other times I post here and am like "ojhhhhhhhhhhh"
If you find yourself hardstuck with the easy ones because you don't know the syntax, you can check the python documentation.
That's better than any video course you can find
Is syntax my issue? A lot of problems I have are because I don't know how to do somrthing I want to do. Not sure if that is logic, or syntax, or literally every coding problem ever.
You probably saw my example above and can probably eealziw what I was trying to do, but the fact I couldn't do one small thing meant my code didn't work even tho the rest was sound
It doesn't seem like, that's why I'm recommending you to do code challenges and read the documentation
Both things can help you to find solutions (for example, the case=False, it's in the documentation)
Alright, we should stop talking about this on a data science channel, if you want to ask for help #python-discussion
I appreciate it a lot :)
Hello. I have been looking for books for Machine learning from scratch (it Has to be from scratch, so no frameworks like TF, Pytorch, theano etc. just numpy, pandas or matpmotlib etc.) Unfortunatelly, I couldn't find any. Does anybody know any good books?
Thanks to everyone helped me with my job scraper! got an interview for a company that creates them next week 😄
Found strange pandas interaction let me share
Make a df
Then get a loc of df to another variable
Change original df
@uncut shadow I can't recommend any books, I had to do a similar project for a uni assignment, there are loads of youtube tutorials, and if you type for example "neural net in python from scratch", loads of examples
Then the locced df is changed
Change locced df, original is not changed
Wth is that
I didn't try to change locced df after I changed original df though
Hi guys,
is anyone around to answer a quick squestion?
I have a scatter plot showing the relationship between cosine and euclidean distance matrices that looks like this:
https://gyazo.com/372514a67c18132b6364582cfdc6125c
I have been asked to plot a second order polynomial over the data.
Our practicals and lectures don't really cover this so I was wondering if somebody could help explain? 😅
hi everyone, anyone know about TCN(Temporal Convolutional Network)? i have project to predict inflation in my country (it's time series case). I know TCN is evolution from CNN and it use for image processing, but i've read that TCN can be used for time series data, i want to implement TCN on my case(infaltion) but
I had difficulty getting started. maybe you have used it or you have reference about that, please tell me
@potent hamlet Check out https://github.com/philipperemy/keras-tcn for a implementation, easy to use and might work for your use case. I've used TCN for time series classification but not for forecasting. Worked great for classification at least.
oh thank you very much
maybe you have repo on github about your TCN time series? can i see?
oh yes one more question, does that mean TCN is not suitable for forecasting (regression)?
@lusty coral If you have time check out this article series https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c especially part 3 and 4. pandas might have changed since 2017 with 1.0 coming out but I still think most of the intuition would be valid. However when and where pandas uses deep or shallow copies have always been sort of opaque.
@potent hamlet Sorry that repo is private and not mine I just wrote it 🙂 As I said I haven't tried it for time series forecasting and although it might work fine (Google had at least one article for "TCN time series forecasting") I have a feeling it would be overkill for something like inflation.
haven't looked into TCN's yet, but if its anything similar to CNN's should work on time series regression just fine.
well you know you could always run an ml model on it and get the plot @full flint
it may not be as good as other statistical methods for finding the second order polynomial over the data, but would likely be significantly easier
@uncut shadow are you trying to implement complex models just to learn the code behind it or also the mathematical understanding of it?
Yes
How do I plot 2 columns (one has decimal numbers ranging from 1 to 10 and other has corresponding values to that) using pandas and matplotlib? I want to plot the whole number 1-10 in x axis.
Find a project you can do that would use numpy a lot and then learn what you need for that. Learning by doing often works better than reading through all the documentation or similar.
hello
I made some code for a neural network a little while back
I'm curious what you guys would think of it
Hey @raw raptor!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
!code-blocks
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
Hey @raw raptor!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
Hey guys, when using likert scale type responses for your analysis, how would you handle missing values? for example, in the experiment each participant had to fill out a total of 49 statements on a 7-point scale. Within these responses, sometimes there is an answer missing randomly.
k, here it is
I didn't have any use for a neural network yet, so I didn't program in back propagation or a fitness function
@polar acorn thanks , can u suggest some projects ?
If (and only if) you have some experience in deep learning then making your own simple neural net is nice. Or you can explore the random module by implementing rock, scissors paper vs the computer. Or a simple connect four game. Or find data from something you're interested in sports, finance, dota or whatever, put it in numpy and do some analysis. You would use pandas for this in real life but for learning internals you can use numpy. Or you could google around, you're probably not the first to ask.
Wow, never thought of that, thank you! I've barely used numpy before so I'd have to do some learning with that, but I'll definitely make some of these games to test it out once I get the time.
When a 19 year old intern says that Data Scientist and AI are the same
-.-
I don't see the data in the video below that a data scientist is working on
https://www.youtube.com/watch?v=gn4nRCC9TwQ
Google's artificial intelligence company, DeepMind, has developed an AI that has managed to learn how to walk, run, jump, and climb without any prior guidance. The result is as impressive as it is goofy.
Read more: http://www.businessinsider.com/sai
FACEBOOK: https://www.fac...
@lapis sequoia https://www.kaggle.com/datasets
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Flexible Data Ingestion.
Hey guys sorry if this is the wrong chat room, does anyone mind giving me a hand with this error?
@fading drum Its a warning
and it means exactly what it says
your data is prob a dict I can't see what it is but its warning you that in future versions you won't be able to do such thing so change your habits
@uncut shadow thanks
Hello team 👋 , so I'm working on with python on colab on Q&A BERT base model using simple-transformers library (https://simpletransformers.ai/) I have a model which has been trained with squad it works pretty well and all, but! every time i ask a question i have to also provide a context where that question can be subtracted from 🤔 .
Now here is the case, let's say i have a table containing a bunch of paragraphs with specific information about depression. Now let's say someone asks a query like: "what can i do to deal with depression?". What techniques do you guys recommend or know about so that based on the question i can choose the best paragraph where the answer will be taken? 🥴
Thank you for your time guys 🙏
!paste
Anyone used PIL before?
hey
I'm trying to get better with datascience, and currently trying to plot a line chart
trying to do something like this post
this is currently what I have
def plot_sums():
confirmed.set_index('Country/Region')
confirmed_date_time = confirmed[3:]
date_time = pd.to_datetime(confirmed_date_time)
countries = confirmed
DF = pd.DataFrame()
DF['countries'] = countries
DF.set_index(date_time)
fig, ax = plt.subplots()
fig.subplots_adjust(bottom=.3)
plt.xticks(rotation=90)
plt.plot()```
and my traceback is
File "/Applications/PyCharm CE.app/Contents/bin/BankApp/Users/asjkdhask/PycharmProjects/COVID19/Covid.py", line 73, in <module>
plot_sums()
File "/Applications/PyCharm CE.app/Contents/bin/BankApp/Users/asjkdhaskjda/PycharmProjects/COVID19/Covid.py", line 28, in plot_sums
date_time = pd.to_datetime(confirmed_date_time)
File "/Users/aksjdhaksjd/.local/share/virtualenvs/COVID19-tX0C9oPJ/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 731, in to_datetime
result = _assemble_from_unit_mappings(arg, errors, tz)
File "/Users/ajksdhaskhd/.local/share/virtualenvs/COVID19-tX0C9oPJ/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 832, in _assemble_from_unit_mappings
"to assemble mappings requires at least that "
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
am I not allowed to use a slice like that to denote the columns I want to use?
@real wigeon This should work I think
Is there any website which allows me to feed images and it will recognise the digits for me?
Via API calls?
that definitely exists
but idk where
I know adobe has that as a premium feature
I can actually help you with that @twilit onyx
I have a script for stuff like that
thank you @uncut shadow that was an interesting read
yeah the thing is that im getting an unresolved attribute refference for unstack()
Hey all, I'm a new python developer (really a new learner of python)
and I'm looking to get a job doing data mining/munging
does anyone know where i can find information about urllib? like how to use it
Is there any good machine learning course for free that uses Python?
He uses Octave
Andrew Ng
That guy's voice is so....hypnotic
says it's free enrollment, most of the courses on coursera can at least be taken for free without credit
@tranquil crane yeah it looks like it's done in octave, but it probably applies to python as well
Thanks
Hi Everyone, was wondering how you guys run your distributed programming
For instance, if I wanted to put 10 requests in for 10 models all at once, such as an async process, is there some sort of computing power I could dial into using a key and run my model?
What is it you want to do? @cloud ledge feedforward an x into 10 different models for 10 different outputs?
This might not be the best place to ask, so I apologize in advance. I have users from my website submit requests to run pre-defined machine learning models
They purchase X amount of cores/gpus to run the models on, but the problem I am trying to solve is that now, how do I run 10 models, say for 10 users, all at once
Sorry, I'm still not getting it
I can't run 10 models at once using the procesing power of 1 server (say that only have 20 avaliable cores)
So I was wondering if I could offload all that work somewhere
You're running your models in a VM or a container right, and now you're trying to optimize your network protocol handling?
So yes, the models are in a container, I just don't have the resoruces to run them
I might have 100 people need to run a contained model, not sure the best way to be able to do that
I can't really help you much there since I'm not very familiar with distributed computing and database optimization. But, it appears your problem doesn't have to do with with the neural networks themselves since you run them in a container and can treat them as just a piece of software.
You might want to try to ask the folks over at web-development/async/databases for their knowledge
im trying to get a sum per column from a df
which i then am trying to plot on a line chart
just need some help with the summing for now
Well
You should google it How to do this in pandas, but I think you can also put it in numpy and then sum columns
i've been googeling how to sum
.sum()
and then you set the index
df['column_name'].sum()
returns sums for all the columns
idk, I guess I should put those new values in a series? and then plot those?
alright yeah sry i thought it was more complicated
and print statements
Those too lol
the fact that it summed it per column but it's just called .sum()
confused me
Is using a 4D array of [n, x, x, x] will be a lot faster than iterating n times on an array of [x,x,x]?
I'm asking because Im implementing a CNN using Numpy and I need to improve the preformance in order to make it even trainable(it's 100x slower than keras CPU only)...
You can check it out here if you want:
https://github.com/shafzhr/SimpleConvNet
Hi!! I have a question about transposing an array?
I would say I know how, but it's just ... not working?
phi = np.random.uniform(0,math.pi,N)
theta = np.random.uniform(0,2*math.pi,N)
r = 1
x_array = r*np.sin(phi)*np.cos(theta)
y_array = r*np.sin(phi)*np.cos(phi)
z_array = r*np.cos(phi)
skin = [x_array,y_array,z_array] # find a way to flip the rows and columns on this
# print(np.ndim(skin))
# not sure why ndim is giving 2, it ought to be 50x3
sphere = np.asarray(skin)
sphere.transpose
print(np.shape(sphere))
# make a 3D scatterplot with matplotlib
return sphere```
I'm trying to flip the rows and columns in skin. I thought I could do this with .transpose, but it's not working.
@whole roost what are you passing in for N?
print(small_sphere)```
import numpy as np
import math
def sample_sphere_polar(N):
phi = np.random.uniform(0,math.pi,N)
theta = np.random.uniform(0,2*math.pi,N)
r = 1
x_array = r*np.sin(phi)*np.cos(theta)
y_array = r*np.sin(phi)*np.cos(phi)
z_array = r*np.cos(phi)
skin = [x_array,y_array,z_array] # find a way to flip the rows and columns on this
# print(np.ndim(skin))
# not sure why ndim is giving 2, it ought to be 50x3
sphere = np.array(skin).T
print(np.shape(skin))
print(np.shape(sphere))
# make a 3D scatterplot with matplotlib
return sphere
small_sphere = sample_sphere_polar(50)
@whole roost
Worked for me seems like calling .T in the same line made a difference
Thank you!!
hi guys anyone here familiar with this factorization
A=UEV^T
When A is a rectangular matrix, the SVD
Does the SVD become Q^T DQ, where D is diagonal eigen value matrix and Q is the orthogonal
vector matrix, when A is a square matrix
No they are different decompositions.
You can always decompose a matrix using it's SVD, the existence of a orthogonal diagonalization depends on the number of linearly independent eigenvectors.
what does input_dim mean in keras?
if I had a perceptron like this
and input0 is 1
and input 1 is 0
what would be the input_dim of that layer?
@jagged basin in this example its(2, ), the input dimensions is the 'shape' of the data for that layer, if you have used numpy you can think of it like the shape of a numpy array, there can be different dimensions of different sizes
I see
from what I know it's basically the number of features your data has (number of columns)
normally you have data like this (x, y) where x stands for number of samples in your batch and y stands for number of features
(in RNNs you have (x, y, z) where z stands for number of time steps but it's not an RNN)
Noob question here regarding AI:
The google image search (similar image) - I assume it uses AI like the one below, right?
https://deepai.org/machine-learning-model/image-similarity
I was wondering - how does google get the result so quickly? Wouldn't they have to go through billions of pictures on the web?
well
- They have really powerful machines for that
- They probably have some metadata for images so they are not checking all images tho
but I'm not 100% sure about the second one
hey folks..
i need some help
i need to start learning machine learning
well, it will be a problem if you want to make your own models without using any frameworks like Tensorflow, Keras or PyTorch
actually i want to understand the underlying maths behind it
so you need to know maths
mostly linear algebra
calculus
statistics
algorithms
and stuff like that
maths and that plotting stuff
@uncut shadow yes
can you suggest me some good books to start with
including books for maths and ML etc
well, I didn't read any books about this type of stuff so unfortunately, I'm not able to suggest anything
but you should google and search and there probably be many interesting books out there
i've tried that
but here i encounters a problem
if i start to learn a single library book that keeps refrencing the concepts of another lib
here i get stuck
Epoch [56/100] Batch 300/1588 Loss D: 0.6502, loss G: 2.3085 D(x): 0.9010
Epoch [56/100] Batch 400/1588 Loss D: 0.6502, loss G: 2.3040 D(x): 0.9014
Epoch [56/100] Batch 500/1588 Loss D: 0.6502, loss G: 2.3045 D(x): 0.8998
Epoch [56/100] Batch 600/1588 Loss D: 0.6502, loss G: 2.2953 D(x): 0.8995
Epoch [56/100] Batch 700/1588 Loss D: 0.6502, loss G: 2.3021 D(x): 0.9003
Epoch [56/100] Batch 800/1588
Any idea what could cause the D loss to be "stuck" after certain amount of epoch and the G loss being so huge from the very start?
Bach size is currently 8, learning rate is 0.0002
DCGAN
Best source to learn deep learning are?
Hi guys, im trying to compute means for some specific rows in a pd df. i changed the likert scale responses to numbers and got rid of some columns that had text in them. however it still isnt able to compute the row means..
any idea how it returns NaN?
there are some NaN values in there (10 out of 4069) but i also set skipna to true so it should be able to calculate the mean still
suspecting that the nans are not understood as nans
In the original csv, this was just a blank cell
you have to make sure pandas agrees that they are nans
But surely it should at least be able to compute rows where no NaNs are found?
i guess?
well, i dont think so, this is the code that i used to get response to number;
mymap = {'Totaal niet (-3)':-3, 'Niet (-2)':-2, 'Enigszins niet (-1)':-1, 'Neutraal (0)':0, 'Enigszins wel (1)':1,
'Wel (2)':2, 'Helemaal (3)':3, 'Man':0, 'Vrouw':1}
df = df.applymap(lambda s: mymap.get(s) if s in mymap else s)```
i tried some stuff indeed but it doesnt let me
whats the command for that?
to check all the datatypes in the dataframe
temp_df.info() just without any print
temp_df.dtypes()?
so it's a series
Whats a series?
dataframes are made of series.
it's a 1d dataframe except it's missing a bunch of stuff.
if you really want, you can do say temp_df.to_frame().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 1 columns):
0 0 non-null float64
dtypes: float64(1)
memory usage: 752.0 bytes
so they're floats
i figured that shouldnt really be a problem for mean calculation tho
i mean for sure
haha
but its super weird, even if they're floats instead of integers, what would be the difference?
nothing
0 is fine
its okay with - afayk?
the raw data (csv)?
in backticks please
Sorry i dont fully understand, the backticks are for code right?
yes but you can post data there too
it looks better and will be easier on to copypaste to test
that or the full csv. i don't care
Ill send the csv if you dont mind. if you want i can send the ipnb as well
Hey @rigid storm!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg.
Feel free to ask in #community-meta if you think this is a mistake.
i don't need your ipynb. there's something wrong with it
How tho?
it's a csv file
that's text
open it up, ctrl+c, put it in back ticks, ctrl+v, enter
such is life.
i mean if you dont feel like it via another way i'd understand that
that thing friendlist is for people i know
but take the simplest solution here you can. just read in like 5 lines from the data
what happens then
should i send the first five?
well you can do that too
if it's somewhat representative of the rest of the data\problem
cant do it. even one line has too many chars.
with that text at the beginning at least
but i mightve fucked up at that mapping part. although it looks like it did convert to floats tho
are you sure the axis is correct
i think if i do axis=0 it will try to do something per row right
and axis =1 would be columns?
does it
all rows show this
@rigid storm this was someting like temp_df.mean(axis=0)
which gave 'NaN' as the mean for each row up to the 84th
sorry it is 1 indeed
yeah that mean(axis=1) gave all rows with na
change the axis to 1 then
let me check
wait
wtf
somethin happened
i actually have means now
i at least dropped the questions themselves (which was row1 - which was text)
are those means correct
they have to be id say. scale goes from -3 to 3
dropping the zero hmm
a lot will be around 0 anyway
the zero's should have no influence on the means right
like it would be as if these responses dont exist
wait mayeb it does matter
since your index starts from 1 now
... so what was your index 0?
which was the original question for the likert scale
🤦♂️
and that was causing the problem?
i guess? but i figured it would just give me NaN for 0 and the rest would be calculated
that shouldn't be a row in the data...
yeah true, thas how i got it from qualtrics 😅
the problem is that it's gonna coax the whole column datatype
into the same datatype
they're all some bs strings now or something
since you left it there
so it couldne cope with it just because of that being in?
it should be part of the index if it has to be there but i think it shouldn't
well it makes everything a string
i thought it would just calculate row by row, and if a row wouldnt be possible NaN would be the output
did you try calculating row by row
you probably got nan for every single row
actually
it probably didn't even try
since it saw it was a string
it skipped over all the rows because they were all strings
just like it would skip over all the columns
But then those means should be the right ones correct?
the new ones? yes
i mean they look correct to me
they are
and then axis=0 i would get the means for all coumns right
yes
you can try also .describe()
for the effort
to get the basic statistics for the axes
it gives you these same numbers and some others too
ah nice
also you would see all the data types if you just do .info() on the dataframe
it would have told you the columns are all objects
yes
you can't calculate a mean on objects
that's the point
they have to be something that counts as a number of some kind
hey y'all, i got this error: ParserError: Unknown string format: 2020-05-19 10-AM
after i tried to convert it with this df['Date'] = pd.to_datetime(df['Date'])
how can i convert "2020-05-19 10-AM" to datetime?
give to_datetime the correct format parameter
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior look the formatting here
sorry for my ignorance, but do you have a simple example how i pass the format parameters?
i tried this now:
for i in df['Date']:
i.strptime('%Y-%m-%d %h-%p')
but i just get this error AttributeError: 'str' object has no attribute 'strptime'
date = datetime.strptime(i, '%Y-%m-%d %h-%p')
got it now 🙂 thanks though for your help @lapis sequoia
you can just shove that same format string to pd.to_datetime
and it'll do it automatically'
without any looping
hey @lapis sequoia if you dont mind one last question? Instead of replacing within the column, what would be the easiest way to replace for rows?
df['columnname'].replace(['NaN'], <the number>)
so this for example could be for all values in a column right?
mmm yes
that kinda takes the column as a series out of the dataframe and then does a replace on it
what's the difference between replacing in rows and columns in your case?
df['x'] = df['x'].replace(['NaN',], the number)
do you want to replace full rows?
also when you assign back to a dataframe, you should always do df.loc[:, 'x'] =
ah
so each NaN can just be replaced with same number, but only that new number of that row (the mean of the row)
ok
you want to replace the NaN with the average of that row?
si
for ex. one participant might have 2 NaNs
those two will be replaced with that participant's mean of the rest of his responses
(Each row = one particiapnt with 49 answers)
if they filled in everything
participant 74 has 4 NaNs > 74 mean was 0.222 so those 4 get 0.222
you want
df.fillna(df.mean(axis=1), axis=1)```
I think
try that
I hope that works
though I get the feeling those nans are not integers so it won't work
or floats
!e ```py
import numpy as np
import pandas as pd
df = pd.DataFrame(np.ones((5,5)), columns=['a', 'b', 'c', 'd', 'e'])
df.iloc[2,2] = np.nan
print(df.fillna(df.mean(axis=1)))```
You are not allowed to use that command here. Please use the #bot-commands channel instead.
i can copy it
and indeed see what it does
ok so looks like the NaN is not filled in this case you sent
i guess you'll have to apply a function on the y-axis
that does the filling by row
you could drop the axis=1 from what's happening here but then it'll fill with the column averages
What about filling the NaNs seperately?
df.apply(lambda row: row.fillna(row.mean()), axis=1)
heh this is apparently not implemented in pandas yet https://stackoverflow.com/questions/33058590/pandas-dataframe-replacing-nan-with-row-average
well that was 2015
but you can see it wasn't implemented in your version either
this seems to work haha
74 had 4 nans, > 0.222
numbers turned into floats as well somehow (at least for the observer)
but yeah that looks right
ofc they were already technically floats right
but i think it should be fine right now
i can check one last time what the data type of the cells is or something
all float64
hey i got another question. I have dates in a dataframe that look like this: '2020-05-18 11-PM'
i used @broken mortarwakes tipps and was able to convert the times with this function:
but now i realized that '2020-05-18 11-PM' and '2020-05-18 11-AM' both were converted to 2020-05-18 11:00:00
how can I make 11-PM 23:00 and 11-AM 11:00?
Why doesn't it automatically turn AM and PM into distinct times?

When used with the strptime() function, the %p directive only affects the output hour field if the %I directive is used to parse the hour.
you are a walking demigod among us normal humans
it's called using google
thank you though
np. tips fedora
@lapis sequoia are you by any chance familiar with the reddit API or pushshift API?

I need to download an entire subreddit, that has a few posts a day and was created in 2008
that sounds fun
why do you need to
for my thesis i am doing some datascience and want to do some sentiment analysis based on posts and comments of certain subreddits
that sounds.. dated.. but ok
well it is just a little part of the thesis but it needs to be done...
do you have some suggestions on how to do it?
sure.. look up fastblob
there's also semantic context for complex sentences
as i read the reddit API does not provide searching by time anymore. and my attempts with pushshift are unsuccessful...
you mean fastblob for the sentiment analysis?
and you need to search by time, because?
I already created a framework for it, since it isn't in english.
ahh that's cool
i need it by time since I need to get all posts and comments from july 2017 to today and reddit API restricts somehow more than 1000 results or something
so do it in batches
yes, but I can only get the 1000 latest
but 1000 results seems like less than a week
ahh
that sucks..
that means you can't do it
try to look for an existing dataset
or send them a request through your school
the alternative is scraping, which is probably against ToS and a waste of time
well I thought about creating a spider with scrapy
but I also read that with pushshift it should be possible to get results by time but my attempts until now failed. I will paste my question from earlier this day:
hey y'all! Is anyone of you familiar with the pushshift API? I was using psaw and basically used their demo example to grab posts from 2017. But somehow it retrieves only the latest posts and not from the time indicated:
from psaw import PushshiftAPI
import datetime as dt
api = PushshiftAPI()
start_epoch=int(dt.datetime(2017, 1, 1).timestamp())
data = list(api.search_submissions(after=start_epoch, subreddit='neo', filter=['url','author', 'title', 'subreddit', 'num_comments', 'comments'], limit=10))
print(data)```
This is what the code above returns for me, if I use limit=1 instead of 10:
[submission(author='anonboyGR', created_utc=1590054802, num_comments=0, subreddit='NEO', title='Pi Network Cryptocurrency', url='https://www.reddit.com/r/NEO/comments/gntymq/pi_network_cryptocurrency/', created=1590047602.0, d_={'author': 'anonboyGR', 'created_utc': 1590054802, 'num_comments': 0, 'subreddit': 'NEO', 'title': 'Pi Network Cryptocurrency', 'url': 'https://www.reddit.com/r/NEO/comments/gntymq/pi_network_cryptocurrency/', 'created': 1590047602.0})]```
notice how this is in fact not from 2017...
This is the link to the example I used: https://psaw.readthedocs.io/en/latest/#first-10-submissions-to-r-politics-in-2017-filtering-results-to-url-author-title-subreddit-fields
where's the comment
I see author, date, etc but isn't the comment supposed to be part of the payload
well it's just an example but this post didn't have a comment (num_comments = 0)
also it was immediately deleted.
My problem though is that the post is not from 2017 even though i was sticking exactly to the example in the link
hmm
looks like it's something this person put together to search public posts.. he's not with reddit
maybe you can raise an issue on his github
the last commit was march
yeah... i will probably find a way to make it work
I have issue with my DCGAN where the training basically halts at
Currently I am trying to read on the D/G module and how I can mess around with the activation functions
Hi guys. For emotion detection, which is the most accurate github project with pretrained models provided?
a = [1,2,3,4]
a=[1,2,3,4]
b=a.asfarray(a)
b
array([1., 2., 3., 4.])
Is there a way to change the dtype of np arrays back to int
like without turning it back to a list and converting to int with native python
i want to switch between numpy dtype to another numpy dtype in that framework, so hopefully nothing slows down too much
RuntimeError: size mismatch, m1: [8192 x 16], m2: [8192 x 16] at
What...
Hello, i am stuck on an error that I don't know what the problem is... see code and error message below. Please let me know what i'm doing wrong! Thanks!
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import Model
input_A = tf.random.normal([4,100],0,1)
input_B = tf.random.normal([4,100],0,1)
X = tf.matmul(input_A, tf.transpose(input_B))
X = tf.keras.layers.Dense(192)(X)
X = tf.keras.layers.Dropout(0.2)(X)
output = tf.keras.layers.Dense(1, activation='softmax')(X)
# print(input_A, input_B, input_C, output)
model = tf.keras.Model(inputs=[input_A, input_B], outputs = output)
ERROR MESSAGE
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-68-4c747a5ac852> in <module>()
13 # print(input_A, input_B, input_C, output)
14
---> 15 model = tf.keras.Model(inputs=[input_A, input_B], outputs = output)
16
6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in op(self)
1111 def op(self):
1112 raise AttributeError(
-> 1113 "Tensor.op is meaningless when eager execution is enabled.")
1114
1115 @property
AttributeError: Tensor.op is meaningless when eager execution is enabled.
@rustic igloo I think this might help you https://github.com/tensorflow/tensorflow/issues/27739
@uncut shadow Thanks !
👍
Hi. I was looking for a real-time emotion detection program written in Python that has the models pre-trained and available. Any suggestions?
I think this might be the right place to ask... I am looking for an API that lets me see weather data. Specifically monthly highs/lows/averages. There's so many things online but they are all about the live weather forecast.
Any ideas?
@lapis sequoia https://github.com/topics/emotion-recognition?l=python - many libraries.
One of the biggest struggles I find related to Data + python is regularly needing to un_nest keys in such a way that the data can be put into tables/CSV form for analysis.
I came accross a stackoverflow post awhile back which gave me a great function that I tweaked slightly but it's still running into issues with nested lists of dictionaries.
Are there any well known methods of accomplishing this? Or should I keep hacking on what I have?
Example dataset:
"key2": [
{"nested_key1": "nested_value1",
{"nested_key2": "nested_value2"},
{"nested_key1": "nested_value1",
{"nested_key2": "nested_value2"}
]
}```
This is the function (its 90% what I found on stack overflow with tiny edits from me while testing it)
def flatten_dictionary(d):
result = {}
stack = [iter(d.items())] # Create a list of the dictionarie's keys + values in touples (k, v), (k, v) then put all that into a list
keys = []
while stack:
for k, v in stack[-1]: # Examine the LAST item in the list of touples
keys.append(k)
if isinstance(v, list):
if len(v) > 0:
for item in v:
if item:
if isinstance(item, dict):
if len(item.keys()) < 1:
result['.'.join(keys)] = 'None'
else:
stack.append(iter(item.items()))
elif isinstance(item, list):
result['.'.join(keys)] = '.'.join(item)
keys.pop() # This may need to be re-commented out
else:
result['.'.join(keys)] = ''.join(str(v))
keys.pop()
break
break
else:
result['.'.join(keys)] = 'None'
keys.pop()
elif isinstance(v, dict):
if len(v.keys()) < 1:
result['.'.join(keys)] = 'None'
keys.pop()
else:
stack.append(iter(v.items()))
break
else:
result['.'.join(keys)] = str(v)
keys.pop()
else:
if keys:
keys.pop()
stack.pop()
return result```
bruh
what are you doing.. don't do this
this is not how you unnest structures..
in your nested structure, what data do you actually hope to use and how do you want it structured
any resources on creating a genetic algorithm in keras?
how do i run a command when someone react on a message 🙂
need a project that takes webcam feed and tells in real time if you are happy, sad, surprised etc. It should have pretrained models etc
Any links?
Anyone able to help with a Pandas/Matplotlib question in Help-hydrogen?
so
def plot_sums():
index_confirmed = confirmed.set_index('Country/Region')
confirmed_date_time = index_confirmed.iloc[:, 3:]
summed_values = confirmed_date_time.sum(skipna=True)
summed_values.plot.line()```
I'm getting an exit code 0 from this, but my output contains no plot. What gives?
How can I vectorize this(it’s very slow)?
def backprop(self, dA_prev):
"""
Back propagation in a max pooling layer
:param dA_prev: derivative of the cost function with respect to the previous layer(when going backwards)
:return: the derivative of the cost layer with respect to the current layer
"""
x = self.cache['X']
n_batch, ch_x, h_x, w_x = x.shape
h_poolwindow, w_poolwindow = self.pool_size
dA = np.zeros(shape=x.shape) # dC/dA --> gradient of the input
for n in range(n_batch):
for ch in range(ch_x):
curr_y = out_y = 0
while curr_y + h_poolwindow <= h_x:
curr_x = out_x = 0
while curr_x + w_poolwindow <= w_x:
window_slice = x[n, ch, curr_y:curr_y + h_poolwindow, curr_x:curr_x + w_poolwindow]
i, j = np.unravel_index(np.argmax(window_slice), window_slice.shape)
dA[n, ch, curr_y + i, curr_x + j] = dA_prev[n, ch, out_y, out_x]
curr_x += self.stride
out_x += 1
curr_y += self.stride
out_y += 1
return dA
What kind of derivative is this? I assume you are doing some kind of shooting method but I can’t follow the discretization.
What kind of derivative is this? I assume you are doing some kind of shooting method but I can’t follow the discretization.
@merry ridge
Max pooling
@merry ridge
That’s how I vectorized the forward propagation:
n_batch, ch_x, h_x, w_x = x.shape
h_poolwindow, w_poolwindow = self.pool_size
out_h = int((h_x - h_poolwindow) / self.stride) + 1
out_w = int((w_x - w_poolwindow) / self.stride) + 1
windows = as_strided(x,
shape=(n_batch, ch_x, out_h, out_w, *self.pool_size),
strides=(x.strides[0], x.strides[1],
self.stride * x.strides[2],
self.stride * x.strides[3],
x.strides[2], x.strides[3])
)
out = np.max(windows, axis=(4, 5))
But I can’t find a way to do so for the back-propagation...
@rustic igloo I think this might help you https://github.com/tensorflow/tensorflow/issues/27739
@uncut shadow Thanks for the link, but I don't still quite understand. The article said this was a bug and was fixed, so why is this still occurring?
If it is suggesting to use tf.Variable for all parameters on moving average function, may I know which of variable in my code need to apply this?
Thanks.
anybody can help about ROC curve?
Guys if anyone here works with numpy pls tell me any advices to learn it and keep motivated(I want to learn it for ML and I started already but sometimes it seems a bit difficult)
Hey guys, I am currently learning and specializing in data science. I am really loving this topic. I am starting to think of ways to make it a my main source of income. Do you guys know what types of people would hire a Data Science Company/Product/Service and which problems are they usually trying to solve? (I am not trying to get hired by a company, but to start my own company that sells data science solutions)
Guys if anyone here works with numpy pls tell me any advices to learn it and keep motivated(I want to learn it for ML and I started already but sometimes it seems a bit difficult)
@arctic canopy I'm also learning this. The best way I found is to practice numpy methods with a small set of code. Also worthwhile to read up on difference it has with other packages like pandas (which is also based on numpy).
@rustic igloo Thanks for your reply im reading a book called python for data anlysis so i will take me to panda after i finish numpy chapter, I will try to pratice it more as you said also can you give me some beginner project?
@rustic igloo Thanks for your reply im reading a book called python for data anlysis so i will take me to panda after i finish numpy chapter, I will try to pratice it more as you said also can you give me some beginner project?
@arctic canopy try practicing something like this:
https://www.machinelearningplus.com/python/101-numpy-exercises-python/
@rustic igloo Thanks i will check it out
hello guyz i have 1 dought what is means by
109ms/step - loss: 5.1975e-07 - accuracy: 1.0000 - val_loss: 0.0000e+00 - val_accuracy: 1.0000 this
self.networks[i].get_weights() + self.networks[v].get_weights()```
(keras) whenever I try to add the weights of two different networks
it returns an error
is there a way I could bypass this?
@merry ridge Do you have any ideas?
Maybe extracting the windows and than summing over a certain axis? I really have no idea...
@arctic canopy
I noticed you were having trouble with Numpy. Check out the channel Coding Matrix. They have beginner friendly content. https://m.youtube.com/channel/UCKaajyjktvduM6mmuBtAOyg
@arctic canopy are you reading the book physically or electronically
Do we divide the gradients by the batch-size in Adam optimizer?
Because I haven’t seen that mentioned at all...
hi i am having following codition
if result2 ==0:
print("country name: Aba, document type: driving licence")
elif result2 ==1:
print("country name: Aba, document type: Passport") ```
but always my 1st condition gets true i.e. "if" gets true
but now in my case my elif condition is true i am using passport image then also it is giving licence image as output
when i pass 'passport` image it is predicting 'licence image'
I have a dying question: how on earth do you output Jupyter notebooks to HTML without it looking truly terrible?!
the 'Export Notebook as HTML' option has the most horrific styling ^
how can I get something simple and clean that still has all the syntax highlighting etc without rewriting all the CSS?!
i have my image recognition model it is predicting "passport " as "licence " and viceversa. what can be the issue will be?
- Maybe 1 stands for driving license and 0 for passport in dataset?
- Your model might not trained with enough data (or there is something wrong with your model) which causes this.
- Maybe you should change the threshold for predicting those values?
Can anyone help me with this question?
You are designing a neural network to extract a feature map of size 50 x 50 from a colour image of size 100 x 100 x 3
What is the number of parameters if only one fully connected layer is used?
trying to study for my exam and i dont know where to begin with this question
@blazing bridge thanks for the channel, im reading an electronic book
@uncut shadow now my model only predicting for "licence images" only . for "passport" image it is predicting as "licence " only.
could someone explain how numpy
np.swapaxes(..)
and
np.moveaxes(..)
is working, I am having a hard time visualizing it
Examples
x = np.zeros((3, 4, 5))
np.moveaxis(x, 0, -1).shape
(4, 5, 3)
np.moveaxis(x, -1, 0).shape
(5, 3, 4)x
array([[[0, 1],
[2, 3]],
[[4, 5],
[6, 7]]])np.swapaxes(x,0,2)
array([[[0, 4],
[2, 6]],
[[1, 5],
[3, 7]]])
I dont understand what is this 0,2 axis, how did that switch those number
uh helloo
ok i understand swapaxes, moveaxes though
Can anyone suggest me what to learn for machine learning?
How to select columns that are in english only from a table of different languages?
@lapis sequoiamaths and statistics basically...and a language to code in
Hi guys how would you approach comparing two groups that made the same survey, but the groups differ in age (survey filled in with likert scale responses between -3 and 3)
there is 50 likert items per respondent, so are we just checking normality (if even possible?) for each column (question)? that seems incorrect
should we just not check normality and use a nonparametric test?
this is how responses look
however, due to some NaNs, some of these responses were imputed with the mean of the row, which is a continuous value (for example 0.22)
@gusty willow well, if all you have is raw data (for example csv files or sth) then there is no way to do this tho. Computer cannot detect which language is it (I mean, not without machine learning)
@uncut shadowhow with ML?
you can technically make a model which could detect what language is it
you would need a dataset for that
but data often doesn't have many collumns so the best way would be to choose columns manually
hi
can somone help me
Can we assume normality of data if both our groups are > 30? (according to central limit theorem) ?
datapoints are discrete [-3, -2, -1, 0, 1, 2, 3]
You can assume normality of the mean of the data but not the data itself, that is what the central limit theorem says.
could you elaborate the diff?
Let's for instance say the data is uniformly distributed. Pretty far from a normal distribution right? However you pull many large sample sets from the data and find the mean of each one. Then if each sample set is big enough you will find that if you plot the means they look normally distributed. What does this mean in practice? It means no matter how your data is distributed, if you have enough samples you can treat the mean as normally distributed and do all the stuff you normally would do with a normally distributed value, e.g. hypothesis testing etc. But the data itself is not normally distributed. It's a common misunderstanding though.