#data-science-and-ml
1 messages · Page 272 of 1
Whichever has an easier github repo at the time 😛
Where can i find data science projects to do as a beginner?
Like, is there an archive or smth for it
@still delta @covert spire You can try kaggle. It has everything you might need. It has datasets for you to use. It has solutions. It has contests and a lot more.
Is there any team working?
Hi. I need help with ARIMA.
I am using the code below to find p,q&r. (I saw it somewhere, it worked fine with another data but not with the one I'm trying currently). It is to predict the number of daily covid cases in a country.
`model = pm.auto_arima(train, start_p=1, start_q=1,
test='adf', # use adftest to find optimal 'd'
max_p=3, max_q=3, # maximum p and q
m=1, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(model.summary())It is giving me the following error:ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'
I have removed all NaN values tho....
it is very inaccurate rn :
guys, how do i train a cnn with more than 2 classes???
Any ideas on how to plot a NumPy array? I've tried pcolor but that plots a mirrored version of the array and imshow() grid lines dont surround the array values properly
Thanks bro i'll look at it
Thanks! I've found a module which is a pipeline for processing text https://spacy.io/usage/spacy-101#pipelines, this is pretty similar to what I'm trying to achieve but with dataframes instead. https://github.com/explosion/spaCy/blob/master/spacy/pipeline/pipes.pyx --> this is the source code but I'm struggling to understand how they've built and organised the classes 😦
Sounds good. I'm going to focus on researching statistics & probability concepts for awhile. I understand how the comp sci processes work and what the models are trying to do. I could trial and error through it and my vision is reliant on chaos theory initial conditions for emergent fractal simulations not end to end engineered simulations. but Id like to have a decent understanding of statistics and probability concepts nevertheless. so i can help with validations, bias and variables et al.
speaking of which; is probability just another statistic? or are all statistics just a probability assessment?
of course there is an uncertainty principle to factor into this and that considering most of my understanding of statistics/probability comes from casual study of physics so symmetries/asymmetries, entropy, distributions and standard of deviations.
https://medium.com/analytics-vidhya/without-knowing-these-you-cant-be-a-data-scientist-b88deaba9533
it is based on functions of pandas
yup with financial and business intelligence/analytics at the top
Can someon rate my code from a purely engineering perspective?
Thanks!
Hi, I know basis of machine learning and RCNN theory. I want to make a object recognition program with google/custom images. Can u recommend some algorithms and their example implementation with tensorflow. I know that fast rcnn and faster rcnn are better/harder but i think there would be more examples so i'm open for suggestions. I'm trying to work at vest.ai machines because of my computer's parameters.
ahh the keyword is Vision Transformer not pixel word
hmm not sure @verbal light I tried to look if there were any APIs to use google image or bing image search engine
i mean something like "Open Images Dataset V6" https://storage.googleapis.com/openimages/web/download.html
ahh thats like a corpus of images
in sets already
not sure man ive been all NLP
whoa thats big data too
one thing NLP can do with character search and semantics of collocates et all is allow for the creation of smaller sub sets of virtual corpora
i suppose you could extract subsets based on this:
idk
heres a good old fashion cnn course
Offered by Coursera Project Network. This guided project course is part of the "Tensorflow for Convolutional Neural Networks" series, and this series presents material that builds on the second course of DeepLearning.AI TensorFlow Developer Professional Certificate, which will help learners reinforce their skills and build more projects with Ten...
Mathematics for Machine Learning
This is a 400-page free book about the mathematics needed for machine learning. It covers the things you need to know in order to get started with machine learning.
https://mml-book.com/
Hi, I have a question and it is hard to explain but I am trying. Can I summarize text from PDF and then classified the main topic? For example, some text has different topics divided into multiple articles. I want the topics with summary. So I hope that’s clear. How can I do it using Python? And which library would be useful. Thanks
Is any prior knowledge assumed?
Does anyone know how to change figure size in using plt.subplot?
It doesn't work the same as plt.subplots()
Have a read of Who Is the Target Audience? in the book
basically they claim high school maths
hi. Could u help me making a cnn for image classification? All the examples ive seen are about cat/dogs with already image dataset from keras. But i want to use my own data set, and there are more than 2 categories
@up just use other model than binary-crossentropy
use categorical with number of categories specified
or sparse-categorical 😛
thats the name of the model i need?
idk, most likely
find any digit recognition tutorial and read the code
it will be most likely something really similar to google one, but with categorical model
i do have ideas for computer vision but without doing more research i have no way to determine how relevant, viable or outlandish the ideas are. until i finish working with NLP someone run a highly experimental unsupervised learning model with the mandelbrot set zoom as the input image. id love to take a look at what the model sees
and apply it to this:
Clustering is cool
guys my train data seems like this
how can i pass that as train data for a cnn?
i already got the labels, which are basically the name of the folders
but inside each folder there are images
how can i tell the cnn "all the images from this folder correspond to this label"
tf.keras.preprocessing.image_dataset_from_directory turns image files sorted into class-specific folders into a labeled dataset of image tensors.
is this what i want?
well, it sais i dont have such funcion
yeah that's what you'd want
and you'd set label mode to whatever label mode you'd want to use, by default it's sparse labels
yeah but that method isnt implemented i think
not on tensorflow 2
The specific function (tf.keras.preprocessing.image_dataset_from_directory) is not available under TensorFlow v2.1.x or v2.2.0 yet.
so... could u help me to do it manually?
@trim imp
nltk would be useful
im not sure about summarizing, but you can def do topic modeling
Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.
Hello, Im working on utilizing tripletloss for MNIST. I got something running and the Loss for the Training and Validation is getting smaller as expected every epoch, but the Accuracy is sticking around 18-20% and its just basic MNIST so something is def wrong,
I have a basic 2 conv layer 3 fc layer architecture. I put the anchor, pos and neg through that model, then put those results into the TripletMarginLoss on pytorch
any recommendations on what I can do?
hello, im looking for an A.I server
im interested in A.I and python
any suggestions?
im already in that one, but im looking for a small and you know less active server
That makes absolutely no sense
You would rather a small inactive server over one with 9000+ people
no, this server is very active and often people dont reply to my questions so
thats why i need a small server
use pandas please
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
your file is using another encode and you have to know what is
Thanks 👍
😸
do you know about a site for learn about CNN?
An open source event grp
tell me more
Which happens in October every year
More than 70k Dev's take part in it
It's the biggest open source event
but is november now
Yep
Any expertise using the orange software for data analysis and visualization ? what is the difference between t-SNE block and manifold t-SNE block? why both results are not the same ?
can someone help me to implement it?
https://stackoverflow.com/questions/54921711/interactive-labeling-of-images-in-jupyter-notebook
Intresting Post about Image Labeling in Python. Maybe you can extract the important part for your function
Be precise. In which context is it used? Documentation or code?
lf help with periodogram of sinusoidal signals with normalized frequency and dB power
signal:
N = 1024; f1 = 500; f2 = 1200; fs = 8000 #Hz
n = np.arange(N)
Sn = 0.5*np.sin(2*np.pi*n*f1/fs) + np.sin(2*np.pi*n*f2/fs)
I have to generate PSD (power spectral density) with and without Hann window
mmm this is not what i was looking for i believe. I have my datset like this https://gyazo.com/10ad185f8027af44c0e9e2edb9200a6f and each fodler has the images. All the images on each folder have the folder name as label. I was planning on using like 80% of the images of each folder as train for that specific label, and the rest for validation. I just dont know how to tell that to keras
for example i run the code my_array.shape() where 'my_array' is a numpy array, then his shape is (102,) and not is (102,1) for example, why?
hey
so guys i wanna make a program that does not specify my needs
do any of you have any recommendations django?
or anythihg else?
Now i understand this question. Okay i make it easy.
A array shape with (500, 1) is 2 dimensional. 500 rows and 1 column.
LIKE --> np.array([[1],[2],[3]...[n]])
A array shape with (500, ) is 1 dimensional and have 500 elements.
LIKE --> np.array([1,2,3 ... n])
It just say you have a array with n elements and not a array with (n* x )*m elements
@fallow prism if you need more, i can send you some good StackOverflow links
of course!! thank you @lapis sequoia you were clear
No problem. Better than study today 🙃
hahaha it's worse to study on a tuesday
i found this; https://neo4j.com/
Seems like a lot of adoption for this graph data platform
id converge all your datasets there
aside from learning playgrounds or research and development
all your startups career business medical IT security et all
Hi! I need a quick help regarding group by in pandas
unless there is a better option?
Dataset looks like this^ ...ignore column event time.
I want to groupby the dataset by install time, event name, campaign, and siteid...and sum event revenue and add a new column which counts rows of event name
there is this snag: 12. No Export. You agree and certify that neither the Product nor any other technical data received from Neo4j,~~~~
for a startup that wants to step up and build their own platform after using the graph dataset that might be an issue everything up to that part of the agreement was solid. The question is tho would their platform handle model cache, validation test scores and metrics so data plus set plus model card plus
Hello everyone! Im having a bit of an issue and was wondering if you can help me. Im trying to train a neural network. Im having issues with fitting my model. When it goes into the directory where my images are it prepends "._" before my image name and can't for the life of me figure out why.
Are you using any libraries? Keras / Tensorflow?
Both
oh yeah neo4Jj = awesomeness
Can you give a short run down of the model?
I've had this problem before and I think it was somehing with a conv2d but don't know how I fixed it. Programming 101
Im using a mobilenet I fine tuned by removing the last 6 layers and added a Dense layer at the end as output
Linear activation, adam?
softmax, adam
have you added breakpoints and tried to see where the name changes?
Negative. im a bit of a noob. I'll try that now.
Kk. (Don't worry, I'm no developer, just a High schooler with youtube and LinkedIn Learning, also a noob)
Im not sure what Im looking at. Is it ok to post the error on here?
https://grandstack.io/ yoooo
Build Fullstack GraphQL Applications With Ease
not only is it a multi-code editor on the datagraph side but this architect app lets you build dataset with point and click & code
thats kind of what I was thinking of
wait its
scatterplot with meta waveform over top
that is neat
if you go up one more layer or dimension guess what it is....
the initial start of a statistical fractal
probably sounds more useful then it is or poetic really
I am curious how to make this wavefrom
its the first example I have seen its like a graph layered over a graph the waveform itself is probably statistical deviation from zero but i saw its relation to the point cloud right away
oh. I see you are decently competent in statistics and maths too, right? I've got a question related to percentiles, though...
considering this graph, I can see a correlation. anyway
I am working with a pandas df in jupyter notebook and am trying to drop rows on the condition that df['overall_status'] =='Recruiting') and df['Raction accrued] is NaN. I have tried using the functions .isna(), .isnull(), and also tried df['Fraction accrued'].replace('',np.nan,inplace=True) followed by df['Fraction accrued'] =='True'. I get the error: "unhashable type: 'list'. Here's my full code:
index_names = df_cancer_drop.drop([((df_cancer_drop['overall_status']=='Recruiting') & (df_cancer_drop['Fraction accrued'].isna()))].index)
df_cancer=df_cancer_drop.drop(index_names,inplace=True)
How can I correctly write the logical statement to drop these rows?
I have the following array
a = [ 1, -9, -15, -11, -19, 2, -15, 3, 8, -8, -5, -14, -5, 1, -19]
And when I'm computing np.percentile(a, 99)
I get this confusing output: 7.299999999999997
Shan't it return simply -19?
yeah there is a correlation im just not sure exactly what to call it perhaps values of y but if data visualization is analytics what is that telling me...
nope its correct.
a = [ 1, -9, -15, -11, -19, 2, -15, 3, 8, -8, -5, -14, -5, 1, -19]
a = sorted(a)
print(np.percentile(a, 99))
to calculate percentiles your data must be sorted aswell.
I've learned only basics of percentile, such as 25/50/75 so how is it computer for not 'boring' values?
once your data is sorted you can calculate correct percentiles by hand. But numpy calculates the correct percentiles prolly by sorting the array during calculation.
oh, i got it. after sorting the array I got 8 as the last element and 3 as prenultimate. so it explains it now
thanks 😄
can numpy cancel out the positive and negative integers?
btw, I grabbed this picture from sklearn site, there was a comparison of different scalers
i just noticed scalar on the graph data platform
happy to help 😄
first example i have seen
i got a bit excited 🙂
if i do cancel them out would that reduce noise or would i lose vale?
The data value plotting may show the correlation (there can be a positive correlation if noise is reduced) and the x,y graphs might show the data distributions.
Trying to work on a python script that will save a string as a .sql file, anybody know how to go about this?
sql seems more of a #databases thing to me
try this
with open('your_said_.sql', 'w') as file:
file.write("stuff you want to write")
file.close()
you don't need the file.close() since you have the with block
isnt it better to close the file too ?
with automatically closes it thats why
you learn something new everyday haha. niceeeee 👍
thanks i'll try that
yep :)
I assume I'll be able to pass variables as the names of the files?
yes you can.
but you need to concatenate .sql with the variable aswell.
for example your_variable+''.sql"
Okay cool. How will that work with the quotation marks?
no i just want the variable to be the names of the file
Okay cool thanks
In a django project where could I manipulate data for data science? any help is much appreciated!
also sorry if this is not the right channel
You could just @/# something here but if your question is about data science I would leave it
ok cool
sure
I know you can make a django website that changes data live when your data changes, I would think you can also change the data on the website to, sounds a little more advanced tho
yea I am kinda a noob
i suggest learning basics of django before doing that type of stuff, so it will be less complicated when you get there and there will be less errors to deal with, will save days if your life probably.
I am sure you knew that but never can be to sure.
ok, I might try a udemy course or somthing
is JSON and JSON-stat different formats? do I need to use different library to handle json-stat?
@stone tangle You could also check out Streamlit or Flask, it's a bit easier to do things like Machine Learning as a Service (MLaaS)
ok will do
wait this might be a better place for pandas dataframe questions. I want to iterate through a list of dataframes. I have a number of functions where the function takes the dataframe as an argument. But! I want to do different things depending on which dataframe in the list is being put into the function.
#define my functions first
def funtion1(df):
if df.name == df1: #doesn't actually work!!!
#do thing
elif df.name == df2:
#do thing differently
#same basic structure for the rest of my functions
df_list = [df1, df2, df3, df4, df5, df6]
for df in df_list:
df.name = df #(doesn't actually work)
function1(df=df)
function2(df=df)```
is basically what my stuff looks like.
but I can't do it that way.
A pandas *series* can be given a name attribute but not a dataframe.
if df.name == df1: **#doesn't actually work!!!**
*ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any(), or a.all().*
So of course I google the error and check the top SO links. I try to create a dictionary as a top reply suggests.
```python
dfs = {'some_label' : df} #is what they type out
but when I try to use df.name = dfs[df] or dfs = {df1 : 'df1' , df2 : 'df2'} I get TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed from both of those. I must not be using a dictionary right or this isn't a good solution.
I would like to be able to keep the inside of my functions along the lines of
if df1 then a
elif df2 then b
elif df3 then c
but, well, the ways I've gone about this are giving me error messages (tried .name, tried making a dict.) help?
like that @spark dirge
oh! muskrat
maybe you can use WHERE? or REPLACE?
numpy.where(condition[, x, y])
Where True, yield x, otherwise yield y.
like, iterate through the column and the condition is true then keep but false replace with nan and then remove nans?
I feel like I had to do something like this once, hang on let me go over my recent projects
df = df.loc[df['marketcap'] <= 1000000000]
@civic fractal what happens when you try that?
scropie that for me?
not sure what you are trying to do. looking for context.
I'm just trying to get some outside eyes on my dataframe problem and then elongatedmuskrat posted after me so I tried to help with their problem and now I'm just chilling here
df_list = [df1, df2, df3, df4, df5, df6]
df_mp = {}
for df in df_list:
df_mp[df.name] = some_func(df)
print(df_mp)
You just want a list of dataframes sent through a function and in a collection?
I'll try that now. Sorry I went offline for a bit haha
@spark dirge what you tryna do exactly?
the inside of some of my functions look like
if df == df1:
#do thing
if df == df2:
#do other thing
but I get that value error. I thought I could assign a name to each df and then instead it's if df.name == '' then do thing
but that hasn't worked
I went to SO and read up a similar problem and the person was advised to make a dictionary and I feel I must be doing something wrong because THAT gives me an error
https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe/31727504#31727504
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
I'm trying to help Silver and scropie. not sure what they are trying to do though.
@snow compass two simple ways
which more or less lead to the same thing
create a list of (df, function) tuples
and iterate through that
I mean, I need all of my dataframes to be put through all of the functions. it's just I need to write to a different row depending on which dataframe is being run, for example.
how to change the timezone in a timestamp column
pandas
im pulling a report from my db
and need to change the time to est (everyone using this app will be on est)
!d pandas.Series.dt.tz_convert
Series.dt.tz_convert(*args, **kwargs)```
Convert tz-aware Datetime Array/Index from one time zone to another.
Parameters **tz**str, pytz.timezone, dateutil.tz.tzfile or NoneTime zone for time. Corresponding timestamps would be converted to this time zone of the Datetime Array/Index. A tz of None will convert to UTC and remove the timezone information.
Returns Array or Index Raises TypeErrorIf Datetime Array/Index is tz-naive.
See also
[`DatetimeIndex.tz`](pandas.DatetimeIndex.tz.html#pandas.DatetimeIndex.tz "pandas.DatetimeIndex.tz")A timezone that has a variable offset from UTC.
[`DatetimeIndex.tz_localize`](pandas.DatetimeIndex.tz_localize.html#pandas.DatetimeIndex.tz_localize "pandas.DatetimeIndex.tz_localize")Localize tz-naive DatetimeIndex to a given time zone, or remove timezone from a tz-aware DatetimeIndex.
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html#pandas.Series.dt.tz_convert)
@real wigeon
thats from converting time-zone aware columns
if your column isnt time-zone aware you'd need to make it time-zone aware
!d pandas.Series.dt.tz_localize
Series.dt.tz_localize(*args, **kwargs)```
Localize tz-naive Datetime Array/Index to tz-aware Datetime Array/Index.
This method takes a time zone (tz) naive Datetime Array/Index object and makes this time zone aware. It does not move the time to another time zone. Time zone localization helps to switch from time zone aware to time zone unaware objects.
Parameters **tz**str, pytz.timezone, dateutil.tz.tzfile or NoneTime zone to convert timestamps to. Passing `None` will remove the time zone information preserving local time.
**ambiguous**‘infer’, ‘NaT’, bool array, default ‘raise’When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
• ‘infer’ will attempt to infer fall dst-transition hours based on order
... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_localize.html#pandas.Series.dt.tz_localize)
Thank you @austere swift
I havent had a chance to research scalar models yet but the meta waveform could be a way to get a glance at the distribution or symmetry between the two axis without having to scope out the numbers
a visual aid layer not a meta or extrapolation or just a visual aid for the deviation from zero or the symmetry of the two almost as if some law of large numbers set
the distribution is symmetrical but the vertical waveform is front whereas the horizontal waveform is in the middle
It seems like you are being sort of particular about how you do this without letting us know why, so expect a lot of solutions that don't quite hit your requirements (because we don't know them / why they exist)
#define my functions first
def funtion1(df, dfname):
if dfname == 'df1': #doesn't actually work!!!
#do thing
elif dfname == 'df2':
#do thing differently
#same basic structure for the rest of my functions
dict_dfs = dict()
dict_dfs['df1'] = df1
dict_dfs['df2'] = df2
dict_dfs['df3'] = df3
dict_dfs['df4'] = df4
for dfname in dict_dfs.keys():
function1(df=dict_dfs[dfname], dfname=dfname)
the offset could be the rate of the axis range the vertical increase by 1 so its waveform is always front where as the horizonal increases an even amount along the axis so its waveform is right in the middle
@snow compass Maybe a better way of doing this is to build custom classes where the different functions do different things depending on the class that you pass. But that would only be better if you had multiple dataframes of the same type that should have the same thing done if they are passed to the same function.
id very much like to see how the graph looks at different intervals same axis range but increases that are unusual
the following ideas are highly experimental.
i would like to have 2 more graphs in a grid exact opposites counting down from the max established from before. This creates a min max wave
a nice symmetrical one at that.
id iterate two more times then have the fourth layer be only min/max and target data the rest is truncated as noise.
how could i check if it's aware
Series.dt.tz```
Return timezone, if any.
Returns datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or NoneReturns None when the array is tz-naive.
if that's None then it's not aware
doing this
timestamps = df["upload_timestamp"].dt.tz
print(timestamps)``` resulted in ``none``
i presume im selecting the df column properly
erm im kind of a noob
If i remember correctly pandas is kind of weird
py
def convert_timezone(self, x):
from_zone = tz.gettz('UTC')
to_zone = tz.gettz('America/New_York')
return x.replace(tzinfo=from_zone).astimezone(to_zone)
df['Creation Date'] = df['Creation Date'].apply(lambda x:self.convert_timezone(x))
df['Creation Date'] = df['Creation Date'].apply(lambda x:x.tz_localize(None))
right but that's just applying that logic to the column
doesn't pandas handle that kind of weird, because the result is a series
and id need that as a part of the df
aren't they two separate entities now
wont doing this result in a series object, which is considered separate from the pandas df?
or is doing df['Creation Date'] applying it to the df, but only to the column Creation Date
i am noob
this will apply on ur date column only
do i need to localize to my time zone, or mark it as UTC
because localize only makes it aware (my data in the db is UTC), it doesn't convert.
I have to make a Bayes classifier for a dataset where each object gets one continuous feature and its class label. But how do you even apply Bayes for continuous data?
binning?
thats what the other function does
tz_convert
yes but im asking
make_timestamps_tz_aware = df["upload_timestamp"].dt.tz_localize(tz='UTC', ambiguous='infer')``` Since my data in the db is ``UTC``
or should I set it to my local timezone
i mean you do tz_localize and then tz_convert
yes correct
but do you localize to UTC
or EST
the data is in UTC
i went with UTC
well in the examples it shows you could use est
see how after the localization it shows -5:00
that means that when it localized with est it assumed the original values were utc
so i think you can just use that
I'm not completely sure tho lol
ok cool
i mean it says that it does not convert
.>
alright well, idk how to place that column back into my df
instead of assigning that value to make_timestamps_tz_aware just assign it back to df["upload_timestamp"]
what do you mean
df["upload_timestamp"] = df["upload_timestamp"].dt.tz_localize(tz='UTC', ambiguous='infer')
oh
i was actually going to do this
make_timestamps_tz_aware = df["upload_timestamp"].dt.tz_localize(tz='UTC', ambiguous='infer')
make_timestamps_tz_est = make_timestamps_tz_aware.tz_convert('US/East')
make_timestamps_tz_est.to_excel('location/output.xlsx', index=False)```
that works too
pandas.to_datetime(arg: DatetimeScalar, errors: str = '...', dayfirst: bool = '...', yearfirst: bool = '...', utc: Optional[bool] = '...', format: Optional[str] = '...', exact: bool = '...', unit: Optional[str] = '...', infer_datetime_format: bool = '...', origin='...', cache: bool = '...') → Union[DatetimeScalar, ‘NaTType’]``````py
pandas.to_datetime(arg: ‘Series’, errors: str = '...', dayfirst: bool = '...', yearfirst: bool = '...', utc: Optional[bool] = '...', format: Optional[str] = '...', exact: bool = '...', unit: Optional[str] = '...', infer_datetime_format: bool = '...', origin='...', cache: bool = '...') → ’Series’``````py
pandas.to_datetime(arg: Union[List, Tuple], errors: str = '...', dayfirst: bool = '...', yearfirst: bool = '...', utc: Optional[bool] = '...', format: Optional[str] = '...', exact: bool = '...', unit: Optional[str] = '...', infer_datetime_format: bool = '...', origin='...', cache: bool = '...') → DatetimeIndex```
Convert argument to datetime.
Parameters **arg**int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-likeThe object to convert to a datetime.
**errors**{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
• If ‘raise’, then invalid parsing will raise an exception.
• If ‘coerce’, then invalid parsing will be set as NaT.
• If ‘ignore’, then invalid parsing will return the input.
**dayfirst**bool, default FalseSpecify a date parse order if arg is str or its list-likes. If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10. Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior).
**yearfirst**bool, default FalseSpecify a date parse order if arg is str or its list-likes.
• If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12.
• If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime)
before the localize stuff
no, you can do it on a single column
no, its not a function from the df its from pandas
yeah
so pd.to_datetime(df["upload_timestamp"])
and then you'd need to set the format arg so it can see how to format it
oh oops typo
i put datatime instead of datetime lol
looking up the formatting
its kinda like datetime strptime
im going to try this
df = pd.DataFrame(query_resolution, columns=['upload_timestamp', 'email', 'was_this_a_pandemic_related_call',
'what_was_the_call', 'was_the_inquiry_resolved'])
pd.to_datetime(df["upload_timestamp"], format='mm/dd/yyyy HH:mm:ss')
make_timestamps_tz_aware = df["upload_timestamp"].dt.tz_localize(tz='UTC', ambiguous='infer')
make_timestamps_tz_est = make_timestamps_tz_aware.tz_convert('US/East')
make_timestamps_tz_est.to_excel('location/output.xlsx', index=False)```
yes
i dont think this is quite correct
pd.to_datetime(df["upload_timestamp"], format='%mm/%dd/%yyyy %HH:%mm:%ss')
no
https://www.programiz.com/python-programming/datetime/strftime scroll down a bit and you'll see a list of all the codes
In this article, you will learn to convert datetime object to its equivalent string in Python with the help of examples. For that, we can use strftime() method. Any object of date, time and datetime can call strftime() to get string from these objects.
thats for datetime strftime but i think it should be the same for pandas
I will come back to this in the morning. I'll need to look up classes and see if that's my solution. I'm performing the same math and using the same pandas functions on each of the dataframes. I'm just writing to different rows or writing to different worksheets in the workbook I'm writing to.
dam it's still the same error @austere swift
make_timestamps_tz_est = make_timestamps_tz_aware.tz_convert('US/East')
it's not an inplace function btw
it returns the output
ohh
so you'd need to assign it back to the original column
or assign it to an intermediary variable that you then use for the other modifications
code?
df = pd.DataFrame(query_resolution, columns=['upload_timestamp', 'email', 'was_this_a_pandemic_related_call',
'what_was_the_call', 'was_the_inquiry_resolved'])
convert_timestamp_to_date_time = pd.to_datetime(df["upload_timestamp"], format="%m/%d/%Y, %H:%M:%S")
make_timestamps_tz_aware = convert_timestamp_to_date_time.dt.tz_localize(tz='UTC', ambiguous='infer')
make_timestamps_tz_est = make_timestamps_tz_aware.tz_convert('US/East')
make_timestamps_tz_est.to_excel('location/output.xlsx', index=False)```
also you forgot %p btw, thats for AM and PM
and there shouldnt be a comma in the format
is it :%p
no it would have space %p
k
so basically imagine you're writing your time out, but replace all the actual number values with the % codes
apparently localize and convert only works on the index
yeah same error
hmmmmm that doesnt really help
Hello guys, I know it depends on the problem, but how would you approach to find out the appropriate number layers?
As well as nodes?
Like a baseline number
theres no real way to just figure out how many you need
that's the whole concept of hyperparameter tuning
you just have to test stuff out and see how it goes
I'd recommend trying to use a model that already works for your baseline, like a premade model
then tweak from there
I know hyperparameter with GridSearch when doing classical ML. How would you do it with TensorFlow?
Thank you!
yeah so im still getting the same error
TypeError: index is not a valid DatetimeIndex or PeriodIndex
progress
syntax stuff
alright well... i managed to download in xls format just the timestamp column..
and I mistakenly stripped the hours/seconds info
Hi. How can I make this matplotlib figure bigger in the y axis without changing the ylim? Since the limit of the values in the y axis is 1.
This is the code used to generate the figure:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig = plt.figure(figsize=(10,2))
ax = fig.add_subplot(1,1,1, aspect='equal')
# Low
x = [0,0,9,11]
y = [0,1,1,0]
ax.add_patch(patches.Polygon(xy=list(zip(x,y)), fill=False))
# Medium
x = [10,12,15,17]
y = [0,1,1,0]
ax.add_patch(patches.Polygon(xy=list(zip(x,y)), fill=False))
# High
x = [16,18,20,20]
y = [0,1,1,0]
ax.add_patch(patches.Polygon(xy=list(zip(x,y)), fill=False))
ax.set_xlim([0,20])
ax.set_ylim([0,2])
plt.show()
I'm not exactly sure of your codes but you can set the ticker with plt.yticks = array
Say your array is range(1, 10,1), then you can set plt.ytics = range(1,10,1)
Don't know fit hat helps
Increase figsize as well?
I tried increasing the figsize but the height of the figure doesn't change
nvm, it was aspect='equal', I forgot to remove it. Thanks for the help anyway!
beginner question but in numpy rather than creating the matrix from scratch is there a way I can call an empty matrix of specified size?
exp: i could call a 3 x 2 matrix full of 0s with the values to change later
nvm found answer
Good nigth to everybody. Does anybody have an idea to transform this plot so that it shows the form of the curves better?
Without the need to show two different pictures.
is singular value decomposition (SVD) solely for linear regression or can it perform on other models like the Gradient Descent algorithms can?
nvm
hey guys, in backpropagation, if we're using cross-entropy as the loss function, why is the error term in the output layer computed as [y - (output activation)]? isn't that the partial derivative of a mean squared error loss func with respect to output activation, rather than cross-entropy? i keep seeing it even if the loss function isn't MSE
How did I miss this last night?? I saw your second ping and not this one. Is this what gm meant? because now that makes sense.
Sorry I didn't realize I was being particular about this. I think I still don't have the best handle I need on the jargon? like, using words as correctly as possible to their coding definition.
I'm gonna try this out and see if that does the thing. and hopefully have a better understanding of why >.>;;
hi! , anyone can help me with np.trapz for calculate area under the curve ? ive been doing some research but all the examples contains random data and i dont know how to incorporate my data.
I have a dictionary of dictionaries:
{
'A': {'spread' = .., 'mid' = ..},
'B': {'spread' = .., 'mid' = ..},
...
}
Where there are usually 3-15 keys. I need the most performant way of finding the minimum spread AND the N largest mids - I've currently got the min spread as best = min(prices.values(), key=lambda x: x['spread'] then best_spread = best['spread]
I'm not sure how to find the N largest mids in the most performant way - but I do put the mids in a numpy array as I need to find their median or mean.
Well, does anyone know why when we use TPUs, PyTorch uses the System RAM for loading the model rather than the internal TPU Vram or the GPU RAM??
I'm trying to drop rows that contain specific words within a column from my df. I tried creating an index and dropping the index, but I got an error saying that since it included more than 6 items it was too large and couldn't be used. I have just tried the following code, which I adapted from Stack Overflow:
tox = ['toxic','toxicity','toxicities', 'deaths','fatal','patient~ safety','safety issue', 'safety monitoring', 'safety data', 'safety measures', 'safety related', 'safety reasons', 'safety concern', 'safety and efficacy']
df_test1 = df_test1[-df_test1['why_stopped'].isin(tox)]
This doesn't return any errors, but the size of my df_test1 hasn't changed.
How might I get this this to successfully drop rows that contain the terms in tox from df_test1?
maybe you can try plot in matrix axs[1]. then axs[2]. etc.... you are going to have the data in sub plots in the same image.
i have a dataset that I manipulate some timezone data on
it manipulates just one column
however I'm trying to output the entire data set, not just the timestamp column, to xls
currently it's just exporting the xls file
im using pandas
df = pd.DataFrame(query_resolution, columns=['upload_timestamp', 'email', 'was_this_a_pandemic_related_call',
'what_was_the_call', 'was_the_inquiry_resolved'])
convert_timestamp_to_date_time = pd.to_datetime(df["upload_timestamp"], format="%m/%d/%Y %H:%M:%S %p")
make_timestamps_tz_aware = convert_timestamp_to_date_time.dt.tz_localize(tz='UTC', ambiguous='infer')
make_timestamps_tz_est = make_timestamps_tz_aware.dt.tz_convert('America/New_York')
remove_time_zone = make_timestamps_tz_est.dt.tz_localize(None)
#remove_time_zone = make_timestamps_tz_est.apply(lambda a: pd.to_datetime(a).date())
remove_time_zone.to_excel('staffDashboard/output.xlsx', index=False)
#print(cursor.mogrify(get_results, (formatted_start_date, formatted_end_date)))
connection.close()
cursor.close()
return send_file('output.xlsx', attachment_filename=f"{formatted_start_date}-{formatted_end_date}_survey_results.xlsx", as_attachment=True)```
how do i go from refferencing just the column, to merging it into the dataframe
and then exporting that dataframe
like I said, currently it just export the column
do i just replace the old column
and export the new df
@real wigeon
You're splitting out that column, running it through functions and then exporting just the column.
Add it back to the df with
df["converted_timestamp"] = remove_time_zone
And then export the df
df.to_excel("output.xlsx", index=False)
doing this
df["converted_timestamp"] = remove_time_zone
wont that assign a new column, since the name is different
yes, if you want to replace the upload_timestamp column with the column full of converted info change the name to "upload_timestamp"
sure let me know, ping me if it doesn't work so I get a notification
Well, does anyone know why when we use TPUs, PyTorch uses the System RAM for loading the model rather than the internal TPU Vram or the GPU RAM??
Does anyone know how to convert an array with values dtype = 'timedelta64[ns]' to days?
although it does.... this weird thing... where the query range is like x-y but y wont be included
i don't understand, will need more context
i think it's a mysql thing
oh okay, is it included in the input file?
im thinking it might now be
i dont believe mysql is inclusive i think is the term
like if i ask it to query all data points between 5am and 6am, it will go all the way up to 5:59am, but not include the 6am
if for eg it's stored in column 'datedif' in a dataframe called 'df
Do df['datedif'].dt.days
cool cool
i can't be a lot of help on the mysql front 🤷🏻♂️
thanks
I Had thought the same but given that I will use these picks for a paper, I need to save space because of I have more plots with the same problem :/
is this a dumb way of handling this?
dfs = [df1,df2,df3,df4,df5,df6]
fn2 = [3,14,25,36,47,58]
fn3 = [3,14,25,36,47,58]
fn4 = [3,19,35,51,67,83]
fn5 = [1,6,11,16,21,26]
fn6 = [6,7,8,9,10,11]
for n in range(6):
fxn1(dfs(n))
fxn2(dfs(n), fn2(n))
fxn3(dfs(n), fn3(n))
fxn4(dfs(n), fn4(n))
fxn5(dfs(n), fn5(n))
fxn6(dfs(n), fn6(n))
https://www.pythoninformer.com/python-libraries/matplotlib/line-plots/ // and maybe trough this?
if it's a white square, try restarting the notebook
i had that a couple of times in jupyter with plotly
Does anyone know where I would be able to learn time series analysis?
Cost isnt too big of an issue since i would be getting my employer to pay for it.
the size hasn't change, but you data inside?
it's possible that fill with NaN values after drops
beacuse your dataframe has a fix size
@fallow prism I'll inspect the data real quick. Give me a sec.
@fallow prism I have examined the df and the cells that I intended to drop remain.
Posting in this channel because my issue includes the use of a dataframe, but please direct me to the correct channel if I posted incorrectly. Can anyone help me fix this error? I don't understand why my list isn't being accepted as column names, even though my variable used is a list with four elements. My list is printed in cmd as ['owner', 'series, 'name', 'image']
is your column just a word or string?
if is just a word try this
df_test1['why_stopped'] = df_test1['why_stopped'].apply(lambda x: return x if x not in tox)
or make a new column an replace the first column later
New Medium Article Published. Introduction to NumPy in Python. Exploring Operations and Arrays in NumPy, The Numerical Python Library. Let me know what you think! https://medium.com/analytics-vidhya/introduction-to-numpy-in-python-db8aa7ffd91f
@lapis sequoia this is something that you wrote?
@serene scaffold Yes
@lapis sequoia very nice. I'm looking at the section on joining. You mention using .join but it looks like it's np.concatenate that you use
Oops, you are correct, I just published the fixed version. Thank you for that feedback!
cool! 💥
@lobon22 A string.
hey
i keep getting this error
result = self.forward(*input, **kwargs)
File "/Users/ashley/Deeplearning/fresh_vs_rotton.py", line 67, in forward
x = F.max_pool2d(self.relu(self.conv1(x_1)), 2)
File "/Users/ashley/Deeplearning/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/ashley/Deeplearning/venv/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/Users/ashley/Deeplearning/venv/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [16, 3, 3, 3], but got 2-dimensional input of size [1176, 512] instead
idk how to fix it
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
self.fc1 = nn.Linear(8 * 8 * 8, 32)
self.fc2 = nn.Linear(32, 2)
self.relu = nn.ReLU()
def forward(self, x_1):
x = F.max_pool2d(self.relu(self.conv1(x_1)), 2)
x = F.max_pool2d(self.relu(self.conv2(x)), 2)
x = x.view(-1, 8 * 8 * 8)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
that minus '-' have to be '~'
So I have a dataset made up of 3 columns. Not every column has data for every row, but I'd still like to compute an average for that row, even if it's just using the 1 column. How do I do that?
.mean() is giving me NaN for the rows that have NaN values in one of the columns and I don't remember how to get around this.
Define "get around" -- how do you want Nan to be treated? Ignore the value, i.e., average the non-NaN values, presumably deducting them from the count? Treat them as zero?
ignore the value if NaN
I want triangle functions in numpy that, for a given range, return 1.0 in the middle of the range, 0.0 at the ends, and np.nan outside the range. But I can only find stuff about making n-arrays with that distribution.
I guess I can make them myself with np.vectorize or something
👍
Hi, I'm dumb. .mean() works, I just have an inability to spell
Don't know if this is helpful since it returns 2D arrays. https://stackoverflow.com/questions/39951392/python-plot-triangular-function-into-2d-arrays
on the examples, when training a nn. What is X_train, Y_train, X_validation and Y_Validation?
X is a list with the training data (images or what ever) and Y another llist of the same size with the labels for each X?
what is this 🥴
wait what do you mean by that
do you have an example
usually not a list
but some sort of container
you train your model on some data, then you check if it's working on other data that your model hasn't seen before. the latter is validation data.
if not a list, what?
I'm semi-afk but I can write a better explanation later.
numpy array, pandas DataFrame (backed by said array), or TensorFlow/PyTorch's containers
okey okey
i knew images must be np arrays
thats easy cuz i think opencv loads images as np arrays, right?
Hey guys, I am trying to assign the strike and expiration date from a row to all of the results its values spawn, would I use inheritance to solve this?
I tried .at but it did not work correctly
is this the place to talk about neural nets
yes
what?
I just want to pass the UIDs of a row to the rows it spawns from an api call
what is "spawns"
you're going to need to give more details
Sorry. For each row of my df, it has a unique option. The values are then passed to a robin_stocks method that returns roughly 210k rows to the 1 input. I need all 210k to be directly traceable back to the 1 input
what do you mean by "directly traceable"?
like do you want all the results in one big DataFrame
and have an additional column
to indicate the source?
do you know what a join is?
Yes
left join on that
what do you mean
a = f.at[i,'strike']
c = f.at[i,'xpire']
df4.at[i,'strike'] = a
df4.at[i,'xpire'] = c```
is how I did it
huh.
wait
I
actually don't get why you did that
that looks like a loop.
why do you have a loop?
YEs
It was
for i in tqdm(range(len(df2))):
df4 = df4.append(r.options.get_option_historicals(f.loc[i]['symbol'], f.loc[i]['xpire'], f.loc[i]['strike'], 'call', interval='5minute', span='week', bounds='regular', info=None))
a = f.at[i,'strike']
c = f.at[i,'xpire']
df4.at[i,'strike'] = a
df4.at[i,'xpire'] = c```
Yes
that's
a pretty weird way to do things
let me think for a bit
show me a bit of df2
in text
not picture form
my gut feel is that you should use df2.apply
Yeah I tried that
df2 is made like this
for i in tqdm(range(len(df))):
df2 = df2.append(r.options.find_tradable_options(df.loc[i]['Symbol'],expirationDate=None, strikePrice=None, optionType=None, info=None))```
you should really
avoid append in a loop
probably df.transform would be appropriate
It takes in 3 values, the df2 gen works ok
@velvet thorn native numpy support for this:
class TriangleFunc:
def __init__(self, start, end):
self._start = start
self._end = end
self._mid = ((end - start) / 2) + start
self._slope = 1 / ((end - start) / 2)
def __call__(self, x):
if not (self._start <= x <= self._end):
return np.nan
slope = self._slope if x <= self._mid else -self._slope
return slope * (x - self._start)
except where I get the slope right for the right side of the midpoint
I'm making a fuzzy controller
problem is I don't think you can vectorize methods
looks like vectorizing doesn't improve performance so I guess it's a moot point.
I'd appreciate an answer if possible
@civic fractal the answer that's already given is quite good
it sounds like you're pushing the limits of how numbers are stored on your computer
@velvet thorn native numpy support for this:
class TriangleFunc: def __init__(self, start, end): self._start = start self._end = end self._mid = ((end - start) / 2) + start self._slope = 1 / ((end - start) / 2) def __call__(self, x): if not (self._start <= x <= self._end): return np.nan slope = self._slope if x <= self._mid else -self._slope return slope * (x - self._start)except where I get the slope right for the right side of the midpoint
@serene scaffold I must confess I do not see what this code is meant to do
🥴
@velvet thorn I figured that part out
now I'm just trying to plot everything
and then I'm 1/3 of the way through the assignment
💥 🎆 😢
(took two days to get this far)
(due at 4pm)
this is wrong :((((((((((((((
what is that supposed to be
Hello! Does anyone here know anything about data mining using Python? I have an assignment I have to do.
Here's the kinda stuff we have to cover...
If anyone can help let me know! 😁
Just @lapis sequoia me
And this is using Anaconda if that means anything
Hi @lapis sequoia , your task seems simple and the explanation on what is expected is quite good, let us know if you need any help
Anaconda is just a Python distribution that has the relevant libraries/packages (however you call it) and its dependencies sort of installed
Yeah I think so far it has been pretty straight forward, I suppose I'm just kinda worried that it seems too simple and that it's like a trick question or something?
Like so far this is what I have
The outliers one looks quite fun
Oh that one I have no idea where to even begin honestly
Maybe you could help me with that
I imagine most of the marks are going towards that question
Is this right @torpid cave ?
I would just present one number instead of creating the table though
haha so your correlation is -0.1
You show a correlation table instead of the correlation between 2 variables, that is why that number is repeated
So instead of showing that matrix I would try to get just the -0.109
But it is just a personal preference thb
tbh
I have an assignment to show my understanding of boosting and bagging concepts. The report requires me to provide examples of various examples of boosting and bagging. Do you think it is ethical to use sample code from xgboost or scikit to show how ada boost, xgboost,etc. works?
You'd definitely have to reference it
Don't take stuff from online without referencing it because you're inherently implying it's all your work then
Of course I will reference but shouldnt be an issue after that right
Since the goal is not to improve a given model just to show the understanding of these concepts
I mean I've never heard of xgboost or scikit before, but if the website or your lecturer doesn't declare that you can't do that then I guess it isn't an issue?
Cool the TA references the site and recommends checking it out
Thanks just wanted a second opinion
I don't know what boosting or bagging is but I guess it's not too small or simple to create an example yourself?
Our lecturers say not to reference them
I think it's kinda cringy when you do
When you like quote them from a class...
I am from the school that references ppt slides
Hmmm
Rules were quite strict in grad school
i tend to reference the book used in class thats about it
undergraduate most students here dont cite properly
I think there has to be some sort of line because mostly 99% of everything we know came from somewhere else, and if we were to reference everything it would be kinda tedious...
And I did engineering
I think for the most part your lecturers understand that most of what you're saying came from them anyway
Unless you specify otherwise
In grad school... I did at least 20 references per paper
Yep
Because your work may get a bit more public and attention
And so it's kinda necessary to show your sources
most of my reports have like 5 and 90% of them are from blogs
As opposed to undergrad where your work is really only gonna be seen by your lecturer
Depends on the subject as well I guess
also on the TA. Most cant be bothered to check really
For example I would not reference how to get correlations... but I would reference testing for heteroskasdicity
what major did you do graduate studies if i may ask?
Ahh cool aight thanks guys I should be fine if I reference the samples
Yeah, reference as much as you can, you never lose much and you might impress your lecturer if he cares about that shit
But not referencing could be a serious offense 😬
@torpid cave
This is just a shot in the dark at this point
I have no idea if this is correct or not
For this point ^
Would t-SNE be useful for visualising this dataset (https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease) ? In the documentation it says to use PCA if the data has a large amount of variables, but i'm not too sure what constitutes as a large amount...
you can use either, usually t-SNE would be preferable for extremely high dimensional data
t-SNE/PCA would work fine by the looks of it
Thanks 🙂
is it okay to upload sensitive data as a private dataset on kaggle?
for some reason the TPU on colab doesn't work well while reading data from drive
guys, i got this loop to load the data set:
for pok in os.listdir(datadir):
path = os.path.join(datadir, pok)
images = os.listdir(path)
amount = len(images)
for i in range(amount):
img_array = cv2.imread(os.path.join(path, images[i]), 0)
new_array = cv2.resize(img_array, dimension)
if i < amount * 0.8:
train_data.append([new_array, pok])
train_label.append([pok])
else:
valid_data.append([new_array, pok])
valid_label.append([pok])```
but it takes a while to complete. Can i run it once, export it somewhere and somehow, and the next times i just load it?
save the dataset, there are many formats
pickle, npy, npz, you can write it to a text file. If its a numpy array best options for you are npy and npz
numpy array are only the images
new_array
since opencv loads them as numpy array
pok is just a string
also, i am thinking. train_data doesnt need to have the label if train_label exists
or train_label shouldnt exists. Right?
Hey ! I'm using matplotlib to display activities with bars and legends, but some text is overlapping, any idea why ?
Well, I know why
but I don't know how to fix it
also, you noticed the hours on the bottom don't exactly display hours from 00:00 to 24:00, do you know how I may be able to fix this ?
Hey @lapis sequoia!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
i'm very new to matplotlib so don't understand everything in there, I copy pasted a chunk of code from stackoverflow to get the structure
Hey, is someone familiar with opencv and a little machine learning?
Anyone know how to do this? 🤔
And this is coming from a dataset where I have a bunch of values for petal width and length.
iris?
just ask your question directly imo
It's the name of the data set
count for which the condition is met / total number of combinations
^this
Well I have a video and I need to recognize and display the poses left arm up and right arm up and I don't quite know how to do it
- not + right?
try posenet and look for hand gesture recognition/ pose detection models on github
thanks
yes
sorry
and then count them
len(df[df[new_feat] > 1] )
divide that value by the amount of entries in you df
i.e. and then count them
len(df[df[new_feat] > 1] ) / df.shape[0]
🤔
What is the df.shape[0]?
What does that mean?
len(df)
What's the difference?
its the size of your dataframe (i.e how many sampels of petals u have)
theres no difference, its the number of entries
question says two methods HMM
Hmm is right
I took a stab at this question also
but meh
I have no idea if that's right
but in what respect, a difference formula (mathematical approach) or a different way to query the data frame
you could train a logistic regression model which gives probability that your condition is true, given that class 1: product >1 class 2: product <=1
there was also this question
Which I have no idea about
Are outliars judged by their distance from the average?
from the line
And and what point is the max?
What line though?
What is the line
their distance from the linear regression line
This?
I see a lot of lines here...
😳
the furthest one
The question says an outliar is identified as the point with maximum distance
but like what is the max distance?
There can be more than one outliar right?
distance perpendicular to the line
I am working on a linear regression line can anyone help??? please!?
that is true, but for your case the question says that th outlier is the furthest way
Ahhh
hoenstly it's a sh*t definition for an outlier
ahaha
usually the outliers problem isnt so easy xD
but i think its a training exercise so its ok
But I just need to check each point and get whichever is furthest from the line, right?
exactly
Just for loop through the data set
But
How do I get the distance from the line?
What do I say to get that?
euclidean distance
does anyone kind of understand linear regression because I am stuck
you have your x value (length) and your y value (width
when you take your x value and put it into your LR eqn ^y = mx + b
you compare the real value y with the predicted value ^y
max(y - ^y ) do it for all of them an take out the maximum one
thx
that really helps
np
anyone know this?
yes
not my first choice either but I'm having issues on colab TPU
Is there a way to transform this dataframe:
0 1
0 0.435752 0.0
1 0.296690 0.0
2 0.737365 2.0
3 0.332111 1.0
4 0.030198 1.0
into this:
0 1
0.0 0.435752 0.296690
1.0 0.332111 0.030198
2.0 0.737365
I know it's no longer rectangular data
I thought it might be the pivot method
have you tried groupby
I didn't think that would have plotting functionality. I'm making density distribution plots for three classes.
in pivot you need to have unique indices
I'm thinking groupby column 1 and make a function that returns values having the number, maybe would take some more editing to get the column name in order
let me try and get back to you
Hey everyone, I'm working with time series data and could use some opinions on the best way to format dates. I have to choose between datetime.datetime or numpy.datetime64 objects.
Leaning towards the native datetime library, but I thought that datetime64 may play nicer with certain models? Anyone run into this before?
I'd say if you use numpy for everything, roll with np.datetime64, if pandas, use pandas own datetimes, if mix or not sure, go with datetime.datetime
worst-case-scenario, you can always convert
although a bit old, but mostly still valid SO answer on converting: https://stackoverflow.com/a/13704307
so practically, you can choose anything what you like:) personally, I like better native datetime.datetime, not sure why...
Yeah I'm a bit spoiled because we were working in R before and the lubridate package made my life so easy haha
Hey @fallow prism!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .flac, .afdesign, .m4a, .csv.
Feel free to ask in #community-meta if you think this is a mistake.
could you solve?
I haven't solved that yet
is there a way to convert list to a dataframe row
I export my env into .yaml and then import to anac nav, then I cd to the path and I find out that my project folder is missing, I go back to my laptop, cp folder, mv to cloud to then cp to path, any idea of how to do this faster than it is or how do you do it?
whats the objective
I work on my laptop most of the time but I decided to work the project on the desktop.
I'm curious about if there is a better or faster way to achieve this.
Could this be done with git?
I'm no expert at these things but we use git (via github) for version control on all our code, perhaps you could just commit your env yaml as well and then pull whenever you want to work on a new machine
I will look into it, I think it could work, thanks for the idea.
i got close but couldnt get exactly the same
nvm that
ended up doing pd.DataFrame({cls: normalized_train[normalized_train[1] == cls].iloc[:, 0] for cls in {0., 1., 2.}})
@umbral oracle We don't allow people to recruit for paid opportunities of any kind here.
Hey guys, did anyone used DataQuest? I'm thinking about paying the pro year, but I wanted to hear from someone that used it (if this is not the place, where can I talk about it?)
Oh sorry!
If you study, you can get the pro version for free. Ask your Prof.
I don't think my college here in Brasil has a partnership with them... But I'll ask anyway
Guys has anyone here classified job seniority based on job descriptions? [NLP]
I am trying to implement an environmental sound classifier using the urban sounds 8k data set but it seems like my validation loss seems to grow with the epochs. Any idea why?
The reference paper I am using gets about 74% accuracy
I just moved to a new computer and am have trouble getting pandas to show my graphs in atom. It says it’s finished but shows nothing
Anything simple I’m missing?
Far as I can tell, I’m just doing df.plot()
Hey guys, so there’s this job opening for “Artificial Intelligence Engineer” role at this company that I am thinking I should apply to... this is the job post ... any tips on how I should prepare for that and what to study... I am fairly new to this
that would be more of a question to ask #career-advice
its alr, just application and job stuff is more in that realm, although one tip i'd give you is to have some sort of example project you could show them
i study data science engineering and I'm not quite sure what an AI engineer is
i would assume that an AI eng would have to know the NLP and be comfortable with algorithms such as A* and be able to figure constrain satisfaction problems etc but the description for that job seems to be something a data scientist would do?
or maybe not
honestly idk
its more of someone who can make machine learning/deep learning models to run in the field
I think they just mean data science/machine learning
fair enough
Know any good resource for learning some of the maths related to data science
Forgot most of my university maths 😅
it's basically statistics
and machine learning (SVM, LR, DT, RF, ANN, NB, etc.)
brush up on your multivariate analysis and statistics
Hmmm
were you looking for something more specific?
So my only experience with data science was like 2.5-3 years ago at my 5-6 months internship... was getting the hang of it until I stopped and life continued
whats your background?
Computer science degree and currently working in ASP.Net
But I kept using python here and there for automation and scripting
yea data science is mostly scripting
since you're compsci i assume your programming skills are good
so i'd say focus on the math and some info viz
the math you require is, like I said, stats, multivariate analysis and all that ML mumbo jumbo
you've got some pretty neat O'Reilly textbooks that focus on the math behind data science
you can torrent them for free
They say statistics, probability theory, machine learning algorithms and data modeling
In the post
yup sounds about right
And python data science stack, I’ve only used like pandas, numpy and some scikit learn from what I remember at my internship
Is this what they mean with that
idk tbh but it must be
you've got the python software packages that are common thru out all DS: sklearn, numpy, pandas, matplotlib
hello, having such a data in csv i would like to create df having period of time in this case having : Doctorid1 period 12:00-12:16
and then you have the ML ones like tensorflow/keras, and sklearn,
using pandas and groupby, does anyone has some clues? 🙂
the info viz stuff: seaborn, yellowbricks, dash plotly etc
the NLP ones like spaCy and NLTK
Uff yeah those I remember from my internship the NLP ones
then more specific ones ... for example id your dealing with networks you'd use networkX, powerlaw etc.
i guess through practice you'll start accumulating knowledge on these libraries
but these ones are a must
thats like your foundation
Right, lets see what I can do... the sucky thing is that my laptop is broken so I only have the PC at work to try and squeeze some learning while no one is looking 😅
good luck there buddi
a couple of good places to start is kaggle.com and https://towardsdatascience.com/
Nice, was looking also at a site called analytics vidhya
Don’t know if they’re good
🤔 didn't know this one, deffo gonna check it out
Havent looked into them much but was reading a medium article by them
those aren't " your using so it doesn't see data.csv as a string, it sees it as a variable with some other type of quotes first (causing the invalid character)
the () should contain the name of the file
the first quote was apparently a "LATIN SMALL LETTER A WITH CIRCUMFLEX "
Problem is I don't understand how did it inteprete hte quotes like that
remove the (.csv), but instead ("data_csv"). that should do the trick
how i can to do to dataframe.head() show me all row?
why not do print(dataframe) instead?
having this kind of data, using pandas i should print in one row (cell) a period of time (in this case it should be 12:00 - 12:16), any clues? 😄
I had not thought of it
still cut it
my problem is the width, i need more width for each row or wrap rows
dataframe.apply(print) and that is all
or Serie.apply(print)
thanks !
oh, that isn't works 😢
that 3 points
don't like to me
a['descripcion_del_hecho - Final'][:5].apply(print) that works fine for me i guess the other ways is mor difficult
more*
😅
pd.options.display.max_colwidth=None
that work better
any someone with kaggle competition experiment?
are you asking if someone wants to do a kaggle competition with you
Do I have the right to edit the notebook after a competition deadline in Kaggle is over?
not sure
I found
You can make a submission at any time and as many times as you like, but we will only consider your latest submission before the deadline.
I need help writing a linear regression code can someone help???
@glad mulch here is my code I want to make a linear regression line```py
import pandas, os
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import linear_model
root=os.path.dirname(file)
data_dir=os.path.join(root,"data")
fig_dir=os.path.join(data_dir,"figs")
def make_plt(x,y,df):
x_list=df[x].to_list()
y_list=df[y].to_list()
x_train, x_test, y_train, y_test = train_test_split(x_list, y_list, test_size=0.2, random_state=42)
linear=linear_model.LinearRegression
plt.title("Coding Books")
plt.legend(["train","test"])
plt.scatter(x_train,y_train)
plt.scatter(x_test,y_test)
plt.savefig(os.path.join(fig_dir,f"{x}-{y}.png"))
plt.close()
def main():
data_raw=os.path.join(data_dir,"prog_book.csv")
raw_df=pandas.read_csv(data_raw)
raw_df["Reviews"]=raw_df["Reviews"].str.replace(",","")
raw_df['Reviews'] = raw_df['Reviews'].astype(int)
#plot price verus rating plot steps #1 turn columns into lists
lists=["Rating","Reviews","Number_Of_Pages","Type","Price"]
for col in lists:
for col2 in lists:
if col2 != col:
make_plt(col,col2, raw_df)
#step #2 use plt.plot to plot the lists
print(lists)
print(type(lists[0]))
# # step #3 export the plot to a pdf
# #regresion lines
if name == 'main':
main()
I do not know how to make the line
I know the different equations but other than that I have no idea what I am supposed to do
thank you so much
your a big life saver
can anyone help me install tensorflow on IDLE? i seem to keep getting callback errors when attempting to import and need it working for a school assignment 😦
ValueError: Input 0 is incompatible with layer conv2d_1: expected ndim=4, found ndim=3
i am getting this error
my images are black and white
img_array = cv2.imread(os.path.join(path, images[i]), 0)
opening them with 0 turns into black white i guess
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=dimension))```
dimension = (64, 64)
where is the error?
dimension amount is different i guess ¯_(ツ)_/¯
you could use stack overflow...