#data-science-and-ml
1 messages ยท Page 330 of 1
Hello, fellow coders.
I'm putting together a team of python users to make a downloadable AI assistant (kind of like Siri, Cortana or Alexa) that you can download on your computer. All in python.
I think this isn't a one-man project so I need some team members. Please contact me if you have experience regarding this area (I'm new to this but I'm a fast learner) or if you have any questions. I'm very new to this but It's a project I definitely want to undertake because it seems overall like a fun project, especially since I'm only a teen.
What I'm expecting or hoping for the final result to be (I will update it, fix it, and add more features as we go too) I'm trying to make it able to tell weather, time, math calculations, mini-games, looking on the web, youtube music, and recent news, all using voice commands and speaking in voice that should sound somewhat natural. I'm also trying to make some sort of machine learning so the AI can learn more about you and slightly change its questions and statements to fit your personality.
If you think this is impossible or I'm having high hopes and I am a complete idiot, please feel free to tell me, since I'm open to judgement and improvement.
You can DM me at DarkMist#0074.
Note: I'm not offering payment of any kind or anything. I am just hoping that this will be a fun experience to everyone and a wonderful project. I will make like a poster of everyone in the team with their names and contribution and everything to kind of honor them and thank them for their help. This is a TEAM, by the way, not a company or a giant corporation, so I will probably accept a max of 15 members or so.
Thank you for reading. It should have taken a ton of time unless you are Mr Howard Berg. Let me know if you have questions!
DarkMist
you cant get it to learn though - anything else is doable
and there are multiple projects that have made similar things, check them out too
Does anybody know of a workaround to train a model for XGBoost regression on multi-output?
The library XGBoost currently does not support multi-output regression.
that's a problem - tried other things like LightGBM?
nvm they dont have it either
is it for a kaggle comp?
Anyone here involved with AI projects and know a good place to start? Not necessarily learning about what it is, but how to actually make projects involving it.
you could build a classifier for a dataset on Kaggle.
Yea I think Ill do that, it looks neat. Thank you.
I have a function like the following
def myfunc(c, h, alpha, beta, delta):
# perform some calculations
return s, t, x, y, z
where input parameters are
c = 0.53
h = 0.07
alpha = 0.6
beta = 1
delta = 0.8
The alpha, beta, delta inputs are initial values in the range from 0 to 1. I would like to adjust these input values such that the outputs s, t, and the sum of x, y, z are close to some values such as
s = 0.34
t = 0.20
sum(x, y, z) = 0.45
Is there an optimization function in SciPy or other Python package that would do something like this?
If you can shift everything to a single objective then yes, I think you can use one of the ready-made ones
If you are doing multi-objective optimization, I'm not too sure if any are ready-made
try just minimizing something like:
c = 0.53
h = 0.07
def cost(alpha, beta, delta):
s, t, x, y, z = myfunc(c, h, alpha, beta, delta)
return (s-0.34)**2 + (t-0.20)**2 + (0.45 - (x+y+z))**2
with scipy.optimize for a starter
I think it even has multiobjective ones
@tidal bough So something like this:
from scipy.optimize import minimize
c = 0.53
h = 0.07
def cost(alpha, beta, delta):
s, t, x, y, z = myfunc(c, h, alpha, beta, delta)
return (s-0.34)**2 + (t-0.20)**2 + (0.45 - (x+y+z))**2
x0 = [0.6, 1, 0.8]
res = minimize(cost, x0, method='Nelder-Mead', tol=1e-6)
Yeah, basically
Hi
Hi
Oh yes , you are right . Thank you so much ! @hearty tusk
Is it possible to change the location of the origin in a 2D Matplotlib plot?
what do you mean by "origin"
if you mean in the mathematical sense
look into ax.axis/plt.axis
hello guys can anyone help me, i am trying to run a code in google colab with a database containing 29lkh records and trying to fit that data to random forest classifier and when i try to run that code, my session crashes because of memory error as i am running it on 16GB ram and GPU tho it crashes any way to run it?
Hi
can you show the code?
Hey @stoic hill!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
send the colab link
random forest doesnt use gpu btw
so try using a non gpu session with maybe more ram
ik even on cpu with 16gb of ram it crashes
i think the only problem is the ram
and i dont have that;(
i have a server with more than enough ram to run that, and it's not currently loaded with much, so if you send me the data i can run it
if you're ok with that
Hello, I'm just getting started with CNN, and I was wondering what would happen if I throw in a (greyscale) image (flattened into an array) to normal feed forward NN. Wouldn't the network learn the weights to classify the image or what would happen?
A lot of people use vsc
it's ultimately up to personal choice, but I like PyCharm.
Anybody know why I get this error when I run the code
Try Googling the "Could not load dynamic library ..." part. That appears to be the salient point.
One of the skills you'll develop is identifying the salient part of error messages. Often you can find an exact solution by googling the salient part.
Mine is way frustrating
pip install tensorflow
Required satisfaction
Import tensorflow
ERROR:
Module "Tensorflow" not found
And thats why i use mobile ide ;D
Yea I wish installing tensorflow was miles easier
It's a roadblock that prevents many beginners from starting to learn
Yea
I mean is google sooo
i need help
This is first time I'm training LogisticRegression Model over 0.95*1.6 Million rows and 0.5 Million columns of data with penalty='elasticnet', l1_ratio=0.5, solver=saga
How much time it can take ?
it already took 4 hr and still in progress, i want to track progress if its is really working or just stuck...
Whats up! Who anyone knows quant trading?
System usage stats
Correlation matrix captures linear relation between 2 features in a dataset. how to capture non linear relations between features? And how to address/eliminate them?
hello
how i can seprate pandas dataframe
4
0 02-03-2020 09:19
1 02-03-2020 09:20
2 02-03-2020 09:20
3 02-03-2020 09:21
4 02-03-2020 09:21
5 02-03-2020 09:22
6 02-03-2020 09:22
7 02-03-2020 09:23
8 02-03-2020 09:23```
how i can seprate date and time in above data?
tensorflow works correctly
you just have to get cuda up and running
hello
I got a dataset from Kaggle about water potability and the data set gives potability as 0 or 1 (like True or False)
it got: ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity attribiutes and I wonder how can i apply multiple linear regression to it
actually it doesnt have to be linear regression
but i think i have trouble with 0 or 1 value of potability. When i try to apply multiple linear regression to it, it gives me absurd results
i want to find coefficients of these attributes
https://www.kaggle.com/adityakadiwal/water-potability here is the dataset
I am very new to data science and thank you in advance for your help
oh I found the solution. I think I must use binary classification not regression ๐
is there any leaks in numpy? My pipeline data is made up of in total 40 images and about 1000 objects that has array views of those 40 images about 100 mb in total as jpeg. My memory consumption increases dramatically compared to how much it should actually be. (10-14 gigabytes of pipeline data, with only libraries used it is about 3gb.)
aren't numpy images memory efficient since it uses view
Hi, is it okay to conduct an elbow test using a data frame with no normalized variables (i.e all dummies), or is it better to include the data frame with normalized variables?
Would there be any significant differences? Thank you
Hi
I have made a Jarvis like
To make my things easier like opening an app or listening to songs
Voice recognition
Can someone help me to put apps
def JARVIS(self): wish() while True: self.query = self.STT() if 'good bye' in self.query: sys.exit() elif 'open google' in self.query: webbrowser.open('www.google.co.in') speak("opening google") elif 'open youtube' in self.query: webbrowser.open("www.youtube.com") elif 'play music' in self.query: speak("playing music from pc") self.music_dir ="./music" self.musics = os.listdir(self.music_dir) os.startfile(os.path.join(self.music_dir,self.musics[0]))
How do I put spotify or any other app
Does anybody know of a good example on sklearn.model_selection.TimeSeriesSplit?
I'm trying to use this method along with a modekl
Check #โ๏ฝhow-to-get-help, this channel is about data science and ai
when comparing models should you keep hyperparameters the same or optimise for each
Sorry I was not clear. I'm working on a boosted trees model and using XGBoost implementation for Python.
XGBoost can only predict one target, so I'm using scikit_learn multioutput regression as a wrapper to train a model with 3 target outputs.
multioutputregressor = MultiOutputRegressor(xgb.XGBRegressor(max_depth=3, n_estimators=100, n_jobs=2,
objectvie='reg:squarederror', booster='gbtree',
random_state=42, learning_rate=0.05)).fit(x_train, y_train)
Time series must be validated using walk forward validation . I want to use the scikit implementation on my problem but I can't find a good example online on how to implement this validation on my model.
Ohh, thanks for letting me know!
I stg Iโve been trying to setup tensorflow for over a year
Tomorrow is the day I will finally finish
global just does it over the whole input
look it up for a better explanation
doesn't the normal maxpool also do it over the whole input
can someone do me a favor and translate this keras nn to pytorch?
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
!e
import random as r
print(r.randint(1, 100))
#what come out will be decision
@dire echo :white_check_mark: Your eval job has completed with return code 0.
75
there, the most simple AI i can think of
ai involves intelligence
that's random
lol
technically the simplest ai you can do is using one parameter, so linear regression
why do you need to?
why not keep it in keras?
Hey @young valve!
It looks like you tried to attach file type(s) that we do not allow (.html). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Pyspark:
def remove_null_columns(df, label_col, null_threshold=0.8):
total_rows = df.count()
cols_to_drop = []
for c in df.columns:
if c == label_col:
continue
null_values = df.select(F.count(F.when(F.isnull(c), c)).alias(c)).collect()[0][0]
if null_values / total_rows > null_threshold:
cols_to_drop.append(c)
df = df.drop(*cols_to_drop)
return df
df = remove_null_columns(df, label_col)
print(len(df.columns))
I'm removing columns which have more than 80% null values. How to optimise this code?
null_values = df.select(F.count(F.when(F.isnull(c), c)).alias(c))
this part
is the problem
you're calling collect once per column
you should write a query that selects the null percentage for all columns, filters out the ones that fall above the threshold, and then collect that
then drop
alternatively you can write it as a select but that's a bit more complex
I wouldn't recommend that
Thank you so much!
yw ๐
Why doesn't it print the chart
https://www.kaggle.com/alexisbcook/hello-seaborn This is the tutorial I am following
add plt.show()
omg
thankyou sm
does anyone know how to open NDPI files in python?
hey guys anyone familiar with text generation pipelines? Need quick help to understand it more
can you be more specific @indigo skiff ?
!code in the future, can you please share your code as text with code formatting, instead of a screenshot? see below ๐
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Does anyone know how efficiency of pd.cumsum() scales as compared to np.cumsum()
I found someone on StackOverflow saying pandas was faster, but my understanding was that pandas is just a layer built on top of numpy
I think Numpy is likely to be faster, but actually, just test it out
My dataset Iโm currently developing with is really small, but when this goes into prod and uses live data itโs going to be thousands of rows
Iโll try a test though
Do you suggest %timeit?
Thousands of rows is generally not performance critical to me, unless you are doing >quadratic stuff
But well, you can make fake data with 50000 rows and see
How would I plot data on a United States Map by coordinate?
!e
from timeit import repeat
setup=(
"""
from numpy.random import default_rng
from pandas import DataFrame
x = default_rng().standard_normal(size=(30000, 20))
df = DataFrame(x)
"""
)
print(repeat("x.cumsum(axis=0)", setup, number=10, repeat=5))
print(repeat("df.cumsum()", setup, number=10, repeat=5))
@chilly geyser :white_check_mark: Your eval job has completed with return code 0.
001 | [0.19380801357328892, 0.1688365377485752, 0.17779459105804563, 0.20028469525277615, 0.2744292030110955]
002 | [0.2766013741493225, 0.28942475002259016, 0.25081070279702544, 0.2581129721365869, 0.2731569781899452]
@weak sentinel Minibot 'benchmark' seems to say np is slightly faster.
I also tested with colab, with numpy being also slightly faster
I further tested with C++ with compiler optimizations - it's a lot faster if you go that route, so there's that if what you're doing is somehow performance critical
My goal is to generate product description using language models however im not sure about the pipe line and it would be really helpful if i can know more details or someone who has gone through it and have a quick chat with him.
Oh yea, sure. I just sent it as a ss to show the output
Hello
take in the product image from a CNN, identify what it is and generate a desc. I guess?
Hey, anyone know why my OLS trend is so whacky on my plotly scatter plots?
Like, in mid-April it jumps up when all of the observations were actually below average
tensorflow has problems on my computer pytorch works perfectly, i actually managed to translate t this thing yesterday so its fine
What's the incentive of publishing articles on medium?
"towards data science" blog, to be more specific
@quasi sparrow money and probably reputation
TDS has a lot of clout nowadays
lots of people subscribe to it
otherwise, medium is just a blogging platform with some social media elements
Yeah, it's hard to navigate TDS. Most of the examples are toy programs.
hey how do i get time.sleep(60) to stop all functions and read the script logically so i can make my script stop where i want it to ?
why not fix tensorflow instead?
that's like asking someone to help move your stuff into a new house because your old house has a leak
lol
true
2021-08-02 10:20:52.830264: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-08-02 10:20:52.830756: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
did you install cuda
hey guys any references to intuitively understand json/html/css parsing
so far its a lot of brute force try and try again
there's already prebuilt modules in python for html and json parsing
also, that doesn't really sound like it would belong in this channel
oh sorry - which channel is more appropriate? for context im in a data science course so i thought that this is something a lot of native users could speak about
I have the same issue lol
I gave up
and just used google collab
did you install cuda lol
so much easier
cuda toolkit?
there's your issue :)
Select Target Platform Click on the green buttons that describe your target platform. Only supported platforms will be shown. By downloading and using the software, you agree to fully comply with the terms and conditions of the CUDA EULA. Operating System Architecture Compilation Distribution Version Installer Type Do you want to cross-compile? ...
you have to go to the archives to see it
king
the update 1 part doesn't matter
then you'll also need to install cudnn from here https://developer.nvidia.com/cudnn-download-survey (you have to make a developer account to get it, its free though)
Some lstm generated tweets:
Heya really tweeting are understanding upon with are you and of it you here and.
I am murder of rogers time as shorter with burgers and with me and here once blahhh.
Stop crazy twitter very hit turkey little homemade turkey upset his food haircut.
Mite goodmorning because roof with development yay amp twittering a his paparazzi.
Willieday tweet guys cousin are ian getting hopes and i im gonna for with so im here yet xxx.
Even sooo the shame create home food visit hit your massive myself starbucks.
Dats sounded yay hence hopes proud brit of ease you movies pain like you they are on at tomorrow.
Lauren hate with bugs wouldnt yet doing word and there do to about that and cya.
Iranelection alot the rumor recorded torture approach printed are of it even love and he are around yay.
Httptwitpic doing jenna and does perfect news line political newcastle while going on your song and in proud dani
made with an LSTM autoencoder and dcgan
I feel like markov chains with ngrams are better than this, but it's still interesting.
do you have the source code? I'd love to see it.
totes this is more for learning
about gans and seq2seq autoencoders
i'd fine tune gpt-2 if i was like oging for quality tweets
sure here's the lstm autoencoder
I have a bad taste in my mouth for anything GPT because I had been researching NLP for two years before I even heard of it, yet people talked about GPT as if it permanently solved all of NLP.
yeah ik but gpt-2 is still probs the best tool i would have for generating tweets coherently
but gpt is abit overbearing rn
the problem is that there's more to NLP than generating text.
right. and it's still very interesting 
yeah classifcation, recognition, text-to-speech
exacrtly - great for anything text-generationy, but doesn't do everything
it isn't AGI for NLP
what is AGI?
aritifical general intelligence
Replace GPT with BERT ezgame
TDS is very annoying with "subscribe or clap or whatever to read this 20%-of-the-time-useful bit"
I don't even know how it's so high up when stackoverflow is even better, especially when the result is quite pertinent
# Imports
import cv2
print("imported cv2")
# Loading pre-trained data
trainedFaceData = cv2.CascadeClassifier('FaceDetection/haarcascade_frontalface_default.xml')
print("loaded pre-trained data")
# launch webcam
webcam = cv2.VideoCapture(1)
print("Webcam launched")
# loop all frames
while True:
sucessFrameRead, frame = webcam.read()
# Converting to grayscale
grayscaleImg = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
print("Converted to grayscale")
# Detecting faces
faceCoordinates= trainedFaceData.detectMultiScale(grayscaleImg)
# Print location of face
print(faceCoordinates)
# Draw rectangle around face
for (x, y, w, h) in faceCoordinates:
cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 1)
# show image (window name, and what you want to show)
cv2.imshow('Face Detector app', frame)
print("Displaying image")
# wait and close be pressing any key
cv2.waitKey(1)
print("Press any key to exit")
How do I print the certainty?
well, transformers help ๐ฆ
plus they are SOTA, so its hard to argue with that ยฏ_(ใ)_/ยฏ
yea i only use lstms cuz they are easier ๐
how about music and AI, there is someone very good at it here?
Perhaps. You're more likely to get answers when you just put the question out there.
i've DONE music and ai
very badly
but basically
encode music as text -> generatoe text w/ lstm -> encode bacc to musak
carykh has great vid on it
def remove_null_columns(df, label_col, null_threshold=config['null_threshold']):
all_features = df.columns
if label_col in all_features:
all_features.remove(label_col)
df.createOrReplaceTempView('remove_null_columns')
query = 'select ' + ', '.join(['count(`%s`) * 1.0 / count(*) as `%s`'%(i, i) \
for i in all_features]) + ' from remove_null_columns'
non_null_count = spark.sql(query).collect()[0].asDict()
columns_to_drop = [k for k, v in non_null_count.items() if (1 - v) > null_threshold]
df = df.drop(*columns_to_drop)
return df
new_df = remove_null_columns(df, label_col, 0.1)
print(len(new_df.columns))
Does this look good?
It runs much faster compared to earlier. Should I change anything further?
you forgot your ```py
I feel like... there must be an easier way. Are you just trying to drop columns that contain a NaN or what?
well
I would suggest
you refrain from writing your own SQL
you can do that with the Spark DSL
but that's not a huge problem
columns that have more nulls than a certain percentage
That doesn't work on pyspark
haha
@velvet thorn i imagine the sql version would be a lot faster, no?
otherwise you end up doing a for loop over columns with a collect-ing operation (count) in each iteration of the loop
or is there some pyspark magic i don't know about
no
this is actually
what they did originally
and I said
do it once, with one collect
it's been like a 2 years since I worked with PySpark
but
you can defo express it with the Spark DSL
how would you do that with the DSL? you can't count without "collecting"
maybe the scala version lets you do some map/filter stuff over columns
like you write a query that counts the nulls in each column, collect that, then drop
not sure if I'm expressing myself properly
df.select((F.count(F.isnull(F.col(col)) / len(df) < 0.8).alias(col) for col in df.columns)?
something like that?
I don't really remember but that should work
wait as is reserved in Python right
yeah it's .alias in pyspark
I remember there was some cool stuff in Scala not available in Python
F.count
but it's been a LONG time since I did any sort of Spark
that was it
yeah 3.something
def null_frac(df, colname):
return df[colname].isNull() / F.count(df)
col_null_fracs = df.select(
*(null_frac(df, c).alias(c) for c in df.columns)
).first()
bad_columns = [c for c, f in col_null_fracs.asDict().items() if f > 0.8]
df = df.drop(bad_columns)
@proven sigil โ๏ธ
it's probably good to stay fresh on pyspark
i haven't used it in over a year
Natural language...
can we use unused human brain as cpu, LOL
It also come with free big hardrive and built in learning module
That's amazing. Thanks!
Tokenization: Subword Tokenization splits words into smaller parts based on the most commonly occurring sub strings. Word Tokenization splits a sentence on spaces as well as applying language specific rules to try to separate parts of meaning even when there are no spaces. Subword Tokenization provides a way to easily scale between character tokenization i.e. using a small subword vocab and word tokenization i.e using a large subword vocab and handles every human language without needing language specific algorithms to be developed. On my Journey of Machine Learning and Deep Learning, I have read and implemented from the book Deep Learning for Coders with Fastai and PyTorch. Here, I have read about Word Tokenization, Subword Tokenization, Setup Method, Vocabulary, Numericalization with Fastai, Embedding Matrices and few more topics related to the same from here. I have presented the implementation of Subword Tokenization and Numericalization using Fastai and PyTorch here in the snapshot. I hope you will gain some insights and work on the same. I hope you will also spend some time learning the topics from the Book mentioned below. Excited about the days ahead !!
https://www.linkedin.com/posts/thinam-tamang-3b12831a2_300daysofdata-66daysofdata-machinelearning-activity-6828204194089041920-gyVw
@raven steeple Please don't try to ping @everyone or @here. Your message has been removed. If you believe this was a mistake, please let staff know!
Hey folks! I am a data science aspirant and I have been learning SQL from the past month for a data analyst role.
I know JOINS,CASE, CTEs, Agrregrate Functions as I have used all these to solve problems in Hackerrank.
What is the next step?
What more do I need to know?
does anyone know how to check if the pixels have more than 8 bits per channel for a given image?
maybe check the szie of the image
if its above what and 8 bit channel img should be
it has 10 bits?
like the size in mem of the img idk
so im using openslide
oh
and the only thing I can get out of it are the dimensions
isn't that a C library?
no python has it too
you cant like... read the size in memory
which are only (width, height)
that's it
so ONLY with the width and height
can you tell if it has more than 8 bits
I converted it to a thumbnail and that turned this to an array and gave 3 dimensions, last being 3 which is colour
yh for thumbnail
not sure what you mean as there's different levels in this library but generally width and height are different for each level
ok
so i dont think your problem can be resolved
with just width, height and cannels
lookf or someone who knows openslide
maybe there's something here
Can anybody share a good resource on Entity Extraction Model..??
that sounds like a pretty good basis for data analysis. maybe also take a look at window functions. otherwise you know more than enough to get started, and you should start focusing on other things like excel skills, light-duty data processing with python, basic command line stuff, and probability/statistics
Hello, everyone! I'm a beginner python programmer who is coming from a project management background and looking to transition into a data analyst career. Could you help me build a roadmap of which skills to develop? So far my python projects are very basic and I wanted to build projects that get slowly more complex.
So far I know the basics of the language and I've played a little bit with some analytics concepts in my last project. I don't want to write a giant text wall here with all my questions and curiosities, but here is my github and I would really appreciate some recommendations on what to work on: https://github.com/renatolew
I am sorry if this is not the right way to ask this. This is my first time participating in a programming community and I don't know how things work yet
how about if you want to get into machine larning
i see you already built a recipe analyzer
how about an RNN/LSTM to write original recipes?
but for a roadmpa
i'd say learn SQL, numpy, pandas, then get into scikit, machine learning, then finally learn tensorflow, keras, deep learning
allw hile making proejcts
Thank you. Do you recommend any specific projects for me?
to start - idk its really up to you?
but your recipe thing is a great example
for my getting started with nerual nets
One of my issues has been finding good projects, because the last two ones I tried were way more complex than I antecipated and I got stuck pretty quickly
i experimented with the iris dataset
or i built a simple webscraper
an LSTM to generate text
all pretty easy with plenty of examples
Thank you. I'll look into it!
those dont work too well and are kinda difficult
with simple LSTMs
though an idea i had
was to run an LSTM on wall street journal article headlines
and use that to predict how the DOW/ S & P 500 would perform in the next week
if anyone wants to try that
i mean.... you could... but text is timerseries data
so lstms yeet DNNs
Hello I want to ask about making custom matrice with pandas.
I want to make similiar confusion plot but with average of distance between actual and predicted class which the data available in here https://paste.pythondiscord.com/ehogipafip.css but cannot find a good advice to do this.
i want to make it like this so i can analyze which entity has almost have relation between each other
Ok thanks
gl
I am doing a walk forward validation to evaluate a model. To get the best accuracy of how the model really performs to new data, I retrain the model every timestep. I am testing for 7 timesteps and repeating the walk forward validation 3 times. The model seems to be getting better every time it foes through the 7 days. Could it be possible that the model has past knowledge and is basically "cheating"?
I am working with keras.
Do you have a separate validation set to test against?
no
Its time series data
Thats why I have walk forward validation
So just to be safe, is there a way to restart models completely?
You can still (and definitely should) have a separate validation set
how
Have you got a sample of your dataset?
like 5
-0.10000228881836648
0.6800003051757955
0.5600013732910085
-0.8400001525878906
-0.7400016784667969
0.9099998474121094
Okay gimme a minute, I need to check some code where I have done something similar before
alright
So if you have 131 datapoints, you could only use the first 100 for training and testing
Then at the end of each epoch, validate on the remaining 31
That is not a very accurate way to test though
Walk forward validation is a great way for testing models with time series data because it represents how I would make predictions with the model
So the problem here is not really the way to testing.
I'm looking for a way to completely delete a model
Either way, the same still applies, you train the model on the first 100 or so datapoints, then once you have finished, you can use those 100 to try and predict the final 30 or however many you want to use as a final benchmark, otherwise there is no way to know the actual performance of the model, this would apply to both sliding windows and expanding windows
Not sure what you mean by deleting the model, since if you are saving them, they are usually just saved in a .h5 file
Statement: We cannot determine overfitting based on one hypothesis only
- Why is that? Isn't it the case that if we have a hypothesis with very low E_in and have a high E_out that its a sign of overfitting? Why can't we in virtue of that conclude that H1 is overfitting the data?
how do i start the basics of machine learning
I don't think you are understanding what walk forward validation is
The point is not to get the best model
it is to test to see how well the model performs
To get the best understanding of the variance of predictions, I am running the walk forward validation 3 times. After every time step, I want to make a new model and train it with the data it has. Then make a prediction and compare it with the answer
I recommend the SoloLearn Course which is free
Thanks
np
it does not go into deep learning right away which is good
at the end it does introduce u to neural networks
If logistic regression is classified between two features, and KNN is classified between more than two features, between how many features a decision tree classified?
this is where the tweet lstm is now:
So greece calories shooting tunnel tragedy and lovin crazy touching.
Just dreading full marathon is as or pool rock move spain beers.
Taking up drunken other tragedy high projects is alot less that no if i feel maybe even worry.
Was into a decisions chilled apples lol but take dream that sore workout i.
Just dreading full thunderstorm is fun or pool rock move spain rehearsal.
Here at my appropriate soft more animation today i was a drunken and.
Just off in class is id miss central places meet downtown weeks.
Was up the disc murder a misery comcast blogging pirates.
Chilling reality tonight october possibly on your annoying theme music for flowers.
Iranelection greece murder ideal loud and is i lovin with touching.
beter than me can right
i want to ask about arcface embedding algorithm, about the output vector. Is the vector value are is normalized between 0 and 1 or not? since i need to decide which similarity method to do the inference?
is it a yes or a no?๐
You are probably right, I thought it was similar to k-fold but with time series data, but it sounds like something else, you got a resource I can look it up on?
the lstm is a autoencoder lstm with dcgan if anyone wants to give it a shot themelves
Yeah no worries, ranked siege is way more important
yes, sorry
@acoustic halo https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
Okay it was what I thought it was, nevermind
is there any benefit to over fitting? maybe it can give you some insight into your data like in unsupervised learning
Yes, the ability to over-fit suggests that your model is sufficiently complex to handle the data. If you just do a single variable linear regression on complicated data, you'll never be able to get a proper fit to, lets say, sinusoidal data. So if you are concerned that you don't have enough layers or you didn't choose a complex enough ML model, the ability to over-fit suggests that it is complex enough but you need more data, or data that better predicts the test samples
It can also give you an idea of when to stop training, because if you are over-fitting, you've gone too far
you can tell when you've overfit?
Yes, in super simple terms if your model seems to be doing really well on the training data, but then does worse on new data it has never seen before, it's likely because it has overfit to the training data
what do you mean by doing well on training data? I thought you just kind of feed the training data in and out pops a model
I am sort of seeing on my test data, that most predictions are pretty close to the actual, but a few are far off
I think that points to over fitting possibly
Okay so let's say you train a model on your training data
Then after training you test your model on the training data again and get 99% correct predictions
Then you test your test data in the model, which only predicts correctly 30% of the time
That's suggests the model has learnt the training data too well that it doesn't generalize well to new data, aka overfitted
Hi guys, when it comes to faster python code, whats the difference between numba and transonic?
transonic appears to be a "wrapper" around several packages, of which numba is one
If any of you are feeling extra generous tonight, mind joining me a #โhelp-coffee ??need a pandas Q answered thx
Anyone have some good tips for how I could speed up my yolov5 model? Using 720p images for my input data and only two classes, I have about 3000 training images and 10% of those are for validation
What can I do to speed up inference no matter how small?
So like, I've been trying to figure out how many heights are within one standard deviation for this given set... but I seem to be doing something wrong, in a process I thought was fairly straightforward
code:
from math import sqrt
players = [180, 172, 178, 185, 190, 195, 192, 200, 210, 190]
mean = sum(players) / len(players)
pre_variant = [(number - mean) * (number - mean) for number in players]
variance = sum(pre_variant) / len(pre_variant)
std = sqrt(variance)
valid = [player for player in players if player in range(int(mean-std), int(mean+std))]
print(len(valid))
Find the mean, then the variance which is the average of the squares of the difference of each value and the mean, find the standard_deviation which is the square root of the Variance, and then find all numbers in that range...
Where did I go wrong?
I tried doing this same method with different data, another question that I knew the correct answer for, and it worked fine
Nvm... I got it to work, was a problem with my last list comp
from math import sqrt
data = [180, 172, 178, 185, 190, 195, 192, 200, 210, 190]
mean = sum(data) / len(data)
pre_variance = list(map(lambda number: (number - mean) * (number - mean), data))
variance = sum(pre_variance) / len(pre_variance)
std = sqrt(variance)
result = list(filter(lambda number: number > (mean - std) and number < (mean + std), data))
print(len(result))
The result
What exactly is TensorFlow? And why do many say that it makes it easy to do ML when looking at it makes my head hurt?
Many use something like keras which provides a simpler interface for tensorflow, or pytorch instead
Tensorflow can be difficult to understand if you are not already familiar with the concept of tensors
Where should I start learning the fundamentals of ML so I can build some cool applications with it?
r/learnmachinelearning: A subreddit dedicated to learning machine learning
Theres also a bunch of other resources in the pinned messages
Codecademy has a decent course as well if you have access to that
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv('/content/student-por.csv')
print(data)
X = data.drop(columns=['grade'])
print(X)
y = data['grade']
print(y)
model = DecisionTreeClassifier()
model.fit(X , y)
model.predict([[18, 2, 2, 0, 1, 0, 0, 0, 1, 1, 0, 0, 4, 3, 4, 1, 1, 3, 4]])
What am i doing wrong here
ValueError Traceback (most recent call last)
<ipython-input-33-4fc4f82f40e4> in <module>()
9 print(y)
10 model = DecisionTreeClassifier()
---> 11 model.fit(X , y)
12 model.predict([[18, 2, 2, 0, 1, 0, 0, 0, 1, 1, 0, 0, 4, 3, 4, 1, 1, 3, 4]])
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
875 sample_weight=sample_weight,
876 check_input=check_input,
--> 877 X_idx_sorted=X_idx_sorted)
878 return self
879
/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
171
172 if is_classification:
--> 173 check_classification_targets(y)
174 y = np.copy(y)
175
/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
167 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
168 'multilabel-indicator', 'multilabel-sequences']:
--> 169 raise ValueError("Unknown label type: %r" % y_type)
170
171
ValueError: Unknown label type: 'continuous'
This works with other data i have but not with this one for some reason
current data : https://paste.pythondiscord.com/pogoqinuko.apache
old data : https://paste.pythondiscord.com/ayazepezoz.apache
Is it that it can only train with 2 parameters
Hello, I'm just getting started with keras and was trying out this code:
model = keras.Sequential(name='shit', layers = [ keras.Input(shape(2,)),
keras.layers.Dense(3, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
But I get this error :
WARNING:tensorflow:Please add keras.layers.InputLayerinstead ofkeras.Inputto Sequential model.keras.Input is intended to be used by Functional model.
What does it mean by "Functional model" and why am I getting this error?
Because as it says, you need to use InputLayer, not Input
you could also just do keras.layers.Dense(3, activation='relu', input_shape=(2,))
And skip defining the input layer
Functional models in keras are just models more complex than sequential models, eg:
well the error pretty much explains itself, keras.Input is meant for a different kind of model, and so for sequential models you should replace it with keras.InputLayer
a functional model is a model that is made kind of like this:
inputs = keras.Input()
x = keras.layers.Dense()(inputs)
x = keras.layers.Dense()(x)
# ...
outputs = keras.layers.Dense()(x)
model = keras.Model(inputs=inputs, outputs=outputs)
if you'd like to learn more about those here's the documentation link for the functional api https://keras.io/guides/functional_api/
Cool ๐ , thanks a lot @acoustic halo & @austere swift
Actually got a functional API question myself, could you implement NEAT with it by creating layers with single inputs and outputs to effectively act as single nodes?
Hello, I've got an error while running a NN. Can you please help me?
It's giving me this error
WARNING:tensorflow:Model was constructed with shape (None, 300, 1) for input KerasTensor(type_spec=TensorSpec(shape=(None, 300, 1), dtype=tf.float32, name='gru_9_input'), name='gru_9_input', description="created by layer 'gru_9_input'"), but it was called on an input with incompatible shape (None, 14, 1).
I think it's something to do with the way I've imputed the data, but I cannot figure it out
with the functional API - you can modify your NN any way you want as long as autograd does it job ยฏ_(ใ)_/ยฏ
That's what I was thinking, not sure what the performance would be of having potentially hundreds of layers, even if they are small
well, you can alawys use their profiler to optimize them further if they take too much time
though use pytorch or jax then cuz it would be more controllable
Not that I plan on doing this, it's mostly just a thought experiment
But it also makes me wonder if there are any papers that apply a method like NEAT to layers as opposed to individual nodes
I have a dataset (100k,52) with labelled anomalies and I'm trying iforest on various variations of the dataset. So far none returns anything sensible. When I plot the feature space in 2D with PCA or TSNE with different colours for normal and anomaly points, must I be able to visually confirm anomaly regions? In my case normal and anomaly points are mixed e.g. with PCA normal points form a circle and anomaly points are scattered towards the middle or with TSNE everything seems to be mixed altogether
Hello! So I am not sure if this is the right room to ask my question but I didn't find a more appropriate one. I want help with fitting a curve on data with errors on both the x and y axes. From what I've read scipy's curve_fit cannot deal with x errors (correct me if I'm wrong). I tried using odr but I think that it didn't give me the correct curve. I could be wrong and it could actually be the best fit curve but I would appreciate a second opinion. Thanks!
I was trying to make an ML algorithm that predicts the value of for example f(x) = 2x for a specific input to x (overkill but I'm trying out the power of ML) so that when I fit this array into the model (DecisionTreeRegressor),
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
it should hopefully predict a y value of 40 for x = 20, 60 for 30, 90 for 45, so on and so forth.
However, when I try to predict 6 for example with the model trained on the array above, it returns 10 and not 12. It goes for any number higher than the numbers in the array.
Can another model solve this issue?
length = 50
slope = 2
b = 0
step = 1
# Initialize array
data = {'x': [(i * step) for i in range(length)], 'y': [(i * step) * slope + b for i in range(length)]}
df = pd.DataFrame(data)
# Define
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1)
y = df.y
X = df.x
X = X.values.reshape(-1, 1)
# Fit
model.fit(X, y)
# Predict
print(model.predict([[10]]))
Thanks!
Anyone here ever do any roguelike stuff with Python that can be read by a lotta shit
there is nothing against it, but it is really screechy
@nekitdev it's already there for you to call it like a regular pattern..maybe if u can help me grab this guys token he keeps logging on my account on steam just saying it out loud for me to learn python?
im new in python that technically it is for
You are not allowed to use that to run in the background while other code is where it doesnt work now
and the webdriver in your code without seeing the code
lmao i use repl .it
that would be a great choice ๐
You need to use requires some notion of OOP and to be compatible with both, Linux and Windows support multiple IPs on same NIC
How do you become better than him including me ๐๐
markov chains = somewhat realistic python discord messages
When i look at to get a simple 1 or 0. 1 means it is assigned to a class
None
I don't appreciate the tone you're taking with cepo. Do you understand what you're asking for is .... Twisted indeed .... badum tsss
how to do a project skskkss
Isn't there a nice project to do with said bot, Selenium is a testing tool, not for scraping
how do i call a function inside a for loop breaks
well pandas complains about the python language. This python course i'm auditing just got a quick question if u have a point where all the keys in the dictionary, instead you should loop backward so if someone asked me to teach him, he didn't take it seriously
if i want to do something on button click, not refresh or redirect the page, instead; update the content in the github student program, which is easier to read i think
It'll work for all of these errors
** does anyone know of the styling guidelines
DecisionTreeRegressor is really bad for this kind of task. i'd encourage you to look at what exactly a decision tree is, and try to figure out why it's so bad.
Decision tree, I understand
You're looking at a tree of possibilities, but they're only limited to the trained values or something
also you might want to practice using numpy/pandas for working with data more efficiently:
length = 50
slope = 2
b = 0
step = 1
data = pd.DataFrame({'x': np.arange(0, length, step)})
data['y'] = data['x'] * slope + b
That makes intuitive sense
So the np.arange function kinda functions exactly like the range function with a min, max and step?
more or less, yeah. a tree has a fixed set of "split points". so it fails in 2 areas:
- it can't extrapolate out of range of the training data - this is a problem in every model, but decision trees always predict the same value on out-of-range data
- it can't predict a continuous range of values, unless you have an infinitely deep tree, which isn't possible
and for the sake of the exercise: if you know that the underlying function has the form f(x) = ax + b, what model is definitely the best choice for learning this function f?
!d numpy.arange
numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)```
Return evenly spaced values within a given interval.
Values are generated within the half-open interval `[start, stop)` (in other words, the interval including *start* but excluding *stop*). For integer arguments the function is equivalent to the Python built-in *range* function, but returns an ndarray rather than a list.
When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use [`numpy.linspace`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html#numpy.linspace "numpy.linspace") for these cases.
No ML is actually necessary, the function can literally be implemented as a simple function
sure, i'm just checking for understanding ๐
I knew it was overkill, I just wanted to check what ML can do
in that case a tree isn't a great idea either. you'll want something that can learn "complicated" functions - random forest, gradient boosting, neural network
if you have the ability to arbitrarily create test inputs and outputs, you might do even better with a gaussian process model
depends on the situation
So a tree is better for let's say bool values or limited-choice values like gender, eye color, country of origin, etc
yeah. i don't think i've ever seen a single regression tree used in serious work
they used classification trees quite a bit when i worked in insurance
i guess a single regression tree isn't bad if you want to "cut" the target into categories/levels, but you don't know what those categories/levels should be
pretty specific use case
I just started kaggle's intro ML course today and it was the first model in the tutorial
it's definitely better if you have more features: that means you have more splits, and more possible values the tree can predict
i think this tutorial is meant to show off scikit-learn moreso than decision trees
but price is probably an okay situation to use a decision tree: you don't really care about the exact price, and it probably makes sense to group prices into "levels" anyway
i thought the titanic competition was a binary classification task?
i remember doing it in ~2015 when i was first learning python and branching out from "traditional" stats and econometrics
Yeah, the target variable is survival
a classification tree is a sensible intuitive choice for that problem
personally i came from the statistics world, so i didn't even consider it and went right for logistic regression (which is also a sensible choice, for other reasons)
I just wanted to dive right in to doing ML stuff with whatever code I learned
it's also worth considering what the advantages and disadvantages of a decision tree and logistic regression are
I still don't know what kinds of ML models there are
i'd argue that the decision tree is a lot easier to understand. but knowing stats and probability is very valuable for doing serious ML or more general data science work
I know, that's why I've been practicing on Khan Academy
that's good
i think i found a better YT channel for stats.. i will try to find it
there are pretty much 3 main categories of ML models in common use: trees and ensembles of trees, neural networks, and statistical models (especially linear ones). many other types exist, but these are the 3 that you will see over and over.
From what I understand, neural networks have layers and utilize linear algebra for getting data from one layer to the next
Are trees in ML models represented by actual tree data structures?
are you thinking of random forests?
Hmm
yes, my understanding is that they are acyclic directed graphs.
(which is a specific kind of tree--trees can be undirected)
so I trained my model trying to get it to overfit (just to see how it performs), and maybe as expected, it performs well on the training data, and worse but also well on test data (never before seen)
is that ok? like the only thing that matters is how well it does on never before seen data?
I mean, it's okay as long as you are happy with it, but reducing the overfitting may make the result better on the test data
not usually, but how they are stored is a good question. it probably varies across implementations.
a "tree" data structure holds data - a decision tree is an algorithm, and you store various parameters that make the algorithm work
in a lot of real-world tasks, yes, this is the most important thing
actually, I'm adding parameters to reduce overfitting, and it's really not doing anything
huh
reduce parameters, don't increase them
Hmm ok
I meant to ask how ML models store what they get from fitting
Like, it concluded that if sex == male, not survived and if sex == female, survived
the scikit-learn decision tree implementation is all written in python, you could read through it if you're curious https://github.com/scikit-learn/scikit-learn/blob/82df48934eba1df9a1ed3be98aaace8eada59e6e/sklearn/tree/_classes.py#L445-L494
The _init_ method is shorter than I expected
What is Covariance ? Can anyone explain with example.. all i know is If one var increases the second also increase then pos covariance but what if one moves up and other down... Can anyone explain waith ease whats covariance and how does it differ with correlation?
correlation is covariance normalized to lie in [-1, 1]
covariance between X and Y means that, when X is "high", then Y is "high", and when X is "low", then Y is "low"
Ok... so what if one us high and other seem to go down then ? is it cov or corr ?
correlation is computed from covariance
ok
and what about the range ? Like corr range between 0 and 1 so covariance lies in 1,-1 ?
no, correlation ranges from -1 to 1 like i told you
covariance is unbounded
ok
@fading burrow where did you get that diagram?
@fading burrow is this a question about neural networks or about pooling covid tests?
or something else
the calculations for that table appear to be in the source paper https://www.medrxiv.org/content/10.1101/2020.04.06.20052159v1
In the global effort to combat the COVID-19 pandemic, governments and public health agencies are striving to rapidly increase the volume and rate of diagnostic testing. The most common form of testing today employs Polymerase Chain Reaction in order to identify the presence of viral RNA in individual patient samples one by one. This process has ...
i see
i think p in that table is the expected frequency of positive results
skimming the paper, it sounds like they derived this table from numerical simulations
they have an appendix with some derivations though
...but i think the appendix is missing
oh its here https://www.medrxiv.org/content/10.1101/2020.04.06.20052159v1.supplementary-material?versioned=true https://www.medrxiv.org/content/medrxiv/suppl/2020/04/14/2020.04.06.20052159.DC2/2020.04.06.20052159-1.pdf
In the global effort to combat the COVID-19 pandemic, governments and public health agencies are striving to rapidly increase the volume and rate of diagnostic testing. The most common form of testing today employs Polymerase Chain Reaction in order to identify the presence of viral RNA in individual patient samples one by one. This process has ...
yeah there are some derivations and formulas in that appendix
anyone using lazypredict on a regular basis? seems like a major time saver but may also mislead if you don't know why one model would be chosen over another https://towardsdatascience.com/lazy-predict-fit-and-evaluate-all-the-models-from-scikit-learn-with-a-single-line-of-code-7fe510c7281
the days, where you need a wrapper for scikit-learn ๐ค
hello
Anyone knows any good site with project ideas on AI? searching hard for my thesis
Thanks
i'd love to help you brainstorm
but for a phd thesis? i wouldn't look for an idea on a website
what do you want to do?
nlp?
computer vision?
MSc
generattive ai?
ok its good to start with an area of machine learning
regression?
neural net architectures?
ye computer vision or sound i guess..I would love to create some easy hardware too, like connect it with arduino
i still wouldn't look too hard for a masters thesis in a python discord server
where would you look?
professors , university resources, prior papers, arxiv, etc.
there are also online communities more specifically focused on that field
that said, some hardware stuff could be interesting. not all theses have to be "implement an AI"
ye...professors gave us some cases, didnt like any too much ๐
your contribution could be "i got AI stuff to run on this tiny arduino and it's something that other people will find useful, here is the source code"
and your thesis wouldn't be "here is my cool machine learning model", it would be "here is how i got XYZ to run on an embedded system"
but i depends on your skillset
if you want an ml thing in audio/computer vision - how about a model that generates audio from a slient video (with lip movements) that's a few shot learner
so you give it a few short clips of someone speaking
and it can figure out the rest
I just have this idea of thesis would be something I like you know..trying to think ideas but I get stuck on implementation side..like how I ll manage to collect data
collecting data is always the hardest part
just an example
nice one..how you come up with this for example?
alright
i thought "what's a cross between audio and computer vision"
lip sync audio generation
ye i know..had to collect and get the metadata through spotipy from 7.5k songs on another project
what's state of the art? "generating new audio from lip sync video after training on a specific speaker"
how could that be improved? "rather than having to train on an individual speaker, make the training few-shot so it oculd work on anyone"
that's a cool idea
Do you have anything of this coolness for AI and environment I like? I would love to use it for dunno maybe predict wildfires or detect stray animals and try to protect them..seems hard to find data though
โค๏ธ Check out Snap's Residency Program and apply here: https://lensstudio.snapchat.com/snap-ar-creator-residency-program/?utm_source=twominutepapers&utm_medium=video&utm_campaign=tmp_ml_residency
โค๏ธ Try Snap's Lens Studio here: https://lensstudio.snapchat.com/
๐ The paper "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"...
i believe this is SOA
โค๏ธ Check out the Gradient Dissent podcast by Weights & Biases: http://wandb.me/gd
๐ The paper "Fire in Paradise: Mesoscale Simulation of Wildfires" is available here:
http://computationalsciences.org/publications/haedrich-2021-wildfires.html
๐ We would like to thank our generous Patreon supporters who make Two Minute Papers possible:
Aleksand...
seems like a good starting point
thank you! both!
hello i need help about keras.backend method.
I now try to validate the result of my model and loop through 100 times different combination of train valid test. I wrap my model train and evaluation method in a one function and call it inside for loop with different data
def train_and_test(trainX,trainY,validX,validY,testX,f_leng):
tf.keras.backend.reset_uids()
tf.keras.backend.clear_session()
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10,mode='min')
model = Sequential()
...
model.compile(loss = 'sparse_categorical_crossentropy',optimizer='Adam', metrics=["accuracy"])
print(model.summary())
with tf.device('/device:GPU:0'):
history = model.fit(x=trainX,
y=trainY,
validation_data=(validX,validY),
epochs=50,
shuffle=True,
batch_size=10,
callbacks=[stop_early])
ypred = model.predict(x = testX)
ypred = ypred.argmax(axis=-1)
return history,ypred
with the flow like this:
for each wavelet method:
tf.keras.backend.reset_uids()
tf.keras.backend.clear_session()
for each level:
set train x,train y
call train_and_test function
create pandas summary
I'm glad it works well but pretty suspicious when the result jump drastically from 70% to 90% for each decomposition level. It is same as other kind of wavelet i used.
lmao i'm pretty happy but suspicious with my CNN result
see both curve are likely identical whether the wavelet are different
i'm pretty affraid that the same model always learn with new weight in every for loop whether than assign fresh untrained model
wait i find the issue
i think is the problem lie on how i define the model
dunno if i wrong, is calling the model of each layer with .add or define it with list has different effect to the model definition?
I know a good test/train split is 80/20, but if I decrease my test size and get better results on my test set, does that mean that a smaller test set is better for my specific dataset?
It's probably just random chance that it's better
Or the fact that now you have more training data
You want a wide range of samples in your test set (hence making it 20% of the data) so that you can get a realistic picture of what it will do in the real world
I'm also confused, say you did all the tuning and stuff, how do you go about generating the final model? do you run it a bunch of times until you get a low scoring metric? Do you train it on 100% of the data or a larger fraction?
how do you shape the best deliverable
I think time is best spent trying a range of model types to get the best result, not fine-tuning the metrics with the same model
The point of the test data is to give you a preview of what it might do in the real world, but if you keep iterating you can get an artificial good result on the test data
Which it sounds like you may have done
That's also why people do test-train-validate, but if you have very limited amounts of data, it's probably not worth doing that
One place where fine-tuning might be worthwhile is in things like batch-size, as I think that can help eliminate over-fit
I'm getting quite good fit on my training, and worse but still good on my test lol
i using test train valid even if the data are very limited. the reason is i need a validation data to check if the model performance are almost linear in both train and valid... which mean check whether the model overfitted or not
It's not the end of the world to have slight overfit, it's typical to have a model perform better on the train than the test. You don't want crazy large differences though
I have large differences ๐ฌ, RMSE on train is 2, test is 6.5
when i using scheme 50-20-30 for train test valid split with random configuration sometimes my model overfit in validation after training or underfit the result... and reflected by the test one...
This is like what happen in my case after 50 epoch, even the train acc almost 100 but the validation is very low around 50 or 60 with the lowest val loss, the test will follow valid result and underfit happen... but not frequently happen
which scare me anytime
i just use 100% of data for training
its been an hour
why?
oh that was a joke
now and then im too lazy to make a validation callback
and jregret it later
lmao
i took 10000 images of fake people
10000 images of real people
and its taking 10000 years to finish
its probably running on CPU
BART usually takes about 1-2 years
no gpu i think, the data is huge
an hour for 4 epochs, each epoch takes approx. 827s. You wait around 3308s for 4 epochs (an hour), to reach 50 epochs, 41350s. 38042s you must wait, around 10.5 hours.
You must wait around 10 hours and 30 minutes for this to complete, correct me if i am wrong pwease.
i will definetly
(Since posting)
but i dont know why i put the epochs so high because now its probably going to overfit
Mmm
whats the best way to find a mentor
also quick question
# lets say I have a list
lst = [1, 2, 3, 4, 5]
data = pd.DataFrame({'Numbers': [3, 42, 5, 345, 36]})
I want to filter for the rows that have any of the numbers in lst
how can i do that?
the desired rows that I would want has the numbers, 3 and 5
# My solution
data[data['Numbers'].apply(lambda x: x in lst)]
But i feel like there would be a faster way
data[data['Numbers'].isin(lst)]
ahhhhhhhhhhhh
i seee
thanks
also what are good visualizations to see correlations between two categorical columns?
A table.
im familiar with visualizations such as pointplot, violinplot, barplot, etc but those are all for categorical columns and also numerical columns
huh
(With coloring, red = high)
hmm
i'm thinking countplot with a hue as the other categorical
but is there a better one
you mean cross tab?
yeah so corss tab
but is there like visualizations?
cuz a table isn't a visualization
It is, but there are others
Can do a graph, close nodes are highly correlated.
Graphs are useful when there are many components and you want an overview from a distance
Can zoom out
Isnt a heatmap prolly the best for crosstabs kind of data?
(But a very small cell size in a matrix works too)
Yea, colored table.
Color is pre-cognitive so it helps a lot.
The graph approach can give you insight into clusters.
A couple of things all correlated with each other.
Graph renderings are kinda hard to find though compared to a simple table.
would you first create a cross tab and then call sns.heatmap()
or pivot table with aggfunc=np.count if that will even work
hey can someone help me out. I'm new to tensorflow and am getting a dimension error on validation data.
basically im using an imagedatagenerator on my training data, when I try to evaluate the mdoel based on the evaluation data however, it throws an error.
here is the error, and im guessing its to do without the output since its a 10x1 array output.
ValueError: Shapes (None, 10, 2) and (None, 10) are incompatible
model.fit(datagen.flow(x=x_train, y=y_train, shuffle=True, batch_size=32), epochs=1,
callbacks=[callback], validation_data=(x_test, y_test))
here is my model.fit line, i bet its something here
could someone help
the error occurs at the end of the epoch
โโโpy
Testโโโ
wow ok, im just stupid nvm
What am I doing wrong? Why is the target variable and its column duplicated?
dummies = pd.get_dummies(df2, columns= ['payment_type','category_name'])
df3 = pd.concat([df2,dummies], axis='columns')
X = df3.loc[:,df3.columns != 'price']
# Target
y = df3['price']```
perhaps the same column appears in both df2 and dummies?
also, the definition of X could just be X = df3.drop('price', axis='columns')
Hey not sure if this is the right place but I've been reading around but couldn't find 1 exact answer.
I am reading large files in python and looking for the fastest way to do so.
are you using the open function? and are you sure that your approach really truly is too slow?
Well I'm just using the normal with open function and then readlines(), pretty much as basic as I can. I was trying to find a faster way to check over it
but how do you know that what you're doing is too slow?
I don't tbh
I've had my code running for 5 hours and it processed 1.85mil lines, every line I am checking a list with a length of 26k if that item in the list is equal to the line of the file
can you do a few lines at a time in parallel?
How would I go about that?
how many cores do you have?
Like multiprocessing?
yes
I got 4 cores
if processing each line doesn't depend on knowing about previous lines, you can do it in parallel
It's using about 25% of my CPU
so maybe you could let it run overnight with three cores ๐คทโโ๏ธ
hmm i could look into it
I just don't want it to kill my CPU over time lol
I have no idea if it will I'm kinda new to this
@lapis sequoia if each process uses at most 25% then running three instances will give you some clearance.
But it's up to you to know what the peak memory usage is
alright thx
yup
help me, understand lstm, please!!

ok so the full form is long short term memory mostly its used in generating text messages like quotes or you can also make maths questions with it
and it is smart enough to also understand the grammar used in a sentence once you have enough data
LSTM or long short term memory is a special type of RNN that solves traditional RNN's short term memory problem. In this video I will give a very simple explanation of LSTM using some real life examples so that you can understand this difficult topic easily. Also refer to following blogs to explore math and understand few more details.
http://c...
this video will be more helpful ๐
i have seen it
@ruby patio great!
yeah bro
it does not understand grammar lol - who said that?
yes it doesnt i meant the patterns can make it learn grammar
start with the official paper on arxiv, check out yannic kilcher or stack overflow if you have any doubts - or post them here
karpathy also wrote a ton of blogs on it, you can see them also
Hello, can someone help me a bit with curve fitting? I try fitting a curve on data with x and y errors using odr but it doesn't give me the correct curve. Thanks!
how many GBs of data is recommended for machine learning (I'm choosing between 8 and 16)? I usually use Google colab for ML because of the free GPU, however i realized that in the future if i am doing things with larger datasets, google colab might not work because uploading the dataset to drive takes a really long time sometimes.
16gb data seems like overkill because i probably won't even do machine learning locally for the next few years, so i probably won't get it, but i am just wondering what other people think
use google cloud if you have the bucks
If you guys are looking for implementation of Machine Learning algorithms on python, I've made a github repository which you can follow (https://github.com/vanshhhhh/Hands-On-Machine-Learning)
If you find this helpful please do give it a star on github ๐
Does this mean as n_student increases, posttest decreases?
Hi guys, i have a quick question. if I have a data frame from 1 to 10, and I trying to create some sort of group like
(0,1,2), (1,2,3),(2,3,4) .... (8,9,10). How can I do it ?
There's this
https://stackoverflow.com/questions/54280228/how-to-iterate-n-wise-over-an-iterator-efficiently
what kinds of groups? you want a list of dataframes of 3 rows each?
or do you just need to perform a calculation on 3 rows at a time?
don't make us guess!
idk what is it called, like mathematical term. My goal is for a list of data ranging from 1 to 10, I will create kinda a partition or group like
Group1: (1,2,3)
group2: (2,3,4)
etc.
yes, and what are you trying to do with those groups? how do you want to store them? do you even need to store them, or do you just want to perform a calculation on them?
are these dataframe rows? index values? etc
Once I have those group. I will use that as the number to identify data. Example data['c'].iloc(group1)
idk whether it is possible to do that or not
and what do you want to do with that data?
you can make a list of overlapping triples, like [(0,1,2), (1,2,3), (2,3,4), ...] pretty easily. but you usually don't/shouldn't need to do this explicitly with pandas, see e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.apply.html
so i ask again - what are you actually trying to do?
Thank you, I will look in to those documents. Reason why I do that is:
on the fall data that I have, I realize that the fall usually happen on the second smallest data. and the next 14 data of it has to be less than the 2nd smallest, specifically the data #2 up to #14 has to be between 0 and 20.
that's why I want to get the sequence of data then isolate it,
!eval @rigid zodiac if you really just want the overlapping tuples, you can do something like this
from itertools import islice
def infinite_windows(window_size):
start = 0
while True:
yield tuple(range(start, start+window_size))
start += 1
windows = list(islice(infinite_windows(5), 5))
print(windows)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
[(0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (2, 3, 4, 5, 6), (3, 4, 5, 6, 7), (4, 5, 6, 7, 8)]
but it sounds like you should probably be using DataFrame.rolling, i just don't really understand what you're trying to do
Thank you, I will try both.
!eval @rigid zodiac you might also want your "windows" to be (start, stop) pairs, which you can use with df.iloc[start : stop]:
window_size = 5
n_windows = 5
window_bounds = [(start, start+window_size) for start in range(n_windows)]
print(window_bounds)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
[(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
!eval as opposed to the way i did it before:
window_size = 5
n_windows = 5
window_bounds = [(start, start+window_size) for start in range(n_windows)]
window_indices = [list(range(start, stop)) for start, stop in window_bounds]
print(window_indices)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7], [4, 5, 6, 7, 8]]
with this one I dont have to do the def right?
i'm just showing a couple different ways to do the same thing
it's a good exercise to figure out what these different versions do
Is there is any other packages except pyautogui and opencv
I want a comment like Locateonscreen in pyautogui
i tried this but it would work with DataFrame
So is there is any other packages that do this work?
!eval here's another one @rigid zodiac :
from itertools import count, islice
def infinite_windows(window_size: int) -> tuple[int, int]:
for window_start in count():
window_stop = window_start + window_size
yield (window_start, window_stop)
def window_to_indices(window: tuple[int, int]) -> list[int]:
start, stop = window
return list(range(start, stop))
window_size = 5
n_windows = 5
windows = infinite_windows(window_size)
windows = islice(windows, n_windows)
windows = map(window_to_indices, windows)
windows = list(windows)
print(windows)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7], [4, 5, 6, 7, 8]]
for a dataframe i told you that you probably want to use .rolling. but of course you can write a loop and use .iloc as well
I see it now, thank you so much
hi, im new to jupyter notebook so i'm wondering what's wrong in here
solved it, nvm!
hello i need suggestion about this case. When i try to run my CNN model, the loss reducted but in every epoch the result is raising. I using 38 training data and 12 validation for the validation. I know this is overfitted heavily.
1/1 [==============================] - 5s 5s/step - loss: 0.7031 - accuracy: 0.4000 - val_loss: 6.6536 - val_accuracy: 0.5000
Epoch 2/50
1/1 [==============================] - 0s 260ms/step - loss: 0.0332 - accuracy: 1.0000 - val_loss: 15.3232 - val_accuracy: 0.5000
Epoch 3/50
1/1 [==============================] - 0s 239ms/step - loss: 6.5698e-04 - accuracy: 1.0000 - val_loss: 25.3745 - val_accuracy: 0.5000
Epoch 4/50
1/1 [==============================] - 0s 244ms/step - loss: 3.0219e-06 - accuracy: 1.0000 - val_loss: 36.1942 - val_accuracy: 0.5000
Epoch 5/50
1/1 [==============================] - 0s 262ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 47.4359 - val_accuracy: 0.5000
Epoch 6/50
1/1 [==============================] - 0s 257ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 58.8482 - val_accuracy: 0.5000
Epoch 7/50
1/1 [==============================] - 0s 255ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 70.2395 - val_accuracy: 0.5000
Epoch 8/50
1/1 [==============================] - 0s 273ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 81.4671 - val_accuracy: 0.5000
Epoch 9/50
1/1 [==============================] - 0s 252ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 92.4239 - val_accuracy: 0.5000
Epoch 10/50
1/1 [==============================] - 0s 268ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 103.0329 - val_accuracy: 0.5000```
i cannot add more data due limited size of data (actually it is 1D data)
help me, search document implement sentiment analysis use CNN with pytorch ๐ฅธ
this seems to be a caase of chronic overfitting
if you can't add any more data
then this will keep happening
try using a model with less parameters
maybe a simple one-layer perceptron
the less parametetrs
the more some info will be etractred from the data
or try finetuning a larger network
either way more daeta helps
maybe not use CNN?
nvm i have check the problem
at one class, the data is has same proportion with the other class (since its binary classification), but the problem each data for that problematic class are repeated and identical due preprocessing issue
so even for 12 data, it always shown only 2 distinct data since the repetition
Hello, any idea how to extract last date of the week from yyyyww format data?
for example: My dataset has 201501 and I need last date from 1st week of 2015 i.e. 4-1-2015
using python^
how hard is it to learn ML and AI
that depends, what are you using it for?
how to data text activate in CNN ?
I'm getting this error:
ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: (128, 20, 20)
and I know how to fix it:
currentStates = currentStates.reshape(-1, *env.STATE_SHAPE) # env.STATE_SHAPE = (20, 20, 1)
currentQsList = self.model.predict(currentStates)
and currentStates.shape returns (128, 20, 20, 1) as expected so why is it saying Full shape received: (128, 20, 20)
I'm convinced this is unsolvable
hey anyone know a GOOD text to speech engine that sounds like a google home?
hard
with work defo doable
but you need some math knowledge
some programming knowledge
a dash of natural aptitiude doesn't hurt
and most of all persistence
lots and lots of persistence
@earnest herald are you running this in a notebook?
no, my own environment
what is self.model?
is it trying to do some batching or something?
i assume self.model is some sklearn-style wrapper around a keras model, but i have no idea what its predict method does
if you can share what libraries you are using and the full traceback, maybe someone can help
I'm on my phone now, can we speak in DM?
i'd rather not
you can @ me when you get back to a computer
i'm on here most days
can't guarantee an answer though
okay,
self.model is a keras Sequential model with 1 Conv2d layer and has input shape 20,20,1
the predict method takes a batch of states (captures of a game in array form) and returns the output layer's values
but as a batch since currentStates is a batch of states
Can you show the actual code
Was thinking of expanding my knowledge into the machine learning/Ai category. Anyone have any tips beforehand/stuff I should know?
linalg
What do you want to work on?
im not very experienced with unsupervised learning-if input data is not labeled then how do you determine whether a model is accurate or not?
Is keras a wrapper around tensorflow
@grand breach yes
which modules are important to learn data science
- numpy: doing math, especially in large batches
- pandas: manipulating tabular data
- sklearn: lots of data science tools and models to work with
- pytorch or tensorflow: deep learning stuff
- matplotlib: data visualization
but focus on learning data science in general and doing projects. don't try to "learn libraries".
Can I share my kaggle notebook? If anyone could tell me how to improve it and what bad practices I should avoid doing?
It's a simple linear regression problem
you can post the link here, yes
where does the part that you wrote start?
hmm?
or did you write the whole thing? I thought it was like a template or something
no no I wrote the whole thing
ahhh
def score(y_test, y_pred):
"""Helper function for evaluation metrics."""
print(f"""Explained Variance: {explained_variance_score(y_test, y_pred) * 100:.2f}%
MAE: {round(mean_absolute_error(y_test, y_pred), 2):.2f}""")
I find this difficult to read
maybe make variables and then put those in the f string?
yea, will do that
For all the cells where you go over the value counts for each feature, it might be interesting to show both the counts and the percent share
thanks, will do
pretty good, I think ๐
should I also consider the standard deviation or mse in this case?
Thanks
I'm not really sure. Eventually my ignorance about stats is going to catch up with me.
I need some help. I trying to create a loop like the following, but it just keep running forever.
for i in range(len(c)):
if (c['ay'].iloc[i] == second_smallest(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == second_smallest(c['az'])) and (c['az'].iloc[i] < -20):
for j in range(1, len(c)):
if (c['ay'].iloc[j] < abs(c['ay'].iloc[i])) and (c['az'].iloc[j] < abs(c['az'].iloc[i])): # frame 1 after minimum
for k in range(2,len(c)):
if (abs(c['ay'].iloc[k]) < 10 ) and (abs(c['az'].iloc[k]) < 10): # frame 2
for n in range(3, len(c)):
if (abs(c['ay'].iloc[n]) < 10 ) and (abs(c['az'].iloc[n]) < 10):# frame 3
for m in range(4, len(c)):
if (abs(c['ay'].iloc[m]) < 10 ) and (abs(c['az'].iloc[m]) < 10):# frame 4
for b in range(5, len(c)):
if (abs(c['ay'].iloc[b]) < 10 ) and (abs(c['az'].iloc[b]) < 10):# frame 5
for v in range(6, len(c)):
if (abs(c['ay'].iloc[v]) < 10 ) and (abs(c['az'].iloc[v]) < 10):# frame 6
for h in range(7, len(c)):
if (abs(c['ay'].iloc[h]) < 10 ) and (abs(c['az'].iloc[h]) < 10):# frame 7
c['cat'].iloc[h] = 1```
what is this supposed to do?
omg
there must be a better way
please don't tell me you had to write this by hand
i use copy and paste
well my logic is if I get the second smallest, then if the next number is less than the second smallest.... and the subsequence 6 more number is ranging between 0 and 20 then the categorical is 1
but god forbid the pc didnt think like I do
like it will follow this logic i >j>k>n>m>b>v>h
I hope so cause I have absolutely how to ๐ฆ
are you sure its running forever, and not for a long time? what does len(c) evaluate to?
c has 67 data
is there any better way to do this? I been google my ass off like entire of week
What exactly is it doing?
it will categorize whether object will fall or not
so for the 1st line, I'm trying to say that if there exist a second smallest in y and z, and the one right after that is less than its absolute value, and the absolute value of 6 more frame after that is between 0 and 20. Then cat =1
Like this image
but in the y and z acceleration. I also have to do like 3 more condition, similar with it as a fail proof. Because sometime we dont have a second smallest. we have smallest
So let me see if I understand, you want to label data as 1 if it is a second smallest value, and the following 6 points don't go above 20?
Yeah, second smallest value, and the one right after it is smaller than the abs(second smallest) and the following 6 points don't go above 20?
Okay well your nested for loops are super unnecessary because it repeats itself for example lets just look at a bit of it:
if (abs(c['ay'].iloc[m]) < 10 ) and (abs(c['az'].iloc[m]) < 10):# frame 4
for b in range(5, len(c)):
if (abs(c['ay'].iloc[b]) < 10 ) and (abs(c['az'].iloc[b]) < 10):# frame 5 ```
Lets say m is 10 and b is 11, on the next m loop, it checks b = 11 again
You could just have a single for loop, staring at second_smallest, and ending at the number of points you want to check
Infact give me a minute and I'll rewrite it to show you
for i in range(len(c)):
if (c['ay'].iloc[i] == second_smallest(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == second_smallest(c['az'])) and (c['az'].iloc[i] < -20)
and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) # frame 1 after minimum
and (abs(c['ay'].iloc[i+2]) < 10 ) and (abs(c['az'].iloc[i+2]) < 10) # frame 2
and (abs(c['ay'].iloc[i+3]) < 10 ) and (abs(c['az'].iloc[i+3]) < 10)# frame 3
and (abs(c['ay'].iloc[i+4]) < 10 ) and (abs(c['az'].iloc[i+4]) < 10)# frame 4
and (abs(c['ay'].iloc[i+5]) < 10 ) and (abs(c['az'].iloc[i+5]) < 10)# frame 5
and (abs(c['ay'].iloc[i+6]) < 10 ) and (abs(c['az'].iloc[i+6]) < 10)# frame 6
and(abs(c['ay'].iloc[i+7]) < 10 ) and (abs(c['az'].iloc[i+7]) < 10):# frame 7
c['cat'].iloc[h] = 1```
You can probably find a way to shorted the condition as well using another loop but i'll leave that to you to figure out, but this gets rid of the nested for loops
And I think should accomplish what you want to achieve
thank you so much let me try it
then you can alter the giant condition with a for j in range 7 to shorten it
I got this
You probably need to fix the indentation / newlines
It may be that the comments are cutting up the condition
for i in range(len(c)):
if (c['ay'].iloc[i] == second_smallest(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == second_smallest(c['az'])) and (c['az'].iloc[i] < -20) and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 10 ) and (abs(c['az'].iloc[i+2]) < 10) and (abs(c['ay'].iloc[i+3]) < 10 ) and (abs(c['az'].iloc[i+3]) < 10) and (abs(c['ay'].iloc[i+4]) < 10 ) and (abs(c['az'].iloc[i+4]) < 10) and (abs(c['ay'].iloc[i+5]) < 10 ) and (abs(c['az'].iloc[i+5]) < 10) and (abs(c['ay'].iloc[i+6]) < 10 ) and (abs(c['az'].iloc[i+6]) < 10) and(abs(c['ay'].iloc[i+7]) < 10 ) and (abs(c['az'].iloc[i+7]) < 10):
c['cat'].iloc[h] = 1```
it work but it categorize the wrong thing..
Which should be 1?
the ay at -310
is -310 the second smallest in ay?
Okay but look at the conditions
and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])
c['ay'].iloc[i+1] is 207.523
Which is obviously bigger
yeah that's why from the second line I switch it back to its absolute value and compare the rest
Yeah so you need to do the same here then
it has to pair with the az in other to work
looks like those ands can get an help from all
because some time ay happen, without az or ax, the whole thing will consider as fall
so ```py
if all([
c['ay'].iloc[i] == second_smallest(c['ay']),
c['ay'].iloc[i] < -20,
...
])
a loop would be better since the conditions are basically the same with incremented indices but yeah
and put the c['ay'].iloc into a separate function
fn = c['ay'].iloc
fn(i), fn(i+1), ...
this should also help
so what will it look like?
and as spagoose said, it should be a loop
wait a min
I can safely drop a few categorical rows which are null if I have a big data set right? and it doesn't remove any of the unique values
yeah,
you can either remove it or use KNN approximate it
@rigid zodiac I realised the code i sent still has the h variable in it
SO obviously you need to remove that
so just i?
i+7 no?
Thats how you were doing it originally
for i in range(len(c)):
condition_name = all([c['ay'].iloc[i] == second_smallest(c['ay']),
c['ay'].iloc[i] < -20),
c['az'].iloc[i] == second_smallest(c['az']),
c['az'].iloc[i] < -20])
for j in range (1, 8):
condition_name &= (c['ay'].iloc[i+j] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+j] < abs(c['az'].iloc[i]))
if condition_name:
c['cat'].iloc[i+7] = 1```
But it could be i if thats where you wanted the label to be
it said out of bound.
for the previous code that you send it work. I will see whether it work for the other data
for i in range(len(c)):
if (c['ay'].iloc[i] == second_smallest(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == second_smallest(c['az'])) and (c['az'].iloc[i] < -20) and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 10 ) and (abs(c['az'].iloc[i+2]) < 10) and (abs(c['ay'].iloc[i+3]) < 10 ) and (abs(c['az'].iloc[i+3]) < 10) and (abs(c['ay'].iloc[i+4]) < 10 ) and (abs(c['az'].iloc[i+4]) < 10) and (abs(c['ay'].iloc[i+5]) < 10 ) and (abs(c['az'].iloc[i+5]) < 10) and (abs(c['ay'].iloc[i+6]) < 10 ) and (abs(c['az'].iloc[i+6]) < 10) and(abs(c['ay'].iloc[i+7]) < 10 ) and (abs(c['az'].iloc[i+7]) < 10):
c['cat'].iloc[i] = 1```
I'm just trying to wing it in notepad, so theres bound to be errors, you should easily be able to resolve something like an out of bounds error though
eh can't bother just for a few columns
i have to make a system map on good health and well being i want some idea or examples
i'm not quiet sure what you mean
do i have to add a break behind cat? In order to jump to the elif?
rows sorry
You dont have an elif?
well my ultimate goal is just id the initial fall. So i think, if all of the condition is satisfy then the ML algorithm will say object fall
I do, I'm just saying if I want to add the elif with it, should I put the break beneath c['cat'].iloc[i] = 1?
Depends on what your trying to do exactly, a break will exit the for loop
well because sometime fall will happen if we have the second smallest. Occasionally, it will happen if it is a smallest
So I have
c['cat'] = np.nan
for i in range(len(c)):
if (c['ay'].iloc[i] == second_smallest(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == second_smallest(c['az'])) and (c['az'].iloc[i] < -20) and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 10 ) and (abs(c['az'].iloc[i+2]) < 10) and (abs(c['ay'].iloc[i+3]) < 10 ) and (abs(c['az'].iloc[i+3]) < 10) and (abs(c['ay'].iloc[i+4]) < 10 ) and (abs(c['az'].iloc[i+4]) < 10) and (abs(c['ay'].iloc[i+5]) < 10 ) and (abs(c['az'].iloc[i+5]) < 10) and (abs(c['ay'].iloc[i+6]) < 10 ) and (abs(c['az'].iloc[i+6]) < 10) and(abs(c['ay'].iloc[i+7]) < 10 ) and (abs(c['az'].iloc[i+7]) < 10):
c['cat'].iloc[i] = 1
elif (c['ay'].iloc[i] == min(c['ay'])) and (c['ay'].iloc[i] < -20) and (c['az'].iloc[i] == min(c['az'])) and (c['az'].iloc[i] < -20) and (c['ay'].iloc[i+1] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+1] < abs(c['az'].iloc[i])) and (abs(c['ay'].iloc[i+2]) < 10 ) and (abs(c['az'].iloc[i+2]) < 10) and (abs(c['ay'].iloc[i+3]) < 10 ) and (abs(c['az'].iloc[i+3]) < 10) and (abs(c['ay'].iloc[i+4]) < 10 ) and (abs(c['az'].iloc[i+4]) < 10) and (abs(c['ay'].iloc[i+5]) < 10 ) and (abs(c['az'].iloc[i+5]) < 10) and (abs(c['ay'].iloc[i+6]) < 10 ) and (abs(c['az'].iloc[i+6]) < 10) and(abs(c['ay'].iloc[i+7]) < 10 ) and (abs(c['az'].iloc[i+7]) < 10):
c['cat'].iloc[i] = 1```
I was just wondering do I need a "break" in order for that to jump to the elif
no, no break needed for that
but you have two conditions that have the same result sooo
agh, so it will be (c['ay'].iloc[i] <= second_smallest(c['ay']))
Yeah you need to simplify the conditions, I only suggested that so you could see how I altered it from your original code
because I'm trying to say that if the second smallest. and satisfy those condition that I set. And the one right after that has to be < abs of the secondsmallest, and the next 7 data has to be between 20 and 0... then it has to be fall
Condition that: second smallest value, and the one right after it is smaller than the abs(second smallest) and the following 6 points don't go above 20?
wait you mean my issue or your lol
it is basically
Would be better if you can show an example dataframe
But @rigid zodiac this simplifies what i said originally:
for i in range(len(c)):
condition_name = all([c['ay'].iloc[i] == second_smallest(c['ay']),
c['ay'].iloc[i] < -20,
c['az'].iloc[i] == second_smallest(c['az']),
c['az'].iloc[i] < -20])
for j in range (1, 8):
condition_name &= (c['ay'].iloc[i+j] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+j] < abs(c['az'].iloc[i]))
if condition_name:
c['cat'].iloc[i] = 1```
why is second_smallest a dataframe too
Then you just add any extra conditions to that
I would be tempted to remove the i loop as well, but idk exactly what your labelling rules are
so to add the new condition in there i just need to create new condition_name?
condition_name is just the name of the variable i picked because I don'tr know what a label of 1 represents, you should rename it
But add it to the same variable with an OR, since the result is the same, the label 1
i'm not quiet sure what you mean there
Lets say you add your new alternate condition, the result is the same : c['cat'].iloc[i] = 1 if its true
So why make a new condition, just make the original => original condition OR new condition
You can make a new condition if you want and its easier for you to read, the end result is the same, but it's less code otherwise
idk how to to be honest
@rigid zodiac like this for example:
for i in range(len(c)):
condition_name = all([
(c['ay'].iloc[i] == second_smallest(c['ay']) and c['az'].iloc[i] == second_smallest(c['az'])) or (c['ay'].iloc[i] == min(c['ay']) and c['az'].iloc[i] == min(c['az'])),
c['ay'].iloc[i] < -20,
c['az'].iloc[i] < -20])
for j in range (1, 8):
condition_name &= (c['ay'].iloc[i+j] < abs(c['ay'].iloc[i])) and (c['az'].iloc[i+j] < abs(c['az'].iloc[i]))
if condition_name:
c['cat'].iloc[i] = 1```
so that for ay and az right? what if I want to add the second condition for the ay and ax within that
Yeah, but like I said, if you are not confident in doing it that way, do it however is easiest for you to understand
thank you so so much, you save me from loop hell
hello, I tried to google this but am having issues
im trying to use sqlalchemy to query my db
im just trying to get all the values for a specific column
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=42)
model = RidgeCV(alphas=np.arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error')
model.fit(X_train, y_train)
print('alpha: %f' % model.alpha_)
```any better way cuz this takes way too long
I guess my mistake trying to do 0.01
I- great