Dealing with missing data for a time series model | Learn AI Together | Page 1

pale island Jul 26, 2023, 7:31 AM

#

As per title. Our model is a time series of rainfall based on 3 environemntal factors (temperature, wind speed, humidity) within an island.

We have data from a period of 6.5 years.

Our data is in time intervals of every 5 minutes.

However over this period, not all stations are always active (there are about 18-23 stations active at any one datapoint and sometimes this number drops to 8 or even 1). Sometimes no stations are available for data over a period of 15 minutes.

our goal in plotting the time series is to train three models (the initial time series, its first derivative to assess data drive, and then a third model to analyze concept drift)

What can we do to deal with these discrepancies?

spark wind Jul 26, 2023, 7:54 AM

#

As someone who worked in causal discovery I found that a quick fix was to use information from either the previous year for the same time, you can also take the surrounding values and calculate an average and use this, you can also use the a LOCF method whereby you take the last value or the reverse which I believe is called something like next observation carried back. If you have time to spend on this I would look into linear interpolation if you need specific granular data and want your best chance at getting exact values. This will use similar data and will basically calculate a finite value between your inputted data I believe. These things are pretty easy to research online, plenty of examples, just depends on how much time you have.

pale island Jul 26, 2023, 11:34 AM

#

spark wind As someone who worked in causal discovery I found that a quick fix was to use in...

what's a good way to approach a time series with lots of data? We have approximately 6.5 years of data at 5 minutes intervals- at the bare minimum that's about 57,024 datapoints, without counting the 8-19 NSWE stations.

We are already planning to do a 70:30 split, but I'm not sure how to structure this such that it covers all available data

spark wind Jul 26, 2023, 1:24 PM

#

So our datasets had hundreds of thousands of data points so we used OpenMP and C to concurrently do all the data manipulation for us for each of the techniques above, we did have access to a 5million pound super computer which had 8000 CPU cores 😂 but still maybe take the hit and let it run, also it depends on your data and you might notice different accuracies based on the technique, it’s all about what kinda data you have.

#Dealing with missing data for a time series model