I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.
I have a timeseries dataset with 5M rows.
The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column.
Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing.
To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase).
So, is there a way:
To modify the data input to the KNN model
Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near
To understand the problem more clearly, consider the following dummy dataframe:
DATE VALUE
0 2018-01-01 0.0
1 2018-01-02 8.0
2 2018-01-03 0.0
3 2018-01-04 0.0
4 2018-01-05 0.0
5 2018-01-06 10.0
6 2018-01-07 NaN
7 2018-01-08 9.0
8 2018-01-09 0.0
9 2018-01-10 0.0
Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9.
A few rough ideas which I thought of but could not proceed through were as follows:
Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation.
Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process.
Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer.
PS -
Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation.
I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.
Let's think logically, leave the machine learning part aside for the moment.
Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now.
Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value.
In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.
They gave me a table storing sensor readings with a schema [TimeStamp, SensorKey, SensorValue].
TimeStamp Id Value
2019-01-01 00:00:47 1 66.6
2019-01-01 00:00:47 2 0.66
2019-01-01 00:00:57 1 66.7
2019-01-01 00:00:57 2 0.68
2019-01-01 00:00:57 3 166.6
2019-01-01 00:01:07 3 146.6
Note that it only stores changes to sensor readings, with limited precision and sampling rate, and repeats a value every hour after the last change if it doesn't change.
Their queries mean checking value of sensor A (and B, and C, and D...) when sensor Z value passes this condition. And they want to use Python and Spark.
So to compare the values of different sensors, I get the rows for those sensor keys and pivot the results to a schema [TimeStamp, ValueOfA, ..., Value of Z].
df1 = df0.groupBy("TS").pivot("Id", listOfIds).agg(F.last("Value"))
TimeStamp Sensor1 Sensor2 Sensor3
2019-01-01 00:00:47 66.6 0.66 Null
2019-01-01 00:00:57 66.7 0.68 166.6
2019-01-01 00:01:07 Null Null 146.6
Then I fill the gaps (always onwards, if I don't have older data to fill the first rows I discard them).
window1hour = Window.orderBy('TS').rowsBetween(-360, 0)
# 360 = 1 hour / 0.1 Hz sampling rate.
df2 = df1
for sid in sensorIds:
df2 = df2\
.withColumn(sid, F.last(F.column(sid), ignorenulls=True).over(window1hour))\
.filter(F.column(sid).isNotNull())
The comparisons, column by column, are trivial now.
But when compared to doing the same with pandas it's slower, so much that it feels like I'm doing something wrong. At least for small queries.
What's happening? And what will happen when it's a large query?
About small and large: I have over thousands of different sensors and about a billion records per year. So the data definitely fits in one server but not in RAM. In fact, they will start with only one server for the data, maybe a second for a second Spark instance (both multiprocessor and with lots of memory), and hopefully they will invest in more hardware if they see returns. They will start making the small queries day by day, and they want them fast. But later they will want to do queries over several years, and it must not explode.
Ideas/doubts: Is the preprocessing done in a single thread? Should I stablish the parallelization myself, or do I let Spark handle it? Should I break the year-spanning queries in many day spanning ones (but then why would I want Spark at all)? Do I solve the small queries in pandas and the large in Spark (and can I set the threshold beforehand)?
What other improvements can I apply?
It's not uncommon for "small" data to be faster in tools other than spark. Spark has fairly significant overhead for it's parallel functionality (granted, these overheads are very small when compared with the old map-reduce paradigm).
Where spark shines is it's ability to scale linearly for "large" data by adding servers. It's at this point the overhead becomes worth it, as it will automatically break the work up among all of the available executors.
I believe letting spark handle the parallelization is ideal, if only for simplicity's sake. Whether or not to implement the "small" queries in another framework is entirely dependent on whether you want to maintain two code paths, and whether your customer is comfortable with the speed of them.
I'm posting here because I couldn't find any solution to my problem anywhere else. Basically we are learning Linear Regression using python at school and the professor wants us to estimate the price of each ingredient in a sandwich as well as the fixed profit of each sandwich based on a csv table. So far we only messed with one X variable and one Y variable, so I'm pretty confused what should I do here? Thank you. Here is the table:
tomato,lettuce,cheese,pickles,palmetto,burger,corn,ham,price
0.05,1,0.05,0,0.05,0.2,0.05,0,18.4
0.05,0,0.05,0.05,0,0.2,0.05,0.05,16.15
0.05,1,0.05,0,0.05,0.4,0,0,22.15
0.05,1,0.05,0,0.05,0.2,0.05,0.05,19.4
0.05,1,0,0,0,0.2,0.05,0.05,18.4
0,0,0.05,0,0,0,0.05,0.05,11.75
0.05,1,0,0,0,0.2,0,0.05,18.15
0.05,1,0.05,0.05,0.05,0.2,0.05,0,18.65
0,0,0.05,0,0,0.2,0.05,0.05,15.75
0.05,1,0.05,0,0.05,0,0.05,0.05,15.4
0.05,1,0,0,0,0.2,0,0,17.15
0.05,1,0,0,0.05,0.2,0.05,0.05,18.9
0,1,0.05,0,0,0.2,0.05,0.05,18.75
You have 9 separate variables for regression (tomato ... price), and 13 samples for each of them (the 13 lines).
So the first approach could be doing a regression for "tomato" on data points
0.05
0.05
0.05
0.05
0.05
0
0.05
0.05
0
0.05
0.05
0.05
0
then doing another one for "lettuce" and the others, up to "price" with
18.4
16.15
22.15
19.4
18.4
11.75
18.15
18.65
15.75
15.4
17.15
18.9
18.75
Online viewer for looking at your CSV data: http://www.convertcsv.com/csv-viewer-editor.htm, but Google SpreadSheet, Excel, etc. can display it nicely too.
SciPy can probably (most likely) do the task for you on vectors too (so handling the 9 variables together), but the part of having 13 samples in the 13 rows, remains.
EDIT: bad news, I was tired and have not answered the full question, sorry about that.
While it is true that you can take the first 8 columns (tomato...ham) as time series, and make individual regressions for them (which is probably the first part of this assignment), the last column (price) is expected to be estimated from the first 8.
Using the notation in Wikipedia, https://en.wikipedia.org/wiki/Linear_regression#Introduction, your y vector is the last column (the prices), the X matrix is the first 8 columns of your data (tomato...ham), extended with a column of 1-s somewhere.
Then pick an estimation method (some are listed in that page too, https://en.wikipedia.org/wiki/Linear_regression#Estimation_methods, but you may want to pick one you have learned about at class). The actual math is there, and NumPy can do the matrix/vector calculations. If you go for "Ordinary least squares", numpy.linalg.lstsq does the same (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq - you may find adding that column of 1-s familiar), so it can be used for verifying the results.
I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks
Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.