Remove zeros in pandas dataframe without effecting the imputation result

Remove zeros in pandas dataframe without effecting the imputation result - python

I have a timeseries dataset with 5M rows.
The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column.
Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing.
To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase).
So, is there a way:
To modify the data input to the KNN model
Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near
To understand the problem more clearly, consider the following dummy dataframe:
DATE VALUE
0 2018-01-01 0.0
1 2018-01-02 8.0
2 2018-01-03 0.0
3 2018-01-04 0.0
4 2018-01-05 0.0
5 2018-01-06 10.0
6 2018-01-07 NaN
7 2018-01-08 9.0
8 2018-01-09 0.0
9 2018-01-10 0.0
Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9.
A few rough ideas which I thought of but could not proceed through were as follows:
Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation.
Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process.
Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer.
PS -
Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation.
I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.

Let's think logically, leave the machine learning part aside for the moment.
Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now.
Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value.
In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.

Related

Efficient storage of time-series data with rarely-changing variables

I must store and analyse time-series data received from several devices.
Each device is emitting data every 20 ms and lots of variables shall be stored (50+).
Some data are changing at each sample, but other discrete data (enums or booleans) are less frequently changing (See example below).
Currently, i'm using TimescaleDb fed by Python (pandas) and data's are split over several tables by grouping data depending of their typical variation rate. Only changes are stored. At the end the data volume is really optimized with this approach.
However i'm having troubles to analyse this data, as typically i must run queries and know the value of all "Data_x" values on a given "timestamp".
Currently, it requires some complex reconstruction process using "Last-Observation-Carried_Forward" etc.
Would there be a better solution ?
Full data set
Timestamp
Data_1
Data_2
Data_3
2022-06-12 17:52:43.000
22.2
0
0
2022-06-12 17:52:44.000
25.4
0
1
2022-06-12 17:52:45.000
29.2
1
0
2022-06-12 17:52:46.000
31.3
1
0
2022-06-12 17:52:47.000
31.4
1
0
2022-06-12 17:52:48.000
33.7
0
1
Data_1 table
Timestamp
Data_1
2022-06-12 17:52:43.000
22.2
2022-06-12 17:52:44.000
25.4
2022-06-12 17:52:45.000
29.2
2022-06-12 17:52:46.000
31.3
2022-06-12 17:52:47.000
31.4
2022-06-12 17:52:48.000
33.7
Data_2_and_3 table (Note that there are only 4 samples in the table, as only the changes are recorded.
Timestamp
Data_2
Data_3
2022-06-12 17:52:43.000
0
0
2022-06-12 17:52:44.000
0
1
2022-06-12 17:52:45.000
1
0
2022-06-12 17:52:48.000
0
1

having troubles to analyse this data...
Would there be a better solution ?
Hard to say. You didn't really give us a description of
the analysis challenges.
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
The central problem seems to be how to boil down "too much data"
into fewer rows.
For the dataframe approach you're currently using,
with 1-second fixed-resolution timestamps across data_{1,2,3},
there's little choice but to synthesize those "carried forward" values.
A bottom-up compression approach, starting with the data,
would ignore irrelevant columns and then suppress any unchanged rows.
This might compress 3600 observations in an hour down to just a few hundred if there's intervals with little change.
A top-down compression approach, from the application side,
would have you focus on just the timestamps of interest
to the app, perhaps at 3-second intervals.
Something could change more quickly, but the app doesn't care.
This might be a better fit for a custom postgres query.
You didn't describe your app. Perhaps it doesn't really require
isochronous samples, and suppressing "too frequent" observations
would suffice. So if you're shooting for nominal 3-second intervals
and there were observations at noon +0, +1, and +5 seconds,
the +1 reading is suppressed and in practice we wind up with a 5-second interval.
When aggregating / suppressing observations,
you are free to choose from
first
last
min
max
median
avg
imputed (interpolated)
according to your app's high-level needs.
Define a loss metric, such as "% error".
Aggregate your data at K-second intervals,
K ranging from maybe .1 to 600,
and evaluate the loss relative to
computing with smallest possible interval.
Having weighed the tradeoffs, choose a value of K
that lets your app compute acceptably fast
with acceptably low loss.
Maybe you are performing a time domain analysis
of data that is more naturally viewed in the frequency domain.
In which case FFT is your friend.
Augmenting each timestamp with "delta time since value last changed"
might go a long way to simplifying your data processing task.

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

Manually interpolating NAs in large dataset in Pandas

I am currently working on a project that uses Pandas, with a large dataset (~42K rows x 1K columns).
My dataset has many omissing values which I want to interpolate to obtain a better result when training an ML model using this data. My method of interpolating the data is by taking the average of the previous and the next value and then considering that the value for any NaN. Example:
TRANSACTION PAYED MONDAY TUESDAY WEDNESDAY
D8Q3ML42DS0 1 123.2 NaN 43.12
So in the above example the NaN would be replaced with the average of the 123.2 and 43.12 which is 83.16. If the value can't be interpolated then a 0 is put. I was able to implement this in a number of ways but I always end up getting into the issue of it taking a very long time to process all of the rows in the dataset despite running it on an Intel Core i9. The following are approaches I've tried and have found out that they take too long:
Interpolating the data and then only replacing the elements that need to be replaced instead of replacing the entire row.
Replacing the entire row with a new pd.Series that has the old and the interpolated values. It seems like my code is able to execute reasonably well on a Numpy Array but the slowness comes from the assignment.
I'm not quite sure why the performance of my code comes nowhere close to df.interpolate() despite it being the same idea. Here is some of my code responsible for the interpolation:
for transaction_id in df.index:
df.loc[transaction_id, 2:] = interpolate(df.loc[transaction_id, 2:])
def interpolate(array:np.array):
arr_len = len(array)
for i in range(array):
if math.isnan(array[i]):
if i == 0 or i == arr_len-1 or math.isnan(array[i-1]) or math.isnan(array[i+1]):
array[i] = 0
else:
statistics.mean([array[i-1], array[i+1]])
return array
My understanding is that Pandas has some sort of parallel techniques and functions that it is able to use to perform that. How can I speed this process up even a little?

df.interpolate(method='linear', limit_direction='forward', axis=0)
Try doing this it might help.

Estimating price with Linear Regression

I'm posting here because I couldn't find any solution to my problem anywhere else. Basically we are learning Linear Regression using python at school and the professor wants us to estimate the price of each ingredient in a sandwich as well as the fixed profit of each sandwich based on a csv table. So far we only messed with one X variable and one Y variable, so I'm pretty confused what should I do here? Thank you. Here is the table:
tomato,lettuce,cheese,pickles,palmetto,burger,corn,ham,price
0.05,1,0.05,0,0.05,0.2,0.05,0,18.4
0.05,0,0.05,0.05,0,0.2,0.05,0.05,16.15
0.05,1,0.05,0,0.05,0.4,0,0,22.15
0.05,1,0.05,0,0.05,0.2,0.05,0.05,19.4
0.05,1,0,0,0,0.2,0.05,0.05,18.4
0,0,0.05,0,0,0,0.05,0.05,11.75
0.05,1,0,0,0,0.2,0,0.05,18.15
0.05,1,0.05,0.05,0.05,0.2,0.05,0,18.65
0,0,0.05,0,0,0.2,0.05,0.05,15.75
0.05,1,0.05,0,0.05,0,0.05,0.05,15.4
0.05,1,0,0,0,0.2,0,0,17.15
0.05,1,0,0,0.05,0.2,0.05,0.05,18.9
0,1,0.05,0,0,0.2,0.05,0.05,18.75

You have 9 separate variables for regression (tomato ... price), and 13 samples for each of them (the 13 lines).
So the first approach could be doing a regression for "tomato" on data points
0.05
0.05
0.05
0.05
0.05
0
0.05
0.05
0
0.05
0.05
0.05
0
then doing another one for "lettuce" and the others, up to "price" with
18.4
16.15
22.15
19.4
18.4
11.75
18.15
18.65
15.75
15.4
17.15
18.9
18.75
Online viewer for looking at your CSV data: http://www.convertcsv.com/csv-viewer-editor.htm, but Google SpreadSheet, Excel, etc. can display it nicely too.
SciPy can probably (most likely) do the task for you on vectors too (so handling the 9 variables together), but the part of having 13 samples in the 13 rows, remains.
EDIT: bad news, I was tired and have not answered the full question, sorry about that.
While it is true that you can take the first 8 columns (tomato...ham) as time series, and make individual regressions for them (which is probably the first part of this assignment), the last column (price) is expected to be estimated from the first 8.
Using the notation in Wikipedia, https://en.wikipedia.org/wiki/Linear_regression#Introduction, your y vector is the last column (the prices), the X matrix is the first 8 columns of your data (tomato...ham), extended with a column of 1-s somewhere.
Then pick an estimation method (some are listed in that page too, https://en.wikipedia.org/wiki/Linear_regression#Estimation_methods, but you may want to pick one you have learned about at class). The actual math is there, and NumPy can do the matrix/vector calculations. If you go for "Ordinary least squares", numpy.linalg.lstsq does the same (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq - you may find adding that column of 1-s familiar), so it can be used for verifying the results.

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA

For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.