Optimizing pivoting and filling

Optimizing pivoting and filling - python

They gave me a table storing sensor readings with a schema [TimeStamp, SensorKey, SensorValue].
TimeStamp Id Value
2019-01-01 00:00:47 1 66.6
2019-01-01 00:00:47 2 0.66
2019-01-01 00:00:57 1 66.7
2019-01-01 00:00:57 2 0.68
2019-01-01 00:00:57 3 166.6
2019-01-01 00:01:07 3 146.6
Note that it only stores changes to sensor readings, with limited precision and sampling rate, and repeats a value every hour after the last change if it doesn't change.
Their queries mean checking value of sensor A (and B, and C, and D...) when sensor Z value passes this condition. And they want to use Python and Spark.
So to compare the values of different sensors, I get the rows for those sensor keys and pivot the results to a schema [TimeStamp, ValueOfA, ..., Value of Z].
df1 = df0.groupBy("TS").pivot("Id", listOfIds).agg(F.last("Value"))
TimeStamp Sensor1 Sensor2 Sensor3
2019-01-01 00:00:47 66.6 0.66 Null
2019-01-01 00:00:57 66.7 0.68 166.6
2019-01-01 00:01:07 Null Null 146.6
Then I fill the gaps (always onwards, if I don't have older data to fill the first rows I discard them).
window1hour = Window.orderBy('TS').rowsBetween(-360, 0)
# 360 = 1 hour / 0.1 Hz sampling rate.
df2 = df1
for sid in sensorIds:
df2 = df2\
.withColumn(sid, F.last(F.column(sid), ignorenulls=True).over(window1hour))\
.filter(F.column(sid).isNotNull())
The comparisons, column by column, are trivial now.
But when compared to doing the same with pandas it's slower, so much that it feels like I'm doing something wrong. At least for small queries.
What's happening? And what will happen when it's a large query?
About small and large: I have over thousands of different sensors and about a billion records per year. So the data definitely fits in one server but not in RAM. In fact, they will start with only one server for the data, maybe a second for a second Spark instance (both multiprocessor and with lots of memory), and hopefully they will invest in more hardware if they see returns. They will start making the small queries day by day, and they want them fast. But later they will want to do queries over several years, and it must not explode.
Ideas/doubts: Is the preprocessing done in a single thread? Should I stablish the parallelization myself, or do I let Spark handle it? Should I break the year-spanning queries in many day spanning ones (but then why would I want Spark at all)? Do I solve the small queries in pandas and the large in Spark (and can I set the threshold beforehand)?
What other improvements can I apply?

It's not uncommon for "small" data to be faster in tools other than spark. Spark has fairly significant overhead for it's parallel functionality (granted, these overheads are very small when compared with the old map-reduce paradigm).
Where spark shines is it's ability to scale linearly for "large" data by adding servers. It's at this point the overhead becomes worth it, as it will automatically break the work up among all of the available executors.
I believe letting spark handle the parallelization is ideal, if only for simplicity's sake. Whether or not to implement the "small" queries in another framework is entirely dependent on whether you want to maintain two code paths, and whether your customer is comfortable with the speed of them.

Related

Efficient storage of time-series data with rarely-changing variables

I must store and analyse time-series data received from several devices.
Each device is emitting data every 20 ms and lots of variables shall be stored (50+).
Some data are changing at each sample, but other discrete data (enums or booleans) are less frequently changing (See example below).
Currently, i'm using TimescaleDb fed by Python (pandas) and data's are split over several tables by grouping data depending of their typical variation rate. Only changes are stored. At the end the data volume is really optimized with this approach.
However i'm having troubles to analyse this data, as typically i must run queries and know the value of all "Data_x" values on a given "timestamp".
Currently, it requires some complex reconstruction process using "Last-Observation-Carried_Forward" etc.
Would there be a better solution ?
Full data set
Timestamp
Data_1
Data_2
Data_3
2022-06-12 17:52:43.000
22.2
0
0
2022-06-12 17:52:44.000
25.4
0
1
2022-06-12 17:52:45.000
29.2
1
0
2022-06-12 17:52:46.000
31.3
1
0
2022-06-12 17:52:47.000
31.4
1
0
2022-06-12 17:52:48.000
33.7
0
1
Data_1 table
Timestamp
Data_1
2022-06-12 17:52:43.000
22.2
2022-06-12 17:52:44.000
25.4
2022-06-12 17:52:45.000
29.2
2022-06-12 17:52:46.000
31.3
2022-06-12 17:52:47.000
31.4
2022-06-12 17:52:48.000
33.7
Data_2_and_3 table (Note that there are only 4 samples in the table, as only the changes are recorded.
Timestamp
Data_2
Data_3
2022-06-12 17:52:43.000
0
0
2022-06-12 17:52:44.000
0
1
2022-06-12 17:52:45.000
1
0
2022-06-12 17:52:48.000
0
1

having troubles to analyse this data...
Would there be a better solution ?
Hard to say. You didn't really give us a description of
the analysis challenges.
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
The central problem seems to be how to boil down "too much data"
into fewer rows.
For the dataframe approach you're currently using,
with 1-second fixed-resolution timestamps across data_{1,2,3},
there's little choice but to synthesize those "carried forward" values.
A bottom-up compression approach, starting with the data,
would ignore irrelevant columns and then suppress any unchanged rows.
This might compress 3600 observations in an hour down to just a few hundred if there's intervals with little change.
A top-down compression approach, from the application side,
would have you focus on just the timestamps of interest
to the app, perhaps at 3-second intervals.
Something could change more quickly, but the app doesn't care.
This might be a better fit for a custom postgres query.
You didn't describe your app. Perhaps it doesn't really require
isochronous samples, and suppressing "too frequent" observations
would suffice. So if you're shooting for nominal 3-second intervals
and there were observations at noon +0, +1, and +5 seconds,
the +1 reading is suppressed and in practice we wind up with a 5-second interval.
When aggregating / suppressing observations,
you are free to choose from
first
last
min
max
median
avg
imputed (interpolated)
according to your app's high-level needs.
Define a loss metric, such as "% error".
Aggregate your data at K-second intervals,
K ranging from maybe .1 to 600,
and evaluate the loss relative to
computing with smallest possible interval.
Having weighed the tradeoffs, choose a value of K
that lets your app compute acceptably fast
with acceptably low loss.
Maybe you are performing a time domain analysis
of data that is more naturally viewed in the frequency domain.
In which case FFT is your friend.
Augmenting each timestamp with "delta time since value last changed"
might go a long way to simplifying your data processing task.

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

Create django model instances from dataframe without server timeout

I am uploading a zipped file that contains a single CSV file. I then unzip it on the server and load it into a dataframe. Then I am creating django objects from it.
This works fine until my dataframe becomes too large. When the dataset is getting too large the django server shuts down. I am assuming that this happens because every iteration in my loop increases memory and when there is no memory space left it shuts down. I am working with big datasets and want to make my code work no matter how big the dataframe.
Imagine I have a df like this:
cola fanta sprite libella
2018-01-01 00:00:00 0 12 12 34
2018-01-01 01:00:00 1 34 23 23
2018-01-01 02:00:00 2 4 2 4
2018-01-01 03:00:00 3 24 2 4
2018-01-01 04:00:00 4 2 2 5
Imagine that the columns could be up to 1000 brands and the rows could be more than half a Million rows. Further imagine I have a model that saves this data in a JSONB field. Thus every column is a django object, the column name being the name. The time-stamp combined with the data in the column is the JSONB field.
e.g.: name=fanta, json_data={ "2018-01-01 00:00:00": 12, "2018-01-01 01:00:0": 34 }
My code to unzip, load to df and then create a django instance is this:
df = pd.read_csv(
file_decompressed.open(name_file),
index_col=0,
)
for column_name in df:
Drink.objects.create(
name=column_name,
json_data=df[column_name].to_dict(),
)
As I said, this works, but my loop breaks after having created about 15 elements. Searching the internet I found that bulk_create could make this more efficient. But I have custom signals implemented, so this is not really a solution. I also thought about using to_sql, but since I have to restructure the data of the dataframe I don't think this will work. Maybe not using pandas at all? Maybe chunking the DF somehow?
I am looking for a solution that works independently of the number of columns. Number of rows is maximum of half a million rows. I also tried a while-loop but same problem occurs.
Any ideas how I can make this work? Help is very much appreciated.
My dummy model could be as simple as:
class Juice(models.Model):
name = CharField(...)
json_data = JSONField(...)

Remove zeros in pandas dataframe without effecting the imputation result

I have a timeseries dataset with 5M rows.
The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column.
Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing.
To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase).
So, is there a way:
To modify the data input to the KNN model
Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near
To understand the problem more clearly, consider the following dummy dataframe:
DATE VALUE
0 2018-01-01 0.0
1 2018-01-02 8.0
2 2018-01-03 0.0
3 2018-01-04 0.0
4 2018-01-05 0.0
5 2018-01-06 10.0
6 2018-01-07 NaN
7 2018-01-08 9.0
8 2018-01-09 0.0
9 2018-01-10 0.0
Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9.
A few rough ideas which I thought of but could not proceed through were as follows:
Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation.
Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process.
Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer.
PS -
Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation.
I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.

Let's think logically, leave the machine learning part aside for the moment.
Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now.
Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value.
In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.

Pyspark structured streaming window (moving average) over last N data points

I read several data frames from kafka topics using Pyspark Structured Streaming 2.4.4. I would like to add some new columns to that data frames that mainly are based on window calculations over past N data points (for instance: Moving average over last 20 data points), and as a new data point is delivered, the corresponding value of MA_20 should be instantly calculated.
Data may look like this:
Timestamp | VIX
2020-01-22 10:20:32 | 13.05
2020-01-22 10:25:31 | 14.35
2020-01-23 09:00:20 | 14.12
It is worth to mention that data will be received from Monday to Friday over 8 hour period a day.
Thus Moving average calculated on Monday morning should include data from Friday!
I tried different approaches but, still I am not able to achieve what I want.
windows = df_vix \
.withWatermark("Timestamp", "100 minutes") \
.groupBy(F.window("Timestamp", "100 minute", "5 minute")) \
aggregatedDF = windows.agg(F.avg("VIX"))
Preceding code calculated MA but it will consider data from Friday as late, so they will be excluded. better than last 100 minutes should be last 20 points (with 5 minute intervals).
I thought that I can use rowsBetween or rangeBetween, but in streaming data frames window cannot be applie over non-timestamp columns (F.col('Timestamp').cast('long'))
w = Window.orderBy(F.col('Timestamp').cast('long')).rowsBetween(-600, 0)
df = df_vix.withColumn('MA_20', F.avg('VIX').over(w)
)
But on the other hand there is no possibility to specify interval within rowsBetween(), using rowsBetween(- minutes(20), 0) throws: minutes are not defined (there is no such a function in sql.functions)
I found the other way, but it doesn't work for streaming data frames either. Don't know why 'Non-time-based windows are not supported on streaming DataFrames' error is raised (df_vix.Timestamp is of timestamp type)
df.createOrReplaceTempView("df_vix")
df_vix.createOrReplaceTempView("df_vix")
aggregatedDF = spark.sql(
"""SELECT *, mean(VIX) OVER (
ORDER BY CAST(df_vix.Timestamp AS timestamp)
RANGE BETWEEN INTERVAL 100 MINUTES PRECEDING AND CURRENT ROW
) AS mean FROM df_vix""")
I have no idea what else could I use to calculate simple Moving Average. It looks like it is impossible to achive that in Pyspark... maybe better solution will be to transform each time new data is comming entire Spark data frame to Pandas and calculate everything in Pandas (or append new rows to pandas and calculate MA) ???
I thought that creating new features as new data is comming is the main purpose of Structured Streaming, but as it turned out Pyspark is not suited to this, I am considering giving up Pyspark an move to Pandas ...
EDIT
The following doesn't work as well, altough df_vix.Timestamp of type: 'timestamp', but it throws 'Non-time-based windows are not supported on streaming DataFrames' error anyway.
w = Window.orderBy(df_vix.Timestamp).rowsBetween(-20, -1)
aggregatedDF = df_vix.withColumn("MA", F.avg("VIX").over(w))

Have you looked at window operation in event times? window(timestamp, "10 minutes", "5 minutes") Will give you a dataframe of 10 minutes every 5 minutes that you can then do aggregations on, including moving averages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.