Pyspark structured streaming window (moving average) over last N data points - python

I read several data frames from kafka topics using Pyspark Structured Streaming 2.4.4. I would like to add some new columns to that data frames that mainly are based on window calculations over past N data points (for instance: Moving average over last 20 data points), and as a new data point is delivered, the corresponding value of MA_20 should be instantly calculated.
Data may look like this:
Timestamp | VIX
2020-01-22 10:20:32 | 13.05
2020-01-22 10:25:31 | 14.35
2020-01-23 09:00:20 | 14.12
It is worth to mention that data will be received from Monday to Friday over 8 hour period a day.
Thus Moving average calculated on Monday morning should include data from Friday!
I tried different approaches but, still I am not able to achieve what I want.
windows = df_vix \
.withWatermark("Timestamp", "100 minutes") \
.groupBy(F.window("Timestamp", "100 minute", "5 minute")) \
aggregatedDF = windows.agg(F.avg("VIX"))
Preceding code calculated MA but it will consider data from Friday as late, so they will be excluded. better than last 100 minutes should be last 20 points (with 5 minute intervals).
I thought that I can use rowsBetween or rangeBetween, but in streaming data frames window cannot be applie over non-timestamp columns (F.col('Timestamp').cast('long'))
w = Window.orderBy(F.col('Timestamp').cast('long')).rowsBetween(-600, 0)
df = df_vix.withColumn('MA_20', F.avg('VIX').over(w)
)
But on the other hand there is no possibility to specify interval within rowsBetween(), using rowsBetween(- minutes(20), 0) throws: minutes are not defined (there is no such a function in sql.functions)
I found the other way, but it doesn't work for streaming data frames either. Don't know why 'Non-time-based windows are not supported on streaming DataFrames' error is raised (df_vix.Timestamp is of timestamp type)
df.createOrReplaceTempView("df_vix")
df_vix.createOrReplaceTempView("df_vix")
aggregatedDF = spark.sql(
"""SELECT *, mean(VIX) OVER (
ORDER BY CAST(df_vix.Timestamp AS timestamp)
RANGE BETWEEN INTERVAL 100 MINUTES PRECEDING AND CURRENT ROW
) AS mean FROM df_vix""")
I have no idea what else could I use to calculate simple Moving Average. It looks like it is impossible to achive that in Pyspark... maybe better solution will be to transform each time new data is comming entire Spark data frame to Pandas and calculate everything in Pandas (or append new rows to pandas and calculate MA) ???
I thought that creating new features as new data is comming is the main purpose of Structured Streaming, but as it turned out Pyspark is not suited to this, I am considering giving up Pyspark an move to Pandas ...
EDIT
The following doesn't work as well, altough df_vix.Timestamp of type: 'timestamp', but it throws 'Non-time-based windows are not supported on streaming DataFrames' error anyway.
w = Window.orderBy(df_vix.Timestamp).rowsBetween(-20, -1)
aggregatedDF = df_vix.withColumn("MA", F.avg("VIX").over(w))

Have you looked at window operation in event times? window(timestamp, "10 minutes", "5 minutes") Will give you a dataframe of 10 minutes every 5 minutes that you can then do aggregations on, including moving averages.

Related

Selecting multiple values with a condition in order to get informations-Python Pandas

I have two data frames. Df1 and Df2
in the first dataframe there is the column minutes and in the second dataframe minutes and time interval. I want to take values from both dataframes. For the column "time interval" the value 7 is given. I want to get all values from the column minutes, which increases by the value 7 that have a time interval of the value 7.
For example time interval 7 Minute 14
or Time interval 7 and minute 49
the real column minutes goes up to 1050, so i dont want to manually write every number that goes up with 7.
Thank you in advance!
I tried to build my knowledge in sql in order to solve it in python pandas
I did read multiple articles but I was not able to replicate it.
I did read multiple articles on stackoverflow but I was not able to understand it.

How to take the value after every 10 min?

I am using LSTM for forecasting the stock prediction , I did some feature engineering on time series data .I have two columns , first is price and second is date . Now I want to train the model which take the value of price after every ten minutes . Suppose :
date_index date new_price
08:28:31 08:28:31 8222000.0
08:28:35 08:28:35 8222000.0
08:28:44 08:28:44 8200000.0
08:28:50 08:28:50 8210000.0
08:28:56 08:28:56 8060000.0
08:29:00 08:29:00 8110000.0
08:29:00 08:29:00 8110000.0
08:29:05 08:29:05 8010000.0
08:29:24 08:29:24 8222000.0
08:29:28 08:29:28 8210000.0
Lets say the date comes first is 8:28:31 it will takes the value of its corresponding price and at the second time it should take the value of corresponding columns after ten minutes means 8:38:31 , and sometimes this time does not available in data . How to do it . My goal is just to train the model on after every 10 minutes or 15 minutes ?
The main keyword you are looking for here is resampling
You have time-series data with unevenly spaced time stamps and want to convert the index to something more regular (i.e. 10 minute intervals).
The question you have to answer is: If the exact 10 minute timestamp is not available in your data, what do you want to do? Take the most recent event instead?
(Let's say the data for 8:38:31 is not available, but there's data for 8::37:25. Do you just want to take that?)
If so, I think something like df.resample(_some_argument_to_set_interval_to_10_minutes).last() should work, where I forgot the exact syntax for setting the interval in the resample method. Might be something like 10m or something.
See here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

How to use Pandas to split timestamped CSV data into multiple CSVs based on values and continuous time periods

I am trying to analyse a ships AIS data. I have a CSV with ~20,000 rows, with columns for lat / long / speed / time stamp.
I have loaded the data in a pandas data frame, in a Jupyter notebook.
What I want to do is split the CSV into smaller CSVs based on the time stamp and the speed, so I want an individual CSV for each period of time the vessel speed was less than say 2 knots, eg if the vessel transited at 10 knots for 6hrs, then slowed down to 1 knot for a period of 3 hrs, sped back up 10 knots, then slowed down again to 1 knot for a period of 4 hrs, I would want to the output to be two CSVs, one for the 3hr period and one for the 4hr period. This is so I can review these periods individually in my mapping software.
I can filter the data easily to show all the periods where it is <1 knot but I can't break it down to output the continuous periods as separate CSVs / data frames.
EDIT
Here is an example of the data
I've tried to show more clearly what I want to achieve here
Here is something to maybe get you started.
First filter out all values that meets the criteria (for example below 2):
df = pd.DataFrame({'speed':[2,1,4,5,4,1,1,1,3,4,5,6], 'time':[4,5,6,7,8,9,10,11,12,13,14,15]})
df_below2 = df[df['speed']<=2].reset_index(drop=True)
Now we need to split the frame if there is too long gap btw values in time. For example:
threshold = 2
df_below2['not_continuous'] = df_below2['time'].diff() > threshold
Distinguish between the groups using cums:
df_below2['group_id'] = df_below2['not_continuous'].cumsum()
From here it should be easy to split the frame based on the group id.

Create equidistant data frame with time ranged data with Python

I have a .cvs file in which data is stored for data ranges - from and to date columns. However, I would like to create a daily data frame with Python out of it.
The time can be ignored, as a gasday always starts at 6am and ends at 6am.
My idea was to have in the end a data frame index with a date (like from March 1st, 2019, ranging to December 31st, 2019 on a daily granularity.
I would create columns with the unique values of the identifier and as values place the respective values or nan in.
The latter one, I can easily do with pd.pivot_table, but still my problem with the time range exists...
Any ideas of how to cope with that?
time-ranged data frame
It should look like this, just with rows in a daily granularity, considering the to column as well. Maybe with range?
output should look similar to this, just with a different period
you can use pandas and groupby the column you want:
df=pd.read_csv("yourfile.csv")
groups=df.groupby("periodFrom")
group.get_group("2019-03-09 06:00")

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA
For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Categories

Resources