How to take the value after every 10 min? - python

I am using LSTM for forecasting the stock prediction , I did some feature engineering on time series data .I have two columns , first is price and second is date . Now I want to train the model which take the value of price after every ten minutes . Suppose :
date_index date new_price
08:28:31 08:28:31 8222000.0
08:28:35 08:28:35 8222000.0
08:28:44 08:28:44 8200000.0
08:28:50 08:28:50 8210000.0
08:28:56 08:28:56 8060000.0
08:29:00 08:29:00 8110000.0
08:29:00 08:29:00 8110000.0
08:29:05 08:29:05 8010000.0
08:29:24 08:29:24 8222000.0
08:29:28 08:29:28 8210000.0
Lets say the date comes first is 8:28:31 it will takes the value of its corresponding price and at the second time it should take the value of corresponding columns after ten minutes means 8:38:31 , and sometimes this time does not available in data . How to do it . My goal is just to train the model on after every 10 minutes or 15 minutes ?

The main keyword you are looking for here is resampling
You have time-series data with unevenly spaced time stamps and want to convert the index to something more regular (i.e. 10 minute intervals).
The question you have to answer is: If the exact 10 minute timestamp is not available in your data, what do you want to do? Take the most recent event instead?
(Let's say the data for 8:38:31 is not available, but there's data for 8::37:25. Do you just want to take that?)
If so, I think something like df.resample(_some_argument_to_set_interval_to_10_minutes).last() should work, where I forgot the exact syntax for setting the interval in the resample method. Might be something like 10m or something.
See here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

Related

Time series frequency in 5 minutes timestamps

I have two large time series data. Both is separated by 5minutes intervals timestamp. The length of each time series is 3month from(August 1 2014 to October 2014). I’m using R (3.1.1) for forecasting the data. I’d like to know the value of the “frequency” argument in the ts() function in R, for each data set. Since most of the examples and cases I’ve seen so far are for months or days at the most, it is quite confusing for me when dealing with equally separated 5 minutes.
I would think that it would be either of these:
myts1 <- ts(series, frequency = (60*60*24)/5)
myts2 <- ts(series, deltat = 5/(60*60*24))
In the first, the frequency argument gives the number of times sampled per time unit. If time unit is the day, there are 606024 seconds per day and you're sampling every 5 of them, so you would be sampling 17280 times per day. Alternatively, the second option is what fraction of a day separates each sample. Here, we would say that every 5.787037e-05 of a day, a sample is drawn. If the time unit is something different (e.g., the hour), then obviously these would change

How to split time into 10 min interval and make it as a separate column using Python

I am working on a data set that has epoch time. I want to create a new column which splits the time into 10 mins time interval.
Suppose
timestamp timeblocks
5:00 1
5:02 1
5:11 2
How can i achieve this using python.
I tried resampling but i cannot able further process.
I agree with the comments, you need to provide more. After guessing what you're looking for, you may want histograms where intervals are known as bins. You wrote "10 mins time interval" but your example doesn't show 10 mins.
Python's Numpy and matplotlib have histograms for epochs.
Here's an SO answer on histogram for epoch time.
I'm guessing here, but I believe you are trying to 'bin' your time data into 10 min intervals.
Epoch or Unix time is represented as time in seconds (or more commonly nowadays, milliseconds).
First thing you'll need to do is convert each of your epoch time to minutes.
Assuming you have a DataFrame and your epoch are in seconds:
df['min] = df['epoch'] // 60
Once that is done, you can bin your data using pd.cut:
df['bins'] = pd.cut(df['min'], bins=pd.interval_range(start=df['min'].min()-1, end=df['min'].max(), freq=10))
Notice that -1 on start is to shift the first bin to beginning of each 10 min interval.
You'll have your 'bins', you can rename them to your liking and you can groupby them.
The solution may not be perfect, but it will possibly get you on the right track.
Good luck!

Pyspark structured streaming window (moving average) over last N data points

I read several data frames from kafka topics using Pyspark Structured Streaming 2.4.4. I would like to add some new columns to that data frames that mainly are based on window calculations over past N data points (for instance: Moving average over last 20 data points), and as a new data point is delivered, the corresponding value of MA_20 should be instantly calculated.
Data may look like this:
Timestamp | VIX
2020-01-22 10:20:32 | 13.05
2020-01-22 10:25:31 | 14.35
2020-01-23 09:00:20 | 14.12
It is worth to mention that data will be received from Monday to Friday over 8 hour period a day.
Thus Moving average calculated on Monday morning should include data from Friday!
I tried different approaches but, still I am not able to achieve what I want.
windows = df_vix \
.withWatermark("Timestamp", "100 minutes") \
.groupBy(F.window("Timestamp", "100 minute", "5 minute")) \
aggregatedDF = windows.agg(F.avg("VIX"))
Preceding code calculated MA but it will consider data from Friday as late, so they will be excluded. better than last 100 minutes should be last 20 points (with 5 minute intervals).
I thought that I can use rowsBetween or rangeBetween, but in streaming data frames window cannot be applie over non-timestamp columns (F.col('Timestamp').cast('long'))
w = Window.orderBy(F.col('Timestamp').cast('long')).rowsBetween(-600, 0)
df = df_vix.withColumn('MA_20', F.avg('VIX').over(w)
)
But on the other hand there is no possibility to specify interval within rowsBetween(), using rowsBetween(- minutes(20), 0) throws: minutes are not defined (there is no such a function in sql.functions)
I found the other way, but it doesn't work for streaming data frames either. Don't know why 'Non-time-based windows are not supported on streaming DataFrames' error is raised (df_vix.Timestamp is of timestamp type)
df.createOrReplaceTempView("df_vix")
df_vix.createOrReplaceTempView("df_vix")
aggregatedDF = spark.sql(
"""SELECT *, mean(VIX) OVER (
ORDER BY CAST(df_vix.Timestamp AS timestamp)
RANGE BETWEEN INTERVAL 100 MINUTES PRECEDING AND CURRENT ROW
) AS mean FROM df_vix""")
I have no idea what else could I use to calculate simple Moving Average. It looks like it is impossible to achive that in Pyspark... maybe better solution will be to transform each time new data is comming entire Spark data frame to Pandas and calculate everything in Pandas (or append new rows to pandas and calculate MA) ???
I thought that creating new features as new data is comming is the main purpose of Structured Streaming, but as it turned out Pyspark is not suited to this, I am considering giving up Pyspark an move to Pandas ...
EDIT
The following doesn't work as well, altough df_vix.Timestamp of type: 'timestamp', but it throws 'Non-time-based windows are not supported on streaming DataFrames' error anyway.
w = Window.orderBy(df_vix.Timestamp).rowsBetween(-20, -1)
aggregatedDF = df_vix.withColumn("MA", F.avg("VIX").over(w))
Have you looked at window operation in event times? window(timestamp, "10 minutes", "5 minutes") Will give you a dataframe of 10 minutes every 5 minutes that you can then do aggregations on, including moving averages.

How to get statistics of once column of dataframe using data from a second column?

I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories

Resources