Resample df to smaller time steps and average the counts - python

I have a dataframe containing counts over time periods (rainfall in periods of 3 hours), something like this:
time_stamp, rain_fall_in_mm
2019-01-01 00:03:00, 0.0
2019-01-01 00:06:00, 3.9
2019-01-01 00:09:00, 0.0
2019-01-01 00:12:00, 1.2
I need to upsample the dataframe into time periods of 1 hour and I would like to average out the counts for the rain, so that there are no NaNs and the total sum of rain remains the same, means this is the desired result:
time_stamp, rain_fall_in_mm
2019-01-01 00:01:00, 0.0
2019-01-01 00:02:00, 0.0
2019-01-01 00:03:00, 0.0
2019-01-01 00:04:00, 1.3
2019-01-01 00:05:00, 1.3
2019-01-01 00:06:00, 1.3
2019-01-01 00:07:00, 0.0
2019-01-01 00:08:00, 0.0
2019-01-01 00:09:00, 0.0
2019-01-01 00:10:00, 0.4
2019-01-01 00:11:00, 0.4
2019-01-01 00:12:00, 0.4
I found that I can do something like series.resample('1H').bfill() or series.resample('1H').pad(). These solve the resampling issue, but don't fulfil the desired averaging. Do you have any suggestions what to do? Tnx

Try this:
df2 = df.reindex(pd.date_range(start = '1/1/2019',periods = 13,freq='1min'))
df2.fillna(0).groupby((~df2['rain_fall_in_mm'].isna()).iloc[::-1].cumsum()).transform('mean')

First, make sure that your index is in datetime format. If it is not you can do this in the following way:
df.set_index(pd.date_range(start=df.time_stamp[0], periods=len(df), freq='3H'), inplace=True)
Then use this if want to upscale only the one column
df_rain_hourly_column = df.resample('H').bfill().rain / 3.
If your initial df contains only floats you can operate on the whole dataframe
df2 = df.resample('H').bfill() / 3.
The division by 3. (the length factor of old_time_period/new_time_period) is a bit hacky, but I really haven't found a more general and simple solution anywhere.

Related

How do I plot a scatter graph comparing two dataframes?

I have two separate DataFrames, which both contain rainfall amounts and dates corresponding to them.
df1:
time tp
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 0.0
3 2013-01-01 03:00:00 0.0
4 2013-01-01 04:00:00 0.0
... ...
8755 2013-12-31 19:00:00 0.0
8756 2013-12-31 20:00:00 0.0
8757 2013-12-31 21:00:00 0.0
8758 2013-12-31 22:00:00 0.0
8759 2013-12-31 23:00:00 0.0
[8760 rows x 2 columns]
df2:
time tp
0 2013-07-18T18:00:01 0.002794
1 2013-07-18T20:00:00 0.002794
2 2013-07-18T21:00:00 0.002794
3 2013-07-18T22:00:00 0.002794
4 2013-07-19T00:00:00 0.000000
... ...
9656 2013-12-30T13:30:00 0.000000
9657 2013-12-30T23:30:00 0.000000
9658 2013-12-31T00:00:00 0.000000
9659 2013-12-31T00:00:00 0.000000
9660 2014-01-01T00:00:00 0.000000
[9661 rows x 2 columns]
I'm trying to plot a scatter graph comparing the two data frames. The way I'm doing it is by choosing a specific date and time and plotting the df1 tp on one axis and df2 tp on the other axis.
For example,
If the date/time on both dataframes = 2013-12-31 19:00:00, then plot tp for df1 onto x-axis, and tp for df2 on the y-axis.
To solve this, I tried using the following:
df1['dates_match'] = np.where(df1['time'] == df2['time'], 'True', 'False')
which will tell me if the dates match, and if they do I can plot. The problem arises as I have a different number of rows on each dataframe, and most methods only allow comparison of dataframes with exactly the same amount of rows.
Does anyone know of an alternative method I could use to plot the graph?
Thanks in advance!
The main goal is to plot two time series with that apparently don't have the same frequency to be able to compare them.
Since the main issue here is the different timestamps let's tackle that with pandas resample so we have a more uniform timestamps for each observation. To take the sum of 30 minutes intervals you can do (feel free to change the time interval and the agg function if you want to)
df1.set_index("time", inplace=True)
df2.set_index("time", inplace=True)
df1_resampled = df1.resample("30T").sum() # taking the sum of 30 minutes intervals
df2_resampled = df2.resample("30T").sum() # taking the sum of 30 minutes intervals
Now that the timestamps are more organized you can either merge the newer resampled dataframes if you want to and then plot i
df_joined = df1_resampled.join(df2_resampled, lsuffix="_1", rsuffix="_2")
df_joined.plot(marker="o", figsize=(12,6))
# df_joined.plot(subplots=True) if you want to plot them separately
Since df1 starts on 2013-01-01 and df2 on 2013-07-18 you'll have a first period where only df1 will exist if you want to plot only the overlapped period you can pass how="outer" to when joining both dataframes.

Converting timeseries data given in timedeltas64[ns] to datetime64[ns]

Suppose I'm given a pandas dataframe that is indexed in timedeltas64[ns].
A B C D E
0 days 00:00:00 0.642973 -0.041259 253.377516 0.0
0 days 00:15:00 0.647493 -0.041230 253.309167 0.0
0 days 00:30:00 0.723258 -0.063110 253.416138 0.0
0 days 00:45:00 0.739604 -0.070342 253.305809 0.0
0 days 01:00:00 0.643327 -0.041131 252.967084 0.0
... ... ... ... ...
364 days 22:45:00 0.650392 -0.064805 249.658052 0.0
364 days 23:00:00 0.652765 -0.064821 249.243891 0.0
364 days 23:15:00 0.607198 -0.103190 249.553821 0.0
364 days 23:30:00 0.597602 -0.107975 249.687942 0.0
364 days 23:45:00 0.595224 -0.110376 250.059530 0.0
There does not appear to be any "permitted" way of converting the index to datetimes. Basic operations to convert the index such as:
df.index = pd.DatetimeIndex(df.index)
Or:
test_df.time = pd.to_datetime(test_df.index,format='%Y%m%d%H%M')
Both yield:
TypeError: dtype timedelta64[ns] cannot be converted to datetime64[ns]
Is there any permitted way to do this operation other than completely reformatting all of these (very numerous) datasets manually? The data is yearly with 15 minute intervals.
Your issue is that you cannot convert a timedelta object to a datetime object because the former is the difference between two datetimes. Based on your question it sounds like all these deltas are from the same base time, so you would need to add that in. Example usages below
In [1]: import datetime
In [2]: now = datetime.datetime.now()
In [3]: delta = datetime.timedelta(minutes=5)
In [4]: print(now, delta + now)
2021-02-22 20:14:37.273444 2021-02-22 20:19:37.273444
You can see in the above that the second print datetime is 5 minutes after the now object

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

probability of the price having n positive consecutive days

This table is a pandas dataframe. Can someone help me out with writing function that show the probability of price up for 5 consecutive days in a row for the past 1000 days. So I will know the probability of price up for tomorrow if the past 4 days' price has been increasing.
Appreciate if any help.
import ccxt
import pandas as pd
binance=ccxt.binance()
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=1000) #limit = 30
df = pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df['date'] = pd.to_datetime(df['date'], unit='ms') + pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
df=get_price("BTC/USDT")
df["daily_return"]=df.close.pct_change()
Random comment, I think based on context you're after empirical probability, in which case this is a simple one-liner using pandas rolling. If this isn't the case, you probably need to explain what you mean by "probability" or explain in words what you're after.
df["probability_5"] = (df["daily_return"] > 0).rolling(5).mean()
df[["daily_return", "probability_5"]].head(15)
Output:
daily_return probability_5
date
2017-08-17 08:00:00 NaN NaN
2017-08-18 08:00:00 -0.041238 NaN
2017-08-19 08:00:00 0.007694 NaN
2017-08-20 08:00:00 -0.012969 NaN
2017-08-21 08:00:00 -0.017201 0.2
2017-08-22 08:00:00 0.005976 0.4
2017-08-23 08:00:00 0.018319 0.6
2017-08-24 08:00:00 0.049101 0.6
2017-08-25 08:00:00 -0.008186 0.6
2017-08-26 08:00:00 0.013260 0.8
2017-08-27 08:00:00 -0.006324 0.6
2017-08-28 08:00:00 0.017791 0.6
2017-08-29 08:00:00 0.045773 0.6
2017-08-30 08:00:00 -0.007050 0.6
2017-08-31 08:00:00 0.037266 0.6
Just to frame the question properly, I believe you are trying to calculate the relative frequency of n-consecutive postive or negative days of price series/array.
Some research:
https://medium.com/#mikeharrisny/probability-in-trading-and-finance-96344108e1d9
Please see my implementation bellow, using a Pandas Dataframe:
import pandas as pd
random_prices = [100, 90, 95, 98, 99, 98, 97, 100, 99, 98]
df = pd.DataFrame(random_prices, columns=['price'])
def consecutive_days_proba(pandas_series, n, positive=True):
# Transform to daily returns
daily_return = pandas_series.pct_change()
# Drop NA values, this makes the original series n-1
daily_return.dropna(inplace=True)
# Count the total number of days in the new series
total_days = len(daily_return)
if positive:
# count the number of n consecutive days with positive returns
consecutive_n = ((daily_return > 0).rolling(n).sum() == n).sum()
else:
# count the number of n consecutive days with negative returns
consecutive_n = ((daily_return < 0).rolling(n).sum() == n).sum()
return ((consecutive_n / total_days) * 100).round(2)
consecutive_days_proba(df['price'], n=3, positive=True)
So this returns 11.11% which is 1/9. Although the original series is a length of 10, I dont' think it makes sense to use the null days as part of the base.

Perform calculations on dataframe column based on index value

I have to monthly normalize values of one dataframe column Allocation.
data=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.001905 9.55 0.0 0.0
2018-11-01 00:15:00 0.001794 9.55 0.0 0.0
2018-11-01 00:30:00 0.001700 9.55 0.0 0.0
2018-11-01 00:45:00 0.001607 9.55 0.0 0.0
This means, if we have 2018-11, divide Allocation by 11.116, while in 2018-12, divide Allocation by 2473.65, and so on... (These values come from a list Volume, where Volume[0] corresponds to 2018-11 untill Volume[7] corresponds to 2019-06).
Date_From is a index and a timestamp.
data_normalized=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.000171 9.55 0.0 0.0
2018-11-01 00:15:00 0.000097 9.55 0.0 0.0
...
My approach was the use of itertuples:
for row in data.itertuples(index=True,name='index'):
if row.index =='2018-11':
data['Allocation']/Volume[0]
Here, the if statement is never true...
Another approach was
if ((row.index >='2018-11-01 00:00:00') & (row.index<='2018-11-31 23:45:00')):
However, here I get the error TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'str'
Can I solve my problem with this approach or should I use a different approach? I am happy about any help
Cheers!
Maybe you can put your list Volume in a dataframe where the date (or index) is the first day of every month.
import pandas as pd
import numpy as np
N = 16
date = pd.date_range(start='2018-01-01', periods=N, freq="15d")
df = pd.DataFrame({"date":date, "Allocation":np.random.randn(N)})
# A dataframe where at every month associate a volume
df_vol = pd.DataFrame({"month":pd.date_range(start="2018-01-01", periods=8, freq="MS"),
"Volume": np.arange(8)+1})
# convert every date with the beginning of the month
df["month"] = df["date"].astype("datetime64[M]")
# merge
df1 = pd.merge(df,df_vol, on="month", how="left")
# divide allocation by Volume.
# Now it's vectorial as to every date we merged the right volume.
df1["norm"] = df1["Allocation"]/df1["Volume"]

Categories

Resources