Given previous datetime values in a Pandas DataFrame--either as an index or as values in a column--is there a way to "autofill" remaining time increments based on the previous fixed increments?
For example, given:
import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:10'),
np.nan,
np.nan])
I would like to apply a function to yield:
B
2013-01-01 09:00:00
0.0
2013-01-01 09:00:05
1.0
2013-01-01 09:00:10
2.0
2013-01-01 09:00:15
NaN
2013-01-01 09:00:20
4.0
Where I have missing timesteps for my last two data points. Here, timesteps are fixed in 5 second increments.
This will be for thousands of rows. While I might reset_index and then create a function to apply to each row, this seems cumbersome. Is there a slick or built-in way to do this that I'm not finding?
Assuming the first index value is a valid datetime and all the values are spaced 5s apart, you could do the following:
df.index = pd.date_range(df.index[0], periods=len(df), freq='5s')
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:05 1.0
2013-01-01 09:00:10 2.0
2013-01-01 09:00:15 NaN
2013-01-01 09:00:20 4.0
This solution might work for you,but also use reset_index() fuction.
new_dateindex=pd.Series(pd.date_range(start=pd.Timestamp('20130101 09:00:00'),periods=1000,freq='5S'),name='Date')
#'periods=1000' can change to 'periods=len(df.index)'
df.reset_index().join(new_dateindex,how='right')
Related
I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.
I would like to import the following file which contains data in a weekly format (Thursdays only) and convert it to a daily file with the values from Thursday filled out through the next Wednesday skipping Saturday and Sunday.
https://www.aaii.com/files/surveys/sentiment.xls
I can import it:
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
Here is the result:
But that is as far as I can get. Even the simplest resampling fails with
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I tried df['Date'] = pd.to_datetime(df['Date']) and other methods with no incremental success.
Thoughts as to how to get this done?
You can try like..
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
your Date column having NaN values so when you trying to convert as datetime it fails to do so ..
>>> df['Date']
0 NaN
1 1987-06-26 00:00:00
2 1987-07-17 00:00:00
3 1987-07-24 00:00:00
4 1987-07-31 00:00:00
So, you to convert the datetime you need to use coerce to get it..
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Now your Date's are processed ..
>>> df['Date']
0 NaT
1 1987-06-26
2 1987-07-17
3 1987-07-24
4 1987-07-31
5 1987-08-07
6 1987-08-14
7 1987-08-21
Now Set your index to the Date column before you can resample as mention in the comments:
>>> df.set_index('Date', inplace=True)
>>> df.head()
Bullish Neutral Bearish Total Mov Avg Spread Average +St. Dev. - St. Dev. High Low Close
Date
NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1987-06-26 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 NaN NaN NaN
1987-07-17 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 314.59 307.63 314.59
1987-07-24 0.36 0.50 0.14 1.0 NaN 0.22 0.382642 0.484295 0.280989 311.39 307.81 309.27
1987-07-31 0.26 0.48 0.26 1.0 NaN 0.00 0.382642 0.484295 0.280989 318.66 310.65 318.66
I think this is the correct answer, converts to daily, strips non-trading days and Saturday/Sunday.
import pandas as pd
from pandas.tseries.offsets import BDay
# read csv, use SENTIMENT sheet, drop the first three rows, parse dates to datetime, index on date
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df = df[3:].asfreq('D', method='ffill') # skip 3 lines then expand to daily and fill forward
df = df[df.index.map(BDay().onOffset)] # strip non-trading weekdays
df = df[df.index.dayofweek < 5] # strip Saturdays and Sundays
print(df.head(250))
There may be a more elegant method, but that gets the job done.
I'd like to calculate a rolling moving average for a data set that is time stamped in ms, but is irregular. For a 2 day dataframe, the irregular data set has ~36K records. If I resample into ms bars, I melt the computer and there become 32M bars.
To be clear, consider the following data set taken from the Pandas docs:
(I've changed the NaN to 0)
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},index =
[pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
df.rolling('2s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 1.5
2013-01-01 09:00:05 0.0
2013-01-01 09:00:06 2.0
But the answer I'd like is:
df.rolling('2s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 0.5
2013-01-01 09:00:03 1.5
2013-01-01 09:00:05 1.0
2013-01-01 09:00:06 2.0
This has the entries rolled forward (ffill style) in order to calc the mean. I'd like to solve this problem without exploding the memory usage and without just going through it sequentially (which I know I can do).
I had thought that something like:
df.rolling('2s', freq='1s').mean()
would work but it throws off an error expecting 7 rows but having only 5 (ValueError: Shape of passed values is (1,5), indices imply (1,7)).
If I resample into another dataframe using pad and then do a rolling mean, it works:
df2 = df.resample('1s').pad()
df2.rolling('2s').mean()
Is there a built in for this? Or do I just iterate through?
I have csv file with data. Link is here. Granularity of time series is 5 min for year 2013. However, values are missing for some time stamps.
I want to create a time series with 5 minute interval with value zero for time stamps which are missing.
Please advise how to do this either in Pandas or Python
In pandas, you just join on the index:
from io import StringIO
import numpy as np
import pandas
ts1_string = StringIO("""\
V1,V2
01/01/2013 00:05:00,10
01/01/2013 00:10:00,6
01/01/2013 00:15:00,10
01/01/2013 00:25:00,8
01/01/2013 00:30:00,11
01/01/2013 00:35:00,7""")
ts2_string = StringIO("""
V1,V2
2013-01-01 00:00:00,0
2013-01-01 00:05:00,0
2013-01-01 00:10:00,0
2013-01-01 00:15:00,0
2013-01-01 00:20:00,0
2013-01-01 00:25:00,0""")
ts1 = pandas.read_csv(ts1_string, parse_dates=True, index_col='V1')
ts2 = pandas.read_csv(ts2_string, parse_dates=True, index_col='V1')
# here's where the join happens
# (suffixes deal with overlapping column names)
ts_joined = ts1.join(ts2, rsuffix='_ts1', lsuffix='_ts2')
# and finally
print(ts_joined.head())
Which gives:
V2_ts2 V2_ts1
V1
2013-01-01 00:05:00 10 0
2013-01-01 00:10:00 6 0
2013-01-01 00:15:00 10 0
2013-01-01 00:25:00 8 0
2013-01-01 00:30:00 11 NaN
I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914