I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.
Related
I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)
(Not duplicate / my question is entirely different)
My dataframe looks like this:
# [df2] is day based
time time2
2017-01-01, 2017-01-01 00:12:00
2017-01-02, 2017-01-02 03:15:00
2017-01-03, 2017-01-03 01:25:00
2017-01-04, 2017-01-04 04:12:00
2017-01-05, 2017-01-05 00:45:00
....
# [df] is minute based
time value
2017-01-01 00:01:00, 0.1232
2017-01-01 00:02:00, 0.1232
2017-01-01 00:03:00, 0.1232
2017-01-01 00:04:00, 0.1232
2017-01-01 00:05:00, 0.1232
....
I want to create a new column called time_val_min in [df2] that finds the min value between df2['time2'] and df2['time'] form [df] within the range specified in df2['time'] and df2['time2']
What did I do?
I did df2['time_val_min'] = df[df['time'].dt.hour.between(df2['time'], df2['time'])].min() but it does not work.
Could you please let me know how to fix it?
You can merge the two data frame on date, and filter the time:
# create the date from the time column
df['date'] = df['time'].dt.normalize()
# merge
new_df = (df.merge(df2, left_on='date', # left on date
right_on='time', # right on time, if time is purely beginning of days
how='right',
suffixes=['','_y'])
.query('time < time2')
.groupby('date')
['time'].min()
.to_frame(name='time_val_min')
.merge(df2, right_on='time', left_index=True)
)
Output:
time_val_min time time2
0 2017-01-01 00:01:00 2017-01-01 2017-01-01 00:12:00
I have a dataframe which contains prices for a security each minute over a long period of time.
I would like to extract a subset of the prices, 1 per day between certain hours.
Here is an example of brute-forcing it (using hourly for brevity):
dates = pandas.date_range('20180101', '20180103', freq='H')
prices = pandas.DataFrame(index=dates,
data=numpy.random.rand(len(dates)),
columns=['price'])
I now have a DateTimeIndex for the hours within each day I want to extract:
start = datetime.datetime(2018,1,1,8)
end = datetime.datetime(2018,1,1,17)
day1 = pandas.date_range(start, end, freq='H')
start = datetime.datetime(2018,1,2,9)
end = datetime.datetime(2018,1,2,13)
day2 = pandas.date_range(start, end, freq='H')
days = [ day1, day2 ]
I can then use prices.index.isin with each of my DateTimeIndexes to extract the relevant day's prices:
daily_prices = [ prices[prices.index.isin(d)] for d in days]
This works as expected:
daily_prices[0]
daily_prices[1]
The problem is that as the length of each selection DateTimeIndex increases, and the number of days I want to extract increases, my list-comprehension slows down to a crawl.
Since I know each selection DateTimeIndex is fully inclusive of the hours it encompasses, I tried using loc and the first and last element of each index in my list comprehension:
daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]
Whilst a bit faster, it is still exceptionally slow when the number of days is very large
Is there a more efficient way to divide up a dataframe into begin and end time ranges like above?
If the hours are consistent from day to day as it seems like they might be, you can just filter the index, which should be pretty fast:
In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
price
2018-01-01 08:00:00 0.638051
2018-01-01 09:00:00 0.059258
2018-01-01 10:00:00 0.869144
2018-01-01 11:00:00 0.443970
2018-01-01 12:00:00 0.725146
2018-01-01 13:00:00 0.309600
2018-01-01 14:00:00 0.520718
2018-01-01 15:00:00 0.976284
2018-01-01 16:00:00 0.973313
2018-01-01 17:00:00 0.158488
2018-01-02 08:00:00 0.053680
2018-01-02 09:00:00 0.280477
2018-01-02 10:00:00 0.802826
2018-01-02 11:00:00 0.379837
2018-01-02 12:00:00 0.247583
....
EDIT: To your comment, working directly on the index and then doing a single lookup at the end will still probably be fastest even if it's not always consistent from day to day. Single day frames at the end will be easy with a groupby.
For example:
df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]]
framelist = [frame for _, frame in df.groupby(df.index.date)]
will give you a list of dataframes with 1 day per list element, and will include 8:00-17:00 for the first 10 days each month and 2:00-3:00 for days 11-31.
I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()
I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914