Pandas.DataFrame.resample inner level of MultiIndex - python

I need to resample a Pandas MultiIndex consisting of two levels. The inner level is a datetime index. which needs to be resampled.
import numpy as np
import pandas as pd
rng = pd.date_range('2019-01-01', '2019-04-27', freq='B', name='date')
df = pd.DataFrame(np.random.randint(0, 100, (len(rng), 2)), index=rng, columns=['sec1', 'sec2'])
df['month'] = df.index.month
df.set_index(['month', rng], inplace=True)
print(df)
# At that point I need to apply pd.resample. I'm wondering how to specify the level that I would like to resample?
df = df.resample('M').last() # is not working;
# I'm looking for somthing like this: df = df.resample('M', level=1).last()

Try:
df.groupby('month').resample('M', level=1).last()
Output:
sec1 sec2
month date
1 2019-01-31 59 87
2 2019-02-28 70 33
3 2019-03-31 71 38
4 2019-04-30 56 79
Details.
First, group the dataframe on 'month' or level=0 of the index.
Next, use resample with the level parameter for MultiIndex.
The level parameter can use either str, the index level name such as 'date' in this case, or the level number.
Lastly, chain and aggregration function such as last.

Related

How to process data into 20Minute aggregates in Python

I have the folowing table
TimeStamp
Name
Marks
Subject
2022-01-01 00:00:02.969
Chris
70
DK
2022-01-01 00:00:04.467
Chris
75
DK
2022-01-01 00:00:05.965
Mark
80
DK
2022-01-01 00:00:08.962
Cuban
60
DK
2022-01-01 00:00:10.461
Cuban
58
DK
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp
Subject
Chris_Min
Chris_Max
Chris_STD
Mark_Min
Mark_Max
Mark_STD
2022-01-01 00:00:00.000
DK
70
75
2022-01-01 00:20:00.000
DK
etc
etc
2022-01-01 00:40:00.000
DK
etc
etc
I am having hard time aggregating the data into required output.
The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
First setting TimeStamp as index, and grouping by Subject and Name to get the right chunks to work on.
Then .resampling() the groups with the given frequency rule.
Then aggregating the required stats by using .agg() with named tuples.
Unstacking the first index level (Name) to get it in the columns.
Swapping the remaining index levels to get the right order when finally resetting the index.
Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you're getting the TypeError: aggregate() missing 1 required positional argument... error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...
Is your table a pandas dataframe ?
If it's a pandas dataframe you can use resample:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
Edit to get second table in the function:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)

Split hourly time-series in pandas DataFrame into specific dates and all other dates

I have a time-series in a pandas DataFrame at hourly frequency:
import pandas as pd
import numpy as np
idx = pd.date_range(freq="h", start="2018-01-01", periods=365*24)
df = pd.DataFrame({'value': np.random.rand(365*24)}, index=idx)
I have a list of dates:
dates = ['2018-03-20', '2018-04-08', '2018-07-14']
I want to end up with two DataFrames: one containing just the data for these dates, and one containing all of the data from the original DataFrame excluding all the data for these dates. In this case, I would have a DataFrame containing three days worth of data (for the days listed in dates), and a DataFrame containing 362 days data (all the data excluding those three days).
What is the best way to do this in pandas?
I can take advantage of nice string-based datetime indexing in pandas to extract each date separately, for example:
df[dates[0]]
and I can use this to put together a DataFrame containing just the specified dates:
to_concat = [df[date] for date in dates]
just_dates = pd.concat(to_concat)
This isn't as 'nice' as it could be, but does the job.
However, I can't work out how to remove those dates from the DataFrame to get the other output that I want. Doing:
df[~dates[0]]
gives a TypeError: bad operand type for unary ~: 'str', and I can't seem to get df.drop to work in this context.
What do you suggest as a nice, Pythonic and 'pandas-like' way to go about this?
Create boolean mask by numpy.in1d with converted dates to strings or Index.isin for test membership:
m = np.in1d(df.index.date.astype(str), dates)
m = df.index.to_series().dt.date.astype(str).isin(dates)
Or DatetimeIndex.strftime for strings:
m = df.index.strftime('%Y-%m-%d').isin(dates)
Another idea is remove times by DatetimeIndex.normalize - get DatetimeIndex in output:
m = df.index.normalize().isin(dates)
#alternative
#m = df.index.floor('d').isin(dates)
Last filter by boolean indexing:
df1 = df[m]
And for second DataFrame invert mask by ~:
df2 = df[~m]
print (df1)
value
2018-03-20 00:00:00 0.348010
2018-03-20 01:00:00 0.406394
2018-03-20 02:00:00 0.944569
2018-03-20 03:00:00 0.425583
2018-03-20 04:00:00 0.586190
...
2018-07-14 19:00:00 0.710710
2018-07-14 20:00:00 0.403660
2018-07-14 21:00:00 0.949572
2018-07-14 22:00:00 0.629871
2018-07-14 23:00:00 0.363081
[72 rows x 1 columns]
one way to solve this
df = df.reset_index()
with_date = df[df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##use del with_date.index.name to remove the index name, if required
without_date = df[~df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##with_date
value
index
2018-03-20 00:00:00 0.059623
2018-03-20 01:00:00 0.343513
...
##without_date
value
index
2018-01-01 00:00:00 0.087846
2018-01-01 01:00:00 0.481971
...
Another way to solve this:
Keep your dates in datetime format, for example through a pd.Timestamp:
dates_in_dt_format = [pd.Timestamp(date).date() for date in dates]
Then, keep only the rows where the index's date is not in that group, for example with:
df_without_dates = df.loc[[idx for idx in df.index if idx.date() not in dates_in_dt_format]]
df_with_dates = df.loc[[idx for idx in df.index if idx.date() in dates_in_dt_format]]
or using pandas apply instead of list comprehension:
df_with_dates = df[df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]
df_without_dates = df[~df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example
# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index)
# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min")
Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:
# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)
I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:
pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")
This seems very inefficient, concatenating and then sorting the two series. Is there a better way?
The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.
Get an example DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2)
df = (pd.DataFrame()
.assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
+ pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
Value=np.random.randn(337)
)
.set_index(['SampleTime'])
)
Let's see what the data looks like:
df.head()
Value
SampleTime
2018-10-01 00:00:03 0.033171
2018-10-01 00:30:03 0.481966
2018-10-01 01:00:01 -0.495496
Get the desired index:
desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')
Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:
(df
.reindex(df.index.union(desired_index))
.interpolate(method='time')
.reindex(desired_index)
)
Value
2018-10-01 00:00:00 NaN
2018-10-01 00:30:00 0.481218
2018-10-01 01:00:00 -0.494952
2018-10-01 01:30:00 -0.103270
As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).
my methods
frequency = nyse_trading_dates.rename_axis([None]).index
df = prices.rename_axis([None]).reindex(frequency)
for d in prices.rename_axis([None]).index:
df.loc[d] = prices.loc[d]
df.interpolate(method='linear')
method 2
prices = data.loc[~data.index.duplicated(keep='last')]
#prices = data.reset_index()
idx1 = prices.index
idx1 = pd.to_datetime(idx1, errors='coerce')
merged = idx1.union(idx2)
s = prices.reindex(merged)
df = s.interpolate(method='linear').dropna(axis=0, how='any')
data=df

Multi-indexing - accessing the last time in every day

New to multiindexing in Pandas. I have data that looks like this
Date Time value
2014-01-14 12:00:04 .424
12:01:12 .342
12:01:19 .341
...
12:05:49 .23
2014-05-12 ...
1:02:42 .23
....
For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this
df= pd.read_csv("df.csv",index_col=0)
df.index = pd.to_datetime(df.index,infer_datetime_format=True)
df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time'])
df= df[~df.index.duplicated(keep='first')]
dates = df.index.get_level_values(0)
So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?
This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):
import pandas as pd
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index('Time', drop=False)
max_times = df.groupby(level=0)['Time'].max()
print(max_times)
yields
Date
2014-01-14 12:05:49
2014-05-12 1:02:42
Name: Time, dtype: object
If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):
df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_timedelta(df['Time'])
df = df.set_index(['Date', 'Time'])
df = df.reset_index()
idx = df.groupby(['Date'])['Time'].idxmax()
print(df.loc[idx])
yields
Date Time value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23
I don't see a good way to do this while keeping the MultiIndex.
It is easier to perform the groupby operation before setting the MultiIndex.
Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:
df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]})
df['DateTime'] = pd.to_datetime(df['DateTime'])
# df = pd.read_csv('df.csv', parse_dates=[0])
idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax()
result = df.loc[idx]
print(result)
yields
DateTime value
3 2014-01-14 12:05:49 0.23
5 2014-05-12 01:02:42 0.23

How to find out about what N the resample function in pandas did its job?

I use the python module pandas and its function resample to calculate means of a dataset. I wonder how I can get to know about what N the resampling for each day/each month takes places.
In the below given example I calculate means for the three month January, Feb. and March.
The answer to my question in that case is: N for January = 31, N for February = 29, N for March = 31. Is there a way to get that information about N for more complex data?
import pandas as pd
import numpy as np
#create dates as index
dates = pd.date_range('1/1/2000', periods=91)
index = pd.Index(dates, name = 'dates')
#create DataFrame df
df = pd.DataFrame(np.random.randn(91, 1), index, columns=['A'])
print df['A']
#calculate monthly_mean
monthly_mean = df.resample('M', how='mean')
Thanks in advance.
You could use how='count', IIUC:
>>> df.resample('M', how='count')
2000-01-31 A 31
2000-02-29 A 29
2000-03-31 A 31
dtype: int64

Categories

Resources