How to assign a sequential label to pandas groupby? - python

I start with the following pandas dataframe, I wish to group each day, and make a new column called 'label', which labels the group with a sequential number. How do I do this?
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
# df['label'] = df.groupby(pd.TimeGrouper('D')) # what do i do here???
print df
output:
val
2016-01-01 00:00:00 10
2016-01-01 12:00:00 40
2016-01-02 00:00:00 30
2016-01-02 12:00:00 10
2016-01-03 00:00:00 11
2016-01-03 12:00:00 13
desired output:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3

Try this:
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
If you just want to group by date:
df['label'] = df.groupby(df.index.date).grouper.group_info[0] + 1
print(df)
To group by time more generally, you can use TimeGrouper:
df['label'] = df.groupby(pd.TimeGrouper('D')).grouper.group_info[0] + 1
print(df)
Both of the above should give you the following:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3
I think this is undocumented (or hard to find, at least). Check out:
Get group id back into pandas dataframe
for more discussion.

maybe a more simpler and intuitive approach is this:
df['label'] = df.groupby(df.index.day).keys

Related

Python pandas modify dataframe according to date and column adding hours

I have the following dataframe:
;h0;h1;h2;h3;h4;h5;h6;h7;h8;h9;h10;h11;h12;h13;h14;h15;h16;h17;h18;h19;h20;h21;h22;h23
2017-01-01;52.72248155184351;49.2949899678983;46.57492391198069;44.087373768731766;44.14801243124734;42.17606224526609;43.18529986793594;39.58391124876044;41.63499969987035;41.40594457169249;47.58107920806581;46.56963630932529;47.377935483897694;37.99479190229543;38.53347417483357;40.62674178535282;45.81503347748674;49.0590694393733;52.73183568074295;54.37213882189341;54.737087166843295;50.224872755157314;47.874441844531056;47.8848916244788
2017-01-02;49.08874087825248;44.998912615866075;45.92457207636786;42.38001388673675;41.66922093408655;43.02027406525752;49.82151473221541;53.23401784350719;58.33805556091773;56.197239473200206;55.7686948361035;57.03099874898539;55.445563603040405;54.929102019056195;55.85170734639889;57.98929007227575;56.65821961018764;61.01309728212006;63.63384537162659;61.730431501017684;54.40180394585544;50.27375006416599;51.229656340500156;51.22066846069472
2017-01-03;50.07885876956572;47.00180020415448;44.47243045246001;42.62192562660052;40.15465704760352;43.48422695796396;50.01631022884173;54.8674584250141;60.434849010428685;61.47694796693493;60.766557330286844;59.12019178422993;53.97447369962696;51.85242030255539;53.604945764469065;56.48188852869667;59.12301823257856;72.05688032286155;74.61342126987793;70.76845988290785;64.13311592022278;58.7237387203283;55.2422389373486;52.63648285910918
As you can notice, there are the days, in the column and the hours.
I would like to create a new dataframe with only two columns:
the first the days (with also the hour data) and a column with the value. Something like the following:
2017-01-01 00:00:00 ; 52.72248
2017-01-01 01:00:00 ; 49.2949899678983
...
I could create a new dataframe and use a cycle to fullfill it. This is I do now
icount = 0
for idd in range(0,365):
for ih in range(0,24):
df.loc[df.index.values[icount]] = ecodf.iloc[idd,ih]
icount = icount + 1
What do you think?
Thanks
Turn columns names into a new column, turn to hours and use pd.to_datetime
s = df.stack()
pd.concat([
pd.to_datetime(s.reset_index() \
.replace({'level_1': r'h(\d+)'}, {'level_1': '\\1:00'}, regex=True) \
[['level_0','level_1']].apply(' '.join, axis=1)), \
s.reset_index(drop=True)], \
axis=1, sort=False)
0 1
0 2017-01-01 00:00:00 52.722482
1 2017-01-01 01:00:00 49.294990
2 2017-01-01 02:00:00 46.574924
3 2017-01-01 03:00:00 44.087374
4 2017-01-01 04:00:00 44.148012
.. ... ...
67 2017-01-03 19:00:00 70.768460
68 2017-01-03 20:00:00 64.133116
69 2017-01-03 21:00:00 58.723739
70 2017-01-03 22:00:00 55.242239
71 2017-01-03 23:00:00 52.636483
[72 rows x 2 columns]
>>>

How to find the datetime difference between rows in a column, based on the condition?

I have the following pandas DataFrame df:
date time val1
2018-12-31 09:00:00 15
2018-12-31 10:00:00 22
2018-12-31 11:00:00 19
2018-12-31 11:30:00 10
2018-12-31 11:45:00 5
2018-12-31 12:00:00 1
2018-12-31 12:05:00 6
I want to find how many minutes are between the val1 value that is greater than 20 and the val1 value that is lower than or equal to 5?
In this example, the answer is 1 hour and 45 minutes = 95 minutes.
I know how to check the difference between two datetime values:
(df.from_datetime-df.to_datetime).astype('timedelta64[m]')
But how to slice it over the DataFrame, detecting the proper rows?
UPDATE: Taking into consideration that date might be different
Convert the date column to a datetime object and time column to a timedelta object and combine them to get another datetime object
df.time = pd.to_timedelta(df.time)
df.date = pd.to_datetime(df.date)
df['date_time'] = df['date'] + df['time']
df
date time val1 date_time
0 2018-12-31 09:00:00 15 2018-12-31 09:00:00
1 2018-12-31 10:00:00 22 2018-12-31 10:00:00
2 2018-12-31 11:00:00 19 2018-12-31 11:00:00
3 2018-12-31 11:30:00 10 2018-12-31 11:30:00
4 2018-12-31 11:45:00 5 2018-12-31 11:45:00
5 2018-12-31 12:00:00 1 2018-12-31 12:00:00
6 2018-12-31 12:05:00 6 2018-12-31 12:05:00
Now could use one of these two methods
1) Love lambdas and this works with Series objects.
subtr = lambda d1, d2: abs(d1 - d2)/np.timedelta64(1, 'm')
d20 = df[df.val1 > 20].date_time.iloc[0]
d5 = df[df.val1 <= 5].date_time.iloc[0]
subtr(d20, d5)
105.0
2) Needs DataFrame object instead of Series object. Hinders with my aesthetics
d20 = df[df.val1 <= 5][['date_time']].iloc[0]
d5 = df[df.val1 > 20][['date_time']].iloc[0]
abs(d5 - d20).astype('timedelta64[m]')[0]
105.0
So this is my approach:
1) Filter out any val1 that is not >= 20 or <= 5
df = pd.DataFrame({'date':['2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31'],
'time':['09:00:00', '10:00:00', '11:00:00', '11:30:00', '11:45:00', '12:00:00', '12:05:00'],
'val1': [15,22,19,10,5,1,6]})
df2 = df[(df['val1'] >= 20)|(df['val1'] <= 5)].copy()
Then we will do the following code:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(-1) >= 15,
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]'),
np.NaN)
Let me go through this.
np.where is a if statement, where if the first statment is true it will do the second, if not true then the third.
df2['val1'] - df2['val1'].shift(-1) >= 15 Since we filtered the df the minimum difference between two rows must be great than or equal to 15.
If it is true:
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]') We take the later time and subtract it from the beginning time.
If not true, we just return np.NaN
We get a df that looks like the following:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 01:45:00
4 2018-12-31 11:45:00 5 NaT
5 2018-12-31 12:00:00 1 NaT
If you want to put the TimeDiff on the end time you can do the following:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(1) <= -15,
df2['time'].astype('datetime64[ns]') - df2['time'].astype('datetime64[ns]').shift(),
np.NaN)
and you will get:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 NaT
4 2018-12-31 11:45:00 5 01:45:00
5 2018-12-31 12:00:00 1 NaT

Merge daily values into intraday DataFrame

Suppose I have two DataFrames: intraday which has one row per minute, and daily which has one row per day.
How can I add a column intraday['some_val'] where some_val is taken from the daily['some_val'] row where the intraday.index value (date component) equals the daily.index value (date component)?
Given the following setup,
intraday = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='T'))
daily = pd.DataFrame(index=pd.date_range('2016-01-01', '2016-01-07', freq='D'))
daily['some_val'] = np.arange(daily.shape[0])
you can create a column from the date component of both indices, and merge on that column
daily['date'] = daily.index.date
intraday['date'] = intraday.index.date
daily.merge(intraday)
date some_val
0 2016-01-01 0
1 2016-01-01 0
2 2016-01-01 0
3 2016-01-01 0
4 2016-01-01 0
... ... ...
8636 2016-01-06 5
8637 2016-01-06 5
8638 2016-01-06 5
8639 2016-01-06 5
8640 2016-01-07 6
Alternatively, you can take advantage of automatic index alignment, and use fillna.
intraday['some_val'] = daily['some_val']
intraday.fillna(method='ffill', downcast='infer')
some_val
2016-01-01 00:00:00 0
2016-01-01 00:01:00 0
2016-01-01 00:02:00 0
2016-01-01 00:03:00 0
2016-01-01 00:04:00 0
... ...
2016-01-06 23:56:00 5
2016-01-06 23:57:00 5
2016-01-06 23:58:00 5
2016-01-06 23:59:00 5
2016-01-07 00:00:00 6
Note that this only works if the time component of your daily index is 00:00.

pandas: merge conditional on time range

I'd like to merge one data frame with another, where the merge is conditional on the date/time falling in a particular range.
For example, let's say I have the following two data frames.
import pandas as pd
import datetime
# Create main data frame.
data = pd.DataFrame()
time_seq1 = pd.DataFrame(pd.date_range('1/1/2016', periods=3, freq='H'))
time_seq2 = pd.DataFrame(pd.date_range('1/2/2016', periods=3, freq='H'))
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq2, ignore_index=True)
data['myID'] = ['001','001','001','002','002','002','003','003','003','004','004','004']
data.columns = ['Timestamp', 'myID']
# Create second data frame.
data2 = pd.DataFrame()
data2['time'] = [pd.to_datetime('1/1/2016 12:06 AM'), pd.to_datetime('1/1/2016 1:34 AM'), pd.to_datetime('1/2/2016 12:25 AM')]
data2['myID'] = ['002', '003', '004']
data2['specialID'] = ['foo_0', 'foo_1', 'foo_2']
# Show data frames.
data
Timestamp myID
0 2016-01-01 00:00:00 001
1 2016-01-01 01:00:00 001
2 2016-01-01 02:00:00 001
3 2016-01-01 00:00:00 002
4 2016-01-01 01:00:00 002
5 2016-01-01 02:00:00 002
6 2016-01-01 00:00:00 003
7 2016-01-01 01:00:00 003
8 2016-01-01 02:00:00 003
9 2016-01-02 00:00:00 004
10 2016-01-02 01:00:00 004
11 2016-01-02 02:00:00 004
data2
time myID specialID
0 2016-01-01 00:06:00 002 foo_0
1 2016-01-01 01:34:00 003 foo_1
2 2016-01-02 00:25:00 004 foo_2
I would like to construct the following output.
# Desired output.
Timestamp myID special_ID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
In particular, I want to merge special_ID into data such that Timestamp is the first time occurring after the value of time. For example, foo_0 would be in the row corresponding to 2016-01-01 01:00:00 with myID = 002 since that is the next time in data immediately following 2016-01-01 00:06:00 (the time of special_ID = foo_0) among the rows containing myID = 002.
Note, Timestamp is not the index of data and time is not the index of data2. Most other related posts seem to rely on using the datetime object as the index of the data frame.
You can use merge_asof, which is new in Pandas 0.19, to do most of the work. Then, combine loc and duplicated to remove secondary matches:
# Data needs to be sorted for merge_asof.
data = data.sort_values(by='Timestamp')
# Perform the merge_asof.
df = pd.merge_asof(data, data2, left_on='Timestamp', right_on='time', by='myID').drop('time', axis=1)
# Make the additional matches null.
df.loc[df['specialID'].duplicated(), 'specialID'] = np.nan
# Get the original ordering.
df = df.set_index(data.index).sort_index()
The resulting output:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
Not very beautiful, but i think it works.
data['specialID'] = None
foolist = list(data2['myID'])
for i in data.index:
if data.myID[i] in foolist:
if data.Timestamp[i]> list(data2[data2['myID'] == data.myID[i]].time)[0]:
data['specialID'][i] = list(data2[data2['myID'] == data.myID[i]].specialID)[0]
foolist.remove(list(data2[data2['myID'] == data.myID[i]].myID)[0])
In [95]: data
Out[95]:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 None
1 2016-01-01 01:00:00 001 None
2 2016-01-01 02:00:00 001 None
3 2016-01-01 00:00:00 002 None
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 None
6 2016-01-01 00:00:00 003 None
7 2016-01-01 01:00:00 003 None
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 None
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 None

rolling_sum on business day and return new dataframe with date as index

I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.

Categories

Resources