I have something like the following dataframe:
d=pd.DataFrame()
d['id']=['a','a','a','b','b','c']
d['version_start']=['2017-01-01','2017-02-12','2017-03-25','2017-01-01','2017-6-15','2017-01-22']
d['version_end']=['2017-02-11','2017-03-24','2017-08-01','2017-06-14','2018-01-01','2018-01-01']
d['version_start']=pd.to_datetime(d.version_start)
d['version_end']=pd.to_datetime(d.version_end)
d['values']=[10,15,20,5,6,200]
print d
id version_start version_end values
0 a 2017-01-01 2017-02-11 10
1 a 2017-02-12 2017-03-24 15
2 a 2017-03-25 2017-08-01 20
3 b 2017-01-01 2017-06-14 5
4 b 2017-06-15 2018-01-01 6
5 c 2017-01-22 2018-01-01 200
The version start and version end represent for each ID, the date range for which that row can be considered valid. For example, the total values for a given date would be the records for which that date is between the version start and version end.
I am looking to get for a set of dates (the first of the month for each month in 2017) the sum of the "values" field. I can do this by looping through each month as follows:
df=pd.DataFrame()
for month in pd.date_range('2017-01-01','2018-01-01',freq='MS'):
s = d[(d.version_start<=month)&(d.version_end>month)]
s['month']=month
s=s.set_index(['month','id'])[['values']]
df=df.append(s)
print df.groupby(level='month')['values'].sum()
2017-01-01 15
2017-02-01 215
2017-03-01 220
2017-04-01 225
2017-05-01 225
2017-06-01 225
2017-07-01 226
2017-08-01 206
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
Name: values, dtype: int64
Is there a more elegant / efficient solution that doesn't require looping through this list of dates?
d.version_start=d.version_start+ pd.offsets.MonthBegin(0)
d.version_end=d.version_end+ pd.offsets.MonthBegin(0)
d['New']=d[['version_start','version_end']].apply(lambda x : pd.date_range(start=x.version_start,end=x.version_end,freq='MS').tolist(),1)
d.set_index(['id','version_start','version_end','values']).New.apply(pd.Series).stack().reset_index('values').groupby(0)['values'].sum()
Out[845]:
0
2017-01-01 15
2017-02-01 215
2017-03-01 230
2017-04-01 240
2017-05-01 225
2017-06-01 225
2017-07-01 231
2017-08-01 226
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
2018-01-01 206
Name: values, dtype: int64
I keep thinking there should be a way more elegant way to do this, but for now:
s = pd.Series(0, index=pd.date_range('2017-01-01','2018-01-01',freq='MS'))
for _id, start, end, values in d.itertuples(index=False):
s[start:end] += values
this returns the proper series, and works with any series for that matter.
If you want version_end day to be excluded, a quick fix is to add this line before the for cycle (only works if you are using 'MS' as frequency):
d.version_end = d.version_end.apply(lambda t: t.replace(day=2))
I think the idea of using explicit indexing is cleaner than conditional indexing based on comparisons between dates, which at scale is terribly slow (timestamps are a valid alternative if you are forced to do this on huge arrays).
Related
I am working on churn prediction use case and here is a part of the dataset (short version)
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2019-05-01
284 327 2019-06-01
285 327 2019-07-01
... ... ...
528 500 2018-01-01
529 500 2018-02-01
Period of the observation is, for example, start date 2017-07-01 until 2019-12-01
First, I have to find all the users with the first date greater than the start date of the observation period ( 2017-07-01) and than shift all rows for them so that the first row starts with the date of the observation period.
For example, ID 026 is active from the start of the observation period, so that is ok - there is no transformation for him.
But ID 237 and 500 start activity later than the start of observation period and
I should shift all dates (rows) starting from that point
after transformation dataframe should look like
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
... ... ...
528 500 2017-07-01
529 500 2017-08-01
IIUC, you can do a groupby.cumcount and offset:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['Timestamp'] =df.Timestamp.min() + pd.DateOffset(months=1) * df.groupby('ID').cumcount()
Output:
ID Timestamp
0 26 2017-07-01
1 26 2017-08-01
2 26 2017-09-01
3 26 2017-10-01
4 26 2017-11-01
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
528 500 2017-07-01
529 500 2017-08-01
This approach set all data to continuous months starting from min date. If you want to just shift date, then a groupby().transform('min') woud do:
df.Timestamp -= df.groupby('ID')['Timestamp'].transform('min') - df.Timestamp.min()
Try using the min() function to find the lowest date for each ID and work from there.
To get a dicitonary of how much you need to shift the timestamps in each ID by:
shift_dict = {}
for id in df.ID.unique():
shift = min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp'])
shift_dict[id] = shift
In dict comprehension:
shift_dict = {id : ( min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp']) ) for id in df.ID.unique()}
I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
I have the following pandas dataframe:
count event date
0 1544 'strike' 2016-11-01
1 226 'defense' 2016-11-01
2 1524 'strike' 2016-12-01
3 246 'defense' 2016-12-01
4 1592 'strike' 2017-01-01
5 245 'defense' 2017-01-01
I want to pivot/transform it in such a way the final output looks like this:
event 2016-11-01 2016-12-01 2017-01-01 2017-02-01 2017-03-01
'strike' 1544 1524 1592 1608 1654
'defense' 226 246 245 210 254
but what i'm getting now upon pivoting is this:
count count count count count\
date 2016-11-01 2016-12-01 2017-01-01 2017-02-01 2017-03-01
event
'strike' 1544 1524 1592 1608 1654
'defense' 226 246 245 210 254
is there any way i could remove the entire empty row ahead of the event index-name and rename the date index-name with event as its index-name and also remove the unwanted count appearing in the first row of the data frame? The data seems to be transforming correctly i just want to get rid of these headers and indexes and have the renamed and removed properly. I also don't want the row labels in the desired output.
This is what i've been trying till now:
output = df.pivot(index='event', columns='date')
print(output)
Solution is add parameter values to pivot, then add reset_index for column from index and rename_axis fro remove column name:
output=df.pivot(index='event',columns='date',values='count').reset_index().rename_axis(None,1)
print(output)
event 2016-11-01 2016-12-01 2017-01-01
0 'defense' 226 246 245
1 'strike' 1544 1524 1592
What happens if omit it?
print (df)
count event date count1
0 1544 'strike' 2016-11-01 1
1 226 'defense' 2016-11-01 7
2 1524 'strike' 2016-12-01 8
3 246 'defense' 2016-12-01 3
4 1592 'strike' 2017-01-01 0
5 245 'defense' 2017-01-01 1
pivot use each not used column and create MultiIndex for distinguish original columns:
output = df.pivot(index='event', columns='date')
print(output)
count count1
date 2016-11-01 2016-12-01 2017-01-01 2016-11-01 2016-12-01 2017-01-01
event
'defense' 226 246 245 7 3 1
'strike' 1544 1524 1592 1 8 0
I would recommend using the more general version of pd.pivot(), which is pd.pivot_table(), like so:
x = pd.pivot_table(df, index = 'event', columns = 'date', values = 'count')
You will get:
date 01/01/2017 01/11/2016 01/12/2016
event
'defense' 245 226 246
'strike' 1592 1544 1524
Next, you can get rid of the 'date' string by setting:
x.columns.name = ' '
Additionally, if you want to change the order of the events, you might want to set the variable up as a categorical variable, before doing the pivoting:
df.event = df.event.astype('category') # cast to categorical
df.event.cat.set_categories(your_list, inplace = True) # force order
where your_list is the list of your categories, in order.
Hope this helps.
Assume I have a count of the number of event per hour as follows:
np.random.seed(42)
idx = pd.date_range('2017-01-01', '2017-01-14', freq='1H')
df = pd.DataFrame(np.random.choice([1,2,3,4,5,6], size=idx.shape[0]), index=idx, columns=['count'])
df.head()
Out[3]:
count
2017-01-01 00:00:00 4
2017-01-01 01:00:00 5
2017-01-01 02:00:00 3
2017-01-01 03:00:00 5
2017-01-01 04:00:00 5
If I want to know the total number of events per day of the week, I can do either:
df.pivot_table(values='count', index=df.index.dayofweek, aggfunc='sum')
or
df.groupby(df.index.dayofweek).sum()
Both yields:
Out[4]:
count
0 161
1 170
2 164
3 133
4 169
5 98
6 172
However, if I want to compute the average number of events per weekday, the following
df.pivot_table(values='count', index=df.index.dayofweek, aggfunc='mean') # [#1]
is wrong!! This approach uses the sum (as computed above), and divides it by the number of hours that appeared in each day of the week.
The workaround I found is:
df_by_day = df.resample('1d').sum()
df_by_day.pivot_table(values='count', index=df_by_day.index.dayofweek, aggfunc='mean')
That is, first resample to days, and then pivot it. Somehow the approach in [#1] feel natural to me. Is there a more pythonic way to achieve what I want? Why does without resampling the mean is wrongly computed?
Resample first using df.resample and then df.groupby:
df = df.resample('1d').sum()
print(df)
count
2017-01-01 92
2017-01-02 86
2017-01-03 86
2017-01-04 90
2017-01-05 64
2017-01-06 82
2017-01-07 97
2017-01-08 80
2017-01-09 75
2017-01-10 84
2017-01-11 74
2017-01-12 69
2017-01-13 87
2017-01-14 1
out = df.groupby(df.index.dayofweek)['count'].mean()
print(out)
1 85.0
2 82.0
3 66.5
4 84.5
5 49.0
6 86.0
Name: count, dtype: float64
I would like to get the 07h00 value every day, from a multiday DataFrame that has 24 hours of minute data in it each day.
import numpy as np
import pandas as pd
aframe = pd.DataFrame([np.arange(10000), np.arange(10000) * 2]).T
aframe.index = pd.date_range("2015-09-01", periods = 10000, freq = "1min")
aframe.head()
Out[174]:
0 1
2015-09-01 00:00:00 0 0
2015-09-01 00:01:00 1 2
2015-09-01 00:02:00 2 4
2015-09-01 00:03:00 3 6
2015-09-01 00:04:00 4 8
aframe.tail()
Out[175]:
0 1
2015-09-07 22:35:00 9995 19990
2015-09-07 22:36:00 9996 19992
2015-09-07 22:37:00 9997 19994
2015-09-07 22:38:00 9998 19996
2015-09-07 22:39:00 9999 19998
In this 10 000 row DataFrame spanning 7 days, how would I get the 7am value each day as efficiently as possible? Assume I might have to do this for very large tick databases so I value speed and low memory usage highly.
I know I can index with strings such as:
aframe.ix["2015-09-02 07:00:00"]
Out[176]:
0 1860
1 3720
Name: 2015-09-02 07:00:00, dtype: int64
But what I need is basically a wildcard style query for example
aframe.ix["* 07:00:00"]
You can use indexer_at_time:
>>> locs = aframe.index.indexer_at_time('7:00:00')
>>> aframe.iloc[locs]
0 1
2015-09-01 07:00:00 420 840
2015-09-02 07:00:00 1860 3720
2015-09-03 07:00:00 3300 6600
2015-09-04 07:00:00 4740 9480
2015-09-05 07:00:00 6180 12360
2015-09-06 07:00:00 7620 15240
2015-09-07 07:00:00 9060 18120
There's also indexer_between_time if you need select all indices that lie between two particular time of day.
Both of these methods return the integer locations of the desired values; the corresponding rows of the Series or DataFrame can be fetched with iloc, as shown above.