Pivoting pandas with removal of some headers and renaming of some indexes - python

I have the following pandas dataframe:
count event date
0 1544 'strike' 2016-11-01
1 226 'defense' 2016-11-01
2 1524 'strike' 2016-12-01
3 246 'defense' 2016-12-01
4 1592 'strike' 2017-01-01
5 245 'defense' 2017-01-01
I want to pivot/transform it in such a way the final output looks like this:
event 2016-11-01 2016-12-01 2017-01-01 2017-02-01 2017-03-01
'strike' 1544 1524 1592 1608 1654
'defense' 226 246 245 210 254
but what i'm getting now upon pivoting is this:
count count count count count\
date 2016-11-01 2016-12-01 2017-01-01 2017-02-01 2017-03-01
event
'strike' 1544 1524 1592 1608 1654
'defense' 226 246 245 210 254
is there any way i could remove the entire empty row ahead of the event index-name and rename the date index-name with event as its index-name and also remove the unwanted count appearing in the first row of the data frame? The data seems to be transforming correctly i just want to get rid of these headers and indexes and have the renamed and removed properly. I also don't want the row labels in the desired output.
This is what i've been trying till now:
output = df.pivot(index='event', columns='date')
print(output)

Solution is add parameter values to pivot, then add reset_index for column from index and rename_axis fro remove column name:
output=df.pivot(index='event',columns='date',values='count').reset_index().rename_axis(None,1)
print(output)
event 2016-11-01 2016-12-01 2017-01-01
0 'defense' 226 246 245
1 'strike' 1544 1524 1592
What happens if omit it?
print (df)
count event date count1
0 1544 'strike' 2016-11-01 1
1 226 'defense' 2016-11-01 7
2 1524 'strike' 2016-12-01 8
3 246 'defense' 2016-12-01 3
4 1592 'strike' 2017-01-01 0
5 245 'defense' 2017-01-01 1
pivot use each not used column and create MultiIndex for distinguish original columns:
output = df.pivot(index='event', columns='date')
print(output)
count count1
date 2016-11-01 2016-12-01 2017-01-01 2016-11-01 2016-12-01 2017-01-01
event
'defense' 226 246 245 7 3 1
'strike' 1544 1524 1592 1 8 0

I would recommend using the more general version of pd.pivot(), which is pd.pivot_table(), like so:
x = pd.pivot_table(df, index = 'event', columns = 'date', values = 'count')
You will get:
date 01/01/2017 01/11/2016 01/12/2016
event
'defense' 245 226 246
'strike' 1592 1544 1524
Next, you can get rid of the 'date' string by setting:
x.columns.name = ' '
Additionally, if you want to change the order of the events, you might want to set the variable up as a categorical variable, before doing the pivoting:
df.event = df.event.astype('category') # cast to categorical
df.event.cat.set_categories(your_list, inplace = True) # force order
where your_list is the list of your categories, in order.
Hope this helps.

Related

Resampling by Date, how to write a condition on the sum column

I want to resample data by datetime, frequency is three day, and sum the weight column. But if the sum of weight is greater than 1000, then stop sum. How can i write this sum condition, i have already write a code to resample data without any condition.
data1 = data.groupby('buyer_id').resample('3D',on='wh_inbound_sg_time').actual_weight.sum()
Table format:
order_sn|buyer_name buyer_id|ordersn|whs_code|consignment_no|actual_weight|time
Result:
buyer_id time
19051 2021-08-04 32
2021-08-07 71
2021-08-10 0
2021-08-13 0
2021-08-16 0
2021-08-19 18
2021-08-22 0
2021-08-25 174
2021-08-28 0
2021-08-31 266
2021-09-03 0
2021-09-06 0
2021-09-09 372
2021-09-12 0
2021-09-15 192
2021-09-18 436
2021-09-21 456
64155 2021-09-06 1964
2021-09-09 0
2021-09-12 0
2021-09-15 0
2021-09-18 940

How can I give a column the same number every 7 times in a dataframe?

How can I give a column the same number every 7 times in a dataframe?
In the last column,
'ww' I want to put the same 1 from 1-21 to 1-27, the same 2 from 1-28 to 2-3,..
2 for the next 7 days
3 for the next 7 days, etc..
Finally, I want to put a number that increases every 7 days, but I am not sure of the code.
date people ww
0 2020-01-21 0
1 2020-01-22 0
2 2020-01-23 0
3 2020-01-24 1
4 2020-01-25 0
... ... ...
616 2021-09-28 2289
617 2021-09-29 2883
618 2021-09-30 2564
619 2021-10-01 2484
620 2021-10-02 2247
Since you have daily data, you can do this with simple math:
df["ww"] = (df["date"]-df["date"].min()).dt.days//7+1
>>> df
date ww
0 2021-01-21 1
1 2021-01-22 1
2 2021-01-23 1
3 2021-01-24 1
4 2021-01-25 1
.. ... ..
250 2021-09-28 36
251 2021-09-29 36
252 2021-09-30 37
253 2021-10-01 37
254 2021-10-02 37

How to shift timestamp for the rows in dataframe on the specific date

I am working on churn prediction use case and here is a part of the dataset (short version)
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2019-05-01
284 327 2019-06-01
285 327 2019-07-01
... ... ...
528 500 2018-01-01
529 500 2018-02-01
Period of the observation is, for example, start date 2017-07-01 until 2019-12-01
First, I have to find all the users with the first date greater than the start date of the observation period ( 2017-07-01) and than shift all rows for them so that the first row starts with the date of the observation period.
For example, ID 026 is active from the start of the observation period, so that is ok - there is no transformation for him.
But ID 237 and 500 start activity later than the start of observation period and
I should shift all dates (rows) starting from that point
after transformation dataframe should look like
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
... ... ...
528 500 2017-07-01
529 500 2017-08-01
IIUC, you can do a groupby.cumcount and offset:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['Timestamp'] =df.Timestamp.min() + pd.DateOffset(months=1) * df.groupby('ID').cumcount()
Output:
ID Timestamp
0 26 2017-07-01
1 26 2017-08-01
2 26 2017-09-01
3 26 2017-10-01
4 26 2017-11-01
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
528 500 2017-07-01
529 500 2017-08-01
This approach set all data to continuous months starting from min date. If you want to just shift date, then a groupby().transform('min') woud do:
df.Timestamp -= df.groupby('ID')['Timestamp'].transform('min') - df.Timestamp.min()
Try using the min() function to find the lowest date for each ID and work from there.
To get a dicitonary of how much you need to shift the timestamps in each ID by:
shift_dict = {}
for id in df.ID.unique():
shift = min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp'])
shift_dict[id] = shift
In dict comprehension:
shift_dict = {id : ( min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp']) ) for id in df.ID.unique()}

Customised start and end date of the month

I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..

Pandas get unique monthly data based on date range

I have something like the following dataframe:
d=pd.DataFrame()
d['id']=['a','a','a','b','b','c']
d['version_start']=['2017-01-01','2017-02-12','2017-03-25','2017-01-01','2017-6-15','2017-01-22']
d['version_end']=['2017-02-11','2017-03-24','2017-08-01','2017-06-14','2018-01-01','2018-01-01']
d['version_start']=pd.to_datetime(d.version_start)
d['version_end']=pd.to_datetime(d.version_end)
d['values']=[10,15,20,5,6,200]
print d
id version_start version_end values
0 a 2017-01-01 2017-02-11 10
1 a 2017-02-12 2017-03-24 15
2 a 2017-03-25 2017-08-01 20
3 b 2017-01-01 2017-06-14 5
4 b 2017-06-15 2018-01-01 6
5 c 2017-01-22 2018-01-01 200
The version start and version end represent for each ID, the date range for which that row can be considered valid. For example, the total values for a given date would be the records for which that date is between the version start and version end.
I am looking to get for a set of dates (the first of the month for each month in 2017) the sum of the "values" field. I can do this by looping through each month as follows:
df=pd.DataFrame()
for month in pd.date_range('2017-01-01','2018-01-01',freq='MS'):
s = d[(d.version_start<=month)&(d.version_end>month)]
s['month']=month
s=s.set_index(['month','id'])[['values']]
df=df.append(s)
print df.groupby(level='month')['values'].sum()
2017-01-01 15
2017-02-01 215
2017-03-01 220
2017-04-01 225
2017-05-01 225
2017-06-01 225
2017-07-01 226
2017-08-01 206
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
Name: values, dtype: int64
Is there a more elegant / efficient solution that doesn't require looping through this list of dates?
d.version_start=d.version_start+ pd.offsets.MonthBegin(0)
d.version_end=d.version_end+ pd.offsets.MonthBegin(0)
d['New']=d[['version_start','version_end']].apply(lambda x : pd.date_range(start=x.version_start,end=x.version_end,freq='MS').tolist(),1)
d.set_index(['id','version_start','version_end','values']).New.apply(pd.Series).stack().reset_index('values').groupby(0)['values'].sum()
Out[845]:
0
2017-01-01 15
2017-02-01 215
2017-03-01 230
2017-04-01 240
2017-05-01 225
2017-06-01 225
2017-07-01 231
2017-08-01 226
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
2018-01-01 206
Name: values, dtype: int64
I keep thinking there should be a way more elegant way to do this, but for now:
s = pd.Series(0, index=pd.date_range('2017-01-01','2018-01-01',freq='MS'))
for _id, start, end, values in d.itertuples(index=False):
s[start:end] += values
this returns the proper series, and works with any series for that matter.
If you want version_end day to be excluded, a quick fix is to add this line before the for cycle (only works if you are using 'MS' as frequency):
d.version_end = d.version_end.apply(lambda t: t.replace(day=2))
I think the idea of using explicit indexing is cleaner than conditional indexing based on comparisons between dates, which at scale is terribly slow (timestamps are a valid alternative if you are forced to do this on huge arrays).

Categories

Resources