Pandas dataframe : Multiple Time/Date columns to single Date index - python

I have a dataframe with a Product as a first column, and then 12 month of sales (one column per month). I'd like to 'pivot' the dataframe to end up with a single date index.
example data :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, 1000, size=(2,12)), index=['PrinterBlue', 'PrinterBetter'], columns=pd.date_range('1-1', periods=12, freq='M'))
yielding:
>>> df
2014-01-31 2014-02-28 2014-03-31 2014-04-30 2014-05-31 \
PrinterBlue 176 77 89 279 81
PrinterBetter 801 660 349 608 322
2014-06-30 2014-07-31 2014-08-31 2014-09-30 2014-10-31 \
PrinterBlue 286 831 114 996 904
PrinterBetter 994 374 895 586 646
2014-11-30 2014-12-31
PrinterBlue 458 117
PrinterBetter 366 196
Desired result :
Brand Date Sales
PrinterBlue 2014-01-31 176
2014-02-28 77
2014-03-31 89
[...]
2014-11-30 458
2014-12-31 117
PrinterBetter 2014-01-31 801
2014-02-28 660
2014-03-31 349
[...]
2014-11-30 366
2014-12-31 196
I can imagine getting the result by :
Building 12 sub dataframe, each containing only one month of information
Pivoting each dataframe
Concatenating them
But that seems like an pretty complicated way to make the target transformation. Is there a better / simpler way ?

I think pandas melt provides the functionality you are looking for
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt
import pandas as pd
import numpy as np
from pandas import melt
df = pd.DataFrame(np.random.randint(10, 1000, size=(2,12)), index=['PrinterBlue', 'PrinterBetter'], columns=pd.date_range('1-1', periods=12, freq='M'))
dft = df.T
dft["date"] = dft.index
result = melt(dft, id_vars=["date"])
result.columns = ["date", "brand", "sales"]
print (result)
outputs this:
date brand sales
0 2014-01-31 PrinterBlue 242
1 2014-02-28 PrinterBlue 670
2 2014-03-31 PrinterBlue 142
3 2014-04-30 PrinterBlue 571
4 2014-05-31 PrinterBlue 826
5 2014-06-30 PrinterBlue 515
6 2014-07-31 PrinterBlue 568
7 2014-08-31 PrinterBlue 90
8 2014-09-30 PrinterBlue 652
9 2014-10-31 PrinterBlue 488
10 2014-11-30 PrinterBlue 671
11 2014-12-31 PrinterBlue 767
12 2014-01-31 PrinterBetter 294
13 2014-02-28 PrinterBetter 77
14 2014-03-31 PrinterBetter 59
15 2014-04-30 PrinterBetter 373
16 2014-05-31 PrinterBetter 228
17 2014-06-30 PrinterBetter 708
18 2014-07-31 PrinterBetter 16
19 2014-08-31 PrinterBetter 542
20 2014-09-30 PrinterBetter 577
21 2014-10-31 PrinterBetter 141
22 2014-11-30 PrinterBetter 358
23 2014-12-31 PrinterBetter 290

Related

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.
This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

Customised start and end date of the month

I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..

How to calculate day's difference between successive pandas dataframe rows with condition

I have a pandas dataframe like following..
item_id date
101 2016-01-05
101 2016-01-21
121 2016-01-08
121 2016-01-22
128 2016-01-19
128 2016-02-17
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
I want to calculate days difference between date column but with respect to item_id column. First I want to sort the dataframe with date grouping on item_id. It should look like this
item_id date
101 2016-01-05
101 2016-01-08
121 2016-01-21
121 2016-01-22
128 2016-01-17
128 2016-02-19
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
Then I want to calculate the difference between dates again grouping on item_id So the output should look like following
item_id date day_difference
101 2016-01-05 0
101 2016-01-08 3
121 2016-01-21 0
121 2016-01-22 1
128 2016-01-17 0
128 2016-02-19 2
131 2016-01-11 0
131 2016-01-23 12
131 2016-01-24 1
131 2016-02-06 13
131 2016-02-07 1
For sorting I used something like this
df.groupby('item_id').apply(lambda x: new_df.sort('date'))
But,it didn't work out. I am able to calculate the difference between consecutive rows by following
(df['date'] - df['date'].shift(1))
But not for grouping with item_id
I think you can use:
df['date'] = df.groupby('item_id')['date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('item_id')['date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
print df
item_id date diff
0 101 2016-01-05 0
1 101 2016-01-21 16
2 121 2016-01-08 0
3 121 2016-01-22 14
4 128 2016-01-19 0
5 128 2016-02-17 29
6 131 2016-01-11 0
7 131 2016-01-23 12
8 131 2016-01-24 1
9 131 2016-02-06 13
10 131 2016-02-07 1
You can also try:
df.date.diff().fillna(pd.Timedelta(seconds=0))
Note: .fillna(0) is no longer supported for timedelta dtype

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help
The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help
IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Categories

Resources