how to subset pandas dataframe on date

how to subset pandas dataframe on date - python

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help

IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64

You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Related

pandas.to_datetime not converting all rows to datetime

simple transformation to convert a string date time to datetime in a df not working - please see last column 990 onwards
new_df = pd.melt(
frame=df,
id_vars={'Date', 'Day'}
)
new_df['new_date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='raise')
Date Day variable value new_date
0 1/5/2015 289 Cases_Guinea 2776.0 2015-01-05
1 1/4/2015 288 Cases_Guinea 2775.0 2015-01-04
2 1/3/2015 287 Cases_Guinea 2769.0 2015-01-03
3 1/2/2015 286 Cases_Guinea NaN 2015-01-02
4 12/31/2014 284 Cases_Guinea 2730.0 2014-12-31
5 12/28/2014 281 Cases_Guinea 2706.0 2014-12-28
6 12/27/2014 280 Cases_Guinea 2695.0 2014-12-27
7 12/24/2014 277 Cases_Guinea 2630.0 2014-12-24
8 12/21/2014 273 Cases_Guinea 2597.0 2014-12-21
9 12/20/2014 272 Cases_Guinea 2571.0 2014-12-20
.. ... ... ... ... ...
990 12/3/2014 256 Deaths_Guinea NaN NaT
991 11/30/2014 253 Deaths_Guinea 1327.0 NaT
992 11/28/2014 251 Deaths_Guinea NaN NaT
993 11/23/2014 246 Deaths_Guinea 1260.0 NaT
994 11/22/2014 245 Deaths_Guinea NaN NaT
995 11/18/2014 241 Deaths_Guinea 1214.0 NaT
996 11/16/2014 239 Deaths_Guinea 1192.0 NaT
997 11/15/2014 238 Deaths_Guinea NaN NaT

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.

This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

How to calculate day's difference between successive pandas dataframe rows with condition

I have a pandas dataframe like following..
item_id date
101 2016-01-05
101 2016-01-21
121 2016-01-08
121 2016-01-22
128 2016-01-19
128 2016-02-17
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
I want to calculate days difference between date column but with respect to item_id column. First I want to sort the dataframe with date grouping on item_id. It should look like this
item_id date
101 2016-01-05
101 2016-01-08
121 2016-01-21
121 2016-01-22
128 2016-01-17
128 2016-02-19
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
Then I want to calculate the difference between dates again grouping on item_id So the output should look like following
item_id date day_difference
101 2016-01-05 0
101 2016-01-08 3
121 2016-01-21 0
121 2016-01-22 1
128 2016-01-17 0
128 2016-02-19 2
131 2016-01-11 0
131 2016-01-23 12
131 2016-01-24 1
131 2016-02-06 13
131 2016-02-07 1
For sorting I used something like this
df.groupby('item_id').apply(lambda x: new_df.sort('date'))
But,it didn't work out. I am able to calculate the difference between consecutive rows by following
(df['date'] - df['date'].shift(1))
But not for grouping with item_id

I think you can use:
df['date'] = df.groupby('item_id')['date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('item_id')['date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
print df
item_id date diff
0 101 2016-01-05 0
1 101 2016-01-21 16
2 121 2016-01-08 0
3 121 2016-01-22 14
4 128 2016-01-19 0
5 128 2016-02-17 29
6 131 2016-01-11 0
7 131 2016-01-23 12
8 131 2016-01-24 1
9 131 2016-02-06 13
10 131 2016-02-07 1

You can also try:
df.date.diff().fillna(pd.Timedelta(seconds=0))
Note: .fillna(0) is no longer supported for timedelta dtype

subsetting pandas dataframe on specific date value

I have a pandas dataframe like this
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to subset this dataframe on date == '2016-01-04.Datatypes of df dataframe are
df.dtypes
Out[1264]:
order_id object
buyer_id object
item_id object
time datetime64[ns]
This is what I am doing in python
df[df['time'] == '2016-01-04']
But it returns me an empty dataframe. But,when I do
df[df['time'] < '2016-01-05'] it works. Please help

The problem here is that the comparison is being performed for an exact match, as none of the times are '00:00:00' then no matches occur, you'd have to compare just the date components in order for this to work:
In [20]:
df[df['time'].dt.date == pd.to_datetime('2016-01-04').date()]
Out[20]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00

IIUC you can use DatetimeIndex Partial String Indexing:
print df
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
df = df.set_index('time')
print df['2016-01-04']
order_id buyer_id item_id
time
2016-01-04 10:20:00 537 79 93
2016-01-04 10:30:00 540 191 93
2016-01-04 13:39:00 556 251 82

Pandas dataframe : Multiple Time/Date columns to single Date index

I have a dataframe with a Product as a first column, and then 12 month of sales (one column per month). I'd like to 'pivot' the dataframe to end up with a single date index.
example data :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, 1000, size=(2,12)), index=['PrinterBlue', 'PrinterBetter'], columns=pd.date_range('1-1', periods=12, freq='M'))
yielding:
>>> df
2014-01-31 2014-02-28 2014-03-31 2014-04-30 2014-05-31 \
PrinterBlue 176 77 89 279 81
PrinterBetter 801 660 349 608 322
2014-06-30 2014-07-31 2014-08-31 2014-09-30 2014-10-31 \
PrinterBlue 286 831 114 996 904
PrinterBetter 994 374 895 586 646
2014-11-30 2014-12-31
PrinterBlue 458 117
PrinterBetter 366 196
Desired result :
Brand Date Sales
PrinterBlue 2014-01-31 176
2014-02-28 77
2014-03-31 89
[...]
2014-11-30 458
2014-12-31 117
PrinterBetter 2014-01-31 801
2014-02-28 660
2014-03-31 349
[...]
2014-11-30 366
2014-12-31 196
I can imagine getting the result by :
Building 12 sub dataframe, each containing only one month of information
Pivoting each dataframe
Concatenating them
But that seems like an pretty complicated way to make the target transformation. Is there a better / simpler way ?

I think pandas melt provides the functionality you are looking for
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt
import pandas as pd
import numpy as np
from pandas import melt
df = pd.DataFrame(np.random.randint(10, 1000, size=(2,12)), index=['PrinterBlue', 'PrinterBetter'], columns=pd.date_range('1-1', periods=12, freq='M'))
dft = df.T
dft["date"] = dft.index
result = melt(dft, id_vars=["date"])
result.columns = ["date", "brand", "sales"]
print (result)
outputs this:
date brand sales
0 2014-01-31 PrinterBlue 242
1 2014-02-28 PrinterBlue 670
2 2014-03-31 PrinterBlue 142
3 2014-04-30 PrinterBlue 571
4 2014-05-31 PrinterBlue 826
5 2014-06-30 PrinterBlue 515
6 2014-07-31 PrinterBlue 568
7 2014-08-31 PrinterBlue 90
8 2014-09-30 PrinterBlue 652
9 2014-10-31 PrinterBlue 488
10 2014-11-30 PrinterBlue 671
11 2014-12-31 PrinterBlue 767
12 2014-01-31 PrinterBetter 294
13 2014-02-28 PrinterBetter 77
14 2014-03-31 PrinterBetter 59
15 2014-04-30 PrinterBetter 373
16 2014-05-31 PrinterBetter 228
17 2014-06-30 PrinterBetter 708
18 2014-07-31 PrinterBetter 16
19 2014-08-31 PrinterBetter 542
20 2014-09-30 PrinterBetter 577
21 2014-10-31 PrinterBetter 141
22 2014-11-30 PrinterBetter 358
23 2014-12-31 PrinterBetter 290

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to subset pandas dataframe on date - python

You could use: df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False) buyer_id 64 False 79 False 191 True 251 True 261 False 309 False dtype: bool

Related

pandas.to_datetime not converting all rows to datetime

How to get values for the next month for a selected column from a pandas data frame with date time index

How to calculate day's difference between successive pandas dataframe rows with condition

subsetting pandas dataframe on specific date value

Pandas dataframe : Multiple Time/Date columns to single Date index

Categories

Resources