Pandas count observations in one dataframe conditionally on values of other dataframe - python

I'm trying to solve this issue. I have two dataframe. The first one looks like:
ID
start.date
end.date
272
2007-03-27 10:37:00
2007-03-27 15:09:00
290
2007-04-10 14:12:00
2007-04-10 15:51:00
268
2007-03-23 18:18:00
2007-03-23 18:24:00
264
2007-04-05 06:54:00
2007-04-09 06:45:00
105
2007-04-18 10:51:00
2007-04-18 13:37:00
280
2007-03-30 11:09:00
2007-04-02 06:27:00
99
2007-03-28 12:12:00
2007-03-28 15:22:00
268
2007-03-27 10:41:00
2007-03-27 10:54:00
263
2007-03-28 11:08:00
2007-03-28 12:45:00
264
2007-03-28 07:12:00
2007-03-28 11:08:00
While the second one looks like:
ID
date
266
2007-03-30 17:17:10
272
2007-03-30 14:23:39
268
2007-03-30 09:12:48
264
2007-03-30 18:57:57
276
2007-04-02 14:30:02
106
2007-03-28 11:35:49
276
2007-03-30 13:40:24
82
2007-03-27 17:29:28
104
2007-03-28 17:50:12
264
2007-03-29 14:41:16
I would like to add a column to the first dataframe with the count of the rows in the second dataframe with that ID and with a date value between the start.date and end.date of the first dataframe. How can I do it?

You can try apply on rows:
df1['start.date'] = pd.to_datetime(df1['start.date'])
df1['end.date'] = pd.to_datetime(df1['end.date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['count'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & (row['start.date'] < df2['date']) & (df2['date'] < row['end.date'])).sum(), axis=1)
# or
df1['count2'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & df2['date'].between(row['start.date'], row['end.date'], inclusive='neither')).sum(), axis=1)
print(df1)
ID start.date end.date count count2
0 272 2007-03-27 10:37:00 2007-03-27 15:09:00 0 0
1 290 2007-04-10 14:12:00 2007-04-10 15:51:00 0 0
2 268 2007-03-23 18:18:00 2007-03-23 18:24:00 0 0
3 264 2007-04-05 06:54:00 2007-04-09 06:45:00 0 0
4 105 2007-04-18 10:51:00 2007-04-18 13:37:00 0 0
5 280 2007-03-30 11:09:00 2007-04-02 06:27:00 0 0
6 99 2007-03-28 12:12:00 2007-03-28 15:22:00 0 0
7 268 2007-03-27 10:41:00 2007-03-27 10:54:00 0 0
8 263 2007-03-28 11:08:00 2007-03-28 12:45:00 0 0
9 264 2007-03-28 07:12:00 2007-03-28 11:08:00 0 0

Perfect job for numpy boardcasting:
id1, start_date, end_date = [df1[[col]].to_numpy() for col in ["ID", "start.date", "end.date"]]
id2, date = [df2[col].to_numpy() for col in ["ID", "date"]]
# Check every row in df1 against every row in df2 for our criteria:
# matching id, and date between start.date and end.date
match = (id1 == id2) & (start_date < date) & (date < end_date)
df1["count"] = match.sum(axis=1)

Related

pandas.to_datetime not converting all rows to datetime

simple transformation to convert a string date time to datetime in a df not working - please see last column 990 onwards
new_df = pd.melt(
frame=df,
id_vars={'Date', 'Day'}
)
new_df['new_date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='raise')
Date Day variable value new_date
0 1/5/2015 289 Cases_Guinea 2776.0 2015-01-05
1 1/4/2015 288 Cases_Guinea 2775.0 2015-01-04
2 1/3/2015 287 Cases_Guinea 2769.0 2015-01-03
3 1/2/2015 286 Cases_Guinea NaN 2015-01-02
4 12/31/2014 284 Cases_Guinea 2730.0 2014-12-31
5 12/28/2014 281 Cases_Guinea 2706.0 2014-12-28
6 12/27/2014 280 Cases_Guinea 2695.0 2014-12-27
7 12/24/2014 277 Cases_Guinea 2630.0 2014-12-24
8 12/21/2014 273 Cases_Guinea 2597.0 2014-12-21
9 12/20/2014 272 Cases_Guinea 2571.0 2014-12-20
.. ... ... ... ... ...
990 12/3/2014 256 Deaths_Guinea NaN NaT
991 11/30/2014 253 Deaths_Guinea 1327.0 NaT
992 11/28/2014 251 Deaths_Guinea NaN NaT
993 11/23/2014 246 Deaths_Guinea 1260.0 NaT
994 11/22/2014 245 Deaths_Guinea NaN NaT
995 11/18/2014 241 Deaths_Guinea 1214.0 NaT
996 11/16/2014 239 Deaths_Guinea 1192.0 NaT
997 11/15/2014 238 Deaths_Guinea NaN NaT

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.
This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

How to join DataFrame with multiple conditions on different columns?

I have two data-frames as follows:
mydata1:
ID X1 X2 Date1
002 324 634 2016-01-01
002 334 534 2016-01-14
002 354 834 2016-01-30
004 543 843 2017-02-01
004 923 043 2017-04-15
005 032 212 2015-09-01
005 523 843 2017-09-15
005 212 222 2015-10-1
mydata2:
ID Y1 Y2 Date2
002 1224 234 2016-01-04
002 1254 249 2016-01-28
004 321 212 2016-12-01
005 1121 222 2017-09-13
I want to merge these two data-frames based on ID and the Date where the difference between Date1 --dataframe1-- and Date2 --indataframe2--is less than 15. So, my desired data-frame as an output should be like this:
ID X1 X2 Date1. Y1. Y2. Date2
002 324 634 2016-01-01. nan. nan. nan
002 334 534 2016-01-14 1224 234 2016-01-04
002 354 834 2016-01-30. 1254 249 2016-01-28
004 543 843 2017-02-01 321 212 2015-12-01
004 923 043 2017-04-15. nan nan. nan
005 032 212 2015-09-01 nan nan. nan
005 523 843 2015-09-15. 1121 222 2017-09-13
005 212 222 2015-10-1. nan nan. nan
So your desired output is slightly wrong since one of the values is 2 years older than the joined value.
First we perform a join:
f = df.merge(df1, how='left', on='ID')
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 1224 234 2016-01-04
1 2 334 534 2016-01-14 1224 234 2016-01-04
2 2 354 834 2016-01-30 1224 234 2016-01-04
3 4 543 843 2017-02-01 321 212 2016-12-01
4 4 923 43 2017-04-15 321 212 2016-12-01
5 5 32 212 2015-09-01 1121 222 2015-09-13
6 5 523 843 2015-09-15 1121 222 2015-09-13
7 5 212 222 2015-10-1 1121 222 2015-09-13
Then we create a boolean mask:
mask = (pd.to_datetime(f['Date1'], format='%Y-%m-%d') - pd.to_datetime(f['Date2'], format='%Y-%m-%d')).apply(lambda i: i.days <= 15 and i.days > 0)
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
Then we set it to nan where the condition does not match:
f.loc[~mask, ['Y1', 'Y2', 'Date2']] = np.nan
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 NaN NaN NaN
1 2 334 534 2016-01-14 1224.0 234.0 2016-01-04
2 2 354 834 2016-01-30 NaN NaN NaN
3 4 543 843 2017-02-01 NaN NaN NaN
4 4 923 43 2017-04-15 NaN NaN NaN
5 5 32 212 2015-09-01 NaN NaN NaN
6 5 523 843 2015-09-15 1121.0 222.0 2015-09-13
7 5 212 222 2015-10-1 NaN NaN NaN

Create histogram from panda frame

I am trying to created bar histogram that will show the mean of subjects by groups
my data looks like this -
week 8 exp
Subject Group 1 2 3 Mean
0 255 WT 0 101.8 75.6 84.1 87.166667
1 157 HD 0 92.6 87.8 82.3 87.566667
2 418 WT 0 54.5 47.0 50.8 50.766667
3 300 WT 0 48.1 73.1 72.2 64.466667
4 299 HD 0 71.8 86.0 93.4 83.733333
5 258 WT 0 88.0 98.5 50.2 78.900000
6 173 WT 0 75.4 70.5 83.9 76.600000
7 273 HD 0 103.6 94.2 108.3 102.033333
8 175 WT 0 36.7 30.7 42.2 36.533333
9 172 HD 0 82.6 91.6 73.4 82.533333
10 263 WT 0 110.7 102.4 105.5 106.200000
11 304 1 90.4 90.1 103.4 94.633333
12 305 1 128.6 141.5 123.1 131.066667
13 306 1 52.0 45.6 57.2 51.600000
14 309 0.1 41.3 52.6 79.9 57.933333
15 317 0.1 86.2 95.8 77.1 86.366667
My code is -
frame_data = pd.read_csv('final results.csv', header=[0,1])
data_avg = df.iloc[:, -3:].mean(axis=1)
frame_data[('exp', 'Mean')] = frame_data.iloc[:, -3:].mean(axis=1)
grouped_by_group = frame_data.groupby(['Group',
'Mean']).size().unstack('Mean')
grouped_by_group.plot.bar(title='Grip')
I am getting an error
KeyError: 'Group'
i checked many times and it is the way it is written... I do not know what is wrong...
I think need reshape DataFrame by melt, aggregate mean and then then Series.plot:
frame_data = pd.read_csv('final results.csv', header=[0,1])
frame_data[('exp', 'Mean')] = frame_data.iloc[:, -3:].mean(axis=1)
#flatten MultiIndex to columns
frame_data.columns = frame_data.columns.map('_'.join)
grouped_by_group = frame_data.groupby('8_Group')['exp_Mean'].mean()
print (grouped_by_group)
8_Group
0.1 72.150000
1 92.433333
HD 0 88.966667
WT 0 71.519048
Name: value, dtype: float64
grouped_by_group.plot.bar(title='Grip')

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help
IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Categories

Resources