How to join DataFrame with multiple conditions on different columns?

How to join DataFrame with multiple conditions on different columns? - python

I have two data-frames as follows:
mydata1:
ID X1 X2 Date1
002 324 634 2016-01-01
002 334 534 2016-01-14
002 354 834 2016-01-30
004 543 843 2017-02-01
004 923 043 2017-04-15
005 032 212 2015-09-01
005 523 843 2017-09-15
005 212 222 2015-10-1
mydata2:
ID Y1 Y2 Date2
002 1224 234 2016-01-04
002 1254 249 2016-01-28
004 321 212 2016-12-01
005 1121 222 2017-09-13
I want to merge these two data-frames based on ID and the Date where the difference between Date1 --dataframe1-- and Date2 --indataframe2--is less than 15. So, my desired data-frame as an output should be like this:
ID X1 X2 Date1. Y1. Y2. Date2
002 324 634 2016-01-01. nan. nan. nan
002 334 534 2016-01-14 1224 234 2016-01-04
002 354 834 2016-01-30. 1254 249 2016-01-28
004 543 843 2017-02-01 321 212 2015-12-01
004 923 043 2017-04-15. nan nan. nan
005 032 212 2015-09-01 nan nan. nan
005 523 843 2015-09-15. 1121 222 2017-09-13
005 212 222 2015-10-1. nan nan. nan

So your desired output is slightly wrong since one of the values is 2 years older than the joined value.
First we perform a join:
f = df.merge(df1, how='left', on='ID')
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 1224 234 2016-01-04
1 2 334 534 2016-01-14 1224 234 2016-01-04
2 2 354 834 2016-01-30 1224 234 2016-01-04
3 4 543 843 2017-02-01 321 212 2016-12-01
4 4 923 43 2017-04-15 321 212 2016-12-01
5 5 32 212 2015-09-01 1121 222 2015-09-13
6 5 523 843 2015-09-15 1121 222 2015-09-13
7 5 212 222 2015-10-1 1121 222 2015-09-13
Then we create a boolean mask:
mask = (pd.to_datetime(f['Date1'], format='%Y-%m-%d') - pd.to_datetime(f['Date2'], format='%Y-%m-%d')).apply(lambda i: i.days <= 15 and i.days > 0)
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
Then we set it to nan where the condition does not match:
f.loc[~mask, ['Y1', 'Y2', 'Date2']] = np.nan
ID X1 X2 Date1 Y1 Y2 Date2
0 2 324 634 2016-01-01 NaN NaN NaN
1 2 334 534 2016-01-14 1224.0 234.0 2016-01-04
2 2 354 834 2016-01-30 NaN NaN NaN
3 4 543 843 2017-02-01 NaN NaN NaN
4 4 923 43 2017-04-15 NaN NaN NaN
5 5 32 212 2015-09-01 NaN NaN NaN
6 5 523 843 2015-09-15 1121.0 222.0 2015-09-13
7 5 212 222 2015-10-1 NaN NaN NaN

Related

pandas.to_datetime not converting all rows to datetime

simple transformation to convert a string date time to datetime in a df not working - please see last column 990 onwards
new_df = pd.melt(
frame=df,
id_vars={'Date', 'Day'}
)
new_df['new_date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='raise')
Date Day variable value new_date
0 1/5/2015 289 Cases_Guinea 2776.0 2015-01-05
1 1/4/2015 288 Cases_Guinea 2775.0 2015-01-04
2 1/3/2015 287 Cases_Guinea 2769.0 2015-01-03
3 1/2/2015 286 Cases_Guinea NaN 2015-01-02
4 12/31/2014 284 Cases_Guinea 2730.0 2014-12-31
5 12/28/2014 281 Cases_Guinea 2706.0 2014-12-28
6 12/27/2014 280 Cases_Guinea 2695.0 2014-12-27
7 12/24/2014 277 Cases_Guinea 2630.0 2014-12-24
8 12/21/2014 273 Cases_Guinea 2597.0 2014-12-21
9 12/20/2014 272 Cases_Guinea 2571.0 2014-12-20
.. ... ... ... ... ...
990 12/3/2014 256 Deaths_Guinea NaN NaT
991 11/30/2014 253 Deaths_Guinea 1327.0 NaT
992 11/28/2014 251 Deaths_Guinea NaN NaT
993 11/23/2014 246 Deaths_Guinea 1260.0 NaT
994 11/22/2014 245 Deaths_Guinea NaN NaT
995 11/18/2014 241 Deaths_Guinea 1214.0 NaT
996 11/16/2014 239 Deaths_Guinea 1192.0 NaT
997 11/15/2014 238 Deaths_Guinea NaN NaT

Python: calculate rolling returns over different frequencies

I have the following DataFrame:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I am applying the following to get rolling returns:
periodicity_dict = {1:'daily', 7:'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in range(key, len(df1[col][df1[col].first_valid_index():df1[col].last_valid_index()])):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-key])/df[col].iloc[i-key]
But I am getting the following error: KeyError: 'col_A'.
What am I doing wrong? And is there a better way to do this with less loops?

I think you are looking for something like the shift method (no for-loop is needed):
df1['col_A_rolling'] = (df1['col_A'] - df1['col_A'].shift(7)) / df1['col_A'].shift(7)
OUTPUT:
col_A col_B col_C col_A_rolling
2022-01-01 99330 12 122 NaN
2022-01-02 1123 1230 1287 NaN
2022-01-03 123 101 812739 NaN
2022-01-04 1143 1230123 252 NaN
2022-01-05 234 342 4546 NaN
2022-01-06 2445 3453 3457 NaN
2022-01-07 7897 8657 5675 NaN
2022-01-08 46 5675 453 -0.999537
2022-01-09 76 484 3735 -0.932324
2022-01-10 363 93 4568 1.951220
2022-01-11 385 568 367 -0.663167
2022-01-12 458 846 4847 0.957265
2022-01-13 574 45747 658468 -0.765235
2022-01-14 57457 46534 4675 6.275801

What's wrong with this code to conditionally count Pandas dataframe columns?

I have the following data:
Data:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I want to add a new column called CheckCount counting the values in the Vol and Mx columns IF they are greater than 150. I have written the following code:
Code:
import pandas as pd
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['CheckCount'] = (Observations[['Vol', 'Mx']]>150).count(axis=1)
print(Observations)
However, unfortunately it is counting every value (result is always 2) rather than only where the values are >150 - what is wrong with my code?
Current Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,2
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,2
101,2017-01-04,78.0,130,182,2
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,2
101,2017-01-07,52.0,134,183,2
101,2017-01-08,62.0,148,176,2
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,2
Desired Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,1
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,1
101,2017-01-04,78.0,130,182,1
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,1
101,2017-01-07,52.0,134,183,1
101,2017-01-08,62.0,148,176,1
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,0

Are you looking for:
df['CheckCount'] = df[['Vol','Mx']].gt(150).sum(1)
Output:
ObjectID Date Price Vol Mx CheckCount
0 101 2017-01-01 NaN 145 203 1
1 101 2017-01-02 NaN 155 163 2
2 101 2017-01-03 67.0 140 234 1
3 101 2017-01-04 78.0 130 182 1
4 101 2017-01-05 58.0 178 202 2
5 101 2017-01-06 53.0 134 204 1
6 101 2017-01-07 52.0 134 183 1
7 101 2017-01-08 62.0 148 176 1
8 101 2017-01-09 42.0 152 193 2
9 101 2017-01-10 80.0 137 150 0

how to subset pandas dataframe on date

I have a pandas DataFrame like this..
order_id buyer_id item_id time
537 79 93 2016-01-04 10:20:00
540 191 93 2016-01-04 10:30:00
556 251 82 2016-01-04 13:39:00
589 191 104 2016-01-05 10:59:00
596 251 99 2016-01-05 13:48:00
609 79 106 2016-01-06 10:39:00
611 261 97 2016-01-06 10:50:00
680 64 135 2016-01-11 11:58:00
681 261 133 2016-01-11 12:03:00
682 309 135 2016-01-11 12:08:00
I want to get all the buyer_ids present before 6th jan 2016 but not after 6th Jan 2016
so, it should return me buyer_id 79
I am doing following in Python.
df.buyer_id[(df['time'] < '2016-01-06')]
This returns me all the buyer ids before 6th jan 2016 but how to check for the condition if its not present after 6th jan ? Please help

IIUC you could use isin method to achieve what you want:
df.time = pd.to_datetime(df.time)
In [52]: df
Out[52]:
order_id buyer_id item_id time
0 537 79 93 2016-01-04 10:20:00
1 540 191 93 2016-01-04 10:30:00
2 556 251 82 2016-01-04 13:39:00
3 589 191 104 2016-01-05 10:59:00
4 596 251 99 2016-01-05 13:48:00
5 609 79 106 2016-01-06 10:39:00
6 611 261 97 2016-01-06 10:50:00
7 680 64 135 2016-01-11 11:58:00
8 681 261 133 2016-01-11 12:03:00
9 682 309 135 2016-01-11 12:08:00
exclude = df.buyer_id[(df['time'] > '2016-01-06')]
select = df.buyer_id[(df['time'] < '2016-01-06')]
In [53]: select
Out[53]:
0 79
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64
In [54]: exclude
Out[54]:
5 79
6 261
7 64
8 261
9 309
Name: buyer_id, dtype: int64
In [55]: select[~select.isin(exclude)]
Out[55]:
1 191
2 251
3 191
4 251
Name: buyer_id, dtype: int64

You could use:
df.groupby('buyer_id').apply(lambda x: True if (x.time < '01-06-2016').any() and not (x.time > '01-06-2016').any() else False)
buyer_id
64 False
79 False
191 True
251 True
261 False
309 False
dtype: bool

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?

This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join DataFrame with multiple conditions on different columns? - python

Related

pandas.to_datetime not converting all rows to datetime

Python: calculate rolling returns over different frequencies

What's wrong with this code to conditionally count Pandas dataframe columns?

how to subset pandas dataframe on date

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

Categories

Resources