Pandas test reappearance of a value based on the rolling period - python

I am trying to find a way to check if my current row value - df['ColM'] in the dataframe below appeared in a 5 day look-back period
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColN'] = ['AAA', 'AAA', 'AAA', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC']
df['ColM'] = ['XYZ', 'WUV', 'WUV', 'XYZ', 'WUV', 'WUV', 'OPQ', 'XYZ']
df['ColN_dt'] = ['03-12-2018', '03-13-2018', '03-16-2018', '03-18-2018', '03-22-2018', '03-23-2018', '03-26-2018', '03-30-2018']
I am trying to see if row value for column ColM by ColN group ever appeared in last 5 days. i.e. I am looking to create a flag:
df['flag'] = [0, 0, 1, 0, 0, 1, 0, 0]

I think you can create a flag column using groupby, if your df['ColN_dt'] is a datetime Series:
# Set df['ColN_dt'] to datetime:
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
# Make sure dates are sorted (they are in your example, but just in case)
df.sort_values('ColN_dt', inplace=True)
# Create your flag column
df['flag'] = (df.groupby(['ColN', 'ColM'])['ColN_dt']
.apply(lambda x: x.diff() < pd.Timedelta(days=5))
.astype(int))
This returns:
>>> df
ColN ColM ColN_dt flag
0 AAA XYZ 2018-03-12 0
1 AAA WUV 2018-03-13 0
2 AAA WUV 2018-03-16 1
3 ABC XYZ 2018-03-18 0
4 ABC WUV 2018-03-22 0
5 ABC WUV 2018-03-23 1
6 ABC OPQ 2018-03-26 0
7 ABC XYZ 2018-03-30 0
Explanation:
df.groupby(['ColN', 'ColM'])['ColN_dt']
Groups your dataframe by ColN and ColM
.apply(lambda x: x.diff() < pd.Timedelta(days=5))
Checks whether the difference between a row's ['ColN_dt'] in each group is less than 5 days from the row above. This returns a boolean, which you can set to int with .astype(int)

Related

pandas: cannot set column with substring extracted from other column

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!
You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN

Get list of column names of values >0 for specific row (date) in Python

I have a DataFrame df1 and I want to get at a specific date, for example 2022-01-04 all the column names of df1 in a list which would be: 01G, 02G, 04G. So far I was only able to get the number of each row, but not the column names.
This would be a simple example:
df1:
01G 02G 03G 04G
Dates
2022-01-01 0 1 0 1
2022-01-02 1 1 1 0
2022-01-03 0 1 1 1
2022-01-04 1 1 0 1
For reproducibility:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'Dates':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
'01G':[0, 1, 0, 1],
'02G':[1, 1, 1, 1],
'03G':[0, 1, 1, 0],
'04G':[1, 0, 1, 1]})
df1 = df1.set_index('Dates')
np.count_nonzero(df1, axis=1)
Thanks a lot!
Use DataFrame.loc for filter row by datetime, compare for greater like 0 and filter columns names:
print (df1.columns[df1.loc['2022-01-04'].gt(0)].tolist())
['01G', '02G', '04G']
For your special case, it seems we can also filter using the row values directly after changing dtype to bool:
out = df1.columns[df1.loc['2022-01-04'].astype(bool)].tolist()
Output:
['01G', '02G', '04G']

How to find rows of a dataframe containing a date value?

There is a huge dataframe containing multiple data types in different columns. I want to find rows that contain date values in different columns.
Here a test dataframe:
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
Now, I want to create a new dataframe that contains only dates in both columns A and B, here rows 2nd and last.
Expected output:
A B C
1 2020-06-01 16:58:17.274311 2020-06-01 17:13:20.391394 2
6 2020-05-05 2020-05-25 7
What is the best way to do that? Thanks.
P.S.> Dates can be in any standard format.
Use:
m = df[['A', 'B']].transform(pd.to_datetime, errors='coerce').isna().any(axis=1)
df = df[~m]
Result:
# print(df)
A B C
1 2020-06-01 17:54:16.377722 2020-06-01 17:54:16.378432 2
6 2020-05-05 2020-05-25 7
Solution for test only A,B columns is boolean indexing with DataFrame.notna and DataFrame.all for not match any non datetimes:
df = df[df[['A','B']].apply(pd.to_datetime, errors='coerce').notna().all(axis=1)]
print (df)
A B C
1 2020-06-01 16:14:35.020855 2020-06-01 16:14:35.021855 2
6 2020-05-05 2020-05-25 7
import pandas as pd
from datetime import datetime
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
m = pd.concat([pd.to_datetime(df['A'], errors='coerce'),
pd.to_datetime(df['B'], errors='coerce')], axis=1).isna().all(axis=1)
print(df[~m])
Prints:
A B C
1 2020-06-01 12:17:51.320286 2020-06-01 12:17:51.320826 2
6 2020-05-05 2020-05-25 7

Seperate rows in a dataframe based on custom value?

I have a df with two columns a and b.
import pandas as pd
raw_data = {'a': ['2019145236792', 'abc_def date_1220', '2020124832852', 'jhi_abc this_1219_abc'],
'b': ['tom','john','mark','jim']}
df = pd.DataFrame(raw_data, columns=['a', 'b'])
df
a b
0 2019145236792 tom
1 abc_def date_1220 john
2 2020124832852 mark
3 jhi_abc this_1219_abc20 jim
I want to seperate the data which only contains 20. The position of 20 won't change.
eg: 2020124832852 and abc_def date_1220
Expected output:
a b
0 abc_def date_1220 john
1 2020124832852 mark
Use boolean indexing with comapre by Series.eq and indexing by str chained by | for bitwise OR by second mask with Series.str.extract for values after date_:
m1 = df['a'].str[2:4].eq('20')
m2 = df['a'].str.extract('date_(.*)', expand=False).str[2:4].eq('20')
df = df[m1 | m2]
print (df)
a b
1 abc_def date_1220 john
2 2020124832852 mark
EDIT:
m2 = df['a'].str.split('_', n=2).str[2].str[2:4].eq('20')
you can use a list comprehension to get the wanted rows but you have to specify the required positions:
import re
req_pos = {2, 15}
df[[any(e.start() in req_pos for e in re.finditer('20', s)) for s in df.a]]

Keep pandas DataFrame rows in df2 for each row in df1 with timedelta

I have two pandas dataframes. I would like to keep all rows in df2 where Type is equal to Type in df1 AND Date is between Date in df1 (- 1 day or + 1 day). How can I do this?
df1
IBSN Type Date
0 1 X 2014-08-17
1 1 Y 2019-09-22
df2
IBSN Type Date
0 2 X 2014-08-16
1 2 D 2019-09-22
2 9 X 2014-08-18
3 3 H 2019-09-22
4 3 Y 2019-09-23
5 5 G 2019-09-22
res
IBSN Type Date
0 2 X 2014-08-16 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] - 1
1 9 X 2014-08-18 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] + 1
2 3 Y 2019-09-23 <-- keep because Type = df1[1]['Type'] AND Date = df1[1]['Date'] + 1
This should do it:
import pandas as pd
from datetime import timedelta
# create dummy data
df1 = pd.DataFrame([[1, 'X', '2014-08-17'], [1, 'Y', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df1['Date'] = pd.to_datetime(df1['Date']) # might not be necessary if your Date column already contain datetime objects
df2 = pd.DataFrame([[2, 'X', '2014-08-16'], [2, 'D', '2019-09-22'], [9, 'X', '2014-08-18'], [3, 'H', '2019-09-22'], [3, 'Y', '2014-09-23'], [5, 'G', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date']) # might not be necessary if your Date column already contain datetime objects
# add date boundaries to the first dataframe
df1['Date_from'] = df1['Date'].apply(lambda x: x - timedelta(days=1))
df1['Date_to'] = df1['Date'].apply(lambda x: x + timedelta(days=1))
# merge the date boundaries to df2 on 'Type'. Filter rows where date is between
# data_from and date_to (inclusive). Drop 'date_from' and 'date_to' columns
df2 = df2.merge(df1.loc[:, ['Type', 'Date_from', 'Date_to']], on='Type', how='left')
df2[(df2['Date'] >= df2['Date_from']) & (df2['Date'] <= df2['Date_to'])].\
drop(['Date_from', 'Date_to'], axis=1)
Note that according to your logic, row 4 in df2 (3 Y 2014-09-23) should not remain as its date (2014) is not in between the given dates in df1 (year 2019).
Assume Date columns in both dataframes are already in dtype datetime. I would construct IntervalIndex to assign to index of df1. Map columns Type of df1 to df2. Finally check equality to create mask to slice
iix = pd.IntervalIndex.from_arrays(df1.Date + pd.Timedelta(days=-1),
df1.Date + pd.Timedelta(days=1), closed='both')
df1 = df1.set_index(iix)
s = df2['Date'].map(df1.Type)
df_final = df2[df2.Type == s]
Out[1131]:
IBSN Type Date
0 2 X 2014-08-16
2 9 X 2014-08-18
4 3 Y 2019-09-23

Categories

Resources