Pandas: df_left.merge(df_right) Summary Statistics - python

With regards to Pandas: df.merge() method, is their a convenient way to obtain the merge summary statistics (such as number of matched, number of not matched etc.). I know these stats depend on the how='inner' flag, but it would be handy to know how much is being 'discarded' when using an inner join etc. I could simply use:
df = df_left.merge(df_right, on='common_column', how='inner')
set1 = set(df_left[common_column].unique())
set2 = set(df_right[common_column].unique())
set1.issubset(set2) #True No Further Analysis Required
set2.issubset(set1) #False
num_shared = len(set2.intersection(set1))
num_diff = len(set2.difference(set1))
# And So on ...
But thought this might be implemented already. Have I missed it (i.e. something like report=True for merge which would return new_dataframe and a report series or dataframe)

Try this function... You can then just pass your arguments into it like this:
df = merge_like_stata(df1, df2, mergevars)
Function definition:
def merge_like_stata(master, using, mergevars):
master['_master_merge_'] = 'master'
using['_using_merge_'] = 'using'
df = pd.merge(master, using, on=mergevars, how='outer')
df['_master_merge_'] = df['_master_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
df['_using_merge_'] = df['_using_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
df['_merge'] = df.apply(lambda row: '3 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='using' else None, axis=1)
df['_merge'] = df.apply(lambda row: '2 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='miss' else row['_merge'], axis=1)
df['_merge'] = df.apply(lambda row: '1 - Using Only' if row['_master_merge_']=='miss' and row['_using_merge_'] =='using' else row['_merge'], axis=1)
df['column']="Count"
pd.crosstab(df._merge, df.column, margins=True)
df = df.drop(['_master_merge_', '_using_merge_'], axis=1)
return print(pd.crosstab(df._merge, df.column, margins=True))
return df

This is what I use thus far.
This is part of a function that concord's data from one coding system to another coding system.
if report == True:
report_df = pd.DataFrame(data[match_on].describe(), columns=['left'])
report_df = report_df.merge(pd.DataFrame(concord[match_on].describe(), columns=['right']), left_index=True, right_index=True)
set_left = set(data[match_on])
set_right = set(concord[match_on])
set_info = pd.DataFrame({'left':set_left.issubset(set_right), 'right':set_right.issubset(set_left)}, index=['subset'])
report_df = report_df.append(set_info)
set_info = pd.DataFrame({'left':len(set_left.difference(set_right)), 'right':len(set_right.difference(set_left))}, index=['differences'])
report_df = report_df.append(set_info)
#Return Random Sample of [5 Differences]
left_diff = list(set_left.difference(set_right))[0:5]
if len(left_diff) < 5:
left_diff = (left_diff + [np.nan]*5)[0:5]
right_diff = list(set_right.difference(set_left))[0:5]
if len(right_diff) < 5:
right_diff = (right_diff + [np.nan]*5)[0:5]
set_info = pd.DataFrame({'left': left_diff, 'right': right_diff}, index=['diff1', 'diff2', 'diff3', 'diff4', 'diff5'])
report_df = report_df.append(set_info)
Sample Report

Related

Conditional method chaining in pandas

Is there a simple general way to make a method conditional to an if-statement when using method chaining with pandas?
Mock example:
df = pd.DataFrame({'A':['one', 'two'], 'B':['one', 'two']})
change_to_numeric = False
df = (df
.query("A == 'one'")
.replace('one', 1) # <-- Execute this row only "if change_to_numeric == True"
)
Thank you!
You can use pipe:
df = pd.DataFrame({'A':['one', 'two'], 'B':['one', 'two']})
change_to_numeric = False
df = (df
.query("A == 'one'")
.pipe(lambda d: d.replace('one', 1) if change_to_numeric else d)
)
output for change_to_numeric = False:
A B
0 one one
output for change_to_numeric = True:
A B
0 1 1

pandas dataframe vectorize for loop with logical statements

i want to vectorize following function in python. i am havin 50000 rows in dataframe , so need to make it happen fast and so vectorization for following code is needed in python
for i in range(1,len(df)):
if(df['temperature'].iloc[i]>df['temperature'].iloc[i-1]):
df['delta'].iloc[i]=df['qty'].iloc[i]
df['value'].iloc[i]=1
elif(df['temperature'].iloc[i]<df['temperature'].iloc[i-1]):
df['delta'].iloc[i]=-1*df['qty'].iloc[i]
df['value'].iloc[i]=-1
elif(df['temperature'].iloc[i]==df['temperature'].iloc[i-1]):
df['delta'].iloc[i]=df['value'].iloc[i-1]*df['qty'].iloc[i]
df['value'].iloc[i]=df['value'].iloc[i-1]
I expect this will do the job, but without input and expected output to compare with, I can't check:
gt_idx = df['temperature'] > df.shift(-1)['temperature']
df.loc[gt_idx, 'delta'] = df.loc[gt_idx, 'qty']
df.loc[gt_idx, 'value'] = 1
lt_idx = df['temperature'] < df.shift(-1)['temperature']
df.loc[lt_idx, 'delta'] = df.loc[lt_idx, 'qty'] * -1
df.loc[lt_idx, 'value'] = -1
eq_idx = df['temperature'] == df.shift(-1)['temperature']
df.loc[eq_idx, 'delta'] = df.shift(-1).loc[eq_idx, 'value'] *df.loc[eq_idx, 'qty']
df.loc[eq_idx, 'value'] = df.shift(-1).loc[eq_idx, 'value']
You can try using np.select(), as follows:
cond_list = [df['temperature'] > df['temperature'].shift(),
df['temperature'] < df['temperature'].shift(),
df['temperature'] = df['temperature'].shift()
]
choice_list = [(df['delta'] = df['qty'], df['value'] = 1),
(df['delta'] = - df['qty'], df['value'] = -1),
(df['delta'] = df['qty'] * df['value'].shift(), df['value'] = df['value'].shift())
]
np.select(cond_list, choice_list)

Apply for loop in multiple dataframe for multiple columns?

Dataframe is like below: Where I want to change dataframes value to 'dead' if age is more than 100.
import pandas as pd
raw_data = {'age1': [23,45,210],'age2': [10,20,150],'name': ['a','b','c']}
df = pd.DataFrame(raw_data, columns = ['age1','age2','name'])
raw_data = {'age1': [80,90,110],'age2': [70,120,90],'name': ['a','b','c']}
df2 = pd.DataFrame(raw_data, columns = ['age1','age2','name'])
Desired outcome
df=
age1 age2 name
0 23 10 a
1 45 20 b
2 dead dead c
df2=
age1 age2 name
0 80 70 a
1 90 dead b
2 dead 90 c
I was trying something like this:
col_list=['age1','age2']
df_list=[df,df2]
def dead(df):
for df in df_list:
if df.columns in col_list:
if df.columns >=100:
return 'dead'
else:
return df.columns
df.apply(dead)
Error shown:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am looking for a loop that works on all dataframe.
Please correct my function also for future learning :)
With your shown samples, please try following. Using filter, np.where functions of pandas, numpy respectively.
c = df.filter(regex='age\d+').columns
df[c] = np.where(df[c].ge(100),'dead',df[c])
df
Alternative approach with where:
c=df.filter(like='age').columns
df[c] = df[c].where(~df['c'].ge(100),'dead')
Explanation:
Getting columns which has same name like age in c variable.
Then using np.where to check if respective(all age columns) are greeter/equal to 100, if yes then set it to dead or keep it as it is.
I did the following:
col_list=['age1','age2']
df_list=[df,df2]
for d in df_list:
for c in col_list:
d.loc[d[c]>100, c] = 'dead'
#inspired by #jib and #ravinder
col_list=['age1','age2']
df_list=[df,df2]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>100,'dead',d[c])
df #or df2
output:
age1 age2 name
0 23 10 a
1 45 20 b
2 dead dead c
One possible solution is to use Pandas' mask, which is similar to if-else, but vectorized.
def dead(df):
col_list = ['age1', 'age2']
df = df.copy()
temporary = df.filter(col_list)
temporary = temporary.mask(temporary >= 100, "dead")
df.loc[:, col_list] = temporary
return df
Apply function to the dataframe:
df.pipe(dead)
age1 age2 name
0 23 10 a
1 45 20 b
2 dead dead c
You can do:
def check_more_than_100(x):
v = None
try:
v = int(x)
except:
pass
if v is not None:
return (v > 100)
return (False)
df['age1'] = df['age1'].apply(lambda x : 'dead' if check_more_than_100(x) else x)
df['age2'] = df['age2'].apply(lambda x : 'dead' if check_more_than_100(x) else x)
df2['age1'] = df2['age1'].apply(lambda x : 'dead' if check_more_than_100(x) else x)
df2['age2'] = df2['age2'].apply(lambda x : 'dead' if check_more_than_100(x) else x)
This should take care of non-int values if any.
I used this answer to a similar question. Basically you can use the .where() function from numpy to set based on the conditional.
import pandas as pd
import numpy as np
raw_data = {'age1': [23,45,210],'age2': [10,20,150],'name': ['a','b','c']}
df = pd.DataFrame(raw_data, columns = ['age1','age2','name'])
raw_data = {'age1': [80,90,110],'age2': [70,120,90],'name': ['a','b','c']}
df2 = pd.DataFrame(raw_data, columns = ['age1','age2','name'])
col_list=['age1','age2']
df_list=[df,df2]
def dead(df_list, col_list):
for df in df_list:
for col in col_list:
df[col] = np.where(df[col] >= 100, "dead", df[col])
return df_list
df
dead([df], col_list)
Extracting numeric columns and then using numpy where -
df_cols = df._get_numeric_data().columns.values
df2_cols = df2._get_numeric_data().columns.values
df[df_cols] = np.where(df[df_cols].to_numpy() > 100, 'dead', df[df_cols])
df2[df2_cols] = np.where(df2[df2_cols].to_numpy() > 100, 'dead', df2[df2_cols])

Improving data validation efficiency in Pandas

I load data from a CSV into Pandas and do validation on some of the fields like this:
(1.5s) loans['net_mortgage_margin'] = loans['net_mortgage_margin'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['current_interest_rate'] = loans['current_interest_rate'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['net_maximum_interest_rate'] = loans['net_maximum_interest_rate'].map(lambda x: convert_to_decimal(x))
(48s) loans['credit_score'] = loans.apply(lambda row: get_minimum_score(row), axis=1)
(< 1s) loans['loan_age'] = ((loans['factor_date'] - loans['first_payment_date']) / np.timedelta64(+1, 'M')).round() + 1
(< 1s) loans['months_to_roll'] = ((loans['next_rate_change_date'] - loans['factor_date']) / np.timedelta64(+1, 'M')).round() + 1
(34s) loans['first_payment_change_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_change_date', loans.columns), axis=1)
(37s) loans['first_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'first_rate_change_date', loans.columns), axis=1)
(39s) loans['first_payment_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_date', loans.columns), axis=1)
(39s) loans['maturity_date'] = loans.apply(lambda x: validate_date(x, 'maturity_date', loans.columns), axis=1)
(37s) loans['next_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'next_rate_change_date', loans.columns), axis=1)
(36s) loans['first_PI_date'] = loans.apply(lambda x: validate_date(x, 'first_PI_date', loans.columns), axis=1)
(36s) loans['servicer_name'] = loans.apply(lambda row: row['servicer_name'][:40].upper().strip(), axis=1)
(38s) loans['state_name'] = loans.apply(lambda row: str(us.states.lookup(row['state_code'])), axis=1)
(33s) loans['occupancy_status'] = loans.apply(lambda row: get_occupancy_type(row), axis=1)
(37s) loans['original_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'original'), axis=1)
(36s) loans['current_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'current'), axis=1)
(33s) loans['valid_credit_score'] = loans.apply(lambda row: validate_credit_score(row), axis=1)
(60s) loans['origination_year'] = loans['first_payment_date'].map(lambda x: x.year if x.month > 2 else x.year - 1)
(< 1s) loans['number_of_units'] = loans['unit_count'].map(lambda x: '1' if x == 1 else '2-4')
(32s) loans['property_type'] = loans.apply(lambda row: validate_property_type(row), axis=1)
Most of these are functions that find the row value, a few directly convert an element to something else, but all in all, these are ran for the entire dataframe line by line. When this code was written, the data frames were small enough that this was not an issue. The code is now, however, being adapted to take in significantly larger tables, such that this part of the code takes far too long.
What is the best way to optimize this? My first thought was to go row by row, but apply all of these functions/transformations on the row once (i.e. for row in df, do func1, func2, ..., func21), but I'm not sure if that is the best way to deal with that. Is there a way to avoid lambda to get the same result, for example, since I assume it's lambda that takes a long time? Running Python 2.7 in case that matters.
Edit: most of these calls run at about the same rate per row (a few are pretty fast). This is a dataframe with 277,659 rows, which is in the 80th percentile in terms of size.
Edit2: example of a function:
def validate_date(row, date_type, cols):
date_element = row[date_type]
if date_type not in cols:
return np.nan
if pd.isnull(date_element) or len(str(date_element).strip()) < 2: # can be blank, NaN, or "0"
return np.nan
if date_element.day == 1:
return date_element
else:
next_month = date_element + relativedelta(months=1)
return pd.to_datetime(dt.date(next_month.year, next_month.month, 1))
This is similar to the longest call (origination_year) which extracts values from a date object (year, month, etc.). Others, like property_type for example, are just checking for irregular values (e.g. "N/A", "NULL", etc.) but still take a little while just to go through each one.
td;lr: Consider distribution the processing. An improvement would be reading the data in chunks and using multiple processes. source http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
import multiprocessing as mp
def process_frame(df): len(x)
if __name__ == "__main__":
reader = read_csv(csv-file, chunk_size=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)
Another option might be to use GNU parallel.
here is another good example of using GNU parallel

create new column from conditional statement without mask pandas

I am looking for a better way to do the following:
A
TRDNumber
ALB2008081610 430
ALB200808167 0
ALB200808168 190
Creating a new column based on the value in another column using a conditional statement
A B
TRDNumber
ALB2008081610 430 z
ALB200808167 0 x
ALB200808168 190 y
The following code works but I know that there must be a better way to do this.
mask = df['A'] == 0
df20 = df[mask]
df20['B'] = 'x'
df20
mask2 = ((df.A != 0) & (df.B <= 200) )
df21 = df[mask2]
df21['B'] = 'y'
df21
pieces = [df20,df21]
pd.concat(pieces)
I think you want to do the following:
#%%
df = pd.DataFrame()
df['A'] = pd.Series([430,0,190], index=['ALB2008081610', 'ALB200808167', 'ALB200808168'])
print(df)
#%%
df['B'] = None
print(df)
#%%
df.loc[(df.A==0), 'B'] = 'x'
print(df)
#%%
df.loc[(df.A!=0) & (df.A<=200), 'B'] = 'y'
print(df)
An explanation about indexing can be found here: http://pandas.pydata.org/pandas-docs/stable/indexing.html
Tip for next time: provide the code for creating the dataframe. Then we can directly play around with the same dataframe you are using.
You can create function and apply it to your dataset:
>>> def foo(x):
... if x['A'] == 0:
... return 'x'
... elif x['A'] < 200:
... return 'y'
... else:
... return 'z'
...
>>> df['B'] = df.apply(foo, axis=1)
>>> df
A B
TRDNumber
ALB2008081610 430 z
ALB200808167 0 x
ALB200808168 190 y

Categories

Resources