Improving data validation efficiency in Pandas - python

I load data from a CSV into Pandas and do validation on some of the fields like this:
(1.5s) loans['net_mortgage_margin'] = loans['net_mortgage_margin'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['current_interest_rate'] = loans['current_interest_rate'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['net_maximum_interest_rate'] = loans['net_maximum_interest_rate'].map(lambda x: convert_to_decimal(x))
(48s) loans['credit_score'] = loans.apply(lambda row: get_minimum_score(row), axis=1)
(< 1s) loans['loan_age'] = ((loans['factor_date'] - loans['first_payment_date']) / np.timedelta64(+1, 'M')).round() + 1
(< 1s) loans['months_to_roll'] = ((loans['next_rate_change_date'] - loans['factor_date']) / np.timedelta64(+1, 'M')).round() + 1
(34s) loans['first_payment_change_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_change_date', loans.columns), axis=1)
(37s) loans['first_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'first_rate_change_date', loans.columns), axis=1)
(39s) loans['first_payment_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_date', loans.columns), axis=1)
(39s) loans['maturity_date'] = loans.apply(lambda x: validate_date(x, 'maturity_date', loans.columns), axis=1)
(37s) loans['next_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'next_rate_change_date', loans.columns), axis=1)
(36s) loans['first_PI_date'] = loans.apply(lambda x: validate_date(x, 'first_PI_date', loans.columns), axis=1)
(36s) loans['servicer_name'] = loans.apply(lambda row: row['servicer_name'][:40].upper().strip(), axis=1)
(38s) loans['state_name'] = loans.apply(lambda row: str(us.states.lookup(row['state_code'])), axis=1)
(33s) loans['occupancy_status'] = loans.apply(lambda row: get_occupancy_type(row), axis=1)
(37s) loans['original_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'original'), axis=1)
(36s) loans['current_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'current'), axis=1)
(33s) loans['valid_credit_score'] = loans.apply(lambda row: validate_credit_score(row), axis=1)
(60s) loans['origination_year'] = loans['first_payment_date'].map(lambda x: x.year if x.month > 2 else x.year - 1)
(< 1s) loans['number_of_units'] = loans['unit_count'].map(lambda x: '1' if x == 1 else '2-4')
(32s) loans['property_type'] = loans.apply(lambda row: validate_property_type(row), axis=1)
Most of these are functions that find the row value, a few directly convert an element to something else, but all in all, these are ran for the entire dataframe line by line. When this code was written, the data frames were small enough that this was not an issue. The code is now, however, being adapted to take in significantly larger tables, such that this part of the code takes far too long.
What is the best way to optimize this? My first thought was to go row by row, but apply all of these functions/transformations on the row once (i.e. for row in df, do func1, func2, ..., func21), but I'm not sure if that is the best way to deal with that. Is there a way to avoid lambda to get the same result, for example, since I assume it's lambda that takes a long time? Running Python 2.7 in case that matters.
Edit: most of these calls run at about the same rate per row (a few are pretty fast). This is a dataframe with 277,659 rows, which is in the 80th percentile in terms of size.
Edit2: example of a function:
def validate_date(row, date_type, cols):
date_element = row[date_type]
if date_type not in cols:
return np.nan
if pd.isnull(date_element) or len(str(date_element).strip()) < 2: # can be blank, NaN, or "0"
return np.nan
if date_element.day == 1:
return date_element
else:
next_month = date_element + relativedelta(months=1)
return pd.to_datetime(dt.date(next_month.year, next_month.month, 1))
This is similar to the longest call (origination_year) which extracts values from a date object (year, month, etc.). Others, like property_type for example, are just checking for irregular values (e.g. "N/A", "NULL", etc.) but still take a little while just to go through each one.

td;lr: Consider distribution the processing. An improvement would be reading the data in chunks and using multiple processes. source http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
import multiprocessing as mp
def process_frame(df): len(x)
if __name__ == "__main__":
reader = read_csv(csv-file, chunk_size=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)
Another option might be to use GNU parallel.
here is another good example of using GNU parallel

Related

python: groupby (merge) next lines with previous line if they start with same match pattern in text data

I have a file.txt with data group (AAA-(n)) that is very large. Many lines in the file have the same tag (ex. AB) between line AAA -(n) to AAA-(n+1) in the file. I want to put them into one line. For example:
AAA-1
XX-a
AB-a
AB-b
AB-c
numb-a
lime-a
lime-b
XX-b
AB-d
AB-e
lime-c
AAA-2
.
.
AAA-n
.
.
My desired output is:
AAA-1
XX-a
AB-a;b;c
numb-a
lime-a;b
XX-b
AB-d;e
lime-c
AAA-2
.
.
.
AAA-n
.
.
I tried this:
from itertools import groupby, count
counter = count()
with open('file.txt') as f:
for key, group in groupby(f, lambda s: next(counter) if s.startswith('AAA') or s.startswith('XX') else -1):
print(';'.join(s.rstrip('\n') for s in group))
Out:
AAA-1
XX-a
AB-a;AB-b;AB-c;numb-a;lime-a;lime-b
XX-b
AB-d;AB-e;lime-c
AAA-2
plz, help me to avoid grouping other tags with AB and remove tags after group?
EDIT: Updated to get correct output
Here is what I came up with:
df = pd.DataFrame.from_dict({'data': dat})
df['data'] = df['data'].str.split('-')
df['tag'] = [x[0] for x in df['data']]
df['tail'] = [x[1] for x in df['data']]
i = 0
while i < (len(df) - 1):
tails = [df.iat[i, 2]]
j = 1
while(df.iat[i, 1] == df.iat[i+j, 1]):
tails.append(df.iat[i+j, 2])
j += 1
df.loc[i, 'tails'] = tails
i += j
df.dropna(how='any', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
df.drop(columns=['data', 'tail'], inplace=True)
df['final'] = [df.at[i, 'tag'] + '-' + ';'.join(df.at[i, 'tails']) for i in range(len(df))]
Output:
First approach that comes to mind would be to split the trailing character from the tag and place in a separate column. Assuming you're using Pandas and it's in a DF already:
df['data'] = df['data'].str.split('-')
df['tag'] = [x[0] for x in df['data']]
df['tail'] = [x[1] for x in df['data']]
So now you have a column with the original data, a column with the tag, and a column with the tail.
Now you can group by the tag:
grouped = df.groupby('tag')
From here youcan achieve what you want using a lambda function:
out = grouped.agg({'tail': lambda t: ''.join(t)})
From here you can reset the index, join into a single string with a dash, whatever you want.

How can I split a row without spaces in a txt to do a dataframe in python?

I have this txt:
1989MaiteyCarlos
2015mamasypadres
And I have a code to separate word and generate different columns to DataFrame.
The code is:
txt1=pd.read_table(r'C:\Users\TOSHIBA\Desktop\prueba.txt',header=None)
txt1['anno'] = txt1[0].apply(lambda x: x[:4])
txt1['chica'] = txt1[0].apply(lambda x: x[4:9])
txt1['chico'] = txt1[0].apply(lambda x: x[10:])
I need to do a general function to solved the problem. I tried it with this code:
def read_txt (df,columnas,rangos):
for i,j in zip(columnas,rangos):
for k in j:
df[i] = df[0].apply(lambda x: x[k])
return df
But the result was fail.
How can I do this function?
I solved the problem.
the function that I used is:
def read_txt (df,columnas,rangos):
for i,j in zip(columnas,rangos):
df[i] = df[0].apply(lambda x: x[j[0]:j[1]])
return df
data = read_txt (txt1,['anno','chica','chico'],[[0,4],[4,9],[10,16]])

Applying many functions to the same column

I have a column in the dataframe on which I apply many functions. For example,
df[col_name] = df[col_name].apply(lambda x: fun1(x))
df[col_name] = df[col_name].apply(lambda x: fun2(x))
df[col_name] = df[col_name].apply(lambda x: fun3(x))
I have 10 functions that I apply to this column for preprocessing and cleaning. Is there a way where I can refactor this code or make the block of code small?
How about
def fun(x):
for f in (fun1, fun2, fun3):
x = f(x)
return x
df[col_name] = df[col_name].apply(fun)

How fast to derive new features (pandas) shift one period or n periods at same time? (Performance issue)

I have a dataframe with some continuos freatures (about 14), and I need to derivate more 14 (shift 1 period of hour) until n hours.
Supposing that I need until 6 hours before, so I will have more 84 columns (14*6).
For example, prcp (precipitation) derivates prcp_1, prcp_2, ... prcp_6, and the same thing for the others variables.
I was using this function:
def derive_nth_hour_feature(df, feature, N):
rows = df.shape[0]
nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)]
col_name = "{}_{}".format(feature, N)
df[col_name] = np.nan
df.loc[:][col_name] = nth_prior_measurements
NON_DER = ['wsid','elvt','lat', 'lon', 'yr', 'mo', 'da', 'hr']
for feature in dfm.columns:
if feature not in NON_DER:
for h in range(1,7):
derive_nth_day_feature(dfm, feature, h)
Work fine when the base have few records like ~ 100.000. But a have millions of records (~ 10 millions).
The result expected is like this: (derivating the prcp, stp and temp in 3 periods):
My main problem is the performance, spend a lots of time to process. The complexity is high 14^6*rows. So a need another approach.
Any sugestion?
Here a partial base (with one weather station):wsid_329.csv.zip
I solved my perfomance issue using a personalized function based on dataframe.shift:
DataFrame.shift(periods=1, freq=None, axis=0)[source]
Shift index by desired number of periods with an optional time freq
This my function:
def df_derived_by_shift(df,lag=0,NON_DER=[]):
df = df.copy()
if not lag:
return df
cols ={}
for i in range(1,lag+1):
for x in list(df.columns):
if x not in NON_DER:
if not x in cols:
cols[x] = ['{}_{}'.format(x, i)]
else:
cols[x].append('{}_{}'.format(x, i))
for k,v in cols.items():
columns = v
dfn = pd.DataFrame(data=None, columns=columns, index=df.index)
i = 1
for c in columns:
dfn[c] = df[k].shift(periods=i)
i+=1
df = pd.concat([df, dfn], axis=1, join_axes=[df.index])
return df
An example of use:
NON_DER = ['wsid','elvt','lat', 'lon', 'yr', 'mo', 'da', 'hr']
r = df_derived_by_shift(dfm,3,NON_DER)
r.head(3)
The result:

Pandas: df_left.merge(df_right) Summary Statistics

With regards to Pandas: df.merge() method, is their a convenient way to obtain the merge summary statistics (such as number of matched, number of not matched etc.). I know these stats depend on the how='inner' flag, but it would be handy to know how much is being 'discarded' when using an inner join etc. I could simply use:
df = df_left.merge(df_right, on='common_column', how='inner')
set1 = set(df_left[common_column].unique())
set2 = set(df_right[common_column].unique())
set1.issubset(set2) #True No Further Analysis Required
set2.issubset(set1) #False
num_shared = len(set2.intersection(set1))
num_diff = len(set2.difference(set1))
# And So on ...
But thought this might be implemented already. Have I missed it (i.e. something like report=True for merge which would return new_dataframe and a report series or dataframe)
Try this function... You can then just pass your arguments into it like this:
df = merge_like_stata(df1, df2, mergevars)
Function definition:
def merge_like_stata(master, using, mergevars):
master['_master_merge_'] = 'master'
using['_using_merge_'] = 'using'
df = pd.merge(master, using, on=mergevars, how='outer')
df['_master_merge_'] = df['_master_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
df['_using_merge_'] = df['_using_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
df['_merge'] = df.apply(lambda row: '3 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='using' else None, axis=1)
df['_merge'] = df.apply(lambda row: '2 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='miss' else row['_merge'], axis=1)
df['_merge'] = df.apply(lambda row: '1 - Using Only' if row['_master_merge_']=='miss' and row['_using_merge_'] =='using' else row['_merge'], axis=1)
df['column']="Count"
pd.crosstab(df._merge, df.column, margins=True)
df = df.drop(['_master_merge_', '_using_merge_'], axis=1)
return print(pd.crosstab(df._merge, df.column, margins=True))
return df
This is what I use thus far.
This is part of a function that concord's data from one coding system to another coding system.
if report == True:
report_df = pd.DataFrame(data[match_on].describe(), columns=['left'])
report_df = report_df.merge(pd.DataFrame(concord[match_on].describe(), columns=['right']), left_index=True, right_index=True)
set_left = set(data[match_on])
set_right = set(concord[match_on])
set_info = pd.DataFrame({'left':set_left.issubset(set_right), 'right':set_right.issubset(set_left)}, index=['subset'])
report_df = report_df.append(set_info)
set_info = pd.DataFrame({'left':len(set_left.difference(set_right)), 'right':len(set_right.difference(set_left))}, index=['differences'])
report_df = report_df.append(set_info)
#Return Random Sample of [5 Differences]
left_diff = list(set_left.difference(set_right))[0:5]
if len(left_diff) < 5:
left_diff = (left_diff + [np.nan]*5)[0:5]
right_diff = list(set_right.difference(set_left))[0:5]
if len(right_diff) < 5:
right_diff = (right_diff + [np.nan]*5)[0:5]
set_info = pd.DataFrame({'left': left_diff, 'right': right_diff}, index=['diff1', 'diff2', 'diff3', 'diff4', 'diff5'])
report_df = report_df.append(set_info)
Sample Report

Categories

Resources