Optimizing string match in Pandas

Optimizing string match in Pandas - python

Currently, I have the following line, where I try to do a string match in a column of my pandas:
input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]
However, this operation takes a lot of time. The size of the pandas df is: (8098977, 16).
Is there any way to optimize this particular operation?

Like Josh Friedlander said it will it should be a little faster adding a column and then filtering:
len(df3)
9599904
# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 6.525546073913574 seconds
just doing a str.contains:
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 11.464462518692017 seconds
It is about twice as fast to create a new column and filter on that than to filter on str.contains()

Use fast numpy "where" and "isin" functions after converting search column values and category list values to lower case. If the column and/or category list contain non-strings, convert to strings first. Delete the column label in the last line if you want to see all columns from the original dataframe indexed to the search column results.
import numpy as np
import pandas as pd
import re
names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])
len(input_supplier)
10000000
category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']
Method 1 (note this method does not ignore case)
%%timeit
input_supplier['search'] = \
input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]
4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Method 2
%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
'|'.join(category), flags=re.IGNORECASE)]
5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy method ignoring case:
%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]
2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy method if matching case:
%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]
623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Vectorize eval Fast - Python Numpy Pandas

I need a way to vectorize the eval statement that loops through the reference dataframe (df_ref) that has the strings that iloc to the source df (df)
Here's a reprex of the problem:
import pandas as pd
dataValues = ['a','b','c']
df = pd.DataFrame({'values': dataValues})
df_list = ['df.iloc[0,0]','df.iloc[2,0]']
df_ref = pd.DataFrame({'ref':df_list})
#looping 10000 times just to replicate a the amount of times this operation
#may run in a typical scenario
def help():
for i in range(10000):
for index,row in df_ref.iterrows():
eval(df_ref.ref[index])
%timeit help()
Outputs:
2.84 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Before you respond with an alternative.. My problem will always have some dynamic referencing that I have to replicate in python, so a more straightforward route may not solve my particular problem.
Thanks for the help!

As commented, extract the indexes first, then slice into the whole thing:
def help1():
indexes = df_ref['ref'].str.extract('iloc\[(\d+),\s*(\d+)\]').astype(int)
return df.to_numpy()[indexes[0], indexes[1]]
and
help1()
> array(['a', 'c'], dtype=object)
%timeit -n 1000 help1()
> 1.11 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pandas better runtime, going trough dataframe

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?

Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas frequency of a specific value per group

Suppose I have data for 50K shoppers and the products they bought. I want to count the number of times each user purchased product "a". value_counts seems to be the fastest way to calculate these types of numbers for a grouped pandas data frame. However, I was surprised at how much slower it was to calculate the purchase frequency for just one specific product (e.g., "a") using agg or apply. I could select a specific column from a data frame created using value_counts but that could be rather inefficient on very large data sets with lots of products.
Below a simulated example where each customer purchases 10 times from a set of three products. At this size you already notice speed differences between apply and agg compared to value_counts. Is there a better/faster way to extract information like this from a grouped pandas data frame?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": [f'c{j}' for i in range(10) for j in range(50000)],
"col2": np.random.choice(["a", "b", "c"], size=500000, replace=True)
})
dfg = df.groupby("col1")
# value_counts is fast
dfg["col2"].value_counts().unstack()
# apply and agg are (much) slower
dfg["col2"].apply(lambda x: (x == "a").sum())
dfg["col2"].agg(lambda x: (x == "a").sum())
# much faster to do
dfg["col2"].value_counts().unstack()["a"]
EDIT:
Two great responses to this question. Given the starting point of an already grouped data frame, it seems there may not be a better/faster way to count the number of occurrences of a single level in a categorical variable than using (1) apply or agg with a lambda function or (2) using value_counts to get the counts for all levels and then selecting the one you need.
The groupby/size approach is an excellent alternative to value_counts. With a minor edit to Cainã Max Couto-Silva's answer, this would give:
dfg = df.groupby(['col1', 'col2'])
dfg.size().unstack(fill_value=0)["a"]
I assume there would be a trade-off at some point where if you have many levels apply/agg or value_counts on an already grouped data frame may be faster than the groupby/size approach which requires creating a newly grouped data frame. I'll post back when I have some time to look into that.
Thanks for the comments and answers!

This is still faster:
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
Tests:
%%timeit
pd.crosstab(df.col1, df.col2)
# > 712 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 165 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we expand the dataframe to 5 million rows:
df = pd.concat([df for _ in range(10)])
print(f'df.shape = {df.shape}')
# > df.shape = (5000000, 2)
print(f'{df.shape[0]:,} rows.')
# > 5,000,000 rows.
%%timeit
pd.crosstab(df.col1, df.col2)
# > 1.58 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 1.27 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 847 ms ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Filter before value_counts
df.loc[df.col2=='a','col1'].value_counts()['c0']
Also I think crosstab is 'faster' than groupby + value_counts
pd.crosstab(df.col1, df.col2)

Using apply() rather than for loop - Pandas

I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply() method is much faster than using a For loop but I do not know how to re-write a For loop using the apply() method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply() method?

Solution
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
Create artificial data
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
Compare the performance of different approaches
Grouping and looping
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling + Dask
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Few words on apply and optimization
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Vectorization or Essential Basic Functionality (as you made)
Cython routines or numba
List comprehension.
Apply method
Iteration
Ilustration
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2). Cython or numba
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with #numba.jit and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?

here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)

This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing string match in Pandas - python

Related

Vectorize eval Fast - Python Numpy Pandas

pandas better runtime, going trough dataframe

pandas frequency of a specific value per group

Using apply() rather than for loop - Pandas

Concatenate Pandas column name to column value

Categories

Resources