pandas better runtime, going trough dataframe - python

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?

Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Best way to execute multiple lines of pandas in parallel? (Speed up)

Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).
I am updating 100 columns like this:
df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100
This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.
You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:
cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
Performance:
np.random.seed(2022)
df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})
In [58]: %%timeit
...: for i in range(1, 101):
...: df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
...:
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [59]: %%timeit
...: cols = [f'col{c}' for c in range(1, 101)]
...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
...:
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.

Optimizing string match in Pandas

Currently, I have the following line, where I try to do a string match in a column of my pandas:
input_supplier = input_supplier[input_supplier['Category Level - 3'].str.contains(category, flags=re.IGNORECASE)]
However, this operation takes a lot of time. The size of the pandas df is: (8098977, 16).
Is there any way to optimize this particular operation?
Like Josh Friedlander said it will it should be a little faster adding a column and then filtering:
len(df3)
9599904
# Creating a column then filtering
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
df3['search'] = df3['First'].str.contains('|'.join(search))
new_df = df3[df3['search'] == True]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 6.525546073913574 seconds
just doing a str.contains:
start_time = time.time()
search = ['Emma','Ryan','Gerald','Billy','Helen']
input_supplier = df3[df3['First'].str.contains('|'.join(search), flags=re.IGNORECASE)]
end_time = time.time()
print(f'Elapsed time was {(end_time - start_time)} seconds')
Elapsed time was 11.464462518692017 seconds
It is about twice as fast to create a new column and filter on that than to filter on str.contains()
Use fast numpy "where" and "isin" functions after converting search column values and category list values to lower case. If the column and/or category list contain non-strings, convert to strings first. Delete the column label in the last line if you want to see all columns from the original dataframe indexed to the search column results.
import numpy as np
import pandas as pd
import re
names = np.array(['Walter', 'Emma', 'Gus', 'Ryan', 'Skylar', 'Gerald',
'Saul', 'Billy', 'Jesse', 'Helen'] * 1000000)
input_supplier = pd.DataFrame(names, columns=['Category Level - 3'])
len(input_supplier)
10000000
category = ['Emma', 'Ryan', 'Gerald', 'Billy', 'Helen']
Method 1 (note this method does not ignore case)
%%timeit
input_supplier['search'] = \
input_supplier['Category Level - 3'].str.contains('|'.join(category))
df1 = input_supplier[input_supplier['search'] == True]
4.42 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Method 2
%%timeit
df2 = input_supplier[input_supplier['Category Level - 3'].str.contains(
'|'.join(category), flags=re.IGNORECASE)]
5.45 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy method ignoring case:
%%timeit
lcase_vals = [x.lower() for x in input_supplier['Category Level - 3']]
category_lcase = [x.lower() for x in category]
df3 = input_supplier.iloc[np.where(np.isin(lcase_vals, category_lcase))[0]]
2.02 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy method if matching case:
%%timeit
col_vals = input_supplier['Category Level - 3'].values
df4 = input_supplier.iloc[np.where(np.isin(col_vals, category))[0]]
623 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

slice panda string in vectorised way [duplicate]

This question already has answers here:
How to slice strings in a column by another column in pandas
(2 answers)
Closed 4 years ago.
I am trying to slice the string in vectorized way and answer is NaN. Although work OK if sequence index (say like str[:1]) is constant. Any help
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]}) #
df['SUB'] = df['NAME'].str[:df['SEQ']]
The output is
NAME SEQ SUB
0 abc 1 NaN
1 xyz 2 NaN
2 hello 1 NaN
Unfortunately vectorized solution does not exist.
Use apply with lambda function:
df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
Or zip with list comprehension for better performance:
df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
print (df)
NAME SEQ SUB
0 abc 1 a
1 xyz 2 xy
2 hello 1 h
Timings:
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]})
df = pd.concat([df] * 1000, ignore_index=True)
In [270]: %timeit df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
4.23 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [271]: %timeit df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
104 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %timeit df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
785 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using groupby:
df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
Might make sense if there are few unique values in SEQ.

How to apply a function on every row on a dataframe?

I am new to Python and I am not sure how to solve the following problem.
I have a function:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
Say I have the dataframe
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
D p
0 10 20
1 20 30
2 30 10
ch=0.2
ck=5
And ch and ck are float types. Now I want to apply the formula to every row on the dataframe and return it as an extra row 'Q'. An example (that does not work) would be:
df['Q']= map(lambda p, D: EOQ(D,p,ck,ch),df['p'], df['D'])
(returns only 'map' types)
I will need this type of processing more in my project and I hope to find something that works.
The following should work:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df
If all you're doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:
In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449
Timings
For a 30k row df:
In [92]:
import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop
You can see that the np method is ~1900 X faster
There are few more ways to apply a function on every row of a DataFrame.
(1) You could modify EOQ a bit by letting it accept a row (a Series object) as argument and access the relevant elements using the column names inside the function. Moreover, you can pass arguments to apply using its keyword, e.g. ch or ck:
def EOQ1(row, ck, ch):
Q = math.sqrt((2*row['D']*ck)/(ch*row['p']))
return Q
df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
(2) It turns out that apply is often slower than a list comprehension (in the benchmark below, it's 20x slower). To use a list comprehension, you could modify EOQ still further so that you access elements by its index. Then call the function in a loop over df rows that are converted to lists:
def EOQ2(row, ck, ch):
Q = math.sqrt((2*row[0]*ck)/(ch*row[1]))
return Q
df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
(3) As it happens, if the goal is to call a function iteratively, map is usually faster than a list comprehension. So you could convert df into a list, map the function to it; then unpack the result in a list:
df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
(4) As #EdChum notes, it's always better to use vectorized methods if it's possible to do so, instead of applying a function row by row. Pandas offers vectorized methods that rival that of numpy's. In the case of EOQ for example, instead of math.sqrt, you could use pandas' pow method (in the benchmark below, using pandas vectorized methods is ~20% faster than using numpy):
df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
Output:
D p Q Q_np Q1 Q2a Q2b Q_pd
0 10 20 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
1 20 30 5.773503 5.773503 5.773503 5.773503 5.773503 5.773503
2 30 10 12.247449 12.247449 12.247449 12.247449 12.247449 12.247449
Timings:
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
df = pd.concat([df]*10000)
>>> %timeit df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
623 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
615 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
31.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
26.9 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q_np'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
1.19 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
966 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources