Efficient way to add a pandas Series to a pandas DataFrame - python

I have a large large DataFrame. And I want to add a series to every row of it.
The following is the current way for achieving my goal:
print(df.shape) # (31676, 3562)
diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
for i in range(len(df)):
df.iloc[i] = df.iloc[i] + diff
The method takes a lot of time. Are there any other efficient way of doing this?

You can subtract Series for vectorize operation, a bit faster should be subtract by numpy array:
np.random.seed(2022)
df = pd.DataFrame(np.random.rand(1000,1000))
In [51]: %timeit df.add(4.1).sub(df.iloc[0])
7.99 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df + 4.1 - df.iloc[0]
8.46 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit df + 4.1 - df.iloc[0].to_numpy()
7.59 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [49]: %%timeit
...: diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
...: for i in range(len(df)):
...: df.iloc[i] = df.iloc[i] + diff
...:
433 ms ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Efficient way to calculate difference from pandas datetime columns based on days

I have a dataframe with a few million rows where I want to calculate the difference on a daily basis between two columns which are in datetime format.
There are stack overflow questions which answer this question computing the difference on a timestamp basis (see here
Doing it on the timestamp basis felt quite fast:
df["Differnce"] = (df["end_date"] - df["start_date"]).dt.days
But doing it on a daily basis felt quite slow:
df["Differnce"] = (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
I was wondering if there is a easy but better/faster way to achieve the same result?
Example Code:
import pandas as pd
import numpy as np
data = {'Condition' :["a", "a", "b"],
'start_date': [pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000')],
'end_date': [pd.Timestamp('2022-01-02 01:00:00.000000'), pd.Timestamp('2022-02-01 23:00:00.000000'), pd.Timestamp('2022-01-02 01:00:00.000000')]}
df = pd.DataFrame(data)
df["Right_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"].dt.date - df["start_date"].dt.date).dt.days), np.nan)
df["Wrong_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"] - df["start_date"]).dt.days), np.nan)
Use Series.dt.to_period, faster is Series.dt.normalize or Series.dt.floor :
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [286]: %timeit (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
1.14 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [287]: %timeit df["end_date"].dt.to_period('d').astype('int') - df["start_date"].dt.to_period('d').astype('int')
64.1 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [288]: %timeit (df["end_date"].dt.normalize() - df["start_date"].dt.normalize()).dt.days
27.7 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [289]: %timeit (df["end_date"].dt.floor('d') - df["start_date"].dt.floor('d')).dt.days
27.7 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If nth character of a string in a dataframe is a certain character delete that row

I have a dataframe:
df1 = {'seq': ["(((...))).(.)", "...((.))", "..(.)..(.)"],
'a': [1,3,5],
'b': [9,4,7],
'val': [0.01, 0.02, 0.03],
}
df1 = pd.DataFrame (df1, columns = ['seq','a','b','val'])
And I want to specify that if the nth character of 'seq' is a "." then delete that row, where n is specified by the 'a' column.
You can use apply and boolean indexing:
df1 = df1[df1.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
Output:
seq a b val
0 (((...))).(.) 1 9 0.01
2 ..(.)..(.) 5 7 0.03
you can use str.split with '' to split each character in a new column, then use indexing to get the right character at the position depending on the column a. finally check where != of '.'.
print (df1[df1['seq'].str.split('', expand=True)
.to_numpy()[np.arange(len(df1)), df1['a']] #no need of -1 here
!='.'])
seq a b val
0 (((...))).(.) 1 9 0.01
2 ..(.)..(.) 5 7 0.03
for a small dataframe of 3 rows, apply if faster, but if the size increases, this method becomes interesting. Although if the strings in seq are longer, they might be an effect on the timing.
%timeit df1[df1.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
# 1.15 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df1[df1['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df1)), df1['a']]!='.']
#1.7 ms ± 95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with 300 rows
df2 = pd.concat([df1]*100)
%timeit df2[df2.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
# 7.45 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2[df2['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df2)), df2['a']]!='.']
#2.58 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#with 3000 rows
df3 = pd.concat([df1]*1000)
%timeit df3[df3.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
#67.9 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df3[df3['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df3)), df3['a']]!='.']
#11.4 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed up list creation from pandas dataframe

I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?
Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.

Get total values_count from a dataframe with Python Pandas

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?
index row 1 row 2 ...
0 string1 string3
1 string1 string1
2 string2 string2
...
If performance is an issue try:
from collections import Counter
Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})
Or stack it into one Series then use value_counts
df.stack().value_counts()
#string1 3
#string2 2
#string3 1
#dtype: int64
For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
#string1 3.0
#string2 2.0
#string3 1.0
#dtype: float64
Also, there's a numpy solution:
import numpy as np
np.unique(df.to_numpy(), return_counts=True)
#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))
df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
'row2': ['string3', 'string1', 'string2']})
def vc_from_loop(df):
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
return s
Small DataFrame
%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Long DataFrame
df = pd.concat([df]*300000, ignore_index=True)
%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Categories

Resources