Pandas multiindex creation performance - python

Performance tests for creating equal pd.MultiIndex using different class methods:
import pandas as pd
size_mult = 8
d1 = [1]*10**size_mult
d2 = [2]*10**size_mult
pd.__version__
'0.24.2'
Namely .from_arrays, from_tuples, from_frame:
# Cell from_arrays
%%time
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
# Cell from_tuples
%%time
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
# Cell from_frame
%%time
df = pd.DataFrame({'a':d1, 'b':d2})
index_frm = pd.MultiIndex.from_frame(df)
Corresponding outputs for cells:
# from_arrays
CPU times: user 1min 15s, sys: 6.58 s, total: 1min 21s
Wall time: 1min 21s
# from_tuples
CPU times: user 26.4 s, sys: 4.99 s, total: 31.4 s
Wall time: 31.3 s
# from_frame
CPU times: user 47.9 s, sys: 5.65 s, total: 53.6 s
Wall time: 53.7 s
And let's check that all results are the same for the case
index_arr.difference(index_tup)
index_arr.difference(index_frm)
All lines produce:
MultiIndex(levels=[[1], [2]],
codes=[[], []],
names=['a', 'b'])
So why is there so big difference? from_arrays is almost 3 times slower than from_tuples. It is even slower than create DataFrame and build index on top of it.
EDIT:
I've done another more generalized test and result was surprisingly the opposite:
np.random.seed(232)
size_mult = 7
d1 = np.random.randint(0, 10**size_mult, 10**size_mult)
d2 = np.random.randint(0, 10**size_mult, 10**size_mult)
start = pd.Timestamp.now()
index_arr = pd.MultiIndex.from_arrays([d1, d2], names=['a', 'b'])
print('ARR done in %f' % (pd.Timestamp.now()-start).total_seconds())
start = pd.Timestamp.now()
index_tup = pd.MultiIndex.from_tuples(zip(d1, d2), names=['a', 'b'])
print('TUP done in %f' % (pd.Timestamp.now()-start).total_seconds())
ARR done in 9.559764
TUP done in 70.457208
So now from_tuples is significantly slower though source data are the same.

Your second example makes more sense to me. Looking at the source code for Pandas, from_tuples actually calls from_arrays, so it makes sense to me that from_arrays will be faster.
from_tuples is also doing some extra steps here that cost more time:
You passed in a zip(d1, d2), which is actually an iterator. from_tuples converts this into a list.
After it was converted to a list of tuples, it goes through an extra step to convert it to a list of numpy arrays
The previous step iterates through the list of tuples twice, making the from_tuples significantly slower than from_arrays, right off the bat.
So overall, I'm not surprised from_tuples is slower, since it has to iterate through your list of tuples an extra two times (and do some extra stuff) before even making it to the from_arrays function (which iterates a couple more times, by the way) that it uses anyways.

from_tuples converts iterators to lists, then lists to arrays, then arrays into lists of arrays, then ultimately calls from_arrays on that.

Related

Vectorizing hashing function in pandas

I have the following dataset (with different values, just multiplied same rows).
I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.
The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.
The function is pretty simple and I feel like it can be vectorized, but struggling to implement.
I am working with millions of rows and it takes hours, even if hashing 4 columns values.
import pandas as pd
import hashlib
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
Using a list comprehension will get you a significant speedup.
First your original:
import pandas as pd
import hashlib
n = 100000
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
1 loop, best of 5: 26.1 s per loop
Then as a list comprehension:
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})
def list_comp(df):
return pd.Series([ _mutate_hash(row) for row in df.to_numpy() ])
%timeit data['row_hash']=list_comp(data)
1 loop, best of 5: 872 ms per loop
...i.e., a speedup of ~30x.
As a check: You can check that these two methods yield equivalent results by putting the first one in "data2" and the second one in "data3" and then check that they're equal:
data2, data3 = pd.DataFrame([]), pd.DataFrame([])
%timeit data2['row_hash']=data.apply(_mutate_hash,axis=1)
...
%timeit data3['row_hash']=list_comp(data)
...
data2.equals(data3)
True
The easiest performance boost comes from using vectorized string operations. If you do the string prep (lowercasing and encoding) before applying the hash function, your performance is much more reasonable.
data = pd.DataFrame(
{
"first_identifier": ["ALP1x", "RDX2b"] * 1000000,
"second_identifier": ["RED413", "BLU031"] * 1000000,
}
)
def _mutate_hash(row):
return hashlib.md5(row).hexdigest()
prepped_data = data.apply(lambda col: col.str.lower().str.encode("utf8")).sum(axis=1)
data["row_hash"] = prepped_data.map(_mutate_hash)
I see ~25x speedup with that change.

Why dask doesnt execute in parallel

Could someone point out what I did wrong with following dask implementation, since it doesnt seems to use the multi cores.
[ Updated with reproducible code]
The code that uses dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
calculate_feature_stats = dask.delayed(calculate_feature_stats)
rows = []
for bookid in bookingID.tolist():
row = calculate_feature_stats(bookid)
rows.append(row)
start = time.time()
rows = dask.persist(*rows)
end = time.time()
print(end - start) # Execution time = 16s in my machine
Code with normal implementation without dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats_normal(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
rows = []
start = time.time()
for bookid in bookingID.tolist():
row = calculate_feature_stats_normal(bookid)
rows.append(row)
end = time.time()
print(end - start) # Execution time = 4s in my machine
So, without dask actually faster, how is that possible?
Answer
Extended comment. You should consider that using dask there is about 1ms overhead (see doc) so if your computation is shorther than that then dask It isn't worth the trouble.
Going to your specific question I can think of two possible real world scenario:
1. A big dataframe with a column called bookingID and another value
2. A different file for every bookingID
In the second case you can play from this answer while for the first case you can proceed as following:
import dask.dataframe as dd
import numpy as np
import pandas as pd
# create dummy df
df = []
for i in range(10_000):
df.append(pd.DataFrame({"id":i,
"value":np.random.rand(1000)}))
df = pd.concat(df, ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df.to_parquet("df.parq")
Pandas
%%time
df = pd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":{"min", "max", "std", "mean"}})
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 1.65 s, sys: 316 ms, total: 1.96 s
Wall time: 1.08 s
Dask
%%time
df = dd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":["min", "max", "std", "mean"]}).compute()
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 4.94 s, sys: 427 ms, total: 5.36 s
Wall time: 3.94 s
Final thoughts
In this situation dask starts to make sense if the df doesn't fit in memory.

Select the max row per group - pandas performance issue

I'm selecting one max row per group and I'm using groupby/agg to return index values and select the rows using loc.
For example, to group by "Id" and then select the row with the highest "delta" value:
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]
However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.
I have two questions for experts:
How can I make this query run fast in pandas? What am I doing wrong?
Why is this operation so expensive?
[Update]
Thank you so much for #unutbu 's analysis!
sort_drop it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop handled it LESS THAN A MINUTE!
I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.
The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:
If there an only a few (large) groups, using_idxmax may be the fastest option:
If there are many (small) groups and the DataFrame is not too large, using_sort_drop may be the fastest option:
Keep in mind, however, that while using_sort_drop, using_sort and using_rank start out looking very fast, as N = len(df) increases, their speed relative to the other options disappears quickly. For large enough N, using_idxmax becomes the fastest option, even if there are many groups.
using_sort_drop, using_sort and using_rank sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N)) on average, while the other methods use O(N) operations. This is why methods like using_idxmax beats using_sort_drop for very large DataFrames.
Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.
Based on the perfplots above, using_sort_drop may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax to be the fastest option -- but again, it's important that you check benchmarks on your machine.
Here is the setup I used to make the perfplots:
import numpy as np
import pandas as pd
import perfplot
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
def using_idxmax(df):
return df.loc[df.groupby("Id")['delta'].idxmax()]
def max_mask(s):
i = np.asarray(s).argmax()
result = [False]*len(s)
result[i] = True
return result
def using_custom_mask(df):
mask = df.groupby("Id")['delta'].transform(max_mask)
return df.loc[mask]
def using_isin(df):
idx = df.groupby("Id")['delta'].idxmax()
mask = df.index.isin(idx)
return df.loc[mask]
def using_sort(df):
df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
return df.groupby('Id', as_index=False).first()
def using_rank(df):
mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
return df.loc[mask]
def using_sort_drop(df):
# Thanks to jezrael
# https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')
def using_apply(df):
selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
return df.loc[selected_idx]
def check(df1, df2):
df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
return df1.equals(df2)
perfplot.show(
setup=make_df,
kernels=[using_idxmax, using_custom_mask, using_isin, using_sort,
using_rank, using_apply, using_sort_drop],
n_range=[2**k for k in range(2, 20)],
logx=True,
logy=True,
xlabel='len(df)',
repeat=75,
equality_check=check)
Another way to benchmark is to use IPython %timeit:
In [55]: df = make_df(2**20)
In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop
In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop
In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop
Using Numba's jit
from numba import njit
import numpy as np
#njit
def nidxmax(bins, k, weights):
out = np.zeros(k, np.int64)
trk = np.zeros(k)
for i, w in enumerate(weights - (weights.min() - 1)):
b = bins[i]
if w > trk[b]:
trk[b] = w
out[b] = i
return np.sort(out)
def with_numba_idxmax(df):
f, u = pd.factorize(df.Id)
return df.iloc[nidxmax(f, len(u), df.delta.values)]
Borrowing from #unutbu
def make_df(N):
# lots of small groups
df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
# few large groups
# df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
return df
Prime jit
with_numba_idxmax(make_df(10));
Test
df = make_df(2**20)
%timeit with_numba_idxmax(df)
%timeit using_sort_drop(df)
47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

python dask DataFrame, support for (trivially parallelizable) row apply?

I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas.
After reading a bit on its manual page, I can't find a way to do this trivially parallelizable task:
ts.apply(func) # for pandas series
df.apply(func, axis = 1) # for pandas DF row apply
At the moment, to achieve this in dask, AFAIK,
ddf.assign(A=lambda df: df.apply(func, axis=1)).compute() # dask DataFrame
which is ugly syntax and is actually slower than outright
df.apply(func, axis = 1) # for pandas DF row apply
Any suggestion?
Edit: Thanks #MRocklin for the map function. It seems to be slower than plain pandas apply. Is this related to pandas GIL releasing issue or am I doing it wrong?
import dask.dataframe as dd
s = pd.Series([10000]*120)
ds = dd.from_pandas(s, npartitions = 3)
def slow_func(k):
A = np.random.normal(size = k) # k = 10000
s = 0
for a in A:
if a > 0:
s += 1
else:
s -= 1
return s
s.apply(slow_func) # 0.43 sec
ds.map(slow_func).compute() # 2.04 sec
map_partitions
You can apply your function to all of the partitions of your dataframe with the map_partitions function.
df.map_partitions(func, columns=...)
Note that func will be given only part of the dataset at a time, not the entire dataset like with pandas apply (which presumably you wouldn't want if you want to do parallelism.)
map / apply
You can map a function row-wise across a series with map
df.mycolumn.map(func)
You can map a function row-wise across a dataframe with apply
df.apply(func, axis=1)
Threads vs Processes
As of version 0.6.0 dask.dataframes parallelizes with threads. Custom Python functions will not receive much benefit from thread-based parallelism. You could try processes instead
df = dd.read_csv(...)
df.map_partitions(func, columns=...).compute(scheduler='processes')
But avoid apply
However, you should really avoid apply with custom Python functions, both in Pandas and in Dask. This is often a source of poor performance. It could be that if you find a way to do your operation in a vectorized manner then it could be that your Pandas code will be 100x faster and you won't need dask.dataframe at all.
Consider numba
For your particular problem you might consider numba. This significantly improves your performance.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([10000]*120)
In [4]: %paste
def slow_func(k):
A = np.random.normal(size = k) # k = 10000
s = 0
for a in A:
if a > 0:
s += 1
else:
s -= 1
return s
## -- End pasted text --
In [5]: %time _ = s.apply(slow_func)
CPU times: user 345 ms, sys: 3.28 ms, total: 348 ms
Wall time: 347 ms
In [6]: import numba
In [7]: fast_func = numba.jit(slow_func)
In [8]: %time _ = s.apply(fast_func) # First time incurs compilation overhead
CPU times: user 179 ms, sys: 0 ns, total: 179 ms
Wall time: 175 ms
In [9]: %time _ = s.apply(fast_func) # Subsequent times are all gain
CPU times: user 68.8 ms, sys: 27 µs, total: 68.8 ms
Wall time: 68.7 ms
Disclaimer, I work for the company that makes both numba and dask and employs many of the pandas developers.
As of v dask.dataframe.apply delegates responsibility to map_partitions:
#insert_meta_param_description(pad=12)
def apply(self, func, convert_dtype=True, meta=no_default, args=(), **kwds):
""" Parallel version of pandas.Series.apply
...
"""
if meta is no_default:
msg = ("`meta` is not specified, inferred from partial data. "
"Please provide `meta` if the result is unexpected.\n"
" Before: .apply(func)\n"
" After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result\n"
" or: .apply(func, meta=('x', 'f8')) for series result")
warnings.warn(msg)
meta = _emulate(M.apply, self._meta_nonempty, func,
convert_dtype=convert_dtype,
args=args, **kwds)
return map_partitions(M.apply, self, func,
convert_dtype, args, meta=meta, **kwds)

What is the most efficient way to loop through dataframes with pandas?

I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.
The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.
Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if close is a 1-d array, and you want the day-over-day percent change,
pct_change = close[1:]/close[:-1]
This computes the entire array of percent changes as one statement, instead of
pct_change = []
for row in close:
pct_change.append(...)
So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.
Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(time.time()-A)
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(time.time()-A)
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(time.time()-A)
print B
Result:
[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row
EDIT 2020/11/10
For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)
import sys
import tqdm
import time
import pandas as pd
B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append({"method": "iterrows", "time": time.time()-A})
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append({"method": "itertuples", "time": time.time()-A})
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append({"method": "zip", "time": time.time()-A})
C = []
A = time.time()
for r in zip(*t.to_dict("list").values()):
C.append((r[0], r[1]))
B.append({"method": "zip + to_dict('list')", "time": time.time()-A})
C = []
A = time.time()
for r in t.to_dict("records"):
C.append((r["a"], r["b"]))
B.append({"method": "to_dict('records')", "time": time.time()-A})
A = time.time()
t.agg(tuple, axis=1).tolist()
B.append({"method": "agg", "time": time.time()-A})
A = time.time()
t.apply(tuple, axis=1).tolist()
B.append({"method": "apply", "time": time.time()-A})
print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)
## Output
Python 3.7.9 (default, Oct 13 2020, 10:58:24)
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
mean std
method
zip + to_dict('list') 0.002353 0.000168
zip 0.003381 0.000250
itertuples 0.007659 0.000728
to_dict('records') 0.025838 0.001458
agg 0.066391 0.007044
apply 0.067753 0.006997
iterrows 0.647215 0.019600
You can loop through the rows by transposing and then calling iteritems:
for date, row in df.T.iteritems():
# do some logic here
I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:
def my_algo(ndarray[object] dates, ndarray[float64_t] open,
ndarray[float64_t] low, ndarray[float64_t] high,
ndarray[float64_t] close, ndarray[float64_t] volume):
cdef:
Py_ssize_t i, n
float64_t foo
n = len(dates)
for i from 0 <= i < n:
foo = close[i] - open[i] # will be extremely fast
I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.
You have three options:
By index (simplest):
>>> for index in df.index:
... print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))
With iterrows (most used):
>>> for index, row in df.iterrows():
... print ("df[" + str(index) + "]['B']=" + str(row['B']))
With itertuples (fastest):
>>> for row in df.itertuples():
... print ("df[" + str(row.Index) + "]['B']=" + str(row.B))
Three options display something like:
df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12
Source: alphons.io
I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.
There's also iterkv, which iterates through (column, series) tuples.
Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
df[b] = df[a].apply(lambda col: do stuff with col here)
As #joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.
If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> for row in df.itertuples():
... print(row.col1, row.col2)
...
1, 0.1
2, 0.2
For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.
index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values
for i in range(df.shape[0]):
index_value = index[i]
...
column_value_k = column_of_interest_k[i]
Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.
Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.
look at last one
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in range(len(t)):
C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B
0.46424
0.00505
0.00245
0.09879
0.00209
I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.
For a test case, let's use the example from #DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.
Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.
import pandas as pd
import numpy as np
import numba as nb
df = pd.DataFrame( { 'close':[100,105,95,105] } )
pandas_vectorized = df.close.pct_change()[1:]
x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
def test_numpy(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numpy_loop = test_numpy(df.close.to_numpy())[1:]
#nb.jit(nopython=True)
def test_numba(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numba_loop = test_numba(df.close.to_numpy())[1:]
And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):
pandas/vectorized 1,130 micro-seconds
numpy/vectorized 382 micro-seconds
numpy/looped 72,800 micro-seconds
numba/looped 455 micro-seconds
Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.
Other details:
Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

Categories

Resources