In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.
I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.
What would be the best way to make this usage pattern as efficient
as possible?
Can we possibly do it without rewriting most of our code?
Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?
You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object
df.apply(you_function, axis=1)
Example:
>>> df = pd.DataFrame({'a': np.arange(3),
'b': np.random.rand(3)})
>>> df
a b
0 0 0.880075
1 1 0.143038
2 2 0.795188
>>> def func(row):
return row['a'] + row['b']
>>> df.apply(func, axis=1)
0 0.880075
1 1.143038
2 2.795188
dtype: float64
As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lot faster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.
Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the apply is too slow for you, I would suggest "Cython-izing" your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.
Or you can try numba. :)
Related
I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!
Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len.
As I understand, the following code will make dask to compute df twice, am I right?
df = dd.read_csv('path/to/file', dtype=some_dtypes)
#some operations...
df.to_csv("path/to/out/*")
print(len(df))
It is possible to avoid computing twice?
upd.
That's what happens when I use solution by #mdurant
but there are really almost 6 times less rows
Yes, you can achieve this. The optional keyword compute= to to_csv to make a lazy version of the write-to-disc process, and df.size, which is like len(), but also lazily computed.
import dask
futs = df.to_csv("path/to/out/*", compute=False)
_, l = dask.compute(futs, df.size)
This will notice the common work required for the writing and length and not have to read the data twice.
Suppose I have two bitboards represented using a numpy array:
import numpy
bitboard = numpy.zeros(2, dtype=numpy.int64)
Let's say that I want to set the 10th bit of the first bitboard. What's the fastest way to do this?
There are two ways that I can think of. Here's the first way:
numpy.bitwise_or(a[0], numpy.left_shift(1, 10), out=a, where=(True, False))
Here's the second way:
a[0] |= 1 << 10
Which one is faster? Is there any other way to do this? In particular, I'd like to know:
When I access a[0] does numpy return an int64 or a Python long?
If it returns a Python long then I'm assuming that both methods are pretty slow because they work on arbitrary-precision numbers. Am I right in assuming that?
If so then is there any way to get bitwise operations to work on fixed-precision numbers?
Note that I'm using Python version 3.
Which one is faster? Is there any other way to do this?
The second method is faster.
When I access a[0] does numpy return an int64 or a Python long?
It'll return an int64.
If it returns a Python long then I'm assuming that both methods are pretty slow because they work on arbitrary-precision numbers. Am I right in assuming that?
More details in this thread: Slow bitwise operations
I have a financial dataset with ~2 million rows. I would like to import it as a pandas dataframe and add additional columns by applying rowwise functions utilizing some of the existing column values. For this purpose I would like to not use any techniques like parallelization, hadoop for python, etc, and so I'm faced with the following:
I am already doing this similar to the example below and performance is poor, ~24 minutes to just get through ~20K rows. Note: this is not the actual function, it is completely made up. For the additional columns I am calculating various financial option metrics. I suspect the slow speed is primarily due to iterating over all the rows, not really the functions themselves as they are fairly simple (e.g. calculating price of an option). I know I can speed up little things in the functions themselves, such as using erf instead of the normal distribution, but for this purpose I want to focus on the holistic problem itself.
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Method 1 (could also use itertuples here) - this is the one that takes ~24 minutes now
for row in df.iterrows():
df['px'][row] = func(alpha, beta, df['time'][row], df['vol'][row])
I have also tried vectorizing this but keep getting an error about 'cannot serialize float' or something like that.
My thought is to try one of the following methods, and I am not sure which one would theoretically be fastest? Are there non-linearities associated with running these, such that a test with 1000 rows would not necessarily indicate which would be fastest across all 2 million rows? Probably a separate question, but should I focus on more efficient ways to manage the dataset rather than just focus on applying the functions?
# Alternative 1 (df.apply with existing function above)
df['px'] = df.apply(lambda row: func(alpha, beta, row['time'], row['vol']), axis=1)
# Alternative 2 (numba & jit)
#jit
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Alternative 3 (cython)
def func_cython(double alpha, double beta, double time, double vol):
cdef double px
px = (alpha*beta)/time * vol
return px
In the case of Cython and numba, would I still iterate over all the rows using df.apply? Or is there a more efficient way?
I have referenced the following and found them to be helpful in understanding the various options, but not what the 'best' way is to do this (though I suppose it depends ultimately on the application).
https://lectures.quantecon.org/py/need_for_speed.html
Numpy vs Cython speed
Speeding up a numpy loop in python?
Cython optimization
http://www.devx.com/opensource/improve-python-performance-with-cython.html
How about simply:
df.loc[:, 'px'] = (alpha * beta) / df.loc[:, 'time'] * df.loc[:, 'vol']
By the way, your for-loop/lambda solutions are slow because the overhead for each pandas access is large. So accessing each cell separately (via looping over each row) is much slower than accessing the whole column.
I have a pandas Panel that is long, wide, and shallow. In reality it's bigger but for ease of example, let's say it's 2x5x6:
panel=pd.Panel(pd.np.random.rand(2,3,6))
I have a Series that is the length of the shortest dimension - in this case 2:
series=pd.Series([0,1])
I want to multiply the panel by the series, by broadcasting the series across the two other axes.
Using panel.mul doesn't work, because that can only take Panels or DataFrames, I think
panel.mul(series) # returns None
Using panel.apply(lambda x: x.mul(series), axis=0) works, but seems to do the calculation across every combination of series, in this case 3x6=18, but in reality >1m series, and so is extremely slow.
Using pd.np.multiply seems to require a very awkward construction:
pd.np.multiply(panel, pd.np.asarray(series)[:, pd.np.newaxis, pd.np.newaxis])
Is there an easier way?
I don't think there's anything wrong conceptually with your last way of doing it (and I can't think of an easier way). A more idiomatic way to write it would be
import numpy as np
panel.values * (series.values[:,np.newaxis,np.newaxis])
using values to return the underlying numpy arrays of the pandas objects.