I have a financial dataset with ~2 million rows. I would like to import it as a pandas dataframe and add additional columns by applying rowwise functions utilizing some of the existing column values. For this purpose I would like to not use any techniques like parallelization, hadoop for python, etc, and so I'm faced with the following:
I am already doing this similar to the example below and performance is poor, ~24 minutes to just get through ~20K rows. Note: this is not the actual function, it is completely made up. For the additional columns I am calculating various financial option metrics. I suspect the slow speed is primarily due to iterating over all the rows, not really the functions themselves as they are fairly simple (e.g. calculating price of an option). I know I can speed up little things in the functions themselves, such as using erf instead of the normal distribution, but for this purpose I want to focus on the holistic problem itself.
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Method 1 (could also use itertuples here) - this is the one that takes ~24 minutes now
for row in df.iterrows():
df['px'][row] = func(alpha, beta, df['time'][row], df['vol'][row])
I have also tried vectorizing this but keep getting an error about 'cannot serialize float' or something like that.
My thought is to try one of the following methods, and I am not sure which one would theoretically be fastest? Are there non-linearities associated with running these, such that a test with 1000 rows would not necessarily indicate which would be fastest across all 2 million rows? Probably a separate question, but should I focus on more efficient ways to manage the dataset rather than just focus on applying the functions?
# Alternative 1 (df.apply with existing function above)
df['px'] = df.apply(lambda row: func(alpha, beta, row['time'], row['vol']), axis=1)
# Alternative 2 (numba & jit)
#jit
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Alternative 3 (cython)
def func_cython(double alpha, double beta, double time, double vol):
cdef double px
px = (alpha*beta)/time * vol
return px
In the case of Cython and numba, would I still iterate over all the rows using df.apply? Or is there a more efficient way?
I have referenced the following and found them to be helpful in understanding the various options, but not what the 'best' way is to do this (though I suppose it depends ultimately on the application).
https://lectures.quantecon.org/py/need_for_speed.html
Numpy vs Cython speed
Speeding up a numpy loop in python?
Cython optimization
http://www.devx.com/opensource/improve-python-performance-with-cython.html
How about simply:
df.loc[:, 'px'] = (alpha * beta) / df.loc[:, 'time'] * df.loc[:, 'vol']
By the way, your for-loop/lambda solutions are slow because the overhead for each pandas access is large. So accessing each cell separately (via looping over each row) is much slower than accessing the whole column.
Related
Good evening!
I have a code similar to the one I will paste below, it has a lot more data but the premise is the same. From both DataFrames I have to pull the first five values but when I am dealing with tens of millions of entries I cannot afford waiting sometimes up to an hour for it to compute the whole DataFrame and return me the first five values. I also cannot use simple Pandas DataFrames as they exceed my memory limit. Is there a solution to this?
import random
import pandas
import dask.dataframe as dd
import time
# Random list from 1 to 10,000,000.
random_pool = [random.randint(1, 1000000) for i in range(10000000)]
random.shuffle(random_pool)
df1 = dd.from_pandas(pandas.DataFrame(random_pool[:100000], columns=["ID"]), npartitions=10)
df2 = dd.from_pandas(pandas.DataFrame(random_pool, columns=["ID"]), npartitions=10)
# Sorting both dataframes.
df1 = df1.sort_values("ID", ascending=True)
df2 = df2.sort_values("ID", ascending=True)
df1_start = time.time()
df1.head(5)
print("DF1 took {:.2f}.".format(time.time() - df1_start))
df2_start = time.time()
df2.head(5)
print("DF2 took {:.2f}.".format(time.time() - df2_start))
The first DataFrame takes around 0.41 seconds meanwhile the second one takes around 1.79.
One thing to keep in mind is that a value in dask is really a stack of operations, serialized. Lots of computation is deferred until when you actually ask for the values to be materialized - like using head, or in general, using .compute().
In the spirit of the general advice on persist, you can try to use .persist() after the sort calls:
It is often ideal to load, filter, and shuffle data once and keep this result in memory. Afterwards, each of the several complex queries can be based off of this in-memory data rather than have to repeat the full load-filter-shuffle process each time. To do this, use the client.persist method [ The .persist() method in this case ].
And take the time to think about what happens if we don't persist here - the future that needs to be resolved when you call head will include the sort_values call, and you're probably seeing the cost of sorting all your data every time you call head - and that explains why getting just the five first items has a cost proportional to the size of the whole dataset - because the whole dataset is being sorted.
The answer is that dask is quite fast about getting the first five items. But it might not be so fast to resolve all the computations to get there, if they are not already in memory.
You should in general avoid whole-dataset shuffling like in this example - the sort!
I'm writing a short program for some standard dataframe operations on Pandas but the time complexity of the program is O(n) due to the following piece of code:
criteria = ((cars["Color"] == order["CarColors"]["Include"]) \
& (cars["Size"] != order["CarSize"]["Exclude"])
cars[criteria]
criteria is used to filter the cars dataframe as I only want to include certain colors and exclude certain sizes. I ran the program for an increasingly large cars file and the time complexity increases linearly with the number of points.
I also tried np.isin as below but it actually made the performance worse. Anyone have an idea how I can improve the time complexity? I thought Boolean operators would be quicker than this.
criteria = np.isin(cars["Color"],order["CarColors"]["Include"]) \
& np.isin(cars["Size"],order["CarSize"]["Exclude"], invert=True)
Thanks
You may want to try and evaluate it like this:
car_color = order["CarColors"]["Include"]
car_size = order["CarSize"]["Exclude"]
cars.query("Color == #car_color and Size != #car_size")
When you do boolean indexing, you create some temporary arrays. If these arrays are much bigger than your CPU cache memory, then the query method could enhance performance a bit. source
Question
How can I create thousands of variables rather than using DataFrame? Updating elements with
df1.loc[a,b] = df1.loc[a,b] + update_term
is so slow!!!
Current situation
I have 2500 days' historical prices of 445 U.S. companies in a dataframe. ( 2500 rows * 445 columns DataFrame)
What I am trying to do with these prices of stocks is calculating 3 parameters of the equation shown below.Since each 445 company has ak, bk (445 each) and pairs of 445 companies has wj,k parameters ((445 * 444) /2 in total) , there are so many variables to create.
To make the variables needed for parameters above, I have made 3 dataframes, 2 of which are 1*445 dimension (1row, 445 columns) for ak, bk and 445 * 445 dimension for wj,k. A screenshot for this are shown below.
Since I update the parameters for each company using df.loc function like
parameter = parameter + df.loc[date,'company_name']
my codes are so slow!!
A real example from my code is shown below.
A_random_parameter = df1.loc['row_index_1',company_x] +
df2.loc['row_index_2',company_x] *
df3.loc[date,'company_y']
Any suggestion to create thousands of variables rather than using DataFrame-like way?
Use arrays
If you need to run matrix calculations like the one shown in your equation, you'd want to use a data structure with fast random element access and a contiguous memory layout. In python, the standard way to do that is using numpy arrays (https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.array.html)
Furthermore, in any such operations where you care about performance you DO NOT want to do a python loop and access/update each element individually. Not if you're using pandas dataframes, not if you're using numpy arrays, not if you're using tensorflow or anything else. Instead, you'd want to 'vectorize' the operations, i.e. use basic operations that work on whole vectors or matrixes "at once" so that the appropriate libraries can effectively parallelize their execution where possible. NumPy Basics: Arrays and Vectorized Computation may be relevant.
It turns out that using dictionary is much faster than using DataFrame for storing variables, as John Zwinck suggested. Thank you!
I have a pandas Panel that is long, wide, and shallow. In reality it's bigger but for ease of example, let's say it's 2x5x6:
panel=pd.Panel(pd.np.random.rand(2,3,6))
I have a Series that is the length of the shortest dimension - in this case 2:
series=pd.Series([0,1])
I want to multiply the panel by the series, by broadcasting the series across the two other axes.
Using panel.mul doesn't work, because that can only take Panels or DataFrames, I think
panel.mul(series) # returns None
Using panel.apply(lambda x: x.mul(series), axis=0) works, but seems to do the calculation across every combination of series, in this case 3x6=18, but in reality >1m series, and so is extremely slow.
Using pd.np.multiply seems to require a very awkward construction:
pd.np.multiply(panel, pd.np.asarray(series)[:, pd.np.newaxis, pd.np.newaxis])
Is there an easier way?
I don't think there's anything wrong conceptually with your last way of doing it (and I can't think of an easier way). A more idiomatic way to write it would be
import numpy as np
panel.values * (series.values[:,np.newaxis,np.newaxis])
using values to return the underlying numpy arrays of the pandas objects.
In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.
I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.
What would be the best way to make this usage pattern as efficient
as possible?
Can we possibly do it without rewriting most of our code?
Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?
You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object
df.apply(you_function, axis=1)
Example:
>>> df = pd.DataFrame({'a': np.arange(3),
'b': np.random.rand(3)})
>>> df
a b
0 0 0.880075
1 1 0.143038
2 2 0.795188
>>> def func(row):
return row['a'] + row['b']
>>> df.apply(func, axis=1)
0 0.880075
1 1.143038
2 2.795188
dtype: float64
As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lot faster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.
Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the apply is too slow for you, I would suggest "Cython-izing" your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.
Or you can try numba. :)