What is the most efficient way to loop through dataframes with pandas?

What is the most efficient way to loop through dataframes with pandas? - python

I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if close is a 1-d array, and you want the day-over-day percent change,
pct_change = close[1:]/close[:-1]
This computes the entire array of percent changes as one statement, instead of
pct_change = []
for row in close:
pct_change.append(...)
So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(time.time()-A)
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(time.time()-A)
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(time.time()-A)
print B
Result:
[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row
EDIT 2020/11/10
For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)
import sys
import tqdm
import time
import pandas as pd
B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append({"method": "iterrows", "time": time.time()-A})
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append({"method": "itertuples", "time": time.time()-A})
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append({"method": "zip", "time": time.time()-A})
C = []
A = time.time()
for r in zip(*t.to_dict("list").values()):
C.append((r[0], r[1]))
B.append({"method": "zip + to_dict('list')", "time": time.time()-A})
C = []
A = time.time()
for r in t.to_dict("records"):
C.append((r["a"], r["b"]))
B.append({"method": "to_dict('records')", "time": time.time()-A})
A = time.time()
t.agg(tuple, axis=1).tolist()
B.append({"method": "agg", "time": time.time()-A})
A = time.time()
t.apply(tuple, axis=1).tolist()
B.append({"method": "apply", "time": time.time()-A})
print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)
## Output
Python 3.7.9 (default, Oct 13 2020, 10:58:24)
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
mean std
method
zip + to_dict('list') 0.002353 0.000168
zip 0.003381 0.000250
itertuples 0.007659 0.000728
to_dict('records') 0.025838 0.001458
agg 0.066391 0.007044
apply 0.067753 0.006997
iterrows 0.647215 0.019600

You can loop through the rows by transposing and then calling iteritems:
for date, row in df.T.iteritems():
# do some logic here
I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:
def my_algo(ndarray[object] dates, ndarray[float64_t] open,
ndarray[float64_t] low, ndarray[float64_t] high,
ndarray[float64_t] close, ndarray[float64_t] volume):
cdef:
Py_ssize_t i, n
float64_t foo
n = len(dates)
for i from 0 <= i < n:
foo = close[i] - open[i] # will be extremely fast
I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.

You have three options:
By index (simplest):
>>> for index in df.index:
... print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))
With iterrows (most used):
>>> for index, row in df.iterrows():
... print ("df[" + str(index) + "]['B']=" + str(row['B']))
With itertuples (fastest):
>>> for row in df.itertuples():
... print ("df[" + str(row.Index) + "]['B']=" + str(row.B))
Three options display something like:
df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12
Source: alphons.io

I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.
There's also iterkv, which iterates through (column, series) tuples.

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
df[b] = df[a].apply(lambda col: do stuff with col here)

As #joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.
If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> for row in df.itertuples():
... print(row.col1, row.col2)
...
1, 0.1
2, 0.2

For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.
index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values
for i in range(df.shape[0]):
index_value = index[i]
...
column_value_k = column_of_interest_k[i]
Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

look at last one
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in range(len(t)):
C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B
0.46424
0.00505
0.00245
0.09879
0.00209

I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.
For a test case, let's use the example from #DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.
Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.
import pandas as pd
import numpy as np
import numba as nb
df = pd.DataFrame( { 'close':[100,105,95,105] } )
pandas_vectorized = df.close.pct_change()[1:]
x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
def test_numpy(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numpy_loop = test_numpy(df.close.to_numpy())[1:]
#nb.jit(nopython=True)
def test_numba(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numba_loop = test_numba(df.close.to_numpy())[1:]
And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):
pandas/vectorized 1,130 micro-seconds
numpy/vectorized 382 micro-seconds
numpy/looped 72,800 micro-seconds
numba/looped 455 micro-seconds
Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.
Other details:
Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.

Related

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!

Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.

How to speed up Numpy array filtering/selection?

I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.

To get good performane within Numba simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
#nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
#nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
#nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results

One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940

Pandas and Numpy program slower than loop version of same functionality, how to speed up?

I have millions of records with each record having an integer (p) and a X*3 matrix of values. For each record, the goal is to find a row from the matrix by selection criteria (see the if-statements in the code).
I'm fairly new to Python and try to make use of vectorization in Pandas using parallel computations instead of loops. I have written the program in two versions, one with Pandas+Numpy and another one with simple loops.
I was told that using vectorization and Numpy array operations is faster than loops. But so far, the loop version is about 10x faster:
Here is the program:
import numpy as np
import pandas as pd
import time
d = {
'values': [np.array([[1400,1400,1800000],[1500,1505,4800000],[1300,1305,5000]]), np.array([[800,900,80000],[1400,1420,50000],[1250,1300,60000]]), np.array([[1700,1750,5000000],[1900,1950,5000000],[1600,1600,3000000]]), np.array([])],
'p': [1300, 1350, 1800, 1400]
}
# The Pandas+Numpy version
def selection_numpy(row):
try:
# Select rows where col[0] >= p
c1 = row['values'][row['values'][:,0] >= row['p']]
# Select rows where col[2] > 1000000
c2 = c1[c1[:,2]>1000000]
# Sort by col[0] and return the lowest row
return c2[c2[:,0].argsort()][0]
except:
pass
start = time.time()
df = pd.DataFrame(d)
df['result'] = df.apply(selection_numpy, axis=1)
# print(df.head())
print(time.time()-start)
# The loop version:
def selection_loop(values, p):
lowest_num = 9999999999
lowest_item = None
# Iterate through each row in the matrix and replace lowest_item if it's lower than the previous one
for item in values:
if item[0] >= p and item[2] > 1000000 and item[0] < lowest_num:
lowest_num = item[0]
lowest_item = item
return lowest_item
start = time.time()
d['result'] = []
for i in range(0, 4):
result = selection_loop(d['values'][i], d['p'][i])
d['result'].append(result)
# print(d['result'])
print(time.time()-start)
Both produce the same result values, but the loop version is magnitudes faster (for the actual million record dataset, not for the 4 example records).
I assume there is a simple and elegant solution to find the desired row for each record which uses vectorization and is the fastest. Not sure why the function using Numpy arrays is so slow, but I appreciate any guidance.

Parallelize pandas apply

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.
My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.
This is the function I call via apply:
def get_nearest_holiday(x, pivot):
nearestHoliday = min(x, key=lambda x: abs(x- pivot))
difference = abs(nearesHoliday - pivot)
return difference / np.timedelta64(1, 'D')
How can I speed it up?
edit
I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:
from joblib import Parallel, delayed
import multiprocessing
def get_nearest_dateParallel(df):
df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
df['daysAfterHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
return df
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)
but I prefer #NinjaPuppy's approach because it does not require O(n * number_of_holidays)

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...
Let's just start with some dates...
import pandas as pd
dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])
We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...
from pandas.tseries.holiday import USFederalHolidayCalendar
holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')
This gives us:
DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
'2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
'2016-11-24', '2016-12-26',
...
'2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
'2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
'2030-11-28', '2030-12-25'],
dtype='datetime64[ns]', length=150, freq=None)
Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:
indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)
Then take the difference between the two:
next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])
You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

I think that the pandarallel package makes it way easier to do this now. Have not looked into it much, but should do the trick.

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!
# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)
def foo(x):
"""Your awesome function"""
return np.sqrt(np.sum(x ** 2))
df = pd.DataFrame(np.random.random((1000, 1000)))
%%time
res = df.apply(foo, raw=True)
Wall time: 5.3 s
# p_apply - is parallel analogue of apply method
%%time
res = df.p_apply(foo, raw=True, executor='processes')
Wall time: 1.2 s

Vectorize integration of pandas.DataFrame

I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests.
How do I calculate the work done (which is "the area under the curve")?
I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this:
import numpy as np
import pandas as pd
forces = pd.read_csv(...)
work_done = {}
for col in forces.columns:
work_done[col] = np.trapz(forces.loc[col], forces.index))
I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking.
In short:
Can I avoid the looping?
Can I create a DataFrame of work done directly?
Thanks in advance for any help.

You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:
import numpy as np
import pandas as pd
# some random input data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
forces = pd.DataFrame(x, columns=names)
# vectorized version
wrk = np.trapz(forces, x=forces.index, axis=0)
work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)
# non-vectorized version for comparison
work_done2 = {}
for col in forces.columns:
work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})
These give the following output:
from pprint import pprint
pprint(work_done.T)
# 0
# a -24.331560
# b -10.347663
# c 4.662212
# d -12.536040
# e -10.276861
# f 3.406740
# g -3.712674
# h -9.508454
# i -1.044931
# j 15.165782
pprint(work_done2)
# {'a': -24.331559643023006,
# 'b': -10.347663159421426,
# 'c': 4.6622123535050459,
# 'd': -12.536039649161403,
# 'e': -10.276861220217308,
# 'f': 3.4067399176289994,
# 'g': -3.7126739591045541,
# 'h': -9.5084536839888187,
# 'i': -1.0449311137294459,
# 'j': 15.165781517623724}
There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.
Edit:
You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:
work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))
However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:
In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
1000 loops, best of 3: 582 µs per loop
In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
The slowest run took 6.04 times longer than the fastest. This could mean that an
intermediate result is being cached
10000 loops, best of 3: 53.4 µs per loop
This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.

Here's how to get the cumulative integral along a dataframe column using the trapezoidal rule. Alternatively, the following creates a pandas.Series method for doing your choice of Trapezoidal, Simpson's or Romberger's rule (source):
import pandas as pd
from scipy import integrate
import numpy as np
#%% Setup Functions
def integrate_method(self, how='trapz', unit='s'):
'''Numerically integrate the time series.
#param how: the method to use (trapz by default)
#return
Available methods:
* trapz - trapezoidal
* cumtrapz - cumulative trapezoidal
* simps - Simpson's rule
* romb - Romberger's rule
See http://docs.scipy.org/doc/scipy/reference/integrate.html for the method details.
or the source code
https://github.com/scipy/scipy/blob/master/scipy/integrate/quadrature.py
'''
available_rules = set(['trapz', 'cumtrapz', 'simps', 'romb'])
if how in available_rules:
rule = integrate.__getattribute__(how)
else:
print('Unsupported integration rule: %s' % (how))
print('Expecting one of these sample-based integration rules: %s' % (str(list(available_rules))))
raise AttributeError
if how is 'cumtrapz':
result = rule(self.values)
result = np.insert(result, 0, 0, axis=0)
else:
result = rule(self.values)
return result
pd.Series.integrate = integrate_method
#%% Setup (random) data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
df = pd.DataFrame(x, columns=names)
#%% Cummulative Integral
df_cummulative_integral = df.apply(lambda x: x.integrate('cumtrapz'))
df_integral = df.apply(lambda x: x.integrate('trapz'))
df_do_they_match = df_cummulative_integral.tail(1).round(3) == df_integral.round(3)
if df_do_they_match.all().all():
print("Trapz produces the last row of cumtrapz")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the most efficient way to loop through dataframes with pandas? - python

The newest versions of pandas now include a built-in function for iterating over rows. for index, row in df.iterrows(): # do some logic here Or, if you want it faster use itertuples() But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html df[b] = df[a].apply(lambda col: do stuff with col here)

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

Related

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

How to speed up Numpy array filtering/selection?

Pandas and Numpy program slower than loop version of same functionality, how to speed up?

Parallelize pandas apply

Vectorize integration of pandas.DataFrame

Categories

Resources