I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.
par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:
Related
I need to apply a function on df, I used a pandarallel to parallelize the process, however, I have an issue here, I need to give func_do an N rows each call so that I can utilize a vectorization on that function.
The following will call func_do on each row. Any idea how to make a single call for each batch and keep the parallelization process.
def fun_do(value_col):
return do(value_col)
df['processed_col'] = df.parallel_apply(lambda row: fun_do(row['col']), axis=1)
A possible solution is to create virtual groups of N rows:
import pandas as pd
from pandarallel import pandarallel
# Setup MRE
pandarallel.initialize(progress_bar=False)
df = pd.DataFrame({'col1': np.linspace(0, 100, 11)})
def fun_do(sr):
return sr**2
N = 4 # size of chunk
df['col2'] = (df.groupby(pd.RangeIndex(len(df)) // N)
.parallel_apply(lambda x: fun_do(x['col1']))
.droplevel(0)) # <- remove virtual group index
Output:
>>> df
col1 col2
0 0.0 0.0
1 10.0 100.0
2 20.0 400.0
3 30.0 900.0
4 40.0 1600.0
5 50.0 2500.0
6 60.0 3600.0
7 70.0 4900.0
8 80.0 6400.0
9 90.0 8100.0
10 100.0 10000.0
Note: I don't know why groupby(...)['col'].parallel_apply(fun_do) doesn't work. It seems parallel_apply is not available with SeriesGroupBy.
This is the first time I use pandarallel, usually I used multiprocessing module
I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new
Could you please help to solve a specific task. I need to process pandas DataFrame column line-by-line. The main point is that "None" values must be turned into "0" or "1" so as to proceed "0" or "1" values which are already in the column. I've done it by using a "for" loop, and it works correct:
for i in np.arange(1, len(pd['signal'])):
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 0:
df['signal'].iloc[i] = 0
if df.isnull(df['signal'].iloc[i]) and df['signal'].iloc[i-1] == 1:
df['signal'].iloc[i] = 1
But, there is the fact that it's not a good approach to iterate the DataFrame.
I tried to use "loc" method, but it brings incorrect results because in this way each step does not consider previously performed results, therefore some "None" values remain unchanged.
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 0), 'signal'] = 0
df.loc[(df.isnull(df['signal'])) & (df['signal'].shift(1) == 1), 'signal'] = 1
Does anyone have any idea how to implement this task without a "for" loop?
there are vectorized functions for just this purpose that will be much faster:
df = pd.DataFrame(dict(a=[1,1,np.nan, np.nan], b=[0,1,0,np.nan]))
df.ffill()
# df
a b
0 1.0 0.0
1 1.0 1.0
2 NaN 0.0
3 NaN NaN
# output
a b
0 1.0 0.0
1 1.0 1.0
2 1.0 0.0
3 1.0 0.0
You can use numpy where:
import numpy as np
df['signal'] = np.where(pd.isnull(df['signal']), df['signal'].shift(1), df['signal'])
I have a dataframe which looks like that:
1 2
a_value 2 8
a_ref 4 2
b_value 6 10
b_ref 3 15
c_value 7 3
note that some indices are pairs of name_value and name_ref and others are not
I want to find those pairs, and for each pair get four rows in my new dataframe: name_value, name_ref, name_ref/name_value, name_value/name_ref so my output dataframe looks like this:
1 2
a_value 2.0 8.000
a_ref 4.0 2.000
a_value/a_ref 0.5 4.000
a_ref/a_value 2.0 0.250
b_value 6.0 10.000
b_ref 3.0 15.000
b_value/b_ref 2.0 0.666
b_ref/b_value 0.5 1.500
I currently do it by iterating over the indices looking for ones that end with value and then trying to find the matching ref, but knowing pandas, it seems that there should be an easier way, maybe using groupby somehow. So.. is there?
This may not be the most elegant solution, but it works. First, lets find the common keys:
import numpy as np
keys = np.intersect1d(df.index.str.extract("(.+)_value").dropna(),
df.index.str.extract("(.+)_ref").dropna())
#array(['a', 'b'], dtype=object)
Next, select the matching refs and values:
refs = df.loc[keys + "_ref"]
values = df.loc[keys +"_value"]
Make a copy of each dataframe and assign them the keys as indexes:
values1 = values.copy()
values1.index = keys
refs1 = refs.copy()
refs1.index = keys
Perform the division and update the indexes once again:
ratios = values1 / refs1
ratios.index += "_value" + "/" + ratios.index + "_ref"
ratios1 = refs1 / values1
ratios1.index += "_ref" + "/" + ratios1.index + "_value"
Put everything together and sort:
pd.concat([refs, values, ratios, ratios1]).sort_index()
# 1 2
#a_ref 4.0 2.000000
#a_ref/a_value 2.0 0.250000
#a_value 2.0 8.000000
#a_value/a_ref 0.5 4.000000
#b_ref 3.0 15.000000
#b_ref/b_value 0.5 1.500000
#b_value 6.0 10.000000
#b_value/b_ref 2.0 0.666667
How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/
You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0