Compute mean and standard deviation for HDF5 data - python

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).
Main Routine
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).
Storing of Simulation Data
To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.
After running the simulations, I receive the following Keys structure:
Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>
Computing Statistics for Files
Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).
The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:
episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]
I currently use the following code to extract the data storing it into an empty list:
def ExtractSimData(name, simulation_runs, length):
# Create empty list
result = []
# Call the simulation run file
filename = f"runs/{length}/{name}_simulation_runs2.h5"
with h5py.File(filename, "r") as hf:
# List all groups
print("Keys: %s" % hf.keys())
for i in range(simulation_runs):
a_group_key = list(hf.keys())[i]
data = list(hf[a_group_key])
for element in data:
result.append(element)
The data structure of result looks something like this:
[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]
First Attempt to Compute Means
I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):
mean_result = [np.mean(k) for k in zip(*list(result))]
However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).
Is there any way to efficiently accomplish this?

This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)
Outline of steps in the procedure summarized below (after opening the file):
Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.
A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules
Code to read and calculate statistics below.
import h5py
import numpy as np
def ExtractSimData(name, simulation_runs, length):
# Call the simulation run file
filename = f"runs/{length}/{name}simulation_runs2.h5"
with h5py.File(filename, "a") as hf:
# List all dataset names
ds_names = list(hf.keys())
print(f'Dataset names (keys): {ds_names}')
# Create empty arrays for player1 and player2 episode values
sim_cnt = len(ds_names)
print(f'# of simulation runs (dataset count) = {sim_cnt}')
ep_cnt = hf[ ds_names[0] ].shape[0]
print(f'# of episodes (rows) in each dataset = {ep_cnt}')
p1_arr = np.empty((ep_cnt,sim_cnt))
p2_arr = np.empty((ep_cnt,sim_cnt))
for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation
p1_arr[:,i] = hf[ds][:,0]
p2_arr[:,i] = hf[ds][:,1]
ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4),
compression='gzip', chunks=True)
ds1[:,0] = np.mean(p1_arr, axis=1)
ds1[:,1] = np.std(p1_arr, axis=1)
ds1[:,2] = np.mean(p2_arr, axis=1)
ds1[:,3] = np.std(p2_arr, axis=1)
dt = np.dtype([ ('average_player1',float), ('average_player2',float),
('std_player1',float), ('std_player2',float) ] )
ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt,
compression='gzip', chunks=True)
ds2['average_player1'] = np.mean(p1_arr, axis=1)
ds2['std_player1'] = np.std(p1_arr, axis=1)
ds2['average_player2'] = np.mean(p2_arr, axis=1)
ds2['std_player2'] = np.std(p2_arr, axis=1)
### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)
Code to create pseudo-data HDF5 file below.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
players = 2
periods = 1000
# Define the simulation with some random data
val_hist = np.random.random(periods*players).reshape(periods,players)
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (unique datasets)
with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation N times
simulations = 10
for i in range(simulations):
print(f'--- Iteration {i} ---')
test_simulation(i)

Related

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!
I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Is there a faster method to calculate implied volatility using mibian module for millions of rows in a csv/xl file?

My situation:
The CSV file has been converted to a data frame df5 and all the columns being used in the for loop below are of float type, this code is working but taking many many hours to just do 30,000 rows.
What I want from my situation:
I need to do the same operation on millions of rows and I am looking for fixes/alternate solutions that make it considerably faster.
Below is the code I am using currently:
for row in np.arange(0,len(df5)):
underlyingPrice = df5.iloc[row]['CLOSE_y']
strikePrice = df5.iloc[row]['STRIKE_PR']
interestRate = 10
dayss = df5.iloc[row]['Days']
optPrice = df5.iloc[row]['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
df5.iloc[row,df5.columns.get_loc('IV')]= result.impliedVolatility
Your loop seems to take values from each row to build another column IV.
This can be done much faster by using the apply method, which allows to use a function on each row/column to calculate a result.
Something like this:
def useBS(row):
underlyingPrice = row['CLOSE_y']
strikePrice = row['STRIKE_PR']
interestRate = 10
dayss = row['Days']
optPrice = row['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
return result.impliedVolatility
df5['IV'] = df5.apply(useBS, axis=1)

Get results in an Earth Engine python script

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

How to create a pivot table on extremely large dataframes in Pandas

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

Categories

Resources