Monte Carlo DataFrame is Highly Fragmented - python

I'm very new to python and trying to run a monte carlo sim for a class. I'm receiving the below error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
results.loc[i, 'Power1']=Power1
I tried the below code:
for i in range(10000):
Power1=np.random.normal(16000,1000,1)
Power2=np.random.triangular(12000,15000,18000)
Efficiency1=np.random.normal(.88,.10,1)
Efficiency2=np.random.normal(.85,.05,1)
TotalPower=(Power1*Efficiency1)+(Power2*Efficiency2)
results.loc[i, 'Power1']=Power1
results.loc[i, 'Efficiency1']=Efficiency1
results.loc[i, 'Power2']=Power2
results.loc[i, 'Efficiency2']=Efficiency2
results.loc[i, 'TotalPower']=TotalPower

Instead of looping over each row, you might want to try the following.
import numpy as np
import pandas as pd
N = 10000
result = pd.DataFrame(columns=['power1', 'efficiency1', 'power2', 'efficiency2', 'totalpower'])
result['power1'] = np.random.normal(16000, 1000, N)
result['power2'] = np.random.triangular(12000, 15000, 18000, size=N)
result['efficiency1'] = np.random.normal(0.88, 0.10, N)
result['efficiency2'] = np.random.normal(0.85, 0.05, N)
result['totalpower'] = result['power1']*result['efficiency1'] + result['power2']*result['efficiency2']

Related

Apply custom weighting in Pandas rolling function [duplicate]

Using pandas I can compute
simple moving average SMA using pandas.stats.moments.rolling_mean
exponential moving average EMA using pandas.stats.moments.ewma
But how do I compute a weighted moving average (WMA) as described in wikipedia http://en.wikipedia.org/wiki/Exponential_smoothing ... using pandas?
Is there a pandas function to compute a WMA?
Using pandas you can calculate a weighted moving average (wma) using:
.rolling() combined with .apply()
Here's an example with 3 weights and window=3:
data = {'colA': random.randint(1, 6, 10)}
df = pd.DataFrame(data)
weights = np.array([0.5, 0.25, 0.25])
sum_weights = np.sum(weights)
df['weighted_ma'] = (df['colA']
.rolling(window=3, center=True)
.apply(lambda x: np.sum(weights*x) / sum_weights, raw=False)
)
Please note that in .rolling() I have used argument center=True.
You should check if this applies with your usecase or whether you need center=False.
No, there is no implementation of that exact algorithm. Created a GitHub issue about it here:
https://github.com/pydata/pandas/issues/886
I'd be happy to take a pull request for this-- implementation should be straightforward Cython coding and can be integrated into pandas.stats.moments
If data is a Pandas DataFrame or Series and you want to compute the WMA over the rows, you can do it using
wma = data[::-1].cumsum().sum() * 2 / data.shape[0] / (data.shape[0] + 1)
If you want a rolling WMA of window length n, use
data.rolling(n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))
as n = x.shape[0]. Note that this solution might be a bit slower than the one by Sander van den Oord, but you don't have to worry about the weights.
Construct a kernel with the weights, and apply it to your series using numpy.convolve.
import pandas as pd
import numpy as np
def wma(arr, period):
kernel = np.arange(period, 0, -1)
kernel = np.concatenate([np.zeros(period - 1), kernel / kernel.sum()])
return np.convolve(arr, kernel, 'same')
df = pd.DataFrame({'value':np.arange(11)})
df['wma'] = wma(df['value'], 4)
Here I am interpreting WMA according to this page: https://en.wikipedia.org/wiki/Moving_average
For this type of WMA, the weights should be a linear range of n values, adding up to 1.0.
Note that I pad the front of the kernel with zeros. This is because we want a 'one-sided' window function, so that 'future' values in the time series do not affect the moving average.
numpy.convolve is fast, unlike apply()!
You can also use numpy.correlate if you reverse the kernel.

Parallelize nested large `(15e4 * 15e4)` for loop to get a pairwise matrix

I am trying to parallelize the following code, which creates a pairwise result for each row. As shown below.
def get_custom_value(i, j):
first = df[df['id'] == i]
second = df[df['id'] == j]
return int(first['val_1']) * int(second['val_1']) +\
int(first['val_2']) * int(second['val_2'])
df = pd.DataFrame(
{
'id' : range(4),
'val_1' : [3, 4, 5, 1],
'val_2' : [2, 3, 1, 1]
}
)
n = df.shape[0]
result = []
for i in range(n):
for j in range(i+1, n):
temp_value = get_custom_value(i, j)
result.append([i, j, temp_value])
if len(result) > 1e5:
# store it in a local file and reset the result object.
# Assume here some code to write to a local file here.
result = []
print(result)
What I have tried? Below is the code: The code hangs. Without any error.
import itertools
import multiprocessing
paramlist = list(itertools.combinations(df.id, 2))
pool = multiprocessing.Pool(processes = 2)
result = pool.map(get_custom_value, paramlist)
print(result)
Can I use dask for this?
The actual data has more than 150,000 records. i.e final result will have (150,000 * 150,000 * 1/2) pairs/rows. Given the huge size of the result object, I have a condition which if satisfied then the result is stored. Hence, the actual result object will not exceed my RAM.
The algorithm used is very inefficient. Indeed, both df['id'] == i and df['id'] == j iterate over the whole id column containing 150_000 items in your real-world use-case. Thus, your algorithm runs in O(n^3) time and performs roughly 3_375_000_000_000_000 comparisons while the best algorithm runs in O(n^2) time.
Moreover, CPython loops are very slow and you should avoid using them as much as possible. Fetching Pandas dataframe cells by name is very slow too. Instead, you can use vectorized Pandas/Numpy functions.
Additionally, the output is not efficient too: CPython lists are a bit slow (because of dynamic reference-counted objects) and storing the (i,j) values consume three time more memory. You can store the result in a matrix. Possibly a sparse one or alternatively in a list of compact Numpy arrays.
Furthermore, Bigger data structures are generally slower. If you want a computation to be done very quickly, you generally need to make it fit in the CPU caches (of few MiB). Thus, process you dataframe efficiently you certainly need to compute it in-situ.
Here is a relatively efficient solution using Numpy:
import numpy as np
val_1 = np.ascontiguousarray(df['val_1'].to_numpy())
val_2 = np.ascontiguousarray(df['val_2'].to_numpy())
result = val_1.reshape(-1, 1) * val_1 + val_2.reshape(-1, 1) * val_2
It produces a n² matrix where the (i,j) item can be found using result[i, j]. reshape(-1, 1) is used to transpose the horizontal vector so to get a vertical one and then benefit from Numpy broadcasting. Note that you can filter the upper-triangular part using np.triu(result, 1).
You can generate the result line by line so not to allocate a huge array:
val_1 = np.ascontiguousarray(df['val_1'].to_numpy())
val_2 = np.ascontiguousarray(df['val_2'].to_numpy())
for i in range(n-1):
first_val_1 = val_1[i]
first_val_2 = val_2[i]
line = first_val_1 * val_1[i+1:] + first_val_2 * val_2[i+1:]
# Store the line if needed with the i value so to know where it is
If you really want to generate an inefficient list from the Numpy array lines, then you can do that with np.vstack((np.repeat(i, n-i-1), np.arange(i+1, n), line)).T.tolist(). But I strongly advise you not to do that (there is certainly no need to use lists). Note that you can load/store Numpy arrays efficiently using np.load and np.save.
Here are the performances of the different approaches on my machine (with a i5-9600KF processor, 2 DDR4 channels reaching 40 GiB/s and a fast Nvme SSD that can practically write big files at 800 MiB/s) on a random Pandas dataframe with 15_000 records:
Initial code: 60500 seconds (estimation)
Numpy matrix: 0.71 second
Numpy line-by-line: 0.24 second
Time to store all the lines: 0.50 second (estimation)
in a compact way on my SSD
Thus, the Numpy, line-by-line solution is about 250_000 times faster than the initial code! All of this without using multiple cores. In fact, using multiple cores will not be much faster in this case because the RAM is a limited shared resource and file storages are not much faster in parallel on most machines (in fact, HDD are slower when used in parallel because they are inherently sequential). If you really want to do that, then using multiprocessing is definitively not the good tool. Please consider using Numba or Cython instead.
Yes, you can use dask for this. However, I would recommend adding a reduction to the computation that directly computes the quantity you are interest in (summary statistics [mean, median, ...], max/min over all values, etc.). Otherwise, you are looking to write something in the ballpark of 250 GB of binary data, which isn't the most efficient use of computing time and disk space (unless your use-case demands it).
import pandas as pd
import numpy as np
import dask.array as da
# generate fake data of desired size
num_records = int(150000)
df = pd.DataFrame(
{
# no need for an ID column, because you have a range index by default
"val_1": np.random.randint(0, 500, size=num_records),
"val_2": np.random.randint(0, 500, size=num_records),
}
)
# index magix to walk over the upper triangle of a fortran contigous array without diagonal
# Note: this is a different order than you use in your for loops, but
# order doesn't matter here.
N = len(df) - 1
flat_index = da.arange(N * (N + 1) // 2, chunks=(1e6,))
idx_j = (da.floor((da.sqrt(1 + 8 * flat_index) - 1)) // 2 + 1).astype(int)
idx_i = flat_index - (idx_j * (idx_j + 1) // 2)
val_1 = df.loc[:, "val_1"].to_numpy().astype(int)
val_2 = df.loc[:, "val_2"].to_numpy().astype(int)
custom_value_vector = da.take(val_1, idx_i) * da.take(val_1, idx_j) + da.take(
val_2, idx_i
) * da.take(val_2, idx_j)
result_vector = da.stack([idx_i, idx_j, custom_value_vector], axis=-1)
rechunked_result = da.rechunk(result_vector, (5e7, 3))
# now you can reduce to your desired summary statistic
# saving uses da.to_npy_stack
reduced = da.min(rechunked_result)
The above will run in a few sec, but that is just setting up the computational graph. You will still have to run it
from dask.diagnostics import ProgressBar
with ProgressBar():
value = reduced.compute()
# [########################################] | 100% Completed | 3min 3.8s
If you can't vectorize, you can still use dask to do your computation. However, be aware that things will be significantly slower:
import pandas as pd
import numpy as np
import dask.array as da
from dask.diagnostics import ProgressBar
# generate fake data of desired size
num_records = int(150000)
df = pd.DataFrame(
{
"id": np.arange(num_records),
"val_1": np.random.randint(0, 500, size=num_records),
"val_2": np.random.randint(0, 500, size=num_records),
}
)
# index magix to walk over the upper triangle of a fortran contigous array without diagonal
# Note: this is a different order than you use in your for loops, but
# order doesn't matter here.
N = len(df) - 1
flat_index = da.arange(N * (N + 1) // 2, chunks=(1e6,))
idx_j = (da.floor((da.sqrt(1 + 8 * flat_index) - 1)) // 2).astype(int)
idx_i = flat_index - (idx_j * (idx_j + 1) // 2)
idx_j = idx_j + 1
def get_custom_value(multi_idx):
i, j = multi_idx # I added this
first = df[df['id'] == i]
second = df[df['id'] == j]
return int(first['val_1']) * int(second['val_1']) +\
int(first['val_2']) * int(second['val_2'])
multi_idx = da.stack((idx_i, idx_j), axis=-1)
custom_value = da.apply_along_axis(get_custom_value, -1, multi_idx, dtype=int, shape=tuple())
result = da.stack([idx_i, idx_j, custom_value])
reduced = da.min(result)
with ProgressBar():
value = reduced.compute(scheduler="processes")
I didn't run this to completion, so I can't share any timings. (I stopped after around 10 minutes. The point here is: it works, but you will need to bring a healthy amount of patience.)
Lets reframe problem as building and processing cartesian product of two datasets (or dataset with itself).
In distributed setting it is usually solved as following:
Lets call our datasets A and B
We partition bigger dataset over cluster
We broadcast (replicate) smaller dataset to each machine in the cluster
We calculate and process caretesian product in each machine.
If smaller dataset is not "that small" and we can not broadcast it, we can do work in iterations, each time broadcasting another partition of it.
In our case, we have one dataset, and it is small.
So what should be done is:
we should partition our dataframe over cluster, by just creating dask dataframe with enough partitions. Lets call it "A_Partitioned"
We should broadcast the same dataset over cluster using scatter function over the cluster http://distributed.dask.org/en/stable/locality.html Lets call it A_Broadcasted
Now we can do map_partiions over A_Partitioned where it will do nested loops over partition and A_Broadcasted

parallelized groupby: apply a function to a groupby object simultaneously

I would like to perform a group by operations and for every single group estimate a linear model.
Writing a function and then using a for loop is pretty easy ,however, kind of slow.
This is a toy example but it does serve the purpose. What is, in your opinion, the "best" way of making this parallelized?
an intuitive example:
import seaborn as sns
import pandas as pd
from statsmodels.formula.api import ols
import time
# Dataset
df = sns.load_dataset("tips")
df.head()
# Groupby the dataset
df_grouped = df.groupby(["day"])
# Some function to be applied for every grouped element
def regression_model(df):
"""
This function estimates a linear regression model and returns coefs as dictionary
"""
model = ols('tip ~ total_bill + C(sex) + size', data = df)
return dict(model.fit().params)
# Performing the function in the for loop ------ Slow. We want to perform it for each grouped element simultaneously.
coefs_dict = {}
for i, j in df_grouped:
coefs_i = regression_model(j)
coefs_dict[i] = coefs_i
# Artificial sleep so we can demostrate that the "mechanical" for loop is slow....
time.sleep(2)
In this particular case I am using the 'sleep' module to make it slower to demonstrate that the for loop will take a lot of time especially if we would be grouping by much larger number of unique cathegories.
You can use multiprocessing module as suggested by #JérômeRichard and the Pool.starmap to be used with groupby
import pandas as pd
import multiprocessing
def regression_model(keys, df):
print(f'Pool: {keys}')
# do stuff here
return df
if __name__ == '__main__':
data = []
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.starmap(regression_model, df.groupby('day'))
df2 = pd.concat(data)

How to reduce memory usage in xarray multidimensional rolling aggregation?

Given a multidimensional xarray DataArray, I would like to perform multidimensional rolling aggregation. For example, if I have a DataArray that is m x n x k, I would like to be able to roll the data along the m axis, and aggregate away either the n or k dimension.
I have an approach that gives me the correct answer but seems not to scale at all. If my window sizes are small, it is feasible, but in the case of a 5000 x 2000 x 10 DataArray, rolling along the 5000 length dimension with a long window explodes memory with my current approach.
import xarray as xr
import numpy as np
import pandas as pd
drange = pd.date_range(start='2000-01-01', freq='D', periods=5000)
x = ['x%i' % i for i in range(1, 3001)]
y = ['y%i' % i for i in range(1,11)]
raw_dat = np.random.randn(len(drange), len(x), len(y))
da = xr.DataArray(raw_dat, coords={'time': drange, 'x': x, 'y': y}, dims=['time', 'x', 'y'])
new_da = da.rolling(time=20).construct('window_dim')
final_da = new_da.stack(combo=['x', 'window_dim']).std('combo')
I have also tried the below, it gives the same result but also runs out of memory when the rolling window is large.
new_da = da.rolling(time=20).construct('window_dim')
final_da = new_da.std(['x', 'window_dim'])
The above code works and on my machine takes roughly 35 seconds to perform the stack and aggregation, but as window size increases, memory usage explodes. I am wondering if there is a smarter way to do this type of aggregation.

how do I compute a weighted moving average using pandas

Using pandas I can compute
simple moving average SMA using pandas.stats.moments.rolling_mean
exponential moving average EMA using pandas.stats.moments.ewma
But how do I compute a weighted moving average (WMA) as described in wikipedia http://en.wikipedia.org/wiki/Exponential_smoothing ... using pandas?
Is there a pandas function to compute a WMA?
Using pandas you can calculate a weighted moving average (wma) using:
.rolling() combined with .apply()
Here's an example with 3 weights and window=3:
data = {'colA': random.randint(1, 6, 10)}
df = pd.DataFrame(data)
weights = np.array([0.5, 0.25, 0.25])
sum_weights = np.sum(weights)
df['weighted_ma'] = (df['colA']
.rolling(window=3, center=True)
.apply(lambda x: np.sum(weights*x) / sum_weights, raw=False)
)
Please note that in .rolling() I have used argument center=True.
You should check if this applies with your usecase or whether you need center=False.
No, there is no implementation of that exact algorithm. Created a GitHub issue about it here:
https://github.com/pydata/pandas/issues/886
I'd be happy to take a pull request for this-- implementation should be straightforward Cython coding and can be integrated into pandas.stats.moments
If data is a Pandas DataFrame or Series and you want to compute the WMA over the rows, you can do it using
wma = data[::-1].cumsum().sum() * 2 / data.shape[0] / (data.shape[0] + 1)
If you want a rolling WMA of window length n, use
data.rolling(n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))
as n = x.shape[0]. Note that this solution might be a bit slower than the one by Sander van den Oord, but you don't have to worry about the weights.
Construct a kernel with the weights, and apply it to your series using numpy.convolve.
import pandas as pd
import numpy as np
def wma(arr, period):
kernel = np.arange(period, 0, -1)
kernel = np.concatenate([np.zeros(period - 1), kernel / kernel.sum()])
return np.convolve(arr, kernel, 'same')
df = pd.DataFrame({'value':np.arange(11)})
df['wma'] = wma(df['value'], 4)
Here I am interpreting WMA according to this page: https://en.wikipedia.org/wiki/Moving_average
For this type of WMA, the weights should be a linear range of n values, adding up to 1.0.
Note that I pad the front of the kernel with zeros. This is because we want a 'one-sided' window function, so that 'future' values in the time series do not affect the moving average.
numpy.convolve is fast, unlike apply()!
You can also use numpy.correlate if you reverse the kernel.

Categories

Resources