I would like to perform a group by operations and for every single group estimate a linear model.
Writing a function and then using a for loop is pretty easy ,however, kind of slow.
This is a toy example but it does serve the purpose. What is, in your opinion, the "best" way of making this parallelized?
an intuitive example:
import seaborn as sns
import pandas as pd
from statsmodels.formula.api import ols
import time
# Dataset
df = sns.load_dataset("tips")
df.head()
# Groupby the dataset
df_grouped = df.groupby(["day"])
# Some function to be applied for every grouped element
def regression_model(df):
"""
This function estimates a linear regression model and returns coefs as dictionary
"""
model = ols('tip ~ total_bill + C(sex) + size', data = df)
return dict(model.fit().params)
# Performing the function in the for loop ------ Slow. We want to perform it for each grouped element simultaneously.
coefs_dict = {}
for i, j in df_grouped:
coefs_i = regression_model(j)
coefs_dict[i] = coefs_i
# Artificial sleep so we can demostrate that the "mechanical" for loop is slow....
time.sleep(2)
In this particular case I am using the 'sleep' module to make it slower to demonstrate that the for loop will take a lot of time especially if we would be grouping by much larger number of unique cathegories.
You can use multiprocessing module as suggested by #JérômeRichard and the Pool.starmap to be used with groupby
import pandas as pd
import multiprocessing
def regression_model(keys, df):
print(f'Pool: {keys}')
# do stuff here
return df
if __name__ == '__main__':
data = []
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.starmap(regression_model, df.groupby('day'))
df2 = pd.concat(data)
Related
I have a dataset and I want to select the subset of variables with VIF(Variance Inflation Factor) smaller than a certain threshold. My idea was to calculate the VIF for every variable, then take out the variable for the highest value (if its higher than a certain threshold), recalculate the VIF for every remaining variable and repeat the process until there is no VIF higher than the treshold.
There is no novel idea in this approach but I couldn't get past a certain point to make a function to automatize this process in Python.
x is the dataset with the target variable dropped
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
x_vif = add_constant(x)
vif = pd.DataFrame([variance_inflation_factor(x_vif.values, i) for i in range(x_vif.shape[1])], index=x_vif.columns)
The vif could also be a List. So, is there any package that does that automatically or could you give me an idea how to create this function ?
I found a R library (thinXwithVIF) that could do that automatically, but I couldn't make rpy2 work with the python version that I need to use.
Maybe what would make sense is to remove the variable with the highest vif in each round, subset the dataframe and stop when all variables are lower than your threshold. I don't think vif would be be-all-and-end-all and you really have to look at the data to decide what to include etc.
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
data = sm.datasets.get_rdataset('mtcars')
x_vif = data.data.iloc[:,1:]
y = data.data['mpg']
thres = 10
while True:
Cols = range(x_vif.shape[1])
vif = np.array([variance_inflation_factor(x_vif.values, i) for i in Cols])
if all(vif < thres):
break
else:
Cols = np.delete(Cols,np.argmax(vif))
x_vif = x_vif.iloc[:,Cols]
I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.
I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask array. For computation, I'd like to use dask.distributed.
However, when I apply this to the full dataset, the number of tasks skyrockets, overwhelming the client and most likely also the scheduler and workers if submitted.
The xarray docs explain:
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early
in the pipeline, especially before calling resample() or groupby().
Grouping and rasampling triggers some computation on all the blocks,
which in theory should commute with indexing, but this optimization
hasn’t been implemented in dask yet.
But I really need to apply this to the full temporal axis.
So how to best implement this?
My approach was to use map_blocks, to apply this function for each chunk individually as to keep the individual xarray sub-datasets small enough.
This seems to work on a small scale, but when I use the full dataset, the workers run out of memory and quickly die.
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
So my questions are:
Is this approach valid?
How could I implement this workflow otherwise, besides manually implementing the resample and groupby part and putting it in a ufunc?
Any ideas regarding the performance issues at scale (specifically the number of executions vs chunks)?
Here's a small example that mimics the workflow and shows the number of executions vs chunks:
from time import sleep
import dask
from dask.distributed import Client, LocalCluster
import numpy as np
import pandas as pd
import xarray as xr
def ufunc(x):
# computation
sleep(2)
return x
def fun(x):
# upsample to higher res
x = x.resample(time="1h").asfreq().fillna(0)
# apply function
x = xr.apply_ufunc(ufunc, x, input_core_dims=[["time"]], output_core_dims=[['time']], dask="parallelized")
# average over dates
x['time'] = x.time.dt.strftime("%Y-%m-%d")
x = x.groupby("time").mean()
return x
def create_xrds(shape):
''' helper function to create dataset'''
x,y,t = shape
tv = pd.date_range(start="1970-01-01", periods=t)
ds = xr.Dataset({
"band": xr.DataArray(
dask.array.zeros(shape, dtype="int16"),
dims=['x', 'y', 'time'],
coords={"x": np.arange(0, x), "y": np.arange(0, y), "time": tv})
})
return ds
# set up distributed
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
ds = create_xrds((500,500,500)).chunk({"x": 100, "y": 100, "time": -1})
# create template
template = ds.copy()
template['time'] = template.time.dt.strftime("%Y-%m-%d")
# map fun to blocks
ds_out = xr.map_blocks(fun, ds, template=template)
# persist
ds_out.persist()
Using the example above, this is how the dask array (25 chunks) looks like:
But the function fun gets executed 125 times:
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
This is misleading because of an unfortunate choice made when making the graph. The number includes tasks that make a block of the input Dataset (one per variable per chunk) & for the output Dataset as well as tasks that apply the function. This will get fixed soon (https://github.com/pydata/xarray/pull/5007)
I am trying to solve a multiobjective optimization problem with 3 objectives and 2 decision variables using NSGA 2. The pymoo code for NSGA2 algorithm and termination criteria is given below. My pop_size is 100 and n_offspring is 100. The algorithm is iterated over 100 generations. I want to store all 100 values of decision variables considered in each generation for all 100 generations in a dataframe.
NSGA2 implementation in pymoo code:
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True
)
from pymoo.factory import get_termination
termination = get_termination("n_gen", 100)
from pymoo.optimize import minimize
res = minimize(MyProblem(),
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
What I have tried (My reference: stackoverflow question):
import pandas as pd
df2 = pd.DataFrame (algorithm.pop)
df2.head(10)
The result from above code is blank and on passing
print(df2)
I get
Empty DataFrame
Columns: []
Index: []
Glad you intend to use pymoo for your research. You have correctly enabled the save_history option, which means you can access the algorithm objects.
To have all solutions from the run, you can combine the offsprings (algorithm.off) from each generation. Don't forget the Population objects contain Individual objectives. With the get method you can get the X and F or other values. See the code below.
import pandas as pd
from pymoo.algorithms.nsga2 import NSGA2 from pymoo.factory import get_sampling, get_crossover, get_mutation, ZDT1 from pymoo.factory import get_termination from pymoo.model.population import Population from pymoo.optimize import minimize
problem = ZDT1()
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True )
termination = get_termination("n_gen", 10)
res = minimize(problem,
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
all_pop = Population()
for algorithm in res.history:
all_pop = Population.merge(all_pop, algorithm.off)
df = pd.DataFrame(all_pop.get("X"), columns=[f"X{i+1}" for i in range(problem.n_var)])
print(df)
Another way would be to use a callback and fill the data frame each generation. Similar as shown here: https://pymoo.org/interface/callback.html
For a given Series I want to change the value of each element around it's current value and then calculate an arbitrary function (here std) as shown in the following code:
import pandas as pd
import numpy as np
a = pd.Series(np.random.randn(10))
perturb = {}
for item in range(2,len(a)):
serturb = {}
for ep in np.arange(-1,1,0.1):
temp = a.ix[0:item]
temp.iloc[-1] += ep
serturb[ep] = temp.std()
perturb[item] = pd.Series(serturb)
perturb = pd.DataFrame(perturb).T
The above code will become too slow for a large amount of data. The above process, when applied on a DataFrame would return a Panel. Is there an efficient way of doing it, since a lot of the calculations are being repeated?