Using Pandas Agg Functionality to compute Quantiles - python

I am trying to clean up and streamline my code and recently came across named aggregation in Pandas(see link)
This note is on the page:
If your aggregation functions requires additional arguments, partially apply them with functools.partial().
Here is the setup code:
from functools import partial as fpart
import pandas as pd
import numpy as np
inputData = {'groupByVar1':['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'groupByVar2':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4],
'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}
df = pd.DataFrame(inputData)
upperPercentile = 0.98
lowerPercentile=0.05
Here is the method I have been using
nsgdf = df.groupby(['groupByVar1','groupByVar2'])[['value']].quantile(upperPercentile).rename({'value':'upperPercentileValue'},axis=1).reset_index(drop=False)
nsgdf['lowerPercentileValue'] = df.groupby(['groupByVar1','groupByVar2'])[['value']].quantile(lowerPercentile).values
Here is the method I would like to use:
fpartUpper = fpart(np.quantile,q=upperPercentile)
fpartLower = fpart(np.quantile,q=lowerPercentile)
bdf = df.groupby(['groupByVar1','groupByVar2']).agg(
upperPercentileValue=pd.NamedAgg(column='value',aggfunc=fpartUpper),
lowerPercentileValue=pd.NamedAgg(column='value',aggfunc=fpartLower)
)
The following Error is returned from Pandas:
pandas.core.base.SpecificationError: Function names must be unique, found multiple named quantile
However if I execute the following I actually get a result:
fpartUpper([1,2,3,4,5])
Out[16]: 4.92
How can I get this particular method to work with pandas? What am i missing? Why is Pandas finding multiple definitions for quantile, where as running the bare function causes no issues?

Related

How to import a function from an R package as if it was native Python function and use all its outputs?

There is a function called dea(x, y, *args) in library(Benchmarking) which returns useful objects. I've described 3 key ones below:
crs = dea(mydata_matrix_x, my_data_matrix_y, RTS="IN", ORIENTATION= "in") # both matrixes have N rows
efficiency(crs) # a 'numeric' type object which looks like a 1xN vector
peers(crs) # A matrix: Nx2 (looks to me like a pandas dataframe when run in .ipynb file with R kernel)
lambda(crs) # A matrix: Nx2 of type dbl (also looks like a dataframe)
Now I would like to programatically vary my_data_matrix_x. This matrix represents my inputs. At first it will be a Nx10 matrix. However I intend to drop each column sequentially and run dea() on the Nx9 matrix, then graph the efficiency(crs) scores that come out. The issue is I have no idea how to achieve this in R (amongst other things) and would rather circumvent the issue by writing all my code in Python and importing this dea() function somehow from an R script
I believe the best solution available to me will be to read and write from files:
from Benchmarking_script.r import dea
def test_inputs(data, input):
INPUTS = ['input 1', 'input2', 'input3', 'input4,' 'input5']
OUTPUTS = ['output1', 'output2']
data_inputs = data.drop(f"{input}", axis=1)
data_outputs = data[OUTPUTS]
data_inputs.to_csv("my_inputs.csv")
data_outputs.to_csv("my_outputs.csv")
run Benchmarking.dea(data_inputs, data_outputs, RTS="crs", ORIENTATION="in")
clearly this last line won't work: I am interested to hear flexible (and simple!) ways to run this dea() function idiomatically as if it was a native Python function
Related SO questions
The closest answer on SO I've found has been Importing any function from an R package into python
When adapting the code I've written
import pandas as pd
data = pd.read_csv("path/to_data.csv")
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packnames = ('Benchmarking')
utils.install_packages(StrVector(packnames))
Benchmarking = importr('Benchmarking')
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
--------------------------------------------------------------
NotImplementedError: Conversion 'py2rpy' not defined for objects of type '<class 'pandas.core.series.Series'>'
So importing the function natively as a Python file hasn't worked
The second approach is the way to go. You need to use a converter context so python and r variables would be converted automatically. Specifically, try pandas2ri submodule shipped with rpy2. Something like this:
from rpy2.robjects import pandas2ri
with pandas2ri:
crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')
If this doesn't work, update your post with the error.

Meaning of dask delayed

I want to split my df to two dfs.
import dask
import dask.dataframe as dd
df = dd.read_csv(r'D:\Amr\Amr.csv',error_bad_lines=False, engine="python")
import numpy as np
dfs = dask.delayed(np.split)(df,2)
df0=dfs[0]
dask.delayed(df0.to_csv)('file1.csv', header=False, index=False)
The result showing, Delayed('to_csv-386b2047-2ed0-4317-bdf7-65e3aa2695af').
What is its meaning.
The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, placing the function and its arguments into a task graph.
In other words, this means that the code will only be executed when the results are needed (so when you perform an action using the results). This is also referred to as call-by-need.

Not able to use numpy inside a udf function

I am trying to run some code on a spark kubernetes cluster
"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"
The code I am trying to run is the following
def getMax(row, subtract):
'''
getMax takes two parameters -
row: array with parameters
subtract: normal value of the parameter
It outputs the value most distant from the normal
'''
try:
row = np.array(row)
out = row[np.argmax(row-subtract)]
except ValueError:
return None
return out.item()
from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())
The dataframe I am passing is as below
However I am getting the following error
ModuleNotFoundError: No module named 'numpy'
When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.
ImportError: No module named numpy on spark workers
And the crazy part is I am able to import numpy outside and
import numpy as np
command outside the function is not getting any errors.
Why is this happening? How to fix this or how to go forward. Any help is appreciated.
Thank you

How to use tqdm with pandas in a jupyter notebook?

I'm doing some analysis with pandas in a jupyter notebook and since my apply function takes a long time I would like to see a progress bar.
Through this post here I found the tqdm library that provides a simple progress bar for pandas operations.
There is also a Jupyter integration that provides a really nice progress bar where the bar itself changes over time.
However, I would like to combine the two and don't quite get how to do that.
Let's just take the same example as in the documentation
import pandas as pd
import numpy as np
from tqdm import tqdm
df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="my bar!")
# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
df.progress_apply(lambda x: x**2)
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)
It even says "can use 'tqdm_notebook' " but I don't find a way how.
I've tried a few things like
tqdm_notebook(tqdm.pandas(desc="my bar!"))
or
tqdm_notebook.pandas
but they don't work.
In the definition it looks to me like
tqdm.pandas(tqdm_notebook(desc="my bar!"))
should work, but the bar doesn't properly show the progress and there is still additional output.
Any other ideas?
My working solution (copied from the documentation):
from tqdm.auto import tqdm
tqdm.pandas()
You can use:
tqdm_notebook().pandas(*args, **kwargs)
This is because tqdm_notebook has a delayer adapter, so it's necessary to instanciate it before accessing its methods (including class methods).
In the future (>v5.1), you should be able to use a more uniform API:
tqdm_pandas(tqdm_notebook, *args, **kwargs)
I found that I had to import tqdm_notebook also. A simple example is given below that works in Jupyter notebook.
Given you want to map a function on a variable to create a new variable in your pandas dataframe.
# progress bar
from tqdm import tqdm, tqdm_notebook
# instantiate
tqdm.pandas(tqdm_notebook)
# replace map with progress_map
# where df is a pandas dataframe
df['new_variable'] = df['old_variable'].progress_map(some_function)
If you want to use more than 1 CPU for that slow apply step, consider using swifter. As a bonus, swifter automatically enables a tqdm progress bar on the apply step. To customize the bar description, use :
df.swifter.progress_bar(enable=True, desc='bar description').apply(...)
from tqdm.notebook import tqdm
tqdm.pandas()
for versions 4.64.0 and greater.

How to return multiple values using scipy ndimage.generic_filter in Python?

I'm looking for a way to output multiple values using the generic_filter module in scipy.ndimage like so:
import numpy as np
from scipy import ndimage
a = np.array([range(1,5),range(5,9),range(9,13),range(13,17)])
def summary(a):
minVal = np.min(a)
maxVal = np.max(a)
return [minVal,maxVal]
[arrMin, arrMax] = ndimage.generic_filter(a, summary, footprint=np.ones((3,3)))
But I keep getting the error that a float is expected.
I've played with the 'output' parameter, like so:
arrMin = np.zeros(np.shape(a))
arrMax = np.zeros(np.shape(a))
ndimage.generic_filter(a, summary, footprint=np.ones((3,3)), output = [arrMin, arrMax])
to no avail. I've also tried returning a named tuple, a class, or a dictionary, as per this question none of which have worked.
Based on the comments, you want to perform multiple filters simultaneously rather than performing them separately.
Unfortunately I do not think this filter works that way. It expects you to return a single filtered output value for each corresponding input value. I looked for a way to do simultaneous filters with numpy/scipy but couldn't find anything.
If you can manage a data flow that allows you to load the image, filter, process and produce some small result data in separate parallel paths (one for each filter), then you may get some benefit from using multiprocessing but if you use it naively it's likely to take more time than doing everything sequentially. If you really have a bottleneck that multiprocessing solves you should also look into sharing your input array rather than loading it in each process.

Categories

Resources