Adding new dask column based on a vectorized function - python

I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.

It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)

Related

How to apply a function pairwise on rows in a series?

I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)
Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})
Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())
Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

how to create multiple variables with similar name in for loop?

I had a problem with for loops earlier, and it was solved thanks to #mak4515, however, there is something else I want to accomplish
# Use pandas to read in csv file
data_df_0 = pd.read_csv('puget_sound_ctd.csv')
#create data subsets based on specific buoy coordinates
data_df_1 = pd.read_csv('puget_sound_ctd.csv', skiprows=range(9,114))
data_df_2 = pd.read_csv('puget_sound_ctd.csv', skiprows=([i for i in range(1, 8)] + [j for j in range(21, 114)]))
for x in range(0,2):
for df in [data_df_0, data_df_2]:
lon_(x) = df['longitude']
lat_(x) = df['latitude']
This is my current code, I want to have it have it so that it reads the different data sets and creates different values based on the data set it is reading. However, when I run the code this way I get this error
File "<ipython-input-66-446aebc48604>", line 21
lon_(x) = df['longitude']
^
SyntaxError: can't assign to function call
What does "can't assign to function call" mean, and how do I fix this?
I think the comment by #Chris is probably a good way to go. I wanted to point out that since you're already using pandas dataframes, an easier way might be to make a column corresponding to the original dataframe then concatenate them.
import pandas as pd
data_df_0 = pd.DataFrame({'longitude':range(-125,-120,1),'latitude':range(45,50,1)})
data_df_0['dfi'] = 0
data_df_2 = pd.DataFrame({'longitude':range(-120,-125,-1),'latitude':range(50,45,-1),'dfi':[2]*5})
data_df_2['dfi'] = 2
df = pd.concat([data_df_0,data_df_2])
Then, you could acess data from the original frames like this:
df.loc[2]

Alternative method for two way interpolation

I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.

Pandas: aggregate when column contains numpy arrays

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.
e.g.
import pandas as pd
import numpy as np
DF = pd.DataFrame([[1,np.array([10,20,30])],
[1,np.array([40,50,60])],
[2,np.array([20,30,40])],], columns=['category','arraydata'])
This works the way I would expect it to:
DF.groupby('category').agg(sum)
output:
arraydata
category 1 [50 70 90]
2 [20 30 40]
However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:
g=DF.groupby('category')
g.agg({'arraydata':sum})
Here is another:
g=DF.groupby('category')
g['arraydata'].agg(sum)
Both give the same output:
Exception: must produce aggregated value
However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?
Thanks
One, perhaps more clunky way to do it would be to iterate over the GroupBy object (it generates (grouping_value, df_subgroup) tuples. For example, to achieve what you want here, you could do:
grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")
This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.
Diving into the Internals
The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.
def _aggregate_named(self, func, *args, **kwargs):
result = {}
for name, group in self:
group.name = name
output = func(group, *args, **kwargs)
if isinstance(output, np.ndarray):
raise Exception('Must produce aggregated value')
result[name] = self._try_cast(output, group)
return result
My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:
DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.
result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)
How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.
Pandas works much more efficiently if you don't do this (e.g using numeric data, as you suggest). Another alternative is to use a Panel object for this kind of multidimensional data.
Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:
Exception: Must produce aggregated value
In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
1510 if isinstance(output, np.ndarray):
-> 1511 raise Exception('Must produce aggregated value')
1512 result[name] = self._try_cast(output, group)
ipdb> output
array([50, 70, 90])
If you were to recklessly remove these two lines from the source code it works as expected:
In [99]: g.agg(sum)
Out[99]:
arraydata
category
1 [50, 70, 90]
2 [20, 30, 40]
Note: They're almost certainly in there for a reason...
Since the sum function only iterate over rows, or sum function only calculates the sum along the first axis.
You can define an aggregation function:
def mySum(dataframe):
return np.sum(np.sum(dataframe))
And then pass this function into the agg():
DF.groupby('category').agg(mySum)

Categories

Resources