Pandas: apply a function to multiple columns of different data-frames

Pandas: apply a function to multiple columns of different data-frames - python

I have a class, which returns a value by comparing different values. The class is:
class feasible:
def __init__(self,old_difference, for_value, back_value, fall_back_value):
self.diff=abs(for_value-back_value)
for_diff=abs(for_value-fall_back_value)
back_diff=abs(back_value-fall_back_value)
if self.diff < old_difference:
self.value=(for_value+back_value)/2
elif for_diff<back_diff:
self.value=(for_value)
else:
self.value=(back_value)
How can I apply this class and return the value if the inputs are columns from different data-frames?
All the input frames are in the following format:
x y theta
0 0.550236 -4.621542 35.071022
1 5.429449 -0.374795 74.884065
2 4.590866 -4.628868 110.697109
I tried the following, but returns error (Error: The truth value of a Series is ambiguous) because of the comparison involved.
feasible_x=feasible(diff_frame.x,for_frame.x,back_frame.x,filler_frame.x)
filler_frame.x=feasible_x.value

Currently, your method expects to receive scalar values but you pass Pandas Series (i.e., columns of data frames) into the method. Hence, the if logic needs to check every element of the Series (a structure of many same-type values) and not one value. Consequently, you receive the error of ambiguous truth value. Newcomers of Pandas often face this error coming from general purpose Python. Pandas/Numpy maintain a different object model than general Python.
To resolve, because you are essentially calculating new fields with conditional logic, consider binding all Series parameters into one data frame. Then, replace the general Python construct of if...elif...else for numpy.where that runs logic across higher dimensional objects such as arrays.
class feasible:
def __init__(self, old_difference, for_value, back_value, fall_back_value):
# HORIZONTAL MERGE (OUTER JOIN) ON INDEX
x_frame = (pd.concat([old_difference, for_value, back_value, fall_back_value], axis = 1)
.set_axis(['old_difference', 'for_value', 'back_value', 'fall_back_value'],
axis = 'columns', inplace = False)
)
# ASSIGN NEW CALCULATED COLUMNS
x_frame['diff'] = (x_frame['for_value'] - x_frame['back_value']).abs()
x_frame['for_diff'] = (x_frame['for_value'] - x_frame['fall_back_value']).abs()
x_frame['back_diff'] = (x_frame['back_value'] - x_frame['fall_back_value']).abs()
# ASSIGN FINAL SERIES BY NESTED CONDITIONAL LOGIC
self.value = np.where(x_frame['diff'] < x_frame['old_difference'],
(x_frame['for_value'] + x_frame['back_value'])/2,
np.where(x_frame['for_diff'] < x_frame['back_diff'],
x_frame['for_value'],
x_frame['back_value']
)
)
Now depending on the row size of all four data frames, different implementation of result must be handled. Specifically, pd.concat at axis = 1 by default runs on join='outer' so all rows are retained in the horizontal merge operation with NaN filled in for unmatched rows.
If filler_frame (the data frame you intend to add a column) is equal to all rows then a simple assignment is doable.
# IF filler_frame CONTAINS THE MOST ROWS (OR EQUIVALENT TO MOST) OF ALL FOUR DFs
feasible_x = feasible(diff_frame.x,for_frame.x,back_frame.x,filler_frame.x)
filler_frame['x_new'] = feasible_x.value
If not a left join for new column, x_new is required. Below will work across all cases including above.
# IF filler_frame DOES NOT CONTAIN MOST ROWS OF ALL FOUR DFs
feasible_x = feasible(diff_frame.x,for_frame.x,back_frame.x,filler_frame.x)
filler_frame = filler_frame.join(pd.Series(feasible_x.value).rename('x_new'), how = 'left')

Related

How to apply a function pairwise on rows in a series?

I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)

Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})

Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())

Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.

It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

Subtracting a Series from A DataFrame with MultiIndex Based on Single Index Level

I have a DataFrame describing movements of multiple "objects" on a few different "tracks" on a Cartesian 2D universe. I also have their "target location" for each "track". Example data:
objs = ['car', 'bicycle', 'plane']
moves = [f'mov{i}' for i in range(1,11)]
multi = pd.MultiIndex.from_product([objs, moves, range(10)], names=['obj', 'mov', 'time'])
locations = pd.DataFrame(np.random.rand(300,2), columns=['X','Y'], index=multi)
targets = pd.DataFrame(np.random.rand(10,2), columns=['X','Y'], index=moves)
I'm interested in the euclidean-distance between the locations and the targets on each timestamp. Something like
distances = pd.Series(np.random.rand(300), index=multi)
Problem is I can't use Subtract method since both objects need to have the same index, and can't figure out how to get the 2 DataFrames' Indexes to "fit". Anyone has a nice (efficient) way for me to get those distances?

so apparently, unlike the subtract method that needs completely matching index-es for self and other, the sub method can take a level as an argument.
So there's a simple one-liner for calculating these euclidean distances:
movements.sub(targets, level=1).pow(2).sum(axis=1).transform(np.sqrt)

IICU:
targets.reset_index(inplace=True)#Reset index
targets.columns=['mov','x','y']#Rename columns
locations.reset_index(inplace=True)#Reset index
loctar = pd.merge(locations, targets, how='left', on='mov')#Merge location and target to loctar
loctar[['dXx','dYY']]=loctar[['X','Y']] - loctar[['x','y']].values#caluclate delta x and y
temp=loctar.loc[:, ~loctar.columns.isin(['obj','mov','time','X','Y','x','y'])]#create temporary datframe with the deltas
result = ((temp**2).sum(axis=1))**0.5#Calculate euclidean-distance
result = result.reset_index()#Reset index
#Can merge result with lotar if you wanted

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]

I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.

I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Adding goupby transform result to an existing pandas DataFrame with each row representing a group

TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().

I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: apply a function to multiple columns of different data-frames - python

Related

How to apply a function pairwise on rows in a series?

How to using the .apply(lambda x: function) over all the columns of a dataframe

Subtracting a Series from A DataFrame with MultiIndex Based on Single Index Level

Finding euclidean distance from multiple mean vectors

Adding goupby transform result to an existing pandas DataFrame with each row representing a group

Categories

Resources