Applying a function using index of row to all columns in dataframe - python

I have this example dataframe:
vocab_list = ['running','sitting','stand','walk']
col_list = ['browse','wander','saunter','jogging','prancing']
df = pd.DataFrame(vocab_list,columns=['vocab'])
df.set_index('vocab',inplace=True)
df = df.reindex(col_list,axis=1)
I need to apply a user-defined function to all columns using values from the index of the dataframe.
Taking my user-defined function to be the cosine similarity between pairs of strings in indices and columns
import spacy
nlp = spacy.load('en_core_web_lg')
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
def func(col):
print(col.name) # Will print the strings in vocab_list in each call
print(col.index) # Will print an Index object containing the names of columns
doc = nlp(col.name)
for i,ind in tqdm(enumerate(col.index),leave=False):
user = nlp(ind)
check_lemma = doc[0].lemma_ != user[0].lemma_
pos_equality = doc[0].pos_ == user[0].pos_
if check_lemma==True and pos_equality==True:
col.iloc[i] = doc.similarity(user)
else:
col.iloc[i] = 0
return col
df = df.parallel_apply(lambda col: func(col), axis=1)
Is there a way to do this without having a for loop in the user_defined function?
The col in the function is a Series object made from the column, I can access the index string of the row by col.name .
Also, col.index gives me an Index object for this Series, containing the names of the columns, but how do I go from there to get the similarities without having a for loop?
NOTE: My actual dataframe has ~3000 columns and ~120000 indices so I would prefer to not have a for loop within the user-defined function.
EDIT: I have edited the question with the user-defined function currently being used.

Related

Comparing two cell values and extracting the difference

Trying to compare two cell values. Per topic I need to screen whether the Old countries and New countries list is exactly the same or whether there are new countries added in the new one or deleted and then I want a third column which identifies which countries have been added or deleted.
Here's a working example (for one of your requests). You can alter func to achieve whatever you want. As is, it gets the elements in column 2 that are not in column 1.
import pandas as pd
# some data that is in the same format as your image
df = pd.DataFrame()
df['Col1'] = ['A;E;I;O;U']
df['Col2'] = ['A;B;C;D;E']
# function to find the differences between two sets
def func(string1,string2):
string1_split = string1.split(';')
string2_split = string2.split(';')
return [x for x in set(string2_split) if x not in set(string1_split)]
# apply the function to the two columns
df['Col3'] = df.apply(lambda x: func(x['Col1'], x['Col2']), axis=1)
df[Col3] outputs:
['B','C','D']

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Pandas DataFrame, how to calculate a new column element based on multiple rows

I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:
DataFrame
I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".
For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.
How would you best approach this problem?
PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!
Here the formula with which I solved this in excel:
Excel Chi Squared test
This might be overly general but I would use some sort of function map if different things should be done depending on the template name:
import pandas as pd
import numpy as np
import collections
n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column
# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)
# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
'are|off': lambda df: df.sum().sum(),
'comp': lambda df: df.mean().mean(),
'something else': lambda df: df.mean().max()
})
# The lambda functions are just placeholders. You could do whatever you want in these functions... for example:
def do_special_stuff(df):
"""Do something that uses rows and columns...
you could also do looping or whatever you want as long
as the result is a scalar, or a sequence with the same
number of columns as the original template DataFrame
"""
crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
return crazy_stuff
function_map['comp'] = do_special_stuff
def wrap(f):
"""Wrap a function so that it returns an updated dataframe"""
def wrapped(df):
df = df.copy()
new_column_data = f(df.drop('template', axis=1))
df['new_column'] = new_column_data
return df
return wrapped
# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}
# throw everything back together
new_df = pd.concat([series_function_map[label](group)
for label, group in df.groupby('template')],
ignore_index=True)
# print your shiny new dataframe
print(new_df)
The result is then something like:
Clicks Impressions Conversions template new_column
0 0.959765 0.111648 0.769329 are|off 4.030594
1 0.809917 0.696348 0.683587 are|off 4.030594
2 0.265642 0.656780 0.182373 comp 0.502015
3 0.753788 0.175305 0.978205 comp 0.502015
4 0.269434 0.966951 0.478056 comp|city NaN
Hope it helps!
Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...
import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])
here b5,b6 .. are column names of the cells shown in image
This should work for you or may need to do some minor changes in maths there

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources