Applying similar functions across multiple columns in python/pandas - python

Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls.
The code for the data:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'days': [365, 365, 213, 318, 71],
'spend_30day': [22, 241.5, 0, 27321.05, 345],
'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}
df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df
The function below will essentially annualize spend; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
Then I apply the function to the particular column:
df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df
This works exactly as I want it to for that one column. However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). I want to be able to write code that will generalize and apply this function to multiple columns in one pass.
I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails:
spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
col = df.apply(annualize_spend, axis = 1)
The error:
AttributeError: ("'Series' object has no attribute 'col'")
I'm not sure why the loop approach is failing. Regardless, I'm hoping for guidance on how to generalize function application in pandas. Thanks in advance!

Look at your two function definitions:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
and
#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
See the difference? On the one hand, in the first case you access the fields with explicit field names, and it works. In the second case you try to access row.col, which fails, but in this case col assumes the values of the corresponding fields in df. Instead try
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
before your loop. On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col. So you might want to use row[col] in the latter case as well. And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col.
I'm unfamiliar with pandas.DataFrame.apply, but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables:
def annualize_spend(col,day,row):
if (row['days']/(float(day)) < 1:
return (row[col])/((row['days'])/float(day))
else:
return row[col]
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)
The lambda will ensure that only one input parameter of your function is dangling free when it gets applyd.

Related

function to sort rows

I have the following code which filters a row based on the part number.
df= sets[sets['num_parts'] == 11695]
df
Now, i want to write a function where i can simply call a function and pass set_num as a string and return the rows.
Here is what I have tried. This is giving me an error.
def select_set_row(df):
return display(df.iloc['set_num'])
iloc is integer-location based indexing. You should use df['set_num'].iloc[0]
This is exactly not a function, but you can create similar function.
Sample data:
df = pd.DataFrame({'id': [1,2,3,4,5,6], 'num_parts': [132, 234, 345, 578, 462, 222]})
print(df)
list of num_parts:
num_parts = df['num_parts'].tolist()
Outputing dataframe for each num_parts:
for parts in num_parts:
print(df[df['num_parts'] == parts])
I was able to sort by this function :
def select_set_numparts(num_parts):
return sets[sets['num_parts'] == num_parts]

Pandas merging dataframes resulting in x and y suffixes

I am creating my own dataset for a Uni project. I've used the merge function often and it always worked perfectly. This time I get x and y suffixes which I can not understand. I know pandas does this because -> The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. But I really don't get why. I assume it has to do with a warning I got earlier:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
I tried to merge the dataframes on the column 'CustomerID' where they obviously match. I really don't get the error.
Here is my code:
I first want to remove duplicate rows where the relevant columns are CustomerID and WebsiteID
Then I want to apply a function which returns true or false as a string randomly. Up to this point the resulting dataframe looks fine. The only warning I get is the one I described earlier.
And lastly I want to merge them and it results in a dataframe way larger than the original one. I really don't understand that.
import numpy as np
import pandas as pd
from numpy.random import choice
df = pd.DataFrame()
df['AdID'] = np.random.randint(1,1000001, size=100000)
df['CustomerID'] = np.random.randint(1,1001, size=len(df))
df["Datetime"] = choice(pd.date_range('2015-01-01', '2020-12-31'), len(df))
def check_weekday(date):
res = len(pd.bdate_range(date, date))
if res == 0:
result = "Weekend"
else:
result = "Working Day"
return result
df["Weekend"] = df["Datetime"].apply(check_weekday)
def apply_age():
age = choice([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36],
p=[.00009, .00159, .02908, .06829, .09102, .10043, .10609, .10072, .09223, .08018, .06836, .05552,
.04549,.03864, .03009, .02439, .01939, .01586, .01280, .01069, .00905])
return age
def apply_income_class():
income_class = choice([np.random.randint(50,501),np.random.randint(502,1001), np.random.randint(1002,1501),np.random.randint(1502,2001)],
p=[.442, .387, .148, .023])
return income_class
def apply_gender():
gender = choice(['male', 'female'], p=[.537, .463])
return gender
unique_customers = df[['CustomerID']].drop_duplicates(keep="first")
unique_customers['Age'] = [apply_age() for x in unique_customers.index]
unique_customers['Gender'] = [apply_gender() for x in unique_customers.index]
unique_customers['Monthly Income'] = [apply_income_class() for x in unique_customers.index]
unique_customers['Spending Score'] = [np.random.randint(1,101) for x in unique_customers.index]
df = df.merge(unique_customers, on=['CustomerID'], how='left')
df['WebsiteID'] = np.random.randint(1,31, len(df))
df['OfferID'] = np.random.randint(1,2001, len(df))
df['BrandID'] = np.random.randint(1,10, len(df))
unique_offers = df[['OfferID']].drop_duplicates(keep="first")
print(len(unique_offers))
unique_offers['CategoryID'] = [np.random.randint(1,501) for x in unique_offers.index]
unique_offers['NPS'] = [np.random.randint(1, 101) for x in unique_offers.index]
df = df.merge(unique_offers, on=['OfferID'], how='left')
def apply_website_user():
purchase = np.random.choice(['True', 'False'])
return purchase
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
print(unique_website_user.head())
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
#df['PurchaseOnWebsite']= df.groupby(['CustomerID', 'WebsiteID']).apply(apply_website_user)
print(df.head)
#Erstellen der csv-Datei
#df.to_csv(r'/Users/alina/Desktop/trainingsdaten.csv', sep=',', #index=False)
It's better to paste the data, rather than provide images, so this is just guidance as I can't test it. You have a couple issues and I don't think they are related.
copy or slice warning. You might be able to get rid of this two ways. One is reconfigure the line:
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
to the format it is suggesting. The other, more simple way that might work is to use .copy() on the line before it. You are dropping duplicates and then modifying it, and pandas is just warning that you are modifying a slice or view of the original. try this:
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
If you just want to merge back that one column and reduce number of columns, try this:
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
Another alternative to this would be to use groupby() and apply your True/False function in and apply method. Something like:
df.groupby(['CustomerID']).apply(yourfunctionhere)
This gets rid of creating and merging dataframes. If you post all the code actual dataframe, we can be more specific.
UPDATE:
Saw your comment that you found your own answer. Also, this is way faster than your call to the weekday function.
df["Weekend"] = df['Datetime'].apply(lambda x: 'Weekend' if (x.weekday() == 5 or x.weekday() == 6) else 'Working Day')

Efficient way to loop with if statement

I have a sample data look like this (real dataset has more columns):
data = {'stringID':['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],'IDct':[1,3,4]}
data = pd.DataFrame(data)
data['Index1'] = [[3,6],[7,9],[5,6]]
data['Index2'] = [[4,8],[10,13],[8,9]]
What i want to achieve is i want to slice stringID column based on second elment in Index1 and Index2 (both are list), only if IDct value is bigger than 1, otherwise return NaN.
I tried this, it works as Output1 column, but there must be a better way (i mean faster when apply to a large dataset) to do it, please kindly advise, thanks!
data['pos'] = data.Index1.map(lambda x: x[1])
data['pos1'] = data.Index2.map(lambda x: x[1])
def cal(m):
if m['IDct'] > 1:
return m['stringID'][m['pos']:m['pos1']]
else:
return 'NaN'
data['Output1'] = data.apply(cal,axis=1)
I love pandas - but realistically speaking it's just one of many tools that belong in your tool belt.
pandas and numpy really shine for computation and analysis. It's okay to use pandas to visualize and analyze your data - but that doesn't mean it's the right tool for the job.
This kind of problem is better suited for regular python. Assuming we can, let's move StringID and IDct out of the dict and back into lists. If we assume the result is regular in shape (all lists are of equal length)
StringID = ['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],
IDct = [1,3,4]
Index1 = [[3,6],[7,9],[5,6]]
Index2 = [[4,8],[10,13],[8,9]]
for stringID, IDct, Index1, Index2 in zip(stringID, IDct, Index1, Index2):
result = []
if IDct > 1:
result.append(your_indexing_goes_here())
else:
result.append(None)
You can then blend the result data back in as you see fit.
data = {
'StringID': StringID,
'IDct': IDct,
'Index1': Index1,
'Index2': Index2,
'Result': result
}
pd.DataFrame(data)

issue in writing function to filter rows data frame

I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])

Pass a function with parameters specified for resample() method on a pandas dataframe

I want to pass a function to resample() on a pandas dataframe with certain parameters specified when it is passed (as opposed to defining several separate functions).
This is the function
import itertools
def spell(X, kind='wet', how='mean', threshold=0.5):
if kind=='wet':
condition = X>threshold
else:
condition = X<=threshold
length = [sum(1 if x==True else nan for x in group) for key,group in itertools.groupby(condition)]
if not length:
res = 0
elif how=='mean':
res = np.mean(length)
else:
res = np.max(length)
return res
here is a dataframe
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
And heres sort of what I want to do with it
df.resample('M', how=spell(kind='dry',how='max',threshold=0.7))
But I get the error TypeError: spell() takes at least 1 argument (3 given). I want to be able to pass this function with these parameters specified except for the input array. Is there a way to do this?
EDIT:
X is the input array that is passed to the function when calling the resample method on a dataframe object like so df.resample('M', how=my_func) for a monthly resampling interval.
If I try df.resample('M', how=spell) I get:
0
1960-01-31 1.875000
1960-02-29 1.500000
1960-03-31 1.888889
1960-04-30 3.000000
which is exactly what I want for the default parameters but I want to be able to specify the input parameters to the function before passing it. This might include storing the definition in another variable but I'm not sure how to do this with the default parameters changed.
I think this may be what you're looking for, though it's a little hard to tell.. Let me know if this helps. First, the example dataframe:
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
EDIT- had a greater than instead of less than or equal to originally...
Next, the function:
def spell(df, column='', kind='wet', rule='M', how='mean', threshold=0.5):
if kind=='wet':
df = df[df[column] > threshold]
else:
df = df[df[column] <= threshold]
df = df.resample(rule=rule, how=how)
return df
So, you would call it by:
spell(df, 0)
To get:
0
1960-01-31 0.721519
1960-02-29 0.754054
1960-03-31 0.746341
1960-04-30 0.654872
You can change around the parameters as well:
spell(df, 0, kind='something else', rule='W', how='max', threshold=0.7)
0
1960-01-03 0.570638
1960-01-10 0.529357
1960-01-17 0.565959
1960-01-24 0.682973
1960-01-31 0.676349
1960-02-07 0.379397
1960-02-14 0.680303
1960-02-21 0.654014
1960-02-28 0.546587
1960-03-06 0.699459
1960-03-13 0.626460
1960-03-20 0.611464
1960-03-27 0.685950
1960-04-03 0.688385
1960-04-10 0.697602

Categories

Resources