I would like to use a function that produces multiple outputs to create multiple new columns in an existing pandas dataframe.
For example, say I have this test function which outputs 2 things:
def testfunc (TranspoId, LogId):
thing1 = TranspoId + LogId
thing2 = LogId - TranspoId
return thing1, thing2
I can give those returned outputs to 2 different variables like so:
Thing1,Thing2 = testfunc(4,28)
print(Thing1)
print(Thing2)
I tried to do this with a dataframe in the following way:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
print(df)
df['thing1','thing2'] = df.apply(lambda row: testfunc(row.TranspoId, row.LogId), axis=1)
print(df)
What I want is something that looks like this:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23], 'Thing1':[13,16,26], 'Thing2':[11,12,20]}
df = pd.DataFrame(data, columns=['Name','TranspoId','LogId','Thing1','Thing2'])
print(df)
In the real world that function is doing a lot of heavy lifting, and I can't afford to run it twice, once for each new variable being added to the df.
I've been hitting myself in the head with this for a few hours. Any insights would be greatly appreciated.
I believe the best way is to change the order and make a function that works with Series.
import pandas as pd
# Create function that deals with series
def testfunc (Series1, Series2):
Thing1 = Series1 + Series2
Thing2 = Series1 - Series2
return Thing1, Thing2
# Create df
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
# Apply function
Thing1,Thing2 = testfunc(df['TranspoId'],df['LogId'])
print(Thing1)
print(Thing2)
# Assign new columns
df = df.assign(Thing1 = Thing1)
df = df.assign(Thing2 = Thing2)
# print df
print(df)
Your function should return a series that calculates the new columns in one pass. Then you can use pandas.apply() to add the new fields.
import pandas as pd
df = pd.DataFrame( {'TranspoId':[1,2,3], 'LogId':[4,5,6]})
def testfunc(row):
new_cols = pd.Series([
row['TranspoId'] + row['LogId'],
row['LogId'] - row['TranspoId']])
return new_cols
df[['thing1','thing2']] = df.apply(testfunc, axis = 1)
print(df)
Output:
TranspoId LogId thing1 thing2
0 1 4 5 3
1 2 5 7 3
2 3 6 9 3
Related
I have a data frame
cat input.csv
dwelling,wall,weather,occ,height,temp
5,2,Ldn,Pen,154.7,23.4
5,4,Ldn,Pen,172.4,28.7
3,4,Ldn,Pen,183.5,21.2
3,4,Ldn,Pen,190.2,30.3
To which I'm trying to apply the following function:
input_df = pd.read_csv('input.csv')
def folder_column(row):
if row['dwelling'] == 5 and row['wall'] == 2:
return 'folder1'
elif row['dwelling'] == 3 and row['wall'] == 4:
return 'folder2'
else:
return 0
I want to run the function on the input dataset and store the output in a separate data frame using something like this:
temp_df = pd.DataFrame()
temp_df = input_df['archetype_folder'] = input_df.apply(folder_column, axis=1)
But when I do this I only get the newly created 'archetype_folder' in the temp_df, when I would like all the original columns from the input_df. Can anyone help? Note that I don't want to add the new column 'archetype_folder' to the original, input_df. I've also tried this:
temp_df = input_df
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
But when I run the second command both input_df and temp_df end up with the new column?
Any help is appreciated!
Use Dataframe.copy :
temp_df = input_df.copy()
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
You need to create copy of original DataFrame, then assign return values of your function to it, consider following simple example
import pandas as pd
def is_odd(row):
return row.value % 2 == 1
df1 = pd.DataFrame({"value":[1,2,3],"name":["uno","dos","tres"]})
df2 = df1.copy()
df2["odd"] = df1.apply(is_odd,axis=1)
print(df1)
print("=====")
print(df2)
gives output
value name
0 1 uno
1 2 dos
2 3 tres
=====
value name odd
0 1 uno True
1 2 dos False
2 3 tres True
You don't need apply. Use .loc to be more efficient.
temp_df = input_df.copy()
m1 = (input_df['dwelling'] == 5) & (input_df['wall'] == 2)
m2 = (input_df['dwelling'] == 3) & (input_df['wall'] == 4)
temp_df.loc[m1, 'archetype_folder'] = 'folder1'
temp_df.loc[m2, 'archetype_folder'] = 'folder2'
I have a dataframe like the below:
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
Email Ticket_Category Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com Tier1 5 1345.45
1 joblogs#gmail.com Tier2 2 10295.88
What I'm trying to do is to create 2 new columns "Tier1_Quantity_Purchased" and "Tier2_Quantity_Purchased" based on the existing dataframe, and sum the total of "Total_Price_Paid" as below:
dummy_dict_desired = {'Email':['joblogs#gmail.com'],
'Tier1_Quantity_Purchased': [5],
'Tier2_Quantity_Purchased':[2],
'Total_Price_Paid':[11641.33]}
Email Tier1_Quantity_Purchased Tier2_Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com 5 2 11641.33
Any help would be greatly appreciated. I know there is an easy way to do this, just can't figure out how without writing some silly for loop!
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Email
Tier1
Tier2
total
joblogs#gmail.com
5
2
11641.33
For more details on pivoting, take a look at How can I pivot a dataframe?
import pandas as pd
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
df = pd.DataFrame(dummy_dict_existing)
df2 = df[['Ticket_Category', 'Quantity_Purchased']]
df_transposed = df2.T
df_transposed.columns = ['Tier1_purchased', 'Tier2_purchased']
df_transposed = df_transposed.iloc[1:]
df_transposed = df_transposed.reset_index()
df_transposed = df_transposed[['Tier1_purchased', 'Tier2_purchased']]
df = df.groupby('Email')[['Total_Price_Paid']].sum()
df = df.reset_index()
df.join(df_transposed)
output
I have my data like this
import pandas as pd
df2 = pd.DataFrame(np.random.rand(5,4), columns = list("ABCD"))
And I have equations that i applied in this data:
df["Vchd_D"] = df["A"].diff(1)*0.1
df["Vch_D"] = df["B"].diff(1)*0.20
df["Vecs_D"] = df["C"].diff(1)*0.50
If I have anothers dataFrame with different values and names, for example
Data = pd.DataFrame(np.random.rand(5,4), columns = list("ABCD"))
DataWI = pd.DataFrame(np.random.rand(5,4), columns = list("ABCD"))
fex = pd.DataFrame(np.random.rand(5,4), columns = list("ABCD"))
How can apply my equations in the others DataFrame automatically?
You can first create a function where you define all your formula's in string format and we use DataFrame.eval to execute these formulas'. Then we apply these formulas to our dataframe:
def apply_formulas(df):
func1 = 'A.diff(1)*0.1'
func2 = 'B.diff(1)*0.2'
func3 = 'C.diff(1)*0.5'
df["Vchd_D"] = df.eval(func1)
df["Vch_D"] = df.eval(func2)
df["Vecs_D"] = df.eval(func3)
return df
dfs = [Data, DataWI, fex]
for df in dfs:
df = apply_formulas(df)
Then for example if we print the Data dataframe:
print(data)
A B C D Vchd_D Vch_D Vecs_D
0 0.569892 0.799825 0.441034 0.858675 NaN NaN NaN
1 0.681410 0.937648 0.457076 0.612711 0.011152 0.027565 0.008021
2 0.848778 0.491082 0.614710 0.049382 0.016737 -0.089313 0.078817
3 0.067191 0.936427 0.264359 0.710680 -0.078159 0.089069 -0.175176
4 0.377954 0.708957 0.368314 0.797688 0.031076 -0.045494 0.051978
I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I'd like to wrap df.groupby(pd.TimeGrouper(freq='M')).sum() in a function so that I can assign sum(), mean() or count() as arguments in that function. I've asked a similar question earlier here, but I don't think I can use the same technique in this particular case.
Here is a snippet with reproducible input:
# Imports
import pandas as pd
import numpy as np
# Dataframe with 1 or zero
# 100 rows and 4 columns
# Indexed by dates
np.random.seed(12345678)
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=100).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
print(df.head(10))
Which gives:
With this we can do:
df2 = df.groupby(pd.TimeGrouper(freq='M')).sum()
print(df2)
And get:
Or we can do:
df3 = df.groupby(pd.TimeGrouper(freq='M')).mean()
print(df3)
And get:
Here's part of the procedure wrapped into a function:
# My function
def function1(df):
df = df.groupby(pd.TimeGrouper(freq='M')).sum()
return df
# Function1 call
df4 = function1(df = df)
print(df4)
And that works just fine:
The problem occurs when I try to add sum() or mean() as an argument in Function2, like this:
# My function with sum() as an argument
def function2(df, fun):
df = df.groupby(pd.TimeGrouper(freq='M')).fun
return df
My first attempt raises a TypeError:
# Function2 test 1
df5 = function2(df = df, fun = sum())
My second attempt raises an attribute error:
# Function2 test 2
df6 = function2(df = df, fun = 'sum()')
Is it possible to make a few adjustments to this setup to get it working? (I tried another version with 'M' as an argument for freq, and that worked just fine). Or is this just not the way these things are done?
Thank you for any suggestions!
Here is the whole mess for an easy copy&paste:
#%%
# Imports
import pandas as pd
import numpy as np
# Dataframe with 1 or zero
# 100 rows across 4 columns
# Indexed by dates
np.random.seed(12345678)
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=100).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
print(df.head(10))
# Calculate sum per month
df2 = df.groupby(pd.TimeGrouper(freq='M')).sum()
print(df2)
# Or calculate average per month
df3 = df.groupby(pd.TimeGrouper(freq='M')).mean()
print(df3)
# My function
def function1(df):
df = df.groupby(pd.TimeGrouper(freq='M')).sum()
return df
# Function1 test
df4 = function1(df = df)
print(df4)
# So far so good
#%%
# My function with sum() as argument
def function2(df, fun):
print(fun)
df = df.groupby(pd.TimeGrouper(freq='M')).fun
return df
# Function2 test 1
# df5 = function2(df = df, fun = sum())
# Function2 test 2
# df6 = function2(df = df, fun = 'sum()')
# Function2 test 3
# df7 = function2(df = df, fun = sum)
you need to use apply
def function2(df, fun):
return df.groupby(pd.TimeGrouper(freq='M')).apply(fun)
Just make sure fun is a callable that takes a pd.DataFrame
However, you should probably use agg. If fun reduces columns to a scalar similar to sum or mean, then this should work. Something to consider.
df.groupby(pd.TimeGrouper('M')).agg(['sum', 'mean', fun])
Per the comment of #BlackJack, here is a simpler implementation that uses getattr(gb, foo) to get the method foo on the gb groupby object. If such a method does not exist, it raises an AttributeError. Depending on use, you may wish to control which functions you can pass as arguments to the foo parameter (see second example below).
def function(df, foo):
gb = df.groupby(pd.TimeGrouper(freq='M'))
try:
foo = getattr(gb, foo)
except AttributeError:
raise('{} cannot be performed on this object'.format(foo))
return foo()
Here is an alternative approach. This uses eval which is evil because of security concerns. However, it first ensures that foo is a known function type that can safely be applied to either a
pd.core.groupby.SeriesGroupBy or pd.core.groupby.DataFrameGroupBy object.
def function2(df, foo):
safe_functions = ('sum', 'mean', 'count')
if foo not in safe_functions:
raise ValueError('foo is not safe')
gb = df.groupby(pd.TimeGrouper(freq='M'))
if not isinstance(gb, (pd.core.groupby.SeriesGroupBy, pd.core.groupby.DataFrameGroupBy)):
raise ValueError('Unexpected groupby result')
return eval('gb.{}()'.format(foo))
>>> function(df, 'sum')
A B C D
dates
2017-01-31 18 15 14 14
2017-02-28 15 15 12 17
2017-03-31 18 17 16 17
2017-04-30 8 3 3 7