how to aggregate multiple tasks into a single python function? - python

I'm working on a dataframe that i have been able to clean by running the following codes in separate cells in jupyter notebook. However, I need to run these same tasks on several dataframes that are organized exactly the same. How can i write a function that can execute the tasks 2 through 4 below?
For reference, the date I'm working with is located here.
[1]: df1 = pd.read_csv('202110-divvy-tripdata.csv')
[2]: df1.drop(columns=['start_station_name','start_station_id','end_station_name','end_station_id','start_lat','start_lng','end_lat','end_lng'],inplace=True)
[3]: df1['ride_length'] = pd.to_datetime(df1.ended_at) - pd.to_datetime(df1.started_at)
[4]: df1['day_of_week'] = pd.to_datetime(df1.started_at).dt.day_name()

You can define a function in a cell in Jupyter, run this cell and then call the function:
def process_df(df):
df['ride_length'] = pd.to_datetime(df.ended_at) - pd.to_datetime(df.started_at)
df['day_of_week'] = pd.to_datetime(df.started_at).dt.day_name()
Call the function with each DataFrame:
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
process_df(df1)
process_df(df2)
According to this answer, both DataFrames will be altered in place and there's no need to return a new object from the function.

Related

Changes to pandas dataframe in for loop is only partially saved

I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.
Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.

python Export dataframe into seperate excel sheets in the same excel file each time the same defined function is used?

I have this defined function which calculates all of the neccessary statistics i need (e.g. two way anova and multicomparison).
def stats(F1_para1,F2_para2,para3,para4,value):
#MEAN,SEM,COUNT
msc = df.groupby([F1_para1,F2_para2,para3,para4])[value].agg(['mean','sem','count'])
msc.reset_index(inplace=True) #converts any columns in index as columns
pd.DataFrame(msc)
#TWO-WAY ANOVE AND MULTICOMP
df['comb'] = df[F1_para1].map(str) + "+" + df[F2_para2].map(str)
mod = ols(value+'~'+F1_para1+'+'+F2_para2+'+'+F1_para1+'*'+F2_para2, data = df).fit()
aov = anova_lm(mod, type=2) #mod needs to be the same text as mod (i.e. mod1,mod2)
comparison=MultiComparison(df[value], df['comb'])
tukey_df = pd.read_html(comparison.tukeyhsd().summary().as_html())[0]
r=tukey_df[tukey_df['reject'] == True]
df2=aov.append(r) #combines dataframes of aov and r
So when i use the function as follows:
Water_intake = stats('Time','Drug','Diet','Pre_conditions',value='Water_intake')
food_intake = stats('Time','Drugs','Diet','Pre_conditions',value='Food_intake')
The output dataframes following statistical analysis from anova and multicompaison are combined and becomes a new dataframe - 'df2'. 'Value' is the column header of the dependent variable from the main dataframe (df in the code). so everytime i use this function with a different dependent variable from the main dataframe (e.g. food intake, water intake, etc), the statistics summary is exported to the df2 dataframe, which i want to save it as separate sheets into a "statistics" workbook.
I've looked at the solutions here: Save list of DataFrames to multisheet Excel spreadsheet
with ExcelWriter(r"path\statistics.xlsx") as writer:
for n, df2 in enumerate(df2):
df2.to_excel(writer,value)
writer.save()
But i recieved this error:
AttributeError: 'str' object has no attribute 'to_excel'
Not sure if there is another way to achieve the same goal?
You are using df2 when you're enumerating through df2, which will return the column names, which are strings not df, hence the error. You can check this by running:
for n, df2 in enumerate(df2):
print(n)
print(df2)
You're also not changing df2 or calling the function to get df2 in your for loop. I think the whole thing needs re-writing.
Firstly you need to add return df2 at the end of your function, so that you actually get your df2 when it's called.
def stats(F1_para1,F2_para2,para3,para4,value):
#MEAN,SEM,COUNT
msc = df.groupby([F1_para1,F2_para2,para3,para4])[value].agg(['mean','sem','count'])
msc.reset_index(inplace=True) #converts any columns in index as columns
pd.DataFrame(msc)
#TWO-WAY ANOVE AND MULTICOMP
df['comb'] = df[F1_para1].map(str) + "+" + df[F2_para2].map(str)
mod = ols(value+'~'+F1_para1+'+'+F2_para2+'+'+F1_para1+'*'+F2_para2, data = df).fit()
aov = anova_lm(mod, type=2) #mod needs to be the same text as mod (i.e. mod1,mod2)
comparison=MultiComparison(df[value], df['comb'])
tukey_df = pd.read_html(comparison.tukeyhsd().summary().as_html())[0]
r=tukey_df[tukey_df['reject'] == True]
df2=aov.append(r) #combines dataframes of aov and r
return df2
Then your 2 function calls in the question will actually return something. To add these into an excel document, you can then do:
Water_intake = stats('Time','Drug','Diet','Pre_conditions',value='Water_intake')
food_intake = stats('Time','Drugs','Diet','Pre_conditions',value='Food_intake')
to export these 2 to excel on different sheets, you can do:
writer = pd.ExcelWriter(r"path\statistics.xlsx")
Water_intake.to_excel(writer, sheet_name='Water_intake')
food_intake.to_excel(writer, sheet_name='Food_intake')
writer.save()
This should give you a spreadsheet with 2 sheets containing the different df2 on each. I don't know how many of these you need, or how you call the function differently for each, but it may be necessary to create a for loop.

Unable to use a jupyter cell global variable inside a python function

I am working on a jupyter notebook using python
I have created two dataframes like as shown below
The below two dataframes are declared outside the function - Meaning they are just defined/declared/initialized in jupyter notebook cell [And I wish to use them inside a function like as shown below]
subcols = ["subjid","marks"] #written in jupyter cell 1
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"] #written in jupyter cell 2
testdf= pd.DataFrame(columns=testcolumns)
def fun1(): #written in jupyter cell 3
....
....
return df1,df2
def fun2(df1,df2):
...
...
return df1,df2,df3
def fun3(df1,df2,df3):
...
subjdf['subid'] = df1['indid']
...
return df1,df2,df3,subjdf
def fun4(df1,df2,df3,subjdf):
...
testdf['testid'] = df2['examid']
...
return df1,df2,df3,subjdf,testdf
The above way of writing throws an error in fun3 as below
UnboundLocalError: local variable 'subjdf' referenced before assignment
but I have already created subjdf outside the function blocks [Refer 1st Jupyter cell]
Two things to note here
a] I don't get an error if I use global subjdf in fun3
b] If I use global subjdf, I don't get any error for testdf in fun4. I was expecting testdf to have similar error as well because I have used them the same way in fun4.
So, my question is why not for testdf but only for subjdf
Additionally, I have followed similar approach earlier [without using global variable but just declaring the df outside the function blocks] and it was working fine. Not sure, why it is throwing error only now.
Can help me to avoid this error? please.
You have created subjdf, but your function fun3 needs it as argument :
def fun3(subjdf, df1, df2, df3):
...
subjdf['subid'] = df1['indid']
You're not using python functions properly. You don't need to use global in your case. Whether you pass the correct argument and return it, or think about creating an instance method using self. You have many solutions, but Instance methods are a good solution when you have to handle pandas.Dataframe within classes and functions.
It's possible run you snippet as you guess. So many lines of code is missing.
If you don't want to use a class, and that you want keep using this recursive manner, then rebuild you code that way :
subcols = ["subjid","marks"]
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"]
testdf= pd.DataFrame(columns=testcolumns)
def fun1():
# DO SOMETHING to generate df1 and df2
return df1, df2
def fun2():
df1, df2 = fun1()
# DO SOMETHING to generate df3
return df1, df2, df3
def fun3(subjdf):
df1, df2, df3 = fun2()
subjdf['subid'] = df1['indid']
return df1, df2, df3, subjdf
def fun4(subjdf, testdf):
df1, df2, df3, subjdf = fun3()
testdf['testid'] = df2['examid']
return df1, df2, df3, subjdf, testdf
fun4(subjdf, testdf)
But I repeat, build an instance method with self for building this.

Proper data manipulation script layout, where all merges, drops, aggregations, renames are easily traceable and visible

I currently have a long script that has one goal: take multiple csv tables of data, merge them into one, while performing various calculations along the way, and then output a final csv table.
I originally had this layout (see LAYOUT A), but found that this made it had to see what columns were being added or merged, because the cleaning and operations methods are listed below everything, so you have to go up and down in the file to see how the table gets altered. This was an attempt to follow the whole keep-things-modular-and-small methodology that I've read about:
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = clean_table_1('table1.csv')
df2 = clean_table_2('table2.csv')
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
def some_operation(x,y,z):
#<calculations for performing on table column>
def some_other_operation(a,b):
#<some calculation>
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
return df
def clean_table_2(fn_2):
#<similar to clean_table_1>
def clean_table_3(fn_3):
#<similar to clean_table_1>
if __name__=='__main__':
main()
My next inclination was to move all the functions in-line with the main script, so its obvious what's being done (see LAYOUT B). This makes it a bit easier to see the linearity of the operations being done, but also makes it a bit messier, so that you can't just quickly read through the main function to get the "overview" of all the operations being done.
# LAYOUT B
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
df1 = clean_table_1('table1.csv')
def clean_table_2(fn_2):
#<similar to clean_table_1>
df2 = clean_table_2('table2.csv')
def clean_table_3(fn_3):
#<similar to clean_table_1>
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
So then I think, well why even have these functions; would it maybe just be easier to follow if it's all at the same level, just as a script like so (LAYOUT C):
# LAYOUT C
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = pd.read_csv('table1.csv)
df1['some_col1'] = 400
df1 = #<operations on df1>
df2 = pd.read_csv('table2.csv)
df2['some_col2'] = 200
df2 = #<operations on df2>
df3 = pd.read_csv('table3.csv)
df3['some_col3'] = 800
df3 = #<operations on df3>
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
The crux of the problem is finding a balance between documenting clearly which columns are being updated, changed, dropped, renamed, merged, etc. while still keeping it modular enough to fit the paradigm of "clean code".
Also, in practice, this script and others are much longer with far more tables being merged into the mix, so this quickly becomes a long list of operations. Should I be breaking up the operations into smaller files and outputting intermediate files or is that just asking to introduce errors? It's a matter of also being able to see all the assumptions made along the way and how they affect the data in its final state, without having to jump between files or scroll way up, way down, etc. to follow the data from A to B, if that makes sense.
If anyone has insights on how to best write these types of data cleaning/manipulation scripts, I would love to hear them.
It is highly subjective topic, but here is my typical approach/remarks/hints:
as long as possible optimize for debug/dev time and ease
split flow into several scripts (e.g. download, preprocess, ... for every table sepearetely, so for merging every table is prepared separetely)
try to keep the same order of operations within script (e.g. type correction, fill na, scaling, new columns, drop columns)
for every wrangle script there is load-start and save-end
save to pickle (to avoid problems like saving date as string) and small csv (to have easy preview of results outside of python)
with "integration points" being data you can easily combine different technologies (caveat: in such case typically you don't use pickle as output but csv/other data format)
every script has clearly defined input/output and can be tested and developed separately, also I use asserts on dataframe shapes
scripts for visualization/EDA use data from wrangle scripts, but they are never part of wrangle scripts, typically they are also bottleneck
combine scripts by e.g. bash if you want simplicity
keep length of script below 1 page*
*if I have long, convoluted code before I encapsulate it in function, I check if it can be done simpler, in 80% yes but you need to know more about pandas, but you learn something new, pandas doc is typically better, your code is more declarative and idiomatic
*if there is no easy way to fix and you use this function in many places put it into utils.py, in docstring put sample >>>f(input) output and some rationale about this function
*if function is used across many projects it is worth to make pandas extension like https://github.com/twopirllc/pandas-ta
*if I have a lot of columns, I think a lot about hierarchy and groupings and keep it in separate file for every table, for now it is just py file, but I started to consider yaml and a way to document table structure
stick to one convention e.g. I don't use inplace=True at all
chain operations on dataframe*
*if you have a good meaningful name for subchain result that could be used elsewhere, it could be a good place to split script
remove main function, if you keep script according to rules above there is nothing wrong in df global variable
when I've read from csv, I always check what can be done directly with read_csv parameters e.g. parse date
clearly mark 'TEMPORARY HACKS', in long-term they lead to unexpected side-effects

Why is it not possible to access other variables from inside the apply function in Python?

Why would the following code not affect the Output DataFrame? (This example is not interesting in itself - it is a convoluted way of 'copying' a DataFrame.)
def getRow(row):
Output.append(row)
Output = pd.DataFrame()
Input = pd.read_csv('Input.csv')
Input.apply(getRow)
Is there a way of obtaining such a functionality that is using the apply function so that it affects other variables?
What happens
DataFrame.append() returns a new dataframe. It does not modify Output but rather creates a new one every time.
DataFrame.append(self, other, ignore_index=False, verify_integrity=False)
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Here:
Output.append(row)
you create a new dataframe but throw it away immediately.
You have access - But you shouldn't use it in this way
While this works, I strongly recommend against using global:
df = DataFrame([1, 2, 3])
df2 = DataFrame()
def get_row(row):
global df2
df2 = df2.append(row)
df.apply(get_row)
print(df2)
Output:
0 1 2
0 1 2 3
Take it as demonstration what happens. Don't use it in your code.

Categories

Resources