Python Munging - Best practice with DataFrame variables - python

In R, with data munging. I generally do most if not all my basic munging in one go through piping e.g.
df_mung = df %>%
filter(X > 1) %>%
select(X, Y, Z) %>%
group_by(X, Y) %>%
summarise(`sum` = sum(Z))
Which means, in this example, at the end I have two DataFrames:
df (my original DataFrame)
df_mung (my munged DataFrame))
If I was to do this in Python, I would do something like this:
df_filter = df[df['X']>1]
df_select = df_filter[['X', 'Y', 'Z']]
df_sum = df_select.groupby(['X','Y']).sum()
Which leaves me with four DataFrames (double the amount I had in R):
df (my original DataFrame)
df_filter (my filtered DataFrame)
df_select (my selected columns DataFrame)
df_sum (my summed DataFrame))
Now I could copy my DataFrame back on to itself, like this:
df = df['X']>1
df = df[['X', 'Y', 'Z']]
df = df.groupby(['X','Y']).sum()
But given the highly upvoted response in this post for SettingWithCopyWarning: How to deal with SettingWithCopyWarning in Pandas , this is apparently something I should not be doing.
So my question is, what is the best practice when data munging in Python? Creating a new variable each time I do something, or copying the DataFrame onto itself, or something else?
I am worried that when I do a piece of analysis in Python, I could have tens if not hundreds of DataFrame variables which a) looks messy b) is confusing to people who take over my code.
Many thanks

I'd just wrap the munging in a function.
the intermediate variables are not in any global scope (not messy)
the munging function does a single, comprehensible thing (not confusing for people reading your code)
the munging function is testable in isolation (good practice).
def munge_df(df):
df_filter = df['X'] > 1
df_select = df_filter[['X', 'Y', 'Z']]
df_sum = df_select.groupby(['X','Y']).sum()
return df_sum
# ...
df_munged = munge_df(df) # or just `df = ...` if you don't need the original

You could skip the SettingWithCopyWarning using loc and do the filtering of rows and columns in one expression. You could also do method chaining which seems like what you are doing in the R example.
df.loc[df['X'].gt(1), ['X', 'Y', 'Z']].groupby(['X', 'Y']).sum()

Related

Updating a list of df variables after modifying a df

I have a list of predictor (X) and an outcome (y) variables from my df. There are 100s of variables in my df so I only care about a few of them below.
X = df[['a', 'b', 'c']]
y = df['d']
I then want to delete all of the rows with missing data for any of my "X" variables, so I ran this:
for i in X:
df = df[df[i].notna()]
This then leaves me with a modified df with no missing values in the columns of interest. However, my list X and y are still populated with the old df, thus I can not use these as inputs to my model. While I know I could just copy and paste the code I used to create those lists in the first place to "refresh" the code, that seems inefficient. Though I can not seem to think of a better way. Thoughts appreciated!
You can use df.dropna:
X = X.dropna()

How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains

I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate and filter calls:
library(tidyverse)
calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length
raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))
new_table <- raw_data %>%
mutate(area = calc_circle_area(diam)) %>%
mutate(vol = calc_cylinder_vol(area, length)) %>%
mutate(is_small_vol = vol < 100) %>%
filter(is_small_vol)
I can replicate this in pandas without too much trouble but find that it involves some nested lambda calls when using assign to do an apply (first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.
import pandas as pd
import math
calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length
raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})
new_table = (
raw_data
.assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area)) could be written as .assign(area=raw_data.diam.apply(calc_circle_area)) but only because the diam column already exists in the original dataframe, which may not always be the case.
I also realize that the calc_... functions here are vectorizable, meaning I could also do things like
.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))
but again, since most functions aren't vectorizable, this wouldn't work in most cases.
TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda statements, like in something like:
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
Are there best practices for this type of application or is this the best one can do within the context of method chaining?
The best practice is to vectorize operations.
The reason for this is performance, because apply is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.
That will get rid of your inner lambdas. For the outer lambdas over the df, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.
There are also Python packages like dfply that aim to mimic the dplyr feel in Python. These do not receive the same level of support as core pandas will, so keep that in mind if you want to go this route.
Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.
def df_apply(col_fn, *col_names):
def inner_fn(df):
cols = [df[col] for col in col_names]
return col_fn(*cols)
return inner_fn
Then usage ends up looking something like this:
new_table = (
raw_data
.assign(area=df_apply(calc_circle_area, 'diam'))
.assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
It is also possible to write this without taking advantage of vectorization, in case that does come up.
def df_apply_unvec(fn, *col_names):
def inner_fn(df):
def row_fn(row):
vals = [row[col] for col in col_names]
return fn(*vals)
return df.apply(row_fn, axis=1)
return inner_fn
I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.
as #mcskinner has pointed out, vectorized operations are way better and faster. if however, your operation cannot be vectorized and you still want to apply a function, you could use the pipe method, which should allow for a cleaner method chaining:
import math
def area(df):
df['area'] = math.pi/4*df['diam']**2
return df
def vol(df):
df['vol'] = df['area'] * df['length']
return df
new_table = (raw_data
.pipe(area)
.pipe(vol)
.assign(is_small_vol = lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
new_table
cylinder_name length diam area vol is_small_vol
0 a 3 1 0.785398 2.356194 True
1 b 5 2 3.141593 15.707963 True

Proper data manipulation script layout, where all merges, drops, aggregations, renames are easily traceable and visible

I currently have a long script that has one goal: take multiple csv tables of data, merge them into one, while performing various calculations along the way, and then output a final csv table.
I originally had this layout (see LAYOUT A), but found that this made it had to see what columns were being added or merged, because the cleaning and operations methods are listed below everything, so you have to go up and down in the file to see how the table gets altered. This was an attempt to follow the whole keep-things-modular-and-small methodology that I've read about:
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = clean_table_1('table1.csv')
df2 = clean_table_2('table2.csv')
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
def some_operation(x,y,z):
#<calculations for performing on table column>
def some_other_operation(a,b):
#<some calculation>
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
return df
def clean_table_2(fn_2):
#<similar to clean_table_1>
def clean_table_3(fn_3):
#<similar to clean_table_1>
if __name__=='__main__':
main()
My next inclination was to move all the functions in-line with the main script, so its obvious what's being done (see LAYOUT B). This makes it a bit easier to see the linearity of the operations being done, but also makes it a bit messier, so that you can't just quickly read through the main function to get the "overview" of all the operations being done.
# LAYOUT B
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
df1 = clean_table_1('table1.csv')
def clean_table_2(fn_2):
#<similar to clean_table_1>
df2 = clean_table_2('table2.csv')
def clean_table_3(fn_3):
#<similar to clean_table_1>
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
So then I think, well why even have these functions; would it maybe just be easier to follow if it's all at the same level, just as a script like so (LAYOUT C):
# LAYOUT C
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = pd.read_csv('table1.csv)
df1['some_col1'] = 400
df1 = #<operations on df1>
df2 = pd.read_csv('table2.csv)
df2['some_col2'] = 200
df2 = #<operations on df2>
df3 = pd.read_csv('table3.csv)
df3['some_col3'] = 800
df3 = #<operations on df3>
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
The crux of the problem is finding a balance between documenting clearly which columns are being updated, changed, dropped, renamed, merged, etc. while still keeping it modular enough to fit the paradigm of "clean code".
Also, in practice, this script and others are much longer with far more tables being merged into the mix, so this quickly becomes a long list of operations. Should I be breaking up the operations into smaller files and outputting intermediate files or is that just asking to introduce errors? It's a matter of also being able to see all the assumptions made along the way and how they affect the data in its final state, without having to jump between files or scroll way up, way down, etc. to follow the data from A to B, if that makes sense.
If anyone has insights on how to best write these types of data cleaning/manipulation scripts, I would love to hear them.
It is highly subjective topic, but here is my typical approach/remarks/hints:
as long as possible optimize for debug/dev time and ease
split flow into several scripts (e.g. download, preprocess, ... for every table sepearetely, so for merging every table is prepared separetely)
try to keep the same order of operations within script (e.g. type correction, fill na, scaling, new columns, drop columns)
for every wrangle script there is load-start and save-end
save to pickle (to avoid problems like saving date as string) and small csv (to have easy preview of results outside of python)
with "integration points" being data you can easily combine different technologies (caveat: in such case typically you don't use pickle as output but csv/other data format)
every script has clearly defined input/output and can be tested and developed separately, also I use asserts on dataframe shapes
scripts for visualization/EDA use data from wrangle scripts, but they are never part of wrangle scripts, typically they are also bottleneck
combine scripts by e.g. bash if you want simplicity
keep length of script below 1 page*
*if I have long, convoluted code before I encapsulate it in function, I check if it can be done simpler, in 80% yes but you need to know more about pandas, but you learn something new, pandas doc is typically better, your code is more declarative and idiomatic
*if there is no easy way to fix and you use this function in many places put it into utils.py, in docstring put sample >>>f(input) output and some rationale about this function
*if function is used across many projects it is worth to make pandas extension like https://github.com/twopirllc/pandas-ta
*if I have a lot of columns, I think a lot about hierarchy and groupings and keep it in separate file for every table, for now it is just py file, but I started to consider yaml and a way to document table structure
stick to one convention e.g. I don't use inplace=True at all
chain operations on dataframe*
*if you have a good meaningful name for subchain result that could be used elsewhere, it could be a good place to split script
remove main function, if you keep script according to rules above there is nothing wrong in df global variable
when I've read from csv, I always check what can be done directly with read_csv parameters e.g. parse date
clearly mark 'TEMPORARY HACKS', in long-term they lead to unexpected side-effects

How to change the sequence of some variables in a Pandas dataframe?

I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks
Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')

Pandas DataFrame naming only 1 column

Is there a way with Pandas Dataframe to name only the first or first and second column even if there's 4 columns :
Here
for x in range(1, len(table2_query) + 1):
if x == 1:
cursor.execute(table2_query[x])
df = pd.DataFrame(data=cursor.fetchall(), columns=['Q', col_name[x-1]])
and it gives me this :
AssertionError: 2 columns passed, passed data had 4 columns
Consider the df:
df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('ABCD'))
df
then use rename and pass a dictionary with the name changes to the argument columns:
df.rename(columns=dict(A='a', B='b'))
Instantiating a DataFrame while only naming a subset of the columns
When constructing a dataframe with pd.DataFrame, you either don't pass an index/columns argument and let pandas auto-generate the index/columns object, or you pass one in yourself. If you pass it in yourself, it must match the dimensions of your data. The trouble of mimicking the auto-generation of pandas while augmenting just the ones you want is not worth the trouble and is ugly and is probably non-performant. In other words, I can't even think of a good reason to do it.
On the other hand, it is super easy to rename the columns/index values. In fact, we can rename just a few. I think below is more in line with the spirit of your question:
df = pd.DataFrame(np.arange(8).reshape(2, 4)).rename(columns=str).rename(columns={'1': 'A', '3': 'F'})
df

Categories

Resources