I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.
Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.
I currently have a long script that has one goal: take multiple csv tables of data, merge them into one, while performing various calculations along the way, and then output a final csv table.
I originally had this layout (see LAYOUT A), but found that this made it had to see what columns were being added or merged, because the cleaning and operations methods are listed below everything, so you have to go up and down in the file to see how the table gets altered. This was an attempt to follow the whole keep-things-modular-and-small methodology that I've read about:
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = clean_table_1('table1.csv')
df2 = clean_table_2('table2.csv')
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
def some_operation(x,y,z):
#<calculations for performing on table column>
def some_other_operation(a,b):
#<some calculation>
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
return df
def clean_table_2(fn_2):
#<similar to clean_table_1>
def clean_table_3(fn_3):
#<similar to clean_table_1>
if __name__=='__main__':
main()
My next inclination was to move all the functions in-line with the main script, so its obvious what's being done (see LAYOUT B). This makes it a bit easier to see the linearity of the operations being done, but also makes it a bit messier, so that you can't just quickly read through the main function to get the "overview" of all the operations being done.
# LAYOUT B
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
df1 = clean_table_1('table1.csv')
def clean_table_2(fn_2):
#<similar to clean_table_1>
df2 = clean_table_2('table2.csv')
def clean_table_3(fn_3):
#<similar to clean_table_1>
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
So then I think, well why even have these functions; would it maybe just be easier to follow if it's all at the same level, just as a script like so (LAYOUT C):
# LAYOUT C
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = pd.read_csv('table1.csv)
df1['some_col1'] = 400
df1 = #<operations on df1>
df2 = pd.read_csv('table2.csv)
df2['some_col2'] = 200
df2 = #<operations on df2>
df3 = pd.read_csv('table3.csv)
df3['some_col3'] = 800
df3 = #<operations on df3>
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
The crux of the problem is finding a balance between documenting clearly which columns are being updated, changed, dropped, renamed, merged, etc. while still keeping it modular enough to fit the paradigm of "clean code".
Also, in practice, this script and others are much longer with far more tables being merged into the mix, so this quickly becomes a long list of operations. Should I be breaking up the operations into smaller files and outputting intermediate files or is that just asking to introduce errors? It's a matter of also being able to see all the assumptions made along the way and how they affect the data in its final state, without having to jump between files or scroll way up, way down, etc. to follow the data from A to B, if that makes sense.
If anyone has insights on how to best write these types of data cleaning/manipulation scripts, I would love to hear them.
It is highly subjective topic, but here is my typical approach/remarks/hints:
as long as possible optimize for debug/dev time and ease
split flow into several scripts (e.g. download, preprocess, ... for every table sepearetely, so for merging every table is prepared separetely)
try to keep the same order of operations within script (e.g. type correction, fill na, scaling, new columns, drop columns)
for every wrangle script there is load-start and save-end
save to pickle (to avoid problems like saving date as string) and small csv (to have easy preview of results outside of python)
with "integration points" being data you can easily combine different technologies (caveat: in such case typically you don't use pickle as output but csv/other data format)
every script has clearly defined input/output and can be tested and developed separately, also I use asserts on dataframe shapes
scripts for visualization/EDA use data from wrangle scripts, but they are never part of wrangle scripts, typically they are also bottleneck
combine scripts by e.g. bash if you want simplicity
keep length of script below 1 page*
*if I have long, convoluted code before I encapsulate it in function, I check if it can be done simpler, in 80% yes but you need to know more about pandas, but you learn something new, pandas doc is typically better, your code is more declarative and idiomatic
*if there is no easy way to fix and you use this function in many places put it into utils.py, in docstring put sample >>>f(input) output and some rationale about this function
*if function is used across many projects it is worth to make pandas extension like https://github.com/twopirllc/pandas-ta
*if I have a lot of columns, I think a lot about hierarchy and groupings and keep it in separate file for every table, for now it is just py file, but I started to consider yaml and a way to document table structure
stick to one convention e.g. I don't use inplace=True at all
chain operations on dataframe*
*if you have a good meaningful name for subchain result that could be used elsewhere, it could be a good place to split script
remove main function, if you keep script according to rules above there is nothing wrong in df global variable
when I've read from csv, I always check what can be done directly with read_csv parameters e.g. parse date
clearly mark 'TEMPORARY HACKS', in long-term they lead to unexpected side-effects
I thought I followed the best practice:
import pandas as pd
df=pd.DataFrame({'cond':['a', 'b', 'a'], 'value':[1.5, 2.5, 3.5]})
mask = df.cond == 'a'
df.loc[mask, 'value'] = df.loc[mask, 'value'] * (-1)
In essence, is condition is met, replace a value with it's negative. What is the better way to perform this operation so I do not trigger the warning?
The problem turns out not to be with the code as shown. That code example worked fine.
In the real world, df is a created as follows: df = df2[['cond', 'value']] . adding .copy() to this line, fixed the warning.
This solution was pointed our by #jezrael in the comments.
Why would the following code not affect the Output DataFrame? (This example is not interesting in itself - it is a convoluted way of 'copying' a DataFrame.)
def getRow(row):
Output.append(row)
Output = pd.DataFrame()
Input = pd.read_csv('Input.csv')
Input.apply(getRow)
Is there a way of obtaining such a functionality that is using the apply function so that it affects other variables?
What happens
DataFrame.append() returns a new dataframe. It does not modify Output but rather creates a new one every time.
DataFrame.append(self, other, ignore_index=False, verify_integrity=False)
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Here:
Output.append(row)
you create a new dataframe but throw it away immediately.
You have access - But you shouldn't use it in this way
While this works, I strongly recommend against using global:
df = DataFrame([1, 2, 3])
df2 = DataFrame()
def get_row(row):
global df2
df2 = df2.append(row)
df.apply(get_row)
print(df2)
Output:
0 1 2
0 1 2 3
Take it as demonstration what happens. Don't use it in your code.
There are few issues I am having with Dask Dataframes.
lets say I have a dataframe with 2 columns ['a','b']
if i want a new column c = a + b
in pandas i would do :
df['c'] = df['a'] + df['b']
In dask I am doing the same operation as follows:
df = df.assign(c=(df.a + df.b).compute())
is it possible to write this operation in a better way, similar to what we do in pandas?
Second question is something which is troubling me more.
In pandas if i want to change the value of 'a' for row 2 & 6 to np.pi , I do the following
df.loc[[2,6],'a'] = np.pi
I have not been able to figure out how to do a similar operation in Dask. My logic selects some rows and I only want to change values in those rows.
Edit Add New Columns
Setitem syntax now works in dask.dataframe
df['z'] = df.x + df.y
Old answer: Add new columns
You're correct that the setitem syntax doesn't work in dask.dataframe.
df['c'] = ... # mutation not supported
As you suggest you should instead use .assign(...).
df = df.assign(c=df.a + df.b)
In your example you have an unnecessary call to .compute(). Generally you want to call compute only at the very end, once you have your final result.
Change rows
As before, dask.dataframe does not support changing rows in place. Inplace operations are difficult to reason about in parallel codes. At the moment dask.dataframe has no nice alternative operation in this case. I've raised issue #653 for conversation on this topic.