Pandas assign() has no effect when used in user-defined function - python

When I use the DataFrame.assign() method in my own function foobar, it has no effect to the global DataFrame.
#!/usr/bin/env python3
import pandas as pd
def foobar(df):
# has no affect to the "global" df
df.assign(Z = lambda x: x.A + x.B)
return df
data = {'A': range(3),
'B': range(3)}
df = pd.DataFrame(data)
df = foobar(df)
# There is no 'Z' column in this df
print(df)
The result output
A B
0 0 0
1 1 1
2 2 2
I assume this has something to do with the difference of views and copy's in Pandas. But I am not sure how to handle this the right and elegant Pandas-way.

Pandas assign returns a DataFrame so you need to assign the result to the same df. Try this:
def foobar(df):
df = df.assign(Z = lambda x: x.A + x.B)
return df

Related

Pandas replace dataframe value with a variable at a variable row

I want to replace a row in a csv file with a variable. The row itself also has to be a variable. The following code is an example:
import pandas as pd
# sample dataframe
df = pd.DataFrame({'A': ['a','b','c'], 'B':['b','c','d']})
print("Original DataFrame:\n", df)
x = 1
y = 12698
df_rep = df.replace([int(x),1], y)
print("\nAfter replacing:\n", df_rep)
This can be done using pandas indexing eg df.iloc[row_num, col_num].
#update df
df.iloc[x,1]=y
#print df
print(df)
A B
0 a b
1 b 12698
2 c d

Applying Function to Rows of a Dataframe in Python

I have a dataframe and within 1 of the columns is a nested dictionary. I want to create a function where you pass each row and a column name and the function json_normalizes the the column into a dataframe. However, I keep getting and error 'function takes 2 positional arguments, 6 were given' There are more than 6 columns in the dataframe and more than 6 columns in the row[col] (see below) so I am confused as how 6 arguments are being provided.
import pandas as pd
from pandas.io.json import json_normalize
def fix_row_(row, col):
if type(row[col]) == list:
df = json_normalize(row[col])
df['id'] = row['id']
else:
df = pd.DataFrame()
return df
new_df = data.apply(lambda x: fix_po_(x, 'Items'), axis=1)
So new_df will be a dataframe of dataframes. In the example below, it would just be a dataframe with A,B,C as columns and 1,2,3 as the values.
Quasi-reproducible example:
my_dict = {'A': 1, 'B': 2, 'C': 3}
ids = pd.Series(['id1','id2','id3'],name='ids')
data= pd.DataFrame(ids)
data['my_column']=''
m = data['ids'].eq('id1')
data.loc[m, 'my_column'] = [my_dict] * m.sum()
Just pass your column using axis=1
df.apply(lambda x: fix_row_(x['my_column']), axis=1)

Appending data to Pandas global dataframe variable does not persist

I am trying to use a pandas dataframe global variable. However, the dataframe is empty when I try to reassign or append it to the global variable. Any help appreciated.
import pandas as pd
df = pd.DataFrame()
def my_func():
global df
d = pd.DataFrame()
for i in range(10):
dct = {
"col1": i,
"col2": 'value {}'.format(i)
}
d.append(dct, ignore_index=True)
# df.append(dct, ignore_index=True) # Does not seem to append anything to the global variable
df = d # does not assign any values to the global variable
my_func()
df.head()
As opposed to list.append, pandas.DataFrame.append is not an in-place operation. Slightly changing your code works as expected:
import pandas as pd
df = pd.DataFrame()
def my_func():
global df
d = pd.DataFrame()
for i in range(10):
dct = {
"col1": i,
"col2": 'value {}'.format(i)}
d = d.append(dct, ignore_index=True) # <<< Assignment needed
# df.append(dct, ignore_index=True) # Does not seem to append anything to the global variable
df = d # does not assign any values to the global variable
my_func()
df.head()
Output:
col1 col2
0 0.0 value 0
1 1.0 value 1
2 2.0 value 2
3 3.0 value 3
4 4.0 value 4

Pandas apply function on dataframe over multiple columns

When I run the following code I get an KeyError: ('a', 'occurred at index a'). How can I apply this function, or something similar, over the Dataframe without encountering this issue?
Running python3.6, pandas v0.22.0
import numpy as np
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(lambda x: add(x['a'], x['c']))
I think need parameter axis=1 for processes by rows in apply:
axis: {0 or 'index', 1 or 'columns'}, default 0
0 or index: apply function to each column
1 or columns: apply function to each row
df = df.apply(lambda x: add(x['a'], x['c']), axis=1)
print (df)
0 -0.802652
1 0.145142
2 -1.160743
dtype: float64
You don't even need apply, you can directly add the columns. The output will be a series either way:
df = df['a'] + df['c']
for example:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
df = df['a'] + df['c']
print(df)
# 0 6
# 1 8
# dtype: int64
you can try this
import numpy as np
import pandas as pd
def add(df):
return df.a + df.b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(add, axis =1)
where of course you can substitute any function that takes as inputs the columns of df.

Appending to an empty DataFrame in Pandas?

Is it possible to append to an empty data frame that doesn't contain any indices or columns?
I have tried to do this, but keep getting an empty dataframe at the end.
e.g.
import pandas as pd
df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)
The result looks like this:
Empty DataFrame
Columns: []
Index: []
This should work:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
Since the append doesn't happen in-place, so you'll have to store the output if you want it:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data) # without storing
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
And if you want to add a row, you can use a dictionary:
df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)
which gives you:
age height name
0 9 2 Zed
You can concat the data in this way:
InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])
InfoDF = pd.concat([InfoDF,tempDF])
The answers are very useful, but since pandas.DataFrame.append was deprecated (as already mentioned by various users), and the answers using pandas.concat are not "Runnable Code Snippets" I would like to add the following snippet:
import pandas as pd
df = pd.DataFrame(columns =['name','age'])
row_to_append = pd.DataFrame([{'name':"Alice", 'age':"25"},{'name':"Bob", 'age':"32"}])
df = pd.concat([df,row_to_append])
So df is now:
name age
0 Alice 25
1 Bob 32
pandas.DataFrame.append Deprecated since version 1.4.0: Use concat() instead.
Therefore:
df = pd.DataFrame() # empty dataframe
df2 = pd..DataFrame(...) # some dataframe with data
df = pd.concat([df, df2])

Categories

Resources