Why Does this DataFrame Modification within Function Change Global Outside Function? - python

Why does the function below change the global DataFrame named df? Shouldn't it just change a local df within the function, but not the global df?
import pandas as pd
df = pd.DataFrame()
def adding_var_inside_function(df):
df['value'] = 0
print(df.columns) # Index([], dtype='object')
adding_var_inside_function(df)
print(df.columns) # Index([u'value'], dtype='object')

from docs:
Mutability and copying of data
All pandas data structures are value-mutable (the values they contain can be altered) but not always
size-mutable. The length of a Series cannot be changed, but, for
example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data
untouched. In general, though, we like to favor immutability where
sensible.
Here is another example, showing values (cell's) mutability:
In [21]: df
Out[21]:
a b c
0 3 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
In [22]: df2 = df
In [23]: df2.loc[0, 'a'] = 100
In [24]: df
Out[24]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
df2 is a reference to df
In [28]: id(df) == id(df2)
Out[28]: True
Your function, that won't mutate the argument DF:
def adding_var_inside_function(df):
df = df.copy()
df['value'] = 0
return df
In [30]: df
Out[30]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
In [31]: adding_var_inside_function(df)
Out[31]:
a b c value
0 100 2 0 0
1 3 3 1 0
2 4 0 0 0
3 2 3 2 0
4 0 4 4 0
In [32]: df
Out[32]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4

Related

ApplyMap function on Multiple columns pandas

I have this dataframe
dd = pd.DataFrame({'a':[1,5,3],'b':[3,2,3],'c':[2,4,5]})
a b c
0 1 3 2
1 5 2 4
2 3 3 5
I just want to replace numbers of column a and b which are smaller than column c numbers. I want to this operation row wise
I did this
dd.applymap(lambda x: 0 if x < x['c'] else x )
I get error
TypeError: 'int' object is not subscriptable
I understood x is a int but how to get value of column c for that row
I want this output
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Use DataFrame.mask with DataFrame.lt:
df = dd.mask(dd.lt(dd['c'], axis=0), 0)
print (df)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Or you can set values by compare broadcasting by column c:
dd[dd < dd['c'].to_numpy()[:, None]] = 0
print (dd)
a b c
0 0 3 2
1 5 0 4
2 0 0 5

How to efficiently remove leading rows containing only 0 as value?

I have a pandas dataframe and the first rows have only zeros as value.
I would like to remove those rows.
So, denoting df my dataframe and ['a', 'b', 'c'] its columns. I tried the following code.
df[(df[['a', 'b', 'c']] != 0).all(axis=1)]
But it will also turn the following dataframe :
a b c
0 0 0
0 0 0
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
Into this one :
a b c
1 0 0
2 3 5
4 5 6
1 1 1
That's not what I want. I just want to focus on leading rows. So, I would like to have :
a b c
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
It would be great to have a simple and efficient solution using pandas functions. Thanks
General solution working if all 0 rows in data - first use cummsum for cumualtive sum and then test any Trues per rows:
df1 = df[(df[['a', 'b', 'c']] != 0).cumsum().any(1)]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1
Solution if at least one non 0 row in data - get first value of non 0 rows with Series.idxmax:
df1 = df.iloc[(df[['a', 'b', 'c']] != 0).any(axis=1).idxmax():]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1
Here is an example that finds the first row that is not all zeros and then selects all from that point on. Should solve the problem you are describing:
ix_first_valid = df[(df != 0).any(axis=1)].index[0]
df[ix_first_valid:]

Pandas DataFrame: Spread CSV columns to multiple columns

I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Drop observations from the data frame in python

How to delete observation from data frame in python. For example, I have data frame with variables a, b, c in it, and I want to delete observation if variable a is missing, or variable c is equal to zero.
You could build a boolean mask using isnull:
mask = (df['a'].isnull()) | (df['c'] == 0)
and then select the desired rows with:
df = df.loc[~mask]
~mask is the boolean inverse of mask, so df.loc[~mask] selects rows where a is not null and c is not 0.
For example,
import numpy as np
import pandas as pd
arr = np.arange(15, dtype='float').reshape(5,3) % 4
arr[arr > 2] = np.nan
df = pd.DataFrame(arr, columns=list('abc'))
# a b c
# 0 0 1 2
# 1 NaN 0 1
# 2 2 NaN 0
# 3 1 2 NaN
# 4 0 1 2
mask = (df['a'].isnull()) | (df['c'] == 0)
df = df.loc[~mask]
yields
a b c
0 0 1 2
3 1 2 NaN
4 0 1 2
Let's say your DataFrame looks like this:
In [2]: data = pd.DataFrame({
...: 'a': [1,2,3,pd.np.nan,5],
...: 'b': [3,4,pd.np.nan,5,6],
...: 'c': [0,1,2,3,4],
...: })
In [3]: data
Out[3]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
3 NaN 5 3
4 5 6 4
To delete rows with missing observations, use:
In [5]: data.dropna()
Out[5]:
a b c
0 1 3 0
1 2 4 1
4 5 6 4
To delete rows where only column 'a' has missing observations, use:
In [6]: data.dropna(subset=['a'])
Out[6]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
4 5 6 4
To delete rows that have either missing observations or zeros, use:
In [18]: data[data.all(axis=1)].dropna()
Out[18]:
a b c
1 2 4 1
4 5 6 4

Categories

Resources