I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0
Related
I have the following df:
df = pd.DataFrame(np.array([[.1, 2, 3], [.4, 5, 6], [7, 8, 9]]),
columns=['col1', 'b', 'c'])
out:
col1 b c
0 0.1 2.0 3.0
1 0.4 5.0 6.0
2 7.0 8.0 9.0
When a value begins with a '.'/point, I want to remove it. But only if it starts with a point / '.'.
I've tried the following:
s = df['col1']
df['col1'] = s.mask(df['col1'].str.startswith('.',na=False),s.str.replace(".",""))
desired output:
col1 b c
0 1 2.0 3.0
1 4 5.0 6.0
2 7.0 8.0 9.0
However this does not work. Please help!
since you have numerical values, You can multiply 10 and replace with a condition:
df.mul(10).mask(df.ge(1),df)
#df['col1'] = df['col1'].mul(10).mask(df['col1'].ge(1),df['col1']) for 1 column
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
use boolean masking and create a mask:
mask=df['col1'].astype(str).str.startswith('0.')
Finally make use of that mask:
df.loc[mask,'col1']=df.loc[mask,'col1'].astype(str).str.lstrip('0.').astype(float)
Now if you print df you will get your desired output:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
Via NumpPy np.where():
df['col1'] = np.where(df<1, df*10, df)
df contents:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
These two functions seem equivalent to me. You can see that they accomplish the same goal in the code below, as columns c and d are equal. So when should I use one over the other?
Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
df.loc[::2, 'a'] = np.nan
Returns:
a b
0 NaN 4
1 2.0 6
2 NaN 8
3 0.0 4
4 NaN 4
5 0.0 8
6 NaN 7
7 2.0 2
8 NaN 9
9 7.0 2
This is my starting point. Now I will add two columns, one using combine_first and one using fillna, and they will produce the same result:
df['c'] = df.a.combine_first(df.b)
df['d'] = df['a'].fillna(df['b'])
Returns:
a b c d
0 NaN 4 4.0 4.0
1 8.0 7 8.0 8.0
2 NaN 2 2.0 2.0
3 3.0 0 3.0 3.0
4 NaN 0 0.0 0.0
5 2.0 4 2.0 2.0
6 NaN 0 0.0 0.0
7 2.0 6 2.0 2.0
8 NaN 4 4.0 4.0
9 4.0 6 4.0 4.0
Credit to this question for the data set: Combine Pandas data frame column values into new column
combine_first is intended to be used when there are non-overlapping indices. It will effectively fill in nulls as well as supply values for indices and columns that didn't exist in the first.
dfa = pd.DataFrame([[1, 2, 3], [4, np.nan, 5]], ['a', 'b'], ['w', 'x', 'y'])
w x y
a 1.0 2.0 3.0
b 4.0 NaN 5.0
dfb = pd.DataFrame([[1, 2, 3], [3, 4, 5]], ['b', 'c'], ['x', 'y', 'z'])
x y z
b 1.0 2.0 3.0
c 3.0 4.0 5.0
dfa.combine_first(dfb)
w x y z
a 1.0 2.0 3.0 NaN
b 4.0 1.0 5.0 3.0 # 1.0 filled from `dfb`; 5.0 was in `dfa`; 3.0 new column
c NaN 3.0 4.0 5.0 # whole new index
Notice that all indices and columns are included in the results
Now if we fillna
dfa.fillna(dfb)
w x y
a 1 2.0 3
b 4 1.0 5 # 1.0 filled in from `dfb`
Notice no new columns or indices from dfb are included. We only filled in the null value where dfa shared index and column information.
In your case, you use fillna and combine_first on one column with the same index. These translate to effectively the same thing.
I have number of similar dataframes where I would like to standardize the nans across all the dataframes. For instance, if a nan exists in df1.loc[0,'a'] then ALL other dataframes should be set to nan for the same index location.
I am aware that I could group the dataframes to create one big multiindexed dataframe but sometimes I find it easier to work with a group of dataframes of the same structure.
Here is an example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
print df1
print ' '
print df2
print ' '
print df3
Output:
a b c
0 0.0 1 2
1 3.0 4 5
2 6.0 7 8
3 NaN 10 11
a b c
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
3 9 10.0 11
a b c
0 0 1 NaN
1 3 4 5.0
2 6 7 8.0
3 9 10 11.0
However, I would like df1, df2 and df3 to have nans in the same locations:
print df1
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
Using the answer provided by piRSquared, I was able to extend it for dataframes of different sizes. Here is the function:
def set_nans_over_every_df(df_list):
# Find unique index and column values
complete_index = sorted(set([idx for df in df_list for idx in df.index]))
complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))
# Ensure that every df has the same indexes and columns
df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]
# Find the nans in each df and set nans in every other df at the same location
mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
df_list = [df.mask(mask) for df in df_list]
return df_list
And an example using different sized dataframes:
df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])
print df1
a b c d
0 0.0 1.0 NaN NaN
1 3.0 NaN 5.0 NaN
2 6.0 7.0 8.0 NaN
3 NaN 10.0 11.0 NaN
4 NaN NaN NaN NaN
I'd set up a mask in numpy then use this mask in the pd.DataFrame.mask method
mask = np.isnan(np.stack([d.values for d in [df1, df2, df3]])).any(0)
print(df1.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df2.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df3.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
You can create mask and then apply to all dataframes:
mask = df1.notnull() & df2.notnull() & df3.notnull()
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
You can also set mask dynamically with reduce:
import functools
masks = [df1.notnull(),df2.notnull(),df3.notnull()]
mask = functools.reduce(lambda x,y: x & y, masks)
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
print (df1[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
assuming that all your DF are of the same shape and have the same indexes:
In [196]: df2[df1.isnull()] = df3[df1.isnull()] = np.nan
In [197]: df1[df3.isnull()] = df2[df3.isnull()] = np.nan
In [198]: df1[df2.isnull()] = df3[df2.isnull()] = np.nan
In [199]: df1
Out[199]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [200]: df2
Out[200]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [201]: df3
Out[201]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
One simple method is to add the DataFrames together and multiply the result by 0 and then add this DataFrame to all the others individually.
df_zero = (df1 + df2 + df3) * 0
df1 + df_zero
df2 + df_zero
df3 + df_zero
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN