Fill missing value by averaging previous row value - python

I want to fill missing value with the average of previous N row value, example is shown below:
N=2
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
DataFrame is like:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN NaN
Result should be:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN (4+2)/2 NaN 5
3 NaN 3.0 NaN (1+5)/2
I am wondering if there is elegant and fast way to achieve this without for loop.

rolling + mean + shift
You will need to modify the below logic to interpret the mean of NaN and another value, in the case where one of the previous two values are null.
df = df.fillna(df.rolling(2).mean().shift())
print(df)
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 3.0 NaN 5.0
3 NaN 3.0 NaN 3.0

Related

conditions inside conditions pandas

below is my DF in which I want to create a column based on other columns
test = pd.DataFrame({"Year_2017" : [np.nan, np.nan, np.nan, 4], "Year_2018" : [np.nan, np.nan, 3, np.nan], "Year_2019" : [np.nan, 2, np.nan, np.nan], "Year_2020" : [1, np.nan, np.nan, np.nan]})
Year_2017 Year_2018 Year_2019 Year_2020
0 NaN NaN NaN 1
1 NaN NaN 2 NaN
2 NaN 3 NaN NaN
3 4 NaN NaN NaN
The aim will be to create a new column and take value of the columns which is notna()
Below is what I tried without success..
test['Final'] = np.where(test.Year_2017.isna(), test.Year_2018,
np.where(test.Year_2018.isna(), test.Year_2019,
np.where(test.Year_2019.isna(), test.Year_2020, test.Year_2019)))
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 NaN
1 NaN NaN 2 NaN NaN
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN NaN
The expected output:
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 1
1 NaN NaN 2 NaN 2
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN 4
You can forward or back filling missing values and then select last or first column:
test['Final'] = test.ffill(axis=1).iloc[:, -1]
test['Final'] = test.bfill(axis=1).iloc[:, 0]
If there is only one non missing values per rows and numeric use:
test['Final'] = test.min(1)
test['Final'] = test.max(1)
test['Final'] = test.mean(1)
test['Final'] = test.sum(1, min_count=1)
I you only have a single non NA value per row, you can use:
df['Final'] = test.max(axis=1)
(or other aggregators)

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

How do I merge multiple pandas dataframe columns

I have a dataframe similar to the one seen below.
In[2]: df = pd.DataFrame({'P1': [1, 2, None, None, None, None],'P2': [None, None, 3, 4, None, None],'P3': [None, None, None, None, 5, 6]})
Out[2]:
P1 P2 P3
0 1.0 NaN NaN
1 2.0 NaN NaN
2 NaN 3.0 NaN
3 NaN 4.0 NaN
4 NaN NaN 5.0
5 NaN NaN 6.0
And I am trying to merge all of the columns into a single P column in a new dataframe (see below).
P
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
In my actual code, I have an arbitrary list of columns that should be merged, not necessarily P1, P2, and P3 (between 1 and 5 columns). I've tried something along the following lines:
new_series = pd.Series()
desired_columns = ['P1', 'P2', 'P3']
for col in desired_columns:
other_series=df[col]
new_series = new_series.align(other_series)
However this results in a tuple of Series objects, and neither of them appear to contain the data I need. I could iterate through every row, then check each column, but I feel that there is likely an easy pandas solution that I am missing.
If there is only one non None value per row forward filling Nones and select last column by position:
df['P'] = df[['P1', 'P2', 'P3']].ffill(axis=1).iloc[:, -1]
print (df)
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0
Another alternate solution:
So, if we are not column specific within the DataFrame to choose about then we can use bfill() function to populate the non-nan values in the dataframe across columns So, when axis='columns', then the current nan cells will be filled from the value present in the next column in the same row.
>>> df['P'] = df.bfill(axis=1).iloc[:, 0]
>>> df
P1 P2 P3 P
0 1.0 NaN NaN 1.0
1 2.0 NaN NaN 2.0
2 NaN 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 NaN NaN 5.0 5.0
5 NaN NaN 6.0 6.0

Pandas acessing rows with nan

I was working on a dataframe like this.
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]],index=['a','b','c'])
df
0 1 2
a 1.0 NaN 2
b 2.0 3.0 5
c NaN 4.0 6
When I use df.isnull() it gives the output as :
0 1 2
a False True False
b False False False
c True False False
When I use df[df.isnull()] why does it show all elements as nan:
df[df.isnull()]
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
Can somebody explain why it is happening?
This is mask for the dataframe , it will mask all False value to np.nan.
For example
df[~df.isnull()]
Out[342]:
0 1 2
a 1.0 NaN 2
b 2.0 3.0 5
c NaN 4.0 6
and
df[df==2]
Out[343]:
0 1 2
a NaN NaN 2.0
b 2.0 NaN NaN
c NaN NaN NaN
Since isnull return all np.nan value as True
After mask
df[df.isnull()]
Out[344]:
0 1 2
a NaN(False mask as NaN) NaN(True) NaN
b NaN(True) NaN NaN
c NaN NaN NaN

Combine 2 series pandas - overwriting the NANs [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

Categories

Resources