how to impute a column in pandas dataframe within each group [duplicate]

how to impute a column in pandas dataframe within each group [duplicate] - python

This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 6 years ago.
All,
I have dataframe with four columns ('key1', 'key2', 'data1', 'data2'). I inserted some nan into data1. Now I want to fill the nan with values that is the most occuring value within each group after I do groupby(['key1', 'key2']).
dt = pd.DataFrame ({'key1': np.random.choice(['a', 'b'], size=100),
'key2': np.random.choice(['c', 'd'], size=100),
'data1': np.random.randint(5, size=100),
'data2': np.random.randn(100)},
columns = ['key1', 'key2','data1', 'data2'])
#insert nan
dt['data1'].ix[[2,6,10]]= None
# group by key1 and key2
group =dt.groupby(['key1', 'key2'])['data1']
group.value_counts(dropna=False)
key1 key2 data1
a c 1.0 8
4.0 6
0.0 4
2.0 2
3.0 1
d 0.0 7
1.0 6
4.0 6
2.0 5
NaN 3
3.0 1
b c 0.0 7
2.0 7
1.0 3
3.0 2
4.0 2
d 2.0 11
1.0 10
0.0 3
3.0 3
4.0 3
What I wan to do is, for this example, fill the nan in the data1 column with 0.0 (most frequent value within group (key1=a, key2=d).
thank you very much for help!

Use .transform(lambda y: y.fillna(y.value_counts().idxmax()))
Before
key1 key2 data1
a c 1.0 6
3.0 5
0.0 4
2.0 3
4.0 3
NaN 1
d 1.0 11
3.0 9
0.0 5
2.0 5
4.0 5
b c 4.0 7
0.0 4
3.0 4
2.0 3
NaN 2
1.0 1
d 4.0 6
1.0 5
2.0 5
3.0 4
0.0 2
Name: data1, dtype: int64
After applying .transform(lambda y: y.fillna(y.value_counts().idxmax()))
dt['nan_filled'] = dt.groupby(['key1', 'key2'])['data1'].transform(lambda y: y.fillna(y.value_counts().idxmax()))
group = dt.groupby(['key1', 'key2'])['nan_filled']
group.value_counts(dropna=False)
key1 key2 nan_filled
a c 1.0 7
3.0 5
0.0 4
2.0 3
4.0 3
d 1.0 11
3.0 9
0.0 5
2.0 5
4.0 5
b c 4.0 9
0.0 4
3.0 4
2.0 3
1.0 1
d 4.0 6
1.0 5
2.0 5
3.0 4
0.0 2
Name: nan_filled, dtype: int64

Related

Replace NaN with the average of the last 5 values - Pandas

I want to know how can I replace the NaN in my dataset with the last average of 5 last values.
Column A
Column B
1
2
2
5
3
5
4
2
5
2
NaN
2
NaN
2
1
2
1
2
1
2
1
NaN
1
2
1
2
For example, in this case the first NaN will be the average of (1,2,3,4,5) and second NaN will be the average of (2,3,4,5, The value of the other NaN).
I have tried
df.fillna(df.mean())

As mentioned, it has been answered here, but the updated version for the latest pandas version is as follow:
data={'col1':[1,2,3,4,5,np.nan,np.nan,1,1,1,1,1,1],
'col2':[2,5,5,2,2,2,2,2,2,2,np.nan,2,2]}
df=pd.DataFrame(data)
window_size = 5
df=df.fillna(df.rolling(window_size+1, min_periods=1).mean())
outputs:
col1 col2
0 1.0 2.0
1 2.0 5.0
2 3.0 5.0
3 4.0 2.0
4 5.0 2.0
5 3.0 2.0
6 3.5 2.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
10 1.0 2.0
11 1.0 2.0
12 1.0 2.0

How to interpolate in Pandas using only previous values?

This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values

This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)

You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN

Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?

You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0

Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)

simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)

IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

pandas filling nans by mean of before and after non-nan values

I would like to fill df's nan with an average of adjacent elements.
Consider a dataframe:
df = pd.DataFrame({'val': [1,np.nan, 4, 5, np.nan, 10, 1,2,5, np.nan, np.nan, 9]})
val
0 1.0
1 NaN
2 4.0
3 5.0
4 NaN
5 10.0
6 1.0
7 2.0
8 5.0
9 NaN
10 NaN
11 9.0
My desired output is:
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0 <<< deadend
10 7.0 <<< deadend
11 9.0
I've looked into other solutions such as Fill cell containing NaN with average of value before and after, but this won't work in case of two or more consecutive np.nans.
Any help is greatly appreciated!

Use ffill + bfill and divide by 2:
df = (df.ffill()+df.bfill())/2
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0
10 7.0
11 9.0
EDIT : If 1st and last element contains NaN then use (Dark
suggestion):
df = pd.DataFrame({'val':[np.nan,1,np.nan, 4, 5, np.nan,
10, 1,2,5, np.nan, np.nan, 9,np.nan,]})
df = (df.ffill()+df.bfill())/2
df = df.bfill().ffill()
print(df)
val
0 1.0
1 1.0
2 2.5
3 4.0
4 5.0
5 7.5
6 10.0
7 1.0
8 2.0
9 5.0
10 7.0
11 7.0
12 9.0
13 9.0

Althogh in case of multiple nan's in a row it doesn't produce the exact output you specified, other users reaching this page may actually prefer the effect of the method interpolate():
df = df.interpolate()
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 6.3
10 7.7
11 9.0

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.

Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.

Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5

If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5

You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to impute a column in pandas dataframe within each group [duplicate] - python

Related

Replace NaN with the average of the last 5 values - Pandas

How to interpolate in Pandas using only previous values?

How to add the sum of some column at the end of dataframe

pandas filling nans by mean of before and after non-nan values

Generate New DataFrame without NaN Values

Categories

Resources