fill missing values based on the last value [duplicate] - python

I am dealing with pandas DataFrames like this:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 NaN
5 2 NaN
6 1 300
7 1 NaN
I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 20
5 2 200
6 1 300
7 1 300
Is there some slick way to do this without manually looping over rows?

You could perform a groupby/forward-fill operation on each group:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)
yields
id x
0 1 10.0
1 1 20.0
2 2 100.0
3 2 200.0
4 1 20.0
5 2 200.0
6 1 300.0
7 1 300.0

df
id val
0 1 23.0
1 1 NaN
2 1 NaN
3 2 NaN
4 2 34.0
5 2 NaN
6 3 2.0
7 3 NaN
8 3 NaN
df.sort_values(['id','val']).groupby('id').ffill()
id val
0 1 23.0
1 1 23.0
2 1 23.0
4 2 34.0
3 2 34.0
5 2 34.0
6 3 2.0
7 3 2.0
8 3 2.0
use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.

Solution for multi-key problem:
In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.
import os
import pandas as pd
#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)
#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
for t in types:
group_df = df[(df.region == r) & (df.type == t)].copy()
group_df.fillna(method='ffill', inplace=True)
group_df.fillna(method='bfill', inplace=True)
group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)
Checking the result:
#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()
#compare new and old dataframe
print(df.shape)
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

How to fill NaN values based on previous columns

I have an initial column with no missing data (A) but with repeated values. How do I fill the next column (B) with missing data so that it is filled and the column on the left always has the same value on the right? I would also like any other columns to remain the same (C)
For example, this is what I have
A B C
1 1 20 4
2 2 NaN 8
3 3 NaN 2
4 2 30 9
5 3 40 1
6 1 NaN 3
And this is what I want
A B C
1 1 20 4
2 2 30* 8
3 3 40* 2
4 2 30 9
5 3 40 1
6 1 20* 3
Asterisk on filled values.
This needs to be scalable with a very large dataframe.
Additionally, if I had a value on the left column that has more than one value on the right side on separate observations, how would I fill with the mean?
You can use groupby on 'A' and use first to find the first corresponding value in 'B' (it will not select NaN).
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# find first 'B' value for each 'A'
lookup = df[['A', 'B']].groupby('A').first()['B']
# only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# replace NaN values in 'B' with lookup values
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
Which outputs:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3
If there are many NaN values in 'B' you might want to exclude them before you use groupby.
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# Only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# Find first 'B' value for each 'A'
lookup = df[~nan_mask][['A', 'B']].groupby('A').first()['B']
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
You could do sort_values first then forward fill column B based on column A. The way to implement this will be:
import pandas as pd
import numpy as np
x = {'A':[1,2,3,2,3,1],
'B':[20,np.nan,np.nan,30,40,np.nan],
'C':[4,8,2,9,1,3]}
df = pd.DataFrame(x)
#sort_values first, then forward fill based on column B
#this will get the right values for you while maintaing
#the original order of the dataframe
df['B'] = df.sort_values(by=['A','B'])['B'].ffill()
print (df)
Output will be:
Original data:
A B C
0 1 20.0 4
1 2 NaN 8
2 3 NaN 2
3 2 30.0 9
4 3 40.0 1
5 1 NaN 3
Updated data:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3

Fill missing data based on the other columns same data [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Join on a fragment of a dataframe

I am trying to join a fragment of a dataframe with another one. The structure of the dataframe to join is simplified below:
left:
ID f1 TIME
1 10 1
3 10 1
7 10 1
9 10 2
2 10 2
1 10 2
3 10 2
right:
ID f2 f3
1 0 11
7 9 11
I need to select the left dataset by time, and I need to attached the right one, the result I would like to have is the following:
left:
ID f1 TIME f2 f3
1 10 1 0 11
3 10 1 nan nan
7 10 1 9 11
9 10 2 nan nan
2 10 2 nan nan
1 10 2 nan nan
3 10 2 nan nan
Currently I am usually joining dataframes in this way:
left = left.join(right.set_index('ID'), on='ID')
In this case I am using:
left[left.TIME == 1] = left[left.TIME == 1].join(right.set_index('ID'), on='ID')
I have also tried with merge, but the result is the left dataframe without any of the other columns.
Finally the structure of my script need to do this for every unique TIME in the dataframe, thus:
for t in numpy.unique(left.TIME):
#do join on the fragment left.TIME == t
If I save the returned value from the join function in a new dataframe everything works fine, but trying to add the value at the left dataframe does not work.
EDIT: The IDs of the left dataset can be present multiple times, but not inside the same TIME value.
You can filter first by boolean indexing, merge and concat last:
df1 = left[left['TIME']==1]
#alternative
#df1 = left.query('TIME == 1')
df2 = left[left['TIME']!=1]
#alternative
#df2 = left.query('TIME != 1')
df = pd.concat([df1.merge(right, how='left'), df2])
print (df)
ID TIME f1 f2 f3
0 1 1 10 0.0 11.0
1 3 1 10 NaN NaN
2 7 1 10 9.0 11.0
3 9 2 10 NaN NaN
4 2 2 10 NaN NaN
5 1 2 10 NaN NaN
6 3 2 10 NaN NaN
EDIT: merge create default indices, so possible solution is create column first and then set to index:
print (left)
ID f1 TIME
10 1 10 1
11 3 10 1
12 7 10 1
13 9 10 2
14 2 10 2
15 1 10 2
16 3 10 2
#df = left.merge(right, how='left')
df1 = left[left['TIME']==1]
df2 = left[left['TIME']!=1]
df = pd.concat([df1.reset_index().merge(right, how='left').set_index('index'), df2])
print (df)
ID TIME f1 f2 f3
10 1 1 10 0.0 11.0
11 3 1 10 NaN NaN
12 7 1 10 9.0 11.0
13 9 2 10 NaN NaN
14 2 2 10 NaN NaN
15 1 2 10 NaN NaN
16 3 2 10 NaN NaN
EDIT:
After discussion after modify input data is possible use:
df = left.merge(right, how='left', on=['ID','TIME'])
This is one way:
res = left.drop_duplicates('ID')\
.merge(right, how='left')\
.append(left[left.duplicated(subset=['ID'])])
# ID TIME f1 f2 f3
# 0 1 1 10 0.0 11.0
# 1 3 1 10 NaN NaN
# 2 7 1 10 9.0 11.0
# 3 9 2 10 NaN NaN
# 4 2 2 10 NaN NaN
# 5 1 2 10 NaN NaN
# 6 3 2 10 NaN NaN
Note that columns f2 and f3 become float since NaN is considered a float.

Randomly assign values to subset of rows in pandas dataframe

I am using Python 2.7.11 with Anaconda.
I understand how to set the value of a subset of rows of a Pandas DataFrame like Modifying a subset of rows in a pandas dataframe, but I need to randomly set these values.
Say I have the dataframe df below. How can I randomly set the values of group == 2 so they are not all equal to 1.0?
import pandas as pd
import numpy as np
df = pd.DataFrame([1,1,1,2,2,2], columns = ['group'])
df['value'] = np.nan
df.loc[df['group'] == 2, 'value'] = np.random.randint(0,5)
print df
group value
0 1 NaN
1 1 NaN
2 1 NaN
3 2 1.0
4 2 1.0
5 2 1.0
df should look something like the below:
print df
group value
0 1 NaN
1 1 NaN
2 1 NaN
3 2 1.0
4 2 4.0
5 2 2.0
You must determine the size of group 2
g2 = df['group'] == 2
df.loc[g2, 'value'] = np.random.randint(5, size=g2.sum())
print(df)
group value
0 1 NaN
1 1 NaN
2 1 NaN
3 2 3.0
4 2 4.0
5 2 2.0

Categories

Resources