I hope you are doing well.
I need help to perform a complex "NaN replace" on my dataframe.
What is the best way to replace NaN values in a pandas column, based on a mode of other column values filtered by other columns?
Let me illustrate my problem:
import random
import numpy as np
import pandas as pd
data = {'Region': [1,1,1,2,2,2,1,2,2,2,2,1,1,1,2,1], 'Country': ['a','a', 'a', 'a', 'a','a', 'a', 'a', 'b', 'b', 'b', 'b','b','b','b','b'], 'GDP' : [100,100,101,105,105,110,np.nan,np.nan,200,200,100,150,100,150,np.nan,np.nan]}
df = pd.DataFrame.from_dict(data)
df:
Region Country GDP
0 1 a 100.0
1 1 a 100.0
2 1 a 101.0
3 2 a 105.0
4 2 a 105.0
5 2 a 110.0
6 1 a NaN
7 2 a NaN
8 2 b 200.0
9 2 b 200.0
10 2 b 100.0
11 1 b 150.0
12 1 b 100.0
13 1 b 150.0
14 2 b NaN
15 1 b NaN
I would like to replace the nan values of the GDP column with the mode of other GDP values for the same country and region.
In the case of the NaN value of the GDP column of index 6, I wish to replace it with 100 (as it is the mode for GDP values for Region 1 & Country a)
The desired output should look like this:
Region Country GDP
0 1 a 100
1 1 a 100
2 1 a 101
3 2 a 105
4 2 a 105
5 2 a 110
6 1 a 100
7 2 a 105
8 2 b 200
9 2 b 200
10 2 b 100
11 1 b 150
12 1 b 100
13 1 b 150
14 2 b 200
15 1 b 150
Thank you for your help, I hope you have an excellent day!
Pandas' fillna allows for filling missing values from another series. So we need another series that contains the mode of each Country/Region at the corresponding indices.
To get this series, we can use Pandas' groupby().transform() operation. It groups the dataframe, and then broadcasts the results back to the original shape.
If we use this operation with mode as is, it will give an error. Mode can return multiple values, preventing pandas from broadcasting the values back to the original shape. So we need to force it to return a single value, so just pick the first one (or last one, or whichever).
df["GDP"].fillna(
df.groupby(["Country", "Region"])["GDP"].transform(
lambda x: x.mode()[0]
)
)
Related
I am dealing with pandas DataFrames like this:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 NaN
5 2 NaN
6 1 300
7 1 NaN
I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value:
id x
0 1 10
1 1 20
2 2 100
3 2 200
4 1 20
5 2 200
6 1 300
7 1 300
Is there some slick way to do this without manually looping over rows?
You could perform a groupby/forward-fill operation on each group:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)
yields
id x
0 1 10.0
1 1 20.0
2 2 100.0
3 2 200.0
4 1 20.0
5 2 200.0
6 1 300.0
7 1 300.0
df
id val
0 1 23.0
1 1 NaN
2 1 NaN
3 2 NaN
4 2 34.0
5 2 NaN
6 3 2.0
7 3 NaN
8 3 NaN
df.sort_values(['id','val']).groupby('id').ffill()
id val
0 1 23.0
1 1 23.0
2 1 23.0
4 2 34.0
3 2 34.0
5 2 34.0
6 3 2.0
7 3 2.0
8 3 2.0
use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.
Solution for multi-key problem:
In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.
import os
import pandas as pd
#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)
#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
for t in types:
group_df = df[(df.region == r) & (df.type == t)].copy()
group_df.fillna(method='ffill', inplace=True)
group_df.fillna(method='bfill', inplace=True)
group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)
Checking the result:
#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()
#compare new and old dataframe
print(df.shape)
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())
Assume the dataframes df_1 and df_2 below, which I want to merge "left".
df_1= pd.DataFrame({'A': [1,2,3,4,5],
'B': [10,20,30,40,50]})
df_2= pd.DataFrame({'AA': [1,5],
'BB': [10,50],
'CC': [100, 500]})
>>> df_1
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
>>> df_2
AA BB CC
0 1 10 100
1 5 50 500
I want to perform a merging which will result to the following output:
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
So, I tried pd.merge(df_1, df_2, left_on=['A', 'B'], right_on=['AA', 'BB'], how='left') which unfortunately duplicates the columns upon which I merge:
A B AA BB CC
0 1 10 1.0 10.0 100.0
1 2 20 NaN NaN NaN
2 3 30 NaN NaN NaN
3 4 40 NaN NaN NaN
4 5 50 5.0 50.0 500.0
How do I achieve this without needing to drop the columns 'AA' and 'BB'?
Thank you!
You can use rename and join by A, B columns only:
df = pd.merge(df_1, df_2.rename(columns={'AA':'A','BB':'B'}), on=['A', 'B'], how='left')
print (df)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
In pd.merge's right_on parameter accepts array-like argument.
Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
df_1.merge(
df_2["CC"], left_on=["A", "B"], right_on=[df_2["AA"], df_2["BB"]], how="left"
)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
df.merge(sec_df) give sec_df without column names you want to merge on.
rigth_on as a list with columns you want to merge, [df_2['AA'], df_2['BB']] is equivalent to [*df_2[['AA', 'BB']].to_numpy()]
IMHO this method is cumbersome. As #jezrael posted renaming columns and merging them is pythonic/pandorable
I have an initial column with no missing data (A) but with repeated values. How do I fill the next column (B) with missing data so that it is filled and the column on the left always has the same value on the right? I would also like any other columns to remain the same (C)
For example, this is what I have
A B C
1 1 20 4
2 2 NaN 8
3 3 NaN 2
4 2 30 9
5 3 40 1
6 1 NaN 3
And this is what I want
A B C
1 1 20 4
2 2 30* 8
3 3 40* 2
4 2 30 9
5 3 40 1
6 1 20* 3
Asterisk on filled values.
This needs to be scalable with a very large dataframe.
Additionally, if I had a value on the left column that has more than one value on the right side on separate observations, how would I fill with the mean?
You can use groupby on 'A' and use first to find the first corresponding value in 'B' (it will not select NaN).
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# find first 'B' value for each 'A'
lookup = df[['A', 'B']].groupby('A').first()['B']
# only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# replace NaN values in 'B' with lookup values
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
Which outputs:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3
If there are many NaN values in 'B' you might want to exclude them before you use groupby.
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# Only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# Find first 'B' value for each 'A'
lookup = df[~nan_mask][['A', 'B']].groupby('A').first()['B']
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
You could do sort_values first then forward fill column B based on column A. The way to implement this will be:
import pandas as pd
import numpy as np
x = {'A':[1,2,3,2,3,1],
'B':[20,np.nan,np.nan,30,40,np.nan],
'C':[4,8,2,9,1,3]}
df = pd.DataFrame(x)
#sort_values first, then forward fill based on column B
#this will get the right values for you while maintaing
#the original order of the dataframe
df['B'] = df.sort_values(by=['A','B'])['B'].ffill()
print (df)
Output will be:
Original data:
A B C
0 1 20.0 4
1 2 NaN 8
2 3 NaN 2
3 2 30.0 9
4 3 40.0 1
5 1 NaN 3
Updated data:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3
I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
73 float columns
30 columns dates
remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
- dtype
- count
- count null values
- % number of null values
- max
- min
- 50%
- 75%
- 25%
- ......
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series and then add to final describe DataFrame:
Notice:
First row of final df is count - used function count for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe(), but it clearly isn't displaying all the values that you need. You can use various parameters of the describe() function accordingly.
describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in describe(), change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.
describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe() on just the objects (strings) use describe(include = ['O']).
I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)