Updating column in a dataframe based on multiple columns - python

I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender
df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan],
['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']
df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 NaN e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 NaN c M
df.groupby(['job','gender']).mean().reset_index()
job gender col1 age
0 a F 7.500000 30.5
1 a M 1.000000 19.0
2 b F 1.500000 37.0
3 b M 11.000000 27.0
4 c F NaN NaN
5 c M 7.666667 48.0
6 d F 7.500000 34.0
7 d M NaN NaN
8 e F NaN NaN
9 e M 7.500000 70.0
I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?
Resultant output should look like this:
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 70.0 c M
Thanks.

Use Series.fillna with GroupBy.transform, but because in sample data are not data for combination c, M there is NaN:
df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
If need also replace NaN by groiping only by id add another fillna:
avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')
df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F

Related

Trying to group by using multiple columns

Using Pandas I am trying to do group by for multiple columns and then fill the pandas dataframe where a person name is not present
For Example this is my Dataframe
enter image description here
V1 V2 V3 PN
1 10 20 A
2 10 21 A
3 10 20 C
I have a unique person name list = ['A','B','C','D','E']
Expected Outcome:-
enter image description here
V1 V2 V3 PN
1 10 20 A
1 10 20 B
1 10 20 C
1 10 20 D
1 10 20 E
2 10 21 A
2 10 21 B
2 10 21 C
2 10 21 D
2 10 21 E
3 10 20 A
3 10 20 B
3 10 20 C
3 10 20 D
3 10 20 E
I was thinking about trying group by pandas statement but it didnt work out
Try this, using pd.MultiIndex with reindex to create additional rows:
import pandas as pd
df = pd.DataFrame({'Version 1':[1,2,3],
'Version 2':[10,10,10],
'Version 3':[20,21,20],
'Person Name':'A A C'.split(' ')})
p_list = [*'ABCDE']
df.set_index(['Version 1', 'Person Name'])\
.reindex(pd.MultiIndex.from_product([df['Version 1'].unique(), p_list],
names=['Version 1', 'Person Name']))\
.groupby(level=0, group_keys=False).apply(lambda x: x.ffill().bfill())\
.reset_index()
Output:
Version 1 Person Name Version 2 Version 3
0 1 A 10.0 20.0
1 1 B 10.0 20.0
2 1 C 10.0 20.0
3 1 D 10.0 20.0
4 1 E 10.0 20.0
5 2 A 10.0 21.0
6 2 B 10.0 21.0
7 2 C 10.0 21.0
8 2 D 10.0 21.0
9 2 E 10.0 21.0
10 3 A 10.0 20.0
11 3 B 10.0 20.0
12 3 C 10.0 20.0
13 3 D 10.0 20.0
14 3 E 10.0 20.0

How to remove certain features that have a low completeness rate in a Data frame(Python)

I have a Data Frame with more than 450 variables and more than 500 000 rows. However, some variables have null values ​​over 90%. I would like to delete features with more than > 90% empty rows.
I made my description of my variables:
Data Frame :
df = pd.DataFrame({
'A':list('abcdefghij'),
'B':[4,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan, np.nan, np.nan],
'C':[7,8,np.nan,4,2,3,6,5, 4, 6],
'D':[1,3,5,np.nan,1,0,10,7, np.nan, 5],
'E':[5,3,6,9,2,4,7,3, 5, 9],
'F':list('aaabbbckfr'),
'G':[np.nan,8,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan]})
print(df)
A B C D E F G
0 a 4.0 7 1 5 a NaN
1 b NaN 8 3 3 a 8.0
2 c NaN NaN 5 6 a NaN
3 d NaN 4 NaN 9 b NaN
4 e NaN 2 1 2 b NaN
5 f NaN 3 0 4 b NaN
6 g NaN 6 10 7 c NaN
7 h NaN 5 7 3 k NaN
8 i NaN 4 NaN 5 f NaN
9 j NaN 6 5 9 r NaN
Describe:
desc = df.describe(include = 'all')
d1 = desc.loc['varType'] = desc.dtypes
d3 = desc.loc['rowsNull'] = df.isnull().sum()
d4 = desc.loc['%rowsNull'] = round((d3/len(df))*100, 2)
print(desc)
A B C D E F G
count 10 1 10 10 10 10 1
unique 10 NaN NaN NaN NaN 6 NaN
top i NaN NaN NaN NaN b NaN
freq 1 NaN NaN NaN NaN 3 NaN
mean NaN 4 5.4 4.3 5.3 NaN 8
std NaN NaN 2.22111 3.16403 2.45176 NaN NaN
min NaN 4 2 0 2 NaN 8
25% NaN 4 4 1.5 3.25 NaN 8
50% NaN 4 5.5 4.5 5 NaN 8
75% NaN 4 6.75 6.5 6.75 NaN 8
max NaN 4 9 10 9 NaN 8
varType object float64 float64 float64 float64 object float64
rowsNull 0 9 1 2 0 0 9
%rowsNull 0 90 10 20 0 0 90
In this exemple we have juste 2 features to delete 'B' and 'G'.
But in my case i find 40 variables whose '%rowsNull' greater than > 90%, how should i do not take into account these variables in my modeling?
I have no idea how to do this.
Please help me.
Thanks.
First compare missing values and then get mean (it working because Trues are processing like 1s), last filter by boolean indexing with loc, because removing columns:
df = df.loc[:, df.isnull().mean() <.9]
print (df)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7 c
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Detail:
print (df.isnull().mean())
A 0.0
B 0.9
C 0.1
D 0.2
E 0.0
F 0.0
G 0.9
dtype: float64
You can find columns with more than 90% null values and drop
cols_to_drop = df.columns[df.isnull().sum()/len(df) >= .90]
df.drop(cols_to_drop, axis = 1, inplace = True)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Based on your code, you could do something like
keepCols = desc.columns[desc.loc['%rowsNull'] < 90]
df = df[keepCols]

Pandas split /group dataframe by row values

I have a dataframe of the following form
In [1]: df
Out [1]:
A B C D
1 0 2 6 0
2 6 1 5 2
3 NaN NaN NaN NaN
4 9 3 2 2
...
15 2 12 5 23
16 NaN NaN NaN NaN
17 8 1 5 3
I'm interested in splitting the dataframe into multiple dataframes (or grouping it) by the NaN rows.
So resulting in something as follows
In [2]: df1
Out [2]:
A B C D
1 0 2 6 0
2 6 1 5 2
In [3]: df2
Out [3]:
A B C D
1 9 3 2 2
...
12 2 12 5 23
In [4]: df3
Out [4]:
A B C D
1 8 1 5 3
You could use the compare-cumsum-groupby pattern, where we find the all-null rows, cumulative sum those to get a group number for each subgroup, and then iterate over the groups:
In [114]: breaks = df.isnull().all(axis=1)
In [115]: groups = [group.dropna(how='all') for _, group in df.groupby(breaks.cumsum())]
In [116]: for group in groups:
...: print(group)
...: print("--")
...:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
--
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
--
A B C D
17 8.0 1.0 5.0 3.0
--
You can using local with groupby split
variables = locals()
for x, y in df.dropna(0).groupby(df.isnull().all(1).cumsum()[~df.isnull().all(1)]):
variables["df{0}".format(x + 1)] = y
df1
Out[768]:
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
df2
Out[769]:
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
I'd use dictionary, groupby with cumsum:
dictofdfs = {}
for n,g in df.groupby(df.isnull().all(1).cumsum()):
dictofdfs[n]= g.dropna()
Output:
dictofdfs[0]
A B C D
1 0.0 2.0 6.0 0.0
2 6.0 1.0 5.0 2.0
dictofdfs[1]
A B C D
4 9.0 3.0 2.0 2.0
15 2.0 12.0 5.0 23.0
dictofdfs[2]
A B C D
17 8.0 1.0 5.0 3.0

How to preserve column order when calling groupby and shift from pandas?

It seems that the columns get reordered by column index when calling pandas.DataFrame.groupby().shift(). The sort parameter applies only to rows.
Here is an example:
import pandas as pd
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
'E': ['a','b','c','d','e','f'],
'B': [10, 12, 10, 25, 10, 12],
'C': [100, 102, 100, 250, 100, 102],
'D': [1,2,3,4,5,6]
})
df.set_index('A',inplace=True)
df = df[['E','C','D','B']]
df
# E C D B
# A
#group1 a 100 1 10
#group1 b 102 2 12
#group2 c 100 3 10
#group2 d 250 4 25
#group3 e 100 5 10
#group3 f 102 6 12
Going from here, I want to achieve:
# E C D B C_s D_s B_s
# A
#group1 a 100 1 10 102.0 2.0 12.0
#group1 b 102 2 12 NaN NaN NaN
#group2 c 100 3 10 250.0 4.0 25.0
#group2 d 250 4 25 NaN NaN NaN
#group3 e 100 5 10 102.0 6.0 12.0
#group3 f 102 6 12 NaN NaN NaN
But
df[['C_s','D_s','B_s']]= df.groupby(level='A')[['C','D','B']].shift(-1)
Results in:
# E C D B C_s D_s B_s
# A
#group1 a 100 1 10 12.0 102.0 2.0
#group1 b 102 2 12 NaN NaN NaN
#group2 c 100 3 10 25.0 250.0 4.0
#group2 d 250 4 25 NaN NaN NaN
#group3 e 100 5 10 12.0 102.0 6.0
#group3 f 102 6 12 NaN NaN NaN
Introducing an artificial ordering of the columns helps to maintain the intrinsic logical connection of the columns:
df = df.sort_index(axis=1)
df[['B_s','C_s','D_s']]= df.groupby(level='A')[['B','C','D']].shift(-1).sort_index(axis=1)
df
# B C D E B_s C_s D_s
# A
#group1 10 100 1 a 12.0 102.0 2.0
#group1 12 102 2 b NaN NaN NaN
#group2 10 100 3 c 25.0 250.0 4.0
#group2 25 250 4 d NaN NaN NaN
#group3 10 100 5 e 12.0 102.0 6.0
#group3 12 102 6 f NaN NaN NaN
Why are the columns reordered in the first place?
In my opinion it is bug.
Working custom lambda function:
df[['C_s','D_s','B_s']] = df.groupby(level='A')['C','D','B'].apply(lambda x: x.shift(-1))
print (df)
E C D B C_s D_s B_s
A
group1 a 100 1 10 102.0 2.0 12.0
group1 b 102 2 12 NaN NaN NaN
group2 c 100 3 10 250.0 4.0 25.0
group2 d 250 4 25 NaN NaN NaN
group3 e 100 5 10 102.0 6.0 12.0
group3 f 102 6 12 NaN NaN NaN
Thank you #cᴏʟᴅsᴘᴇᴇᴅ for another solution:
df[['C_s','D_s','B_s']] = (df.groupby(level='A')['C','D','B']
.apply(pd.DataFrame.shift, periods=-1))

in python 3.6.i want to use sklearn tree to classify but appear ValueError: could not convert string to float: 'NC',

This is my code
import os
import pandas as pd
import numpy as np
import pylab as pl
from sklearn import tree
os.chdir('C:/Users/Shinelon/Desktop/ch13')
w=pd.read_table('cup98lrn.txt',sep=',',low_memory=False)
w1=(w.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG','TARGET_B']]).dropna(how='any')
x=w1.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG']]
y=w1.loc[:,['TARGET_B']]
clf=tree.DecisionTreeClassifier(min_samples_split=1000,min_samples_leaf=400,max_depth=10)
print(w1.head())
clf=clf.fit(x,y)
but appear the question I can't understand .because i use sklearn.tree before .D:\python3.6\python.exe C:/Users/Shinelon/Desktop/ch13/.idea/13.4.py
AGE AVGGIFT CARDGIFT CARDPM12 CARDPROM CLUSTER2 DOMAIN GENDER \
1 46.0 15.666667 1 6 12 1.0 S1 M
3 70.0 6.812500 7 6 27 41.0 R2 F
4 78.0 6.864865 8 10 43 26.0 S2 F
6 38.0 7.642857 8 4 26 53.0 T2 F
11 75.0 12.500000 2 6 8 23.0 S2 M
GEOCODE2 HIT ... MDMAUD_R MINRAMNT NGIFTALL NUMPRM12 RAMNTALL \
1 A 16 ... X 10.0 3 13 47.0
3 C 2 ... X 2.0 16 14 109.0
4 A 60 ... X 3.0 37 25 254.0
6 D 0 ... X 3.0 14 9 107.0
11 B 3 ... X 10.0 2 12 25.0
RFA_2A RFA_2F STATE TIMELAG TARGET_B
1 G 2 CA 18.0 0
3 E 4 CA 9.0 0
4 F 2 FL 14.0 0
6 E 1 IN 4.0 0
11 F 2 IN 3.0 0
this is print(w1) result

Categories

Resources