Using Pandas I am trying to do group by for multiple columns and then fill the pandas dataframe where a person name is not present
For Example this is my Dataframe
enter image description here
V1 V2 V3 PN
1 10 20 A
2 10 21 A
3 10 20 C
I have a unique person name list = ['A','B','C','D','E']
Expected Outcome:-
enter image description here
V1 V2 V3 PN
1 10 20 A
1 10 20 B
1 10 20 C
1 10 20 D
1 10 20 E
2 10 21 A
2 10 21 B
2 10 21 C
2 10 21 D
2 10 21 E
3 10 20 A
3 10 20 B
3 10 20 C
3 10 20 D
3 10 20 E
I was thinking about trying group by pandas statement but it didnt work out
Try this, using pd.MultiIndex with reindex to create additional rows:
import pandas as pd
df = pd.DataFrame({'Version 1':[1,2,3],
'Version 2':[10,10,10],
'Version 3':[20,21,20],
'Person Name':'A A C'.split(' ')})
p_list = [*'ABCDE']
df.set_index(['Version 1', 'Person Name'])\
.reindex(pd.MultiIndex.from_product([df['Version 1'].unique(), p_list],
names=['Version 1', 'Person Name']))\
.groupby(level=0, group_keys=False).apply(lambda x: x.ffill().bfill())\
.reset_index()
Output:
Version 1 Person Name Version 2 Version 3
0 1 A 10.0 20.0
1 1 B 10.0 20.0
2 1 C 10.0 20.0
3 1 D 10.0 20.0
4 1 E 10.0 20.0
5 2 A 10.0 21.0
6 2 B 10.0 21.0
7 2 C 10.0 21.0
8 2 D 10.0 21.0
9 2 E 10.0 21.0
10 3 A 10.0 20.0
11 3 B 10.0 20.0
12 3 C 10.0 20.0
13 3 D 10.0 20.0
14 3 E 10.0 20.0
Related
How can I create Dataframe column(s) with the subsequent indexes for a certain value? I know I can find the matching indexes with
b_Index = df[df.Type=='B'].index
c_Index = df[df.Type=='C'].index
but I'm in need of a solution which includes the wrap-around case such that the 'next' index after the final match is the first index.
Say I have a dataframe with a Type series. Type includes values A, B or C.
d = dict(Type=['A', 'A', 'A', 'C', 'C', 'C', 'A', 'A', 'C', 'A', 'B', 'B', 'B', 'A'])
df = pd.DataFrame(d)
Type
0 A
1 A
2 A
3 C
4 C
5 C
6 A
7 A
8 C
9 A
10 B
11 B
12 B
13 A
I'm looking to add NextForwardBIndex and NextForwardCIndex columns such that the result is
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 10 4
4 C 10 5
5 C 10 8
6 A 10 8
7 A 10 8
8 C 10 3
9 A 10 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
You can use a bit of numpy.roll, pandas.ffill, and pandas.fillna:
# roll indices and assign the next values for B/C rows
df.loc[b_Index, 'NextForwardBIndex'] = np.roll(b_Index,-1)
df.loc[c_Index, 'NextForwardCIndex'] = np.roll(c_Index,-1)
# fill missing values
(df.ffill()
.fillna({'NextForwardBIndex': b_Index[0],
'NextForwardCIndex': c_Index[0]})
.astype(int, errors='ignore')
)
output:
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 4 4
4 C 5 5
5 C 8 8
6 A 8 8
7 A 8 8
8 C 3 3
9 A 3 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
This should work:
df2 = df['Type'].str.get_dummies().mul(s.index,axis=0).shift(-1).where(lambda x: x.ne(0)).bfill()
df2.fillna(df2.iloc[0]).rename('NextForward{}Index'.format,axis=1)
Old Answer:
(df.assign(NextForwardBIndex = df.loc[df['Type'].eq('B')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)),
NextForwardCIndex = df.loc[df['Type'].eq('C')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)))
.fillna({'NextForwardBIndex':df['Type'].eq('B').idxmax(),'NextForwardCIndex':df['Type'].eq('C').idxmax()}))
Output:
NextForwardAIndex NextForwardBIndex NextForwardCIndex
0 1.0 10.0 3.0
1 2.0 10.0 3.0
2 6.0 10.0 3.0
3 6.0 10.0 4.0
4 6.0 10.0 5.0
5 6.0 10.0 8.0
6 7.0 10.0 8.0
7 9.0 10.0 8.0
8 9.0 10.0 3.0
9 13.0 10.0 3.0
10 13.0 11.0 3.0
11 13.0 12.0 3.0
12 13.0 10.0 3.0
13 1.0 10.0 3.0
I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender
df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan],
['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']
df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 NaN e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 NaN c M
df.groupby(['job','gender']).mean().reset_index()
job gender col1 age
0 a F 7.500000 30.5
1 a M 1.000000 19.0
2 b F 1.500000 37.0
3 b M 11.000000 27.0
4 c F NaN NaN
5 c M 7.666667 48.0
6 d F 7.500000 34.0
7 d M NaN NaN
8 e F NaN NaN
9 e M 7.500000 70.0
I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?
Resultant output should look like this:
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 70.0 c M
Thanks.
Use Series.fillna with GroupBy.transform, but because in sample data are not data for combination c, M there is NaN:
df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
If need also replace NaN by groiping only by id add another fillna:
avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')
df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN
I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN
This is my code
import os
import pandas as pd
import numpy as np
import pylab as pl
from sklearn import tree
os.chdir('C:/Users/Shinelon/Desktop/ch13')
w=pd.read_table('cup98lrn.txt',sep=',',low_memory=False)
w1=(w.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG','TARGET_B']]).dropna(how='any')
x=w1.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG']]
y=w1.loc[:,['TARGET_B']]
clf=tree.DecisionTreeClassifier(min_samples_split=1000,min_samples_leaf=400,max_depth=10)
print(w1.head())
clf=clf.fit(x,y)
but appear the question I can't understand .because i use sklearn.tree before .D:\python3.6\python.exe C:/Users/Shinelon/Desktop/ch13/.idea/13.4.py
AGE AVGGIFT CARDGIFT CARDPM12 CARDPROM CLUSTER2 DOMAIN GENDER \
1 46.0 15.666667 1 6 12 1.0 S1 M
3 70.0 6.812500 7 6 27 41.0 R2 F
4 78.0 6.864865 8 10 43 26.0 S2 F
6 38.0 7.642857 8 4 26 53.0 T2 F
11 75.0 12.500000 2 6 8 23.0 S2 M
GEOCODE2 HIT ... MDMAUD_R MINRAMNT NGIFTALL NUMPRM12 RAMNTALL \
1 A 16 ... X 10.0 3 13 47.0
3 C 2 ... X 2.0 16 14 109.0
4 A 60 ... X 3.0 37 25 254.0
6 D 0 ... X 3.0 14 9 107.0
11 B 3 ... X 10.0 2 12 25.0
RFA_2A RFA_2F STATE TIMELAG TARGET_B
1 G 2 CA 18.0 0
3 E 4 CA 9.0 0
4 F 2 FL 14.0 0
6 E 1 IN 4.0 0
11 F 2 IN 3.0 0
this is print(w1) result