Trying to group by using multiple columns - python

Using Pandas I am trying to do group by for multiple columns and then fill the pandas dataframe where a person name is not present
For Example this is my Dataframe
enter image description here
V1 V2 V3 PN
1 10 20 A
2 10 21 A
3 10 20 C
I have a unique person name list = ['A','B','C','D','E']
Expected Outcome:-
enter image description here
V1 V2 V3 PN
1 10 20 A
1 10 20 B
1 10 20 C
1 10 20 D
1 10 20 E
2 10 21 A
2 10 21 B
2 10 21 C
2 10 21 D
2 10 21 E
3 10 20 A
3 10 20 B
3 10 20 C
3 10 20 D
3 10 20 E
I was thinking about trying group by pandas statement but it didnt work out

Try this, using pd.MultiIndex with reindex to create additional rows:
import pandas as pd
df = pd.DataFrame({'Version 1':[1,2,3],
'Version 2':[10,10,10],
'Version 3':[20,21,20],
'Person Name':'A A C'.split(' ')})
p_list = [*'ABCDE']
df.set_index(['Version 1', 'Person Name'])\
.reindex(pd.MultiIndex.from_product([df['Version 1'].unique(), p_list],
names=['Version 1', 'Person Name']))\
.groupby(level=0, group_keys=False).apply(lambda x: x.ffill().bfill())\
.reset_index()
Output:
Version 1 Person Name Version 2 Version 3
0 1 A 10.0 20.0
1 1 B 10.0 20.0
2 1 C 10.0 20.0
3 1 D 10.0 20.0
4 1 E 10.0 20.0
5 2 A 10.0 21.0
6 2 B 10.0 21.0
7 2 C 10.0 21.0
8 2 D 10.0 21.0
9 2 E 10.0 21.0
10 3 A 10.0 20.0
11 3 B 10.0 20.0
12 3 C 10.0 20.0
13 3 D 10.0 20.0
14 3 E 10.0 20.0

Related

Dataframe column with subsequent matching value's index including range wrap-around direction from final index to initial index

How can I create Dataframe column(s) with the subsequent indexes for a certain value? I know I can find the matching indexes with
b_Index = df[df.Type=='B'].index
c_Index = df[df.Type=='C'].index
but I'm in need of a solution which includes the wrap-around case such that the 'next' index after the final match is the first index.
Say I have a dataframe with a Type series. Type includes values A, B or C.
d = dict(Type=['A', 'A', 'A', 'C', 'C', 'C', 'A', 'A', 'C', 'A', 'B', 'B', 'B', 'A'])
df = pd.DataFrame(d)
Type
0 A
1 A
2 A
3 C
4 C
5 C
6 A
7 A
8 C
9 A
10 B
11 B
12 B
13 A
I'm looking to add NextForwardBIndex and NextForwardCIndex columns such that the result is
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 10 4
4 C 10 5
5 C 10 8
6 A 10 8
7 A 10 8
8 C 10 3
9 A 10 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
You can use a bit of numpy.roll, pandas.ffill, and pandas.fillna:
# roll indices and assign the next values for B/C rows
df.loc[b_Index, 'NextForwardBIndex'] = np.roll(b_Index,-1)
df.loc[c_Index, 'NextForwardCIndex'] = np.roll(c_Index,-1)
# fill missing values
(df.ffill()
.fillna({'NextForwardBIndex': b_Index[0],
'NextForwardCIndex': c_Index[0]})
.astype(int, errors='ignore')
)
output:
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 4 4
4 C 5 5
5 C 8 8
6 A 8 8
7 A 8 8
8 C 3 3
9 A 3 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
This should work:
df2 = df['Type'].str.get_dummies().mul(s.index,axis=0).shift(-1).where(lambda x: x.ne(0)).bfill()
df2.fillna(df2.iloc[0]).rename('NextForward{}Index'.format,axis=1)
Old Answer:
(df.assign(NextForwardBIndex = df.loc[df['Type'].eq('B')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)),
NextForwardCIndex = df.loc[df['Type'].eq('C')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)))
.fillna({'NextForwardBIndex':df['Type'].eq('B').idxmax(),'NextForwardCIndex':df['Type'].eq('C').idxmax()}))
Output:
NextForwardAIndex NextForwardBIndex NextForwardCIndex
0 1.0 10.0 3.0
1 2.0 10.0 3.0
2 6.0 10.0 3.0
3 6.0 10.0 4.0
4 6.0 10.0 5.0
5 6.0 10.0 8.0
6 7.0 10.0 8.0
7 9.0 10.0 8.0
8 9.0 10.0 3.0
9 13.0 10.0 3.0
10 13.0 11.0 3.0
11 13.0 12.0 3.0
12 13.0 10.0 3.0
13 1.0 10.0 3.0

Updating column in a dataframe based on multiple columns

I have a column named "age" with a few NaN; crude logic of deriving the value of the age is finding the mean of age using 2 key categorical variables - job, gender
df = pd.DataFrame([[1,2,1,2,3,4,11,12,13,12,11,1,10], [19,23,np.nan,29,np.nan,32,27,48,39,70,29,51,np.nan],
['a','b','c','d','e','a','b','c','d','e','a','b','c'],['M','F','M','F','M','F','M','F','M','M','F','F','F']]).T
df.columns = ['col1','age','job','gender']
df = df.astype({"col1": int, "age": float})
df['job'] = df.job.astype('category')
df['gender'] = df.gender.astype('category')
df
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 NaN e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 NaN c M
df.groupby(['job','gender']).mean().reset_index()
job gender col1 age
0 a F 7.500000 30.5
1 a M 1.000000 19.0
2 b F 1.500000 37.0
3 b M 11.000000 27.0
4 c F NaN NaN
5 c M 7.666667 48.0
6 d F 7.500000 34.0
7 d M NaN NaN
8 e F NaN NaN
9 e M 7.500000 70.0
I want to update the age to the derived value from above. What is the optimal way of doing it? Should I store it in another dataframe and loop it through for updation?
Resultant output should look like this:
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 70.0 c M
Thanks.
Use Series.fillna with GroupBy.transform, but because in sample data are not data for combination c, M there is NaN:
df['age'] = df['age'].fillna(df.groupby(['job','gender'])['age'].transform('mean'))
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 NaN c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F
If need also replace NaN by groiping only by id add another fillna:
avg1 = df.groupby(['job','gender'])['age'].transform('mean')
avg2 = df.groupby('job')['age'].transform('mean')
df['age'] = df['age'].fillna(avg1).fillna(avg2)
print (df)
col1 age job gender
0 1 19.0 a M
1 2 23.0 b F
2 1 48.0 c M
3 2 29.0 d F
4 3 70.0 e M
5 4 32.0 a F
6 11 27.0 b M
7 12 48.0 c F
8 13 39.0 d M
9 12 70.0 e M
10 11 29.0 a F
11 1 51.0 b F
12 10 48.0 c F

Obtain data from pandas dataframe based on the column containing column names provided at each record [duplicate]

I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN

Vectorized lookup on a pandas dataframe

I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN

in python 3.6.i want to use sklearn tree to classify but appear ValueError: could not convert string to float: 'NC',

This is my code
import os
import pandas as pd
import numpy as np
import pylab as pl
from sklearn import tree
os.chdir('C:/Users/Shinelon/Desktop/ch13')
w=pd.read_table('cup98lrn.txt',sep=',',low_memory=False)
w1=(w.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG','TARGET_B']]).dropna(how='any')
x=w1.loc[:,['AGE','AVGGIFT','CARDGIFT','CARDPM12','CARDPROM','CLUSTER2','DOMAIN','GENDER','GEOCODE2','HIT',
'HOMEOWNR','HPHONE_D','INCOME','LASTGIFT','MAXRAMNT',
'MDMAUD_F','MDMAUD_R','MINRAMNT','NGIFTALL','NUMPRM12',
'RAMNTALL',
'RFA_2A','RFA_2F','STATE','TIMELAG']]
y=w1.loc[:,['TARGET_B']]
clf=tree.DecisionTreeClassifier(min_samples_split=1000,min_samples_leaf=400,max_depth=10)
print(w1.head())
clf=clf.fit(x,y)
but appear the question I can't understand .because i use sklearn.tree before .D:\python3.6\python.exe C:/Users/Shinelon/Desktop/ch13/.idea/13.4.py
AGE AVGGIFT CARDGIFT CARDPM12 CARDPROM CLUSTER2 DOMAIN GENDER \
1 46.0 15.666667 1 6 12 1.0 S1 M
3 70.0 6.812500 7 6 27 41.0 R2 F
4 78.0 6.864865 8 10 43 26.0 S2 F
6 38.0 7.642857 8 4 26 53.0 T2 F
11 75.0 12.500000 2 6 8 23.0 S2 M
GEOCODE2 HIT ... MDMAUD_R MINRAMNT NGIFTALL NUMPRM12 RAMNTALL \
1 A 16 ... X 10.0 3 13 47.0
3 C 2 ... X 2.0 16 14 109.0
4 A 60 ... X 3.0 37 25 254.0
6 D 0 ... X 3.0 14 9 107.0
11 B 3 ... X 10.0 2 12 25.0
RFA_2A RFA_2F STATE TIMELAG TARGET_B
1 G 2 CA 18.0 0
3 E 4 CA 9.0 0
4 F 2 FL 14.0 0
6 E 1 IN 4.0 0
11 F 2 IN 3.0 0
this is print(w1) result

Categories

Resources