I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN
Related
Using Pandas I am trying to do group by for multiple columns and then fill the pandas dataframe where a person name is not present
For Example this is my Dataframe
enter image description here
V1 V2 V3 PN
1 10 20 A
2 10 21 A
3 10 20 C
I have a unique person name list = ['A','B','C','D','E']
Expected Outcome:-
enter image description here
V1 V2 V3 PN
1 10 20 A
1 10 20 B
1 10 20 C
1 10 20 D
1 10 20 E
2 10 21 A
2 10 21 B
2 10 21 C
2 10 21 D
2 10 21 E
3 10 20 A
3 10 20 B
3 10 20 C
3 10 20 D
3 10 20 E
I was thinking about trying group by pandas statement but it didnt work out
Try this, using pd.MultiIndex with reindex to create additional rows:
import pandas as pd
df = pd.DataFrame({'Version 1':[1,2,3],
'Version 2':[10,10,10],
'Version 3':[20,21,20],
'Person Name':'A A C'.split(' ')})
p_list = [*'ABCDE']
df.set_index(['Version 1', 'Person Name'])\
.reindex(pd.MultiIndex.from_product([df['Version 1'].unique(), p_list],
names=['Version 1', 'Person Name']))\
.groupby(level=0, group_keys=False).apply(lambda x: x.ffill().bfill())\
.reset_index()
Output:
Version 1 Person Name Version 2 Version 3
0 1 A 10.0 20.0
1 1 B 10.0 20.0
2 1 C 10.0 20.0
3 1 D 10.0 20.0
4 1 E 10.0 20.0
5 2 A 10.0 21.0
6 2 B 10.0 21.0
7 2 C 10.0 21.0
8 2 D 10.0 21.0
9 2 E 10.0 21.0
10 3 A 10.0 20.0
11 3 B 10.0 20.0
12 3 C 10.0 20.0
13 3 D 10.0 20.0
14 3 E 10.0 20.0
I am trying to add rows to a DataFrame interpolating values in a column by group, and fill with missing all other columns. My data looks something like this:
import pandas as pd
import random
random.seed(42)
data = {'group':['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c' ],
'value' : [1, 2, 5, 3, 4, 5, 7, 4, 7, 9],
'other': random.sample(range(1, 100), 10)}
df = pd.DataFrame(data)
print(df)
group value other
0 a 1 82
1 a 2 15
2 a 5 4
3 b 3 95
4 b 4 36
5 b 5 32
6 b 7 29
7 c 4 18
8 c 7 14
9 c 9 87
What I am trying to achieve is something like this:
group value other
a 1 82
a 2 15
a 3 NaN
a 4 NaN
a 5 NaN
b 3 95
b 4 36
b 5 32
b 6 NaN
b 7 29
c 4 18
c 5 NaN
c 6 NaN
c 7 14
c 8 NaN
c 9 87
For example, group a has a range from 1 to 5, b from 3 to 7, and c from 4 to 9.
The issue I'm having is that each group has a different range. I found something that works assuming a single range for all groups. This could work using the global min and max and dropping extra rows in each group, but since my data is fairly large adding many rows per group quickly becomes unfeasible.
>>> df.groupby('group').apply(lambda x: x.set_index('value').reindex(np.arange(x['value'].min(), x['value'].max() + 1))).drop(columns='group').reset_index()
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
We group on the group column and then re-index each group with the range from the min to the max of the value column
One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):
# pip install pyjanitor
import pandas as pd
import janitor
new_value = {'value' : lambda df: range(df.min(), df.max()+1)}
# expose the missing values per group via the `by` parameter
df.complete(new_value, by='group', sort = True)
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
How can I create Dataframe column(s) with the subsequent indexes for a certain value? I know I can find the matching indexes with
b_Index = df[df.Type=='B'].index
c_Index = df[df.Type=='C'].index
but I'm in need of a solution which includes the wrap-around case such that the 'next' index after the final match is the first index.
Say I have a dataframe with a Type series. Type includes values A, B or C.
d = dict(Type=['A', 'A', 'A', 'C', 'C', 'C', 'A', 'A', 'C', 'A', 'B', 'B', 'B', 'A'])
df = pd.DataFrame(d)
Type
0 A
1 A
2 A
3 C
4 C
5 C
6 A
7 A
8 C
9 A
10 B
11 B
12 B
13 A
I'm looking to add NextForwardBIndex and NextForwardCIndex columns such that the result is
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 10 4
4 C 10 5
5 C 10 8
6 A 10 8
7 A 10 8
8 C 10 3
9 A 10 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
You can use a bit of numpy.roll, pandas.ffill, and pandas.fillna:
# roll indices and assign the next values for B/C rows
df.loc[b_Index, 'NextForwardBIndex'] = np.roll(b_Index,-1)
df.loc[c_Index, 'NextForwardCIndex'] = np.roll(c_Index,-1)
# fill missing values
(df.ffill()
.fillna({'NextForwardBIndex': b_Index[0],
'NextForwardCIndex': c_Index[0]})
.astype(int, errors='ignore')
)
output:
Type NextForwardBIndex NextForwardCIndex
0 A 10 3
1 A 10 3
2 A 10 3
3 C 4 4
4 C 5 5
5 C 8 8
6 A 8 8
7 A 8 8
8 C 3 3
9 A 3 3
10 B 11 3
11 B 12 3
12 B 10 3
13 A 10 3
This should work:
df2 = df['Type'].str.get_dummies().mul(s.index,axis=0).shift(-1).where(lambda x: x.ne(0)).bfill()
df2.fillna(df2.iloc[0]).rename('NextForward{}Index'.format,axis=1)
Old Answer:
(df.assign(NextForwardBIndex = df.loc[df['Type'].eq('B')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)),
NextForwardCIndex = df.loc[df['Type'].eq('C')].groupby(df['Type']).transform(lambda x: x.index.to_series().shift(-1)))
.fillna({'NextForwardBIndex':df['Type'].eq('B').idxmax(),'NextForwardCIndex':df['Type'].eq('C').idxmax()}))
Output:
NextForwardAIndex NextForwardBIndex NextForwardCIndex
0 1.0 10.0 3.0
1 2.0 10.0 3.0
2 6.0 10.0 3.0
3 6.0 10.0 4.0
4 6.0 10.0 5.0
5 6.0 10.0 8.0
6 7.0 10.0 8.0
7 9.0 10.0 8.0
8 9.0 10.0 3.0
9 13.0 10.0 3.0
10 13.0 11.0 3.0
11 13.0 12.0 3.0
12 13.0 10.0 3.0
13 1.0 10.0 3.0
I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN
consider the dataframe df
df = pd.DataFrame(np.arange(25).reshape(5, 5), list('ABCDE'), list('abcde'))
print(df)
a b c d e
A 0 1 2 3 4
B 5 6 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
E 20 21 22 23 24
I want to replace the values in row 'A' with the corresponding values in row 'E' only where the values in row 'D' are equal to zero mod three
I create the boolean mask
mask = df.loc['D'] % 3 == 0
Then I make my assignment
df.loc['A'] = df.loc['E', mask]
However, I now have np.nan in some of my columns and my whole dataframe is now float
print(df)
a b c d e
A 20.0 NaN NaN 23.0 NaN
B 5.0 6.0 7.0 8.0 9.0
C 10.0 11.0 12.0 13.0 14.0
D 15.0 16.0 17.0 18.0 19.0
E 20.0 21.0 22.0 23.0 24.0
How should I go about getting this result?
a b c d e
A 20 1 2 23 4
B 5 6 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
E 20 21 22 23 24
Include mask in your loc for row 'A' instead of row 'E':
df.loc['A', mask] = df.loc['E']
The reason you're seeing NaN values is that you're reassigning all of row 'A' as just the masked version of row 'E'. The masked version of row 'E' is missing entries for some columns, so they get filled with NaN. The dtype for NaN is float, which forces all of the other integer values to be floats. By using mask on row 'A' instead, you're only assigning to the locations you want to update.
The resulting output:
a b c d e
A 20 1 2 23 4
B 5 6 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
E 20 21 22 23 24
Try this:
In [172]: df.loc['A', df.columns[df.loc['D'] % 3 == 0]] = df.loc['E']
In [173]: df
Out[173]:
a b c d e
A 20 1 2 23 4
B 5 6 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
E 20 21 22 23 24