I'd like to conditionally replace values, row-by-row in a pandas dataframe so that max(row) will remain, while all other values in the row will be set to None.
My intuition goes towards apply() but I am not sure if that's the right choice, or how to do it.
Example (but there may be multiple columns):
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )
tmp
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
Wanted output:
somemagic(tmp)
A B
0 None 3
1 None 4
2 3 None
3 None 33
4 None 10
5 None 9
6 7 None # on tie I don't really care which one is set to None
7 8 None
8 None 10
9 10 None # on tie I don't really care which one is set to None
Any suggestions on how to achieve that?
You can compared DataFrame values by eq with max:
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
mask = (tmp.eq(tmp.max(axis=1), axis=0))
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True True
7 True False
8 False True
9 True True
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 7.0
7 8.0 NaN
8 NaN 10.0
9 10.0 10.0
and then you can add NaN if values in columns are equal:
mask = (tmp.eq(tmp.max(axis=1), axis=0))
mask['B'] = mask.B & (tmp.A != tmp.B)
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True False
7 True False
8 False True
9 True False
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 NaN
7 8.0 NaN
8 NaN 10.0
9 10.0 NaN
Timings (len(df)=10):
In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
1000 loops, best of 3: 974 µs per loop
In [235]: %timeit (gh(tmp))
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.64 ms per loop
(len(df)=100k):
In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
100 loops, best of 3: 7.42 ms per loop
In [245]: %timeit (gh(t1))
1 loop, best of 3: 8.81 s per loop
Code for timings:
import pandas as pd
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )
tmp = pd.concat([tmp]*10000).reset_index(drop=True)
t1 = tmp.copy()
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
def top(row):
data = row.tolist()
return [d if d == max(data) else None for d in data]
def gh(tmp1):
return tmp1.apply(top, axis=1)
print (gh(t1))
I would suggest you to use apply(). You can use it as below:
In [1]: import pandas as pd
In [2]: tmp= pd.DataFrame({
...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
...: } )
In [3]: tmp
Out[3]:
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
In [4]: def top(row):
...: data = row.tolist()
...: return [d if d == max(data) else None for d in data]
...:
In [5]: df2 = tmp.apply(top, axis=1)
In [6]: df2
Out[6]:
A B
0 NaN 3
1 NaN 4
2 3 NaN
3 NaN 33
4 NaN 10
5 NaN 9
6 7 7
7 8 NaN
8 NaN 10
9 10 10
Related
I want to fill column with True and NaN values
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = np.where(df['col1'].isin(my_list), True, np.NaN)
print (df)
It prints:
col1 col2
0 0 NaN
1 1 1.0
2 2 1.0
3 3 1.0
4 4 1.0
5 5 1.0
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
But it is very important for me to print bool value True, not float number 1.0. This column interacts with other columns. They are bool, so it must be bool too.
I know I can change it with replace function. But my DataFrame is very large. I cannot waste time. Is there a simple option to do it?
This code will solve your problem. np.where will returning you true because of numpy only deals with the number and True means 1 in number. that's why it's giving you 1.0 instead of True
Code
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = df['col1'].apply(lambda x: True if x in my_list else np.NaN)
print (df)
Results
col1 col2
0 0 NaN
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
Use Nullable Boolean data type:
df['col2'] = pd.Series(np.where(df['col1'].isin(my_list), True, np.NaN), dtype='boolean')
print (df)
col1 col2
0 0 <NA>
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 <NA>
7 7 <NA>
8 8 <NA>
9 9 <NA>
10 10 <NA>
you can call this
df.col2 = df.col2.apply(lambda x: True if x==1.0 else x)
I have two DataFrames . . .
df1 is a table I need to pull values from using index, column pairs retrieved from multiple columns in df2.
I see there is a function get_value which works perfectly when given an index and column value, but when trying to vectorize this function to create a new column I am failing...
df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df1.columns = list('abcde')
df1.index = ['cat', 'dog', 'fish', 'bird']
a b c d e
cat 0 1 2 3 4
dog 5 6 7 8 9
fish 10 11 12 13 14
bird 15 16 17 18 19
df1.get_value('bird, 'c')
17
Now what I need to do is to create an entire new column on df2 -- when indexing df1 based on index, column pairs from the animal, letter columns specified in df2 effectively vectorizing the pd.get_value function above.
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))
df2['animal'] = ['cat', 'dog', 'fish', 'bird']
df2['letter'] = list('abcd')
0 1 2 3 4 animal letter
0 0 1 2 3 4 cat a
1 5 6 7 8 9 dog b
2 10 11 12 13 14 fish c
3 15 16 17 18 19 bird d
resulting in . . .
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Deprecation Notice: lookup was deprecated in v1.2.0
There's a function aptly named lookup that does exactly this.
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
df2
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
If looking for a bit faster approach then zip will help in case of small dataframe i.e
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
Output:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
As John suggested you can simplify the code which will be much faster.
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
In case of missing data use if else i.e
df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]
For small dataframes
%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop
k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop
[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop
For large dataframe
df3 = pd.concat([df2]*10000)
%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop
df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop
df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop
lookup and get_value are great answers if your values exist in lookup dataframe.
However, if you've (row, column) pairs not present in the lookup dataframe, and want the lookup value be NaN -- merge and stack is one way to do it
In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
0 1 2 3 4 animal letter looked_up
0 0 1 2 3 4 cat a 0
1 5 6 7 8 9 dog b 6
2 10 11 12 13 14 fish c 12
3 15 16 17 18 19 bird d 18
Test with adding non-existing (animal, letter) pair
In [207]: df22
Out[207]:
0 1 2 3 4 animal letter
0 0.0 1.0 2.0 3.0 4.0 cat a
1 5.0 6.0 7.0 8.0 9.0 dog b
2 10.0 11.0 12.0 13.0 14.0 fish c
3 15.0 16.0 17.0 18.0 19.0 bird d
4 NaN NaN NaN NaN NaN dummy NaN
In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
0 1 2 3 4 animal letter looked_up
0 0.0 1.0 2.0 3.0 4.0 cat a 0.0
1 5.0 6.0 7.0 8.0 9.0 dog b 6.0
2 10.0 11.0 12.0 13.0 14.0 fish c 12.0
3 15.0 16.0 17.0 18.0 19.0 bird d 18.0
4 NaN NaN NaN NaN NaN dummy NaN NaN
Im trying to do something to a pandas dataframe as follows:
If say row 2 has a 'nan' value in the 'start' column, then I can replace all row entries with '999999'
if pd.isnull(dfSleep.ix[2,'start']):
dfSleep.ix[2,:] = 999999
The above code works but I want to do it for every row, ive tried replacing the '2' with a ':' but that does not work
if pd.isnull(dfSleep.ix[:,'start']):
dfSleep.ix[:,:] = 999999
and ive tried something like this
for row in df.iterrows():
if pd.isnull(dfSleep.ix[row,'start']):
dfSleep.ix[row,:] = 999999
but again no luck, any ideas?
I think row in your approach is not an row index. It's a row of the DataFrame
You can use this instead:
for row in df.iterrows():
if pd.isnull(dfSleep.ix[row[0],'start']):
dfSleep.ix[row[0],:] = 999999
UPDATE:
In [63]: df
Out[63]:
a b c
0 0 3 NaN
1 3 7 5.0
2 0 5 NaN
3 4 1 6.0
4 7 9 NaN
In [64]: df.ix[df.c.isnull()] = [999999] * len(df.columns)
In [65]: df
Out[65]:
a b c
0 999999 999999 999999.0
1 3 7 5.0
2 999999 999999 999999.0
3 4 1 6.0
4 999999 999999 999999.0
You can use vectorized approach (.fillna() method):
In [50]: df
Out[50]:
a b c
0 1 8 NaN
1 8 8 6.0
2 5 2 NaN
3 9 4 1.0
4 4 2 NaN
In [51]: df.c = df.c.fillna(999999)
In [52]: df
Out[52]:
a b c
0 1 8 999999.0
1 8 8 6.0
2 5 2 999999.0
3 9 4 1.0
4 4 2 999999.0
In a dataframe there are 4 columns col1,col1_id,col2,col2_id, I want to locate col_2 values in col_1 then if is there any match respective col1_id should be append to col2_id.
col_1 col1_id col_2 col2_id
A 1 NaN NaN
B 2 K NaN
D 3 A NaN
J 4 NaN NaN
E 5 H NaN
Z 6 NaN NaN
H 7 H NaN
K 8 Z NaN
Any help??, Thanks
Try:
df = df.set_index('col_1')
df['col2_id'] = df.col_2.apply(lambda x: x if pd.isnull(x) else df.loc[x, 'col1_id'])
df = df.reset_index()
df
col_1 col1_id col_2 col2_id
0 A 1 NaN NaN
1 B 2 K 8.0
2 D 3 A 1.0
3 J 4 NaN NaN
4 E 5 H 7.0
5 Z 6 NaN NaN
6 H 7 H 7.0
7 K 8 Z 6.0
There are 2 possible solution and it seems output of first look better.
I think you need map with dictionary d created with columns col_1 and col1_id:
d = df[['col_1','col1_id']].set_index('col_1').to_dict()
print d
{'col1_id': {'A': 1, 'B': 2, 'E': 5, 'D': 3, 'H': 7, 'K': 8, 'J': 4, 'Z': 6}}
df['col2_id'] = df.col_2.map(d['col1_id'])
print df
col_1 col1_id col_2 col2_id
0 A 1 NaN NaN
1 B 2 K 8.0
2 D 3 A 1.0
3 J 4 NaN NaN
4 E 5 H 7.0
5 Z 6 NaN NaN
6 H 7 H 7.0
7 K 8 Z 6.0
Or you can use isin with where:
print df.col_1.isin(df.col_2)
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
Name: col_1, dtype: bool
df['col2_id'] = df.col1_id.where(df.col_1.isin(df.col_2))
print df
col_1 col1_id col_2 col2_id
0 A 1 NaN 1.0
1 B 2 K NaN
2 D 3 A NaN
3 J 4 NaN NaN
4 E 5 H NaN
5 Z 6 NaN 6.0
6 H 7 H 7.0
7 K 8 Z 8.0
Timings:
def pil(df):
df = df.set_index('col_1')
df['col2_id'] = df.col_2.apply(lambda x: x if pd.isnull(x) else df.loc[x, 'col1_id'])
return df.reset_index()
def jez(df):
df['col2_id'] = df.col_2.map(df.set_index('col_1').to_dict()['col1_id'])
return df
print pil(df1)
print jez(df)
In [34]: %timeit jez(df)
1000 loops, best of 3: 1.48 ms per loop
In [35]: %timeit pil(df1)
The slowest run took 4.23 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.56 ms per loop
It seems to me that this problem looks like a standard task in RDBMS. So you can use merge()
df['col2_id'] = pd.merge(df, df[['col1', 'col1_id']], left_on='col2', right_on='col1', how='left')['col1_id_y']
I am working with a DataFrame in Pandas / Python, each row has an ID (that is not unique), I would like to modify the dataframe to add a column with the secondname for each row that has multiple matching ID's.
Starting with:
ID Name Rate
0 1 A 65.5
1 2 B 67.3
2 2 C 78.8
3 3 D 65.0
4 4 E 45.3
5 5 F 52.0
6 5 G 66.0
7 6 H 34.0
8 7 I 2.0
Trying to get to:
ID Name Rate Secondname
0 1 A 65.5 None
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 None
4 4 E 45.3 None
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 None
8 7 I 2.0 None
My code:
import numpy as np
import pandas as pd
mydict = {'ID':[1,2,2,3,4,5,5,6,7],
'Name':['A','B','C','D','E','F','G','H','I'],
'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}
df=pd.DataFrame(mydict)
df['Newname']='None'
for i in range(0, df.shape[0]-1):
if df.irow(i)['ID']==df.irow(i+1)['ID']:
df.irow(i)['Newname']=df.irow(i+1)['Name']
Which results in the following error:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.irow(i)['Newname']=df.irow(i+1)['Secondname']
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy
self.loc[key] = value
Any help would be much appreciated.
You can use groupby with custom function f, which use shift and combine_first:
def f(x):
#print x
x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
return x
print df.groupby('ID').apply(f)
ID Name Rate Secondname
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
You can avoid groupby and find duplicated, then fill helper columns by loc with column Name, then shift and combine_first and last drop helper columns:
print df.duplicated('ID', keep='first')
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print df.duplicated('ID', keep='last')
0 False
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 False
dtype: bool
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
print df
ID Name Rate first last
0 1 A 65.5 NaN NaN
1 2 B 67.3 NaN B
2 2 C 78.8 C NaN
3 3 D 65.0 NaN NaN
4 4 E 45.3 NaN NaN
5 5 F 52.0 NaN F
6 5 G 66.0 G NaN
7 6 H 34.0 NaN NaN
8 7 I 2.0 NaN NaN
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'l1'], axis=1)
print df
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
TESTING: (in time of testing solution of Roman Kh has wrong output)
len(df) = 9:
In [154]: %timeit jez(df1)
100 loops, best of 3: 15 ms per loop
In [155]: %timeit jez2(df2)
100 loops, best of 3: 3.45 ms per loop
In [156]: %timeit rom(df)
100 loops, best of 3: 3.55 ms per loop
len(df) = 90k:
In [158]: %timeit jez(df1)
10 loops, best of 3: 57.1 ms per loop
In [159]: %timeit jez2(df2)
10 loops, best of 3: 36.4 ms per loop
In [160]: %timeit rom(df)
10 loops, best of 3: 40.4 ms per loop
import pandas as pd
mydict = {'ID':[1,2,2,3,4,5,5,6,7],
'Name':['A','B','C','D','E','F','G','H','I'],
'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}
df=pd.DataFrame(mydict)
print df
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def jez(df):
def f(x):
#print x
x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
return x
return df.groupby('ID').apply(f)
def jez2(df):
#print df.duplicated('ID', keep='first')
#print df.duplicated('ID', keep='last')
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
#print df
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'last'], axis=1)
return df
def rom(df):
# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']
return df
print jez(df1)
print jez2(df2)
print rom(df)
print jez(df1)
ID Name Rate Secondname
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
print jez2(df2)
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
print rom(df)
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 NaN
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 NaN
7 6 H 34.0 NaN
8 7 I 2.0 NaN
EDIT:
If there is more duplicated pairs with same names, use shift for creating first and last columns:
df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name']
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name']
If your dataframe is sorted by ID, you might add a new column which compares ID of the current row with ID of the next row:
# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']
Of course, you can shorten the code above, as I intentionally made it longer, but easier to comprehend.
Depending on your dataframe size it might be pretty fast (perhaps the fastest) as it does not use any complicated operations.
P.S. As a side note, try to avoid any loops when dealing with dataframes and numpy arrays. Almost always you can find a so called vector solution which operates on the whole array or large ranges, not on individual cells and rows.