I am looking to update the values in a pandas series that satisfy a certain condition and take the corresponding value from another column.
Specifically, I want to look at the subcluster column and if the value equals 1, I want the record to update to the corresponding value in the cluster column.
For example:
Cluster
Subcluster
3
1
3
2
3
1
3
4
4
1
4
2
Should result in this
Cluster
Subcluster
3
3
3
2
3
3
3
4
4
4
4
2
I've been trying to use apply and a lambda function, but can't seem to get it to work properly. Any advice would be greatly appreciated. Thanks!
You can use np.where:
import numpy as np
df['Subcluster'] = np.where(df['Subcluster'].eq(1), df['Cluster'], df['Subcluster'])
Output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
In your case try mask
df.Subcluster.mask(lambda x : x==1, df.Cluster,inplace=True)
df
Out[12]:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
Or
df.loc[df.Subcluster==1,'Subcluster'] = df['Cluster']
Really all you need here is to use .loc with a mask (you don't actually need to create the mask, you could apply a mask inline)
df = pd.DataFrame({'cluster':np.random.randint(0,10,10)
,'subcluster':np.random.randint(0,3,10)}
)
df.to_clipboard(sep=',')
df at this point
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,1
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,1
create and apply the mask (you could do this all in one line)
mask = df.subcluster == 1
df.loc[mask,'subcluster'] = df.loc[mask,'cluster']
df.to_clipboard(sep=',')
final output:
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,6
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,3
Here's the lambda you couldn't write. In lamba, x corresponds to the index, so you can use that to refer a specific row in a column.
df['Subcluster'] = df.apply(lambda x: x['Cluster'] if x['Subcluster'] == 1 else x['Subcluster'], axis = 1)
And the output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
Related
I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
Using Pandas
I'm trying to determine whether a value in a certain row is greater than the values in all the other columns in the same row.
To do this I'm looping through the rows of a dataframe and using the 'all' function to compare the values in other columns; but it seems this is throwing an error "string indices must be integers"
It seems like this should work: What's wrong with this approach?
for row in dataframe:
if all (i < row['col1'] for i in [row['col2'], row['col3'], row['col4'], row['col5']]):
row['newcol'] = 'value'
Build a mask and pass it to loc:
df.loc[df['col1'] > df.loc[:, 'col2':'col5'].max(axis=1), 'newcol'] = 'newvalue'
The main problem, in my opinion, is using a loop for vectorisable logic.
Below is an example of how your logic can be implemented using numpy.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (5, 10)))
df['new_col'] = np.where(df[1] > df.max(axis=1),
'col1_is_max',
'col1_not_max')
Result:
0 1 2 3 4 5 6 7 8 9 new_col
0 4 1 3 8 3 2 5 1 1 2 col1_not_max
1 2 7 1 2 5 3 5 1 8 5 col1_is_max
2 1 8 2 5 7 4 0 3 6 3 col1_is_max
3 6 4 2 1 7 2 0 8 3 2 col1_not_max
4 0 1 3 3 0 3 7 4 4 1 col1_not_max
Say I have this dataframe df:
A B C
0 1 1 2
1 2 2 2
2 1 3 1
3 4 5 2
Say you want to select all rows which column C is >1. If I do this:
newdf=df['C']>1
I only obtain True or False in the resulting df. Instead, in the example given I want this result:
A B C
0 1 1 2
1 2 2 2
3 4 5 2
What would you do? Do you suggest using iloc?
Use boolean indexing:
newdf=df[df['C']>1]
use query
df.query('C > 1')
I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866
I'm trying to replace a row in a dataframe with the row of another dataframe only if they share a common column.
Here is the first dataframe:
index no foo
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
and the second dataframe:
index no foo
0 2 aaa
1 3 bbb
2 22 3
3 33 4
4 44 5
5 55 6
I'd like my result to be
index no foo
0 0 1
1 1 2
2 2 aaa
3 3 bbb
4 4 5
5 5 6
The result of the inner merge between both dataframes returns the correct rows, but I'm having trouble inserting them at the correct index in the first dataframe
Any help would be greatly appreciated.
Thank you.
This should work as well
df1['foo'] = pd.merge(df1, df2, on='no', how='left').apply(lambda r: r['foo_y'] if r['foo_y'] == r['foo_y'] else r['foo_x'], axis=1)
You could use apply, there is probably a better way than this:
In [67]:
# define a function that takes a row and tries to find a match
def func(x):
# find if 'no' value matches, test the length of the series
if len(df1.loc[df1.no ==x.no, 'foo']) > 0:
return df1.loc[df1.no ==x.no, 'foo'].values[0] # return the first array value
else:
return x.foo # no match so return the existing value
# call apply and using a lamda apply row-wise (axis=1 means row-wise)
df.foo = df.apply(lambda row: func(row), axis=1)
df
Out[67]:
index no foo
0 0 0 1
1 1 1 2
2 2 2 aaa
3 3 3 bbb
4 4 4 5
5 5 5 6
[6 rows x 3 columns]