replace values in pandas based on aggregion and condition - python

I have a dataframe like this:
I want to replace values in col1 with a specific value (ex:with "b"). I should count the records of each group based on col1 and col2. For example count of col1 = a, col2 = t is 3 and col1 = a, col2 = u is 1 .
If the count is greater than 2 then replace the value of col1 with 'b'. For this example, i want to replace all "a" values with "b" where col2 = t.
I tried the below code, but it did not change all of then "a" values with this condition.
import pandas as pd
df = pd.read_excel('c:/test.xlsx')
df.loc[df[(df['col1'] == 'a') & (df['col2'] == 't')].agg("count")["ID"] >2, 'col1'] = 'b'
I want this result:

You can use numpy.where and check whether all your conditions are satisfied. If yes, replace the values in col1 with b, and otherwise leave the values as is:
import numpy as np
df['col1'] = np.where((df['col1']=='a') &
(df['col2']=='t') &
(df.groupby('col1')['ID'].transform('count') > 2),'b',df['col1'])
prints:
ID col1 col2
0 1 b t
1 2 b t
2 3 b t
3 4 a u
4 5 b t
5 6 b t
6 7 b u
7 8 c t
8 9 c u
9 10 c w
Using transform('count'), will check whether the grouped (by col1) ID column will have more than 2 values.

Related

Remove duplicates in a row pandas

I have a df
Name Symbol Dummy
A (BO),(BO),(AD),(TR) 2
B (TV),(TV),(TV) 2
C (HY) 2
D (UI) 2
I need df as
Name Symbol Dummy
A (BO),(AD),(TR) 2
B (TV) 2
C (HY) 2
D (UI) 2
Tried with this function but not working as expected.
drop_duplicates
Split the strings around delimiter ,, then dedupe using dict.fromkeys which also preserves the order of strings, finally join around delimiter ,
df['Symbol'] = df['Symbol'].str.split(',').map(dict.fromkeys).str.join(',')
Name Symbol Dummy
0 A (BO),(AD),(TR) 2
1 B (TV) 2
2 C (HY) 2
3 D (UI) 2
Another method
#original DF
index
col1
col2
0
(BO),(BO),(AD),(TR)
2
df.col1 = df.col1.str.split(',').apply(lambda x: sorted(set(x), key=x.index)).str.join(',')
df
#output
index
col1
col2
0
(BO),(AD),(TR)
2
If values order not important you can simply do:
df.col1 = df.col1.str.split(',').apply(lambda x: set(x)).str.join(',')
df
#output
index
col1
col2
0
(AD),(BO),(TR)
2

Pandas - changing multiple values in COL1 based on unique values in COL2 where a condition has been met for a value in COL1

I am having a hard time with being able to change all the values in one column where another column has a unique ID associated with the values that need to be changed. For example...
col1 | col2
a x
a x
a y
a y
b 'none'
b x
b x
b z
b z
I need to be able to check where col2 contains 'none' and then change all of the values in col2 to 'none' where col1 is equal to 'b'. Please bear in mind that the values I provided here in the example are not the real values, they are much longer and there are 100s of 1000s of rows, so checked the names manually is not an option. This would be the desired outcome...
col1 | col2
a x
a x
a y
a y
b 'none'
b 'none'
b 'none'
b 'none'
b 'none'
I am not sure how to even start with this conditional statement in pandas. Your help will be greatly appreciated. I am slowly building my knowledge of Pandas methods.
Use .loc twice:
df.loc[df['col1'].isin(df.loc[df['col2'].eq('none'), 'col1']), 'col2'] = 'none'
print(df)
# Output:
col1 col2
0 a x
1 a x
2 a y
3 a y
4 b none
5 b none
6 b none
7 b none
8 b none
Step-by-step:
# Extract rows where col2 is 'none'
>>> df.loc[df['col2'].eq('none'), 'col1']
4 b
Name: col1, dtype: object
# Create a boolean mask
>>> df['col1'].isin(df.loc[df['col2'].eq('none'), 'col1'])
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
Name: col1, dtype: bool
# Set 'none' to all rows from same "group"
>>> df.loc[df['col1'].isin(df.loc[df['col2'].eq('none'), 'col1']), 'col2'] = 'none'

Pandas Groupby mean and first of multiple columns

My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?
if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output
(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)
this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m

Keep the first rows of continuous specific values in a pandas data frame?

I have a data frame like this,
df
col1 col2
1 A
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 A
10 A
11 A
12 A
13 B
14 A
15 B
16 A
17 A
18 A
Now if there is continuous B or only one row between two Bs then display starting rows of those Bs.
So final output would look like,
col1 col2
7 B
13 B
I could do this using a for loop by comparing the row values, but the execution time will be huge. I am looking for any pandas shortcut or any other method to do it most efficiently.
You can first replace non B values to missing values and then forward filling them by limit 1 - so last 2 B create one group and last get first values of B groups:
m = df['col2'].where(df['col2'].eq('B')).ffill(limit=1).eq('B')
df = df[ m.ne(m.shift()) & m]
print (df)
col1 col2
6 7 B
12 13 B
cols = []
for i in range(len(df)):
if i!=0:
if df['col2'][i]==B and df['col2'][i-1]!=B:
if i>=2 and df['col2'][i-1]!=B:
cols.append(df['col1'][i])
print(df[df['col1'].isin(cols)])
Output:
col1 col2
7 B
13 B
find indexes with B not having it's i-1 and i-2 row not having B and retrieve the rows from data frames of retrieved indexes.
You can use shift and vector logic:
a = df['col2']
mask = (a.shift(1) != a) & ((a.shift(-1) == a) | (a.shift(-2) == a)) & (a == 'B')
df = df[mask]

Pandas - Count occurences in a column

I have this file with 19 columns of mixed dtypes. One of the column Names contain elements which are separated by space. For example:
Col1 Col2
adress1 x
adress2 a b
adress3 x c
adress4 a x d
What I want to do is go over Col2 and find out how many times each element occurs and put the result in a new column along with its corresponding in Col1
Note the above columns were already processed as a Dataframe.
I have this which somewhat give me the results but not what I want ultimately.
new_df = pd.Dataframe(old_df.Col2.str.split(' ').tolist(), index=old_df.Col1).stack
How do I put the results in a new column (replacing Col2) and also have the remaining columnS?
Something like:
Col1 Col2 Col3
adress1 x something
adress2 a something1
adress2 b something1
adress3 x NaN
adress3 c NaN
Also calculate occurrence of items in Col2?
We can do split first then do explode
s=df.assign(Col2=df.Col2.str.split()).explode('Col2')
s=s.groupby(['Col1','Col2']).size().to_frame('count').reset_index()
Out[48]:
Col1 Col2 count
0 adress1 x 1
1 adress2 a 1
2 adress2 b 1
3 adress3 c 1
4 adress3 x 1
5 adress4 a 1
6 adress4 d 1
7 adress4 x 1

Categories

Resources