Drop Duplicates and Add Values Pandas - python

I have a dataframe below. I would like to drop the duplicates, but add the duplicated value from the E column to the non-duplicated record
import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7],
'B' : [1,1,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)
I'm grabbing all the duplicates:
df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 1.0 AA1233445 123456.0 Allign
2 3.0 3.0 rmacy 1234567.0 Hello
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Testing
and would like my outcome to be:
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo
I know I need to use dfp.loc[(dfp['A'].duplicated(keep='last'))].copy() to grab the first occurrence, but I'm failing to set the value of the E column to include the other duplicated values.
I'm thinking I need to try something like:
df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']
but my output is:
A B C D E
0 NaN 1.0 AA1233445 123456.0 AssignAssign
2 3.0 3.0 rmacy 1234567.0 HelloHello
4 5.0 0.0 Ab123455 12345.0 AppreciateAppreciate
I'm stumped. Am I over complicating it? How can I get the output I'm looking for so that I can later drop all the duplicates, except the first, but 'save' the values of the dropped vlaues in the E Column?

Define functions to use in agg and use within groupby. In order to get groupby to work with NaN, I converted to strings then back to floats.
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
dfp.groupby(
dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
'A = #pd.to_numeric(A, "coerce").values',
inplace=False
)
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 4.0 5.0 Idaho Rx 12345678.0 Ugly
3 5.0 0.0 Ab123455 12345.0 Appreciate Undo
4 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
5 6.0 0.0 RX12345 1234567.0 Pharma
6 7.0 0.0 USA Pharma NaN Unicorn
Limiting it to just the duplicated rows:
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)
d2
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 5.0 0.0 Ab123455 12345.0 Appreciate Undo

Here is my ugly solution:
In [263]: (dfp.reset_index()
...: .assign(A=dfp.A.fillna(-1))
...: .groupby('A')
...: .filter(lambda x: len(x) > 1)
...: .groupby('A', as_index=False)
...: .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
...: .replace({'A':{-1:np.nan}})
...: .set_index('index'))
...:
Out[263]:
A B C D E
index
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo

Related

pandas pivot table where the column contains a string with multiple catogeries

I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN

create new pandas column based on if and else rule

I have this dataframe and I want to create column e:
df
a b c d
1 2 1 2
Nan Nan 3 1
Nan Nan Nan 5
4 5 0 2
I want create a new column based on this criteria:
The highest of column a vs column b.
If no value in column a and column b , then look column c
if no value in column c, then look column d.
df
a b c d e
1 2 1 2 2
Nan Nan 3 1 3
Nan Nan Nan 5 5
4 5 0 2 5
my idea just until step number 2.
def e(x):
if x['a'] >= x['b']:
return x['a']
elif x['a'] <= x['b']:
return x['b']
else:
x['c']
df['e'] = df.apply(e, axis=1)
IIUC, use pandas.DataFrame.bfill:
df["e"] = df.bfill(1)[["a", "b"]].max(1)
print(df)
Output:
a b c d e
0 1 2 1 2 2.0
1 NaN NaN 3 1 3.0
2 NaN NaN NaN 5 5.0
3 4 5 0 2 5.0
You can always use np.where()
df['e'] = df['d']
df['e'] = np.where((df['a'].isna()) & (df['b'].isna()) & (df['c'].notnull()), df['c'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['a'] > df['b']), df['a'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['b'] > df['a']), df['b'], df['e'])
df
First get maximum a, b values and assign to a column, then back filling missing values and select first column for prioritize c and then d columns:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 1 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If want test only a,b,c,d columns and possible some another columns:
df['e'] = df[['a','b']].max(axis=1).fillna(df.c).fillna(df.d)
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If changed second row to 3,5 output is:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0 <- changed d=5
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

How to implement sql coalesce in pandas

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Python Pandas: Get single max numeric value from df with floats and characters/alphas

I want to get the single max numeric value from this df below which contains a mix of floats and characters/alphas.
Here is my df:
df = pd.DataFrame({'group1': ['a','a','a','b','b','b','c','c','d','d','d','d','d'],
'group2': ['c','c','d','d','d','e','f','f','e','d','d','d','e'],
'value1': [1.1,2,3,4,5,6,7,8,9,1,2,3,4],
'value2': [7.1,8,9,10,11,12,43,12,34,5,6,2,3]})
This is what it looks like:
group1 group2 value1 value2
0 a c 1.1 7.1
1 a c 2.0 8.0
2 a d 3.0 9.0
3 b d 4.0 10.0
4 b d 5.0 11.0
5 b e 6.0 12.0
6 c f 7.0 43.0
7 c f 8.0 12.0
8 d e 9.0 34.0
9 d d 1.0 5.0
10 d d 2.0 6.0
11 d d 3.0 2.0
12 d e 4.0 3.0
Expected outcome:
43.0
At the moment I am creating a new df which excludes "group1" and "group2" but there must be a better way to pull the max numeric value?
Note: this thread is linked to http://goo.gl/ZJoR9V
Thanks
Using 0.14.1, this is a nice clean way
In [6]: df.select_dtypes(exclude=['object']).max().max()
Out[6]: 43.0
Or
In [6]: df.select_dtypes(exclude=['object']).unstack().max()
Out[6]: 43.0

Categories

Resources