Python Pandas - Deal with duplicates - python

I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot

You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1

Using get_dummies + sum -
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1

You can do something like this
df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1
This will put a 1 in a new column where every an "a" or "c" appears.
A B C D D_a D_c
0 1 2 2 a 1.0 NaN
1 1 2 2 c NaN 1.0
2 1 1 1 a 1.0 NaN
3 2 2 1 c NaN 1.0
4 1 1 1 c NaN 1.0
but then you have to replace the NaN with a 0.
df = df.fillna(0)
Next you only have to select the columns you need and then drop the duplicates.
df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()
Hope this is the solution you were looking for.

Related

How to fill nan column with natural numbers beginning in order?

I have a data-frame
Columns
0 Nan
1 Nan
2 Nan
3 Nan
I want to fill all the Nan columns here with natural numbers starting from 1 to rest of the empty columns in increasing order.
Expected Output
Columns
0 1
1 2
2 3
3 4
Any suggestions to do this?
df['Columns'] = df['Columns'].fillna(??????????)
Solution if need replace only missing values use DataFrame.loc with Series.cumsum, then Trues are processing like 1:
m = df['Columns'].isna()
#nice solution from #Ch3steR, thank you
df.loc[m, 'Columns'] = m.cumsum()
#alternative
#df.loc[m, 'Columns'] = range(1, m.sum() + 1)
print (df)
Columns
0 1
1 2
2 3
3 4
Test for another data:
print (df)
Columns
0 NaN
1 NaN
2 100.0
3 NaN
m = df['Columns'].isna()
df.loc[m, 'Columns'] = m.cumsum()
print (df)
Columns
0 1.0
1 2.0
2 100.0
3 3.0
If need set values by range, so original column values are overwritten, use:
df['Columns'] = range(1, len(df) + 1)
print (df)
Columns
0 1
1 2
2 3
3 4

Remove duplicate column based on a condition in pandas

I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work

How to find sum and count of a column based on a grouping condition on a Pandas dataset?

I have a Pandas dataset with 3 columns. I need to group by the ID column while finding the sum and count of the other two columns. Also, I have to ignore the zeroes in the columsn 'A' and 'B'.
The dataset looks like -
ID A B
1 0 5
2 10 0
2 20 0
3 0 30
What I need -
ID A_Count A_Sum B_Count B_Sum
1 0 0 1 5
2 2 30 0 0
3 0 0 1 30
I have tried this using one column but wasn't able to get both the aggregations in the final dataset.
(df.groupby('ID').agg({'A':'sum', 'A':'count'}).reset_index().rename(columns = {'A':'A_sum', 'A': 'A_count'}))
If you don't pass it columns specifically, it will aggregate the numeric columns by itself.
Since your don't want to count 0, replace them with NaN first:
df.replace(0, np.NaN, inplace=True)
print(df)
ID A B
0 1 NaN 5.0
1 2 10.0 NaN
2 2 20.0 NaN
3 3 NaN 30.0
df = df.groupby('ID').agg(['count', 'sum'])
print(df)
A B
count sum count sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0
Remove MultiIndex columns
You can use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print(df)
A_count A_sum B_count B_sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0

pandas append rows on index with overwrite

for example, two dataframes are as below
df1
index a b
0 1 1
1 1 1
df2
index a b
1 2 2
2 2 2
and I want df1.append(df2) with overwrite
so result maybe as below
merged df
index a b
0 1 1
1 2 2 <= overwrite value of df2
2 2 2
is there any good way in pandas?
Using combine_first
df1=df1.set_index('index')
df2=df2.set_index('index')
df2.combine_first(df1)
Out[279]:
a b
index
0 1.0 1.0
1 2.0 2.0
2 2.0 2.0

Display result of multi index array groupby in pandas dataframe

I have a data frame which looks like:
D Type Value
0 1 A 2
1 1 B 4
2 2 C 1
3 1 A 1
I want to group by D and Type and sum the values.
data=df.groupby(['D','Type']).sum()
print(data)
Which gives me this result:
D Type Value
1 A 3
B 4
2 C 3
But I want it in this format:
D A B C
1 3 4 Nan
2 Nan Nan 3
UPDATE:
r = df.pivot_table(index=['D'], columns='Type', aggfunc='sum').reset_index()
r.columns = [tup[1] if tup[1] else tup[0] for tup in r.columns]
r.to_csv('c:/temp/out.csv', index=False)
Result:
D,A,B,C
1,3.0,4.0,
2,,,1.0
Original answer:
you can use pivot_table() method:
In [7]: df.pivot_table(index=['D'], columns='Type', aggfunc='sum', fill_value=0)
Out[7]:
Value
Type A B C
D
1 3 4 0
2 0 0 1
or with NaN's:
In [8]: df.pivot_table(index=['D'], columns='Type', aggfunc='sum')
Out[8]:
Value
Type A B C
D
1 3.0 4.0 NaN
2 NaN NaN 1.0
PS i think you have a typo in your groupby... section:
In [10]: df.groupby(['D','Type']).sum()
Out[10]:
Value
D Type
1 A 3
B 4
2 C 1
there should be C --> 1 instead of C --> 3

Categories

Resources