Removing duplicates based on two columns while deleting inconsistent data - python

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.

First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Related

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

How to find the values of a column such that no values in another column takes value greater than 3

I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)

How to count number of records in a group and save them in a csv file?

I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1
Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1

Find duplicates for one column with the last row group by one column in Pandas Python

I have 4 columns in my dataframe user abcisse ordonnee,time
I want to find for each user the duplicate row with the last row of the user, duplicate row meaning two row with same abcisse and ordonnee.
I was thinking to use the df.duplicated function but i don't know how to combine it with groupby ?
entry = pd.DataFrame([[1,0,0,1],[1,3,-2,2],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,1],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
output = pd.DataFrame([[1,0,0,1],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
Use drop_duplicates:
print (entry.drop_duplicates(['user', 'abcisse', 'ordonnee'], keep='last'))
user abcisse ordonnee temps
0 1 0 0 1
2 1 2 1 3
3 1 3 1 4
4 1 3 -2 5
6 2 1 3 2

count of unique occurrences of a value pandas python

So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)

Categories

Resources