Pandas drop duplicate base on 2 columns, having differents value

Pandas drop duplicate base on 2 columns, having differents value - python

How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.

Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0

Related

On DataFrame.pivot(), different result with what I expected

I'm referring to
https://github.com/pandas-dev/pandas/tree/main/doc/cheatsheet.
As you can see, if I use pivot(), then all values are in row number 0 and 1.
But if I do use pivot(), the result was different like below.
DataFrame before pivot():
DataFrame after pivot():
Is the result on purpose?

In your data, the grey column (index of the row) is missing:
df = pd.DataFrame({'variable': list('aaabbbccc'), 'value': range(9)})
print(df)
# Output
variable value
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Add the grey column:
df['grey'] = df.groupby('variable').cumcount()
print(df)
# Output
variable value grey
0 a 0 0
1 a 1 1
2 a 2 2
3 b 3 0
4 b 4 1
5 b 5 2
6 c 6 0
7 c 7 1
8 c 8 2
Now you can pivot:
df = df.pivot('grey', 'variable', 'value')
print(df)
# Output
variable a b c
grey
0 0 3 6
1 1 4 7
2 2 5 8
Take the time to read How can I pivot a dataframe?

Pandas take the line value below

There is such a model of real data:
C S E D
1 1 3 0 0
2 1 5 0 0
3 1 6 0 0
4 2 1 0 0
5 2 3 0 0
6 2 7 0 0
С - category, S - start, E - end, D - delta
Using pandas, you need to enter the value of column S with the condition id = id+1 in column E, and the last value of category E is equal to the value from column S of the same row
It turns out:
C S E D
1 1 3 5 0
2 1 5 6 0
3 1 6 6 0
4 2 1 3 0
5 2 3 7 0
6 2 7 7 0
And then subtract S from E and put it in D. This, in principle, is easy. The difficulty is filling in column E
The result is this:
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0

Use DataFrameGroupBy.shift with replace last missing values by original with Series.fillna and then only subtract for column D:
df['E'] = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int)
df['D'] = df['E'] - df['S']
Or if use DataFrame.assign is necessary use lambda function for use counted values of E column:
df = df.assign(E = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int),
D = lambda x: x['E'] - x['S'])
print (df)
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?

You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"

df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!

If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')

Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?

You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas drop duplicate base on 2 columns, having differents value - python

Try this, using transform with nunique and drop_duplicates: df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B') Output: B C Index 3 3 1 5 4 0

Related

On DataFrame.pivot(), different result with what I expected

Pandas take the line value below

Python Counting Same Values For Specific Columns

Use groupby and merge to create new column in pandas

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

Categories

Resources