How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.
Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0
Related
I'm referring to
https://github.com/pandas-dev/pandas/tree/main/doc/cheatsheet.
As you can see, if I use pivot(), then all values are in row number 0 and 1.
But if I do use pivot(), the result was different like below.
DataFrame before pivot():
DataFrame after pivot():
Is the result on purpose?
In your data, the grey column (index of the row) is missing:
df = pd.DataFrame({'variable': list('aaabbbccc'), 'value': range(9)})
print(df)
# Output
variable value
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Add the grey column:
df['grey'] = df.groupby('variable').cumcount()
print(df)
# Output
variable value grey
0 a 0 0
1 a 1 1
2 a 2 2
3 b 3 0
4 b 4 1
5 b 5 2
6 c 6 0
7 c 7 1
8 c 8 2
Now you can pivot:
df = df.pivot('grey', 'variable', 'value')
print(df)
# Output
variable a b c
grey
0 0 3 6
1 1 4 7
2 2 5 8
Take the time to read How can I pivot a dataframe?
There is such a model of real data:
C S E D
1 1 3 0 0
2 1 5 0 0
3 1 6 0 0
4 2 1 0 0
5 2 3 0 0
6 2 7 0 0
ะก - category, S - start, E - end, D - delta
Using pandas, you need to enter the value of column S with the condition id = id+1 in column E, and the last value of category E is equal to the value from column S of the same row
It turns out:
C S E D
1 1 3 5 0
2 1 5 6 0
3 1 6 6 0
4 2 1 3 0
5 2 3 7 0
6 2 7 7 0
And then subtract S from E and put it in D. This, in principle, is easy. The difficulty is filling in column E
The result is this:
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
Use DataFrameGroupBy.shift with replace last missing values by original with Series.fillna and then only subtract for column D:
df['E'] = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int)
df['D'] = df['E'] - df['S']
Or if use DataFrame.assign is necessary use lambda function for use counted values of E column:
df = df.assign(E = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int),
D = lambda x: x['E'] - x['S'])
print (df)
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?
You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"
df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3
So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3