Remove duplicates using column value with some ignore condition - python

I have two columns in my excel file and I want to remove duplicates from 'A' column with an ignore condition. The columns are as follow:
A B
1 10
1 20
2 30
2 40
3 10
3 20
Now, I want it to turn into this:
A B
1 10
2 30
2 40
3 10
So, basically I want to remove all duplicates except when column 'A' has value 2 (I want to ignore 2). My current code is as follows but it does not work for me as it removes duplicates with value '2' too.
df = pd.read_excel(save_filename)
df2 = df.drop_duplicates(subset=["A", "B"], keep='first')
df2.to_excel(save_filename, index=False)

You can use two conditions:
df[~df.duplicated(subset="A") | df["A"].eq(2)]
A B
0 1 10
2 2 30
3 2 40
4 3 10

Related

Adding multiple row values into one row keeping the index interval as same as the number of row added in python

I have a data frame with multiple columns (30/40) in a time series continuously from 1 to 1440 minutes.
df
time colA colB colC.....
1 5 4 3
2 1 2 3
3 5 4 3
4 6 7 3
5 9 0 3
6 4 4 0
..
Now I want to add two row values into one but I want to keep the interval of index 'time' same as the row number I am adding. The resulted data frame is:
df
time colA colB colC.......
1 6 6 6
3 11 11 6
5 13 4 3
..
Here I added two row values into one but the time index interval is also same as 2 rows. 1,3,5...
Is it possible to achieve that?
Another way would be to group your data set every two rows and aggregate with using sum on your 'colX' columns and mean on your time column. Chaining astype(int) will round the resulting values:
d = {col: 'sum' for col in [c for c in df.columns if c.startswith('col')]}
df.groupby(df.index // 2).agg({**d,'time': 'mean'}).astype(int)
prints back:
colA colB colC time
0 6 6 6 1
1 11 11 6 3
2 13 4 3 5
one way is to do the addition for all and then fix time:
df_new = df[1::2].reset_index(drop=True) + df[::2].reset_index(drop=True)
df_new['time'] = df[::2]['time'].values

Getting max values based on sliced column

Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.
Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20

Pandas delete all duplicate rows in one column if values in another column is higher than a threshold

I have a dataframe where there are duplicate values in column A that have different values in column B.
I want to delete rows if one of column A duplicated values has values higher than 15 in column B.
Original Datafram
A Column
B Column
1
10
1
14
2
10
2
20
3
5
3
10
Desired dataframe
A Column
B Column
1
10
1
14
3
5
3
10
This works:
dfnew = df.groupby('A Column').filter(lambda x: x['B Column'].max()<=15 )
dfnew.reset_index(drop=True, inplace=True)
dfnew = dfnew[['A Column','B Column']]
print(dfnew)
output:
A Column B Column
0 1 10
1 1 14
2 3 5
3 3 10
Here is another way using groupby() and transform()
df.loc[~df['B Column'].gt(15).groupby(df['A Column']).transform('any')]

Group seperated counting values in a pandas dataframe

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj
Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5
All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

Comparing dataframes in pandas

I have two separate pandas dataframes (df1 and df2) which have multiple columns with some common columns.
I would like to find every row in df2 that does not have a match in df1. Match between df1 and df2 is defined as having the same values in two different columns A and B in the same row.
df1
A B C text
45 2 1 score
33 5 2 miss
20 1 3 score
df2
A B D text
45 3 1 shot
33 5 2 shot
10 2 3 miss
20 1 4 miss
Result df (Only Rows 1 and 3 are returned as the values of A and B in df2 have a match in the same row in df1 for Rows 2 and 4)
A B D text
45 3 1 shot
10 2 3 miss
Is it possible to use the isin method in this scenario?
This works:
# set index (as selecting columns)
df1 = df1.set_index(['A','B'])
df2 = df2.set_index(['A','B'])
# now .isin will work
df2[~df2.index.isin(df1.index)].reset_index()
A B D text
0 45 3 1 shot
1 10 2 3 miss

Categories

Resources