Conditional drop of identical pairs of columns in pandas - python

I have a somewhat big pandas dataframe (100,000x9). The first two columns are a combination of names associated with a value (in both sides). I want to delete the lower value associated with a given combination.
I haven't tried anything yet, because I'm not sure how to tackle this problem. My first impression is that I need to use the apply function over the data frame, but I need to select each combination of 'first' and 'second', compare them and then delete that row.
df = pd.DataFrame(np.array([['John','Mary',5],['John','Mark',1], ['Mary','John',2], ['Mary','Mark',1], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])
df
first second third
0 John Mary 5
1 John Mark 1
2 Mary John 2
3 Mary Mark 1
4 Mark John 3
5 Mark Mary 5
My objective is to get this data frame
df_clean = pd.DataFrame(np.array([['John','Mary',5], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])
df_clean
first second third
0 John Mary 5
1 Mark John 3
2 Mark Mary 5
Any ideas?

First we use np.sort to sort horizontally, then we use groupby with max function to get the highest value per unique value of first, second:
df[['first', 'second']] = np.sort(df[['first', 'second']], axis=1)
print(df.groupby(['first', 'second']).third.max().reset_index())
first second third
0 John Mark 3
1 John Mary 5
2 Mark Mary 5

Related

Count occurrence of a max value within aggregation

I have a table like this:
Column1
Column2
John
2
John
8
John
8
John
8
Robert
5
Robert
5
Robert
1
Carl
8
Carl
7
Now what I want is to aggregate this DataFrame by Column1 and get the max value as well as to count how many times does the given max value occurs for every group.
So the output should look like this:
Column1
Max
Count_of_Max
John
8
3
Robert
5
2
Carl
8
1
I've been trying to do something like this:
def Max_Count(x):
a = df.loc[x.index]
return a.loc[a['Column2'] == a['Column2'].max(), 'Column2'].count()
df.groupby(["Column1"]).agg({'Column2': ["max", Max_Count]}).reset_index()
But it's not really working :(
What would be the way get the desired result?
df.groupby('Column1').agg({
'Column2': [max, lambda x: (x==max(x)).sum()]
}).rename(columns={'max': 'Max', '<lambda_0>': 'Count_of_Max'})

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?
Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Pandas: filter the row according to the value of another column in different group (two columns in aggregate)

I have a dataset like below in pandas dataframe:
Name Shift Data Type
Peter 0 12 A
Peter 0 13 A
Peter 0 14 B
Sam 1 12 A
Sam 1 15 A
Sam 1 16 B
Sam 1 17 B
Mary 2 20 A
Mary 2 21 A
Mary 2 12 A
May anyone suggest how to show end result like the below? (logic is: if shift is 0, pick the 1st item under groupby "Name" and "type" columns; if shift is 1, pick the 2nd value under the groupby "Name" and "type" columns, etc... I have thought of nth(x) but I don't know how to put a variable on x in this case. Other workaround is fine that can generated the same result. Thank you.
Name Shift Data Type
Peter 0 12 A
Peter 0 14 B
Sam 1 15 A
Sam 1 17 B
Mary 2 12 A
You can use groupby.cumcount()
Assuming your data is in a DataFrame called df, I think this should work for you:
df = df[df.groupby(['Name','Type']).cumcount()==df['Shift']]
It compares the cumulative count of rows with the same Name and Type to the values in the Shift column to determine which rows should be kept

How to create a new column based on information in another column?

I try to create a new column in panda dataframe. I have names in one column, I want to attain numbers to them in a new column. If name is repeated sequentially, they get the same number, if they are repeated after different names then they should get another number
For example, my df is like
Name/
Stephen
Stephen
Mike
Carla
Carla
Stephen
my new column should be
Numbers/
0
0
1
2
2
3
Sorry, I couldn't paste my dataframe here.
Try:
df['Numbers'] = (df['Name'] != df['Name'].shift()).cumsum() - 1
Output:
Name Numbers
0 Stephen 0
1 Stephen 0
2 Mike 1
3 Carla 2
4 Carla 2
5 Stephen 3

Python/Pandas - Duplicating an index in a new column in a Pandas DataFrame

I have a DataFrame with indexes 1,2,3.
Name
1 Rob
2 Mark
3 Alex
I want to duplicate that index in a new column so it gets like this:
Name Number
1 Rob 1
2 Mark 2
3 Alex 3
Any ideas?
EDIT
I forgot one important part: those items in the Numbers column should be turned into string
You can try:
df['Number'] = df.index.astype(str)
Name Number
1 Rob 1
2 Mark 2
3 Alex 3

Categories

Resources