Pandas - dense rank but keep current group numbers - python

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?

Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?
groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5
You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

Python merging data frames and renaming column values

In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1

Pandas: filter the row according to the value of another column in different group (two columns in aggregate)

I have a dataset like below in pandas dataframe:
Name Shift Data Type
Peter 0 12 A
Peter 0 13 A
Peter 0 14 B
Sam 1 12 A
Sam 1 15 A
Sam 1 16 B
Sam 1 17 B
Mary 2 20 A
Mary 2 21 A
Mary 2 12 A
May anyone suggest how to show end result like the below? (logic is: if shift is 0, pick the 1st item under groupby "Name" and "type" columns; if shift is 1, pick the 2nd value under the groupby "Name" and "type" columns, etc... I have thought of nth(x) but I don't know how to put a variable on x in this case. Other workaround is fine that can generated the same result. Thank you.
Name Shift Data Type
Peter 0 12 A
Peter 0 14 B
Sam 1 15 A
Sam 1 17 B
Mary 2 12 A
You can use groupby.cumcount()
Assuming your data is in a DataFrame called df, I think this should work for you:
df = df[df.groupby(['Name','Type']).cumcount()==df['Shift']]
It compares the cumulative count of rows with the same Name and Type to the values in the Shift column to determine which rows should be kept

How to create a new column based on information in another column?

I try to create a new column in panda dataframe. I have names in one column, I want to attain numbers to them in a new column. If name is repeated sequentially, they get the same number, if they are repeated after different names then they should get another number
For example, my df is like
Name/
Stephen
Stephen
Mike
Carla
Carla
Stephen
my new column should be
Numbers/
0
0
1
2
2
3
Sorry, I couldn't paste my dataframe here.
Try:
df['Numbers'] = (df['Name'] != df['Name'].shift()).cumsum() - 1
Output:
Name Numbers
0 Stephen 0
1 Stephen 0
2 Mike 1
3 Carla 2
4 Carla 2
5 Stephen 3

lookup/search values from data frame to create new column

I am trying to create new column in dataframe based on search of data from other column and row. What is a best/fasted method to calculate such column's value.
I have tried with lambda and external function without result.
Can someone elaborate little bit about methods to get final result and which method is optimal from computation time.
Can we assign function/lambda which will calculate such values?
Can we implement data frame that way it will keep reference to function calculating value in a column rather than calculated values itself? Dynamic result based on data in other columns/rows.
data = {
'ID':[1, 2, 3, 4 ,5],
'Name':['Andy', 'Rob', 'Tony', 'John', 'Lui'],
'M_Name':['Lui', 'Lui', 'Lui','NoData', 'John']
}
df = pd.DataFrame(data)
Original DataFrame:
ID M_Name Name
0 1 Lui Andy
1 2 Lui Rob
2 3 Lui Tony
3 4 NoData John
4 5 John Lui
data_after = {
'ID':[1, 2, 3, 4 ,5],
'Name':['Andy', 'Rob', 'Tony', 'John', 'Lui'],
'M_Name':['Lui', 'Lui', 'Lui','NoData', 'John'],
'ID_by_M_Name':[5, 5, 5, 'NoData', '4']
}
df1 = pd.DataFrame(data_after)
Processed DataFrame:
ID ID_by_M_Name M_Name Name
0 1 5 Lui Andy
1 2 5 Lui Rob
2 3 5 Lui Tony
3 4 NoData NoData John
4 5 4 John Lui
I have tried two ways to get ID but not sure how to use them in assign
getID = lambda name: df.loc[df['Name'] == name]['ID'].iloc[0]
def mID(name):
return df.loc[df['Name'] == name]['ID'].iloc[0]
For each row we want to find ID of M_Name for specifc Name.
e.g. for Name='Andy' we have M_Name = 'Lui' and Lui's ID(5)
For Lui M_name is John and John's ID is 4
print(getID('Lui'))
print(mID('Lui'))
df['ID'] = df.assign(mID(df['M_Name']), axis=1 )
IndexError: single positional indexer is out-of-bounds
Use Series.replace or Series.map with Series.fillna:
df['ID_by_M_Name'] = df['M_Name'].replace(df.set_index('Name')['ID'])
#assign alternative
#df = df.assign(ID_by_M_Name = df['M_Name'].replace(df.set_index('Name')['ID']))
df['ID_by_M_Name'] = df['M_Name'].map(df.set_index('Name')['ID']).fillna(df['M_Name'])
#assign alternative
#df=df.assign(ID_by_M_Name=df['M_Name'].map(df.set_index('Name')['ID']).fillna(df['M_Name']))
print (df)
ID Name M_Name ID_by_M_Name
0 1 Andy Lui 5
1 2 Rob Lui 5
2 3 Tony Lui 5
3 4 John NoData NoData
4 5 Lui John 4
If important position of new column use DataFrame.insert:
df.insert(1, 'ID_by_M_Name', df['M_Name'].replace(df.set_index('Name')['ID']))
print (df)
ID ID_by_M_Name Name M_Name
0 1 5 Andy Lui
1 2 5 Rob Lui
2 3 5 Tony Lui
3 4 NoData John NoData
4 5 4 Lui John

Categories

Resources