lookup/search values from data frame to create new column - python

I am trying to create new column in dataframe based on search of data from other column and row. What is a best/fasted method to calculate such column's value.
I have tried with lambda and external function without result.
Can someone elaborate little bit about methods to get final result and which method is optimal from computation time.
Can we assign function/lambda which will calculate such values?
Can we implement data frame that way it will keep reference to function calculating value in a column rather than calculated values itself? Dynamic result based on data in other columns/rows.
data = {
'ID':[1, 2, 3, 4 ,5],
'Name':['Andy', 'Rob', 'Tony', 'John', 'Lui'],
'M_Name':['Lui', 'Lui', 'Lui','NoData', 'John']
}
df = pd.DataFrame(data)
Original DataFrame:
ID M_Name Name
0 1 Lui Andy
1 2 Lui Rob
2 3 Lui Tony
3 4 NoData John
4 5 John Lui
data_after = {
'ID':[1, 2, 3, 4 ,5],
'Name':['Andy', 'Rob', 'Tony', 'John', 'Lui'],
'M_Name':['Lui', 'Lui', 'Lui','NoData', 'John'],
'ID_by_M_Name':[5, 5, 5, 'NoData', '4']
}
df1 = pd.DataFrame(data_after)
Processed DataFrame:
ID ID_by_M_Name M_Name Name
0 1 5 Lui Andy
1 2 5 Lui Rob
2 3 5 Lui Tony
3 4 NoData NoData John
4 5 4 John Lui
I have tried two ways to get ID but not sure how to use them in assign
getID = lambda name: df.loc[df['Name'] == name]['ID'].iloc[0]
def mID(name):
return df.loc[df['Name'] == name]['ID'].iloc[0]
For each row we want to find ID of M_Name for specifc Name.
e.g. for Name='Andy' we have M_Name = 'Lui' and Lui's ID(5)
For Lui M_name is John and John's ID is 4
print(getID('Lui'))
print(mID('Lui'))
df['ID'] = df.assign(mID(df['M_Name']), axis=1 )
IndexError: single positional indexer is out-of-bounds

Use Series.replace or Series.map with Series.fillna:
df['ID_by_M_Name'] = df['M_Name'].replace(df.set_index('Name')['ID'])
#assign alternative
#df = df.assign(ID_by_M_Name = df['M_Name'].replace(df.set_index('Name')['ID']))
df['ID_by_M_Name'] = df['M_Name'].map(df.set_index('Name')['ID']).fillna(df['M_Name'])
#assign alternative
#df=df.assign(ID_by_M_Name=df['M_Name'].map(df.set_index('Name')['ID']).fillna(df['M_Name']))
print (df)
ID Name M_Name ID_by_M_Name
0 1 Andy Lui 5
1 2 Rob Lui 5
2 3 Tony Lui 5
3 4 John NoData NoData
4 5 Lui John 4
If important position of new column use DataFrame.insert:
df.insert(1, 'ID_by_M_Name', df['M_Name'].replace(df.set_index('Name')['ID']))
print (df)
ID ID_by_M_Name Name M_Name
0 1 5 Andy Lui
1 2 5 Rob Lui
2 3 5 Tony Lui
3 4 NoData John NoData
4 5 4 Lui John

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?
groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5
You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?
Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Python merging data frames and renaming column values

In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1

Select top n items in a pandas groupby and calculate the mean

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!
Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

How to know the occurrence of a text in row of data frame pandas python

C1
0 John
1 John
2 John
3 Michale
4 Michale
5 Newton
6 Newton
7 John
8 John
9 John
I want to know how many time John occurred row wise. Suppose John occurred from 0 to 2 In result i want from 0 to 2 John. from 3 to 4 Michel from 5 to 6 Newton
Result I want in this format:
Start End Name
0 2 John
3 4 Michale
5 6 newton
7 9 John
Use
In [163]: df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
Out[163]:
start end
C1
John 0 2
Michale 3 4
Newton 5 6
#Zero: Would adding the below to your code help ..?? :)
df_new = df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
df_new.reset_index().rename(columns={'C1':'Name'})
Edit: Maybe something like this..? I am still learning but there is no harm trying. :)
labels = (df.C1 != df.C1.shift()).cumsum()
df1 = pd.concat([df,labels],axis = 1,names = 'label')
df1.columns = ['C1','label']
df_new = df1.reset_index().groupby(['label','C1']).agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'}).reset_index().rename(columns={'C1':'Name'})
df_new

Categories

Resources