Modify Rows With Duplicate Values in a Python Pandas Dataframe

Modify Rows With Duplicate Values in a Python Pandas Dataframe - python

Right now, I am working with this dataframe..
Name
DateSolved
Points
Jimmy
12/3
100
Tim
12/4
50
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
At this moment, if there are duplicate names in the dataset, I just drop the oldest one (by date) from the dataframe by utilizing df.sort_values('DateSolved').drop_duplicates('Name', keep='last') leading to a dataset like this
Name
DateSolved
Points
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
However, instead of dropping the oldest one, I wish to keep it but give it a 50% points reduction. Something like this
Name
DateSolved
Points
Jimmy
12/3
50 (-50%)
Tim
12/4
25 (-50%)
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
How could I go about doing this? I cannot find a way to both FIND the duplicates based on "Name" and then change the value of the "POINTS" column in the same row.
Thank you!

IIUC use DataFrame.duplicated for select all duplicates withot last, select column Points and divide by 2:
df.loc[df.duplicated('Name', keep='last'), 'Points'] /= 2
print (df)
Name DateSolved Points
0 Jimmy 12/3 50.0
1 Tim 12/4 25.0
2 Jo 12/5 25.0
3 Jonny 12/5 25.0
4 Jimmy 12/8 10.0
5 Tim 12/8 10.0

Related

How do I getting all values from a one panda.series row that correspond to s specific value in second row?

I have a Dataframe that looks like this :
name age occ salary
0 Vinay 22.0 engineer 60000.0
1 Kushal NaN doctor 70000.0
2 Aman 24.0 engineer 80000.0
3 Rahul NaN doctor 65000.0
4 Ramesh 25.0 doctor 70000.0
and im trying to get all salary values that correspond to a specific occupation ,t o then compute the mean salary of that occ.

here is an answer with a few step
temp_df = df.loc[df['occ'] == 'engineer']
temp_df.salary.mean()

All averages at once:
df_averages = df[['occ', 'salary']].groupby('occ').mean()
salary
occ
doctor 68333.333333
engineer 70000.000000

Conditional aggregation on dataframe columns with combining 'n' rows into 1 row

I have an Input Dataframe that the following :
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
Want output dataframe as follows:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
My current code:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
This combines the last 4 rows into one. Instead I want to combine only 2 rows (say any n rows) even if the 'NAME' has the same value.
Appreciate your help on this.
Thanks

You can groupby the grp to get the relative blocks inside the group:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
Output:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0

There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')

In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Pandas find values present in at least two groups

I have a multiindex dataframe like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
John 3 961.0
4 346.0
Bricks James 10 244.0
20 303.0
30 811.0
Fred 40 449.0
James 501 265.0
Sand Donald 15 378.0
800 359.0
How can I slice that df to see only drivers, who worked for different companies? So the result should be like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0
UPD: My original dataframe is 400k long, so I can't just slice it by index. I'm trying to find general solution to solve problems like these.

To get the number of unique companies a person has worked for, use groupby and unique:
v = (df.index.get_level_values(0)
.to_series()
.groupby(df.index.get_level_values(1))
.nunique())
# Alternative involving resetting the index, may not be as efficient.
# v = df.reset_index().groupby('Driver').Company.nunique()
v
Driver
Donald 1
Fred 2
James 1
John 1
Name: Company, dtype: int64
Now, you can run a query:
names = v[v.gt(1)].index.tolist()
df.query("Driver in #names")
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.

You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modify Rows With Duplicate Values in a Python Pandas Dataframe - python

IIUC use DataFrame.duplicated for select all duplicates withot last, select column Points and divide by 2: df.loc[df.duplicated('Name', keep='last'), 'Points'] /= 2 print (df) Name DateSolved Points 0 Jimmy 12/3 50.0 1 Tim 12/4 25.0 2 Jo 12/5 25.0 3 Jonny 12/5 25.0 4 Jimmy 12/8 10.0 5 Tim 12/8 10.0

Related

How do I getting all values from a one panda.series row that correspond to s specific value in second row?

Conditional aggregation on dataframe columns with combining 'n' rows into 1 row

Get latest value looked up from other dataframe

Pandas find values present in at least two groups

How to perform groupby and mean on categorical columns in Pandas

Categories

Resources