Conditional aggregation on dataframe columns with combining 'n' rows into 1 row - python

I have an Input Dataframe that the following :
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
Want output dataframe as follows:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
My current code:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
This combines the last 4 rows into one. Instead I want to combine only 2 rows (say any n rows) even if the 'NAME' has the same value.
Appreciate your help on this.
Thanks

You can groupby the grp to get the relative blocks inside the group:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
Output:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0

Related

Python pandas merge map with multiple values xlookup

I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!
One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85

Modify Rows With Duplicate Values in a Python Pandas Dataframe

Right now, I am working with this dataframe..
Name
DateSolved
Points
Jimmy
12/3
100
Tim
12/4
50
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
At this moment, if there are duplicate names in the dataset, I just drop the oldest one (by date) from the dataframe by utilizing df.sort_values('DateSolved').drop_duplicates('Name', keep='last') leading to a dataset like this
Name
DateSolved
Points
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
However, instead of dropping the oldest one, I wish to keep it but give it a 50% points reduction. Something like this
Name
DateSolved
Points
Jimmy
12/3
50 (-50%)
Tim
12/4
25 (-50%)
Jo
12/5
25
Jonny
12/5
25
Jimmy
12/8
10
Tim
12/8
10
How could I go about doing this? I cannot find a way to both FIND the duplicates based on "Name" and then change the value of the "POINTS" column in the same row.
Thank you!
IIUC use DataFrame.duplicated for select all duplicates withot last, select column Points and divide by 2:
df.loc[df.duplicated('Name', keep='last'), 'Points'] /= 2
print (df)
Name DateSolved Points
0 Jimmy 12/3 50.0
1 Tim 12/4 25.0
2 Jo 12/5 25.0
3 Jonny 12/5 25.0
4 Jimmy 12/8 10.0
5 Tim 12/8 10.0

Replacement Values into the integer on dataset columns

House Number
Street
First Name
Surname
Age
Relationship to Head of House
Marital Status
Gender
Occupation
Infirmity
Religion
0
1
Smith Radial
Grace
Patel
46
Head
Widowed
Female
Petroleum engineer
None
Catholic
1
1
Smith Radial
Ian
Nixon
24
Lodger
Single
Male
Publishing rights manager
None
Christian
2
2
Smith Radial
Frederick
Read
87
Head
Divorced
Male
Retired TEFL teacher
None
Catholic
3
3
Smith Radial
Daniel
Adams
58
Head
Divorced
Male
Therapist, music
None
Catholic
4
3
Smith Radial
Matthew
Hall
13
Grandson
NaN
Male
Student
None
NaN
5
3
Smith Radial
Steven
Fletcher
9
Grandson
NaN
Male
Student
None
NaN
6
4
Smith Radial
Alison
Jenkins
38
Head
Single
Female
Physiotherapist
None
Catholic
7
4
Smith Radial
Kelly
Jenkins
12
Daughter
NaN
Female
Student
None
NaN
8
5
Smith Radial
Kim
Browne
69
Head
Married
Female
Retired Estate manager/land agent
None
Christian
9
5
Smith Radial
Oliver
Browne
69
Husband
Married
Male
Retired Merchandiser, retail
None
None
Hello,
I have a dataset that you can see below. When I tried to convert Age to int. I got that error: ValueError: invalid literal for int() with base 10: '43.54302670766108'
This means there is float data inside that data. I tried to replace '.' to '0' then tried to convert but I failed. Could you help me to do that?
df['Age'] = df['Age'].replace('.','0')
df['Age'] = df['Age'].astype('int')
I still got the same error. I think replace line is not working. Do you know why?
Thanks
Try:
df['Age'] = df['Age'].replace('\..*$', '', regex=True).astype(int)
Or, more drastic:
df['Age'] = df['Age'].replace('^(?:.*\D.*)?$', '0', regex=True).astype(int)
You do not need to manipulate the strings; you might first convert values to float then to int like:
df["Age"] = df["Age"].astype('float').astype('int')

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Categories

Resources