Combine rows in pandas df as per given condition

Combine rows in pandas df as per given condition - python

I have pandas df as shown
Name Subject Score
Rakesh Math 65
Mukesh Science 76
Bhavesh French 87
Rakesh Science 88
Rakesh Hindi 76
Sanjay English 66
Mukesh English 98
Mukesh Marathi 77
I have to make another df including students who took two or more subjects and total their scores in each subjects.
Hence the resultant df will be as shown:

In pandas, there is a method explode that will take a column that contains lists and break them apart. We can do a sort of opposite of that by making list of your Subjects column. I pulled the idea here from another question.
In [1]: df = df.groupby('Name').agg({'Subject': lambda x: x.tolist(), 'Score':'sum'})
In [2]: df
Out[2]:
Subject Score
Name
Bhavesh [French] 87
Mukesh [Science, English, Marathi] 251
Rakesh [Math, Science, Hindi] 229
Sanjay [English] 66
We can then filter on the Subject column for any row where the list has more than one item. This method I lifted from another SO question.
In [3]: df[df['Subject'].str.len() > 1]
Out[3]:
Subject Score
Name
Mukesh [Science, English, Marathi] 251
Rakesh [Math, Science, Hindi] 229
If you want the Subject column to be a string instead of a list, you can utulize this third other-answer from SO.
df['Subject'] = df['Subject'].apply(lambda x: ", ".join(x))

Using groupby, filter and agg we can do it in one line:
(df.groupby('Name')
.filter(lambda g:len(g)>1)
.groupby('Name')
.agg({'Subject': ', '.join, 'Score':'sum'})
)
output
Subject Score
Name
Mukesh Science, English, Marathi 251
Rakesh Math, Science, Hindi 229

Related

Pandas - Concatenate rows that are truncated

I found a db online that contains for a series of anonymous users their degrees and the inverse sequence in which they completed them (last degree first).
For each user, I have:
Their UserID
The inverse sequence
The degree title
Basically my dataframe looks like this:
User_ID
Sequence
Degree
123
1
MSc in Civil
123
1
Engineering
123
2
BSc in Engineering
As you can see, my issue is that at times degree titles are truncated and split into two separate rows (User 123 has a MSc in Civil Engineering - notice the same value in sequence).
Ideally, my dataframe should look like this:
User_ID
Sequence
Degree
123
1
MSc in Civil Engineering
123
2
BSc in Engineering
I was wondering if anyone could help me out. I will be happy to provide any more insight that may be needed for assistance.
Thanks in advance!

Try with groupby aggregate:
df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
User_ID Sequence Degree
0 123 1 MSc in Civil Engineering
1 123 2 BSc in Engineering
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'User_ID': [123, 123, 123],
'Sequence': [1, 1, 2],
'Degree': ['MSc in Civil', 'Engineering', 'BSc in Engineering']
})
df = df.groupby(['User_ID', 'Sequence'], as_index=False).aggregate(' '.join)
print(df)

How to convert to lowercase all columns except a few specific ones?

I would like to convert to lowercase all columns within a dataframe except two. To convert all the dataframe I usually do
df=df.apply(lambda x: x.astype(str).str.lower())
My dataset is
Time Name Surname Age Notes Comments
12 Mirabel Gutierrez 23 None Already Paid
09 Kim Stuart 45 In debt Should refund 100 EUR
and so on.
I would like to transform into lowercase all the columns except Notes and Comments.
Time Name Surname Age Notes Comments
12 mirabel gutierrez 23 None Already Paid
09 kim stuart 45 In debt Should refund 100 EUR
What can I try?

You probably simply want to create a list of the relevant columns:
lowerify_cols = [col for col in df if col not in ['Notes','Comments']]
Then you can use your code snippet:
df[lowerify_cols]= df[lowerify_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)

Aggregating rows in a data frame and eliminating duplicates

I want to merge rows in my df so I have one unique row per ID/Name with other values either summed (revenue) or concatenated (subject and product). However, where I am concatenating, I do not want duplicates to appear.
My df is similar to this:
ID Name Revenue Subject Product
123 John 125 Maths A
123 John 75 English B
246 Mary 32 History B
312 Peter 67 Maths A
312 Peter 39 Science A
I am using the following code to aggregate rows in my data frame
def f(x): return ' '.join(list(x))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)
This results in output like this:
ID Name Revenue Subject Product
123 John 200 Maths English A B
246 Mary 32 History B
312 Peter 106 Maths Science A A
How can I amend my code so that duplicates are removed in my concatenation? So in the example above the last row reads A in Product and not A A

You are very close. First apply set on the items before listing and joining them. This will return only unique items
def f(x): return ' '.join(list(set(x)))
df.groupby(['ID', 'Name']).agg(
{'Revenue': 'sum', 'Subject': f, 'Product': f}
)

How can I use python to iterate through multiple csv files and if a value is the same, update another value?

I'm trying to iterate through a few .csv files, and if a value is the same, add up the related entries. If a value is not the same, add that as a new row.
My current code is:
import pandas
s1011 = pandas.read_csv('goals/1011.csv', sep=',', usecols=(1,3,5))
s1011.dropna(how="all", inplace=True)
print s1011
s1112 = pandas.read_csv('goals/1112.csv', sep=',', usecols=(1,3,5))
s1112.dropna(how="all", inplace=True)
print s1112
s1213 = pandas.read_csv('goals/1213.csv', sep=',', usecols=(1,3,5))
s1213.dropna(how="all", inplace=True)
print s1213
Currently this doesn't do much I know. It prints out 3 headings, Team, For & Against. It the prints out 20 football teams, how many goals they've scored, and how many conceded. I've tried using merge in python but it isn't suitable as it just makes one big table.
What I'm trying to do is open the multiple csv files, iterate through the list, if the Team is the same, add together goals for and goals against from each file. If Team has not been entered before, add a new row with new entries.
Is this possible using python?
Edit:
It currently prints
Team
For
Against
Man Utd
86
43
Man City
83
45
Chelsea
89
39
etc for 20 teams. So I want to update the For and Against entries by adding up the number of goals for each team over a number of seasons. Since it won't be the same 20 teams each season, I want to add a new row of entries if a team hasn't been in the league before.

Assume you have the following csv's:
df1
Team For Against
Man Utd 86 43
Man City 83 45
Chelsea 89 39
df2
Team For Against
Man Utd 88 45
Man City 85 47
ICantNameATeam 91 41
You can first stack them using pandas.concat:
df_concat = pandas.concat([df1, df2], axis=0)
which would give you:
df_concat
Team For Against
Man Utd 86 43
Man City 83 45
Chelsea 89 39
Man Utd 88 45
Man City 85 47
ICantNameATeam 91 41
Then you can use dataframe.groupby to take the sum:
df_sum = df_concat.groupby('Team').sum().reset_index()
This will group the dataframe according to unique team names and take the sum of each column.

Here's one approach:
Use Pandas to append the file's identifying number to the For & Against column in each dataframe (i.e. For-1011, For-1112, For-1213).
Use a full outer join on 'Team' to bring in all rows into a new dataframe.
Sum the columns as you'd like: df[For-total] = df[For-1011] + df[For-1112] + df[For-1213]

How to merge rows (with strings) based on column value (int) in Pandas dataframe?

I have datasets in the format
df1=
userid movieid tags timestamp
73 130682 b movie 1432523704
73 130682 comedy 1432523704
73 130682 horror 1432523704
77 1199 Trilogy of the Imagination 1163220043
77 2968 Gilliam 1163220138
77 2968 Trilogy of the Imagination 1163220039
77 4467 Trilogy of the Imagination 1163220065
77 4911 Gilliam 1163220167
77 5909 Takashi Miike 1163219591
and I want another dataframe to be in format
df2=
userid tags
73 b movie[1] comedy[1] horror[1]
77 Trilogy of the Imagination[3] Gilliam[1] Takashi Miike[1]
such that I can merge all tags together for word/s count or term frequency.
In sort, I want all tags for one userid together concatenated by " " (one space), such that I can also count number of occurrences of word/s. I am unable to concatenate strings in tags together. I can count words and its occurrences. Any help/advice would be appreciated.

First count and reformat the result of the count per group. Keep it as an intermediate result:
r = df.groupby('userid').apply(lambda g: g.tags.value_counts()).reset_index(level=-1)
r
Out[46]:
level_1 tags
userid
73 b movie 1
73 horror 1
73 comedy 1
77 Trilogy of the Imagination 3
77 Gilliam 2
77 Takashi Miike 1
This simple string manipulation will give you the result per line:
r.level_1+'['+r.tags.astype(str)+']'
Out[49]:
userid
73 b movie[1]
73 horror[1]
73 comedy[1]
77 Trilogy of the Imagination[3]
77 Gilliam[2]
77 Takashi Miike[1]
The neat part of being in Python is to be able to do something like this with it:
(r.level_1+'['+r.tags.astype(str)+']').groupby(level=0).apply(' '.join)
Out[50]:
userid
73 b movie[1] horror[1] comedy[1]
77 Trilogy of the Imagination[3] Gilliam[2] Takas...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine rows in pandas df as per given condition - python

Using groupby, filter and agg we can do it in one line: (df.groupby('Name') .filter(lambda g:len(g)>1) .groupby('Name') .agg({'Subject': ', '.join, 'Score':'sum'}) ) output Subject Score Name Mukesh Science, English, Marathi 251 Rakesh Math, Science, Hindi 229

Related

Pandas - Concatenate rows that are truncated

How to convert to lowercase all columns except a few specific ones?

Aggregating rows in a data frame and eliminating duplicates

How can I use python to iterate through multiple csv files and if a value is the same, update another value?

How to merge rows (with strings) based on column value (int) in Pandas dataframe?

Categories

Resources