Pivoting without numerical aggregation/ a numerical column - python

I have a dataframe that looks like this
d = {'Name': ['Sally', 'Sally', 'Sally', 'James', 'James', 'James'], 'Sports': ['Tennis', 'Track & field', 'Dance', 'Dance', 'MMA', 'Crosscountry']}
df = pd.DataFrame(data=d)
Name
Sports
Sally
Tennis
Sally
Track & field
Sally
Dance
James
Dance
James
MMA
James
Crosscountry
It seems that pandas' pivot_table only allows reshaping with numerical aggregation, but I want to reshape it to wide format such that the strings are in the "values":
Name
First_sport
Second_sport
Third_sport
Sally
Tennis
Track & field
Dance
James
Dance
MMA
Crosscountry
Is there a method in pandas that can help me do this? Thanks!

You can do that, either with .pivot() if your column / index names are unique, or with .pivot_table() by providing an aggregation function that works on strings too, e.g. 'first'.
>>> df['Sport_num'] = 'Sport ' + df.groupby('Name').cumcount().astype(str)
>>> df
Name Sports Sport_num
0 Sally Tennis Sport 0
1 Sally Track & field Sport 1
2 Sally Dance Sport 2
3 James Dance Sport 0
4 James MMA Sport 1
5 James Crosscountry Sport 2
>>> df.pivot(index='Name', values='Sports', columns='Sport_num')
Sport_num Sport 0 Sport 1 Sport 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance
>>> df.pivot_table(index='Name', values='Sports', columns='Sport_num', aggfunc='first')
Sport_num Sport 0 Sport 1 Sport 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance

Another solution:
print(
df.groupby("Name")
.agg(list)["Sports"]
.apply(pd.Series)
.rename(columns={0: "First", 1: "Second", 2: "Third"})
.add_suffix("_sport")
.reset_index()
)
Prints:
Name First_sport Second_sport Third_sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance

We can also use groupby cumcount in conjunction with set_index + unstack:
new_df = df.set_index(['Name', df.groupby('Name').cumcount()]).unstack()
new_df:
Sports
0 1 2
Name
James Dance MMA Crosscountry
Sally Tennis Track & field Dance
We can do some additional cleanup by renaming and collapsing the MultiIndex:
new_df = (
df.set_index(['Name', df.groupby('Name').cumcount()])
.unstack()
.rename(columns={0: "First", 1: "Second", 2: "Third",
'Sports': 'Sport'})
)
new_df.columns = new_df.columns.swaplevel().map('_'.join)
new_df = new_df.reset_index()
new_df:
Name First_Sport Second_Sport Third_Sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance
If wanting a programmatic conversion from ints to ordinal words we can use something like inflect:
import inflect
new_df = df.set_index([
'Name', df.groupby('Name').cumcount().add(1)
]).unstack()
# Collapse MultiIndex
p = inflect.engine()
new_df.columns = new_df.columns.map(
# Convert to Ordinal Word and Column to singular noun
lambda c: f'{p.number_to_words(p.ordinal(c[1])).capitalize()}_'
f'{p.singular_noun(c[0])}'
)
new_df = new_df.reset_index()
new_df:
Name First_Sport Second_Sport Third_Sport
0 James Dance MMA Crosscountry
1 Sally Tennis Track & field Dance

Related

Fill pandas dataframe with dictionary elements

I have a dataframe df structured as well:
Name Surname Nationality
Joe Tippy Italian
Adam Wesker American
I would like to create a new record based on a dictionary whose keys corresponds to the column names:
new_record = {'Name': 'Jimmy', 'Surname': 'Turner', 'Nationality': 'Australian'}
How can I do that? I tried with a simple:
df = df.append(new_record, ignore_index=True)
but if I have a missing value in my record the dataframe doesn't get filled with a space, instead it leaves me the last column empty.
IIUC replace missing values in next step:
new_record = {'Surname': 'Turner', 'Nationality': 'Australian'}
df = pd.concat([df, pd.DataFrame([new_record])], ignore_index=True).fillna('')
print (df)
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Turner Australian
Or use DataFrame.reindex:
df = pd.concat([df, pd.DataFrame([new_record])].reindex(df.columns, fill_value='', axis=1), ignore_index=True)
A simple way if you have a range index:
df.loc[len(df)] = new_record
Updated dataframe:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Turner Australian
If you have a missing key (for example 'Surname'):
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy NaN Australian
If you want empty strings:
df.loc[len(df)] = pd.Series(new_record).reindex(df.columns, fill_value='')
Output:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Australian

Find words and create new value in different column pandas dataframe with regex

suppose I have a dataframe which contains:
df = pd.DataFrame({'Name':['John', 'Alice', 'Peter', 'Sue'],
'Job': ['Dentist', 'Blogger', 'Cook', 'Cook'],
'Sector': ['Health', 'Entertainment', '', '']})
and I want to find all 'cooks', whether in capital letters or not and assign them to the column 'Sector' with a value called 'gastronomy', how do I do that? And without overwriting the other entries in the column 'Sector'? Thanks!
Here's one approach:
df.loc[df.Job.str.lower().eq('cook'), 'Sector'] = 'gastronomy'
print(df)
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy
Using Series.str.match with regex and a regex flag for not case sensitive (?i):
df.loc[df['Job'].str.match('(?i)cook'), 'Sector'] = 'gastronomy'
Output
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy

How to display values of a column as separate columns

I want to display the values in a column along with their count in separate columns
Dataframe is
Date Name SoldItem
15-Jul Joe TV
15-Jul Joe Fridge
15-Jul Joe Washing Machine
15-Jul Joe TV
15-Jul Joe Fridge
15-Jul Mary Chair
15-Jul Mary Fridge
16-Jul Joe Fridge
16-Jul Joe Fridge
16-Jul Tim Washing Machine
17-Jul Joe Washing Machine
17-Jul Jimmy Washing Machine
17-Jul Joe Washing Machine
17-Jul Joe Washing Machine
And I get the output as
Date Name Count
15-Jul Joe 2
Mary 1
16-Jul Joe 2
I want the final output to be
Date Joe Mary
15-Jul 2 1
16-Jul 2
below is the code
fields = ['Date', 'Name', 'SoldItem']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
df_fridge = df.loc[(df['SoldItem'] == 'Fridge')]
df_fridge_grp = df_fridge.groupby(["Date", "Name"]).size()
print df_fridge_grp
If anyone can advise some pointers. I am guessing it can be done with loc, iloc, but am wondering then if my approach is wrong. Basically i want to count the values for certain types of items per person and then display that count against the name in a column display.
Does
df_fridge_grp.unstack()
Work?
Code:
df_new = df[df['SoldItem'] == 'Fridge'].groupby(['Date', 'Name']).count()
df_new = df_new.unstack().fillna(0).astype(int)
print(df_new)
Output:
SoldItem
Name Joe Mary
Date
15-Jul 2 1
16-Jul 2 0

Matching and Joining Two Inconsistent DataFrames

I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.
As an example:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?
you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
This assumes that each rows age and location are unique

Concatenate a set of column values based on another column in Pandas

Given a Pandas dataframe which has a few labeled series in it, say Name and Villain.
Say the dataframe has values such:
Name: {'Batman', 'Batman', 'Spiderman', 'Spiderman', 'Spiderman', 'Spiderman'}
Villain: {'Joker', 'Bane', 'Green Goblin', 'Electro', 'Venom', 'Dr Octopus'}
In total the above dataframe has 2 series(or columns) each with six datapoints.
Now, based on the Name, I want to concatenate 3 more columns: FirstName, LastName, LoveInterest to each datapoint.
The result of which adds 'Bruce; Wayne; Catwoman' to every row which has Name as Batman. And 'Peter; Parker; MaryJane' to every row which has Name as Spiderman.
The final result should be a dataframe containing 5 columns(series) and 6 rows each.
This is a classic inner-join scenario. In pandas, use the merge module-level function:
In [13]: df1
Out[13]:
Name Villain
0 Batman Joker
1 Batman Bane
2 Spiderman Green Goblin
3 Spiderman Electro
4 Spiderman Venom
5 Spiderman Dr. Octopus
In [14]: df2
Out[14]:
FirstName LastName LoveInterest Name
0 Bruce Wayne Catwoman Batman
1 Peter Parker MaryJane Spiderman
In [15]: pd.DataFrame.merge(df1,df2,on='Name')
Out[15]:
Name Villain FirstName LastName LoveInterest
0 Batman Joker Bruce Wayne Catwoman
1 Batman Bane Bruce Wayne Catwoman
2 Spiderman Green Goblin Peter Parker MaryJane
3 Spiderman Electro Peter Parker MaryJane
4 Spiderman Venom Peter Parker MaryJane
5 Spiderman Dr. Octopus Peter Parker MaryJane

Categories

Resources