This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have survey data that was exported as a Vertical dataframe. Meaning for everytime a person responded to 3 questions in the survey, their row would duplicate 3 times, except the content of the question and their answer. I am trying to transpose/pivot this data so that all 3 questions are displayed in a unique column so that their responses are also displayed in each column instead of another row, alongside their details like ID, Full Name, Location, etc...
Here is what it looks like currently:
ID Full Name Location Question Multiple Choice Question Answers
12345 John Smith UK 1. It was easy to report my sickness. Agree
12345 John Smith UK 2. I felt ready to return from Quarantine. Neutral
12345 John Smith UK 3. I am satisfied with the adjustments made. Disagree
.. ... ... ... ...
67891 Jane Smith UK 1. It was easy to report my sickness. Agree
67891 Jane Smith UK 2. I felt ready to return from Quarantine. Agree
67891 Jane Smith UK 3. I am satisfied with the adjustments made. Agree
and this is how I want it:
ID Full Name Location 1. It was easy to report my sickness. 2. I was satisfied with the support I received. 3. I felt ready to return from Quarantine.
12345 John Smith UK Agree Neutral Disagree
67891 Jane Smith UK Agree Agree Disagree
Currently I'm trying to use this code to get my desired output but I can only get the IDs and Full Names to isolate without duplicating and the other columns just show up as individual rows.
column_indices1 = [2,3,4]
df5 = df4.pivot_table(index = ['ID', 'Full Name'], columns = df4.iloc[:, column_indices1], \
values = 'Multiple Choice Question Answer', \
fill_value = 0)
Concept
In this scenario, we should consider using:
pivot(): Pivot without aggregation that can handle non-numeric data.
Practice
Prepare data
data = {'ID':[12345,12345,12345,67891,67891,67891],
'Full Name':['John Smith','John Smith','John Smith','Jane Smith','Jane Smith','Jane Smith'],
'Location':['UK','UK','UK','UK','UK','UK'],
'Question':['Q1','Q2','Q3','Q1','Q2','Q3'],
'Answers':['Agree','Neutral','Disagree','Agree','Agree','Agree']}
df = pd.DataFrame(data=data)
df
Output
ID
Full Name
Location
Question
Answers
0
12345
John Smith
UK
Q1
Agree
1
12345
John Smith
UK
Q2
Neutral
2
12345
John Smith
UK
Q3
Disagree
3
67891
Jane Smith
UK
Q1
Agree
4
67891
Jane Smith
UK
Q2
Agree
5
67891
Jane Smith
UK
Q3
Agree
Use pivot()
questionnaire = df.pivot(index=['ID','Full Name','Location'], columns='Question', values='Answers')
questionnaire
Output
adding reset_index() and rename_axis() to get the format you want
questionnaire = questionnaire.reset_index().rename_axis(None, axis=1)
questionnaire
Output
I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
My dataframe:
Passengers
1 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
2 Sally Muller, President, Mark Smith, Vicepresident
3 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 Sally Muller, President, John Doe, Chief of Staff, Peter Parker, Special Effects, Lydia Johnson, Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President,','', regex=True)
df = df.replace(r'Vicepresident,','', regex=True)
df = df.replace(r'Chief of Staff,','', regex=True)
df = df.replace(r'Special Effects,','', regex=True)
df = df.replace(r'Vice Chief of Staff,','', regex=True)
...
Is there a more comfortable way to do this?
Edit
More accurate example of original df:
Passengers
1 Sally Muller, President, EL Mark Smith, John Doe, Chief of Staff, Peter Gordon, Director of Central Command
2 Sally Muller, President, EL Mark Smith, Vicepresident
3 Sally Muller, President, EL Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Gordon, Dir CC
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 President Sally Muller, John Doe Chief of Staff, Peter Parker, Special Effects, Lydia Johnson , Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe, Peter Gordon
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President','', regex=True)
df = df.replace(r'Director of Central Command,','', regex=True)
df = df.replace(r'Dir CC','', regex=True)
df = df.replace(r'Vicepresident','', regex=True)
df = df.replace(r'Chief of Staff','', regex=True)
df = df.replace(r'Special Effects','', regex=True)
df = df.replace(r'Vice Chief of Staff','', regex=True)
...
messy output is like:
Passengers
1 Sally Muller, , Mark Smith, John Doe, , Peter Gordon,
2 Sally Muller, Mark Smith,
3 Sally Muller, Mark Smith,, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller,, John Doe, Peter Parker , Lydia Johnson,
...
If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:
out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)
or str.split + explode and apply a lambda that does the selection + join
out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))
Output:
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...
Edit:
If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:
titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}
out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])
If order matters, then you can use a nested list comprehension:
out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])
This is one case where apply is actually faster that explode:
df2 = df['Passengers'].apply(lambda x: ', '.join(x.split(', ')[::2])) #.to_frame() # if dataframe needed
output:
Passengers
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia Jo...
We can create a full regex pattern match on every string you need to remove and replace.
This can handle situations were the passengers will not have a title.
df2 = df['Passengers'].str.replace("(President)|(Vicepresident)|(Chief of Staff)|(Special Effects)|(Vice Chief of Staff)", "",regex=True).replace("( ,)", "", regex=True).str.strip().str.rstrip(",")
I have a report that contains Invoice IDs and approvers. Invoices can have multiple approvers, which results in the IDs being duplicated (this is fine). What I want to do, is check each group of Invoice IDs to see if either of 2 approvers are in the list of approvers associated with that ID. If they are, then I want to keep all of the rows of that ID. I think my question is similar to this one: Drop all rows in a group if none of the rows match a specific condition however no one has answered that one yet. Below is an example of what I'm going for.
**Invoice Id** **Approver**
149877RV Jane Doe
149877RV Joe Manchin
149877RV Michael Frank
149877RV Kevin Holder
149877RV Michael Frank
149877RV Michael Frank
149877RV James Doe
Michael Frank and Kevin Holder are the names I'm looking for. Since they are both present here (in my scenario it can be either one of them) I want to keep all of these rows.
150210 Jim Halpert
150210 Mike Smith
150210 FP&A
150210 Michael Scott
Since neither Michael Frank nor Kevin Holder are on this list, I want to remove all of these rows.
I haven't been able to find a solution that allows me to keep all rows as I'm describing.
This should work:
import pandas as pd
df = pd.DataFrame({'invoice_ID': ['149877RV' ,'149877RV' ,'149877RV','149877RV','149877RV','149877RV','149877RV'],
'Approver': ['Jane Doe','Joe Manchin','Michael Frank','Kevin Holder','Michael Frank','Michael Frank','James Doe']})
mask=df['Approver'].isin(['Michael Frank','Kevin Holder'])
df1=df.loc[mask]
df1=df1.drop_duplicates(subset=['invoice_ID'])
mask2=df['invoice_ID'].isin(df1['invoice_ID'])
final_list=df.loc[mask2]
I think you should check your condition for each available invoice id.
Then check your condition and return the results. I merged your data in order to make it a little bit more clear
data = {"invoice_id": ["149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "150210", "150210", "150210", "150210"],
"approver": ["Jane Doe", "Joe Manchin", "Michael Frank", "Kevin Holder", "Michael Frank", "Michael Frank", "Jane Doe", "Jim Halpert", "Mike Smith", "FP&A", "Michael Scott"]}
df = pd.DataFrame.from_dict(data)
approver_1, approver_2 = "Michael Frank", "Kevin Holder"
results = []
for id in set(df["invoice_id"]):
approvers_per_id = set(df.loc[df.loc[:, "invoice_id"] == id, "approver"])
if approver_1 in approvers_per_id and approver_2 in approvers_per_id:
results.append(df.loc[df.loc[:, "invoice_id"] == id])
print(results)
This returns
[ invoice_id approver
0 149877RV Jane Doe
1 149877RV Joe Manchin
2 149877RV Michael Frank
3 149877RV Kevin Holder
4 149877RV Michael Frank
5 149877RV Michael Frank
6 149877RV Jane Doe]
I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.
I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")
Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781
Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)
Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")