pandas sort/average numerical/text data in single column - python

I have data that is in 3 columns (name, question, response) that resulted from judging a student research symposium. There are 2 possible choices for level (graduate/undergraduate), and 5 for college (science, education, etc.)
What I need to do is take the average of the numerical responses for a given name, and sort by average numerical score for that person to output a table containing:
College Level Rank
science graduate 1st
science graduate 2nd
science graduate 3rd
science undergrad 1st
...
education graduate 1st
...
education undergrad 3rd
Here's a sample data table:
name question response
Smith, John Q1 10
Smith, John Q2 7
Smith, John Q3 10
Smith, John Q4 8
Smith, John Q5 10
Smith, John Q8 8
Smith, John level graduate
Smith, John colleg science
Smith, John Q9 4
Jones, Mary Q3 10
Jones, Mary Q2 10
Jones, Mary Q1 10
Jones, Mary Q4 10
Jones, Mary level undergraduate
Jones, Mary college education
Jones, Mary Q6 10
Jones, Mary Q7 10
Jones, Mary Q9 10
A talented student did this for us in excel using pivot tables, but I'm sure this can be done using pandas, and I'm very curious how to do it (I'm quite new to pandas). The tricky part is all the 'useful' information is in that 3rd column.

convert response column to numeric, the string will be na then groupby and aggregate
import pandas as pd
import numpy as np
import StringIO
data = '''name;question;response
Smith, John;Q1;10
Smith, John;Q2;7
Smith, John;Q3;10
Smith, John;Q4;8
Smith, John;Q5;10
Smith, John;Q8;8
Smith, John;level;graduate
Smith, John;colleg;science
Smith, John;Q9;4
Jones, Mary;Q3;10
Jones, Mary;Q2;10
Jones, Mary;Q1;10
Jones, Mary;Q4;10
Jones, Mary;level;undergraduate
Jones, Mary;college;education
Jones, Mary;Q6;10
Jones, Mary;Q7;10
Jones, Mary;Q9;10'''
df = pd.read_csv(StringIO.StringIO(data), delimiter=';')
df['response'] = pd.to_numeric(df['response'], errors='coerce')
df.groupby('name').agg(np.mean).reset_index().sort_values(by='response')
output
name response
1 Smith, John 8.142857
0 Jones, Mary 10.000000

You can start pivoting the dataframe in order to obtain the info for the college and level and the question marks in different dataframes:
pivoted = df.pivot_table(index='name',
columns='question',
values='response',
fill_value=np.nan, # Changing this value to 0 would consider
# unanswered questions as 0 score
aggfunc=lambda x: x)
categories = pivoted[['college','level']]
questions = pivoted.drop(['college','level'],axis=1)
And set the question mark average to each student in the categories dataframe:
categories['points'] = questions.astype(float).mean(axis=1,skipna=True)
The skipna=True combined with the fill_value=np.nan makes that unanswered questions will not compute in the average, thus, a student with only one answer which is 10 will have mean 10. As commented, fill_value=0 modifies this behaviour.
Eventually, the values can be sorted using sort_values in order to have a ranking for each category:
categories.sort_values(['college','level','points'],ascending=False)

Related

Pivot - Transpose Vertical Data with repeated rows into Horizontal Data with one row per ID [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have survey data that was exported as a Vertical dataframe. Meaning for everytime a person responded to 3 questions in the survey, their row would duplicate 3 times, except the content of the question and their answer. I am trying to transpose/pivot this data so that all 3 questions are displayed in a unique column so that their responses are also displayed in each column instead of another row, alongside their details like ID, Full Name, Location, etc...
Here is what it looks like currently:
ID Full Name Location Question Multiple Choice Question Answers
12345 John Smith UK 1. It was easy to report my sickness. Agree
12345 John Smith UK 2. I felt ready to return from Quarantine. Neutral
12345 John Smith UK 3. I am satisfied with the adjustments made. Disagree
.. ... ... ... ...
67891 Jane Smith UK 1. It was easy to report my sickness. Agree
67891 Jane Smith UK 2. I felt ready to return from Quarantine. Agree
67891 Jane Smith UK 3. I am satisfied with the adjustments made. Agree
and this is how I want it:
ID Full Name Location 1. It was easy to report my sickness. 2. I was satisfied with the support I received. 3. I felt ready to return from Quarantine.
12345 John Smith UK Agree Neutral Disagree
67891 Jane Smith UK Agree Agree Disagree
Currently I'm trying to use this code to get my desired output but I can only get the IDs and Full Names to isolate without duplicating and the other columns just show up as individual rows.
column_indices1 = [2,3,4]
df5 = df4.pivot_table(index = ['ID', 'Full Name'], columns = df4.iloc[:, column_indices1], \
values = 'Multiple Choice Question Answer', \
fill_value = 0)
Concept
In this scenario, we should consider using:
pivot(): Pivot without aggregation that can handle non-numeric data.
Practice
Prepare data
data = {'ID':[12345,12345,12345,67891,67891,67891],
'Full Name':['John Smith','John Smith','John Smith','Jane Smith','Jane Smith','Jane Smith'],
'Location':['UK','UK','UK','UK','UK','UK'],
'Question':['Q1','Q2','Q3','Q1','Q2','Q3'],
'Answers':['Agree','Neutral','Disagree','Agree','Agree','Agree']}
df = pd.DataFrame(data=data)
df
Output
ID
Full Name
Location
Question
Answers
0
12345
John Smith
UK
Q1
Agree
1
12345
John Smith
UK
Q2
Neutral
2
12345
John Smith
UK
Q3
Disagree
3
67891
Jane Smith
UK
Q1
Agree
4
67891
Jane Smith
UK
Q2
Agree
5
67891
Jane Smith
UK
Q3
Agree
Use pivot()
questionnaire = df.pivot(index=['ID','Full Name','Location'], columns='Question', values='Answers')
questionnaire
Output
adding reset_index() and rename_axis() to get the format you want
questionnaire = questionnaire.reset_index().rename_axis(None, axis=1)
questionnaire
Output

Remove unwanted parts from strings in Dataframe

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
My dataframe:
Passengers
1 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
2 Sally Muller, President, Mark Smith, Vicepresident
3 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 Sally Muller, President, John Doe, Chief of Staff, Peter Parker, Special Effects, Lydia Johnson, Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President,','', regex=True)
df = df.replace(r'Vicepresident,','', regex=True)
df = df.replace(r'Chief of Staff,','', regex=True)
df = df.replace(r'Special Effects,','', regex=True)
df = df.replace(r'Vice Chief of Staff,','', regex=True)
...
Is there a more comfortable way to do this?
Edit
More accurate example of original df:
Passengers
1 Sally Muller, President, EL Mark Smith, John Doe, Chief of Staff, Peter Gordon, Director of Central Command
2 Sally Muller, President, EL Mark Smith, Vicepresident
3 Sally Muller, President, EL Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Gordon, Dir CC
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 President Sally Muller, John Doe Chief of Staff, Peter Parker, Special Effects, Lydia Johnson , Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe, Peter Gordon
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President','', regex=True)
df = df.replace(r'Director of Central Command,','', regex=True)
df = df.replace(r'Dir CC','', regex=True)
df = df.replace(r'Vicepresident','', regex=True)
df = df.replace(r'Chief of Staff','', regex=True)
df = df.replace(r'Special Effects','', regex=True)
df = df.replace(r'Vice Chief of Staff','', regex=True)
...
messy output is like:
Passengers
1 Sally Muller, , Mark Smith, John Doe, , Peter Gordon,
2 Sally Muller, Mark Smith,
3 Sally Muller, Mark Smith,, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller,, John Doe, Peter Parker , Lydia Johnson,
...
If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:
out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)
or str.split + explode and apply a lambda that does the selection + join
out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))
Output:
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...
Edit:
If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:
titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}
out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])
If order matters, then you can use a nested list comprehension:
out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])
This is one case where apply is actually faster that explode:
df2 = df['Passengers'].apply(lambda x: ', '.join(x.split(', ')[::2])) #.to_frame() # if dataframe needed
output:
Passengers
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia Jo...
We can create a full regex pattern match on every string you need to remove and replace.
This can handle situations were the passengers will not have a title.
df2 = df['Passengers'].str.replace("(President)|(Vicepresident)|(Chief of Staff)|(Special Effects)|(Vice Chief of Staff)", "",regex=True).replace("( ,)", "", regex=True).str.strip().str.rstrip(",")

Python - Keep all rows of a group if a name is contained in any row

I have a report that contains Invoice IDs and approvers. Invoices can have multiple approvers, which results in the IDs being duplicated (this is fine). What I want to do, is check each group of Invoice IDs to see if either of 2 approvers are in the list of approvers associated with that ID. If they are, then I want to keep all of the rows of that ID. I think my question is similar to this one: Drop all rows in a group if none of the rows match a specific condition however no one has answered that one yet. Below is an example of what I'm going for.
**Invoice Id** **Approver**
149877RV Jane Doe
149877RV Joe Manchin
149877RV Michael Frank
149877RV Kevin Holder
149877RV Michael Frank
149877RV Michael Frank
149877RV James Doe
Michael Frank and Kevin Holder are the names I'm looking for. Since they are both present here (in my scenario it can be either one of them) I want to keep all of these rows.
150210 Jim Halpert
150210 Mike Smith
150210 FP&A
150210 Michael Scott
Since neither Michael Frank nor Kevin Holder are on this list, I want to remove all of these rows.
I haven't been able to find a solution that allows me to keep all rows as I'm describing.
This should work:
import pandas as pd
df = pd.DataFrame({'invoice_ID': ['149877RV' ,'149877RV' ,'149877RV','149877RV','149877RV','149877RV','149877RV'],
'Approver': ['Jane Doe','Joe Manchin','Michael Frank','Kevin Holder','Michael Frank','Michael Frank','James Doe']})
mask=df['Approver'].isin(['Michael Frank','Kevin Holder'])
df1=df.loc[mask]
df1=df1.drop_duplicates(subset=['invoice_ID'])
mask2=df['invoice_ID'].isin(df1['invoice_ID'])
final_list=df.loc[mask2]
I think you should check your condition for each available invoice id.
Then check your condition and return the results. I merged your data in order to make it a little bit more clear
data = {"invoice_id": ["149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "150210", "150210", "150210", "150210"],
"approver": ["Jane Doe", "Joe Manchin", "Michael Frank", "Kevin Holder", "Michael Frank", "Michael Frank", "Jane Doe", "Jim Halpert", "Mike Smith", "FP&A", "Michael Scott"]}
df = pd.DataFrame.from_dict(data)
approver_1, approver_2 = "Michael Frank", "Kevin Holder"
results = []
for id in set(df["invoice_id"]):
approvers_per_id = set(df.loc[df.loc[:, "invoice_id"] == id, "approver"])
if approver_1 in approvers_per_id and approver_2 in approvers_per_id:
results.append(df.loc[df.loc[:, "invoice_id"] == id])
print(results)
This returns
[ invoice_id approver
0 149877RV Jane Doe
1 149877RV Joe Manchin
2 149877RV Michael Frank
3 149877RV Kevin Holder
4 149877RV Michael Frank
5 149877RV Michael Frank
6 149877RV Jane Doe]

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.
I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")
Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781
Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)
Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")

randomly select n rows from each block - pandas DataFrame [duplicate]

This question already has answers here:
Python: Random selection per group
(11 answers)
Closed 4 years ago.
Let's say I have a pandas DataFrame named df that looks like this
father_name child_name
Robert Julian
Robert Emily
Robert Dan
Carl Jack
Carl Rose
John Lucy
John Mark
John Alysha
Paul Christopher
Paul Thomas
Robert Kevin
Carl Elisabeth
where I know for sure that each father has at least 2 children.
I would like to obtain a DataFrame where each father has exactly 2 of his children, and those two children are selected at random. An example output would be
father_name child_name
Robert Emily
Robert Kevin
Carl Jack
Carl Elisabeth
John Alysha
John Mark
Paul Thomas
Paul Christopher
How can I do that?
You can apply DataFrame.sample on the grouped data. It takes the parameter n which you can set to 2
df.groupby('father_name').child_name.apply(lambda x: x.sample(n=2))\
.reset_index(1, drop = True).reset_index()
father_name child_name
0 Carl Elisabeth
1 Carl Jack
2 John Mark
3 John Lucy
4 Paul Thomas
5 Paul Christopher
6 Robert Emily
7 Robert Julian

Categories

Resources