How to pivot columns correctly [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have a dataframe like this:
NUMBER NAME
1 000231 John Stockton
2 009456 Karl Malone
3 100000901 John Stockton
4 100008496 Karl Malone
I want to obtain a new dataframe with:
NAME VALUE1 VALUE2
1 John Stockton 000231 100000901
2 Karl Malone 009456 100008496
I think I should use pd.groupby(), but I have no function to pass as an aggregator (I don't need to compute any mean(), min(), or max() value). If I just use pd.groupby() without any aggregator, I get:
In[1]: pd.DataFrame(df.groupby(['NAME']))
Out[1]:
0 1
0 John Stockton NAME NUMBER 000231 100000901
1 Karl Malone NAME NUMBER 009456 100008496
What am I doing wrong? Do I need to pivot the dataframe?

Actually, you need a bit mode complicated pipeline:
(df.assign(group=df.groupby('NAME').cumcount().add(1)
.pivot(index='NAME', columns='group', values='NUMBER')
.rename_axis(None, axis=1)
.add_prefix('VALUE')
.reset_index()
)
output:
NAME VALUE1 VALUE2
0 John Stockton 231 100000901
1 Karl Malone 9456 100008496

Related

Merge rows based on same column value (float type) [duplicate]

This question already has answers here:
pandas join rows/groupby with categorical data and lots of nan values
(3 answers)
Closed 6 months ago.
I have a dataset that looks like the following:
id name phone diagnosis
0 1 archie 12345 healthy
1 2 betty 23456 dead
2 3 clara 34567 NaN
3 3 clara 34567 kidney
4 4 diana 45678 cancer
I want to merge duplicated rows and have a table that looks like this:
id name phone diagnosis
0 1 archie 12345 healthy
1 2 betty 23456 dead
2 3 clara 34567 NaN, kidney
3 4 diana 45678 cancer
In short I want the entries in the diagnosis column put together so I can have an overview. I have tried running the following but it throws out an error, stating that a string was expected but a float was found.
data = data.groupby(['id','name','phone'])['diagnosis'].apply(', '.join).reset_index()
Anyone have any ideas how I can merge the rows?
It is because of NaN values. And you can't really concatenate strings with NaN as expected. One alternative way is to fill nans with string 'NaN':
data.fillna('NaN', inplace=True)
data.groupby(['id', 'name', 'phone']).diagnosis.apply(', '.join).reset_index()

Pivot - Transpose Vertical Data with repeated rows into Horizontal Data with one row per ID [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 8 months ago.
I have survey data that was exported as a Vertical dataframe. Meaning for everytime a person responded to 3 questions in the survey, their row would duplicate 3 times, except the content of the question and their answer. I am trying to transpose/pivot this data so that all 3 questions are displayed in a unique column so that their responses are also displayed in each column instead of another row, alongside their details like ID, Full Name, Location, etc...
Here is what it looks like currently:
ID Full Name Location Question Multiple Choice Question Answers
12345 John Smith UK 1. It was easy to report my sickness. Agree
12345 John Smith UK 2. I felt ready to return from Quarantine. Neutral
12345 John Smith UK 3. I am satisfied with the adjustments made. Disagree
.. ... ... ... ...
67891 Jane Smith UK 1. It was easy to report my sickness. Agree
67891 Jane Smith UK 2. I felt ready to return from Quarantine. Agree
67891 Jane Smith UK 3. I am satisfied with the adjustments made. Agree
and this is how I want it:
ID Full Name Location 1. It was easy to report my sickness. 2. I was satisfied with the support I received. 3. I felt ready to return from Quarantine.
12345 John Smith UK Agree Neutral Disagree
67891 Jane Smith UK Agree Agree Disagree
Currently I'm trying to use this code to get my desired output but I can only get the IDs and Full Names to isolate without duplicating and the other columns just show up as individual rows.
column_indices1 = [2,3,4]
df5 = df4.pivot_table(index = ['ID', 'Full Name'], columns = df4.iloc[:, column_indices1], \
values = 'Multiple Choice Question Answer', \
fill_value = 0)
Concept
In this scenario, we should consider using:
pivot(): Pivot without aggregation that can handle non-numeric data.
Practice
Prepare data
data = {'ID':[12345,12345,12345,67891,67891,67891],
'Full Name':['John Smith','John Smith','John Smith','Jane Smith','Jane Smith','Jane Smith'],
'Location':['UK','UK','UK','UK','UK','UK'],
'Question':['Q1','Q2','Q3','Q1','Q2','Q3'],
'Answers':['Agree','Neutral','Disagree','Agree','Agree','Agree']}
df = pd.DataFrame(data=data)
df
Output
ID
Full Name
Location
Question
Answers
0
12345
John Smith
UK
Q1
Agree
1
12345
John Smith
UK
Q2
Neutral
2
12345
John Smith
UK
Q3
Disagree
3
67891
Jane Smith
UK
Q1
Agree
4
67891
Jane Smith
UK
Q2
Agree
5
67891
Jane Smith
UK
Q3
Agree
Use pivot()
questionnaire = df.pivot(index=['ID','Full Name','Location'], columns='Question', values='Answers')
questionnaire
Output
adding reset_index() and rename_axis() to get the format you want
questionnaire = questionnaire.reset_index().rename_axis(None, axis=1)
questionnaire
Output

Remove column without headers and data [duplicate]

This question already has answers here:
Remove Unnamed columns in pandas dataframe [duplicate]
(4 answers)
Closed 4 years ago.
I have a CSV file and when I bring it to python as a dataframe, it create a new Unnamed: 1 column in dataframe. So how could I remove it or filter it.
So I need only Title and Date column in my dataframe not the column B of csv. Dataframe look like,
Title Unnamed: 1 Date
0 Đồng Nai Province makes it easier for people w... NaN 18/07/2018
1 Ex-NBA forward Washington gets six-year prison... NaN 10/07/2018
2 Helicobacter pylori NaN 10/07/2018
3 Paedophile gets prison term for sexual assault NaN 03/07/2018
4 Immunodeficiency burdens families NaN 28/06/2018
Drop that column from your dataframe:
df.drop(["Unnamed: 1"], inplace=True)

Pandas dataframe: groupby one column, but concatenate and aggregate by others [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Pandas groupby: How to get a union of strings
(8 answers)
Closed 4 years ago.
How do I turn the below input data (Pandas dataframe fed from Excel file):
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith 100
334014 E&E Tom Smith 200
334014 Real Estate Perspectives Janet Brown 100
334014 E&E Janet Brown 200
into this:
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith, Janet Brown 100
334014 E&E Tom Smith, Janet Brown 200
So basiscally I want to group by Category, concatenate the Speakers, but not aggregate Price.
I tried different approaches with Pandas dataframe.groupby() and .agg(), but to no avail. Maybe there is simpler pure Python solution?
There are 2 possible solutions - aggregate by multiple columns and join:
dataframe.groupby(['ID','Category','Price'])['Speaker'].apply(','.join)
Or need aggregate only Price column, then is necessary aggregate all columns by first or last:
dataframe.groupby('Price').agg({'Speaker':','.join, 'ID':'first', 'Price':'first'})
Try this
df.groupby(['ID','Category'],as_index=False).agg(lambda x : x if x.dtype=='int64' else ', '.join(x))

Fill Missing Dates in DataFrame with Duplicate Dates in Groupby

I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?

Categories

Resources