Pandas : Produce alternative rows with melt on multiple column [duplicate] - python

This question already has answers here:
python: how to melt dataframe retaining specific order / custom sorting
(2 answers)
Closed 7 months ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Say I have a dataframe
data_dict = {'Number': {0: 1, 1: 2, 2: 3}, 'mw link': {0: 'SAM3703_2SAM3944 2', 1: 'SAM3720_2SAM4115 2', 2: 'SAM3729_2SAM4121_ 2'}, 'site_a': {0: 'SAM3703', 1: 'SAM3720', 2: 'SAM3729'}, 'name_a': {0: 'Chelak', 1: 'KattakurganATC', 2: 'Payariq'}, 'site_b': {0: 'SAM3944', 1: 'SAM4115', 2: 'SAM4121'}, 'name_b': {0: 'Turkibolo', 1: 'Kattagurgon Sement Zavod', 2: 'Payariq Dehgonobod'}, 'distance km': {0: 3.618, 1: 7.507, 2: 9.478}, 'manufacture': {0: 'ZTE NR 8150/8250', 1: 'ZTE NR 8150/8250', 2: 'ZTE NR 8150/8250'}}
df = pd.DataFrame(data_dict)
Expected Output :
There are these two columns site_a and site_b which I want to melt into rows but applying a simple melt gives output in series, I want them to be in an alternate fashion.
Number mw link distance km manufacture variable value
0 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_a SAM3703
1 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_b SAM3944
2 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_a SAM3720
3 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_b SAM4115
4 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_a SAM3729
5 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_b SAM4121
My Solution :
This is what I have tried
df1 = pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b'])
which gives me :

You just add sort_values(['Number', 'variable']):
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['Number', 'variable'])
Alternatives:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['mw link', 'variable'])
Or:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['distance km', 'variable'])

Related

Column Pair-wise aggregation and reorganization in Pandas

I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0
1
2
3
4
5
6
7
8
9
ID
Net Cost
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
1
30
Surcharge A
9.5
Discount X
-11.5
Discount Y
-3.25
Surcharge B
2.5
2
40
Discount X
-12.5
3
50
Discount X
-11.5
4
35
Discount X
-5.5
Surcharge B
3.5
5
45
Surcharge A
9.5
Discount X
-10.5
Surcharge B
4.5
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID
Net Cost
Surcharge A
Surcharge B
Discount X
Discount Y
1
30
9.5
2.5
-11.5
-3.25
2
40
0
0
-12.5
0
3
50
0
0
-11.5
0
4
35
0
3.5
-5.5
0
5
45
9.5
4.5
-10.5
0
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount to numeric
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
https://trinket.io/python3/555f860855
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]

pandas subtract values in two dataframes with identical columns create new dataframe to store result

I am trying to create a new dataframe new_df with a new column containing the difference in values from subtracting identical columns in 2 separate dataframes: df1 df2
I have tried to use the code new_df.loc['difference'] = df1.loc['s_values'] - df2.loc['s_values']
but I cannot achieve my result.
where df1 =
stats s_values
gender year
women 2007 height 40
2007 cigarette use 31
and df2 =
stats s_values
gender year
Men 2007 height 10
2007 cigarette use 11
desired output achieved (I do not want to include the gender index)
new_df =
stats difference
year
2007 height 30
2007 cigarette use 20
You can try this (full example):
Input:
import pandas as pd
df1 = pd.DataFrame({'gender': {0: 'woman', 1: 'woman'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 40, 1: 31}})
df2 = pd.DataFrame({'gender': {0: 'men', 1: 'men'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 10, 1: 11}})
Code:
df = pd.concat([df1,df2], ignore_index=True)
df['s_values'] = df.groupby(['year', 'stats'])['s_values'].diff().abs()
df.dropna(subset=['s_values']).drop('gender', axis=1)
Output:
year stats s_values
2 2007 height 30.0
3 2007 cigarette use 20.0
Note:
If both dataframes are completely identicaly structured, its even shorter:
df1.drop('gender', axis=1).assign(s_values=df1['s_values'] - df2['s_values'])
new_df = pd.DataFrame()
new_df["year"] = df1["year"]
new_df["stats"] = df1["stats"]
for i, (val1, val2) in enumerate(zip(df1["s_values"],df2["s_values"])):
new_df.at[i,"difference"] = val1-val2

Multiply a specific value with a series of columns based on a Condition in Pandas Dataframe

I have data of a certain country that gives the certain age group population in a time series. I am trying to multiply the number of the female population with -1 to display it on the other side of the pyramid graph. I have achieved that for one year i.e 1960 (see code below). Now I want to achieve the same results for all the columns from 1960-2020
PakPopulation.loc[PakPopulation['Gender']=="Female",['1960']]=PakPopulation['1960'].apply(lambda x:-x)
I have also tried the following solution but no luck:
PakPopulation.loc[PakPopulation['Gender']=="Female",[:,['1960':'2019']]=PakPopulation[:,['1960':'2019']].apply(lambda x:-x)
Schema:
Country
Age Group
Gender
1960
1961
1962
XYZ
0-4
Male
5880k
5887k
6998k
XYZ
0-4
Female
5980k
6887k
7998k
You could build a list of years and use that list as part of your selection:
import pandas as pd
PakPopulation = pd.DataFrame({
'Country': {0: 'XYZ', 1: 'ABC'},
'Age Group': {0: '0-4', 1: '0-4'},
'Gender': {0: 'Male', 1: 'Female'},
'1960': {0: 5880, 1: 5980},
'1961': {0: 5887, 1: 6887},
'1962': {0: 6998, 1: 7998},
})
start_year = 1960
end_year = 1962
years_lst = list(map(str, range(start_year, end_year + 1)))
PakPopulation.loc[PakPopulation['Gender'] == "Female", years_lst] = \
PakPopulation[years_lst].apply(lambda x: -x)
print(PakPopulation)
Output:
Country Age Group Gender 1960 1961 1962
0 XYZ 0-4 Male 5880 5887 6998
1 ABC 0-4 Female -5980 -6887 -7998

How to group by multiple columns in python

I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution
Use agg:
df.groupby(["UserId", "Date","category"]).agg({'category':'count',
'Date': np.ptp})
or replace np.ptp with lambda x: x.max() - x.min().
Data
df = pd.DataFrame({'name ': {0: 'Settings', 1: 'Calendar', 2: 'Calendar', 3: 'Messages', 4: 'Instagram', 5: 'Drive'}, ' duration': {0: 3.4360000000000004, 1: 2.167, 2: 5.705, 3: 7.907, 4: 50.285, 5: 30.26}, ' UserId': {0: 1, 1: 1, 2: 1, 3: 1, 4: 9, 5: 9}, ' category': {0: ' System tool', 1: ' Calendar', 2: ' Calendar', 3: ' Phone_and_SMS', 4: ' Social', 5: ' Productivity'}, ' part_of_day': {0: ' evening', 1: ' night ', 2: ' night ', 3: 'night ', 4: ' night ', 5: ' night '}, ' Date': {0: ' 2020-09-10', 1: ' 2020-09-11', 2: ' 2020-09-11', 3: ' 2020-09-11', 4: ' 2020-09-28', 5: ' 2020-09-28'}})
df.columns = df.columns.str.strip()
df:
name duration UserId category part_of_day Date
0 Settings 3.436 1 System tool evening 2020-09-10
1 Calendar 2.167 1 Calendar night 2020-09-11
2 Calendar 5.705 1 Calendar night 2020-09-11
3 Messages 7.907 1 Phone_and_SMS night 2020-09-11
4 Instagram 50.285 9 Social night 2020-09-28
5 Drive 30.260 9 Productivity night 2020-09-28
grouping = df.groupby(["UserId", "Date","category"]).agg({"category": 'count', 'duration':max}).rename(columns={"duration" : "max-duration"})
grouping:
category max-duration
UserId Date category
1 2020-09-10 System tool 1 3.436
2020-09-11 Calendar 2 5.705
Phone_and_SMS 1 7.907
9 2020-09-28 Productivity 1 30.260
Social 1 50.285
You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:

Aggregate a bunch of different data in a single groupby with multiple columns

I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?
Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8
Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1

Categories

Resources