I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0
1
2
3
4
5
6
7
8
9
ID
Net Cost
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
1
30
Surcharge A
9.5
Discount X
-11.5
Discount Y
-3.25
Surcharge B
2.5
2
40
Discount X
-12.5
3
50
Discount X
-11.5
4
35
Discount X
-5.5
Surcharge B
3.5
5
45
Surcharge A
9.5
Discount X
-10.5
Surcharge B
4.5
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID
Net Cost
Surcharge A
Surcharge B
Discount X
Discount Y
1
30
9.5
2.5
-11.5
-3.25
2
40
0
0
-12.5
0
3
50
0
0
-11.5
0
4
35
0
3.5
-5.5
0
5
45
9.5
4.5
-10.5
0
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount to numeric
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
https://trinket.io/python3/555f860855
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]
I am trying to create a new dataframe new_df with a new column containing the difference in values from subtracting identical columns in 2 separate dataframes: df1 df2
I have tried to use the code new_df.loc['difference'] = df1.loc['s_values'] - df2.loc['s_values']
but I cannot achieve my result.
where df1 =
stats s_values
gender year
women 2007 height 40
2007 cigarette use 31
and df2 =
stats s_values
gender year
Men 2007 height 10
2007 cigarette use 11
desired output achieved (I do not want to include the gender index)
new_df =
stats difference
year
2007 height 30
2007 cigarette use 20
You can try this (full example):
Input:
import pandas as pd
df1 = pd.DataFrame({'gender': {0: 'woman', 1: 'woman'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 40, 1: 31}})
df2 = pd.DataFrame({'gender': {0: 'men', 1: 'men'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 10, 1: 11}})
Code:
df = pd.concat([df1,df2], ignore_index=True)
df['s_values'] = df.groupby(['year', 'stats'])['s_values'].diff().abs()
df.dropna(subset=['s_values']).drop('gender', axis=1)
Output:
year stats s_values
2 2007 height 30.0
3 2007 cigarette use 20.0
Note:
If both dataframes are completely identicaly structured, its even shorter:
df1.drop('gender', axis=1).assign(s_values=df1['s_values'] - df2['s_values'])
new_df = pd.DataFrame()
new_df["year"] = df1["year"]
new_df["stats"] = df1["stats"]
for i, (val1, val2) in enumerate(zip(df1["s_values"],df2["s_values"])):
new_df.at[i,"difference"] = val1-val2
I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution
Use agg:
df.groupby(["UserId", "Date","category"]).agg({'category':'count',
'Date': np.ptp})
or replace np.ptp with lambda x: x.max() - x.min().
Data
df = pd.DataFrame({'name ': {0: 'Settings', 1: 'Calendar', 2: 'Calendar', 3: 'Messages', 4: 'Instagram', 5: 'Drive'}, ' duration': {0: 3.4360000000000004, 1: 2.167, 2: 5.705, 3: 7.907, 4: 50.285, 5: 30.26}, ' UserId': {0: 1, 1: 1, 2: 1, 3: 1, 4: 9, 5: 9}, ' category': {0: ' System tool', 1: ' Calendar', 2: ' Calendar', 3: ' Phone_and_SMS', 4: ' Social', 5: ' Productivity'}, ' part_of_day': {0: ' evening', 1: ' night ', 2: ' night ', 3: 'night ', 4: ' night ', 5: ' night '}, ' Date': {0: ' 2020-09-10', 1: ' 2020-09-11', 2: ' 2020-09-11', 3: ' 2020-09-11', 4: ' 2020-09-28', 5: ' 2020-09-28'}})
df.columns = df.columns.str.strip()
df:
name duration UserId category part_of_day Date
0 Settings 3.436 1 System tool evening 2020-09-10
1 Calendar 2.167 1 Calendar night 2020-09-11
2 Calendar 5.705 1 Calendar night 2020-09-11
3 Messages 7.907 1 Phone_and_SMS night 2020-09-11
4 Instagram 50.285 9 Social night 2020-09-28
5 Drive 30.260 9 Productivity night 2020-09-28
grouping = df.groupby(["UserId", "Date","category"]).agg({"category": 'count', 'duration':max}).rename(columns={"duration" : "max-duration"})
grouping:
category max-duration
UserId Date category
1 2020-09-10 System tool 1 3.436
2020-09-11 Calendar 2 5.705
Phone_and_SMS 1 7.907
9 2020-09-28 Productivity 1 30.260
Social 1 50.285
You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:
I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?
Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8
Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1