Reshape pandas dataframe to turn categorical columns into individual columns - python

I have data that looks like this:
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86
The Category and Quantity columns refer to the corresponding the Substance columns.
I want to expand the Category columns as a new column for each unique value and have the Quantity value as the cell value. Non-existant categories would be NaN. So the resulting frame would look like this:
ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49
4 E 92 86 NaN NaN
I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.

You need to use pd.melt then groupby:
np.random.seed(0)
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')\
.groupby(['ID','Category'])['Quantity'].sum()\
.unstack().reset_index()
Output:
Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

Related

how do you groupby multiple columns in Pandas and add rows for missing groups

Say in my dataset I have 3 nominal/categorical variables-Year(2 possible unique values), Gender(2 possible unique values), Country(2 possibleunique values) and 2 numerical variables- Work exp in years and Salary. Thus we can make 8 (2x2X2) possible combinations of categorical variables. However, my data does not have all the combinations but lets say 5 out of 8 (see the data example below).
Example:
Data (Obtained after pandas group by)- 5 group combinations
df.groupby(['Years','Gender','Country'])[['Salary','Work ex']].mean()
df.reset_index(inpalce=True)
Years
Gender
Country
Salary
Work ex
2010
Male
USA
50
2
2011
Female
India
30
1
2011
Male
India
10
3
2011
Male
USA
50
2
2011
Female
USA
80
2
Now what I want is to have rows with all the combinations of categorical variables i.e. 8 rows, thus, for the new 3 rows the numercical variables will have null values and for rest 5 it would have values as shown below
Years
Gender
Country
Mean Salary
Mean Work ex
2010
Male
USA
50
2
2010
Male
India
NA
NA
2010
Female
USA
NA
NA
2010
Female
India
NA
NA
2011
Female
India
30
1
2011
Male
India
10
3
2011
Male
USA
50
2
2011
Female
USA
80
2
PS: My original data had years, gender, country, salary, work exp as variables. I have grouped (on years,gender,country) and summarised (on work ex and salary).That led to data above with only 5 different group combinatins out of 8. Now, I want to add the rest of the possible groups (3 groups) with null values.
Assuming you achieved step 1 and lets call it df_grp.
Then create a dataframe with all possible combination of ['Years', 'Gender', 'Country'] like:
df_all = pd.MultiIndex.from_product([df_grp['Years'].unique(), df_grp['Gender'
].unique(), df_grp['Country'
].unique()]).to_frame().reset_index(drop=True)
df_all.columns = ['Years', 'Gender', 'Country']
Then do an outer merge with df_grp
out = df_all.merge(df_grp, on=['Years', 'Gender', 'Country'], how = 'outer')
print(out):
Years Gender Country Mean Salary Mean Work ex.
0 2010 Male India NaN NaN
1 2010 Male USA 50.0 1.5
2 2010 Female India NaN NaN
3 2010 Female USA NaN NaN
4 2011 Male India 10.0 3.0
5 2011 Male USA 50.0 2.0
6 2011 Female India 30.0 1.0
7 2011 Female USA 80.0 2.0
Make sure the variables are categories, then use pd.groupby():
df = pd.DataFrame({'Years': {0: 2010, 1: 2011, 2: 2011, 3: 2011, 4: 2011, 5: 2010},
'Gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male'},
'Country': {0: 'USA', 1: 'India', 2: 'India', 3: 'USA', 4: 'USA', 5: 'USA'},
'Salary': {0: 50, 1: 30, 2: 10, 3: 50, 4: 80, 5: 50},
'Work ex': {0: 2, 1: 1, 2: 3, 3: 2, 4: 2, 5: 1}})
df[['Years', 'Gender', 'Country']] = df[['Years', 'Gender', 'Country']].astype('category')
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India NaN NaN
1 2010 Female USA NaN NaN
2 2010 Male India NaN NaN
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0
You can also set the missing values to zero by doing:
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().fillna(0).reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India 0.0 0.0
1 2010 Female USA 0.0 0.0
2 2010 Male India 0.0 0.0
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0

Subtract values with groupby in Pandas dataframe Python

I have a dataframe likes this:
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp
8845
1135
0
12
12128
1
Shinva group
LDN corp
11
1243
133
121
113
1
Telegraph group
Freename LLC
5487
223
928
0
0
21
Telegraph group
Grt
0
7543
24
3213
15
21
Zero group
PetZoo crp
5574
0
2
0
6478
1
Zero group
Elephant
48324
0
32
118
4
1
I need to subtract values between cells in the column if they have the same Alliance_name.
(it would be perfect not to subtract the last column Sur, but it is not the main target)
I know that for addition we can make something like this:
df = df.groupby('Alliance_name').sum()
But I don't know how to do this with subtraction.
The result should be like this (if we don't subtract the last column):
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp LDN corp
8834
-108
-133
-109
12015
1
Telegraph group
Freename LLC Grt
5487
-7320
904
-3212
-15
21
Zero group
PetZoo crp Elephant
-42750
0
-30
-118
6474
1
Thanks for your help!
You could invert the values to subtract, and then sum them.
df.loc[df.Alliance_name.duplicated(keep="first"), ["TOAD", "MBA", "Class", "EVE", "TBD", "Sur"]] *= -1
df.groupby("Alliance_name").sum()
The .first() and .last() groupby methods can be useful for such tasks.
You can organize the columns you want to skip/compute
>>> df.columns
Index(['Alliance_name', 'Company_name', 'TOAD', 'MBA', 'Class', 'EVE', 'TBD',
'Sur'],
dtype='object')
>>> alliance, company, *cols, sur = df.columns
>>> groups = df.groupby(alliance)
>>> company = groups.first()[[company]]
>>> sur = groups.first()[sur]
>>> groups = groups[cols]
And use .first() - .last() directly:
>>> groups.first() - groups.last()
TOAD MBA Class EVE TBD
Alliance_name
Shinva group 8834 -108 -133 -109 12015
Telegraph group 5487 -7320 904 -3213 -15
Zero group -42750 0 -30 -118 6474
Then .join() the other columns back in
>>> company.join(groups.first() - groups.last()).join(sur).reset_index()
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
1 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
2 Zero group PetZoo crp -42750 0 -30 -118 6474 1
Another approach:
>>> df - df.drop(columns=['Company_name', 'Sur']) .groupby('Alliance_name').shift(-1)
Alliance_name Class Company_name EVE MBA Sur TBD TOAD
0 NaN -133.0 NaN -109.0 -108.0 NaN 12015.0 8834.0
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 904.0 NaN -3213.0 -7320.0 NaN -15.0 5487.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN -30.0 NaN -118.0 0.0 NaN 6474.0 -42750.0
5 NaN NaN NaN NaN NaN NaN NaN NaN
You can then drop the all nan rows and fill the remainder values from the original df.
>>> ((df - df.drop(columns=['Company_name', 'Sur'])
.groupby('Alliance_name').shift(-1)).dropna(how='all')[df.columns].fillna(df))
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
2 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
4 Zero group PetZoo crp -42750 0 -30 -118 6474 1

How to insert concatenated columns into a pivot table in pandas

I have this data frame that I am transforming into a pivot table I want to add concatenated columns as the values within the pivot
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'Student': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'Grade': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'Major': ['Liberal Arts', 'Business', 'Sciences', 'Education', 'Law'],
'Age': [27, 23, 21, 23, 24],
'City': ['Boston', 'Brooklyn', 'Camden', 'Chicago', 'Manhattan'],
'State': ['MA', 'NY', 'NJ', 'IL', 'NY'],
'Years' : [2, 4, 3, 3, 4] })
Displays this table
Student Grade Major Age City State Years
0 John Masters Liberal Arts 27 Boston MA 2
1 Boby Graduate Business 23 Brooklyn NY 4
2 Mina Graduate Sciences 21 Camden NJ 3
3 Peter Masters Education 23 Chicago IL 3
4 Nicky Graduate Law 24 Manhattan NY 4
Concatenated Columns
values = pd.concat([df['Age'],df['Years']], axis=1, ignore_index=True)
Displays this result
0 1
0 27 2
1 23 4
2 21 3
3 23 3
4 24 4
I want to add the concatenated column (values) inside of the pivot table so the table displays the Age and Year in adjacent columns not separate pivot tables
table = pd.pivot_table(df, values =['Age','Years'], index =['Student','City','State'], columns =['Grade', 'Major'], aggfunc = np.sum)
Grade Graduate Masters
Major Business Law Sciences Education Liberal Arts
Student City State
Boby Brooklyn NY 23.0 NaN NaN NaN NaN
John Boston MA NaN NaN NaN NaN 27.0
Mina Camden NJ NaN NaN 21.0 NaN NaN
Nicky Manhattan NY NaN 24.0 NaN NaN NaN
Peter Chicago IL NaN NaN NaN 23.0 NaN

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.
arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Pandas How to combine two rows in group with complex rules/conditions

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
"ID": ['company A', 'company A', 'company A', 'company B','company B', 'company B', 'company C', 'company C','company C','company C', 'company D', 'company D','company D'],
'Sender': [28, 'remove1', 'flag_source', 56, 28, 312, 'remove2', 'flag_source', 78, 102, 26, 101, 96],
'Receiver': [129, 28, 'remove1', 172, 56, 28, 61, 'remove2', 12, 78, 98, 26, 101],
'Date': ['2020-04-12', '2020-03-20', '2020-03-20', '2019-02-11', '2019-01-31', '2018-04-02', '2020-06-29', '2020-06-29', '2019-11-29', '2019-10-01', '2020-04-03', '2020-01-30', '2019-10-18'],
'Sender_type': ['house', 'temp', 'house', 'house', 'house', 'house', 'temp', 'house', 'house','house','house', 'temp', 'house'],
'Receiver_type': ['house', 'house', 'temp', 'house','house','house','house', 'temp', 'house','house','house','house','temp'],
'Price': [32, 50, 47, 21, 23, 19, 52, 39, 12, 22, 61, 53, 19]
})
The df is like this below:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A remove1 28 2020-03-20 temp house 50 # combine this row with below
2 company A flag_source remove1 2020-03-20 house temp 47 # combine this row with above
3 company B 56 172 2019-02-11 house house 21
4 company B 28 56 2019-01-31 house house 23
5 company B 312 28 2018-04-02 house house 19
6 company C remove2 61 2020-06-29 temp house 52 # combine this row and below
7 company C flag_source remove2 2020-06-29 house temp 39 # combine this row with above
8 company C 78 12 2019-11-29 house house 12
9 company C 102 78 2019-10-01 house house 22
10 company D 26 98 2020-04-03 house house 61
11 company D 101 26 2020-01-30 temp house 53
12 company D 96 101 2019-10-18 house temp 19
I wish to combine/merge two rows for each group 'ID' (company x) by the following rule: combine the row in 'Sender' that contains a'flag_source' and its above row into one new row. In this new row: the Sender is the flag_source, 'Revceiver' is its above value (remove the two 'remove' values), Date is the above date, Sender_type and Receiver_type are 'house', and 'Price' is the previous above value. Then remove the two rows. For example, for company A, it will combine line 1 and line 2 to generate the new row below:
ID Sender Receiver Date Sender_type Receiver_type Price
company A flag_source 28 2020-03-20 house house 50
Then use this new row to replace the previous two lines. Same rules for the other groups(in this case only apply to company A and C). In the end, I wish to have a result like this:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A flag_source 28 2020-03-20 house house 50 # new row
2 company B 56 172 2019-02-11 house house 21
3 company B 28 56 2019-01-31 house house 23
4 company B 312 28 2018-04-02 house house 19
5 company C flag_source 61 2020-06-29 house house 52 # new row
6 company C 78 12 2019-11-29 house house 12
7 company C 102 78 2019-10-01 house house 22
8 company D 26 98 2020-04-03 house house 61
9 company D 101 26 2020-01-30 temp house 53
10 company D 96 101 2019-10-18 house temp 19
Hopefully my explanation for the question is clear.
As this is a brief sample, the real case has many data like this, I wrote a loop but very slow and unproductive, so please help if you have any ideas and effective way. Many many thanks for help!
I believe the following is working:
mask = df.Sender == 'flag_source'
df[mask] = df.shift()
df.loc[mask, 'Sender'] = 'flag_source'
df.loc[mask, ['Sender_type','Receiver_type']] = 'house'
df = df[~mask.shift(-1).fillna(False).astype(bool)].reset_index(drop=True)
So the steps are (by line):
make a mask of the rows you need to channge
set those rows equal to the previous row with 'shift'
rewrite Sender for those rows to flag_source
also rewrite Sender_type and Receiver_type
remove the previous rows, by using a shift again on the mask. This seems a little convoluted; you could also do something like a loc against rows that don't contain the string remove
Output:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32.0
1 company A flag_source 28 2020-03-20 house house 50.0
2 company B 56 172 2019-02-11 house house 21.0
3 company B 28 56 2019-01-31 house house 23.0
4 company B 312 28 2018-04-02 house house 19.0
5 company C flag_source 61 2020-06-29 house house 52.0
6 company C 78 12 2019-11-29 house house 12.0
7 company C 102 78 2019-10-01 house house 22.0
8 company D 26 98 2020-04-03 house house 61.0
9 company D 101 26 2020-01-30 temp house 53.0
10 company D 96 101 2019-10-18 house temp 19.0

Categories

Resources