I have this data frame that I am transforming into a pivot table I want to add concatenated columns as the values within the pivot
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'Student': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'Grade': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'Major': ['Liberal Arts', 'Business', 'Sciences', 'Education', 'Law'],
'Age': [27, 23, 21, 23, 24],
'City': ['Boston', 'Brooklyn', 'Camden', 'Chicago', 'Manhattan'],
'State': ['MA', 'NY', 'NJ', 'IL', 'NY'],
'Years' : [2, 4, 3, 3, 4] })
Displays this table
Student Grade Major Age City State Years
0 John Masters Liberal Arts 27 Boston MA 2
1 Boby Graduate Business 23 Brooklyn NY 4
2 Mina Graduate Sciences 21 Camden NJ 3
3 Peter Masters Education 23 Chicago IL 3
4 Nicky Graduate Law 24 Manhattan NY 4
Concatenated Columns
values = pd.concat([df['Age'],df['Years']], axis=1, ignore_index=True)
Displays this result
0 1
0 27 2
1 23 4
2 21 3
3 23 3
4 24 4
I want to add the concatenated column (values) inside of the pivot table so the table displays the Age and Year in adjacent columns not separate pivot tables
table = pd.pivot_table(df, values =['Age','Years'], index =['Student','City','State'], columns =['Grade', 'Major'], aggfunc = np.sum)
Grade Graduate Masters
Major Business Law Sciences Education Liberal Arts
Student City State
Boby Brooklyn NY 23.0 NaN NaN NaN NaN
John Boston MA NaN NaN NaN NaN 27.0
Mina Camden NJ NaN NaN 21.0 NaN NaN
Nicky Manhattan NY NaN 24.0 NaN NaN NaN
Peter Chicago IL NaN NaN NaN 23.0 NaN
Related
Say in my dataset I have 3 nominal/categorical variables-Year(2 possible unique values), Gender(2 possible unique values), Country(2 possibleunique values) and 2 numerical variables- Work exp in years and Salary. Thus we can make 8 (2x2X2) possible combinations of categorical variables. However, my data does not have all the combinations but lets say 5 out of 8 (see the data example below).
Example:
Data (Obtained after pandas group by)- 5 group combinations
df.groupby(['Years','Gender','Country'])[['Salary','Work ex']].mean()
df.reset_index(inpalce=True)
Years
Gender
Country
Salary
Work ex
2010
Male
USA
50
2
2011
Female
India
30
1
2011
Male
India
10
3
2011
Male
USA
50
2
2011
Female
USA
80
2
Now what I want is to have rows with all the combinations of categorical variables i.e. 8 rows, thus, for the new 3 rows the numercical variables will have null values and for rest 5 it would have values as shown below
Years
Gender
Country
Mean Salary
Mean Work ex
2010
Male
USA
50
2
2010
Male
India
NA
NA
2010
Female
USA
NA
NA
2010
Female
India
NA
NA
2011
Female
India
30
1
2011
Male
India
10
3
2011
Male
USA
50
2
2011
Female
USA
80
2
PS: My original data had years, gender, country, salary, work exp as variables. I have grouped (on years,gender,country) and summarised (on work ex and salary).That led to data above with only 5 different group combinatins out of 8. Now, I want to add the rest of the possible groups (3 groups) with null values.
Assuming you achieved step 1 and lets call it df_grp.
Then create a dataframe with all possible combination of ['Years', 'Gender', 'Country'] like:
df_all = pd.MultiIndex.from_product([df_grp['Years'].unique(), df_grp['Gender'
].unique(), df_grp['Country'
].unique()]).to_frame().reset_index(drop=True)
df_all.columns = ['Years', 'Gender', 'Country']
Then do an outer merge with df_grp
out = df_all.merge(df_grp, on=['Years', 'Gender', 'Country'], how = 'outer')
print(out):
Years Gender Country Mean Salary Mean Work ex.
0 2010 Male India NaN NaN
1 2010 Male USA 50.0 1.5
2 2010 Female India NaN NaN
3 2010 Female USA NaN NaN
4 2011 Male India 10.0 3.0
5 2011 Male USA 50.0 2.0
6 2011 Female India 30.0 1.0
7 2011 Female USA 80.0 2.0
Make sure the variables are categories, then use pd.groupby():
df = pd.DataFrame({'Years': {0: 2010, 1: 2011, 2: 2011, 3: 2011, 4: 2011, 5: 2010},
'Gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male'},
'Country': {0: 'USA', 1: 'India', 2: 'India', 3: 'USA', 4: 'USA', 5: 'USA'},
'Salary': {0: 50, 1: 30, 2: 10, 3: 50, 4: 80, 5: 50},
'Work ex': {0: 2, 1: 1, 2: 3, 3: 2, 4: 2, 5: 1}})
df[['Years', 'Gender', 'Country']] = df[['Years', 'Gender', 'Country']].astype('category')
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India NaN NaN
1 2010 Female USA NaN NaN
2 2010 Male India NaN NaN
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0
You can also set the missing values to zero by doing:
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().fillna(0).reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India 0.0 0.0
1 2010 Female USA 0.0 0.0
2 2010 Male India 0.0 0.0
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0
Trying to convert consecutive columns to rows in pandas. Ex: Consecutive column names are sequential numbers along with some strings i.e Key1,Val1,...., KeyN,ValN in DataFrame. You can use below code to generate the dataframe.
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],'State': ['Texas', 'Texas', 'Alabama'],'Name':['Aria', 'Penelope', 'Niko'],'Key1':["test1", "test2", "test3"],'Val1':[28, 4, 7],'Key2':["test4", "test5", "test6"],
'Val2':[82, 45, 76],'Key3':["test7", "test8", "test9"],'Val3':[4, 76, 9],'Key4':["test10", "test11", "test12"],'Val4':[97, 66, 10],'Key5':["test13", "test14", "test15"],'Val5':[4, 10, '']},columns=['City', 'State', 'Name', 'Key1', 'Val1', 'Key2', 'Val2', 'Key3', 'Val3', 'Key4', 'Val4', 'Key5', 'Val5'])
I tried melt function as below:
df.melt(id_vars=['City', 'State'], var_name='Column', value_name='Key')
But I got the below output:
The problem is for every key, val column has different rows. The expected output is below:
Use pd.wide_to_long:
pd.wide_to_long(df,['Key', 'Val'],['City', 'State', 'Name'],'No').reset_index()
Output:
City State Name No Key Val
0 Houston Texas Aria 1 test1 28
1 Houston Texas Aria 2 test4 82
2 Houston Texas Aria 3 test7 4
3 Houston Texas Aria 4 test10 97
4 Houston Texas Aria 5 test13 4
5 Austin Texas Penelope 1 test2 4
6 Austin Texas Penelope 2 test5 45
7 Austin Texas Penelope 3 test8 76
8 Austin Texas Penelope 4 test11 66
9 Austin Texas Penelope 5 test14 10
10 Hoover Alabama Niko 1 test3 7
11 Hoover Alabama Niko 2 test6 76
12 Hoover Alabama Niko 3 test9 9
13 Hoover Alabama Niko 4 test12 10
14 Hoover Alabama Niko 5 test15
You are trying to simultaneously melt two columns. pd.wide_to_long handles this situation.
I need to match the identical fields of two columns from two separate dataframes, and rewrite the original dataframe, considering the another one.
So I have this original df:
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Vienna
3 Toyota Zurich
4 Renault Sydney
5 Ford Toronto
6 BMW Hamburg
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat San Francisco
11 Audi New York City
12 Ferrari Oslo
13 Volkswagen Stockholm
14 Lamborghini Singapore
15 Mercedes Lisbon
16 Jaguar Boston
And this new df:
Car Brand Current City
0 Tesla Amsterdam
1 Renault Paris
2 BMW Munich
3 Fiat Detroit
4 Audi Berlin
5 Ferrari Bruxelles
6 Lamborghini Rome
7 Mercedes Madrid
I need to match the car brands that are identical within the above two dataframes and write the new associate city in the original df, so the result should be this one: (so for example Tesla is now Amsterdam instead of Vienna)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
I tried with this code for mapping the columns and rewrite the field, but it doesn't really work and I cannot figure out how to make it work:
original_df['Original City'] = original_df['Car Brand'].map(dict(corrected_df[['Car Brand', 'Current City']]))
How to make it work ? Thanks a lot!!!!
P.S.: Code for df:
cars = ['Daimler', 'Mitsubishi','Tesla', 'Toyota', 'Renault', 'Ford','BMW', 'Audi Sport','Citroen', 'Chevrolet', 'Fiat', 'Audi', 'Ferrari', 'Volkswagen','Lamborghini', 'Mercedes', 'Jaguar']
cities = ['Chicago', 'LA', 'Vienna', 'Zurich', 'Sydney', 'Toronto', 'Hamburg', 'Helsinki', 'Dublin', 'Brisbane', 'San Francisco', 'New York City', 'Oslo', 'Stockholm', 'Singapore', 'Lisbon', 'Boston']
data = {'Original Car Brand': cars, 'Original City': cities}
original_df = pd.DataFrame (data, columns = ['Original Car Brand', 'Original City'])
---
cars = ['Tesla', 'Renault', 'BMW', 'Fiat', 'Audi', 'Ferrari', 'Lamborghini', 'Mercedes']
cities = ['Amsterdam', 'Paris', 'Munich', 'Detroit', 'Berlin', 'Bruxelles', 'Rome', 'Madrid']
data = {'Car Brand': cars, 'Current City': cities}
corrected_df = pd.DataFrame (data, columns = ['Car Brand', 'Current City'])
Use Series.map with repalce not matched values by original column by Series.fillna:
s = corrected_df.set_index('Car Brand')['Current City']
original_df['Original City'] = (original_df['Original Car Brand'].map(s)
.fillna(original_df['Original City']))
print (original_df)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
Your solution should be changed with convert both columns to numpy array before dict:
d = dict(corrected_df[['Car Brand','Current City']].to_numpy())
original_df['Original City'] = (original_df['Original Car Brand'].map(d)
.fillna(original_df['Original City']))
You can use set_index() and assign() method:
resultdf=original_df.set_index('Original Car Brand').assign(OriginalCity=corrected_df.set_index('Car Brand'))
Finally use fillna() method and reset_index() method:
resultdf=resultdf['OriginalCity'].fillna(resultdf['Original City']).reset_index()
Let us try update
df1 = df1.set_index('Original Car Brand')
df1.update(df2.set_index('Car Brand'))
df1 = df1.reset_index()
Merge can do the work as well
original_df['Original City'] = original_df.merge(corrected_df,left_on='Original Car Brand', right_on='Car Brand',how='left')['Current City'].fillna(original_df['Original City'])
I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.
arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16
I have data that looks like this:
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86
The Category and Quantity columns refer to the corresponding the Substance columns.
I want to expand the Category columns as a new column for each unique value and have the Quantity value as the cell value. Non-existant categories would be NaN. So the resulting frame would look like this:
ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49
4 E 92 86 NaN NaN
I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.
You need to use pd.melt then groupby:
np.random.seed(0)
df = pd.DataFrame(data=[list('ABCDE'),
['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
['Gas', 'Water', 'Water', 'Oil', 'Gas'],
list(np.random.randint(10, 100, 5)),
list(np.random.randint(10, 100, 5))]
).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']
pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')\
.groupby(['ID','Category'])['Quantity'].sum()\
.unstack().reset_index()
Output:
Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN