Python / Pandas - Merge on index when I have two indexes - python

I have a dataframe with double index, it looks like this:
bal:
ano unit period
business_id id
9564 302 2012 reais anual
303 2011 reais anual
2361 304 2013 reais anual
305 2012 reais anual
2369 306 2013 reais anual
307 2012 reais anual
I have another dataframe that looks like this:
accounts:
A B
id
302 5964168.52 1.097601e+07
303 5774707.15 1.086787e+07
304 3652575.31 6.608469e+06
305 321076.15 6.027066e+06
306 3858137.49 9.733126e+06
I want to merge them so they look like this:
ano unit period A B
business_id id
9564 302 2012 reais anual 5964168.52 1.097601e+07
303 2011 reais anual 5774707.15 1.086787e+07
2361 304 2013 reais anual 3652575.31 6.608469e+06
305 2012 reais anual 321076.15 6.027066e+06
2369 306 2013 reais anual 3858137.49 9.733126e+06
What I'm trying to do is something like this:
bal=bal.merge(accounts,left_on='id', right_index=True)
However I think that the synthax is not correct, since I'm getting a ValueError:
ValueError: len(right_on) must equal the number of levels in the index of "left"
Can anyone help?

Currently, it is not possible to join on specific levels of a MultiIndex.
You can only join on the entire index or by columns.
So you'll have to take the business_id out of the MultiIndex before you join:
result = (bal.reset_index('business_id').join(accounts, how='inner')
.set_index(['business_id'], append=True))
import pandas as pd
bal = pd.DataFrame({'ano': [2012, 2011, 2013, 2012, 2013, 2012], 'business_id': [9564, 9564, 2361, 2361, 2369, 2369], 'id': [302, 303, 304, 305, 306, 307], 'period': ['anual', 'anual', 'anual', 'anual', 'anual', 'anual'], 'unit': ['reais', 'reais', 'reais', 'reais', 'reais', 'reais']})
bal = bal.set_index(['business_id', 'id'])
accounts = pd.DataFrame({'A': [5964168.52, 5774707.15, 3652575.31, 321076.15, 3858137.49], 'B': [10976010.0, 10867870.0, 6608469.0, 6027066.0, 9733126.0], 'id': [302, 303, 304, 305, 306]})
accounts = accounts.set_index('id')
result = (bal.reset_index('business_id').join(accounts, how='inner')
.set_index(['business_id'], append=True))
print(result)
yields
ano period unit A B
id business_id
302 9564 2012 anual reais 5964168.52 10976010.0
303 9564 2011 anual reais 5774707.15 10867870.0
304 2361 2013 anual reais 3652575.31 6608469.0
305 2361 2012 anual reais 321076.15 6027066.0
306 2369 2013 anual reais 3858137.49 9733126.0

inspired by ununtbu. adding merge
bal.reset_index(['business_id','id']).merge(accounts, left_on = 'id', right_index= True).set_index(['id','business_id'])

Related

How to plot bar chart using seaborn after pandas.pivot

I'm having difficulties plotting my bar chart after I pivot my data as it can't seem to detect the column that I'm using for the x-axis.
This is the original data:
import pandas as pd
data = {'year': [2014, 2014, 2014, 2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021],
'sector': ['Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector',
'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice',
'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice'],
'count': [861, 531, 2, 877, 606, 66, 899, 682, 112, 882, 765, 167, 960, 804, 203, 943, 834, 243, 1016, 876, 237, 1085, 960, 215]}
df = pd.DataFrame(data)
year sector count
0 2014 Public Sector 861
1 2014 Private Sector 531
2 2014 Not in Active Practice 2
3 2015 Public Sector 877
4 2015 Private Sector 606
5 2015 Not in Active Practice 66
6 2016 Public Sector 899
7 2016 Private Sector 682
8 2016 Not in Active Practice 112
9 2017 Public Sector 882
10 2017 Private Sector 765
11 2017 Not in Active Practice 167
12 2018 Public Sector 960
13 2018 Private Sector 804
14 2018 Not in Active Practice 203
15 2019 Public Sector 943
16 2019 Private Sector 834
17 2019 Not in Active Practice 243
18 2020 Public Sector 1016
19 2020 Private Sector 876
20 2020 Not in Active Practice 237
21 2021 Public Sector 1085
22 2021 Private Sector 960
23 2021 Not in Active Practice 215
After pivoting the data:
sector Not in Active Practice Private Sector Public Sector
year
2014 2 531 861
2015 66 606 877
2016 112 682 899
2017 167 765 882
2018 203 804 960
2019 243 834 943
2020 237 876 1016
2021 215 960 1085
After tweaking the data to get the columns I want:
sector Private Sector Public Sector Total in Practice
year
2014 531 861 1392
2015 606 877 1483
2016 682 899 1581
2017 765 882 1647
2018 804 960 1764
2019 834 943 1777
2020 876 1016 1892
2021 960 1085 2045
As you can see, after I have pivoted the data, there is an extra row on top of the year called 'sector'.
sns.barplot(data=df3, x='year', y="Total in Practice")
This is the code that I'm using to plot the graph but python returns with:
<Could not interpret input 'year'>
I've tried using 'sector' instead of 'year' but it returns with the same error.
I have copied the original data, then do the same process as you described :
import pandas as pd
import seaborn as sns
mdic = {'year': [2014, 2014, 2014, 2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017,
2018, 2018, 2018, 2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021],
'sector': ["Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice"],
'count' : [861, 531, 2, 877, 606, 66, 899, 682, 112, 882, 765, 167, 960, 804, 203, 943, 834, 243, 1016, 876, 237, 1085, 960, 215]}
data = pd.DataFrame(mdic)
data_pivot = data.pivot(index='year', columns='sector', values='count')
df3= data_pivot.drop('Not in Active Practice', axis=1)
df3['Total in Practice'] = df3.sum(axis=1)
Then got the same result as :
df3
sector Private Sector Public Sector Total in Practice
year
2014 531 861 1392
2015 606 877 1483
2016 682 899 1581
2017 765 882 1647
2018 804 960 1764
2019 834 943 1777
2020 876 1016 1892
2021 960 1085 2045
The reason that you are getting the error is that when you created df3, the colum year is changed to index, here are three solutions:
First is as commented by #tdy
sns.barplot(data=df3.reset_index(), x='year', y='Total in Practice')
Second is:
sns.barplot(data=df3, x=df3.index, y="Total in Practice")
Third is when you do the pivoting add reset_index() and do the sum for specified columns:
data_pivot = data.pivot(index='year', columns='sector', values='count').reset_index()
df3= data_pivot.drop('Not in Active Practice', axis=1)
df3['Total in Practice'] = df3[['Public Sector','Private Sector']].sum(axis=1)
Then you can do bar plot with your code :
ax = sns.barplot(data=df3, x='year', y="Total in Practice")
ax.bar_label(ax.containers[0])
You get the figure :
If you’re going to plot a pivoted (wide form) dataframe, then plot directly with pandas.dataframe.plot, which works with 'year' as the index. Leave the data in long form (as specified in the data parameter documentation) when using seaborn. Both pandas and seaborn use matplotlib.
seaborn doesn't recognize 'year' because it's in the dataframe index, it's not a column, as needed by the API.
It's not necessary to calculate a total column because this can be added to the top of stacked bars with matplotlib.pyplot.bar_label.
See this answer for a thorough explanation of using .bar_label.
Manage the DataFrame
# select the data to not include 'Not in Active Practice'
df = df[df.sector.ne('Not in Active Practice')]
Plot long df with seaborn
As shown in this answer, seaborn.histplot, or seaborn.displot with kind='hist', can be used to plot a stacked bar.
# plot the data in long form
fig, ax = plt.subplots(figsize=(9, 7))
sns.histplot(data=df, x='year', weights='count', hue='sector', multiple='stack', bins=8, discrete=True, ax=ax)
# iterate through the axes containers to add bar labels
for c in ax.containers:
# add the section label to the middle of each bar
ax.bar_label(c, label_type='center')
# add the label for the total bar length by adding only the last container to the top of the bar
_ = ax.bar_label(ax.containers[-1])
Plot pivoted (wide) df with pandas.DataFrame.plot
# pivot the dataframe
dfp = df.pivot(index='year', columns='sector', values='count')
# plot the dataframe
ax = dfp.plot(kind='bar', stacked=True, rot=0, figsize=(9, 7))
# add labels
for c in ax.containers:
ax.bar_label(c, label_type='center')
_ = ax.bar_label(ax.containers[-1])

Merging multiple CSV files to one dataframe using Pandas

I have have 26 "CSV" files with file names similar to the below:
LA 2016
LA 2017
LA 2019
LA 2020
NY 2016
NY 2017
NY 2019
NY 2020
All the files have similar column names. The Column names are :
Month A B C Total
Jan 156 132 968 1256
Feb 863 363 657 1883
Mar 142 437 857 1436
I am trying to merge them all on Month. I tried pd.concat but for some reason the dataframe is not merging.
I am using the below code:
list=[]
city=['LA ','NY ','MA ','TX ']
year=['2016','2017','2018', '2019','2020']
for i in city:
for j in year:
list.append(i+j+".csv")
df=pd.concat([pd.read_csv(i) for i in list])
Can someone help me with this.
The following should work:
from functools import reduce
list_of_dataframes=[]
for i in list:
list_od_dataframes.append(pd.read_csv(i))
df_final = reduce(lambda left,right: pd.merge(left,right,on='Month'), list_of_dataframes)

Sorting grouped DataFrame column without changing index sorting

I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .
You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580

How to get top 10 sellers by sales for each country from the table Sellers with columns (Seller_ID, Country, Month, Sales) in Python [duplicate]

This question already has answers here:
Pandas get topmost n records within each group
(6 answers)
Closed 3 years ago.
Basically this is a sql query task that I am trying to perform in Python.
Is there a way to get Top 10 sellers from each country without creating new DataFrames ?
Table for example:
df = pd.DataFrame(
{
'Seller_ID': [1321, 1245, 1567, 1876, 1345, 1983, 1245, 1623, 1756, 1555, 1424, 1777,
2321, 2245, 2567, 2876, 2345, 2983, 2245, 2623, 2756, 2555, 2424, 2777],
'Country' : ['India','India','India','India','India','India','India','India','India','India','India','India',
'UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Month' : ['Jan','Mar','Mar','Feb','May','May','Jun','Aug','Dec','Sep','Apr','Jul',
'Jan','Mar','Mar','Feb','May','May','Jun','Aug','Dec','Sep','Apr','Jul'],
'Sales' : [456, 876, 345, 537, 128, 874, 458, 931, 742, 682, 386, 857,
456, 876, 345, 537, 128, 874, 458, 931, 742, 682, 386, 857]
})
df
Table Output:
Seller_ID Country Month Sales
0 1321 India Jan 456
1 1245 India Mar 876
2 1567 India Mar 345
3 1876 India Feb 537
4 1345 India May 128
5 1983 India May 874
6 1245 India Jun 458
7 1623 India Aug 931
8 1756 India Dec 742
9 1555 India Sep 682
10 1424 India Apr 386
11 1777 India Jul 857
12 2321 UK Jan 456
13 2245 UK Mar 876
14 2567 UK Mar 345
15 2876 UK Feb 537
16 2345 UK May 128
17 2983 UK May 874
18 2245 UK Jun 458
19 2623 UK Aug 931
20 2756 UK Dec 742
21 2555 UK Sep 682
22 2424 UK Apr 386
23 2777 UK Jul 857
Wrote below line of code but that violates condition of top 10 of each country and gives wrong results.
df.loc[df['Country'].isin(['India','UK'])].sort_values(['Sales'], ascending=False)[0:20]
Another code that worked but it doesn't look that smart as it needs to create new dataframes
a = pd.DataFrame(df.loc[df['Country'] == 'India'].sort_values(['Sales'], ascending=False)[0:10])
b = pd.DataFrame(df.loc[df['Country'] == 'UK'].sort_values(['Sales'], ascending=False)[0:10])
top10_ofeach = pd.concat([a,b], ignore_index=True)
Max I can improve here is run country inside the loop but looking for much smarter way to do it overall. I am not able to think of any better way to do it.
Seems to be duplicate of Pandas get topmost n records within each group
df.sort_values(['Sales'], ascending=False).groupby('Country').head(10)

Creating a multi-index from csv census data

I would like to create a multi indexed dataframe so I can calculate values in a more organized way.
I know a MUCH more elegant solution is out there, but I'm struggling to find it. Most of the stuff I've found involves series and tuples. I'm fairly new to pandas (and programming) and this is my first attempt at using/creating multi-indexes.
After downloading census data as csv and creating dataframe with pertinent fields I have:
county housingunits2010 housingunits2012 occupiedunits2010 occupiedunits2012
8001 120 200 50 100
8002 100 200 75 125
And I want to end up with:
id Year housingunits occupiedunits
8001 2010 120 50
2012 200 100
8002 2010 100 75
2012 200 125
And then be able to add columns from calculated values (ie difference between years, %change) and from other dataframes, matching merging by county and year.
I figured out a workaround with the basic methods that I've learned (see below), but...it certainly isn't elegant. Any suggestion would be appreciated.
First creating two diff data frames
df3 = df2[["county_id","housingunits2012"]]
df4 = df2[["county_id","housingunits2010"]]
Adding the year column
df3['year'] = np.array(['2012'] * 7)
df4['year'] = np.array(['2010'] * 7)
df3.columns = ['county_id','housingunits','year']
df4.columns = ['county_id','housingunits','year']
Appending
df5 = df3.append(df4)
Writing to csv
df5.to_csv('/Users/ntapia/df5.csv', index = False)
Reading & sorting
df6 = pd.read_csv('/Users/ntapia/df5.csv', index_col=[0, 2])
df6.sort_index(0)
Result (actual data):
housingunits
county_id year
8001 2010 163229
2012 163986
8005 2010 238457
2012 239685
8013 2010 127115
2012 128106
8031 2010 285859
2012 288191
8035 2010 107056
2012 109115
8059 2010 230006
2012 230850
8123 2010 96406
2012 97525
Thanks!
import re
df = df.set_index('county')
df = df.rename(columns=lambda x: re.search(r'([a-zA-Z_]+)(\d{4})', x).groups())
df.columns = MultiIndex.from_tuples(df.columns, names=['label', 'year'])
s = df.unstack()
s.name = 'count'
print(s)
gives
label year county
housingunits 2010 8001 120
8002 100
2012 8001 200
8002 200
occupiedunits 2010 8001 50
8002 75
2012 8001 100
8002 125
Name: count, dtype: int64
If you want that in a DataFrame call reset_index():
print(s.reset_index())
yields
label year county numunits
0 housingunits 2010 8001 120
1 housingunits 2010 8002 100
2 housingunits 2012 8001 200
3 housingunits 2012 8002 200
4 occupiedunits 2010 8001 50
5 occupiedunits 2010 8002 75
6 occupiedunits 2012 8001 100
7 occupiedunits 2012 8002 125

Categories

Resources