I'm having difficulties plotting my bar chart after I pivot my data as it can't seem to detect the column that I'm using for the x-axis.
This is the original data:
import pandas as pd
data = {'year': [2014, 2014, 2014, 2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018, 2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021],
'sector': ['Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector',
'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice',
'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice', 'Public Sector', 'Private Sector', 'Not in Active Practice'],
'count': [861, 531, 2, 877, 606, 66, 899, 682, 112, 882, 765, 167, 960, 804, 203, 943, 834, 243, 1016, 876, 237, 1085, 960, 215]}
df = pd.DataFrame(data)
year sector count
0 2014 Public Sector 861
1 2014 Private Sector 531
2 2014 Not in Active Practice 2
3 2015 Public Sector 877
4 2015 Private Sector 606
5 2015 Not in Active Practice 66
6 2016 Public Sector 899
7 2016 Private Sector 682
8 2016 Not in Active Practice 112
9 2017 Public Sector 882
10 2017 Private Sector 765
11 2017 Not in Active Practice 167
12 2018 Public Sector 960
13 2018 Private Sector 804
14 2018 Not in Active Practice 203
15 2019 Public Sector 943
16 2019 Private Sector 834
17 2019 Not in Active Practice 243
18 2020 Public Sector 1016
19 2020 Private Sector 876
20 2020 Not in Active Practice 237
21 2021 Public Sector 1085
22 2021 Private Sector 960
23 2021 Not in Active Practice 215
After pivoting the data:
sector Not in Active Practice Private Sector Public Sector
year
2014 2 531 861
2015 66 606 877
2016 112 682 899
2017 167 765 882
2018 203 804 960
2019 243 834 943
2020 237 876 1016
2021 215 960 1085
After tweaking the data to get the columns I want:
sector Private Sector Public Sector Total in Practice
year
2014 531 861 1392
2015 606 877 1483
2016 682 899 1581
2017 765 882 1647
2018 804 960 1764
2019 834 943 1777
2020 876 1016 1892
2021 960 1085 2045
As you can see, after I have pivoted the data, there is an extra row on top of the year called 'sector'.
sns.barplot(data=df3, x='year', y="Total in Practice")
This is the code that I'm using to plot the graph but python returns with:
<Could not interpret input 'year'>
I've tried using 'sector' instead of 'year' but it returns with the same error.
I have copied the original data, then do the same process as you described :
import pandas as pd
import seaborn as sns
mdic = {'year': [2014, 2014, 2014, 2015, 2015, 2015, 2016, 2016, 2016, 2017, 2017, 2017,
2018, 2018, 2018, 2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021],
'sector': ["Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice",
"Public Sector", "Private Sector", "Not in Active Practice", "Public Sector", "Private Sector", "Not in Active Practice"],
'count' : [861, 531, 2, 877, 606, 66, 899, 682, 112, 882, 765, 167, 960, 804, 203, 943, 834, 243, 1016, 876, 237, 1085, 960, 215]}
data = pd.DataFrame(mdic)
data_pivot = data.pivot(index='year', columns='sector', values='count')
df3= data_pivot.drop('Not in Active Practice', axis=1)
df3['Total in Practice'] = df3.sum(axis=1)
Then got the same result as :
df3
sector Private Sector Public Sector Total in Practice
year
2014 531 861 1392
2015 606 877 1483
2016 682 899 1581
2017 765 882 1647
2018 804 960 1764
2019 834 943 1777
2020 876 1016 1892
2021 960 1085 2045
The reason that you are getting the error is that when you created df3, the colum year is changed to index, here are three solutions:
First is as commented by #tdy
sns.barplot(data=df3.reset_index(), x='year', y='Total in Practice')
Second is:
sns.barplot(data=df3, x=df3.index, y="Total in Practice")
Third is when you do the pivoting add reset_index() and do the sum for specified columns:
data_pivot = data.pivot(index='year', columns='sector', values='count').reset_index()
df3= data_pivot.drop('Not in Active Practice', axis=1)
df3['Total in Practice'] = df3[['Public Sector','Private Sector']].sum(axis=1)
Then you can do bar plot with your code :
ax = sns.barplot(data=df3, x='year', y="Total in Practice")
ax.bar_label(ax.containers[0])
You get the figure :
If you’re going to plot a pivoted (wide form) dataframe, then plot directly with pandas.dataframe.plot, which works with 'year' as the index. Leave the data in long form (as specified in the data parameter documentation) when using seaborn. Both pandas and seaborn use matplotlib.
seaborn doesn't recognize 'year' because it's in the dataframe index, it's not a column, as needed by the API.
It's not necessary to calculate a total column because this can be added to the top of stacked bars with matplotlib.pyplot.bar_label.
See this answer for a thorough explanation of using .bar_label.
Manage the DataFrame
# select the data to not include 'Not in Active Practice'
df = df[df.sector.ne('Not in Active Practice')]
Plot long df with seaborn
As shown in this answer, seaborn.histplot, or seaborn.displot with kind='hist', can be used to plot a stacked bar.
# plot the data in long form
fig, ax = plt.subplots(figsize=(9, 7))
sns.histplot(data=df, x='year', weights='count', hue='sector', multiple='stack', bins=8, discrete=True, ax=ax)
# iterate through the axes containers to add bar labels
for c in ax.containers:
# add the section label to the middle of each bar
ax.bar_label(c, label_type='center')
# add the label for the total bar length by adding only the last container to the top of the bar
_ = ax.bar_label(ax.containers[-1])
Plot pivoted (wide) df with pandas.DataFrame.plot
# pivot the dataframe
dfp = df.pivot(index='year', columns='sector', values='count')
# plot the dataframe
ax = dfp.plot(kind='bar', stacked=True, rot=0, figsize=(9, 7))
# add labels
for c in ax.containers:
ax.bar_label(c, label_type='center')
_ = ax.bar_label(ax.containers[-1])
Related
I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03
Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km
My table looks like something below
YEAR RESPONSIBLE DISTRICT
2014 01 - PARIS
2014 01 - PARIS
2014 01 - PARIS
2014 01 - PARIS
2014 01 - PARIS
... ... ...
2017 15 - SAN ANTONIO
2017 15 - SAN ANTONIO
2017 15 - SAN ANTONIO
2017 15 - SAN ANTONIO
2017 15 - SAN ANTONIO
After I wrote
g = df.groupby('FISCAL YEAR')['RESPONSIBLE DISTRICT'].value_counts()
I got below
YEAR RESPONSIBLE DISTRICT
2014 05 - LUBBOCK 12312
15 - SAN ANTONIO 10457
18 - DALLAS 9885
04 - AMARILLO 9617
08 - ABILENE 8730
...
2020 21 - PHARR 5645
25 - CHILDRESS 5625
20 - BEAUMONT 5560
22 - LAREDO 5034
24 - EL PASO 4620
I have 25 districts in total. Now I want to create 25 subplots, so each subplot would represent a single district. For each subplot, I want the year 2014-2020 to be on the x-axis and the value count to be on the y-axis. How could I do that?
Is it what you expect?
import matplotlib.pyplot as plt
fig, axs = plt.subplots(5, 5, sharex=True, sharey=True, figsize=(15, 15))
for ax, (district, sr) in zip(axs.flat, g.groupby('RESPONSIBLE DISTRICT')):
ax.set_title(district)
ax.plot(sr.index.get_level_values('YEAR'), sr.values)
fig.tight_layout()
plt.show()
This should work.
import matplotlib.pyplot as plt
import pandas as pd
g = df.groupby('YEAR')['RESPONSIBLE DISTRICT'].value_counts()
fig, axs = plt.subplots(5, 5, constrained_layout=True)
for ax, (district, dfi) in zip(axs.ravel(), g.groupby('RESPONSIBLE DISTRICT')):
x = dfi.index.get_level_values('YEAR').values
y = dfi.values
ax.bar(x, y)
ax.set_title(district)
plt.show()
The correct way with only pandas, is to shape the dataframe with .pivot, and then to correctly use pandas.DataFrame.plot.
Imports & Data
import pandas as pd
import numpy as np # for test data
import seaborn as sns # only for seaborn option
# test data
np.random.seed(365)
rows = 100000
data = {'YEAR': np.random.choice(range(2014, 2021), size=rows),
'RESPONSIBLE DISTRICT': np.random.choice(['05 - LUBBOCK', '15 - SAN ANTONIO', '18 - DALLAS', '04 - AMARILLO', '08 - ABILENE', '21 - PHARR', '25 - CHILDRESS', '20 - BEAUMONT', '22 - LAREDO', '24 - EL PASO'], size=rows)}
df = pd.DataFrame(data)
# get the value count of each district by year and pivot the shape
dfp = df.value_counts(subset=['YEAR', 'RESPONSIBLE DISTRICT']).reset_index(name='VC').pivot(index='YEAR', columns='RESPONSIBLE DISTRICT', values='VC')
# display(dfp)
RESPONSIBLE DISTRICT 04 - AMARILLO 05 - LUBBOCK 08 - ABILENE 15 - SAN ANTONIO 18 - DALLAS 20 - BEAUMONT 21 - PHARR 22 - LAREDO 24 - EL PASO 25 - CHILDRESS
YEAR
2014 1407 1406 1485 1456 1392 1456 1499 1458 1394 1452
2015 1436 1423 1428 1441 1395 1400 1423 1442 1375 1399
2016 1480 1381 1393 1415 1446 1442 1414 1435 1452 1454
2017 1422 1388 1485 1447 1404 1401 1413 1470 1424 1426
2018 1479 1424 1384 1450 1390 1384 1445 1435 1478 1386
2019 1387 1317 1379 1457 1457 1476 1447 1459 1451 1406
2020 1462 1452 1454 1448 1441 1428 1411 1407 1402 1445
pandas.DataFrame.plot
Use kind='line' if a line plot is preferred.
# plot the dataframe
fig = dfp.plot(kind='bar', subplots=True, layout=(5, 5), figsize=(20, 20), legend=False)
seaborn.catplot
seaborn is a high-level API for matplotlib
This is the easiest way because the dataframe does not need to be reshaped.
p = sns.catplot(kind='count', data=df, col='RESPONSIBLE DISTRICT', col_wrap=5, x='YEAR', height=3.5, )
p.set_titles(row_template='{row_name}', col_template='{col_name}') # shortens the titles
I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015
This question already has answers here:
Pandas get topmost n records within each group
(6 answers)
Closed 3 years ago.
Basically this is a sql query task that I am trying to perform in Python.
Is there a way to get Top 10 sellers from each country without creating new DataFrames ?
Table for example:
df = pd.DataFrame(
{
'Seller_ID': [1321, 1245, 1567, 1876, 1345, 1983, 1245, 1623, 1756, 1555, 1424, 1777,
2321, 2245, 2567, 2876, 2345, 2983, 2245, 2623, 2756, 2555, 2424, 2777],
'Country' : ['India','India','India','India','India','India','India','India','India','India','India','India',
'UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Month' : ['Jan','Mar','Mar','Feb','May','May','Jun','Aug','Dec','Sep','Apr','Jul',
'Jan','Mar','Mar','Feb','May','May','Jun','Aug','Dec','Sep','Apr','Jul'],
'Sales' : [456, 876, 345, 537, 128, 874, 458, 931, 742, 682, 386, 857,
456, 876, 345, 537, 128, 874, 458, 931, 742, 682, 386, 857]
})
df
Table Output:
Seller_ID Country Month Sales
0 1321 India Jan 456
1 1245 India Mar 876
2 1567 India Mar 345
3 1876 India Feb 537
4 1345 India May 128
5 1983 India May 874
6 1245 India Jun 458
7 1623 India Aug 931
8 1756 India Dec 742
9 1555 India Sep 682
10 1424 India Apr 386
11 1777 India Jul 857
12 2321 UK Jan 456
13 2245 UK Mar 876
14 2567 UK Mar 345
15 2876 UK Feb 537
16 2345 UK May 128
17 2983 UK May 874
18 2245 UK Jun 458
19 2623 UK Aug 931
20 2756 UK Dec 742
21 2555 UK Sep 682
22 2424 UK Apr 386
23 2777 UK Jul 857
Wrote below line of code but that violates condition of top 10 of each country and gives wrong results.
df.loc[df['Country'].isin(['India','UK'])].sort_values(['Sales'], ascending=False)[0:20]
Another code that worked but it doesn't look that smart as it needs to create new dataframes
a = pd.DataFrame(df.loc[df['Country'] == 'India'].sort_values(['Sales'], ascending=False)[0:10])
b = pd.DataFrame(df.loc[df['Country'] == 'UK'].sort_values(['Sales'], ascending=False)[0:10])
top10_ofeach = pd.concat([a,b], ignore_index=True)
Max I can improve here is run country inside the loop but looking for much smarter way to do it overall. I am not able to think of any better way to do it.
Seems to be duplicate of Pandas get topmost n records within each group
df.sort_values(['Sales'], ascending=False).groupby('Country').head(10)
I have a dataframe with double index, it looks like this:
bal:
ano unit period
business_id id
9564 302 2012 reais anual
303 2011 reais anual
2361 304 2013 reais anual
305 2012 reais anual
2369 306 2013 reais anual
307 2012 reais anual
I have another dataframe that looks like this:
accounts:
A B
id
302 5964168.52 1.097601e+07
303 5774707.15 1.086787e+07
304 3652575.31 6.608469e+06
305 321076.15 6.027066e+06
306 3858137.49 9.733126e+06
I want to merge them so they look like this:
ano unit period A B
business_id id
9564 302 2012 reais anual 5964168.52 1.097601e+07
303 2011 reais anual 5774707.15 1.086787e+07
2361 304 2013 reais anual 3652575.31 6.608469e+06
305 2012 reais anual 321076.15 6.027066e+06
2369 306 2013 reais anual 3858137.49 9.733126e+06
What I'm trying to do is something like this:
bal=bal.merge(accounts,left_on='id', right_index=True)
However I think that the synthax is not correct, since I'm getting a ValueError:
ValueError: len(right_on) must equal the number of levels in the index of "left"
Can anyone help?
Currently, it is not possible to join on specific levels of a MultiIndex.
You can only join on the entire index or by columns.
So you'll have to take the business_id out of the MultiIndex before you join:
result = (bal.reset_index('business_id').join(accounts, how='inner')
.set_index(['business_id'], append=True))
import pandas as pd
bal = pd.DataFrame({'ano': [2012, 2011, 2013, 2012, 2013, 2012], 'business_id': [9564, 9564, 2361, 2361, 2369, 2369], 'id': [302, 303, 304, 305, 306, 307], 'period': ['anual', 'anual', 'anual', 'anual', 'anual', 'anual'], 'unit': ['reais', 'reais', 'reais', 'reais', 'reais', 'reais']})
bal = bal.set_index(['business_id', 'id'])
accounts = pd.DataFrame({'A': [5964168.52, 5774707.15, 3652575.31, 321076.15, 3858137.49], 'B': [10976010.0, 10867870.0, 6608469.0, 6027066.0, 9733126.0], 'id': [302, 303, 304, 305, 306]})
accounts = accounts.set_index('id')
result = (bal.reset_index('business_id').join(accounts, how='inner')
.set_index(['business_id'], append=True))
print(result)
yields
ano period unit A B
id business_id
302 9564 2012 anual reais 5964168.52 10976010.0
303 9564 2011 anual reais 5774707.15 10867870.0
304 2361 2013 anual reais 3652575.31 6608469.0
305 2361 2012 anual reais 321076.15 6027066.0
306 2369 2013 anual reais 3858137.49 9733126.0
inspired by ununtbu. adding merge
bal.reset_index(['business_id','id']).merge(accounts, left_on = 'id', right_index= True).set_index(['id','business_id'])