How to drop subheaders in wikipedia tables? - python

I am trying to web scrap a wikipedia table into a dataframe. In the wikipedia table, I want to drop Population density, Land Area, and specifically Population (Rank). In the end I want to keep State or territory and just Population (People).
https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density
Here is my code:
wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)
soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})
df=pd.read_html(str(indiatable))
df=pd.DataFrame(df[0])
data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
print (wikidata.head())
How to do I reference specifically that subtable header to drop the rank in Population?

Note: There is no expected result in your question, so you may have to make some adjustments to your headers. Assuming you like to rename people to population and not population by itself I changed that.
To get your goal, simply set the header parameter while reading the html to choose only the second, so you do not need to drop it separatly:
df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)
soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})
df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)
Output
State
Rank(all)
Rank(50 states)
permi2
perkm2
Population
Rank.1
mi2
km2
District of Columbia
1
—
11295
4361
689545
56
61
158
New Jersey
2
1
1263
488
9288994
46
7354
19046.8
Rhode Island
3
2
1061
410
1097379
51
1034
2678
Puerto Rico
4
—
960
371
3285874
49
3515
9103.8
Massachusetts
5
3
901
348
7029917
45
7800
20201.9
Connecticut
6
4
745
288
3605944
48
4842
12540.7
Guam
7
—
733
283
153836
52
210
543.9
American Samoa
8
—
650
251
49710
55
77
199.4

Related

How to regex extract CAR MAKE from URL in pandas df column

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/
I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..
I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')
Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:

How to resolve, list index out of range, from scraping website?

from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
For some reason, the last line with header one gives me an error, "list index out of range". I am not too sure what is causing this error to happen, but I know I need this line. Here is a link to the website I am using for the data, https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. The specific table I want is the one that is below the horizontal bar chart.
Traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-47-67ef2aac7bf3> in <module>
28 data_tables.append(td.findAll('table'))
29
---> 30 header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
31
32 header1
IndexError: list index out of range
Use pandas.read_html
Read HTML tables into a list of DataFrame objects.
This answer side-steps the question to provide a more efficient method for extracting tables from Wikipedia and gives the OP the desired end result.
The following code will more easily get the desired table from the Wikipedia page.
.read_html will return a list of dataframes.
The table you're interested in, is at index 4
Clean the table
Select the rows and columns with valid data.
This method does return the table headers, but the column names are multi-level so we'll rename them.
Before renaming the columns, if you need the original data from the column names, use us_covid_data.columns which will return a list of tuples with all the column name values.
import pandas as pd
# get list of dataframes and select index 4
us_covid_data = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')[4]
# select rows and columns
us_covid_data = us_covid_data.iloc[0:56, 1:6]
# rename columns
us_covid_data.columns = ['state_territory', 'cases', 'deaths', 'recovered', 'hospitalized']
# display(us_covid_data)
state_territory cases deaths recovered hospitalized
0 Alabama 45785 1033 22082 2961
1 Alaska 1184 17 560 78
2 American Samoa 0 0 – –
3 Arizona 116892 2082 – 5272
4 Arkansas 24253 292 17834 1604
5 California 296499 6711 – –
6 Colorado 34316 1704 – 5527
7 Connecticut 46976 4338 – –
8 Delaware 12293 512 6778 –
9 District of Columbia 10569 561 1465 –
10 Florida 244151 4102 – 15150
11 Georgia 111211 2965 – 11919
12 Guam 1272 6 179 –
13 Hawaii 1012 19 746 116
14 Idaho 8222 94 2886 350
15 Illinois 151767 7144 – –
16 Indiana 49560 2698 36788 7139
17 Iowa 31906 725 24242 –
18 Kansas 17618 282 – 1269
19 Kentucky 17526 623 4785 2662
20 Louisiana 66435 3296 43026 –
21 Maine 3440 110 2787 354
22 Maryland 70497 3246 – 10939
23 Massachusetts 111110 8296 88725 10985
24 Michigan 73403 6225 52841 –
25 Minnesota 38606 1511 33907 4112
26 Mississippi 31257 1114 22167 2881
27 Missouri 24985 1077 – –
28 Montana 1249 23 678 89
29 Nebraska 20053 286 14641 1224
30 Nevada 22930 537 – –
31 New Hampshire 5914 382 4684 558
32 New Jersey 174628 15479 31014 –
33 New Mexico 14549 539 6181 2161
34 New York 400299 32307 71371 –
35 North Carolina 81331 1479 55318 –
36 North Dakota 3858 89 3350 218
37 Northern Mariana Islands 31 2 19 –
38 Ohio 57956 2927 – 7292
39 Oklahoma 16362 399 12432 1676
40 Oregon 10402 218 2846 1069
41 Pennsylvania 93876 6880 – –
42 Puerto Rico 8714 157 – –
43 Rhode Island 16991 960 – 1922
44 South Carolina 47214 838 – –
45 South Dakota 7105 97 6062 689
46 Tennessee 51509 646 31020 2860
47 Texas 240111 3013 122996 9610
48 Virgin Islands 112 6 79 –
49 Utah 25563 190 14448 1565
50 Vermont 1251 56 1022 –
51 Virginia 66740 1881 – 9549
52 Washington 38517 1370 – 4463
53 West Virginia 3461 95 2518 –
54 Wisconsin 35318 805 25542 3574
55 Wyoming 1675 20 1172 253
Addressing the original issue:
data is an empty list generated from data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
With data = data_table.tbody.findAll('tr', recursive=False)[1] and then data = [v for v in data.get_text().split('\n') if v], you will get the headers.
The output of data will be ['U.S. state or territory[i]', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]', 'Ref.']
Since data_tables is generated from iterating through data, it is also empty.
header1 is generated from iterating data_tables[0], so IndexError occurs because data_tables is empty.

Plot triple bar graph from a single column grouped by another column using pandas

This is my dataset.
Country Type Disaster Count
0 CHINA P REP Industrial Accident 415
1 CHINA P REP Transport Accident 231
2 CHINA P REP Flood 175
3 INDIA Transport Accident 425
4 INDIA Flood 206
5 INDIA Storm 121
6 UNITED STATES Storm 348
7 UNITED STATES Transport Accident 159
8 UNITED STATES Flood 92
9 PHILIPPINES Storm 249
10 PHILIPPINES Transport Accident 84
11 PHILIPPINES Flood 71
12 INDONESIA Transport Accident 136
13 INDONESIA Flood 110
14 INDONESIA Seismic Activity 77
I would like to make a triple bar chart and the label is based on the column 'Type'. I would also like to group the bar based on the column 'Country'.
I have tried using (with df as the DataFrame object of the pandas library),
df.groupby('Country').plot.bar()
but the result came out as multiple bar charts representing each group in the 'Country' column.
The expected output is similar to this:
What are the codes that I need to run in order to achieve this graph?
There are two ways -
df.set_index('Country').pivot(columns='Type').plot.bar()
df.set_index(['Country','Type']).plot.bar()

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Rearranging groupings in bar chart from pandas dataframe

I want a grouped bar chart, but the default plot doesn't have the groupings the way I'd like, and I'm struggling to get them rearranged properly.
The dataframe looks like this:
user year cat1 cat2 cat3 cat4 cat5
0 Brad 2014 309 186 119 702 73
1 Brad 2015 280 177 100 625 75
2 Brad 2016 306 148 127 671 74
3 Brian 2014 298 182 131 702 73
4 Brian 2015 295 125 117 607 76
5 Brian 2016 298 137 97 596 75
6 Chris 2014 309 171 111 654 72
7 Chris 2015 251 146 105 559 76
8 Chris 2016 231 130 105 526 75
etc
Elsewhere, the code produces two variables, user1 and user2. I want to produce a bar chart that compares the numbers for those two users over time in cat1, cat2, and cat3. So for example if user1 and user2 were Brian and Chris, I would want a chart that looks something like this:
On an aesthetic note: I'd prefer the year labels be vertical text or a font size that fits on a single line, but it's really the dataframe pivot that's confusing me at the moment.
Select the subset of users you want to plot against. Use pivot_table later to transform the DF to the required format to be plotted by transposing and unstacking it.
import matplotlib.pyplot as plt
def select_user_plot(user_1, user_2, cats, frame, idx, col):
frame = frame[(frame[idx[0]] == user_1)|(frame[idx[0]] == user_2)]
frame_pivot = frame.pivot_table(index=idx, columns=col, values=cats).T.unstack()
frame_pivot.plot.bar(legend=True, cmap=plt.get_cmap('RdYlGn'), figsize=(8,8), rot=0)
Finally,
Choose the users and categories to be included in the bar plot.
user_1 = 'Brian'
user_2 = 'Chris'
cats = ['cat1', 'cat2', 'cat3']
select_user_plot(user_1, user_2, cats, frame=df, idx=['user'], col=['year'])
Note: This gives close to the plot that the OP had posted.(Year appearing as Legends instead of the tick labels)

Categories

Resources