Getting all links from page, receiving javascript.void() error? - python

I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/

To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):

Look at the onclick property of this link and get "real" address from them.

Related

Python / Selenium. How to find relationship between data?

I'm creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I'm doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.
So making the scraper was easy enough. Here's a snippet so you understand the code I am using.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key
driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1")
# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[#class="a-size-base-plus a-color-base a-text-normal"]')
shoes_list = []
for s in range(len(shoes)):
shoes_list.append(shoes[s].text)
# retrieving prices
price = driver.find_elements(By.XPATH, '//span[#class="a-price"]')
price_list = []
for p in range(len(price)):
price_list.append(price[p].text)
# prices are retuned with a newline instead of a decimal
# example: £9\n99 instead of £9.99
# so fixing that
temp_price_list = []
for price in price_list:
price = price.replace("\n", ".")
temp_price_list.append(price)
price_list = temp_price_list
So here's the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I've sorted out the data into a dataframe
title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])
At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.
Missing prices on Amazon site
Incorrect data
Unfortunately, when pulling the price data, it does not pull in a 'blank' set of data if it's blank, which if it did I wouldn't need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.
Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there's a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.
To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
df_list = []
url = 'https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1'
browser.get(url)
shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
# print(shoe.get_attribute('outerHTML'))
try:
shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
except Exception as e:
continue
try:
shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]')
except Exception as e:
continue
df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)
This would return (depending on Amazon's appetite for serving ads in html tags similar to products):
Shoe Price
0 Nike NIKE AIR MAX MOTION 2, Men's Running Shoe... £79\n99
1 Nike Air Max 270 React Se GS Running Trainers ... £69\n99
2 NIKE Women's Air Max Low Sneakers £69\n99
3 NIKE Men's React Miler Running Shoes £109\n99
4 NIKE Men's Revolution 5 Flyease Running Shoe £38\n70
5 NIKE Women's W Revolution 6 Nn Running Shoe £48\n00
6 NIKE Men's Downshifter 10 Running Shoe £54\n99
7 NIKE Women's Court Vision Low Better Basketbal... £30\n00
8 NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac... £20\n72
9 NIKE Men's Air Max Wright Gs Running Shoe £68\n51
10 NIKE Men's Air Max Sc Trainers £54\n99
11 NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof... £134\n95
12 NIKE Women's W Superrep Go 2 Sneakers £54\n00
13 NIKE Boys Tanjun Running Shoes £35\n53
14 NIKE Women's Air Max Bella Tr 4 Gymnastics Sho... £28\n00
15 NIKE Men's Defy All Day Gymnastics Shoe £54\n95
16 NIKE Men's Venture Runner Sneaker £45\n90
17 Nike Nike Court Borough Low 2 (gs), Boy's Bask... £24\n00
18 NIKE Men's Court Royale 2 Better Essential Tra... £25\n81
19 NIKE Men's Quest 4 Running Shoe £38\n00
20 Women Trainers Running Shoes - Air Cushion Sne... £35\n69
21 Men Women Walking Trainers Light Running Breat... £42\n99
22 JSLEAP Mens Running Shoes Fashion Non Slip Ath... £44\n99
[...]
You should pay attention to a couple of things:
I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
Your results may vary, depending on your advertising profile
You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only
I found the answer to this question. Thank you to #platipus_on_fire for the answer. You certainly guided me towards to correct place but I did find that the issue was still persistent in that prices were being mismatched with the names.
The solution I found was finding the price first before the name. If there is no price, then it shouldn't log down the name. In your example you had it search for the name first which was still causing issues as sometimes the name is there but no the price.
Here is the code that finally works for me:
for shoe in shoes:
try:
price_list.append(shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]').text)
shoe_list.append(shoe.find_element(By.CSS_SELECTOR, ".a-text-normal").text)
except Exception as e:
continue
shoe_list and price_list now match regardless if some items don;t display price. Those items in question are now removed from the list entirely.
Thanks again to those who answered.

Web scraping table data using beautiful soup

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following:
Scraping Kansas City Chiefs active team player name with the college attended. This is the url used https://www.chiefs.com/team/players-roster/.
After compiling, I get an error saying "IndexError: list index out of range".
I don't know if my set classes are wrong. Help would be appreciated.
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)
TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries
Indexing
The Python Index Operator is represented by opening and closing square brackets: []. The syntax, however, requires you to put a number inside the brackets.
Example:
So [7] applies indexing to the preceding iterable (all found tds), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.
In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7].
row.find_all('td', class_='sorter-lastname selected')[7]
How to avoid index-errors ?
Are you sure there are any td elements found in the row?
If some are found, can we guarantee that it are always at least 8.
In this case, the were apparently less than 8 elements.
That's why Python would raise an IndexError, e.g. in given script line 15:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
Element-queries
When the index was fixed, the queried names returned empty results as None, None.
We need to debug (thus I added the print inside the loop) and adjust the queries:
(1) for the university-name:
If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last. The number of cells should be at least 1 or greater than 0.
(2) for the player-name:
It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a .. title="Player Name"> or in following sibling as inner text of span > a.
CSS selectors
You may use CSS selectors for that an bs4's select or select_one functions. Then you can select the path like td > ? > ? > a and get the title.
Note: the ? placeholders are left as challenging exercise for you.)
💡️ Tip: most browsers have an inspector (right click on the element, e.g. the player-name), then choose "inspect element" and an HTML source view opens selecting the element. Right-click again to "Copy" the element as "CSS selector".
Further Reading
About indexing, and the magic of negative numbers like [-1]:
AskPython: Indexing in Python - A Complete Beginners Guide
.. a bit further, about slicing:
Real Python: Indexing and Slicing
Research on Beautiful Soup here:
Using BeautifulSoup to extract the title of a link
Get text with BeautifulSoup CSS Selector
I couldn't find a td with class sorter-lastname selected in the source code. You basically need the last td in each row, so this would do:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS. scraping tables is extremely easy in pandas:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0]):
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]

Appending elements of a list into a multi-dimensional list

Hi I'm doing some web scraping with NBA Data in python on this page. Some elements of basketball-reference are easy to scrape, but this one is giving me some trouble with my lack of python knowledge.
I'm able to grab the data and column headers I want, but I end up with 2 lists of data that I need to combine by their index (i think?) so that index 0 of player_injury_info lines up with index 0 of player_names etc, which I dont know how to do.
Below I've pasted some code that you can follow along.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
# this correctly gives me the 4 column headers i want (Player, Team, Update, Description)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# 2 lists - player_injury_info and player_names. they need to be combined.
rows = soup.findAll('tr')
player_injury_info = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
player_injury_info = player_injury_info[1:] # removing first element bc dont need it
player_names = [[th.getText() for th in rows[i].findAll('th')]
for i in range(len(rows))]
player_names = player_names[1:] # removing first element bc dont need it
### joining the lists in the correct order- the part i dont know how to do
player_list = player_names.append(player_injury_info)
### this should give me the data frame i want if i can get player_injury_info into the right format.
injury_data = pd.DataFrame(player_injury_info, columns = headers)
There might be an easier way to web scrape the data into all 1 list / data frame? Or maybe it's fine to just join the 2 lists together like I'm trying to do. But if anybody was able to follow along and can offer a solution I'd appreciate the help!
Let pandas do the parse of the table for you.
import pandas as pd
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
injury_data = pd.read_html(url)[0]
Output:
print(injury_data)
Player ... Description
0 Onyeka Okongwu ... Out (Shoulder) - The Hawks announced that Okon...
1 Jaylen Brown ... Out (Wrist) - The Celtics announced that Brown...
2 Coby White ... Out (Shoulder) - The Bulls announced that Whit...
3 Taurean Prince ... Out (Ankle) - The Cavaliers announced F Taurea...
4 Jamal Murray ... Out (Knee) - Murray is recovering from a torn ...
5 Klay Thompson ... Out (Right Achilles) - Thompson is on track to...
6 James Wiseman ... Out (Knee) - Wiseman is on track to be ready b...
7 T.J. Warren ... Out (Foot) - Warren underwent foot surgery and...
8 Serge Ibaka ... Out (Back) - The Clippers announced Serge Ibak...
9 Kawhi Leonard ... Out (Knee) - The Clippers announced Kawhi Leon...
10 Victor Oladipo ... Out (Knee) - Oladipo could be cleared for full...
11 Donte DiVincenzo ... Out (Foot) - DiVincenzo suffered a tendon inju...
12 Jarrett Culver ... Out (Ankle) - The Timberwolves announced Culve...
13 Markelle Fultz ... Out (Knee) - Fultz will miss the rest of the s...
14 Jonathan Isaac ... Out (Knee) - Isaac is making progress with his...
15 Dario Šarić ... Out (Knee) - The Suns announced that Sario has...
16 Zach Collins ... Out (Ankle) - The Blazers announced that Colli...
17 Pascal Siakam ... Out (Shoulder) - The Raptors announced Pascal ...
18 Deni Avdija ... Out (Leg) - The Wizards announced that Avdija ...
19 Thomas Bryant ... Out (Left knee) - The Wizards announced that B...
[20 rows x 4 columns]
But if you were to iterate it yourself, I'd simply get at the rows (<tr> tags), then get the player name in the <a> tag, and combine it with that row's <td> tags. Then create your dataframe from the list of those:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta
url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
trs = soup.findAll('tr')[1:]
rows = []
for tr in trs:
player_name = tr.find('a').text
data = [player_name] + [x.text for x in tr.find_all('td')]
rows.append(data)
injury_data = pd.DataFrame(rows, columns = headers)
I think you want this (a list of tuples), using zip:
players = ["joe", "bill"]
injuries = ["tooth-ache", "mental break"]
list(zip(players, injuries))
Result:
[('joe', 'tooth-ache'), ('bill', 'mental break')]

Python dataframe name error: name is not defined

I scraped the link and address of each property on page 1 of a real estate website into a list. I then convert this list of lists listing_details into pandas dataframe by appending info of each property as a row (20 rows in total). My code is as follows:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0'}
url = "https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1&scat=1%2C7%2C2%2C4%2C6%2C5%2C3%2C50%2C53"
response = requests.get(url, headers=headers)
listings = BeautifulSoup(response.content, "lxml")
listing_details = []
details = listings.findAll('div', attrs={"data-test":"tile"})
for detail in details:
# get property links
links = detail.findAll('a', href=True)
for link in links:
link="https://www.realestate.co.nz" + link['href']
# get addresses
addresses = detail.findAll('h3')
for address in addresses:
address=address.text.strip()
df = df.append(pd.DataFrame(listing_details, columns=['Link','Location']), ignore_index=True)
print(df)
However, I got the following error: NameError: name 'df' is not defined
I changed the last two lines into print(listing_details) to see if there's something wrong with the list but found that I got 20 empty lists.
But when I write print(link) and print(address) I can see that I did scrape the correct information as follows:
https://www.realestate.co.nz/4016546/residential/sale/3436-westgate-drive-westgate
34-36 Westgate Drive, Westgate
https://www.realestate.co.nz/4016545/residential/sale/3436-westgate-drive-westgate
34-36 Westgate Drive, Westgate
https://www.realestate.co.nz/4016519/residential/sale/7-ckaitiaki-drive-clarks-beach
7C Kaitiaki Drive, Clarks Beach
https://www.realestate.co.nz/4016178/residential/sale/6423427-beach-road-mairangi-bay
6/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016177/residential/sale/4423427-beach-road-mairangi-bay
4/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016176/residential/sale/2423427-beach-road-mairangi-bay
2/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016163/residential/sale/303428-dominion-road-mount-eden
303/428 Dominion Road, Mount Eden
https://www.realestate.co.nz/4016162/residential/sale/316428-dominion-road-mount-eden
316/428 Dominion Road, Mount Eden
https://www.realestate.co.nz/4016127/residential/sale/50910-kingdon-street-newmarket
509/10 Kingdon Street, Newmarket
https://www.realestate.co.nz/4016057/residential/sale/3-s99-customs-street-west-auckland-central
3S/99 Customs Street West, Auckland Central
https://www.realestate.co.nz/4016005/residential/sale/80270-daldy-street-wynyard-quarter
802/70 Daldy Street, Wynyard Quarter
https://www.realestate.co.nz/4015970/residential/sale/20-crown-lynn-place-new-lynn
20 Crown Lynn Place, New Lynn
https://www.realestate.co.nz/4015916/residential/sale/3-s15-nelson-street-auckland-central
3S/15 Nelson Street, Auckland Central
https://www.realestate.co.nz/4015773/residential/sale/lot7280-fred-taylor-drive-westgate
Lot 72, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015774/residential/sale/lot4280-fred-taylor-drive-westgate
Lot 42, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015772/residential/sale/lot4580-fred-taylor-drive-massey
Lot 45, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015771/residential/sale/lot6680-fred-taylor-drive-massey
Lot 66, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015759/residential/sale/lot7280-fred-taylor-drive-massey
Lot 72, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015757/residential/sale/lot4780-fred-taylor-drive-westgate
Lot 47, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015758/residential/sale/lot4580-fred-taylor-drive-westgate
Lot 45, 80 Fred Taylor Drive, Westgate
Any ideas on where I did wrong? Thanks a lot!
Currently, you are not appending anything to listing_details. Your for loop should look something like this:
for detail in details:
# get property links
links = detail.findAll('a', href=True)
for link in links:
link="https://www.realestate.co.nz" + link['href']
# get addresses
addresses = detail.findAll('h3')
for address in addresses:
address=address.text.strip()
listing_details.append(address, link) # You can decide the order or address and link.

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Categories

Resources