I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.
Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.
If you inspect the webpage all the details are in tags with no id or class.
My BeautifulSoup attempt
URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)
Which is throwing the error:
producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'
My Scrapy attempt
winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[#id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[#id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(#class,\
"dwwa-page-link") #href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[#id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[#id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)
Which produces and empty df.
Any help would be greatly appreciated.
Thank you!
When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:
import pandas as pd
df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')
Which gives all 14858 items:
producer
name
id
competition
award
score
country
region
subRegion
vintage
color
style
priceBandLetter
competitionYear
competitionType
0
Yealands Estate Wines
Babydoll Sauvignon Blanc
706484
DWWA 2022
7
88
New Zealand
Marlborough
Not Applicable
2021
White
Still - Dry (below 5 g/L residual sugar)
A
2022
DWWA
1
Yealands Estate Wines
Reserve Pinot Gris
706478
DWWA 2022
7
86
New Zealand
Marlborough
Not Applicable
2021
White
Still - Dry (below 5 g/L residual sugar)
B
2022
DWWA
2
Yealands Estate Wines
Babydoll Pinot Gris
706479
DWWA 2022
7
87
New Zealand
Marlborough
Not Applicable
2021
White
Still - Dry (below 5 g/L residual sugar)
A
2022
DWWA
3
Yealands Estate Wines
Reserve Chardonnay
705165
DWWA 2022
6
90
New Zealand
Hawke's Bay
Not Applicable
2021
White
Still - Dry (below 5 g/L residual sugar)
B
2022
DWWA
4
Yealands Estate Wines
Reserve Sauvignon Blanc
706486
DWWA 2022
6
90
New Zealand
Marlborough
Awatere Valley
2021
White
Still - Dry (below 5 g/L residual sugar)
B
2022
DWWA
Try:
import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())
Prints:
producer
name
id
competition
award
score
country
region
subRegion
vintage
color
style
priceBandLetter
competitionYear
competitionType
14853
Telavi Wine Cellar
Marani
718257
DWWA 2022
7
86
Georgia
Kakheti
Kindzmarauli
2021
Red
Still - Medium (between 19 and 44 g/L residual sugar)
B
2022
DWWA
14854
Štrigova
Muškat Žuti
716526
DWWA 2022
7
87
Croatia
Continental
Zagorje - Međimurje
2021
White
Still - Medium (between 19 and 44 g/L residual sugar)
C
2022
DWWA
14855
Kopjar
Muscat žUti
717754
DWWA 2022
7
86
Croatia
Continental
Zagorje - Međimurje
2021
White
Still - Medium (between 19 and 44 g/L residual sugar)
C
2022
DWWA
14856
Cleebronn-Güglingen
Blanc De Noir Fein & Fruchtig
719836
DWWA 2022
7
87
Germany
Württemberg
Not Applicable
2021
White
Still - Medium (between 19 and 44 g/L residual sugar)
B
2022
DWWA
14857
Winnice Czajkowski
Thoma 8 Grand Selection
719891
DWWA 2022
6
90
Poland
Not Applicable
Not Applicable
2021
White
Still - Medium (between 19 and 44 g/L residual sugar)
D
2022
DWWA
Related
I have been trying to scrape all data from the first page to the last page, but it returns only the first page as the output. How can I solve this? Below is my code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
pages = np.arange(2, 1589, 20)
for page in pages:
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_="project-card-vertical h-full flex flex-col rounded border-thin border-inactive-blue overflow-hidden pointer")
for list in lists:
title = list.find('p', class_ ="project-location text-body text-base mb-3").text. replace ('\n', '',).strip()
location = list.find('span', class_ ="text-gray-1").text. replace ('\n', '',).strip()
status = list.find('span', class_ ="text-purple-1 font-bold").text. replace ('\n', '',).strip()
units = list.find('span', class_ ="text-body font-semibold").text. replace ('\n', '',).strip()
info = [title,location,status,units]
print(info)
The page is loaded dynamically using the API. Therefore, with a regular GET request, you will always get the first page. You need to study how the page communicates with the browser and find the request you need, I wrote an example for review.
import json
import requests
def get_info(page):
url = f"https://services.estateintel.com/api/v2/properties?type\\[\\]=residential&page={page}"
headers = {
'accept': 'application/json',
'authorization': 'false',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
}
response = requests.request("GET", url, headers=headers)
json_obj = json.loads(response.text)
for data in json_obj['data']:
print(data['name'])
print(data['area'], data['state'])
print(data['status'])
print(data['size']['value'], data['size']['unit'])
print('------')
for page in range(1, 134):
get_info(page)
You can choose the fields you need, this is just an example, also add to dataframe. Output:
Twin Oaks Apartment
Kilimani Nairobi
Completed
0 units
------
Duchess Park
Lavington Nairobi
Completed
62 units
------
Greenvale Apartments
Kileleshwa Nairobi
Completed
36 units
------
The Urban apartments & Suites
Osu Greater Accra
Completed
28 units
------
Chateau Towers
Osu Greater Accra
Completed
120 units
------
Cedar Haus Gardens
Oluyole Oyo
Under Construction
38 units
------
10 Agoro Street
Oluyole Oyo
Completed
1 units
..............
Think it is working well, but needs the time to sleep - Just in case, you could select your elements more specific e.g. with css selectors and store information in a list of dicts instead just printing it.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
data = []
for page in range(1,134):
print(page)
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
for item in soup.select('div.project-grid > a'):
data.append({
'title' : item.h3.text.strip(),
'location' : item.find('span', class_ ="text-gray-1").text.strip(),
'status' : item.find('span', class_ ="text-purple-1 font-bold").text.strip(),
'units' : item.find('span', class_ ="text-body font-semibold").text.strip()
})
pd.DataFrame(data)
Output
title
location
status
units
0
Twin Oaks Apartment
Kilimani, Nairobi
Completed
Size: --
1
Duchess Park
Lavington, Nairobi
Completed
Size: 62 units
2
Greenvale Apartments
Kileleshwa, Nairobi
Completed
Size: 36 units
3
The Urban apartments & Suites
Osu, Greater Accra
Completed
Size: 28 units
4
Chateau Towers
Osu, Greater Accra
Completed
Size: 120 units
5
Cedar Haus Gardens
Oluyole, Oyo
Under Construction
Size: 38 units
6
10 Agoro Street
Oluyole, Oyo
Completed
Size: 1 units
7
Villa O
Oluyole, Oyo
Completed
Size: 2 units
8
Avenue Road Apartments
Oluyole, Oyo
Completed
Size: 6 units
9
15 Alafia Street
Oluyole, Oyo
Completed
Size: 4 units
10
12 Saint Mary Street
Oluyole, Oyo
Nearing Completion
Size: 8 units
11
RATCON Estate
Oluyole, Oyo
Completed
Size: --
12
1 Goodwill Road
Oluyole, Oyo
Completed
Size: 4 units
13
Anike's Court
Oluyole, Oyo
Completed
Size: 3 units
14
9 Adeyemo Quarters
Oluyole, Oyo
Completed
Size: 4 units
15
Marigold Residency
Nairobi West, Nairobi
Under Construction
Size: --
16
Kings Distinction
Kilimani, Nairobi
Completed
Size: --
17
Riverview Apartments
Kyumvi, Machakos
Completed
Size: --
18
Serene Park
Kyumvi, Machakos
Under Construction
Size: --
19
Gitanga Duplexes
Lavington, Nairobi
Under Construction
Size: 36 units
20
Westpointe Apartments
Upper Hill, Nairobi
Completed
Size: 254 units
21
10 Olaoluwa Street
Oluyole, Oyo
Under Construction
Size: 12 units
22
Rosslyn Grove
Nairobi West, Nairobi
Under Construction
Size: 90 units
23
7 Kamoru Ajimobi Street
Oluyole, Oyo
Completed
Size: 2 units
#pip install trio httpx pandas
import trio
import httpx
import pandas as pd
allin = []
keys1 = ['name', 'area', 'state']
keys2 = ['value', 'unit']
async def scraper(client, page):
client.params = client.params.merge({'page': page})
r = await client.get('/properties')
allin.extend([[i.get(k, 'N/A') for k in keys1] +
[i['size'].get(b, 'N/A')
for b in keys2] for i in r.json()['data']])
async def main():
async with httpx.AsyncClient(timeout=None, base_url='https://services.estateintel.com/api/v2') as client, trio.open_nursery() as nurse:
client.params = {
'type[]': 'residential'
}
for page in range(1, 3):
nurse.start_soon(scraper, client, page)
df = pd.DataFrame(allin, columns=[keys1 + keys2])
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
0 Cedar Haus Gardens Oluyole Oyo 38 units
1 10 Agoro Street Oluyole Oyo 1 units
2 Villa O Oluyole Oyo 2 units
3 Avenue Road Apartments Oluyole Oyo 6 units
4 15 Alafia Street Oluyole Oyo 4 units
5 12 Saint Mary Street Oluyole Oyo 8 units
6 RATCON Estate Oluyole Oyo 0 units
7 1 Goodwill Road Oluyole Oyo 4 units
8 Anike's Court Oluyole Oyo 3 units
9 9 Adeyemo Quarters Oluyole Oyo 4 units
10 Marigold Residency Nairobi West Nairobi 0 units
11 Riverview Apartments Kyumvi Machakos 0 units
12 Socian Villa Apartments Kileleshwa Nairobi 36 units
13 Kings Pearl Residency Lavington Nairobi 55 units
14 Touchwood Gardens Kilimani Nairobi 32 units
15 Panorama Apartments Upper Hill Nairobi 0 units
16 Gitanga Duplexes Lavington Nairobi 36 units
17 Serene Park Kyumvi Machakos 25 units
18 Kings Distinction Kilimani Nairobi 48 units
19 Twin Oaks Apartment Kilimani Nairobi 0 units
20 Duchess Park Lavington Nairobi 70 units
21 Greenvale Apartments Kileleshwa Nairobi 36 units
22 The Urban apartments & Suites Osu Greater Accra 28 units
23 Chateau Towers Osu Greater Accra 120 units
When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb
I'm trying to parse the html of a live sport results website, but my code doesn't return every span tag there is to the site. I saw under inspect that all the matches are , but my code can't seem to find anything from the website apart from the footer or header. Also tried with the divs, those didn't work either. I'm new to this and kinda lost, this is my code, could someone help me?
I left the firs part of the for loop for more clarity.
#Creating the urls for the different dates
my_url='https://www.livescore.com/en/football/{}'.format(d1)
print(my_url)
today=date.today()-timedelta(days=i)
d1 = today.strftime("%Y-%m-%d/")
#Opening up the connection and grabbing the html
uClient=uReq(my_url)
page_html=uClient.read()
uClient.close()
#HTML parser
page_soup=soup(page_html,"html.parser")
spans=page_soup.findAll("span")
matches=page_soup.findAll("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"})
print(spans)
The page is dynamic and rendered by JS. When you do a request, you are getting the static html response before it's rendered. There are few things you could do to work with this situation:
Use something like Selenium which simulates the browser operations. It'l open a browser, go to the site, allow the site to render the page. Once the page is rendered, you THEN can get the html of that page which will have the data. It'll work, but takes longer to process since it literally is simulating the process as you would do it manually.
Use requests-HTML package which also allows the page to be rendered (I have not tried this package before as it conflicts with my IDE Spyder). This would be similar to Selenium, without the borwser actually opening. It's essentially the requests package, but with javascript support.
See if the data (in the static html response) is embedded in the <script> tags in json format. Sometimes you'll find it there, but takes a little work to pull that out and conform/manipulate to a valid json format to be read in using json.loads()
Find if there is an api of some sort (checking XHR) and fetch the data directly from there.
The best option is always #4 if it's available. Why? Because the data will be consistently structured. Even if the website changes it's structure or css changes (which would change the html you parse), the underlying data feeding into it will rarely change it's structure. This site does have an api to access the data:
import requests
import datetime
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
dates_list = ['20210214', '20210215', '20210216']
for dateStr in dates_list:
url = f'https://prod-public-api.livescore.com/v1/api/react/date/soccer/{dateStr}/0.00'
dateStr_alpha = datetime.datetime.strptime(dateStr, '%Y%m%d').strftime('%B %d')
response = requests.get(url, headers=headers).json()
stages = response['Stages']
for stage in stages:
location = stage['Cnm']
stageName = stage['Snm']
events = stage['Events']
print('\n\n%s - %s\t%s' %(location, stageName, dateStr_alpha))
print('*'*50)
for event in events:
outcome = event['Eps']
team1Name = event['T1'][0]['Nm']
if 'Tr1' in event.keys():
team1Goals = event['Tr1']
else:
team1Goals = '?'
team2Name = event['T2'][0]['Nm']
if 'Tr2' in event.keys():
team2Goals = event['Tr2']
else:
team2Goals = '?'
print('%s\t%s %s - %s %s' %(outcome, team1Name, team1Goals, team2Name, team2Goals))
Output:
England - Premier League February 15
********************************************************************************
FT West Ham United 3 - Sheffield United 0
FT Chelsea 2 - Newcastle United 0
Spain - LaLiga Santander February 15
********************************************************************************
FT Cadiz 0 - Athletic Bilbao 4
Germany - Bundesliga February 15
********************************************************************************
FT Bayern Munich 3 - Arminia Bielefeld 3
Italy - Serie A February 15
********************************************************************************
FT Hellas Verona 2 - Parma Calcio 1913 1
Portugal - Primeira Liga February 15
********************************************************************************
FT Sporting CP 2 - Pacos de Ferreira 0
Belgium - Jupiler League February 15
********************************************************************************
FT Gent 4 - Royal Excel Mouscron 0
Belgium - First Division B February 15
********************************************************************************
FT Westerlo 1 - Lommel 1
Turkey - Super Lig February 15
********************************************************************************
FT Genclerbirligi 0 - Besiktas 3
FT Antalyaspor 1 - Yeni Malatyaspor 1
Brazil - Serie A February 15
********************************************************************************
FT Gremio 1 - Sao Paulo 2
FT Ceara 1 - Fluminense 3
FT Sport Recife 0 - Bragantino 0
Italy - Serie B February 15
********************************************************************************
FT Cosenza 2 - Reggina 2
France - Ligue 2 February 15
********************************************************************************
FT Sochaux 2 - Valenciennes 0
FT Toulouse 3 - AC Ajaccio 0
Spain - LaLiga Smartbank February 15
********************************************************************************
FT Castellon 1 - Fuenlabrada 2
FT Real Oviedo 3 - Lugo 1
...
Uganda - Super League February 16
********************************************************************************
FT Busoga United FC 1 - Bright Stars FC 1
FT Kitara FC 0 - Mbarara City 1
FT Kyetume 2 - Vipers SC 2
FT UPDF FC 0 - Onduparaka FC 1
FT Uganda Police 2 - BUL FC 0
Uruguay - Primera División: Clausura February 16
********************************************************************************
FT Boston River 0 - Montevideo City Torque 3
International - Friendlies Women February 16
********************************************************************************
FT Guatemala 3 - Panama 1
Africa - Africa Cup Of Nations U20: Group C February 16
********************************************************************************
FT Ghana U20 4 - Tanzania U20 0
FT Gambia U20 0 - Morocco U20 1
Brazil - Amazonense: Group A February 16
********************************************************************************
Postp. Manaus FC ? - Penarol AC AM ?
Now assuming you have the correct class to scrape, a simple loop would work:
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
print(i)
Or add it into a list:
teams = []
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
teams.append(i.text)
print(teams)
If this does not work, run some tests to see if you are actually scraping the correct things e.g. print a singular thing.
Also in your code I see that you are printing "spans" and not "matches", this could also be a problem with your code.
You can also look at this post what further explains how to do this.
New coder here! I am trying to scrape web table data from multiple URLs. Each URL web-page has 1 table, however that table is split among multiple pages. My code only iterates through the table pages of the first URL and not the rest. So.. I am able to get pages 1-5 of NBA data for year 2000 only, but it stops there. How do I get my code to pull every year of data? Any help is greatly appreciated.
page = 1
year = 2000
while page < 20 and year < 2020:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
sal_table = soup.find_all('table', class_ = 'tablehead')
if len(sal_table) < 2:
sal_table = sal_table[0]
with open ('NBA_Salary_2000_2019.txt', 'a') as r:
for row in sal_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(30))
r.write('\n')
page+=1
else:
print("too many tables")
else:
year +=1
page = 1
I'd consider using pandas here as 1) it's .read_html() function (which uses beautifulsoup under the hood), is easier to parse <table> tags, and 2) it can easily then write straight to file.
Also, it's a waste to iterate through 20 pages (for example the first season you are after only has 4 pages...the rest are blank. So I'd consider adding something that says once it reaches a blank table, move on to the next season.
import pandas as pd
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
results = pd.DataFrame()
year = 2000
while year < 2020:
goToNextPage = True
page = 1
while goToNextPage == True:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
temp_df = pd.read_html(base_URL)[0]
temp_df.columns = list(temp_df.iloc[0,:])
temp_df = temp_df[temp_df['RK'] != 'RK']
if len(temp_df) == 0:
goToNextPage = False
year +=1
continue
print ('Aquiring Season: %s\tPage: %s' %(year, page))
temp_df['Season'] = '%s-%s' %(year-1, year)
results = results.append(temp_df, sort=False).reset_index(drop=True)
page+=1
results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)
Output:
print (results.head(25).to_string())
RK NAME TEAM SALARY Season
0 1 Shaquille O'Neal, C Los Angeles Lakers $17,142,000 1999-2000
1 2 Kevin Garnett, PF Minnesota Timberwolves $16,806,000 1999-2000
2 3 Alonzo Mourning, C Miami Heat $15,004,000 1999-2000
3 4 Juwan Howard, PF Washington Wizards $15,000,000 1999-2000
4 5 Scottie Pippen, SF Portland Trail Blazers $14,795,000 1999-2000
5 6 Karl Malone, PF Utah Jazz $14,000,000 1999-2000
6 7 Larry Johnson, F New York Knicks $11,910,000 1999-2000
7 8 Gary Payton, PG Seattle SuperSonics $11,020,000 1999-2000
8 9 Rasheed Wallace, PF Portland Trail Blazers $10,800,000 1999-2000
9 10 Shawn Kemp, C Cleveland Cavaliers $10,780,000 1999-2000
10 11 Damon Stoudamire, PG Portland Trail Blazers $10,125,000 1999-2000
11 12 Antonio McDyess, PF Denver Nuggets $9,900,000 1999-2000
12 13 Antoine Walker, PF Boston Celtics $9,000,000 1999-2000
13 14 Shareef Abdur-Rahim, PF Vancouver Grizzlies $9,000,000 1999-2000
14 15 Allen Iverson, SG Philadelphia 76ers $9,000,000 1999-2000
15 16 Vin Baker, PF Seattle SuperSonics $9,000,000 1999-2000
16 17 Ray Allen, SG Milwaukee Bucks $9,000,000 1999-2000
17 18 Anfernee Hardaway, SF Phoenix Suns $9,000,000 1999-2000
18 19 Kobe Bryant, SF Los Angeles Lakers $9,000,000 1999-2000
19 20 Stephon Marbury, PG New Jersey Nets $9,000,000 1999-2000
20 21 Vlade Divac, C Sacramento Kings $8,837,000 1999-2000
21 22 Bryant Reeves, C Vancouver Grizzlies $8,666,000 1999-2000
22 23 Tom Gugliotta, PF Phoenix Suns $8,558,000 1999-2000
23 24 Nick Van Exel, PG Denver Nuggets $8,354,000 1999-2000
24 25 Elden Campbell, C Charlotte Hornets $7,975,000 1999-2000
...
I am having some trouble trying to cleanly iterate through a table of sold property listings using BeautifulSoup.
In this example
Some rows in the main table are irrelevant (like "set search filters")
The rows have unique IDs
Have tried getting the rows using a style attribute, but this did not return results.
What would be the best approach to get just the rows for sold properties out of that table?
End goal is to pluck out the sold price; date of sale; # bedrooms/bathrooms/car; land area and append into a pandas dataframe.
from bs4 import BeautifulSoup
import requests
# Globals
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
url = 'http://house.ksou.cn/p.php?q=West+Footscray%2C+VIC'
r=requests.get(url,headers=headers)
c=r.content
soup=BeautifulSoup(c,"html.parser")
r=requests.get(url,headers=headers)
c=r.content
soup=BeautifulSoup(c,"html.parser")
prop_table = soup.find('table', id="mainT")
#prop_table = soup.find('table', {"font-size" : "13px"})
#prop_table = soup.select('.addr') # Pluck out the listings
rows = prop_table.findAll('tr')
for row in rows:
print(row.text)
This HTML is tricky to parse, because it doesn't have fixed structure. Unfortunately, I don't have pandas installed, so I only print the data to the screen:
import requests
from bs4 import BeautifulSoup
url = 'http://house.ksou.cn/p.php?q=West+Footscray&p={page}&s=1&st=&type=&count=300®ion=West+Footscray&lat=0&lng=0&sta=vic&htype=&agent=0&minprice=0&maxprice=0&minbed=0&maxbed=0&minland=0&maxland=0'
data = []
for page in range(0, 2): # <-- increase to number of pages you want to crawl
soup = BeautifulSoup(requests.get(url.format(page=page)).text, 'html.parser')
for table in soup.select('table[id^="r"]'):
name = table.select_one('span.addr').text
price = table.select_one('span.addr').find_next('b').get_text(strip=True).split()[-1]
sold = table.select_one('span.addr').find_next('b').find_next_sibling(text=True).replace('in', '').replace('(Auction)', '').strip()
beds = table.select_one('img[alt="Bed rooms"]')
beds = beds.find_previous_sibling(text=True).strip() if beds else '-'
bath = table.select_one('img[alt="Bath rooms"]')
bath = bath.find_previous_sibling(text=True).strip() if bath else '-'
car = table.select_one('img[alt="Car spaces"]')
car = car.find_previous_sibling(text=True).strip() if car else '-'
land = table.select_one('b:contains("Land size:")')
land = land.find_next_sibling(text=True).split()[0] if land else '-'
building = table.select_one('b:contains("Building size:")')
building = building.find_next_sibling(text=True).split()[0] if building else '-'
data.append([name, price, sold, beds, bath, car, land, building])
# print the data
print('{:^25} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15}'.format('Name', 'Price', 'Sold', 'Beds', 'Bath', 'Car', 'Land', 'Building'))
for row in data:
print('{:<25} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15} {:^15}'.format(*row))
Prints:
Name Price Sold Beds Bath Car Land Building
51 Fontein Street $770,000 07 Dec 2019 - - - - -
50 Fontein Street $751,000 07 Dec 2019 - - - - -
9 Wellington Street $1,024,999 Dec 2019 2 1 1 381 -
239 Essex Street $740,000 07 Dec 2019 2 1 1 358 101
677a Barkly Street $780,000 Dec 2019 4 1 - 380 -
23A Busch Street $800,000 30 Nov 2019 3 1 1 215 -
3/2-4 Dyson Street $858,000 Nov 2019 3 2 - 378 119
3/101 Stanhope Street $803,000 30 Nov 2019 2 2 2 168 113
2/4 Rondell Avenue $552,500 30 Nov 2019 2 - - 1,088 -
3/2 Dyson Street $858,000 30 Nov 2019 3 2 2 378 -
9 Vine Street $805,000 Nov 2019 2 1 2 318 -
39 Robbs Road $957,000 23 Nov 2019 2 2 - 231 100
29 Robbs Road $1,165,000 Nov 2019 2 1 1 266 -
5 Busch Street $700,000 Nov 2019 2 1 1 202 -
46 Indwe Street $730,000 16 Nov 2019 3 1 1 470 -
29/132 Rupert Street $216,000 16 Nov 2019 1 1 1 3,640 -
11/10 Carmichael Street $385,000 15 Nov 2019 2 1 1 1,005 -
2/16 Carmichael Street $515,000 14 Nov 2019 2 1 1 112 -
4/26 Beaumont Parade $410,000 Nov 2019 2 1 1 798 -
5/10 Carmichael Street $310,000 Nov 2019 1 1 1 1,004 -