Webscraping a particular element of html - python

I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is "https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security"
The code I'm using is:
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr">
<p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks

If you are only interested in data located inside govuk-govspeak direction-ltr class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself. For class use . and for id use #
data = soup.select('.govuk-govspeak.direction-ltr')
# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]
#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]

You can find that particular <div> and then under that div you can find the <p> tags and get the data like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div.find_all("p"):
data.append(i.get_text().encode("ascii","ignore"))
data="\n".join(data)
now data will contain the whole content with paragraphs seperated by \n
Note: The above code will give you only the text content of paragraph heading content will not be included
if you want both heading with paragraph text then you can extract both <h3> and <p> like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div:
if i.name=="h3":
data.append(i.get_text().encode("ascii","ignore")+"\n\n")
if i.name=="p":
data.append(i.get_text().encode("ascii","ignore")+"\n")
data="".join(data)
Now data will have both headings and paragraphs where headings will be seperated by \n\n and paragraphs will be seperated by \n

Related

How to scrape a website using selected words if present?

I have used Beautifulsoup to scrape a website. My current code helps me to get the website content in HTML format. I used soup to find the word if it is present or not but I am not able to get the paragraph it belongs to.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://manychat.com/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title.text
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_body, page_head)
thirdParty = soup.find(text = 'Facebook')
Usually, the areas you're interested in searching are of a common kind, like <div> with a common class. So, you have Soup return all of the <div>s with that class, and you search the div text for your word.

How to use web scraping to get visible text on the webpage?

This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Scraping difficult table

I have been trying to scrape a table from here for quite some time but have been unsuccessful. The table I am trying to scrape is titled "Team Per Game Stats". I am confident that once I am able to scrape one element of that table that I can iterate through the columns I want from the list and eventually end up with a pandas data frame.
Here is my code so far:
from bs4 import BeautifulSoup
import requests
# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)
# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)
# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())
# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)
Any help would be greatly appreciated.
That webpage employs a trick to try to stop search engines and other automated web clients (including scrapers) from finding the table data: the tables are stored in HTML comments:
<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">
<div class="section_heading">
<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2> <div class="section_heading_text">
<ul> <li><small>* Playoff teams</small></li>
</ul>
</div>
</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team-stats-per_game">
<table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>
...
</table>
</div>
</div>
-->
</div>
I note that the opening div has setup_commented and commented classes. Javascript code included in the page is then executed by your browser that then loads the text from those comments and replaces the placeholder div with the contents as new HTML for the browser to display.
You can extract the comment text here:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')
then continue to parse the table HTML.
This specific site has published both terms of use, and a page on data use you should probably read if you are going to use their data. Specifically, their terms state, under section 6. Site Content:
You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.
Scraping the data would fall under that heading.
Just to complete Martijn Pieters's answer (and without lxml)
from bs4 import BeautifulSoup, Comment
import requests
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
soup = BeautifulSoup(r.content, 'html.parser')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table = BeautifulSoup(comment, 'html.parser')
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if cells:
print([cell.text for cell in cells])
Partial output
[u'New Orleans Pelicans', u'71', u'240.0', u'43.6', u'91.7', u'.476', u'10.1', u'29.4', u'.344', u'33.5', u'62.4', u'.537', u'18.1', u'23.9', u'.760', u'11.0', u'36.0', u'47.0', u'27.0', u'7.5', u'5.5', u'14.5', u'21.4', u'115.5']
[u'Milwaukee Bucks*', u'69', u'241.1', u'43.3', u'90.8', u'.477', u'13.3', u'37.9', u'.351', u'30.0', u'52.9', u'.567', u'17.6', u'22.8', u'.773', u'9.3', u'40.1', u'49.4', u'26.0', u'7.4', u'6.0', u'14.0', u'19.8', u'117.6']
[u'Los Angeles Clippers', u'70', u'241.8', u'41.0', u'87.6', u'.469', u'9.8', u'25.2', u'.387', u'31.3', u'62.3', u'.502', u'22.8', u'28.8', u'.792', u'9.9', u'35.7', u'45.6', u'23.4', u'6.6', u'4.7', u'14.5', u'23.5', u'114.6']

Scrape data from HTML pages with sequenced span IDs using Python

I am working with certain HTML pages from which I need to scrape data. The issue is that span ids are numbered.
For example -
ContentPlaceHolder_0, ContentPlaceHolder_1, ContentPlaceHolder_2 ..... ContentPlaceHolder_n
I need to get data from all of these span tags at each page. What would be the best approach to get this data using Beautiful Soup?
You can try CSS selectors built-in within BeautifulSoup. This will select all span whose ids are beginning with ContentPlaceHolder:
soup.select('span[id^=ContentPlaceHolder]')
Example:
from bs4 import BeautifulSoup
html = """<span id='ContentPlaceHolder_0'>0</span>
<span id='ContentPlaceHolder_1'>1</span>
<span id='ContentPlaceHolder_2'>2</span>
<span id='ContentPlaceHolder_3'>3</span>
<span id='xxx'>xxx</span>"""
soup = BeautifulSoup(html, 'lxml')
for s in soup.select('span[id^=ContentPlaceHolder]'):
print(s.text)
Prints:
0
1
2
3

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

Categories

Resources