Scrape data from HTML pages with sequenced span IDs using Python

Scrape data from HTML pages with sequenced span IDs using Python - python

I am working with certain HTML pages from which I need to scrape data. The issue is that span ids are numbered.
For example -
ContentPlaceHolder_0, ContentPlaceHolder_1, ContentPlaceHolder_2 ..... ContentPlaceHolder_n
I need to get data from all of these span tags at each page. What would be the best approach to get this data using Beautiful Soup?

You can try CSS selectors built-in within BeautifulSoup. This will select all span whose ids are beginning with ContentPlaceHolder:
soup.select('span[id^=ContentPlaceHolder]')
Example:
from bs4 import BeautifulSoup
html = """<span id='ContentPlaceHolder_0'>0</span>
<span id='ContentPlaceHolder_1'>1</span>
<span id='ContentPlaceHolder_2'>2</span>
<span id='ContentPlaceHolder_3'>3</span>
<span id='xxx'>xxx</span>"""
soup = BeautifulSoup(html, 'lxml')
for s in soup.select('span[id^=ContentPlaceHolder]'):
print(s.text)
Prints:
0
1
2
3

Related

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)

This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

getting content inside of all <span> using findall, only get the content that not has \n

I'm trying to extract the content that is inside the span tag under the structure:
<span style="font-weight:bold">xxx</span>
I get a big html code from a web service and from there I extract the span tags with this structure.
the problem is that if the content of some span has a \n it does not extract it.
for example:
print(re.findall(pattern, '<span style="font-weight:bold">AAA\n</span><span style="font-weight:bold">ooo</span>'))
>>[ooo]
#output desired should be [AAA,ooo]
how can I fix this so that the content of the span is extracted if it has or does not have \n?

Use BeautifulSoup to handle element in html
from bs4 import BeautifulSoup
h = """<span style="font-weight:bold">xxx</span>"""
soup = BeautifulSoup(h)
spans = soup.find_all("span")
for span in spans:
print(span.text)
OUTPUT
u'xxx'

Webscraping a particular element of html

I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is "https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security"
The code I'm using is:
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr">
<p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks

If you are only interested in data located inside govuk-govspeak direction-ltr class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself. For class use . and for id use #
data = soup.select('.govuk-govspeak.direction-ltr')
# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]
#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]

You can find that particular <div> and then under that div you can find the <p> tags and get the data like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div.find_all("p"):
data.append(i.get_text().encode("ascii","ignore"))
data="\n".join(data)
now data will contain the whole content with paragraphs seperated by \n
Note: The above code will give you only the text content of paragraph heading content will not be included
if you want both heading with paragraph text then you can extract both <h3> and <p> like this
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
div=soup.find("div",{"class":"govuk-govspeak direction-ltr"})
data=[]
for i in div:
if i.name=="h3":
data.append(i.get_text().encode("ascii","ignore")+"\n\n")
if i.name=="p":
data.append(i.get_text().encode("ascii","ignore")+"\n")
data="".join(data)
Now data will have both headings and paragraphs where headings will be seperated by \n\n and paragraphs will be seperated by \n

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'

Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())

If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

I'm trying to collect the text with BeautifulSoup using python

I want to know how I can collect the desire data with beautiful soup here is the code and trying to collect the text data that is "RoSharon1977"
I'm trying using
<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>

You have to find the div by its id, then get the next ul element, etc and continuing to drill down until you reach the a element, then get the text of it:
from bs4 import BeautifulSoup
html = '''<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>'''
soup = BeautifulSoup(html)
print soup.find('div', attrs={'id': 'twitter-view'}).findNext('ul').findNext('li').findNext('a').text
Or depending on how the whole webpage looks you could simply do:
soup = BeautifulSoup(html)
print soup.find('a').text
And if there are multiple a elements:
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
print a.text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape data from HTML pages with sequenced span IDs using Python - python

Related

How to get data from nested HTML using BeautifulSoup in Django

getting content inside of all <span> using findall, only get the content that not has \n

Webscraping a particular element of html

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm trying to collect the text with BeautifulSoup using python

Categories

Resources