I'm using the following code in Python to capture certain text values from a webpage.
from bs4 import BeautifulSoup
import requests
url="https://example.com/page1.html"
response=requests.get(url)
soup=BeautifulSoup(response.content,'html5lib')
spans=soup.find_all('a',"menu-tags")
for span in spans:
print(span.text)
It works perfectly when the input HTML page is having the following:
<li class="foodie">
<a href="../../-/british/" class="menu-tags" data-clickstream-city-cuisine-module>British</a>
<span>, </span>
<a href="../../-/indian/" class="menu-tags" data-clickstream-city-cuisine-module>Indian</a>
<span>, </span>
<a href="../../-/french/" class="menu-tags" data-clickstream-city-cuisine-module>French</a>
and correctly produces the following output:
British
Indian
French
However, when I use the following modified code on the following input HTML page containing the class which have brackets (), the output is NOT generated.
from bs4 import BeautifulSoup
import requests
url="https://example.com/page1.html"
response=requests.get(url)
soup=BeautifulSoup(response.content,'html5lib')
spans=soup.find_all('span',"Fw(600)")
for span in spans:
print(span.text)
HTML code input:
<span class="Fw(600)">Pineapple</span><br/><span>Animal</span>: <span class="Fw(600)">Monkey</span><br/><span>
Expected output is
Pineapple
Monkey
But nothing is being generated.
Is it because of brackets in the class, and if so how to capture it?
Using single or double backslash(es) before brackets doesn't help either:
spans=soup.find_all('span',"Fw\(600\)")
spans=soup.find_all('span',"Fw\\(600\\)")
In addition to comment of #nigh_anxiety
You need to specify the class to search for as a keyword argument with the keyword class_
You could also use css selectors with escaping:
soup.select('.Fw\(600\)')
Example
from bs4 import BeautifulSoup
html = '''<span class="Fw(600)">Pineapple</span><br/><span>Animal</span>: <span class="Fw(600)">Monkey</span><br/><span>'''
soup = BeautifulSoup(html)
soup.select('.Fw\(600\)')
Output
[<span class="Fw(600)">Pineapple</span>, <span class="Fw(600)">Monkey</span>]
Related
I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())
I would like to extract the text 'THIS IS THE TEXT I WANT TO EXTRACT' from the snippet below. Does anyone have any suggestions? Thanks!
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
from bs4 import BeautifulSoup
html = """<span class="cw-type__h2 Ingredients-title">Ingredients</span><p>THIS IS THE TEXT I WANT TO EXTRACT</p>"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.text)
Assuming there is likely more html, I would use the class of the preceeding span with adjacent sibling combinator and p type selector to target the appropriate p tag
from bs4 import BeautifulSoup as bs
html = '''
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
'''
soup = bs(html, 'lxml')
print(soup.select_one('.Ingredients-title + p').text.strip())
I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())
I started to learn the beautifulsoup. I am trying to remove from html script a line of code containing </div> .
The most examples in the documentation are presented for the whole tags (opening and closing part).
Is it possible to modify just one part of a tag?
For example:
</div>
<div >Hello</div>
<div data-foo="value">foo!</div>
how to remove just the first line of the code?
You can use BeautifulSoup's unwrap() to specify the invalid tag, which will only remove the extra tags that don't have a open/close counterpart, while retaining others:
soup = BeautifulSoup(html_doc, 'html.parser')
invalid_tags = ['</div>']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.unwrap()
print(soup)
result:
<div>Hello</div>
<div data-foo="value">foo!</div>
you don't need do anything it will repaired automatically
from bs4 import BeautifulSoup
html_doc = '''</div>
<div>World</div>
<div data-foo="value">foo!''' # also invalid, no closing
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)
output
<div>World</div>
<div data-foo="value">foo!</div>
unwrap() is for removing not repairing tag.
I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4