My code:
<div id="title">
<h2>
My title <span class="subtitle">My Subtitle</span></h2></div>
If I use this code:
title = soup.find('div', id="title").h2.text
print title
>> My title My Subtitle
It matches everything. I want to match My title and My Subtitle as 2 different objects:
print title
>> My title
print subtitle
>> My subtitle
Any help?
You can get the subtitle and it's preceding sibling separately:
title = soup.find('div', id="title").h2
subtitle = title.find(class_="subtitle")
print(subtitle.previous_sibling.strip(), subtitle.get_text())
Or, you can locate the text node in a non-recursive mode:
title = soup.find('div', id="title").h2
print(title.find(text=True, recursive=False).strip(),
title.find(class_="subtitle").get_text(strip=True))
Both print:
(u'My title', u'My Subtitle')
One way to do it without using the class attribute is:
h2 = soup.find('div', id="title").h2
subtitle = h2.span.text
title = str(h2.contents[0])
The h2.contents[0] will return a NavigableString object here. Its behavior for print is same as that as the string version of it. If you're only going to use the print statement with it, then the str() call won't be necessary.
check out this example to understand
from bs4 import BeautifulSoup
#html source
html_source = '''
<div class="test">
<h2>paragraph1</h2>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser')
#find h2 tag
print(soup.h2.string)
output
paragraph1
source
link
Another solution.
from simplified_scrapy import SimplifiedDoc
html = '''
<div id="title">
<h2>
My title <span class="subtitle">My Subtitle</span></h2></div>
'''
doc = SimplifiedDoc(html)
h2 = doc.select('div#title').h2
print ('title:',h2.firstText())
print ('subtitle:',h2.span.text)
Result:
title: My title
subtitle: My Subtitle
Related
Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.
You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))
i am quite stuck with this:
<span>Alpha<span class="class_xyz">Beta</span></span>
I am trying to scrape only the first span text "Alpha" (excluding the second nested "Beta").
How would you do that?
I am trying to write a function to find all the Span tags without a class attribute, but something is not working...
Thanks.
One way to handle it:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
Output:
'Alpha'
Here is another way that get the text for every Span tag without a class attribute:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
Output:
['Alpha', 'Gamma', 'Epsilon']
Or if you want the whole span tag:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
Output:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]
I am trying to scrape multiple web pages to compare the prices of books. Because every site has a different layout (and class names), I want to find the title of the book using regex and then the surrounding elements. An example of the code is given below.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2)
This returns:
Names1: ['Title Book']
Names2: ['Title Book']
Now I want to use this information to find the corresponding price. I know that when an element has been selected using the tags and class names, "next_sibling" can be used, however this doesn't work for the element selected by text:
select_title = soup1.find('h2', {"class": "title"})
next_sib = new_try.next_sibling
print(next_sib) # returns <p class='price>18.45
# now try the same thing on element selected by name, this will result in an error
next_sib = names1.next_sibling
How can I use the same method to find the price when I have found the element using its text?
A similiar question can be found here: Find data within HTML tags using Python However, it still uses the html tags.
EDIT The problem is that I have many pages with different layouts and class names. Because of that I cannot use the tag/class/id name to find the elements and I have to find the book titles using regex.
To get the price Include 'h2' tag while doing it find_all() And then use find_next('p')
The first example of p tag where string was missing for classname I have added the string class='price'.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
Or change string to text
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
EDITED
Use text to get the element without tag and next_element to get the value of price.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0])
print('Price1: ', names1[0].next_element.next_element.next_element)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0])
print('Price2: ', names2[0].next_element.next_element.next_element)
Output:
Names1: Title Book
Price1: 18.45
Names2: Title Book
Price2: 18.45
You missed class closure comma for the p.price in html_page1.
With names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))")) you get NavigableString, that's why you'll get None for the next_sibling.
You can find solution with regex in #Kunduk answer.
Alternative more clear and simple solution for the both html_page1 and html_page2:
soup = BeautifulSoup(html_page1, 'html.parser')
# or BeautifulSoup(html_page2, 'html.parser')
books = soup.select('div[class*=box]')
for book in books:
book_title = book.select_one('h2').text
book_price = book.select_one('p[class*=price]').text
print(book_title, book_price)
div[class*=box] mean div where class contains box.
I'm trying to collect the plain text/business title from the following:
<div class = "business-detail-text>
<h1 class = "business-title" style="position:relative;" itemprop="name">H&H Construction Co.</h1>
What is the best way to do this? The style & itemprop attribute's are where I get stuck. I know I can use soup.select but I've had no luck so far.
Here is my code so far:
def bbb_profiles(profile_urls):
sauce_code = requests.get(profile_urls)
plain_text = sauce_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for profile_info in soup.findAll("h1", {"class": "business-title"}):
print(profile_info.string)
is it what you need?
>>> from bs4 import BeautifulSoup
>>> txt='''<div class = "business-detail-text">
<h1 class = "business-title" style="position:relative;" itemprop="name">H&H Construction Co.</h1></div>'''
>>> soup = BeautifulSoup(txt, "html.parser")
>>> soup.find_all('h1', 'business-title')
[<h1 class="business-title" itemprop="name" style="position:relative;">H&H; Construction Co.</h1>]
>>> soup.find_all('h1', 'business-title')[0].text
u'H&H; Construction Co.'
I see your html is missing " after "business-detail-text and < /div> in the very end
I want to get data located(name, city and address) in div tag from a HTML file like this:
<div class="mainInfoWrapper">
<h4 itemprop="name">name</h4>
<div>
city
Address
</div>
</div>
I don't know how can I get data that i want in that specific tag.
obviously I'm using python with beautifulsoup library.
There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:
from bs4 import BeautifulSoup
html = '''<div class="mainInfoWrapper">
<h4 itemprop="name">
NAME
</h4>
<div>
PROVINCE - CITY ADDRESS
</div>
</div>'''
soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
When run for the URL that you provided
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت
I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.
You can do it with built-in lxml.html module :
>>> s="""<div class="mainInfoWrapper">
... <h4 itemprop="name">name</h4>
... <div>
...
... city
...
... Address
... </div>
... </div>"""
>>>
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']
And with BeautifulSoup to get the text between your tags:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text
And for get the text from a specific tag just use soup.find_all :
soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
print line.text
If h4 is used only once then you can do this -
name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')