BeautifulSoup extract text from comment html [duplicate] - python

This question already has answers here:
How to find all comments with Beautiful Soup
(2 answers)
Closed 4 years ago.
Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
How do I get the part 'views' and 'interaction'?

You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:
from bs4 import BeautifulSoup, Comment
html = """<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])
Outputs:
1.56M Clicks 0
If the comment is not the first on the page you would need to index it like this:
comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]
or find it from a parent element.
Updated in response to comment:
You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.
from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')
for tr in soup.find_all('tr'):
comment = tr.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
shares= commentsoup.find('span', {'class': 'shares'})
print (views.get_text(), shares['data-shares'])
Outputs:
3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...

The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this <!-- and this --> signs from the html elements and the rest are as it is. Take a look at the below script.
from bs4 import BeautifulSoup
htdoc = """
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')
Output:
1.56M Clicks
0

If you want only the views then:
views = soup.findAll("span", {"class": "views"})
You also can get the whole paragraph with:
p = soup.findAll("p", {"class": "statistics"})
Then you can get the data from the p.

Related

cant get text from span with Beautifulsoup

Why can I not get the text 3.7M from those (multiple with same class name) span with below code?:
result_prices_pc = soup.find_all("span", attrs={"class": "pc_color font-weight-bold"})
HTML:
<td><span class="pc_color font-weight-bold">3.7M <img alt="c" class="small-coins-icon" src="/design/img/coins_bin.png"></span></td>
I try to get all prices with a for loop:
for price in result_prices_pc:
print(price.text)
But I cant get the text from it.
The "problem" is the pc_color CSS class. When you load the page, you need to specify what version of page do you need (PS4/XBOX/PC) - this is done by "platform" cookie (or you can use ps4_color instead of pc_color, for example):
import requests
from bs4 import BeautifulSoup
url = "https://www.futbin.com/players"
cookies = {"platform": "pc"}
soup = BeautifulSoup(requests.get(url, cookies=cookies).content, "html.parser")
result_prices_pc = soup.find_all(
"span", attrs={"class": "pc_color font-weight-bold"}
)
for price in result_prices_pc:
print(price.text)
Prints:
0
1.15M
3.75M
1.7M
4.19M
1.81M
351.65K
0
1.66M
98K
1.16M
3M
775K
99K
1.62M
187K
280K
245K
220K
1.03M
395K
100K
185K
864.2K
0
1.95M
540K
0
0
89K
These elements are actually having multiple class names: pc_color font-weight-bold are actually pc_color and font-weight-bold class names.
Forthis case you should use this syntax:
result_prices_pc = soup.find_all("span", attrs={"class": ['pc_color', 'font-weight-bold']})

How to take link from onclickvalue in BeautifulSoup?

Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.
You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))

Fetching <td> text next to <th> tag with specific text

I'd linke to retrieve information form a couple of players from transfermarkt.de, e.g Manuel Neuer's birthday.
Here is how the relevant html looks like:
<tr>
<th>Geburtsdatum:</th>
<td>
27.03.1986
</td>
</tr>
I know I could get the date by using the following code:
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
date_of_birth = re.search(r'([0-9]+\.[0-9]+\.[0-9]+)', rows[1].get_text(), re.M)[0]
but that is quite fragile. E.g. for Robert Lewandowski the date of birth is in a different position of the table. So, which attributes appear at the players profile differs. Is there a way to logically do
finde the tag with 'Geburtsdatum:' in it
get the text of the tag right after it
the more robust the better :)
BeautifulSoup allows retrieve next sibling using method findNext():
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.transfermarkt.de/manuel-neuer/profil/spieler/17259', headers = {'User-Agent': 'Custom'})
soup = BeautifulSoup(source_code, "html.parser")
player_attributes = soup.find("table", class_ = 'auflistung')
rows = player_attributes.find_all('tr')
def get_table_value(rows, table_header):
for row in rows:
helpers = row.find_all(text=re.compile(table_header))
if helpers is not None:
for helper in helpers:
return helper.find_next('td').get_text()

Finding tag of text-searched element in HTML

I am trying to scrape multiple web pages to compare the prices of books. Because every site has a different layout (and class names), I want to find the title of the book using regex and then the surrounding elements. An example of the code is given below.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2)
This returns:
Names1: ['Title Book']
Names2: ['Title Book']
Now I want to use this information to find the corresponding price. I know that when an element has been selected using the tags and class names, "next_sibling" can be used, however this doesn't work for the element selected by text:
select_title = soup1.find('h2', {"class": "title"})
next_sib = new_try.next_sibling
print(next_sib) # returns <p class='price>18.45
# now try the same thing on element selected by name, this will result in an error
next_sib = names1.next_sibling
How can I use the same method to find the price when I have found the element using its text?
A similiar question can be found here: Find data within HTML tags using Python However, it still uses the html tags.
EDIT The problem is that I have many pages with different layouts and class names. Because of that I cannot use the tag/class/id name to find the elements and I have to find the book titles using regex.
To get the price Include 'h2' tag while doing it find_all() And then use find_next('p')
The first example of p tag where string was missing for classname I have added the string class='price'.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
Or change string to text
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0].find_next('p').text)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0].find_next('p').text)
EDITED
Use text to get the element without tag and next_element to get the value of price.
from bs4 import BeautifulSoup
import re
html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""
html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""
# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0])
print('Price1: ', names1[0].next_element.next_element.next_element)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0])
print('Price2: ', names2[0].next_element.next_element.next_element)
Output:
Names1: Title Book
Price1: 18.45
Names2: Title Book
Price2: 18.45
You missed class closure comma for the p.price in html_page1.
With names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))")) you get NavigableString, that's why you'll get None for the next_sibling.
You can find solution with regex in #Kunduk answer.
Alternative more clear and simple solution for the both html_page1 and html_page2:
soup = BeautifulSoup(html_page1, 'html.parser')
# or BeautifulSoup(html_page2, 'html.parser')
books = soup.select('div[class*=box]')
for book in books:
book_title = book.select_one('h2').text
book_price = book.select_one('p[class*=price]').text
print(book_title, book_price)
div[class*=box] mean div where class contains box.

How to access text from both <p> using beautifulsoup4?

I want to grab text from both <p>, how do I get that?
for first <p> my code is working but I couldn't able to get the second <p>.
<p>
<a href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
Emerging online threats changing Homeland Security's role from merely fighting terrorism
</a>
</p>
</hgroup>
</header>
<p>
Homeland Security Secretary Kirstjen Nielsen said Monday that her department may have been founded to combat terrorism, but its mission is shifting to also confront emerging online threats.
China, Iran and other countries are mimicking the approach that Russia used to interfere in the U.S. ...
<a class="more_link" href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
<span class="icon-arrow-2">
</span>
</a>
</p>
My code is:
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = urllib.request.urlopen(article)
soup = BeautifulSoup(page, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date = date.text
headline = article.p.find('a')
headline = headline.text
content = article.p.text
print(date, headline,content)
Use the parent id and p selector and index into returned list for required number of paragraphs. You can use the time tag for when posted
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/#.XJIQNDj7TX4')
soup = bs(r.content, 'lxml')
posted = soup.select_one('time').text
print(posted)
paras = [item.text.strip() for item in soup.select('#jtarticle p')]
print(paras[:2])
You could use the .find_next(). However, it's not the full article:
from bs4 import BeautifulSoup
import requests
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = requests.get(article)
soup = BeautifulSoup(page.text, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date_text = date.text
headline = article.p.find('a')
headline_text = headline.text
content_text = article.p.find_next('p').text
print(date_text, headline_text ,content_text)

Categories

Resources