How do I soup this without it doing nothing? - python

I'm tampering around with BS4 and web scraping, and when I soup it, the variable becomes blank.
<!-- example.html -->
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>
import bs4
example = open('example.html')
soup = bs4.BeautifulSoup(example.read())
print(soup) # returns '[]'

Related

Make BeautifulSoup recognize word breaks caused by HTML <li> elements

BeautifulSoup4 does not recognize that it should would break between <li> elements when extracting text:
Demo program:
#!/usr/bin/env python3
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup( HTML, 'html.parser' )
print(soup.find('body').text.strip())
Output:
First ElementSecond element
Desired output:
First Element Second element
I guess I could just globally add a space before all <li> elements. That seems like a hack?
Try using .stripped_strings of soup to extract the text while preserving the whitespaces between elements
from bs4 import BeautifulSoup
HTML = """
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup(HTML, 'html.parser')
print(' '.join(soup.body.stripped_strings))
Or extract the text of each <li> element separately and then join them
from bs4 import BeautifulSoup
HTML="""
<html>
<body>
<ul>
<li>First Element</li><li>Second element</li>
</ul>
</body>
"""
soup = BeautifulSoup( HTML, 'html.parser' )
lis = soup.find_all('li')
text = ' '.join([li.text.strip() for li in lis])
print(text)

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

Can I accept or ignore the Google privacy notice when webscraping with BeautifulSoup?

I am unable to view the HTML of the Google News page when running the following code from my console. The HTML I see instead is that of the Google privacy notice (the one that starts with "Before you continue").
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.google.com/news", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
Is there a way to prevent the privacy notice from popping up at all?
A snippet of what I get instead:
<title>
Before you continue
</title>
<meta content="initial-scale=1, maximum-scale=5, width=device-width" name="viewport"/>
<link href="//www.google.com/favicon.ico" rel="shortcut icon"/>
</head>
<body>
<div class="signin">
<a class="button" href="https://accounts.google.com/ServiceLogin?hl=en-US&continue=https://news.google.com/topics/CAAqBwgKMKHQ9Qowlc7cAg&gae=cb-">
Sign in
</a>
</div>
<div class="box">
<img alt="Google" height="28" src="//www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_68x28dp.png" srcset="//www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_68x28dp.png 2x" width="68"/>
<div class="productLogoContainer">
<img alt="" aria-hidden="true" class="image" height="100%" src="https://www.gstatic.com/ac/cb/scene_cookie_wall_search_v2.svg" width="100%"/>
</div>
You can set CONSENT cookie to not get "Before you continue" page:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0"}
cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
r = requests.get(
"https://www.google.com/news", headers=headers, cookies=cookies
)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())

Returning body text using BeautifulSoup

I'm trying to use BeautifulSoup to scrape HTML tags off of something that was returned using ExchangeLib. What I have so far is this:
from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup
credentials = Credentials('myemail#notreal.com', 'topSecret')
account = Account('myemail#notreal.com', credentials=credentials, autodiscover=True)
for item in account.inbox.all().order_by('-datetime_received')[:1]:
soup = BeautifulSoup(item.unique_body, 'html.parser')
print(soup)
As is, this will use exchangeLib to grab the first email from my inbox via Exchange, and print specifically the unique_body which contains the body text of the email. Here is a sample of the output from print(soup):
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
My end goal is to have it print:
Hey John,
Here is a test email
From what I'm reading on BeautifulSoup documentation, the process of scraping falls between my "Soup =" line and the final print line.
My issue is that in order to run the scraping portion of BeautifulSoup, it requires a class and h1 tags such as: name_box = soup.find(‘h1’, attrs={‘class’: ‘name’}), however from what I currently have, I have none of this.
As someone who is new to Python, how should I go about doing this?
You can try Find_all to get all the font tag value and then iterate.
from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
print(span.text)
Output:
Hey John,
Here is a test email
You need to print the font tag content. You can use select method and pass it type selector for the element of font.
from bs4 import BeautifulSoup as bs
html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''
soup = bs(html, 'lxml')
textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)

How can I find a comment with specified text string

I'm using robobrowser to parse some html content. I has a BeautifulSoup inside. How can I find a comment with specified string inside
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
In fact I need to get TEXT_1 if I know ANY_ID
Thanks
Use the text argument and check the type to be Comment. Then, load the contents with BeautifulSoup again and find the desired element by id:
from bs4 import BeautifulSoup
from bs4 import Comment
data = """
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment) and "ANY_ID" in text)
soup_comment = BeautifulSoup(comment, "html.parser")
text = soup_comment.find("div", id="ANY_ID").get_text()
print(text)
Prints TEXT_1.

Categories

Resources