BS4: How to edit the content of a <pre> that contains <> - python

I have some HTML that contains a pre tag:
<p>Hi!</p><pre><p>Hi!</p></pre>
I'd like to change it to:
<p>Hi!</p><pre><p>Bye!</p></pre>
The naïve thing to do seems to be:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
pre_tag.string = "<p>bye!</p>"
print(str(soup))
but that gives <p>Hi!</p><pre><p>bye!</p></pre>
In the BS4 docs there's a section on output formatters that gives an example of using cdata:
from bs4.element import CData
soup = BeautifulSoup("<a></a>", 'html.parser')
soup.a.string = CData("one < three")
print(soup.a.prettify(formatter="html"))
# <a>
# <![CDATA[one < three]]>
# </a>
Which looks like what's needed, except that it also wraps the unformatted characters in a cdata tag; not good inside a pre.
This question: Beautiful Soup replaces < with < looks like it's going in this vague direction, but isn't about the insides of a pre tag.
This question: customize BeautifulSoup's prettify by tag seems like overkill, and is also from the BS3 era.
p.s. the example above is indicative of wanting to do all kinds of things to the contents of a pre, not just change hi to bye. (before anyone asks)

Either you can use the API to construct the new contents:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
new_tag = soup.new_tag("p")
new_tag.append("bye!")
pre_tag.clear()
pre_tag.append(new_tag)
print(str(soup))
Or you can provide the HTML to another BeautifulSoup instance and use that:
from bs4 import BeautifulSoup
markup = """<p>Hi!</p><pre><p>Hi!</p></pre>"""
soup = BeautifulSoup(markup, "html.parser")
pre_tag = soup.pre
soup2 = BeautifulSoup("<p>bye!</p>", "html.parser")
pre_tag.clear()
pre_tag.append(soup2)
print(str(soup))

Related

Handle angle bracket in pre tag using BeautifulSoup

I have a string like this
html = "<pre>City_<cityname>_001</pre>"
While trying to parse this using BeautifulSoup 4, using the following code,
>>> from bs4 import BeautifulSoup
>>> html = "<pre>City_<cityname>_001</pre>"
>>> soup = BeautifulSoup(html, "html.parser")
>>> soup
<pre>City_<cityname>_001</cityname></pre>
>>> soup.text
City__001
As can be seen, BeautifulSoup treats cityname as a new tag.
Is there any way in which this could be avoided to get the correct text and html?
Comments are ignored by parsers. You could make the content of <pre> a comment before parsing and then extract() the comment later.
import bs4
html = "<pre>City_<cityname>_001</pre>"
soup = bs4.BeautifulSoup(html.replace("<pre>","<pre><!--").replace("</pre>","--></pre>"), "lxml")
pre=soup.find('pre')
pre_comment=pre.find(text=lambda text: isinstance(text, bs4.Comment)).extract()
print(pre_comment)
Output:
City_<cityname>_001
It is a bit of a hack, but you can replace strings wrapped with brackets and then format the string with the results after:
from bs4 import BeautifulSoup as soup
html = "<pre>City_<cityname>_001</pre>"
_html, _vals = re.sub('(?<=_)\<\w+\>(?=_)', '{}', html), re.findall('(?<=_)\<\w+\>(?=_)', html)
new_result = soup(_html, 'html.parser').find('pre').text.format(*_vals)
Output:
'City_<cityname>_001'

beautifulsoup can't find any tags

I have a script that I've used for several years. One particular page on the site loads and returns soup, but all my finds return no result. This is old code that has worked on this site in the past. Instead of searching for a specific <div> I simplified it to look for any table, tr or td, with find or findAll. I've tried various methods of opening the page, including lxml - all with no results.
My interests are in the player_basic and player_records div's
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
import urllib2
url = "http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456"
#html = urllib2.urlopen(url).read()
html = urllib2.urlopen(url,"lxml")
soup = BeautifulSoup(html)
#div = soup.find('div', {"class":"player_basic"})
#div = soup.find('div', {"class":"player_records"})
item = soup.findAll('td')
print item
you're not reading the response. try this:
import urllib2
url = 'http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456'
response = urllib2.urlopen(url, 'lxml')
html = response.read()
then you can use it with BeautifulSoup. if it still does not work, there are strong reasons to believe that there is malformed HTML in that page (missing closing tags, etc.) since the parsers that BeautifulSoup uses (specially html.parser) are not very tolerant with that.
UPDATE: try using lxml parser:
soup = BeautifulSoup(html, 'lxml')
tds = soup.find_all('td')
print len(tds)
$ 142

find_all with camelCase tag names with BeautifulSoup 4

I'm trying to scrape an xml file with BeautifulSoup 4.4.0 that has tag names in camelCase and find_all doesn't seem to be able to find them. Example code:
from bs4 import BeautifulSoup
xml = """
<hello>
world
</hello>
"""
soup = BeautifulSoup(xml, "lxml")
for x in soup.find_all("hello"):
print x
xml2 = """
<helloWorld>
:-)
</helloWorld>
"""
soup = BeautifulSoup(xml2, "lxml")
for x in soup.find_all("helloWorld"):
print x
The output I get is:
$ python soup_test.py
<hello>
world
</hello>
What's the correct way to look up camel cased/uppercased tag names?
For any case-sensitive parsing using BeautifulSoup, you would want to parse in "xml" mode. The default mode (parsing HTML) doesn't care about case, since HTML doesn't care about case. In your case, instead of using "lxml" mode, switch it to "xml":
from bs4 import BeautifulSoup
xml2 = """
<helloWorld>
:-)
</helloWorld>
"""
soup = BeautifulSoup(xml2, "xml")
for x in soup.find_all("helloWorld"):
print x

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?
From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

bs4 the second comment <!-- > is missing

I am doing python challenge level-9 with BeautifulSoup.
url = "http://www.pythonchallenge.com/pc/return/good.html".
bs4.version == '4.3.2'.
There are two comments in its page source. The output of soup should be as follows.
However, when BeautifulSoup is applied, the second comment is missing.
It seems kinda weird. Any hint? Thanks!
import requests
from bs4 import BeautifulSoup
url = "http://www.pythonchallenge.com/pc/return/good.html"
page = requests.get(url, auth = ("huge", "file")).text
print page
soup = BeautifulSoup(page)
print soup
Beautiful Soup is a wrapper around an html parser. The default parser is very strict, and when it encounters malformed html silently drops the elements it had trouble with.
You should instead install the package 'html5lib' and use that as your parser, like so:
soup = BeautifulSoup(page, 'html5lib')

Categories

Resources