xml.etree.ElementTree: How to replace like "innerHTML"? - python

I want to replace the <h1> tag of a html page.
But the content of the heading can be HTML (not just a string).
I want to insert foo <b>bold</b> bar
input:
start
<h1 class="myclass">bar <i>italic</i></h1>
end
Desired output:
start
<h1 class="myclass">foo <b>bold</b> bar</h1>
end
How to solve this with Python?

using htql:
page="""start
<h1 class="myclass">bar <i>italic</i></h1>
end
"""
import htql
x = htql.query(page, "<h1>:tx &replace('foo <b>bold</b> bar') ")[0][0]
You get:
>>> x
'start \n<h1 class="myclass">foo <b>bold</b> bar</h1>\nend\n'

parser = HTMLParser(namespaceHTMLElements=False)
etree = parser.parse('start <h1 class="myclass">bar <i>italic</i></h1> end')
for h1 in etree.findall('.//h1'):
for sub in h1:
h1.remove(sub)
html = parser.parse('foo <b>bold</b> bar')
body = html.find('.//body')
for sub in body:
h1.append(sub)
h1.text = body.text
print(ElementTree.tostring(etree))

Related

BeautifulSoup: Search and replace in the text parts of HTML

I want to do a search and replace on the textual part of the content of the HTML elements.
E.g., replacing foo with <b>bar</b> in
<div id="foo">foo <i>foo</i> hi foo hi</div>
should result in
<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>
I already have a working version in Perl, but the HTML parser there is buggy:
#!/usr/bin/env perl
##
use strict;
use warnings;
use v5.34.0;
use Mojo::DOM;
##
my $input = do { local $/; <STDIN> };
my $dom = Mojo::DOM->new($input);
$dom->descendant_nodes->grep(sub { $_->type eq 'text' })
->each(sub{
$_->replace(s/(sth)/<span class="todo at_tag">$1<\/span>/gr)
});
say $dom;
Search all text nodes containing foo
Create a b element
Replace the text with the new element
Insert the desired text into the b
from bs4 import BeautifulSoup, NavigableString, Tag
import re
import html
htmlString = '''
<div id="foo">foo <i>foo</i> hi foo hi</div>
'''
soup = BeautifulSoup(htmlString, "html.parser")
for n in soup.find_all(text=re.compile('foo')):
bold = soup.new_tag("b")
n.replaceWith(bold)
bold.insert(0, 'bar')
print(soup)
Output:
<div id="foo"><b>bar</b><i><b>bar</b></i><b>bar</b></div>
It's not recomended to use string manupulation functions such as .replace & regex on Html strings...As you are looking solution in that area Just writing solution. Orginally we have to do with BeautifulSoup
html = """<div id="foo">foo <i>foo</i> hi foo hi</div>"""
res = html.replace("foo", "<b>bar</b>").replace("<b>bar</b>", "foo", 1)
print(res)
output#
<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>

How to wrap initial of each word in a specific tag with a <b>?

I am trying to use the BeautifulSoup module with Python to do the following:
Within a div for HTML, for each paragraph tag, I want to add a bold tag to the first letter of each word within the paragraph. For example:
<div class="body">
<p>The quick brown fox</p>
</div>
which would read: The quick brown fox
would then become
<div class="body">
<p><b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox</p>
</div>
that would read: The quick brown fox
Using bs4 i've been unable to find a good solution to do this and am open to ideas.
You could use replace_with() combined with list comprehension - Extract text / string from tag / bs4 object, process it as text and later on replace the tag with new bs4 object:
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
Example
from bs4 import BeautifulSoup
html = '''
<div class="body">
<p>The quick brown fox</p>
</div>'''
soup = BeautifulSoup(html,'html.parser')
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
soup
Output
<div class="body">
<b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox
</div>
I don't know much about how Python parses HTML in detail, but I can provide you with some ideas.
To find <p> tags, you can use RegEx <p.*?>.*?</p> or use str.find("<p>") and walk until </p>.
To add <b> tags, perhaps this code will work:
def add_bold(s: str) -> str:
ret = ""
isFirstLet = True
for i in s:
if isFirstLet:
ret += "<b>" + i + "</b>"
isFirstLet = False
else:
ret += i
if i == " ": isFirstLet = True
return ret

Modifying a BeautifulSoup .string with line breaks

I am trying to change the content of an html file with BeautifulSoup. This content will be coming from python-based text so it will have \n newlines...
newContent = """This is my content \n with a line break."""
newContent = newContent.replace("\n", "<br>")
htmlFile.find_all("div", "product").p.string = newContent
when I do this, the html file <p> text is changed to this:
This is my content <br> with a line break.
How do I change a string within a BeautifulSoup object and keep <br> breaks? if the string just contains \n then it'll create an actual line break.
You need to create separate elements; there isn't one piece of text contained in the <p> tag, but a series of text and <br/> elements.
Rather than replace \n newlines with the text <br/> (which will be escaped), split the text on newlines and insert extra elements in between:
parent = htmlFile.find_all("div", "product")[0].p
lines = newContent.splitlines()
parent.append(htmlFile.new_string(lines[0]))
for line in lines[1:]:
parent.append(htmlFile.new_tag('br'))
parent.append(htmlFile.new_string(line))
This uses the Element.append() method to add new elements to the tree, and using BeautifulSoup.new_string() and BeautifulSoup.new_tag() to create those extra elements.
Demo:
>>> from bs4 import BeautifulSoup
>>> htmlFile = BeautifulSoup('<p></p>')
>>> newContent = """This is my content \n with a line break."""
>>> parent = htmlFile.p
>>> lines = newContent.splitlines()
>>> parent.append(htmlFile.new_string(lines[0]))
>>> for line in lines[1:]:
... parent.append(htmlFile.new_tag('br'))
... parent.append(htmlFile.new_string(line))
...
>>> print htmlFile.prettify()
<html>
<head>
</head>
<body>
<p>
This is my content
<br/>
with a line break.
</p>
</body>
</html>

How to find spans with a specific class containing specific text using beautiful soup and re?

how can I find all span's with a class of 'blue' that contain text in the format:
04/18/13 7:29pm
which could therefore be:
04/18/13 7:29pm
or:
Posted on 04/18/13 7:29pm
in terms of constructing the logic to do this, this is what i have got so far:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.
edit:
to clarify the scenario, there are span's with:
<span class="blue">here is a lot of text that i don't need</span>
and
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
and note i only need 04/18/13 7:29pm not the rest of the content.
edit 2:
I also tried:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
and got error:
'TypeError: expected string or buffer'
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""
# parse the html
soup = BeautifulSoup(html_doc)
# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})
# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]
# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
if m:
found_dates.append(m.group(1))
# print the dates we collected
for date in found_dates:
print(date)
output:
04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
This is a flexible regex that you can use:
"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
Example:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
for line in lines
for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
print i
04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM
This pattern seems to satisfy what you're looking for:
>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)

Parse HTML and preserve original content

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):
$('.header .title').text('my new content')
on the following HTML document:
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
and have the following result:
<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:
<html>
<head></head>
<body>
<div class=header><span class=title>my new content</span></div>
<p>1</p><p>2</p>
<table><tbody><tr><td>1</td></tr></tbody></table>
</body>
</html>
E.g. they add:
html, head and body elements
closing p tags
tbody
Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.
I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.
I believe this does exactly what you want, with a quite familiar interface.
from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)
doc('.header .title').text('my new content')
print doc
Output:
<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>
The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.
Note: I'm on Python 3.
This will only handle a subset of CSS selectors, but it may be enough for your purposes.
from html.parser import HTMLParser
class AttrQuery():
def __init__(self):
self.repl_text = ""
self.selectors = []
def add_css_sel(self, seltext):
sels = seltext.split(" ")
for selector in sels:
if selector[:1] == "#":
self.add_selector({"id": selector[1:]})
elif selector[:1] == ".":
self.add_selector({"class": selector[1:]})
elif "." in selector:
html_tag, html_class = selector.split(".")
self.add_selector({"html_tag": html_tag, "class": html_class})
else:
self.add_selector({"html_tag": selector})
def add_selector(self, selector_dict):
self.selectors.append(selector_dict)
def match_test(self, tagwithattrs_list):
for selector in self.selectors:
for condition in selector:
condition_value = selector[condition]
if not self._condition_test(tagwithattrs_list, condition, condition_value):
return False
return True
def _condition_test(self, tagwithattrs_list, condition, condition_value):
for tagwithattrs in tagwithattrs_list:
try:
if condition_value == tagwithattrs[condition]:
return True
except KeyError:
pass
return False
class HTMLAttrParser(HTMLParser):
def __init__(self, html, **kwargs):
super().__init__(self, **kwargs)
self.tagwithattrs_list = []
self.queries = []
self.matchrepl_list = []
self.html = html
def handle_starttag(self, tag, attrs):
tagwithattrs = dict(attrs)
tagwithattrs["html_tag"] = tag
self.tagwithattrs_list.append(tagwithattrs)
if debug:
print("push\t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
def handle_endtag(self, tag):
try:
while True:
tagwithattrs = self.tagwithattrs_list.pop()
if debug:
print("pop \t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
if tag == tagwithattrs["html_tag"]: break
except IndexError:
raise IndexError("Found a close-tag for a non-existent element.")
def handle_data(self, data):
if self.tagwithattrs_list:
for query in self.queries:
if query.match_test(self.tagwithattrs_list):
line, position = self.getpos()
length = len(data)
match_replace = (line-1, position, length, query.repl_text)
self.matchrepl_list.append(match_replace)
def addquery(self, query):
self.queries.append(query)
def transform(self):
split_html = self.html.split("\n")
self.matchrepl_list.reverse()
if debug: print ("\nreversed list of matches (line, position, len, repl_text):\n{}\n".format(self.matchrepl_list))
for line, position, length, repl_text in self.matchrepl_list:
oldline = split_html[line]
newline = oldline[:position] + repl_text + oldline[position+length:]
split_html = split_html[:line] + [newline] + split_html[line+1:]
return "\n".join(split_html)
See the example usage below.
html_test = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>1</div></td></tr></table>"""
debug = False
parser = HTMLAttrParser(html_test)
query = AttrQuery()
query.repl_text = "Bar"
query.add_selector({"html_tag": "div", "class": "header"})
query.add_selector({"class": "title"})
parser.addquery(query)
query = AttrQuery()
query.repl_text = "InTable"
query.add_css_sel("table tr td.hi #there")
parser.addquery(query)
parser.feed(html_test)
transformed_html = parser.transform()
print("transformed html:\n{}".format(transformed_html))
Output:
transformed html:
<div class=header><span class=title>Bar</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>InTable</div></td></tr></table>
Ok I have done this in a few languages and I have to say the best parser I have seen that preserves whitespace and even HTML comments is:
Jericho which is unfortunately Java.
That is Jericho knows how to parse and preserve fragments.
Yes I know its Java but you could easily make a RESTful service with a tiny bit of Java that would take the payload and convert it. In the Java REST service you could use JRuby, Jython, Rhino Javascript etc. to coordinate with Jericho.
You can use Nokogiri HTML Fragment for this:
fragment = Nokogiri::HTML.fragment('<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>')
fragment.css('.title').children.first.replace(Nokogiri::XML::Text.new('HEY', fragment))
frament.to_s #=> "<div class=\"header\"><span class=\"title\">HEY</span></div>\n<p>1</p><p>2\n</p><table><tr><td>1</td></tr></table>"
The problem with the p tag persists, because it is invalid HTML, but this should return your document without html, head or body and tbody tags.
With Python - using lxml.html is fairly straight forward:
(It meets points 1 & 3, but I don't think much can be done about 2, and handles the unquoted class='s)
import lxml.html
fragment = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
"""
page = lxml.html.fromstring(fragment)
for span in page.cssselect('.header .title'):
span.text = 'my new value'
print lxml.html.tostring(page, pretty_print=True)
Result:
<div>
<div class="header"><span class="title">my new content</span></div>
<p>1</p>
<p>2
</p>
<table><tr><td>1</td></tr></table>
</div>
This is a slightly separate solution but if this is only for a few simple instances then perhaps CSS is the answer.
Generated Content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<style type="text/css">
#header.title1:first-child:before {
content: "This is your title!";
display: block;
width: 100%;
}
#header.title2:first-child:before {
content: "This is your other title!";
display: block;
width: 100%;
}
</style>
</head>
<body>
<div id="header" class="title1">
<span class="non-title">Blah Blah Blah Blah</span>
</div>
</body>
</html>
In this instance you could just have jQuery swap the classes and you'd get the change for free with css. I haven't tested this particular usage but it should work.
We use this for things like outage messages.
If you're running a Node.js app, this module will do exactly what you want, a JQuery style DOM manipulator: https://github.com/cheeriojs/cheerio
An example from their wiki:
var cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there!');
$('h2').addClass('welcome');
$.html();
//=> <h2 class="title welcome">Hello there!</h2>

Categories

Resources