How to insert an attribute with the right namespace prefix using lxml

How to insert an attribute with the right namespace prefix using lxml - python

Is there a way, using lxml, to insert XML attributes with the right namespace?
For instance, I want to use XLink to insert links in a XML document. All I need to do is to insert {http://www.w3.org/1999/xlink}href attributes in some elements. I would like to use xlink prefix, but lxml generates prefixes like "ns0", "ns1"…
Here is what I tried:
from lxml import etree
#: Name (and namespace) of the *href* attribute use to insert links.
HREF_ATTR = etree.QName("http://www.w3.org/1999/xlink", "href").text
content = """\
<body>
<p>Link to <span>StackOverflow</span></p>
<p>Link to <span>Google</span></p>
</body>
"""
targets = ["https://stackoverflow.com", "https://www.google.fr"]
body_elem = etree.XML(content)
for span_elem, target in zip(body_elem.iter("span"), targets):
span_elem.attrib[HREF_ATTR] = target
etree.dump(body_elem)
The dump looks like this:
<body>
<p>link to <span xmlns:ns0="http://www.w3.org/1999/xlink"
ns0:href="https://stackoverflow.com">stackoverflow</span></p>
<p>link to <span xmlns:ns1="http://www.w3.org/1999/xlink"
ns1:href="https://www.google.fr">google</span></p>
</body>
I found a way to factorize the namespaces by inserting and deleting an attribute in the root element, like this:
# trick to declare the XLink namespace globally (only one time).
body_elem = etree.XML(content)
body_elem.attrib[HREF_ATTR] = ""
del body_elem.attrib[HREF_ATTR]
targets = ["https://stackoverflow.com", "https://www.google.fr"]
for span_elem, target in zip(body_elem.iter("span"), targets):
span_elem.attrib[HREF_ATTR] = target
etree.dump(body_elem)
It's ugly, but it works and I only need to do it one time. I get:
<body xmlns:ns0="http://www.w3.org/1999/xlink">
<p>Link to <span ns0:href="https://stackoverflow.com">StackOverflow</span></p>
<p>Link to <span ns0:href="https://www.google.fr">Google</span></p>
</body>
But the problem remains: how can I turn this "ns0" prefix into "xlink"?

Using register_namespace as suggested by #mzjn:
etree.register_namespace("xlink", "http://www.w3.org/1999/xlink")
# trick to declare the XLink namespace globally (only one time).
body_elem = etree.XML(content)
body_elem.attrib[HREF_ATTR] = ""
del body_elem.attrib[HREF_ATTR]
targets = ["https://stackoverflow.com", "https://www.google.fr"]
for span_elem, target in zip(body_elem.iter("span"), targets):
span_elem.attrib[HREF_ATTR] = target
etree.dump(body_elem)
The result is what I expected:
<body xmlns:xlink="http://www.w3.org/1999/xlink">
<p>Link to <span xlink:href="https://stackoverflow.com">StackOverflow</span></p>
<p>Link to <span xlink:href="https://www.google.fr">Google</span></p>
</body>

Related

Find Tags that Match Specific Classes but one class keeps changing

I want to extract information from a div tag which has some specific classes.
Class are in the format of abc def jss238 xyz
Now, the jss class number keeps changing, so after some time ,the classes will become abc def jss384 xyz
What is the best way to extract information so that the code doesn't break if the tags change as well.
The current code that I using is
val = soup.findAll('div', class_="abc def jss328 xyz")
I feel Regex can be a good way, but can I also not use jss class and use the other 3 only to search?

SO yes you can use regex to find the pattern that has abc def <pattern of 3 letters and 3 digits> xyz
Personally, I would see if you can get the data from the source. When classes change like that, it's usually because the page is rendered through javascript, but it needs to put the data in there and get it from somewhere. If you share the url and what data you are after, I could see if thats the case. But here's the regex version:
from bs4 import BeautifulSoup
import re
html = '''<div class="abc def jss238 xyz">jss238 text</div>
<div class="abc def jss384 xyz">jss384 text</div>
<div class="hij klm jss238 xyz">doesn't match the pattern</div>'''
soup = BeautifulSoup(html, 'html.parser')
regex = re.compile('abc def \w{3}\d{3} xyz')
specialDivs = soup.find_all('div', {'class':regex})
for each in specialDivs:
print(f'html: {each}\tText: {each.text}')
Output:
html: <div class="abc def jss238 xyz">jss238 text</div> Text: jss238 text
html: <div class="abc def jss384 xyz">jss384 text</div> Text: jss384 text

Remove redundant class names in HTML using BeautifulSoup

I want to convert:
<span class = "foo">data-1</span>
<span class = "foo">data-2</span>
<span class = "foo">data-3</span>
to
<span class = "foo"> data-1 data-2 data-3 </span>
Using BeautifulSoup in Python. This HTML part exists in multiple areas of the page body, hence I want to minimize this part and scrap it. Actually the mid span was with em class hence originally separated.

Adapted from this answer to show how this could be used for your span tags:
span_tags = container.find_all('span')
# combine all the text from b tags
text = ''.join(span.get_text(strip=True) for span in span_tags)
# here you choose a tag you want to preserve and update its text
span_main = span_tags[0] # you can target it however you want, I just take the first one from the list
span_main.span.string = text # replace the text
for tag in span_tags:
if tag is not span_main:
tag.decompose()

Replacing a custom "HTML" tag in a Python string

I want to be able to include a custom "HTML" tag in a string, such as: "This is a <photo id="4" /> string".
In this case the custom tag is <photo id="4" />. I would also be fine changing this custom tag to be written differently if it makes it easier, ie [photo id:4] or something.
I want to be able to pass this string to a function that will extract the tag <photo id="4" />, and allow me to transform this to some more complicated template like <div class="photo"><img src="...." alt="..."></div>, which I can then use to replace the tag in the original string.
I'm imaging it work something like this:
>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string
I think this is a fairly common pattern, for something like putting a photo in a blog post, so I'm wondering if there is a library out there that will do something similar to the example parse_tags function above. Or do I need to write it?
This example of the photo tag is just a single example. I would want to have tags with different names. As a different example, maybe I have a database of people and I want a tag like <person name="John Doe" />. In that case the output I want is something like {'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}. I can then use the name to look that person up and return a rendered template of the person's vcard or something.

If you're working with HTML5, I would suggest looking into the xml module (etree). It will allow you to parse the whole document into a tree structure and manipulate tags individually (and then turn the resut bask into an html document).
You could also use regular expressions to perform text substitutions. This would likely be faster than loading a xml tree structure if you don't have too many changes to make.
import re
text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
tags = ["photo","person","abc"]
patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
matches = list(re.finditer(patterns,text))
for match in reversed(matches):
tag = text[match.start():match.end()]
print(match.start(),match.end(),tag)
# substitute what you need for that tag
text = text[:match.start()] + "***" + text[match.end():]
print(text)
This will be printed:
64 88 <person name="John Doe">
39 53 <photo id="4">
22 29 <photo>
<html><body>some text *** and tags *** more text *** yet more text
Performing the replacements in reverse order ensures that the ranges found by finditer() remain valid as the text changes with the substitutions.

For this kind of "surgical" parsing (where you want to isolate specific tags instead of creating a full hierarchical document), pyparsing's makeHTMLTags method can be very useful.
See the annotated script below, showing the creation of the parser, and using it for parseTag and replaceTag methods:
import pyparsing as pp
def make_tag_parser(tag):
# makeHTMLTags returns 2 parsers, one for the opening tag and one for the
# closing tag - we only need the opening tag; the parser will return parsed
# fields of the tag itself
tag_parser = pp.makeHTMLTags(tag)[0]
# instead of returning parsed bits of the tag, use originalTextFor to
# return the raw tag as token[0] (specifying asString=False will retain
# the parsed attributes and tag name as attributes)
parser = pp.originalTextFor(tag_parser, asString=False)
# add one more callback to define the 'raw' attribute, copied from t[0]
def add_raw_attr(t):
t['raw'] = t[0]
parser.addParseAction(add_raw_attr)
return parser
# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
return make_tag_parser(tag).searchString(s)
content = """This is a <photo id="4" /> string"""
tag_matches = parseTag("photo", content)
for match in tag_matches:
print(match.dump())
print("raw: {!r}".format(match.raw))
print("tag: {!r}".format(match.tag))
print("id: {!r}".format(match.id))
# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
parser = make_tag_parser(tag)
# add one more parse action to do transform
parser.addParseAction(lambda t: transform.format(**t))
return parser.transformString(s)
print(replaceTag("photo",
'<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>',
content))
Prints:
['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
[0]:
photo
[1]:
['id', '4']
[2]:
True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id: '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string

Identifying branches different in tag structure

I'm hoping to check if two html are different by tags only without considering the text and pick out those branch(es).
For example :
html_1 = """
<p>i love it</p>
"""
html_2 = """
<p>i love it really</p>
"""
They share the same tag structure, so they're seen to be the same. However:
html_1 = """
<div>
<p>i love it</p>
</div>
<p>i love it</p>
"""
html_2 = """
<div>
<p>i <em>love</em> it</p>
</div>
<p>i love it</p>
"""
I'd expect it to return the <div> branch, because the tag structures are different. Could lxml, BeautifulSoup or some other lib achieve this? I'm trying to find a way to actually pick out the different branches.
Thanks

A more reliable approach would be to construct a Tree of tag names out of the document as discussed here:
HTML Parse tree using Python 2.7
Here is an example working solution based on treelib.Tree:
from bs4 import BeautifulSoup
from treelib import Tree
def traverse(parent, tree):
tree.create_node(parent.name, parent.name, parent=parent.parent.name if parent.parent else None)
for node in parent.find_all(recursive=False):
tree.create_node(node.name, parent=parent.name)
traverse(node, tree)
def compare(html1, html2):
tree1 = Tree()
traverse(BeautifulSoup(html1, "html.parser"), tree1)
tree2 = Tree()
traverse(BeautifulSoup(html2, "html.parser"), tree2)
return tree1.to_json() == tree2.to_json()
print compare("<p>i love it</p>", "<p>i love it really</p>")
print compare("<p>i love it</p>", "<p>i <em>love</em> it</p>")
Prints:
True
False

Sample code to check tagging structure of two HTML content are same for not
Demo:
def getTagSequence(content):
"""
Get all Tag Sequence
"""
root = PARSER.fromstring(content)
tag_sequence = []
for elm in root.getiterator():
tag_sequence.append(elm.tag)
return tag_sequence
html_1_tags = getTagSequence(html_1)
html_2_tags = getTagSequence(html_2)
if html_1_tags==html_2_tags:
print "Tagging structure is same."
else:
print "Tagging structure is diffrent."
print "HTML 1 Tagging:", html_1_tags
print "HTML 2 Tagging:", html_2_tags
Note:
Above code just check tagging sequence only, Not checking parent and its children relationship i.e
html_1 = """ <p> This <span>is <em>p</em></span> tag</p>"""
html_2 = """ <p> This <span>is </span><em>p</em> tag</p>"""

Parse HTML and preserve original content

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):
$('.header .title').text('my new content')
on the following HTML document:
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
and have the following result:
<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:
<html>
<head></head>
<body>
<div class=header><span class=title>my new content</span></div>
<p>1</p><p>2</p>
<table><tbody><tr><td>1</td></tr></tbody></table>
</body>
</html>
E.g. they add:
html, head and body elements
closing p tags
tbody
Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.

I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.
I believe this does exactly what you want, with a quite familiar interface.
from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)
doc('.header .title').text('my new content')
print doc
Output:
<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>
The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.

Note: I'm on Python 3.
This will only handle a subset of CSS selectors, but it may be enough for your purposes.
from html.parser import HTMLParser
class AttrQuery():
def __init__(self):
self.repl_text = ""
self.selectors = []
def add_css_sel(self, seltext):
sels = seltext.split(" ")
for selector in sels:
if selector[:1] == "#":
self.add_selector({"id": selector[1:]})
elif selector[:1] == ".":
self.add_selector({"class": selector[1:]})
elif "." in selector:
html_tag, html_class = selector.split(".")
self.add_selector({"html_tag": html_tag, "class": html_class})
else:
self.add_selector({"html_tag": selector})
def add_selector(self, selector_dict):
self.selectors.append(selector_dict)
def match_test(self, tagwithattrs_list):
for selector in self.selectors:
for condition in selector:
condition_value = selector[condition]
if not self._condition_test(tagwithattrs_list, condition, condition_value):
return False
return True
def _condition_test(self, tagwithattrs_list, condition, condition_value):
for tagwithattrs in tagwithattrs_list:
try:
if condition_value == tagwithattrs[condition]:
return True
except KeyError:
pass
return False
class HTMLAttrParser(HTMLParser):
def __init__(self, html, **kwargs):
super().__init__(self, **kwargs)
self.tagwithattrs_list = []
self.queries = []
self.matchrepl_list = []
self.html = html
def handle_starttag(self, tag, attrs):
tagwithattrs = dict(attrs)
tagwithattrs["html_tag"] = tag
self.tagwithattrs_list.append(tagwithattrs)
if debug:
print("push\t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
def handle_endtag(self, tag):
try:
while True:
tagwithattrs = self.tagwithattrs_list.pop()
if debug:
print("pop \t", end="")
for attrname in tagwithattrs:
print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
print("")
if tag == tagwithattrs["html_tag"]: break
except IndexError:
raise IndexError("Found a close-tag for a non-existent element.")
def handle_data(self, data):
if self.tagwithattrs_list:
for query in self.queries:
if query.match_test(self.tagwithattrs_list):
line, position = self.getpos()
length = len(data)
match_replace = (line-1, position, length, query.repl_text)
self.matchrepl_list.append(match_replace)
def addquery(self, query):
self.queries.append(query)
def transform(self):
split_html = self.html.split("\n")
self.matchrepl_list.reverse()
if debug: print ("\nreversed list of matches (line, position, len, repl_text):\n{}\n".format(self.matchrepl_list))
for line, position, length, repl_text in self.matchrepl_list:
oldline = split_html[line]
newline = oldline[:position] + repl_text + oldline[position+length:]
split_html = split_html[:line] + [newline] + split_html[line+1:]
return "\n".join(split_html)
See the example usage below.
html_test = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>1</div></td></tr></table>"""
debug = False
parser = HTMLAttrParser(html_test)
query = AttrQuery()
query.repl_text = "Bar"
query.add_selector({"html_tag": "div", "class": "header"})
query.add_selector({"class": "title"})
parser.addquery(query)
query = AttrQuery()
query.repl_text = "InTable"
query.add_css_sel("table tr td.hi #there")
parser.addquery(query)
parser.feed(html_test)
transformed_html = parser.transform()
print("transformed html:\n{}".format(transformed_html))
Output:
transformed html:
<div class=header><span class=title>Bar</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>InTable</div></td></tr></table>

Ok I have done this in a few languages and I have to say the best parser I have seen that preserves whitespace and even HTML comments is:
Jericho which is unfortunately Java.
That is Jericho knows how to parse and preserve fragments.
Yes I know its Java but you could easily make a RESTful service with a tiny bit of Java that would take the payload and convert it. In the Java REST service you could use JRuby, Jython, Rhino Javascript etc. to coordinate with Jericho.

You can use Nokogiri HTML Fragment for this:
fragment = Nokogiri::HTML.fragment('<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>')
fragment.css('.title').children.first.replace(Nokogiri::XML::Text.new('HEY', fragment))
frament.to_s #=> "<div class=\"header\"><span class=\"title\">HEY</span></div>\n<p>1</p><p>2\n</p><table><tr><td>1</td></tr></table>"
The problem with the p tag persists, because it is invalid HTML, but this should return your document without html, head or body and tbody tags.

With Python - using lxml.html is fairly straight forward:
(It meets points 1 & 3, but I don't think much can be done about 2, and handles the unquoted class='s)
import lxml.html
fragment = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
"""
page = lxml.html.fromstring(fragment)
for span in page.cssselect('.header .title'):
span.text = 'my new value'
print lxml.html.tostring(page, pretty_print=True)
Result:
<div>
<div class="header"><span class="title">my new content</span></div>
<p>1</p>
<p>2
</p>
<table><tr><td>1</td></tr></table>
</div>

This is a slightly separate solution but if this is only for a few simple instances then perhaps CSS is the answer.
Generated Content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<style type="text/css">
#header.title1:first-child:before {
content: "This is your title!";
display: block;
width: 100%;
}
#header.title2:first-child:before {
content: "This is your other title!";
display: block;
width: 100%;
}
</style>
</head>
<body>
<div id="header" class="title1">
<span class="non-title">Blah Blah Blah Blah</span>
</div>
</body>
</html>
In this instance you could just have jQuery swap the classes and you'd get the change for free with css. I haven't tested this particular usage but it should work.
We use this for things like outage messages.

If you're running a Node.js app, this module will do exactly what you want, a JQuery style DOM manipulator: https://github.com/cheeriojs/cheerio
An example from their wiki:
var cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there!');
$('h2').addClass('welcome');
$.html();
//=> <h2 class="title welcome">Hello there!</h2>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to insert an attribute with the right namespace prefix using lxml - python

Related

Find Tags that Match Specific Classes but one class keeps changing

Remove redundant class names in HTML using BeautifulSoup

Replacing a custom "HTML" tag in a Python string

Identifying branches different in tag structure

Parse HTML and preserve original content

Categories

Resources