Lxml Get all itens but test the next one as well - Python - python

I'm, in trouble trying to parse this lxml. I'm using python language, 3.6.9.
It is something like this.
<download date="22/05/2020 08:34">
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="y"/>
<subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>
<link url="http://xpto" document="z"/>
<subjects number="1"><subject>Text explaining the previous link</subject></subjects>
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="z"/>
</download>
Currently, I'm able to get all the links (which is something easy to accomplish) using this function:
import requests
from lxml import html
response = html.fromstring(requests.post(url_post, data=data).content)
links = response.xpath('//link')
As I pointed at the lxml, the subjects, when exists, are designed to explain the previous link. Sometimes, it can have more than one subject (Like the example above, one of the subjects has the number 2, which means it has two 'subject' items inside, but the other 'subjects' has just one subject). It is a large lxml file, so this difference (a lot of links until it has one link with one explanation after) occurs very often.
How can I build a query to get all these links and, when exists the subjects next to it (after the link, to be more precise), put it together or insert it into the link as well?
My dream would be something like this:
<link url="http://xpto" document="y" subjects="Text explaining the previous link| Another text explaining the thing"/>
A list with both links and subjects together would help a lot as well.
[
[<link url="http://xpto" document="y"/>],
[<link url="http://xpto" document="y"/>, <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>],
[<link url="http://xpto" document="y"/>],
]
Please, be free to suggest something different, of course.
Thank you, folks!

This does what I think it is you need:
from lxml import html
example = """
<link url="some_url" document="a"/>
<link url="some_url" document="b"/>
<subjects><subject>some text</subject></subjects>
<link url="some_url" document="c"/>
<link url="some_url" document="d"/>
<subjects><subject>some text</subject><subject>some more</subject></subjects>
"""
response = html.fromstring(example)
links = response.xpath('//link')
result = []
for link in links:
result.append([link])
next_element = link.getnext()
if next_element is not None and next_element.tag == 'subjects':
result[-1].append(next_element)
print(result)
Result:
[[<Element link at 0x1a0891e0d60>], [<Element link at 0x1a0891e0db0>, <Element subjects at 0x1a089096360>], [<Element link at 0x1a0891e0e00>], [<Element link at 0x1a0891e0e50>, <Element subjects at 0x1a0891e0d10>]]
Note that the lists still contain lxml Element objects, you can turn them into strings of course, if you need.
The key step is the next_element = link.getnext() line. For an lxml Element, the .getnext() method returns the next sibling in the document. So, although you're looping over link elements matched with .xpath(), link.getnext() will still get you a subjects element if that's the next sibling in the document. If there is no next element (i.e. for the last link, if it's not followed by a subjects), .getnext() will return None, which is why the following lines of code check for is not None.

This isn't the most elegant way of doing things, but it gets the job done...
subjects= """
<download date="22/05/2020 08:34">
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="y"/>
<subjects number="2">
<subject>First Text explaining the previous link</subject>
<subject>Another text explaining the previous link</subject>
</subjects>
<link url="http://xpto2" document="z"/>
<subjects number="1"><subject>Second Text explaining the previous link</subject></subjects>
<link url="http://xpto3" document="y"/>
<link url="http://xpto4" document="z"/>
</download>
"""
#Note that I changed your html a bit to emphasize the differences between nodes
import lxml.html as lh
import elementpath
doc = lh.fromstring(subjects)
elements = elementpath.select(doc, "//link[following-sibling::*[1][name()='subjects']]/concat('<link url=',./#url, ' document=xxx',#document,'xxx subjects=xxx',string-join(./following-sibling::subjects[1]//subject,' | '),'xxx/>')")
# I needed to use the xxx placeholder because I couldn't find a way to escape the double quote marks inside the expression, and this way is simple to implement
for element in elements:
print(element.replace('xxx','"'))
Output:
<link url=http://xpto document="y" subjects="First Text explaining the previous link | Another text explaining the previous link"/>
<link url=http://xpto2 document="z" subjects="Second Text explaining the previous link"/>

I came up with this solution.
It is a little slower than the #grismar suggestion but achieved the insertion of the 'subjects' into the link. On the other hand, it saved me the need to loop through the list one more time to parse the '[[link, subjects],]' element.
filteredData = response.xpath('//link | //subjects') #get both link and subjects into a list
for i, item in enumerate(filteredData):
if item.tag == 'subjects':
filteredData[i-1].append(item)
filteredData.remove(item)

Related

How to find list of all HTML Tags which are active on a particular data

I want to parse HTML to convert it to some other format while keeping some of the styles (Bolds, lists, etc).
To better explain what I mean,
Consider the following code:
<html>
<body>
<h2>A Nested List</h2>
<p>List <b>can</b> be nested (lists inside lists):</p>
<ul>
<li>Coffee</li>
<li>Tea
<ul>
<li>Black tea</li>
<li>Green tea</li>
</ul>
</li>
<li>Milk</li>
</ul>
</body>
</html>
Now if I were to select the word "List" at the start of the paragraph, my output should be (html, body,p), since those are the tags active on the word "List".
Another example, if I were to select the word "Black tea", my output should be (html,body,ul,li,ul,li), since it's part of the nested list.
I've seen chrome inspector do this but I'm not sure how I can do this in code by using Python.
Here is an Image of what the chrome inspector shows:
Chrome Inspector
I've tried parsing through the HTML using Beautiful soup and while it is amazing for getting a data, I was unable to solve my problem using it.
Later I tried the html-parser for this same issue, trying to make a stack of all tags before a "data" and popping them out as I encounter corresponding end-tags, but I couldn't do it either.
As you said in your comment, it may or may not get you what you want, but it may be a start. So I would try it anyway and see what happens:
from lxml import etree
snippet = """[your html above]"""
root = etree.fromstring(snippet)
tree = etree.ElementTree(root)
targets = ['List','nested','Black tea']
for e in root.iter():
for target in targets:
if (e.text and target in e.text) or (e.tail and target in e.tail):
print(target,' :',tree.getpath(e))
Output is
List : /html/body/h2
List : /html/body/p
nested : /html/body/p/b
Black tea : /html/body/ul/li[2]/ul/li[1]
As you can see, what this does is give you the xpath to selected text targets. A couple of things to note: first, "List" appears twice because it appears twice the text. Second: the "Black tea" xpath contains positional values (for example, the [2] in /li[2]) which indicate that the target string appears in the second li element of the snippet, etc. If you don't need that, you may need to strip that information from the output (or use another tool).

Python lxml cant navigate when using namespace [duplicate]

I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
<img class="foo" src="bar.png"/>
</body>
</html>
If I parse the document using lxml.html, I can get the IMG with an xpath just fine:
>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]
However, if I parse the document as XML and try to get the IMG tag, I get an empty result:
>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]
I can navigate to the element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>
But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]
But that xpath is, again, obviously not useful for parsing arbitrary documents.
Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.
So, what am I missing?
The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.
Try this:
>>> tree.getroot().xpath(
... "//xhtml:img",
... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
... )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
XPath considers all unprefixed names to be in "no namespace".
In particular the spec says:
"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "
See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.
Hope this helped.
Cheers,
Dimitre Novatchev
If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.
In your case it would be like
from lxml import objectify
root = objectify.parse(url) #also available: fromstring
You can access the nodes as
root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag
While it might not be of great help in html, it can be highly useful in well structured xml.
For more info, check out http://lxml.de/objectify.html

Using Python and lxml to strip only the tags that have certain attributes/values

I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.
For instance: I'd like to strip all span or div tags (or other elements) from a tree (xhtml) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.
Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans/divs (or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.
I feel I'm missing something obvious, but I've been searching without any luck for quite some time.
HTML
lxmls HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html.
It acts similar to strip_tags in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:
doc.html
<html>
<body>
<div>This is some <span attr="foo">Text</span>.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get <span attr="foo">removed</span> as well.</div>
<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
</body>
</html>
strip.py
from lxml import etree
from lxml import html
doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[#attr='foo']")
for span in spans_with_attrs:
span.drop_tag()
print etree.tostring(doc)
Output:
<html>
<body>
<div>This is some Text.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get removed as well.</div>
<div>Nested elements will <b>be</b> left alone.</div>
<div>Unless they also match.</div>
</body>
</html>
In this case, the XPath expression //span[#attr='foo'] selects all the span elements with an attribute attr of value foo. See this XPath tutorial for more details on how to construct XPath expressions.
XML / XHTML
Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag() method is really only available for elements in a HTML document.
So for XML it's a bit more complicated:
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
def strip_nodes(nodes):
for node in nodes:
text_content = node.xpath('string()')
# Include tail in full_text because it will be removed with the node
full_text = text_content + (node.tail or '')
parent = node.getparent()
prev = node.getprevious()
if prev:
# There is a previous node, append text to its tail
prev.tail += full_text
else:
# It's the first node in <parent/>, append to parent's text
parent.text = (parent.text or '') + full_text
parent.remove(node)
doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[#attr='foo']")
strip_nodes(nodes)
print etree.tostring(doc)
Output:
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first span should <span>be</span> removed.</node>
</document>
As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)
NOTE Last edit have changed the code in question.
I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind brings, renaming those elements to something unlikely, and then strip all those elements?
Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[#attr='foo']"):
el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))
Output
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>
I have the same problem. But in my case the scenario a little easier, I have an option - not remove tags, just clear it, our users see rendered html and if I have for example
<div>Hello <strong>awesome</strong> World!</div>
I want to clear strong tag by css selector div > strong and save tail context, in lxml you cant use strip_tags with keep_tail by selector, you can strip only by tag, it makes me crazy. And more over if you just remove <strong>awesome</strong> node, you also remove this tail - "World!", text that wrapped strong tag.
Output will be like:
<div>Hello</div>
For me ok this:
<div>Hello <strong></strong> World!</div>
No awesome for the user anymore.
doc = lxml.html.fromstring(markup)
selector = lxml.cssselect.CSSSelector('div > strong')
for el in list(selector(doc)):
if el.tail:
tail = el.tail
el.clear()
el.tail = tail
else:
#if no tail, we can safety just remove node
el.getparent().remove(el)
You can adapt the code with physical delete strong tag with the call element.remove(child) and attach it tail to the parent, but for my case it was overhead.

Trouble parsing xml archive using Element Tree

Python + programming noob here, so you may have to bear with me. I have a number of xml files (RSS archives) and I want to extract news article urls from them. I'm using Python 2.7.3 on Windows... and here's an example of the code I'm looking at:
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!--
Content-type: Preventing XSRF in IE.
-->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>
Specifically I want to extract the "original id" link:
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
I originally tried using BeautifulSoup for this but ran into problems, and from the research I did it looks like Element Tree is the way to go. First off with ET I tried:
import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()
#first_original_id = root[8][0]
parents_of_interest = root[8::]
for elem in parents_of_interest:
print elem.items()[0][1]
So far as I can work out parents_of_interest does grab the data I want (as a list of dictionaries) but the for loop only returns a bunch of true statements, and after reading the documentation and SO it seems like this is the wrong approach.
I think this has the answer I'm looking for but even though it's a good explanation I can't seem to apply it to my own situation. From that answer I tried:
print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text
But got the error:
__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely
on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
Any help on this would be appreciated... and sorry if that's a verbose question... but I thought I'd detail everything... just in case.
Your xpath expression matches the first id, not the one you're looking for and original-id is an attribute of the element, so you should write something like that:
idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')
That will find only the first matching id, if you want them all, use findall and iterate over the results.

Retrieving first urban dictionary result for a term in python

I have written a pretty simple code to get the first result for any term on urbandictionary.com. I started by writing a simple thing to see how their code is formatted.
def parseudtest(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
for lines in url_info:
print lines
For a test, I searched for 'cats', and used that as the variable searchurl. The output I receive is of course a gigantic page, but here is the part I care about:
<meta content='He set us up the bomb. Also took all our base.' name='Description' />
<meta content='He set us up the bomb. Also took all our base.' property='og:description' />
<meta content='cats' property='og:title' />
<meta content="http://static3.urbandictionary.com/rel-1e0b481/images/og_image.png" property="og:image" />
<meta content='Urban Dictionary' property='og:site_name' />
As you can see, the first time the element "meta content" appears on the site, it is the first definition for the search term. So I wrote this code to retrieve it:
def parseud(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
if (url_info):
xmldoc = minidom.parse(url_info)
if (xmldoc):
definition = xmldoc.getElementsByTagName('meta content')[0].firstChild.data
print definition
For some reason the parsing doesn't seem to be working and invariably encounters an error every time. It is especially confusing since the site appears to use basically the same format as other sites I have successfully retrieved specific data from. If anyone could help me figure out what I am messing up here, it would be greatly appreciated.
As you don't give the traceback for the errors that occur it's hard to be specific, but I assume that although the site claims to be XHTML it's not actually valid XML. You'd be better off using Beautiful Soup as it is designed for parsing HTML and will correctly handle broken markup.
I never used the minidom parser, but I think the problem is that you call:
xmldoc.getElementsByTagName('meta content')
while tha tag name is meta, content is just the first attribute (as shown pretty well by the highlighting of your html code).
Try to replace that bit with:
xmldoc.getElementsByTagName('meta')

Categories

Resources