I noticed the xml entities " will automatically force to convert to their real original characters:
>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>"hello world"</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
<elem>"hello world"</elem>
</root>
>>>
I found one related old(2009-02-07) thread:
s = cStringIO.StringIO(""""She's the MAN!"""")
e = etree.parse(s,etree.XMLParser(resolve_entities=False))
Note that there's also etree.fromstring().
etree.tostring(e)
'"She\'s the MAN!"'
I would have expected resolve_entities=False to have prevented the
translation of, eg, " to ".
The "resolve_entities" option is meant for entities defined in a DTD
of which you want to keep the reference instead of the resolved value.
The entities you mention are part of the XML spec, not of a DTD.
is there another way to prevent this behavior (or, if nothing else,
reverse it after the fact)?
Well, what you get is well-formed XML. May I ask why you need the
entity references in the output?
Still, the response is why you want to do that, there's no direct answer to this problem. I'm quite surprised because the etree parser force the conversion without giving an option to disable it.
The following example shown why i need this solution, this xml is for xbmc skinning parser:
>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>Close</onfocus>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
... for cc in c:
... if cc.attrib.get('id') == "103":
... cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
...
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>>
As you can see of the onfocus element under id "103" at the end, the " are no longer in their original form, and it lead to bug if the "$INFO[VideoPlayer.Album]" variable contains nested quotes and become ""test"" which was invalid and error.
So is it any hacky way i can keep " in their original form ?
[UPDATE]:
For someone who interest, the other 3 predefined xml entities, i.e. gt, lt and amp will only get converted by using method="html" and script tag. Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confuse:
>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>"'&><</script><p>"'&><</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
<script>"'&><</script>
<p>"'&><</p>
</root>
>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&><</p></root>
>>>
[UPDATE2]:
The following is the list of all possible html tags:
#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
r = et.fromstring(e.join(["<", ">hello&world</", ">"]))
s = et.tostring(r, pretty_print=1, method="html")
closed_tag = "</" + e + ">"
if closed_tag not in s:
print s
Run this code and you will see output as following:
<area>
<br>
<col>
<hr>
<img>
<input>
As you can see, only opening tag printed and the rest was just go into black hole. I tested all 5 xml entities and all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so i guess there's buggy between fromstring(method should be default to xml) and tostring(method="html") steps. And i found it has nothing to do with entities because "< img >hello< /img >"(without entities) is truncate into < img > too(and hello just gone to nowhere, it can appear at anytime if use method="xml" to print out).
from xml.sax.saxutils import escape
from lxml import etree
def to_string(xdoc):
r = ""
for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
if action == 'start':
text = escape(elem.text, {"'": "'", "\"": """}) if elem.text is not None else ""
attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
r += "<%s%s>%s" % (elem.tag, attrs, text)
elif action == 'end':
r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)
Related
I am using the bleach package to strip away invalid html. I am puzzled why the dir attribute is being stripped from my string. Is dir not an attribute, or could it just be that the package does not support dir?
I have included the entire script, so you can run it for your convenience.
import bleach
string = """<p dir="rtl">asdasdasd <span>asdasdasd</span> asdsadasdsad .<br data-mce-bogus="1"></p>"""
def strip_invalid_html(html):
""" strips invalid tags/attributes """
allowed_tags = [
'p', 'a', 'blockquote',
'h1', 'h2', 'h3', 'h4', 'h5',
'strong', 'em',
'br',
'span',
]
allowed_attributes = {
'a': ['href', 'title'],
'dir': ['rtl', 'ltr']
}
cleaned_html = bleach.clean(
html,
attributes=allowed_attributes,
strip=True,
tags=allowed_tags
)
print(cleaned_html)
strip_invalid_html(string)
If you pass a dict for attributes, the dict should map tag names to allowed attribute names, not map attribute names to allowed attribute values.
If you want 'dir' to be an allowed attribute for p tags, you need a 'p': ['dir'] entry, not a 'dir': ['rtl', 'ltr'] entry:
allowed_attributes = {
'a': ['href', 'title'],
'p': ['dir'],
}
I am trying to parse an xml file with regular expression.
Whichever script tag has "catch" alias, I need to collect "type" and "value".
<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>
I tried this regular expression with multiline and dotall:
>>> re.findall(r'script\s+type=\"(\w+)\".*alias=\"catch\"\s+value=\"(\d+)\"', a, re.MULTILINE|re.DOTALL)
Output which I am getting is:
[('abc', '8')]
Expected output is:
[('abc', '4'), ('xyz', '8')]
Can someone help me in figuring out what I am missing here?
I recommend using BeautifulSoup. You can parse through the tags and, with a little bit of data re-structuring, easily check for the right alias values and store the related attributes of interest. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
to_keep = []
for script in soup.find_all("script"):
t = script["type"]
attrs = {
k:v for k, v in [attr.split("=")
for attr in script.contents[0].split()
if "=" in attr]
}
if attrs["alias"] == '"catch"':
to_keep.append({"type": t, "value": attrs["value"]})
to_keep
# [{'type': 'abc', 'value': '"4"'}, {'type': 'xyz', 'value': '"8"'}]
Data:
data = """<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>"""
Found the answer. Thanks downvoter. I don't think there was any need to downvote this question.
>>> re.findall(r'script\s+type=\"(\w+)\".*?alias=\"catch\"\s+value=\"(\d+)\".*?\<\/script\>', a, re.MULTILINE|re.DOTALL)
[('abc', '4'), ('xyz', '8')]
Folks, I am new (brand new) to python, so after taking a course I decided to create a script to covert an XML file to CSV. The file in question is 2GB in size, so after searching here and on google I think I need to use the xml.etree.ElementTree.interparse functionality. For reference the XML file I am looking to covert looks like this:
<Document>
<type></type>
<internal_id></internal_id>
<name></name>
<number></number>
<cadname></cadname>
<version></version>
<iteration></iteration>
**<isLatest></isLatest>**
<modifiedBy>
<username></username>
<email/>
</modifiedBy>
<content>
**<name></name>**
<id></id>
<uploaded></uploaded>
<refSize></refSize>
<storage>
<vault></vault>
<folder></folder>
**<filename></filename>**
<location></location>
**<actualLocation></actualLocation>**
</storage>
<replicatedTo></replicatedTo>
<copies></copies>
<status></status>
</content>
I am using the value of isLatest to determine whether I need to add the items to the CSV file. If the value is "true" I want the data to move to the CSV file. Here is the code that works to a point:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml")
# open a file for writing
csvfile = open('windchill.txt', 'w', encoding="utf-8")
# create the csv writer object
csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:
if document.tag == 'Document':
if document.find('isLatest').text == 'true':
row = []
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
folder = document.find('content').find('storage').find('actualLocation').text
row.append(folder)
csvwriter.writerow(row)
document.clear()
csvfile.close()
If I run the code, I get this error:
Traceback (most recent call last):
File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'
A file is created that has 91,000 entries that look like this:
plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7
There seem to be two issues with the output.
Some items seem to be duplicated, although the source file has no duplications. The name could be duplicated in the source file, but there can only be one name value that is .
I don't think the file completed. I looked at the last entry of my TXT (CSV) file and it does not match the last line of my source file. I was assuming the iterator was serial in nature.
So, any idea what the error is telling me, and any idea why I may be seeing duplicates? Originally I thought the error may have been related to me not ending gracefully. I am confident the XML is formed properly throughout, but maybe that is a bad assumption.
******UPDATES******
Here is a sample of the elements.
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33709881</internal_id>
<name>bga_13x11p137_0_4_0_8.prt</name>
<number>BGA_13X11P137_0_4_0_8.PRT</number>
<cadname>bga_13x11p137_0_4_0_8.prt</cadname>
<version>A</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>ets027 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>bga_13x11p137_0_4_0_8.prt</name>
<id>5341368</id>
<uploaded>Jan 13, 2006 09:14:41</uploaded>
<refSize>287764</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000505a6</filename>
<location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34570129</internal_id>
<name>d61-2446-02_nest_plate.prt</name>
<number>D61-2446-02_NEST_PLATE.PRT</number>
<cadname>d61-2446-02_nest_plate.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>esb044c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d61-2446-02_nest_plate.prt</name>
<id>5344204</id>
<uploaded>Jan 30, 2006 09:09:24</uploaded>
<refSize>109278</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000518e8</filename>
<location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33512036</internal_id>
<name>d68-2568-07_press_head_adhesive.prt</name>
<number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
<cadname>d68-2568-07_press_head_adhesive.prt</cadname>
<version>-</version>
<iteration>2</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>e3789c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d68-2568-07_press_head_adhesive.prt</name>
<id>5340927</id>
<uploaded>Jan 10, 2006 15:42:31</uploaded>
<refSize>76314</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>0000000005024b</filename>
<location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34715717</internal_id>
<name>dbk_flip_sleeve.prt</name>
<number>DBK_FLIP_SLEEVE.PRT</number>
<cadname>dbk_flip_sleeve.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>EKA014 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>dbk_flip_sleeve.prt</name>
<id>5344969</id>
<uploaded>Feb 01, 2006 12:54:43</uploaded>
<refSize>847210</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>00000000051b54</filename>
<location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
Here is my updated code:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml", events=('start', 'end'))
csvfile = open('windchill.txt', 'w', encoding="utf-8")
csvwriter = csv.writer(csvfile)
for event, document in parser:
if event=='end' and document.tag=='Document':
if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
row = []
version = document.find('version').text
row.append(version)
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
# folder = document.find('content').find('storage').find('actualLocation').text
folder = document.find('content').find('storage').find('folder').text
row.append(folder)
csvwriter.writerow(row)
csvfile.close()
I added in a check for type. Type wt.ep.EPMDocument will have the record. I then want to pull the data out of the storage element. Specifically name, folder, and filename. I originally was using actualLocation instead ov vault, but changed hoping the shorter name would help with my memory error.
Concerning your first issue: iterparse 'sees' each and every xml element in a document when that element starts and, again, when it closes. This probably explains the duplication that you find. Not only must you filter for the element(s) that you want, you must filter for the appropriate event. You might look at this answer, https://stackoverflow.com/a/46167799/131187, to see how to deal with this.
Concerning the second: When document.find('isLatest') fails to find what you've requested it returns None, rather than an object representing an xml element. None has no properties, including text, therefore, your program croaks at that point, and you receive an incomplete csv file.
Edit in answer to comment: This code parses the xml but does not write the csv. csv records would be written in the save_csv_record function, or its equivalent. It appears only once in the code so should be easy to replace.
Called in the way it is in this code iterparse returns only 'end' events and their corresponding xml elements. Therefore, the code watches for the 'end' of a 'Document'. When it sees one it asks whether the document contains an 'isLatest' of 'true'. If it does it writes it out; if not, it ignores it and empties document_content. If the code has not seen the 'end' of a document it simply saves the content of the tag and keeps reading under it does.
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
Output:
{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
EDITED FOR LATEST CODE:
Here is the new code that I am using, that sill runs out of memory:
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['type']=='wt.epm.EPMDocument' and
document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
How to get all text before an element in a etree separated from the text after the element?
from lxml import etree
tree = etree.fromstring('''
<a>
find
<b>
the
</b>
text
<dd></dd>
<c>
before
</c>
<dd></dd>
and after
</a>
''')
What do I want? In this example, the <dd> tags are separators and for all of them
for el in tree.findall('.//dd'):
I would like to have all text before and after them:
[
{
el : <Element dd at 0xsomedistinctadress>,
before : 'find the text',
after : 'before and after'
},
{
el : <Element dd at 0xsomeotherdistinctadress>,
before : 'find the text before',
after : 'and after'
}
]
My idea was to use some kind of placeholders in the tree with which I replace the <dd> tags and then cut the string at that placeholder, but I need the correspondence with the actual element.
There might be a simpler way, but I would use the following XPath expressions:
preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()
Sample implementation (definitely violating the DRY principle):
def get_text_before(element):
for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
item = item.strip()
if item:
yield item
def get_text_after(element):
for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
item = item.strip()
if item:
yield item
for el in tree.findall('.//dd'):
before = " ".join(get_text_before(el))
after = " ".join(get_text_after(el))
print {
"el": el,
"before": before,
"after": after
}
Prints:
{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}
I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.
Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree represents xml objects.