Folks, I am new (brand new) to python, so after taking a course I decided to create a script to covert an XML file to CSV. The file in question is 2GB in size, so after searching here and on google I think I need to use the xml.etree.ElementTree.interparse functionality. For reference the XML file I am looking to covert looks like this:
<Document>
<type></type>
<internal_id></internal_id>
<name></name>
<number></number>
<cadname></cadname>
<version></version>
<iteration></iteration>
**<isLatest></isLatest>**
<modifiedBy>
<username></username>
<email/>
</modifiedBy>
<content>
**<name></name>**
<id></id>
<uploaded></uploaded>
<refSize></refSize>
<storage>
<vault></vault>
<folder></folder>
**<filename></filename>**
<location></location>
**<actualLocation></actualLocation>**
</storage>
<replicatedTo></replicatedTo>
<copies></copies>
<status></status>
</content>
I am using the value of isLatest to determine whether I need to add the items to the CSV file. If the value is "true" I want the data to move to the CSV file. Here is the code that works to a point:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml")
# open a file for writing
csvfile = open('windchill.txt', 'w', encoding="utf-8")
# create the csv writer object
csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:
if document.tag == 'Document':
if document.find('isLatest').text == 'true':
row = []
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
folder = document.find('content').find('storage').find('actualLocation').text
row.append(folder)
csvwriter.writerow(row)
document.clear()
csvfile.close()
If I run the code, I get this error:
Traceback (most recent call last):
File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'
A file is created that has 91,000 entries that look like this:
plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7
There seem to be two issues with the output.
Some items seem to be duplicated, although the source file has no duplications. The name could be duplicated in the source file, but there can only be one name value that is .
I don't think the file completed. I looked at the last entry of my TXT (CSV) file and it does not match the last line of my source file. I was assuming the iterator was serial in nature.
So, any idea what the error is telling me, and any idea why I may be seeing duplicates? Originally I thought the error may have been related to me not ending gracefully. I am confident the XML is formed properly throughout, but maybe that is a bad assumption.
******UPDATES******
Here is a sample of the elements.
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33709881</internal_id>
<name>bga_13x11p137_0_4_0_8.prt</name>
<number>BGA_13X11P137_0_4_0_8.PRT</number>
<cadname>bga_13x11p137_0_4_0_8.prt</cadname>
<version>A</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>ets027 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>bga_13x11p137_0_4_0_8.prt</name>
<id>5341368</id>
<uploaded>Jan 13, 2006 09:14:41</uploaded>
<refSize>287764</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000505a6</filename>
<location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34570129</internal_id>
<name>d61-2446-02_nest_plate.prt</name>
<number>D61-2446-02_NEST_PLATE.PRT</number>
<cadname>d61-2446-02_nest_plate.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>esb044c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d61-2446-02_nest_plate.prt</name>
<id>5344204</id>
<uploaded>Jan 30, 2006 09:09:24</uploaded>
<refSize>109278</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000518e8</filename>
<location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33512036</internal_id>
<name>d68-2568-07_press_head_adhesive.prt</name>
<number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
<cadname>d68-2568-07_press_head_adhesive.prt</cadname>
<version>-</version>
<iteration>2</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>e3789c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d68-2568-07_press_head_adhesive.prt</name>
<id>5340927</id>
<uploaded>Jan 10, 2006 15:42:31</uploaded>
<refSize>76314</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>0000000005024b</filename>
<location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34715717</internal_id>
<name>dbk_flip_sleeve.prt</name>
<number>DBK_FLIP_SLEEVE.PRT</number>
<cadname>dbk_flip_sleeve.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>EKA014 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>dbk_flip_sleeve.prt</name>
<id>5344969</id>
<uploaded>Feb 01, 2006 12:54:43</uploaded>
<refSize>847210</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>00000000051b54</filename>
<location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
Here is my updated code:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml", events=('start', 'end'))
csvfile = open('windchill.txt', 'w', encoding="utf-8")
csvwriter = csv.writer(csvfile)
for event, document in parser:
if event=='end' and document.tag=='Document':
if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
row = []
version = document.find('version').text
row.append(version)
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
# folder = document.find('content').find('storage').find('actualLocation').text
folder = document.find('content').find('storage').find('folder').text
row.append(folder)
csvwriter.writerow(row)
csvfile.close()
I added in a check for type. Type wt.ep.EPMDocument will have the record. I then want to pull the data out of the storage element. Specifically name, folder, and filename. I originally was using actualLocation instead ov vault, but changed hoping the shorter name would help with my memory error.
Concerning your first issue: iterparse 'sees' each and every xml element in a document when that element starts and, again, when it closes. This probably explains the duplication that you find. Not only must you filter for the element(s) that you want, you must filter for the appropriate event. You might look at this answer, https://stackoverflow.com/a/46167799/131187, to see how to deal with this.
Concerning the second: When document.find('isLatest') fails to find what you've requested it returns None, rather than an object representing an xml element. None has no properties, including text, therefore, your program croaks at that point, and you receive an incomplete csv file.
Edit in answer to comment: This code parses the xml but does not write the csv. csv records would be written in the save_csv_record function, or its equivalent. It appears only once in the code so should be easy to replace.
Called in the way it is in this code iterparse returns only 'end' events and their corresponding xml elements. Therefore, the code watches for the 'end' of a 'Document'. When it sees one it asks whether the document contains an 'isLatest' of 'true'. If it does it writes it out; if not, it ignores it and empties document_content. If the code has not seen the 'end' of a document it simply saves the content of the tag and keeps reading under it does.
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
Output:
{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
EDITED FOR LATEST CODE:
Here is the new code that I am using, that sill runs out of memory:
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['type']=='wt.epm.EPMDocument' and
document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
Related
I am trying to run this code where data of a dictionary is saved in a separate csv file.
Here is the dict:
body = {
'dont-ask-for-email': 0,
'action': 'submit_user_review',
'post_id': 76196,
'email': email_random(),
'subscribe': 1,
'previous_hosting_id': prev_hosting_comp_random(),
'fb_token': '',
'title': review_title_random(),
'summary': summary_random(),
'score_pricing': star_random(),
'score_userfriendly': star_random(),
'score_support': star_random(),
'score_features': star_random(),
'hosting_type': hosting_type_random(),
'author': name_random(),
'social_link': '',
'site': '',
'screenshot[image][]': '',
'screenshot[description][]': '',
'user_data_process_agreement': 1,
'user_email_popup': '',
'subscribe_popup': 1,
'email_asked': 1
}
Now this is the code to write in a CSV file and finally save it:
columns = []
rows = []
chunks = body.split('}')
for chunk in chunks:
row = []
if len(chunk)>1:
entry = chunk.replace('{','').strip().split(',')
for e in entry:
item = e.strip().split(':')
if len(item)==2:
row.append(item[1])
if chunks.index(chunk)==0:
columns.append(item[0])
rows.append(row)
df = pd.DataFrame(rows, columns = columns)
df.head()
df.to_csv ('r3edata.csv', index = False, header = True)
but this is the error I get:
Traceback (most recent call last):
File "codeOffshoreupdated.py", line 125, in <module>
chunks = body.split('}')
AttributeError: 'dict' object has no attribute 'split'
I know that dict has no attribute named split but how do I fix it?
Edit:
format of the CSV I want:
dont-ask-for-email, action, post_id, email, subscribe, previous_hosting_id, fb_token, title, summary, score_pricing, score_userfriendly, score_support, score_features, hosting_type,author, social_link, site, screenshot[image][],screenshot[description][],user_data_process_agreement,user_email_popup,subscribe_popup,email_asked
0,'submit_user_review',76196,email_random(),1,prev_hosting_comp_random(),,review_title_random(),summary_random(),star_random(),star_random(),star_random(),star_random(),hosting_type_random(),name_random(),,,,,1,,1,1
Note: all these functions mentioned are return values
Edit2:
I am picking emails from the email_random() function like this:
def email_random():
with open('emaillist.txt') as emails:
read_emails = csv.reader(emails, delimiter = '\n')
return random.choice(list(read_emails))[0]
and the emaillist.txt is like this:
xyz#gmail.com
xya#gmail.com
xyb#gmail.com
xyc#gmail.com
xyd#gmail.com
other functions are also picking the data from the files like this too.
Since body is a dictionary, you don't have to a any manual parsing to get it into a CSV format.
If you want the function calls (like email_random()) to be written into the CSV as such, you need to wrap them into quotes (as I have done below). If you want them to resolve as function calls and write the results, you can keep them as they are.
import csv
def email_random():
return "john#example.com"
body = {
'dont-ask-for-email': 0,
'action': 'submit_user_review',
'post_id': 76196,
'email': email_random(),
'subscribe': 1,
'previous_hosting_id': "prev_hosting_comp_random()",
'fb_token': '',
'title': "review_title_random()",
'summary': "summary_random()",
'score_pricing': "star_random()",
'score_userfriendly': "star_random()",
'score_support': "star_random()",
'score_features': "star_random()",
'hosting_type': "hosting_type_random()",
'author': "name_random()",
'social_link': '',
'site': '',
'screenshot[image][]': '',
'screenshot[description][]': '',
'user_data_process_agreement': 1,
'user_email_popup': '',
'subscribe_popup': 1,
'email_asked': 1
}
with open('example.csv', 'w') as fhandle:
writer = csv.writer(fhandle)
items = body.items()
writer.writerow([key for key, value in items])
writer.writerow([value for key, value in items])
What we do here is:
with open('example.csv', 'w') as fhandle:
this opens a new file (named example.csv) with writing permissions ('w') and stores the reference into variable fhandle. If using with is not familiar to you, you can learn more about them from this PEP.
body.items() will return an iterable of tuples (this is done to guarantee dictionary items are returned in the same order). The output of this will look like [('dont-ask-for-email', 0), ('action', 'submit_user_review'), ...].
We can then write first all the keys using a list comprehension and to the next row, we write all the values.
This results in
dont-ask-for-email,action,post_id,email,subscribe,previous_hosting_id,fb_token,title,summary,score_pricing,score_userfriendly,score_support,score_features,hosting_type,author,social_link,site,screenshot[image][],screenshot[description][],user_data_process_agreement,user_email_popup,subscribe_popup,email_asked
0,submit_user_review,76196,john#example.com,1,prev_hosting_comp_random(),,review_title_random(),summary_random(),star_random(),star_random(),star_random(),star_random(),hosting_type_random(),name_random(),,,,,1,,1,1
I've got a json file with 30-ish, blocks of "dicts" where every block has and ID, like this:
{
"ID": "23926695",
"webpage_url": "https://.com",
"logo_url": null,
"headline": "aewafs",
"application_deadline": "2020-03-31T23:59:59",
}
Since my script pulls information in the same way from an API more than once, I would like to append new "blocks" to the json file only if the ID doesn't already exist in the JSON file.
I've got something like this so far:
import os
check_empty = os.stat('pbdb.json').st_size
if check_empty == 0:
with open('pbdb.json', 'w') as f:
f.write('[\n]') # Writes '[' then linebreaks with '\n' and writes ']'
output = json.load(open("pbdb.json"))
for i in jobs:
output.append({
'ID': job_id,
'Title': jobtitle,
'Employer' : company,
'Employment type' : emptype,
'Fulltime' : tid,
'Deadline' : deadline,
'Link' : webpage
})
with open('pbdb.json', 'w') as job_data_file:
json.dump(output, job_data_file)
but I would like to only do the "output.append" part if the ID doesn't exist in the Json file.
I am not able to complete the code you provided but I added an example to show how you can achieve the none duplicate list of jobs(hopefully it helps):
# suppose `data` is you input data with duplicate ids included
data = [{'id': 1, 'name': 'john'}, {'id': 1, 'name': 'mary'}, {'id': 2, 'name': 'george'}]
# using dictionary comprehension you can eliminate the duplicates and finally get the results by calling the `values` method on dict.
noduplicate = list({itm['id']:itm for itm in data}.values())
with open('pbdb.json', 'w') as job_data_file:
json.dump(noduplicate, job_data_file)
I'll just go with a database guys, thank you for your time, we can close this thread now
I am a fairly new dev and trying to parse "id" values from this file. Running into the issue below.
My python code:
import ast
from pathlib import Path
file = Path.home() /'AppData'/'Roaming'/'user-preferences-prod'
with open(file, 'r') as f:
contents = f.read()
ids = ast.literal_eval(contents)
profileids = []
for data in ids:
test= data.get('id')
profileids.append(test)
print(profileids))
This returns the error: ValueError: malformed node or string: <_ast.Name object at 0x0000023D8DA4D2E8> at ids = ast.literal_eval(contents)
A snippet of the content in my file of interest:
{"settings":{"defaults":{"value1":,"value2":,"value3":null,"value4":null,"proxyid":null,"sites":{},"sizes":[],"value5":false},"value6":true,"value11":,"user":{"value9":"","value8": ,"value7":"","value10":""},"webhook":"},'profiles':[{'billing': {'address1': '', 'address2': '', 'city': '', 'country': 'United States', 'firstName': '', 'lastName': '', 'phone': '', 'postalCode': '', 'province': '', 'usesBillingInformation': False}, 'createdAt': 123231231213212, 'id': '23123123123213, 'name': ''
I need this code to be looped as there are multiple id values that I am interested in and need them all to be entered into a list.Hopefully I explained it all. the file type is "file" according to windows, I just view its contents with notepad.
It appears to me that you have a file with a string representation of a dict (dictionary). So, what you need to do is:
string_of_dict →ast.literal_eval()→ dict
Open file and read in the text into a string variable. Currently I think this string is going into ids.
Then convert the string representation of dict into a dict using ast library as shown below. Reference
import ast
string_of_dict = "{'muffin' : 'lolz', 'foo' : 'kitty'}"
ast.literal_eval(string_of_dict)
Output:
{'muffin': 'lolz', 'foo': 'kitty'}
Solution
Something like this should most likely work. You may have to tweak it a little bit.
import ast
with open(file, 'r') as f:
contents = f.read()
ids = ast.literal_eval(contents)
profileids = []
for data in ids:
test= data.get('id')
profileids.append(test)
print(profileids)
I am trying to process a CSV file into a dict using a Dataflow template and Python.
As it is a template I have to use ReadFromText from the textio module, to be able to provide the path at runtime.
| beam.io.ReadFromText(contact_options.path)
All I need is to be able to extract the first line of this text/csv file, I can then use this data in DictReader as the fieldnames.
If I use split lines it brings back a each element of the text file in a list:
return element.splitlines()
or
csv_data = []
split_element = element.split('\n')
for row in split_element:
csv_data.append(row)
return csv_data
['phone_number', 'cid', 'first_name', 'last_name']
[' ', '101XXXXX', 'MurXXX', 'LevXXXX']
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX']
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']
Although If I then use say element[0], it just brings everythin back without the list brackets. I have also tried splitting by '\n', then using a for loop to produce a list object, although it produces almost the same result.
I cannot rely on using predetermined fieldnames as the csv files to be processed will all have different fieldnames and DictReader will not work effectively without fieldnames given.
EDIT:
The expected output is:
[{'phone_Number': '561XXXXX', 'first_Name': '', 'last_Name': 'BeXXXX', 'cid': '745XXXXX'}, {'phone_Number': '561XXXXX', 'first_Name': 'A', 'last_Name': 'BXXXX', 'cid': '61XXXXX'}]
EDIT:
Element contents:
"phone_Number","cid","first_Name","last_Name"
"5616XXXXX","745XXXX","","BeXXXXX"
"561XXXXXX","61XXXXX","A","BXXXXXXt"
"95XXXXXXX","6XXXXXX","A","BXXXXXX"
"727XXXXXX","98XXXXXX","A","CaXXXXXX"
Use Pandas to load the values and use first line as colheaders
import pandas as pd
a_big_list=[['phone_number', 'cid', 'first_name', 'last_name'],
[' ', '101XXXXX', 'MurXXX', 'LevXXXX'],
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX'],
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']]
df=pd.DataFrame(a_big_list[1:],columns=a_big_list[0])
df.to_dict('records')
#[{'cid': '101XXXXX',
'first_name': 'MurXXX',
'last_name': 'LevXXXX',
'phone_number': ' '},
{'cid': '109XXXXX',
'first_name': 'MerXXXX',
'last_name': 'CoXXXX',
'phone_number': '3052XXXXX'},
{'cid': '10XXXXXX',
'first_name': 'RoXXXX',
'last_name': 'MaXXXXX',
'phone_number': '954XXXXX'}]
I was able to figure this problem out with inspiration from #mad_'s answer, but this still didn't give me the correct answer initally, as I needed to first group my pcollection into one element. I found a way of doing this inspired from this answer from Jiayuan Ma, and slightly altered it as so:
class Group(beam.DoFn):
def __init__(self):
self._buffer = []
def process(self, element):
self._buffer.append(element)
def finish_bundle(self):
if len(self._buffer) != 0:
yield list(self._buffer)
self._buffer = []
lines = p | 'File reading' >> ReadFromText(known_args.input)
| 'Group' >> beam.ParDo(Group(known_args.N)
...
Thus it grouped the entire CSV file as one object, and then I was able to apply mad_'s method to turn it into a dictionary.
I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.
Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree represents xml objects.