Removing Elements From 300MG Xml In Python / Element Tree

Removing Elements From 300MG Xml In Python / Element Tree - python

I'm trying to parse a 300MB XML in ElementTree, based on advise like Can Python xml ElementTree parse a very large xml file?
from xml.etree import ElementTree as Et
for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):
if elem.tag == 'DescriptorRecord':
for e in elem._children:
if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
e.clear()
elem.remove(e)
print 'removed %s' % e
giving...
removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>
However, this just keeps going, the file isn't getting any smaller, and on inspection the elements are still there. Tried either e.clear() or elem.remove(e), but the same results. Regards
UPDATE
Error code from my first comment on #alexanderlukanin13 s answer:
Traceback (most recent call last): File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1570, in trace_dispatch Traceback (most recent call last): File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 2278, in globals = debugger.run(setup['file'], None, None) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1704, in run pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 234, in main() File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # Note: still doesn't return a proper value. File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 835, in main PydevTestRunner(configuration).run_tests() File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 762, in run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 517, in find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 476, in __get_module_from_str buf_err = pydevd_io.StartRedirect(keep_original_redirection=True, std='stderr') File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd_io.py", line 72, in StartRedirect import sys MemoryError

The main problem in your script is that you don't save altered XML back to disk. You need to store reference to root element and then call ElementTree.write:
from xml.etree import ElementTree as Et
context = Et.iterparse('input.xml')
root = None
for event, elem in context:
if elem.tag == 'DescriptorRecord':
for e in list(elem.getchildren()): # Don't use _children, it's a private field
if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
elem.remove(e) # You need remove(), not clear()
root = elem
with open('output.xml', 'wb') as file:
Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)
Note: here I use an awkward (and probably unsafe) way to get a root element - I assume that it's always the last element in iterparse output. If anyone knows a better way, please tell.

Related

With Python 3 and lxml, how to extract the Version number from a SOAP WSDL?

When I test with a subset of the WSDL file, with Name Spaces omitted from the file and code, it works fine.
# for reference, these are the final lines from the WSDL
#
# <wsdl:service name="Shopping">
# <wsdl:documentation>
# <Version>1027</Version>
# </wsdl:documentation>
# <wsdl:port binding="ns:ShoppingBinding" name="Shopping">
# <wsdlsoap:address location="http://open.api.ebay.com/shopping"/>
# </wsdl:port>
# </wsdl:service>
#</wsdl:definitions>
from lxml import etree
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
print(wsdl.findtext('wsdl:.//Version')) # wish this would print 1027
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 "/Users/matecsaj/Google Drive/Projects/collectibles/eBay/figure-it3.py"
Traceback (most recent call last):
File "src/lxml/_elementpath.py", line 79, in lxml._elementpath.xpath_tokenizer (src/lxml/_elementpath.c:2414)
KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/matecsaj/Google Drive/Projects/collectibles/eBay/figure-it3.py", line 14, in <module>
print(wsdl.findtext('wsdl:.//Version')) # wish this would print 1027
File "src/lxml/etree.pyx", line 2230, in lxml.etree._ElementTree.findtext (src/lxml/etree.c:69049)
File "src/lxml/etree.pyx", line 1552, in lxml.etree._Element.findtext (src/lxml/etree.c:60629)
File "src/lxml/_elementpath.py", line 329, in lxml._elementpath.findtext (src/lxml/_elementpath.c:10089)
File "src/lxml/_elementpath.py", line 311, in lxml._elementpath.find (src/lxml/_elementpath.c:9610)
File "src/lxml/_elementpath.py", line 300, in lxml._elementpath.iterfind (src/lxml/_elementpath.c:9282)
File "src/lxml/_elementpath.py", line 277, in lxml._elementpath._build_path_iterator (src/lxml/_elementpath.c:8675)
File "src/lxml/_elementpath.py", line 82, in xpath_tokenizer (src/lxml/_elementpath.c:2542)
SyntaxError: prefix 'wsdl' not found in prefix map
Process finished with exit code 1

The xml has namespaces defined in it, hence to access the element you need to define the link of the namespace. Please see if the code below helps:
wsdlLink = "http://schemas.xmlsoap.org/wsdl/"
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
print(wsdl.findtext('{'+wsdlLink+'}//Version'))

With credit to the kind folks that commented, here is a modified solution that does print the Version number. All I could get working was the wildcard search. Also, the iterator skipped the Version element, so I had to get at it from its parent element. Good enough.
from lxml import etree
wsdlLink = "http://schemas.xmlsoap.org/wsdl/"
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
for element in wsdl.iter('{'+wsdlLink+'}*'):
if 'documentation' in element.tag:
for child in element:
print(child.text)

Python XML Parser: junk after document element

I'm learning Python at work. I've got a large XML file with data similar to this:
testData3.xml File
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
I have copied an XML parser out of one of my Python books that works in gathering the data when the data file contains only one line. As soon as I add a second line of data, the script fails when it runs.
Python script that I'm running (xmlReader.py):
from xml.dom.minidom import parse, Node
xmltree = parse('testData3.xml')
for node1 in xmltree.getElementsByTagName('c'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
I'm looking for some help on how to write the loop so that my xmlReader.py continues through the entire file instead of just one line. I get the following errors when I run this script:
Errors during execution:
xxxx#xxxx:~/xxxx/xxxx> python xmlReader.py
Traceback (most recent call last):
File "xmlReader.py", line 2, in <module>
xmltree = parse('testData3.xml')
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 926, in parse
result = builder.parseFile(fp)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: junk after document element: line 2, column 0
xxxx#xxxx:~/xxxx/xxxx>

The problem is that your example data is not valid XML. A valid XML document should have a single root element; this is true for a single line of the file, where <r> is the root element, but not true when you add a second line, because each line is contained within a separate <r> element, but there is no global parent element in the file.
Either construct valid XML, for example:
<root>
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
</root>
or parse the file line by line:
from xml.dom.minidom import parseString
f = open('testData3.xml'):
for line in f:
xmltree = parseString(line)
...
f.close()

parse xml file and output to text file

Trying to parse an xml file (config.xml) with ElementTree and output to a text file. I looked at other similar ques here but none helped me. Using Python 2.7.9
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
notags = ET.tostring(tree,encoding='us-ascii',method='text')
print(notags)
OUTPUT
Traceback (most recent call last):
File "./python_element", line 9, in <module>
notags = ET.tostring(tree,encoding='us-ascii',method='text')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1005, in _serialize_text
for part in elem.itertext():
AttributeError:
> 'ElementTree' object has no attribute 'itertext'

Instead of tree (ElementTree object), pass an Element object. You can get an root element using .getroot() method:
notags = ET.tostring(tree.getroot(), encoding='utf-8',method='text')

Reading password protected Word Documents with zipfile

I am trying to read a password protected word document on Python using zipfile.
The following code works with a non-password protected document, but gives an error when used with a password protected file.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
psw = "1234"
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
document = zipfile.ZipFile(path, "r")
document.setpassword(psw)
document.extractall()
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
When running get_docx_text() with a password protected file, I received the following error:
Traceback (most recent call last):
File "<ipython-input-15-d2783899bfe5>", line 1, in <module>
runfile('/Users/username/Workspace/Python/docx2txt.py', wdir='/Users/username/Workspace/Python')
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 680, in runfile
execfile(filename, namespace)
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/Users/username/Workspace/Python/docx2txt.py", line 41, in <module>
x = get_docx_text("/Users/username/Desktop/file.docx")
File "/Users/username/Workspace/Python/docx2txt.py", line 23, in get_docx_text
document = zipfile.ZipFile(path, "r")
File "zipfile.pyc", line 770, in __init__
File "zipfile.pyc", line 811, in _RealGetContents
BadZipfile: File is not a zip file
Does anyone have any advice to get this code to work?

I don't think this is an encryption problem, for two reasons:
Decryption is not attempted when the ZipFile object is created. Methods like ZipFile.extractall, extract, and open, and read take an optional pwd parameter containing the password, but the object constructor / initializer does not.
Your stack trace indicates that the BadZipFile is being raised when you create the ZipFile object, before you call setpassword:
document = zipfile.ZipFile(path, "r")
I'd look carefully for other differences between the two files you're testing: ownership, permissions, security context (if you have that on your OS), ... even filename differences can cause a framework to "not see" the file you're working on.
Also --- the obvious one --- try opening the encrypted zip file with your zip-compatible command of choice. See if it really is a zip file.
I tested this by opening an encrypted zip file in Python 3.1, while "forgetting" to provide a password. I could create the ZipFile object (the variable zfile below) without any error, but got a RuntimeError --- not a BadZipFile exception --- when I tried to read a file without providing a password:
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 13, in checksum_contents
inner_file = zfile.open(inner_filename, "r")
File "/usr/lib64/python3.1/zipfile.py", line 903, in open
"password required for extraction" % name)
RuntimeError: File apache.log is encrypted, password required for extraction
I was also able to raise a BadZipfile exception, once by trying to open an empty file and once by trying to open some random logfile text that I'd renamed to a ".zip" extension. The two test files produced identical stack traces, down to the line numbers.
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 10, in checksum_contents
zfile = zipfile.ZipFile(zipfile_name, "r")
File "/usr/lib64/python3.1/zipfile.py", line 706, in __init__
self._GetContents()
File "/usr/lib64/python3.1/zipfile.py", line 726, in _GetContents
self._RealGetContents()
File "/usr/lib64/python3.1/zipfile.py", line 738, in _RealGetContents
raise BadZipfile("File is not a zip file")
zipfile.BadZipfile: File is not a zip file
While this stack trace isn't exactly the same as yours --- mine has a call to _GetContents, and the pre-3.2 "small f" spelling of BadZipfile --- but they're close enough that I think this is the kind of problem you're dealing with.

tags on an element's children

The following code (or rather, code very much like it) has been doing my head in all day. I'm out of ideas.
import xml.etree.ElementTree as etree
parent = etree.Element(etree.QName('http://www.example.com', tag='parent'))
child_a = etree.Element(etree.QName('http://www.example.com', tag='child'))
child_a.text='Bill'
parent.append(child_a)
child_b = etree.Element(etree.QName('http://www.example.com', tag='child'))
child_b.text='Barry'
parent.append(child_b)
print(etree.tostring(parent))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "Lib\xml\etree\ElementTree.py", line 1120, in tostring
File "Lib\xml\etree\ElementTree.py", line 812, in write
File "Lib\xml\etree\ElementTree.py", line 880, in namespaces
File "Lib\xml\etree\ElementTree.py", line 1046, in _raise_serialization_error
TypeError: cannot serialize <xml.etree.ElementTree.QName object at 0x0000000222B2198> (type QName)
It's not letting me add two children to an element if both children have the same qualified name. Is there any way around this (using etree, because it's too late to swap out my xml writer before tomorrow)?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Elements From 300MG Xml In Python / Element Tree - python

Related

With Python 3 and lxml, how to extract the Version number from a SOAP WSDL?

Python XML Parser: junk after document element

parse xml file and output to text file

Reading password protected Word Documents with zipfile

tags on an element's children

Categories

Resources