Python: Read a very large XML file

Python: Read a very large XML file - python

Am trying to read an xml file(which is a french diccionary) from a python script, but am getting the error below. Is there anyway I can fix it?
PS: the file is 158 643 Ko
from xml.dom import minidom
doc = minidom.parse("Dic.xml")
Data = doc.getElementsByTagName("title")[0]
titleData = Data.firstChild.data
print (titleData)
The error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
doc = minidom.parse("Morphalou-2.0.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 204, in parseFile
buffer = file.read(16*1024)
MemoryError
Advance Thanks

Related

Parseing xml and html in same project

I want to parse in one project XML and HTML at the same time.
I tried this:
from xml.etree import ElementTree as ET
tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)
and got this error:
Traceback (most recent call last):
File "C:.py", line 55, in
html_file = ET.parse("htmlpath")
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity : line 690, column 78

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

I am facing this error xml.parsers.expat.ExpatError: syntax error: line 1, column 0 while parse data from url using minidom. Anyone can help me for this ?
Here is my code:
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse("about_us.xml")
Error:
Traceback (most recent call last):
File "test3.py", line 11, in <module>
doc=minidom.parse("about_us.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 211, in parseFile
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
The above from your traceback indicates to me that your "about_us.xml" file is empty.
You have openurl but you have not shown that you've ever called openurl.read() to actually get at the data.
Nor have you shown where or how you've written said data to your "about_us.xml" file.
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse(openurl)
print doc
gives me
Traceback (most recent call last):
File "main.py", line 5, in <module>
doc=minidom.parse(openurl)
File "/usr/local/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 51, column 81
which indicates that the page you are trying to parse as XML is not well-formed. Try using beautiful soup instead which, from memory, is very forgiving.
from BeautifulSoup import BeautifulSoup
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
soup = BeautifulSoup(openurl.read())
for a in soup.findAll('a'):
print (a.text, a.get('href'))
BTW, you'll need ver 3 of Beautiful Soup since you're still on python 2.7

With Python 3 and lxml, how to extract the Version number from a SOAP WSDL?

When I test with a subset of the WSDL file, with Name Spaces omitted from the file and code, it works fine.
# for reference, these are the final lines from the WSDL
#
# <wsdl:service name="Shopping">
# <wsdl:documentation>
# <Version>1027</Version>
# </wsdl:documentation>
# <wsdl:port binding="ns:ShoppingBinding" name="Shopping">
# <wsdlsoap:address location="http://open.api.ebay.com/shopping"/>
# </wsdl:port>
# </wsdl:service>
#</wsdl:definitions>
from lxml import etree
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
print(wsdl.findtext('wsdl:.//Version')) # wish this would print 1027
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 "/Users/matecsaj/Google Drive/Projects/collectibles/eBay/figure-it3.py"
Traceback (most recent call last):
File "src/lxml/_elementpath.py", line 79, in lxml._elementpath.xpath_tokenizer (src/lxml/_elementpath.c:2414)
KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/matecsaj/Google Drive/Projects/collectibles/eBay/figure-it3.py", line 14, in <module>
print(wsdl.findtext('wsdl:.//Version')) # wish this would print 1027
File "src/lxml/etree.pyx", line 2230, in lxml.etree._ElementTree.findtext (src/lxml/etree.c:69049)
File "src/lxml/etree.pyx", line 1552, in lxml.etree._Element.findtext (src/lxml/etree.c:60629)
File "src/lxml/_elementpath.py", line 329, in lxml._elementpath.findtext (src/lxml/_elementpath.c:10089)
File "src/lxml/_elementpath.py", line 311, in lxml._elementpath.find (src/lxml/_elementpath.c:9610)
File "src/lxml/_elementpath.py", line 300, in lxml._elementpath.iterfind (src/lxml/_elementpath.c:9282)
File "src/lxml/_elementpath.py", line 277, in lxml._elementpath._build_path_iterator (src/lxml/_elementpath.c:8675)
File "src/lxml/_elementpath.py", line 82, in xpath_tokenizer (src/lxml/_elementpath.c:2542)
SyntaxError: prefix 'wsdl' not found in prefix map
Process finished with exit code 1

The xml has namespaces defined in it, hence to access the element you need to define the link of the namespace. Please see if the code below helps:
wsdlLink = "http://schemas.xmlsoap.org/wsdl/"
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
print(wsdl.findtext('{'+wsdlLink+'}//Version'))

With credit to the kind folks that commented, here is a modified solution that does print the Version number. All I could get working was the wildcard search. Also, the iterator skipped the Version element, so I had to get at it from its parent element. Good enough.
from lxml import etree
wsdlLink = "http://schemas.xmlsoap.org/wsdl/"
wsdl = etree.parse('http://developer.ebay.com/webservices/latest/ShoppingService.wsdl')
for element in wsdl.iter('{'+wsdlLink+'}*'):
if 'documentation' in element.tag:
for child in element:
print(child.text)

parse xml file and output to text file

Trying to parse an xml file (config.xml) with ElementTree and output to a text file. I looked at other similar ques here but none helped me. Using Python 2.7.9
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
notags = ET.tostring(tree,encoding='us-ascii',method='text')
print(notags)
OUTPUT
Traceback (most recent call last):
File "./python_element", line 9, in <module>
notags = ET.tostring(tree,encoding='us-ascii',method='text')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1005, in _serialize_text
for part in elem.itertext():
AttributeError:
> 'ElementTree' object has no attribute 'itertext'

Instead of tree (ElementTree object), pass an Element object. You can get an root element using .getroot() method:
notags = ET.tostring(tree.getroot(), encoding='utf-8',method='text')

Trying to connect to a SOAP / WSDL service using Python (SOAPPy) - getting ERROR

Im trying to connect to a SOAP/WSDL server through my Python script:
# based on the tutorial:
# http://www.diveintopython.net/soap_web_services/
import pprint
from SOAPpy import WSDL
WSDLFILE = "https://api.comscore.com/KeyMeasures.asmx?WSDL"
proxy = WSDL.Proxy(WSDLFILE)
proxy.soapserver.config.dumpSOAPIn=1
proxy.soapserver.config.dumpSOAPOut=1
When I run this script I get the following error:
Traceback (most recent call last):
File "/Users/XX/PycharmProjects/Test_Proj/Comscore_connector.py", line 9, in <module>
proxy = WSDL.Proxy(WSDLFILE)
File "/Users/XXpython_projects/lib/python2.7/site-packages/SOAPpy/WSDL.py", line 85, in __init__
self.wsdl = reader.loadFromString(str(wsdlsource))
File "/Users/XX/python_projects/lib/python2.7/site-packages/wstools/WSDLTools.py", line 47, in loadFromString
return self.loadFromStream(StringIO(data))
File "/Users/XX/python_projects/lib/python2.7/site-packages/wstools/WSDLTools.py", line 28, in loadFromStream
document = DOM.loadDocument(stream)
File "/Users/XX/python_projects/lib/python2.7/site-packages/wstools/Utility.py", line 645, in loadDocument
return xml.dom.minidom.parse(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
Process finished with exit code 1
Ive looked a the following links for SOAP examples:
http://users.skynet.be/pascalbotte/rcx-ws-doc/python.htm
http://code.activestate.com/recipes/502259-calling-a-web-service-using-soappy/
However, there isn't much I could find elsewhere....
Would really appreciate any advice to point me in the right direction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Read a very large XML file - python

Related

Parseing xml and html in same project

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

With Python 3 and lxml, how to extract the Version number from a SOAP WSDL?

parse xml file and output to text file

Trying to connect to a SOAP / WSDL service using Python (SOAPPy) - getting ERROR

Categories

Resources