I am trying to use AlchemyAPI Python 0.7 SDK. However when ever I run a method within it e.g. URLGetText(url);
I get this error:
nodes = etree.fromstring(result).xpath(xpathQuery)
File "lxml.etree.pyx", line 2743, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52665)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 19, column 11
This come from this area of code:
def GetRequest(self, apiCall, apiPrefix, paramObject):
endpoint = 'http://' + self._hostPrefix + '.alchemyapi.com/calls/' + apiPrefix + '/' + apiCall
endpoint += '?apikey=' + self._apiKey + paramObject.getParameterString()
handle = urllib.urlopen(endpoint)
result = handle.read()
handle.close()
xpathQuery = '/results/status'
nodes = etree.fromstring(result).xpath(xpathQuery)
if nodes[0].text != "OK":
raise Exception, 'Error making API call.'
return result
Any one have any ideas about what is going wrong?
Thank You
Daniel Kershaw
I looked at the Python urllib docs, and found this page:
http://docs.python.org/library/urllib.html#high-level-interface
which contains this warning about the filehandle object returned by urllib.urlopen():
One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.
I think maybe you should ensure that you obtain the entire contents of the file as a Python string before parsing it with the etree.fromstring() API. Something like:
result = ''
while (1):
next = handle.read()
if not next:
break
result += next
Related
Some backend-endpoint returns parquet-file in octet-stream.
In pandas I can do something like this:
result = requests.get("https://..../file.parquet")
df = pd.read_parquet(io.BytesIO(result.content))
Can I do it in Dask somehow?
This code:
dd.read_parquet("https://..../file.parquet")
Raises exception (obviously, because this is bytes-like object):
File "to_parquet_dask.py", line 153, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 137, in main
download_parquet(
File "to_parquet_dask.py", line 121, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 148, in _determine_pf_parts
elif fs.isdir(paths[0]):
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 418, in _isdir
return bool(await self._ls(path))
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 195, in _ls
out = await self._ls_real(url, detail=detail, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 150, in _ls_real
text = await r.text()
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1082, in text
return self._body.decode(encoding, errors=errors) # type: ignore
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 7: invalid start byte
UPD
With changes in fsspec from #mdurant answer I got error
ValueError: Cannot seek streaming HTTP file
So I put "simplecache::" to my url and I face next:
Traceback (most recent call last):
File "to_parquet_dask.py", line 161, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 145, in main
download_parquet(
File "to_parquet_dask.py", line 128, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 185, in _determine_pf_parts
pf = ParquetFile(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fastparquet/api.py", line 127, in __init__
raise ValueError("Opening directories without a _metadata requires"
ValueError: Opening directories without a _metadata requiresa filesystem compatible with fsspec
Temperary workaround
Maybe this way is dirty and not optimal, but some kind of works:
#dask.delayed
def parquet_from_http(url, token):
result = requests.get(
url,
headers={'Authorization': token}
)
return pd.read_parquet(io.BytesIO(result.content))
delayed_download = parquet_from_http(url, token)
df = dd.from_delayed(delayed_download, meta=meta)
p.s. meta argument in this approach is necessary, because otherwise dask will use this function twice: to find out meta and than to calculate, so two requests will be made.
This is not an answer, but I believe the following change in fsspec will fix your problem. If you would be willing to try and confirm, we can make this a patch.
--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
## -472,7 +472,10 ## class HTTPFileSystem(AsyncFileSystem):
async def _isdir(self, path):
# override, since all URLs are (also) files
- return bool(await self._ls(path))
+ try:
+ return bool(await self._ls(path))
+ except (FileNotFoundError, ValueError):
+ return False
(we can put this in a branch, if that makes it easier for you to install)
-edit-
The second problem (which is the same thing in both parquet engines) stems from the server either not providing the size of the file, or not allowing range-gets. The parquet format requires random access to the data to be able to read. The only way to get around this (short of improving the server) is to copy the whole file locally, e.g., by prepending "simplecache::" to your URL.
I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!
am curently learning using Python 101 and in one of examples I'm getting an error and have no clue how to fix it - my code is 100% same as in the book (checked it 3 times already) and it still outputs this error.
Here is the code:
from lxml import etree
def parseXML(xmlFile):
"""
Parse the xml
"""
with open(xmlFile) as fobj:
xml = fobj.read()
root = etree.fromstring(xml)
for appt in root.getchildren():
for elem in appt.getchildren():
if not elem.text:
text = 'None'
else:
text = elem.text
print(elem.tag + ' => ' + text)
if __name__ == '__main__':
parseXML('example.xml')
and here is xml file (it's the same as in the book):
<?xml version="1.0" ?>
<zAppointments reminder-"15">
<appointment>
<begin>1181251600</begin>
<uid>0400000008200E000</uid>
<alarmTime>1181572063</alarmTime>
<state></state>
<location></location>
<duration>1800</duration>
<subject>Bring pizza home</subject>
</appointment>
<appointment>
<begin>1234567890</begin>
<duration>1800</duration>
<subject>Check MS office webstie for updates</subject>
<state>dismissed</state>
<location></location>
<uid>502fq14-12551ss-255sf2</uid>
</appointment>
</zAppointments>
EDITED: Sry, got so excited about my first post that I actually forgot to put the error code.
Traceback (most recent call last):
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 21, in <module>
parseXML('example.xml')
File "/home/michal/Desktop/nauka programowania/python 101/parsing_with_lxml.py", line 10, in parseXML
root = etree.fromstring(xml)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: Specification mandate value for attribute reminder-, line 2, column 25
Thanks for help!!
The only error in the xml can be found here: <zAppointments reminder-"15">, should be: <zAppointments reminder="15">.
In the future useful tools for validating xml can be found online.
Here for example: https://www.xmlvalidation.com/
Error may be in
<zAppointments reminder-"15">
For next validation try to use xmllint:
xmllint --valid --noout example.xml
I have a .gpx file which is cut off int the middle of the file. When I try to parse it using the gpxpy library I run into the following error.
Parsing points in track.gpx
ERROR:root:expected '>', line 3125, column 29
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/parser.py", line 209, in parse
self.xml_parser = LXMLParser(self.xml)
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/parser.py", line 107, in __init__
self.dom = mod_etree.XML(self.xml)
File "lxml.etree.pyx", line 2734, in lxml.etree.XML (src/lxml/lxml.etree.c:54411)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
XMLSyntaxError: expected '>', line 3125, column 29
File "gpxscript.py", line 370, in extractpoints gpx = gpxpy.parse(file)
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/__init__.py",
line 28, in parse raise mod_gpx.GPXException('Error parsing {0}: {1}'
.format(xml_or_file[0 : 100], parser.get_error()))
TypeError: 'file' object has no attribute '__getitem__'
These are the relevant commands of the script which produces the error.
368 file = open(filepath)
369 try:
370 gpx = gpxpy.parse(file)
371 except gpxpy.gpx.GPXException:
372 print "GPXException for %s." % filepath
373 return 1
I filed a bug for the library as suggested. I added a sample file to the bug report which produces the syntax error.
This appears to be a bug in gpxpy's error handling.
Looking at the source to parse, when the parser fails without raising an exception, it tries to raise an exception with this:
raise mod_gpx.GPXException('Error parsing {0}: {1}'.format(xml_or_file[0 : 100], parser.get_error()))
This assumes that xml_or_file is an XML string—but, as the name implies, it's allowed to be either a string or a file object. So, what you're doing (giving it a file object) is perfectly legal and should work, and it doesn't, and therefore it's a bug.
So, you should file an issue. The correct patch should be something like:
if not parser.is_valid():
try:
fragment = xml_or_file[0 : 100]
except TypeError:
xml_or_file.seek(0)
fragment = xml_or_file.read(100)
raise mod_gpx.GPXException('Error parsing {0}: {1}'.format(fragment, parser.get_error()))
So, how do you work around this? A few options:
Since it only happens with invalid files anyway, you can just use except Exception or except (gpxpy.gpx.GPXException, TypeError).
Since it only happens when you give it a the file object, give it a string instead: gpx = gpx.parse(file.read()). This is a bad idea if the file is very large, of course.
Since the buggy function is only 12 lines of trivial code wrapping the real function, just use the real function directly. Or, if you like the wrapper, copy it, fix it, and use your own copy instead.
Meanwhile, given that the very first bit of code I looked at in this library has some obvious red flags (Why xml_or_file[0 : 100] instead of just xml_or_file[:100]? Why catch exceptions, throw them away and just set a flag, and then use that flag to raise a new exception with all the information missing?), if you're not able to debug libraries on your own, I don't think this one is ready for you to use.
I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?
What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue
I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.
You have a syntax error in your XML Markup. You aren't doing anything wrong.
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.