Why doesn't Python lxml take my xml? - python

I'm using the Python lxml library to parse my xml, but I'm having a hard time parsing one specific text. Checkout the following code:
>>> print type(raw_text_xml)
<type 'unicode'>
>>> from lxml import etree
>>> article_xml_root = etree.fromstring(raw_text_xml, parser)
Traceback (most recent call last):
File "<input>", line 1, in <module>
article_xml_root = etree.fromstring(raw_text_xml, parser)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
File "parser.pxi", line 1667, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101229)
File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:96139)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772)
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
so it says the first character is not a <, which by inspection is true:
>>> print raw_text_xml[:20]
ďťż<?xml version="1.
it has 3 weird characters in front of the xml. So to clean these I tried the following:
>>> article_xml_root = etree.fromstring(raw_text_xml[3:], parser)
Traceback (most recent call last):
File "<input>", line 1, in <module>
article_xml_root = etree.fromstring(raw_text_xml[3:], parser)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1781, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102435)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
And now it suddenly complains about it being a unicode string with encoding declaration, while if you look all the way up to my first line of code, it was Unicode all along.
Does anybody know why after slicing it suddenly gives a whole different error? And most importantly, does anybody know how I can solve this?

why after slicing it suddenly gives a whole different error?
Because after the slicing the first error vanishes and the parsing can progress until the second one is found.
And most importantly, does anybody know how I can solve this?
Maybe the error message is right (it happens) and you can solve it by converting the unicode to bytes. I guess that's better than removing the encoding declaration.
raw_text_xml.encode('utf8')
Or instead of 'utf8' whatever encoding is declared in the xml fragment.

The first error was caused by wrong characters. Once you have fixed it, you fall in second which is that your raw_text_xml is unicode.
You can know what will be a proper encoding (ASCII, latin1, utf8, ...). I cannot without seeing the actual content.
Assuming it is the content of encoding variable, you should be able to do:
article_xml_root = etree.fromstring(raw_text_xml.encode(encoding), parser)
(but I strongly advice you to first control what shows print raw_text_xml[3:160] ...)

Where ever you decoded that original Unicode, it was done incorrectly. It looks like iso-8859-2, where it was originally UTF-8 with BOM signature. The following backs out the bad decoding and re-decodes correctly:
>>> s.encode('iso-8859-2').decode('utf-8-sig')
'<?xml version="1.'

Related

Error when reading a SPSS file that is in Spanish with Pandas (Python)

Good morning!
I am trying to work with a SPSS file (.sav) in Python.
This is my code:
import pandas as pd
df=pd.read_spss('C:/Users/bonif/Documents/CSALUD01.sav')
df.head()
I get this error:
df=pd.read_spss('C:/Users/bonif/Documents/CSALUD01.sav')
File "C:\Users\bonif\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\spss.py", line 44, in read_spss
df, _ = pyreadstat.read_sav(
File "pyreadstat\pyreadstat.pyx", line 342, in pyreadstat.pyreadstat.read_sav
File "pyreadstat\_readstat_parser.pyx", line 1034, in pyreadstat._readstat_parser.run_conversion
File "pyreadstat\_readstat_parser.pyx", line 845, in pyreadstat._readstat_parser.run_readstat_parser
File "pyreadstat\_readstat_parser.pyx", line 775, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)
I figure out that the error may be because there are some words with the letter "ñ" or maybe some words with the following character "á". How may I solve this?
The data base is in this google drive: https://drive.google.com/drive/folders/1P8v5NWE-GdAEJRZdmrp5KiL-DODClmfU?usp=sharing
Thank you so much
as ti7 suggests, use pyreadstat, and you need to specify the encoding, in this case latin1 will do the trick:
>>> import pyreadstat
# This raises an error
>>> df, meta = pyreadstat.read_sav("CSALUD01.sav")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyreadstat/pyreadstat.pyx", line 342, in pyreadstat.pyreadstat.read_sav
File "pyreadstat/_readstat_parser.pyx", line 1034, in pyreadstat._readstat_parser.run_conversion
File "pyreadstat/_readstat_parser.pyx", line 845, in pyreadstat._readstat_parser.run_readstat_parser
File "pyreadstat/_readstat_parser.pyx", line 775, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)
# This is fine
>>> df, meta = pyreadstat.read_sav("CSALUD01.sav", encoding="latin1")
>>>
Pandas calls pyreadstat to read SPSS files src
You may have more luck using it directly, where it has an option to set the encoding
From the docs https://github.com/Roche/pyreadstat#other-options
You can set the encoding of the original file manually. The encoding must be a iconv-compatible encoding. This is absolutely necessary if you are handling old xport files with non-ascii characters. Those files do not have stamped the encoding in the file itself, therefore the encoding must be set manually.
import pyreadstat
df, meta = pyreadstat.read_sav(path, encoding=my_encoding)
It could also be that you simply don't have iconv installed (which it relies on for encodings), but I doubt it (you would get some other error)

How to include long base64 string in XML written by lxml element factory?

I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:
pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
for cm in cm_pg:
print cm
I got this error
Traceback (most recent call last):
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte
I know that there is an invalid character in the comments. How do I solve this?
Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).
The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.
Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.
The page have badly encoded characters.
Ex:
Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)
You can avoid them by manually decoding:
>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

Python: TypeError: 'file' object has no attribute '__getitem__'

I have a .gpx file which is cut off int the middle of the file. When I try to parse it using the gpxpy library I run into the following error.
Parsing points in track.gpx
ERROR:root:expected '>', line 3125, column 29
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/parser.py", line 209, in parse
self.xml_parser = LXMLParser(self.xml)
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/parser.py", line 107, in __init__
self.dom = mod_etree.XML(self.xml)
File "lxml.etree.pyx", line 2734, in lxml.etree.XML (src/lxml/lxml.etree.c:54411)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
XMLSyntaxError: expected '>', line 3125, column 29
File "gpxscript.py", line 370, in extractpoints gpx = gpxpy.parse(file)
File "/usr/local/lib/python2.7/dist-packages/gpxpy-0.8.7-py2.7.egg/gpxpy/__init__.py",
line 28, in parse raise mod_gpx.GPXException('Error parsing {0}: {1}'
.format(xml_or_file[0 : 100], parser.get_error()))
TypeError: 'file' object has no attribute '__getitem__'
These are the relevant commands of the script which produces the error.
368 file = open(filepath)
369 try:
370 gpx = gpxpy.parse(file)
371 except gpxpy.gpx.GPXException:
372 print "GPXException for %s." % filepath
373 return 1
I filed a bug for the library as suggested. I added a sample file to the bug report which produces the syntax error.
This appears to be a bug in gpxpy's error handling.
Looking at the source to parse, when the parser fails without raising an exception, it tries to raise an exception with this:
raise mod_gpx.GPXException('Error parsing {0}: {1}'.format(xml_or_file[0 : 100], parser.get_error()))
This assumes that xml_or_file is an XML string—but, as the name implies, it's allowed to be either a string or a file object. So, what you're doing (giving it a file object) is perfectly legal and should work, and it doesn't, and therefore it's a bug.
So, you should file an issue. The correct patch should be something like:
if not parser.is_valid():
try:
fragment = xml_or_file[0 : 100]
except TypeError:
xml_or_file.seek(0)
fragment = xml_or_file.read(100)
raise mod_gpx.GPXException('Error parsing {0}: {1}'.format(fragment, parser.get_error()))
So, how do you work around this? A few options:
Since it only happens with invalid files anyway, you can just use except Exception or except (gpxpy.gpx.GPXException, TypeError).
Since it only happens when you give it a the file object, give it a string instead: gpx = gpx.parse(file.read()). This is a bad idea if the file is very large, of course.
Since the buggy function is only 12 lines of trivial code wrapping the real function, just use the real function directly. Or, if you like the wrapper, copy it, fix it, and use your own copy instead.
Meanwhile, given that the very first bit of code I looked at in this library has some obvious red flags (Why xml_or_file[0 : 100] instead of just xml_or_file[:100]? Why catch exceptions, throw them away and just set a flag, and then use that flag to raise a new exception with all the information missing?), if you're not able to debug libraries on your own, I don't think this one is ready for you to use.

how to pass an xml file to lxml to parse?

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?
What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue
I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.
You have a syntax error in your XML Markup. You aren't doing anything wrong.
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.

Categories

Resources