HeaderParseError in python

HeaderParseError in python - python

I get a HeaderParseError if I try to parse this string with decode_header() in python 2.6.5 (and 2.7). Here the repr() of the string:
'=?iso-8859-1?B?QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw==?='
This string comes from a mime email which contains a JPEG picture. Thunderbird can
decode the filename (which contains German umlauts).
>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?B?QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw==?=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/email/header.py", line 101, in decode_header
raise HeaderParseError
email.errors.HeaderParseError

It seems an incompatibility between Python's character set for base64-encoded strings and the mail agent's:
>>> from email.header import decode_header
>>> a='QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw=='
>>> decode_header(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/email/header.py", line 108, in decode_header
raise HeaderParseError
email.errors.HeaderParseError
>>> a1= a.replace('_', '/')
>>> decode_header(a1)
[('Anmeldung Netzanschluss S\xecdring3p.jpg', 'iso-8859-1')]
>>> print _[0][0].decode(_[0][1])
Anmeldung Netzanschluss Südring3p.jpg
Python utilizes the character set that the Wikipedia article suggests (i.e 0-9, A-Z, a-z, +, /). In that same article, some alternatives (including the underscore that's the issue here) are included; however, the underscore's value is vague (it's value 62 or 63, depending on the alternative).
I don't know what Python can do to guess the intentions of b0rken mail agents; so I suggest you do some appropriate guessing whenever decode_header fails.
I'm calling “broken” the mail agent because there is no need to escape either + or / in a message header: it's not a URL, so why not use the typical character set?

Related

This regex is not valid for xsd

I want to validate a 2- or 3-letter iso code, but also allow it to be empty (so it can be 0, 2, or 3 characters).
'\w{2,3}|'
This works locally and also on http://www.freeformatter.com/xml-validator-xsd.html. However, when I try running it on an ubuntu machine, I get the following error:
>>> import urllib2
>>> from lxml import etree
>>> xsd_url = 'https://s3-us-west-1.amazonaws.com/premiere-avails/movie.xsd.xml'
>>> xsd_contents = urllib2.urlopen(xsd_url).read()
>>> xmlschema_doc = etree.fromstring(xsd_contents)
>>> xmlschema=etree.XMLSchema(xmlschema_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "xmlschema.pxi", line 102, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:168126)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}pattern':
The value '\w{2,3}|' of the facet 'pattern' is not a valid regular expression., line 58
What would be a better regex pattern for this? (\w{2,3})? also fails with xsd, so it needs to be something else.

python email decode_header raise HeaderParseError

I got a HeaderParseError as below.What's wrong with it?
>>> from email import Header
>>> s= "=?UTF-8?B?6KGM6KGM5ZyI5Li65oKo5o6o6I2Q5Lul5LiL6IGM5L2N77yM?==?UTF-8?B?56Wd5oKo5om+5Yiw5aW95bel5L2c77yB44CQ6KGM6KGM5ZyI44CR?="
>>> src = Header.decode_header(s)
This is the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/email/header.py", line 108, in decode_header
raise HeaderParseError
email.errors.HeaderParseError

You are trying to parse two headers at once:
"=?UTF-8?B?6KGM6KGM5ZyI5Li65oKo5o6o6I2Q5Lul5LiL6IGM5L2N77yM?="
and
"=?UTF-8?B?56Wd5oKo5om+5Yiw5aW95bel5L2c77yB44CQ6KGM6KGM5ZyI44CR?="
removing one of them will do the job. If you want to parse all of them - you have to split them first

Why do I get IOErrors when writing Unicode to the CMD? (With codepage 65001)

I'm on the CMD in Windows 8 and I've set the codepage to 65001 (chcp 65001). I'm using Python 2.7.2 (ActivePython 2.7.2.5) and I've set the PYTHONSTARTUP environment variable to "bootstrap.py".
bootstrap.py:
import codecs
codecs.register(
lambda name: name == 'cp65001' and codecs.lookup('UTF-8') or None
)
This lets me print ASCII:
>>> print 'hello'
hello
>>> print u'hello'
hello
But the errors I get when I try to print a Unicode string with non-ASCII characters makes no sense to me. Here I try to print a few strings containing Nordic symbols (I added the extra line break between the prints for readability):
>>> print u'æøå'
��øåTraceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> print u'åndalsnes'
��ndalsnes
>>> print u'åndalsnesæ'
��ndalsnesæTraceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument
>>> print u'Øst'
��st
>>> print u'uØst'
uØstTraceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument
>>> print u'ØstÆØÅæøå'
��stÆØÅæøåTraceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument
>>> print u'_ØstÆØÅæøå'
_ØstÆØÅæøåTraceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument
As you see it doesn't always raise an error (and doesn't even raise the same error every time), and the Nordic symbols is only displayed correctly occasionally.
Can somebody explain this behavior, or at least help me figure out how to print Unicode to the CMD correctly?

Try This :
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
print u'æøå'
Making use of from __future__ import unicode_literals would be useful in an interactive python session.
It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code here does so (it's for Python 2.x, but you'd be calling WriteConsoleW from C anyway).
WriteConsoleW has one bug that I know of, which is that it fails when writing more than 26608 characters at once. That's easy to work around by limiting the amount of data passed in a single call.
Fonts are not Python's problem, but encoding is. It doesn't make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.
(For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it requires a registry hack.)
I hope it helps.

where is the call to encode the string or force the string to need to be encoded in this file?

I know this may seem rude or mean or unpolite, but I need some help to try to figure out why I cant call window.loadPvmFile("f:\games#DD.ATC3.Root\common\models\a300\amu\dummy.pvm") exactly like that as a string. Instead of doing that, it gives me a traceback error:
Traceback (most recent call last):
File "F:\Python Apps\pvmViewer_v1_1.py", line 415, in <module>
window.loadPvmFile("f:\games\#DD.ATC3.Root\common\models\a300\amu\dummy.pvm")
File "F:\Python Apps\pvmViewer_v1_1.py", line 392, in loadPvmFile
file1 = open(path, "rb")
IOError: [Errno 22] invalid mode ('rb') or filename:
'f:\\games\\#DD.ATC3.Root\\common\\models\x07300\x07mu\\dummy.pvm'
Also notice, that in the traceback error, the file path is different. When I try a path that has no letters in it except for the drive letter and filename, it throws this error:
Traceback (most recent call last):
File "F:\Python Apps\pvmViewer_v1_1.py", line 416, in <module>
loadPvmFile('f:\0\0\dummy.pvm')
File "F:\Python Apps\pvmViewer_v1_1.py", line 393, in loadPvmFile
file1 = open(path, "r")
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
I have searched for the place that the encode function is called or where the argument is encoded and cant find it. Flat out, I am out of ideas, frustrated and I have nowhere else to go. The source code can be found here: PVM VIEWER
Also note that you will not be able to run this code and load a pvm file and that I am using portable python 2.7.3! Thanks for everyone's time and effort!

\a and \0 are escape sequences. Use r'' (or R'') around the string to mark it as a raw string.
window.loadPvmFile(r"f:\games#DD.ATC3.Root\common\models\a300\amu\dummy.pvm")

urllib and regular expression substitution error

Why does the following result in an error?
import re
from urllib import quote as q
s = re.compile(r'[^a-zA-Z0-9.: ^*$#!+_?-]')
s.sub(q, "A/B")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/python/python-2.7.1/lib/python2.7/urllib.py", line 1236, in quote
if not s.rstrip(safe):
AttributeError: rstrip
I'd like to call sub on strings that contain forward slashes, not sure why it results in this error. How can it be fixed so that I can pass strings with '/' characters in them to sub()?
thanks.

Because re.sub calls the repl parameter with an instance of re.match.
I think you want to use:
s.sub(lambda m: q(m.group()), "A/B")
However, a simpler way of doing this might be to use the safe argument to urllib.quote:
urllib.quote("A/B", safe="/.: ^*$#!+_?-")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HeaderParseError in python - python

Related

This regex is not valid for xsd

python email decode_header raise HeaderParseError

Why do I get IOErrors when writing Unicode to the CMD? (With codepage 65001)

where is the call to encode the string or force the string to need to be encoded in this file?

urllib and regular expression substitution error

Categories

Resources