urllib and regular expression substitution error - python

Why does the following result in an error?
import re
from urllib import quote as q
s = re.compile(r'[^a-zA-Z0-9.: ^*$#!+_?-]')
s.sub(q, "A/B")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/python/python-2.7.1/lib/python2.7/urllib.py", line 1236, in quote
if not s.rstrip(safe):
AttributeError: rstrip
I'd like to call sub on strings that contain forward slashes, not sure why it results in this error. How can it be fixed so that I can pass strings with '/' characters in them to sub()?
thanks.

Because re.sub calls the repl parameter with an instance of re.match.
I think you want to use:
s.sub(lambda m: q(m.group()), "A/B")
However, a simpler way of doing this might be to use the safe argument to urllib.quote:
urllib.quote("A/B", safe="/.: ^*$#!+_?-")

Related

This regex is not valid for xsd

I want to validate a 2- or 3-letter iso code, but also allow it to be empty (so it can be 0, 2, or 3 characters).
'\w{2,3}|'
This works locally and also on http://www.freeformatter.com/xml-validator-xsd.html. However, when I try running it on an ubuntu machine, I get the following error:
>>> import urllib2
>>> from lxml import etree
>>> xsd_url = 'https://s3-us-west-1.amazonaws.com/premiere-avails/movie.xsd.xml'
>>> xsd_contents = urllib2.urlopen(xsd_url).read()
>>> xmlschema_doc = etree.fromstring(xsd_contents)
>>> xmlschema=etree.XMLSchema(xmlschema_doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "xmlschema.pxi", line 102, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:168126)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}pattern':
The value '\w{2,3}|' of the facet 'pattern' is not a valid regular expression., line 58
What would be a better regex pattern for this? (\w{2,3})? also fails with xsd, so it needs to be something else.

Output compiler error to a txt file in python

I'm a beginner using python. I want to create a regular expression to capture error messages from compiler output in python. How would I do this?
for example, if the compiler output is the following error message:
Traceback (most recent call last):
File "sample.py", line 1, in <module>
hello
NameError: name 'hello' is not defined
I want to be able to only extract only the following string from the output:
NameError: name 'hello' is not defined
In this case there is only one error, however I want to extract all the errors the compiler outputs. How do I do this using regular expressions? Or if there is an easier way, I'm open to suggestions
r'Traceback \(most recent call last\):\n(?:[ ]+.*\n)*(\w+: .*)'
should extract your exception; a traceback contains lines that all start with whitespace except for the exception line.
The above matches the literal text of the traceback first line, 0 or more lines that start with at least one space, and then captures the line following that provided it starts with 1 or more word characters (which fits Python identifiers nicely), a colon, and then the rest up to the end of a line.
Demo:
>>> import re
>>> sample = '''\
... Traceback (most recent call last):
... File "sample.py", line 1, in <module>
... hello
... NameError: name 'hello' is not defined
... '''
>>> re.search(r'Traceback \(most recent call last\):\n(?:[ ]+.*\n)*(\w+: .*)', sample).groups()
("NameError: name 'hello' is not defined",)

where is the call to encode the string or force the string to need to be encoded in this file?

I know this may seem rude or mean or unpolite, but I need some help to try to figure out why I cant call window.loadPvmFile("f:\games#DD.ATC3.Root\common\models\a300\amu\dummy.pvm") exactly like that as a string. Instead of doing that, it gives me a traceback error:
Traceback (most recent call last):
File "F:\Python Apps\pvmViewer_v1_1.py", line 415, in <module>
window.loadPvmFile("f:\games\#DD.ATC3.Root\common\models\a300\amu\dummy.pvm")
File "F:\Python Apps\pvmViewer_v1_1.py", line 392, in loadPvmFile
file1 = open(path, "rb")
IOError: [Errno 22] invalid mode ('rb') or filename:
'f:\\games\\#DD.ATC3.Root\\common\\models\x07300\x07mu\\dummy.pvm'
Also notice, that in the traceback error, the file path is different. When I try a path that has no letters in it except for the drive letter and filename, it throws this error:
Traceback (most recent call last):
File "F:\Python Apps\pvmViewer_v1_1.py", line 416, in <module>
loadPvmFile('f:\0\0\dummy.pvm')
File "F:\Python Apps\pvmViewer_v1_1.py", line 393, in loadPvmFile
file1 = open(path, "r")
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
I have searched for the place that the encode function is called or where the argument is encoded and cant find it. Flat out, I am out of ideas, frustrated and I have nowhere else to go. The source code can be found here: PVM VIEWER
Also note that you will not be able to run this code and load a pvm file and that I am using portable python 2.7.3! Thanks for everyone's time and effort!
\a and \0 are escape sequences. Use r'' (or R'') around the string to mark it as a raw string.
window.loadPvmFile(r"f:\games#DD.ATC3.Root\common\models\a300\amu\dummy.pvm")

HeaderParseError in python

I get a HeaderParseError if I try to parse this string with decode_header() in python 2.6.5 (and 2.7). Here the repr() of the string:
'=?iso-8859-1?B?QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw==?='
This string comes from a mime email which contains a JPEG picture. Thunderbird can
decode the filename (which contains German umlauts).
>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?B?QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw==?=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/email/header.py", line 101, in decode_header
raise HeaderParseError
email.errors.HeaderParseError
It seems an incompatibility between Python's character set for base64-encoded strings and the mail agent's:
>>> from email.header import decode_header
>>> a='QW5tZWxkdW5nIE5ldHphbnNjaGx1c3MgU_xkcmluZzNwLmpwZw=='
>>> decode_header(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/email/header.py", line 108, in decode_header
raise HeaderParseError
email.errors.HeaderParseError
>>> a1= a.replace('_', '/')
>>> decode_header(a1)
[('Anmeldung Netzanschluss S\xecdring3p.jpg', 'iso-8859-1')]
>>> print _[0][0].decode(_[0][1])
Anmeldung Netzanschluss Südring3p.jpg
Python utilizes the character set that the Wikipedia article suggests (i.e 0-9, A-Z, a-z, +, /). In that same article, some alternatives (including the underscore that's the issue here) are included; however, the underscore's value is vague (it's value 62 or 63, depending on the alternative).
I don't know what Python can do to guess the intentions of b0rken mail agents; so I suggest you do some appropriate guessing whenever decode_header fails.
I'm calling “broken” the mail agent because there is no need to escape either + or / in a message header: it's not a URL, so why not use the typical character set?

python logging into a forum

I've written this to try and log onto a forum (phpBB3).
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
output = page.read()
However when I run it it comes up with;
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
TypeError: string indices must be integers
any ideas as to how to solve this?
edit
adding a comma between the string and the data gives this error instead
Traceback (most recent call last):
File "C:/Users/Mike/Documents/python/test urllib2", line 4, in <module>
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login",[logindata])
File "C:\Python25\lib\urllib.py", line 84, in urlopen
return opener.open(url, data)
File "C:\Python25\lib\urllib.py", line 192, in open
return getattr(self, name)(url, data)
File "C:\Python25\lib\urllib.py", line 327, in open_http
h.send(data)
File "C:\Python25\lib\httplib.py", line 711, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall
TypeError: sendall() argument 1 must be string or read-only buffer, not list
edit2
I've changed the code from what it was to;
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
output = page.read()
This doesn't throw any error messages, it just gives 3 blank lines. Is this because I'm trying to read from the log in page which disappears after logging in. If so how do I get it to display the index which is what should appear after hitting log in.
Your line
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login"[logindata])
is semantically invalid Python. Presumably you meant
page = urllib.urlopen("http://www.woarl.com/board/ucp.php?mode=login", [logindata])
which has a comma separating the arguments. However, what you ACTUALLY want is simply
page = urllib2.urlopen("http://www.woarl.com/board/ucp.php?mode=login", logindata)
without trying to enclose logindata into a list and using the more up-to-date version of urlopen is the urllib2 library.
How about using a comma between the string,"http:..." and the urlencoded data, [logindata]?
Your URL string shouldn't be
"http://www.woarl.com/board/ucp.php?mode=login"[logindata]
But
"http://www.woarl.com/board/ucp.php?mode=login", logindata
I think, because [] is for array and it require an integer. I might be wrong cause I haven't done a lot of Python.
If you do a type on logindata, you can see that it is a string:
>>> import urllib
>>> logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
>>> type(logindata)
<type 'str'>
Putting it in brackets ([]) puts it in a list context, which isn't what you want.
This would be easier with the high-level "mechanize" module.

Categories

Resources