Escaping XPath literal with Python - python

I'm writing a common library to setup an automation test suite with Selenium 2.0 Python's webdriver.
def verify_error_message_present(self, message):
try:
self.driver.find_element_by_xpath("//span[#class='error'][contains(.,'%s')]" % message)
self.assertTrue(True, "Found an error message containing %s" % message
except Exception, e:
self.logger.exception(e)
I would like to escape the message before passing it to XPath query, so it can support if 'message' is something like "The number of memory slots used (32) exceeds the number of memory slots that are available (16)"
Without escaping, the xpath query won't work since it contains '(' and ')'
Which library can we use to do this in Python?
I know that this is a simple question, but I don't have so much experience in Python (just started).
Thanks in advance.
Additional info:
During testing in firebug, the query below will return no result:
//span[#class='error'][contains(.,'The number of memory slots used (32) exceeds the number of memory slots that are available (16)')]
While the query below will return the desired component:
//span[#class='error'][contains(.,'The number of memory slots used \(32\) exceeds the number of memory slots that are available \(16\)')]
Logically this problem can be solved by replacing ) with \) for this particular string literal, but then there are still the other characters need to be escaped. So is there any library to do this in a proper way?

Parentheses should be fine there. They're inside an XPath string literal delimited by apostrophe, so they do not prematurely end the contains condition.
The problem is what happens when you have apostrophes in your string, since those do end the string literal, breaking the expression. Unfortunately there is no string escaping scheme for XPath string literals, so you have to work around it using expressions to generate the troublesome characters, typically in the form concat('str1', "'", 'str2').
Here's a Python function to do that:
def toXPathStringLiteral(s):
if "'" not in s: return "'%s'" % s
if '"' not in s: return '"%s"' % s
return "concat('%s')" % s.replace("'", "',\"'\",'")
"//span[#class='error'][contains(.,%s)]" % toXPathStringLiteral(message)

Related

why python raise a error :“float argument required, not unicode”

for each_ID ,each_Title in zip(Id,Title):
url="http://www.zjjsggzy.gov.cn/%E6%96%B0%E6%B5%81%E7%A8%8B/%E6%8B%9B%E6%8A%95%E6%A0%87%E4%BF%A1%E6%81%AF/jyxx_1.html?iq=x&type=%E6%8B%9B%E6%A0%87%E5%85%AC%E5%91%8A&tpid=%s&tpTitle=%s"%(each_ID,each_Title)
“each_ID”and “each_Title” are from website unicode parameters, but why it cause a “float”error, %s is not a string?
You have loads of % formatters in your string. %E formats a float object. You have several of those in your string, including at the start:
"http://www.zjjsggzy.gov.cn/%E6
# ^^
You'd need to double up every single % used in a URL character escape:
"http://www.zjjsggzy.gov.cn/%%E6%%96%%B0%%E6%%B5%%81%%E7%%A8%%8B/..."
That'd be a lot of work, you'd be better off using a different string formatting style. Use str.format():
url = (
"http://www.zjjsggzy.gov.cn/"
"%E6%96%B0%E6%B5%81%E7%A8%8B/%E6%8B%9B%E6%8A%95%E6%A0%87%E4%BF%A1%E6%81%AF"
"/jyxx_1.html?iq=x&type=%E6%8B%9B%E6%A0%87%E5%85%AC%E5%91%8A&"
"tpid={}&tpTitle={}".format(
each_ID, each_Title)
)
I broke the string up into multiple chunks to make it easier to read; the {} brackets delineate the placeholders.
Try using the format method on string. The existing '%' chars conflicting with your %s placeholders :
for each_ID ,each_Title in zip(Id,Title):
url="http://www.zjjsggzy.gov.cn/%E6%96%B0%E6%B5%81%E7%A8%8B/%E6%8B%9B%E6%8A%95%E6%A0%87%E4%BF%A1%E6%81%AF/jyxx_1.html?iq=x&type=%E6%8B%9B%E6%A0%87%E5%85%AC%E5%91%8A&tpid={}&tpTitle={}".format(each_ID, each_Title)

Questions concerning using regex in python

I currently am reading in a string that starts with a number up to the next delimeter and testing if the string read in is a float. Now I have a few questions here as I believe my regex works I just think I am not using the proper method once it tries to do it.
my particular float will be in the format of
d+(.d+)?(E(+|-)?d+)?
r'(\d+(\.\d+)?([E][+|-]?\d+)?'
Above is the regular expression I'm using and it is correct for the specifications I have set up, but my issue is that I will be reading in bad values and I want to print an error that either prints the whole string as bad or prints the part that passed followed by an error with the incorrect part printing. When I try I get the error print
print "ERROR: %s" % m.groups()
TypeError: not all arguments converted during string formatting
I feel like I am missing something simple but I cannot figure out what.
So in summary I am trying to use the above regular expression to compare a read in number string to see if it is in the float form. If the whole string conforms I want to print it and if there is a bad part I want to print the whole string as an error or print the good part follow by printing the bad part out with an error message.
p = re.compile(r'(\d+)(\.\d+)?(([E][+-])?\d+)?')
def is_float(str):
m = p.match(str)
if m:
print (m.groups())
return True
I have provided the piece of code I am working with perhaps there is an error there
Some sample inputs are:
3#33 //should print 3 then an error with #33 printed
3.435E-10 // should print the whole thing
0.45654 //should print the whole thing
4E-2 //should print the whole thing
m.groups() is an array. NOT a string, m.groups(0) is the entire match, m.groups(1) is the 1st set of capturing brackets in your regex and so forth.
Try:
print(m.groups())
To see the different values at play
First thing
You're missing a closing bracket. It should be:
(\d+(\.\d+)?([E][+|-]?\d+)?)
Notice the one at the end after the final ?
I then tested it here:
https://regex101.com/r/jF1jX2/1 and it worked.
I have to say, I'd not bother with a regex at all. Given a string that is supposed to represent a float, I'd do
def is_float(str):
try:
f = float(str)
return True
except ValueError:
return False
(BTW if the next step was going to be to convert an acceptable str to float,
just put the try .. except inline, use f if no exception is thrown and do whatever is appropriate when the exception is caught )
Also there's a mistake in your regex, in that it doesn't handle a leading "-" for a negative number (or a "+" for a positive one). Try ... except handles anything that you can throw at Python, using Python's rules.

Is it possible to suppress Python's escape sequence processing on a given string without using the raw specifier?

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

Python: 2.6 and 3.1 string matching inconsistencies

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.
I'm not going to post all my code since it may cause confusion.
Brief explanation:
I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.
I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value.
After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title.
Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.
Without further ado, here's the relevant code:
for i in d:
if i[1:-2] != d[i].get('id'):
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues.
Here is an example line of output:
X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY
Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.
Edit:
So the problem seems to be the way that I am slicing.
In Python3,
if i[1:-2] != d[i].get('id'):
this comparison works fine.
In Python2,
if i[1:-3] != d[i].get('id'):
I have to change the offset by one.
Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').
Edit 2:
Updated with requested repr() information.
I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character.
Python2:
'"9626-2008olympics_Prod-SH"\r\n'
'9626-2008olympics_Prod-SH'
Python3:
'"9626-2008olympics_Prod-SH"\n'
'9626-2008olympics_Prod-SH'
Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?
You are printing i[1:-3] but comparing i[1:-2] in the loop.
Very Important Question
Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!
Russell Borogrove is right.
Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.
In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].
I just added a manual check for the Python major version.
Here is the fixed code:
for i in d:
# The keys in D contain quotes and a newline which need
# to be removed. In v3, newline = 1 char and in v2,
# newline = 2 char.
if sys.version_info[0] < 3:
if i[1:-3] != d[i].get('id'):
print('%s %s' % (i[1:-3], d[i].get('id')))
else:
if i[1:-2] != d[i].get('id'):
print('%s %s' % (i[1:-2], d[i].get('id')))
Thanks for the responses everyone! I appreciate your help.
repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.
Instead of
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
do
print('%r %r' % (i, d[i].get('id')))
Note leaving off the [1:-3] so that you can see what is in i before you slice it.
Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":
How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?
Update after actual input finally provided by OP:
If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.
I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?
for i in d:
stripped = i.strip()
if stripped != d[i].get('id'):
print('X%sX Y%sY' % (stripped, d[i].get('id')))

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Categories

Resources