Why regex findall return a weird \x00

Why regex findall return a weird \x00 - python

I use a regex to build a list of all key-value pair present on line(string).
My key-pair syntax respect/match the following regex:
re.compile("\((.*?),(.*?)\)")
typically I have to parse a string like:
(hex, 0x123456)
If I use the interpreter it's OK
str = "(hex,0x123456)"
>>> KeyPair = re.findall(MyRegex, str)
>>> KeyPair
[('hex', '0x123456')]
But when I use that code under linux to parse a command line output I get:
[('hex', '0x123456\x00')]
it comes from the following code
KeyPayList = []
# some code ....
process = subprocess.Popen(self.cmd_line, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False, stdin=subprocess.PIPE)
# here we parse the output
for line in process.stdout:
if line.startswith(lineStartWith):
KeyPair = re.findall(MyRegex, line.strip())
KeyPayList.append(KeyPair)
Do you know why I get that strange \x00 in the second group I captured ?
Note that I already try to strip the string before calling findall.

That's a null byte, and it is present in your original string. You may not have seen it, as your terminal will ignore it when you print the string:
>>> s = "(hex,0x123456\x00)"
>>> print s
(hex,0x123456)
The Python repr() function used for container contents (such as the contents of the tuple you are printing here) does show it:
>>> print repr(s)
'(hex,0x123456\x00)'
Your regular expression is simply returning that null byte because it is present in your original string:
>>> import re
>>> s = "(hex,0x123456\x00)"
>>> yourpattern = re.compile("\((.*?),(.*?)\)")
>>> yourpattern.search(s).groups()
('hex', '0x123456\x00')
If you were to remove it, the regular expression engine won't return it either:
>>> yourpattern.search(s.replace('\x00', '')).groups()
('hex', '0x123456')

It's simply that, in your case, the strings yielded by the process.stdout iterator contain null bytes.
Without a specific list of characters to remove, strip deletes whitespace characters. That means tab, linefeed, vertical tab, form feed, carriage return, and space.
Many of those aren't relevant to most applications, but if you want to remove null characters then you must say so explicitly. For instance, if you wanted to remove tabs, spaces, and nulls, then you would write
line.strip('\x00\x09\x20')

Related

Read regexes from file and avoid or undo escaping

I want to read regular expressions from a file, where each line contains a regex:
lorem.*
dolor\S*
The following code is supposed to read each and append it to a list of regex strings:
vocabulary=[]
with open(path, "r") as vocabularyFile:
for term in vocabularyFile:
term = term.rstrip()
vocabulary.append(term)
This code seems to escape the \ special character in the file as \\. How can I either avoid escaping or unescape the string so that it can be worked with as if I wrote this?
regex = r"dolor\S*"

You are getting confused by echoing the value. The Python interpreter echoes values by printing the repr() function result, and this makes sure to escape any meta characters:
>>> regex = r"dolor\S*"
>>> regex
'dolor\\S*'
regex is still an 8 character string, not 9, and the single character at index 5 is a single backslash:
>>> regex[4]
'r'
>>> regex[5]
'\\'
>>> regex[6]
'S'
Printing the string writes out all characters verbatim, so no escaping takes place:
>>> print(regex)
dolor\S*
The same process is applied to the contents of containers, like a list or a dict:
>>> container = [regex, 'foo\nbar']
>>> print(container)
['dolor\\S*', 'foo\nbar']
Note that I didn't echo there, I printed. str(list_object) produces the same output as repr(list_object) here.
If you were to print individual elements from the list, you get the same unescaped result again:
>>> print(container[0])
dolor\S*
>>> print(container[1])
foo
bar
Note how the \n in the second element was written out as a newline now. It is for that reason that containers use repr() for contents; to make otherwise hard-to-detect or non-printable data visible.
In other words, your strings do not contain escaped strings here.

python find if newline is in string

I am trying to find if a "\n" character is in a string using this:
if "\n" in errors.text
This works fine for a string like "one\ntwo" but when the newline is at the end of the string like "one\n", it doesn't seem to work. I am using selenium to get this string from a website. Is it possible that it is not catching the newline at the end and simply not including it?
Or could this be the problem?
fixedText = errors.text.split("\n")[0]
I want the fixed text to remove all newlines and only get the first line of text. It works except for the case discussed above

If you want the fixed text to only be the first line in a string, you can do this:
if errors.text: # skips empty strings
fixedText = errors.text.split("\n")[0]
This is because split() is reasonably robust:
>>> 'a'.split()[0]
'a'
>>> 'a\n'.split()[0]
'a'
>>> 'a\n1'.split()[0]
'a'
>>> ''.split()
[]
That last example demonstrates why we check for an empty string before trying to index the resulting list.

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)

I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.

What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.

PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Multiple Quotes in String

In Python how would I write the string '"['BOS']"'.
I tried entering "\"['BOS']\"" but this gives the output '"[\'BOS\']"' with added backslashes in front of the '.

You can use triple quotes:
'''"['BOS']"'''
What you did ("\"['BOS']\"") is fine too. You get the backslashes on output, but they aren't part of the string:
>>> a = "\"['BOS']\""
>>> a
'"[\'BOS\']"' # this is the representation of the string
>>> print a
"['BOS']" # this is the actual content
When you type an expression such as a into the console, it's the same as writing print repr(a). repr(a) returns a string that can be used to reconstruct the original value, hence the quotes around the string and the backslashes.

You should use triple quotes so that you don't need to use backslashes.
'''"['BOS']"'''
The reason you got \s in your output is because the python console adds them:
>>> s = '''"['BOS']"'''
>>> s
'"[\'BOS\']"'
>>>

Enclose the entire string with """ or ''' (you would use ''' if the outermost quotation marks were ") in cases like these to make things simpler.
"""'"['BOS']"'"""

You can build it dynamically as well:
>>> print('"{}"'.format("'[BOS]'"))
"'[BOS]'"
>>> print('"'+"'[BOS]'"+'"')
"'[BOS]'"

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.

If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')

My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'

I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.

Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why regex findall return a weird \x00 - python

Related

Read regexes from file and avoid or undo escaping

python find if newline is in string

With pyparsing, how do you parse a quoted string that ends with a backslash

Multiple Quotes in String

dealing with \n characters at end of multiline string in python

Categories

Resources