Escaping quotes when isolating strings from input - python

I'm trying to parse a file in which quotation files are used to encapsulate strings. For instance, the file might contain a line like this:
"\"Hello there, my friends,\" the tour guide says." me # swap notify
But it might also contain lines like this:
"I'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify
In that example, the quotes shouldn't be escaped, but a single backslash should remain.
Is there any function I can use to extract that full quoted statement? \n for newline and \r for carriage return also show up on occasion, so I'd like to get those two, but only after I have the full string isolated.

Parse out the string part. You could use a regular expression or string partition
ast.literal_eval the string and assign it to a variable.
Test:
>>> import re
>>> import ast
>>> with open('test.txt.') as f:
... for line in f:
... m = re.match('(.*) \w+ # \w+ \w+', line)
... print ast.literal_eval(m.group(1))
...
"Hello there, my friends," the tour guide says.
I'm a dingus who wants to put a backslash at the end of my statements. \
The regex says "Match anything and store it as group 1, up to a space, a word, a space, #-sign, space and a word". You then retreive the group with the .group(1) syntax. The parenthesis define a group, see regex documentation.
Here's a version that tries to parse the string as greedily as possible, by failing and retrying until a match is found, or no match can be made:
import re
import ast
def match_line(line):
while line:
print "Trying to match:", line
try:
return ast.literal_eval(line)
except SyntaxError, e:
line = line[:e.offset - 1]
except ValueError: # No way it would ever match
break
return None
with open('test.txt.') as f:
for line in f:
match = match_line(line.strip())
print "Matched:", match
print

You could use regex. It's usually not recommended for parsing though, because unless you have fairly simple inputs or inputs that follow strict rules, it's easy to make mistakes.
There is probably some sort of parsing module that handles this better (for example the csv module is fantastic for quote marks in fields & escaping, if you have a csv).
txt1 = r'"\"Hello there, my friends,\" the tour guide says." me # swap notify.'
txt2 = '"I' + "'" + r'm a dingus who wants to put a backslash at the end of my statements. \\" me # swap notify'
import re
print re.findall(r'"(?:[^"\\]|\\.)+"',txt1)[0]
# "\"Hello there, my friends,\" the tour guide says."
print re.findall(r'"(?:[^"\\]|\\.)+"',txt2)[0]
# "I'm a dingus who wants to put a backslash at the end of my statements. \\"
Note I used the r'xxxxx' syntax to avoid having to further escape my backslashes for python (they're already escaped for the regex).
The regex "([^"\\]|\\.)+" says "match anything that's not a " or a backslash, OR match a backslash and whatever is immediately following it."

Related

python re.sub newline multiline dotall

I have this CSV with the next lines written on it (please note the newline /n):
"<a>https://google.com</a>",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Dirección
I am trying to delete all that commas and putting the address one row up. Thus, on Python I am using this:
with open('Reutput.csv') as e, open('Put.csv', 'w') as ee:
text = e.read()
text = str(text)
re.compile('<a/>*D', re.MULTILINE|re.DOTALL)
replace = re.sub('<a/>*D','<a/>",D',text) #arreglar comas entre campos
replace = str(replace)
ee.write(replace)
f.close()
As far as I know, re.multiline and re.dotall are necessary to fulfill /n needs. I am using re.compile because it is the only way I know to add them, but obviously compiling it is not needed here.
How could I finish with this text?
"<a>https://google.com</a>",Dirección
You don't need the compile statement at all, because you aren't using it. You can put either the compiled pattern or the raw pattern in the re.sub function. You also don't need the MULTILINE flag, which has to do with the interpretation of the ^ and $ metacharacters, which you don't use.
The heart of the problem is that you are compiling the flag into a regular expression pattern, but since you aren't using the compiled pattern in your substitute command, it isn't getting recognized.
One more thing. re.sub returns a string, so replace = str(replace) is unnecessary.
Here's what worked for me:
import re
with open('Reutput.csv') as e:
text = e.read()
text = str(text)
s = re.compile('</a>".*D',re.DOTALL)
replace = re.sub(s, '</a>"D',text) #arreglar comas entre campos
print(replace)
If you just call re.sub without compiling, you need to call it like
re.sub('</a>".*D', '</a>"D', text, flags=re.DOTALL)
I don't know exactly what your application is, of course, but if all you want to do is to delete all the commas and newlines, it might be clearer to write
replace = ''.join((c for c in text if c not in ',\n'))
When you use re.compile you need to save the returned Regular Expression object and then call sub on that. You also need to have a .* to match any character instead of matching close html tags. The re.MULTILINE flag is only for the begin and end string symbols (^ and $) so you do not need it in this case.
regex = re.compile('</a>.*D',re.DOTALL)
replace = regex.sub('</a>",D',text)
That should work. You don't need to convert replace to a string since it is already a string.
Alternative you can write a regular expression that doesn't use .
replace = re.sub('"(,|\n)*D','",D',text)
This worked for me using re.sub with multiline texte
#!/usr/bin/env python3
import re
output = open("newFile.txt","w")
input = open("myfile.txt")
file = input.read()
input.close()
text = input.read()
replace = re.sub("value1\n\s +nickname", "value\n\s +name", text, flags=re.DOTALL)
output.write(replace)
output.close()

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Python Regex to match YAML Front Matter

I'm having trouble crafting a regex to match YAML Front Matter
This is the front matter I was trying to match:
---
name: me
title: test
cpu: 1
---
This is what I thought would work:
re.search( r'^(---)(.*)(---)$', content, re.MULTILINE)
Any help would be greatly appreciated.
To unpack what you are currently doing with this regular expression:
r'^(---)(.*)(---)$':
r: Treat this as a string literal in Python
^: Start the evaluation at the beginning of a line
(---): Parse --- into an anonymous capture group
(.*): Parse all characters (.) non-greedily (*) until the next expression
(---): As above
$: End at the evaluation of the end of a line
The trouble is this will fail when whitespace is present. You're literally saying: find dashes that occur at the beginning of a line and parse until we find dashes that occur at the end of one. Furthermore, you're creating groups that I believe are not necessary to the useful evaluation of your regular expression, by using parentheses () around the dashes used to find YAML front matter.
A better expression would be:
r'^\s*---(.*)---\s*$'
Which adds the repeating group \s* to capture whitespace characters between the beginning of the first line up to the dashes, adds this again between the second group of dashes to the end of that line, and captures everything between into a single anonymous capture group that you can then use for additional processing. If extracting the contents of the front matter isn't desired, simply replace (.*) with .*.
Consider re.findall for multiple evaluations of this regular expression in a single file, and as mentioned, use re.DOTALL to allow the dot character to match new lines.
I've used something like this regex, re.findall('^---[\s\S]+?---', text):
def extractFrontMatter(markdown):
md = open(markdown, 'r')
text = md.read()
md.close()
# Returns first yaml content, `--- yaml frontmatter ---` from the .md file
# http://regexr.com/3f5la
# https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match
match = re.findall('^---[\s\S]+?---', text)
if match:
# Strips `---` to create a valid yaml object
ymd = match[0].replace('---', '')
try:
return yaml.load(ymd)
except yaml.YAMLError as exc:
print exc
I've also come across python-frontmatter, which has some additional helper functions:
import frontmatter
post = frontmatter.load('/path/to-markdown.md')
print post.metadata, 'meta'
print post.keys(), 'keys'

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

regex not matching

I am write a small python script to gather some data from a database, the only problem is when I export data as XML from mysql it includes a \b character in the XML file. I wrote code to remove it, but then realized I didn't need to do that processing everytime, so I put it in a method and am calling it I find a \b in the XML file, only now the regex isnt matching, even though I know the \b is there.
here is what I am doing:
Main program:
'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
p = re.compile("\b")
if(p.match(line)):
print p.match(line)
processing = True
break #only one match needed
if(processing):
print "preprocess"
preprocess(xml_file)
Preprocessing method:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = []
for line in xml_file:
lines.append(re.sub("\b", "", line))
#go to the beginning of the file
xml_file.seek(0);
#overwrite with correct data
for line in lines:
xml_file.write(line);
xml_file.truncate()
Any help would be great,
Thanks
\b is a flag for the regular expression engine:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
So you will need to escape it to find it with a regex.
Escape it with backslash in regex. Since backslash in Python needs to be escaped as well (unless you use raw strings which you don't want to), you need a total of 3 backslashes:
p = re.compile("\\\b")
This will produce a pattern matching the \b character.
Correct me if i wrong but there is no need to use regEx in order to replace '\b', you can simply use replace method for this purpose:
def preprocess(file):
#exporting from MySQL query browser adds a weird
#character to the result set, remove it
#so the XML parser can read the data
print "in preprocess"
lines = map(lambda line: line.replace("\b", ""), xml_file)
#go to the beginning of the file
xml_file.seek(0)
#overwrite with correct data
for line in lines:
xml_file.write(line)
# OR: xml_file.writelines(lines)
xml_file.truncate()
Note that there is no need in python to use ';' at the end of string

Categories

Resources