splitlines of quote splits '\n' in sub-quote - python

Given I have a quote that contains a double sub-quote with a '\n',
If one performs a splitlines on the parent quote, the child quote is split too.
double_quote_in_simple_quote = 'v\n"x\ny"\nz'
print(double_quote_in_simple_quote.splitlines())
Resulting output
['v', '"x', 'y"', 'z']
I would have expected the following:
['v', '"x\ny"', 'z']
Because the '\n' is in the scope of the sub-quote.
I was hoping to get an explanation why it behaves as such and if you have any alternative to 'splitlines' at the level of the main quote only?
Thank you

The split function doesn't care about additional levels of quoting; it simply splits on every occurrence of the character you split on. (There isn't really a concept of nested quoting; a string is a string, and may or may not contain literal quotes, which are treated the same as any other character.)
If you want to implement quoting inside of strings, you have to do it yourself.
Perhaps use a regular expression;
import re
tokens = re.findall(r'"[^"]*"|[^"]*', double_quote_in_simple_quote)
splitresult = [
x if x.startswith('"') else x.split('\n')
for x in tokens]
Demo: https://ideone.com/lAgJTb

It is due to the nature of escape sequences in Python.
\n in python means a new line character. Whenever this sequence is captured by python, it treats it as line breakers and considers skipping a line. splitlines() method splits a string into a list and the splitting is done at line breaks. That's why you get a list without new line character.
However, you can get away with it by specifying a parameter which won't consider the escape line by default :
print(double_quote_in_simple_quote.splitlines(keepends=True))
>>> ['"x\\ny"']

I came up with a nasty code that can get you around while you try to find another method that splits quotes without the characteristics that makes Python's behaves as it does.
double_quote_in_simple_quote = '"x\ny"'
double_quote_in_simple_quote = double_quote_in_simple_quote.replace("\n", "$n")
splitted_quote = double_quote_in_simple_quote.splitlines()
print(splitted_quote)
splitted_quote_decoded = [quote.replace('$n', '\n') for quote in splitted_quote]
print(splitted_quote_decoded)
The idea is to replace the \n by something not meaningful yet not used, and then reverse it. I used your example, but I'm sure you will be able to tune it to fit your needs. My output was:
['"x$ny"']
['"x\ny"']

If you double-quote a string in Python, that doesn't mean there are nested strings, per se. Whatever the outermost quotes are, Python will start and end the string object according to that. Any internal quote-like characters are treated as the ascii characters.
>>> print('dog')
dog
>>> print('"dog"')
"dog"
Note how in the second line, the quotes are also printed, because those actual quote-characters are a part of the string. No nesting happening.

Related

Why use an escape sequence instead of a different quote type?

Why would we want to use escape sequence characters like for example in this Python code:
print('It\'s alright.')
Why are we using this backslash to print a single quote when we can accomplish the same by using:
print("it's alright")
This is useful because you can do:
txt = 'in python you can have \'string\' or "string"'
print(txt)
No matter how many different kinds of quote you have, you may still need an escape mechanism now and then. Consider this:
If you want to use Python's "multiline string literal" you have to begin it and end it with a triple quote, which can be either """ or '''.
To put that into a string literal you are going to have to quote ' or ":
a = 'If you want to use Python\'s "multiline string literal" you have to begin it and end it with a triple quote, which can be either """ or \'\'\'.'.
a = "If you want to use Python's \"multiline string literal\" you have to begin it and end it with a triple quote, which can be either \"\"\" or '''."
a = """If you want to use Python's "multiline string literal" you have to begin it and end it with a triple quote, which can be either ""\" or '''."""
Having different quote types is a great programming convenience, making it easier and less error prone to put quotes and apostrophes in the data without having to jump through hoops. But it can't cover every case. If you need to convince yourself of this, experiment with those three lines at a command prompt and see if you can come up with a way to avoid backslashes. You will find you always need at least one.
Without further context, I can only take a guess and say that the person who wrote the first example, didn't know or wasn't aware of the fact that it's possible to use double-quotes "" for string literals in Python.
That's just a matter of style. Some people like to use single quotes to create string literals, and therefore they'll have to escape any single quotes it comes inside of their strings (same for double quotes). The following will raise a SyntaxError:
s = 'It's gonna be alright!'
s = "They used to call me "Big" but I was 4ft!"
So you may ask why they don't use " when their string have single quotes and ' when their string have double quotes? Yes, they can, but there are some unavoidable situations, such as Regex:
regexp = r"["']\w+["']"
Note that they can't use neither single nor double quotes to create the string, since both are present in the Regex. Therefore, they'll need to escape it.
In this case its not needed cuz you have used " " for the print statement.
case1) use: print(" It's alright.")
case2) use: print(' It\'s alright.')
Note the parenthesis used for the print statements.
You cant use ' directly in case2 cuz python would think that the string ends causing a SyntaxError.
In the code
txt = 'It\'s alright.'
you need the backslash(\) so python understands that the second apostrophe is a character of the string. Without the backslash, Python would interpret it as the character used to mark the end of the string.
When you use a ' at the start, python looks for a matching ' and considers whatever is present in between these quotes as a string.
But if you use a ' in the middle of the string, python considers that as the end of the string. And since there is no matching ' for the ' at the end of the string that results in a SyntaxError
The backslash () character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
Refer the docs: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Deleting of 'd' and 'n' character in strip in python

dataframe
string1
Data%2Fxxx
Data%2Ffrance
Data%2Fdenmark
Data%2Fnorway
Code
df['string1'] = [x.strip('Data%2F') for x in df.string1]
output
string1
xxx
france
enmark
orway
So, strip function is removing 'd' and 'n' first character. Does anyone know why?How can i stop this from removing?Is this related to '\d' and '\n' ?
python version - 3.7.4
The strip() method returns a copy of the string with both leading and trailing characters stripped. According to https://docs.python.org/3/library/stdtypes.html#str.strip, "The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped." Examples from the documentation:
>>> ' spacious '.strip()
'spacious'
>>> 'www.example.com'.strip('cmowz.')
'example'
In other words, x.strip('Data%2F') is directing Python to strip any a's, t's, D's etc. from the beginning and end of the string. This is why "Data%2Faloha".strip("Data%2F") would actually return 'loh' unless you have, say, a space at the end, which is not part of the chars argument in your example. This is my best guess as to what's happening for you.
str.replace() should work perfectly for you.
>>> x.replace('Data%2F', '')
The correct way to proceed is with string.replace()
df['string1'] = [x.replace('Data%2F','') for x in dbppp.string1]
The string.strip() method returns a copy of the string in which all chars have been stripped from the beginning and the end of the string.
When I tested, it gave me a different result but still incorrect.
string.strip() is more used if you want to remove spaces from the start and end of a string for example.
It should be because of \n if it happens with t as well. You should rather use replace because it won't get rid of whitespaces.
string.replace("Data%2F","")

How to get rid of trailing \ while reading a file in python3

I am reading a file in python and getting the lines from it.
However, after printing out the values I get, I realize that after each line there is a trailing \ at the end.
I have looked at Python strip with \n and tried everything in it but nothing has removed the trailing .
For example
0048\
0051\
0052\
0054\
0056\
0057\
0058\
0059\
How can I get rid of these slashes?
Here is the code I have so far
for line in f:
line = line.replace('\\n', "")
line = line.replace('\\n', "")
print(line)
I've even tried using regex
strings = re.findall(r"\S+", f.read())
But nothing has worked so far.
You're probably confused about what is in the lines, and as a result you're confusing me too. '\n' is a single newline character, as shown using repr() (which is your friend when you want to know what a value is exactly). A line typically ends with that (the exception being the end of file which might not). That does not contain a backslash; that backslash is part of a string literal escape sequence. Your replace argument of '\\n' contains two characters, a backslash followed by the letter n. This wouldn't match a '\n'; the easiest way to remove the newline specifically is to use str.rstrip('\n'). The line reading itself will guarantee that there's only up to one newline, and it is at the end of the string. Frequently we use strip() with no argument instead as we don't want whitespace either.
If your string really does contain backslash, you can process that as well, whether using replace, strip, re or some other string processing. Just keep in mind that it might be used for escape sequences not only at string literal level but at regular expression level too. For instance, re.sub(r'\\$', '', str) will remove a backslash from the end of a string; the backslash itself is doubled to not mean a special sequence in the regular expression, and the string literal is raw to not need another doubling of the backslashes.

Deal with special characters in regex

I have a list with this format:
var = ['A12232'], '['926596']','787878', '[WA-12333]', '[78888] [78888]']
I need to extrac the codes from this list, in this case those would be
A12232,926596,787878,WA-12333,78888 (just the first one)
I haven't found a way to deal with the " [' " at the same time, I have try to use the '\' to scape it but only works with the first of them.
If you're just trying to strip leading and trailing quotes and/or brackets (your example is a little funny, since it's clearly not legal Python '['926596']' is garbage since it has unescaped quotes inside; perhaps you meant "['926596']"?), you don't need regular expressions, just str.strip-ing each piece and joining together:
codes = ','.join(x.strip('[]\'"') for x in var)
That just removes runs of mixed usage of any of [, ], ' or " from the beginning and end of each string, then joins them together with commas.

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Categories

Resources