I'm using Python (and Pytumblr) and trying to extract a certain string from some returned data, but the string I am searching for includes ":" in it. Whenever I run my script I get the error:
File "myfile.py", line 22
if re.search('^ion': u'..', u'b', line) :
^
SyntaxError: invalid syntax
Here is my code:
import pytumblr
import re
returned = client.submission('blog') # get the submissions for a given blog
sch = open('returned')
for line in sch:
line = line.rstrip()
if re.search('^ion': u'..', u'b', line) :
print line
Is there another error in this code or is there a way to escape ":" that I don't know about? I'm pretty new to Python but I didn't think : needed to be escaped.
That's a syntax error because your colon is not part of the string. The single quote ' mark is closing off the string. Your first argument is being parsed as:
'^ion' - String 1: ^ion
: - Syntactical colon
u - The syntactical character u,
indicating you intend for the
following string literal to be
in unicode
'..' - String 2: ..
If you want your single quote at the end of ^ion to be a part of the string, you either need to escape that with a backslash '^ion\': or, alternatively, use double quotes around the string itself. Since Python accepts both single and double quotes for string literal markers, 'hello' and "hello" mean the same thing. Making '"hello world"' and "'hello world'" both legal strings.
If the regex is the pain point here, there's lots of literature and tooling out there to help. I recommend regex101
Try to use double quotes:
re.search("^ion': u'..', u'b", line):
Or escape ':
re.search('^ion\': u\'..\', u\'b', line):
Related
program:
d=r'he said,'let's python.''
print(d)
output:
File "<ipython-input-39-bb6666c2121c>", line 1
d=r'he said,'let's python.''
^
SyntaxError: invalid syntax
Enclose the raw string in double quotes. This is one way to deal with situations where you may have single quotes(or double quotes) representing the string boundaries and also exist within the string. In our case, we denote the string boundaries with double quotes since the single quote(apostrophe) appears in the word let's.
>>> d=r"he said,'let's python."
>>> print(d)
he said,'let's python.
You got SyntaxError because r'he said,'let's python.'' is not legal python literal - as you used single ' at ends we are dealing with shortstring which must consist of elements which are <any source character except "\" or newline or the quote> and you tried to use quote inside so it failed.
You can wrap your message, which includes single quotes ('), with double quotes (") to make the syntax work. In your case:
d = r"he said,'let's python."
print(d)
If your string includes ' or " you must put a backslash in front of it.
I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.
I am learning python 3.3 in windows 7. I have a two text files - lines.txt and raven.txt in a folder. Both contain the same text for the first example.
When I try to access ravens, using the code below, I get the error -
OSError: [Errno 22] Invalid argument: 'C:\\Python\raven.txt'
I know that the above error can be fixed by using an escape character like this -
C:\\Python\\raven.txt
C:\Python\\raven.txt
Why do both methods work ? Strangely, when I access lines.txt in the same folder, I get no error ! Why ?
import re
def main():
print('')
fh = open('C:\Python\lines.txt')
for line in fh:
if re.search('(Len|Neverm)ore', line):
print(line, end = '')
if __name__ == '__main__':main()
Also, when I use the line below, I get a completely different error - TypeError: embedded NUL character. Why ?
fh = open('C:\Python\Exercise Files\09 Regexes\raven.txt')
I can rectify this by using \ before every \ in the file path.
\r is an escape character, but \l is not. So, lines is interpreted as lines while raven is interpreted as aven, since \r is escaped.
In [1]: len('\l')
Out[1]: 2
In [2]: len('\r')
Out[2]: 1
You should always escape backslashes with \\. In cases your string doesn't have quotes, you can also use raw strings:
In [9]: len(r'\r')
Out[9]: 2
In [10]: r'\r'
Out[10]: '\\r'
See: https://docs.python.org/3/reference/lexical_analysis.html
maybe you can use raw string.
just like this open(r'C:\Python\Exercise Files\09 Regexes\raven.txt').
When an r' orR' prefix is present, backslashes are still used to
quote the following character, but all backslashes are left in the
string. For example, the string literal r"\n" consists of two
characters: a backslash and a lowercase `n'. String quotes can be
escaped with a backslash, but the backslash remains in the string; for
example, r"\"" is a valid string literal consisting of two characters:
a backslash and a double quote; r"\" is not a value string literal
(even a raw string cannot end in an odd number of backslashes).
Specifically, a raw string cannot end in a single backslash (since the
backslash would escape the following quote character). Note also that
a single backslash followed by a newline is interpreted as those two
characters as part of the string, not as a line continuation.
You can actually use forward slashes instead of backward ones, that way you don't have to escape them at all, which would save you a lot of headaches. Like this: 'C:/Python/raven.txt', I can guarantee that it works on Windows.
I was experimenting in the python shell with the type() operator. I noted that:
type('''' string '''')
returns an error which is trouble scanning the string
yet:
type(''''' string ''''')
works fine and responds that a string was found.
What is going on? does it have to do with the fact that type('''' string '''') is interpreted as type("" "" string "" "") and therefore a meaningless concatenation of empty strings and an undefined variable?
You are ending a string with 3 quotes, plus one extra. This works:
>>> ''''string'''
"'string"
In other words, Python sees 3 quotes, then the string ends at the next 3 quotes. Anything that follows after that is not part of the string anymore.
Python also concatenates strings that are placed one after the other:
>>> 'foo' 'bar'
'foobar'
so '''''string''''' means '''''string''' + '' really; the first string starts right after the opening 3 quotes until it finds 3 closing quotes. Those three closing quotes are then followed by two more quotes forming a separate but empty string:
>>> '''''string'''
"''string"
>>> '''''string'''''
"''string"
>>> '''''string'''' - extra extra! -'
"''string - extra extra! -"
Moral of the story: Python only supports triple or single quoting. Anything deviating from that can only lead to pain.
Your supposition seems to be correct, given the following:
a = '''' string ''''
File "<stdin>", line 1
a = '''' string ''''
^
SyntaxError: EOL while scanning string literal
As Martijn says in his answer, Python is trying to concatenate adjacent strings, and fails when it doesn't find the ending '.
This question already has answers here:
How can I print a single backslash?
(4 answers)
Closed 7 months ago.
I am trying replace a backslash '\' in a string with the following code
string = "<P style='TEXT-INDENT'>\B7 </P>"
result = string.replace("\",'')
result:
------------------------------------------------------------
File "<ipython console>", line 1
result = string.replace("\",'')
^
SyntaxError: EOL while scanning string literal
Here i don't need the back slashes because actually i am parsing an xml file which has a tag in the above format, so if backslashes are there it is displaying invalid token during parsing
Can i know how to replace the backslashes with empty string in python
We need to specify that we want to replace a string that contains a single backslash. We cannot write that as "\", because the backslash is escaping the intended closing double-quote. We also cannot use a raw string literal for this: r"\" does not work.
Instead, we simply escape the backslash using another backslash:
result = string.replace("\\","")
The error is because you did not add a escape character to your '\', you should give \\ for backslash (\)
In [147]: foo = "a\c\d" # example string with backslashes
In [148]: foo
Out[148]: 'a\\c\\d'
In [149]: foo.replace('\\', " ")
Out[149]: 'a c d'
In [150]: foo.replace('\\', "")
Out[150]: 'acd'
In Python, as explained in the documentation:
The backslash () character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
So, in order to replace \ in a string, you need to escape the backslash itself with another backslash, thus:
>>> "this is a \ I want to replace".replace("\\", "?")
'this is a ? I want to replace'
Using regular expressions:
import re
new_string = re.sub("\\\\", "", old_string)
The trick here is that "\\\\" is a string literal describing a string containing two backslashes (each one is escaped), then the regex engine compiles that into a pattern that will match one backslash (doing a separate layer of unescaping).
Adding a solution if string='abcd\nop.png'
result = string.replace("\\","")
This above won't work as it'll give result='abcd\nop.png'.
Here if you see \n is a newline character. So we have to replace backslah char in raw string(as there '\n' won't be detected)
string.encode('unicode_escape')
result = string.replace("\\", "")
#result=abcdnop.png
You need to escape '\' with one extra backslash to compare actually with \.. So you should use '\'..
See Python Documentation - section 2.4 for all the escape sequences in Python.. And how you should handle them..
It's August 2020.
Python 3.8.1
Pandas 1.1.0
At this point in time I used both the double \ backslash AND the r.
df.replace([r'\\'], [''], regex=True, inplace=True)
Cheers.