I'm trying to append a string in python and the following code produces
buildVersion =request.values.get("buildVersion", None)
pathToSave = 'processedImages/%s/'%buildVersion
print pathToSave
prints out
processedImages/V41
/
I'm expecting the string to be of format: processedImages/V41/
It doesn't seem to be a new line character.
pathToSave = pathToSave.replace("\n", "")
This dint really help
It might not be relevant to actual question but, in addition to Alex Martelli's answer, I would also check if buildVersion ever exists in the first place, because otherwise all solutions posted here will give you another errors:
import re
buildVersion = request.values.get('buildVersion')
if buildVersion is not None:
return 'processedImages/{}/'.format(re.sub('\W+', '', buildVersion))
else:
return None
It might be a \r or other special whitespace character. Just clean up buildVersion of all such whitespace before executing
pathToSave = 'processedImages/%s/' % buildVersion
You can approach the clean-up task in several ways -- for example, if valid characters in buildVersion are only "word characters" (letters, digits, underscore), something like
import re
buildVersion = re.sub('\W+', '', buildVersion)
would usefully clean up even whitespace inside the string. It's hard to be more specific without knowing exactly what characters you need to accept in buildVersion, of course.
Related
I ultimately want to split a string by a certain character. I tried Regex, but it started escaping \, so I want to avoid that with another approach (all the attempts at unescaping the string failed). So, I want to get all positions of a character char in a string that is not within quotes, so I can split them up accordingly.
For example, given the phase hello-world:la\test, I want to get back 11 if char is :, as that is the only : in the string, and it is in the 11th index. However, re does split it, but I get ['hello-world,lat\\test'].
EDIT:
#BoarGules made me realize that re didn't actually change anything, but it's just how Python displays slashes.
Here's a function that works:
def split_by_char(string,char=':'):
PATTERN = re.compile(rf'''((?:[^\{char}"']|"[^"]*"|'[^']*')+)''')
return [string[m.span()[0]:m.span()[1]] for m in PATTERN.finditer(string)]
string = 'hello-world:la\test'
char = ':'
print(string.find(char))
Prints
11
char_index = string.find(char)
string[:char_index]
Returns
'hello-world'
string[char_index+1:]
Returns
'la\test'
Solution for the case you're likely encountering (a pseudo-CSV format you're hand-rolling a parser for; if you're not in that situation, it's still a likely situation for people finding this question later):
Just use the csv module.
import csv
import io
test_strings = ['field1:field2:field3', 'field1:"field2:with:embedded:colons":field3']
for s in test_strings:
for row in csv.reader(io.StringIO(s), delimiter=':'):
print(row)
Try it online!
which outputs:
['field1', 'field2', 'field3']
['field1', 'field2:with:embedded:colons', 'field3']
correctly ignoring the colons within the quoted field, requiring no kludgy, hard-to-verify hand-written regexes.
This question already has answers here:
Removing a list of characters in string
(20 answers)
Closed 3 years ago.
I have a dataframe that I need to write to disk but pyspark doesn't allow any of these characters ,;{}()\\n\\t= to be present in the headers while writing as a parquet file.
So I wrote a simple script to detect if this is happening
import re
for each_header in all_headers:
print(re.match(",;{}()\\n\\t= ", each_header))
But for each header, None was printed. This is wrong because I know my file has spaces in its headers.
So, I decided to check it out by executing the following couple of lines
a = re.match(",;{}()\\n\\t= ", 'a s')
print(a)
a = re.search(",;{}()\\n\\t= ", 'a s')
print(a)
This too resulted in None getting printed.
I am not sure what I am doing wrong here.
PS: I am using python3.7
The problem is that {} and also () are regex metacharacters, and have a special meaning. Perhaps the easiest way to write your logic would be to use the pattern:
[,;{}()\n\t=]
This says to match the literal characters which PySpark does not allow to be present in the headers.
a = re.match("[,;{}()\n\t=]", 'a s')
print(a)
If you wanted to remove these characters, you could try using re.sub:
header = '...'
header = re.sub(r'[,;{}()\n\t=]+', '', header)
If you want to check whether a text contains any of the "forbidden"
characters, you have to put them between [ and ].
Another flaw in your regex is that in "normal" strings (not r-strings)
any backslash should be doubled.
So change your regex to:
"[,;{}()\\n\\t= ]"
Or use r-string:
r"[,;{}()\n\t= ]"
Note that I included also a space, which you missed.
One more remark: {} and () have special meaning, but outside [...].
Between [ and ] they represent themselves, so they need no
quotation with a backslash.
As already explained you could use regex for looking for forbidden characters, I want to add that you could do it without using regex following way:
forbidden = ",;{}()\n\t="
def has_forbidden(txt):
for i in forbidden:
if i in txt:
return True
return False
print(has_forbidden("ok name")) # False
print(has_forbidden("wrong=name")) # True
print(has_forbidden("with\nnewline")) # True
Note that using this approach you do not have to care about escaping special-regex characters, like for example *.
I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.
I have a verbose python regex string (with lots of whitespace and comments) that I'd like to convert to "normal" style (for export to javascript). In particular, I need this to be quite reliable. If there's any demonstrably correct way to do this, it's what I want. For example, a naive implementation would destroy a regex like r' \# # A literal hash character', which is not OK.
The best way to do this would be to coerce the python re module to give me back a non-verbose representation of my regex, but I don't see a way to do that.
I believe you only need to address these two issues to strip a verbose regex:
delete comments to the end of line
delete unescaped whitespace
try this, which chains the 2 with separate regex substitutions:
import re
def unverbosify_regex_simple(verbose):
WS_RX = r'(?<!\\)((\\{2})*)\s+'
CM_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
return re.sub(WS_RX, "\\1", re.sub(CM_RX, "\\1", verbose))
The above is a simplified version that leaves escaped spaces as-is. The resulting output will be a little harder to read but should work for regex platforms.
Alternatively, for a slightly more complex answer that "unescapes" spaces (i.e., '\ ' => ' ') and returns what I think most people would expect:
import re
def unverbosify_regex(verbose):
CM1_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
CM2_RX = r'(\\)?((\\{2})*)(#)'
WS_RX = r'(\\)?((\\{2})*)(\s)\s*'
def strip_escapes(match):
## if even slashes: delete space and retain slashes
if match.group(1) is None:
return match.group(2)
## if number of slashes is odd: delete slash and keep space (or 'comment')
elif match.group(1) == '\\':
return match.group(2) + match.group(4)
## error
else:
raise Exception
not_verbose_regex = re.sub(WS_RX, strip_escapes,
re.sub(CM2_RX, strip_escapes,
re.sub(CM1_RX, "\\1", verbose)))
return not_verbose_regex
UPDATE: added comments to explain even v. odd slash counting. Fixed first group in CM_RX to retain full 'comment' if slash count is odd.
UPDATE 2: Fixed comments regex, which was not dealing with escaped hashes properly. Should handle both "\# #escaped hash" as well as "# comment with \# escaped hash" and "\\# comment"
UPDATE 3: Added a simplified version that doesn't clean up escaped spaces.
UPDATE 4: Further simplification to eliminate variable-length negative lookbehind (and reverse/reverse trick)
I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks
Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too
So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.
Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.
Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)
It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)
I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>
re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg