Regex Findall Hang in Linux - python

I make a program to find a float inside string using re.findall, as follows:
string1 = 'Voltage = 3.0 - 4.0 V'
string2 = '3.66666'
float1 = re.findall('\d+.\d+', string1)
float2 = re.findall('\d+.\d+', string2)
This program runs well on windows, but when I tried to run the program on Linux, the program keep being stuck on the second re.findall. Any idea what cause this problem? How to solve this?
Thank you

You need to define your regex as raw string and also you need to escape the dot. Dot is a special meta character in regex which matches any character except line breaks. Escaping the dot in your regex will match a literal dot.
float1 = re.findall(r'\d+\.\d+', string1)
float2 = re.findall(r'\d+\.\d+', string2)
From re doc.
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
>>> string1 = 'Voltage = 3.0 - 4.0 V'
>>> string2 = '3.66666'
>>> float1 = re.findall(r'\d+\.\d+', string1)
>>> float2 = re.findall(r'\d+\.\d+', string2)
>>> float1
['3.0', '4.0']
>>> float2
['3.66666']

Related

Split a string and keep the delimiters as part of the split string chunks, not as separate list elements

This is a spin-off from In Python, how do I split a string and keep the separators?
rawByteString = b'\\!\x00\x00\x00\x00\x00\x00\\!\x00\x00\x00\x00\x00\x00'
How can I split this rawByteString into parts using "\\!" as the delimiter without dropping the delimiters, so that I get:
[b'\\!\x00\x00\x00\x00\x00\x00', b'\\!\x00\x00\x00\x00\x00\x00']
I do not want to use [b'\\!' + x for x in rawByteString.split(b'\\!')][1:] as that would use string.split() and is just a workaround, that is why this question is tagged with the "re" module.
You may use
re.split(rb'(?!\A)(?=\\!)', rawByteString)
re.split(rb'(?!^)(?=\\!)', rawByteString)
See a sample regex demo (the string input changed since null bytes cannot be part of a string).
Regex details
(?!^) / (?!\A) / (?<!^) - a position other than start of string
(?=\\!) - a position not immediately followed with a backslash + !
NOTES
Since you use a byte string, the b prefix is required when defining the pattern string literal
r makes the string literal a raw string literal so that we do not have to double escape backslashes and can use \\ to match a single \ in the string.
See Python demo:
import re
rawByteString = b'\\!\x00\x00\x00\x00\x00\x00\\!\x00\x00\x00\x00\x00\x00'
print ( re.split(rb'(?!\A)(?=\\!)', rawByteString) )
Output:
[b'\\!\x00\x00\x00\x00\x00\x00', b'\\!\x00\x00\x00\x00\x00\x00']

Python regex not working with special characters

SOLVED: it replaced the " symbols in the file with ' (in the data strings)
Do you know a way to only search for 1 or more words (not numbers) between [" and \n?
This works on regexr.com, but not in python
https://regexr.com/3tju7
¨
(?<=\[\")(\D+)(?=\\n)
"S": ["Something\n13/8-2018 09:00 to 11:30
¨
Python code:
re.search('(?<=[\")(\D+)(?=\n)', str(data))
I think \[, \" and \\n is the problem, I have tried to use raw in python
re.search('(?<=\[\")(\D+)(?=\\n)', '"S": ["Something\n13/8-201809:00 to 11:30').group()
This worked but I have to use "data" because I have multiple strings, and it won't let me use .group() on that.
Error: AttributeError: 'NoneType' object has no attribute 'group'
Your problem is that the \n is being interpreted as a newline, instead of the literal characters \ and n. You can use a simpler regex, \["([\w\s]+)$, along with the MULTILINE flag, without modifying the data.
>>> import re
>>> data = '"S": ["Something\n13/8-201809:00 to 11:30'
>>> pattern = '\["([\w\s]+)$'
>>> m = re.search(pattern, data, re.MULTILINE)
>>> m.group(1)
'Something'
Try to put a r before the string with the pattern, that marks the string as "raw". This stops python from evaluating escaped characters before passing them to the function
re.search(r'\search', string)
Or:
rgx = re.compile(r'pattern')
rgx.search(string)

How to find string between '\begin{minipage}' and '\end{minipage}' by python re?

I have tried the following code:
strReFindString = u"\\begin{minipage}"+"(.*?)"
strReFindString += u"\\end{minipage}"
lst = re.findall(strReFindString, strBuffer, re.DOTALL)
But it always returns empty list.
How can I do?
Thanks all.
As #BrenBarn said, u"\\b" parses as \b; and \b is not a valid regexp escape, so findall treats it as b (literal b). u"\\\\b" is \\b, which regexp understands as \b (literal backslash, literal b). You can prevent escape-parsing in the string using raw strings, ur"\\b" is equal to u"\\\\b":
ur"\\b" == u"\\\\b"
# => True

How do I escape backslash and single quote or double quote in Python? [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
How do I escape a backslash and a single quote or double quote in Python?
For example:
Long string = '''some 'long' string \' and \" some 'escaped' strings'''
value_to_change = re.compile(A EXPRESION TO REPRESENT \' and \")
modified = re.sub(value_to_change, 'thevalue', Long_string)
## Desired Output
modified = '''some 'long' string thevalue and thevalue some 'escaped' strings'''
How you did it
If your "long string" is read from a file (as you mention in a comment) then your question is misleading. Since you obviously don't fully understand how escaping works, the question as you wrote it down probably is different from the question you really have.
If these are the contents of your file (51 bytes as shown + maybe one or two end-of-line characters):
some 'long' string \' and \" some 'escaped' strings
then this is what it will look like in python:
>>> s1 = open('data.txt', 'r').read().strip()
>>> s1
'some \'long\' string \\\' and \\" some \'escaped\' strings'
>>> print s1
some 'long' string \' and \" some 'escaped' strings
What you wrote in the question will produce:
>>> s2 = '''some 'long' string \' and \" some 'escaped' strings'''
>>> s2
'some \'long\' string \' and " some \'escaped\' strings'
>>> print s2
some 'long' string ' and " some 'escaped' strings
>>> len(s)
49
Do you see the difference?
There are no backslashes in s2 because they have special meaning when you use them to write down strings in Python. They have no special meaning when you read them from a file.
If you want to write down a string that afterwards has a backslash in it, you have to protect the backslash you enter. You have to keep Python from thinking it has special meaning. You do that by escaping it - with a backslash.
One way to do this is to use backslashes, but often the easier and less confusing way is to use raw strings:
>>> s3 = r'''some 'long' string \' and \" some 'escaped' strings'''
'some \'long\' string \\\' and \\" some \'escaped\' strings'
>>> print s3
some 'long' string \' and \" some 'escaped' strings
>>> s1 == s3
True
How you meant it
The above was only to show you that your question was confusing.
The actual answer is a bit harder - when you are working with regular expressions, the backslash takes on yet another layer of special meaning. If you want to safely get a backslash through string escaping and through regex escaping to the actual regex, you have to write down multiple backslashes accordingly.
Furthermore, rules for putting single quotes (') in single-quoted raw strings (r'') are a bit tricky as well, so I will use a raw string with triple single-quotes (r'''''').
>>> print re.sub(r'''\\['"]''', 'thevalue', s1)
some 'long' string thevalue and thevalue some 'escaped' strings
The two backslashes stay two backslashes throughout string escaping and then become only one backslash without special meaning through regex escaping. In total, the regex says:
"match one backslash followed by either a single-quote or a double-quote."
How it should be done
Now for the pièce de résistance: The previous is really a good demonstration of what jwz meant1. If you forget about regex (and know about raw strings), the solution becomes much more obvious:
>>> print s1.replace(r'\"', 'thevalue').replace(r"\'", 'thevalue')
some 'long' string thevalue and thevalue some 'escaped' strings
1 Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
The problem is that in your string \' and \" get converted to ' and ", so on your example as-is, you won't be able to match only \' without matching the single quotes around long.
But my understanding is that this data comes from a file so assuming you have your_file.txt containing
some 'long' string \' and \" some 'escaped' strings
you can replace \' and \" with following code:
import re
from_file = open("your_file.txt", "r").read()
print(re.sub("\\\\(\"|')", "thevalue", from_file))
Note the four slashes. Since this is a string \ gets converted to \ (as this is an escaped character). Then in the regular expression, the remaining \ gets again converted to \, as this is also regular experssion escaped character. Result will match a single slash and one of the " and ' quotes.
is this what you want?
import re
Long_string = "some long string \' and \" some escaped strings"
value_to_change = re.compile( "'|\"" )
modified = re.sub(value_to_change , 'thevalue' , Long_string )
print modified
I try this to print a single backslash (Python 3):
single_backslash_str = r'\ '[0]
print('single_backslash_str') #output: \
print('repr(single_backslash_str)') #output: '\\'
Hope this will help!
Keep in mind, all these strings are exactly the same:
Long_string = '''some long string \' and \" some escaped strings'''
Long_string = '''some long string ' and " some escaped strings'''
Long_string = """some long string ' and " some escaped strings"""
Long_string = 'some long string \' and \" some escaped strings'
Long_string = "some long string \' and \" some escaped strings"
Long_string = 'some long string \' and " some escaped strings'
Long_string = "some long string ' and \" some escaped strings"
There is no backslash character in any of them. So the regex you're looking for doesn't need to match a backslash and a quote, just a quote:
modified = re.sub("['\"]", 'thevalue', Long_string)
BTW: You also don't have to compile the regex before you use it, re.sub will accept a string regex as well as a compiled one.

Python regex - r prefix

Can anyone explain why example 1 below works, when the r prefix is not used?
I thought the r prefix must be used whenever escape sequences are used.
Example 2 and example 3 demonstrate this.
# example 1
import re
print (re.sub('\s+', ' ', 'hello there there'))
# prints 'hello there there' - not expected as r prefix is not used
# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
# prints 'hello there' - as expected as r prefix is used
# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello there there'))
# prints 'hello there there' - as expected as r prefix is not used
Because \ begin escape sequences only when they are valid escape sequences.
>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'
>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Ignored
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\N{name} Character named name in the Unicode database (Unicode only)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\uxxxx Character with 16-bit hex value xxxx (Unicode only)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo
\xhh Character with hex value hh
Never rely on raw strings for path literals, as raw strings have some rather peculiar inner workings, known to have bitten people in the ass:
When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.
To better illustrate this last point:
>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>>
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\
the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character.
http://docs.python.org/reference/lexical_analysis.html#literals
so '\n' is a single newline
and r'\n' is two characters - a backslash and the letter 'n'
another way to write it would be '\\n' because the first backslash escapes the second
an equivalent way of writing this
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
is
print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello there there'))
Because of the way Python treats characters that are not valid escape characters, not all of those double backslashes are necessary - eg '\s'=='\\s' however the same is not true for '\b' and '\\b'. My preference is to be explicit and double all the backslashes.
Not all sequences involving backslashes are escape sequences. \t and \f are, for example, but \s is not. In a non-raw string literal, any \ that is not part of an escape sequence is seen as just another \:
>>> "\s"
'\\s'
>>> "\t"
'\t'
\b is an escape sequence, however, so example 3 fails. (And yes, some people consider this behaviour rather unfortunate.)
Try that:
a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'
Check below example:
print r"123\n123"
#outputs>>>
123\n123
print "123\n123"
#outputs>>>
123
123

Categories

Resources