Python regex not working with special characters - python

SOLVED: it replaced the " symbols in the file with ' (in the data strings)
Do you know a way to only search for 1 or more words (not numbers) between [" and \n?
This works on regexr.com, but not in python
https://regexr.com/3tju7
¨
(?<=\[\")(\D+)(?=\\n)
"S": ["Something\n13/8-2018 09:00 to 11:30
¨
Python code:
re.search('(?<=[\")(\D+)(?=\n)', str(data))
I think \[, \" and \\n is the problem, I have tried to use raw in python
re.search('(?<=\[\")(\D+)(?=\\n)', '"S": ["Something\n13/8-201809:00 to 11:30').group()
This worked but I have to use "data" because I have multiple strings, and it won't let me use .group() on that.
Error: AttributeError: 'NoneType' object has no attribute 'group'

Your problem is that the \n is being interpreted as a newline, instead of the literal characters \ and n. You can use a simpler regex, \["([\w\s]+)$, along with the MULTILINE flag, without modifying the data.
>>> import re
>>> data = '"S": ["Something\n13/8-201809:00 to 11:30'
>>> pattern = '\["([\w\s]+)$'
>>> m = re.search(pattern, data, re.MULTILINE)
>>> m.group(1)
'Something'

Try to put a r before the string with the pattern, that marks the string as "raw". This stops python from evaluating escaped characters before passing them to the function
re.search(r'\search', string)
Or:
rgx = re.compile(r'pattern')
rgx.search(string)

Related

How to find string between '\begin{minipage}' and '\end{minipage}' by python re?

I have tried the following code:
strReFindString = u"\\begin{minipage}"+"(.*?)"
strReFindString += u"\\end{minipage}"
lst = re.findall(strReFindString, strBuffer, re.DOTALL)
But it always returns empty list.
How can I do?
Thanks all.
As #BrenBarn said, u"\\b" parses as \b; and \b is not a valid regexp escape, so findall treats it as b (literal b). u"\\\\b" is \\b, which regexp understands as \b (literal backslash, literal b). You can prevent escape-parsing in the string using raw strings, ur"\\b" is equal to u"\\\\b":
ur"\\b" == u"\\\\b"
# => True

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

RE match fail in python, confuse with the result on regex101

http://regex101.com/r/oU6eI5/1 , test here seam works, but when i put in Python, match whole str.
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', "\1", str)
i want get "tweqt/"
You need to use a raw string in the replace:
str = galley/files/tew/tewt/tweqt/
re.sub('^.+/+([^/]+/$)', r"\1", str)
# ^
Otherwise, you get the escaped character \1. For instance on my console, it's a little smiley.
If you somehow don't want to raw your string, you'll have to escape the backslash:
re.sub('^.+/+([^/]+/$)', "\\1", str)
Also worth noting that it's a good practice to raw your regex strings and use consistent quotes, so you I would advise using:
re.sub(r'^.+/+([^/]+/$)', r'\1', str)
Other notes
It might be simpler to match (using re.search) instead of using re.sub:
re.search(r'[^/]+/$', str).group()
# => tweqt/
And you might want to use another variable name other than str because this will override the existing function str().
It would be better if you define the pattern or regex as raw string.
>>> import re
>>> s = "galley/files/tew/tewt/tweqt/"
>>> m = re.sub(r'^.+/+([^/]+/$)', r'\1', s)
^ ^
>>> m
'tweqt/'

Python regex, remove all punctuation except hyphen for unicode string

I have this code for removing all punctuation from a regex string:
import regex as re
re.sub(ur"\p{P}+", "", txt)
How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.
[^\P{P}-]+
\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.
Example: http://www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk
Here's how to do it with the re module, in case you have to stick with the standard libraries:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - ###!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).
You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string:
re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

python regex issue with underscore

i am trying to do some string search with regular expressions, where i need to print the [a-z,A-Z,_] only if they end with " " space, but i am having some trouble if i have underscore at the end then it doesn't wait for the space and executes the command.
if re.search(r".*\s\D+\s", string):
print string
if i keep
string = "abc shot0000 "
it works fine, i do need it to execute it only when the string ends with a space \s.
but if i keep
string = "abc shot0000 _"
then it doesn't wait for the space \s and executes the command.
You're using search and this function, as the name says, search in your string if the pattern appear and that's the case in your two strings.
You should add a $ to your regular expression to search for the end of string:
if re.search(r".*\s\D+\s$", string):
print string
You need to anchor the RE at the end of the string with $:
if re.search(r".*\s\D+\s$", string):
print string
Use a $:
>>> strs = "abc shot0000 "
>>> re.search(r"\s\w+\s$", strs) #use \w: it'll handle A-Za-z_
<_sre.SRE_Match object at 0xa530100>
>>> strs = "abc shot0000 _"
>>> re.search(r"\s\w+\s$", strs)
#None

Categories

Resources