Regular expressions and Unicode in Python: difference between sub and findall - python

I am having difficulty trying to figure out a bug in my Python (2.7) script. I am getting an difference with using sub and findall in recognizing special characters.
Here is the code:
>>> re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
u'Castaeda'
>>> re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)
[]
When I use findall, it correctly sees ñ as an alphabetic character, but when I use sub it replaces this--treating it as a non-alphabetic character.
I've been able to get the correct functionality using findall with string.replace, but this seems like a bad solution. Also, I want to use re.split, and I'm having the same problems as with re.sub.
Thanks in advance for the help.

The call signature of re.sub is:
re.sub(pattern, repl, string, count=0)
So
re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
is setting count to re.UNICODE, which has value 32.
Try instead:
In [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')
Out[57]: u'Casta\xf1eda'
Placing (?u) at the beginning of the regex is an alternate way to specify the re.UNICODE flag in the regex itself. You can also set the other flags
(?iLmsux) this way. (For more info click this link and search for "(?iLmsux)".)
Similarly, the call signature of re.split is:
re.split(pattern, string, maxsplit=0)
The solution is the same.

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.
re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.
Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo
This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/

The way to unescape escaped regex pattern Python

I'm trying to unescape the escaped regex pattern to apply it to a string.
It's actually dynamic I don't exactly know what it would look like, but throughout my testing I encountered one problem, the string with escaped regex pattern looks like this:
\\d{4}
I've written a simple regex which replaces every single combination of backslash and a character with just a character
And I'm applying it this way:
sub(r"\\(.)", "\\1", escaped_pattern)
But what it gives me afterwards is d{4} not \d{4} as I expect.
I've tried using raw strings for repl, escape\unescape it, it still doesnt return what I expect it to return. Would appreciate any help.
EDIT
escaped_pattern = settings.reg_exp
regexp = sub(r"\\(.)", "\\1", escaped_pattern)
search(regexp, string_to_regexp).group()[0]
Based on you update I'm pretty sure that you would get exactly your desired output if you just stopped trying to unescape it.
import re
s1 = "1234astring"
matches = re.search("\\d{4}", s1)
matches.group(0)
"1234"
matches.group()[0]
"1"
Try r"\\\\(.)" in search pattern and '\\\1' in substitution pattern.
works OK here: https://regex101.com/r/M3ikqj/1

How to write a regular expression in Python that accepts alphabets, numbers and a few selected special characters(,.-|;!_?)?

A Regular Expression in Python that accepts letters,numbers and only these special characters (,.-|;!_?).
I have tried solving the problem through the following regular expressions but it didn't work:
'([a-zA-Z0-9,.-|;!_?])$'
'([a-zA-Z0-9][.-|;!_?])$'
Can someone please help me write the regular expression.
I think the following should work (tested on RegExr against Foo123,.-|;!_?):
^[\w,.\-|;!_?]*$
In your regular expressions, you forget to escape the '-' character, which is interpreted as a range of characters to match against.
Use this for only one character:
'[a-zA-Z0-9,.\-|;!_?]' or '[\w,.\-|;!_?]'
Use this for all characters:
'[a-zA-Z0-9,.\-|;!_?]*' or '[\w,.\-|;!_?]*'
Use this for an equal check:
'^[a-zA-Z0-9,.\-|;!_?]*$' or '^[\w,.\-|;!_?]*$'
Try this (you should escape - like this \-):
^[a-zA-Z0-9,.\-|;!_?]+$
+ to prevent matching empty strings, to allow them, you can use * instead.
Examples:
>>> import re
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '12.0')
<_sre.SRE_Match object at 0x00000000027EB850>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '')
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', 'test!?')
<_sre.SRE_Match object at 0x00000000027EB7E8>
You could use \w (bonus: unicode and locale support!):
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
See Python's documentation. Also, you might want to use a raw string when specifying your regular expression pattern:
m = re.match(r'[\w,.-|;!?]+', your_string)
Notice the use of + (repeat once or more). You also used $ to match the end of the string but I did not include it in mine. YMMV.

Regex in python, repeated fragment finding

I try find in text using regex the elements like this: abs=abs , 1=1 etc.
i wrote this i this way:
opis="Some text abs=abs sfsdvc"
wyn=re.search('([\w]*)=\1',opis)
print(wyn.group(0))
And this find nothing, when i tried this code in the websites like www.regexr.com it was working correctly.
Am I doing something wrong in python re ?
You must specify the regex as raw string r'..'
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search(r'([\w]*)=\1',opis)
>>> print wyn.group(0)
abs=abs
From re documentation
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
Meaning, if you are not planing to use raw string, then all the \ in the string must be escaped as
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search('([\\w]*)=\\1',opis)
>>> print wyn.group(0)
abs=abs
Change your regex to:
re.search(r'(\w+)=\1', opis).group()
↑
Note that you don't really need character class here, the [ and ] are redundant, also it's better to have \w+ if you don't want to match the string "=" (lonely equal sign).

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Categories

Resources