Python Regex find standalone case - python

I have a string of time for example:
text = '2010; 04/20/2010; 04/2009'
I want to only find the first standalone '2010', but applying the following code:
re.findall(r'\d{4}', text)
will also find the second '2010' embedded in the mm/dd/yyyy format.
Is there a way to achieve this (not using the ';' sign)?

You can use re.search to find only the first occurrence:
>>> import re
>>> text = '2010; 04/20/2010; 04/2009'
>>> re.search('\d{4}', text)
<_sre.SRE_Match object; span=(0, 4), match='2010'>
>>> re.search('\d{4}', text).group()
'2010'
>>>
From the documentation:
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
Emphasis mine.

I don't know if you have to use regex but .find() in Python3 will return the lowest index of the start of the string you are looking for. From there if you know the length of the string which I assume you do you can extrapolate it out with a slice of the string with another line of code. Not sure if it's better or worse than regex but it seems less complex version that does the same thing for this occurrence. Here is a stack overflow about it and here is the python docs on it

Related

Python Regex \pL matching issues

I'm trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here's an example
import regex as re
p = r'((?!\pL)|^)blah((?!\pL)|$)'
print(re.search(p, "blah u"))
print(re.search(p, "blahé u"))
print(re.search(p, "éblah u"))
print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'>
None
None
None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for "u blah".
I've tried to also use Pythons built in re module, but I cannot get it to work with \pL or \p{Latin} due to "bad-escape" errors. I've also tried to use unicode strings (without the "r") but despite adding slashes to \\\\pL, I just can't get this to work right.
Note: I'm using Python 3.8
The problem with your ((?!\pL)|^)blah((?!\pL)|$) regex is that the ((?!\pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!\pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!\pL)|$) and only matches at the start of string.
Note (?!\pL) already matches a position at the end of string, so ((?!\pL)|$) = (?!\pL).
You should use
(?<!\pL)blah(?!\pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re-compatible version of the regex is
(?<![^\W\d_])blah(?![^\W\d_])
See the regex demo.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Python re match at specific point in string

If I have a given string s in Python, is it possible to easily check if a regex matches the string starting at a specific position i in the string?
I would rather not slice the entire string from i to the end as it doesn't seem very scalable (ruling out re.match I think).
re.match doesn't support this directly. However, if you pre-compile your regular expression (often a good idea anyway) with re.compile, then the RegexObject's similar method, match (and search) both take an optional pos parameter:
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
Example:
import re
s = 'this is a test 4242 did you get it'
pat = re.compile('[a-zA-Z]+ ([0-9]+)')
print pat.match(s, 10).group(0)
Output:
'test 4242'
Although re.match does not support this, the new regex module (intended to replace the re module) has a treasure trove of new features, including pos and endpos arguments for search, match, sub, and subn. Although not official yet, the regex module can be pip installed and works for Python versions 2.5 through 3.4. Here's an example:
>>> import regex
>>> regex.match(r'\d+', 'abc123def')
>>> regex.match(r'\d+', 'abc123def', pos=3)
<regex.Match object; span=(3, 6), match='123'>
>>> regex.match(r'\d+', 'abc123def', pos=3, endpos=5)
<regex.Match object; span=(3, 5), match='12'>

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?
Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')
The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.
Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources