Python: Searching for date using regular expression - python

I am searching for date information, in the format of 01-JAN-2023 in a extracted text, and the following regular expression didn't work. Can \b and \Y be used this way?
import re
rext = 'This is the testing text with 01-Jan-2023'
match = re.search(r"\d\b\Y", rext)
print(match)

import re
rext = 'This is the testing text with 01-Jan-2023'
match = re.search(r"\d+-\w+-\d+", rext)
print(match)
<re.Match object; span=(30, 41), match='01-Jan-2023'>

You can use this regular expression:
match = re.search(r"\d{2}-[a-zA-Z]{3}-\d{4}", rext)
print(match.group())
\d matches a digit (equivalent to [0-9]).
[a-zA-Z] matches an upper- or lower-case letter.
{n} matches the preceding pattern n times.

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

How to find all matches with a regex where part of the match overlaps

I have a long .txt file. I want to find all the matching results with regex.
for example :
test_str = 'ali. veli. ahmet.'
src = re.finditer(r'(\w+\.\s){1,2}', test_str, re.MULTILINE)
print(*src)
this code returns :
<re.Match object; span=(0, 11), match='ali. veli. '>
i need;
['ali. veli', 'veli. ahmet.']
how can i do that with regex?
The (\w+\.\s){1,2} pattern contains a repeated capturing group, and Python re does not store all the captures it finds, it only saves the last one into the group memory buffer. At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer or re.findall will do that for you.
Also, the re.MULTILINE flag is not necessar here since there are no ^ or $ anchors in the pattern.
You may get the expected results using
import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']
See the Python demo
The pattern means
(?= - start of a positive lookahead
\b - a word boundary (crucial here, it is necessary to only start capturing at word boundaries)
(\w+\.\s+\w+) - Capturing group 1: 1+ word chars, ., 1+ whitespaces and 1+ word chars
) - end of the lookahead.

python regex to find alphanumeric string with at least one letter

I am trying to figure out the syntax for regular expression that would match 4 alphanumeric characters, where there is at least one letter. Each should be wrapped by: > and < but I wouldn't like to return the angle brackets.
For example when using re.findall on string >ABCD<>1234<>ABC1<>ABC2 it should return ['ABCD', 'ABC1'].
1234 - doesn't have a letter
ABC2 - is not wrapped with angle brackets
You may use this lookahead based regex in python with findall:
(?i)>((?=\d*[a-z])[a-z\d]{4})<
RegEx Demo
Code:
>>> regex = re.compile(r">((?=\d*[a-z])[a-z\d]{4})<", re.I)
>>> s = ">ABCD<>1234<>ABC1<>ABC2"
>>> print (regex.findall(s))
['ABCD', 'ABC1']
RegEx Details:
re.I: Enable ignore case modifier
>: Match literal character >
(: Start capture group
(?=\d*[a-z]): Lookahead to assert we have at least one letter after 0 or more digits
[a-z\d]{4}: Match 4 alphanumeric characters
): End capture group
<: Match literal character <
import re
sentence = ">ABCD<>1234<>ABC1<>ABC2"
pattern = "\>((?=[a-zA-Z])(.){4})\<"
m = [m[0] for m in re.findall(pattern, sentence)]
#outputs ['ABCD', 'ABC1']

Match word but ignore end-of-sentence word

My regex search is matching a word that is at the end of the sentence.
>>> needle = 'miss'
>>> needle_regex = r"\b" + needle + r"\b"
>>> haystack = 'Cleveland, Miss. - This is the article'
>>> re.search(needle_regex, haystack, re.IGNORECASE)
<_sre.SRE_Match object; span=(10, 14), match='Miss'>
In this case, "Miss." is actually short for Mississippi and it's not a match. How do I ignore end-of-sentence words but also ensure that
>>> haystack = "Website Miss.com some more text here"
would still be a match.
As already mentioned, language is fuzzy and regex is not a natural language processing tool. A feasible solution could be to exclude matches that have a punctuation mark using the regex \p{P} Unicode category followed by a space, e.g.
(?!\bmiss\p{P}\s)\bmiss\b
Demo
*PCRE
However, to take advantage of Unicode codepoint properties with the \p{} syntax we have to use the regex module (an alternative to the standard re module) that support that feature.
Code Sample:
import regex as re
regex = r"(?!\bmiss\p{P}\s)\bmiss\b"
test_str = ("Cleveland, Miss. - This is the article\n"
"Website Miss.com")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE | re.UNICODE)
for match in matches:
print ("Match at {start}-{end}: {match}".format(start = match.start(), end = match.end(), match = match.group()))

Why this string matches the regular expression?

Why this string matches the pattern ?
pattern = """
^Page \d of \d$|
^Group \d Notes$|
^More word lists and tips at http://wwwmajortests.com/word-lists$|
"""
re.match(pattern, "stackoverflow", re.VERBOSE)
According to me it should match strings like "Page 1 of 1" or "Group 1 Notes".
In your regular expression, there's trailing |:
# ^More word lists and tips at http://wwwmajortests.com/word-lists$|
# ^
Empty pattern matches any string:
>>> import re
>>> re.match('abc|', 'abc')
<_sre.SRE_Match object at 0x7fc63f3ff3d8>
>>> re.match('abc|', 'bbbb')
<_sre.SRE_Match object at 0x7fc63f3ff440>
So, Remove the trailing |.
BTW, you don't need ^ becasue re.match checks for a match only at the beginning of the string.
And, I recommend you to use raw strings(r'....') to correctly escape backslahes.
ADDITIONAL NOTE
\d matches only a single digit. Use \d+ if you also want to match multiple digits.

Categories

Resources