Match word but ignore end-of-sentence word

Match word but ignore end-of-sentence word - python

My regex search is matching a word that is at the end of the sentence.
>>> needle = 'miss'
>>> needle_regex = r"\b" + needle + r"\b"
>>> haystack = 'Cleveland, Miss. - This is the article'
>>> re.search(needle_regex, haystack, re.IGNORECASE)
<_sre.SRE_Match object; span=(10, 14), match='Miss'>
In this case, "Miss." is actually short for Mississippi and it's not a match. How do I ignore end-of-sentence words but also ensure that
>>> haystack = "Website Miss.com some more text here"
would still be a match.

As already mentioned, language is fuzzy and regex is not a natural language processing tool. A feasible solution could be to exclude matches that have a punctuation mark using the regex \p{P} Unicode category followed by a space, e.g.
(?!\bmiss\p{P}\s)\bmiss\b
Demo
*PCRE
However, to take advantage of Unicode codepoint properties with the \p{} syntax we have to use the regex module (an alternative to the standard re module) that support that feature.
Code Sample:
import regex as re
regex = r"(?!\bmiss\p{P}\s)\bmiss\b"
test_str = ("Cleveland, Miss. - This is the article\n"
"Website Miss.com")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE | re.UNICODE)
for match in matches:
print ("Match at {start}-{end}: {match}".format(start = match.start(), end = match.end(), match = match.group()))

Related

Python: Searching for date using regular expression

I am searching for date information, in the format of 01-JAN-2023 in a extracted text, and the following regular expression didn't work. Can \b and \Y be used this way?
import re
rext = 'This is the testing text with 01-Jan-2023'
match = re.search(r"\d\b\Y", rext)
print(match)

import re
rext = 'This is the testing text with 01-Jan-2023'
match = re.search(r"\d+-\w+-\d+", rext)
print(match)
<re.Match object; span=(30, 41), match='01-Jan-2023'>

You can use this regular expression:
match = re.search(r"\d{2}-[a-zA-Z]{3}-\d{4}", rext)
print(match.group())
\d matches a digit (equivalent to [0-9]).
[a-zA-Z] matches an upper- or lower-case letter.
{n} matches the preceding pattern n times.

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured

You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.

As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs

You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.

You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Delete the repetition of a specific word in a row

For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.

You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.

Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo

You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'

Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

Find string after "task-" in a long substring using regex

I have list of files with a pattern sub-*_task-XYZabc_run-*_bold.json and sub-*_task-PQRghu_bold.json, for example:
sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json
I intend to find all different task names from the filename. In the above example, dis and fb are the two tasks.
What kind of regex should I use to find TASKNAME from task-TASKNAME in a given filename?

The following regex should do it :
(?<=task-).*?(?=_)
see regex demo / explanation
python ( demo )
import re
regex = r"(?<=task-).*?(?=_)"
str = """sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json"""
matches = re.finditer(regex, str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

My regex works on regex101 but doesn't work in python? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 1 year ago.
So I need to match strings that are surrounded by |. So, the pattern should simply be r"\|([^\|]*)\|", right? And yet:
>>> pattern = r"\|([^\|]*)\|"
>>> re.match(pattern, "|test|")
<_sre.SRE_Match object at 0x10341dd50>
>>> re.match(pattern, " |test|")
>>> re.match(pattern, "asdf|test|")
>>> re.match(pattern, "asdf|test|1234")
>>> re.match(pattern, "|test|1234")
<_sre.SRE_Match object at 0x10341df30>
It's only matching on strings that begin with |? It works just fine on regex101 and this is python 2.7 if it matters. I'm probably just doing something dumb here so any help would be appreciated. Thanks!

re.match will want to match the string starting at the beginning. In your case, you just need the matching element, correct? In that case you can use something like re.search or re.findall, which will find that match anywhere in the string:
>>> re.search(pattern, " |test|").group(0)
'|test|'
>>> re.findall(pattern, " |test|")
['test']

In order to reproduce code that runs on https://regex101.com/, you have to click on Code Generator on the left handside. This will show you what their website is using. From there you can play around with flags, or with the function you need from re.
Note:
https://regex101.com/ uses re.MULTILINE as default flag
https://regex101.com/ uses re.finditer as default method
import re
regex = r"where"
test_str = "select * from table where t=3;"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only
at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does
by default).
Document

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match word but ignore end-of-sentence word - python

Related

Python: Searching for date using regular expression

regex to remove every hyphen except between two words

Delete the repetition of a specific word in a row

Find string after "task-" in a long substring using regex

My regex works on regex101 but doesn't work in python? [duplicate]

Categories

Resources