regex to remove every hyphen except between two words - python

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured

You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.

As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs

You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.

You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Related

How to match regex in python?

describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do
I try to filter the sg-ezsrzerzer out of it (so I want to filter on start sg- till double quote). I'm using python
I currently have:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-.*\b', a)
print(test)
output is
['sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do']
How do I only get ['sg-ezsrzerzer']?
The pattern (?<=group_id=\>").+?(?=\") would work nicely if the goal is to extract the group_id value within a given string formatted as in your example.
(?<=group_id=\>") Looks behind for the sub-string group_id=>" before the string to be matched.
.+? Matches one or more of any character lazily.
(?=\") Looks ahead for the character " following the match (effectively making the expression .+ match any character except a closing ").
If you only want to extract sub-strings where the group_id starts with sg- then you can simply add this to the matching part of the pattern as follows (?<=group_id=\>")sg\-.+?(?=\")
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
results = re.findall('(?<=group_id=\>").+?(?=\")', s)
print(results)
Output
['sg-ezsrzerzer']
Of course you could alternatively use re.search instead of re.findall to find the first instance of a sub-string matching the above pattern in a given string - depends on your use case I suppose.
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
result = re.search('(?<=group_id=\>").+?(?=\")', s)
if result:
result = result.group()
print(result)
Output
'sg-ezsrzerzer'
If you decide to use re.search you will find that it returns None if there is no match found in your input string and an re.Match object if there is - hence the if statement and call to s.group() to extract the matching string if present in the above example.
The pattern \bsg-.*\b matches too much as the .* will match until the end of the string, and will then backtrack to the first word boundary, which is after the o and the end of string.
If you are using re.findall you can also use a capture group instead of lookarounds and the group value will be in the result.
:group_id=>"(sg-[^"\r\n]+)"
The pattern matches:
:group_id=>" Match literally
(sg-[^"\r\n]+) Capture group 1 match sg- and 1+ times any char except " or a newline
" Match the double quote
See a regex demo or a Python demo
For example
import re
pattern = r':group_id=>"(sg-[^"\r\n]+)"'
s = "describe aws_security_group({:group_id=>\"sg-ezsrzerzer\", :vpc_id=>\"vpc-zfds54zef4s\"}) do"
print(re.findall(pattern, s))
Output
['sg-ezsrzerzer']
Match until the first word boundary with \w+:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-\w+', a)
print(test[0])
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
sg- 'sg-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
Results: g-ezsrzerzer

Regex to exclude optional words and return as list

I am trying to extract the name and profession as a list of tuples from the below string using regex.
Input string
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
As you can see the first word is the name followed by the profession which repeats in a comma seperated fashion. The problem is that, I want to get rid of the adjectives that comes along with the profession. For e.g "amazing" in the below example.
Expected output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
I stripped out the adjective from the text using "replace" and used the below code using "regex" to get the output. But I am looking for a single regex function to avoid running the string replace. I figured that this has something to do with look ahead in regex but couldn't make it work. Any help would be appreciated.
text.replace("amazing ", "")
txt_new = re.findall("([\w\s]+),([\w\s]+)",text)
If you only want to use word and whitespace characters, this could be another option:
(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)
Explanation
( Capture group 1
\w+(?:\s+\w+)* Match 1+ word chars and optionally repeat 1+ whitespace chars and 1+ word chars
) Close group 1
\s*,\s* Match a comma between optional whitespace chars
(?:\w+\s+)* Optionally repeat 1+ word and 1+ whitespace chars
(\w+) Capture group 2, match 1+ word chars
Regex demo | Python demo
import re
regex = r"(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)"
s = ("Mr John,Carpenter,Mrs Liza,amazing painter")
print(re.findall(regex, s))
Output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is one regex approach using re.findall:
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
matches = re.findall(r'\s*([^,]+?)\s*,\s*.*?(\S+)\s*(?![^,])', text)
print(matches)
This prints:
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is an explanation of the regex pattern:
\s* match optional whitespace
([^,]+?) match the name
\s* optional whitespace
, first comma
\s* optional whitespace
.*? consume all content up until
(\S+) the last profession word
\s* optional whitespace
(?![^,]) assert that what follows is either comma or the end of the input

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex similar to a Hearst Pattern in Python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))
In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force
I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

regular expression python `r'*(.|:|?|!)*'`

I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.

Categories

Resources