Use Regex to replace dashes and keep hyphens?

Use Regex to replace dashes and keep hyphens? - python

I want to replace dashes with a full-stop (.). If the dash appears as a hyphen it should be ignored. E.g. -ac-ac with .ac-ac
I started with the following regex: (?<!\s|\-)\-+|\-+(?!\s|\-)

You can use
\B-|-\B
See the regex demo.
The pattern matches
\B- - a hyphen that is preceded by a non-word char or is at the start of a string
| - or
-\B - a hyphen that is followed by a non-word char or is at the end of a string.
See the Python demo:
import re
text = "-ac-ac"
print( re.sub(r'\B-|-\B', '.', text) )
# => .ac-ac
If you want to only narrow this down to letter context, replace \B with negative lookarounds containing a letter pattern:
(?<![^\W\d_])-|-(?![^\W\d_])
See this regex and Python demo.

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))

Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']

For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Regex to exclude optional words and return as list

I am trying to extract the name and profession as a list of tuples from the below string using regex.
Input string
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
As you can see the first word is the name followed by the profession which repeats in a comma seperated fashion. The problem is that, I want to get rid of the adjectives that comes along with the profession. For e.g "amazing" in the below example.
Expected output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
I stripped out the adjective from the text using "replace" and used the below code using "regex" to get the output. But I am looking for a single regex function to avoid running the string replace. I figured that this has something to do with look ahead in regex but couldn't make it work. Any help would be appreciated.
text.replace("amazing ", "")
txt_new = re.findall("([\w\s]+),([\w\s]+)",text)

If you only want to use word and whitespace characters, this could be another option:
(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)
Explanation
( Capture group 1
\w+(?:\s+\w+)* Match 1+ word chars and optionally repeat 1+ whitespace chars and 1+ word chars
) Close group 1
\s*,\s* Match a comma between optional whitespace chars
(?:\w+\s+)* Optionally repeat 1+ word and 1+ whitespace chars
(\w+) Capture group 2, match 1+ word chars
Regex demo | Python demo
import re
regex = r"(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)"
s = ("Mr John,Carpenter,Mrs Liza,amazing painter")
print(re.findall(regex, s))
Output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]

Here is one regex approach using re.findall:
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
matches = re.findall(r'\s*([^,]+?)\s*,\s*.*?(\S+)\s*(?![^,])', text)
print(matches)
This prints:
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is an explanation of the regex pattern:
\s* match optional whitespace
([^,]+?) match the name
\s* optional whitespace
, first comma
\s* optional whitespace
.*? consume all content up until
(\S+) the last profession word
\s* optional whitespace
(?![^,]) assert that what follows is either comma or the end of the input

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured

You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.

As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs

You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.

You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Match word boundary before non-alphanumerical character

I want to find words starting with a single non-alphanumerical character, say '$', in a string with re.findall
Example of matching words
$Python
$foo
$any_word123
Example of non-matching words
$$Python
foo
foo$bar
Why \b does not work
If the first character were to be alphanumerical, I could do this.
re.findall(r'\bA\w+', s)
But this does not work for a pattern like \b\$\w+ because \b matches the empty string only between a \w and a \W.
# The line below matches only the last '$baz' which is the one that should not be matched
re.findall(r'\b\$\w+', '$foo $bar x$baz').
The above outputs ['$baz'], but the desired pattern should output ['$foo', '$bar'].
I tried replacing \b by a positive lookbehind with pattern ^|\s, but this does not work because lookarounds must be fixed in length.
What is the correct way to handle this pattern?

The following will match a word starting with a single non-alphanumerical character.
re.findall(r'''
(?: # start non-capturing group
^ # start of string
| # or
\s # space character
) # end non-capturing group
( # start capturing group
[^\w\s] # character that is not a word or space character
\w+ # one or more word characters
) # end capturing group
''', s, re.X)
or just:
re.findall(r'(?:^|\s)([^\w\s]\w+)', s, re.X)
results in:
'$a $b a$c $$d' -> ['$a', '$b']

One way is to use a negative lookbehind with the non-whitespace metacharacter \S.
s = '$Python $foo foo$bar baz'
re.findall(r'(?<!\S)\$\w+', s) # output: ['$Python', '$foo']

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?

To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).

Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'

Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Regex to replace dashes and keep hyphens? - python

I want to replace dashes with a full-stop (.). If the dash appears as a hyphen it should be ignored. E.g. -ac-ac with .ac-ac I started with the following regex: (?<!\s|\-)\-+|\-+(?!\s|\-)

Related

How to split a string with parentheses and spaces into a list

Regex to exclude optional words and return as list

regex to remove every hyphen except between two words

Match word boundary before non-alphanumerical character

Python regex matching all but last occurrence

Categories

Resources