Python Regex: Find character within matched string

Python Regex: Find character within matched string - python

I want to use a regex in python to find all words which start with \.
Afterwards, the regex should look for a [ within the matched word and then replace it with an underscore.
Here is an example:
input_string = SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText
output_string = SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText
The following python code replaces the square bracket with the underscore:
output_string = re.sub(<regex>, '_', input_string)
I have written this regex to find words that start with \ :
\\[^\s]+
https://regex101.com/r/d4YO9K/1
But now I don't know how to find the square bracket.
Can someone please contribute some ideas how to solve this problem?

You can define a function that takes the match object and returns the replacement string:
def rep(m):
return m.group(0).replace("[", "_")
And pass it as the the replacement parameter to re.sub:
re.sub(r"\\\S+", rep, "abc \\xyz[0] def")
'abc \\xyz_0] def'

You want to match:
r'((?:^| )\\\w*?\['
And replace with:
r'\1_`
which is whatever is in group 1 followed by a '_'.
( - Start of group 1
(?:^| ) - Match start of string or a space
\\ - Match a backslash
\w*? - Match 0 or more word characters non-greedily
) - End of group 1
\[ - Match a [
See Regex Demo
Note that you should not be naming your variable input, which is the name of a built-in function.
import re
s = 'SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText'
output = re.sub(r'((?:^| )\\\w*?)\[', r'\1_', s)
print(output)
Prints:
SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText

Related

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured

You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.

As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs

You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.

You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

python regex to find alphanumeric string with at least one letter

I am trying to figure out the syntax for regular expression that would match 4 alphanumeric characters, where there is at least one letter. Each should be wrapped by: > and < but I wouldn't like to return the angle brackets.
For example when using re.findall on string >ABCD<>1234<>ABC1<>ABC2 it should return ['ABCD', 'ABC1'].
1234 - doesn't have a letter
ABC2 - is not wrapped with angle brackets

You may use this lookahead based regex in python with findall:
(?i)>((?=\d*[a-z])[a-z\d]{4})<
RegEx Demo
Code:
>>> regex = re.compile(r">((?=\d*[a-z])[a-z\d]{4})<", re.I)
>>> s = ">ABCD<>1234<>ABC1<>ABC2"
>>> print (regex.findall(s))
['ABCD', 'ABC1']
RegEx Details:
re.I: Enable ignore case modifier
>: Match literal character >
(: Start capture group
(?=\d*[a-z]): Lookahead to assert we have at least one letter after 0 or more digits
[a-z\d]{4}: Match 4 alphanumeric characters
): End capture group
<: Match literal character <

import re
sentence = ">ABCD<>1234<>ABC1<>ABC2"
pattern = "\>((?=[a-zA-Z])(.){4})\<"
m = [m[0] for m in re.findall(pattern, sentence)]
#outputs ['ABCD', 'ABC1']

How to insert space between alphabet characters and numeric character using regex?

I'm trying to insert space between numeric characters and alphabet character so I can convert numeric character to words like :
Input :
subject101
street45
Output :
subject 101
street 45
I tried this one
re.sub('[a-z][\d]|[\d][a-z]',' ','subject101')
but the output was like this :
subjec 01
How can I do it using python?

Try this Regex:
(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)
Click for Demo
Replace each match with a space
Explanation:
(?i) - modifier to make the matches case-insensitive
(?<=\d)(?=[a-z]) - finds the position just preceded by a digit and followed by a letter
| - OR
(?<=[a-z])(?=\d) - finds the position just preceded by a letter and followed by a digit
Code output
import re
regex = r"(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)"
test_str = ("subject101\n"
" street45")
subst = " "
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)

You can use if statement (?(#group)) in regex to check if char is digit or a letter.
Regex: (?<=([a-z])|\d)(?=(?(1)\d|[a-z]))
Python code:
def addSpace(text):
return re.sub(r'(?<=([a-z])|\d)(?=(?(1)\d|[a-z]))', ' ', text)
Output:
addSpace('subject101')
>>> subject 101
addSpace('101subject')
>>> 101 subject

A way to do this would be to pass a callable to re.sub. This allows you to reuse the matched substring to generate the replacement value.
subject = '101subject101'
s = re.sub(r'[a-zA-Z]\d|\d[a-zA-Z]', lambda m: ' '.join(m.group()), subject )
# s: '101 subject 101'

Delete the repetition of a specific word in a row

For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.

You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.

Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo

You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'

Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?

Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.

This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex: Find character within matched string - python

You can define a function that takes the match object and returns the replacement string: def rep(m): return m.group(0).replace("[", "_") And pass it as the the replacement parameter to re.sub: re.sub(r"\\\S+", rep, "abc \\xyz[0] def") 'abc \\xyz_0] def'

Related

regex to remove every hyphen except between two words

python regex to find alphanumeric string with at least one letter

How to insert space between alphabet characters and numeric character using regex?

Delete the repetition of a specific word in a row

Python regular expression to find letters and numbers

Categories

Resources