Python Regex: Find character within matched string - python

I want to use a regex in python to find all words which start with \.
Afterwards, the regex should look for a [ within the matched word and then replace it with an underscore.
Here is an example:
input_string = SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText
output_string = SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText
The following python code replaces the square bracket with the underscore:
output_string = re.sub(<regex>, '_', input_string)
I have written this regex to find words that start with \ :
\\[^\s]+
https://regex101.com/r/d4YO9K/1
But now I don't know how to find the square bracket.
Can someone please contribute some ideas how to solve this problem?

You can define a function that takes the match object and returns the replacement string:
def rep(m):
return m.group(0).replace("[", "_")
And pass it as the the replacement parameter to re.sub:
re.sub(r"\\\S+", rep, "abc \\xyz[0] def")
'abc \\xyz_0] def'

You want to match:
r'((?:^| )\\\w*?\['
And replace with:
r'\1_`
which is whatever is in group 1 followed by a '_'.
( - Start of group 1
(?:^| ) - Match start of string or a space
\\ - Match a backslash
\w*? - Match 0 or more word characters non-greedily
) - End of group 1
\[ - Match a [
See Regex Demo
Note that you should not be naming your variable input, which is the name of a built-in function.
import re
s = 'SomeText \Word[0] Word[3] \SomeText[123] SomeText[10] SomeText'
output = re.sub(r'((?:^| )\\\w*?)\[', r'\1_', s)
print(output)
Prints:
SomeText \Word_0] Word[3] \SomeText_123] SomeText[10] SomeText

Related

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured
You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.
As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs
You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.
You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

python regex to find alphanumeric string with at least one letter

I am trying to figure out the syntax for regular expression that would match 4 alphanumeric characters, where there is at least one letter. Each should be wrapped by: > and < but I wouldn't like to return the angle brackets.
For example when using re.findall on string >ABCD<>1234<>ABC1<>ABC2 it should return ['ABCD', 'ABC1'].
1234 - doesn't have a letter
ABC2 - is not wrapped with angle brackets
You may use this lookahead based regex in python with findall:
(?i)>((?=\d*[a-z])[a-z\d]{4})<
RegEx Demo
Code:
>>> regex = re.compile(r">((?=\d*[a-z])[a-z\d]{4})<", re.I)
>>> s = ">ABCD<>1234<>ABC1<>ABC2"
>>> print (regex.findall(s))
['ABCD', 'ABC1']
RegEx Details:
re.I: Enable ignore case modifier
>: Match literal character >
(: Start capture group
(?=\d*[a-z]): Lookahead to assert we have at least one letter after 0 or more digits
[a-z\d]{4}: Match 4 alphanumeric characters
): End capture group
<: Match literal character <
import re
sentence = ">ABCD<>1234<>ABC1<>ABC2"
pattern = "\>((?=[a-zA-Z])(.){4})\<"
m = [m[0] for m in re.findall(pattern, sentence)]
#outputs ['ABCD', 'ABC1']

How to insert space between alphabet characters and numeric character using regex?

I'm trying to insert space between numeric characters and alphabet character so I can convert numeric character to words like :
Input :
subject101
street45
Output :
subject 101
street 45
I tried this one
re.sub('[a-z][\d]|[\d][a-z]',' ','subject101')
but the output was like this :
subjec 01
How can I do it using python?
Try this Regex:
(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)
Click for Demo
Replace each match with a space
Explanation:
(?i) - modifier to make the matches case-insensitive
(?<=\d)(?=[a-z]) - finds the position just preceded by a digit and followed by a letter
| - OR
(?<=[a-z])(?=\d) - finds the position just preceded by a letter and followed by a digit
Code output
import re
regex = r"(?i)(?<=\d)(?=[a-z])|(?<=[a-z])(?=\d)"
test_str = ("subject101\n"
" street45")
subst = " "
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
You can use if statement (?(#group)) in regex to check if char is digit or a letter.
Regex: (?<=([a-z])|\d)(?=(?(1)\d|[a-z]))
Python code:
def addSpace(text):
return re.sub(r'(?<=([a-z])|\d)(?=(?(1)\d|[a-z]))', ' ', text)
Output:
addSpace('subject101')
>>> subject 101
addSpace('101subject')
>>> 101 subject
A way to do this would be to pass a callable to re.sub. This allows you to reuse the matched substring to generate the replacement value.
subject = '101subject101'
s = re.sub(r'[a-zA-Z]\d|\d[a-zA-Z]', lambda m: ' '.join(m.group()), subject )
# s: '101 subject 101'

Delete the repetition of a specific word in a row

For example I have a string:
my_str = 'my example example string contains example some text'
What I want to do - delete all duplicates of specific word (only if they goes in a row). Result:
my example string contains example some text
I tried next code:
import re
my_str = re.sub(' example +', ' example ', my_str)
or
my_str = re.sub('\[ example ]+', ' example ', my_str)
But it doesn't work.
I know there are a lot of questions about re, but I still can't implement them to my case correctly.
You need to create a group and quantify it:
import re
my_str = 'my example example string contains example some text'
my_str = re.sub(r'\b(example)(?:\s+\1)+\b', r'\1', my_str)
print(my_str) # => my example string contains example some text
# To build the pattern dynamically, if your word is not static
word = "example"
my_str = re.sub(r'(?<!\w)({})(?:\s+\1)+(?!\w)'.format(re.escape(word)), r'\1', my_str)
See the Python demo
I added word boundaries as - judging by the spaces in the original code - whole word matches are expected.
See the regex demo here:
\b - word boundary (replaced with (?<!\w) - no word char before the current position is allowed - in the dynamic approach since re.escape might also support "words" like .word. and then \b might stop the regex from matching)
(example) - Group 1 (referred to with \1 from the replacement pattern):
the example word
(?:\s+\1)+ - 1 or more occurrences of
\s+ - 1+ whitespaces
\1 - a backreference to the Group 1 value, that is, an example word
\b - word boundary (replaced with (?!\w) - no word char after the current position is allowed).
Remember that in Python 2.x, you need to use re.U if you need to make \b word boundary Unicode-aware.
Regex: \b(\w+)(?:\s+\1)+\b or \b(example)(?:\s+\1)+\b Substitution: \1
Details:
\b Assert position at a word boundary
\w Matches any word character (equal to [a-zA-Z0-9_])
\s Matches any whitespace character
+ Matches between one and unlimited times
\1 Group 1.
Python code:
text = 'my example example string contains example some text'
text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)
Output:
my example string contains example some text
Code demo
You could also do this in pure Python (without a regex), by creating a list of words and then generating a new string - applying your rules.
>>> words = my_str.split()
>>> ' '.join(w for i, w in enumerate(words) if w != words[i-1] or i == 0)
'my example string contains example some text'
Why not use the .replace function:
my_str = 'my example example string contains example some text'
print my_str.replace("example example", "example")

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Categories

Resources