I would like to remove the dots within a word, such that a.b.c.d becomes abcd, But under some conditions:
There should be at least 2 dots within the word, For example, a.b remains a.b, But a.b.c is a match.
This should match on 1 or 2 letters only. For example, a.bb.c is a match (because a, bb and c are 1 or 2 letters each), but aaa.b.cc is not a match (because aaa consists of 3 letters)
Here is what I've tried so far:
import re
texts = [
'a.b.c', # Should be: 'abc'
'ab.c.dd.ee', # Should be: 'abcddee'
'a.b' # Should remain: 'a.b'
]
for text in texts:
text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
print(text)
This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.
So, it prints:
ac
abee
a.b
Which is not what I want. I would appreciate any help, thanks.
Starting the match with a . dot not make sure that there is a char a-zA-Z before it.
If you use the named group word in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.
You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.
To prevent aaa.b.cc from matching, you could make use of word boundaries \b
\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b
The pattern matches:
\b A word boundary to prevent the word being part of a larger word
[a-zA-Z]{1,2} Match 1 or 2 times a char a-zA-Z
(?: Non capture group
\.[a-zA-Z]{1,2} Match a dot and 1 or 2 times a char a-zA-Z
){2,} Close non capture group and repeat 2 or more times to match at least 2 dots
\b A word boundary
Regex demo | Python demo
import re
pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
'a.b.c',
'ab.c.dd.ee',
'a.b',
'aaa.b.cc'
]
for s in texts:
print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))
Output
abc
abcddee
a.b
aaa.b.cc
^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$
You can use this to match the string.If its a match, you can just remove . using any naive method.
See demo.
https://regex101.com/r/BrNBtk/1
Related
I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)
I want to match a regex like this
] prima 1 words 2 words
And not if it's
] prima 1 words 2 words 3 words
My trial is this one:
\]\s*prima\s*1([\w\s]+)\s2([\w\s][^3]+)
But it matches only part of the expression I don't want to match at all. My exclusion si wrong. How to do it? I need to insert it in re.compile so it has to be one line.
This pattern will match the example data, but note that \w by itself can also match a digit.
If you want to match 1 or more whitespace characters (which could also match newlines), you could use \s+ instead of a space.
^\] prima 1 \w+ 2 \w+$
Regex demo
If you want to match ] prima followed by 1 and 2 which by them selves can be followed by 1 or more words that can not start with a digit:
^] prima 1 [^\W\d]\w*(?: [^\W\d]\w*)* 2 [^\W\d]\w*(?: [^\W\d]\w*)*$
^ Start of string
] prima 1 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
2 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
$ End of string
Regex demo
If the following words can not consists solely of digits, you can use a negative lookahead (?!\d+\b) checking for digits only
^\] prima 1 (?!\d+\b)\w+(?: (?!\d+\b)\w+)* 2 (?!\d+\b)\w+(?: (?!\d+\b)\w+)*$
Regex demo
I am working on a python code piece. The problem is I have a list of words that have to be replaced if they occur in the input string. Below are few samples of the words to be replaced:
block1 with block 1
block2 with block 2
block3 with block 3
sector1 with sector 1
sector2 with sector 2
phase1 with phase 1
pocket1 with pocket 1
and so on, this goes on for a lot of numbers.
I was wondering if I there is a regex that could detect these replace it. This ways I won't need to maintain a dictionary of a lot words
Is it possible to do so in python?
For your given example, you could capture what is before the digit in a capturing group and in the replacement use group 1 and add a whitespace like \1 '
\b([a-z]+)
Which matches
a word boundary \b
capture one or more lowercase character one or more times in a capturing group ([a-z]+)
Demo
Search for this regex
(?i)\b(block|sector|phase|pocket)(\d+)
and replace it with
\1 \2
Regex Breakdown
(?i) #Inline modifier for ignore case
\b #Word boundary
( #First Capturing group
block|sector|phase|pocket #Match any of 4 words
)
( #Second Capturing group
\d+ #Match digits
)
Python Code
>>> import re
>>> re.sub(r'(?i)\b(block|sector|phase|pocket)(\d+)', r'\1 \2', inputString)
I want to find whether a particular character is occurring continuously in the a word of the string or find if the word contains only numbers and remove those as well. For example,
df
All aaaaaab the best 8965
US issssss is 123 good
qqqq qwerty 1 poiks
lkjh ggggqwe 1234 aqwe iphone5224s
I want to check for two conditions, where in the first condition check for repeating characters more than 3 times and also check if a word contains only numbers. I want to remove only when the word contains only numbers and when a character occurs more than 3 times continuously in the word.
the following should be the output,
df
All the best
US is good
qwerty poiks
lkjh aqwe iphone5224s
The following are my trying,
re.sub('r'\w[0-9]\w*', df[i]) for number. but this is not removing single character numbers. Also for the repeated characters, I tried, re.sub('r'\w[a-z A-Z]+[a-z A-Z]+[a-z A-Z]+[a-z A-Z]\w*', df[i]) but this is removing every word here. instead of repeated letter.
Can anybody help me in solving these problems?
I would suggest
\s*\b(?=[a-zA-Z\d]*([a-zA-Z\d])\1{3}|\d+\b)[a-zA-Z\d]+
See the regex demo
Only alphanumeric words are matched with this pattern:
\s* - zero or more whitespaces
\b - word boundary
(?=[a-zA-Z\d]*([a-zA-Z\d])\1{3}|\d+\b) - there must be at least 4 repeated consecutive letters or digits in the word OR the whole word must consist of only digits
[a-zA-Z\d]+ - a word with 1+ letters or digits.
Python demo:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n") # Split to test lines individually
print([p.sub("", x).strip() for x in strs])
# => ['df', 'All the best', 'US is good', 'qwerty poiks', 'lkjh aqwe iphone5224s']
Note that strip() will remove remaining whitespaces at the start of the string.
A similar solution in R with a TRE regex:
x <- c("df", "All aaaaaab the best 8965", "US issssss is 123 good ", "qqqq qwerty 1 poiks", "lkjh ggggqwe 1234 aqwe iphone5224s")
p <- " *\\b(?:[[:alnum:]]*([[:alnum:]])\\1{3}[[:alnum:]]*|[0-9]+)\\b"
gsub(p, "", x)
See a demo
Pattern details and demo:
\s* - 0+ whitespaces
\b - a leading word boundary
(?:[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+) - either of the 2 alternatives:
[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]* - 0+ alphanumerics followed with the same 4 alphanumeric chars, followed with 0+ alphanumerics
| - or
[0-9]+ - 1 or more digits
\b - a trailing word boundary
UPDATE:
To also add an option to remove 1-letter words you may use
R (add [[:alpha:]]| to the alternation group): \s*\b(?:[[:alpha:]]|[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+)\b (see demo)
Python lookaround based regex (add [a-zA-Z]\b| to the lookahead group): *\b(?=[a-zA-Z]\b|\d+\b|[a-zA-Z\d]*([a-zA-Z\d])\1{3})[a-zA-Z\d]+
Numbers are easy:
re.sub(r'\d+', '', s)
If you want to remove words where the same letter appears twice, you can use capturing groups (see https://docs.python.org/3/library/re.html):
re.sub(r'\w*(\w)\1\w*', '', s)
Putting those together:
re.sub(r'\d+|\w*(\w)\1\w*', '', s)
For example:
>>> re.sub(r'\d+|\w*(\w)\1\w*', '', 'abc abbc 123 a1')
'abc a'
You may need to clean up spaces afterwards with something like this:
>>> re.sub(r' +', ' ', 'abc a')
'abc a'
I have the following:
>>> re.sub('(..)+?/story','\\g<1>','money/story')
'mey'
>>>
Why is capture group 1 the first letter and last two letters of money and not the first two letters?
The first capture group does not contain m at all. What is being matched by (..)+?/story is oney/story.
The (..)+? matches an even number of characters, so the following is matched (spaced out to make it clearer):
m o n e y / s t o r y
^-^ ^-^
Then the replacement is the first capture group. Something you might not know is that when you have a repeated capture group (in this case (..)+?), then only the last captured group is kept.
To summarise, oney/story is matched, and replaced with ey, so the result is mey.
Because the string money contains 5 letters (odd) not even, it won't even match the first letter m. (..)+? captures two characters and non-greedily repeats the pattern one or more times . Because the repetation quantifier + exists next to the capturing group, it would capture tha last two characters of the match . Now the captured group contains the last two characters of the match done by this (..)+? pattern. So you got ey as the captured string not the first on. So by replacing all the matched characters with the string inside the group index 1 ey will give you mey.
DEMO