I am working on a python code piece. The problem is I have a list of words that have to be replaced if they occur in the input string. Below are few samples of the words to be replaced:
block1 with block 1
block2 with block 2
block3 with block 3
sector1 with sector 1
sector2 with sector 2
phase1 with phase 1
pocket1 with pocket 1
and so on, this goes on for a lot of numbers.
I was wondering if I there is a regex that could detect these replace it. This ways I won't need to maintain a dictionary of a lot words
Is it possible to do so in python?
For your given example, you could capture what is before the digit in a capturing group and in the replacement use group 1 and add a whitespace like \1 '
\b([a-z]+)
Which matches
a word boundary \b
capture one or more lowercase character one or more times in a capturing group ([a-z]+)
Demo
Search for this regex
(?i)\b(block|sector|phase|pocket)(\d+)
and replace it with
\1 \2
Regex Breakdown
(?i) #Inline modifier for ignore case
\b #Word boundary
( #First Capturing group
block|sector|phase|pocket #Match any of 4 words
)
( #Second Capturing group
\d+ #Match digits
)
Python Code
>>> import re
>>> re.sub(r'(?i)\b(block|sector|phase|pocket)(\d+)', r'\1 \2', inputString)
Related
I would like to remove the dots within a word, such that a.b.c.d becomes abcd, But under some conditions:
There should be at least 2 dots within the word, For example, a.b remains a.b, But a.b.c is a match.
This should match on 1 or 2 letters only. For example, a.bb.c is a match (because a, bb and c are 1 or 2 letters each), but aaa.b.cc is not a match (because aaa consists of 3 letters)
Here is what I've tried so far:
import re
texts = [
'a.b.c', # Should be: 'abc'
'ab.c.dd.ee', # Should be: 'abcddee'
'a.b' # Should remain: 'a.b'
]
for text in texts:
text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
print(text)
This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.
So, it prints:
ac
abee
a.b
Which is not what I want. I would appreciate any help, thanks.
Starting the match with a . dot not make sure that there is a char a-zA-Z before it.
If you use the named group word in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.
You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.
To prevent aaa.b.cc from matching, you could make use of word boundaries \b
\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b
The pattern matches:
\b A word boundary to prevent the word being part of a larger word
[a-zA-Z]{1,2} Match 1 or 2 times a char a-zA-Z
(?: Non capture group
\.[a-zA-Z]{1,2} Match a dot and 1 or 2 times a char a-zA-Z
){2,} Close non capture group and repeat 2 or more times to match at least 2 dots
\b A word boundary
Regex demo | Python demo
import re
pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
'a.b.c',
'ab.c.dd.ee',
'a.b',
'aaa.b.cc'
]
for s in texts:
print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))
Output
abc
abcddee
a.b
aaa.b.cc
^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$
You can use this to match the string.If its a match, you can just remove . using any naive method.
See demo.
https://regex101.com/r/BrNBtk/1
I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}
I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)
I want to match a regex like this
] prima 1 words 2 words
And not if it's
] prima 1 words 2 words 3 words
My trial is this one:
\]\s*prima\s*1([\w\s]+)\s2([\w\s][^3]+)
But it matches only part of the expression I don't want to match at all. My exclusion si wrong. How to do it? I need to insert it in re.compile so it has to be one line.
This pattern will match the example data, but note that \w by itself can also match a digit.
If you want to match 1 or more whitespace characters (which could also match newlines), you could use \s+ instead of a space.
^\] prima 1 \w+ 2 \w+$
Regex demo
If you want to match ] prima followed by 1 and 2 which by them selves can be followed by 1 or more words that can not start with a digit:
^] prima 1 [^\W\d]\w*(?: [^\W\d]\w*)* 2 [^\W\d]\w*(?: [^\W\d]\w*)*$
^ Start of string
] prima 1 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
2 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
$ End of string
Regex demo
If the following words can not consists solely of digits, you can use a negative lookahead (?!\d+\b) checking for digits only
^\] prima 1 (?!\d+\b)\w+(?: (?!\d+\b)\w+)* 2 (?!\d+\b)\w+(?: (?!\d+\b)\w+)*$
Regex demo
I want to get patterns involving complete words, not pieces of words.
E.g. 12345 [some word] 1234567 [some word] 123 1679. Random text and the pattern appears again 1111 123 [word] 555.
This should return
[[12345, 1234567, 123, 1679],[1111, 123, 555]]
I am only tolerating one word between the numbers otherwise the whole string would match.
Also note that it is important to capture that 2 matches were found and so a two-element list was returned.
I am running this in python3.
I have tried:
\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b
but I am not sure how to scale this to an unrestricted number of matches.
re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)
This matches [number] [word] [number] but not any number that might follow with or without a word in between.
Are you expecting re.findall() to return a list of lists? It will only return a list - no matter what regex you use.
One approach is to split your input string into sentences and then loop through them
import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))
the trick is to find a <pattern> to split your input.
You can't do this in one operation with the Python re engine.
But you could match the sequence with one match, then extract the
digits with another.
This matches the sequence
r"(?<!\w)\d+(?:(?:[^\S\r\n]+[a-zA-Z](?:\w*[a-zA-Z])*)?[^\S\r\n]+\d+)*(?!\w)"
https://regex101.com/r/73AYLU/1
Explained
(?<! \w ) # Not a word behind
\d+ # Many digits
(?: # Optional word block
(?: # Optional words
[^\S\r\n]+ # Horizontal whitespace
[a-zA-Z] # Starts with a letter
(?: \w* [a-zA-Z] )* # Can be digits in middle, ends with a letter
)? # End words, do once
[^\S\r\n]+ # Horizontal whitespace
\d+ # Many digits
)* # End word block, do many times
(?! \w ) # Not a word ahead
This gets the array of digits from the sequence matched above (use findall)
r"(?<!\S)(\d+)(?!\S)"
https://regex101.com/r/BHov38/1
Explained
(?<! \S ) # Whitespace boundary
( \d+ ) # (1)
(?! \S ) # Whitespace boundary
This is a bit complicated, maybe this expression would be just something to look into:
(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)
and script the rest of the problem for a valid solution.
Demo