Using regex to extract characters either side of a match - python

I have a string:
test=' 40 virtual asset service providers law, 2020e section 1 c law 14 of 2020 page 5 cayman islands'
I want to match all occurrences of a digit, then print not just the digit but the three characters either side of the digit.
At the moment, using re I have matched the digits:
print (re.findall('\d+', test ))
['40', '2020', '1', '14', '2020', '5']
I want it to return:
[' 40 v', 'w, 2020e s', 'aw 14 of', 'of 2020 ', 'ge 5 c']

Use . to capture any character and then {0,3} to capture up to 3 characters on each side
print(re.findall('.{0,3}\d+.{0,3}', test))

re.findall(".{0,3}\d+.{0,3}", test)
The {0,3} "greedy" quantifier match at most 3 characters.

Here you go:
re.findall('[^0-9]{0,3}[0-9]+[^0-9]{0,3}', test)
[EDIT]
Breaking the pattern down:
'[^0-9]{0,3}' matches up to 3 non-digit characters
'[0-9]+' matches one or more digits
The final pattern '[^0-9]{0,3}[0-9]+[^0-9]{0,3}' matches one or more digits surrounded by up to 3 non-digits on either side.
To reduce confusion, I am in favor of using '[^0-9]{0,3}' instead of '.{0,3}' (as mentioned in other answers) in the pattern, because it explicitly tells that non-digits need to be matched. '.' could be confusing because it matches any literal (including digits).

Related

Remove digits from the string if they are concatenated using Regex

I am trying to remove the digits from the text only if they are concatenated with the alphabets or coming between characters in a word. But not with the dates.
Like
if "21st" then should remain "21st"
But if "alphab24et" should be "alphabet"
but if the digits come separately like "26 alphabets"
then it should remain "26 alphabets" .
I am using the below regex
newString = re.sub(r'[0-9]+', '', newString)
, which removes digits in ay position they occur, like in the above example it removes 26 as well.
You can match digits that are not enclosed with word boundaries with custom digit boundaries:
import re
newString = 'Like if "21st" then should remain "21st" But if "alphab24et" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .'
print( re.sub(r'\B(?<!\d)[0-9]+\B(?!\d)', '', newString) )
# => Like if "21st" then should remain "21st" But if "alphabet" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .
See the Python demo and the regex demo.
Details:
\B(?<!\d) - a non-word boundary position with no digit immediately on the left
[0-9]+ - one or more digits
\B(?!\d) - a non-word boundary position with no digit immediately on the right.
I find a way to make my re.sub's cleaner is to capture the things around my pattern in groups ((...) below), and put them back in the subsitute pattern (\1 and \2 below).
In your case you want to catch digit sequences ([0-9]+) that are not surrounded by white spaces (\s, since you want to keep those) or other other digits ([0-9], otherwise the greediness of the algorithm won't remove these): [^\s0-9]. This gives:
In [1]: re.sub(r"([^\s0-9])[0-9]+([^\s0-9])", r"\1\2", "11 a000b 11 11st x11 11")
Out[1]: '11 ab 11 11st x11 11'
What you should do is add parenthesis so as to define a group and specify that the digits need to be sourounded by strings.
re.sub(r"([^\s\d])\d+([^\s\d])", r'\1\2', newString)
This does match only digits which are between a character other than a space : [^\s] part.

Regex unexpected outcome

import re
pattern =r"[1-9][0-9]{0,2}(?:,\d{3})?(?:,\d{3})?"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a)
Dears, what should I do to get the expected result?
Expected output:
['42', '1,234', '6,368,745']
Actual output:
['42', '1,234', '6,368,745', '12', '34,567', '123', '4']
I was trying to solve this quiz in a book.
How would you write a regex that matches a number with commas for every three digits? It must match the following:
• '42'
• '1,234'
• '6,368,745'
but not like the following:
• '12,34,567' (which has only two digits between the commas)
• '1234' (which lacks commas)
Your help would be much appreciated!
You may use
import re
pattern =r"(?<!\d,)(?<!\d)[1-9][0-9]{0,2}(?:,\d{3})*(?!,?\d)"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a) # => ['42', '1,234', '6,368,745']
See Python demo.
Regex details
(?- no digit or digit +,` allowed immediately to the left of the current location
[1-9][0-9]{0,2} - a non-zero digit followed with any zero, one or two digits
(?:,\d{3})* - 0 or more occurrences of a comma and then any three digits
(?!,?\d) - no , or , + digit allowed immediately to the right of the current location.
You could use the following regular expression.
r'(?<![,\d])[1-9]\d{,2}(?:,\d{3})*(?![,\d])'
with re.findall.
Demo
Python's regex engine performs the following operations.
(?<! begin negative lookbehind
[,\d] match ',' or a digit
) end negative lookbehind
[1-9] match a digit other than '0'
\d{0,2} match 0-2 digits
(?: begin non-capture group
,\d{3} match ',' then 3 digits
) end non-capture group
* execute non-capture group 0+ times
(?![,\d]) previous match is not followed by ',' or a digit
(negative lookahead)

Regex to find continuous characters in the word and remove the word

I want to find whether a particular character is occurring continuously in the a word of the string or find if the word contains only numbers and remove those as well. For example,
df
All aaaaaab the best 8965
US issssss is 123 good
qqqq qwerty 1 poiks
lkjh ggggqwe 1234 aqwe iphone5224s
I want to check for two conditions, where in the first condition check for repeating characters more than 3 times and also check if a word contains only numbers. I want to remove only when the word contains only numbers and when a character occurs more than 3 times continuously in the word.
the following should be the output,
df
All the best
US is good
qwerty poiks
lkjh aqwe iphone5224s
The following are my trying,
re.sub('r'\w[0-9]\w*', df[i]) for number. but this is not removing single character numbers. Also for the repeated characters, I tried, re.sub('r'\w[a-z A-Z]+[a-z A-Z]+[a-z A-Z]+[a-z A-Z]\w*', df[i]) but this is removing every word here. instead of repeated letter.
Can anybody help me in solving these problems?
I would suggest
\s*\b(?=[a-zA-Z\d]*([a-zA-Z\d])\1{3}|\d+\b)[a-zA-Z\d]+
See the regex demo
Only alphanumeric words are matched with this pattern:
\s* - zero or more whitespaces
\b - word boundary
(?=[a-zA-Z\d]*([a-zA-Z\d])\1{3}|\d+\b) - there must be at least 4 repeated consecutive letters or digits in the word OR the whole word must consist of only digits
[a-zA-Z\d]+ - a word with 1+ letters or digits.
Python demo:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n") # Split to test lines individually
print([p.sub("", x).strip() for x in strs])
# => ['df', 'All the best', 'US is good', 'qwerty poiks', 'lkjh aqwe iphone5224s']
Note that strip() will remove remaining whitespaces at the start of the string.
A similar solution in R with a TRE regex:
x <- c("df", "All aaaaaab the best 8965", "US issssss is 123 good ", "qqqq qwerty 1 poiks", "lkjh ggggqwe 1234 aqwe iphone5224s")
p <- " *\\b(?:[[:alnum:]]*([[:alnum:]])\\1{3}[[:alnum:]]*|[0-9]+)\\b"
gsub(p, "", x)
See a demo
Pattern details and demo:
\s* - 0+ whitespaces
\b - a leading word boundary
(?:[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+) - either of the 2 alternatives:
[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]* - 0+ alphanumerics followed with the same 4 alphanumeric chars, followed with 0+ alphanumerics
| - or
[0-9]+ - 1 or more digits
\b - a trailing word boundary
UPDATE:
To also add an option to remove 1-letter words you may use
R (add [[:alpha:]]| to the alternation group): \s*\b(?:[[:alpha:]]|[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+)\b (see demo)
Python lookaround based regex (add [a-zA-Z]\b| to the lookahead group): *\b(?=[a-zA-Z]\b|\d+\b|[a-zA-Z\d]*([a-zA-Z\d])\1{3})[a-zA-Z\d]+
Numbers are easy:
re.sub(r'\d+', '', s)
If you want to remove words where the same letter appears twice, you can use capturing groups (see https://docs.python.org/3/library/re.html):
re.sub(r'\w*(\w)\1\w*', '', s)
Putting those together:
re.sub(r'\d+|\w*(\w)\1\w*', '', s)
For example:
>>> re.sub(r'\d+|\w*(\w)\1\w*', '', 'abc abbc 123 a1')
'abc a'
You may need to clean up spaces afterwards with something like this:
>>> re.sub(r' +', ' ', 'abc a')
'abc a'

Regex for fraction mathematical expressions using python re module

I need a regex to parse through a string that contains fractions and a operation [+, -, *, or /] and to return a 5 element tuple containing the numerators, denominators, and operation using the findall function in the re module.
Example: str = "15/9 + -9/5"
The output should of the form[("15","9","+","-9","5")]
I was able to come up with this:
pattern = r'-?\d+|\s+\W\s+'
print(re.findall(pattarn,str))
Which produces an output of ["15","9"," + ","-9","5"]. But after fiddling with this for so time, I cannot get this into a 5 element tuple and I cannot match the operation without also matching the white spaces around it.
This pattern will work:
(-?\d+)\/(\d+)\s+([+\-*/])\s+(-?\d+)\/(\d+)
#lets walk through it
(-?\d+) #matches any group of digits that may or may not have a `-` sign to group 1
\/ #escape character to match `/`
(\d+) #matches any group of digits to group 2
\s+([+\-*/])\s+ #matches any '+,-,*,/' character and puts only that into group 3 (whitespace is not captured in group)
(-?\d+)\/(\d+) #exactly the same as group 1/2 for groups 4/5
demo for this:
>>> s = "15/9 + -9/5 6/12 * 2/3"
>>> re.findall('(-?\d+)\/(\d+)\s([+\-*/])\s(-?\d+)\/(\d+)',s)
[('15', '9', '+', '-9', '5'), ('6', '12', '*', '2', '3')]
A general way to tokenize a string based on a regexp is this:
import re
pattern = "\s*(\d+|[/+*-])"
def tokens(x):
return [ m.group(1) for m in re.finditer(pattern, x) ]
print tokens("9 / 4 + 6 ")
Notes:
The regex begins with \s* to pass over any initial whitespace.
The part of the regex which matches the token is enclosed in parens to form a capture.
The different token patterns are in the capture separated by the alternation operation |.
Be careful about using \W since that will also match whitespace.

How should I write this regex in python

I have the string.
st = "12345 hai how r u #3456? Awer12345 7890"
re.findall('([0-9]+)',st)
It should not come like :
['12345', '3456', '12345', '7890']
I should get
['12345','7890']
I should only take the numeric values
and
It should not contain any other chars like alphabets,special chars
No need to use a regular expression:
[i for i in st.split(" ") if i.isdigit()]
Which I think is much more readable than using a regex
Corey's solution is really the right way to go here, but since the question did ask for regex, here is a regex solution that I think is simpler than the others:
re.findall(r'(?<!\S)\d+(?!\S)', st)
And an explanation:
(?<!\S) # Fail if the previous character (if one exists) isn't whitespace
\d+ # Match one or more digits
(?!\S) # Fail if the next character (if one exists) isn't whitespace
Some examples:
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890')
['12345', '7890']
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890123ER%345 234 456 789')
['12345', '234', '456', '789']
In [21]: re.findall(r'(?:^|\s)(\d+)(?=$|\s)', st)
Out[21]: ['12345', '7890']
Here,
(?:^|\s) is a non-capture group that matches the start of the string, or a space.
(\d+) is a capture group that matches one or more digits.
(?=$|\s) is lookahead assertion that matches the end of the string, or a space, without consuming it.
use this: (^|\s)[0-9]+(\s|$) pattern. (^|\s) means that your number must be at the start of the string or there must be a whitespace character before the number. And (\s|$) means that there must be a whitespace after number or the number is at the end of the string.
As said Jan Pöschko, 456 won't be found in 123 456. If your "bad" parts (#, Awer) are always prefixes, you can use this (^|\s)[0-9]+ pattern and everything will be OK. It will match all numbers, which have only whitespaces before or are at the start of the string. Hope this helped...
Your expression finds all sequences of digits, regardless of what surrounds them. You need to include a specification of what comes before and after the sequence to get the behavior you want:
re.findall(r"[\D\b](\d+)[\D\b]", st)
will do what you want. In English, it says "match all sequences of one or more digits that are surrounded by a non-digit character.or a word boundary"

Categories

Resources