Regex to find repeating numbers even if they are separated - python

I'm trying to create a regular expression that will tell me if I have two or more repeating numbers in a string separated by a comma. For example "10,2,3,4,5,6,7,8,9,10" would return true because there are two tens.
I think I am close. So far I have:
if re.match(r"(\d+),.+\1",line):
Thanks!

You don't need regex for this. Just convert into a list using split, then convert that into a set (which will contain only the unique numbers in the list) and compare the lengths:
line = "10,2,3,4,5,6,7,8,9,10"
lst = line.split(',')
unq = set(lst)
if (len(lst) != len(unq)):
# non-unique numbers present
If you want to use regex, you need to use re.search rather than re.match, as re.match requires matches to begin at the start of the string, which would preclude matching 2 in "1,2,3,4,5,6,2,7,8,9,10". Also, you need to surround your (\d+) with word breaks (\b), so that 1 in "1,2,3,4,5,6,2,7,8,9,10" doesn't then match against the 1 in 10. This regex will give you the results you want:
m = re.search(r'\b(\d+)\b.*\b\1\b', line)
if m:
print('Duplicate number ' + m.group(1))

Related

Select element with special character with regex and Python

From a list of strings ('16','160','1,2','100,11','1','16:','16:00'), I want to keep only the elements that
either have a comma between two digits (e.g. 1,2 or 100,11)
or have two digits (without comma) that are NOT followed by ":" (i.e. followed by nothing: e.g 16, or followed by anything but ":": e.g. 160)
I tried the following code using regex in Python:
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:[\d],[\d]|[\d][\d][^:]*)')
rate = list(filter(pattern_rate.search,string))
print(rate)
Print:
['16', '160', '1,2','100,11' '16:', '16:00']
To be correct, the script should keep the first three items and reject the rest, but my script fails at rejecting the last two items. I guess I'm using the "[^:]" sign incorrectly.
To be correct, the script should keep the first three items and reject
the rest,
You can match either 2 or more digits, or match 2 digits with a comma in between.
As the list contains only numbers, you could use re.match to start the match at the beginning of the string instead of re.search.
(?:\d{2,}|\d,\d)\Z
Explanation
(?: Non capture group
\d{2,} Match 2 or more digits
| Or
\d,\d Match 2 digits with a comma in between
) Close non capture group
\Z End of string
Regex demo | Python demo
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:\d{2,}|\d,\d)\Z')
rate = list(filter(pattern_rate.match,string))
print(rate)
Output
['16', '160', '1,2']
I recommend looking a bit deeper into a regex guide.
100 is not a digit and will not match \d. Also having groups [..] with one element inside is not necessary if you don't intend to negate or otherwise transform them.
The first query can be represented by (?:\d+,\d+). It's a non-capturing group, that detects comma-separated numbers of length greater equal to one.
Your second query will show anything matching three consecutive digits following any (*) amount of not colons.
You'll want to use something similar to (?:\d{2,}(?!:)). It's a non-capturing group, matching digits with length greater equal to two, that are not followed by a colon. ?! designates a negative lookahead.
In your python code, you'll want to use pattern_rate.match instead of pattern_rate.find as the latter one will return partial matches while the first one only returns full matches.
pattern_rate = re.compile(r'(?:\d+,\d+)|(?:\d{2,}(?!:))')
rate = list(filter(pattern_rate.match, string))
Not sure you need regex for that:
string = ['16','160','1,2','100,11','1','16:','16:00']
keep = []
for elem in string:
if ("," in elem and len(elem) == 3) or ( ":" not in elem and "," not in elem and len(elem) >= 2):
keep.append(elem)
print (keep)
Output:
['16', '160', '1,2']
Although not that much elegant, tends to be faster than using regex.

Regular expression to convert given number in the required format

I am first time using regular expression hence need help with one slightly complex regular expression. I have input list of around 100-150 string object(numbers).
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
Expected output = ['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
## I have tried this -
import re
new_list = []
for i in range (0, len(input)):
new_list.append(re.sub('\d+-\d+-\d+','0000\\1', input[i]))
## problem is with second argument '0000\\1'. I know its wrong but unable to solve
print(new_list) ## new_list is the expected output.
As you can see, I need to convert string of numbers coming in different formats into 15 digit numbers by adding leading zeros to them.
But there is catch here i.e. some numbers i.e.'000480087800784' are already 15 digits, so should be left unchanged (That's why I cannot use string formatting (.format) option of python) Regex has to be used here, which will modify only required numbers. I have already tried following answers but not been able to solve.
Using regex to add leading zeroes
using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
Regular expression to match defined length number with leading zeros
Your regex does not work as you used \1 in the replacement, but the regex pattern has no corresponding capturing group. \1 refers to the first capturing group in the pattern.
If you want to try your hand at regex, you may use
re.sub(r'^(\d+)-(\d+)-(\d+)$', lambda x: "{}-{}-{}".format(x.group(1).zfill(5), x.group(2).zfill(5), x.group(3).zfill(5)), input[i])
See the Python demo.
Here, ^(\d+)-(\d+)-(\d+)$ matches a string that starts with 1+ digits, then has -, then 1+ digits, - and again 1+ digits followed by the end of string. There are three capturing groups whose values can be referred to with \1, \2 and \3 backreferences from the replacement pattern. However, since we need to apply .zfill(5) on each captured text, a lambda expression is used as the replacement argument, and the captures are accessed via the match data object group() method.
However, if your strings are already in correct format, you may just split the strings and format as necessary:
for i in range (0, len(input)):
splits = input[i].split('-')
if len(splits) == 1:
new_list.append(input[i])
else:
new_list.append("{}-{}-{}".format(splits[0].zfill(5), splits[1].zfill(5), splits[2].zfill(5)))
See another Python demo. Both solutions yield
['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
How about analysing the string for numbers and dashes, then adding leading zeros?
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
output = []
for inp in input:
# calculate length of string
inpLen = len(inp)
# calculate num of dashes
inpDashes = inp.count('-')
# add specific number of leading zeros
zeros = "0" * (15-(inpLen-inpDashes))
output.append(zeros + inp)
print (output)
>>> ['00000090-10-07457', '000480087800784', '00000001-713-0926', '00000012-710-8197', '00000001-345-1715', '000000009-23-4532', '000200007100272']

How to add dot separator on different positions of a number in Python?

I am trying to capture a number from a string, which sometimes contains dot separators and sometimes it does not. In any case I need a number with the dot separator.
e.g.:
num = re.findall('\d{3}\.(?:\d{2}\.){4}\d{3}|\d{14}', txt)[0]
will capture both variations:
304.33.44.52.03.002
30433445203002
In case it captured the one without dots, I would need to add the dots with the systematic of:
AAA.BB.CC.DD.EE.FFF
How can I add those dots with Python?
Solution without regexp.
You can transform it to list and insert dots in required positions, ensuring that value is string.
n = 30433445203002
l = list(str(n))
Add dots in positions you need
l.insert(3, '.')
l.insert(6, '.')
l.insert(9, '.')
l.insert(12, '.')
l.insert(15, '.')
If this is well-defined pattern. You can generalize the insertion above.
After insertion is done, join them back to the string:
num = "".join(l)
Input:
30433445203002
Output:
304.33.44.52.03.002
You can capture each "group" of numbers into a capturing group, and refer to it in the replacement string. The dots can be made optional with \.?.
string = "30433445203002"
regex = r"(\d{3})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{3})"
pattern = "\\1.\\2.\\3.\\4.\\5.\\6"
result = re.sub(regex, pattern, string)
For more details, take a look on re.sub
Output:
304.33.44.52.03.002
Regex Demo
EDIT:
If I have misunderstood you and what you actually want is to get the first 3 numbers, 4th and 5th numbers, 6th and 7th numbers etc, you can use the same regex with search:
re.search(regex, string).group(1) # 304
re.search(regex, string).group(2) # 33

Find out till where a regex satisfies a sentence

I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);

Count for regex is returning 0 instead of current count

If I have a
pattern "A*C" and my
string is "AKKLSKLCAajaklDAajdklafdC" it should return 2 but it returns 0.
string = input("What is the string?")
pattern = input("What is the pattern?")
print (len(re.findall(pattern, string)))
I've also tried
count = 0
match = re.search(pattern,string)
if match:
count +=1
That also returned zero.
Two things,
First, the code as given DOES return 2:
string = "AKKLSKLCAajaklDAajdklafdC"
pattern = "A*C"
print (len(re.findall(pattern, string)))
Second, are you sure "A*C" is the pattern you want to match for? That pattern is asking for zero or more consecutive A's, followed immediately by a C, so in the example string you give it's matching just the two C's (eg. "AKKLSKLCAajaklDAajdklafdC"). If you're trying to find an A followed by a C with some random junk in between (eg. "AKKLSKLCAajaklDAajdklafdC"), you want "A.*C" instead. That is because, for regular expressions, the '.' character is a wildcard, as opposed to '*' which means to look for zero or more of the preceeding character. This is different in function to how a lot of non-regex searchers work, which use '*' as the wildcard character.

Categories

Resources