Select element with special character with regex and Python

Select element with special character with regex and Python - python

From a list of strings ('16','160','1,2','100,11','1','16:','16:00'), I want to keep only the elements that
either have a comma between two digits (e.g. 1,2 or 100,11)
or have two digits (without comma) that are NOT followed by ":" (i.e. followed by nothing: e.g 16, or followed by anything but ":": e.g. 160)
I tried the following code using regex in Python:
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:[\d],[\d]|[\d][\d][^:]*)')
rate = list(filter(pattern_rate.search,string))
print(rate)
Print:
['16', '160', '1,2','100,11' '16:', '16:00']
To be correct, the script should keep the first three items and reject the rest, but my script fails at rejecting the last two items. I guess I'm using the "[^:]" sign incorrectly.

To be correct, the script should keep the first three items and reject
the rest,
You can match either 2 or more digits, or match 2 digits with a comma in between.
As the list contains only numbers, you could use re.match to start the match at the beginning of the string instead of re.search.
(?:\d{2,}|\d,\d)\Z
Explanation
(?: Non capture group
\d{2,} Match 2 or more digits
| Or
\d,\d Match 2 digits with a comma in between
) Close non capture group
\Z End of string
Regex demo | Python demo
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:\d{2,}|\d,\d)\Z')
rate = list(filter(pattern_rate.match,string))
print(rate)
Output
['16', '160', '1,2']

I recommend looking a bit deeper into a regex guide.
100 is not a digit and will not match \d. Also having groups [..] with one element inside is not necessary if you don't intend to negate or otherwise transform them.
The first query can be represented by (?:\d+,\d+). It's a non-capturing group, that detects comma-separated numbers of length greater equal to one.
Your second query will show anything matching three consecutive digits following any (*) amount of not colons.
You'll want to use something similar to (?:\d{2,}(?!:)). It's a non-capturing group, matching digits with length greater equal to two, that are not followed by a colon. ?! designates a negative lookahead.
In your python code, you'll want to use pattern_rate.match instead of pattern_rate.find as the latter one will return partial matches while the first one only returns full matches.
pattern_rate = re.compile(r'(?:\d+,\d+)|(?:\d{2,}(?!:))')
rate = list(filter(pattern_rate.match, string))

Not sure you need regex for that:
string = ['16','160','1,2','100,11','1','16:','16:00']
keep = []
for elem in string:
if ("," in elem and len(elem) == 3) or ( ":" not in elem and "," not in elem and len(elem) >= 2):
keep.append(elem)
print (keep)
Output:
['16', '160', '1,2']
Although not that much elegant, tends to be faster than using regex.

Related

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'

This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*

You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo

Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False

This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Capturing entire repeated string based on a repeated pattern

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59

Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.

In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

Regular expression to convert given number in the required format

I am first time using regular expression hence need help with one slightly complex regular expression. I have input list of around 100-150 string object(numbers).
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
Expected output = ['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
## I have tried this -
import re
new_list = []
for i in range (0, len(input)):
new_list.append(re.sub('\d+-\d+-\d+','0000\\1', input[i]))
## problem is with second argument '0000\\1'. I know its wrong but unable to solve
print(new_list) ## new_list is the expected output.
As you can see, I need to convert string of numbers coming in different formats into 15 digit numbers by adding leading zeros to them.
But there is catch here i.e. some numbers i.e.'000480087800784' are already 15 digits, so should be left unchanged (That's why I cannot use string formatting (.format) option of python) Regex has to be used here, which will modify only required numbers. I have already tried following answers but not been able to solve.
Using regex to add leading zeroes
using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
Regular expression to match defined length number with leading zeros

Your regex does not work as you used \1 in the replacement, but the regex pattern has no corresponding capturing group. \1 refers to the first capturing group in the pattern.
If you want to try your hand at regex, you may use
re.sub(r'^(\d+)-(\d+)-(\d+)$', lambda x: "{}-{}-{}".format(x.group(1).zfill(5), x.group(2).zfill(5), x.group(3).zfill(5)), input[i])
See the Python demo.
Here, ^(\d+)-(\d+)-(\d+)$ matches a string that starts with 1+ digits, then has -, then 1+ digits, - and again 1+ digits followed by the end of string. There are three capturing groups whose values can be referred to with \1, \2 and \3 backreferences from the replacement pattern. However, since we need to apply .zfill(5) on each captured text, a lambda expression is used as the replacement argument, and the captures are accessed via the match data object group() method.
However, if your strings are already in correct format, you may just split the strings and format as necessary:
for i in range (0, len(input)):
splits = input[i].split('-')
if len(splits) == 1:
new_list.append(input[i])
else:
new_list.append("{}-{}-{}".format(splits[0].zfill(5), splits[1].zfill(5), splits[2].zfill(5)))
See another Python demo. Both solutions yield
['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']

How about analysing the string for numbers and dashes, then adding leading zeros?
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
output = []
for inp in input:
# calculate length of string
inpLen = len(inp)
# calculate num of dashes
inpDashes = inp.count('-')
# add specific number of leading zeros
zeros = "0" * (15-(inpLen-inpDashes))
output.append(zeros + inp)
print (output)
>>> ['00000090-10-07457', '000480087800784', '00000001-713-0926', '00000012-710-8197', '00000001-345-1715', '000000009-23-4532', '000200007100272']

Regex in python repetition Error

In my code I Want answer [('22', '254', '15', '36')] but got [('15', '36')]. My regex (?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3} is not run for 3 time may be!
import re
def fun(st):
print(re.findall("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
ip="22.254.15.36"
print(fun(ip))

Overview
As I mentioned in the comments below your question, most regex engines only capture the last match. So when you do (...){3}, only the last match is captured: E.g. (.){3} used against abc will only return c.
Also, note that changing your regex to (2[0-4]\d|25[0-5]|[01]?\d{1,2}) performs much better and catches full numbers (currently you'll grab 25 instead of 255 on the last octet for example - unless you anchor it to the end).
To give you a fully functional regex for capturing each octet of the IP:
(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})
Personally, however, I'd separate the logic from the validation. The code below first validates the format of the string and then checks whether or not the logic (no octets greater than 255) passes while splitting the string on ..
Code
See code in use here
import re
ip='22.254.15.36'
if re.match(r"(?:\d{1,3}\.){3}\d{1,3}$", ip):
print([octet for octet in ip.split('.') if int(octet) < 256])
Result: ['22', '254', '15', '36']
If you're using this method to extract IPs from an arbitrary string, you can replace re.match() with re.search() or re.findall(). In that case you may want to remove $ and add some logic to ensure you're not matching special cases like 11.11.11.11.11: (?<!\d\.)\b(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)

You only have two capturing groups in your regex:
(?: # non-capturing group
( # group 1
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)\.
){3}
( # group 2
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)
That the first group can be repeated 3 times doesn't make it capture 3 times. The regex engine will only ever return 2 groups, and the last match in a given group will fill that group.
If you want to capture each of the parts of an IP address into separate groups, you'll have to explicitly define groups for each:
pattern = (
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)')
def fun(st, p=re.compile(pattern)):
return p.findall(st)
You could avoid that much repetition with a little string and list manipulation:
octet = r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)'
pattern = r'\.'.join([octet] * 4)
Next, the pattern will just as happily match the 25 portion of 255. Better to put matching of the 200-255 range at the start over matching smaller numbers:
octet = r'(2(?:5[0-5]|[0-4]\d)|[01]?[0-9]{1,2})'
pattern = r'\.'.join([octet] * 4)
This still allows leading 0 digits, by the way, but is
If all you are doing is passing in single IP addresses, then re.findall() is overkill, just use p.match() (matching only at the string start) or p.search(), and return the .groups() result if there is a match;)
def fun(st, p=re.compile(pattern + '$')):
match = p.match(st)
return match and match.groups()
Note that no validation is done on the surrounding data, so if you are trying to extract IP addresses from a larger body of text you can't use re.match(), and can't add the $ anchor and the match could be from a larger number of octets (e.g. 22.22.22.22.22.22). You'd have to add some look-around operators for that:
# only match an IP address if there is no indication that it is part of a larger
# set of octets; no leading or trailing dot or digits
pattern = r'(?<![\.\d])' + pattern + r'(?![\.\d])'

I encountered a very similar issue.
I found two solutions, using the official documentation.
The answer of #ctwheels above did mention the cause of the problem, and I really appreciate it, but it did not provide a solution.
Even when trying the lookbehind and the lookahead, it did not work.
First solution:
re.finditer
re.finditer iterates over match objects !!
You can use each one's 'group' method !
>>> def fun(st):
pr=re.finditer("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st)
for p in pr:
print(p.group(),end="")
>>> fun(ip)
22.254.15.36
Or !!!
Another solution haha : You can still use findall, but you'll have to make every group a non-capturing group ! (Since the main problem is not with findall, but with the group function that is used by findall (which, we all know, only returns the last match):
"re.findall:
...If one or more groups are present in the pattern, return a list of groups"
(Python 3.8 Manuals)
So:
>>> def fun(st):
print(re.findall("(?:(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
>>> fun(ip)
['22.254.15.36']
Have fun !

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-

I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.

Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.

This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select element with special character with regex and Python - python

Related

Match strings with alternating characters

Capturing entire repeated string based on a repeated pattern

Regular expression to convert given number in the required format

Regex in python repetition Error

Python regex matching only if digit

Categories

Resources