How to match specific characters only? - python

Here I am try to match the specific characters in a string,
^[23]*$
Here my cases,
2233 --> Match
22 --> Not Match
33 --> Not Match
2435 --> Not Match
2322 --> Match
323 --> Match
I want to match the string with correct regular expression. I mean 1,5,6 cases needed.
Update:
If I have more than two digits match, like the patterns,
234 or 43 or etc. how to match this pattern with any string ?.
I want dynamic matching ?

How about:
(2+3|3+2)[23]*$
String must either:
start with one or more 2s, contain a 3, followed by any mix of 2 or 3 only
start with one or more 3s, contain a 2, followed by any mix of 2 or 3 only
Update: to parameterize the pattern
To parameterize this pattern, you could do something like:
x = 2
y = 3
pat = re.compile("(%s+%s|%s+%s)[%s%s]*$" % (x,y,y,x,x,y))
pat.match('2233')
Or a bit clearer, but longer:
pat = re.compile("({x}+{y}|{y}+{x})[{x}{y}]*$".format(x=2, y=3))
Or you could use Python template strings
Update: to handle more than two characters:
If you have more than two characters to test, then the regex gets unwieldy and my other answer becomes easier:
def match(s,ch):
return all([c in s for c in ch]) and len(s.translate(None,ch)) == 0
match('223344','234') # True
match('2233445, '234') # False
Another update: use sets
I wasn't entirely happy with the above solution, as it seemed a bit ad-hoc. Eventually I realized it's just a set comparison - we just want to check that the input consists of a fixed set of characters:
def match(s,ch):
return set(s) == set(ch)

If you want to match strings containing both 2 and 3, but no other characters you could use lookaheads combined with what you already have:
^(?=.*2)(?=.*3)[23]*$
The lookaheads (?=.*2) and (?=.*3) assert the presence of 2 and 3, and ^[23]*$ matches the actual string to only those two characters.

I know you asked for a solution using regex (which I have posted), but here's an alternative approach:
def match(s):
return '2' in s and '3' in s and len(s.translate(None,'23')) == 0
We check that the string contains both desired characters, then translate them both to empty strings, then check that there's nothing left (i.e. we only had 2s and 3s).
This approach can easily be extended to handle more than two characters, using the all function, and a list comprehension:
def match(s,ch):
return all([c in s for c in ch]) and len(s.translate(None,ch)) == 0
which would be used as follows:
match('223344','234') # True
match('2233445, '234') # False

Try this: (both 2 and 3 should exist)
^([2]+[3]+[23]*)|([3]+[2]+[23]*)$

^(2[23]*3[23]*)|(3[23]*2[23]*)$
I think this will do it. Look for either a 2 at the start, then a 3 has to appear somewhere (surrounded by as many other 2s and 3s as needed). Or vice versa, with 3 at the start and a 2 somewhere.

Should start with 2 or 3 followed by 2 or more occurrence of 2 or 3
^[23][23]{2,}$

Related

Selecting patterns in character sequence using regex

I would need to select all the accounts were 3 (or more) consecutive characters are identical and/or include also digits in the name, for example
Account
aaa12
43qas
42134dfsdd
did
Output
Account
aaa12
43qas
42134dfsdd
I am considering of using regex for this: [a-zA-Z]{3,} , but I am not sure of the approach. Also, this does not include the and/or condition on the digits. I would be interested in both for selecting accounts with at least one of these:
repeated identical characters,
numbers in the name.
Give this a try
n = 3 #for 3 chars repeating
pat = f'([a-zA-Z])\\1{{{n-1}}}|(\\d)+' #need `{{` to pass a literal `{`
df_final = df[df.Account.str.findall(pat).astype(bool)]
Out[101]:
Account
0 aaa12
1 43qas
2 42134dfsdd
Can you try :
x = re.search([a-zA-Z]{3}|\d, string)

Match characters and digits of fixed length and one occurance in Python

I have a list in Python with values
['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
I want to match only strings where length is 8 and there are 3 characters before underscore and 4 digits after underscore so I eliminate values not required. I am interested only in the MMM_YYYY values from above list.
Tried below and I am not able to filter values like YTD_TY_1 which has multiple underscores.
for c in col_headers:
d= (re.match('^(?=.*\d)(?=.*[A-Z0-9])[A-Z_0-9\d]{8}$',c))
if d:
data_period.append(d[0])
Update: based on #WiktorStribiżew observation that re.match does not require a full string match in Python
The regex I am using is based upon the one that #dvo provided in a comment:
import re
REGEX = '^[A-Z]{3}_[0-9]{4}$'
col_headers = ['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
regex = re.compile(REGEX)
data_period = list(filter(regex.search, col_headers))
Once again, based on a comment made by #WiktorStribiżew, if you do not want to match something as "SXX_0012" or "XYZ_0000", you should use the regex he has provided in a comment:
REGEX = r'^(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-[0-9]{4}$'
Rather than use regex for this, you should just try to parse it as a date in the first place:
from datetime import datetime
date_fmt = "%b_%Y"
for c in col_headers:
try:
d = datetime.strptime(c, date_fmt)
data_period.append(c) # Or just save the datetime object directly
except ValueError:
pass
The part of this code that is actually doing the matching in your solution is this
[A-Z_0-9\d]{8}
The problem with this is that you're asking to find exactly 8 characters that include A-Z, _, 0-9, and \d. Now, \d is equivalent to 0-9, so you can eliminate that, but that doesn't solve the whole problem, the issue here is that you've encased the entire solution in brackets []. Basically, your string will match anything that is 8 characters long and includes the above characters, ie: A_19_KJ9
What you need to do is specify that you want exactly 3 A-Z characters, then a single _, then 4 \d, see below:
[A-Z]{3}_\d{4}
This will match anything with exactly 3 A-Z characters, then a single _, then 4 \d(any numeric digit)
For a better understanding of regex, I'd encourage you to use an online tool, like regex101

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

Remove strings with repeating characters [Python]

I need to determine if a string is composed of a certain repeating character, for example eeeee, 55555, or !!!.
I know this regex 'e{1,15}' can match eeeee but it obviously can't match 555. I tried [a-z0-9]{1-15} but it matches even the strings I don't need like Hello.
The solution doesn't have to be regex. I just can't think of any other way to do this.
A string consists of a single repeating character if and only if all characters in it are the same. You can easily test that by constructing a set out of the string: set('55555').
All characters are the same if and only if the set has size 1:
>>> len(set('55555')) == 1
True
>>> len(set('Hello')) == 1
False
>>> len(set('')) == 1
False
If you want to allow the empty string as well (set size 0), then use <= 1 instead of == 1.
Regex solution (via re.search() function):
import re
s = 'eeeee'
print(bool(re.search(r'^(.)\1+$', s))) # True
s = 'ee44e'
print(bool(re.search(r'^(.)\1+$', s))) # False
^(.)\1+$ :
(.) - capture any character
\1+ - backreference to the previously captured group, repeated one or many times
You do not have to use regex for this, a test to determine if all characters in the string are the same will produce the desired output:
s = "eee"
assert len(s) > 0
reference = s[0]
result = all([c==reference for c in s])
Or use set as Thomas showed, which is probably a better way.

How do I regex match with grouping with unknown number of groups

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:
...
VALUE 100 234 568 9233 119
...
VALUE 101 124 9223 4329 1559
...
I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.
I tried to use this as a regex:
VALUE (?:(\d+)\s)+
This matches the line, but it only captures the last value, so I just get ('119',).
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.
>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']
That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.
Another option not described here is to have a bunch of optional capturing groups.
VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$
This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.
You could just run you're main match regex then run a secondary regex on those matches to get the numbers:
matches = Regex.Match(log)
foreach (Match match in matches)
{
submatches = Regex2.Match(match)
}
This is of course also if you don't want to write a full parser.
I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:
VALUE((\s\d+)+)
This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.
[0] and [2] can be ignored and then [1] can be used with the following:
\s(\d+)
Note: these regexps were not tested, I hope you get the idea though.
The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.
However, I would honestly go with Greg's solution for this question (it's probably way more efficient).
I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.
You can use re.match to check first and call re.split to use a regex as separator to split.
>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
... result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']
The separator "\s+" can be more complicated.

Categories

Resources