regex contains "times" but not "clock"

regex contains "times" but not "clock" - python

Disclaimer: I know "in" and "not in" can be used but due to technical contraints I need to use regex.
I have:
a = "digital clock time fan. Segments featuring digital 24 hour oclock times. For 11+"
b = "nine times ten is ninety"
and I would like to match based on contains "times" but not "oclock", so a and b are put through regex and only b passes
Any ideas?

You can use a negative lookahead for this:
^(?!.*\bo?clock\b).*\btimes\b
Explanation:
^ # starting at the beginning of the string
(?! # fail if
.*\bo?clock\b # we can match 'clock' or 'oclock' anywhere in the string
) # end if
.*\btimes\b # match 'times' anywhere in the string
The \b is for word boundaries, so you would still match a string like 'clocked times' but would fail for a string like 'timeshare'. You can just remove all of the \b in the regex if you don't want this behavior.
Example:
>>> re.match(r'^(?!.*\bo?clock\b).*\btimes\b', a)
>>> re.match(r'^(?!.*\bo?clock\b).*\btimes\b', b)
<_sre.SRE_Match object at 0x7fc1f96cc718>

Related

Python Three Letter Acronyms

I am trying to check if a certain string contains an acronym using regex.
my current regex:
re.search(r'\b[A-Z]{3}', string)
currently it outputs true to USA, NYCs, and NSFW but it should not say true on NSFW because it is a four letter acronym, not three.
How can I readjust the regex to make it not accept NSFW, but still accept NYCs
EDIT: it should also accept NYC,

A negative lookahead assertion: (?!pattern)
re.search(r'\b[A-Z]{3}(?![A-Z])',string)
This requires the triple capital pattern to never be followed by another capital letter, while it doesn't imply other restrictions, like the pattern necessarily be followed by something.
Think "Not followed by P" vs "Followed by not P"
Try:
filter(re.compile(r'\b[A-Z]{3}(?![A-Z])').search, ['.ANS', 'ANSs', 'AANS', 'ANS.'])

>>> import re
>>> rexp = r'(?:\b)([A-Z]{3})(?:$|[^A-Z])'
>>> re.search(rexp, 'USA').groups()
('USA',)
>>> re.search(rexp, 'NSFW') is None
True
>>> re.search(rexp, 'aUSA') is None
True
>>> re.search(rexp, 'NSF,').groups()
('NSF',)

You can use the ? to mean a character is optional, {0,1} would be equivalent.
You can put whatever characters you want to match inside the square brackets [ ] it will match any one of those 0 or 1 times so NYC. or WINs or FOO, will match.
Add the $ to the end to specify no more characters after the match are allowed
re.search(r'\b[A-Z]{3}[s,.]?$', string)

Regular Expressions using Substitution to convert numbers

I'm a Python beginner, so keep in mind my regex skills are level -122.
I need to convert a string with text containing file1 to file01, but not convert file10 to file010.
My program is wrong, but this is the closest I can get, I've tried dozens of combinations but I can't get close:
import re
txt = 'file8, file9, file10'
pat = r"[0-9]"
regexp = re.compile(pat)
print(regexp.sub(r"0\d", txt))
Can someone tell me what's wrong with my pattern and substitution and give me some suggestions?

You could capture the number and check the length before adding 0, but you might be able to use this instead:
import re
txt = 'file8, file9, file10'
pat = r"(?<!\d)(\d)(?=,|$)"
regexp = re.compile(pat)
print(regexp.sub(r"0\1", txt))
regex101 demo
(?<! ... ) is called a negative lookbehind. This prevents (negative) a match if the pattern after it has the pattern in the negative lookbehind matches. For example, (?<!a)b will match all b in a string, except if it has an a before it, meaning bb, cb matches, but ab doesn't match. (?<!\d)(\d) thus matches a digit, unless it has another digit before it.
(\d) is a single digit, enclosed in a capture group, denoted by simple parentheses. The captured group gets stored in the first capture group.
(?= ... ) is a positive lookahead. This matches only if the pattern inside the positive lookahead matches after the pattern before this positive lookahead. In other words, a(?=b) will match all a in a string only if there's a b after it. ab matches, but ac or aa don't.
(?=,|$) is a positive lookahead containing ,|$ meaning either a comma, or the end of the string.
(?<!\d)(\d)(?=,|$) thus matches any digit, as long as there's no digit before it and there's a comma after it, or if that digit is at the end of the string.

how about?
a='file1'
a='file' + "%02d" % int(a.split('file')[1])

This approach uses a regex to find every sequence of digits and str.zfill to pad with zeros:
>>> txt = 'file8, file9, file10'
>>> re.sub(r'\d+', lambda m : m.group().zfill(2), txt)
'file08, file09, file10'

How to match one character word?

How do I match only words of character length one? Or do I have to check the length of the match after I performed the match operation? My filter looks like this:
sw = r'\w+,\s+([A-Za-z]){1}
So it should match
rs =re.match(sw,'Herb, A')
But shouldn't match
rs =re.match(sw,'Herb, Abc')

If you use \b\w\b you will only match one character of type word. So your expression would be
sw = r'\w+,\s+\w\b'
(since \w is preceded by at least one \s you don't need the first \b)
Verification:
>>> sw = r'\w+,\s+\w\b'
>>> print re.match(sw,'Herb, A')
<_sre.SRE_Match object at 0xb7242058>
>>> print re.match(sw,'Herb, Abc')
None

You can use
(?<=\s|^)\p{L}(?=[\s,.!?]|$)
which will match a single letter that is preceded and followed either by a whitespace character or the end of the string. The lookahead is a little augmented by punctuation marks as well ... this all depends a bit on your input data. You could also do a lookahead on a non-letter, but that begs the question whether “a123” is really a one-letter word. Or “I'm”.

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.

With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>

Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?

Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets

What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))

You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"

Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.

An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False

If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))

To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex contains "times" but not "clock" - python

Related

Python Three Letter Acronyms

Regular Expressions using Substitution to convert numbers

How to match one character word?

using reg exp to check if test string is of a fixed format

Regular expression for repeating sequence

Categories

Resources