Using regex to check for no characters before another - python

I am trying to select any letter character (a-z) with no numbers in front of it. For example:
2x+a-3p returns a
a+b-c+d returns a,b,c,d
7g+8k returns nothing
I'm attempting to use regex for this so that I can use the expression in python but I can't find the solution.
I am using Python 3.10.4 if that is necessary.

You may need a negative lookbehind:
(?<!\d)[a-z]
Translates exactly into "any letter character (a-z) with no numbers in front of it".
Check the demo here.

A first pass attempt would be
p = re.compile(r'[^0-9]([a-z])')
This would get a lower-case letter preceded by a character which is not a digit. However, You would miss any letters which occurred at the beginning of a line since there would be no character preceding the letter. So, you can instead do
p = re.compile(r'(?:[^0-9]|^)([a-z])')

Related

Regexp to remove specific number of occurrences of character only

In Python re, I have long strings of text with > character chunks of different lengths. One string can have 3 consecutive > chars in the middle, >> in the beginning, or any such combination.
I want to write a regexp that, after splitting the string based on spaces, iterates through each word to only identify those regions with exactly 2 occurrences >>, and I can't be sure if it's at the beginning, middle or end of the whole string, or what characters are before or after it, or if it's even the only 2 characters in the string.
So far I could come up with:
word = re.sub(r'>{2}', '', word)
This ends up removing all occurrences of 2 or more. What regular expression would work for this requirement? Any help is appreciated.
You need to make sure there is no character of your choice both on the left and right using a pair of lookaround, a lookahead and a lookbehind. The general scheme is
(?<!X)X{n}(?!X)
where (?<!X) means no X immediately on the left is allowed, X{n} means n occurrences of X, and (?!X) means no X immediately on the right is allowed.
In this case, use
r'(?<!>)>{2}(?!>)'
See the regex demo.
no need to split on spaces first if dont needs to
try (?<![^ ])[^ >]*>>[^ >]*(?![^ ])
finds segments on space boundry's with only >> in it and no more

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Regex: Complement a group of characters (Python)

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

Match to string length by using regex in python

Writing python regex for string. I want the string to be at least 1 symbol and max 30. The problem is that im using 3 sub-blocks in regex letters, so there always must be 3 characters long length.
Is it possible to add that condition in this regex (1-30 characters length):
regex = re.compile("^[a-zA-Z]+[a-zA-Z0-9\.\-]+[a-zA-Z0-9]$")
r = regex.search(login)
Thank you.
Although it is not clear which 1 or 2 length character strings you want to accept I propose the following regex:
regex = re.compile("^[a-zA-Z][a-zA-Z0-9\.\-]{0,28}[a-zA-Z0-9]$")
As the middle set includes all other this will directly match all words with length 3-30 as you wish.
I hope this regex also matches your 2 length strings (I just assumed that the first character must be a letter), you need to add something (using '|') for single letter matches.
In general, this is difficult and doing some work outside of the RE (as suggested in the comment by M. Buettner) is often required. Your problem is easier because it can be reduced to a pattern with only one repeating element.
You have one or more letters, followed by one or more of (letter, digit, dot, hyphen) followed by a single (letter or digit), right? If so, the repetition of the first group is not needed. Leave off the + to get
r"^[a-zA-Z][a-zA-Z0-9\.\-]+[a-zA-Z0-9]$"
and you will match exactly the same set of strings. Any extra leading letters past the first will be matched in the second group instead of the first.
Now, the only variable portion of your RE is the middle section. To limit the overall length to 30, all you need do is limit that middle portion to 28 characters. Change the + to {1,28} to get:
r"^[a-zA-Z][a-zA-Z0-9\.\-]{1,28}[a-zA-Z0-9]$"
You can read more about Python REs at:
http://docs.python.org/2/library/re.html

Categories

Resources