Python Regex for alpha(alpha|digit)* - python

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?

like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Related

Python matching dashes using Regular Expressions

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.
Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

search substring + integer from a string in python using regular expression

I have a string
str="TMOUT=1800; export TMOUT"
I want to extract only TMOUT=1800 from above string, but 1800 is not constant it can be any integer value. For example TMOUT=18 or TMOUT=201 etc. I'm very new to regular expression.
I tried using code below
re.search("TMOUT=\d",str).
It is not working. Please help
\d matches a single digit. You want to match one or more digits, so you have to add a + quantifier:
re.search("TMOUT=\d+", text)
If you then you want to extract the number you have to create a group using parenthesis ():
match = re.search(r"TMOUT=(\d+)", text)
number = int(match.group(1))
Or you may want to use the named group syntax (?P<name>):
match = re.search(r"TMOUT=(?P<num>\d+)", text)
number = int(match.group("num"))
I suggest you use regex101 to test your regexes and get an explanation of what they do. Also read python's re docs to learn about the methods of the various objects and functions available.

python regular expression replace only in parentheses

I would like to replace the ー to - in a regular expression like \d+(ー)\d+(ー)\d+. I tried re.sub but it will replace all the text including the numbers. Is it possible to replace the word in parentheses only?
e.g.
sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-') returns '4-3-1'. Assume that simple replace cannot be used because there are other ー that do not satisfy the regular expression. My current solution is to split the text and do replacement on the part which satisfy the regular expression.
You may use the Group Reference here.
import re
before = '4ー3ー1ーー4ー31'
after = re.sub(r'(\d+)ー(\d+)ー(\d+)', r'\1-\2-\3', before)
print(after) # '4-3-1ーー4ー31'
Here, r'\1' is the reference to the first group, a.k.a, the first parentheses.
You could use a function for the repl argument in re.sub to only touch the match groups.
import re
s = '1234ー2134ー5124'
re.sub("\d+(ー)\d+(ー)\d+", lambda x: x.group(0).replace('ー', '-'), s)
Using a slightly different pattern, you might be able to take advantage of a lookahead expression which does not consume the part of string it matches to. That is to say, a lookahead/lookbehind will match on a pattern with the condition that it also matches the component in the lookahead/lookbehind expression (rather than the entire pattern.)
re.sub("ー(?=\d+)", "-", s)
If you can live with a fixed-length expression for the part preceding the emdash you can combine the lookahead with a lookbehind to make the regex a little more conservative.
re.sub("(?<=\d)ー(?=\d+)", "-", s)
re.sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-')
Like you pointed out, the output of the regular expression will be '-'. because you are trying to replace the entire pattern with a '-'. to replace the ー to - you can use
import re
input_string = '4ー3ー1'
re.sub('ー','-', input_string)
or you could do a find all on the digits and join the string with a '-'
'-'.join(re.findall('\d+', input_string))
both methods should give you '4-3-1'

python regex match a group or not match it

I want to match the string:
from string as string
It may or may not contain as.
The current code I have is
r'(?ix) from [a-z0-9_]+ [as ]* [a-z0-9_]+'
But this code matches a single a or s. So something like from string a little will also be in the result.
I wonder what is the correct way of doing this.
You may use
(?i)from\s+[a-z0-9_]+\s+(?:as\s+)?[a-z0-9_]+
See the regex demo
Note that you use x "verbose" (free spacing) modifier, and all spaces in your pattern became formatting whitespaces that the re engine omits when parsing the pattern. Thus, I suggest using \s+ to match 1 or more whitespaces. If you really want to use single regular spaces, just omit the x modifier and use the regular space. If you need the x modifier to insert comments, escape the regular spaces:
r'(?ix) from\ [a-z0-9_]+\ (?:as\ )?[a-z0-9_]+'
Also, to match a sequence of chars, you need to use a grouping construct rather than a character class. Here, (?:as\s+)? defines an optional non-capturing group that matches 1 or 0 occurrences of as + space substring.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Categories

Resources