Find several strings with regular expressions

Find several strings with regular expressions - python

I'm looking for an OR capability to match on several strings with regular expressions.
# I would like to find either "-hex", "-mos", or "-sig"
# the result would be -hex, -mos, or -sig
# You see I want to get rid of the double quotes around these three strings.
# Other double quoting is OK.
# I'd like something like.
messWithCommandArgs = ' -o {} "-sig" "-r" "-sip" '
messWithCommandArgs = re.sub(
r'"(-[hex|mos|sig])"',
r"\1",
messWithCommandArgs)
This works:
messWithCommandArgs = re.sub(
r'"(-sig)"',
r"\1",
messWithCommandArgs)

Square brackets are for character classes that can only match a single character. If you want to match multiple character alternatives you need to use a group (parentheses instead of square brackets). Try changing your regex to the following:
r'"(-(?:hex|mos|sig))"'
Note that I used a non-capturing group (?:...) because you don't need another capture group, but r'"(-(hex|mos|sig))"' would actually work the same way since \1 would still be everything but the quotes.
Alternative you could use r'"-(hex|mos|sig)"' and use r"-\1" as the replacement (since the - is no longer a part of the group.

You should remove [] metacharacters in order to match hex or mos or sig. (?:-(hex|mos|sig))

Related

Replace second and last second characters, using re.sub

I have a string "F(foo)", and I'd like to replace that string with "F('foo')". I know we can also use regular expression in the second parameter and do this replacement using re.sub(r"F\(foo\)", r"F\('foo'\)",str). But the problem here is, foo is a dynamic string variable. It is different every time we want to do this replacement. Is it possible by some sort of regex, to do such replacement in a cleaner way?
I remember one way to extract foo using () and then .group(1). But this would require me to define one more temporary variable just to store foo. I'm curious if there is a way by which we can replace "F(foo)" with "F('foo')" in a single line or in other words in a more cleaner way.
Examples :
F(name) should be replaced with F('name').
F(id) should be replaced with F('id').
G(name) should not be replaced.
So, the regex would be r"F\((\w)+\)" to find such strings.

Using re.sub
Ex:
import re
s = "F(foo)"
print(re.sub(r"\((.*)\)", r"('\1')", s))
Output:
F('foo')

The following regex encloses valid [Python|C|Java] identifiers after F and in parentheses in single quotation marks:
re.sub(r"F\(([_a-z][_a-z0-9]+)\)", r"F('\1')", s, flags=re.I)
#"F('foo')"

There are several ways, depending on what foo actually is.
If it can't contain ( or ), you can just replace ( with (' and ) with '). Otherwise, try using
re.sub(r"F\((.*)\)", r"F('\1')", yourstring)
where the \1 in the replacement part will reference the (.*) capture group in the search regex

In your pattern F\((\w)+\) you are almost there, you just need to put the quantifier + after the \w to repeat matching 1+ word characters.
If you put it after the capturing group, you repeat the group which will give you the value of the last iteration in the capturing group which would be the second o in foo.
You could update your expression to:
F\((\w+)\)
And in the replacement refer to the capturing group using \1
F('\1')
For example:
import re
str = "F(foo)"
print(re.sub(r"F\((\w+)\)", r"F('\1')", str)) # F('foo')
Python demo

python regex match a group or not match it

I want to match the string:
from string as string
It may or may not contain as.
The current code I have is
r'(?ix) from [a-z0-9_]+ [as ]* [a-z0-9_]+'
But this code matches a single a or s. So something like from string a little will also be in the result.
I wonder what is the correct way of doing this.

You may use
(?i)from\s+[a-z0-9_]+\s+(?:as\s+)?[a-z0-9_]+
See the regex demo
Note that you use x "verbose" (free spacing) modifier, and all spaces in your pattern became formatting whitespaces that the re engine omits when parsing the pattern. Thus, I suggest using \s+ to match 1 or more whitespaces. If you really want to use single regular spaces, just omit the x modifier and use the regular space. If you need the x modifier to insert comments, escape the regular spaces:
r'(?ix) from\ [a-z0-9_]+\ (?:as\ )?[a-z0-9_]+'
Also, to match a sequence of chars, you need to use a grouping construct rather than a character class. Here, (?:as\s+)? defines an optional non-capturing group that matches 1 or 0 occurrences of as + space substring.

Python regex find all matches

I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.

You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall

Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.

i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)

When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work

python regular expression of a string

I have python string
wrong_data_type is not one of the allowed values `([one_two, two_three, three_four])`
and I have a regexp:
\w+ is not one of the allowed values`\(\[\w,+\)\]`
However, it is not correct? Any help?

The regexp should be
\w+ is not one of the allowed values `\(\[(?:\w+, )*\w+\]\)`
Fixes:
Added space after values.
\]\) at the end instead of \)\].
Inside the brackets, need to allow multiple \w, so it should be \w+.
Need to have a space after ,.
Need a group around \w+, to match multiple comma-separated words using the * quantifier.
Then have to match a single last word with no comma after it.

data = re.search(r'\(\[[\w,\s]+\]\)', string).group()

You can use the following:
\w+ is not one of the allowed values `\(\[[\w,\s]+\]\)`

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.

pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.

Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.

s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find several strings with regular expressions - python

You should remove [] metacharacters in order to match hex or mos or sig. (?:-(hex|mos|sig))

Related

Replace second and last second characters, using re.sub

python regex match a group or not match it

Python regex find all matches

python regular expression of a string

Matching everything after series of hyphens

Categories

Resources