Replacing only a specific group within a matched expression - python

I'm parsing text in which I would like to make changes, but only to specific lines.
I have a regular expression pattern that catches the entire line if it's a line of interest, and within the expression I have a remembered group of the thing I would actually like to change.
I would like to be able to changed only the specific group within a matched expression, and not replace the entire expression (that would replace the entire line).
For example:
I have a textual file with:
This is a completely silly example.
something something "this should be replaced" bla.
more uninteresting stuff
And I have the regex:
pattern = '.*("[^"]*").*'
Then I catch the second line, but I would to replace only the "this should be replaced" matched group within the line, not the entire line. (so using re.sub(pattern, replacement, string) won't do the job.
Thanks in advance!

What's wrong with
r'"[^"]+"'
Your .* before and after the matched expression match zero-length-string too, so you don't need it at all.
re.sub(r'"[^"]+"', 'DEF', 'abc"def"ghi')
# returns 'abcDEFghi'
and your example text will result into:
'This is a completely silly example.\nsomething something DEF bla.\nmore uninteresting stuff

eumiro answer is best in this very case, but for the sake of completeness, if you really need to perform some more complicated processing of pre, inside, and post text, you can simply use multiple groups, like:
'(.*)("[^"]*")(.*)'
(first group provides the the text before, third the text after, do what you like with them)
Also, you may prefer to forbid " in the pre-part:
'([^"]*)("[^"]*")(.*)'

re.match and re.search return a "match object". (See the python documentation). Supposing you want to replace group 3 in your RE, pull out its start/end indices and replace the substring directly:
mobj = re.match(pattern, line)
start = mobj.start(3)
end = mobj.end(3)
line = line[:start] + replacement + line[end:]

Related

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.
re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'
You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Python: regex: find if exists, else ignore

I need help with re module. I have pattern:
pattern = re.compile('''first_condition\((.*)\)
extra_condition\((.*)\)
testing\((.*)\)
other\((.*)\)''', re.UNICODE)
That's what happens if I run regex on the following text:
text = '''first_condition(enabled)
extra_condition(disabled)
testing(example)
other(something)'''
result = pattern.findall(text)
print(result)
[('enabled', 'disabled', 'example', 'something')]
But if one or two lines were missed, regex returns empty list. E.g. my text is:
text = '''first_condition(enabled)
other(other)'''
What I want to get:
[('enabled', '', '', 'something')]
I could do it in several commands, but I think that it will be slower than doing it in one regex. Original code uses sed, so it is very fast. I could do it using sed, but I need cross-platform way to do it. Is it possible to do? Tnanks!
P.S. It will be also great if sequence of strings will be free, not fixed:
text = '''other(other)
first_condition(enabled)'''
must return absolutely the same:
[('enabled', '', '', 'something')]
I would parse it to a dictionary first:
import re
keys = ['first_condition', 'extra_condition', 'testing', 'other']
d = dict(re.findall(r'^(.*)\((.*)\)$', text, re.M))
result = [d.get(key, '') for key in keys]
See it working online: ideone
Use a non-matching group for optional stuff, and make the group optional by putting a question mark after the group.
Example:
pat = re.compile(r'a\(([^)]+)\)(?:b\((?P<bgr>[^)]+)\)?')
Sorry but I can't test this right now.
The above requires a string like a(foo) and grabs the text in parents as group 0.
Then it optionally matches a string like b(foo)and if it is matched it will be saved as a named group with name: bgr
Note that I didn't use .* to match inside the parens but [^)]+. This definitely stops matching when it reaches the closing paren, and requires at least one character. You could use [^)]* if the parens can be empty.
These patterns are getting complicated so you might want to use verbose patterns with comments.
To have several optional patterns that might appear in any order, put them all inside a non-matching group and separate them with vertical bars. You will need to use named match groups because you won't know the order. Put an asterisk after the non-matching group to allow for any number of the alternative patterns to be present (including zero if none are present).

Extracting content BETWEEN a regex python?

Is there a simple method to pull content between a regex? Assume I have the following sample text
SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT
My regex is:
compiledRegex = re.compile('\[.*\] value=("|\').*("|\')')
This will obviously return the entire [SOME MORE TEXT] value="ssss", however I only want ssss to be returned since that's what I'm looking for
I can obviously define a parser function but I feel as if python provides some simple pythonic way to do such a task
This is what capturing groups are designed to do.
compiledRegex = re.compile('\[.*\] value=(?:"|\')(.*)(?:"|\')')
matches = compiledRegex.match(sampleText)
capturedGroup = matches.group(1) # grab contents of first group
The ?: inside the old groups (the parentheses) means that the group is now a non-capturing group; that is, it won't be accessible as a group in the result. I converted them to keep the output simpler, but you can leave them as capturing groups if you prefer (but then you have to use matches.group(2) instead, since the first quote would be the first captured group).
Your original regex is too greedy: r'.*\]' won't stop at the first ']' and the second '.*' won't stop at '"'. To stop at c you could use [^c] or '.*?':
regex = re.compile(r"""\[[^]]*\] value=("|')(.*?)\1""")
Example
m = regex.search("""SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT""")
print m.group(2)

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Categories

Resources