removing block comments but keeping linebreaks - python

I'm removing block comments from python scripts with this regex
re.sub("'''.*?'''", "", string, flags = re.DOTALL)
It removes the complete block comment including line breaks (\n). However I would like to keep the line breaks for further processing of the files. Any way to do this with a regex?

What youre doing is trying to find repeated matches of lines contained within the multiline strings and replace them with new line characters instead of the whole line. Re.sub can actually take a method/lambda as its second parameter and that is what you should do. Here is the description and an example from pythons documentation
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
>>> def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
Using that concept, you just find the blockquotes and everything within them, then pass that match to a method which will run its own search, but this time you get the ability to just replace any line with a newline character. So that would be like replace "^.*" with "\n" and made sure you dont remove the triple quotes, or dont include them in the original regex group. Then you can just pass that value back from the method which should then happen for each group indepentantly.

Related

Is there a way to write regex to avoid matching a # inside a string

I am writing Syntax Highlighting for Python, and I want to make comments only be highlighted as comments if they are not included in strings.
What is currently happening:
# this line will be matched
x = '# this line is matched from `#` onwards'
however I want it to only select the line to be matched if there are no ' or " surrounding the #.
example
my current regex is as follows: #[^\n]* which selects a # and everything after, but I don't know how to make it check for surrounding ' first
To check if something is surrounded you can use the lookbehind and lookahead :
pattern = (?<=['|\"]).*#.*(?=['|\"])
(?<) this is for a lookbehind, addind a "!" will make it negative. Basically you're checking if there is not a specific character before.
(?) Same applies to this except this look after and not before
This pattern will match when the '#' is surrounded by this ' or this "

Matching regex pattern where there is \n\r between starting and ending pattern

The red underscore is the desired string I want to match
I would like to match all strings (including \n) between the the two string provided in the example
However, in the first example, where there is a newline, I can't get anything to match
In the second example, the regex expression works. It matches the string highlighted in Green because it resides on a single line
Not sure if there is a notation I need to include for \n\r to be part of the pattern to match
Use this
output = re.search('This(.*?)\n\n(.*?)match', text)
>>> output.group(1)
'is a multiline expression'
>>> output.group(2)
'I would like to '
Try this one aswell:
output = re.search(r"This ([\S.]+) match", text).group(1).replace(r'\n','')
That will find the entire thing as one group then remove the new lines.

Python replace - Treat multiple instances as one

I am trying to replace all the carriage returns with a command for incoming lines. It works fine, except when multiple carriage returns exist. I see no information in python's string.replace() function on how to treat multiple instances of the same item as though they are one. Is this possible?
For instance, this line:
This is\nA sentence\nwith multiple\nbreaklines\n\npython.
Should end up like this:
This is, A sentence, with multiple, breaklines, python.
But it actually turns into this:
This is, A sentence, with multiple, breaklines, , python.
You can use regex.
In [48]: mystr = "This is\nA sentence\nwith multiple\nbreaklines\n\npython."
In [49]: re.sub(r'\n+', ', ', mystr)
Out[49]: 'This is, A sentence, with multiple, breaklines, python.'
The regex pattern matches where there's one or more \n's next to each other and replaces them with a ,.

Replacing only a specific group within a matched expression

I'm parsing text in which I would like to make changes, but only to specific lines.
I have a regular expression pattern that catches the entire line if it's a line of interest, and within the expression I have a remembered group of the thing I would actually like to change.
I would like to be able to changed only the specific group within a matched expression, and not replace the entire expression (that would replace the entire line).
For example:
I have a textual file with:
This is a completely silly example.
something something "this should be replaced" bla.
more uninteresting stuff
And I have the regex:
pattern = '.*("[^"]*").*'
Then I catch the second line, but I would to replace only the "this should be replaced" matched group within the line, not the entire line. (so using re.sub(pattern, replacement, string) won't do the job.
Thanks in advance!
What's wrong with
r'"[^"]+"'
Your .* before and after the matched expression match zero-length-string too, so you don't need it at all.
re.sub(r'"[^"]+"', 'DEF', 'abc"def"ghi')
# returns 'abcDEFghi'
and your example text will result into:
'This is a completely silly example.\nsomething something DEF bla.\nmore uninteresting stuff
eumiro answer is best in this very case, but for the sake of completeness, if you really need to perform some more complicated processing of pre, inside, and post text, you can simply use multiple groups, like:
'(.*)("[^"]*")(.*)'
(first group provides the the text before, third the text after, do what you like with them)
Also, you may prefer to forbid " in the pre-part:
'([^"]*)("[^"]*")(.*)'
re.match and re.search return a "match object". (See the python documentation). Supposing you want to replace group 3 in your RE, pull out its start/end indices and replace the substring directly:
mobj = re.match(pattern, line)
start = mobj.start(3)
end = mobj.end(3)
line = line[:start] + replacement + line[end:]

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources