Python: Search and Replace but ignore commented lines

Python: Search and Replace but ignore commented lines - python

I actually want to do a search and replace but ignore all my commented lines, and I also just want to replace only the first found...
input-file.txt
#replace me
#replace me
replace me
replace me
...like with:
text = text.replace("replace me", "replaced!", 1) # with max. 1 rep.
But I'm not sure how to approach(ignore) those comments. So that I get:
#replace me
#replace me
replaced!
replace me

As I see it, the existing solutions have one or more of several problems:
Incomplete (e.g. requiring match on start of line)
Incomplete (e.g. requiring match not containing \n)
Clunky (e.g. looong file-based solutions)
I'm pretty sure a pure-regex solution would require variable-width lookbehinds, which the re module doesn't support (though I think the regex module does). With a small tweak though, regex can still provide a fairly clean answer.
import re
i = re.search(r'^([^#\n]?)+replace me', string_to_replace, re.M).start()
replaced_string = ''.join([
string_to_replace[:i],
re.sub(r'replace me', 'replaced!', string_to_replace[i:], 1, re.M),
])
The idea is that you find the first uncommented line containing the start of your match, and then you replace the first instance of 'replace me' that you find starting on that line. The ^([^#\n]?)+ bit in the regex says
^ -- Find the start of a line.
([^#\n]?)+ -- Find as few ([^#\n]?) as you can before matching the rest of the expression.
([^#\n]?) -- Find 0 or 1 of [^#\n].
[^#\n] -- Find anything that's not # or \n.
Note that we're using raw strings r'' to prevent double escaping things like backslashes when creating our regex expressions, and we're using re.M to search across line breaks.
Note that the behavior is a bit weird if the string you're string to replace contains the pattern \n#. In that case, you'll wind up replacing part or all of one or more commented lines, which may not be what you want. Considering the problems with the alternatives, I'd be inclined to say the alternatives are all wrong approaches.
If that's not what you want, excluding all commented lines gets doubly weird because of some uncertainty in how they'd get merged back together. For example, consider the following input file.
#comment 1
replace
#comment 2
me
replace
me
What happens if you want to replace the string replace\nme? Do you exclude the first match because \n#comment 2 is stuck in between? If you use the first match, where does \n#comment 2 go? Does it go before or after the replacement? Is the replacement multiple lines as well so that it can still get sandwiched in? Do you just delete it?

Have a flag that marks whether you have completed the replacement yet. And then only replace when that flag is true and the lines is not a comment:
not_yet_replaced = True
with open('input-file.txt') as f:
for l in f:
if not_yet_replaced and not l.startswith('#') and 'replace me' in l:
l = l.replace('replace me', 'replaced!')
not_yet_replaced = False
print(l)

You can use a break after the first occurrence like so:
with open('input.txt', 'r') as f:
content = f.read().split('\n')
for i in range(len(content)):
if content[i] == 'replace me':
content[i] = 'replaced'
break
with open('input.txt', 'w') as f:
content = ('\n').join(content)
f.write(content)
Output :
(xenial)vash#localhost:~/python/stack_overflow$ cat input.txt
#replace me
#replace me
replaced
replace me

If the input file is not very big, you can read it into memory as a list of lines. Then iterate over the lines and replace the first matching one. Then write the lines back to the file:
with open('input-file.txt', 'r+') as f:
lines = f.readlines()
substr = 'replace me'
for i in range(len(lines)):
if lines[i].startswith('#'):
continue
if substr in lines[i]:
lines[i] = lines[i].replace(substr, 'replaced!', 1)
break
f.seek(0)
f.truncate()
f.writelines(lines)

I'm not sure whether or not you have managed to get the text out of the file, so you can do that by doing
f = open("input-file.txt", "r")
text = f.read()
f.close()
Then the way I would do this is first split the text into lines like so
lines = text.split("\n")
then do the replacement on each line, checking it does not start with a "#"
for index, line in enumerate(lines):
if len(line) > 0 and line[0] != "#" and "replace me" in line:
lines[index] = line.replace("replace me", "replaced!")
break
then stitch the lines back together.
new_text = "\n".join(lines)
hope this helps :)

Easiest way is to use a multiline regex along with its sub() method and giving it a count of 1:
import re
r = re.compile("^replace me$", re.M)
s = """
#replace me
#replace me
replace me
replace me
"""
r.sub("replaced!", s, 1)
Gives
#replace me
#replace me
replaced!
replace me
Online demo here

Related

Can't replace string with variable

I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)

Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1

In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.

Reduce multiple blank lines to single (Pythonically)

How can I reduce multiple blank lines in a text file to a single line at each occurrence?
I have read the entire file into a string, because I want to do some replacement across line endings.
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = sourceFile.read()
This doesn't seem to work
while '\n\n\n' in sourceFileContents:
sourceFileContents = sourceFileContents.replace('\n\n\n', '\n\n')
and nor does this
sourceFileContents = re.sub('\n\n\n+', '\n\n', sourceFileContents)
It's easy enough to strip them all, but I want to reduce multiple blank lines to a single one, each time I encounter them.
I feel that I'm close, but just can't get it to work.

This is a reach, but perhaps some of the lines aren't completely blank (i.e. they have only whitespace characters that give the appearance of blankness). You could try removing all possible whitespace between newlines.
re.sub(r'(\n\s*)+\n+', '\n\n', sourceFileContents)
Edit: realized the second '+' was superfluous, as the \s* will catch newlines between the first and last. We just want to make sure the last character is definitely a newline so we don't remove leading whitespace from a line with other content.
re.sub(r'(\n\s*)+\n', '\n\n', sourceFileContents)
Edit 2
re.sub(r'\n\s*\n', '\n\n', sourceFileContents)
Should be an even simpler solution. We really just want to a catch any possible space (which includes intermediate newlines) between our two anchor newlines that will make the single blank line and collapse it down to just the two newlines.

Your code works for me. Maybe there is a chance of carriage return \r would be present.
re.sub(r'[\r\n][\r\n]{2,}', '\n\n', sourceFileContents)

You can use just str methods split and join:
text = "some text\n\n\n\nanother line\n\n"
print("\n".join(item for item in text.split('\n') if item))

Very simple approach using re module
import re
text = 'Abc\n\n\ndef\nGhijk\n\nLmnop'
text = re.sub('[\n]+', '\n', text) # Replacing one or more consecutive newlines with single \n
Result:
'Abc\ndef\nGhijk\nLmnop'

If the lines are completely empty, you can use regex positive lookahead to replace them with single lines:
sourceFileContents = re.sub(r'\n+(?=\n)', '\n', sourceFileContents)

If you replace your read statement with the following, then you don't have to worry about whitespace or carriage returns:
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = ''.join([l.rstrip() + '\n' for l in sourceFile])
After doing this, both of your methods you tried in the OP work.
OR
Just write it out in a simple loop.
with open(sourceFileName, 'rt') as sourceFile:
lines = ['']
for line in (l.rstrip() for l in sourceFile):
if line != '' or lines[-1] != '\n':
lines.append(line + '\n')
sourceFileContents = "".join(lines)

I guess another option which is longer, but maybe prettier?
with open(sourceFileName, 'rt') as sourceFile:
last_line = None
lines = []
for line in sourceFile:
# if you want to skip lines with only whitespace, you could add something like:
# line = line.lstrip(" \t")
if last_line != "\n":
lines.append(line)
last_line = line
contents = "".join(lines)
I was trying to find some clever generator function way of writing this, but it's been a long week so I can't.
Code untested, but I think it should work?
(edit: One upside is I removed the need for regular expressions which fixes the "now you have two problems" problem :) )
(another edit based on Marc Chiesa's suggestion of lingering whitespace)

For someone who can't do regex like me, if the code to process is python:
import autopep8
autopep8.fixcode('your_code')
Another quick solution, just in case your code isn't Python:
for x in range(100):
content.replace(" ", " ") # reduce the number of multiple whitespaces
# then
for x in range(20):
content.replace("\n\n", "\n") # reduce the number of multiple white lines
Note that if you have more than 100 consecutive whitespaces or 20 consecutive new lines, you'll want to increase the repetition times.

If decoding from unicode, watch out for non-breaking spaces which show up in cat -vet as M-BM-:
sourceFileContents = sourceFile.read()
sourceFileContents = re.sub(r'\n(\s*\n)+','\n\n',sourceFileContents.replace("\xc2\xa0"," "))

In Python, how do I efficiently check if a string has been found in a file yet?

in a Python function I'm writing I'm going through a text file, line by line, to replace each occurence of a certain string by a (numerical) value. Once I'm at the end of the file I would like to know if this string appeared in the file at all.
The function string.replace() does not tell you if anything has been replaced or not so I find myself having to go over each line twice, to look for the string and again to replace the string.
So far, I've come up with 2 ways to do this.
For each line:
use line.find(...) to look for the string, if it hasn't been found before
if the string is found, mark it as found
newLine = line.replace(...)
(do sth. with newLine ...)
For each line:
do newLine = line.replace(...) first
if newLine != line mark the string as found
(do sth. with newLine ...)
Here's my question:
Is there a better, i.e., more efficient or more pythonic way to do this?
If not, which of the above ways is faster?

I'd do something roughly like
found = False
newlines = []
for line in f:
if oldstring in line:
found = True
newlines.append(line.replace(oldstring, newstring))
else:
newlines.append(line)
Because that's the most easily understandable to me, I think.
There may be faster ways, but the best way depends on how often the string will occur in lines. Almost every line or almost no lines, that makes a big difference.

This example will work with multiple replacements:
replacements = {'string': [1,0], 'string2': [2,0]}
with open('somefile.txt') as f:
for line in f:
for key, value in replacements.iteritems():
if key in line:
new_line = line.replace(key, value[0])
replacements[key][1] += 1
# At the end
for key, value in replacements.iteritems():
print('Replaced {} with {} {} times'.format(key, *value))

Since we anyway have to go through the string twice, I'd make it as follows:
import re
with open('yourfile.txt', 'r', encoding='utf-8') as f: # check encoding
s = f.read()
oldstr, newstr = 'XXX', 'YYY'
count = len(list(re.finditer(oldstr, s)))
s_new = s.replace(oldstr, newstr)
print(oldstr, 'has been found and replaced by', newstr, count, 'times')

Writing items to file on separate lines without blank line at the end

I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?

If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))

inconsistent indentation with Python after split

edit in progress will re-submit sometimes later
edit in progress will re-submit sometimes later
edit in progress will re-submit sometimes later

That should work:
import re #Regex may be the easiest way to split that line
with open(infile) as in_f, open(outfile,'w') as out_f:
f = (i for i in in_f if i.rstrip()) #iterate over non empty lines
for line in f:
_, k = line.split('\t', 1)
x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
if not x:
continue
out_f.write(' '.join(x[0]) + '\n')

You can use .strip() to remove any whitespace around an item before entering it. This would allow a bit more clarity and solve any indentation issues.
For example:
b=a.split('chr').strip() # No white space either side now
c=b[1].split(':').strip() # No white space
d=c[1].split('..').strip()
e=b[0]+'\t'+c[0]+'\t'+d[0]+'\t'+d[1]+'\t'+'\n'
rfh.write(e)
What this will have done is remove any existing whitespace, and let only your \t's exist.

Why not use a regex split ?
import re
with open(<infile>) as inf:
for annot_info in f:
split_array = re.split(r'(\W+)(chr\w+):(\d+)..(\d+)', annot_info)
#do your sql processing here.
#write out to a file if you wish to.
would give you ['', '+', 'chr6', '140302505', '140302604', '']. You can use the same in your current mysql methods.
PS: The regex pattern I've used would give you empty strings at the beginning and end. Modify the regex or change your sql insert to exclude first and last elements of array while pushing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Search and Replace but ignore commented lines - python

Easiest way is to use a multiline regex along with its sub() method and giving it a count of 1: import re r = re.compile("^replace me$", re.M) s = """ #replace me #replace me replace me replace me """ r.sub("replaced!", s, 1) Gives #replace me #replace me replaced! replace me Online demo here

Related

Can't replace string with variable

Reduce multiple blank lines to single (Pythonically)

In Python, how do I efficiently check if a string has been found in a file yet?

Writing items to file on separate lines without blank line at the end

inconsistent indentation with Python after split

Categories

Resources