How can I reduce multiple blank lines in a text file to a single line at each occurrence?
I have read the entire file into a string, because I want to do some replacement across line endings.
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = sourceFile.read()
This doesn't seem to work
while '\n\n\n' in sourceFileContents:
sourceFileContents = sourceFileContents.replace('\n\n\n', '\n\n')
and nor does this
sourceFileContents = re.sub('\n\n\n+', '\n\n', sourceFileContents)
It's easy enough to strip them all, but I want to reduce multiple blank lines to a single one, each time I encounter them.
I feel that I'm close, but just can't get it to work.
This is a reach, but perhaps some of the lines aren't completely blank (i.e. they have only whitespace characters that give the appearance of blankness). You could try removing all possible whitespace between newlines.
re.sub(r'(\n\s*)+\n+', '\n\n', sourceFileContents)
Edit: realized the second '+' was superfluous, as the \s* will catch newlines between the first and last. We just want to make sure the last character is definitely a newline so we don't remove leading whitespace from a line with other content.
re.sub(r'(\n\s*)+\n', '\n\n', sourceFileContents)
Edit 2
re.sub(r'\n\s*\n', '\n\n', sourceFileContents)
Should be an even simpler solution. We really just want to a catch any possible space (which includes intermediate newlines) between our two anchor newlines that will make the single blank line and collapse it down to just the two newlines.
Your code works for me. Maybe there is a chance of carriage return \r would be present.
re.sub(r'[\r\n][\r\n]{2,}', '\n\n', sourceFileContents)
You can use just str methods split and join:
text = "some text\n\n\n\nanother line\n\n"
print("\n".join(item for item in text.split('\n') if item))
Very simple approach using re module
import re
text = 'Abc\n\n\ndef\nGhijk\n\nLmnop'
text = re.sub('[\n]+', '\n', text) # Replacing one or more consecutive newlines with single \n
Result:
'Abc\ndef\nGhijk\nLmnop'
If the lines are completely empty, you can use regex positive lookahead to replace them with single lines:
sourceFileContents = re.sub(r'\n+(?=\n)', '\n', sourceFileContents)
If you replace your read statement with the following, then you don't have to worry about whitespace or carriage returns:
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = ''.join([l.rstrip() + '\n' for l in sourceFile])
After doing this, both of your methods you tried in the OP work.
OR
Just write it out in a simple loop.
with open(sourceFileName, 'rt') as sourceFile:
lines = ['']
for line in (l.rstrip() for l in sourceFile):
if line != '' or lines[-1] != '\n':
lines.append(line + '\n')
sourceFileContents = "".join(lines)
I guess another option which is longer, but maybe prettier?
with open(sourceFileName, 'rt') as sourceFile:
last_line = None
lines = []
for line in sourceFile:
# if you want to skip lines with only whitespace, you could add something like:
# line = line.lstrip(" \t")
if last_line != "\n":
lines.append(line)
last_line = line
contents = "".join(lines)
I was trying to find some clever generator function way of writing this, but it's been a long week so I can't.
Code untested, but I think it should work?
(edit: One upside is I removed the need for regular expressions which fixes the "now you have two problems" problem :) )
(another edit based on Marc Chiesa's suggestion of lingering whitespace)
For someone who can't do regex like me, if the code to process is python:
import autopep8
autopep8.fixcode('your_code')
Another quick solution, just in case your code isn't Python:
for x in range(100):
content.replace(" ", " ") # reduce the number of multiple whitespaces
# then
for x in range(20):
content.replace("\n\n", "\n") # reduce the number of multiple white lines
Note that if you have more than 100 consecutive whitespaces or 20 consecutive new lines, you'll want to increase the repetition times.
If decoding from unicode, watch out for non-breaking spaces which show up in cat -vet as M-BM-:
sourceFileContents = sourceFile.read()
sourceFileContents = re.sub(r'\n(\s*\n)+','\n\n',sourceFileContents.replace("\xc2\xa0"," "))
Related
I have a complete_list_of_records which has a length of 550
this list would look something like this:
Apples
Pears
Bananas
The issue is that when i use:
with open("recordedlines.txt", "a") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i)
the outcome of the file is 393 long and the structure someplaces looks like so
Apples
PearsBananas
Pineapples
I have tried with "w" instead of "a" append and manually inserted "\n" for each item in the list but this just creates blank spaces on every second row and still som rows have the same issue with dual lines in one.
Anyone who has encountered something similar?
From the comments seen so far, I think there are strings in the source list that contain newline characters in positions other than at the end. Also, it seems that some strings end with newline character(s) but not all.
I suggest replacing embedded newlines with some other character - e.g., underscore.
Therefore I suggest this:
with open("recordedlines.txt", "w") as recorded_lines:
for line in complete_list_of_records:
line = line.rstrip() # remove trailing whitespace
line = line.replace('\n', '_') # replace any embedded newlines with underscore
print(line, file=recorded_lines) # print function will add a newline
You could simply strip all whitespaces off in any case and then insert a newline per hand like so:
with open("recordedlines.txt", "a") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i.strip() + "\n")
you need to use
file.writelines(listOfRecords)
but the list values must have '\n'
f = open("demofile3.txt", "a")
li = ["See you soon!", "Over and out."]
li = [i+'\n' for i in li]
f.writelines(li)
f.close()
#open and read the file after the appending:
f = open("demofile3.txt", "r")
print(f.read())
output will be
See you soon!
Over and out.
you can also use for loop with write() having '\n' at each iteration
[Soln][1]
complete_list_of_records =['1.Apples','2.Pears','3.Bananas','4.Pineapples']
with open("recordedlines.txt", "w") as recorded_lines:
for i in complete_list_of_records:
recorded_lines.write(i+"\n")
I think it should work.
Make sure that, you write as a string.
This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 5 years ago.
I have a .txt file with values in it.
The values are listed like so:
Value1
Value2
Value3
Value4
My goal is to put the values in a list. When I do so, the list looks like this:
['Value1\n', 'Value2\n', ...]
The \n is not needed.
Here is my code:
t = open('filename.txt')
contents = t.readlines()
This should do what you want (file contents in a list, by line, without \n)
with open(filename) as f:
mylist = f.read().splitlines()
I'd do this:
alist = [line.rstrip() for line in open('filename.txt')]
or:
with open('filename.txt') as f:
alist = [line.rstrip() for line in f]
You can use .rstrip('\n') to only remove newlines from the end of the string:
for i in contents:
alist.append(i.rstrip('\n'))
This leaves all other whitespace intact. If you don't care about whitespace at the start and end of your lines, then the big heavy hammer is called .strip().
However, since you are reading from a file and are pulling everything into memory anyway, better to use the str.splitlines() method; this splits one string on line separators and returns a list of lines without those separators; use this on the file.read() result and don't use file.readlines() at all:
alist = t.read().splitlines()
After opening the file, list comprehension can do this in one line:
fh=open('filename')
newlist = [line.rstrip() for line in fh.readlines()]
fh.close()
Just remember to close your file afterwards.
I used the strip function to get rid of newline character as split lines was throwing memory errors on 4 gb File.
Sample Code:
with open('C:\\aapl.csv','r') as apple:
for apps in apple.readlines():
print(apps.strip())
for each string in your list, use .strip() which removes whitespace from the beginning or end of the string:
for i in contents:
alist.append(i.strip())
But depending on your use case, you might be better off using something like numpy.loadtxt or even numpy.genfromtxt if you need a nice array of the data you're reading from the file.
from string import rstrip
with open('bvc.txt') as f:
alist = map(rstrip, f)
Nota Bene: rstrip() removes the whitespaces, that is to say : \f , \n , \r , \t , \v , \x and blank ,
but I suppose you're only interested to keep the significant characters in the lines. Then, mere map(strip, f) will fit better, removing the heading whitespaces too.
If you really want to eliminate only the NL \n and RF \r symbols, do:
with open('bvc.txt') as f:
alist = f.read().splitlines()
splitlines() without argument passed doesn't keep the NL and RF symbols (Windows records the files with NLRF at the end of lines, at least on my machine) but keeps the other whitespaces, notably the blanks and tabs.
.
with open('bvc.txt') as f:
alist = f.read().splitlines(True)
has the same effect as
with open('bvc.txt') as f:
alist = f.readlines()
that is to say the NL and RF are kept
I had the same problem and i found the following solution to be very efficient. I hope that it will help you or everyone else who wants to do the same thing.
First of all, i would start with a "with" statement as it ensures the proper open/close of the file.
It should look something like this:
with open("filename.txt", "r+") as f:
contents = [x.strip() for x in f.readlines()]
If you want to convert those strings (every item in the contents list is a string) in integer or float you can do the following:
contents = [float(contents[i]) for i in range(len(contents))]
Use int instead of float if you want to convert to integer.
It's my first answer in SO, so sorry if it's not in the proper formatting.
I recently used this to read all the lines from a file:
alist = open('maze.txt').read().split()
or you can use this for that little bit of extra added safety:
with f as open('maze.txt'):
alist = f.read().split()
It doesn't work with whitespace in-between text in a single line, but it looks like your example file might not have whitespace splitting the values. It is a simple solution and it returns an accurate list of values, and does not add an empty string: '' for every empty line, such as a newline at the end of the file.
with open('D:\\file.txt', 'r') as f1:
lines = f1.readlines()
lines = [s[:-1] for s in lines]
The easiest way to do this is to write file.readline()[0:-1]
This will read everything except the last character, which is the newline.
I actually want to do a search and replace but ignore all my commented lines, and I also just want to replace only the first found...
input-file.txt
#replace me
#replace me
replace me
replace me
...like with:
text = text.replace("replace me", "replaced!", 1) # with max. 1 rep.
But I'm not sure how to approach(ignore) those comments. So that I get:
#replace me
#replace me
replaced!
replace me
As I see it, the existing solutions have one or more of several problems:
Incomplete (e.g. requiring match on start of line)
Incomplete (e.g. requiring match not containing \n)
Clunky (e.g. looong file-based solutions)
I'm pretty sure a pure-regex solution would require variable-width lookbehinds, which the re module doesn't support (though I think the regex module does). With a small tweak though, regex can still provide a fairly clean answer.
import re
i = re.search(r'^([^#\n]?)+replace me', string_to_replace, re.M).start()
replaced_string = ''.join([
string_to_replace[:i],
re.sub(r'replace me', 'replaced!', string_to_replace[i:], 1, re.M),
])
The idea is that you find the first uncommented line containing the start of your match, and then you replace the first instance of 'replace me' that you find starting on that line. The ^([^#\n]?)+ bit in the regex says
^ -- Find the start of a line.
([^#\n]?)+ -- Find as few ([^#\n]?) as you can before matching the rest of the expression.
([^#\n]?) -- Find 0 or 1 of [^#\n].
[^#\n] -- Find anything that's not # or \n.
Note that we're using raw strings r'' to prevent double escaping things like backslashes when creating our regex expressions, and we're using re.M to search across line breaks.
Note that the behavior is a bit weird if the string you're string to replace contains the pattern \n#. In that case, you'll wind up replacing part or all of one or more commented lines, which may not be what you want. Considering the problems with the alternatives, I'd be inclined to say the alternatives are all wrong approaches.
If that's not what you want, excluding all commented lines gets doubly weird because of some uncertainty in how they'd get merged back together. For example, consider the following input file.
#comment 1
replace
#comment 2
me
replace
me
What happens if you want to replace the string replace\nme? Do you exclude the first match because \n#comment 2 is stuck in between? If you use the first match, where does \n#comment 2 go? Does it go before or after the replacement? Is the replacement multiple lines as well so that it can still get sandwiched in? Do you just delete it?
Have a flag that marks whether you have completed the replacement yet. And then only replace when that flag is true and the lines is not a comment:
not_yet_replaced = True
with open('input-file.txt') as f:
for l in f:
if not_yet_replaced and not l.startswith('#') and 'replace me' in l:
l = l.replace('replace me', 'replaced!')
not_yet_replaced = False
print(l)
You can use a break after the first occurrence like so:
with open('input.txt', 'r') as f:
content = f.read().split('\n')
for i in range(len(content)):
if content[i] == 'replace me':
content[i] = 'replaced'
break
with open('input.txt', 'w') as f:
content = ('\n').join(content)
f.write(content)
Output :
(xenial)vash#localhost:~/python/stack_overflow$ cat input.txt
#replace me
#replace me
replaced
replace me
If the input file is not very big, you can read it into memory as a list of lines. Then iterate over the lines and replace the first matching one. Then write the lines back to the file:
with open('input-file.txt', 'r+') as f:
lines = f.readlines()
substr = 'replace me'
for i in range(len(lines)):
if lines[i].startswith('#'):
continue
if substr in lines[i]:
lines[i] = lines[i].replace(substr, 'replaced!', 1)
break
f.seek(0)
f.truncate()
f.writelines(lines)
I'm not sure whether or not you have managed to get the text out of the file, so you can do that by doing
f = open("input-file.txt", "r")
text = f.read()
f.close()
Then the way I would do this is first split the text into lines like so
lines = text.split("\n")
then do the replacement on each line, checking it does not start with a "#"
for index, line in enumerate(lines):
if len(line) > 0 and line[0] != "#" and "replace me" in line:
lines[index] = line.replace("replace me", "replaced!")
break
then stitch the lines back together.
new_text = "\n".join(lines)
hope this helps :)
Easiest way is to use a multiline regex along with its sub() method and giving it a count of 1:
import re
r = re.compile("^replace me$", re.M)
s = """
#replace me
#replace me
replace me
replace me
"""
r.sub("replaced!", s, 1)
Gives
#replace me
#replace me
replaced!
replace me
Online demo here
This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 5 years ago.
I have a .txt file with values in it.
The values are listed like so:
Value1
Value2
Value3
Value4
My goal is to put the values in a list. When I do so, the list looks like this:
['Value1\n', 'Value2\n', ...]
The \n is not needed.
Here is my code:
t = open('filename.txt')
contents = t.readlines()
This should do what you want (file contents in a list, by line, without \n)
with open(filename) as f:
mylist = f.read().splitlines()
I'd do this:
alist = [line.rstrip() for line in open('filename.txt')]
or:
with open('filename.txt') as f:
alist = [line.rstrip() for line in f]
You can use .rstrip('\n') to only remove newlines from the end of the string:
for i in contents:
alist.append(i.rstrip('\n'))
This leaves all other whitespace intact. If you don't care about whitespace at the start and end of your lines, then the big heavy hammer is called .strip().
However, since you are reading from a file and are pulling everything into memory anyway, better to use the str.splitlines() method; this splits one string on line separators and returns a list of lines without those separators; use this on the file.read() result and don't use file.readlines() at all:
alist = t.read().splitlines()
After opening the file, list comprehension can do this in one line:
fh=open('filename')
newlist = [line.rstrip() for line in fh.readlines()]
fh.close()
Just remember to close your file afterwards.
I used the strip function to get rid of newline character as split lines was throwing memory errors on 4 gb File.
Sample Code:
with open('C:\\aapl.csv','r') as apple:
for apps in apple.readlines():
print(apps.strip())
for each string in your list, use .strip() which removes whitespace from the beginning or end of the string:
for i in contents:
alist.append(i.strip())
But depending on your use case, you might be better off using something like numpy.loadtxt or even numpy.genfromtxt if you need a nice array of the data you're reading from the file.
from string import rstrip
with open('bvc.txt') as f:
alist = map(rstrip, f)
Nota Bene: rstrip() removes the whitespaces, that is to say : \f , \n , \r , \t , \v , \x and blank ,
but I suppose you're only interested to keep the significant characters in the lines. Then, mere map(strip, f) will fit better, removing the heading whitespaces too.
If you really want to eliminate only the NL \n and RF \r symbols, do:
with open('bvc.txt') as f:
alist = f.read().splitlines()
splitlines() without argument passed doesn't keep the NL and RF symbols (Windows records the files with NLRF at the end of lines, at least on my machine) but keeps the other whitespaces, notably the blanks and tabs.
.
with open('bvc.txt') as f:
alist = f.read().splitlines(True)
has the same effect as
with open('bvc.txt') as f:
alist = f.readlines()
that is to say the NL and RF are kept
I had the same problem and i found the following solution to be very efficient. I hope that it will help you or everyone else who wants to do the same thing.
First of all, i would start with a "with" statement as it ensures the proper open/close of the file.
It should look something like this:
with open("filename.txt", "r+") as f:
contents = [x.strip() for x in f.readlines()]
If you want to convert those strings (every item in the contents list is a string) in integer or float you can do the following:
contents = [float(contents[i]) for i in range(len(contents))]
Use int instead of float if you want to convert to integer.
It's my first answer in SO, so sorry if it's not in the proper formatting.
I recently used this to read all the lines from a file:
alist = open('maze.txt').read().split()
or you can use this for that little bit of extra added safety:
with f as open('maze.txt'):
alist = f.read().split()
It doesn't work with whitespace in-between text in a single line, but it looks like your example file might not have whitespace splitting the values. It is a simple solution and it returns an accurate list of values, and does not add an empty string: '' for every empty line, such as a newline at the end of the file.
with open('D:\\file.txt', 'r') as f1:
lines = f1.readlines()
lines = [s[:-1] for s in lines]
The easiest way to do this is to write file.readline()[0:-1]
This will read everything except the last character, which is the newline.
edit in progress will re-submit sometimes later
edit in progress will re-submit sometimes later
edit in progress will re-submit sometimes later
That should work:
import re #Regex may be the easiest way to split that line
with open(infile) as in_f, open(outfile,'w') as out_f:
f = (i for i in in_f if i.rstrip()) #iterate over non empty lines
for line in f:
_, k = line.split('\t', 1)
x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
if not x:
continue
out_f.write(' '.join(x[0]) + '\n')
You can use .strip() to remove any whitespace around an item before entering it. This would allow a bit more clarity and solve any indentation issues.
For example:
b=a.split('chr').strip() # No white space either side now
c=b[1].split(':').strip() # No white space
d=c[1].split('..').strip()
e=b[0]+'\t'+c[0]+'\t'+d[0]+'\t'+d[1]+'\t'+'\n'
rfh.write(e)
What this will have done is remove any existing whitespace, and let only your \t's exist.
Why not use a regex split ?
import re
with open(<infile>) as inf:
for annot_info in f:
split_array = re.split(r'(\W+)(chr\w+):(\d+)..(\d+)', annot_info)
#do your sql processing here.
#write out to a file if you wish to.
would give you ['', '+', 'chr6', '140302505', '140302604', '']. You can use the same in your current mysql methods.
PS: The regex pattern I've used would give you empty strings at the beginning and end. Modify the regex or change your sql insert to exclude first and last elements of array while pushing.