I have looked around StackOverflow and couldn't find an answer to my specific question so forgive me if I have missed something.
import re
target = open('output.txt', 'w')
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
match_text = match.group()
target.write(match_text + '\n')
else:
continue
target.close()
The file I am parsing is huge so need to process it line by line.
This (of course) leaves an additional newline at the end of the file.
How should I best change this code so that on the final iteration of the 'if match' loop it doesn't put the extra newline character at the end of the file. Should it look through the file again at the end and remove the last line (seems a bit inefficient though)?
The existing StackOverflow questions I have found cover removing all new lines from a file.
If there is a more pythonic / efficient way to write this code I would welcome suggestions for my own learning also.
Thanks for the help!
Another thing you can do, is to truncate the file. .tell() gives us the current byte number in the file. We then subtract one, and truncate it there to remove the trailing newline.
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell()-1)
On Linux and MacOS, the -1 is correct, but on Windows it needs to be -2. A more Pythonic method of determining which is to check os.linesep.
import os
remove_chars = len(os.linesep)
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell() - remove_chars)
kindal's answer is also valid, with the exception that you said it's a large file. This method will let you handle a terabyte sized file on a gigabyte of RAM.
Write the newline of each line at the beginning of the next line. To avoid writing a newline at the beginning of the first line, use a variable that is initialized to an empty string and then set to a newline in the loop.
import re
with open('input.txt') as source, open('output.txt', 'w') as target:
newline = ''
for line in source:
match = re.search(r'Stuff', line)
if match:
target.write(newline + match.group())
newline = '\n'
I also restructured your code a bit (the else: continue is not needed, because what else is the loop going to do?) and changed it to use the with statement so the files are automatically closed.
The shortest path from what you have to what you want is probably to store the results in a list, then join the list with newlines and write that to the file.
import re
target = open('output.txt', 'w')
results = []
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
results.append(match.group())
target.write("\n".join(results))
target.close()
Voilà, no extra newline at the beginning or end. Might not scale very well of the resulting list is huge. (And like kindall I left out the else)
Since you're performing the same regex over and over, you'd probably want to compile it beforehand.
import re
prog = re.compile(r'Stuff')
I tend to input from and output to stdin and stdout for simplicity. But that's a matter of taste (and specs).
from sys import stdin, stdout
Ignoring the specific requirement about removing the final EOL[1], and just addressing the bit about your own learning, the whole thing could be written like this:
from itertools import imap
stdout.writelines(match.group() for match in imap(prog.match, stdin) if match)
[1] As others have commented, this is a Bad Thing, and it's extremely annoying when someone does this.
Related
This public gist creates a simple scenario where you can turn a text file into a python list line by line.
with open('test.txt', 'r') as listFile:
lines = listFile.read().split("\n")
out = []
for item in lines:
if '"' in item:
out.append('("""' + item + '"""),')
else:
out.append('("' + item + '"),')
with open('out.py', 'a') as outFile:
outFile.write("out = [\n")
for item in out:
outFile.write("\t" + item + "\n")
outFile.write("]")
In text.txt the sixth and seventh lines
'"""'
""
are the ones that produce invalid output. Perhaps you can think of some other examples that would fail to work.
EDIT:
Valid output would look something like this:
out = [
"line1",
"line2",
""" line 3 has """ and "" and " in it """, # but it is a valid string
"last line",
]
The ( and ) characters were an oversight by me they are not needed or wanted...
EDIT: Oh god I'm getting overwhelmed. I'm going to take 5 minutes and post the question again in a better form.
Using a newline character besides \n would also cause the program to fail. In Windows its common to use \r or \r\n.
#abarnert's comment shows a better way to read lines.
A text file is already an iterable of lines.
As with any other iterable, you can convert it to a list by just passing it to the list constructor:
with open('text.txt') as f:
lines = list(f)
Or, if you don't want the newlines on the end of each line:
with open('text.txt') as f:
lines = [line.rstrip('\n') for line in f]
If you want to handle classic Mac and Windows line endings as well as Unix, open the file in universal-newlines mode:
with open('text.txt', 'rU') as f:
… or use the Python 3-style io classes (but note that this will give you unicode strings, not byte strings, which will repr with u prefixes—they're still valid Python literals that way, but they won't look as pretty):
import io
with io.open('text.txt') as f:
Now, it's hard to tell from code that doesn't work and no explanation of what's wrong with it, but it looks like you're trying to figure out how to write that list out as a Python-source-format list display, wrapping it in brackets, adding quotes, escaping any internal quotes, etc. But there's a much easier way to do that too:
with open('out.py', 'a') as f:
f.write(repr(lines))
If you're trying to pretty-print it, there's a pprint module in the stdlib for exactly that purpose, and various bigger/better alternatives on PyPI. Here's an example of the output of pprint.pprint(lines, width=60) with (what I think is) the same input you used for your desired output:
['line1',
'line2',
' line 3 has """ and "" and " in it ',
'last line']
Not exactly the same as your desired output—but, unlike your output, it's a valid Python list display that evaluates to the original input, and it looks pretty readable to me.
I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.
I have a program that writes a list to a file.
The list is a list of pipe delimited lines and the lines should be written to the file like this:
123|GSV|Weather_Mean|hello|joe|43.45
122|GEV|temp_Mean|hello|joe|23.45
124|GSI|Weather_Mean|hello|Mike|47.45
BUT it wrote them line this ahhhh:
123|GSV|Weather_Mean|hello|joe|43.45122|GEV|temp_Mean|hello|joe|23.45124|GSI|Weather_Mean|hello|Mike|47.45
This program wrote all the lines into like one line without any line breaks.. This hurts me a lot and I gotta figure-out how to reverse this but anyway, where is my program wrong here? I thought write lines should write lines down the file rather than just write everything to one line..
fr = open(sys.argv[1], 'r') # source file
fw = open(sys.argv[2]+"/masked_"+sys.argv[1], 'w') # Target Directory Location
for line in fr:
line = line.strip()
if line == "":
continue
columns = line.strip().split('|')
if columns[0].find("#") > 1:
looking_for = columns[0] # this is what we need to search
else:
looking_for = "Dummy#dummy.com"
if looking_for in d:
# by default, iterating over a dictionary will return keys
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
else:
new_idx = str(len(d)+1)
d[looking_for] = new_idx
kv = open(sys.argv[3], 'a')
kv.write(looking_for+" "+new_idx+'\n')
kv.close()
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
fw.writelines(line_list)
This is actually a pretty common problem for newcomers to Python—especially since, across the standard library and popular third-party libraries, some reading functions strip out newlines, but almost no writing functions (except the log-related stuff) add them.
So, there's a lot of Python code out there that does things like:
fw.write('\n'.join(line_list) + '\n')
(writing a single string) or
fw.writelines(line + '\n' for line in line_list)
Either one is correct, and of course you could even write your own writelinesWithNewlines function that wraps it up…
But you should only do this if you can't avoid it.
It's better if you can create/keep the newlines in the first place—as in Greg Hewgill's suggestions:
line_list.append(new_line + "\n")
And it's even better if you can work at a higher level than raw lines of text, e.g., by using the csv module in the standard library, as esuaro suggests.
For example, right after defining fw, you might do this:
cw = csv.writer(fw, delimiter='|')
Then, instead of this:
new_line = d[looking_for]+'|'+'|'.join(columns[1:])
line_list.append(new_line)
You do this:
row_list.append(d[looking_for] + columns[1:])
And at the end, instead of this:
fw.writelines(line_list)
You do this:
cw.writerows(row_list)
Finally, your design is "open a file, then build up a list of lines to add to the file, then write them all at once". If you're going to open the file up top, why not just write the lines one by one? Whether you're using simple writes or a csv.writer, it'll make your life simpler, and your code easier to read. (Sometimes there can be simplicity, efficiency, or correctness reasons to write a file all at once—but once you've moved the open all the way to the opposite end of the program from the write, you've pretty much lost any benefits of all-at-once.)
The documentation for writelines() states:
writelines() does not add line separators
So you'll need to add them yourself. For example:
line_list.append(new_line + "\n")
whenever you append a new item to line_list.
As others have noted, writelines is a misnomer (it ridiculously does not add newlines to the end of each line).
To do that, explicitly add it to each line:
with open(dst_filename, 'w') as f:
f.writelines(s + '\n' for s in lines)
writelines() does not add line separators. You can alter the list of strings by using map() to add a new \n (line break) at the end of each string.
items = ['abc', '123', '!##']
items = map(lambda x: x + '\n', items)
w.writelines(items)
As others have mentioned, and counter to what the method name would imply, writelines does not add line separators. This is a textbook case for a generator. Here is a contrived example:
def item_generator(things):
for item in things:
yield item
yield '\n'
def write_things_to_file(things):
with open('path_to_file.txt', 'wb') as f:
f.writelines(item_generator(things))
Benefits: adds newlines explicitly without modifying the input or output values or doing any messy string concatenation. And, critically, does not create any new data structures in memory. IO (writing to a file) is when that kind of thing tends to actually matter. Hope this helps someone!
Credits to Brent Faust.
Python >= 3.6 with format string:
with open(dst_filename, 'w') as f:
f.writelines(f'{s}\n' for s in lines)
lines can be a set.
If you are oldschool (like me) you may add f.write('\n') below the second line.
As we have well established here, writelines does not append the newlines for you. But, what everyone seems to be missing, is that it doesn't have to when used as a direct "counterpart" for readlines() and the initial read persevered the newlines!
When you open a file for reading in binary mode (via 'rb'), then use readlines() to fetch the file contents into memory, split by line, the newlines remain attached to the end of your lines! So, if you then subsequently write them back, you don't likely want writelines to append anything!
So if, you do something like:
with open('test.txt','rb') as f: lines=f.readlines()
with open('test.txt','wb') as f: f.writelines(lines)
You should end up with the same file content you started with.
As we want to only separate lines, and the writelines function in python does not support adding separator between lines, I have written the simple code below which best suits this problem:
sep = "\n" # defining the separator
new_lines = sep.join(lines) # lines as an iterator containing line strings
and finally:
with open("file_name", 'w') as file:
file.writelines(new_lines)
and you are done.
i have some data stored in a .txt file in this format:
----------|||||||||||||||||||||||||-----------|||||||||||
1029450386abcdefghijklmnopqrstuvwxy0293847719184756301943
1020414646canBeFollowedBySpaces 3292532113435532419963
don't ask...
i have many lines of this, and i need a way to add more digits to the end of a particular line.
i've written code to find the line i want, but im stumped as to how to add 11 characters to the end of it. i've looked around, this site has been helpful with some other issues i've run into, but i can't seem to find what i need for this.
it is important that the line retain its position in the file, and its contents in their current order.
using python3.1, how would you turn this:
1020414646canBeFollowedBySpaces 3292532113435532419963
into
1020414646canBeFollowedBySpaces 329253211343553241996301846372998
As a general principle, there's no shortcut to "inserting" new data in the middle of a text file. You will need to make a copy of the entire original file in a new file, modifying your desired line(s) of text on the way.
For example:
with open("input.txt") as infile:
with open("output.txt", "w") as outfile:
for s in infile:
s = s.rstrip() # remove trailing newline
if "target" in s:
s += "0123456789"
print(s, file=outfile)
os.rename("input.txt", "input.txt.original")
os.rename("output.txt", "input.txt")
Check out the fileinput module, it can do sort of "inplace" edits with files. though I believe temporary files are still involved in the internal process.
import fileinput
for line in fileinput.input('input.txt', inplace=1, backup='.orig'):
if line.startswith('1020414646canBeFollowedBySpaces'):
line = line.rstrip() + '01846372998' '\n'
print(line, end='')
The print now prints to the file instead of the console.
You might want to back up your original file before editing.
target_chain = '1020414646canBeFollowedBySpaces 3292532113435532419963'
to_add = '01846372998'
with open('zaza.txt','rb+') as f:
ch = f.read()
x = ch.find(target_chain)
f.seek(x + len(target_chain),0)
f.write(to_add)
f.write(ch[x + len(target_chain):])
In this method it's absolutely obligatory to open the file in binary mode 'b' for some reason linked to the treatment of the end of lines by Python (see Universal Newline, enabled by default)
The mode 'r+' is to allow the writing as well as the reading
In this method, what is before the target_chain in the file remains untouched. And what is after the target_chain is shifted ahead. As said by Greg Hewgill, there is no possibility to move apart bits on a hard drisk to insert new bits in the middle.
Evidently, if the file is very big, reading all of its content in ch could be too much memory consuming and the algorithm should then be changed: reading line after line until the line containing the target_chain, and then reading the next line before inserting, and then continuing to do "reading the next line - re-writing on the current line" until the end of the file in order to shift progressively the content from the line concerned with addition.
You see what I mean...
Copy the file, line by line, to another file. When you get to the line that needs extra chars then add them before writing.
This question already has answers here:
How to read a file line-by-line into a list?
(28 answers)
Closed 8 years ago.
I want to prompt a user for a number of random numbers to be generated and saved to a file. He gave us that part. The part we have to do is to open that file, convert the numbers into a list, then find the mean, standard deviation, etc. without using the easy built-in Python tools.
I've tried using open but it gives me invalid syntax (the file name I chose was "numbers" and it saved into "My Documents" automatically, so I tried open(numbers, 'r') and open(C:\name\MyDocuments\numbers, 'r') and neither one worked).
with open('C:/path/numbers.txt') as f:
lines = f.read().splitlines()
this will give you a list of values (strings) you had in your file, with newlines stripped.
also, watch your backslashes in windows path names, as those are also escape chars in strings. You can use forward slashes or double backslashes instead.
Two ways to read file into list in python (note these are not either or) -
use of with - supported from python 2.5 and above
use of list comprehensions
1. use of with
This is the pythonic way of opening and reading files.
#Sample 1 - elucidating each step but not memory efficient
lines = []
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #or some other preprocessing
lines.append(line) #storing everything in memory!
#Sample 2 - a more pythonic and idiomatic way but still not memory efficient
with open("C:\name\MyDocuments\numbers") as file:
lines = [line.strip() for line in file]
#Sample 3 - a more pythonic way with efficient memory usage. Proper usage of with and file iterators.
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #preprocess line
doSomethingWithThisLine(line) #take action on line instead of storing in a list. more memory efficient at the cost of execution speed.
the .strip() is used for each line of the file to remove \n newline character that each line might have. When the with ends, the file will be closed automatically for you. This is true even if an exception is raised inside of it.
2. use of list comprehension
This could be considered inefficient as the file descriptor might not be closed immediately. Could be a potential issue when this is called inside a function opening thousands of files.
data = [line.strip() for line in open("C:/name/MyDocuments/numbers", 'r')]
Note that file closing is implementation dependent. Normally unused variables are garbage collected by python interpreter. In cPython (the regular interpreter version from python.org), it will happen immediately, since its garbage collector works by reference counting. In another interpreter, like Jython or Iron Python, there may be a delay.
f = open("file.txt")
lines = f.readlines()
Look over here. readlines() returns a list containing one line per element. Note that these lines contain the \n (newline-character) at the end of the line. You can strip off this newline-character by using the strip()-method. I.e. call lines[index].strip() in order to get the string without the newline character.
As joaquin noted, do not forget to f.close() the file.
Converting strint to integers is easy: int("12").
The pythonic way to read a file and put every lines in a list:
from __future__ import with_statement #for python 2.5
with open('C:/path/numbers.txt', 'r') as f:
lines = f.readlines()
Then, assuming that each lines contains a number,
numbers =[int(e.strip()) for e in lines]
You need to pass a filename string to open. There's an extra complication when the string has \ in it, because that's a special string escape character to Python. You can fix this by doubling up each as \\ or by putting a r in front of the string as follows: r'C:\name\MyDocuments\numbers'.
Edit: The edits to the question make it completely different from the original, and since none of them was from the original poster I'm not sure they're warrented. However it does point out one obvious thing that might have been overlooked, and that's how to add "My Documents" to a filename.
In an English version of Windows XP, My Documents is actually C:\Documents and Settings\name\My Documents. This means the open call should look like:
open(r"C:\Documents and Settings\name\My Documents\numbers", 'r')
I presume you're using XP because you call it My Documents - it changed in Vista and Windows 7. I don't know if there's an easy way to look this up automatically in Python.
hdl = open("C:/name/MyDocuments/numbers", 'r')
milist = hdl.readlines()
hdl.close()
To summarize a bit from what people have been saying:
f=open('data.txt', 'w') # will make a new file or erase a file of that name if it is present
f=open('data.txt', 'r') # will open a file as read-only
f=open('data.txt', 'a') # will open a file for appending (appended data goes to the end of the file)
If you wish have something in place similar to a try/catch
with open('data.txt') as f:
for line in f:
print line
I think #movieyoda code is probably what you should use however
If you have multiple numbers per line and you have multiple lines, you can read them in like this:
#!/usr/bin/env python
from os.path import dirname
with open(dirname(__file__) + '/data/path/filename.txt') as input_data:
input_list= [map(int,num.split()) for num in input_data.readlines()]