sorting file in place with Python on unix system - python

I'm sorting a text file from Python using a custom unix command that takes a filename as input (or reads from stdin) and writes to stdout. I'd like to sort myfile and keep the sorted version in its place. Is the best way to do this from Python to make a temporary file? My current solution is:
inputfile = "myfile"
# inputfile: filename to be sorted
tmpfile = "%s.tmp_file" %(inputfile)
cmd = "mysort %s > %s" %(inputfile, tmpfile)
# rename sorted file to be originally sorted filename
os.rename(tmpfile, inputfile)
Is this the best solution? thanks.

If you don't want to create temporary files, you can use subprocess as in:
import sys
import subprocess
fname = sys.argv[1]
proc = subprocess.Popen(['sort', fname], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
with open(fname, 'w') as f:
f.write(stdout)

You either create a temporary file, or you'll have to read the whole file into memory and pipe it to your command.

The best solution is to use os.replace because it would work on Windows too.
This is not really what I regards as "in-place sorting" though. Usually, in-place sorting means that you actually exchange single elements in the lists without doing copies. You are making a copy since the sorted list has to get completely built before you can overwrite the original. If your files get very large, this obviously won't work anymore. You'd probably need to choose between atomicity and in-place-ity at that point.
If your Python is too old to have os.replace, there are lots of resources in the bug adding os.replace.
For other uses of temporary files, you can consider using the tempfile module, but I don't think it would gain you much in this case.

You could try a truncate-write pattern:
with open(filename, 'r') as f:
model.read(f)
model.process()
with open(filename, 'w') as f:
model.write(f)
Note this is non-atomic
This entry describes some pros/cons of updating files in Python:
http://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/

Related

copying one file's contents to another in python

I've been taught the best way to read a file in python is to do something like:
with open('file.txt', 'r') as f1:
for line in f1:
do_something()
But I have been thinking. If my goal is to copy the contents of one file completely to another, are there any dangers of doing this:
with open('file2.txt', 'w+') as output, open('file.txt', 'r') as input:
output.write(input.read())
Is it possible for this to behave in some way I don't expect?
Along the same lines, how would I handle the problem if the file is a binary file, rather than a text file. In this case, there would be no newline characters, so readline() or for line in file wouldn't work (right?).
EDIT Yes, I know about shutil. There are many better ways to copy a file if that is exactly what I want to do. I want to know about the potential risks, if any, of this approach specifically, because I may need to do more advanced things than simply copying one file to another (such as copying several files into a single one).
Please note that the shutil module also contains copyfileobj(), basically implemented like Barmar's answer.
Or, to answer your question:
from shutil import copyfileobj
with open('file2.txt', 'wb') as output, open('file.txt', 'rb') as input:
copyfileobj(input, output)
would be my suggestion. It avoids re-implementing the buffering mechanism and, should the implementation of the standard library improve, your code wins as well.
On Unix, there also is a non-standardised syscall called sendfile. It is used mostly for sending data from an open file to a socket (serving HTTP requests, etc.).
Linux allows using it for copying data between regular files as well though. Other platforms don't, check the Python doc and your man pages.
By using a syscall the kernel copies the content without the need of copying buffers to and from userland.
The os module offers os.sendfile() since Python 3.3.
You could use it like:
import io
import os
with open('file2.txt', 'wb') as output, open('file.txt', 'rb') as input:
offset = 0 # instructs sendfile to start reading at start of input
input_size = input.seek(0, io.SEEK_END)
os.sendfile(output.fileno(), input.fileno(), offset, input_size)
Otherwise, there is a package on PyPi, pysendfile, implementing the syscall. It works exactly as above, just replace os.sendfile with sendfile.sendfile (and import sendfile).
The only potential problem with your output.write(input.read()) version is if the size of the file is too large to hold all of it in memory. You can use a loop that reads smaller batches.
with open('file2.txt', 'wb+') as output, open('file.txt', 'rb') as input:
while True:
data = input.read(100000)
if data == '': # end of file reached
break
output.write(data)
This will work for both text and binary files. But you need to add the b modifier to the modes for portable operation on binary files.
While this may not completely answer your question, but for plain copying without any other processing of file contents, you should consider other means, e.g. the shutil module:
shutil.copy('file.txt', 'file2.txt')

python clear content writing on same file

I am a newbie to python. I have a code in which I must write the contents again to my same file,but when I do it it clears my content.Please help to fix it.
How should I modify my code such that the contents will be written back on the same file?
My code:
import re
numbers = {}
with open('1.txt') as f,open('11.txt', 'w') as f1:
for line in f:
row = re.split(r'(\d+)', line.strip())
words = tuple(row[::2])
if words not in numbers:
numbers[words] = [int(n) for n in row[1::2]]
numbers[words] = [n+1 for n in numbers[words]]
row[1::2] = map(str, numbers[words])
indentation = (re.match(r"\s*", line).group())
print (indentation + ''.join(row))
f1.write(indentation + ''.join(row) + '\n')
In general, it's a bad idea to write over a file you're still processing (or change a data structure over which you are iterating). It can be done...but it requires much care, and there is little safety or restart-ability should something go wrong in the middle (an error, a power failure, etc.)
A better approach is to write a clean new file, then rename it to the old name. For example:
import re
import os
filename = '1.txt'
tempname = "temp{0}_{1}".format(os.getpid(), filename)
numbers = {}
with open(filename) as f, open(tempname, 'w') as f1:
# ... file processing as before
os.rename(tempname, filename)
Here I've dropped filenames (both original and temporary) into variables, so they can be easily referred to multiple times or changed. This also prepares for the moment when you hoist this code into a function (as part of a larger program), as opposed to making it the main line of your program.
You don't strictly need the temporary name to embed the process id, but it's a standard way of making sure the temp file is uniquely named (temp32939_1.txt vs temp_1.txt or tempfile.txt, say).
It may also be helpful to create backups of the files as they were before processing. In which case, before the os.rename(tempname, filename) you can drop in code to move the original data to a safer location or a backup name. E.g.:
backupname = filename + ".bak"
os.rename(filename, backupname)
os.rename(tempname, filename)
While beyond the scope of this question, if you used a read-process-overwrite strategy frequently, it would be possible to create a separate module that abstracted these file-handling details away from your processing code. Here is an example.
Use
open('11.txt', 'a')
To append to the file instead of w for writing (a new or overwriting a file).
If you want to read and modify file in one time use "r+' mode.
f=file('/path/to/file.txt', 'r+')
content=f.read()
content=content.replace('oldstring', 'newstring') #for example change some substring in whole file
f.seek(0) #move to beginning of file
f.write(content)
f.truncate() #clear file conent "tail" on disk if new content shorter then old
f.close()

python: save list to a file with performance in mind

let say we have a list of strings which is so big that if I save it as a normal text file(every element in a separate line) it'll be 1GB in size;
currently I use this code to save the list:
savefile = codecs.open("BigList.txt", "w", "utf-8")
savefile.write("\r\n".join(BigList));
savefile.close()
as soon as we reach to this part of code: "\r\n".join(BigList), I can see a huge bump in memory usage and also considerable time(~1min) to save the results;
any tip or suggestion for better handling this list(less memory-usage) and save it on hard-disk more quickly?
The join in:
"\r\n".join(BigList)
is creating a very large string in the memory before writing it down. It will be much more memory efficient if you use a for loop:
for line in BigList:
savefile.write(line + "\r\n")
another question is, why do you have so many strings in the memory in the first place?
for line in BigList:
savefile.write(line+'\n')
I would do it by iterating.
To save disk-space you could do:
from gzip impo GzipFile
with GzipFile('dump.txt', 'w') as fh:
fh.write('\r\n'.join(BigList))
(also use the with operator instead).
Combine this with a for operator in order to save memory:
from gzip impo GzipFile
with GzipFile('dump.txt', 'w') as fh:
for item in BigList:
fh.write(str(item)+'\r\n')
And to do it really quick you could potentially do (saves memory, disk-space and time):
import pickle
from gzip import GzipFile
with GzipFile('dump.pckl', 'wb') as fh:
pickle.dump(BigList, fh)
Note however that this big list of yours would only be accessible to external programs if they understand pythons pickle structure on the data.
But assuming you want to re-use the BigList in your own application, pickle is the way to go.
Noticed some comment about you reading a big textfile in order to write to another file..
In that case the above an approach that would work for you.
If you want to save memory or time over two files. Consider the following instead:
with open('file_one.txt', 'rb') as inp:
with open('file_two.txt', 'wb' out:
for line in inp:
out.write(do_work(line)+b'\r\n')

Opening And Reading Large Numbers of Files in Python

I have 37 data files that I need to open and analyze using python. Rather than brute force my code with a lot of open() and close() statements, is there a concise way to open and read from a large number of files?
You are going to have to open and close a file handle for each file you are hoping to read from. What is your aversion to doing it this way?
Are you looking for perhaps good way to determine which files need to be read?
Use a dictionary of filenames to file handles and then iterate over the items. Or a list of tuples. Or two-dimensional arrays. Or or or ...
Use the standard library fileinput module
Pass in the data files on the command line and process like this
import fileinput
for line in fileinput.input():
process(line)
This iterates over all the lines of all the files passed in on the command line. This module also provides helper functions to let you know which file and line you are on currently.
Use the arcane functionality known as a function.
def slurp(filename):
"""slurp will cleanly read in a file's contents, cleaning up after itself"""
# Using the 'with' statement will automagically close
# the file handle when you're done.
with open(filename, "r") as fh:
# if the files are too big to keep in-memory, then read by chunks
# instead and process the data into smaller data structures as needed.
return fh.read()
data = [ slurp(filename) for filename in ["data1.dat", "data2.dat", "data3.dat"]]
You can also combine the entire thing:
for filename in ["a.dat", "b.dat", "c.dat"]:
with open(filename,"r") as fh:
for line in fh:
process_line(line)
And so on...

Python Overwriting files after parsing

I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.
The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)
As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).
The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.
newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))
(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion

Categories

Resources