copying one file's contents to another in python - python

I've been taught the best way to read a file in python is to do something like:
with open('file.txt', 'r') as f1:
for line in f1:
do_something()
But I have been thinking. If my goal is to copy the contents of one file completely to another, are there any dangers of doing this:
with open('file2.txt', 'w+') as output, open('file.txt', 'r') as input:
output.write(input.read())
Is it possible for this to behave in some way I don't expect?
Along the same lines, how would I handle the problem if the file is a binary file, rather than a text file. In this case, there would be no newline characters, so readline() or for line in file wouldn't work (right?).
EDIT Yes, I know about shutil. There are many better ways to copy a file if that is exactly what I want to do. I want to know about the potential risks, if any, of this approach specifically, because I may need to do more advanced things than simply copying one file to another (such as copying several files into a single one).

Please note that the shutil module also contains copyfileobj(), basically implemented like Barmar's answer.
Or, to answer your question:
from shutil import copyfileobj
with open('file2.txt', 'wb') as output, open('file.txt', 'rb') as input:
copyfileobj(input, output)
would be my suggestion. It avoids re-implementing the buffering mechanism and, should the implementation of the standard library improve, your code wins as well.
On Unix, there also is a non-standardised syscall called sendfile. It is used mostly for sending data from an open file to a socket (serving HTTP requests, etc.).
Linux allows using it for copying data between regular files as well though. Other platforms don't, check the Python doc and your man pages.
By using a syscall the kernel copies the content without the need of copying buffers to and from userland.
The os module offers os.sendfile() since Python 3.3.
You could use it like:
import io
import os
with open('file2.txt', 'wb') as output, open('file.txt', 'rb') as input:
offset = 0 # instructs sendfile to start reading at start of input
input_size = input.seek(0, io.SEEK_END)
os.sendfile(output.fileno(), input.fileno(), offset, input_size)
Otherwise, there is a package on PyPi, pysendfile, implementing the syscall. It works exactly as above, just replace os.sendfile with sendfile.sendfile (and import sendfile).

The only potential problem with your output.write(input.read()) version is if the size of the file is too large to hold all of it in memory. You can use a loop that reads smaller batches.
with open('file2.txt', 'wb+') as output, open('file.txt', 'rb') as input:
while True:
data = input.read(100000)
if data == '': # end of file reached
break
output.write(data)
This will work for both text and binary files. But you need to add the b modifier to the modes for portable operation on binary files.

While this may not completely answer your question, but for plain copying without any other processing of file contents, you should consider other means, e.g. the shutil module:
shutil.copy('file.txt', 'file2.txt')

Related

How to extract the full path from a file while using the "with" statement?

I'm trying, just for fun, to understand if I can extract the full path of my file while using the with statement (python 3.8)
I have this simple code:
with open('tmp.txt', 'r') as file:
print(os.path.basename(file))
But I keep getting an error that it's not a suitable type format.
I've been trying also with the relpath, abspath, and so on.
It says that the input should be a string, but even after casting it into string, I'm getting something that I can't manipulate.
Perhaps there isn't an actual way to extract that full path name, but I think there is. I just can't find it, yet.
You could try:
import os
with open("tmp.txt", "r") as file_handle:
print(os.path.abspath(file_handle.name))
The functions in os.path accept strings or path-like objects. You are attempting to pass in a file instead. There are lots of reasons the types aren't interchangable.
Since you opened the file for text reading, file is an instance of io.TextIOWrapper. This class is just an interface that provides text encoding and decoding for some underlying data. It is not associated with a path in general: the underlying stream can be a file on disk, but also a pipe, a network socket, or an in-memory buffer (like io.StringIO). None of the latter are associated with a path or filename in the way that you are thinking, even though you would interface with them as through normal file objects.
If your file-like is an instance of io.FileIO, it will have a name attribute to keep track of this information for you. Other sources of data will not. Since the example in your question uses FileIO, you can do
with open('tmp.txt', 'r') as file:
print(os.path.abspath(file.name))
The full file path is given by os.path.abspath.
That being said, since file objects don't generally care about file names, it is probably better for you to keep track of that info yourself, in case one day you decide to use something else as input. Python 3.8+ allows you to do this without changing your line count using the walrus operator:
with open((filename := 'tmp.txt'), 'r') as file:
print(os.path.abspath(filename))

python: save list to a file with performance in mind

let say we have a list of strings which is so big that if I save it as a normal text file(every element in a separate line) it'll be 1GB in size;
currently I use this code to save the list:
savefile = codecs.open("BigList.txt", "w", "utf-8")
savefile.write("\r\n".join(BigList));
savefile.close()
as soon as we reach to this part of code: "\r\n".join(BigList), I can see a huge bump in memory usage and also considerable time(~1min) to save the results;
any tip or suggestion for better handling this list(less memory-usage) and save it on hard-disk more quickly?
The join in:
"\r\n".join(BigList)
is creating a very large string in the memory before writing it down. It will be much more memory efficient if you use a for loop:
for line in BigList:
savefile.write(line + "\r\n")
another question is, why do you have so many strings in the memory in the first place?
for line in BigList:
savefile.write(line+'\n')
I would do it by iterating.
To save disk-space you could do:
from gzip impo GzipFile
with GzipFile('dump.txt', 'w') as fh:
fh.write('\r\n'.join(BigList))
(also use the with operator instead).
Combine this with a for operator in order to save memory:
from gzip impo GzipFile
with GzipFile('dump.txt', 'w') as fh:
for item in BigList:
fh.write(str(item)+'\r\n')
And to do it really quick you could potentially do (saves memory, disk-space and time):
import pickle
from gzip import GzipFile
with GzipFile('dump.pckl', 'wb') as fh:
pickle.dump(BigList, fh)
Note however that this big list of yours would only be accessible to external programs if they understand pythons pickle structure on the data.
But assuming you want to re-use the BigList in your own application, pickle is the way to go.
Noticed some comment about you reading a big textfile in order to write to another file..
In that case the above an approach that would work for you.
If you want to save memory or time over two files. Consider the following instead:
with open('file_one.txt', 'rb') as inp:
with open('file_two.txt', 'wb' out:
for line in inp:
out.write(do_work(line)+b'\r\n')

Opening And Reading Large Numbers of Files in Python

I have 37 data files that I need to open and analyze using python. Rather than brute force my code with a lot of open() and close() statements, is there a concise way to open and read from a large number of files?
You are going to have to open and close a file handle for each file you are hoping to read from. What is your aversion to doing it this way?
Are you looking for perhaps good way to determine which files need to be read?
Use a dictionary of filenames to file handles and then iterate over the items. Or a list of tuples. Or two-dimensional arrays. Or or or ...
Use the standard library fileinput module
Pass in the data files on the command line and process like this
import fileinput
for line in fileinput.input():
process(line)
This iterates over all the lines of all the files passed in on the command line. This module also provides helper functions to let you know which file and line you are on currently.
Use the arcane functionality known as a function.
def slurp(filename):
"""slurp will cleanly read in a file's contents, cleaning up after itself"""
# Using the 'with' statement will automagically close
# the file handle when you're done.
with open(filename, "r") as fh:
# if the files are too big to keep in-memory, then read by chunks
# instead and process the data into smaller data structures as needed.
return fh.read()
data = [ slurp(filename) for filename in ["data1.dat", "data2.dat", "data3.dat"]]
You can also combine the entire thing:
for filename in ["a.dat", "b.dat", "c.dat"]:
with open(filename,"r") as fh:
for line in fh:
process_line(line)
And so on...

sorting file in place with Python on unix system

I'm sorting a text file from Python using a custom unix command that takes a filename as input (or reads from stdin) and writes to stdout. I'd like to sort myfile and keep the sorted version in its place. Is the best way to do this from Python to make a temporary file? My current solution is:
inputfile = "myfile"
# inputfile: filename to be sorted
tmpfile = "%s.tmp_file" %(inputfile)
cmd = "mysort %s > %s" %(inputfile, tmpfile)
# rename sorted file to be originally sorted filename
os.rename(tmpfile, inputfile)
Is this the best solution? thanks.
If you don't want to create temporary files, you can use subprocess as in:
import sys
import subprocess
fname = sys.argv[1]
proc = subprocess.Popen(['sort', fname], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
with open(fname, 'w') as f:
f.write(stdout)
You either create a temporary file, or you'll have to read the whole file into memory and pipe it to your command.
The best solution is to use os.replace because it would work on Windows too.
This is not really what I regards as "in-place sorting" though. Usually, in-place sorting means that you actually exchange single elements in the lists without doing copies. You are making a copy since the sorted list has to get completely built before you can overwrite the original. If your files get very large, this obviously won't work anymore. You'd probably need to choose between atomicity and in-place-ity at that point.
If your Python is too old to have os.replace, there are lots of resources in the bug adding os.replace.
For other uses of temporary files, you can consider using the tempfile module, but I don't think it would gain you much in this case.
You could try a truncate-write pattern:
with open(filename, 'r') as f:
model.read(f)
model.process()
with open(filename, 'w') as f:
model.write(f)
Note this is non-atomic
This entry describes some pros/cons of updating files in Python:
http://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/

Python Overwriting files after parsing

I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.
The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)
As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).
The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.
newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))
(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion

Categories

Resources