Opening And Reading Large Numbers of Files in Python

Opening And Reading Large Numbers of Files in Python - python

I have 37 data files that I need to open and analyze using python. Rather than brute force my code with a lot of open() and close() statements, is there a concise way to open and read from a large number of files?

You are going to have to open and close a file handle for each file you are hoping to read from. What is your aversion to doing it this way?
Are you looking for perhaps good way to determine which files need to be read?

Use a dictionary of filenames to file handles and then iterate over the items. Or a list of tuples. Or two-dimensional arrays. Or or or ...

Use the standard library fileinput module
Pass in the data files on the command line and process like this
import fileinput
for line in fileinput.input():
process(line)
This iterates over all the lines of all the files passed in on the command line. This module also provides helper functions to let you know which file and line you are on currently.

Use the arcane functionality known as a function.
def slurp(filename):
"""slurp will cleanly read in a file's contents, cleaning up after itself"""
# Using the 'with' statement will automagically close
# the file handle when you're done.
with open(filename, "r") as fh:
# if the files are too big to keep in-memory, then read by chunks
# instead and process the data into smaller data structures as needed.
return fh.read()
data = [ slurp(filename) for filename in ["data1.dat", "data2.dat", "data3.dat"]]
You can also combine the entire thing:
for filename in ["a.dat", "b.dat", "c.dat"]:
with open(filename,"r") as fh:
for line in fh:
process_line(line)
And so on...

Related

Most efficient way of inserting new data between lines of a file

In Python 2.6 is there a more efficient way of searching a file line by line (for a string) and after finding it, inserting some lines into that file? So the output file would just be the same as the input file with a few lines added in between. Also, I'd rather not read these files into a buffer because the files can be very large.
Right now, I'm reading a file line by line and writing it into a temp file until I find the line I'm looking for and then inserting the extra data in the temp file. And write the rest of the data into the temp file. After I'm done processing the file, overwrite the old file with the new temp file.
Something like this:
with open(file_in_read, 'r') as inFile:
if os.path.exists(file_in_write):
os.remove(file_in_write)
with open(file_in_write, 'a') as outFile:
for line in inFile:
if re.search((r'<search_string',line):
write_some_data(outFile)
outFile.write(line)
else:
outFile.write(line)
os.rename(src,dst)
I was just wondering if I can speed it up somehow.

It looks like using the fileinput module in the standard library is the way to go. You can simplify your code to:
import fileinput
import re
import sys
regex = re.compile(r'<pattern>')
for line in fileinput.input(file_in_read, inplace=True):
sys.stdout.write(line)
if regex.search(line):
sys.stdout.write(additional_lines)

You can seek to some point of the file with file.seek and write there, but this way data will have fixed offset in the file and this is generally not what you want.
If the data need to go after some other data and this one has no fixed offset and size, then there is no way around and you need to read it to find out it's offset and size.
You may having a x,y problem. When you think that can solve x by y so you ask for help on y instead of asking for help in x. If you share what you are trying to get with these files other people may suggest better solutions.

Reading a .txt file in python

I have use the following code to read a .txt file:
f = os.open(os.path.join(self.dirname, self.filename), os.O_RDONLY)
And when I want to output the content I use this:
os.read(f, 10);
Which means that this method reads 10 bytes from the beginning of the file on. While I need to read the content as much as it is, using some values such as -1 and so. What should I do?

You have two options:
Call os.read() repeatedly.
Open the file using the open() built-in (as opposed to os.open()), and just call f.read() with no arguments.
The second approach carries certain risk, in that you might run into memory issues if the file is very large.

Python : Text Replacement In Large Files

I'm trying to insert text at very specific locations in a text file. This text file can be fairly large (>> 10 GB)
The approach I am currently using to read it:
with open("my_text_file.txt") as f:
while True:
result = f.read(set_number_of_bytes)
x = process_result(result)
if x:
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
However, I am unsure as to how to implement
replace_some_characters_that_i_just_read_and write_it_back_to_same_file
Is there some method which I can use to determine where I have read up to in the current file that I might be able to use to write to the file.
Performance-wise, if I was to use the approach above to write to the original file at specific locations, would there be efficiency issues with having to find the write location before writing?
Or would you recommend creating an entirely different file and appending to that file on each loop above. Then deleting the original file after this operation is completed? Assuming space is not a large concern but performance is.

Use the fileinput module, which handles files correctly when replacing data, with the inplace flag set:
import sys
import fileinput
for line in fileinput.input('my_text_file.txt', inplace=True):
x = process_result(line)
if x:
line = line.replace('something', x)
sys.stdout.write(line)
When you use the inplace flag, the original file is moved to a backup, and anything your write to sys.stdout is written to the original filename (so, as a new file). Make sure you include all lines, altered or not.
You have to rewrite the complete file whenever your replacement data is not exactly the same number of bytes as the parts that you are replacing.

Python Overwriting files after parsing

I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.

The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)

As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).

The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.

newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))

(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion

question about splitting a large file

Hey I need to split a large file in python into smaller files that contain only specific lines. How do I do this?

You're probably going to want to do something like this:
big_file = open('big_file', 'r')
small_file1 = open('small_file1', 'w')
small_file2 = open('small_file2', 'w')
for line in big_file:
if 'Charlie' in line: small_file1.write(line)
if 'Mark' in line: small_file2.write(line)
big_file.close()
small_file1.close()
small_file2.close()
Opening a file for reading returns an object that allows you to iterate over the lines. You can then check each line (which is just a string of whatever that line contains) for whatever condition you want, then write it to the appropriate file that you opened for writing. It is worth noting that when you open a file with 'w' it will overwrite anything already written to that file. If you want to simply add to the end, you should open it with 'a', to append.
Additionally, if you expect there to be some possibility of error in your reading/writing code, and want to make sure the files are closed, you can use:
with open('big_file', 'r') as big_file:
<do stuff prone to error>

Do you mean breaking it down into subsections? Like if I had a file with chapter 1, chapter 2, and chapter 3, you want it to be broken down into separate files for each chapter?
The way I've done this is similar to Wilduck's response, but closes the input file as soon as it reads in the data and keeps all the lines read in.
data_file = open('large_file_name', 'r')
lines = data_file.readlines()
data_file.close()
outputFile = open('output_file_one', 'w')
for line in lines:
if 'SomeName' in line:
outputFile.write(line)
outputFile.close()
If you wanted to have more than one output file you could either add more loops or open more than one outputFile at a time.
I'd recommend using Wilducks response, however, as it uses less space and will take less time with larger files since the file is read only once.

How big and does it need to be done in python? If this is on unix, would split/csplit/grep suffice?

First, open the big file for reading.
Second, open all the smaller file names for writing.
Third, iterate through every line. Every iteration, check to see what kind of line it is, then write it to that file.
More info on File I/O: http://docs.python.org/tutorial/inputoutput.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Opening And Reading Large Numbers of Files in Python - python

I have 37 data files that I need to open and analyze using python. Rather than brute force my code with a lot of open() and close() statements, is there a concise way to open and read from a large number of files?

You are going to have to open and close a file handle for each file you are hoping to read from. What is your aversion to doing it this way? Are you looking for perhaps good way to determine which files need to be read?

Use a dictionary of filenames to file handles and then iterate over the items. Or a list of tuples. Or two-dimensional arrays. Or or or ...

Related

Most efficient way of inserting new data between lines of a file

Reading a .txt file in python

Python : Text Replacement In Large Files

Python Overwriting files after parsing

question about splitting a large file

Categories

Resources