question about splitting a large file

question about splitting a large file - python

Hey I need to split a large file in python into smaller files that contain only specific lines. How do I do this?

You're probably going to want to do something like this:
big_file = open('big_file', 'r')
small_file1 = open('small_file1', 'w')
small_file2 = open('small_file2', 'w')
for line in big_file:
if 'Charlie' in line: small_file1.write(line)
if 'Mark' in line: small_file2.write(line)
big_file.close()
small_file1.close()
small_file2.close()
Opening a file for reading returns an object that allows you to iterate over the lines. You can then check each line (which is just a string of whatever that line contains) for whatever condition you want, then write it to the appropriate file that you opened for writing. It is worth noting that when you open a file with 'w' it will overwrite anything already written to that file. If you want to simply add to the end, you should open it with 'a', to append.
Additionally, if you expect there to be some possibility of error in your reading/writing code, and want to make sure the files are closed, you can use:
with open('big_file', 'r') as big_file:
<do stuff prone to error>

Do you mean breaking it down into subsections? Like if I had a file with chapter 1, chapter 2, and chapter 3, you want it to be broken down into separate files for each chapter?
The way I've done this is similar to Wilduck's response, but closes the input file as soon as it reads in the data and keeps all the lines read in.
data_file = open('large_file_name', 'r')
lines = data_file.readlines()
data_file.close()
outputFile = open('output_file_one', 'w')
for line in lines:
if 'SomeName' in line:
outputFile.write(line)
outputFile.close()
If you wanted to have more than one output file you could either add more loops or open more than one outputFile at a time.
I'd recommend using Wilducks response, however, as it uses less space and will take less time with larger files since the file is read only once.

How big and does it need to be done in python? If this is on unix, would split/csplit/grep suffice?

First, open the big file for reading.
Second, open all the smaller file names for writing.
Third, iterate through every line. Every iteration, check to see what kind of line it is, then write it to that file.
More info on File I/O: http://docs.python.org/tutorial/inputoutput.html

Related

Python split text file into different text files when line starts with a specific character

I have the following txt file which is generated by an instrument:
https://www.dropbox.com/s/01rx9fk5e64y4b1/Test.txt?dl=0
Every time I run an experiment the software of the instrument appended a header which starts with "//" followed by the acquired data to the same txt file. Therefore the txt file above contains many different experiments which are separated by the header. I need to divide the file above into different txt files each one containing only one experiment and including the header.
Would the best strategy be to read line by line the file and generate a new txt file every time the programme encounters the first line that starts with "//"?
Maybe there is a better way using Pandas?
Any suggestions would be very much appreciated.

Any suggestions would be very much appreciated.
I examined your file and it seems every header is preceded by empty line, thefore I propose exploiting that fact following way
fileno = 0
inputfile = open('Test.txt', 'r')
outfile = open(f'out{fileno}.txt', 'w')
for line in inputfile:
if not line.strip():
fileno += 1
outfile.close()
outfile = open(f'out{fileno}.txt', 'w')
else:
outfile.write(line)
outfile.close()
inputfile.close()
Explanation: process line-by-line so there is not need to load whole file to memory, if empty line encountered close current outfile and prepare new outfile with next number. Disclaimer: I used so-called f-strings so Python 3.6 or newer is required.

python loop won't iterate on second pass

When I run the following in the Python IDLE Shell:
f = open(r"H:\Test\test.csv", "rb")
for line in f:
print line
#this works fine
however, when I run the following for a second time:
for line in f:
print line
#this does nothing

This does not work because you've already seeked to the end of the file the first time. You need to rewind (using .seek(0)) or re-open your file.
Some other pointers:
Python has a very good csv module. Do not attempt to implement CSV parsing yourself unless doing so as an educational exercise.
You probably want to open your file in 'rU' mode, not 'rb'. 'rU' is universal newline mode, which will deal with source files coming from platforms with different line endings for you.
Use with when working with file objects, since it will cleanup the handles for you even in the case of errors. Ex:
.
with open(r"H:\Test\test.csv", "rU") as f:
for line in f:
...

You can read the data from the file in a variable, and then you can iterate over this data any no. of times you want to in your script. This is better than doing seek back and forth.
f = open(r"H:\Test\test.csv", "rb")
data = f.readlines()
for line in data:
print line
for line in data:
print line
Output:
# This is test.csv
Line1,This is line 1, there are, some numbers here,321423423
Line2,This is line2 , there are some characters here,sdfdsfdsf
# This is test.csv
Line1,This is line 1, there are, some numbers here,321423423
Line2,This is line2 , there are some characters here,sdfdsfdsf

Because you've gone all the way through the CSV file, and the iterator is exhausted. You'll need to re-open it before the second loop.

How to confirm that a file object is empty? [Python]

in a py module, I write:
outFile = open(fileName, mode='w')
if A:
outFile.write(...)
if B:
outFile.write(...)
and in these lines, I didn't use flush or close method.
Then after these lines, I want to check whether this "outFile" object is empty or not. How can I do with it?

There are a few problems with your code.
You can't .write to a file that you opened with 'r'. You need to open(fileName, 'w').
If A or B then you've certainly written to the file, so it's not empty!
Barring those. you can get the length of a file with
os.stat(outFile.fileno())
EDIT: I'll explain what flush does. Python is often used to do quite large amounts of file reads and writes, which can be slow. It is thus tweaked to make them as fast as possible. One way that is does so is to "buffer" such writes and then do them all in one big block: when you write a small string, Python will remember it but won't actually write it to the file until it thinks it should.
This means that if you want to tell whether you have written data to the file by inspecting the file, you have to tell Python to write all the data it's remembering first, or else you might not see it. flush is the command to write all the buffered data.
Of course, if you ask Python whether it's written anything to the file, say by inspecting the position in the file (.tell()), then it will know about the buffering.

If you've already written to the file, you can use .tell() to check if the current file position is nonzero:
>>> handle = open('/tmp/file.txt', 'w')
>>> handle.write('foo')
>>> handle.tell()
3
This won't work if you .seek() back to the beginning of the file.

You can use os.stat to get file info:
import os
fileSize = os.stat(fileName).st_size

with open("filename.txt", "r+") as f:
if f.read():
# file isn't empty
f.write("something")
# uncomment this line if you want to delete everything else in the file
# f.truncate()
else:
# file is empty
f.write("somethingelse")
"r+" mode always you to read & write.
"with" will automatically close file

Open a file for input and output in Python

I have the following code which is intended to remove specific lines of a file. When I run it, it prints the two filenames that live in the directory, then deletes all information in them. What am I doing wrong? I'm using Python 3.2 under Windows.
import os
files = [file for file in os.listdir() if file.split(".")[-1] == "txt"]
for file in files:
print(file)
input = open(file,"r")
output = open(file,"w")
for line in input:
print(line)
# if line is good, write it to output
input.close()
output.close()

open(file, 'w') wipes the file. To prevent that, open it in r+ mode (read+write/don't wipe), then read it all at once, filter the lines, and write them back out again. Something like
with open(file, "r+") as f:
lines = f.readlines() # read entire file into memory
f.seek(0) # go back to the beginning of the file
f.writelines(filter(good, lines)) # dump the filtered lines back
f.truncate() # wipe the remains of the old file
I've assumed that good is a function telling whether a line should be kept.

If your file fits in memory, the easiest solution is to open the file for reading, read its contents to memory, close the file, open it for writing and write the filtered output back:
with open(file_name) as f:
lines = list(f)
# filter lines
with open(file_name, "w") as f: # This removes the file contents
f.writelines(lines)
Since you are not intermangling read and write operations, the advanced file modes like "r+" are unnecessary here, and only compicate things.
If the file does not fit into memory, the usual approach is to write the output to a new, temporary file, and move it back to the original file name after processing is finished.

One way is to use the fileinput stdlib module. Then you don't have to worry about open/closing and file modes etc...
import fileinput
from contextlib import closing
import os
fnames = [fname for fname in os.listdir() if fname.split(".")[-1] == "txt"] # use splitext
with closing(fileinput.input(fnames, inplace=True)) as fin:
for line in fin:
# some condition
if 'z' not in line: # your condition here
print line, # suppress new line but adjust for py3 - print(line, eol='') ?
When using inplace=True - the fileinput redirects stdout to be to the file currently opened. A backup of the file with a default '.bak' extension is created which may come in useful if needed.
jon#minerva:~$ cat testtext.txt
one
two
three
four
five
six
seven
eight
nine
ten
After running the above with a condition of not line.startswith('t'):
jon#minerva:~$ cat testtext.txt
one
four
five
six
seven
eight
nine

You're deleting everything when you open the file to write to it. You can't have an open read and write to a file at the same time. Use open(file,"r+") instead, and then save all the lines to another variable before writing anything.

You should not open the same file for reading and writing at the same time.
"w" means create a empty for writing. If the file already exists, its data will be deleted.
So you can use a different file name for writing.

Python Overwriting files after parsing

I'm new to Python, and I need to do a parsing exercise. I got a file, and I need to parse it (just the headers), but after the process, i need to keep the file the same format, the same extension, and at the same place in disk, but only with the differences of new headers..
I tried this code...
for line in open ('/home/name/db/str/dir/numbers/str.phy'):
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
print linepars
..and it does the job, but I don't know how to "overwrite" the file with the new parsing.

The easiest way, but not the most efficient (by far, and especially for long files) would be to rewrite the complete file.
You could do this by opening a second file handle and rewriting each line, except in the case of the header, you'd write the parsed header. For example,
fr = open('/home/name/db/str/dir/numbers/str.phy')
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
fr.close()
EDIT: Note that this does not use readlines(), so its more memory efficient. It also does not store every output line, but only one at a time, writing it to file immediately.
Just as a cool trick, you could use the with statement on the input file to avoid having to close it (Python 2.5+):
fw = open('/home/name/db/str/dir/numbers/str.phy.parsed', 'w') # Name this whatever makes sense
with open('/home/name/db/str/dir/numbers/str.phy') as fr:
for line in fr:
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
fw.write(linepars)
else:
fw.write(line)
fw.close()
P.S. Welcome :-)

As others are saying here, you want to open a file and use that file object's .write() method.
The best approach would be to open an additional file for writing:
import os
current_cfg = open(...)
parsed_cfg = open(..., 'w')
for line in current_cfg:
new_line = parse(line)
print new_line
parsed.cfg.write(new_line + '\n')
current_cfg.close()
parsed_cfg.close()
os.rename(....) # Rename old file to backup name
os.rename(....) # Rename new file into place
Additionally I'd suggest looking at the tempfile module and use one of its methods for either naming your new file or opening/creating it. Personally I'd favor putting the new file in the same directory as the existing file to ensure that os.rename will work atomically (the configuration file named will be guaranteed to either point at the old file or the new file; in no case would it point at a partially written/copied file).

The following code DOES the job.
I mean it DOES overwrite the file ON ONESELF; that's what the OP asked for. That's possible because the transformations are only removing characters, so the file's pointer fo that writes is always BEHIND the file's pointer fi that reads.
import re
regx = re.compile('\AENS([A-Z]+)0+([0-9]{6})')
with open('bomo.phy','rb+') as fi, open('bomo.phy','rb+') as fo:
fo.writelines(regx.sub('\\1\\2',line) for line in fi)
I think that the writing isn't performed by the operating system one line at a time but through a buffer. So several lines are read before a pool of transformed lines are written. That's what I think.

newlines = []
for line in open ('/home/name/db/str/dir/numbers/str.phy').readlines():
if line.startswith('ENS'):
linepars = re.sub ('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
newlines.append( linepars )
open ('/home/name/db/str/dir/numbers/str.phy', 'w').write('\n'.join(newlines))

(sidenote: Of course if you are working with large files, you should be aware that the level of optimization required may depend on your situation. Python by nature is very non-lazily-evaluated. The following solution is not a good choice if you are parsing large files, such as database dumps or logs, but a few tweaks such as nesting the with clauses and using lazy generators or a line-by-line algorithm can allow O(1)-memory behavior.)
targetFile = '/home/name/db/str/dir/numbers/str.phy'
def replaceIfHeader(line):
if line.startswith('ENS'):
return re.sub('ENS([A-Z]+)0+([0-9]{6})','\\1\\2',line)
else:
return line
with open(targetFile, 'r') as f:
newText = '\n'.join(replaceIfHeader(line) for line in f)
try:
# make backup of targetFile
with open(targetFile, 'w') as f:
f.write(newText)
except:
# error encountered, do something to inform user where backup of targetFile is
edit: thanks to Jeff for suggestion

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

question about splitting a large file - python

Hey I need to split a large file in python into smaller files that contain only specific lines. How do I do this?

How big and does it need to be done in python? If this is on unix, would split/csplit/grep suffice?

First, open the big file for reading. Second, open all the smaller file names for writing. Third, iterate through every line. Every iteration, check to see what kind of line it is, then write it to that file. More info on File I/O: http://docs.python.org/tutorial/inputoutput.html

Related

Python split text file into different text files when line starts with a specific character

python loop won't iterate on second pass

How to confirm that a file object is empty? [Python]

Open a file for input and output in Python

Python Overwriting files after parsing

Categories

Resources