File output based on the contents of another file - python

I have an issue which has to do with file input and output in Python (it's a continuation from this question: how to extract specific lines from a data file, which has been solved now).
So I have one big file, danish.train, and eleven small files (called danish.test.part-01 and so on), each of them containing a different selection of the data from the danish.train file. Now, for each of the eleven files, I want to create an accompanying file that complements them. This means that for each small file, I want to create a file that contains the contents of danish.train minus the part that is already in the small file.
What I've come up with so far is this:
trainFile = open("danish.train")
for file_number in range(1,12):
input = open('danish.test.part-%02d' % file_number, 'r')
for line in trainFile:
if line not in input:
with open('danish.train.part-%02d' % file_number, 'a+') as myfile:
myfile.write(line)
The problem is that this code only gives output for file_number 1, although I have a loop from 1-11. If I change the range, for example to in range(2,3), I get an output danish.train.part-02, but this output contains a copy of the whole danish.train without leaving out the contents of the file danish.test.part-02, as I wanted.
I suspect that these issues may have something to do with me not completely understanding the with... as operator, but I'm not sure. Any help would be greatly appreciated.

When you open a file, it returns an iterator through the lines of the file. This is nice, in that it lets you go through the file, one line at a time, without keeping the whole file into memory at once. In your case, it leads to a problem, in that you need to iterate through the file multiple times.
Instead, you can read the full training file into memory, and go through it multiple times:
with open("danish.train", 'r') as f:
train_lines = f.readlines()
for file_number in range(1, 12):
with open("danish.test.part-%02d" % file_number, 'r') as f:
test_lines = set(f)
with open("danish.train.part-%02d" % file_number, 'w') as g:
g.writelines(line for line in train_lines if line not in test_lines)
I've simplified the logic a little bit, as well. If you don't care about the order of the lines, you could also consider reading the training lines into a set, and then just use set operations instead of the generator expression I used in the final line.

Related

Selectively replacing csv header names

I have been searching for a solution for this and haven't been able to find one. I have a directory of folders which contain multiple, very-large csv files. I'm looping through each csv in each folder in the directory to replace values of certain headers. I need the headers to be consistent (from file to file) in order to run a different script to process all the data properly.
I found this solution that I though would work: change first line of a file in python.
However this is not working as expected. My code:
from_file = open(filepath)
# for line in f:
# if
data = from_file.readline()
# print(data)
# with open(filepath, "w") as f:
print 'DBG: replacing in file', filepath
# s = s.replace(search_pattern, replacement)
for i in range(len(search_pattern)):
data = re.sub(search_pattern[i], replacement[i], data)
# data = re.sub(search_pattern, replacement, data)
to_file = open(filepath, mode="w")
to_file.write(data)
shutil.copyfileobj(from_file, to_file)
I want to replace the header values in search_pattern with values in replacement without saving or writing to a different file - I want to modify the file. I have also tried
shutil.copyfileobj(from_file, to_file, -1)
As I understand it that should copy the whole file rather than breaking it up in chunks, but it doesn't seem to have an effect on my output. Is it possible that the csv is just too big?
I haven't been able to determine a different way to do this or make this way work. Any help would be greatly appreciated!
this answer from change first line of a file in python you copied from doesn't work in windows
On Linux, you can open a file for reading & writing at the same time. The system ensures that there's no conflict, but behind the scenes, 2 different file objects are being handled. And this method is very unsafe: if the program crashes while reading/writing (power off, disk full)... the file has a great chance to be truncated/corrupt.
Anyway, in Windows, you cannot open a file for reading and writing at the same time using 2 handles. It just destroys the contents of the file.
So there are 2 options, which are portable and safe:
create a file in the same directory, once copied, delete first file, and rename the new one
Like this:
import os
import shutil
filepath = "test.txt"
with open(filepath) as from_file, open(filepath+".new","w") as to_file:
data = from_file.readline()
to_file.write("something else\n")
shutil.copyfileobj(from_file, to_file)
os.remove(filepath)
os.rename(filepath+".new",filepath)
This doesn't take much longer, because the rename operation is instantaneous. Besides, if the program/computer crashes at any point, one of the files (old or new) is valid, so it's safe.
if patterns have the same length, use read/write mode
like this:
filepath = "test.txt"
with open(filepath,"r+") as rw_file:
data = rw_file.readline()
data = "h"*(len(data)-1) + "\n"
rw_file.seek(0)
rw_file.write(data)
Here we, read the line, replace the first line by the same amount of h characters, rewind the file and write the first line back, overwriting previous contents, keeping the rest of the lines. This is also safe, and even if the file is huge, it's very fast. The only constraint is that the pattern must be of the exact same size (else you would have remainders of the previous data, or you would overwrite the next line(s) since no data is shifted)

how to read a large compressed file in python without loading it all in memory

I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.
Here's the relevant part of the code:
for filename in os.listdir(path):
if not filename.startswith("."):
with open(b, 'a') as newfile, gzip.GzipFile(path+filename,'rb') as oldfile:
# BEGIN Reads each remaining line from the log into a list
data = oldfile.readlines()
for line in data:
parts = line.split()
after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system
What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.
Question:
how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)
Thanks
It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:
# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()
for line in data:
parts = line.split()
...use:
# GOOD: Iterates a line at a time
for line in oldfile:
parts = line.split()
...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.
That easy.

Sort a big file with Python heapq.merge

I'm looking to complete such job but have encountered difficulty:
I have a huge file of texts. Each line is of the format "AGTCCCGGAT filename" where the first part is a DNA thing.
The professor suggests that we break this huge file into many temporary files and use heapq.merge() to sort them. The goal is to have one file at the end which contains every line of the original file and is sorted.
My first try was to break each line into a separate temporary file. The problem is that heapq.merge() reports there are too many files to sort.
My second try was to break it into temporary files by 50000 lines. The problem is that it seems that it does not sort by line, but by file. for example, we have something like:
ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename
where the first two lines are from one temp file and the last two lines are from the second file.
My code to sort them is as follows:
for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
result.write(line)
result.close()
Your solution is almost correct. However, each partial file must be sorted before you write them to the disk. Here's a 2-pass algorithm that demonstrates it: First, iterate the file in 50k line chunks, sort the lines in a chunk and then write this sorted chunk into a file. In second pass, open all these files and merge to the output file.
from heapq import merge
from itertools import count, islice
from contextlib import ExitStack # not available on Python 2
# need to care for closing files otherwise
chunk_names = []
# chunk and sort
with open('input.txt') as input_file:
for chunk_number in count(1):
# read in next 50k lines and sort them
sorted_chunk = sorted(islice(input_file, 50000))
if not sorted_chunk:
# end of input
break
chunk_name = 'chunk_{}.chk'.format(chunk_number)
chunk_names.append(chunk_name)
with open(chunk_name, 'w') as chunk_file:
chunk_file.writelines(sorted_chunk)
with ExitStack() as stack, open('output.txt', 'w') as output_file:
files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
output_file.writelines(merge(*files))

Split large files using python

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

question about splitting a large file

Hey I need to split a large file in python into smaller files that contain only specific lines. How do I do this?
You're probably going to want to do something like this:
big_file = open('big_file', 'r')
small_file1 = open('small_file1', 'w')
small_file2 = open('small_file2', 'w')
for line in big_file:
if 'Charlie' in line: small_file1.write(line)
if 'Mark' in line: small_file2.write(line)
big_file.close()
small_file1.close()
small_file2.close()
Opening a file for reading returns an object that allows you to iterate over the lines. You can then check each line (which is just a string of whatever that line contains) for whatever condition you want, then write it to the appropriate file that you opened for writing. It is worth noting that when you open a file with 'w' it will overwrite anything already written to that file. If you want to simply add to the end, you should open it with 'a', to append.
Additionally, if you expect there to be some possibility of error in your reading/writing code, and want to make sure the files are closed, you can use:
with open('big_file', 'r') as big_file:
<do stuff prone to error>
Do you mean breaking it down into subsections? Like if I had a file with chapter 1, chapter 2, and chapter 3, you want it to be broken down into separate files for each chapter?
The way I've done this is similar to Wilduck's response, but closes the input file as soon as it reads in the data and keeps all the lines read in.
data_file = open('large_file_name', 'r')
lines = data_file.readlines()
data_file.close()
outputFile = open('output_file_one', 'w')
for line in lines:
if 'SomeName' in line:
outputFile.write(line)
outputFile.close()
If you wanted to have more than one output file you could either add more loops or open more than one outputFile at a time.
I'd recommend using Wilducks response, however, as it uses less space and will take less time with larger files since the file is read only once.
How big and does it need to be done in python? If this is on unix, would split/csplit/grep suffice?
First, open the big file for reading.
Second, open all the smaller file names for writing.
Third, iterate through every line. Every iteration, check to see what kind of line it is, then write it to that file.
More info on File I/O: http://docs.python.org/tutorial/inputoutput.html

Categories

Resources