Sort a big file with Python heapq.merge - python

I'm looking to complete such job but have encountered difficulty:
I have a huge file of texts. Each line is of the format "AGTCCCGGAT filename" where the first part is a DNA thing.
The professor suggests that we break this huge file into many temporary files and use heapq.merge() to sort them. The goal is to have one file at the end which contains every line of the original file and is sorted.
My first try was to break each line into a separate temporary file. The problem is that heapq.merge() reports there are too many files to sort.
My second try was to break it into temporary files by 50000 lines. The problem is that it seems that it does not sort by line, but by file. for example, we have something like:
ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename
where the first two lines are from one temp file and the last two lines are from the second file.
My code to sort them is as follows:
for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
result.write(line)
result.close()

Your solution is almost correct. However, each partial file must be sorted before you write them to the disk. Here's a 2-pass algorithm that demonstrates it: First, iterate the file in 50k line chunks, sort the lines in a chunk and then write this sorted chunk into a file. In second pass, open all these files and merge to the output file.
from heapq import merge
from itertools import count, islice
from contextlib import ExitStack # not available on Python 2
# need to care for closing files otherwise
chunk_names = []
# chunk and sort
with open('input.txt') as input_file:
for chunk_number in count(1):
# read in next 50k lines and sort them
sorted_chunk = sorted(islice(input_file, 50000))
if not sorted_chunk:
# end of input
break
chunk_name = 'chunk_{}.chk'.format(chunk_number)
chunk_names.append(chunk_name)
with open(chunk_name, 'w') as chunk_file:
chunk_file.writelines(sorted_chunk)
with ExitStack() as stack, open('output.txt', 'w') as output_file:
files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
output_file.writelines(merge(*files))

Related

Skip first rows from CSV with Python without reading the file

I need to skip some first lines from a CSV file and save it to another file.
The code i currently accomplish such tasks is:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=2)
df.to_csv("usersOutput.csv", index=False)
and it works without issues. The only thing is: this code reads the whole file before saving. Now my problem is: i have to deal with a file with 4GB size and i think, this code will be very time consuming.
Is there a possibility to skip some first lines and save the file without to read it before?
You don't need to use pandas just to filter lines from a file:
with open('users.csv') as users, open('usersOutput.csv', 'w') as output:
for lineno, line in enumerate(users):
if lineno > 1:
output.write(line)
The most efficient way with shutil.copyfileobj(fsrc, fdst[, length]) feature:
from shutil import copyfileobj
from itertools import islice
with open('users.csv') as f_old, open('usersOutput.csv', 'w') as f_new:
list(islice(f, 2)) # skip 2 lines
copyfileobj(f_old, f_new)
From doc:
... if the current file position of the fsrc object is not 0, only
the contents from the current file position to the end of the file
will be copied.
i.e. the new file will contain the same content except the first 2 lines.

Python - reading huge file

I have the following code that tries to process a huge file with multiple xml elements.
from shutil import copyfile
files_with_companies_mentions=[]
# code that reads the file line by line
def read_the_file(file_to_read):
list_of_files_to_keep=[]
f = open('huge_file.nml','r')
lines=f.readlines()
print("2. I GET HERE ")
len_lines = len(lines)
for i in range(0,len(lines)):
j=i
if '<?xml version="1.0"' in lines[i]:
next_line = lines[i+1]
write_f = open('temp_files/myfile_'+str(i)+'.nml', 'w')
write_f.write(lines[i])
while '</doc>' not in next_line:
write_f.write(next_line)
j=j+1
next_line = lines[j]
write_f.write(next_line)
write_f.close()
list_of_files_to_keep.append(write_f.name)
return list_of_files_to_keep
The file is over 700 MB large, with over 20 million rows. Is there a better way to handle it?
As you can see I need to reference to the previous and the next lines with an indicator variable such as i.
The problem I am facing is that it is very slow. It takes more than 1 hour for every file and I have multiple of these.
You can use parallel processing for speeding up, using the joblib package. Assuming you have a list of files called files, the structure would be as follows:
import ...
from joblib import Parallel, delayed
def read_the_file(file):
...
if __name__ == '__main__':
n = 8 # number of processors
Parallel(n_jobs=n)(delayed(read_the_file)(file) for file in files)
First of all you shouldn't determine the total number of lines on its own or read the whole file at once
if you dont need to. Use a loop like this and you'll already save some time.
Plus consider this for usage of readlines() http://stupidpythonideas.blogspot.de/2013/06/readlines-considered-silly.html.
Considering you're working with XML elements maybe consider using a lib that makes this easier. especially for the writing.
suggestion: make use of a context manager:
with open(filename, 'r') as file:
...
suggestion: do the reading and processing junk-wise (currently, you are reading the file in a single step, just afterwards you go over the list "line by line"):
for chunk in file.read(number_of_bytes_to_read):
my_function(chunk)
Of course this way you have to look out for correct xml tag start/ends.
Alternative: look for an XML Parser package. I am quite certain there is one that can process files chunk-wise, with correct tag-handling included.

Import the output into a CSV file

Desktop.zip contains multiple text files. fun.py is a python program which will print the name of text files from zip and also the number of lines in each file. Everything is okay up to here. But, It will also import this output in a single CSV file. Code :-
import zipfile, csv
file = zipfile.ZipFile("Desktop.zip", "r")
inputcsv = input("Enter the name of the CSV file: ")
csvfile = open(inputcsv,'a')
#list file names
for name in file.namelist():
print (name)
# do stuff with the file object
for name in file.namelist():
with open(name) as fh:
count = 0
for line in fh:
count += 1
print ("File " + name + "line(s) count = " + str(count))
b = open(inputcsv, 'w')
a = csv.writer(b)
data = [name, str(count)]
a.writerows(data)
file.close()
I am expecting output in CSV file like :-
test1.txt, 25
test2.txt, 10
But I am getting this output in CSV file :-
t,e,s,t,1,.,t,x,t
2,5
t,e,s,t,2,.,t,x,t
1,0
Here, test1.txt and test2.txt are the files in Desktop.zip, and 25 and 10 is the number of lines of those files respectively.
writerows takes an iterable of row-representing iterables. You’re passing it a single row, so it interprets each character of each column as a cell. You don’t want that. Use writerow rather than writerows.
I saw a number of issues:
You should open the csv file only once, before the for loop. Open it inside the for loop will override the information from previous loop iteration
icktoofay pointed out that you should use writerow, not writerows
file is a reserve word, you should not use it to name your variable. Besides, it is not that descriptive
You seem to get the file names from the archive, but open the file from the directory (not the ones inside the archive). These two sets of files might not be identical.
Here is my approach:
import csv
import zipfile
with open('out.csv', 'wb') as file_handle:
csv_writer = csv.writer(file_handle)
archive = zipfile.ZipFile('Desktop.zip')
for filename in archive.namelist():
lines = archive.open(filename).read().splitlines()
line_count = len(lines)
csv_writer.writerow([filename, line_count])
My approach has a couple of issues, which might or might not matter:
I assume files in the archive to be text file
I open, read, and split lines in one operation. This might not work well for very large files
The code in your question has multiple issues, as others have pointed out. The two primary ones are that you're recreating the csv file over and over again for each archive member being processed, and then secondly, are passing csvwriter.writerows() the wrong data. It interprets each item in the list you're passing as a separate row to be added to the csv file.
One way to fix that would be to only open the csv file once, before entering a for loop which counts the line in each member of the archive and writes one row to it at time with a call to csvwriter.writerow().
A slightly different way, shown below, does use writerows() but passes it generator expression that processes the each member one-the-fly instead of calling writerow() repeatedly. It also processes each member incrementally, so it doesn't need to read the whole thing into memory at one time and then split it up in order to get a line count.
Although you didn't indicate what version of Python you're using, from the code in your question, I'm guessing it's Python 3.x, so the answer below has been written and tested with that (although it wouldn't be hard to make it work in Python 2.7).
import csv
import zipfile
input_zip_filename = 'Desktop.zip'
output_csv_filename = input("Enter the name of the CSV file to create: ")
# Helper function.
def line_count(archive, filename):
''' Count the lines in specified ZipFile member. '''
with archive.open(filename) as member:
return sum(1 for line in member)
with zipfile.ZipFile(input_zip_filename, 'r') as archive:
# List files in archive.
print('Members of {!r}:'.format(input_zip_filename))
for filename in archive.namelist():
print(' {}'.format(filename))
# Create csv with filenames and line counts.
with open(output_csv_filename, 'w', newline='') as output_csv:
csv.writer(output_csv).writerows(
# generator expression
[filename, line_count(archive, filename)] # contents of one row
for filename in archive.namelist())
Sample format of content in csv file created:
test1.txt,25
test2.txt,10

File output based on the contents of another file

I have an issue which has to do with file input and output in Python (it's a continuation from this question: how to extract specific lines from a data file, which has been solved now).
So I have one big file, danish.train, and eleven small files (called danish.test.part-01 and so on), each of them containing a different selection of the data from the danish.train file. Now, for each of the eleven files, I want to create an accompanying file that complements them. This means that for each small file, I want to create a file that contains the contents of danish.train minus the part that is already in the small file.
What I've come up with so far is this:
trainFile = open("danish.train")
for file_number in range(1,12):
input = open('danish.test.part-%02d' % file_number, 'r')
for line in trainFile:
if line not in input:
with open('danish.train.part-%02d' % file_number, 'a+') as myfile:
myfile.write(line)
The problem is that this code only gives output for file_number 1, although I have a loop from 1-11. If I change the range, for example to in range(2,3), I get an output danish.train.part-02, but this output contains a copy of the whole danish.train without leaving out the contents of the file danish.test.part-02, as I wanted.
I suspect that these issues may have something to do with me not completely understanding the with... as operator, but I'm not sure. Any help would be greatly appreciated.
When you open a file, it returns an iterator through the lines of the file. This is nice, in that it lets you go through the file, one line at a time, without keeping the whole file into memory at once. In your case, it leads to a problem, in that you need to iterate through the file multiple times.
Instead, you can read the full training file into memory, and go through it multiple times:
with open("danish.train", 'r') as f:
train_lines = f.readlines()
for file_number in range(1, 12):
with open("danish.test.part-%02d" % file_number, 'r') as f:
test_lines = set(f)
with open("danish.train.part-%02d" % file_number, 'w') as g:
g.writelines(line for line in train_lines if line not in test_lines)
I've simplified the logic a little bit, as well. If you don't care about the order of the lines, you could also consider reading the training lines into a set, and then just use set operations instead of the generator expression I used in the final line.

Split large files using python

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

Categories

Resources