Implementing an external merge sort - python

I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)
import itertools
from itertools import islice
import tempfile
import heapq
#converts heapq.merge to ints
#def merge(*temp_files):
# return heapq.merge(*[itertools.imap(int, s) for s in temp_files])
with open("path\to\input", "r") as f:
temp_file = tempfile.TemporaryFile()
temp_files = []
elements = []
while True:
elements = list(islice(f, 1000))
if not elements:
break
elements.sort(key=int)
temp_files.append(elements)
temp_file.writelines(elements)
temp_file.flush()
temp_file.seek(0)
with open("path\to\output", "w") as output_file:
output_file.writelines(heapq.merge(*temp_files))

Your elements are read by default as strings, you have to do something like:
elements = list(islice(f, 1000))
elements = [int(elem) for elem in elements]
so that they would be interpreted as integers instead.
That would also mean that you need to convert them back to strings when writing, e.g.:
temp_file.writelines([str(elem) for elem in elements])
Apart from that, you would need to convert your elements again to int for the final merging. In your case, you probably want to uncomment your merge method (and then convert the result back to strings again, same way as above).

Your code doesn't make much sense to me (temp_files.append(elements)? Merging inside the loop?), but here's a way to merge files sorting numerically:
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
out.writelines(map('{}\n'.format,
heapq.merge(*(map(int, f)
for f in files))))
First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge. Then map('{}\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.
A different way to write it (might be clearer for some):
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
int_streams = (map(int, f) for f in files)
int_stream = heapq.merge(*int_streams)
line_stream = map('{}\n'.format, int_stream)
out.writelines(line_stream)
And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map.
And yes, if you do it right, this will allow you to sort gigantic files with very little memory.

You are doing Kway merge within the loop which will add a lots of runtimeComplexity . Better Store the file handles into a spearate list and perform a Kway merge
You also don't have to remove and add new line back ,just sort it based on number.
sorted(temp_files,key=lambda no:int(no.strip()))
Rest of things are fine.
https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md

Related

Python Large list sorting and storage

I'm currently working with some very large lists (50 to 100 million entries) of information where each item in the list is in the form of [float,(string_1,string_2)]
I'm adding elements to the list in an unsorted manner, and would eventually like to have a list that is sorted by the float value. For example I would have a list that looks like this:
[ [0.5,(A,B)], [-0.15,(B,C)], [0.3,(A,C)], [-0.8,(A,D)] ]
and then sort it to get
[ [0.5,(A,B)], [0.3,(A,C)], [-0.15,(B,C)], [-0.8,(A,D)] ]
Currently I'm using heapq to add items as I go along and then using sorted(heap) to ultimately give me the list I need. My question is: is there a better way to go about adding millions of items to a list and sorting them that won't crash my computer? Holding a list that long and then sorting it is causing some issues with my RAM.
sorted() creates an entirely distinct list, so doubles the RAM needed for the massive list. Use a list's .sort() method instead - that sorts the list in-place.
And unless there's something you haven't told us, leave heapq out of it entirely. Putting the entries in a heap serves no purpose I can think of. Just use a list's .append() method to add new entries, and apply .sort(reverse=True) to the list at end.
If you're still running out of RAM, then you simply can't solve this problem entirely in memory, and will need to craft an approach merging disk files.
LIVING WITH "TOO SMALL" RAM
In the worst case, even the list all by itself can't fit in available memory. You can still create the sorted sequence, but it requires writing sorted chunks to disk and merging them later. For the merging part, heapq is useful. Here's an example:
import pickle
import heapq
MAXPERFILE = 100 # the array will never get bigger than this
def getfname(i):
return "pickled%d.dat" % i
filenum = 0
def dumptofile(a): # dump the array to file, as pickled data
global filenum
fname = getfname(filenum)
with open(fname, "wb") as f:
pickle.dump(len(a), f)
for x in a:
pickle.dump(x, f)
filenum += 1
# generate some random data
import random
a = []
for _ in range(1012): # 10 "full" files with some leftovers
a.append(random.random())
if len(a) == MAXPERFILE:
a.sort(reverse=True)
dumptofile(a)
del a[:]
if a:
a.sort(reverse=True)
dumptofile(a)
print("number of files written:", filenum)
# now merge the files together; first a function
# to generate the file contents, one at a time
def feedfile(i):
fname = getfname(i)
with open(fname, "rb") as f:
count = pickle.load(f)
for _ in range(count):
yield pickle.load(f)
for x in heapq.merge(*(feedfile(i) for i in range(filenum)),
reverse=True):
print(x)
Max memory use can be made smaller by decreasing MAXPERFILE, although performance will be better the larger MAXPERFILE. Indeed, if MAXPERFILE is small enough and the total amount of data is large enough, the merging code may die with an OS "too many open files" error.

Python 2.7 Bubble Sort

I am trying to make my program sort the recorded scores that are in a csv file, my process of doing this is going to be reading the csv file into a list, bubblesort the list, then overwrite the csv file with the new list. I have encountered a logic error in my code however. When I sort the list the result is [[], ['190'], ['200'], ['250'], ['350'], ['90']].
If anyone could help it would be much appreciated. Here is my code for my read and my bubble sort.
import csv
def bubbleSort(scores):
for length in range(len(scores)-1,0,-1):
for i in range(length):
if scores[i]>scores[i+1]:
temp = scores[i]
scores[i] = scores[i+1]
scores[i+1] = temp
with open ("rec_Scores.csv", "rb") as csvfile:
r = csv.reader(csvfile)
scores = list(r)
bubbleSort(scores)
print(scores)
This is my first time implementing a sort in python so any help would be great, thanks.
You are comparing strings instead of integers. Use int(scores[i]) to convert the string to an integer.
Upon further inspection looks like you are storing your numbers in a list of lists. In that case, to access the first number we must do scores[0][0] the second number would be scores[1][0] and so on... the first index is increasing by increments of one so we can use int(scores[i][0].
The second numbers stays at 0 because it looks like you are only storing a single int in your inner list.
It appears you are using strings in your scores list. If you want this sort to work correctly you need to convert your values to integers:
int(str_num)
Where str_num is your string value.
This sort should work just fine after you do this conversion.
Also, you can use the built-in timsort to sort your numbers by calling
scores.sort()
Then you don't have to worry about implementing your own algorithm.
you could try this:
import csv
with open ("rec_Scores.csv", "rb") as csvfile:
r = csv.reader(csvfile)
scores = [int(item) for item in list(r)]
print(sorted(scores))
scores= list(r);
intscores = [0]*scores.len()
for str in scores:
intscores.append(int(str))
intscores.sort()
This should do it.

More efficient way to select partly records from a big file in Python

I would like to filter records from a big file (a list of lists, 10M+ lines) based on given ids.
selected_id = list() # 70k+ elements
for line in in_fp: # input file: 10M+ lines
id = line.split()[0] # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
The above code is time consuming. A idea comes to my mind. Store selected_id as dict instead of list.
Any better solutions?
You've got a few issues, though only the first is really nasty:
(By far the biggest cost in all likelihood) Checking for membership in a list is O(n); for a 70K element list, that's a lot of work. Make it a set/frozenset and lookup is typically O(1), saving thousands of comparisons. If the types are unhashable, you can pre-sort the selected_list and use the bisect module to do lookups in O(log n) time, which would still get a multiple order of magnitude speedup for such a large list.
If your lines are large, with several runs of whitespace, splitting at all points wastes time; you can specify maxsplit to only split enough to get the ID
If the IDs are always integer values it may be worth the time to make selected_id store int instead of str and convert on read so the lookup comparisons run a little faster (this would require testing). This probably won't make a major difference, so I'll omit it from the example.
Combining all suggestions:
selected_id = frozenset(... Your original list of 70k+ str elements ...)
for line in in_fp: # input file: 10M+ lines
id, _ = line.split(None, 1) # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
You could even convert the for loop to a single call with a generator expression (though it gets a little overly compact) which pushes more work to the C layer in CPython, reducing Python byte code execution overhead:
out_fp.writelines(x for x in in_fp if x.split(None, 1)[0] in selected_id)
First off, in order to get the first column from your lines you can read your file using csv module with a proper delimiter them use zip() function (in python 3 and in pyhton 2 itertools.izip()) and next() function in order to get the first column then pass the result to a set() function in order to preserve the unique values.
import csv
with open('file_name') as f:
spam_reader = csv.reader(f, delimiter=' ')
unique_ids = set(next(zip(*spam_reader)))
If you want to preserve the order you can use collections.OrderedDict():
import csv
from collections import OrderedDict
with open('file_name') as f:
spam_reader = csv.reader(f, delimiter=' ')
unique_ids = OrderedDict.fromkeys(next(zip(*spam_reader)))

Write lists of different size to csv in columns in Python

I need to write lists that all differ in length to a CSV file in columns.
I currently have:
d=lists
writer = csv.writer(fl)
for values in zip(*d):
writer.writerow(values)
which works only partially. What I suspect is happening is that it stops zipping up until the lists with the list of the shortest length.
Code below is for Python 3. If you use Python 2, import izip_longest instead of zip_longest.
import csv
from itertools import zip_longest
d = [[2,3,4,8],[5,6]]
with open("file.csv","w+") as f:
writer = csv.writer(f)
for values in zip_longest(*d):
writer.writerow(values)
Result:
2,5
3,6
4,
8,
There's no 'neat' way to write it since, as you said, zip truncates to the length of the shortest iterable.
Probably the simplest way would be to just pad with None or empty strings (Not sure offhand what the behavior of writerow is with None values):
maxlen = max([len(member) for member in d])
[member.extend([""] * (maxlen - len(member))) for member in d]
for values in zip(*d):
...
Alternatively, you could just construct each row inside the for loop instead of using zip, which would be more efficient but a lot wordier.
EDIT: corrected example

Faster replacing in list with a lot of matches

just a small problem with list and replacing some list entries.
Maybe some informations around my problem. My idea is really simple and easy. I use the module mmap to read out bigger files. It's some FORTRAN-files which have 7 columns and one million lines. Some values didn't fulfill the format of the FORTRAN-output and I just have ten stars. I can't change the format of the output inside the source code and I have to deal with this problem. After loading the file with mmap I use str.split() to convert the data to a list and then I search for the bad values. Look at the following source code:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i in range(A.count('********')):
A[A.index('********')] = '0.0'
I know it's probably not the best solution but it's quick and dirty. Ok. It's quick if A.count('********') is small. Actually this is my problem. For some files the replacing method doesn't work really fast. If it's to big it take a lot of time. Is there any other method or a total other way to replace my bad values and don't waste a lot of time?
Thanks for any help or any suggestions.
EDIT:
How does the method list.count() works? I can also run through whole list and replacing it by my own.
for k in range(len(A)):
if A[k] == '**********': A[k] = '0.0'
This would be faster for many replacements. But would it be faster if I only would have one match?
The main problem in your code is the use of "A.index" inside the loop -. The index method will walk linearly through your list, from the start up to the next ocurrence of "**" - this turns a O(n) problem into O(n²) - hence your perceived lack of performance.
While using Python the most obvious way is usually the best way to do it: so walking through your list in a Python forloop in this case will undoubtley be better than O(n²) loops in C with the cound and index methods. The not so obvious part is the recomended usage of the built-in function "enumerate" to get both an item value and its index from the list on the for loop.
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i, value in enumerate(A):
if value == "********":
A[i] = "0.0"
If you are eventually going to convert this to an array, you might consider using numpy and the np.genfromtxt which has the ability to deal with missing data:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
With a binary file, you can use np.memmap and then use masked arrays to deal with the missing elements.
fin = open(fname, 'r')
fout = open(fname + '_fixed', 'w')
for line in fin:
# replace 10 asterisks by 7 spaces + '0.0'
# If you don't mind losing the fixed-column-width format,
# omit the seven spaces
line = line.replace('**********', ' 0.0')
fout.write(line)
fin.close()
fout.close()
Alternatively if your file is smallish, replace the loop by this:
fout.write(fin.read().replace('**********', ' 0.0'))
If after converting A to one huge string representation, you first could change all the bad values with a single call to the A.replace('********', '0.0') method and then split it, you'd have the same result, likely a lot faster. Something like:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).replace('********', '0.0').split()
It would use a lot of memory, but that's often the trade-off for speed.
Instead of manipulating A, try using a list comprehension to make a new A:
A = [v if v != '********' else 0.0 for v in A]
I think you'll find this surprisingly fast.

Categories

Resources