Python Large list sorting and storage

Python Large list sorting and storage - python

I'm currently working with some very large lists (50 to 100 million entries) of information where each item in the list is in the form of [float,(string_1,string_2)]
I'm adding elements to the list in an unsorted manner, and would eventually like to have a list that is sorted by the float value. For example I would have a list that looks like this:
[ [0.5,(A,B)], [-0.15,(B,C)], [0.3,(A,C)], [-0.8,(A,D)] ]
and then sort it to get
[ [0.5,(A,B)], [0.3,(A,C)], [-0.15,(B,C)], [-0.8,(A,D)] ]
Currently I'm using heapq to add items as I go along and then using sorted(heap) to ultimately give me the list I need. My question is: is there a better way to go about adding millions of items to a list and sorting them that won't crash my computer? Holding a list that long and then sorting it is causing some issues with my RAM.

sorted() creates an entirely distinct list, so doubles the RAM needed for the massive list. Use a list's .sort() method instead - that sorts the list in-place.
And unless there's something you haven't told us, leave heapq out of it entirely. Putting the entries in a heap serves no purpose I can think of. Just use a list's .append() method to add new entries, and apply .sort(reverse=True) to the list at end.
If you're still running out of RAM, then you simply can't solve this problem entirely in memory, and will need to craft an approach merging disk files.
LIVING WITH "TOO SMALL" RAM
In the worst case, even the list all by itself can't fit in available memory. You can still create the sorted sequence, but it requires writing sorted chunks to disk and merging them later. For the merging part, heapq is useful. Here's an example:
import pickle
import heapq
MAXPERFILE = 100 # the array will never get bigger than this
def getfname(i):
return "pickled%d.dat" % i
filenum = 0
def dumptofile(a): # dump the array to file, as pickled data
global filenum
fname = getfname(filenum)
with open(fname, "wb") as f:
pickle.dump(len(a), f)
for x in a:
pickle.dump(x, f)
filenum += 1
# generate some random data
import random
a = []
for _ in range(1012): # 10 "full" files with some leftovers
a.append(random.random())
if len(a) == MAXPERFILE:
a.sort(reverse=True)
dumptofile(a)
del a[:]
if a:
a.sort(reverse=True)
dumptofile(a)
print("number of files written:", filenum)
# now merge the files together; first a function
# to generate the file contents, one at a time
def feedfile(i):
fname = getfname(i)
with open(fname, "rb") as f:
count = pickle.load(f)
for _ in range(count):
yield pickle.load(f)
for x in heapq.merge(*(feedfile(i) for i in range(filenum)),
reverse=True):
print(x)
Max memory use can be made smaller by decreasing MAXPERFILE, although performance will be better the larger MAXPERFILE. Indeed, if MAXPERFILE is small enough and the total amount of data is large enough, the merging code may die with an OS "too many open files" error.

Related

How to append only unique values to a key in a dictionary?

sorry this is likely a complete noob question, although I'm new to python and am unable to implement any online suggestions such that they actually work. I need decrease the run-time of the code for larger files, so need to reduce the number of iterations i'm doing.
How do I modify the append_value function below to append only UNIQUE values to dict_obj, and remove the need for another series of iterations to do this later on.
EDIT: Sorry, here is an example input/output
Sample Input:
6
5 6
0 1
1 4
5 4
1 2
4 0
Sample Output:
1
4
I'm attempting to solve to solve:
http://orac.amt.edu.au/cgi-bin/train/problem.pl?problemid=416
Output Result
input_file = open("listin.txt", "r")
output_file = open("listout.txt", "w")
ls = []
n = int(input_file.readline())
for i in range(n):
a, b = input_file.readline().split()
ls.append(int(a))
ls.append(int(b))
def append_value(dict_obj, key, value): # How to append only UNIQUE values to
if key in dict_obj: # dict_obj?
if not isinstance(dict_obj[key], list):
dict_obj[key] = [dict_obj[key]]
dict_obj[key].append(value)
else:
dict_obj[key] = value
mx = []
ls.sort()
Dict = {}
for i in range(len(ls)):
c = ls.count(ls[i])
append_value(Dict, int(c), ls[i])
mx.append(c)
x = max(mx)
lss = []
list_set = set(Dict[x]) #To remove the need for this
unique_list = (list(list_set))
for x in unique_list:
lss.append(x)
lsss = sorted(lss)
for i in lsss:
output_file.write(str(i) + "\n")
output_file.close()
input_file.close()
Thank you

The answer to your question, 'how to only append unique values to this container' is fairly simple: change it from a list to a set (as #ShadowRanger suggested in the comments). This isn't really a question about dictionaries, though; you're not appending values to 'dict_obj', only to a list stored in the dictionary.
Since the source you linked to shows this is a training problem for people newer to coding, you should know that changing the lists to sets might be a good idea, but it's not the cause of the performance issues.
The problem boils down to: given a file containing a list of integers, print the most common integer(s). Your current code iterates over the list, and for each index i, iterates over the entire list to count matches with ls[i] (this is the line c = ls.count(ls[i])).
Some operations are more expensive than others: calling count() is one of the more expensive operations on a Python list. It reads through the entire list every time it's called. This is an O(n) function, which is inside a length n loop, taking O(n^2) time. All of the set() filtering for non-unique elements takes O(n) time total (and is even quite fast in practice). Identifying linear-time functions hidden in loops like this is a frequent theme in optimization, but profiling your code would have identified this.
In general, you'll want to use something like the Counter class in Python's standard library for frequency counting. That kind of defeats the whole point of this training problem, though, which is to encourage you to improve on the brute-force algorithm for finding the most frequent element(s) in a list. One possible way to solve this problem is to read the description of Counter, and try to mimic its behavior yourself with a plain Python dictionary.

Answering the question you haven't asked: Your whole approach is overkill.
You don't need to worry about uniqueness; the question prompt guarantees that if you see 2 5, you'll never see 5 2, nor a repeat of 2 5
You don't even care who is friends with who, you just care how many friends an individual has
So don't even bother making the pairs. Just count how many times each player ID appears at all. If you see 2 5, that means 2 has one more friend, and 5 has one more friend, it doesn't matter who they are friends with.
The entire problem can simplify down to a simple exercise in separating the player IDs and counting them all up (because each appearance means one more unique friend), then keeping only the ones with the highest counts.
A fairly idiomatic solution (reading from stdin and writing to stdout; tweaking it to open files is left as an exercise) would be something like:
import sys
from collections import Counter
from itertools import chain, islice
def main():
numlines = int(next(sys.stdin))
friend_pairs = map(str.split, islice(sys.stdin, numlines)) # Convert lines to friendship pairs
counts = Counter(chain.from_iterable(friend_pairs)) # Flatten to friend mentions and count mentions to get friend count
max_count = max(counts.values()) # Identify maximum friend count
winners = [pid for pid, cnt in counts.items() if cnt == max_count]
winners.sort(key=int) # Sort winners numerically
print(*winners, sep="\n")
if __name__ == '__main__':
main()
Try it online!
Technically, it doesn't even require the use of islice nor storing to numlines (the line count at the beginning might be useful to low level languages to preallocate an array for results, but for Python, you can just read line by line until you run out), so the first two lines of main could simplify to:
next(sys.stdin)
friend_pairs = map(str.split, sys.stdin)
But either way, you don't need to uniquify friendships, nor preserve any knowledge of who is friends with whom to figure out who has the most friends, so save yourself some trouble and skip the unnecessary work.

If you intention is to have a list in each value of the dictionary why not iterate the same way you iterated on each key.
if key in dict_obj.keys():
for elem in dict_obje[key]: # dict_obje[key] asusming the value is a list
if (elem == value):
else:
# append the value to the desired list
else:
dic_obj[key] = value

List of lists updates an entire column when given a specific element [duplicate]

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(

The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.

The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?

Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])

To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value

So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

More efficient way to select partly records from a big file in Python

I would like to filter records from a big file (a list of lists, 10M+ lines) based on given ids.
selected_id = list() # 70k+ elements
for line in in_fp: # input file: 10M+ lines
id = line.split()[0] # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
The above code is time consuming. A idea comes to my mind. Store selected_id as dict instead of list.
Any better solutions?

You've got a few issues, though only the first is really nasty:
(By far the biggest cost in all likelihood) Checking for membership in a list is O(n); for a 70K element list, that's a lot of work. Make it a set/frozenset and lookup is typically O(1), saving thousands of comparisons. If the types are unhashable, you can pre-sort the selected_list and use the bisect module to do lookups in O(log n) time, which would still get a multiple order of magnitude speedup for such a large list.
If your lines are large, with several runs of whitespace, splitting at all points wastes time; you can specify maxsplit to only split enough to get the ID
If the IDs are always integer values it may be worth the time to make selected_id store int instead of str and convert on read so the lookup comparisons run a little faster (this would require testing). This probably won't make a major difference, so I'll omit it from the example.
Combining all suggestions:
selected_id = frozenset(... Your original list of 70k+ str elements ...)
for line in in_fp: # input file: 10M+ lines
id, _ = line.split(None, 1) # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
You could even convert the for loop to a single call with a generator expression (though it gets a little overly compact) which pushes more work to the C layer in CPython, reducing Python byte code execution overhead:
out_fp.writelines(x for x in in_fp if x.split(None, 1)[0] in selected_id)

First off, in order to get the first column from your lines you can read your file using csv module with a proper delimiter them use zip() function (in python 3 and in pyhton 2 itertools.izip()) and next() function in order to get the first column then pass the result to a set() function in order to preserve the unique values.
import csv
with open('file_name') as f:
spam_reader = csv.reader(f, delimiter=' ')
unique_ids = set(next(zip(*spam_reader)))
If you want to preserve the order you can use collections.OrderedDict():
import csv
from collections import OrderedDict
with open('file_name') as f:
spam_reader = csv.reader(f, delimiter=' ')
unique_ids = OrderedDict.fromkeys(next(zip(*spam_reader)))

Implementing an external merge sort

I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)
import itertools
from itertools import islice
import tempfile
import heapq
#converts heapq.merge to ints
#def merge(*temp_files):
# return heapq.merge(*[itertools.imap(int, s) for s in temp_files])
with open("path\to\input", "r") as f:
temp_file = tempfile.TemporaryFile()
temp_files = []
elements = []
while True:
elements = list(islice(f, 1000))
if not elements:
break
elements.sort(key=int)
temp_files.append(elements)
temp_file.writelines(elements)
temp_file.flush()
temp_file.seek(0)
with open("path\to\output", "w") as output_file:
output_file.writelines(heapq.merge(*temp_files))

Your elements are read by default as strings, you have to do something like:
elements = list(islice(f, 1000))
elements = [int(elem) for elem in elements]
so that they would be interpreted as integers instead.
That would also mean that you need to convert them back to strings when writing, e.g.:
temp_file.writelines([str(elem) for elem in elements])
Apart from that, you would need to convert your elements again to int for the final merging. In your case, you probably want to uncomment your merge method (and then convert the result back to strings again, same way as above).

Your code doesn't make much sense to me (temp_files.append(elements)? Merging inside the loop?), but here's a way to merge files sorting numerically:
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
out.writelines(map('{}\n'.format,
heapq.merge(*(map(int, f)
for f in files))))
First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge. Then map('{}\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.
A different way to write it (might be clearer for some):
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
int_streams = (map(int, f) for f in files)
int_stream = heapq.merge(*int_streams)
line_stream = map('{}\n'.format, int_stream)
out.writelines(line_stream)
And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map.
And yes, if you do it right, this will allow you to sort gigantic files with very little memory.

You are doing Kway merge within the loop which will add a lots of runtimeComplexity . Better Store the file handles into a spearate list and perform a Kway merge
You also don't have to remove and add new line back ,just sort it based on number.
sorted(temp_files,key=lambda no:int(no.strip()))
Rest of things are fine.
https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md

Python, out of memory when iterating over very large numbers

I'm writing a python script that does various permutations of characters. Eventually, the script will crash with out of memory error depending on how much depth I want to go for the permutation.
I had initially thought the solution would have been emptying out the list and restarting over but doing it this way I get index out of bounds error.
This is my current set up:
for j in range(0, csetlen):
getJ = None
for i in range(0, char_set_len):
getJ = word_list[j] + char_set[i]
word_list.append(getJ)
csetlen = csetlen - j
del word_list[j-1:]
word_list.append(getJ)
j=0
Basically, csetlen can be a very large number (excess of 100,000,000). Of course I do not have enough RAM for this; so I'm trying to find out how to shrink the list in the outer for loop. How does one do this gracefully?
The memory error has to do with word_list. Currently, I am storing millions of different permutations; I need to be able to "recycle" some of the old list values. How does one do this to a python list?

What you want is an iterator that generates the values on demand (and doesn't store them in memory):
from itertools import product
getJ_iterator = product(wordlist[:csetlen], char_set[:char_set_len])
This is equivalent to the following generator function:
def getJ_gen(first_list, second_list):
for i in first_list:
for j in second_list:
yield (i, j)
getJ_iterator = getJ_gen(wordlist[:csetlen], char_set[:char_set_len])
You would iterate over the object like so:
for item in getJ_iterator:
#do stuff
Note that item in this case would be a tuple of the form (word, char).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Large list sorting and storage - python

Related

How to append only unique values to a key in a dictionary?

List of lists updates an entire column when given a specific element [duplicate]

More efficient way to select partly records from a big file in Python

Implementing an external merge sort

Python, out of memory when iterating over very large numbers

Categories

Resources