Python sort parallel arrays in place?

Python sort parallel arrays in place? - python

Is there an easy (meaning without rolling one's own sorting function) way to sort parallel lists without unnecessary copying in Python? For example:
foo = range(5)
bar = range(5, 0, -1)
parallelSort(bar, foo)
print foo # [4,3,2,1,0]
print bar # [1,2,3,4,5]
I've seen the examples using zip but it seems silly to copy all your data from parallel lists to a list of tuples and back again if this can be easily avoided.

Here's an easy way:
perm = sorted(xrange(len(foo)), key=lambda x:foo[x])
This generates a list of permutations - the value in perm[i] is the index of the ith smallest value in foo. Then, you can access both lists in order:
for p in perm:
print "%s: %s" % (foo[p], bar[p])
You'd need to benchmark it to find out if it's any more efficient, though - I doubt it makes much of a difference.

Is there an easy way? Yes. Use zip.
Is there an "easy way that doesn't use a zip variant"? No.
If you wanted to elaborate on why you object to using zip, that would be helpful. Either you're copying objects, in which case Python will copy by reference, or you're copying something so lightweight into a lightweight tuple as to not be worthy of optimization.
If you really don't care about execution speed but are especially concerned for some reason about memory pressure, you could roll your own bubble sort (or your sort algorithm of choice) on your key list which swaps both the key list and the target lists elements when it does a swap. I would call this the opposite of easy, but it would certainly limit your working set.

To achieve this, you would have to implement your own sort.
However: Does the unnecessary copying really hurt your application? Often parts of Python strike me as inefficient, too, but they are efficient enough for what I need.

Any solution I can imagine short of introducing a sort from scratch uses indices, or a dict, or something else that really is not apt to save you memory. In any event, using zip will only increase memory usage by a constant factor, so it is worth making sure this is really a problem before a solution.
If it does get to be a problem, there may be more effective solutions. Since the elements of foo and bar are so closely related, are you sure their right representation is not a list of tuples? Are you sure they should not be in a more compact data structure if you are running out of memory, such as a numpy array or a database (the latter of which is really good at this kind of manipulation)?
(Also, incidentally, itertools.izip can save you a little bit of memory over zip, though you still end up with the full zipped list in list form as the result of sorted.)

Related

Best / most pythonic way to remove duplicate from the a list and sort in reverse order

I'm trying to take a list (orig_list below), and return a list (new_list below) which:
does not contain duplicate items (i.e. contains only unique elements)
is sorted in reverse order
Here is what I have so far, which seems... I'm going to say "weird," though I'm sure there is a better way to say that. I'm mostly put off by using list() twice for what seems pretty straightforward, and then I'm wondering about the efficiency of this approach.
new_list = list(reversed(sorted(list(set(orig_list)))))
Question #1 (SO-style question):
Are the following propositions correct?
There is no more efficient way to get unique elements of a list than converting the list to a set and back.
Since sets are unordered in Python one must (1) convert to a set before removing duplicate items because otherwise you'd lose the sort anyway, and (2) you have to convert back to a list before you sort.
Using list(reversed()) is programatically equivalent to using list.sort(reversed=True).
Question #2 (bonus):
Are there any ways to achieve the same result in fewer Os, or using a less verbose approach? If so, what is an / are some example(s)?

sorted(set(orig_list), reverse=True)
Shortest in code, more efficient, same result.
Depending on the size, it may or may not be faster to sort first then dedupe in linear time as user2864740 suggests in comments. (The biggest drawback to that approach is it would be entirely in Python, while the above line executes mostly in native code.)
Your questions:
You do not need to convert from set to list and back. sorted accepts any iterable, so set qualifies, and spits out a list, so no post-conversion needed.
reversed(sorted(x)) is not equivalent to sorted(x, reverse=True). You get the same result, but slower - sort is of same speed whether forward or reverse, so reversed is adding an extra operation that is not needed if you sort to the proper ordering from the start.

You've got a few mildly wasteful steps in here, but your proposition is largely correct. The only real improvements to be made are to get rid of all the unnecessary temporary lists:
new_list = sorted(set(orig_list), reverse=True)
sorted already converts its input to a list (so no need to listify before passing to sorted), and you can have it directly produce the output list sorted in reverse (so no need to produce a list only to make a copy of it in reverse).
The only conceivable improvement on big-O time is if you know the data is already sorted, in which case you can avoid O(n log n) sorting, and uniqify without losing the existing sorted order by using itertools.groupby:
new_list = [key for key, grp in itertools.groupby(orig_list)]
If orig_list is sorted in forward order, you can make the result of this reversed at essentially no cost by changing itertools.groupby(orig_list) to itertools.groupby(reversed(orig_list)).
The groupby solution isn't really practical for initially unsorted inputs, because if duplicates are even remotely common, removing them via uniquification as a O(n) step is almost always worth it, as it reduces the n in the more costly O(n log n) sorting step. groupby is also a relatively slow tool; the nature of the implementation using a bunch of temporary iterators for each group, internal caching of values, etc., means that it's a slower O(n) in practice than the O(n) uniquification via set, with its primary advantage being the streaming aspect (making it scale to data sets streamed from disk or the network and back without storing anything for the long term, where set must pull everything into memory).
The other reason to use sorted+groupby would be if your data wasn't hashable, but was comparable; in that case, set isn't an option, so the only choice is sorting and grouping.

Pythonic pattern for building up parallel lists

I am new-ish to Python and I am finding that I am writing the same pattern of code over and over again:
def foo(list):
results = []
for n in list:
#do some or a lot of processing on N and possibly other variables
nprime = operation(n)
results.append(nprime)
return results
I am thinking in particular about the creation of the empty list followed by the append call. Is there a more Pythonic way to express this pattern? append might not have the best performance characteristics, but I am not sure how else I would approach it in Python.
I often know exactly the length of my output, so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up. I am writing a lot of text parsing code that isn't super performance sensitive on any particular loop or piece because all of the performance is really contained in gensim or NLTK code and is in much more capable hands than mine.
Is there a better/more pythonic pattern for doing this type of operation?

First, a list comprehension may be all you need (if all the processing mentioned in your comment occurs in operation.
def foo(list):
return [operation(n) for n in list]
If a list comprehension will not work in your situation, consider whether foo really needs to build the list and could be a generator instead.
def foo(list):
for n in list:
# Processing...
yield operation(n)
In this case, you can iterate over the sequence, and each value is calculated on demand:
for x in foo(myList):
...
or you can let the caller decide if a full list is needed:
results = list(foo())
If neither of the above is suitable, then building up the return list in the body of the loop as you are now is perfectly reasonable.

[..] so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up.
If you are worried about this, don't. Python over-allocates when a new resizing of the list is required (lists are dynamically resized based on their size) in order to perform O(1) appends. Either you manually call list.append or build it with a list comprehension (which internally also uses .append) the effect, memory wise, is similar.
The list-comprehension just performs (speed wise) a bit better; it is optimized for creating lists with specialized byte-code instructions that aid it (LIST_APPEND mainly that directly calls lists append in C).
Of course, if memory usage is of concern, you could always opt for the generator approach as highlighted in chepners answer to lazily produce your results.
In the end, for loops are still great. They might seem clunky in comparison to comprehensions and maps but they still offer a recognizable and readable way to achieve a goal. for loops deserve our love too.

Efficiency of Python sorted() built-in function vs. list insert() method

I am not new to it, but I do not use Python much and my knowledge is rather broad and not very deep in the language, perhaps someone here more knowledgeable can answer my question. I find myself in the situation when I need to add items to a list and keep it sorted as items as added. A quick way of doing this would be.
list.append(item) // O(1)
list.sort() // ??
I would imagine if this is the only way items are added to the list, I would hope the sort would be rather efficient, because the list is sorted with each addition. However there is also this that works:
inserted = False
for i in range(len(list)): // O(N)
if (item < list[i]):
list.insert(i, item) // ??
inserted = True
break
if not inserted: list.append(item)
Can anyone tell me if one of these is obviously more efficient? I am leaning toward the second set of statements, however I really have no idea.

What you are looking for is the bisect module and most possible the insort_left
So your expression could be equivalently written as
from
some_list.append(item) // O(1)
some_list.sort() // ??
to
bisect.insort_left(some_list, item)

Insertion anywhere except near the end takes O(n) time, as it has to move (copy) all elements which come after the insertion point. But on the other hand, all comparison-based sorting algorithm must, on average, make Omega(n log n) comparisons. Many sorts (including timsort, which Python uses) will do significantly better on many inputs, likely including yours (the "almost sorted" case). They still have to move at least that as many elements as inserting in the right position right away though. They also have to do quite a bit of additional work (inspecting all element to ensure they're in the right order, plus more complicated logic that often improves performance, but not in your case). For these reasons, it's probably slower, at least for large lists.
Due to being written in C (in CPython; but similar reasoning applies for other Pythons), it may still be faster than your Python-written linear scan. That leaves the question of how to find the insertion point. Binary search can do this part in O(log n) time, so it's quite useful here (of course, insertion is still O(n), but there's no way around this if you want a sorted list). Unfortunately, binary search is rather tricky to implement. Fortunately, it's already implemented in the standard library: bisect.

Can you speed up "for " loop in python with sorting ?

If I have a long unsorted list of 300k elements, will sorting this list first and then do a "for" loop on list speed up code? I need to do a "for loop" regardless, cant use list comprehension.
sortedL=[list].sort()
for i in sortedL:
(if i is somenumber)
"do some work"
How could I signal to python that sortedL is sorted and not read whole list. Is there any benefit to sorting a list? If there is then how can I implement?

It would appear that you're considering sorting the list so that you could then quickly look for somenumber.
Whether the sorting will be worth it depends on whether you are going to search once, or repeatedly:
If you're only searching once, sorting the list will not speed things up. Just iterate over the list looking for the element, and you're done.
If, on the other hand, you need to search for values repeatedly, by all means pre-sort the list. This will enable you to use bisect to quickly look up values.
The third option is to store elements in a dict. This might offer the fastest lookups, but will probably be less memory-efficient than using a list.

The cost of a for loop in python is not dependent on whether the input data is sorted.
That being said, you might be able to break out of the for loop early or have other computation saving things at the algorithm level if you sort first.

If you want to search within a sorted list, you need an algorithm that takes advantage of the sorting.
One possibility is the built-in bisect module. This is a bit of a pain to use, but there's a recipe in the documentation for building simple sorted-list functions on top of it.
With that recipe, you can just write this:
i = index(sortedL, somenumber)
Of course if you're just sorting for the purposes of speeding up a single search, this is a bit silly. Sorting will take O(N log N) time, then searching will take O(log N), for a total time of O(N log N); just doing a linear search will take O(N) time. So, unless you're typically doing log N searches on the same list, this isn't worth doing.
If you don't actually need sorting, just fast lookups, you can use a set instead of a list. This gives you O(1) lookup for all but pathological cases.
Also, if you want to keep a list sorted while continuing to add/remove/etc., consider using something like blist.sortedlist instead of a plain list.

Python dict.setdefault uses more memory? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I was writing some Python code that involved something like this
values = {}
for element in iterable:
values.setdefault(element.name, []).append(element)
Because I could have sorted the input previously, I also implemented it like this
values = {}
cur_name = None
cur_list = None
for element in iterable:
if element.name != cur_name:
values[cur_name] = cur_list
cur_name = element.name
cur_list = []
cur_list.append(element)
if cur_list:
values[cur_name] = cur_list
del values[None]
Here the input is already sorted by element.name.
The second approach was much faster than the first approach, and it also used less memory.
What's the reason for this?
Or have I made some sort of mistake in the second approach?

Your original code every time round the loop will create a list that mostly then just gets thrown away. It also makes multiple dictionary lookups (looking up the method setdefault is a dictionary lookup and then the method itself does a dictionary lookup to see whether the object was set and if it isn't does another to store the value). .name and .append() are also dictionary lookups but they are still present in the revised code.
for element in iterable:
values.setdefault(element.name, []).append(element)
The revised code only looks up the dictionary when the name changes, so it it removes two dictionary lookups and a method call from every loop. That's why it's faster.
As for the memory use, when the list grows it may sometimes have to copy the data but can avoid that if the memory block can just be expanded. My guess would be that creating all of those unused temporary lists may be fragmenting the memory more and forcing more copies. In other words Python isn't actually using more memory, but it may have more allocated but unused memory.
When you feel a need for setdefault consider using collections.defaultdict instead. That avoids creating the list except when it's needed:
from collections import defaultdict
values = defaultdict(list)
for element in iterable:
values[element.name].append(element)
That will probably still be slower than your second code because it doesn't take advantage of your knowledge that names are all grouped, but for the general case it is better than setdefault.
Another way would be to use itertools.groupby. Something like this:
from itertools import groupby
from operator import attrgetter
values = { name: list(elements) for name,elements in
groupby(elements, attrgetter('name')) }
That takes advantage of the ordering and simplifies everything down to a single dictionary comprehension.

I can think of a couple of reasons why the second approach is faster.
values.setdefault(element.name, []).append(element)
Here you're creating an empty list for each element, even if you're never going to use it. You're also calling the setdefault method for every element, and that amounts to one hash table lookup and a possible hash table write, plus the cost of calling the method itself, which is not insignificant in python. Finally, as the others have pointed out after I posted this answer, you're looking up the setdefault attribute once for each element, even though it always references the same method.
In the second example you avoid all of these inefficiencies. You're only creating as many lists as you need, and you do it all without calling any method but the required list.append, interspersed with a smattering of dictionary assignments. You're also in effect replacing the hash table lookup with a simple comparison (element.name != cur_name), and that's another improvement.
I expect you also get cache benefits, since you're not jumping all over the place when adding items to lists (which would cause lots of cache misses), but work on one list at a time. This way, the relevant memory is probably in a cache layer very near the CPU so that the process is faster. This effect should not be underestimated -- getting data from RAM is two orders of magnitude (or ~100 times) slower than reading it from L1 cache (source).
Of course the sorting adds a little time, but python has one of the best and most optimized sorting algorithms in the world, all coded in C, so it doesn't outweigh the benefits listed above.
I don't know why the second solution is more memory efficient though. As Jiri points out it might be the unneccessary lists, but my understanding is that these should have been collected immediately by the garbage collector, so it should only increase the memory usage by a tiny amount -- the size of a single empty list. Maybe it's because the garbage collector is lazier than I thought.

Your first version has two inefficient parts:
calling and dereferencing values.setdefault in a loop. You can assing values_setdefault = values.setdefault before the loop and it can speed things up a bit.
as other answer suggested, creating new empty list for every element in your list is very slow and memory inefficient. I do not know how to avoid it and using setdefault at once.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.