Nested list comprehension against .NET Arrays within Arrays

Nested list comprehension against .NET Arrays within Arrays - python

I have a .NET structure that has arrays with arrays in it. I want to crete a list of members of items from a specific array in a specific array using list comprehension in IronPython, if possible.
Here is what I am doing now:
tag_results = [item_result for item_result in results.ItemResults if item_result.ItemId == tag_id][0]
tag_vqts = [vqt for vqt in tag_results.VQTs]
tag_timestamps = [vqt.TimeStamp for vqt in tag_vqts]
So, get the single item result from the results array which matches my condition, then get the vqts arrays from those item results, THEN get all the timestamp members for each VQT in the vqts array.
Is wanting to do this in a single statement overkill? Later on, the timestamps are used in this manner:
vqts_to_write = [vqt for vqt in resampled_vqts if not vqt.TimeStamp in tag_timestamps]
I am not sure if a generator would be appropriate, since I am not really looping through them, I just want a list of all the timestamps for all the item results for this item/tag so that I can test membership in the list.
I have to do this multiple times for different contexts in my script, so I was just wondering if I am doing this in an efficient and pythonic manner. I am refactoring this into a method, which got me thinking about making it easier.
FYI, this is IronPython 2.6, embedded in a fixed environment that does not allow the use of numpy, pandas, etc. It is safe to assume I need a python 2.6 only solution.
My main question is:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Two! My two main questions are:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Is a generator appropriate for testing membership in a list?
Three! My three questions are... Amongst my questions are such diverse queries as...I'll come in again...
(it IS python...)

tag_results = [...][0] builds a whole new list just to get one item. This is what next() on a generator expression is for:
next(item_result for item_result in results.ItemResults if item_result.ItemId == tag_id)
which only iterates just enough to get a first item.
You can inline that, but I'd keep that as a separate expression for readability.
The remainder is easily put into one expression:
tag_results = next(item_result for item_result in results.ItemResults
if item_result.ItemId == tag_id)
tag_timestamps = [vqt.TimeStamp for vqt in tag_results.VQTs]
I'd make that a set if you only need to do membership testing:
tag_timestamps = set(vqt.TimeStamp for vqt in tag_results.VQTs)
Sets allow for constant time membership tests; testing against a list takes linear time as the whole list could end up being scanned for each such test.

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)

Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).

Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.

In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).

There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())

Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Efficient use of Python list comprehensions

I have a Python list of objects that could be pretty long. At particular times, I'm interested in all of the elements in the list that have a certain attribute, say flag, that evaluates to False. To do so, I've been using a list comprehension, like this:
objList = list()
# ... populate list
[x for x in objList if not x.flag]
Which seems to work well. After forming the sublist, I have a few different operations that I might need to do:
Subscript the sublist to get the element at index ind.
Calculate the length of the sublist (i.e. the number of elements that have flag == False).
Search the sublist for the first instance of a particular object (i.e. using the list's .index() method).
I've implemented these using the naive approach of just forming the sublist and then using its methods to get at the data I want. I'm wondering if there are more efficient ways to go about these. #1 and #3 at least seem like they could be optimized, because in #1 I only need the first ind + 1 matching elements of the sublist, not necessarily the entire result set, and in #3 I only need to search through the sublist until I find a matching element.
Is there a good Pythonic way to do this? I'm guessing I might be able to use the () syntax in some way to get a generator instead of creating the entire list, but I haven't happened upon the right way yet. I obviously could write loops manually, but I'm looking for something as elegant as the comprehension-based method.

If you need to do any of these operations a couple of times, the overhead of other methods will be higher, the list is the best way. It's also probably the clearest, so if memory isn't a problem, then I'd recommend just going with it.
If memory/speed is a problem, then there are alternatives - note that speed-wise, these might actually be slower, depending on the common case for your software.
For your scenarios:
#value = sublist[n]
value = nth(x for x in objList if not x.flag, n)
#value = len(sublist)
value = sum(not x.flag for x in objList)
#value = sublist.index(target)
value = next(dropwhile(lambda x: x != target, (x for x in objList if not x.flag)))
Using itertools.dropwhile() and the nth() recipe from the itertools docs.

I'm going to assume you might do any of these three things, and you might do them more than once.
In that case, what you want is basically to write a lazily evaluated list class. It would keep two pieces of data, a real list cache of evaluated items, and a generator of the rest. You could then do ll[10] and it would evaluate up to the 10th item, ll.index('spam') and it would evaluate until it finds 'spam', and then len(ll) and it would evaluate the rest of the list, all the while caching in the real list what it sees so nothing is done more than once.
Constructing it would look like this:
LazyList(x for x in obj_list if not x.flag)
But nothing would actually be computed until you actually start using it as above.

Since you commented that your objList can change, if you don't also need to index or search objList itself, then you might be better off just storing two different lists, one with .flag = True and one with .flag = False. Then you can use the second list directly instead of constructing it with a list comprehension each time.
If this works in your situation, it is likely the most efficient way to do it.

Does it pay off to use a generator as input to sorted() instead of a list-comprehension [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
sorted() using Generator Expressions Rather Than Lists
We all know using generators instead of instantiating lists all the time saves time and memory, especially if we use comprehensions a lot.
Here's a question though, consider the following code:
output = SomeExpensiveCallEgDatabase()
results = [result[0] for result in output]
return sorted(results)
The call to sorted will return a sorted list of the results. Would it be better or worse to declare results as below and then call sorted?
results = (result[0] for result in output)
My guess is the call to sorted() would traverse the generator and instantiate a list itself in order to run quicksort or mergesort on it. So there would be no advantage in using the generator here. Is this assumption correct?

I believe your assumption to be true, since there is no easy way of ordering the collection without first having the whole list in memory (at least certainly not with the default sorting algorithm, TimSort if I'm not mistaken).
Check this out:
sorted() using Generator Expressions Rather Than Lists
To create the new List, the builtin sorted method uses PySequence_List:
PyObject* PySequence_List(PyObject *o) Return value: New reference.
Return a list object with the same contents as the arbitrary sequence
o. The returned list is guaranteed to be new.
Pros and cons of both approaches:
Memory-wise:
The returned list is the one used for the sorted version, so this would mean that in this case, only one list is stored completely in memory at any given time, using the generator version.
This makes the generator version more efficient memory-wise.
Speed:
Here the version with the whole list wins.
To create a new list based on a generator, an empty list must be created (or at best with the first element), and each following element appended to the list, with the possible redimensioning steps this may provoke.
To create a new list based on a previous list, the size of the list is known beforehand, and thus can be allocated at once and each of the entries assigned (possibly, there are other optimizations at work here, but I can't back that up).
So regarding speed, the list wins.
The answer to "what's the best", comes down to the most common answer in any field of engineering... it depends....

No you are still creating a brand new list with sorted()
output = SomeExpensiveCallEgDatabase()
results = [result[0] for result in output]
results.sort()
return results
would be closer to the generator version.
I believe it's better to use the generator version because some future version of Python may be able to take advantage of this to work more efficiently. It's always nice to get a speed up for free.

Yes, you are correct (although I believe the sorting routine is still called tim-sort, after uncle timmy <wink-ly y'rs>)

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.

In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"

the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.

The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.

for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.