Speed up lookup item in list (via Python)

Speed up lookup item in list (via Python) - python

I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?

You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.

Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:

Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)

Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).

Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.

In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Python delete multiple element(s) in list if in another list

I have two arrays, where if an element exists in an array received from a client then it should delete the matching array in the other array. This works when the client array has just a single element but not when it has more than one.
This is the code:
projects = ['xmas','easter','mayday','newyear','vacation']
for i in self.get_arguments('del[]'):
try:
if i in projects:
print 'PROJECTS', projects
print 'DEL', self.get_arguments('del[]')
projects.remove(i)
except ValueError:
pass
self.get_arguments('del[]'), returns an array from the client side in the format:
[u'xmas , newyear, mayday']
So it reads as one element not 3 elements, as only one unicode present.
How can I get this to delete multiple elements?
EDIT: I've had to make the list into one with several individual elements.

How about filter?
projects = filter(lambda a: a not in self.get_arguments('del[]'), projects)

Could try something uber pythonic like a list comprehension:
new_list = [i for i in projects if i not in array_two]
You'd have to write-over your original projects, which isn't the most elegant, but this should work.

The reason this doesn't work is that remove just removes the first element that matches. You could fix that by just repeatedly calling remove until it doesn't exist anymore—e.g., by changing your if to a while, like this:
while i in projects:
print 'PROJECTS', projects
print 'DEL', self.get_arguments('del[]')
projects.remove(i)
But in general, using remove is a bad idea—especially when you already searched for the element. Now you're just repeating the search so you can remove it. Besides the obvious inefficiency, there are many cases where you're going to end up trying to delete the third instance of i (because that's the one you found) but actually deleting the first instead. It just makes your code harder to reason about. You can improve both the complexity and the efficiency by just iterating over the list once and removing as you go.
But even this is overly complicated—and still inefficient, because every time you delete from a list, you're moving all the other elements of the list. It's almost always simpler to just build a new list from the values you want to keep, using filter or a list comprehension:
arguments = set(self.get_arguments('del[]'))
projects = [project for project in projects if project not in arguments]
Making arguments into a set isn't essential here, but it's conceptually cleaner—you don't care about the order of the arguments, or need to retain any duplicates—and it's more efficient—sets can test membership instantly instead of by comparing to each element.

Can I build a list, and sort it at the same time?

I'm working on a script for a piece of software, and it doesn't really give me direct access to the data I need. Instead, I need to ask for each piece of information I need, and build a list of the data I'm getting. For various reasons, I need the list to be sorted. It's very easy to just build the list once, and then sort it, followed by doing stuff with it. However, I assume it would be faster to run through everything once, rather than build the list and then sort it.
So, at the moment I've basically got this:
my_list = []
for item in "query for stuff":
my_list.append("query for %s data" % item)
my_list.sort()
do_stuff(my_list)
The "query for stuff" bit is the query interface with the software, which will give me an iterable. my_list needs to contain a list of data from the contents of said iterable. By doing it like this, I'm querying for the first list, then looping over it to extract the data and put it into my_list. Then I'm sorting it. Lastly, I'm doing stuff to it with the do_stuff() method, which will loop over it and do stuff to each item.
The problem is that I can't do_stuff() to it before it's sorted, as the list order is important for various reasons. I don't think I can get away from having to loop over lists twice — once to build the list and once to do stuff to each item in it, as we won't know in advance if a recently added item at position N will stay at position N after we've added the next item — but it seems cleaner to insert each item in a sorted fashion, rather than just appending them at the end. Kind of like this:
for item in "query for stuff":
my_list.append_sorted(item)
Is it worth bothering trying to do it like this, or should I just stick to building the list, and then sorting it?
Thanks!

The short answer is: it's not worth it.
Have a look at insertion sort. The worst-case running time is O(n^2) (average case is also quadratic). On the other hand, Python's sort (also known as Timsort) will take O(n log n) in the worst case.
Yes, it does "seem" cleaner to keep the list sorted as you're inserting, but that's a fallacy.
There is no real benefit to it. The only time you'd consider using insertion sort is when you need to show the sorted list after every insertion.

The two approaches are asmptotically equivalent.
Sorting is O(n lg n) (Python uses Timsort by default, except for very small arrays), and inserting in a sorted list is O(lg n) (using binary search), which you would have to do n times.
In practice, one method or the other may be slightly faster, depending on how much of your data is already sorted.
EDIT: I assumed that inserting in the middle of a sorted list after you've found the insertion point would be constant time (i.e. the list behaved like a linked list, which is the data structure you would use for such an algorithm). This probably isn't the case with Python lists, as pointed out by Sven. This would make the "keep the list sorted" approach O(n^2), i.e. insertion sort.
I say "probably" because some list implementations switch from array to linked list as the list grows, the most notable example being CFArray/NSArray in CoreFoundation/Cocoa. This may or may not be the case with Python.

Take a look at the bisect module. It gives you various tools for maintaining a list order. In your case, you probably want to use bisect.insort.
for item in query_for_stuff():
bisect.insort( my_list, "query for %s data" % item )

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.

In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"

the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.

The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.

for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Checking information in a dataset in Python

I currently have a requirement to make a comparison of strings containing MAC addresses (eg. "11:22:33:AA:BB:CC" using Python 2.7. At present, I have a preconfigured set containing the MAC address and my script iterates through the set comparing each new MAC address to those in the list. This works great but as the set grows, the script massively slows down. With only 100 or so, you can notice a massive difference.
Does anybody have any advice on speeding up this process? Is storing them in a set the best way to compare or is it better to store them in a CSV / DB for example?
Sample of the code...
def Detect(p):
stamgmtstypes = (0,2,4)
if p.haslayer(Dot11):
if p.type == 0 and p.subtype in stamgmtstypes:
if p.addr2 not in observedclients:
# This is the set with location_mutex:
detection = p.addr2 + "\t" + str(datetime.now())
print type(p.addr2)
print detection, last_location
observedclients.append(p.addr2)

First, you need to profile your code to understand where exactly the bottleneck is...
Also, as a generic recommendation, consider psyco, although there are a few times when psyco doesn't help
Once you find a bottleneck, cython may be useful, but you need to be sure that you declare all your variables in the cython source.

Try using set. To declare set use set(), not [] (because the latter declares an empty list).
The lookup in the list is of O(n) complexity. It's what happens in you case when the list grows (the complexity grows with growing of n as O(n)).
The lookup in the set is of O(1) complexity on the average.
http://wiki.python.org/moin/TimeComplexity
Also, you will need to change some part of your code. There is no append method in set, so you will need to use something like observedclients.add(address).

The post mentions "the script iterates through the set comparing each new MAC address to those in the list."
To take full advantage of sets, don't loop over them doing one-by-one comparisons. Instead use set operations like union(), intersection(), and difference():
s = set(list_of_strings_containing_mac_addresses)
t = set(preconfigured_set_of_mac_addresses)
print s - t, 'addresses in the list but not preconfigured'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.