Checking information in a dataset in Python - python

I currently have a requirement to make a comparison of strings containing MAC addresses (eg. "11:22:33:AA:BB:CC" using Python 2.7. At present, I have a preconfigured set containing the MAC address and my script iterates through the set comparing each new MAC address to those in the list. This works great but as the set grows, the script massively slows down. With only 100 or so, you can notice a massive difference.
Does anybody have any advice on speeding up this process? Is storing them in a set the best way to compare or is it better to store them in a CSV / DB for example?
Sample of the code...
def Detect(p):
stamgmtstypes = (0,2,4)
if p.haslayer(Dot11):
if p.type == 0 and p.subtype in stamgmtstypes:
if p.addr2 not in observedclients:
# This is the set with location_mutex:
detection = p.addr2 + "\t" + str(datetime.now())
print type(p.addr2)
print detection, last_location
observedclients.append(p.addr2)

First, you need to profile your code to understand where exactly the bottleneck is...
Also, as a generic recommendation, consider psyco, although there are a few times when psyco doesn't help
Once you find a bottleneck, cython may be useful, but you need to be sure that you declare all your variables in the cython source.

Try using set. To declare set use set(), not [] (because the latter declares an empty list).
The lookup in the list is of O(n) complexity. It's what happens in you case when the list grows (the complexity grows with growing of n as O(n)).
The lookup in the set is of O(1) complexity on the average.
http://wiki.python.org/moin/TimeComplexity
Also, you will need to change some part of your code. There is no append method in set, so you will need to use something like observedclients.add(address).

The post mentions "the script iterates through the set comparing each new MAC address to those in the list."
To take full advantage of sets, don't loop over them doing one-by-one comparisons. Instead use set operations like union(), intersection(), and difference():
s = set(list_of_strings_containing_mac_addresses)
t = set(preconfigured_set_of_mac_addresses)
print s - t, 'addresses in the list but not preconfigured'

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Speed up lookup item in list (via Python)

I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?
You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.
Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:
Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.

Can I build a list, and sort it at the same time?

I'm working on a script for a piece of software, and it doesn't really give me direct access to the data I need. Instead, I need to ask for each piece of information I need, and build a list of the data I'm getting. For various reasons, I need the list to be sorted. It's very easy to just build the list once, and then sort it, followed by doing stuff with it. However, I assume it would be faster to run through everything once, rather than build the list and then sort it.
So, at the moment I've basically got this:
my_list = []
for item in "query for stuff":
my_list.append("query for %s data" % item)
my_list.sort()
do_stuff(my_list)
The "query for stuff" bit is the query interface with the software, which will give me an iterable. my_list needs to contain a list of data from the contents of said iterable. By doing it like this, I'm querying for the first list, then looping over it to extract the data and put it into my_list. Then I'm sorting it. Lastly, I'm doing stuff to it with the do_stuff() method, which will loop over it and do stuff to each item.
The problem is that I can't do_stuff() to it before it's sorted, as the list order is important for various reasons. I don't think I can get away from having to loop over lists twice — once to build the list and once to do stuff to each item in it, as we won't know in advance if a recently added item at position N will stay at position N after we've added the next item — but it seems cleaner to insert each item in a sorted fashion, rather than just appending them at the end. Kind of like this:
for item in "query for stuff":
my_list.append_sorted(item)
Is it worth bothering trying to do it like this, or should I just stick to building the list, and then sorting it?
Thanks!
The short answer is: it's not worth it.
Have a look at insertion sort. The worst-case running time is O(n^2) (average case is also quadratic). On the other hand, Python's sort (also known as Timsort) will take O(n log n) in the worst case.
Yes, it does "seem" cleaner to keep the list sorted as you're inserting, but that's a fallacy.
There is no real benefit to it. The only time you'd consider using insertion sort is when you need to show the sorted list after every insertion.
The two approaches are asmptotically equivalent.
Sorting is O(n lg n) (Python uses Timsort by default, except for very small arrays), and inserting in a sorted list is O(lg n) (using binary search), which you would have to do n times.
In practice, one method or the other may be slightly faster, depending on how much of your data is already sorted.
EDIT: I assumed that inserting in the middle of a sorted list after you've found the insertion point would be constant time (i.e. the list behaved like a linked list, which is the data structure you would use for such an algorithm). This probably isn't the case with Python lists, as pointed out by Sven. This would make the "keep the list sorted" approach O(n^2), i.e. insertion sort.
I say "probably" because some list implementations switch from array to linked list as the list grows, the most notable example being CFArray/NSArray in CoreFoundation/Cocoa. This may or may not be the case with Python.
Take a look at the bisect module. It gives you various tools for maintaining a list order. In your case, you probably want to use bisect.insort.
for item in query_for_stuff():
bisect.insort( my_list, "query for %s data" % item )

How to quickly sort dict() with a huge number of keys?

TLE always happen in SBANK SPOJ using python. In order to solve it, i have to sort dict(), though dict() has huge number of KEYS(maximum--100000). Use sorted() function in my code takes no effect. Is there any fast solution? Thanks for your help.
My code below:
for j in range(n): # n is the number of keys
account = sys.stdin.readline().rstrip()
dic.setdefault(account, 0)
dic[account] += 1
sorted(dic) # **this sort take a lot of time**
EDIT1:According to Justin Peel's tips, i update my code below, but return still TLE. How can i do?
import sys
import psyco # import psyco module to speed up
psyco.full()
nCase = int(sys.stdin.readline().split()[0])
for i in range(nCase):
n = int(sys.stdin.readline().split()[0])
dic = dict()
lst = list()
for j in range(n):
account = sys.stdin.readline().rstrip()
dic.setdefault(account, 0)
dic[account] += 1
sys.stdin.readline()
lst = dic.keys() # store keys in list
lst.sort()
for account in lst:
sys.stdout.write('%s %s\n' % (account, dic[account]))
dicts are not sorted, which is how they are able to provide O(1) insert and get access. (Internally, they are implemented as hash tables, I believe, though I'm not sure this is required by the Python spec).
If you want to iterate the keys of a dict in sorted order, you can use:
for key in sorted(the_dict.iterkeys()):
value = the_dict[key]
# do something
But, as you note, sorting 100,000 elements may take some time.
As an alternative, you can write (or find on the internet) sorted dict implementations that keep an ordered list of keys along with the dictionary, and support fast lookup by key, and iteration in order without having to sort all at once. Of course, to support sorted order, the keys will need to be sorted at insertion time, so inserts will not be O(1).
Edit: Per dsolimano's comment, if you are using Python 2.7 or Python 3.x, there is a built-in OrderedDict class that orders iteration in insertion order. This keeps insertion fast, but may not do what you need (depending on the order of items you want).
I was able to solve this problem. Here are some tips:
Use Python 2.5. It is much faster than Python 3.2 which is the other option available on SPOJ with Python. Only one person has been able to get a fast enough solution using Python 3.2
Just use a basic dict for counting. You can get by with the defaultdict from the collections module too, but the basic dict was faster for me.
Sort only the keys of the dict, not the key-item pairs. Forming the key-item pairs takes far too long. Also, use keys = mydict.keys(); keys.sort() because it is the fastest way to go about it.
Use psyco (pretty much always with SPOJ problems in Python)
Learn the fastest ways to do input and output in Python. Hint: it isn't iterating over every line of input for example.
Try submitting after you've added each part (getting input, counting, doing output) to see where you are with time. This is a very valuable thing to do on SPOJ. The SPOJ computer running your code is quite likely much slower than your current computer and it can be hard to determine based on your own computer's running time of the code if it will be fast enough for SPOJ.
Since Python 3.1 is available, collections.Counter is good for that purpose:
collections.Counter(map(str.rstrip, sys.stdin)).most_common()

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Categories

Resources