I have an object with different attributes and a list that contains those objects.
Before adding an object to the list, I'd like to check if an attribute of this new object is present in the list.
This attribute is unique, so this is done to make sure that every object in the list is unique.
I would do something like this:
for post in stream:
if post.post_id not in post_list:
post_list.append(post)
else:
# Find old post in the list and replace it
But obviously line 2 doesn't work as I'm comparing the post_id to the object list.
Keep a separate set to which you add the attribute, and against which you can then test the next value:
ids_seen = set()
for post in stream:
if post.post_id not in ids_seen:
post_list.append(post)
ids_seen.add(post.post_id)
Another option is to create an ordered dict first, with the ids as keys:
posts = OrderedDict((post.post_id, post) for post in stream)
post_list = list(posts.values())
This will keep the most recently seen post reference for a given id, but you'll still unique ids only.
If ordering isn't important, just use a regular dictionary comprehension:
posts = {post.post_id: post for post in stream}
post_list = list(posts.values())
If you are using Python 3.6 or newer, then the order will be preserved anyway as the CPython implementation was updated to retain input order, and in Python 3.7 this feature became part of the language specification.
Whatever you do, don't use a separate list to test the post.id against, as that takes O(N) time each time you check to see if the id is present, where N is the number of items in your stream in the end. Combined with O(N) such checks, that approach would take O(N**2) quadratic time, meaning that for every 10-fold increase in the number of input items, you'd also take 100 times more time to process them all.
But when using a set or dictionary, testing if the id is already there only takes O(1) constant time, so checks are cheap. That makes a full processing loop take O(N) linear time, meaning that it'll take time directly proportional to how many input items you have.
This should work
for post in stream:
if post.post_id not in [post.post_id for post in post_list]:
post_list.append(post)
Related
I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.
I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?
You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.
Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:
Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.
I have a dictionary that I would like to write in whole to an NDB on App Engine. The problem is that only the last item in the dictionary is being written. I thought perhaps the writes were too fast so I put a sleep timer in with a very long wait of 20 seconds to see what would happen. I continually refreshed the Datastore Viewer and saw the transaction write, and then later get overwritten by the next transaction, etc. The table started out empty and the dictionary keys are unique. A simple example:
class Stats(ndb.Model):
desc= ndb.StringProperty(required = True)
count= ndb.IntegerProperty(required = True)
update = ndb.DateTimeProperty(auto_now_add = True)
class refresh(webapp2.RequestHandler):
def get(self):
statsStore = Stats()
dict = {"test1":0,"test2":1,"test3":2}
for key in dict:
statsStore.desc = key
statsStore.count = dict.get(key)
statsStore.put()
What will happen above is that only the final dictionary item will remain in the datastore. Again with a sleep timer I can see each being written but then overwritten. I am using this on my local machine with the local development GAE environment.
Appreciate the help.
The problem with your original code is that you're reusing the same entity (model instance).
During the first put(), a datastore key is generated and assigned to that entity. Then, all the following put() calls are using the same key.
Changing it to create a new model instance on each iteration (the solution you mention in your comment) will ensure a new datastore key is generated each time.
Another option would be to clear the key with "statsStore.key = None" before calling put(). But what you did is probably better.
Not sure what you are trying to do, but here are some hopefully helpful pointers. If you want to save the dict and then re-use it later by reading from the database, then change your string to a text property, import json, and save the dict as a json string value using json.dumps(). If you want to write an entity for every element in your dict, then you will want to move your statsStore class creation line inside the for loop, and finish the loop process by adding each Stats() classes to an array. Once the loop is done, you can batch put all the entities in the array. This batch approach is much faster than including a put() inside your loop which is most often a very non-performant design choice. If you just want to record all the values in the dict for later reference, and you have a value that you can safely use as a delimiter, then I would create two empty arrays prior to your loop, and append each desc and count inside the respective array. Once outside the array, you can save these values to two text properties in your entity by joining the arrays using the delimiter string. If you do this, then strongly suggest using urllib.quote() to escape your desc text value when appending it so at to avoid conflicts with your delimiter value.
Some final notese: You should be careful using this type of process with a StringProperty. You might easily exceed the string limit size depending on the number of items, and/or the length of your desc values. Also remember your items in the dict may not come out in the order you intend. Consider something like: "for k, v in sorted(mydict.items()):" HTH, stevep
I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:
Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
If it is, update the timestamp to "now"
If not, add it to the set with timestamp "now"
Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).
I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.
Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).
Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.
This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).
Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.
So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:
import heapq
class ExpiringCache(object):
def __init__(self):
self._dict = {}
self._heap = []
def add(self, key, expiry):
self._dict[key] = expiry
heapq.heappush(self._heap, (expiry, key))
def contains(self, key):
return key in self._dict
def collect(self, maxage):
while self._heap and self._heap[0][0] <= maxage:
expiry, key = heapq.heappop(self._heap)
if self._dict.get(key) == expiry:
del self._dict[key]
def items(self):
return self._dict.items()
create a cache and add some items
>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)
re-add an item with an even later expiry
>>> xc.add('apples', 4)
collect everything "older" than two time units
>>> xc.collect(2)
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False
The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).
By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.
Figuring out the details (and implementing them) is left as an exercise for the reader ...
Caveats:
worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp
A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.
Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.
Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.
This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.
Edit:
For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.
If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)
http://en.wikipedia.org/wiki/Bloom_filter
and a python implementation: https://github.com/axiak/pybloomfiltermmap
EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.
A simple hashtable or dictionary will be O(1) for the check/update/set operations. You could simultaneously store the data in a simple time-ordered list where for the purge operations. Keep a head and tail pointer, so that insert is also O(1) and removal is as simple as advancing the head until it reaches the target time and removing all the entries you find from the hash.
The overhead is one extra pointer per stored data item, and the code is dead simple:
insert(key,time,data):
existing = MyDictionary.find(key)
if existing:
existing.mark()
node = MyNodeType(data,time) #simple container holding args + 'next' pointer
node.next = NULL
MyDictionary.insert(key,node)
Tail.next = node
if Head is NULL: Head = node
clean(olderThan):
while Head.time < olderThan:
n = Head.next
if not Head.isMarked():
MyDictionary.remove(Head.key)
#else it was already overwritten
if Head == Tail: Tail = n
Head = n
I tend to obsess about expressing code the most compactly and succinctly possible without sacrificing runtime efficiency.
Here's my code:
p_audio = plate.parts.filter(content__iendswith=".mp3")
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv")
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")
extra_context.update({
'p_audio': p_audio and p_audio[0],
'p_video': p_video and p_video[0],
'p_swf': p_swf and p_swf[0]
})
Are there any python/django gurus that can drastically shorten this code?
Actually, in your pursuit of compactness and efficiency, you have managed to come up with code that is terribly inefficient. This is because when you refer to p_audio or not p_audio, that causes that queryset to be evaluated - and because you haven't sliced it before then, that means that the entire filter is brought from the database - eg all the plate objects that end with mp3, and so on.
You should ensure you do the slice for each query first, before you refer to the value of that query. Since you're concerned with code compactness, you probably want to slice with [:1] first, to get a queryset of a single object:
p_audio = plate.parts.filter(content__iendswith=".mp3")[:1]
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv") [:1]
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")[:1]
and the rest can stay the same.
Edit to add Because you're only interested in the first element of each list, as evidenced by the fact that you only pass [0] from each element into the context. But in your code, not p_audio refers to the original, unsliced queryset: and to determine the true/false value of the qs, Django has to evaluate it, which gets all matching elements from the database and converts them into Python objects. Since you don't actually want those objects, you're doing a lot more work than you need.
Note though that it's not re-running it every time: just the first time, since after the first evaluation the queryset is cached internally. But as I say, that's already more work than you want.
Besides featuring less redundancy, this is also way easier to extend with new content types.
kinds = (("p_audio", ".mp3"), ("p_video", ".flv"), ("p_swf", ".swf"))
extra_context.update((key, False) for key, _ in kinds)
for key, ext in kinds:
entries = plate.parts.filter(content__iendswith=ext)
if entries:
extra_context[key] = entries[0]
break
Just adding this as another answer inspired by Pyroscope's above (as my edit there has to be peer reviewed)
The latest incarnation is exploiting that the Django template system just disregards nonexistant context items when they are referenced, so mp3, etc below do not need to be initialized to False (or 0). So, the following meets all the functionality of the code from the OP. The other optimization is that mp3, etc are used as key names (instead of "p_audio" etc.)
for key in ['mp3','flv','swf'] :
entries = plate.parts.filter(content__iendswith=key)[:1]
extra_context[key] = entries and entries[0]
if extra_context[key] :
break