I wanted to learn about Data Structures so I decided to create them using Python. I first created a Singly Linked List (which consists of two classes: the actual List and the Node). A List consists of Nodes (or can be empty). Each node had a "next" value. When I instantiated a list, it would look like this:
l = LinkedList([1,2])
and this was the sudocode for the init:
def __init__(self, item=None):
head = None
if a single item was given
head = Node(item)
head.next = None
else if multiple items were given
for i in item
if head: # i is not the first item in the list
new_head = Node(i)
new_head.next = head
head = new_head
else: # i is the first item in the list
head = Node(i)
head.next = None
Maybe there is a flaw in the logic above, but hopefully you get how it works more or less. The key thing I noted here was that I did not use any list ([]) or array ({}) because I didn't need to.
Now, I am trying to create a MultiSet but I am stuck in the init part.
It was simple for a Linked Lists because when I read articles on Linked Lists, all of the articles immediately mentioned a List class and a Node class (a List consists of a Node. a List has a head and a tail. a Node has a 'next' value). But when I read articles on MultiSets, they just mention that multisets are sets (or bags) of data where multiple instances of the same data are allowed.
This is my init for my multiset so far:
def __init__(self, items=None):
self.multiset = []
if items: # if items is not None
try:
for i in items: # iterate through multiple items (assuming items is a list of items)
self.multiset.append(i)
except: # items is only one item
self.multiset.append(items)
I don't think I am on the right track though because I'm using a Python list ([]) rather than implementing my own (like how I did with a Linked List using Nodes, list.head, list.tail and node.next).
Can someone point me in the right direction as to how I would create and instantiate my own MultiSet using Python (without using existing Python lists / arrays)? Or am I already on the right track and am I supposed to be using Python lists / arrays when I am creating my own MultiSet data structure?
It looks like you're conflating two things:
data structures - using Python (or any other language, basically), you can implement linked lists, balanced trees, hash tables, etc.
mapping semantics - any container, but an associative container in particular, has a protocol: what does it do when you insert a key that's already in it? does it have an operation to access all items with a given key? etc.
So, given your linked list, you can certainly implement a multiset (albeit with not that great performance), because it's mainly a question of your decision. One possible way would be:
Upon an insert, append a new node with the key
For a find, iterate through the nodes; return the first key you find, or None if there aren't any
For a find_all, iterate through the nodes, and return a list (or your own linked-list, for that matter), of all the keys matching it.
Similarly, a linked-list, by itself, doesn't dictate if you have to use it as a set or a dictionary (or anything else). These are orthogonal decisions.
Related
This question already has answers here:
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 20 days ago.
I have no idea how, but I've managed to write code that is mysteriously adding items to a list.
The code I'm writing is for building a network/graph of actors and co-actors. I have a list of nodes (actors) that I begin with and use two functions that work via an API to extract credit data (movies an actor has been in) and cast data (a list of cast members for a given movie). I need to iterate through the list of nodes/actors, use the API to pull a list of movies the given node/actor has been in, then use the cast API to pull the first three cast members of each of these movies. I then make additional nodes/edges to add to the graph.
As you can see I have multiple nested loops, so I've included counters and print statements to help track my progress and see if everything is working. I started with the outer loop (commenting out the inner loops) and tested to make sure it was working/iterating appropriately, then added in the inner loop and repeated in order to make sure each loop acted as it should. All loops work as expected until I reach the third tier loop (for individual in cast_members). For some reason, once the code gets to this loop it starts adding elements to 'nodes' list. This essentially makes infinite loop as it adds a new element with each iteration. I have no idea why it's doing this and can't see anything in the code/that loop that could be causing it.
It may be that the answer is right in front of me, but I've been working on it for a while and can't figure it out. Any help would be much appreciated.
Note: I know the API functions are not a problem, they've been tested multiple times and are working as they should.
# ITERATION 1
nodes = graph.nodes
i1_count = 1
for n in nodes:
actor_credits = tmdb_api_utils.get_movie_credits_for_person(person_id=n[0],vote_avg_threshold=8.0)
i2_count = 1
for movie in actor_credits:
cast_members = tmdb_api_utils.get_movie_cast(movie_id=movie['id'],limit=2) #list of dictionaries of cast members
new_node = set() #create an empty set for storing new node to teh graph
i3_count = 1
for individual in cast_members:
graph.add_node(id=individual["id"], name=individual["character"]) #add the cast member as a node
graph.add_edge(source=n[0], target=individual["id"]) #add an edge between L.F. and cast member
# if len(new_node) == 0: #add cast member to new node list if the node list is empty and skip adding in edges (no need)
# new_node.add(individual["id"]) #add the cast member's ID to a list of new nodes (as tuples)
# else:
# for co_actor in new_node:
# graph.add_edge(source=individual["id"], target=co_actor) #add an edge between the cast member and every member of new_node
# new_node.add(individual["id"]) #Once loop completes, add the current cast member to new node set for next iteration
print(str(i1_count)+"."+str(i2_count)+"."+str(i3_count))
print(nodes)
i3_count += 1
# print(str(i1_count)+"."+str(i2_count))
print(nodes)
i2_count += 1
i1_count += 1
I just found my error. The nodes list is supposed to be a copy of another list. I assumed creating a variable and setting it equal would copy it but was mistaken. I added the .copy() method and this solved my problem.
so basically, I have a list of objects. Let's say each object has two properties: A and B. A is a tuple of 3 integers: (A1, A2, A3), and B is an integer. This list may contain objects that have identical A's, and I want to get rid of those duplicates. However, I want to do so in a way that among those objects that have the same A, the one with the lowest B is chosen. In the end, I want a list of objects with all unique A's and with the lowest B's.
I thought about it for a while, and I think I can come up with a really janky way to do this with lots of for loops, but I feel like there must be a much better way built into a function in python or in some sort of library (to do at least a part of this). Anyone have any ideas?
Thanks!
edit: For more detail, this is actually for a tetris AI, for finding all possible moves with a given piece. My objects are nodes in a tree of possible Tetris moves. Each node has two values: A: (x_position, y_position, rotation), and B: the number of frames it takes to reach that position. I start with a root node at the starting position. At each step, I expand the tree by making children by doing one move to the left, one move to the right, one rotation left, one rotation right, or one softdrop downward, and for each child I update both A, the XYR position, and B, the number of frames it took to get there. I add all these to a list of potential moves. After this, I merge all nodes that have the same XYR position, choosing the node that has the least frames to get there. The next step, I expand each node inside of the list of potential moves and repeat the process. Sorry, I realize this explanation might be confusing, which is why I didn't include it in the original explanation. I think it's advantageous to do it this way because in modern tetris, there is a rather complicated rotation system called SRS (Super Rotation System) that allows you to perform complicated spins with various pieces, so by making a pathfinder in this way and simulating the piece making the moves according to SRS is a good way since it tells you if the move was a spin or not (sending more/less dmg), and it also allows you to know the exact movement to execute the placement (I also store a list of series of moves to reach that position) with the least frames. Later, I want to be able to figure out how to hash the states properly so I don't revisit, but I'm still figuring it out.
d = {}
for obj in the_list:
current_lowest = d.setdefault(obj.A, obj)
if obj.B < current_lowest.B:
d[obj.A] = obj
# Get the result
desired_list = list(d.values())
We have a dict d whose keys are tuples (A) and values are objects themselves. The .setdefault ensures that if the A of interest is not seen yet, it sets it with the current object obj. If it was seen already, it returns the value (an object) corresponding to that A. Then we compare that object's B with the one at hand and act dependingly. At the end, the desired result will lie in the values of d.
I am using class with static values in python to represent some constants
class Constants(object):
PERSON = "person_in_class"
PARENT = "parent_of_person_in_class"
and lot of more, over 30 constants. I am using keys so I ma trying to make value short as possible ( on another side I have same file/class, and I am using haffman algorithm and it works), I generate pairs like
elements = [elem for elem in dir(Constants)if not startwith("_")]
but problem is that elements are always sorted alphabetically, so when I add new key in Constants I will change all, I want when I add at the end to be like have index.
I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:
Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
If it is, update the timestamp to "now"
If not, add it to the set with timestamp "now"
Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).
I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.
Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).
Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.
This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).
Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.
So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:
import heapq
class ExpiringCache(object):
def __init__(self):
self._dict = {}
self._heap = []
def add(self, key, expiry):
self._dict[key] = expiry
heapq.heappush(self._heap, (expiry, key))
def contains(self, key):
return key in self._dict
def collect(self, maxage):
while self._heap and self._heap[0][0] <= maxage:
expiry, key = heapq.heappop(self._heap)
if self._dict.get(key) == expiry:
del self._dict[key]
def items(self):
return self._dict.items()
create a cache and add some items
>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)
re-add an item with an even later expiry
>>> xc.add('apples', 4)
collect everything "older" than two time units
>>> xc.collect(2)
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False
The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).
By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.
Figuring out the details (and implementing them) is left as an exercise for the reader ...
Caveats:
worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp
A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.
Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.
Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.
This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.
Edit:
For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.
If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)
http://en.wikipedia.org/wiki/Bloom_filter
and a python implementation: https://github.com/axiak/pybloomfiltermmap
EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.
A simple hashtable or dictionary will be O(1) for the check/update/set operations. You could simultaneously store the data in a simple time-ordered list where for the purge operations. Keep a head and tail pointer, so that insert is also O(1) and removal is as simple as advancing the head until it reaches the target time and removing all the entries you find from the hash.
The overhead is one extra pointer per stored data item, and the code is dead simple:
insert(key,time,data):
existing = MyDictionary.find(key)
if existing:
existing.mark()
node = MyNodeType(data,time) #simple container holding args + 'next' pointer
node.next = NULL
MyDictionary.insert(key,node)
Tail.next = node
if Head is NULL: Head = node
clean(olderThan):
while Head.time < olderThan:
n = Head.next
if not Head.isMarked():
MyDictionary.remove(Head.key)
#else it was already overwritten
if Head == Tail: Tail = n
Head = n
Essentially this is what I'm trying to do:
I have a set that I add objects to. These objects have their own equality method, and a set should never have an element equal to another element in the set. However, when attempting to insert an element, if it is equal to another element, I'd like to record a merged version of the two elements. That is, the objects have an "aux" field that is not considered in its equality method. When I'm done adding things, I would like an element's "aux" field to contain a combination of all of the "aux" fields of equal elements I've tried to add.
My thinking was, okay, before adding an element to the set, check to see if it's already in the set. If so, pull it out of the set, combine the two elements, then put it back in. However, the remove method in Python sets doesn't return anything and the pop method returns an arbitrary element.
Can I do what I'm trying to do with sets in Python, or am I barking up the wrong tree (what is the right tree?)
Sounds like you want a defaultdict
from collections import defaultdict
D = defaultdict(list)
D[somekey].append(auxfield)
Edit:
To use your merge function, you can combine the code people have given in the comments
D = {}
for something in yourthings:
if something.key in D:
D[something.key] = something.auxfield
else:
D[something.key] = merge(D[something.key], something.auxfield)