How to prevent object to sort static fields alphabetically - python

I am using class with static values in python to represent some constants
class Constants(object):
PERSON = "person_in_class"
PARENT = "parent_of_person_in_class"
and lot of more, over 30 constants. I am using keys so I ma trying to make value short as possible ( on another side I have same file/class, and I am using haffman algorithm and it works), I generate pairs like
elements = [elem for elem in dir(Constants)if not startwith("_")]
but problem is that elements are always sorted alphabetically, so when I add new key in Constants I will change all, I want when I add at the end to be like have index.

Related

Trying to remove duplicates from large list of objects, keep certain one

I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!

How to properly instantiate a MultiSet (created on my own) using Python

I wanted to learn about Data Structures so I decided to create them using Python. I first created a Singly Linked List (which consists of two classes: the actual List and the Node). A List consists of Nodes (or can be empty). Each node had a "next" value. When I instantiated a list, it would look like this:
l = LinkedList([1,2])
and this was the sudocode for the init:
def __init__(self, item=None):
head = None
if a single item was given
head = Node(item)
head.next = None
else if multiple items were given
for i in item
if head: # i is not the first item in the list
new_head = Node(i)
new_head.next = head
head = new_head
else: # i is the first item in the list
head = Node(i)
head.next = None
Maybe there is a flaw in the logic above, but hopefully you get how it works more or less. The key thing I noted here was that I did not use any list ([]) or array ({}) because I didn't need to.
Now, I am trying to create a MultiSet but I am stuck in the init part.
It was simple for a Linked Lists because when I read articles on Linked Lists, all of the articles immediately mentioned a List class and a Node class (a List consists of a Node. a List has a head and a tail. a Node has a 'next' value). But when I read articles on MultiSets, they just mention that multisets are sets (or bags) of data where multiple instances of the same data are allowed.
This is my init for my multiset so far:
def __init__(self, items=None):
self.multiset = []
if items: # if items is not None
try:
for i in items: # iterate through multiple items (assuming items is a list of items)
self.multiset.append(i)
except: # items is only one item
self.multiset.append(items)
I don't think I am on the right track though because I'm using a Python list ([]) rather than implementing my own (like how I did with a Linked List using Nodes, list.head, list.tail and node.next).
Can someone point me in the right direction as to how I would create and instantiate my own MultiSet using Python (without using existing Python lists / arrays)? Or am I already on the right track and am I supposed to be using Python lists / arrays when I am creating my own MultiSet data structure?
It looks like you're conflating two things:
data structures - using Python (or any other language, basically), you can implement linked lists, balanced trees, hash tables, etc.
mapping semantics - any container, but an associative container in particular, has a protocol: what does it do when you insert a key that's already in it? does it have an operation to access all items with a given key? etc.
So, given your linked list, you can certainly implement a multiset (albeit with not that great performance), because it's mainly a question of your decision. One possible way would be:
Upon an insert, append a new node with the key
For a find, iterate through the nodes; return the first key you find, or None if there aren't any
For a find_all, iterate through the nodes, and return a list (or your own linked-list, for that matter), of all the keys matching it.
Similarly, a linked-list, by itself, doesn't dictate if you have to use it as a set or a dictionary (or anything else). These are orthogonal decisions.

Retrieving a specific set element in Python

Essentially this is what I'm trying to do:
I have a set that I add objects to. These objects have their own equality method, and a set should never have an element equal to another element in the set. However, when attempting to insert an element, if it is equal to another element, I'd like to record a merged version of the two elements. That is, the objects have an "aux" field that is not considered in its equality method. When I'm done adding things, I would like an element's "aux" field to contain a combination of all of the "aux" fields of equal elements I've tried to add.
My thinking was, okay, before adding an element to the set, check to see if it's already in the set. If so, pull it out of the set, combine the two elements, then put it back in. However, the remove method in Python sets doesn't return anything and the pop method returns an arbitrary element.
Can I do what I'm trying to do with sets in Python, or am I barking up the wrong tree (what is the right tree?)
Sounds like you want a defaultdict
from collections import defaultdict
D = defaultdict(list)
D[somekey].append(auxfield)
Edit:
To use your merge function, you can combine the code people have given in the comments
D = {}
for something in yourthings:
if something.key in D:
D[something.key] = something.auxfield
else:
D[something.key] = merge(D[something.key], something.auxfield)

How to access a specific class instance by attribute in python?

Say I have a class Box with two attributes, self.contents and self.number. I have instances of box in a list called Boxes. Is there anyway to access/modify a specific instance by its attribute rather than iterating through Boxes? For example, if I want a box with box.number = 40 (and the list is not sorted) what would be the best way to modify its contents.
If you need to do it more frequently and you have unique numbers, then create a dictionary:
numberedBox = dict((b.number, b) for b in Boxes)
you can then access your boxes directly with numbers:
numberedBox[40]
but if you want to change their number, you will have to modify the numberedBox dictionary too...
Otherwise yes, you have to iterate over the list.
The most straightforward way is to use a list comprehension:
answer=[box for box in boxes if box.number==40]
Be warned though. This actually does iterate over the whole list. Since the list is not sorted, there is no faster method than to iterate over it (and thus do a linear search), unless you want to copy all the data into some other data structure (e.g. dict, set or sort the list).
Use the filter builtin:
wanted_boxes = filter(lambda box: box.number == 40, boxes)
Although not as flexible as using a dictionary, you might be able to get by using a simple lookup table to the map box numbers to a particular box in boxes. For example if you knew the box numbers could range 0...MAX_BOX_NUMBER, then the following would be very fast. It requires only one full scan of the Boxes list to setup the table.
MAX_BOX_NUMBER = ...
# setup lookup table
box_number = [None for i in xrange(MAX_BOX_NUMBER+1)]
for i in xrange(len(Boxes)):
box_number[Boxes[i].number] = Boxes[i]
box_number[42] # box in Boxes with given number (or None)
If the box numbers are in some other arbitrary range, some minor arithmetic would have to be applied to them before their use as indices. If the range is very large, but sparsely populated, dictionaries would be the way to go to save memory but would require more computation -- the usual trade-off.

How to rewrite this Dictionary For Loop in Python?

I have a Dictionary of Classes where the classes hold attributes that are lists of strings.
I made this function to find out the max number of items are in one of those lists for a particular person.
def find_max_var_amt(some_person) #pass in a patient id number, get back their max number of variables for a type of variable
max_vars=0
for key, value in patients[some_person].__dict__.items():
challenger=len(value)
if max_vars < challenger:
max_vars= challenger
return max_vars
What I want to do is rewrite it so that I do not have to use the .iteritems() function. This find_max_var_amt function works fine as is, but I am converting my code from using a dictionary to be a database using the dbm module, so typical dictionary functions will no longer work for me even though the syntax for assigning and accessing the key:value pairs will be the same. Thanks for your help!
Since dbm doesn't let you iterate over the values directly, you can iterate over the keys. To do so, you could modify your for loop to look like
for key in patients[some_person].__dict__:
value = patients[some_person].__dict__[key]
# then continue as before
I think a bigger issue, though, will be the fact that dbm only stores strings. So you won't be able to store the list directly in the database; you'll have to store a string representation of it. And that means that when you try to compute the length of the list, it won't be as simple as len(value); you'll have to develop some code to figure out the length of the list based on whatever string representation you use. It could just be as simple as len(the_string.split(',')), just be aware that you have to do it.
By the way, your existing function could be rewritten using a generator, like so:
def find_max_var_amt(some_person):
return max(len(value) for value in patients[some_person].__dict__.itervalues())
and if you did it that way, the change to iterating over keys would look like
def find_max_var_amt(some_person):
dct = patients[some_person].__dict__
return max(len(dct[key]) for key in dct)

Categories

Resources