Dynamically named sets or other method? - python

First of all thank you for taking the time to look at my problem. Rather than simply describing the solution I have in mind for the problem I have to solve, I though it best to outline the problem also in order to enable alternative solution ideas to be suggested. It is more than likely that there is a better way to achieve this solution.
The problem I have:
I generate lists of names with associated scores ranks and other associated values, these lists are generated daily but have to change as the day progresses as a result of needing to remove some names. Currently these lists of names are produced on excel based sheets which contains the following data types in the following format;
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value)
For example;
Mrs Dodgsons class
Rosie,1,123.8,5
James,2,122.6,7
Chris,3,120.4,12
Dr Clements class
Hannah,1,126.9,2.56
Gill,2,124.54,6.89
Jack,3,122.04,15.62
Jamie,4,121.09,20.91
Now what I have is a separate list of users who need removing from the above excel generated lists (don't worry the final product of this little project is not to re-save a modified excel doc), this list is generated via a web scraper which is updated every two minutes. The method I currently perceive as a potentially viable solution to this problem is to use a piece of code which saves each list in the CSV as a SET (if this is possible) then upon finding a Unique Name it would then delete them from the set/s in which they occur.
My questions to the python forum are;
is the methodology proposed viable with regards to producing multiple uniquely named SETs (up to 60 per day )
Is there a better method of achieving the same result ?
Any help or comments would be greatly appreciated
Best regards AEA

It will probably be easier for you to use dictionaries rather than sets, as dictionaries, unlike sets, provide a natural way of associating items of data with each member of a collection.
Here's one approach, in which the data for each class is stored in a dictionary, each key of which is a student's name, and the values of which are lists with the scores, etc., of each student:
data = {
"Mrs Dodgson": {
"Rosie": [1,123.8,5],
"James": [2,122.6,7],
"Chris": [3,120.4,12]
},
"Dr Clement": {
"Hannah": [1,126.9,2.56],
"Gill": [2,124.54,6.89],
"Jack": [3,122.04,15.62],
"Jamie": [4,121.09,20.91]
}
}
to_remove = ["Jamie", "Rosie"]
# Mrs. Dodgson's class data, initially.
print data["Mrs Dodgson"]
# Now remove the student data.
for cls_data in data.values():
for student in to_remove:
try:
del cls_data[student]
except KeyError:
pass
print data["Mrs Dodgson"]

Related

Optimal data structure for streaming data

I have a stream of data of the form [id, name, act, value, type].
id is an integer, name a string, act can be 'add', 'update' or 'delete', value is an integer, type is either L or R. We can only add once an id, perform multiple updates and then delete the id. I obviously look for a data structure that will allow me to insert those data efficiently.
I also need to be able to get the highest L value by name and the lowest R value by name at each moment the fastest way possible.
I believe I will need to use heap to get in a constant time min and max values by name. My problem is that I don't manage to find a way to also have the possibility to delete and update existing data at the same time.
The phrasing is a bit unclear here. Let me try and rephrase: you are looking for a good data structure such that, given a stream of operations in the form given above, you can add, delete, or update items (found using their id number). And you'd also like to maintain a few summary statistics about the whole data structure such as highest L and lowest R value.
Does this sound correct?
A dictionary of dictionaries sounds like it's probably the right answer if your id numbers are not over a specific range, or a list of dictionaries if they are.
Sorting makes this a different sort of problem. So you are instead looking for a way to add and subtract data entries into a data structure sorted alphabetically on their string names? One common way to do this is with a binary search tree. A BST will give you an insertion time complexity of O(log(n)) with n elements in the tree. At each element you can store the other data. Then you can separately maintain the highest L and lowest R values and update these each time a value is added that exceeds on of these values. If you remove a value equal to one of these limits, you'll have to traverse the whole data structure to get the new limit value.

Get key from a value(list) in a dic

Is there a way to retrive a key from values if the values is a list:
def add(self, name, number):
if name in self.bok.values():
print 'The name alredy exists in the telefonbok.'
else:
self.bok.update({number: []})
self.bok[number].append(name)
print self.bok
This works if i only have one element in the list:
self.bok.keys()[self.bok.values().index(my value i want to get the corresponding key)]
But if i insert more elements is gives me the error that it isnt in the list,
if u are wondering im creating an telephone book using class and dictionary so im supposed
to give and alias to the number and also be able to change the number on one name and alias should also get the new number. Would appriciate any help sorry if i'm blurry
If you find yourself wondering "how do I look up a key by its value?" it usually means that your dictionary is going the wrong way, or at least that you should be keeping two dictionaries. This is especially true if you notice yourself ensuring that the values are unique by hand!
At the moment, your conditional is never true (unless self.bok.values() is updated some other way), because name is (presumably) a string whereas self.bok.values() looks like it's a list of lists. If names should only appear once in the telephone book, that's a good hint that you should have a dictionary going the opposite direction.
Assuming you also need the number-to-name lookup, what I would do is add another dictionary to your class, and update them both whenever you add a new name/number pair.
import collections # defaultdict is a very nice object
# elsewhere, presumably in the __init__ method
self.name_to_number = {}
self.number_to_names = collections.defaultdict(list)
def add(self, name, number):
if name in self.name_to_number:
print 'The name alredy exists in the telefonbok.'
else:
self.name_to_number[name] = number
self.number_to_names[number].append(name)
If you're dead set on doing it the hard way for whatever reason, the findByValue method in Óscar López's answer is what you need.

writing to a file using one of the key as index and other key as value

I am very new to programming as well as to python.
I have been trying to implement it, but without success and would like ur help.
I have a dictionary with weird key values. I need to use one of the key as my index number and other i.e second key as the value along with the value stored in the dictionary as the third column.
For e.g if the dictionary is
{'Michael', 'Student<matriculation no>', 'marks obtained' : 40 }
the result should be like this
Name Admission no marks obtained
Michael matriculation no 40
sara matriculation no 60
where matriculation no is the value extracted from the second key of the dictionary(different for each value)
and this goes on for about 100 rows.
kindly suggest a method to do this.
You don't have multiple keys. In your example, your key is the tuple ('Michael', 'Student', 'marks obtained') (your dictionary syntax is wrong, by the way: it should be {('Michael', 'Student', 'marks obtained') : 40} based on what you're implying).
If you are guaranteed that no two students will have the same name (perhaps you might include last name and middle initial!) then you may use just their names as keys. Then, it would make sense to have the value be a tuple (matriculation, marks obtained). Like so: {"Michael" : ('Student', 40)}.
When you want to print these students, you may say print name, students[name][0], "no", students[name][1], where students is your dictionary and name is a string which is the student's name e.g. 'Michael'.
I'm not sure what else you can have for matriculation besides 'Student' by the way. It seems to me that you don't need to include that, unless you can in fact have other values for that.
A good metaphor here is to think of 'Michael' as having some data associated with him, i.e. his matriculation status and the number of marks received. The state of being matriculated does not have 'Michael' associated with it (particularly) nor does having received 40 marks have 'Michael' associated with it (particularly), because these things can happen to other people. So, the proper key is the student's name. Keys are supposed to be unique - when they are not, you run into a problem known as collision, in which two or more data (values) are associated with the same thing (key).
Big edit:
After looking at your edited post it seems that your key should actually be the matriculation number, since it is never the same. So now your dictionary should be {matriculation_no : (name, marks)}. And printing is now print students[matriculation_no][0], "Admission", matriculation_no, students[matriculation_no][1] or something like that. It depends on whether you wanted "Admission" in your string.
Minor edit:
If you want to write to a file, use file.write() instead.

Python search: how to do it efficiently

I have one class have 2 variable members:
class A:
fullname = ""
email = ""
There is a list of A stored in memory, now I need to search against fullname or email, the search needs to support fuzzy search(assemble the SQL 'like' clause), eg) search "abc", for "dabcd" it should match(if it can show the exact match first it will be better).
I think I should build index on 'fullname' and 'email'?
Please suggest, thanks!
EDIT: If I only need exact match, will two dictionaries with 'fullname' and 'email' to be the key is the best choice? I see some articles says the fetching is O(1).
2nd EDIT: The 'best' I defined is for the search speed(best speed). As I think in python a reference will only be stored as the pointer into the dictionaries, so the space allocation should not be a problem. I have thousands of record.
Look at the sqlite3 module. You can put your data into an in-memory database, index it, and query it with standard SQL.
If I only need exact match, will two dictionaries with 'fullname' and 'email' to be the key is the best choice?
If by "Best" you mean "best speed", then yes.
I see some articles says the fetching is O(1).
That's correct.
Two dictionaries would be fast.
If you want "like" clause behavior, it doesn't matter. Most structures are equally slow. A dictionary will work, and will be reasonably fast. A list, however, will be about the same speed.
def find_using_like( some_partial_key, dictionary ):
for k in dictionary:
if some_partial_key in key:
return dictionary[k]

Python - Nested dict, versus hashing a tuple?

I'm wondering what are the pros/cons of nested dicts versus hashing a tuple in Python?
The context is a small script to assign "workshops" to attendees at a conference.
Each workshop has several attributes (e.g. week, day, subject etc.). We use four of these attributes in order to assign workshops to each attendee - i.e. each delegate will also have a week/day/subject etc.
Currently, I'm using a nested dictionary to store my workshops:
workshops = defaultdict(lambda: defaultdict(lambda:defaultdict(lambda:defaultdict(dict))))
with open(input_filename) as input_file:
workshop_reader = csv.DictReader(input_file, dialect='excel')
for row in workshop_reader:
workshops[row['Week ']][row['Stream']][row['Strand']][row['Day']] = row
return workshops
I can then use the above data-structure to assign each attendee their workshop.
The issue is later, I need to iterate through every workshop, and assign an ID (this ID is stored in a different system), which necessitates unwrapping the structure layer by layer.
First question - is there some other way of creating a secondary index to the same values, using a string (workshop name) as a key? I.e. I'd still have the four-level nested dicts, but I could also search for individual entries based on just a name.
Secondly - how can I achieve a similar effect using tuples as the key? Are there any advantages you can think of using this approach? Would it be much cleaner, or easier to use? (This whole unwrapping this is a bit of a pain, and I don't think it's very approachable).
Or are there any other data structures that you could recommend that might be superior/easier to access/manipulate?
Thanks,
Victor
class Workshop(object):
def __init__(self, week, stream, strand, day, name):
self.week = week
self.stream = stream
self.day = day
self.strand = strand
self.name = name
...
for row in workshop_reader:
workshops['name'] = Workshop(...)
That is only if the name is the unique attribute of the workshops (that is, no workshops with repeating names).
Also, you'll be able to easily assign an ID to each Workshop object in the workshops dictionary.
Generally, if you need to nest dictionaries, you should use classes instead. Nested dictionaries gets complicated to keep track of when you are debugging. With classes you can tell different types of dictionaries apart, so it gets easier. :)

Categories

Resources