merging dictionaries in python - python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!

A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).

Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count

If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

Related

python data structures for storing classes

I am trying to design a system in python where my customers can create an order and it will be stored in an array or similar type of structure that will be able to constantly expand to store more orders as they are placed. What is the best way to do this?
I can think of two ways to do this.
Serialization. Reference
Create two tables, One called Order and other called order_contents. You can join order and Order_contents by order id. Store Order specific data in order table and content specific data in conetnt. All contents can be retrieved with a single SQL query this way OR in python, easily with ORM.
How big would you expect and order to get and how many orders could there be? Also what is stored in an order?
If you use numpy arrays you have the problem that increasing the size of an array is a very expensive process, so doing it many times on large arrays would be problematic. Numpy arrays also are for data that is all the same type, so you would not use this for things that are combinations of strings (name, item), integers (item reference), and floats (cost).
A simple list is likely the easiest and most inclusive choice. You can put whatever you like in a list and increase the size easily since a list is actually just pointers.
A dictionary could be useful if you are expecting to have to search the list often, or have a clear key-item relationship.
It really comes down to your use case. A list is often the choice, but a dictionary could be nice, and numpy arrays are nice if you are doing math with the stored data.

Dictionary, set or frozenset?

I have a large collection of data, about 10 million entries and part of my program required very many membership checks...
if a in data:
return True
return False
right now I have data as dictionary entries with all their values equal to '1'
I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...
For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?
And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable?
Is a dictionary with 1 billion entries still fast?
On Principal
If you expect the data to keep growing you can't use a frozenset.
A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.
Practically speaking...
When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.
For this amount of data RyPeck is right - a DB will do the job much better.
One more point:
Something seems odd to me in what you've written:
If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.
There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB
For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed

how to store key-value as well as value-key in Python?

I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?
You could try bidict.

Python synchronised reading of sorted files

I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.
im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.
"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)
This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.
I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps
You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...

How to store a dynamic List into MySQL column efficiently?

I want to store a list of numbers along with some other fields into MySQL. The number of elements in the list is dynamic (some time it could hold about 60 elements)
Currently I'm storing the list into a column of varchar type and the following operations are done.
e.g. aList = [1234122433,1352435632,2346433334,1234122464]
At storing time, aList is coverted to string as below
aListStr = str(aList)
and at reading time the string is converted back to list as below.
aList = eval(aListStr)
There are about 10 million rows, and since I'm storing as strings, it occupies lot space. What is the most efficient way to do this?
Also what should be the efficient way for storing list of strings instead of numbers?
Since you wish to store integers, an effective way would be to store them in an INT/DECIMAL column.
Create an additional table that will hold these numbers and add an ID column to relate the records to other table(s).
Also what should be the efficient way
for storing list of strings instead of
numbers?
Beside what I said, you can convert them to HEX code which will be very easy & take less space.
Note that a big VARCHAR may influence badly on the performance.
VARCHAR(2) and VARCHAR(50) does matter when actions like sotring are done, since MySQL allocates fixed-size memory slices for them, according to the VARCHAR maximum size.
When those slices are too large to store in memory, MySQL will store them on disk.
MySQL also has a SET type, it works like ENUM but can hold multiple items.
Of course you'd have to have a limited list, currently MySQL only supports up to 64 different items.
I'd be less worried about storage space and more worried about record retrieveal i.e., indexability/searching.
For example, I imagine performing a LIKE or REGEXP in a WHERE clause to find a single item in the list will be quite bit more expensive than if you normalized each list item into a row in a separate table.
However, if you never need to perform such queries agains these columns, then it just won't matter.
Since you are using relational database you should know that storing non-atomic values in individual fields breaks even the first normal form. More likely than not you should follow Don's advice and keep those values in related table. I can't say that for certain because I don't know your problem domain. It may well be that choosing RDBMS for this data was a bad choice altogether.

Categories

Resources