Storing MongoDB ObjectID's in Pandas - python

Once I've retrieved data from MongoDB and loaded into a Pandas dataframe, what is the recommended practice with regards to storing hexadecimal ObjectID's?
I presume that, stored as strings, they take a lot of memory, which can be limiting in very large datasets. Is it a good idea to convert them to integers (from hex to dec)? Wouldn't that decrease memory usage and speed up processing (merges, lookups...)?
And BTW, here's how I'm doing it. Is this the best way? It unfortunately fails with NaN.
tank_hist['id'] = pd.to_numeric(tank_hist['id'].apply(lambda x: int(str(x), base=16)))

First of all, I think it's NaN because object IDs are bigger than a 64bit integer. Python can handle that, but the underlying pandas/numpy may not.
I think you want to use a map to extract some useful fields that you could later do multi level sorting on. I'm not sure you'll see the performance improvement you're expecting though.
I'd start by creating a new series "oid_*" into your frame and checking your results
https://docs.mongodb.com/manual/reference/method/ObjectId/
breaks down the object id into components with:
timestamp (4 bytes),
random (5 bytes - used to be host identifier), and
counter (3 bytes)
these integer sizes are good and appropriate integer sizes for numpy to deal with.
tank_hist['oid_timestamp'] = tank_hist['id'].map(lambda x: int(str(x)[:8], 16))
tank_hist['oid_random'] = tank_hist['id'].map(lambda x: int(str(x[:4])[8:18], 16))
tank_hist['oid_counter'] = tank_hist['id'].map(lambda x: int(str(x[:4])[18:], 16))
This would allow you to primary sort on timestamp series, secondary sort on some other series in the frame? Then third sort on counter.
Maps are super helpful (though slow) way to poke every record in your series. Realize that if you are adding compute time here in exchange for saving this compute time later.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

Related

python how to convert a very long data string to numpy array?

I'm trying to convert a very long string (like:'1,2,3,4,5,6...n' n=60M) to numpy.array.
I have tried to convert the string to a list, then use numpy.array() to convert the list to array.
But there is a problem that a lot of memory will be used(pls. let me know if you know why, sys.getsizeof(list) much less than memory used) when string convert to list.
I also have tried to use numpy.fromstring(). but it seems will spend a lot of time(wait a long time but still no result).
Is there any methods that can reduce memory used and more efficiently except sperate the string to a lot of pieces?
When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself.
You can use the gc module to see different generations of your objects
See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation.

Memory-efficient way to merge large list of series into dataframe

I have a large list of pandas.Series that I would like to merge into a single DataFrame. This list is produced asynchronously using multiprocessing.imap_unordered and new pd.Series objects come in every few seconds. My current approach is to call pd.DataFrame on the list of series as follows:
timeseries_lst = []
counter = 0
for timeseries in pool.imap_unordered(CPU_intensive_function, args):
if counter % 500 == 0:
logger.debug(f"Finished creating timeseries for {counter} out of {nr_jobs}")
counter += 1
timeseries_lst.append(timeseries)
timeseries_df = pd.DataFrame(timeseries_lst)
The problem is that during the last line, my available RAM is all used up (I get an exit code 137 error). Unfortunately it is not possible to provide a runnable example, because the data is several 100 GB large. Increasing the Swap-Memory is not a feasible option since the available RAM is already quite large (about 1 TB) and a bit of Swap-Memory is not going to make much of a difference.
My idea is that one could, at regular intervals of maybe 500 iterations, add the new series to a growing dataframe. This would allow for cleaning the timeseries_lst and thereby reduce RAM intensity. My question would however be the following: What is the most efficient approach to do so? The options that I can think of are:
Create small dataframes with the new data and merge into the growing dataframe
Concat the growing dataframe and the new series
Does anybody know which of these two would be more efficient? Or maybe have a better idea? I have seen this answer, but this would not really reduce RAM usage since the small dataframes need to be held in memory.
Thanks a lot!
Edit: Thanks to Timus, I am one step further
Pandas uses the following code when creating a DataFrame:
elif is_list_like(data):
if not isinstance(data, (abc.Sequence, ExtensionArray)):
data = list(data) <-- We don't want this
So how would a generator function have to look like to be considered an instance of either abc.Sequence or ExtensionArray? Thanks!

need fast date converter for pytables

I have to convert a lot of csv data to a pytable. I can do the job in 5 hours, if I just store the dates as strings. But, that's not useful for query operation, so I would like it as an integer, or some format that makes searches quicker.
Here's what I have tried:
np.datetime64(date)
This is fast, but pytables will not store it directly, as I write with numpy structured arrays and type 'M8' is not accepted.
Converting to int64 using astype slows the process considerably.
ts = time.strptime(date, '%m/%d/%Y')
calendar.timegm(ts)
Too slow. Causes total processing time to go to 15 hours
I just want some kind of number to represent a day number since 2000. I don't need hours, seconds.
Any ideas?
I wonder if you could improve on that by using the slow method, but caching the results in a dictionary after computation. So 1) check a (possibly global) dictionary to see if that string exists as a key; if so, use the value for that key. 2) if not, then compute the date for the string. 3) add the string/date as a key/value in the dictionary for next time. Assuming you have a lot of duplicates, which you must (because it sounds like you have a gigantic pile of data, and there aren't that many distinct days between 2000 and now) then you will get a fantastic cache hit rate. Fetching from a dictionary is an O(1) operation; that should improve things a lot.
This is kind of late, but I've written fast Cython-based converter precisely for this kind of task:
https://bitbucket.org/mrkafk/fastdateconverter
Essentially, you give it a date format and it generates Cython code that is then compiled to Python extension. This makes it so fast, see example in date_converter_generator.py:
fdef1 = FunDef('convert_date_fast', '2014/01/07 10:15:08', year_offset=0,
month_offset=5, day_offset=8, hour_offset=11, minute_offset=14, second_offset=17)
cg = ConverterGenerator([fdef1])
cg.benchmark()

finding a duplicate in a hdf5 pytable with 500e6 rows

Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution
Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.
Two obvious techniques come to mind: hashing and sorting.
A) define a hash function to combine ID and Counter into a single, compact value.
B) count how often each hash code occurs
C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)
D) sort this data set to find duplicates.
The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.
This is essentially a Bloom filter.
The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.
I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.
In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.
If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

using strings as python dictionaries (memory management)

I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

Categories

Resources