I have to convert a lot of csv data to a pytable. I can do the job in 5 hours, if I just store the dates as strings. But, that's not useful for query operation, so I would like it as an integer, or some format that makes searches quicker.
Here's what I have tried:
np.datetime64(date)
This is fast, but pytables will not store it directly, as I write with numpy structured arrays and type 'M8' is not accepted.
Converting to int64 using astype slows the process considerably.
ts = time.strptime(date, '%m/%d/%Y')
calendar.timegm(ts)
Too slow. Causes total processing time to go to 15 hours
I just want some kind of number to represent a day number since 2000. I don't need hours, seconds.
Any ideas?
I wonder if you could improve on that by using the slow method, but caching the results in a dictionary after computation. So 1) check a (possibly global) dictionary to see if that string exists as a key; if so, use the value for that key. 2) if not, then compute the date for the string. 3) add the string/date as a key/value in the dictionary for next time. Assuming you have a lot of duplicates, which you must (because it sounds like you have a gigantic pile of data, and there aren't that many distinct days between 2000 and now) then you will get a fantastic cache hit rate. Fetching from a dictionary is an O(1) operation; that should improve things a lot.
This is kind of late, but I've written fast Cython-based converter precisely for this kind of task:
https://bitbucket.org/mrkafk/fastdateconverter
Essentially, you give it a date format and it generates Cython code that is then compiled to Python extension. This makes it so fast, see example in date_converter_generator.py:
fdef1 = FunDef('convert_date_fast', '2014/01/07 10:15:08', year_offset=0,
month_offset=5, day_offset=8, hour_offset=11, minute_offset=14, second_offset=17)
cg = ConverterGenerator([fdef1])
cg.benchmark()
Related
I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db
I want to check if a record for a certain date already exists for an user before adding a new record to the user
what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code
There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found
import mongoengine
#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class User(mongoengine.Document):
#meta{}
#account details
records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
def conditional_insert(user, new_record):
# the docs tell me to iterate tthrough every record in the user
# there has to be a better way
for r in user.records:
if str(new_record.date) == str(r.date): #i had to do that in my program
#because python kep converting datetime obj to str
return
# if record of duplicate date not found, insert new record
save_record(user, new_record)
def save_record(): pass
if __name__ == "__main__":
lst_to_insert = [] # list of (user, record_to_insert)
for object in lst_to_insert: #O(n)
conditional_insert(object[0],object[1]) #O(n)
#and I have n lst_to_insert so in reality I'm currently at O(n^3)
Hi everyone (and future me who will probably search for the same question 10 years later)
I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class Year(mongoengine.EmbeddedDocument):
monthly_records = mongoengine.EmbeddedDocumentListField(Month)
class Month(mongoengine.EmbeddedDocument):
daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
class User(mongoengine.Document):
#meta{}
#account details
yearly_records = mongoengine.EmbeddedDocumentListField(Year)
because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant
I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:
0) get monthly df
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates
The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)
Mongo cannot conveniently offer you suitable indexes, very sad.
You frequently iterate over user.records.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set, which
offers O(1) constant time lookup, and
offers RAM speed rather than network latency
When you save a user, also make note of if with cache.add((user_id, str(new_record.date))).
EDIT
If you can't afford the memory for all those (user_id, date) tuples,
then sure, a relational database JOIN is fair,
it's just an out-of-core merge of sorted records.
I have had good results with using sqlalchemy to
hit local sqlite (memory or file-backed), or
heavier databases like Postgres or MariaDB.
Bear in mind that relational databases offer lovely ACID guarantees,
but you're paying for those guarantees. In this application
it doesn't sound like you need such properties.
Something as simple as /usr/bin/sort could do an out-of-core
ordering operation that puts all of a user's current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)
Bench it and see.
Without profiling data for a specific workload,
it's pretty hard to tell if any extra complexity
would be worth it.
Identify the true memory or CPU bottleneck,
and focus just on that.
You don't necessarily need ordering, as hashing would suffice,
given enough core.
Send those tuples to a redis cache, and make it his problem
to store them somewhere.
I'm trying to convert a very long string (like:'1,2,3,4,5,6...n' n=60M) to numpy.array.
I have tried to convert the string to a list, then use numpy.array() to convert the list to array.
But there is a problem that a lot of memory will be used(pls. let me know if you know why, sys.getsizeof(list) much less than memory used) when string convert to list.
I also have tried to use numpy.fromstring(). but it seems will spend a lot of time(wait a long time but still no result).
Is there any methods that can reduce memory used and more efficiently except sperate the string to a lot of pieces?
When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself.
You can use the gc module to see different generations of your objects
See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation.
Once I've retrieved data from MongoDB and loaded into a Pandas dataframe, what is the recommended practice with regards to storing hexadecimal ObjectID's?
I presume that, stored as strings, they take a lot of memory, which can be limiting in very large datasets. Is it a good idea to convert them to integers (from hex to dec)? Wouldn't that decrease memory usage and speed up processing (merges, lookups...)?
And BTW, here's how I'm doing it. Is this the best way? It unfortunately fails with NaN.
tank_hist['id'] = pd.to_numeric(tank_hist['id'].apply(lambda x: int(str(x), base=16)))
First of all, I think it's NaN because object IDs are bigger than a 64bit integer. Python can handle that, but the underlying pandas/numpy may not.
I think you want to use a map to extract some useful fields that you could later do multi level sorting on. I'm not sure you'll see the performance improvement you're expecting though.
I'd start by creating a new series "oid_*" into your frame and checking your results
https://docs.mongodb.com/manual/reference/method/ObjectId/
breaks down the object id into components with:
timestamp (4 bytes),
random (5 bytes - used to be host identifier), and
counter (3 bytes)
these integer sizes are good and appropriate integer sizes for numpy to deal with.
tank_hist['oid_timestamp'] = tank_hist['id'].map(lambda x: int(str(x)[:8], 16))
tank_hist['oid_random'] = tank_hist['id'].map(lambda x: int(str(x[:4])[8:18], 16))
tank_hist['oid_counter'] = tank_hist['id'].map(lambda x: int(str(x[:4])[18:], 16))
This would allow you to primary sort on timestamp series, secondary sort on some other series in the frame? Then third sort on counter.
Maps are super helpful (though slow) way to poke every record in your series. Realize that if you are adding compute time here in exchange for saving this compute time later.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
In bioinformatics, we do the following transformation an awful lot:
>>> data = {
(90,100):1,
(91,101):1,
(92,102):2,
(93,103):1,
(94,104):1
}
>>> someFuction(data)
{
90:1,
91:2,
92:4,
93:5,
94:6,
95:6,
96:6,
97:6,
98:6,
99:6,
100:6,
101:5,
102:4,
103:2,
104:1
}
Where the tuple in data is always a unique pair.
But there are many methods for doing this transform - some significantly better than others. One i have tried is:
newData = {}
for pos, values in data.iteritems():
A,B = pos
for i in xrange(A,B+1):
try: newData[i] += values
except KeyError: newData[i] = values
This has the benefit that its short and sweet, but im not actually sure it is that efficient....
I have a feeling that somehow turning the dict into a list of lists, and then doing the xrange, would save an awful lot of time. We're talking weeks of computational work per experiment. Something like this:
>>> someFuction(data)
[
[90,90,1],
[91,91,2],
[92,92,4],
[93,93,5],
[94,100,6],
[101,101,5],
[102,102,4],
[103,103,2],
[104,104,1]
]
and THEN do the for/xrange loop.
People on #Python have recommended bisect and heapy, but after struggling with bisect all day, I can't come up with a nice algorithm which i can be 100% will work all the time. If anyone on here could help or even point me in the right direction, id be massively grateful :)
I worked out a solution last night that takes the total run time of one file from roughly 400 minutes to 251 minutes. I would post the code but its pretty long, and likely to have bugs in the edge-cases. For that reason i'll say the 'working' code can be found in the program 'rawSeQL', but the algorithmic improvements that helped the most were:
Looping over the overlapping arrays and flattening them to non-overlapping arrays with a multiplier value made an enormous difference, as xrange() does not now need to repeat itself.
Using collections.defaultdict(int) made a big difference over the try/except loop above. collections.Counter() and orderedDict was a LOT slower than the try/except.
I went with using bisect_left() to find where to insert the next non-overlapping piece, and it was so-so, but then adding in Bisect's 'low' parameter to limit the range of the list it needs to check gave a sizeable reduction in compute time. If you sort the input list, your value for low is always the last returned value for bisect, which makes this process easy :)
It is possible that heapy would provide even more benefits still - but for now the main algorithm improvements mentioned above will probably outweigh any compile-time tricks. I have 75 files to process now, which means just these three things save roughly 12500 days of compute time :)
I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)