python how to convert a very long data string to numpy array? - python

I'm trying to convert a very long string (like:'1,2,3,4,5,6...n' n=60M) to numpy.array.
I have tried to convert the string to a list, then use numpy.array() to convert the list to array.
But there is a problem that a lot of memory will be used(pls. let me know if you know why, sys.getsizeof(list) much less than memory used) when string convert to list.
I also have tried to use numpy.fromstring(). but it seems will spend a lot of time(wait a long time but still no result).
Is there any methods that can reduce memory used and more efficiently except sperate the string to a lot of pieces?

When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself.
You can use the gc module to see different generations of your objects
See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation.

Related

Why memory space allocation is different for the same objects?

I was experimenting with how Python allocates the memory, so found the same issue like
Size of list in memory and Eli describes in a much better way. His answer leads me to the new doubt that, I checked the size of 1 + [] and [1], but it is different as you can see in the code snippet. if I'm not wrong memory space allocation should be the same. But it's not the case. Anyone can help me with the understanding?
>>> import sys
>>> sys.getsizeof(1)
28
>>> sys.getsizeof([])
64
>>> 28 + 64
92
>>> sys.getsizeof([1])
72
What's the minimum information a list needs to function?
some kind of top-level list object, containg a reference to the class information (methods, type info, etc), and the list's own instance data
the actual objects stored in the list
... that gets you the size you expected. But is it enough?
A fixed-size list object can only track a fixed number of list entries: traditionally just one (head) or two (head and tail).
Adding more entries to the list doesn't change the size of the list object itself, so there must be some extra information: the relationship between list elements.
It's possible to store this information in every Object (this is called an intrusive list), but it's very restrictive: each Object can only be stored in one list at a time.
Since Python lists clearly don't behave like that, we know this extra information isn't already in the list element, and it can't be inside the list object, so it must be stored elsewhere. Which increases the total size of the list.
NB. I've kept this argument fairly abstract deliberately. You could implement list a few different ways, but none of them avoid some extra storage for element relationships, even if the representation differs.

getsizeof returns the same value for seemingly different lists

I have the following two dimensional bitmap:
num = 521
arr = [i == '1' for i in bin(num)[2:].zfill(n*n)]
board = [arr[n*i:n*i+n] for i in xrange(n)]
Just for curiosity I wanted to check how much more space will it take, if it will have integers instead of booleans. So I checked the current size with sys.getsizeof(board) and got 104
After that I modified
arr = [int(i) for i in bin(num)[2:].zfill(n*n)] , but still got 104
Then I decided to see how much will I get with just strings:
arr = [i for i in bin(num)[2:].zfill(n*n)], which still shows 104
This looks strange, because I expected list of lists of strings to waste way more memory than just booleans.
Apparently I am missing something about how the getsizeof calculates the size. Can anyone explain me why I get such results.
P.S. thanks to zehnpard's answer, I see that I can use sum(sys.getsizeof(i) for line in board for i in line) to approximately count the memory (most probably it will not count the lists, which is not that much important for me). Now I see the difference in numbers for string and int/bool (no difference for int and boolean)
The docs for the sys module since Python 3.4 is pretty explicit:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
Given that Python lists are effectively arrays of pointers to other Python objects, the number of elements a Python list contains will influence its size in memory (more pointers) but the type of objects contained will not (memory-wise, they aren't contained in the list, just pointed at).
To get the size of all items in a container, you need a recursive solution, and the docs helpfully provide a link to an activestate recipe.
http://code.activestate.com/recipes/577504/
Given that this recipe is for Python 2.x, I'm sure this behavior was always standard, and got explicitly mentioned in the docs since 3.4 onwards.

need fast date converter for pytables

I have to convert a lot of csv data to a pytable. I can do the job in 5 hours, if I just store the dates as strings. But, that's not useful for query operation, so I would like it as an integer, or some format that makes searches quicker.
Here's what I have tried:
np.datetime64(date)
This is fast, but pytables will not store it directly, as I write with numpy structured arrays and type 'M8' is not accepted.
Converting to int64 using astype slows the process considerably.
ts = time.strptime(date, '%m/%d/%Y')
calendar.timegm(ts)
Too slow. Causes total processing time to go to 15 hours
I just want some kind of number to represent a day number since 2000. I don't need hours, seconds.
Any ideas?
I wonder if you could improve on that by using the slow method, but caching the results in a dictionary after computation. So 1) check a (possibly global) dictionary to see if that string exists as a key; if so, use the value for that key. 2) if not, then compute the date for the string. 3) add the string/date as a key/value in the dictionary for next time. Assuming you have a lot of duplicates, which you must (because it sounds like you have a gigantic pile of data, and there aren't that many distinct days between 2000 and now) then you will get a fantastic cache hit rate. Fetching from a dictionary is an O(1) operation; that should improve things a lot.
This is kind of late, but I've written fast Cython-based converter precisely for this kind of task:
https://bitbucket.org/mrkafk/fastdateconverter
Essentially, you give it a date format and it generates Cython code that is then compiled to Python extension. This makes it so fast, see example in date_converter_generator.py:
fdef1 = FunDef('convert_date_fast', '2014/01/07 10:15:08', year_offset=0,
month_offset=5, day_offset=8, hour_offset=11, minute_offset=14, second_offset=17)
cg = ConverterGenerator([fdef1])
cg.benchmark()

using strings as python dictionaries (memory management)

I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

Improve speed of reading and converting from binary file?

I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)

Categories

Resources