Numpy.load under the hood - python

I am using numpy.load in runtime as my application loads different numpy array based on external event.
My application is really low-latency oriented and I am struggling with numpy.load.
I noticed that everytime i use numpy.load on particular array(saved as npy), the loading time is pretty slow(~0.2-0.3s), but every other time i do it again, the time is dramaticly reducing so after 2,3rd load it is even as low as 0.01s.
I am using classical syntax
data = np.load(name)
Later on, I pass data into some processing function and rewrite variale data
data = None
So my question is, what is happening? And if there is some kind of cache, can I load and rewrite all arrays in the beginning of the script so whenever I load array, its fast? If so, will the memory suffer?
Thanks in advance

Related

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

I am absolutely baffled by how many unhelpful error messages I've received while trying to use this supposedly simple API to write TFRecords in a manner that doesn't take 30 minutes every time I have a new dataset.
Task:
I'd like to feed a list of image paths and a list of labels to a tf.data.Dataset, parse them in parallel to read the images and encode as tf.train.Examples, use tf.data.Dataset.shard to distribute them into different TFRecord shards (e.g. train-001-of-010.tfrecord, train-002-of-010.tfrecord, etc.), and for each shard finally write them to the corresponding file.
Since I've been debugging this for hours I haven't gotten any single particular error to fix, otherwise I would provide it. I've struggled to find any up to date tutorial that doesn't either (a) come from 2017 and use queue runners, (b) use a tf.Session (I'm using tensorflow 1.15 but official docs keep telling me to phase out sessions), (c) Conveniently do the record creating in pure python, which makes a simple tutorial but is too slow for any actual application, or (d) use already created TFRecords and just skip the whole process.
If necessary, I can put together an example of what I'm talking about. But since I'm getting stuck at every level of the process, at the moment it seems unhelpful.
Tldr:
If anyone has utilized tf.data.Dataset to create TFRecord shards in parallel please point me in a better direction than google has.

I want to read a large amount of images for deep learning, but what is the solution for when there is insufficient memory?

In deep learning program written in python, I wanted to store a large amount of image data in numpy array at once, and to extract batch data randomly from that array, but the image data is too large and memory is run out.
How should we deal with such cases? I have no choice but to do IO processing and read image data from storage every time you retrieve batch data?
File I/O would solve the issue, but will slow down the leanring process, since FILE I/O is a task which takes long.
However, you could try to implement a mixture of both using multithreading, e.g.
https://github.com/stratospark/keras-multiprocess-image-data-generator
(I do not know what kind of framework you are using).
Anyhow back to basic idea:
Pick some random files and read them, start training. During training start a second thread which will read out random Files again. Thus, you learning thread does not have to wait for new data, since the training process might take longer than the reading process.
Some frameworks have this feature already implemented, check out:
https://github.com/fchollet/keras/issues/1627
or:
https://github.com/pytorch/examples/blob/master/mnist_hogwild/train.py

Fast saving and retrieving of python data structures for an autocorrect program?

So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.

Scipy & Ipython Notebook: how to save and load results?

Now I'm using Ipython Notebook.
There is part of my program need a long time to get the result, so I want to save the result and load it when next time I use the script. Otherwise I need to repeat the calculation and need a lot time for this.
I'm wondering is there any good practice of saving and load results? which makes it easier to resume the script the next time I need it?
It's easy to save text results, but in scipy, numpy, the result may be quite complex, e.g. matrix, numerical array.
There are several options, such as pickle, which allows you to save almost anything. However, if what you are going to save are numeric numpy arrays/matices, np.save and np.load seem to be more appropiate.
data = # my data np array
np.save('mypath', data)
data = np.load('mypath')

How i should store simple objects using python and redis?

Lets suppose that I have a lot of(hundreds) big python dictionaries. Pickled file size is about 2Mb. I want to draw chart using data from this dictionaries so i have to load them all. What is the most efficent (at first speed, at second memory) way to store my data? May be I should use another caching tool? This how i am solving this task now:
Pickle every my dictionary. Just pickle(dict)
Load the pickled string to redis. redis.set(key, dict)
When user needs chart, i am creating array and fill it with unpickled data from redis. Just like that:
array = []
for i in range(iteration_count):
array.append(unpickle(redis.get(key)))
Now i have both problems: with memory, cause my array is very big, but its not important and easy to solve. The main problem - speed. A lot of objects unpickling more than 0.3 seconds. I even have bottlenecks with more than 1 second unpickling time. And getting this string from redis rather expensive (more than 0.01 sec). When i have lots of objects, my user have to wait a lot of seconds.
If it can be assumed that you are asking in the context of a web application and that you are displaying your charts in a browser, I would definitely recommend storing your dictionaries as JSON in redis.
Again, you have not provided too many details about your application, but I have implemented charting over very large data sets before (100,000's of sensor data points per second over several minutes of time). To help performance when rendering the datasets, I stored each type of data into their own dictionary or 'series'. This strategy allows you to render only portions of the data as required.
Perhaps if you share more about your particular application we may be able to provide more help.

Categories

Resources