Scipy & Ipython Notebook: how to save and load results? - python

Now I'm using Ipython Notebook.
There is part of my program need a long time to get the result, so I want to save the result and load it when next time I use the script. Otherwise I need to repeat the calculation and need a lot time for this.
I'm wondering is there any good practice of saving and load results? which makes it easier to resume the script the next time I need it?
It's easy to save text results, but in scipy, numpy, the result may be quite complex, e.g. matrix, numerical array.

There are several options, such as pickle, which allows you to save almost anything. However, if what you are going to save are numeric numpy arrays/matices, np.save and np.load seem to be more appropiate.
data = # my data np array
np.save('mypath', data)
data = np.load('mypath')

Related

Efficient Way to Read SAS file with over 100 million rows into pandas

I have an SAS file that is roughly 112 million rows. I do not actually have access to SAS software, so I need to get this data into, preferably, a pandas DataFrame or something very similar in the python family. I just don't know how to do this efficiently. ie, just doing df = pd.read_sas(filename.sas7bdat) takes a few hours. I can do chunk sizes but that doesn't really solve the underlying problem. Is there any faster way to get this into pandas, or do I just have to eat the multi-hour wait? Additionally, even when I have read in the file, I can barely do anything with it because iterating over the df takes forever as well. It usually just ends up crashing the Jupyter kernel. Thanks in advance for any advice in this regard!
Regarding the first part, I guess there is not much to do as the read_sas options are limited.
For the second part 1. iterating manually through rows is slow and not pandas philosophy. Whenever possible use vectorial operations. 2. Try to look into specialized solutions for large datasets, like dask. Also read how to scale to large dataframes.
Maybe you don't need your entire file to work on it so you can take 10%. You can also change your variable types to reduce its memory.
if you want to store a df and re use it instead of re importing the entire file each time you want to work on it you can save it as a pickle file (.pkl) and re open it by using pandas.read_pickle

What is the fastest way to append and read data to an 3D Numpy Array?

My code be summarised as a for loop (M~10^5-10^6 iterations) over some function which sequentially produces data in the form (W,N)-arrays where W~500 and N~100 and I need to store these as efficiently as possible. Apart from saving data of this form, I would also like be able to access them as fast as possible.
So far, I tried:
Creating a np.empty((M,W,N)),
Starting from a (W,N)-array and appending data using np.append, np.vstack, np.hstack.
So far everything seems to pretty slow.
What is the fastest way to manage this?
Do I need to rely on 3rd-party packages like Dask? If so, to which ones?

Python: Can I write to a file without loading its contents in RAM?

Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.

Numpy.load under the hood

I am using numpy.load in runtime as my application loads different numpy array based on external event.
My application is really low-latency oriented and I am struggling with numpy.load.
I noticed that everytime i use numpy.load on particular array(saved as npy), the loading time is pretty slow(~0.2-0.3s), but every other time i do it again, the time is dramaticly reducing so after 2,3rd load it is even as low as 0.01s.
I am using classical syntax
data = np.load(name)
Later on, I pass data into some processing function and rewrite variale data
data = None
So my question is, what is happening? And if there is some kind of cache, can I load and rewrite all arrays in the beginning of the script so whenever I load array, its fast? If so, will the memory suffer?
Thanks in advance

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Categories

Resources