how to downsize a pandas DataFrame without making a copy? - python

I have RAM concerns, and I want to downsize my data I loaded (with read_stata() you cannot only a few rows, sadly). Can I change the code below to use only some rows for X and y, but not make a copy? That would, even if temporarily defeat the purpose, I want to save on memory, not add ever more to my footprint. Or probably downsize the data first (does `reshape' do that without a copy if you specify a smaller size than the original?) and then pick some columns?
data = pd.read_stata('S:/data/controls/notreat.dta')
X = data.iloc[:,1:]
y = data.iloc[:,0]

I feel your pain. Pandas is not a memory-friendly library, and 500Mb can quickly turn into >16Gb and shredding performance.
However, one thing that's worked for me is memmap. You can use memmap to page in numpy arrays and matrices just about as fast as your databus permits. And as an added benefit, unused pages may be unloaded.
See here for details. With some work, these memmap np arrays can be used to back a pd.Series or a pd.DataFrame without copying. However, you may find that Pandas later copies you data as you proceed. So, my advice: create a memmap file, and stay in numpy-land.
Your other alternative is to use HDFS.

Related

Python: Can I write to a file without loading its contents in RAM?

Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.

Read only Pandas dataset in Dask Distributed

TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.
Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.

Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)
The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.
One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.
An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.
In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.
Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

Numpy matrix of arrays without copying possible?

I got a question about numpy and it's memory. Is it possible to generate a view or something out of multiple numpy arrays without copying them?
import numpy as np
def test_var_args(*inputData):
dataArray = np.array(inputData)
print np.may_share_memory(inputData, dataArray) # prints false, b.c. of no shared memory
test_var_args(np.arange(32),np.arange(32)*2)
I've got a c++ application with images and want to do some python magic. I pass the images in rows to the python script using the c-api and want to combine them without copying them.
I am able to pass the data s.t. c++ and python share the same memory. Now I want to arange the memory to a numpy view/array or something like that.
The images in c++ are not continuously present in the memory (I slice them). The rows that I hand over to python are aranged in a continuous memory block.
The number of images I pass are varying. Maybe I can change that if there exist a preallocation trick.
There's a useful discussion in the answer here: Can memmap pandas series. What about a dataframe?
In short:
If you initialize your DataFrame from a single array of matrix, then it may not copy the data.
If you initialize from multiple arrays of the same or different types, your data will be copied.
This is the only behavior permitted by the default BlockManager used by Pandas' DataFrame, which organizes the DataFrame's memory internally.
Its possible to monkey patch the BlockManager to change this behavior though, in which case your supplied data will be referenced.

Handling very large netCDF files in python

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.
For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:
import netCDF4 as nc
f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]
But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.
Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.
You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:
http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html
The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles)
you want for the variables. You can specify how much memory to use as a buffer and how much to
use for chunk caches, but it's not clear how to use memory optimally between these uses, so you
may have to just try some examples and time them. Rather than completely transpose a variable,
you probably want to "partially transpose" it, by specifying chunks that have a lot of data along
the 2 big dimensions of your slice and have only a few values along the other dimensions.
This is a comment, not an answer, but I can't comment on the above, sorry.
I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:
for i in range(450):
myslice = myvar[:,:,i]
do_something(slice)
and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.

Categories

Resources