Fastest Way to Create Large Numpy Arrays? - python

I'm working on creating a bunch of data (10M rows or more) and am trying to understand more deeply about the best way to work with numpy arrays as quickly as possible.
I'll have 1M rows of each class of data, which I read in from different sources (async). When I'm done reading, I want to combine them into a single numpy array. I'll know the final array is 10M (precisely).
I'm thinking I have the following paths:
Create a global numpy array of the entire size and copy in
Create a global numpy array and a numpy array for each source and concat together at the end
Create a null global numpy array and add each row to the global array (I think this is the slowest)
However, I'm not sure how to do #1 - numpy.copyto seems to always start with index 0.
Is there another model. I should be going with here?
If I use "views", I'm not clear how to copy it to the final array. I'm, of course, familiar with views for DBs, but not for numpy.

Related

Slicing Ragged Array

I have a bunch of sliced images (32 x 32 in shape) stored each with a corresponding string id (file name and such). The slices and their ids are grouped in a bunch of arrays part of a final large array. This array is ragged and non-standard, but I'd like to be able to efficiently access the slices inside.
Let's say that I have 500 slices. The shape should be (500, 2) on the surface because the first grouping of slices and ids is not a standard array with given shape.
What I would like to be able to do is extract the sliced images themselves from the final array. Normally, I could collect everything by slicing the big array like big_array[:][:][0] but the ragged nesting has made the array appear 1D with shape (500, ).
The only way around this is to use a clunky for loop, but I'm pretty sure everything I've been doing up until this point has been a terrible way of storing the data.
I need to keep the ids associated with each slice because I'm training a model with this, and if something goes wrong, I'd like to be able to reference the origins of the slice which has undergone some processing.
The only other way around this is to store the ids and slices separately, but that is also a lot of hassle since I have to save them in separate files.
What's the correct way to store this thing?

Downsample numpy array

I have two numpy arrays. One has features and the other the corresponding labels.
There are 3 classes, but the dataset is imbalanced, so I would like to balance it out by downsample one class. It has like 10k elements, I would like to have it around 2k like the other classes. I tried to do it with a for loop by creating a new array but I am sure there is a more clean method for that.
In the end there should be the two numpy arrays, one with the features but with the elements removed that also got removed in the labels array based on the class.
Any idea? Thanks!

How to append multiple data arrays into one varibale of xarray dataset?

i am new in this field and i need a small help.
i just want to know that, what is the best way to append multiple data arrays in a variable of a xarray dataset?
each data array has a different time and value but has the same x,y coordinates same as dataset.
i tried ds[variable_name] = da but it works only for the first data array .
i want to create a function that gets data arrays and put them into one variable of the dataset and updating the time dimension of the dataset.
thanks for your help
The best way for doing that is first to convert data arrays to datasets separately then merge datasets together (using xr.merge).
Hope it helps the others.

How to save the n-d numpy array data and read it quickly next time?

Here is my question:
I have a 3-d numpy array Data which in the shape of (1000, 100, 100).
And I want to save it as a .txt or .csv file, how to achieve that?
My first attempt was to reshape it into a 1-d array which length 1000*100*100, and transfer it into pandas.Dataframe, and then, I save it as .csv file.
When I wanted to call it next time,I would reshape it back to 3-d array.
I think there must be some methods easier.
If you need to re-read it quickly into numpy you could just use the cPickle module.
This is going to be much faster that parsing it back from an ASCII dump (but however only the program will be able to re-read it). As a bonus with just one instruction you could dump more than a single matrix (i.e. any data structure built with core python and numpy arrays).
Note that parsing a floating point value from an ASCII string is a quite complex and slow operation (if implemented correctly down to ulp).

Numpy matrix of arrays without copying possible?

I got a question about numpy and it's memory. Is it possible to generate a view or something out of multiple numpy arrays without copying them?
import numpy as np
def test_var_args(*inputData):
dataArray = np.array(inputData)
print np.may_share_memory(inputData, dataArray) # prints false, b.c. of no shared memory
test_var_args(np.arange(32),np.arange(32)*2)
I've got a c++ application with images and want to do some python magic. I pass the images in rows to the python script using the c-api and want to combine them without copying them.
I am able to pass the data s.t. c++ and python share the same memory. Now I want to arange the memory to a numpy view/array or something like that.
The images in c++ are not continuously present in the memory (I slice them). The rows that I hand over to python are aranged in a continuous memory block.
The number of images I pass are varying. Maybe I can change that if there exist a preallocation trick.
There's a useful discussion in the answer here: Can memmap pandas series. What about a dataframe?
In short:
If you initialize your DataFrame from a single array of matrix, then it may not copy the data.
If you initialize from multiple arrays of the same or different types, your data will be copied.
This is the only behavior permitted by the default BlockManager used by Pandas' DataFrame, which organizes the DataFrame's memory internally.
Its possible to monkey patch the BlockManager to change this behavior though, in which case your supplied data will be referenced.

Categories

Resources