I have this matlab file that is of shape 70x10,000,000 (10,000,000 columns 70 rows)
Whats annoying is that when I run this line of code which is supposed to print that chunk of data,
f = h5py.File(filepath, 'r')
item = list(f.items())[0][1]
print(item)
it reshapes it into 10,000,000x70 (10,000,000 rows, 70 columns)
Is there a way to keep the original shape ?
h5py returns HDF5 data as Numpy arrays. So, the key to using h5py is using Numpy methods when needed. You can easily transpose an array using np.transpose(). A simple example is provided below. It creates a HDF5 file with 2 datasets: 1) an array with shape (20,5), and 2) the transposed array with shape (5,20). Then it extracts the 2 arrays and uses np.transpose() to switch row/column order.
with h5py.File('SO_67031436','w') as h5w:
arr = np.arange(100.).reshape(20,5)
h5w.create_dataset('ds_1',data=arr)
h5w.create_dataset('ds_1t',data=np.transpose(arr))
with h5py.File('SO_67031436','r') as h5r:
for name in h5r:
print(name,',shape=',h5r[name].shape)
arr=np.transpose(h5r[name][:])
print('transposed shape=',arr.shape)
Related
I would like to store multiple 2D ndarrays of different dimensions in a single Pandas dataframe, save that dataframe as a .csv, and load the .csv and retrieve the 2D arrays. I have seen ways of storing a single 2D array and ways of storing multiple 1D arrays, but never multiple 2D arrays. I would accept an answer using the xray package if necessary.
Nx = 10
Nz = 20
N_one = np.random.rand((Nx, Nz))
N_two = np.random.rand((Nx+1, Nz+1))
I have tried using a dictionary to store the arrays but I get back an error that says that it only supports 1D arrays.
my_dict = {"N_one": N_one, 'N_two': N_two}
dict_df = pd.DataFrame({key:pd.Series(value) for key,value in my_dict.items()})
For a single 2D array, this is how I would store one of the arrays in a dataframe and export it to a .csv
test_df = pd.DataFrame(N_one)
test_df.to_csv('testdataset.csv',index=False)
Then I would read the .csv like this
location_of_dataset = r'C:\Users\User\testdataset.csv'
new_test_df = pd.read_csv(location_of_dataset)
I am trying to read a dataset from an hdf5 file. The h5py data handle is formatted as a a 1d array
with h5py.File(files[0], 'r') as f:
flux = f['/nsae']
print(flux.shape)
print(flux[:5])
(1344,)
[(b'2018-02-01T00:00:00.000Z', b'2018-02-01T00:29:59.000Z', 4.95551669)
(b'2018-02-01T00:30:00.000Z', b'2018-02-01T00:59:59.000Z', 1.65919189)
(b'2018-02-01T01:00:00.000Z', b'2018-02-01T01:29:59.000Z', 0.88306412)
(b'2018-02-01T01:30:00.000Z', b'2018-02-01T01:59:59.000Z', 0.97632175)
(b'2018-02-01T02:00:00.000Z', b'2018-02-01T02:29:59.000Z', 0.9270446)]
I want to end up with a 3d dask array:
[[b'2018-02-01T00:00:00.000Z', b'2018-02-01T00:29:59.000Z', 4.95551669]
[b'2018-02-01T00:30:00.000Z', b'2018-02-01T00:59:59.000Z', 1.65919189]
[b'2018-02-01T01:00:00.000Z', b'2018-02-01T01:29:59.000Z', 0.88306412]
[b'2018-02-01T01:30:00.000Z', b'2018-02-01T01:59:59.000Z', 0.97632175]
[b'2018-02-01T02:00:00.000Z', b'2018-02-01T02:29:59.000Z', 0.9270446]
...]
What is the most efficient way of doing this?
Ultimately I will be putting the data from many similar files (and also data from some csv files) into dask dataframes, concatenating into a single dataframe and then saving as one new hdf5. I am open to skipping the array stage if it makes sense to do so.
How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the chunks of rows one by one. Could this be done more efficiently with dask?
In particular, assuming one doesn't need all the possible out of core operations on sparse arrays, but just the ability to load row chunks in parallel (each chunk is a CSR array) and apply some function to them (in my case it would be e.g. estimator.predict(X) from scikit-learn).
Besides, is there a file format on disk that would be suitable for this task? Joblib works but I'm not sure about the (parallel) performance of CSR arrays loaded as memory maps; spark.mllib appears to use either some custom sparse storage format (that doesn't seem to have a pure Python parser) or LIBSVM format (the parser in scikit-learn is, in my experience, much slower than joblib.dump)...
Note: I have read documentation, various issues about it on https://github.com/dask/dask/ but I'm still not sure how to best approach this problem.
Edit: to give a more practical example, below is the code that works in dask for dense arrays but fails when using sparse arrays with this error,
import numpy as np
import scipy.sparse
import joblib
import dask.array as da
from sklearn.utils import gen_batches
np.random.seed(42)
joblib.dump(np.random.rand(100000, 1000), 'X_dense.pkl')
joblib.dump(scipy.sparse.random(10000, 1000000, format='csr'), 'X_csr.pkl')
fh = joblib.load('X_dense.pkl', mmap_mode='r')
# computing the results without dask
results = np.vstack((fh[sl, :].sum(axis=1)) for sl in gen_batches(fh.shape[0], batch_size))
# computing the results with dask
x = da.from_array(fh, chunks=(2000))
results = x.sum(axis=1).compute()
Edit2: following the discussion below, the example below overcomes the previous error but gets ones about IndexError: tuple index out of range in dask/array/core.py:L3413,
import dask
# +imports from the example above
dask.set_options(get=dask.get) # disable multiprocessing
fh = joblib.load('X_csr.pkl', mmap_mode='r')
def func(x):
if x.ndim == 0:
# dask does some heuristics with dummy data, if the x is a 0d array
# the sum command would fail
return x
res = np.asarray(x.sum(axis=1, keepdims=True))
return res
Xd = da.from_array(fh, chunks=(2000))
results_new = Xd.map_blocks(func).compute()
So I don't know anything about joblib or dask, let alone your application specific data format. But it is actually possible to read sparse matrices from disk in chunks while retaining the sparse data structure.
While the Wikipedia article for the CSR format does a great job explaining how it works, I'll give a short recap:
Some sparse Matrix, e.g.:
1 0 2
0 0 3
4 5 6
is stored by remembering each nonzero-value and the column it resides in:
sparse.data = 1 2 3 4 5 6 # acutal value
sparse.indices = 0 2 2 0 1 2 # number of column (0-indexed)
Now we are still missing the rows. The compressed format just stores how many non-zero values there are in each row, instead of storing every single values row.
Note that the non-zero count is also accumulated, so the following array contains the number of non-zero values up until and including this row. To complicate things even further, the array always starts with a 0 and thus contains num_rows+1 entries:
sparse.indptr = 0 2 3 6
so up until and including the second row there are 3 nonzero values, namely 1, 2 and 3.
Since we got this sorted out, we can start 'slicing' the matrix. The goal is to construct the data, indices and indptr arrays for some chunks. Assume the original huge matrix is stored in three binary files, which we will incrementally read. We use a generator to repeatedly yield some chunk.
For this we need to know how many non-zero values are in each chunk, and read the according amount of values and column-indices. The non-zero count can be conveniently read from the indptr array. This is achieved by reading some amount of entries from the huge indptr file that corresponds to the desired chunk size. The last entry of that portion of the indptr file minus the number of non-zero values before gives the number of non-zeros in that chunk. So the chunks data and indices arrays are just sliced from the big data and indices files. The indptr array needs to be prepended artificially with a zero (that's what the format wants, don't ask me :D).
Then we can just construct a sparse matrix with the chunk data, indices and indptr to get a new sparse matrix.
It has to be noted that the actual matrix size cannot be directly reconstructed from the three arrays alone. It is either the maximum column index of the matrix, or if you are unlucky and there is no data in the chunk undetermined. So we also need to pass the column count.
I probably explained things in a rather complicated way, so just read this just as opaque piece of code, that implements such a generator:
import numpy as np
import scipy.sparse
def gen_batches(batch_size, sparse_data_path, sparse_indices_path,
sparse_indptr_path, dtype=np.float32, column_size=None):
data_item_size = dtype().itemsize
with open(sparse_data_path, 'rb') as data_file, \
open(sparse_indices_path, 'rb') as indices_file, \
open(sparse_indptr_path, 'rb') as indptr_file:
nnz_before = np.fromstring(indptr_file.read(4), dtype=np.int32)
while True:
indptr_batch = np.frombuffer(nnz_before.tobytes() +
indptr_file.read(4*batch_size), dtype=np.int32)
if len(indptr_batch) == 1:
break
batch_indptr = indptr_batch - nnz_before
nnz_before = indptr_batch[-1]
batch_nnz = np.asscalar(batch_indptr[-1])
batch_data = np.frombuffer(data_file.read(
data_item_size * batch_nnz), dtype=dtype)
batch_indices = np.frombuffer(indices_file.read(
4 * batch_nnz), dtype=np.int32)
dimensions = (len(indptr_batch)-1, column_size)
matrix = scipy.sparse.csr_matrix((batch_data,
batch_indices, batch_indptr), shape=dimensions)
yield matrix
if __name__ == '__main__':
sparse = scipy.sparse.random(5, 4, density=0.1, format='csr', dtype=np.float32)
sparse.data.tofile('sparse.data') # dtype as specified above ^^^^^^^^^^
sparse.indices.tofile('sparse.indices') # dtype=int32
sparse.indptr.tofile('sparse.indptr') # dtype=int32
print(sparse.toarray())
print('========')
for batch in gen_batches(2, 'sparse.data', 'sparse.indices',
'sparse.indptr', column_size=4):
print(batch.toarray())
the numpy.ndarray.tofile() just stores binary arrays, so you need to remember the data format. scipy.sparse represents the indices and indptr as int32, so that's a limitation for the total matrix size.
Also I benchmarked the code and found that the scipy csr matrix constructor is the bottleneck for small matrices. Your mileage might vary tho, this is just a 'proof of principle'.
If there is need for a more sophisticated implementation, or something is too obstruse, just hit me up :)
I am quite new to numpy and python in general. I am getting a dimension mismatch error when I try to append values even though I have made sure that both arrays have the same dimension. Also another question I have is why does numpy create a single dimensional array when reading in data from a tab delimited text file.
import numpy as np
names = ["Angle", "RX_Power", "Frequency"]
data = np.array([0,0,0],float) #experimental
data = np.genfromtxt("rx_power_mode 0.txt", dtype=float, delimiter='\t', names = names, usecols=[0,1,2], skip_header=1)
freq_177 = np.zeros(shape=(data.shape))
print(freq_177.shape) #outputs(315,)
for i in range(len(data)):
if data[i][2] == 177:
#np.concatenate(freq_177,data[i]) has same issue
np.append(freq_177,data[i],0)
The output I am getting is
all the input arrays must have same number of dimensions
Annotated code:
import numpy as np
names = ["Angle", "RX_Power", "Frequency"]
You don't need to 'initialize' an array - unless you are going to assign values to individual elements.
data = np.array([0,0,0],float) #experimental
This data assignment completely overwrites the previous one.
data = np.genfromtxt("rx_power_mode 0.txt", dtype=float, delimiter='\t', names = names, usecols=[0,1,2], skip_header=1)
Look at data at this point. What is data.shape? What is data.dtype? Print it, or at least some elements. With names I'm guessing that this is a 1d array, with a 3 field dtype. It's not a 2d array, though, with all floats it could transformed/view as such.
Why are you making a 1d array of zeros?
freq_177 = np.zeros(shape=(data.shape))
print(freq_177.shape) #outputs(315,)
With a structured array like data, the preferred way to index a given element is by field name and row number, eg. data['frequency'][i]`. Play with that.
np.append is not the same as the list append. It returns a value; it does not change freq_177 in place. Same for concatenate. I recommend staying away from np.append. It's too easy to use it in the wrong way and place.
for i in range(len(data)):
if data[i][2] == 177:
#np.concatenate(freq_177,data[i]) has same issue
np.append(freq_177,data[i],0)
It looks like you want to collect in freq_177 all the terms of the data array for which the 'frequency' field is 177.
I = data['frequency'].astype(int)==177
freq_177 = data[I]
I have used astype(int) because the == test with floats is uncertain. It is best used with integers.
I is a boolean mask, true where the values match; data[I] then is the corresponding elements of data. The dtype will match that of data, that is, it will have 3 fields. You can't append or concatenate it to an array of float zeros (your original freq_177).
If you must iterate and collect values, I suggest using list append, e.g.
alist = []
for row in data:
if int(row['frequency'])==177:
alist.append(row)
freq177 = np.array(alist)
I don't think np.append is discussed much except in its own doc page and text. It comes up periodically in SO questions.
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.append.html
Returns: append : ndarray
A copy of arr with values appended to axis. Note that append does not occur in-place: a new array is allocated and filled.
See also help(np.append) in an interpreter shell.
For genfromtxt - it too has docs, and lots of SO discussion. But to understand what it returned in this case, you need to also read about structured arrays and compound dtype. (add links?)
Try loading the data with:
data = np.genfromtxt("rx_power_mode 0.txt", dtype=float, delimiter='\t', usecols=[0,1,2], skip_header=1)
Since you are skipping the header line, and just using columns with floats, data should be a 2d array with 3 columns, (N, 3). In that case you could access the 'frequency' values with data[:,2]
I = int(data[:,2])==177
freq_177 = data[I,:]
freq_177 is now be a 3 column array - with a subset of the data rows.
I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.