Fast way to transpose np.arrays and write them to file - python

I have situation where I'm writing .TDMS files to .txt files. TDMS file structure has groups that consist of channels that are in numpy array format. I'm trying to figure out the fastest way to loop over the groups and the channels and transpose them to "table" format. I have a working solution but I don't think it's thats fast.
Snippet of my code where file has been opened to tdms variable and groups variable is a list of groups in that .tdms file. Code loops over the groups and opens the list of channels in that group to channels variable. To get the numpy.array data from channel you use channels[list index].data. Then I just with column_stack add the channels to "table" format one by one and save the array with np.savetxt. Could there be a faster way to do this?
for i in range(len(groups)):
output = filepath+"\\"+str(groups[i])+".txt"
channels = tdms.group_channels(groups[i]) #list of channels in i group
data=channels[0].data #np.array for first index in channels list
for i in range(1,len(channels)):
data=np.column_stack((data,channels[i].data))
np.savetxt(output,data,delimiter="|",newline="\n")
Channel data is 1D array. length = 6200
channels[0].data = array([ 12.74204722, 12.74205311, 12.74205884, ...,
12.78374288, 12.7837487 , 13.78375434])

I think you can streamline the column_stack application with:
np.column_stack([chl.data for chl in channels])
I don't think it will save on time (much)
Is this what your data looks like?
In [138]: np.column_stack([[0,1,2],[10,11,12],[20,21,22]])
Out[138]:
array([[ 0, 10, 20],
[ 1, 11, 21],
[ 2, 12, 22]])
savetxt iterates through the rows of data, performing a format and write on each.
Since each row of the output consists of a data point from each of the channels, I think you have to assemble a 2d array like this. And you have to iterate over the channels to do (assuming they are non-vectorizable objects).
And there doesn't appear to be any advantage to looping through the rows of data and doing your own line write. savetxt is relatively simple Python code.
With 1d arrays, all the same size, this construction is even simpler, and faster with the basic np.array:
np.array([[0,1,2],[10,11,12],[20,21,22]]).T

Related

Python: Convert 2D array to seperate arrays

I have read in a numpy.ndarray that looks like this:
[[1.4600e-01 2.9575e+00 6.1580e+02]
[5.8600e-01 4.5070e+00 8.7480e+02]]
Let's assume that this array I am reading will not always have a length of 2. (e.g. It could have a length of 1,3, 456, etc.)
I would like to separate this to two separate arrays that look like this:
[[1.4600e-01 2.9575e+00 6.1580e+02]]
[[5.8600e-01 4.5070e+00 8.7480e+02]]
I previously tried searching a solution to this problem but this is not the solution I am looking for: python convert 2d array to 1d array
Since you want to extract the rows, you can just index them. So suppose your array is stored in the variable x. x[0] will give you the first row: [1.4600e-01 2.9575e+00 6.1580e+02], while x[1] will give you the second row: [5.8600e-01 4.5070e+00 8.7480e+02], etc.
You can also iterate over the rows doing something like:
for row in x:
# Do stuff with the row
If you really want to preserve the outer dimension, you reshape the rows using x[0].reshape((1,-1)) which says to set the first dimension to 1 (meaning it has 1 row) and infer the second dimension from the existing data.
Alternatively if you want to split some number of rows into n groups, you can use the numpy.vsplit() function: https://numpy.org/doc/stable/reference/generated/numpy.vsplit.html#numpy.vsplit
However, I would suggest looping over the rows instead of splitting them up unless you really need to split them up.

Speeding up 3D numpy and dataframe lookup

I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)

Numpy np.array() constructor behaving "inconsistently"

I have two Pandas dataframes, say df1 and df2 (shape (10, 15)), and I want to turn them into Numpy arrays, and then construct a Numpy array containing both of them (shape (2, 10, 15)). I'm currently doing this as follows:
data1 = df1.to_numpy()
data2 = df2.to_numpy()
data = np.array([data1, data2])
Now I'm trying to do this for many pairs of dataframes, and the code I'm using will break when I call data.any() for some of the pairs, giving the truth value error saying to use any() or all() (which I'm already doing). I started printing data when I saw this happening, and I noticed that the np.array() constructor will produce something that looks like [[[...]]] or [array([[...]])].
The first one works fine, but the second doesn't. The difference isn't random with respect to the dataframes, it breaks for certain ones, but all of these dataframes are preprocessed & processed the same way and I've manually checked that the ones that don't work don't have any anomalies.
Since I can't provide much explicit code/data (code is pretty bulky, and arrays are 300 entries each), my main question is why the array constructor either gives [[[...]]] or [array([[...]])] forms, and why the second one doesn't like when I call data.any()?
The issue is that after processing the data, some of the dataframes were missing rows (ie. of shape (x, 15) where x<10). The construction of the data array would give a shape of (2,) when this happened, so as long as both df1 and df2 had the same number of rows it worked fine.

Storing multiple arrays within multiple arrays within an array Python/Numpy

I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.

Appending large amount of data to a tables (HDF5) database where database.numcols != newdata.numcols?

I am trying to append a large dataset (>30Gb) to an existing pytables table. The table is N columns, and the dataset is N-1 columns; one column is calculated after I know the other N-1 columns.
I'm using numpy.fromfile() to read chunks of the dataset into memory before appending it to the database. Ideally, I'd like to stick the data into the database, then calculate the final column, and finish up by using Table.modifyColumn() to complete the operation.
I've considered appending numpy.zeros((len(new_data), N)) to the table, then using Table.modifyColumns() to fill in the new data, but I'm hopeful someone knows a nice way to avoid generating a huge array of empty data for each chunk that I need to append.
If the columns are all the same type, you can use numpy.lib.stride_tricks.as_strided to make the array you read from the file of shape (L, N-1) to look like shape (L, N). For example,
In [5]: a = numpy.arange(12).reshape(4,3)
In [6]: a
Out[6]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [7]: a.strides
Out[7]: (24, 8)
In [8]: b = numpy.lib.stride_tricks.as_strided(a, shape=(4, 4), strides=(24, 8))
In [9]: b
Out[9]:
array([[ 0, 1, 2, 3],
[ 3, 4, 5, 6],
[ 6, 7, 8, 9],
[ 9, 10, 11, 112]])
Now you can use this array b to fill up the table. The last column of each row will be the same as the first column of the next row, but you'll overwrite them when you can compute the values.
This won't work if a is record array (i.e. has a complex dtype). For that, you can try numpy.lib.recfunctions.append_fields. As it will copy the data to a new array, it won't save you any significant amount of memory, but it will allow you to do all the writing at once.
You could add the results to another table. Unless there's some compelling reason for the calculated column to be adjacent to the other columns, that's probably the easiest. There's something to be said for separating raw data from calculations anyways.
If you must increase the size of the table, look into using h5py. It provides a more direct interface to the h5 file. Keep in mind that depending on how the data set was created in the h5 file, it may not be possible to simply append a column to the data. See section 1.2.4, "Dataspace" in http://www.hdfgroup.org/HDF5/doc/UG/03_DataModel.html for a discussion regarding the general data format. h5py supports resize if the underlying dataset supports it.
You could also use a single buffer to store the input data like so:
z = zeros((nrows, N))
while more_data_in_file:
# Read a data block
z[:,:N-1] = fromfile('your_params')
# Set the final column
z[:,N-1:N] = f(z[:,:N-1])
# Append the data
tables_handle.append(z)

Categories

Resources