I am fairly new to pandas, and need to import a 3D array of tuples from a data file. In the file, the data is formatted as so:
[[(1.1, 1.2), (1.3, 1.4)], [(1.5, 1.6), (1.7, 1.8)], [(1.9, 1.10), (1.11, 1.12)], [(1.13, 1.14), (1.15, 1.16)]]
[[(2.1, 2.2), (2.3, 2.4)], [(2.5, 2.6), (2.7, 2.8)], [(2.9, 2.10), (2.11, 2.12)], [(2.13, 2.14), (2.15, 2.16)]]
[[(3.1, 3.2), (3.3, 3.4)], [(3.5, 3.6), (3.7, 3.8)], [(3.9, 3.10), (3.11, 3.12)], [(3.13, 3.14), (3.15, 3.16)]]
I would like to be able to import this into a data frame such that (for this example) the dimensionality would be 3x4x2 (with another x2, if you want to count the dimensions of the tuples, though those don't necessarily need their own dimension, so long as I can access them as tuples).
In actuality, my data set is much larger than this (with dimensions of roughly 13000x2000x2), so I would like to keep any manual editing that might be needed to a minimum, though I should be able to change how the data is formatted in the file with some simple scripts, if a different format would help.
Even 'eval' is a dangerous tool it gives here a one-liner to collect the data :
with open('data.csv') as f: a=np.array([eval(x) for x in f.readlines()])
check :
In [59]: a.shape
Out[59]: (3, 4, 2, 2)
There is no such thing as a multidimensional dataframe with pandas.
You could think of several dataframes and have them relate to each other with an extra column as an id.
Or you could also flatten your 3D array to dataframe with several columns:
your rows would be the observation, in this case 3
your columns would be the flatten output 4 x 2 = 8
You could use numpy to reshape:
new_array = numpy.reshape(array, (3,8))
Related
I have two Pandas dataframes, say df1 and df2 (shape (10, 15)), and I want to turn them into Numpy arrays, and then construct a Numpy array containing both of them (shape (2, 10, 15)). I'm currently doing this as follows:
data1 = df1.to_numpy()
data2 = df2.to_numpy()
data = np.array([data1, data2])
Now I'm trying to do this for many pairs of dataframes, and the code I'm using will break when I call data.any() for some of the pairs, giving the truth value error saying to use any() or all() (which I'm already doing). I started printing data when I saw this happening, and I noticed that the np.array() constructor will produce something that looks like [[[...]]] or [array([[...]])].
The first one works fine, but the second doesn't. The difference isn't random with respect to the dataframes, it breaks for certain ones, but all of these dataframes are preprocessed & processed the same way and I've manually checked that the ones that don't work don't have any anomalies.
Since I can't provide much explicit code/data (code is pretty bulky, and arrays are 300 entries each), my main question is why the array constructor either gives [[[...]]] or [array([[...]])] forms, and why the second one doesn't like when I call data.any()?
The issue is that after processing the data, some of the dataframes were missing rows (ie. of shape (x, 15) where x<10). The construction of the data array would give a shape of (2,) when this happened, so as long as both df1 and df2 had the same number of rows it worked fine.
I'd like to sort a matrix of shape (N, 2) on the first column where N >> system memory.
With in-memory numpy you can do:
x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]
But that appears to require that x[:,0].argsort() fit in memory, which won't work for memmap where N >> system memory (please correct me if this assumption is wrong).
Can I achieve this sort in-place with numpy memmap?
(assume heapsort is used for sorting and simple numeric data types are used)
The solution may be simple, using the order argument to an in place sort. Of course, order requires fieldnames, so those have to be added first.
d = x.dtype
x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
array([[(2, 10)],
[(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])
The field names are strings, corresponding to the column indices. Sorting can be done in place with
x.sort(order='0', axis=0)
Then convert back to a regular array with the original datatype
x.view(d)
array([[ 1, 20],
[ 2, 10]])
That should work, although you may need to change how the view is taken depending on how the data is stored on disk, see the docs
For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
#user2699 answered the question beautifully. I'm adding this solution as a simplified example in case you don't mind keeping your data as a structured array, which does away with the view.
import numpy as np
filename = '/tmp/test'
x = np.memmap(filename, dtype=[('index', '<f2'),('other1', '<f2'),('other2', '<f2')], mode='w+', shape=(2,))
x[0] = (2, 10, 30)
x[1] = (1, 20, 20)
print(x.shape)
print(x)
x.sort(order='index', axis=0, kind='heapsort')
print(x)
(2,)
[(2., 10., 30.) (1., 20., 20.)]
[(1., 20., 20.) (2., 10., 30.)]
Also the dtype formats are documented here.
I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.
I have situation where I'm writing .TDMS files to .txt files. TDMS file structure has groups that consist of channels that are in numpy array format. I'm trying to figure out the fastest way to loop over the groups and the channels and transpose them to "table" format. I have a working solution but I don't think it's thats fast.
Snippet of my code where file has been opened to tdms variable and groups variable is a list of groups in that .tdms file. Code loops over the groups and opens the list of channels in that group to channels variable. To get the numpy.array data from channel you use channels[list index].data. Then I just with column_stack add the channels to "table" format one by one and save the array with np.savetxt. Could there be a faster way to do this?
for i in range(len(groups)):
output = filepath+"\\"+str(groups[i])+".txt"
channels = tdms.group_channels(groups[i]) #list of channels in i group
data=channels[0].data #np.array for first index in channels list
for i in range(1,len(channels)):
data=np.column_stack((data,channels[i].data))
np.savetxt(output,data,delimiter="|",newline="\n")
Channel data is 1D array. length = 6200
channels[0].data = array([ 12.74204722, 12.74205311, 12.74205884, ...,
12.78374288, 12.7837487 , 13.78375434])
I think you can streamline the column_stack application with:
np.column_stack([chl.data for chl in channels])
I don't think it will save on time (much)
Is this what your data looks like?
In [138]: np.column_stack([[0,1,2],[10,11,12],[20,21,22]])
Out[138]:
array([[ 0, 10, 20],
[ 1, 11, 21],
[ 2, 12, 22]])
savetxt iterates through the rows of data, performing a format and write on each.
Since each row of the output consists of a data point from each of the channels, I think you have to assemble a 2d array like this. And you have to iterate over the channels to do (assuming they are non-vectorizable objects).
And there doesn't appear to be any advantage to looping through the rows of data and doing your own line write. savetxt is relatively simple Python code.
With 1d arrays, all the same size, this construction is even simpler, and faster with the basic np.array:
np.array([[0,1,2],[10,11,12],[20,21,22]]).T
I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.
Here's my code.
data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)
#Add headers
csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))
Dimension-based diagnostics match what I expect:
print len(csv_names)
>> 108
print data.shape
>> (1652, 108)
"print data.dtype.names" also returns the expected output.
But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...
print data["EDUC"].shape
>> (1652, 108)
... and it appears to contain more missing values than there are rows in the data set.
print np.sum(np.isnan(data["EDUC"]))
>> 27976
Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!
The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.
Here is what you must know about NumPy:
NumPy arrays only contain elements of a single type.
If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).
In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).
These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.
Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).
If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:
data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))
(you were close: you used astype() instead of view()).
You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.
Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via
data = np.genfromtxt(csv_file, delimiter=',', names=True)
EDIT:
It seems like adding field names only works when the input is a list of tuples:
data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])