Just to give you some context:
I have to translate some MATLAB code into Python 3 one, but here I've been confronted to a little problem.
Matlab:
for i in 1:num_nodes
for j in 1:num_nodes
K{i,j} = zeros(3,3);
Which I translated into:
k_topology = [[]]
for i in range(x):
for i in range(x):
k_topology[[i][j]].extend(np.zeros(3,3))
Also, further in the Matlab code there's a third loop:
for k in 1:3
K{i,j}(k,k) = -1
Which also kind of... Upsets me?
The fact is I don't really see how I can translate this kind of variable into Python. Also, I guess that my Python code's kind of "broken" - and I'm not really asking to any of you to improve it - , so I'm just asking which is the best way to translate Matlab's cell into Python?
I finally found something apparently simple to translate this, using list comprehension - according to kazemakase's answer. The actual Python code is now looking like this:
k_topology = [[np.zeros((3,3)) for j in range(self.get_nb_nodes_from_network())]\
for i in range(self.get_nb_nodes_from_network())]
And looks like something like this in Output:
[[array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])], ..., [array(...)]]
(There's really too many values to paste it here, but I think you got it.)
The first question you need to ask is "what is a Matlab cell and what could be a suitable corresponding Python type?"
If I remember correctly from my bad old Matlab days, a cell is sort of a container that holds content of mixed types. It is something like a dynamically typed array or matrix. It is multidimensionally indexed.
Python is dynamically typed, so any Python contianer can basically fulfill this function. Lists in Python are indexed, so nested lists could work - but they are somewhat weird to set up and access:
K = [[None] * num_nodes for _ in range(num_nodes)]
K[i][j] # need two indices to access elements of a nested list.
For the particular scenario a dictionary better mirrors Matlab syntax. Although a ditionary takes only one index, we can exploit the fact that tuples can be declared without brackets and that dictionaries can take tuples as index:
K = {}
for i in range(num_nodes):
for j in range(num_nodes):
K[i, j] = np.zeros((3, 3))
for k in 1:3
K[i, j][k, k] = -1
While the dictionary is syntactically more concise, element access is potentially less performant than in nested lists. Nested look different than Matlab code. The choice depends on performance or similarity to the original code. But if performance is an issue there are many more things to consider, anyway. In summary: There is no one best way to do it.
Since the OP expclicitly asked not to improve the code, I explicitly ask him/her to ignore this part of the answer.
A better way to build diagonal matrices is to use np.ones instead of looping over diagonal elements.
K = {}
for i in range(num_nodes):
for j in range(num_nodes):
K[i, j] = -np.ones((3, 3))
Also, nested lists can be constructed without (much) prior initialization, if that is the preferred approach:
K = []
for i in range(num_nodes):
K.append([])
for j in range(num_nodes):
K[-1].append(-np.ones((3, 3)))
Now, for the peace of my soul, let me take apart provide feedback on the OP's code:
k_topology = [[]]
for i in range(x):
for i in range(x):
k_topology[[i][j]].extend(np.zeros(3,3))
This has nothing to do with the original Matlab code (different variable names)
Both loops use i. j is never defined.
[[i][j]] builds a list with one element i and tries to take the jth element. If j is ever something other than 0 this will cause an error.
list.extend a appends all elements of the argument individually to the list - in this case individual rows. list.append would be correct to use as the whole 3x3 matrix should be appended as one element in K.
np.zeros(3, 3) should be np.zeros((3, 3)) (assuming np is an alias for numpy) because the function takes the shape is the first argument, not multiple arguments.
Using the Octave/scipy save/loadmat that I demonstrated in the linked post:
In an Octave session
>> num_nodes=3
num_nodes = 3
>> num_nodes=3;
>> K=cell(num_nodes, num_nodes);
>> for i = 1:num_nodes
for j = 1:num_nodes
K{i,j} = zeros(2,2);
end
end
>> K
K =
{
[1,1] =
0 0
0 0
[2,1] =
0 0
0 0
etc
Access one cell:
>> K{1,2}
ans =
0 0
0 0
Access one element of one cell:
>> K{1,2}(1,1)
ans = 0
>> save -7 kfile.mat K
In Python
In [31]: from scipy import io
In [32]: data = io.loadmat('kfile.mat')
In [34]: data
Out[34]:
{'K': array([[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])],
[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])],
[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])]], dtype=object),
'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file, written by Octave 4.0.0, 2017-02-15 19:05:44 UTC',
'__version__': '1.0'}
In [35]: data['K'].shape
Out[35]: (3, 3)
In [36]: data['K'][0,0].shape
Out[36]: (2, 2)
In [37]: data['K'][0,0][0,0]
Out[37]: 0.0
loadmat treats a cell as a 2d object dtype array; while regular matrices are 2d numeric arrays. Object arrays are, in many ways like a nested Python list.
Related
I want to change a number of values in my pandas dataframe, where the indices that are indicating the columns may vary in size.
I need something that is faster than a for-loop, because it will be done on a lot of rows, and this turned out to be too slow.
As a simple example, consider this
df = pd.DataFrame(np.zeros((5,5)))
Now, I want to change some of the values in this dataframe to 1. If I e.g. want to change the values in the second and fith row for the first two columns, but in the fourth row I want to change all the values, I want something like this to work:
col_indices = np.array([np.arange(2),np.arange(5),np.arange(2)])
row_indices = np.array([1,3,4])
df.loc(row_indices,col_indices) =1
However, this does not work (I suspect that it does not work because the shape of the data you would select is not conform with a dataframe).
Is there any more flexible way of indexing without having to loop over rows etc.?
A solution that works only for range-like arrays (as above) would also work for my current problem - but general answer would also be nice.
Thanks for any help!
IIUC here's one approach. Define the column indices as the amount of columns where you want to insert 1s instead, and the rows where you want to insert them:
col_indices = np.array([2,5,2])
row_indices = np.array([1,3,4])
arr = df.values
And use advanced indexing to set the cells of interest to 1:
arr[row_indices] = np.arange(arr.shape[0]) <= col_indices[:,None]
array([[0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1.],
[1., 1., 0., 0., 0.]])
For the purpose of this exercise, let's consider a matrix where the element m_{i, j} is given by the rule m_{i, j} = i*j if i == j and 0 else.
Is there an easy "numpy" way of calculating such a matrix without having to resort to if statements checking for the indices?
You can use the numpy function diag to construct a diagonal matrix if you give it the intended diagonal as a 1D array as input.
So you just need to create that, like [i**2 for i in range (N)] with N the dimension of the matrix.
You could use the identity matrix given by numpy.identity(n) and then multiply it by a n dimensional vector.
Assuming you have a squared matrix, you can do this:
import numpy as np
ary = np.zeros((4, 4))
_ = [ary.__setitem__((i, i), i**2) for i in range(ary.shape[0])]
print(ary)
# array([[0., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 4., 0.],
# [0., 0., 0., 9.]])
Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:
dictionary[0] = np.array([[[...],[...]]...])
I want to concat all these np arrays, code like
sample = np.concatenate(list(dictionary.values))
this operation waste 100GB memory! If I use
del dictionary
It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this
sample = np.concatenate(sample,dictionary[key])
It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this
sample = np.empty(shape)
with h5py.File(...) as dictionary:
for key in dictionary.keys():
sample[key] = dictionary[key]
I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?
Are there any good methods to limit the memory usage as 50GB?
Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.
import numpy as np
import time
def test1(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
return b
def test2(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.concatenate(list(a.values()))
return b
x1 = test1(1000000)
del x1
time.sleep(1)
x2 = test2(1000000)
Results:
test1 : 0.71 s
test2 : 1.39 s
The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.
dictionary[key] is a dataset on the file. dictionary[key][...] will be an numpy array, that dataset downloaded.
I imagine
sample[key] = dictionary[key]
is evaluated as
sample[key,...] = dictionary[key][...]
The dataset is downloaded, and then copied to a slice of the sample array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.
You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what
list(dictionary.values)
contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...) on that list will have to used the downloaded arrays.
One thing puzzles me - how can you use the same key to index the first dimension of sample and dataset in dictionary? h5py keys are supposed to be strings, not integers.
Some testing
Note that I'm using string dataset names:
In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))
Your np.concatenate(list(dictionary.values)) code is missing ():
In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]:
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
<HDF5 dataset "1": shape (2, 3), type "<f8">,
<HDF5 dataset "2": shape (2, 3), type "<f8">,
<HDF5 dataset "3": shape (2, 3), type "<f8">]
So it's just a list of the datasets. The downloading occurs when concatenate does a np.asarray(a) for each element of the list:
In [28]: np.concatenate(list(f.values()))
Out[28]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 1., 1.],
[1., 1., 1.],
[0., 1., 2.],
[3., 4., 5.]])
e.g.:
In [29]: [np.array(a) for a in f.values()]
Out[29]:
[array([[0., 0., 0.],
[0., 0., 0.]]), array([[0., 0., 0.],
[0., 0., 0.]]), array([[1., 1., 1.],
[1., 1., 1.]]), array([[0., 1., 2.],
[3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
....
Let's look at what happens when using your iteration approach:
Make an array that can takes one dataset for each 'row':
In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
...: v = d[...]
...: print(v.__array_interface__['data']) # databuffer location
...: samples[i,...] = v
...:
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]:
array([[[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.]],
[[0., 1., 2.],
[3., 4., 5.]]])
In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.
These are small arrays in a interactive ipython session. I don't know if these observations apply in large cases.
I have a bit of code that loads up a long (100k-1mil) set of lines, it has an index in the first column followed by 18 values, for a total of 19 floats per line. This all is put into a numpy array.
I need to do some simple processing on the matrix to keep the index column and get out 1s and 0s depending on conditions of whether values are positive or negative, but the criterion varies as the columns are sequential pairs of values with different reference values.
The code below goes through the columns 2-19 first by evens then odds to check the values, and then creates a temporary list to put into the array I want to have at the end.
I know there's a simpler way to do this, with list comprehension and possibly lambda, but I'm not proficient enough with this to figure it out. So I'm hoping someone can help me reduce the length of this code into something more compact. More efficient would be great too, but I know that the compact methods don't always increase efficiency. It will however help me better understand list comprehension, with and without numpy.
Sample values for reference:
0.000 72.250 -158.622 86.575 -151.153 85.807 -149.803 84.285 -143.701 77.723 -160.471 96.587 -144.020 75.827 -157.071 87.629 -148.856 100.814 -140.488
10.000 56.224 -174.351 108.309 -154.148 68.564 -155.721 83.634 -132.836 75.030 -177.971 100.623 -146.616 61.856 -150.885 92.147 -150.124 91.841 -153.112
20.000 53.357 -153.537 58.190 -160.235 77.575 176.257 93.771 -150.549 77.789 -161.534 103.589 -146.363 73.623 -159.441 99.315 -129.663 92.842 -138.736
And here is the code snippet:
datain = numpy.loadtxt(testfile.txt) #load data
dataout = numpy.zeros(datain.shape) # initialize empty processing array
dataout[:, 0] = datain[:, 0] # assign time values from input data to processing array
dataarray = numpy.zeros(len(datain[0]))
phit = numpy.zeros((len(dataarray)-1)/2)
psit = numpy.zeros((len(dataarray)-1)/2)
for i in range(len(datain)):
dataarray = numpy.copy(datain[i])
phit[:] = dataarray[1::2]
psit[:] = dataarray[2::2]
temp = []
for j in range(len(phit)):
if(phit[j] < 0):
temp.append(1)
else:
temp.append(0)
if(psit[j] > 0):
temp.append(1)
else:
temp.append(0)
dataout[i][1:] = temp
Thanks in advance, I know there's a fair number of questions on these topics here; unfortunately I couldn't find one that helped me get to a solution.
As #abarnert mentioned, the solution here is not to write better loops, but (since you're using Numpy) to not loop in Python at all by understanding how to use Numpy in more advanced ways.
What you have is a matrix like
[ [idx, v0a, v0b, v1a, v1b, ... ], ... ]
And you want a matrix that's basically
[ [idx, 1 if v0a < 0 else 0, 1 if v0b > 0 else 0, ... ], ... ]
We're going to do this in two steps: first, we'll transform the matrix slightly so that the comparisons are all the same; second, we'll apply the comparison in-place.
The only difference between how we handle "even" and "odd" columns is that one is being checked for <0, the other >0. If we modify the second group of columns by multiplying them by -1, then these comparisons both become simply <0:
datain[:, 2::2] *= -1
Now we just want to know, for every value (besides the first column), is that value <0. This is super easy:
datain[:, 1:] < 0
This returns a matrix of boolean values, where each value represents whether or not the corresponding cell in datain[:, 1:] was less than 0. You want these as integers, 1 for True and 0 for False; it turns out, when we assign these boolean values back into our original array (which contains floats), numpy will cast the bools into floats automatically; True will get cast to 1.0, and False will get cast to 0.0.
If you don't want to throw away your original data, simply copy it off first. Here's the complete code:
# If you want to preserve your old data, create a copy for us to modify
dataout = np.array(datain)
# Now assign your integer values into your data array
dataout[:, 2::2] *= -1
dataout[:, 1:] = datain[:, 1:] < 0
For the sample input you provided:
array([[ 0. , 72.25 , 158.622, 86.575, 151.153, 85.807,
149.803, 84.285, 143.701, 77.723, 160.471, 96.587,
144.02 , 75.827, 157.071, 87.629, 148.856, 100.814,
140.488],
[ 10. , 56.224, 174.351, 108.309, 154.148, 68.564,
155.721, 83.634, 132.836, 75.03 , 177.971, 100.623,
146.616, 61.856, 150.885, 92.147, 150.124, 91.841,
153.112],
[ 20. , 53.357, 153.537, 58.19 , 160.235, 77.575,
-176.257, 93.771, 150.549, 77.789, 161.534, 103.589,
146.363, 73.623, 159.441, 99.315, 129.663, 92.842,
138.736]])
This code ends up with the following final result:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.],
[10., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.],
[20., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.]])
Thanks to abarnert for pointing me in the right direction with this, the solution is pretty simple.
datain = numpy.loadtxt(testfile.txt) #load data
dataout = numpy.empty(datain.shape, dtype=int) # initialize empty processing array
dataout[:, 0] = datain[:, 0] # assign time values from input data to processing array
dataout[:, 1::2] = datain[:, 1::2] < 0
dataout[:, 2::2] = datain[:, 2::2] > 0
That's it! Much shorter, much more readable, and gets me the values I want.
I wish to initiate a symmetric matrix in python and populate it with zeros.
At the moment, I have initiated an array of known dimensions but this is unsuitable for subsequent input into R as a distance matrix.
Are there any 'simple' methods in numpy to create a symmetric matrix?
Edit
I should clarify - creating the 'symmetric' matrix is fine. However I am interested in only generating the lower triangular form, ie.,
ar = numpy.zeros((3, 3))
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
I want:
array([[ 0],
[ 0, 0 ],
[ 0., 0., 0.]])
Is this possible?
I don't think it's feasible to try work with that kind of triangular arrays.
So here is for example a straightforward implementation of (squared) pairwise Euclidean distances:
def pdista(X):
"""Squared pairwise distances between all columns of X."""
B= np.dot(X.T, X)
q= np.diag(B)[:, None]
return q+ q.T- 2* B
For performance wise it's hard to beat it (in Python level). What would be the main advantage of not using this approach?