Related
I'm working with Python3 and I would like to load datas from several CSV files.
Each CSV (one measurement) has 3 columns (3 different physical quantities). I want to load each quantity on 3 separate variables. For one CSV file this is quite simple, I used :
TIME,CH1,CH2 = loadtxt(file_path,usecols=(3,4,5),delimiter=',',skiprows=2,unpack=True)
and it worked fine. Now I would like to extend this procedure so I can load several CSV files. Each array would be 2D, each column representing one CSV file. Instead of having several CSV with three variables, I will have 3 2D arrays, which is much more convenient for data analysis.
I thought I could try something like this :
TIME = matrix(zeros((20480,len(file_path)))) # 20480 length of each column
CH1 = matrix(zeros((20480,len(file_path)))) # len(file_path) number of CSV files
CH2 = matrix(zeros((20480,len(file_path))))
for k in range(0,len(file_path)): # reading each CSV file
TIME[:,k],CH1[:,k],CH2[:,k] = loadtxt(file_path[k],usecols=(3,4,5),delimiter=',',skiprows=2,unpack=True)
But it's telling me :
ValueError: could not broadcast input array from shape (20480) into shape (20480,1)
In the end I would like variables looking like this :
TIME = matrix([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
Each column is from one different CSV file.
I think this is a quite usual problem, but I don't really get how arrays works in Python. I get this idea from Matlab which is quite straightforward but here I don't know why indexing arrays with TIME[:][:] doesn't work.
Have you any idea how I could do this ?
Thanks.
Use np.array, not np.matrix
I can't emphasize this enough. np.matrix exists only for legacy reasons. See this answer for an explanation of the difference. np.matrix requires 2 dimensions, while np.array permits a single dimension when indexing. This seems to be the source of your error.
Here's a minimal example exhibiting the behaviour you are seeing:
A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.matrix(A)
print(A[:, 0].shape) # (2,)
print(B[:, 0].shape) # (2, 1)
Therefore, define your resultant arrays as np.array objects:
m = 20480
n = len(file_path)
shape = (m, n)
TIME = np.zeros(shape)
CH1 = np.zeros(shape)
CH2 = np.zeros(shape)
I have a bit of code that loads up a long (100k-1mil) set of lines, it has an index in the first column followed by 18 values, for a total of 19 floats per line. This all is put into a numpy array.
I need to do some simple processing on the matrix to keep the index column and get out 1s and 0s depending on conditions of whether values are positive or negative, but the criterion varies as the columns are sequential pairs of values with different reference values.
The code below goes through the columns 2-19 first by evens then odds to check the values, and then creates a temporary list to put into the array I want to have at the end.
I know there's a simpler way to do this, with list comprehension and possibly lambda, but I'm not proficient enough with this to figure it out. So I'm hoping someone can help me reduce the length of this code into something more compact. More efficient would be great too, but I know that the compact methods don't always increase efficiency. It will however help me better understand list comprehension, with and without numpy.
Sample values for reference:
0.000 72.250 -158.622 86.575 -151.153 85.807 -149.803 84.285 -143.701 77.723 -160.471 96.587 -144.020 75.827 -157.071 87.629 -148.856 100.814 -140.488
10.000 56.224 -174.351 108.309 -154.148 68.564 -155.721 83.634 -132.836 75.030 -177.971 100.623 -146.616 61.856 -150.885 92.147 -150.124 91.841 -153.112
20.000 53.357 -153.537 58.190 -160.235 77.575 176.257 93.771 -150.549 77.789 -161.534 103.589 -146.363 73.623 -159.441 99.315 -129.663 92.842 -138.736
And here is the code snippet:
datain = numpy.loadtxt(testfile.txt) #load data
dataout = numpy.zeros(datain.shape) # initialize empty processing array
dataout[:, 0] = datain[:, 0] # assign time values from input data to processing array
dataarray = numpy.zeros(len(datain[0]))
phit = numpy.zeros((len(dataarray)-1)/2)
psit = numpy.zeros((len(dataarray)-1)/2)
for i in range(len(datain)):
dataarray = numpy.copy(datain[i])
phit[:] = dataarray[1::2]
psit[:] = dataarray[2::2]
temp = []
for j in range(len(phit)):
if(phit[j] < 0):
temp.append(1)
else:
temp.append(0)
if(psit[j] > 0):
temp.append(1)
else:
temp.append(0)
dataout[i][1:] = temp
Thanks in advance, I know there's a fair number of questions on these topics here; unfortunately I couldn't find one that helped me get to a solution.
As #abarnert mentioned, the solution here is not to write better loops, but (since you're using Numpy) to not loop in Python at all by understanding how to use Numpy in more advanced ways.
What you have is a matrix like
[ [idx, v0a, v0b, v1a, v1b, ... ], ... ]
And you want a matrix that's basically
[ [idx, 1 if v0a < 0 else 0, 1 if v0b > 0 else 0, ... ], ... ]
We're going to do this in two steps: first, we'll transform the matrix slightly so that the comparisons are all the same; second, we'll apply the comparison in-place.
The only difference between how we handle "even" and "odd" columns is that one is being checked for <0, the other >0. If we modify the second group of columns by multiplying them by -1, then these comparisons both become simply <0:
datain[:, 2::2] *= -1
Now we just want to know, for every value (besides the first column), is that value <0. This is super easy:
datain[:, 1:] < 0
This returns a matrix of boolean values, where each value represents whether or not the corresponding cell in datain[:, 1:] was less than 0. You want these as integers, 1 for True and 0 for False; it turns out, when we assign these boolean values back into our original array (which contains floats), numpy will cast the bools into floats automatically; True will get cast to 1.0, and False will get cast to 0.0.
If you don't want to throw away your original data, simply copy it off first. Here's the complete code:
# If you want to preserve your old data, create a copy for us to modify
dataout = np.array(datain)
# Now assign your integer values into your data array
dataout[:, 2::2] *= -1
dataout[:, 1:] = datain[:, 1:] < 0
For the sample input you provided:
array([[ 0. , 72.25 , 158.622, 86.575, 151.153, 85.807,
149.803, 84.285, 143.701, 77.723, 160.471, 96.587,
144.02 , 75.827, 157.071, 87.629, 148.856, 100.814,
140.488],
[ 10. , 56.224, 174.351, 108.309, 154.148, 68.564,
155.721, 83.634, 132.836, 75.03 , 177.971, 100.623,
146.616, 61.856, 150.885, 92.147, 150.124, 91.841,
153.112],
[ 20. , 53.357, 153.537, 58.19 , 160.235, 77.575,
-176.257, 93.771, 150.549, 77.789, 161.534, 103.589,
146.363, 73.623, 159.441, 99.315, 129.663, 92.842,
138.736]])
This code ends up with the following final result:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.],
[10., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.],
[20., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.]])
Thanks to abarnert for pointing me in the right direction with this, the solution is pretty simple.
datain = numpy.loadtxt(testfile.txt) #load data
dataout = numpy.empty(datain.shape, dtype=int) # initialize empty processing array
dataout[:, 0] = datain[:, 0] # assign time values from input data to processing array
dataout[:, 1::2] = datain[:, 1::2] < 0
dataout[:, 2::2] = datain[:, 2::2] > 0
That's it! Much shorter, much more readable, and gets me the values I want.
Is there a way to override certain operations.
import dask
import numpy as np
a = np.zeros((10,10))
a = dask.delayed(lambda x : x*2)(a)
I would like a[0] to return a number (instead of having to call a[0].compute()).
Is this possible?
The context is that I would like to have a series of images (3D array), and run operations like:
imgs2 = imgs - 1
imgs3 = imgs*mask
and then have an operation like imgs3[0] explicitly run imgs3[0].compute().
However, I see many drawbacks with this method now and I would like to remove this post. For one is this is quite limiting. Indexing things like imgs3[:,:, 10] (all columns) may also have to end up with a computed result.
This seems to work fine for me
In [1]: import dask
...: import numpy as np
...: a = np.zeros((10,10))
...: a = dask.delayed(lambda x : x*2)(a)
...:
In [2]: a[0]
Out[2]: Delayed('getitem-4eccd4e43153cac99d8e6d280cc1ad9c')
In [3]: a[0].compute()
Out[3]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Just to give you some context:
I have to translate some MATLAB code into Python 3 one, but here I've been confronted to a little problem.
Matlab:
for i in 1:num_nodes
for j in 1:num_nodes
K{i,j} = zeros(3,3);
Which I translated into:
k_topology = [[]]
for i in range(x):
for i in range(x):
k_topology[[i][j]].extend(np.zeros(3,3))
Also, further in the Matlab code there's a third loop:
for k in 1:3
K{i,j}(k,k) = -1
Which also kind of... Upsets me?
The fact is I don't really see how I can translate this kind of variable into Python. Also, I guess that my Python code's kind of "broken" - and I'm not really asking to any of you to improve it - , so I'm just asking which is the best way to translate Matlab's cell into Python?
I finally found something apparently simple to translate this, using list comprehension - according to kazemakase's answer. The actual Python code is now looking like this:
k_topology = [[np.zeros((3,3)) for j in range(self.get_nb_nodes_from_network())]\
for i in range(self.get_nb_nodes_from_network())]
And looks like something like this in Output:
[[array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])], ..., [array(...)]]
(There's really too many values to paste it here, but I think you got it.)
The first question you need to ask is "what is a Matlab cell and what could be a suitable corresponding Python type?"
If I remember correctly from my bad old Matlab days, a cell is sort of a container that holds content of mixed types. It is something like a dynamically typed array or matrix. It is multidimensionally indexed.
Python is dynamically typed, so any Python contianer can basically fulfill this function. Lists in Python are indexed, so nested lists could work - but they are somewhat weird to set up and access:
K = [[None] * num_nodes for _ in range(num_nodes)]
K[i][j] # need two indices to access elements of a nested list.
For the particular scenario a dictionary better mirrors Matlab syntax. Although a ditionary takes only one index, we can exploit the fact that tuples can be declared without brackets and that dictionaries can take tuples as index:
K = {}
for i in range(num_nodes):
for j in range(num_nodes):
K[i, j] = np.zeros((3, 3))
for k in 1:3
K[i, j][k, k] = -1
While the dictionary is syntactically more concise, element access is potentially less performant than in nested lists. Nested look different than Matlab code. The choice depends on performance or similarity to the original code. But if performance is an issue there are many more things to consider, anyway. In summary: There is no one best way to do it.
Since the OP expclicitly asked not to improve the code, I explicitly ask him/her to ignore this part of the answer.
A better way to build diagonal matrices is to use np.ones instead of looping over diagonal elements.
K = {}
for i in range(num_nodes):
for j in range(num_nodes):
K[i, j] = -np.ones((3, 3))
Also, nested lists can be constructed without (much) prior initialization, if that is the preferred approach:
K = []
for i in range(num_nodes):
K.append([])
for j in range(num_nodes):
K[-1].append(-np.ones((3, 3)))
Now, for the peace of my soul, let me take apart provide feedback on the OP's code:
k_topology = [[]]
for i in range(x):
for i in range(x):
k_topology[[i][j]].extend(np.zeros(3,3))
This has nothing to do with the original Matlab code (different variable names)
Both loops use i. j is never defined.
[[i][j]] builds a list with one element i and tries to take the jth element. If j is ever something other than 0 this will cause an error.
list.extend a appends all elements of the argument individually to the list - in this case individual rows. list.append would be correct to use as the whole 3x3 matrix should be appended as one element in K.
np.zeros(3, 3) should be np.zeros((3, 3)) (assuming np is an alias for numpy) because the function takes the shape is the first argument, not multiple arguments.
Using the Octave/scipy save/loadmat that I demonstrated in the linked post:
In an Octave session
>> num_nodes=3
num_nodes = 3
>> num_nodes=3;
>> K=cell(num_nodes, num_nodes);
>> for i = 1:num_nodes
for j = 1:num_nodes
K{i,j} = zeros(2,2);
end
end
>> K
K =
{
[1,1] =
0 0
0 0
[2,1] =
0 0
0 0
etc
Access one cell:
>> K{1,2}
ans =
0 0
0 0
Access one element of one cell:
>> K{1,2}(1,1)
ans = 0
>> save -7 kfile.mat K
In Python
In [31]: from scipy import io
In [32]: data = io.loadmat('kfile.mat')
In [34]: data
Out[34]:
{'K': array([[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])],
[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])],
[array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]]),
array([[ 0., 0.],
[ 0., 0.]])]], dtype=object),
'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file, written by Octave 4.0.0, 2017-02-15 19:05:44 UTC',
'__version__': '1.0'}
In [35]: data['K'].shape
Out[35]: (3, 3)
In [36]: data['K'][0,0].shape
Out[36]: (2, 2)
In [37]: data['K'][0,0][0,0]
Out[37]: 0.0
loadmat treats a cell as a 2d object dtype array; while regular matrices are 2d numeric arrays. Object arrays are, in many ways like a nested Python list.
Question
After fitting the data with neigh.fit() I would like to access these data-points, how do I do this?
Details
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> samplesy = [80, 60, 40]
>>> from sklearn import neighbors
>>> neigh = neighbors.KNeighborsRegressor(n_neighbors=1)
>>> neigh.fit(samples, samplesy)
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]))
So from this I learned that the closest data-point is 'samples[2]'.
However in the case I don't have access anymore to the variable 'samples', is there a way to access the data-point in 'neigh'? Maybe something like 'neigh[2]'? Because the data-points have to be saved somewhere in the model of 'neigh' right?
Why
I would like to access the 5 closest neighbors data-points and calculate a cluster-center of these data-points. Then I want to calculate the distance of this cluster-center to the new data point to get an idea of how far this new data-point is from the original data.
The data used to fit the model are stored in neigh._fit_X:
>>> neigh._fit_X
array([[ 0. , 0. , 0. ],
[ 0. , 0.5, 0. ],
[ 1. , 1. , 0.5]])
However: The leading underscore of the variable name should be a signal to you that this is supposed to be somewhat of a private attribute. You shouldn't expect for this data to behave in any particular way, or even to exist in future versions of the library. Use it at your own risk.
A better way might be to just keep track of the input data on your own.