What are the dimensionality requirements for np.dot?

What are the dimensionality requirements for np.dot? - python

I have a variable W that has:
[[1.]
[2.]
[3.]
[4.]
[5.]]
And another variable X that has:
[[1. 5.1 3.5 1.4 0.2]
[1. 4.9 3. 1.4 0.2]
[1. 4.7 3.2 1.3 0.2]
[1. 4.6 3.1 1.5 0.2]
[1. 5. 3.6 1.4 0.2]
[1. 5.4 3.9 1.7 0.4]
[1. 4.6 3.4 1.4 0.3]
[1. 5. 3.4 1.5 0.2]
[1. 4.4 2.9 1.4 0.2]
[1. 4.9 3.1 1.5 0.1]
[1. 5.4 3.7 1.5 0.2]
...
[1. 5.7 2.8 4.1 1.3]]
I keep guessing and checking to see how to np.dot them together. np.dot(W.T, X.T) seems to work, but the shape is wrong: (1, 100).
What I want to do is multiply like:
1 * 1 + 2 * 5.1 + 3 * 3.5 + 4 * 1.4 + 5 * 0.02 for each row in X. How can I do that?

Matrix multiplication is row by column:
X
XXXXX X .
..... * X = .
..... X .
X
So:
In [6]: a=np.array([[1, 5.1, 3.5, 1.4, 0.2],
...: [1, 4.9, 3, 1.4, 0.2],
...: [1, 4.7, 3.2, 1.3, 0.2],
...: [1, 4.6, 3.1, 1.5, 0.2],
...: [1, 5, 3.6, 1.4, 0.2],
...: [1, 5.4, 3.9, 1.7, 0.4],
...: [1, 4.6, 3.4, 1.4, 0.3],
...: [1, 5, 3.4, 1.5, 0.2],
...: [1, 4.4, 2.9, 1.4, 0.2],
...: [1, 4.9, 3.1, 1.5, 0.1],
...: [1, 5.4, 3.7, 1.5, 0.2]])
In [8]: b=np.array([[1.],
...: [2.],
...: [3.],
...: [4.],
...: [5.]])
In [25]: np.dot(a,b)
Out[25]:
array([[28.3],
[26.4],
[26.2],
[26.5],
[28.4],
[32.3],
[27.5],
[28.2],
[25.1],
[26.6],
[29.9]])

The last dimension of a should be the same size of the second-to-last dimension of b.
Given: np.dot(a, b)
More references: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.dot.html

You could use np.matmul:
W = np.array([[1.],[2.],[3.],[4.],[5.]])
X = np.array([[1., 5.1, 3.5, 1.4, 0.2],
[1., 4.9, 3. , 1.4, 0.2],
[1. , 4.7, 3.2, 1.3, 0.2],
[1. ,4.6, 3.1, 1.5, 0.2],
[1. ,5. , 3.6, 1.4, 0.2],
[1. ,5.4, 3.9, 1.7, 0.4]])
np.matmul(X,W)
array([[28.3],
[26.4],
[26.2],
[26.5],
[28.4],
[32.3]])
Quick check on the output:
1*1 + 2*5.1 + 3*3.5 + 4*1.4 + 5*0.2 = 28.3
Note that in this case it is equivalent to np.dot given that both inputs are 2-D arrays.

Related

How to pad an array with rows

I have a set of numpy arrays with different number of rows and I would like to pad them to a fixed number of rows, e.g.
An array "a" with 3 rows:
a = [
[1.1, 2.1, 3.1]
[1.2, 2.2, 3.2]
[1.3, 2.3, 3.3]
]
I would like to convert "a" to an array with 5 rows:
[
[1.1, 2.1, 3.1]
[1.2, 2.2, 3.2]
[1.3, 2.3, 3.3]
[0, 0, 0]
[0, 0, 0]
]
I have tried np.concatenate((a, np.zeros(3)*(5-len(a))), axis=0), but it does not work.
Any help would be appreciated.

You're looking for np.pad. To zero pad you must set mode to constant and the pad_width that you want on the edges of each axis:
np.pad(a, pad_width=((0,2),(0,0)), mode='constant')
array([[1.1, 2.1, 3.1],
[1.2, 2.2, 3.2],
[1.3, 2.3, 3.3],
[0. , 0. , 0. ],
[0. , 0. , 0. ]])

Using np.arange to create list of coordinate pairs

I am trying to make a program faster and I found this post and I want to implement a solution that resembles the fourth case given in that question.
Here is the relevant part of the code I am using:
count = 0
hist_dat = np.zeros(r**2)
points = np.zeros((r**2, 2))
for a in range(r):
for b in range(r):
for i in range(N):
for j in range(N):
hist_dat[count] += retval(a/r, (a+1)/r, data_a[i][j])*retval(b/r, (b+1)/r, data_b[i][j])/N
points[count][0], points[count][1] = (a+0.5)/r, (b+0.5)/r
count += 1
What this code does is generate the values of a normalized 2D histogram (with "r" divisions in each direction) and the coordinates for those values as numpy.ndarray.
As you can see in the other question linked, I am currently using the second worst possible solution and it takes several minutes to run.
For starters I want to change what the code is doing for the points array (I think that once I can see how that is done I could figure something out for hist_dat). Which is basically this:
In the particular case I am working on, both A and B are the same. So for example, it could be like going from array([0, 0.5, 1]) to array([[0,0], [0,0.5], [0,1], [0.5,0], [0.5,0.5], [0.5,1], [1,0], [1,0.5], [1,1]])
Is there any method for numpy.ndarray or an operation with the np.arange() that does what the above diagram shows without requiring for loops?
Or is there any alternative that can do this as fast as what the linked post showed for the np.arange()?

You can use np.c_ to combine the result of np.repeat and np.tile:
import numpy as np
start = 0.5
end = 5.5
step = 1.0
points = np.arange(start, end, step) # [0.5, 1.5, 2.5, 3.5, 4.5]
output = np.c_[np.repeat(points, n_elements), np.tile(points, n_elements)]
print(output)
Output:
[[0.5 0.5]
[0.5 1.5]
[0.5 2.5]
[0.5 3.5]
[0.5 4.5]
[1.5 0.5]
[1.5 1.5]
[1.5 2.5]
[1.5 3.5]
[1.5 4.5]
[2.5 0.5]
[2.5 1.5]
[2.5 2.5]
[2.5 3.5]
[2.5 4.5]
[3.5 0.5]
[3.5 1.5]
[3.5 2.5]
[3.5 3.5]
[3.5 4.5]
[4.5 0.5]
[4.5 1.5]
[4.5 2.5]
[4.5 3.5]
[4.5 4.5]]

maybe np.mgird would help?
import numpy as np
np.mgrid[0:2:.5,0:2:.5].reshape(2,4**2).T
Output:
array([[0. , 0. ],
[0. , 0.5],
[0. , 1. ],
[0. , 1.5],
[0.5, 0. ],
[0.5, 0.5],
[0.5, 1. ],
[0.5, 1.5],
[1. , 0. ],
[1. , 0.5],
[1. , 1. ],
[1. , 1.5],
[1.5, 0. ],
[1.5, 0.5],
[1.5, 1. ],
[1.5, 1.5]])

Tensorflow: Passing CSV with 3D feature array

My current text file that I intend to use for LSTM training in Tensorflow looks like this:
> 0.2, 4.3, 1.2
> 1.1, 2.2, 3.1
> 3.5, 4.1, 1.1, 4300
>
> 1.2, 3.3, 1.2
> 1.5, 2.4, 3.1
> 3.5, 2.1, 1.1, 4400
>
> ...
There are 3 sequences 3 features vectors with only 1 label for each sample. I formatted this text file so it can be consistent with the LSTM training as the latter requires a time-steps of the sequences or in general, LSTM training requires a 3D tensor (batch, num of time-steps, num of features).
My question: How should I use Numpy or TensorFlow.TextReader in order to reformat the 3x3 sequence vectors and the singleton Labels so it can become compatible with Tensorflow?
Edit: I saw many tutorials on reformatting text or CSV files that has vectors and labels but unfortunately they were for 1 to 1 relationships e.g.
0.2, 4.3, 1.2, Class1
1.1, 2.2, 3.1, Class2
3.5, 4.1, 1.1, Class3
becomes:
[0.2, 4.3, 1.2, Class1], [1.1, 2.2, 3.1, Class2], [3.5, 4.1, 1.1, Class3]
which clearly is readable by Numpy and can build vectors easily from it dedicated for simple Feed-Forward NN tasks. But this procedure doesn't actually build an LSTM friendly CSV.
EDIT: The TensorFlow tutorial on CSV formats, covers only 2D arrays as an example. The features = col1, col2, col3 doesn't assume that there might be time-steps for each sequence array and hence my question.

I'm a little confused as to whether you are more interested in the numpy array(s) structure, or the csv fomat.
The np.savetxt csv file writer can't readily produce text like:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
savetxt is not tricky. It opens a file for writing, and then iterates on the input array, writing it, one row at a time to the file. Effectively:
for row in arr:
f.write(fmt % tuple(row))
where fmt has a % field for each element of the the row. In the simple case it constructs fmt = delimiter.join(['fmt']*(arr.shape[1])). In other words repeating the simgle field fmt for the number of columns. Or you can give it a multifield fmt.
So you could use normal line/file writing methods to write a custom display. The simplest is to construct it using the usual print commends, and then redirect those to a file.
But having done that, there's the question of how to read that back into a numpy session. np.genfromtxt can handle missing data, but you still have to include the delimiters. It's also trickier to have it read blocks (3 lines separated by a blank line). It's not impossible, but you have to do some preprocessing.
Of course genfromtxt isn't that tricky either. It reads the file line by line, converts each line into a list of numbers or strings, and collects those lists in a master list. Only at the end is that list converted into an array.
I can construct an array like your text with:
In [121]: dt = np.dtype([('lbl',int), ('block', float, (3,3))])
In [122]: A = np.zeros((2,),dtype=dt)
In [123]: A
Out[123]:
array([(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [124]: A['lbl']=[4300,4400]
In [125]: A[0]['block']=np.array([[.2,4.3,1.2],[1.1,2.2,3.1],[3.5,4.1,1.1]])
In [126]: A
Out[126]:
array([(4300, [[0.2, 4.3, 1.2], [1.1, 2.2, 3.1], [3.5, 4.1, 1.1]]),
(4400, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [127]: A['block']
Out[127]:
array([[[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]],
[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ]]])
I can load it from a txt that has all the block values flattened:
In [130]: txt=b"""4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1"""
In [131]: txt
Out[131]: b'4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1'
genfromtxt can handle a complex dtype, allocating values in order from the flat line list:
In [133]: data=np.genfromtxt([txt],delimiter=',',dtype=dt)
In [134]: data['lbl']
Out[134]: array(4300)
In [135]: data['block']
Out[135]:
array([[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]])
I'm not sure about writing it. I have have to reshape it into a 10 column or field array, if I want to use savetxt.

UPDATE: addition to the previos answer:
df.stack().to_csv('d:/temp/1D.csv', index=False)
1D.csv:
0.2
4.3
1.2
4300.0
1.1
2.2
3.1
4300.0
3.5
4.1
1.1
4300.0
1.2
3.3
1.2
4400.0
1.5
2.4
3.1
4400.0
3.5
2.1
1.1
4400.0
OLD answer:
Here is a Pandas solution.
Assume we have the following text file:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
Code:
import pandas as pd
In [95]: fn = r'D:\temp\.data\data.txt'
In [96]: df = pd.read_csv(fn, sep=',', skipinitialspace=True, header=None, names=list('abcd'))
In [97]: df
Out[97]:
a b c d
0 0.2 4.3 1.2 NaN
1 1.1 2.2 3.1 NaN
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 NaN
4 1.5 2.4 3.1 NaN
5 3.5 2.1 1.1 4400.0
In [98]: df.d = df.d.bfill()
In [99]: df
Out[99]:
a b c d
0 0.2 4.3 1.2 4300.0
1 1.1 2.2 3.1 4300.0
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 4400.0
4 1.5 2.4 3.1 4400.0
5 3.5 2.1 1.1 4400.0
now you can save it back to CSV:
df.to_csv('d:/temp/out.csv', index=False, header=None)
d:/temp/out.csv:
0.2,4.3,1.2,4300.0
1.1,2.2,3.1,4300.0
3.5,4.1,1.1,4300.0
1.2,3.3,1.2,4400.0
1.5,2.4,3.1,4400.0
3.5,2.1,1.1,4400.0

uniquify an array/list with a tolerance in python (uniquetol equivalent)

I want to find the unique elements of an array in a certain range of tolerance
For instance, for an array/list
[1.1 , 1.3 , 1.9 , 2.0 , 2.5 , 2.9]
Function will return
[1.1 , 1.9 , 2.5 , 2.9]
If the tolerance is 0.3
at bit like the MATLAB function
http://mathworks.com/help/matlab/ref/uniquetol.html
(but this function uses a relative tolerance, an absolute one can be sufficient)
What is the pythonic way to implement it ? (numpy is privilegied)

With A as the input array and tol as the tolerance value, we could have a vectorized approach with NumPy broadcasting, like so -
A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Sample run -
In [20]: A = np.array([2.1, 1.3 , 1.9 , 1.1 , 2.0 , 2.5 , 2.9])
In [21]: tol = 0.3
In [22]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[22]: array([ 2.1, 1.3, 2.5, 2.9])
Notice 1.9 being gone because we had 2.1 within the tolerance of 0.3. Then, 1.1 gone for 1.3 and 2.0 for 2.1.
Please note that this would create a unique array with "chained-closeness" check. As an example :
In [91]: A = np.array([ 1.1, 1.3, 1.5, 2. , 2.1, 2.2, 2.35, 2.5, 2.9])
In [92]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[92]: array([ 1.1, 2. , 2.9])
Thus, 1.3 is gone because of 1.1 and 1.5 is gone because of 1.3.

In pure Python 2, I wrote the following:
a = [1.1, 1.3, 1.9, 2.0, 2.5, 2.9]
# Per http://fr.mathworks.com/help/matlab/ref/uniquetol.html
tol = max(map(lambda x: abs(x), a)) * 0.3
a.sort()
results = [a.pop(0), ]
for i in a:
# Skip items within tolerance.
if abs(results[-1] - i) <= tol:
continue
results.append(i)
print a
print results
Which results in
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 2.0, 2.9]
Which is what the spec seems to agree with, but isn't consistent with your example.
If I just set the tolerance to 0.3 instead of max(map(lambda x: abs(x), a)) * 0.3, I get:
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 1.9, 2.5, 2.9]
...which is consistent with your example.

Best way to parse matrix in python from a logfile

I have a matrix printed inside a log file this way:
[[ 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
[ 1.9 76.5 1.1 1.6 1.2 0.9 2.4 0.3 5.9 8.2]
[ 4.2 0.3 64.5 5.7 9.1 5.9 6.5 2.7 1. 0.1]
[ 2.2 0.5 9.4 49.4 9.6 15.6 8.5 3.3 1.2 0.3]
[ 1.5 0.1 9. 5.3 71.3 3.5 4.8 3.7 0.8 0. ]
[ 1. 0.4 7.7 15.9 6.5 59.5 4.4 3.7 0.7 0.2]
[ 0.1 0.3 7. 5.6 3.8 2.2 80.3 0.5 0.2 0. ]
[ 1. 0.2 6.3 4.5 11.2 6.2 0.9 69.1 0.2 0.4]
[ 6.2 0.7 1.7 2.3 1.6 0.6 1. 0.4 84. 1.5]
[ 4.3 8.6 1.9 3.7 1.3 1.2 1.7 1.9 4.4 71. ]]
.... some text and then again another matrix ......
[[ 71.9 0.6 8.1 2. 1. 1.9 2. 1.6 8.4 2.5]
[ 1.2 82.9 1.1 1.1 0.5 1.3 1.5 0.7 3.9 5.8]
[ 4.7 0.9 59.6 4.1 7.6 10. 7.3 3.8 1.5 0.5]
[ 2.3 0.7 6.5 43.1 6.8 24.5 7.4 4. 2.6 2.1]
[ 1.7 0.3 5.6 5.4 62.5 7.4 6.5 9.3 1. 0.3]
[ 1.4 0.2 6. 12.7 5.3 64.4 3.3 5.4 0.8 0.5]
[ 0.7 0.6 5.5 4.9 3.9 4.5 78.2 0.6 0.8 0.3]
[ 1.7 0.2 4.5 3.4 4.4 7.6 1. 76. 0.7 0.5]
[ 6.3 1.9 1.7 1.4 0.8 0.9 1.4 0.4 83. 2.2]
[ 3.4 8.6 1.4 1.2 0.8 1.3 0.9 2.4 3.8 76.2]]
I tried doing this to read the first matrix and then append it to a list of matrices:
with open('log.txt') as f:
for line in f:
for i in range(N):
cnf_mline = f.next().strip()
cnf_mvalue = cnf_mline[cnf_mline.rfind('[')+1:]
cnf_mtx.append(map(float, (cnf_mvalue,)))
print cnf_mline
print cnf_mvalue
and I see the following error:
ValueError: invalid literal for float(): 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
How do I parse this matrix directly from the log file to a list?
Thanks in advance !

final = [[]]
with open("log.txt") as f:
counter = 0
for line in f:
match = re.findall("\d+\.\d+|\d+\.", line)
if match:
final[counter].append(map(float,match))
else:
counter += 1
final.append([])
print final
[[[73.1, 0.7, 5.3, 3.7, 3.5, 1.0, 0.9, 1.3, 8.8, 1.7], [1.9, 76.5, 1.1, 1.6, 1.2, 0.9, 2.4, 0.3, 5.9, 8.2], [4.2, 0.3, 64.5, 5.7, 9.1, 5.9, 6.5, 2.7, 1.0, 0.1], [2.2, 0.5, 9.4, 49.4, 9.6, 15.6, 8.5, 3.3, 1.2, 0.3], [1.5, 0.1, 9.0, 5.3, 71.3, 3.5, 4.8, 3.7, 0.8, 0.0], [1.0, 0.4, 7.7, 15.9, 6.5, 59.5, 4.4, 3.7, 0.7, 0.2], [0.1, 0.3, 7.0, 5.6, 3.8, 2.2, 80.3, 0.5, 0.2, 0.0], [1.0, 0.2, 6.3, 4.5, 11.2, 6.2, 0.9, 69.1, 0.2, 0.4], [6.2, 0.7, 1.7, 2.3, 1.6, 0.6, 1.0, 0.4, 84.0, 1.5], [4.3, 8.6, 1.9, 3.7, 1.3, 1.2, 1.7, 1.9, 4.4, 71.0]], [[71.9, 0.6, 8.1, 2.0, 1.0, 1.9, 2.0, 1.6, 8.4, 2.5], [1.2, 82.9, 1.1, 1.1, 0.5, 1.3, 1.5, 0.7, 3.9, 5.8], [4.7, 0.9, 59.6, 4.1, 7.6, 10.0, 7.3, 3.8, 1.5, 0.5], [2.3, 0.7, 6.5, 43.1, 6.8, 24.5, 7.4, 4.0, 2.6, 2.1], [1.7, 0.3, 5.6, 5.4, 62.5, 7.4, 6.5, 9.3, 1.0, 0.3], [1.4, 0.2, 6.0, 12.7, 5.3, 64.4, 3.3, 5.4, 0.8, 0.5], [0.7, 0.6, 5.5, 4.9, 3.9, 4.5, 78.2, 0.6, 0.8, 0.3], [1.7, 0.2, 4.5, 3.4, 4.4, 7.6, 1.0, 76.0, 0.7, 0.5], [6.3, 1.9, 1.7, 1.4, 0.8, 0.9, 1.4, 0.4, 83.0, 2.2], [3.4, 8.6, 1.4, 1.2, 0.8, 1.3, 0.9, 2.4, 3.8, 76.2]]]

The problem you are facing is that you have a string "73.1 0.7 " etc, and that cannot be parsed into a float. If you want to have something parseable, you'll have to split that string (and get rid of that trailing ]:
cnf_mvalues = cnf_mline[cnf_mline.rfind('[')+1:-1].split()
cnf.mtx.append(map(float, cnf_mvalues))
This should help you with the exception. (But it is not a complete solution!)
I think you might want to have a more stateful model, because:
a line starting with [[ and ending with ] marks the start
a line starting with [ and ending with ] is a line within the matrix
a line starting with [ and ending with ]] marks the end
(And even this makes a lot of assumptions.) There are really only two states: in matrix and outside of matrix. We may call that variable ìn_matrix`.
Then for each line the logic goes:
"[[...]":
in_matrix := True
start an empty matrix
append the data on the row to the empty matrix
"[...]":
if in_matrix:
append the data on the row to the matrix under collection
else:
stray data, ignore
"[...]]":
if in_matrix:
append the data on the row to the matrix under collection
append the matrix under collection to the list of matrices
in_matrix := False
else:
stray data, ignore
And, naturally, a lot of exception handling with invalid values, etc. Most of the time it is enough to set in_matrix to False if you get bad data.
This state machine should turn into a shortish code, as the checking of the brackets is not very difficult.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What are the dimensionality requirements for np.dot? - python

The last dimension of a should be the same size of the second-to-last dimension of b. Given: np.dot(a, b) More references: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.dot.html

Related

How to pad an array with rows

Using np.arange to create list of coordinate pairs

Tensorflow: Passing CSV with 3D feature array

uniquify an array/list with a tolerance in python (uniquetol equivalent)

Best way to parse matrix in python from a logfile

Categories

Resources