Best way to parse matrix in python from a logfile - python

I have a matrix printed inside a log file this way:
[[ 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
[ 1.9 76.5 1.1 1.6 1.2 0.9 2.4 0.3 5.9 8.2]
[ 4.2 0.3 64.5 5.7 9.1 5.9 6.5 2.7 1. 0.1]
[ 2.2 0.5 9.4 49.4 9.6 15.6 8.5 3.3 1.2 0.3]
[ 1.5 0.1 9. 5.3 71.3 3.5 4.8 3.7 0.8 0. ]
[ 1. 0.4 7.7 15.9 6.5 59.5 4.4 3.7 0.7 0.2]
[ 0.1 0.3 7. 5.6 3.8 2.2 80.3 0.5 0.2 0. ]
[ 1. 0.2 6.3 4.5 11.2 6.2 0.9 69.1 0.2 0.4]
[ 6.2 0.7 1.7 2.3 1.6 0.6 1. 0.4 84. 1.5]
[ 4.3 8.6 1.9 3.7 1.3 1.2 1.7 1.9 4.4 71. ]]
.... some text and then again another matrix ......
[[ 71.9 0.6 8.1 2. 1. 1.9 2. 1.6 8.4 2.5]
[ 1.2 82.9 1.1 1.1 0.5 1.3 1.5 0.7 3.9 5.8]
[ 4.7 0.9 59.6 4.1 7.6 10. 7.3 3.8 1.5 0.5]
[ 2.3 0.7 6.5 43.1 6.8 24.5 7.4 4. 2.6 2.1]
[ 1.7 0.3 5.6 5.4 62.5 7.4 6.5 9.3 1. 0.3]
[ 1.4 0.2 6. 12.7 5.3 64.4 3.3 5.4 0.8 0.5]
[ 0.7 0.6 5.5 4.9 3.9 4.5 78.2 0.6 0.8 0.3]
[ 1.7 0.2 4.5 3.4 4.4 7.6 1. 76. 0.7 0.5]
[ 6.3 1.9 1.7 1.4 0.8 0.9 1.4 0.4 83. 2.2]
[ 3.4 8.6 1.4 1.2 0.8 1.3 0.9 2.4 3.8 76.2]]
I tried doing this to read the first matrix and then append it to a list of matrices:
with open('log.txt') as f:
for line in f:
for i in range(N):
cnf_mline = f.next().strip()
cnf_mvalue = cnf_mline[cnf_mline.rfind('[')+1:]
cnf_mtx.append(map(float, (cnf_mvalue,)))
print cnf_mline
print cnf_mvalue
and I see the following error:
ValueError: invalid literal for float(): 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
How do I parse this matrix directly from the log file to a list?
Thanks in advance !

final = [[]]
with open("log.txt") as f:
counter = 0
for line in f:
match = re.findall("\d+\.\d+|\d+\.", line)
if match:
final[counter].append(map(float,match))
else:
counter += 1
final.append([])
print final
[[[73.1, 0.7, 5.3, 3.7, 3.5, 1.0, 0.9, 1.3, 8.8, 1.7], [1.9, 76.5, 1.1, 1.6, 1.2, 0.9, 2.4, 0.3, 5.9, 8.2], [4.2, 0.3, 64.5, 5.7, 9.1, 5.9, 6.5, 2.7, 1.0, 0.1], [2.2, 0.5, 9.4, 49.4, 9.6, 15.6, 8.5, 3.3, 1.2, 0.3], [1.5, 0.1, 9.0, 5.3, 71.3, 3.5, 4.8, 3.7, 0.8, 0.0], [1.0, 0.4, 7.7, 15.9, 6.5, 59.5, 4.4, 3.7, 0.7, 0.2], [0.1, 0.3, 7.0, 5.6, 3.8, 2.2, 80.3, 0.5, 0.2, 0.0], [1.0, 0.2, 6.3, 4.5, 11.2, 6.2, 0.9, 69.1, 0.2, 0.4], [6.2, 0.7, 1.7, 2.3, 1.6, 0.6, 1.0, 0.4, 84.0, 1.5], [4.3, 8.6, 1.9, 3.7, 1.3, 1.2, 1.7, 1.9, 4.4, 71.0]], [[71.9, 0.6, 8.1, 2.0, 1.0, 1.9, 2.0, 1.6, 8.4, 2.5], [1.2, 82.9, 1.1, 1.1, 0.5, 1.3, 1.5, 0.7, 3.9, 5.8], [4.7, 0.9, 59.6, 4.1, 7.6, 10.0, 7.3, 3.8, 1.5, 0.5], [2.3, 0.7, 6.5, 43.1, 6.8, 24.5, 7.4, 4.0, 2.6, 2.1], [1.7, 0.3, 5.6, 5.4, 62.5, 7.4, 6.5, 9.3, 1.0, 0.3], [1.4, 0.2, 6.0, 12.7, 5.3, 64.4, 3.3, 5.4, 0.8, 0.5], [0.7, 0.6, 5.5, 4.9, 3.9, 4.5, 78.2, 0.6, 0.8, 0.3], [1.7, 0.2, 4.5, 3.4, 4.4, 7.6, 1.0, 76.0, 0.7, 0.5], [6.3, 1.9, 1.7, 1.4, 0.8, 0.9, 1.4, 0.4, 83.0, 2.2], [3.4, 8.6, 1.4, 1.2, 0.8, 1.3, 0.9, 2.4, 3.8, 76.2]]]

The problem you are facing is that you have a string "73.1 0.7 " etc, and that cannot be parsed into a float. If you want to have something parseable, you'll have to split that string (and get rid of that trailing ]:
cnf_mvalues = cnf_mline[cnf_mline.rfind('[')+1:-1].split()
cnf.mtx.append(map(float, cnf_mvalues))
This should help you with the exception. (But it is not a complete solution!)
I think you might want to have a more stateful model, because:
a line starting with [[ and ending with ] marks the start
a line starting with [ and ending with ] is a line within the matrix
a line starting with [ and ending with ]] marks the end
(And even this makes a lot of assumptions.) There are really only two states: in matrix and outside of matrix. We may call that variable ìn_matrix`.
Then for each line the logic goes:
"[[...]":
in_matrix := True
start an empty matrix
append the data on the row to the empty matrix
"[...]":
if in_matrix:
append the data on the row to the matrix under collection
else:
stray data, ignore
"[...]]":
if in_matrix:
append the data on the row to the matrix under collection
append the matrix under collection to the list of matrices
in_matrix := False
else:
stray data, ignore
And, naturally, a lot of exception handling with invalid values, etc. Most of the time it is enough to set in_matrix to False if you get bad data.
This state machine should turn into a shortish code, as the checking of the brackets is not very difficult.

Related

Removing incomplete numbers from list Python

I'm given this list
[1.3, 2.2, 2.3, 4.2, 5.1, 3.2, 5.3, 3.3, 2.1, 1.1, 5.2, 3.1]
and I'm supposed to remove the elements 1,3, 4.2 and 1.1 so that it becomes
[2.2, 2.3, 5.1, 3.2, 5.3, 3.3, 2.1, 5.2, 3.1]
I have written this code and it becomes wrong. What am I doing wrong?
def removeIncomplete(id):
numbers_buf = id
idComplete = id[:]
for ind, item in enumerate(id):
if item == 1.3 and item == 4.2 and item == 1.1:
numbers_buf.remove(item)
return numbers_buf
#return idComplete
import numpy as np
print(removeIncomplete(np.array([1.3, 2.2, 2.3, 4.2, 5.1,
3.2, 5.3, 3.3, 2.1, 1.1, 5.2, 3.1])))
#Correct output [ 2.2 2.3 5.1 3.2 5.3 3.3 2.1 5.2 3.1]
def removeIncomplete(id):
numbers_buf = id
idComplete = id[:]
for ind, item in enumerate(id):
if item == 1.3 or item == 4.2 or item == 1.1:
numbers_buf = np.delete(numbers_buf, np.where(numbers_buf == item))
return numbers_buf
#return idComplete
import numpy as np
print(removeIncomplete(np.array([1.3, 2.2, 2.3, 4.2, 5.1,
3.2, 5.3, 3.3, 2.1, 1.1, 5.2, 3.1])))
what about using list comprehension
data = [1.3, 2.2, 2.3, 4.2, 5.1, 3.2, 5.3, 3.3, 2.1, 1.1, 5.2, 3.1]
exclude = [1.3, 4.2, 1.1]
out = [val for val in data if val not in exclude]
print(out)
>>>
[2.2, 2.3, 5.1, 3.2, 5.3, 3.3, 2.1, 5.2, 3.1]

How to combine these two numpy arrays?

How would I combine these two arrays:
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
Into something like this:
xy = [[0.1, [1.0, 1.1, 1.2, 1.3]], [0.2, [2.0, 2.1, 2.2, 2.3]...
Thank you for the assistance!
Someone suggested I post code that I have tried and I realized I had forgot to:
xy = np.array(list(zip(x, y)))
This is my current solution, however it is extremely inefficient.
You can use zip to combine
[[a,b] for a,b in zip(y,x)]
Out:
[[array([0.1]), array([1. , 1.1, 1.2, 1.3])],
[array([0.2]), array([2. , 2.1, 2.2, 2.3])],
[array([0.3]), array([3. , 3.1, 3.2, 3.3])],
[array([0.4]), array([4. , 4.1, 4.2, 4.3])],
[array([0.5]), array([5. , 5.1, 5.2, 5.3])]]
A pure numpy solution will be much faster than list comprehension for large arrays.
I do have to say your use case makes no sense, as there is no logic in putting these arrays into a single data structure, and I believe you should re check your design.
Like #user2357112 supports Monica was subtly implying, this is very likely an XY problem. See if this is really what you are trying to solve, and not something else. If you want something else, try asking about that.
I strongly suggest checking what you want to do before moving on, as you will put yourself in a place with bad design.
That aside, here's a solution
import numpy as np
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
xy = np.hstack([y, x])
print(xy)
prints
[[0.1 1. 1.1 1.2 1.3]
[0.2 2. 2.1 2.2 2.3]
[0.3 3. 3.1 3.2 3.3]
[0.4 4. 4.1 4.2 4.3]
[0.5 5. 5.1 5.2 5.3]]

What are the dimensionality requirements for np.dot?

I have a variable W that has:
[[1.]
[2.]
[3.]
[4.]
[5.]]
And another variable X that has:
[[1. 5.1 3.5 1.4 0.2]
[1. 4.9 3. 1.4 0.2]
[1. 4.7 3.2 1.3 0.2]
[1. 4.6 3.1 1.5 0.2]
[1. 5. 3.6 1.4 0.2]
[1. 5.4 3.9 1.7 0.4]
[1. 4.6 3.4 1.4 0.3]
[1. 5. 3.4 1.5 0.2]
[1. 4.4 2.9 1.4 0.2]
[1. 4.9 3.1 1.5 0.1]
[1. 5.4 3.7 1.5 0.2]
...
[1. 5.7 2.8 4.1 1.3]]
I keep guessing and checking to see how to np.dot them together. np.dot(W.T, X.T) seems to work, but the shape is wrong: (1, 100).
What I want to do is multiply like:
1 * 1 + 2 * 5.1 + 3 * 3.5 + 4 * 1.4 + 5 * 0.02 for each row in X. How can I do that?
Matrix multiplication is row by column:
X
XXXXX X .
..... * X = .
..... X .
X
So:
In [6]: a=np.array([[1, 5.1, 3.5, 1.4, 0.2],
...: [1, 4.9, 3, 1.4, 0.2],
...: [1, 4.7, 3.2, 1.3, 0.2],
...: [1, 4.6, 3.1, 1.5, 0.2],
...: [1, 5, 3.6, 1.4, 0.2],
...: [1, 5.4, 3.9, 1.7, 0.4],
...: [1, 4.6, 3.4, 1.4, 0.3],
...: [1, 5, 3.4, 1.5, 0.2],
...: [1, 4.4, 2.9, 1.4, 0.2],
...: [1, 4.9, 3.1, 1.5, 0.1],
...: [1, 5.4, 3.7, 1.5, 0.2]])
In [8]: b=np.array([[1.],
...: [2.],
...: [3.],
...: [4.],
...: [5.]])
In [25]: np.dot(a,b)
Out[25]:
array([[28.3],
[26.4],
[26.2],
[26.5],
[28.4],
[32.3],
[27.5],
[28.2],
[25.1],
[26.6],
[29.9]])
The last dimension of a should be the same size of the second-to-last dimension of b.
Given: np.dot(a, b)
More references: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.dot.html
You could use np.matmul:
W = np.array([[1.],[2.],[3.],[4.],[5.]])
X = np.array([[1., 5.1, 3.5, 1.4, 0.2],
[1., 4.9, 3. , 1.4, 0.2],
[1. , 4.7, 3.2, 1.3, 0.2],
[1. ,4.6, 3.1, 1.5, 0.2],
[1. ,5. , 3.6, 1.4, 0.2],
[1. ,5.4, 3.9, 1.7, 0.4]])
np.matmul(X,W)
array([[28.3],
[26.4],
[26.2],
[26.5],
[28.4],
[32.3]])
Quick check on the output:
1*1 + 2*5.1 + 3*3.5 + 4*1.4 + 5*0.2 = 28.3
Note that in this case it is equivalent to np.dot given that both inputs are 2-D arrays.

Tensorflow: Passing CSV with 3D feature array

My current text file that I intend to use for LSTM training in Tensorflow looks like this:
> 0.2, 4.3, 1.2
> 1.1, 2.2, 3.1
> 3.5, 4.1, 1.1, 4300
>
> 1.2, 3.3, 1.2
> 1.5, 2.4, 3.1
> 3.5, 2.1, 1.1, 4400
>
> ...
There are 3 sequences 3 features vectors with only 1 label for each sample. I formatted this text file so it can be consistent with the LSTM training as the latter requires a time-steps of the sequences or in general, LSTM training requires a 3D tensor (batch, num of time-steps, num of features).
My question: How should I use Numpy or TensorFlow.TextReader in order to reformat the 3x3 sequence vectors and the singleton Labels so it can become compatible with Tensorflow?
Edit: I saw many tutorials on reformatting text or CSV files that has vectors and labels but unfortunately they were for 1 to 1 relationships e.g.
0.2, 4.3, 1.2, Class1
1.1, 2.2, 3.1, Class2
3.5, 4.1, 1.1, Class3
becomes:
[0.2, 4.3, 1.2, Class1], [1.1, 2.2, 3.1, Class2], [3.5, 4.1, 1.1, Class3]
which clearly is readable by Numpy and can build vectors easily from it dedicated for simple Feed-Forward NN tasks. But this procedure doesn't actually build an LSTM friendly CSV.
EDIT: The TensorFlow tutorial on CSV formats, covers only 2D arrays as an example. The features = col1, col2, col3 doesn't assume that there might be time-steps for each sequence array and hence my question.
I'm a little confused as to whether you are more interested in the numpy array(s) structure, or the csv fomat.
The np.savetxt csv file writer can't readily produce text like:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
savetxt is not tricky. It opens a file for writing, and then iterates on the input array, writing it, one row at a time to the file. Effectively:
for row in arr:
f.write(fmt % tuple(row))
where fmt has a % field for each element of the the row. In the simple case it constructs fmt = delimiter.join(['fmt']*(arr.shape[1])). In other words repeating the simgle field fmt for the number of columns. Or you can give it a multifield fmt.
So you could use normal line/file writing methods to write a custom display. The simplest is to construct it using the usual print commends, and then redirect those to a file.
But having done that, there's the question of how to read that back into a numpy session. np.genfromtxt can handle missing data, but you still have to include the delimiters. It's also trickier to have it read blocks (3 lines separated by a blank line). It's not impossible, but you have to do some preprocessing.
Of course genfromtxt isn't that tricky either. It reads the file line by line, converts each line into a list of numbers or strings, and collects those lists in a master list. Only at the end is that list converted into an array.
I can construct an array like your text with:
In [121]: dt = np.dtype([('lbl',int), ('block', float, (3,3))])
In [122]: A = np.zeros((2,),dtype=dt)
In [123]: A
Out[123]:
array([(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [124]: A['lbl']=[4300,4400]
In [125]: A[0]['block']=np.array([[.2,4.3,1.2],[1.1,2.2,3.1],[3.5,4.1,1.1]])
In [126]: A
Out[126]:
array([(4300, [[0.2, 4.3, 1.2], [1.1, 2.2, 3.1], [3.5, 4.1, 1.1]]),
(4400, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [127]: A['block']
Out[127]:
array([[[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]],
[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ]]])
I can load it from a txt that has all the block values flattened:
In [130]: txt=b"""4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1"""
In [131]: txt
Out[131]: b'4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1'
genfromtxt can handle a complex dtype, allocating values in order from the flat line list:
In [133]: data=np.genfromtxt([txt],delimiter=',',dtype=dt)
In [134]: data['lbl']
Out[134]: array(4300)
In [135]: data['block']
Out[135]:
array([[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]])
I'm not sure about writing it. I have have to reshape it into a 10 column or field array, if I want to use savetxt.
UPDATE: addition to the previos answer:
df.stack().to_csv('d:/temp/1D.csv', index=False)
1D.csv:
0.2
4.3
1.2
4300.0
1.1
2.2
3.1
4300.0
3.5
4.1
1.1
4300.0
1.2
3.3
1.2
4400.0
1.5
2.4
3.1
4400.0
3.5
2.1
1.1
4400.0
OLD answer:
Here is a Pandas solution.
Assume we have the following text file:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
Code:
import pandas as pd
In [95]: fn = r'D:\temp\.data\data.txt'
In [96]: df = pd.read_csv(fn, sep=',', skipinitialspace=True, header=None, names=list('abcd'))
In [97]: df
Out[97]:
a b c d
0 0.2 4.3 1.2 NaN
1 1.1 2.2 3.1 NaN
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 NaN
4 1.5 2.4 3.1 NaN
5 3.5 2.1 1.1 4400.0
In [98]: df.d = df.d.bfill()
In [99]: df
Out[99]:
a b c d
0 0.2 4.3 1.2 4300.0
1 1.1 2.2 3.1 4300.0
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 4400.0
4 1.5 2.4 3.1 4400.0
5 3.5 2.1 1.1 4400.0
now you can save it back to CSV:
df.to_csv('d:/temp/out.csv', index=False, header=None)
d:/temp/out.csv:
0.2,4.3,1.2,4300.0
1.1,2.2,3.1,4300.0
3.5,4.1,1.1,4300.0
1.2,3.3,1.2,4400.0
1.5,2.4,3.1,4400.0
3.5,2.1,1.1,4400.0

uniquify an array/list with a tolerance in python (uniquetol equivalent)

I want to find the unique elements of an array in a certain range of tolerance
For instance, for an array/list
[1.1 , 1.3 , 1.9 , 2.0 , 2.5 , 2.9]
Function will return
[1.1 , 1.9 , 2.5 , 2.9]
If the tolerance is 0.3
at bit like the MATLAB function
http://mathworks.com/help/matlab/ref/uniquetol.html
(but this function uses a relative tolerance, an absolute one can be sufficient)
What is the pythonic way to implement it ? (numpy is privilegied)
With A as the input array and tol as the tolerance value, we could have a vectorized approach with NumPy broadcasting, like so -
A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Sample run -
In [20]: A = np.array([2.1, 1.3 , 1.9 , 1.1 , 2.0 , 2.5 , 2.9])
In [21]: tol = 0.3
In [22]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[22]: array([ 2.1, 1.3, 2.5, 2.9])
Notice 1.9 being gone because we had 2.1 within the tolerance of 0.3. Then, 1.1 gone for 1.3 and 2.0 for 2.1.
Please note that this would create a unique array with "chained-closeness" check. As an example :
In [91]: A = np.array([ 1.1, 1.3, 1.5, 2. , 2.1, 2.2, 2.35, 2.5, 2.9])
In [92]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[92]: array([ 1.1, 2. , 2.9])
Thus, 1.3 is gone because of 1.1 and 1.5 is gone because of 1.3.
In pure Python 2, I wrote the following:
a = [1.1, 1.3, 1.9, 2.0, 2.5, 2.9]
# Per http://fr.mathworks.com/help/matlab/ref/uniquetol.html
tol = max(map(lambda x: abs(x), a)) * 0.3
a.sort()
results = [a.pop(0), ]
for i in a:
# Skip items within tolerance.
if abs(results[-1] - i) <= tol:
continue
results.append(i)
print a
print results
Which results in
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 2.0, 2.9]
Which is what the spec seems to agree with, but isn't consistent with your example.
If I just set the tolerance to 0.3 instead of max(map(lambda x: abs(x), a)) * 0.3, I get:
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 1.9, 2.5, 2.9]
...which is consistent with your example.

Categories

Resources