Tensorflow: Passing CSV with 3D feature array

Tensorflow: Passing CSV with 3D feature array - python

My current text file that I intend to use for LSTM training in Tensorflow looks like this:
> 0.2, 4.3, 1.2
> 1.1, 2.2, 3.1
> 3.5, 4.1, 1.1, 4300
>
> 1.2, 3.3, 1.2
> 1.5, 2.4, 3.1
> 3.5, 2.1, 1.1, 4400
>
> ...
There are 3 sequences 3 features vectors with only 1 label for each sample. I formatted this text file so it can be consistent with the LSTM training as the latter requires a time-steps of the sequences or in general, LSTM training requires a 3D tensor (batch, num of time-steps, num of features).
My question: How should I use Numpy or TensorFlow.TextReader in order to reformat the 3x3 sequence vectors and the singleton Labels so it can become compatible with Tensorflow?
Edit: I saw many tutorials on reformatting text or CSV files that has vectors and labels but unfortunately they were for 1 to 1 relationships e.g.
0.2, 4.3, 1.2, Class1
1.1, 2.2, 3.1, Class2
3.5, 4.1, 1.1, Class3
becomes:
[0.2, 4.3, 1.2, Class1], [1.1, 2.2, 3.1, Class2], [3.5, 4.1, 1.1, Class3]
which clearly is readable by Numpy and can build vectors easily from it dedicated for simple Feed-Forward NN tasks. But this procedure doesn't actually build an LSTM friendly CSV.
EDIT: The TensorFlow tutorial on CSV formats, covers only 2D arrays as an example. The features = col1, col2, col3 doesn't assume that there might be time-steps for each sequence array and hence my question.

I'm a little confused as to whether you are more interested in the numpy array(s) structure, or the csv fomat.
The np.savetxt csv file writer can't readily produce text like:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
savetxt is not tricky. It opens a file for writing, and then iterates on the input array, writing it, one row at a time to the file. Effectively:
for row in arr:
f.write(fmt % tuple(row))
where fmt has a % field for each element of the the row. In the simple case it constructs fmt = delimiter.join(['fmt']*(arr.shape[1])). In other words repeating the simgle field fmt for the number of columns. Or you can give it a multifield fmt.
So you could use normal line/file writing methods to write a custom display. The simplest is to construct it using the usual print commends, and then redirect those to a file.
But having done that, there's the question of how to read that back into a numpy session. np.genfromtxt can handle missing data, but you still have to include the delimiters. It's also trickier to have it read blocks (3 lines separated by a blank line). It's not impossible, but you have to do some preprocessing.
Of course genfromtxt isn't that tricky either. It reads the file line by line, converts each line into a list of numbers or strings, and collects those lists in a master list. Only at the end is that list converted into an array.
I can construct an array like your text with:
In [121]: dt = np.dtype([('lbl',int), ('block', float, (3,3))])
In [122]: A = np.zeros((2,),dtype=dt)
In [123]: A
Out[123]:
array([(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [124]: A['lbl']=[4300,4400]
In [125]: A[0]['block']=np.array([[.2,4.3,1.2],[1.1,2.2,3.1],[3.5,4.1,1.1]])
In [126]: A
Out[126]:
array([(4300, [[0.2, 4.3, 1.2], [1.1, 2.2, 3.1], [3.5, 4.1, 1.1]]),
(4400, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [127]: A['block']
Out[127]:
array([[[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]],
[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ]]])
I can load it from a txt that has all the block values flattened:
In [130]: txt=b"""4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1"""
In [131]: txt
Out[131]: b'4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1'
genfromtxt can handle a complex dtype, allocating values in order from the flat line list:
In [133]: data=np.genfromtxt([txt],delimiter=',',dtype=dt)
In [134]: data['lbl']
Out[134]: array(4300)
In [135]: data['block']
Out[135]:
array([[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]])
I'm not sure about writing it. I have have to reshape it into a 10 column or field array, if I want to use savetxt.

UPDATE: addition to the previos answer:
df.stack().to_csv('d:/temp/1D.csv', index=False)
1D.csv:
0.2
4.3
1.2
4300.0
1.1
2.2
3.1
4300.0
3.5
4.1
1.1
4300.0
1.2
3.3
1.2
4400.0
1.5
2.4
3.1
4400.0
3.5
2.1
1.1
4400.0
OLD answer:
Here is a Pandas solution.
Assume we have the following text file:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
Code:
import pandas as pd
In [95]: fn = r'D:\temp\.data\data.txt'
In [96]: df = pd.read_csv(fn, sep=',', skipinitialspace=True, header=None, names=list('abcd'))
In [97]: df
Out[97]:
a b c d
0 0.2 4.3 1.2 NaN
1 1.1 2.2 3.1 NaN
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 NaN
4 1.5 2.4 3.1 NaN
5 3.5 2.1 1.1 4400.0
In [98]: df.d = df.d.bfill()
In [99]: df
Out[99]:
a b c d
0 0.2 4.3 1.2 4300.0
1 1.1 2.2 3.1 4300.0
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 4400.0
4 1.5 2.4 3.1 4400.0
5 3.5 2.1 1.1 4400.0
now you can save it back to CSV:
df.to_csv('d:/temp/out.csv', index=False, header=None)
d:/temp/out.csv:
0.2,4.3,1.2,4300.0
1.1,2.2,3.1,4300.0
3.5,4.1,1.1,4300.0
1.2,3.3,1.2,4400.0
1.5,2.4,3.1,4400.0
3.5,2.1,1.1,4400.0

Related

How to combine these two numpy arrays?

How would I combine these two arrays:
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
Into something like this:
xy = [[0.1, [1.0, 1.1, 1.2, 1.3]], [0.2, [2.0, 2.1, 2.2, 2.3]...
Thank you for the assistance!
Someone suggested I post code that I have tried and I realized I had forgot to:
xy = np.array(list(zip(x, y)))
This is my current solution, however it is extremely inefficient.

You can use zip to combine
[[a,b] for a,b in zip(y,x)]
Out:
[[array([0.1]), array([1. , 1.1, 1.2, 1.3])],
[array([0.2]), array([2. , 2.1, 2.2, 2.3])],
[array([0.3]), array([3. , 3.1, 3.2, 3.3])],
[array([0.4]), array([4. , 4.1, 4.2, 4.3])],
[array([0.5]), array([5. , 5.1, 5.2, 5.3])]]

A pure numpy solution will be much faster than list comprehension for large arrays.
I do have to say your use case makes no sense, as there is no logic in putting these arrays into a single data structure, and I believe you should re check your design.
Like #user2357112 supports Monica was subtly implying, this is very likely an XY problem. See if this is really what you are trying to solve, and not something else. If you want something else, try asking about that.
I strongly suggest checking what you want to do before moving on, as you will put yourself in a place with bad design.
That aside, here's a solution
import numpy as np
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
xy = np.hstack([y, x])
print(xy)
prints
[[0.1 1. 1.1 1.2 1.3]
[0.2 2. 2.1 2.2 2.3]
[0.3 3. 3.1 3.2 3.3]
[0.4 4. 4.1 4.2 4.3]
[0.5 5. 5.1 5.2 5.3]]

Problems when using tf.data.Dataset.batch

I want to make clear how tf.data.Dataset.batch work with my dataset. The dataset is as follows:
dataset = tf.convert_to_tensor([[5.1, 3.3, 1.7, 0.5, ],
[5.9, 3.0, 4.2, 1.5],
[6.9, 3.1, 5.4, 2.1],
[2.3, 1.3, 6.4, 9.3]])
Then I use batch method:
dataset = dataset.batch(2)
and iterate the dataset once.
x = tfe.Iterator(dataset).next()
As I suppose, the result should be a 2*4 array, but it returns the whole 4*4 dataset.
Could anyone give me some details about how to apply the batch method?

You need to convert your dataset Tensor into a TensorSliceDataset, i.e. telling Tensorflow to slice the tensor and make a dataset of it.
import tensorflow as tf
data = tf.convert_to_tensor([[5.1, 3.3, 1.7, 0.5],
[5.9, 3.0, 4.2, 1.5],
[6.9, 3.1, 5.4, 2.1],
[2.3, 1.3, 6.4, 9.3]])
dataset = tf.data.Dataset.from_tensor_slices(data).batch(2)
batch_iterator = dataset.make_one_shot_iterator().get_next()
sess = tf.InteractiveSession()
batch = sess.run(batch_iterator)
print(batch)
# [[ 5.1 3.3 1.7 0.5 ]
# [ 5.9 3. 4.2 1.5 ]]

The meaning of the comma inside X[:,0]

If X is an array, what is the meaning of X[:,0]? In fact, it is not the first time I see such thing, and it's confusing me, but I can't see what is its meaning? Could anyone be able to show me an example? A full clear answer would be appreciated on this question of comma.
Please see the file https://github.com/lazyprogrammer/machine_learning_examples/blob/master/ann_class/forwardprop.py

The comma inside the bricks seperates the rows from the columns you want to slide from your array.
x[row,column]
You can place ":" before or after the row and column values. Before the value it means "unitl" and after the value it means "from".
For example you have:
x: array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2]])
x[:,:] would mean u want every row and every column.
x[3,3] would mean u want the 3 row and the 3 column value
x[:3,:3] would mean u want the rows and columns until 3
x[:, 3] would mean u want the 3 column and every row

>>> x = [1, 2, 3]
>>> x[:, 0] Traceback (most recent call last):
File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not tuple
If you see that, then the variable is not a list, but something else. A numpy array, perhaps.

I am creating an example matrix:
import numpy as np
np.random.seed(0)
F = np.random.randint(2,5, size=(3, 4), dtype = 'int32' )
F
Query cutting matrix rows:
F[0:2]
Query cutting matrix columns:
F[:,2]

to be straight at point it is X[rows, columns] as some one mentioned but you may ask wat just colon means : in "X[:,0]" it means you say list all.
So X[:,0] - > would say list elements in all rows as it just colon : present in first column so the column of entire matrix is printed out. dimension is [no_of_rows * 1]
Similarly, X[:,1] - > this would list the second column from all rows.
Hope this clarifies you

Pretty clear. Check this out!
Load some data
from sklearn import datasets
iris = datasets.load_iris()
samples = iris.data
Explore first 10 elements of 2D array
samples[:10]
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
Test our annotation
x = samples[:,0]
x[:10]
array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9])
y = samples[:,1]
y[:10]
array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1])
P.S. The length of samples is 150, I've cut it to 10 for clarity.

uniquify an array/list with a tolerance in python (uniquetol equivalent)

I want to find the unique elements of an array in a certain range of tolerance
For instance, for an array/list
[1.1 , 1.3 , 1.9 , 2.0 , 2.5 , 2.9]
Function will return
[1.1 , 1.9 , 2.5 , 2.9]
If the tolerance is 0.3
at bit like the MATLAB function
http://mathworks.com/help/matlab/ref/uniquetol.html
(but this function uses a relative tolerance, an absolute one can be sufficient)
What is the pythonic way to implement it ? (numpy is privilegied)

With A as the input array and tol as the tolerance value, we could have a vectorized approach with NumPy broadcasting, like so -
A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Sample run -
In [20]: A = np.array([2.1, 1.3 , 1.9 , 1.1 , 2.0 , 2.5 , 2.9])
In [21]: tol = 0.3
In [22]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[22]: array([ 2.1, 1.3, 2.5, 2.9])
Notice 1.9 being gone because we had 2.1 within the tolerance of 0.3. Then, 1.1 gone for 1.3 and 2.0 for 2.1.
Please note that this would create a unique array with "chained-closeness" check. As an example :
In [91]: A = np.array([ 1.1, 1.3, 1.5, 2. , 2.1, 2.2, 2.35, 2.5, 2.9])
In [92]: A[~(np.triu(np.abs(A[:,None] - A) <= tol,1)).any(0)]
Out[92]: array([ 1.1, 2. , 2.9])
Thus, 1.3 is gone because of 1.1 and 1.5 is gone because of 1.3.

In pure Python 2, I wrote the following:
a = [1.1, 1.3, 1.9, 2.0, 2.5, 2.9]
# Per http://fr.mathworks.com/help/matlab/ref/uniquetol.html
tol = max(map(lambda x: abs(x), a)) * 0.3
a.sort()
results = [a.pop(0), ]
for i in a:
# Skip items within tolerance.
if abs(results[-1] - i) <= tol:
continue
results.append(i)
print a
print results
Which results in
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 2.0, 2.9]
Which is what the spec seems to agree with, but isn't consistent with your example.
If I just set the tolerance to 0.3 instead of max(map(lambda x: abs(x), a)) * 0.3, I get:
[1.3, 1.9, 2.0, 2.5, 2.9]
[1.1, 1.9, 2.5, 2.9]
...which is consistent with your example.

Best way to parse matrix in python from a logfile

I have a matrix printed inside a log file this way:
[[ 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
[ 1.9 76.5 1.1 1.6 1.2 0.9 2.4 0.3 5.9 8.2]
[ 4.2 0.3 64.5 5.7 9.1 5.9 6.5 2.7 1. 0.1]
[ 2.2 0.5 9.4 49.4 9.6 15.6 8.5 3.3 1.2 0.3]
[ 1.5 0.1 9. 5.3 71.3 3.5 4.8 3.7 0.8 0. ]
[ 1. 0.4 7.7 15.9 6.5 59.5 4.4 3.7 0.7 0.2]
[ 0.1 0.3 7. 5.6 3.8 2.2 80.3 0.5 0.2 0. ]
[ 1. 0.2 6.3 4.5 11.2 6.2 0.9 69.1 0.2 0.4]
[ 6.2 0.7 1.7 2.3 1.6 0.6 1. 0.4 84. 1.5]
[ 4.3 8.6 1.9 3.7 1.3 1.2 1.7 1.9 4.4 71. ]]
.... some text and then again another matrix ......
[[ 71.9 0.6 8.1 2. 1. 1.9 2. 1.6 8.4 2.5]
[ 1.2 82.9 1.1 1.1 0.5 1.3 1.5 0.7 3.9 5.8]
[ 4.7 0.9 59.6 4.1 7.6 10. 7.3 3.8 1.5 0.5]
[ 2.3 0.7 6.5 43.1 6.8 24.5 7.4 4. 2.6 2.1]
[ 1.7 0.3 5.6 5.4 62.5 7.4 6.5 9.3 1. 0.3]
[ 1.4 0.2 6. 12.7 5.3 64.4 3.3 5.4 0.8 0.5]
[ 0.7 0.6 5.5 4.9 3.9 4.5 78.2 0.6 0.8 0.3]
[ 1.7 0.2 4.5 3.4 4.4 7.6 1. 76. 0.7 0.5]
[ 6.3 1.9 1.7 1.4 0.8 0.9 1.4 0.4 83. 2.2]
[ 3.4 8.6 1.4 1.2 0.8 1.3 0.9 2.4 3.8 76.2]]
I tried doing this to read the first matrix and then append it to a list of matrices:
with open('log.txt') as f:
for line in f:
for i in range(N):
cnf_mline = f.next().strip()
cnf_mvalue = cnf_mline[cnf_mline.rfind('[')+1:]
cnf_mtx.append(map(float, (cnf_mvalue,)))
print cnf_mline
print cnf_mvalue
and I see the following error:
ValueError: invalid literal for float(): 73.1 0.7 5.3 3.7 3.5 1. 0.9 1.3 8.8 1.7]
How do I parse this matrix directly from the log file to a list?
Thanks in advance !

final = [[]]
with open("log.txt") as f:
counter = 0
for line in f:
match = re.findall("\d+\.\d+|\d+\.", line)
if match:
final[counter].append(map(float,match))
else:
counter += 1
final.append([])
print final
[[[73.1, 0.7, 5.3, 3.7, 3.5, 1.0, 0.9, 1.3, 8.8, 1.7], [1.9, 76.5, 1.1, 1.6, 1.2, 0.9, 2.4, 0.3, 5.9, 8.2], [4.2, 0.3, 64.5, 5.7, 9.1, 5.9, 6.5, 2.7, 1.0, 0.1], [2.2, 0.5, 9.4, 49.4, 9.6, 15.6, 8.5, 3.3, 1.2, 0.3], [1.5, 0.1, 9.0, 5.3, 71.3, 3.5, 4.8, 3.7, 0.8, 0.0], [1.0, 0.4, 7.7, 15.9, 6.5, 59.5, 4.4, 3.7, 0.7, 0.2], [0.1, 0.3, 7.0, 5.6, 3.8, 2.2, 80.3, 0.5, 0.2, 0.0], [1.0, 0.2, 6.3, 4.5, 11.2, 6.2, 0.9, 69.1, 0.2, 0.4], [6.2, 0.7, 1.7, 2.3, 1.6, 0.6, 1.0, 0.4, 84.0, 1.5], [4.3, 8.6, 1.9, 3.7, 1.3, 1.2, 1.7, 1.9, 4.4, 71.0]], [[71.9, 0.6, 8.1, 2.0, 1.0, 1.9, 2.0, 1.6, 8.4, 2.5], [1.2, 82.9, 1.1, 1.1, 0.5, 1.3, 1.5, 0.7, 3.9, 5.8], [4.7, 0.9, 59.6, 4.1, 7.6, 10.0, 7.3, 3.8, 1.5, 0.5], [2.3, 0.7, 6.5, 43.1, 6.8, 24.5, 7.4, 4.0, 2.6, 2.1], [1.7, 0.3, 5.6, 5.4, 62.5, 7.4, 6.5, 9.3, 1.0, 0.3], [1.4, 0.2, 6.0, 12.7, 5.3, 64.4, 3.3, 5.4, 0.8, 0.5], [0.7, 0.6, 5.5, 4.9, 3.9, 4.5, 78.2, 0.6, 0.8, 0.3], [1.7, 0.2, 4.5, 3.4, 4.4, 7.6, 1.0, 76.0, 0.7, 0.5], [6.3, 1.9, 1.7, 1.4, 0.8, 0.9, 1.4, 0.4, 83.0, 2.2], [3.4, 8.6, 1.4, 1.2, 0.8, 1.3, 0.9, 2.4, 3.8, 76.2]]]

The problem you are facing is that you have a string "73.1 0.7 " etc, and that cannot be parsed into a float. If you want to have something parseable, you'll have to split that string (and get rid of that trailing ]:
cnf_mvalues = cnf_mline[cnf_mline.rfind('[')+1:-1].split()
cnf.mtx.append(map(float, cnf_mvalues))
This should help you with the exception. (But it is not a complete solution!)
I think you might want to have a more stateful model, because:
a line starting with [[ and ending with ] marks the start
a line starting with [ and ending with ] is a line within the matrix
a line starting with [ and ending with ]] marks the end
(And even this makes a lot of assumptions.) There are really only two states: in matrix and outside of matrix. We may call that variable ìn_matrix`.
Then for each line the logic goes:
"[[...]":
in_matrix := True
start an empty matrix
append the data on the row to the empty matrix
"[...]":
if in_matrix:
append the data on the row to the matrix under collection
else:
stray data, ignore
"[...]]":
if in_matrix:
append the data on the row to the matrix under collection
append the matrix under collection to the list of matrices
in_matrix := False
else:
stray data, ignore
And, naturally, a lot of exception handling with invalid values, etc. Most of the time it is enough to set in_matrix to False if you get bad data.
This state machine should turn into a shortish code, as the checking of the brackets is not very difficult.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: Passing CSV with 3D feature array - python

Related

How to combine these two numpy arrays?

Problems when using tf.data.Dataset.batch

The meaning of the comma inside X[:,0]

uniquify an array/list with a tolerance in python (uniquetol equivalent)

Best way to parse matrix in python from a logfile

Categories

Resources