np.loadtxt for a file with many matrixes - python

I have a file that looks something like this:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
.. the file continues, with several 3x3 matrices appearing in the same fashion. Each matrix is prefaced by text with a unique ID, though the IDs aren't particularly important to me. I want to create a matrix of these matrixes. Can I use loadtxt to do that?
Here is my best attempt. The 6 in this code could be replaced with an iterating variable starting at 6 and incrementing by the number of rows in the matrix. I thought that skiprows would accept a list, but apparently it only accepts integers.
np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
TypeError Traceback (most recent call last)
<ipython-input-23-7d82fb7ef14a> in <module>()
----> 1 np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
/usr/local/lib/python2.7/site-packages/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
932
933 # Skip the first `skiprows` lines
--> 934 for i in range(skiprows):
935 next(fh)
936

Maybe I misunderstand, but if you can match the lines preceding the 3x3 matrices, then you can create a generator to feed to loadtxt:
import numpy as np
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if 'matrix' in line: # or whatever matches the line before a matrix
yield next(fs)
yield next(fs)
yield next(fs)
with open('matrices.dat') as fs:
g = get_matrices(fs)
M = np.loadtxt(g)
M = M.reshape((M.size//9, 3, 3))
print(M)
If you feed it:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
new matrix 7 6
1 0 1
2 0 3
0 1 2
You get an array of the matrices:
[[[ 3. 2. 4.]
[ 7. 4. 2.]
[ 9. 1. 1.]]
[[ 9. 4. 1.]
[ 1. 3. 4.]
[ 4. 3. 1.]]
[[ 7. 2. 1.]
[ 1. 3. 4.]
[ 2. 3. 2.]]
[[ 1. 0. 1.]
[ 2. 0. 3.]
[ 0. 1. 2.]]]
Alternatively, if you just want to yield all lines that look like they might be rows from a 3x3 matrix of integers, match to a regular expression:
import re
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if re.match('\d+\s+\d+\s+\d+', line):
yield line

You need to change your processing workflow to use steps: first, extract substrings corresponding to your desired matrices, then call numpy.loadtxt. To do this, a great way would be:
Find matrix start and end with re.
Load matrix within that range
Reset your range and continue.
Your matrix marker seem to be diverse, so you could use a regular expression like this:
start = re.compile("\w+\s+matrix\s+(\d+)\s+(\d+)\n")
end = re.compile("\n\n")
Then, you can find start/end pairs and then load the text for each matrix:
import io
import numpy as np
# read our data
data = open("/path/to/file.txt").read()
def load_matrix(data, *args):
# find start and end bounds
s = start.search(data)
if not s:
# no matrix leftover, return None
return None
e = end.search(data, s.end())
e_index = e.end() if e else len(data)
# load text
buf = io.StringIO(data[s.end(): e_index])
matrix = np.loadtxt(buf, *args) # add other args here
# reset our buffer
data = data[e_index:]
return matrix
Idea
In this case, my regular expression marker for the start of the matrix has capturing groups (\d+) for the matrix dimensions, so you can get the MxN representation of the matrix if you wish. List itemI also then search for items with the word "matrix" on the line, with arbitrary leading text and two numbers separated by whitespace at the end.
My match for the end is two "\n\n" groups, or two newlines (if you have Windows line endings, you may need to consider "\r" too).
Automating This
Now that we have a way to find a single case, all you need to do is iterate this and populate a list of matrices while you still get matches.
matrices = []
# read our data
data = open("/path/to/file.txt").read()
while True:
result = load_matrix(data, ...) # pass other arguments to loadtxt
if not result:
break
matrices.append(result)

Related

Iterate through upper triangular matrix in Python

I am using Python and I have a XxY matrix where X=Y and I want to iterate over the upper triangular matrix in a specific way such that it starts with and proceeds with and and so on and so forth until the last row and column. Therefore, I tried to create a double loop which loops over the columns one by one and within that loop I created another loop which loops over the rows always adding one row. However, I got stuck in defining how to add the next row for every column in the second loop. Here is what I got so far (for simplicity I just created an array of zeros):
import pandas as pd
import numpy as np
# number of columns
X = 10
# number or rows
Y = X
U = np.zeros((Y,X))
for j in range(X):
for z in range():
My initial idea was to create an array of Yx1 with y = np.asarray(list(range(0,Y)))and use it for the second loop but I don't understand how to implement it. Can somebody please help me? Is there maybe a simpler way to define such an iteration?
With Numpy, you can get the indices for the upper triangular matrix with triu_indices_from and index into the array with that:
import numpy as np
a = np.arange(16).reshape([4, 4])
print(a)
#[[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]]
indices = np.triu_indices_from(a)
upper = a[indices]
print(upper)
# [ 0 1 2 3 5 6 7 10 11 15]

The inverse of some matrices are different between Python and Excel. Which results I should consider?

I tested two 3x3 matrix to know the inverse in Python and Excel, but the results are different. Which I should consider as the correct or best result?
These are the matrix I tested:
Matrix 1:
1 0 0
1 2 0
1 2 3
Matrix 2:
1 0 0
4 5 0
7 8 9
The Matrix 1 inverse is the same in Python and Excel, but Matrix 2 inverse is different.
In Excel I use the MINVERSE(matrix) function, and in Python np.linalg.inv(matrix) (from Numpy library)
I can't post images yet, so I can't show the results from Excel :c
This is the code I use in Python:
# Matrix 1
A = np.array([[1,0,0],
[1,2,0],
[1,2,3]])
Ainv = np.linalg.inv(A)
print(Ainv)
Result:
[[ 1. 0. 0. ]
[-0.5 0.5 0. ]
[ 0. -0.33333333 0.33333333]]
# (This is the same in Excel)
# Matrix 2
B = np.array([[1,0,0],
[4,5,0],
[7,8,9]])
Binv = np.linalg.inv(B)
print(Binv)
Result:
[[ 1.00000000e+00 0.00000000e+00 -6.16790569e-18]
[-8.00000000e-01 2.00000000e-01 1.23358114e-17]
[-6.66666667e-02 -1.77777778e-01 1.11111111e-01]]
# (This is different in Excel)

How to rewrite the node ids that stored in .mtx file

I have a .mtx file that looks like below:
0 435 1
0 544 1
1 344 1
2 410 1
2 471 1
This matrix has shape of (1000, 1000).
As you can see, node ids starts at 0. I want to change this to start at 1 instead of 0.
In other words, I need to add 1 to all the numbers in the first and second columns that represent the node ids.
So I converted .mtx file to .txt file and tried to add 1 in each first and second columns.
and simply added 1 to each row like below
import numpy as np
data_path = "my_data_path"
data = np.loadtxt(data_path, delimiter=' ', dtype='int')
for i in data:
print(data[i]+1)
and result was
[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]
now I need to subtract 1 from third column, but I have no idea how to implement that.
Can someone help me to do that?
Or if there's any way to complete my goal way more easier, please tell me. Thank you in advance.
Why wouldn't you increment only the first column?
data[:, 0] += 1
You may want to have a look at indexing in NumPy.
Additionally, I don't think the loop in your code ever worked:
for i in data:
print(data[i]+1)
You are indexing with values from the array which generally is wrong and is surely wrong in this case:
IndexError: index 435 is out of bounds for axis 0 with size 5
You could correct it to print the whole matrix:
print(data + 1)
Giving:
[[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]]

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Creating columns with numpy Python

I have some elements stored in numpy.array[]. I wish to store them in a ".txt" file. The case is it needs to fit a certain standard, which means each element needs to be stored x lines into the file.
Example:
numpy.array[0] needs to start in line 1, col 26.
numpy.array[1] needs to start in line 1, col 34.
I use numpy.savetxt() to save the arrays to file.
Later I will implement this in a loop to create a lagre ".txt" file with coordinates.
Edit: This good example was provided below, it does point out my struggle:
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
The fmt option '%20d %10d' gives you spacing which depend on the last integer. What I need is an option which lets me set the spacing from the left side regardless of other integers.
Template is need to fit integers into:
XXXXXXXX.XXX YYYYYYY.YYY ZZZZ.ZZZ
Final Edit:
I solved it by creating a test which checks how many spaces the last float used. I was then able to predict the number of spaces the next float needed to fit the template.
Have you played with the fmt of np.savetxt?
Let me illustrate with a concrete example (the sort that you should have given us)
Make a 2 row array:
In [111]: A=np.arange((12)).reshape(2,6)
In [112]: A
Out[112]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
Save it, and get 2 rows, 6 columns
In [113]: np.savetxt('test.txt',A,'%d')
In [114]: cat test.txt
0 1 2 3 4 5
6 7 8 9 10 11
save its transpose, and get 6 rows, 2 columns
In [115]: np.savetxt('test.txt',A.T,'%d')
In [116]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
Put more detail into fmt to space out the columns
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
I think you can figure out how to make a fmt string that puts your numbers in the correct columns (join 26 spaces etc, or use left and right justification - the usual Python formatting issues).
savetxt also takes an opened file. So you can open a file for writing, write one array, add some filler lines, and write another. Also, savetxt doesn't do anything fancy. It just iterates through the rows of the array, and writes each row to a line, e.g.
for row in A:
file.write(fmt % tuple(row))
So if you don't like the control that savetxt gives you, write the file directly.

Categories

Resources