Python numpy method matrix.toFile() - python

I need help formatting my matrix when i write it to a file. I am using the numpy method called toFile()
it takes 3 args. 1-name of file,2-seperator(must be a string),3-format(Also a string)
I dont know a lot about formatting but i am trying to format the file to there is a new line each 9 charatcers. (not including spaces). The output is a 9x9 soduku game. So I need to it be formatted 9x9.
finished = M.tofile("soduku_solved.txt", " ", "")
Where M is a matrix
My first argument is the name of the file, the second is a space, but I dont know what format argument i need to to make it 9x9

I could be wrong, but I don't think that's possible with the numpy tofile function. I think the format argument just allows you to format how each individual item is formatted, it doesn't consider them in a group.
You could do something like:
M = np.random.randint(1, 9, (9, 9))
each_item_fmt = '{:>3}'
each_row_fmt = ' '.join([each_item_fmt] * 9)
fmt = '\n'.join([each_row_fmt] * 9)
as_string = fmt.format(*M.flatten())
It's not a very nice way to build up the format string and there's bound to be a better way of doing it. You'll see the final result (print(fmt)) is a big block of '{:>3}', which basically says, put a bit of data in here with a fixed width of 3 characters, right aligned.
EDIT Since you're putting it directly into a file you could write it line by line:
M = np.random.randint(1, 9, (9, 9))
fmt = ('{:>3} ' * 9).strip()
with open('soduku_solved.txt', 'w') as f:
for m in M:
f.write(fmt.format(*m) + '\n')

Related

creating a numpy array in a loop

I want to create a numpy array by parsing a .txt file. The .txt file consists of features of iris flowers seperated by commas. every line is has one flower example with 5 data seperated with 4 commas. first 4 number is features and the last one is the name. I parse the .txt in a loop and want to append (using numpy.append probably) every lines parsed data into a numpy array called feature_table.
heres the code;
import numpy as np
iris_data = open("iris_data.txt", "r")
for line in iris_data:
currentline = line.split(",")
#iris_data_parsed = (currentline[0] + " , " + currentline[3] + " , " + currentline[4])
#sepal_length = numpy.array(currentline[0])
#petal_width = numpy.array(currentline[3])
#iris_names = numpy.array(currentline[4])
feature_table = np.array([currentline[0]],[currentline[3]],[currentline[4]])
print (feature_table)
print(feature_table.shape)
so I want to create a numpy array using only first, fourth and fifth data in every line
but I can't make it work as I want to. tried reading numpy docs but couldn't understand it.
While the people in the comments are right in that you are not persisting your data anywhere, your problem, I assume, is incorrect np.array construction. You should enclose all of the arguments in a list like this:
feature_table = np.array([currentline[0],currentline[3],currentline[4]])
And get rid of redundant [ and ] around the arguments.
See the official documentation for more examples. Basically all of the input data needs to be grouped/separated to be only 1 argument as Python will consider the other arguemnts as different positional arguments.

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

How to use function like matlab 'fread' in python?

This is a .dat file.
In Matlab, I can use this code to read.
lonlatfile='NOM_ITG_2288_2288(0E0N)_LE.dat';
f=fopen(lonlatfile,'r');
lat_fy=fread(f,[2288*2288,1],'float32');
lon_fy=fread(f,[2288*2288,1],'float32')+86.5;
lon=reshape(lon_fy,2288,2288);
lat=reshape(lat_fy,2288,2288);
Here are some results of Matlab:
matalab
How to do in python to get the same result?
PS: My code is this:
def fromfileskip(fid,shape,counts,skip,dtype):
"""
fid : file object, Should be open binary file.
shape : tuple of ints, This is the desired shape of each data block.
For a 2d array with xdim,ydim = 3000,2000 and xdim = fastest
dimension, then shape = (2000,3000).
counts : int, Number of times to read a data block.
skip : int, Number of bytes to skip between reads.
dtype : np.dtype object, Type of each binary element.
"""
data = np.zeros((counts,) + shape)
for c in range(counts):
block = np.fromfile(fid,dtype=np.float32,count=np.product(shape))
data[c] = block.reshape(shape)
fid.seek( fid.tell() + skip)
return data
fid = open(r'NOM_ITG_2288_2288(0E0N)_LE.dat','rb')
data = fromfileskip(fid,(2288,2288),1,0,np.float32)
loncenter = 86.5 #Footpoint of FY2E
latcenter = 0
lon2e = data+loncenter
lat2e = data+latcenter
Lon = lon2e.reshape(2288,2288)
Lat = lat2e.reshape(2288,2288)
But, the result is different from that of Matlab.
You should be able to translate the code directly into Python with little change:
lonlatfile = 'NOM_ITG_2288_2288(0E0N)_LE.dat'
with open(lonlatfile, 'rb') as f:
lat_fy = np.fromfile(f, count=2288*2288, dtype='float32')
lon_fy = np.fromfile(f, count=2288*2288, dtype='float32')+86.5
lon = lon_ft.reshape([2288, 2288], order='F');
lat = lat_ft.reshape([2288, 2288], order='F');
Normally the numpy reshape would be transposed compared to the MATLAB result, due to different index orders. The order='F' part makes sure the final output has the same layout as the MATLAB version. It is optional, if you remember the different index order you can leave that off.
The with open() as f: opens the file in a safe manner, making sure it is closed again when you are done even if the program has an error or is cancelled for whatever reason. Strictly speaking it is not needed, but you really should always use it when opening a file.

Reading several arrays in a binary file with numpy

I'm trying to read a binary file which is composed by several matrices of float numbers separated by a single int. The code in Matlab to achieve this is the following:
fid1=fopen(fname1,'r');
for i=1:xx
Rstart= fread(fid1,1,'int32'); #read blank at the begining
ZZ1 = fread(fid1,[Nx Ny],'real*4'); #read z
Rend = fread(fid1,1,'int32'); #read blank at the end
end
As you can see, each matrix size is Nx by Ny. Rstart and Rend are just dummy values. ZZ1 is the matrix I'm interested in.
I am trying to do the same in python, doing the following:
Rstart = np.fromfile(fname1,dtype='int32',count=1)
ZZ1 = np.fromfile(fname1,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fname1,dtype='int32',count=1)
Then, I have to iterate to read the subsequent matrices, but the function np.fromfile doesn't retain the pointer in the file.
Another option:
with open(fname1,'r') as f:
ZZ1=np.memmap(f, dtype='float32', mode='r', offset = 4,shape=(Ny1,Nx1))
plt.pcolor(ZZ1)
This works fine for the first array, but doesn't read the next matrices. Any idea how can I do this?
I searched for similar questions but didn't find a suitable answer.
Thanks
The cleanest way to read all your matrices in a single vectorized statement is to use a struct array:
dtype = [('start', np.int32), ('ZZ', np.float32, (Ny1, Nx1)), ('end', np.int32)]
with open(fname1, 'rb') as fh:
data = np.fromfile(fh, dtype)
print(data['ZZ'])
There are 2 solutions for this problem.
The first one:
for i in range(x):
ZZ1=np.memmap(fname1, dtype='float32', mode='r', offset = 4+8*i+(Nx1*Ny1)*4*i,shape=(Ny1,Nx1))
Where i is the array you want to get.
The second one:
fid=open('fname','rb')
for i in range(x):
Rstart = np.fromfile(fid,dtype='int32',count=1)
ZZ1 = np.fromfile(fid,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fid,dtype='int32',count=1)
So as morningsun points out, np.fromfile can receive a file object as an argument and keep track of the pointer. Notice that you must open the file in binary mode 'rb'.

Splitting data file columns into separate arrays in Python

I'm new to python and have been trying to figure this out all day. I have a data file laid out as below,
time I(R_stkb)
Step Information: Temp=0 (Run: 1/11)
0.000000000000000e+000 0.000000e+000
9.999999960041972e-012 8.924141e-012
1.999999992008394e-011 9.623148e-012
3.999999984016789e-011 6.154220e-012
(Note: No empty line between the each data line.)
I want to plot the data using matplotlib functions, so I'll need the two separate columns in arrays.
I currently have
def plotdata():
Xvals=[], Yvals=[]
i = open(file,'r')
for line in i:
Xvals,Yvals = line.split(' ', 1)
print Xvals,Yvals
But obviously its completely wrong. Can anyone give me a simple answer to this, and with an explanation of what exactly the lines mean would be helpful. Cheers.
Edit: The first two lines repeat throughout the file.
This is a job for the * operator on the zip method.
>>> asdf
[[1, 2], [3, 4], [5, 6]]
>>> zip(*asdf)
[(1, 3, 5), (2, 4, 6)]
So in the context of your data it might be something like:
handle = open(file,'r')
lines = [line.split() for line in handle if line[:4] not in ('time', 'Step')]
Xvals, Yvals = zip(*lines)
or if your really need to be able to mutate the data afterwards you could just call the list constructor on each tuple:
Xvals, Yvals = [list(block) for block in zip(*lines)]
One way to do it is:
Xvals=[]; Yvals=[]
i = open(file,'r')
for line in i:
x, y = line.split(' ', 1)
Xvals.append(float(x))
Yvals.append(float(y))
print Xvals,Yvals
Note the call to the float function, which will change the string you get from the file into a number.
This is what numpy.loadtxt is designed for. Try:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt(file, skiprows = 2) # assuming you have time and step information on 2 separate lines
# and you do not want to read them
plt.plot(data[:,0], data[:,1])
plt.show()
EDIT:
if you have time and step information scattered throughout the file and you want to plot data on every step, there is a possibility of reading all the file to memory (suppose it's small enough), and then split it on time strings:
l = open(fname, 'rb').read()
for chunk in l.split('time'):
data = np.array([s.split() for s in chunk.split('\n')[2:]][:-1], dtype = np.float)
plt.plot(data[:,0], data[:,1])
plt.show()
Or else you could add the # comment sign to the comment lines and use np.loadxt.
If you want to plot this file with matplotlib, you might want to check out it's plotfile function. See the official documentation here.

Categories

Resources