Matrix manipulation in numpy - python

I wrote the code below:
import os
import csv
import numpy as np
ROOT_PATH = os.path.dirname(os.path.abspath(__file__)) # These two lines give the
path = os.path.join(ROOT_PATH, "0.dat") # path to a file on my disk
with open(path, 'r') as f1:
listme = csv.reader(f1, delimiter="\t") # I imported the file
listme2 = list(listme) # I used this command to make a matrix in the next line
m = np.matrix(listme2)
m2 = np.delete(m,[1,2],1) # I deleted two columns to get a 2 by 2 matrix
print m + m # It cannot add these two matrix. It also cannot multiply them by np.dot(m,m)
I cannot add the matrix I defined to itself. Please read the comment in the code.
The error returned is:
TypeError: unsupported operand type(s) for +: 'matrix' and 'matrix'

It's not the matter with + operator, it's because m is a matrix of strings but not numbers. convert listme2 to a list of numbers before you use it to get m if you are ordered to do it manually with list comprehension:
listme2=[map(float, line) for line in listme]
Or, when creating the matrix, specify the dtype:
m = np.matrix(listme2, dtype=float)
You can also use np.loadtxt or np.genfromtxt to get the 2D array directly without open and csv.reader. Read the docs yourself ;)

Related

Making a presence/absence matrix (y axis is file names, x axis is reads in file)

I have multiple files (filenames) with multiple sequence reads (Each has a readname that starts with >) in them:
Filename1
>Readname1
>Readname2
Filename2
>Readname1
>Readname3
Given a dictionary that contains all possible readnames like this:
g={}
g['Readname1']=[]
g['Readname2']=[]
g['Readname3']=[]
How could I write code that would iterate each file and generate the following matrix:
Filename1 Filename2
Readname1 1 1
Readname2 1 0
Readname3 0 1
The code should scan the contents of each file in the directory. Ideally I could read the dictionary from an input file, rather than hard-coded, so I can generate matrices for different dictionaries. The content of each read (e.g. its gene sequence) is not relevant, just whether the readname is present or absent in that file.
I am just learning python, so a colleague shared their code to get me started. Here they were creating a presence/absence matrix of their dictionary (Readnames) in a single specified file (files.txt). I would like to input the dictionary from a second file (so that it's not static in the code) and to iterate over multiple files.
from Bio import SeqIO
import os
dir_path="" #directory path
files=os.listdir(path=dir_path)
with open(dir_path+'files.txt') as f:
files=f.readlines()
files=[x.strip() for x in files]
enter code here
g={}
g['Readname1']=[]
g['Readname2']=[]
g['Readname3']=[]
for i in files:
a = list(SeqIO.parse(dir_path + i, 'fasta'))
for j in a:
g[j.id].append(i)
print('generating counts...')
counts={}
for i in g.keys():
counts[i]=[]
for i in files:
for j in g:
if i in g[j]:
counts[j].append(1)
else:
counts[j].append(0)
print('writing out...')
outfile=open(dir_path+'core_withLabels.csv','w')
outfile2=open(dir_path+'core_noLabels.csv','w')
temp_string=''
for i in files:
outfile.write(','+i)
temp_string=temp_string+i+','
temp_string=temp_string[:-1]
outfile2.write(temp_string+'\n')
outfile.write('\n')
for i in counts:
outfile.write(i)
temp_string=''
for j in counts[i]:
outfile.write(','+str(j))
temp_string=temp_string+str(j)+','
temp_string=temp_string[:-1]
outfile2.write(temp_string+'\n')
outfile.write('\n')
outfile.close()
outfile2.close()
By matrices, do you mean a numpy matrix or List[List[int]]?
If you know the total number of readnames, numpy matrix is an easy go. For numpy matrix, create a zero matrix of the corresponding size.
matrix = np.zeros((n_filenames, n_readnames), dtype=int)
Alternatively, define
matrix = [[] for _ in range(n_filenames)]
Also, define the map that maps readname to idx in the matrix
mapping = dict()
next_available_idx = 0
Then, iterate over all files, and fill out the corresponding entries with ones.
for i, filename in enumerate(filenames):
with open(filename) as f:
for readname in f:
readname.strip() # get rid of extra spaces
# find the corresponding column
if readname in mapping:
col_idx = mapping[readname]
else:
col_idx = next_available_idx
next_available_idx += 1
mapping[readname] = col_idx
matrix[i, col_idx] = 1 # for numpy matrix
"""
if you use list of lists, then:
matrix[i] += [0] * (col_idx - len(matrix[i]) + [1]
"""
Finally, if you use list of lists, please make sure that the length of all lists is the same. You need to iterate over the rows of matrix one more time.

Inserting values into array without indices

I am trying to parse some .out files to get a value E contained within each file, and then plot these value against theta and r as a 3d surface plot. The values of theta and r are contained in the .out file title names: H2O.r{}theta{}.out. I.e. r is given in the first {} and theta is then given in the next {}. r is given to 2 d.p and theta is given to 1 d.p. in the file names, e.g. r = 0.90, theta = 190.0.
I am having a hard time iterating through the files, and extracting this information into an array E . I have come across an error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices.
However, if I change my r array to int to get rid of this error, then all of the values in r will become 0. Addtionally, my code to extract E from the file will no longer work as I will inputting 'H2O.r0.00theta70.out', a file which doesn't exist. Does anybody have any suggestions ?
from numpy import *
import matplotlib.pyplot as plt
import os
os.chdir('C:/Users/myName/ex2/all')
theta = arange(70.0, 161.0, 1, dtype = float)
r = arange(0.70, 1.95, 0.05, dtype = float)
r_2dp = [ '%.2f' % elem for elem in r_O ] # string array, with rounding to match the file names
E = zeros((theta.shape[0],r.shape[0]))
def extract(filename): #extract value from file
filename = open(filename,"r")
for line in filename:
if 'SCF Done' in line:
l = line.split()
p = float(l[4])
return p
for i in r_2dp: #create E array that will allow me to plot E vs r vs theta
for j in theta:
for filename in os.listdir('C:/Users/myName/ex2/all'):
if filename.startswith('H2O'):
filename = 'H2O.r{}theta{}.out'.format(i,j)
E[i,j] = extract(filename)
One solution would be to associate an index with the filename which you can accomplish using enumerate; I think you can just change your loop to
for i, fn in enumerate(r_2dp):
for j, ti in enumerate(theta):
for filename in os.listdir('C:/Users/myName/ex2/all'):
if filename.startswith('H2O'):
filename = 'H2O.r{}theta{}.out'.format(fn,ti)
E[j,i] = extract(filename)
Please note that I changed E[i,j] to E[j,i] to get the dimensions correctly; you could also change the order of the two for-loops or initialize E the other way round...
Untested, as we cannot access your file, but the general idea should work...

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

Saving a dictionary of numpy arrays in human-readable format

This is not a duplicate question. I looked around a lot and found this question, but the savezand pickle utilities render the file unreadable by a human. I want to save it in a .txt file which can be loaded back into a python script. So I wanted to know whether there are some utilities in python which can facilitate this task and keep the written file readable by a human.
The dictionary of numpy arrays contains 2D arrays.
EDIT:
According to Craig's answer, I tried the following :
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
f = open('out.txt', 'r')
d = eval(f.readline())
print(d)
This gave the following error: SyntaxError: unexpected EOF while parsing.
But the out.txtdid contain the dictionary as expected. How can I load it correctly?
EDIT 2:
Ran into a problem : Craig's answer truncates the array if the size is large. The out.txt shows first few elements, replaces the middle elements by ... and shows the last few elements.
Convert the dict to a string using repr() and write that to the text file.
import numpy as np
d = {'a':np.zeros(10), 'b':np.ones(10)}
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
You can read it back in and convert to a dictionary with eval():
import numpy as np
f = open('out.txt', 'r')
data = f.read()
data = data.replace('array', 'np.array')
d = eval(data)
Or, you can directly import array from numpy:
from numpy import array
f = open('out.txt', 'r')
data = f.read()
d = eval(data)
H/T: How can a string representation of a NumPy array be converted to a NumPy array?
Handling large arrays
By default, numpy summarizes arrays longer than 1000 elements. You can change this behavior by calling numpy.set_printoptions(threshold=S) where S is larger than the size of the arrays. For example:
import numpy as np
W = np.arange(10).reshape(2,5)
b = np.arange(12).reshape(3,4)
d = {'W':W, 'b':b}
largest = max(np.prod(a.shape) for a in d.values()) #get the size of the largest array
np.set_printoptions(threshold=largest) #set threshold to largest to avoid summarizing
with open('out.txt', 'w') as outfile:
outfile.write(repr(d))
np.set_printoptions(threshold=1000) #recommended, but not necessary
H/T: Ellipses when converting list of numpy arrays to string in python 3

Reading several arrays in a binary file with numpy

I'm trying to read a binary file which is composed by several matrices of float numbers separated by a single int. The code in Matlab to achieve this is the following:
fid1=fopen(fname1,'r');
for i=1:xx
Rstart= fread(fid1,1,'int32'); #read blank at the begining
ZZ1 = fread(fid1,[Nx Ny],'real*4'); #read z
Rend = fread(fid1,1,'int32'); #read blank at the end
end
As you can see, each matrix size is Nx by Ny. Rstart and Rend are just dummy values. ZZ1 is the matrix I'm interested in.
I am trying to do the same in python, doing the following:
Rstart = np.fromfile(fname1,dtype='int32',count=1)
ZZ1 = np.fromfile(fname1,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fname1,dtype='int32',count=1)
Then, I have to iterate to read the subsequent matrices, but the function np.fromfile doesn't retain the pointer in the file.
Another option:
with open(fname1,'r') as f:
ZZ1=np.memmap(f, dtype='float32', mode='r', offset = 4,shape=(Ny1,Nx1))
plt.pcolor(ZZ1)
This works fine for the first array, but doesn't read the next matrices. Any idea how can I do this?
I searched for similar questions but didn't find a suitable answer.
Thanks
The cleanest way to read all your matrices in a single vectorized statement is to use a struct array:
dtype = [('start', np.int32), ('ZZ', np.float32, (Ny1, Nx1)), ('end', np.int32)]
with open(fname1, 'rb') as fh:
data = np.fromfile(fh, dtype)
print(data['ZZ'])
There are 2 solutions for this problem.
The first one:
for i in range(x):
ZZ1=np.memmap(fname1, dtype='float32', mode='r', offset = 4+8*i+(Nx1*Ny1)*4*i,shape=(Ny1,Nx1))
Where i is the array you want to get.
The second one:
fid=open('fname','rb')
for i in range(x):
Rstart = np.fromfile(fid,dtype='int32',count=1)
ZZ1 = np.fromfile(fid,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fid,dtype='int32',count=1)
So as morningsun points out, np.fromfile can receive a file object as an argument and keep track of the pointer. Notice that you must open the file in binary mode 'rb'.

Categories

Resources