Inserting values into array without indices - python

I am trying to parse some .out files to get a value E contained within each file, and then plot these value against theta and r as a 3d surface plot. The values of theta and r are contained in the .out file title names: H2O.r{}theta{}.out. I.e. r is given in the first {} and theta is then given in the next {}. r is given to 2 d.p and theta is given to 1 d.p. in the file names, e.g. r = 0.90, theta = 190.0.
I am having a hard time iterating through the files, and extracting this information into an array E . I have come across an error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices.
However, if I change my r array to int to get rid of this error, then all of the values in r will become 0. Addtionally, my code to extract E from the file will no longer work as I will inputting 'H2O.r0.00theta70.out', a file which doesn't exist. Does anybody have any suggestions ?
from numpy import *
import matplotlib.pyplot as plt
import os
os.chdir('C:/Users/myName/ex2/all')
theta = arange(70.0, 161.0, 1, dtype = float)
r = arange(0.70, 1.95, 0.05, dtype = float)
r_2dp = [ '%.2f' % elem for elem in r_O ] # string array, with rounding to match the file names
E = zeros((theta.shape[0],r.shape[0]))
def extract(filename): #extract value from file
filename = open(filename,"r")
for line in filename:
if 'SCF Done' in line:
l = line.split()
p = float(l[4])
return p
for i in r_2dp: #create E array that will allow me to plot E vs r vs theta
for j in theta:
for filename in os.listdir('C:/Users/myName/ex2/all'):
if filename.startswith('H2O'):
filename = 'H2O.r{}theta{}.out'.format(i,j)
E[i,j] = extract(filename)

One solution would be to associate an index with the filename which you can accomplish using enumerate; I think you can just change your loop to
for i, fn in enumerate(r_2dp):
for j, ti in enumerate(theta):
for filename in os.listdir('C:/Users/myName/ex2/all'):
if filename.startswith('H2O'):
filename = 'H2O.r{}theta{}.out'.format(fn,ti)
E[j,i] = extract(filename)
Please note that I changed E[i,j] to E[j,i] to get the dimensions correctly; you could also change the order of the two for-loops or initialize E the other way round...
Untested, as we cannot access your file, but the general idea should work...

Related

Making a presence/absence matrix (y axis is file names, x axis is reads in file)

I have multiple files (filenames) with multiple sequence reads (Each has a readname that starts with >) in them:
Filename1
>Readname1
>Readname2
Filename2
>Readname1
>Readname3
Given a dictionary that contains all possible readnames like this:
g={}
g['Readname1']=[]
g['Readname2']=[]
g['Readname3']=[]
How could I write code that would iterate each file and generate the following matrix:
Filename1 Filename2
Readname1 1 1
Readname2 1 0
Readname3 0 1
The code should scan the contents of each file in the directory. Ideally I could read the dictionary from an input file, rather than hard-coded, so I can generate matrices for different dictionaries. The content of each read (e.g. its gene sequence) is not relevant, just whether the readname is present or absent in that file.
I am just learning python, so a colleague shared their code to get me started. Here they were creating a presence/absence matrix of their dictionary (Readnames) in a single specified file (files.txt). I would like to input the dictionary from a second file (so that it's not static in the code) and to iterate over multiple files.
from Bio import SeqIO
import os
dir_path="" #directory path
files=os.listdir(path=dir_path)
with open(dir_path+'files.txt') as f:
files=f.readlines()
files=[x.strip() for x in files]
enter code here
g={}
g['Readname1']=[]
g['Readname2']=[]
g['Readname3']=[]
for i in files:
a = list(SeqIO.parse(dir_path + i, 'fasta'))
for j in a:
g[j.id].append(i)
print('generating counts...')
counts={}
for i in g.keys():
counts[i]=[]
for i in files:
for j in g:
if i in g[j]:
counts[j].append(1)
else:
counts[j].append(0)
print('writing out...')
outfile=open(dir_path+'core_withLabels.csv','w')
outfile2=open(dir_path+'core_noLabels.csv','w')
temp_string=''
for i in files:
outfile.write(','+i)
temp_string=temp_string+i+','
temp_string=temp_string[:-1]
outfile2.write(temp_string+'\n')
outfile.write('\n')
for i in counts:
outfile.write(i)
temp_string=''
for j in counts[i]:
outfile.write(','+str(j))
temp_string=temp_string+str(j)+','
temp_string=temp_string[:-1]
outfile2.write(temp_string+'\n')
outfile.write('\n')
outfile.close()
outfile2.close()
By matrices, do you mean a numpy matrix or List[List[int]]?
If you know the total number of readnames, numpy matrix is an easy go. For numpy matrix, create a zero matrix of the corresponding size.
matrix = np.zeros((n_filenames, n_readnames), dtype=int)
Alternatively, define
matrix = [[] for _ in range(n_filenames)]
Also, define the map that maps readname to idx in the matrix
mapping = dict()
next_available_idx = 0
Then, iterate over all files, and fill out the corresponding entries with ones.
for i, filename in enumerate(filenames):
with open(filename) as f:
for readname in f:
readname.strip() # get rid of extra spaces
# find the corresponding column
if readname in mapping:
col_idx = mapping[readname]
else:
col_idx = next_available_idx
next_available_idx += 1
mapping[readname] = col_idx
matrix[i, col_idx] = 1 # for numpy matrix
"""
if you use list of lists, then:
matrix[i] += [0] * (col_idx - len(matrix[i]) + [1]
"""
Finally, if you use list of lists, please make sure that the length of all lists is the same. You need to iterate over the rows of matrix one more time.

Error when trying to round values in an ndarray

I am working on a memory-based collaborative filtering algorithm. I am building a matrix that I want to write into CSV file that contains three columns: users, app and ratings.
fid = fopen('pred_ratings.csv','wt');
for i=1:user_num
for j=1:item_num
if R(j,i) == 1
entry = Y(j,i);
else
entry = round(P(j,i));
end
fprintf(fid,'%d %d %d\n',i,j,entry);
end
end
fclose(fid);
The above code is a MATLAB implementation of writing a multidimensional matrix into a file having 3 columns. I tried to imitate this in python, using:
n_users=816
n_items=17
f = open("guru.txt","w+")
for i in range(1,n_users):
for j in range(1,n_items):
if (i,j)==1 in a:
entry = data_matrix(j, i)
else:
entry = round(user_prediction(j, i))
print(f, '%d%d%d\n', i, j, entry)
f.close
But this results in the following error:
File "<ipython-input-198-7a444566e1ce>", line 7, in <module>
entry = round(user_prediction(j, i))
TypeError: 'numpy.ndarray' object is not callable
What can be done to fix this?
numpy uses square brackets for indexing. Since user_predictions is a numpy array, it should be indexed as
user_predictions[i, j]
The same goes for data_matrix.
You should probably read the Numpy for MATLAB users guide.
Edit:
Also, the
if (i,j)==1 in a:
line is very dubious. (i, j) is a tuple of two integers, which means it will never be equal to 1. That line is thus equivalent to if False in a: which is probably not what you want.

Can vtk InsertValue take float arguments?

I have a question about InsertValue
If I understand it only takes integer arguements. I was wondering if there is a way to have it take float values? Or maybe some other function that does the job of InsertValue but takes float values? I know there is InsertNextValue, but I am not sure if it'll be efficient in my case since my array is a very big array (~ 100.000 by 120)
Below is my code and in my code I am making the entries of fl values integers to make it work for now but ideally it'll be great if I don't have to do that.
Thanks in advance :)
import vtk
import math
from vtk import vtkStructuredGrid, vtkPoints, vtkFloatArray, vtkXMLStructuredGridWriter
import scipy.io
import numpy
import os
#loading the matlab files
mats = scipy.io.loadmat('/home/lusine/data/3DDA/donut_for_vtk/20130228_050000_3D_E=1.mat')
#x,y,z coordinate, fl flux values
xx = mats['xvect']
yy = mats['yvect']
zz = mats['zvect']
fl = mats['fluxmesh3d'] #3d matrix
nx = xx.shape[1]
ny = yy.shape[1]
nz = zz.shape[1]
fl = numpy.nan_to_num(fl)
inx = numpy.nonzero(fl)
l = len(inx[1])
grid = vtk.vtkStructuredGrid()
grid.SetDimensions(nx,ny,nz) # sets the dimensions of the grid
pts = vtk.vtkPoints() # represents 3D points, The data model for vtkPoints is an array of vx-vy-vz triplets accessible by (point or cell) id.
pts.SetNumberOfPoints(nx*ny*nz) # Specify the number of points for this object to hold.
p=0
for i in range(l):
pts.InsertPoint(p, xx[0][inx[0][i]], yy[0][inx[1][i]], zz[0][inx[2][i]])
p = p + 1
SetPoint()
grid.SetPoints(pts)
cdata = vtk.vtkFloatArray()
cdata.SetNumberOfComponents(1)
cdata.SetNumberOfTuples((nx-1)*(ny-1)*(nz-1))
cdata.SetName('cellData')
p=0
for i in range(l-1):
cdata.InsertValue(p,inx[0][i]+inx[1][i]+inx[2][i])
p = p+1
grid.GetCellData().SetScalars(cdata)
pdata = vtk.vtkFloatArray()
pdata.SetNumberOfComponents(1)
#Get the number of tuples (a component group) in the array
pdata.SetNumberOfTuples(nx*ny*nz)
#Sets the array name
pdata.SetName('pointData')
for i in range(l):
pdata.InsertValue(int(fl[inx[0][i]][inx[1][i]][inx[2][i]]), inx[0][i]+inx[1][i]+inx[2][i])
grid.GetPointData().SetScalars(pdata)
writer = vtk.vtkXMLStructuredGridWriter()
writer.SetFileName('new_grid.vts')
#writer.SetInput(grid)
writer.SetInputData(grid)
writer.Update()
print 'end'
The first argument of InsertValue requires an integer because it's the index where the value is going to be inserted. If instead of a vtkFloatArray pdata you had a numpy array called p, this would be the equivalent of your instruction:
pdata.InsertValue(a,b) becomes p[a]=b
p[0.1] wouldn't make sense, it a must be an integer!
But I am a bit lost on the data. What do you mean that your array is (~ 100.000 by 120)..do you have 100.000 points, and each point has a vector of 120 components? In such a case, your pdata should have 120 components, and for each point point_index you call
pdata.SetTuple[point_index,[v0,v1...,v119]
or
pdata.SetComponent[point_index,0,v0]
...
pdata.SetComponent[point_index,119,v119]
If not, are you sure that you have to access pdata based on fl values (you have to be sure that fl is int, 0 <= fl < ntuples, and that you are not going to have holes). Check if you can do the same thing that you do for cdata (btw in your code p is always equal to i, you can just use i)
It's also possible to copy a numpy array directly to vtk , see http://vtk.1045678.n5.nabble.com/vtk-to-numpy-how-to-get-a-vtk-array-tp1244891p1244895.html , but you have to be very careful with the structure of your data

Calculating and plotting a grow rate in years from a dictionary

I am trying to plot a graph from a CSV file with the following Python code;
import csv
import matplotlib.pyplot as plt
def population_dict(filename):
"""
Reads the population from a CSV file, containing
years in column 2 and population / 1000 in column 3.
#param filename: the filename to read the data from
#return dictionary containing year -> population
"""
dictionary = {}
with open(filename, 'r') as f:
reader = csv.reader(f)
f.next()
for row in reader:
dictionary[row[2]] = row[3]
return dictionary
dict_for_plot = population_dict('population.csv')
def plot_dict(dict_for_plot):
x_list = []
y_list = []
for data in dict_for_plot:
x = data
y = dict_for_plot[data]
x_list.append(x)
y_list.append(y)
plt.plot(x_list, y_list, 'ro')
plt.ylabel('population')
plt.xlabel('year')
plt.show()
plot_dict(dict_for_plot)
def grow_rate(data_dict):
# fill lists
growth_rates = []
x_list = []
y_list = []
for data in data_dict:
x = data
y = data_dict[data]
x_list.append(x)
y_list.append(y)
# calc grow_rate
for i in range(0, len(y_list)-1):
var = float(y_list[i+1]) - float(y_list[i])
var = var/y_list[i]
print var
growth_rates.append(var)
# growth_rate_dict = dict(zip(years, growth_rates))
grow_rate(dict_for_plot)
However, I'm getting a rather weird error on executing this code
Traceback (most recent call last):
File "/home/jharvard/Desktop/pyplot.py", line 71, in <module>
grow_rate(dict_for_plot)
File "/home/jharvard/Desktop/pyplot.py", line 64, in grow_rate
var = var/y_list[i]
TypeError: unsupported operand type(s) for /: 'float' and 'str'
I've been trying different methods to cast the y_list variable. For example; casting an int.
How can I solve this problem so I can get the percentage of the grow rate through the years to plot this.
Since CSV files are text files, you will need to convert them into numbers. Its easy to correct for the syntax error. Just use
var/float(y_list[i])
Even though that gets rid of the syntax error, there is a minor bug which is a little more difficult to spot, which may result in incorrect results under some circumstances. The main reason being that dictionaries are not ordered. i.e. the x and y values are not ordered in any way. The indentation for your program appears to be a bit off on my computer, so am unable to follow it exactly. But the gist of it appears to be that you are obtaining values from a file (x, and y values) and then finding the sequence
var[i] = (y[i+1] - y[i]) / y[i]
Unfortunately, your y_list[i] may not be in the same sequence as in the CSV file because, it is being populated from a dictionary.
In the section where you did:
for row in reader:
dictionary[row[2]] = row[3]
it is just better to preserve the order by doing
x, y = zip(*[ ( float(row[2]), float(row[3]) ) for row in reader])
x, y = map(numpy.array, [x, y])
return x, y
or something like this ...
Then, Numpy arrays have methods for handling your problem much more efficiently. You can then simply do:
growth_rates = numpy.diff(y) / y[:-1]
Hope this helps. Let me know if you have any questions.
Finally, if you do go the Numpy route, I would highly recommend its own csv reader. Check it out here: http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

Matrix manipulation in numpy

I wrote the code below:
import os
import csv
import numpy as np
ROOT_PATH = os.path.dirname(os.path.abspath(__file__)) # These two lines give the
path = os.path.join(ROOT_PATH, "0.dat") # path to a file on my disk
with open(path, 'r') as f1:
listme = csv.reader(f1, delimiter="\t") # I imported the file
listme2 = list(listme) # I used this command to make a matrix in the next line
m = np.matrix(listme2)
m2 = np.delete(m,[1,2],1) # I deleted two columns to get a 2 by 2 matrix
print m + m # It cannot add these two matrix. It also cannot multiply them by np.dot(m,m)
I cannot add the matrix I defined to itself. Please read the comment in the code.
The error returned is:
TypeError: unsupported operand type(s) for +: 'matrix' and 'matrix'
It's not the matter with + operator, it's because m is a matrix of strings but not numbers. convert listme2 to a list of numbers before you use it to get m if you are ordered to do it manually with list comprehension:
listme2=[map(float, line) for line in listme]
Or, when creating the matrix, specify the dtype:
m = np.matrix(listme2, dtype=float)
You can also use np.loadtxt or np.genfromtxt to get the 2D array directly without open and csv.reader. Read the docs yourself ;)

Categories

Resources