Adding items to an array in a loop - python

I try to add a row to a numpy.array within a loop and it's not working although I don't get any error. My general aim is to compare two files and create a third file summarizing the comparison.
ipython
import numpy as np
my arrays
aList1=np.array([['A','we'],['A','we'],['B','we'],['C','de']])
aList2=np.array([['A'],['B'],['D']])
aResult=np.array(['row1','occurence'])
my function
def coverageA(array,file1,name1,colum1,file2,name2,colum2):
x=file1[1:,colum1]
y=file2[1:,colum2]
for f in x:
if f in y:
array=np.vstack((array,np.array([f,'shared'])))
else:
array=np.vstack((array,np.array([f,name1])))
for f in y:
if f not in x:
array=np.vstack((array,np.array([f,name2])))
return
and use it this way
coverageA(aResult,alist1,'list1', 0,aList2,'list',0)
but aResult didn't change
print(aResult)
output:(['row1','occurence'])
wanted
([['row1','occurence'],['A', 'shared'],['B', 'shared'],['C','list1'],['D','list2']])

repaired:
import numpy as np
#my arrays
aList1=np.array([['A','we'],['A','we'],['B','we'],['C','de']])
aList2=np.array([['A'],['B'],['D']])
aResult=np.array(['row1','occurence'])
#my function
def coverageA(array,file1,name1,colum1,file2,name2,colum2):
x=file1[1:,colum1]
y=file2[1:,colum2]
for f in x:
if f in y:
array=np.vstack((array,np.array([f,'shared'])))
else:
array=np.vstack((array,np.array([f,name1])))
for f in y:
if f not in x:
array=np.vstack((array,np.array([f,name2])))
print(array)
return array
#and use it this way
aResult=coverageA(aResult,aList1,'list1', 0,aList2,'list2',0)
#but aResult didn't change
print(aResult)
#output:(['row1','occurence'])
#wanted
#([['row1','occurence'],['A', 'shared'],['B', 'shared'],['C','list1'],['D','list2']])
The explanation is, that in python arguments are passed by assignment, which is explained nicely here. In the line array=np.vstack((array,np.array([f,'shared']))) a new numpy array is created at a new possition im memory (array points to this), but aResult still points to its old position. You can check the memory adresses with print(id(array)).

Related

Use Sympy solver on a list of equations that contain multiple variables

I need to solve for a list (because there are two of values for each variable that are zipped in the proper order) of exmax, eymax, exymax and Nxmax given that all of these variables are some combination of the others.
I have the issue that the type is coming back as a 'finiteset' and it won't let me iterate properly as a result.
import math
import numpy as np
from astropy.table import QTable, Table, Column
from collections import Counter
import operator
from sympy import *
exmax= symbols('exmax')
eymax= symbols('eymax')
exymax= symbols('exymax')
Nxmax=symbols('Nxmax')
Stiffnessofplies=list(1,1) #This isn't the actual value, but it is important to have a len of two #here for later on
Nxmax=[78.4613527541947*exmax + 8.06201746514537e-15*exymax + 4.07395485454472*eymax,
69.4081197440953*exmax + 1.35798495151491*eymax]
exmax= [{(-1.0275144618526e-16*exymax - 0.0519230769230769*eymax,)},
{(-0.0195652173913043*eymax,)}]
eymax = [{(-0.0284210526315789*exmax + 8.11515424209734e-19*exymax,)},
{(-0.299999999999999*exmax,)}]
exymax = [{(-7.78938885521292e-17*exmax + 1.12391245013323e-18*eymax,)}, {(0,)}]
exmax2=[]
for i in emax:
for j in i:
exmax2.append(j)
eymax2=[]
for i in eymax:
for j in i:
eymax2.append(j)
exymax2=[]
for i in exymax:
for j in i:
exymax2.append(j)
I did these last three equations to try and flatten everything out to make it iterable. Here are other things I have tried:
#Pleasework=[]
#for i in range(0,len(Stiffnessofplies)):
# linsolve([exmax2[i]], [eymax2[i]], [exymax2[i]], [Nxmax[i]], (exmax, eymax, exymax,Nxmax))
#System= exmax2[0],eymax2[0],exymax2[0]
#linsolve(System, exmax,eymax,exymax,Nxmax)
#Masterlist=list(zip(exmax,eymax,exymax,Nxmax))
I think one of my main issues is the type I'm getting back 'finiteset' really doesn't work well when trying to iterate the list for both values in the list.

Error when trying to round values in an ndarray

I am working on a memory-based collaborative filtering algorithm. I am building a matrix that I want to write into CSV file that contains three columns: users, app and ratings.
fid = fopen('pred_ratings.csv','wt');
for i=1:user_num
for j=1:item_num
if R(j,i) == 1
entry = Y(j,i);
else
entry = round(P(j,i));
end
fprintf(fid,'%d %d %d\n',i,j,entry);
end
end
fclose(fid);
The above code is a MATLAB implementation of writing a multidimensional matrix into a file having 3 columns. I tried to imitate this in python, using:
n_users=816
n_items=17
f = open("guru.txt","w+")
for i in range(1,n_users):
for j in range(1,n_items):
if (i,j)==1 in a:
entry = data_matrix(j, i)
else:
entry = round(user_prediction(j, i))
print(f, '%d%d%d\n', i, j, entry)
f.close
But this results in the following error:
File "<ipython-input-198-7a444566e1ce>", line 7, in <module>
entry = round(user_prediction(j, i))
TypeError: 'numpy.ndarray' object is not callable
What can be done to fix this?
numpy uses square brackets for indexing. Since user_predictions is a numpy array, it should be indexed as
user_predictions[i, j]
The same goes for data_matrix.
You should probably read the Numpy for MATLAB users guide.
Edit:
Also, the
if (i,j)==1 in a:
line is very dubious. (i, j) is a tuple of two integers, which means it will never be equal to 1. That line is thus equivalent to if False in a: which is probably not what you want.

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

How to use structure arrays in python?

I´m writing this question because I´m not sure with the use of "structure arrays". I did a matrix from keyboard with different inputs (integer, float, etc.) using the "dtype" command. Then, I want find repeated elements in the column "p" and "q", when I have these elements, I want to sum the respective elements from column "z". Thanks. This is my Python code:
from numpy import *
from math import *
from cmath import *
from numpy.linalg import *
number_lines_=raw_input("Lines:")
numero_lines=int(number_lines_)
ceros=zeros((numero_lines,1))
dtype=([('p',int),('q',int),('r',float),('x',float),('b',complex),('z',complex),('y',complex)])
#print dtype
leng=len(dtype)
#print leng
yinfo=array(ceros,dtype)
#print shape(yinfo)
if numero_lines>0:
for i in range(numero_lines):
p_=raw_input("P: ")
p=int(p_)
if p>0:
yinfo['p'][i]=p
#print yinfo
q_=raw_input("Q: ")
q=int(q_)
if q>0 and q!=p:
yinfo['q'][i]=q
r_=raw_input("R: ")
r=float(r_)
yinfo['r'][i]=r
x_=raw_input("X: ")
x=float(x_)
yinfo['x'][i]=x
b_=raw_input("b:")
b=complex(b_)
yinfo['b'][i]=complex(0,b)
yinfo['z'][i]=complex(yinfo['r'][i],yinfo['x'][i])
yinfo['y'][i]=1./(yinfo['z'][i])
# print "\n"
print yinfo
print type(yinfo)
print yinfo.shape
Let me suggest some changes:
import numpy as np # not *
....
numero_lines=int(number_lines_)
...
# don't use numpy function names as variable names
# even the np import
dt=np.dtype([('p',int),('q',int),('r',float),('x',float),('b',complex),('z',complex),('y',complex)])
...
yinfo=np.zeros((numero_lines,),dtype=dt)
# make a zero filled array directly
# also make it 1d, I don't think 2nd dimension helps you
# if numero_lines>0: omit this test; range(0) is empty
# link each of the fields to a variable name
# changing a value of yinfo_p will change a value in yinfo
yinfo_p=yinfo['p']
yinfo_q=yinfo['q']
# etc
for i in range(numero_lines):
p_=raw_input("P: ")
p=int(p_)
if p>0:
yinfo_p[i]=p
#print yinfo
q_=raw_input("Q: ")
q=int(q_)
if q>0 and q!=p:
yinfo_q[i]=q
r_=raw_input("R: ")
r=float(r_)
yinfo_r[i]=r
x_=raw_input("X: ")
x=float(x_)
yinfo_x[i]=x
b_=raw_input("b:")
b=complex(b_)
yinfo_b[i]=complex(0,b)
# dont need to fill in these values row by row
# perform array operations after you are done with the input loop
yinfo_z[:] = yinfo_r + 1j*yinfo_x
yinfo_y[:] = 1./yinfo_z
Alternatively I could have defined
yinfo_p = np.zeros((numero_lines,), dtype=int)
yinfo_q = ...
After filling in all the yinfo_* values I could assemble them into a structured array - if that's what I need for other uses.
yinfo['p'] = yinfo_p
etc.
This isn't a very good use of structured arrays; as I tried to show with the yinfo_b variables, you are using each field as though it was a separate 1d array.
I haven't actually run these changes, so there might be some errors, but hopefully they will give you ideas that can improve your code.

Why does scan upcast?

This code to calculate the trace of a matrix (based on an example in the Theano "loop" tutorial) works fine:
import numpy as np
import theano as th
import theano.tensor as T
floatX = 'float32'
X = T.matrix()
results = th.scan(lambda i,j,t_f : T.cast(X[i,j] + t_f, floatX),
sequences=[T.arange(X.shape[0]), T.arange(X.shape[1])],
outputs_info=np.asarray(0., dtype=floatX))[0]
result = results[-1]
compute_trace = th.function([X], result)
x = np.eye(5, dtype=floatX)
x[0] = np.arange(5, dtype=floatX)
print compute_trace(x)
But if I remove the cast operation from the lambda function like this:
lambda i,j,t_f : X[i,j] + t_f
The following error message is produced:
ValueError: When compiling the inner function of scan the following error has been encountered: The initial state (outputs_info in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 2) has dtype float32, while the result of the inner function (fn) has dtype float64. This can happen if the inner function of scan results in an upcast or downcast.
Why so? X and outputs_info are explicitly set to float32. How does the result of adding them get to be float64?
This is a very late answer, but we're working on a fork of Theano called Aesara, and, since people still run into problems like this, it seems worthwhile to provide a public explanation.
That said, the issue is X = T.matrix(). T.matrix creates a float64 matrix when theano.config.floatX == "float64" (the default), and the result is an upcast to float64 for the sum in the body of the scan's loop function.
If X = T.fmatrix() is used, a float32 matrix is created instead and the problem is no longer present; otherwise, as mentioned in the comments, one can also set theano.config.floatX to "float32".

Categories

Resources