I have wrote this code to analyze and search geological coordinates for proximity of data points. Since I had so many data points, the output in PyCharm was becoming overloaded and gave me a bunch of nonsense. Since then I have worked to try and solve this issue by writing the True/False results into separate documents on my computer.
The point of this code is to analyze the proximity of coordinates in file1 to all elements in file2. Then return any resulting matches of coordinates which share proximity. As you will see below I wrote a nested for loop to do this which I understand may be a sort of brute force tactic so if anybody has a more elegant solution them I would be happy to learn more.
import numpy as np
import math as ma
filename1 = "C:\Users\Justin\Desktop\file1.data"
data1 = np.genfromtxt(filename1,
skip_header=1,
usecols=(0, 1))
#dtype=[
#("x1", "f9"),
#("y1", "f9")])
#print "data1", data1
filename2 = "C:\Users\Justin\Desktop\file2.data"
data2 = np.genfromtxt(filename2,
skip_header=1,
usecols=(0, 1))
#dtype=[
#("x2", "f9"),
#("y2", "f9")])
#print "data2",data2
def d(a,b):
d = ma.acos(ma.sin(ma.radians(a[1]))*ma.sin(ma.radians(b[1]))
+ma.cos(ma.radians(a[1]))*ma.cos(ma.radians(b[1]))* (ma.cos(ma.radians((a[0]-b[0])))))
return d
results = open("results.txt", "w")
for coor1 in data1:
for coor2 in data2:
n=0
a = [coor1[0], coor1[1]]
b = [coor2[0], coor2[1]]
#print "a", a
#print "b", b
if d(a, b) < 0.07865: # if true what happens
results.write("\t".join([str(coor1), str(coor2), "True", str(d)]) + "\n")
else:
results.write("\t".join([str(coor1), str(coor2), "False", str(d)]) + "\n")
results.close()
This is the error message I get when I run the code:
results.write("\t".join([str(coor1), str(coor2), "False", str(d)]) + "\n")
ValueError: I/O operation on closed file
I think my problem is that I don't understand how I am supposed to write, save and organize the files in a meaningful format into my computer. So, again if anybody has any advice or suggestions I would be very grateful for the support!
My suggestion: take your code that writes to the file and wrap it in a context manager. E.g. https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/
Related
I am a newbie of multiprocessing and i am using the said library in python to parallelize the computation of a parameter for the rows of a dataframe.
The idea is the following:
I have two functions, g for the actual computation and f for filling the dataframe with the computed values. I call the function f with pool.apply_async. The problem is that at the end of the poo.async the dataframe has not been updated even though a print inside f easily shows that it is saving correctly the values. So I thought to save the results in a file excel inside the f function as showed in my pseudo code below. However, what I obtain is that the file excel where i save the results stops to be updated after 2 values and the kernel keeps running even though the terminal shows that the script has computed all the values.
This is my pseudo code:
def g(path to image1, path to image 2):
#vectorize images
#does computation
return value #value is a float
def f(row, index):
value= g(row.image1, row.image2)
df.at[index, 'value'] = value
df.to_csv('dftest.csv')
return df
def callbackf(result):
global results
results.append(result)
inside the main:
results=[]
pool = mp.Pool(N_CORES)
for index, row in df.iterrows():
pool.apply_async(f,
args=(row, index),
callback=callbackf)
I tried to use with get_context("spawn").Pool() as pool inside the main as suggested by https://pythonspeed.com/articles/python-multiprocessing/ but it didn't solve my problem. What am I doing wrong? Is it possible that the vectorizing the images at each row causes problem to the multiprocessing?
At the end I saved the results in a txt instead of a csv and it worked. I don't know why it didn't work with csv though.
Here's the code I put instead of the csv and pickle lines:
with open('results.txt', 'a') as f:
f.write(image1 +
'\t' + image2 +
'\t' + str(value) +
'\n')
In python, using the OpenCV library, I need to create some polylines. The example code for the polylines method shows:
cv2.polylines(img,[pts],True,(0,255,255))
I have all the 'pts' laid out in a text file in the format:
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
How can I read this file and provide the data to the [pts] variable in the method call?
I've tried the np.array(csv.reader(...)) method as well as a few others I've found examples of. I can successfully read the file, but it's not in the format the polylines method wants. (I am a newbie when it comes to python, if this was C++ or Java, it wouldn't be a problem).
I would try to use numpy to read the csv as an array.
from numpy import genfromtxt
p = genfromtxt('myfile.csv', delimiter=',')
cv2.polylines(img,p,True,(0,255,255))
You may have to pass a dtype argument to the genfromtext if you need to coerce the data to a specific format.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
In case you know it is a fixed number of items in each row:
import csv
with open('myfile.csv') as csvfile:
rows = csv.reader(csvfile)
res = list(zip(*rows))
print(res)
I know it's not pretty and there is probably a MUCH BETTER way to do this, but it works. That being said, if someone could show me a better way, it would be much appreciated.
pointlist = []
f = open(args["slots"])
data = f.read().split()
for row in data:
tmp = []
col = row.split(";")
for points in col:
xy = points.split(",")
tmp += [[int(pt) for pt in xy]]
pointlist += [tmp]
slots = np.asarray(pointlist)
You might need to draw each polyline individually (to expand on #Chris's answer):
from numpy import genfromtxt
lines = genfromtxt('myfile.csv', delimiter=',')
for line in lines:
cv2.polylines(img, line.reshape((-1, 2)), True, (0,255,255))
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
I have a very weird issue here. I have 2 functions: one which reads an HDF5 file created using h5py and one which creates a new HDF5 file which concatenates the content returned by the former function.
def read_file(filename):
with h5py.File(filename+".hdf5",'r') as hf:
group1 = hf.get('group1')
group1 = hf.get('group2')
dataset1 = hf.get('dataset1')
dataset2 = hf.get('dataset2')
print group1.attrs['w'] # Works here
return dataset1, dataset2, group1, group1
And the create file function
def create_chunk(start_index, end_index):
for i in range(start_index, end_index):
if i == start_index:
mergedhf = h5py.File("output.hdf5",'w')
mergedhf.create_dataset("dataset1",dtype='float64')
mergedhf.create_dataset("dataset2",dtype='float64')
g1 = mergedhf.create_group('group1')
g2 = mergedhf.create_group('group2')
rd1,rd2,rg1,rg2 = read_file(filename)
print rg1.attrs['w'] #gives me <Closed HDF5 group> message
g1.attrs['w'] = "content"
g1.attrs['x'] = "content"
g2.attrs['y'] = "content"
g2.attrs['z'] = "content"
print g1.attrs['w'] # Works Here
return mergedhf.get('dataset1'), mergedhf.get('dataset2'), g1, g2
def calling_function():
wd1, wd2, wg1, wg2 = create_chunk(start_index, end_index)
print wg1.attrs['w'] #Works here as well
Now the problem is, the dataset and the properties from the new file created and represented by wd1, wd2, wg1 and wg2 can be accessed by me and I can access the attribute data but i cant do the same for which I have read and returned the value for.
Can anyone help me fetch the values of the dataset and group when I have returned the reference to the calling function?
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
There are (at least) two ways to fix this. Firstly, you can open the file like you do in create_chunk:
hf = h5py.File(filename+".hdf5", 'r')
and keep the reference to hf around as long as you need it, before closing it:
hf.close()
The other way is to copy the data from the datasets in read_file and return those instead:
dataset1 = hf.get('dataset1')[:]
dataset2 = hf.get('dataset2')[:]
Note that you can't do this with the groups. The file needs to be open for as long as you need to do things with the groups.
Adding to #Yossarian's answer
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
For those who come across this and are reading a scalar dataset make sure to index using [()]:
scalar_dataset1 = hf['scalar_dataset1'][()]
Preface
I had a similar issue as OP resulting in a return value of <closed hdf5 dataset>. However, I would get a ValueError when attempting to slice my scalar dataset with [:].
"ValueError: Illegal slicing argument for scalar dataspace"
Indexing with [()] along with #Yossarian's answer helped solve my problem.
I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.