Writing a file in Python - python

There is a file format called .xyz that helps visualizing molecular bonds. Basically the format asks for a specific pattern:
At the first line there must be the number of atoms, which in my case is 30.
After that there should be the data where the first line is the name of the atom, in my case they are all carbon. The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. So something like this:
30
C x1 y1 z1
C x2 y2 z2
...
...
...
I generated my data in C++ into a text file like this:
C 2.99996 7.31001e-05 0
C 2.93478 0.623697 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.0079 2.22961 0
C 1.50006 2.59812 0
C 0.927076 2.8532 0
C 0.313848 2.98349 0
C -0.313623 2.9837 0
C -0.927229 2.85319 0
C -1.5003 2.5981 0
C -2.00732 2.22951 0
C -2.42686 1.76331 0
C -2.74119 1.22029 0
C -2.93437 0.623802 0
C -2.99992 -5.5509e-05 0
C -2.93416 -0.623574 0
C -2.7409 -1.22022 0
C -2.42726 -1.7634 0
C -2.00723 -2.22941 0
C -1.49985 -2.59809 0
C -0.92683 -2.85314 0
C -0.313899 -2.98358 0
C 0.31363 -2.98356 0
C 0.927096 -2.85308 0
C 1.50005 -2.59792 0
C 2.00734 -2.22953 0
C 2.4273 -1.76339 0
C 2.74031 -1.22035 0
C 2.93441 -0.623647 0
So, now what I'm trying to do is that I want to write this file into a .xyz file. I saw online that people do it with Python in which I almost have no experience. So I checked around and came up with this script:
#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
atom, x, y, z = line.split()
myfile.write("%s\t %d\t %d\t %d\t" %(atom,x,y,z))
myfile.close()
text_file.close()
However when I run this, it gives the following error: "%d format: a number is required, not str."
It doesn't make sense to me, since as you can see in txt file, they are all numbers apart from the first line. I tried changing my d's into s's but then the program I'll load this data into gave an error.
To summarize:
I have a data file in .txt, I want to change it into .xyz that's been specified but I am running into problems.
Thanks in advance.

A string can represent a number as well. In programming languages, this is called a type. '1' and 1 have different types. Use %s instead for strings.
myfile.write("%s\t %s\t %s\t %s\t" % (atom, x, y, z))
If you want them to be floats, you should do this during the parsing stage:
x, y, z = map(float, (x, y, z))
And btw, % is considered obsolete in python. Please use format instead:
myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))

Maybe the problem you've faced was because of "\t" in the answer (tab).
The .xyz file uses only spaces to separate data from the same line, as stated here. You could use only one space if you wanted, but to have an easily readable format, like other programs use to format when saving .xyz files, it's better to use the tips from https://pyformat.info/
My working code (in Python 3) to generate .xyz files, using objects of molecules and atoms from the CSD Python API library, is this one, that you can adapt to your reality:
with open("file.xyz", 'w') as xyz_file:
xyz_file.write("%d\n%s\n" % (len(molecule.atoms), title))
for atom in molecule.atoms:
xyz_file.write("{:4} {:11.6f} {:11.6f} {:11.6f}\n".format(
atom.atomic_symbol, atom.coordinates.x, atom.coordinates.y, atom.coordinates.z))
The first two lines are the number of atoms and the title for the xyz file. The other lines are the atoms (atomic symbol and 3D position).
So, the atomic symbol has 4 spaces for it, aligned to the left: {:4}
Then, this happens 3 times: {:11.6f}
That means one space followed by the next coordinate, that uses 11 characters aligned to the right, where 6 are after the decimal point. It's sufficient for numbers from -999.999999 to 9999.999999, that use to be sufficient. Numbers out of this interval only break the format, but keep the mandatory space between data, so the xyz file still works on those cases.
The result is like this:
18
WERPOY
N 0.655321 3.658330 14.594159
C 1.174111 4.551873 13.561374
C 0.703656 3.889113 15.926147
S 1.455530 5.313258 16.574524
C 1.127601 5.061435 18.321297
N 0.146377 2.914698 16.639984
C -0.288067 2.014580 15.736297
C 0.014111 2.441298 14.475693
N -0.266880 1.787085 13.260652
O -0.831165 0.699580 13.319885
O 0.056329 2.322290 12.209641
H 0.439235 4.780025 12.970561
H 1.821597 4.148825 13.137629
H 1.519448 5.312600 13.775525
H 1.522843 5.786000 18.590124
H 0.171557 5.069325 18.423056
H 1.477689 4.168550 18.574936
H -0.665073 1.282125 16.053727

Related

Combining and tabulating several blocks of text

The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]

need help determining the general formatting of text in a text file

This question is related to HECRAS if anyone has experience, but in general it's just a question about writing text files to a very particular format to be read by the HECRAS software.
Basically I'm reading some files and altering some numbers, then writing it back out but I can't seem to match the initial format perfectly.
Here is what the original file looks like:
Type RM Length L Ch R = 1 ,229.41 ,21276,21276,21276
Node Last Edited Time=Oct-17-2019 15:52:28
#Sta/Elev= 452
0 20.097 67.042 9.137 67.43 9.139 68.208 9.073 68.598 9.129
68.986 9.086 70.538 9.071 70.926 9.042 71.984 9.046 72.48 9.025
73.646 9.056 74.368 9.034 75.586 9.042 76.55 9.017 77.138 9.047
78.304 8.989 79.47 9.025 80.19 9.001 81.41 9.003 81.974 8.978
83.83 9.005 85.284 9.079 85.682 9.068 86.97 9.118 88.012 9.223
88.79 9.239 89.65 9.316 90.342 9.324 91.134 9.475 91.966 9.525
92.282 9.589 93.346 9.546 94.222 9.557 94.922 9.594 95.71 9.591
96.546 9.64 97.286 9.574 98.87 9.688 99.258 9.673 99.642 9.712
#Mann= 3 , 0 , 0
0 .09 0 246.4 .028 0 286.4 .09 0
Bank Sta=246.4,286.4
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=1.708,0.1, 500
XS HTab Horizontal Distribution= 5 , 5 , 5
Exp/Cntr=0.3,0.1
I'm interested in the Sta/Elev data...it look's like some right justified tab/space? delimited format in station/elevation pairs of 5 per line..maybe 16 chars per pair??
I've tried a bunch of different things, my current code is:
with open('C:/Users/deden/Desktop/t/test.g01','w') as out:
out.write(txt[:idx[0][0]])
out.write(txt[idx[0][0]:idx[0][0]+bounds[0]])
out.write('#'+raw_SE.split('\n')[0]+'\n')
i = 0
while i <= len(new_SE):
out.write('\t'.join(new_SE[i:i+10])+'\n')
i+=10
out.write(txt[idx[0][0]+bounds[1]:idx[1][0]])
it's a little hacky atm, still trying to work it out, the important part is just:
while i <= len(new_SE):
out.write('\t'.join(new_SE[i:i+10])+'\n')
i+=10
new_SE is just a list of station/elevation:
['0', '30.097', '67.042', '19.137', '67.43', '19.139', '68.208', '19.073', '68.598', '19.128999999999998' ...]
I also tried playing around with justified side with something like:
'%8s %8s' % (tmp[0], tmp[1])
to basically have 8 spaces between the text but right justify them
honestly struggling...if anyone can recreate the original text in between #Sta/Elev= 452 and #Mann I would be eternally grateful, here is the full list if someone wants to give it a go:
new_SE = ['0', '30.097', '67.042', '19.137', '67.43', '19.139', '68.208', '19.073', '68.598', '19.128999999999998', '68.986', '19.086', '70.538', '19.070999999999998', '70.926', '19.042', '71.984', '19.046', '72.48', '19.025', '73.646', '19.055999999999997', '74.368', '19.034', '75.586', '19.042', '76.55', '19.017', '77.138', '19.047', '78.304', '18.989', '79.47', '19.025', '80.19', '19.000999999999998', '81.41', '19.003', '81.974', '18.978', '83.83', '19.005000000000003', '85.284', '19.079', '85.682', '19.067999999999998', '86.97', '19.118000000000002', '88.012', '19.223', '88.79', '19.239', '89.65', '19.316000000000003', '90.342', '19.323999999999998', '91.134', '19.475', '91.966', '19.525', '92.282', '19.589', '93.346', '19.546', '94.222', '19.557000000000002', '94.922', '19.594', '95.71', '19.591', '96.546', '19.64', '97.286', '19.573999999999998', '98.87', '19.688000000000002', '99.258', '19.673000000000002', '99.642', '19.712']
not really sure if I understood correctly - please consider having a look at
# with open('C:/Users/deden/Desktop/t/test.g01','w') as out:
for i in range(0, len(new_SE), 10):
row = [f'{float(v):8.3f}' for v in new_SE[i:i+10]]
out.write(''.join(r) + '\n')
# 0.000 30.097 67.042 19.137 67.430 19.139 68.208 19.073 68.598 19.129
# 68.986 19.086 70.538 19.071 70.926 19.042 71.984 19.046 72.480 19.025
# 73.646 19.056 74.368 19.034 75.586 19.042 76.550 19.017 77.138 19.047
# 78.304 18.989 79.470 19.025 80.190 19.001 81.410 19.003 81.974 18.978
# 83.830 19.005 85.284 19.079 85.682 19.068 86.970 19.118 88.012 19.223
# 88.790 19.239 89.650 19.316 90.342 19.324 91.134 19.475 91.966 19.525
# 92.282 19.589 93.346 19.546 94.222 19.557 94.922 19.594 95.710 19.591
# 96.546 19.640 97.286 19.574 98.870 19.688 99.258 19.673 99.642 19.712

pbm-file created with Python only works when copy&pasted in new text-file

When I create a random image in .pbm-format (code below) and output the result in a .pbm-file, the file appears to be corrupt (cannot be opened e.g. with GIMP). However, if I open the file with a text editor (e.g. Notepad) and copy the content over to a new text file, that was manually created, it works fine.
I also tried outputting the image as .txt and manually changing it to .pbm, which did not work either without manually creating a new text file and copy&pasting the content. I also noticed that the text file created by Python is bigger in size (about double) compared to the manually created one with the same content.
Do you guys have any idea, why and how the .txt-file created in python differs from a manually created one with the same content?
Thanks a lot!
.pbm-file created with the following command in windows command line:
python random_image 20 10 > image.pbm
Code used:
import random
import sys
width = int(sys.argv[1]) #in pixel
height = int(sys.argv[2]) #in pixel
def print_image(width, height):
image=[[" " for r in range(height)] for c in range(width)]
print("P1 {} {}".format(width, height))
for r in range(height):
for c in range(width):
image[c][r]=random.getrandbits(1)
print (image [c][r], end='')
print()
print_image(width, height)
I've tested the code myself. I'm not sure how copy&paste worked for you (it didn't work for me), but it seems that your code produces something similar to this
P1 5 5
01101
01100
11010
01110
11000
But the pbm format expects spaces between the bits, so for the picture to be presented correctly, you have to produce something like
P1 5 5
0 0 1 1 1
1 0 0 1 0
1 1 0 1 0
1 1 0 0 1
1 1 1 1 1
To get this result, just change the print statements of the bits to include a space at the end
print (image [c][r], end=' ') # notice the space

How can I tell MATLAB that the data I am importing is a series of vectors, not just a series of letters?

I have obtained my data using python for a project in MATLAB. I have 3 different matrices of dimensions mxn, mxn+1 and mxn+2. I used this command in python scipy.io.savemat('set1.mat', mdict ={'abc1':abc1}). Each row of the matrix should actually be a row of row vectors (of length p) not scalars, so that the matrices are actually mx(n)*p, mx(n+1)*p and mx(n+2)*p.
As an example, I have defined at the top of the MATLAB file for both cases
A = ones(1,5)
B = 2*ones(1,5)
C = 3*ones(1,5)
Now directly in MATLAB I can write:
abc1 = [A B C]
which strange as though it may seem, gives me the output I want.
abc1 =
Columns 1 through 14
1 1 1 1 1 2 2 2 2 2 3 3 3 3
Column 15
3
Now if I import my data using load I can grab abc1(1,:). This gives me:
ans = A B C
or I could take:
abc1(1,1)
ans = A
How can I get it to recognise that A is the name of a vector?
From what I understand of your question it sounds like you have (in matlab):
A = ones(1,5);
B = 2*ones(1,5);
C = 3*ones(1,5);
load('set1.mat');
And then you want to do something like:
D = [abc1];
and have the result be, for abc1 = 'A B C', the equivalent of [A B C].
There are a number of options for doing this. The first and possibly simplest is to use eval, though I shudder to mention it, since most consider eval to be evil.
In your case this would look like:
D = eval(['[' abc1 ']']);
A nicer solution would be to exploit the dynamic field names trick that can be done with structures:
foo.A = ones(1,5);
foo.B = 2*ones(1,5);
foo.C = 3*ones(1,5);
load('set1.mat');
D = [foo.(abc1(1,1)) foo.(abc1(1,2)) foo.(abc1(1,3))];
Or, if you need to concatenate more than just 3 columns you could do so itteratively, using the cat function. e.g.:
D = [];
for idx = 1:3
D = cat(2, D, foo.(abc1(1,idx)));
end
Or, if you know the length of D before you have created it you can use a slightly more efficient version:
D = zeros(1, num_elements);
ins_idx = 1;
for idx = 1:3
temp_num = length(foo.(abc1(1,idx)));
D(ins_idx:(ins_idx+temp_num-1)) = foo.(abc1(1,idx));
ins_idx = ins_idx + temp_num;
end
Load the data into a structure and use dynamic field indexing:
s = load('yourfile');
s.(abc1(1,1))
However, if you keep structuring your project in the above-mentioned way, you're probably gonna run into eval(), which I always suggest to avoid.

Computing averages of records from multiple files with python

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :
A:
1 2 3
4 5 6
B:
0 1 3
2 4 5
C:
2 5 6
1 1 1
Therefore, file D should be
D:
1 2.67 4
2.33 3.33 4
My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.
Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).
import numpy as np
a = np.fromfile('A.csv', dtype=np.int)
b = np.fromfile('B.csv', dtype=np.int)
c = np.fromfile('C.csv', dtype=np.int)
d = (a + b + c) / 3.0
d.tofile('D.csv')
Size of "some MB" should not be a problem.
In case of text files, try this:
def readdat(data,sep=','):
step1 = data.split('\n')
step2 = []
for index in step1:
step2.append(float(index.split(sep)))
return step2
def formatdat(data,sep=','):
step1 = []
for index in data:
step1.append(sep.join(str(data)))
return '\n'.join(step1)
and then use these functions to format the text into lists.
Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):
files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files: # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
# [[1,2,3], [0,1,3], [2,5,6]]
intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
# use float() to ensure we get true division in Python 2.
averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
outfile.write(" ".join(str(a) for a in averages) + "\n")
In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.

Categories

Resources