Computing averages of records from multiple files with python

Computing averages of records from multiple files with python - python

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :
A:
1 2 3
4 5 6
B:
0 1 3
2 4 5
C:
2 5 6
1 1 1
Therefore, file D should be
D:
1 2.67 4
2.33 3.33 4
My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.

Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).
import numpy as np
a = np.fromfile('A.csv', dtype=np.int)
b = np.fromfile('B.csv', dtype=np.int)
c = np.fromfile('C.csv', dtype=np.int)
d = (a + b + c) / 3.0
d.tofile('D.csv')
Size of "some MB" should not be a problem.

In case of text files, try this:
def readdat(data,sep=','):
step1 = data.split('\n')
step2 = []
for index in step1:
step2.append(float(index.split(sep)))
return step2
def formatdat(data,sep=','):
step1 = []
for index in data:
step1.append(sep.join(str(data)))
return '\n'.join(step1)
and then use these functions to format the text into lists.

Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):
files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files: # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
# [[1,2,3], [0,1,3], [2,5,6]]
intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
# use float() to ensure we get true division in Python 2.
averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
outfile.write(" ".join(str(a) for a in averages) + "\n")
In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.

Related

Combining and tabulating several blocks of text

The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).

Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]

Python: How do I read a .txt file of numbers to find the median of said numbers without using median() call?

I am working on an assignment that requires us to read in a .txt file of numbers, then figure out a way to use the length of the list to determine the middle index of both odd/even lengths of number lists to calculate the median without using the median() call. I do not understand how I would go about this, anything helps! (I am also still faily new to Python)
debug = print
# assign name of file to be read
file = "numbers_even.txt"
# open file to be read
def get_median(med):
with open(file, mode='r') as my_file:
# assign middle index values for list
m2 = len(file) // 2
debug("index2", m2)
m1 = m2 -1
value1 = file[m1]
debug(m1, value1)
value2 = file[m2]
middle = (value1 + value2) / 2
debug("val1:", value1, "val2:", value2, "mid", middle)
# end with
# end function
get_median(file)

I would recommend pulling all the numbers into a list. Then you can sort the list and choose the middle index.

Assuming your text file (numbers_even.txt) looks something like this:
1
2
3
4
5
6
7
8
9
10
11
You can do this:
with open('numbers_even.txt','r') as f:
median = f.readlines()
if len(median) % 2 == 0:
print(median[int(len(median)/2-1)])
else:
print((int(median[int(len(median)//2)])+int(median[int(len(median)//2-12)]))/2)
Output:
5.5

python fastest way to match strings with huge data size

I have a huge table data (or record array) with elements:
tbdata[i]['a'], tbdata[i]['b'], tbdata[i]['c']
which are all integers, and i is a random number between 0 and 1 million (the size of the table).
I also have a list called Name whose elements are all names (900 names in total) of files, such as '/Users/Desktop/Data/spe-3588-55184-0228.jpg' (modified), all containing three numbers.
Now I want to select those data from my tbdata whose elements mentioned above all match the three numbers in the names of list Name. Here's the code I originally wrote:
Data = []
for k in range(0, len(tbdata)):
for i in range(0, len(NameA5)):
if Name[i][43:47] == str(tbdata[k]['a']) and\
Name[i][48:53] == str(tbdata[k]['b']) and\
Name[i][55:58] == str(tbdata[k]['c']):
Data.append(tbdata[k])
Python ran for the whole night and still haven't finished, since either the size of data is huge or my algorithm is too slow...I'm wondering what's the fastest way to complete such a task? Thanks!

You can construct a lookup tree like this:
a2b2c = {}
for name in NameA5:
a = int(name[43:47])
b = int(name[48:53])
c = int(name[55:58])
if a not in a2b2c2name:
a2b2c2name[a] = {}
if b not in a2b2c2name[a]:
a2b2c2name[a][b] = {}
a2b2c2name[a][b][c] = True
for k in range(len(tbdata)):
a = tbdata[k]['a']
b = tbdata[k]['b']
c = tbdata[k]['c']
if a in a2b2c2name and b in a2b2c2name[a] and c in a2b2c2name[a][b]:
Data.append(tbdata[k])

Writing a file in Python

There is a file format called .xyz that helps visualizing molecular bonds. Basically the format asks for a specific pattern:
At the first line there must be the number of atoms, which in my case is 30.
After that there should be the data where the first line is the name of the atom, in my case they are all carbon. The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. So something like this:
30
C x1 y1 z1
C x2 y2 z2
...
...
...
I generated my data in C++ into a text file like this:
C 2.99996 7.31001e-05 0
C 2.93478 0.623697 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.0079 2.22961 0
C 1.50006 2.59812 0
C 0.927076 2.8532 0
C 0.313848 2.98349 0
C -0.313623 2.9837 0
C -0.927229 2.85319 0
C -1.5003 2.5981 0
C -2.00732 2.22951 0
C -2.42686 1.76331 0
C -2.74119 1.22029 0
C -2.93437 0.623802 0
C -2.99992 -5.5509e-05 0
C -2.93416 -0.623574 0
C -2.7409 -1.22022 0
C -2.42726 -1.7634 0
C -2.00723 -2.22941 0
C -1.49985 -2.59809 0
C -0.92683 -2.85314 0
C -0.313899 -2.98358 0
C 0.31363 -2.98356 0
C 0.927096 -2.85308 0
C 1.50005 -2.59792 0
C 2.00734 -2.22953 0
C 2.4273 -1.76339 0
C 2.74031 -1.22035 0
C 2.93441 -0.623647 0
So, now what I'm trying to do is that I want to write this file into a .xyz file. I saw online that people do it with Python in which I almost have no experience. So I checked around and came up with this script:
#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
atom, x, y, z = line.split()
myfile.write("%s\t %d\t %d\t %d\t" %(atom,x,y,z))
myfile.close()
text_file.close()
However when I run this, it gives the following error: "%d format: a number is required, not str."
It doesn't make sense to me, since as you can see in txt file, they are all numbers apart from the first line. I tried changing my d's into s's but then the program I'll load this data into gave an error.
To summarize:
I have a data file in .txt, I want to change it into .xyz that's been specified but I am running into problems.
Thanks in advance.

A string can represent a number as well. In programming languages, this is called a type. '1' and 1 have different types. Use %s instead for strings.
myfile.write("%s\t %s\t %s\t %s\t" % (atom, x, y, z))
If you want them to be floats, you should do this during the parsing stage:
x, y, z = map(float, (x, y, z))
And btw, % is considered obsolete in python. Please use format instead:
myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))

Maybe the problem you've faced was because of "\t" in the answer (tab).
The .xyz file uses only spaces to separate data from the same line, as stated here. You could use only one space if you wanted, but to have an easily readable format, like other programs use to format when saving .xyz files, it's better to use the tips from https://pyformat.info/
My working code (in Python 3) to generate .xyz files, using objects of molecules and atoms from the CSD Python API library, is this one, that you can adapt to your reality:
with open("file.xyz", 'w') as xyz_file:
xyz_file.write("%d\n%s\n" % (len(molecule.atoms), title))
for atom in molecule.atoms:
xyz_file.write("{:4} {:11.6f} {:11.6f} {:11.6f}\n".format(
atom.atomic_symbol, atom.coordinates.x, atom.coordinates.y, atom.coordinates.z))
The first two lines are the number of atoms and the title for the xyz file. The other lines are the atoms (atomic symbol and 3D position).
So, the atomic symbol has 4 spaces for it, aligned to the left: {:4}
Then, this happens 3 times: {:11.6f}
That means one space followed by the next coordinate, that uses 11 characters aligned to the right, where 6 are after the decimal point. It's sufficient for numbers from -999.999999 to 9999.999999, that use to be sufficient. Numbers out of this interval only break the format, but keep the mandatory space between data, so the xyz file still works on those cases.
The result is like this:
18
WERPOY
N 0.655321 3.658330 14.594159
C 1.174111 4.551873 13.561374
C 0.703656 3.889113 15.926147
S 1.455530 5.313258 16.574524
C 1.127601 5.061435 18.321297
N 0.146377 2.914698 16.639984
C -0.288067 2.014580 15.736297
C 0.014111 2.441298 14.475693
N -0.266880 1.787085 13.260652
O -0.831165 0.699580 13.319885
O 0.056329 2.322290 12.209641
H 0.439235 4.780025 12.970561
H 1.821597 4.148825 13.137629
H 1.519448 5.312600 13.775525
H 1.522843 5.786000 18.590124
H 0.171557 5.069325 18.423056
H 1.477689 4.168550 18.574936
H -0.665073 1.282125 16.053727

How can I tell MATLAB that the data I am importing is a series of vectors, not just a series of letters?

I have obtained my data using python for a project in MATLAB. I have 3 different matrices of dimensions mxn, mxn+1 and mxn+2. I used this command in python scipy.io.savemat('set1.mat', mdict ={'abc1':abc1}). Each row of the matrix should actually be a row of row vectors (of length p) not scalars, so that the matrices are actually mx(n)*p, mx(n+1)*p and mx(n+2)*p.
As an example, I have defined at the top of the MATLAB file for both cases
A = ones(1,5)
B = 2*ones(1,5)
C = 3*ones(1,5)
Now directly in MATLAB I can write:
abc1 = [A B C]
which strange as though it may seem, gives me the output I want.
abc1 =
Columns 1 through 14
1 1 1 1 1 2 2 2 2 2 3 3 3 3
Column 15
3
Now if I import my data using load I can grab abc1(1,:). This gives me:
ans = A B C
or I could take:
abc1(1,1)
ans = A
How can I get it to recognise that A is the name of a vector?

From what I understand of your question it sounds like you have (in matlab):
A = ones(1,5);
B = 2*ones(1,5);
C = 3*ones(1,5);
load('set1.mat');
And then you want to do something like:
D = [abc1];
and have the result be, for abc1 = 'A B C', the equivalent of [A B C].
There are a number of options for doing this. The first and possibly simplest is to use eval, though I shudder to mention it, since most consider eval to be evil.
In your case this would look like:
D = eval(['[' abc1 ']']);
A nicer solution would be to exploit the dynamic field names trick that can be done with structures:
foo.A = ones(1,5);
foo.B = 2*ones(1,5);
foo.C = 3*ones(1,5);
load('set1.mat');
D = [foo.(abc1(1,1)) foo.(abc1(1,2)) foo.(abc1(1,3))];
Or, if you need to concatenate more than just 3 columns you could do so itteratively, using the cat function. e.g.:
D = [];
for idx = 1:3
D = cat(2, D, foo.(abc1(1,idx)));
end
Or, if you know the length of D before you have created it you can use a slightly more efficient version:
D = zeros(1, num_elements);
ins_idx = 1;
for idx = 1:3
temp_num = length(foo.(abc1(1,idx)));
D(ins_idx:(ins_idx+temp_num-1)) = foo.(abc1(1,idx));
ins_idx = ins_idx + temp_num;
end

Load the data into a structure and use dynamic field indexing:
s = load('yourfile');
s.(abc1(1,1))
However, if you keep structuring your project in the above-mentioned way, you're probably gonna run into eval(), which I always suggest to avoid.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Computing averages of records from multiple files with python - python

Related

Combining and tabulating several blocks of text

Python: How do I read a .txt file of numbers to find the median of said numbers without using median() call?

python fastest way to match strings with huge data size

Writing a file in Python

How can I tell MATLAB that the data I am importing is a series of vectors, not just a series of letters?

Categories

Resources