Creating arrays while processing big file - python

This is my first post even though I've been reading SO for a while.
I'm a Python beginner and I'd need your help.
I'm processing a very big file (more than 2 million of lines) but I'll show you a much smaller example (24 lines rather than 74513). So let's say I've got 24 lines, each one with a floating point number, after that 3 numbers on the same line, then again 24 lines, line with 3 numbers and so on for 29 times.
56.71739
56.67950
56.65762
56.63320
56.61648
56.60323
56.63215
56.74365
56.98378
57.34681
57.78903
58.27959
58.81514
59.38853
59.98271
60.58515
-1.00000
56.09566
56.05496
56.02777
56.00158
55.98341
55.96830
55.99615
1 1 1
56.34692
56.70977
57.15187
57.64234
58.17782
58.75118
59.34534
59.94779
-1.00000
55.47366
55.42963
55.39739
55.36958
55.35020
55.33404
55.36098
55.47148
55.71110
56.07384
56.51588
57.00632
57.54180
58.11517
58.70937
2 1 1
It's quite easy to create an array with the first 24 lines:
import numpy
def ttarray_tms (traveltimes):
'''It defines the 3-D array, organized as I want.'''
with open (traveltimes, 'r') as file_in:
newarray = file_in.readlines()
ttarray = np.array(newarray)
ttarray.shape = (2,3,4)
ttarray = np.swapaxes(ttarray,1,2)
ttarray = np.swapaxes(ttarray,0,2)
return ttarray
PLEASE NOTE: There's no blank line between each number. It's a simple colon-vector file. For some reason I had to post like that.
What I want is to basically get 29 arrays, so I should loop over the 24 lines and get an array, then loop again over the next 24 lines (jumping the line with 3 numbers, I don't really need them) and get another array and so on. I think my main problem is how to skip the line with the 3 numbers and start again a new loop for a new array.
Have you got any good idea?
Thanks very much!

You can use readline() to read a single line 24 times then use another readline() to skip a line and so on.
With your code:
import numpy
def mk_array(elems):
'''Makes the nparray from an array of 24 numbers'''
ttarray = np.array(elems) # perhaps [ float(a) for a in elems ] is needed
ttarray.shape = (2,3,4)
ttarray = np.swapaxes(ttarray,1,2)
ttarray = np.swapaxes(ttarray,0,2)
return ttarray
def ttarray_tms(traveltimes):
'''It defines the 3-D array, organized as I want.'''
arrays = list()
with open (traveltimes, 'r') as file_in:
ret = "." # force the loop
while ret != "":
newarray = [ file_in.readline() for i in range(24) ]
ret = file_in.realine()
if ret != "": # avoid an empty array
ttarray = mk_array(newarray)
arrays.append(ttarray)
return arrays
Not tested.

The numbers in the three set line are following an incrementing pattern. So why don't you keep track of that pattern by keeping the last two numbers in two variables and if the three correspond to the pattern drop them and continue? It is kind of a sliding window approach.

Related

Combining and tabulating several blocks of text

The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]

Arranging distinct number of floats in an 2d array

First of all I am quite a newbie on python, so please forgive me if I don't see the wood for the trees. My question is on reading a huge file of float numbers and storing them in an array for fast mathematical postprocessing.
Lets assume the file looks similar to this:
!!
-3.2297390 0.4474691 3.5690145 3.5976372 6.9002712 7.7787466 14.2159269 14.3291490
16.7660723 17.1258704 18.9469059 19.1716808 20.0700721 21.4088414
-3.2045361 0.4123081 3.5625981 3.5936954 6.8901539 7.7543415 14.2764611 14.3623976
16.7955934 17.1560337 18.9527369 19.1251184 20.0700709 21.3515145
-3.2317597 0.4494166 3.5799182 3.6005429 6.8838705 7.7661897 14.2576455 14.3295731
16.7550357 17.0986678 19.0187779 19.1687722 20.0288587 21.3818250
-3.1921346 0.3949598 3.5636878 3.5892085 6.8833690 7.7404542 14.3061281 14.3855389
16.8063645 17.1697110 18.9549920 19.1134580 20.0613223 21.3196066
here there are 4 (nb) blocks of 14 (nk) float numbers each. I want them to be arranged in an array elements[nb][nk] so that I can access easily looping over certain floats of the blocks.
Here is what I thought it should look like, but it doesn't work at all:
nb=4
nk=14
with open("datafile") as file:
elements = []
n = 0
while '!!' not in file:
while n <= (nb-1):
elements.append([])
current = map(float,file.read().split()) # here I would need something to assure only 14 (nk) floats are read in
elements[n].append(current)
n += 1
print(elements[0][1])
It would be great if had some ideas and suggestions. Thanks!
EDIT:
here an datafile where the numbers follow after each other with no clear seperator after a block nb. Here it is nb=2 and nk=160. How to split the read in floats after each 160th number?
!!
-7.2578105433 -7.2578105433 -6.7774609392 -6.7774609392 -6.3343986693 -6.3343986693 -5.8537216826 -5.8537216826
-5.6031029888 -5.6031029888 -2.9103190893 -2.9103190893 -1.7962279174 -1.7962279174 -0.8136720023 -0.8136720023
-0.1418500769 -0.1418500769 2.9923464558 2.9923464558 3.5797768050 3.5797768050 3.8793240270 3.8793240270
4.0774192689 4.0774192689 4.2378755781 4.2378755781 4.2707165126 4.2707165126 4.3290523910 4.3290523910
4.4487102661 4.4487102661 4.5341883539 4.5341883539 4.7946098470 4.7946098470 4.9518205998 4.9518205998
4.9592549825 4.9592549825 5.1648268937 5.1648268937 5.2372127454 5.2372127454 5.9377062691 5.9377062691
6.2971992823 6.2971992823 6.6324702419 6.6324702419 6.7948808733 6.7948808733 7.0835270703 7.0835270703
7.6252686579 7.6252686579 7.7886279100 7.7886279100 7.8514022664 7.8514022664 7.9188180854 7.9188180854
7.9661386138 7.9661386138 8.2830991934 8.2830991934 8.4581462733 8.4581462733 8.5537201519 8.5537201519
10.2738010533 10.2738010533 11.4495306517 11.4495306517 11.4819579346 11.4819579346 11.5788238984 11.5788238984
11.9411469341 11.9411469341 12.5006172267 12.5006172267 12.5055546075 12.5055546075 12.6659410418 12.6659410418
12.8741094000 12.8741094000 12.9560279595 12.9560279595 12.9780521671 12.9780521671 13.2195973082 13.2195973082
13.2339969658 13.2339969658 13.3594047155 13.3594047155 13.4530024795 13.4530024795 13.4556342387 13.4556342387
13.5784994631 13.5784994631 14.6887369915 14.6887369915 14.9019726334 14.9019726334 15.1279383300 15.1279383300
15.1953349879 15.1953349879 15.3209538297 15.3209538297 15.4042612992 15.4042612992 15.4528348692 15.4528348692
15.4542742538 15.4542742538 15.5291462589 15.5291462589 15.5415591416 15.5415591416 16.0741610117 16.0741610117
16.1117432607 16.1117432607 16.3566675522 16.3566675522 17.7569123657 17.7569123657 18.4416346230 18.4416346230
18.9525843134 18.9525843134 19.0591624486 19.0591624486 19.1069867477 19.1069867477 19.1853525353 19.1853525353
19.4020021909 19.4020021909 19.4718240723 19.4718240723 19.6384650104 19.6384650104 19.6919638323 19.6919638323
19.7044699790 19.7044699790 19.8851141335 19.8851141335 20.6132283388 20.6132283388 21.4074471478 21.4074471478
-7.2568288331 -7.2568280628 -6.7765483088 -6.7765429702 -6.3336003082 -6.3334841531 -5.8529872639 -5.8528369047
-5.6024822566 -5.6024743589 -2.9101060346 -2.9100930470 -1.7964872791 -1.7959333994 -0.8153333579 -0.8144924713
-0.1440078470 -0.1421444935 2.9869228390 2.9935342026 3.5661875018 3.5733148387 3.8777649741 3.8828300867
4.0569348321 4.0745074351 4.2152251981 4.2276050415 4.2620483420 4.2649182323 4.3401804124 4.3402590222
4.4446178512 4.4509411587 4.5139270348 4.5526439516 4.7788285567 4.7810706248 4.9282976775 4.9397807768
4.9737752749 4.9900180286 5.1456209436 5.1507667583 5.2528363215 5.2835144984 5.9252188817 5.9670441193
6.2699491148 6.3270140700 6.5912060019 6.6576016532 6.7976670773 6.7982056614 7.0789050974 7.1023337244
7.6182108739 7.6309688587 7.7678148773 7.7874194913 7.8544608005 7.8594983757 7.9019395451 7.9100447766
7.9872550937 7.9902791771 8.2617740182 8.3147140843 8.4533756827 8.4672364683 8.5556163680 8.5558640539
10.2756173692 10.2760227976 11.4344757209 11.4355375519 11.4737803653 11.4760186102 11.5914333288 11.5953932241
11.9369518613 11.9380900159 12.4973099542 12.5002401499 12.5030167542 12.5031963862 12.6629548222 12.6634150863
12.8719844312 12.8728126622 12.9541436501 12.9568445777 12.9762780998 12.9764840239 13.2074024551 13.2108294169
13.2279146175 13.2308902307 13.3780648962 13.3839050348 13.4634576072 13.4650575047 13.4701414823 13.4718238883
13.5901622459 13.5971076111 14.6735704782 14.6840793519 14.8963924604 14.8968395615 15.1163287408 15.1219631271
15.1791724308 15.1817299995 15.2628531102 15.3027136606 15.3755066968 15.3802521520 15.3969012144 15.4139294088
15.5131322524 15.5315039463 15.5465532500 15.5629105034 15.5927166831 15.5966393750 16.0841067052 16.0883417123
16.1224821534 16.1226510159 16.3646268213 16.3665839987 17.7654543366 17.7657216551 18.4305335335 18.4342292730
18.9110142692 18.9215889808 18.9821593138 18.9838270736 19.1633959849 19.1637558341 19.2040877093 19.2056062802
19.3760597529 19.3846323861 19.4323552578 19.4329488797 19.6494790293 19.6813374885 19.6943820824 19.7202356536
19.7381237231 19.7414645409 19.9056461663 19.9197428869 20.6239183178 20.6285756411 21.4127637743 21.4128909767
This should work:
elements = []
with open("datafile") as file:
next(file)
for line in file:
elements.append([float(x) for x in line.split()])
next(line) reads the first line. Then for line in file: iterates over all other lines. The list comprehension [float(x) for x in line.split()] goes through all entries in the line split by whitespace. Finally, elements.append() appends this list to elements, which becomes a list of lists that you can call an 2D array.
Access the first entry in the first line:
>>> elements[0][0]
-3.229739
or the last entry in the last line:
>>> elements[3][13]
21.319606
alternatively:
>>> elements[-1][-1]
21.319606
Update
This reads the file into a list of lists without taking line breaks as special:
nb = 2
nk = 160
with open("datafile") as fobj:
all_values = iter(x for x in fobj.read().split())
next(all_values)
elements = []
for x in range(nb):
elements.append([float(next(all_values)) for counter in range(nk)])
If you like nested list comprehensions:
with open("datafile") as fobj:
all_values = iter(x for x in fobj.read().split())
next(all_values)
elements = [[float(next(all_values)) for counter in range(nk)] for x in range(nb)]

read multi-line list from file

I have a file with data like:
POTENTIAL
TYPE 1
-5.19998150116627E+07 -5.09571848744513E+07 -4.99354600752570E+07 -4.89342214499422E+07 -4.79530582388520E+07
-4.69915679183017E+07 -4.60493560354389E+07 -4.51260360464197E+07 -4.42212291578282E+07 -4.33345641712756E+07
-4.24656773311163E+07 -4.16142121752159E+07 -4.07798193887125E+07 -3.99621566607090E+07 -3.91608885438409E+07
-3.83756863166569E+07
-8.99995987594328E+07 -8.81884626368405E+07 -8.64137733336537E+07 -8.46747974037847E+07 -8.29708161608188E+07
-8.13011253809965E+07 -7.96650350121689E+07 -7.80618688886128E+07 -7.64909644515842E+07 -7.49516724754953E+07
-7.34433567996002E+07 -7.19653940650832E+07 -7.05171734574350E+07 -6.90980964540154E+07 -6.77075765766936E+07
-6.63450391494693E+07
Note as per Nsh's comment these data are not single line. They always have 5 data per line, and as per this example, 4 row, with only one data in 4th row. So, I have 16 float spread over 4 line. I always know the total number (i.e. 16 in this case)
My aim is to read them as a list (please let me know if there is better things). The row with the single entry denotes end of a list (e.g. the list[1] ends with -3.83756863166569E+07).
I tried to read it as:
if line.startswith("POTENTIAL"):
lines = f.readline()
if lines.startswith("TYPE "):
lines=f.readline()
lines=lines.split()
lines = [float(i) for i in lines]
pots.append(lines)
print(pots)
which gives result:
[[-51999815.0116627, -50957184.8744513, -49935460.075257, -48934221.4499422, -47953058.238852]]
i.e. just the first line from the list, and not going any further.
My aim is to get them as different list (possibly) as:
pots[1]=[-5.19998150116627E+07....-3.83756863166569E+07]
pots[2]=[-8.99995987594328E+07....-6.63450391494693E+07]
I have read searched google extensively (the present state itself is from another SO question), but due to my inexperience, I cant solve my problem.
Kindly help.
use + instead of append.
It will append the elements of lines to pots.
pots = pots + lines
I didn't see in the start:
pots = []
It is needed in this case...
ITEMS_PER_LIST = 16
lists = [[]] # list of lists with initialized first sublist
with open('data.txt') as f:
for line in f:
if line.startswith(("POTENTIAL", "TYPE")):
continue
if len(lists[-1]) == ITEMS_PER_LIST:
lists.append([]) # create new list
lists[-1].extend([float(i) for i in line.split()])
Additional tweaks are required to validate headers.

Read file elements into 3 different arrays

I have a file that is space delimited with values for x,y,x. I need to visualise the data so I guess I need so read the file into 3 separate arrays (X,Y,Z) and then plot them. How do I read the file into 3 seperate arrays I have this so far which removes the white space element at the end of every line.
def fread(f=None):
"""Reads in test and training CSVs."""
X = []
Y = []
Z = []
if (f==None):
print("No file given to read, exiting...")
sys.exit(1)
read = csv.reader(open(f,'r'),delimiter = ' ')
for line in read:
line = line[:-1]
I tried to add something like:
for x,y,z in line:
X.append(x)
Y.append(y)
Z.append(z)
But I get an error like "ValueError: too many values to unpack"
I have done lots of googling but nothing seems to address having to read in a file into a separate array every element.
I should add my data isn't sorted nicely into rows/columns it just looks like this
"107745590026 2 0.02934046648 0.01023879368 3.331810236 2 0.02727724425 0.07867902517 3.319272757 2 0.01784882881"......
Thanks!
EDIT: If your data isn't actually separated into 3-element lines (and is instead one long space-separated list of values), you could use python list slicing with stride to make this easier:
X = read[::3]
Y = read[1::3]
Z = read[2::3]
This error might be happening because some of the lines in read contain more than three space-separated values. It's unclear from your question exactly what you'd want to do in these cases. If you're using python 3, you could put the first element of a line into X, the second into Y, and all the rest of that line into Z with the following:
for x, y, *z in line:
X.append(x)
Y.append(y)
for elem in z:
Z.append(elem)
If you're not using python 3, you can perform the same basic logic in a slightly more verbose way:
for i, elem in line:
if i == 0:
X.append(elem)
elif i == 1:
Y.append(elem)
else:
Z.append(elem)

Extraction and processing the data from txt file

I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.

Categories

Resources