Categorize and calculate something in python - python

I have following input file
O 2.05151 39.51234 0.00000
O 32.69451 1.48634 8.31300
O 10.53351 21.63634 7.95400
O 30.37451 20.74134 0.99700
Si 8.06451 19.19434 10.21700
Si 32.03251 42.98634 21.23900
O 9.69051 19.06934 16.27200
Si 2.18351 39.67034 11.36500
Si 31.78351 2.38334 1.42300
......
First, I hope to categorize these data based on 4th column, such as
[0~1, 1~2, 2~3, ...., max-1 ~ max]
and then count the number of 'Si' and 'O' in each of sections. After that, do some calculation based on those numbers then print out. Printing format was setted to
section1 number_of_Si_in_section1 number_of_O_in_section1 add_two_numbers
...
with three space devided
I tried to use nested for loops, but failed.
for i1 in range (total number of lines)
for j1 in range (each sections)
if (at_name[j1] = 'Si'):
num_Si = num_Si + 1
if (at_name[j1] = 'O'):
num_O = num_O + 1
Something like this but I stucked in the middle. I heard that numpy, csvanal or other function can do this easily, but I have no idea about them.

You should test small bits of this code line by line in your Python interpreter. You will see the small mistakes (like you used single equals instead of double equals to check equality).
Nothing inside the loop depends on i1, so it looks like this loop will just do the same thing again and again. Also, you should use a dictionary (or better yet, collections.Counter):
import collections
import csv
f = open('myfile.csv','rb')
reader = csv.reader(f, delimiter='\t')
si_counter = collections.Counter()
o_counter = collections.Counter()
for line in reader:
number = int(line[3])
si_or_o = line[0]
if si_or_o == 'Si':
si_counter[number] += 1
elif si_or_o == 'O':
o_counter[number] += 1
The code is untested and you can improve it.

Related

Combining and tabulating several blocks of text

The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]

Arranging distinct number of floats in an 2d array

First of all I am quite a newbie on python, so please forgive me if I don't see the wood for the trees. My question is on reading a huge file of float numbers and storing them in an array for fast mathematical postprocessing.
Lets assume the file looks similar to this:
!!
-3.2297390 0.4474691 3.5690145 3.5976372 6.9002712 7.7787466 14.2159269 14.3291490
16.7660723 17.1258704 18.9469059 19.1716808 20.0700721 21.4088414
-3.2045361 0.4123081 3.5625981 3.5936954 6.8901539 7.7543415 14.2764611 14.3623976
16.7955934 17.1560337 18.9527369 19.1251184 20.0700709 21.3515145
-3.2317597 0.4494166 3.5799182 3.6005429 6.8838705 7.7661897 14.2576455 14.3295731
16.7550357 17.0986678 19.0187779 19.1687722 20.0288587 21.3818250
-3.1921346 0.3949598 3.5636878 3.5892085 6.8833690 7.7404542 14.3061281 14.3855389
16.8063645 17.1697110 18.9549920 19.1134580 20.0613223 21.3196066
here there are 4 (nb) blocks of 14 (nk) float numbers each. I want them to be arranged in an array elements[nb][nk] so that I can access easily looping over certain floats of the blocks.
Here is what I thought it should look like, but it doesn't work at all:
nb=4
nk=14
with open("datafile") as file:
elements = []
n = 0
while '!!' not in file:
while n <= (nb-1):
elements.append([])
current = map(float,file.read().split()) # here I would need something to assure only 14 (nk) floats are read in
elements[n].append(current)
n += 1
print(elements[0][1])
It would be great if had some ideas and suggestions. Thanks!
EDIT:
here an datafile where the numbers follow after each other with no clear seperator after a block nb. Here it is nb=2 and nk=160. How to split the read in floats after each 160th number?
!!
-7.2578105433 -7.2578105433 -6.7774609392 -6.7774609392 -6.3343986693 -6.3343986693 -5.8537216826 -5.8537216826
-5.6031029888 -5.6031029888 -2.9103190893 -2.9103190893 -1.7962279174 -1.7962279174 -0.8136720023 -0.8136720023
-0.1418500769 -0.1418500769 2.9923464558 2.9923464558 3.5797768050 3.5797768050 3.8793240270 3.8793240270
4.0774192689 4.0774192689 4.2378755781 4.2378755781 4.2707165126 4.2707165126 4.3290523910 4.3290523910
4.4487102661 4.4487102661 4.5341883539 4.5341883539 4.7946098470 4.7946098470 4.9518205998 4.9518205998
4.9592549825 4.9592549825 5.1648268937 5.1648268937 5.2372127454 5.2372127454 5.9377062691 5.9377062691
6.2971992823 6.2971992823 6.6324702419 6.6324702419 6.7948808733 6.7948808733 7.0835270703 7.0835270703
7.6252686579 7.6252686579 7.7886279100 7.7886279100 7.8514022664 7.8514022664 7.9188180854 7.9188180854
7.9661386138 7.9661386138 8.2830991934 8.2830991934 8.4581462733 8.4581462733 8.5537201519 8.5537201519
10.2738010533 10.2738010533 11.4495306517 11.4495306517 11.4819579346 11.4819579346 11.5788238984 11.5788238984
11.9411469341 11.9411469341 12.5006172267 12.5006172267 12.5055546075 12.5055546075 12.6659410418 12.6659410418
12.8741094000 12.8741094000 12.9560279595 12.9560279595 12.9780521671 12.9780521671 13.2195973082 13.2195973082
13.2339969658 13.2339969658 13.3594047155 13.3594047155 13.4530024795 13.4530024795 13.4556342387 13.4556342387
13.5784994631 13.5784994631 14.6887369915 14.6887369915 14.9019726334 14.9019726334 15.1279383300 15.1279383300
15.1953349879 15.1953349879 15.3209538297 15.3209538297 15.4042612992 15.4042612992 15.4528348692 15.4528348692
15.4542742538 15.4542742538 15.5291462589 15.5291462589 15.5415591416 15.5415591416 16.0741610117 16.0741610117
16.1117432607 16.1117432607 16.3566675522 16.3566675522 17.7569123657 17.7569123657 18.4416346230 18.4416346230
18.9525843134 18.9525843134 19.0591624486 19.0591624486 19.1069867477 19.1069867477 19.1853525353 19.1853525353
19.4020021909 19.4020021909 19.4718240723 19.4718240723 19.6384650104 19.6384650104 19.6919638323 19.6919638323
19.7044699790 19.7044699790 19.8851141335 19.8851141335 20.6132283388 20.6132283388 21.4074471478 21.4074471478
-7.2568288331 -7.2568280628 -6.7765483088 -6.7765429702 -6.3336003082 -6.3334841531 -5.8529872639 -5.8528369047
-5.6024822566 -5.6024743589 -2.9101060346 -2.9100930470 -1.7964872791 -1.7959333994 -0.8153333579 -0.8144924713
-0.1440078470 -0.1421444935 2.9869228390 2.9935342026 3.5661875018 3.5733148387 3.8777649741 3.8828300867
4.0569348321 4.0745074351 4.2152251981 4.2276050415 4.2620483420 4.2649182323 4.3401804124 4.3402590222
4.4446178512 4.4509411587 4.5139270348 4.5526439516 4.7788285567 4.7810706248 4.9282976775 4.9397807768
4.9737752749 4.9900180286 5.1456209436 5.1507667583 5.2528363215 5.2835144984 5.9252188817 5.9670441193
6.2699491148 6.3270140700 6.5912060019 6.6576016532 6.7976670773 6.7982056614 7.0789050974 7.1023337244
7.6182108739 7.6309688587 7.7678148773 7.7874194913 7.8544608005 7.8594983757 7.9019395451 7.9100447766
7.9872550937 7.9902791771 8.2617740182 8.3147140843 8.4533756827 8.4672364683 8.5556163680 8.5558640539
10.2756173692 10.2760227976 11.4344757209 11.4355375519 11.4737803653 11.4760186102 11.5914333288 11.5953932241
11.9369518613 11.9380900159 12.4973099542 12.5002401499 12.5030167542 12.5031963862 12.6629548222 12.6634150863
12.8719844312 12.8728126622 12.9541436501 12.9568445777 12.9762780998 12.9764840239 13.2074024551 13.2108294169
13.2279146175 13.2308902307 13.3780648962 13.3839050348 13.4634576072 13.4650575047 13.4701414823 13.4718238883
13.5901622459 13.5971076111 14.6735704782 14.6840793519 14.8963924604 14.8968395615 15.1163287408 15.1219631271
15.1791724308 15.1817299995 15.2628531102 15.3027136606 15.3755066968 15.3802521520 15.3969012144 15.4139294088
15.5131322524 15.5315039463 15.5465532500 15.5629105034 15.5927166831 15.5966393750 16.0841067052 16.0883417123
16.1224821534 16.1226510159 16.3646268213 16.3665839987 17.7654543366 17.7657216551 18.4305335335 18.4342292730
18.9110142692 18.9215889808 18.9821593138 18.9838270736 19.1633959849 19.1637558341 19.2040877093 19.2056062802
19.3760597529 19.3846323861 19.4323552578 19.4329488797 19.6494790293 19.6813374885 19.6943820824 19.7202356536
19.7381237231 19.7414645409 19.9056461663 19.9197428869 20.6239183178 20.6285756411 21.4127637743 21.4128909767
This should work:
elements = []
with open("datafile") as file:
next(file)
for line in file:
elements.append([float(x) for x in line.split()])
next(line) reads the first line. Then for line in file: iterates over all other lines. The list comprehension [float(x) for x in line.split()] goes through all entries in the line split by whitespace. Finally, elements.append() appends this list to elements, which becomes a list of lists that you can call an 2D array.
Access the first entry in the first line:
>>> elements[0][0]
-3.229739
or the last entry in the last line:
>>> elements[3][13]
21.319606
alternatively:
>>> elements[-1][-1]
21.319606
Update
This reads the file into a list of lists without taking line breaks as special:
nb = 2
nk = 160
with open("datafile") as fobj:
all_values = iter(x for x in fobj.read().split())
next(all_values)
elements = []
for x in range(nb):
elements.append([float(next(all_values)) for counter in range(nk)])
If you like nested list comprehensions:
with open("datafile") as fobj:
all_values = iter(x for x in fobj.read().split())
next(all_values)
elements = [[float(next(all_values)) for counter in range(nk)] for x in range(nb)]

Extraction and processing the data from txt file

I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages
Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).
If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

using python to search extremely large text file

I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format:
399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......
Each line starts off with a number and is followed by some other data. The numbers are in sorted order. I need to be able to:
Given a number x and and a range y, find all the lines whose number is within y range of x. For example if x=20 and y=5, I need to find all lines whose number is between 15 and 25.
Store these lines into another separate file.
What would be an efficient method to do this without having to trawl through the entire file?
If you don't want to generate a database ahead of time for line lengths, you can try this:
import os
import sys
# Configuration, change these to suit your needs
maxRowOffset = 100 #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25
#seek to first character c before the current position
def seekTo(f,c):
while f.read(1) != c:
f.seek(-2,1)
def parseRow(row):
return (int(row.split(None,1)[0]),row)
minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
while True:
f.seek(int(step),1)
seekTo(f,'\n')
row = parseRow(f.readline())
if row[0] < minRow:
if minRow - row[0] < maxRowOffset:
with open('outputFile.txt','w') as fo:
for row in f:
row = parseRow(row)
if row[0] > maxRow:
sys.exit()
if row[0] >= minRow:
fo.write(row[1])
else:
step /= 2.
step = step * -1 if step < 0 else step
else:
step /= 2.
step = step * -1 if step > 0 else step
It starts by performing a binary search on the file until it is near (less than maxRowOffset) the row to find. Then it starts reading every line until it finds one that is greater than x-y. That line, and every line after it are written to an output file until a line is found that is greater than x+y, and which point the program exits.
I tested this on a 1,000,000 line file and it runs in 0.05 seconds. Compare this to reading every line which took 3.8 seconds.
You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length.
One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). Then index the number column and you're good to go.
Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). You calculate the line-offset by adding the lengths of each row (including line end). Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. If you need to repeat the search later, pickle the in-memory structure and reload for later re-use.
This reads the entire file (which you said you don't want to do), but does so only once to build the index. After that you can execute as many requests against the file as you want and they will be very fast.
Note that this second solution is essentially creating a database index on your text file.
Rough code to create the index in second solution:
import Pickle
line_end_length = len('\n') # must be a better way to do this!
offset = 0
index = [] # probably a better structure to use than a list
f = open(filename)
for row in f:
nbr = float(row.split(' ')[0])
index.append([nbr, offset])
offset += len(row) + line_end_length
Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use
Now, you can perform a binary search on the list. There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types.
Since you want to match the first field, you can use gawk:
$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file
Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 # 2.80GHz. Sounds good enough for me.
In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. No other way, i.e. all data in the file has to be read and parsed for newline characters.
We have to run this search up to the first line that exceeds your upper limit and stop. Hence, it helps that the file is already sorted. This code will hopefully help:
with open(outpath) as outfile:
with open(inpath) as infile:
for line in infile:
t = float(line.split()[0])
if lower_limit <= t <= upper_limit:
outfile.write(line)
elif t > upper_limit:
break
I think theoretically there is no other option.

Categories

Resources