I would like to search a very large text file in which SHA1 Hashes are sorted by hash values using Python. The text file has 10GB and 500 000 000 lines. Each line looks like this:
000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27
I compare thereby whether a given hash value occurs in the file. I tried it with BinarySearch, but it only works with a small test file. If I use the file with 10GB the search takes much too long and the process is killed sometime because 16GB RAM was exceeded.
f=open('testfile.txt', 'r')
text=f.readlines()
data=text
#print data
x = '000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27'
def binarySearch(data, l, r, x):
while l <= r:
mid = l + (r - l)/2;
# Check if x is present at mid
if data[mid] == x:
return mid
# If x is greater, ignore left half
elif data[mid] < x:
l = mid + 1
#print l
# If x is smaller, ignore right half
else:
r = mid - 1
#print r
# If we reach here, then the element
# was not present
return -1
result = binarySearch(data,0, len(data)-1, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
Is there a way to load the 10GB text file once into RAM and access it over and over again? I have 16GB RAM available. That should be enough, right?
Is there anything else I could do to speed up the search? Unfortunately I don't know any more.
Take your sample input as input.txt as below
000000005AD76BD555C1D6D771DE417A4B87E4B4:4
00000000A8DAE4228F821FB418F59826079BF368:3
00000000DD7F2A1C68A35673713783CA390C9E93:630
00000001E225B908BAC31C56DB04D892E47536E0:5
00000006BAB7FC3113AA73DE3589630FC08218E7:2
00000008CD1806EB7B9B46A8F87690B2AC16F617:4
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38:16
0000000CAEF405439D57847A8657218C618160B2:15
0000000FC1C08E6454BED24F463EA2129E254D43:40
And remove the counts so your file becomes (in.txt below ):
000000005AD76BD555C1D6D771DE417A4B87E4B4
00000000A8DAE4228F821FB418F59826079BF368
00000000DD7F2A1C68A35673713783CA390C9E93
00000001E225B908BAC31C56DB04D892E47536E0
00000006BAB7FC3113AA73DE3589630FC08218E7
00000008CD1806EB7B9B46A8F87690B2AC16F617
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38
0000000CAEF405439D57847A8657218C618160B2
0000000FC1C08E6454BED24F463EA2129E254D43
This will ensure you have fixed size for each entry.
Now you can use mmap based file reading approach as in here https://docs.python.org/3/library/mmap.html
import mmap
import os
FIELD_SIZE=40+1 # also include newline separator
def binarySearch(mm, l, r, x):
while l <= r:
mid = int(l + (r - l)/2);
# Check if x is present at mid
mid_slice = mm[mid*FIELD_SIZE:(mid+1)*FIELD_SIZE]
mid_slice = mid_slice.decode('utf-8').strip()
# print(mid_slice)
if mid_slice == x:
return mid
# If x is greater, ignore left half
elif mid_slice < x:
l = mid + 1
#print l
# If x is smaller, ignore right half
else:
r = mid - 1
#print r
# If we reach here, then the element
# was not present
return -1
# text=f.readlines()
# data=text
#print data
x = '0000000CAEF405439D57847A8657218C618160B2'
with open('in.txt', 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
f.seek(0, os.SEEK_END)
size = f.tell()
result = binarySearch(mm, 0, size/FIELD_SIZE, x)
if result != -1:
print("Element is present at index % d" % result)
else:
print("Element is not present in array")
OUTPUT:
$ python3 find.py
Element is present at index 8
Since the file is not read completely in memory, there won't be out of memory errors.
Related
EDIT:
Thanks for fixing it! Unfortunatelly, it messed up the logic. I'll explain what this program does. It's a solution to a task about playing cards trick. There are N cards on the table. First and Second are numbers on the front and back of the cards. The trick can only be done, if the visible numbers are in non-decreasing order. Someone from audience can come and swap places of cards. M represents how many cards will be swapped places. A and B represent which cards will be swapped. Magician can flip any number of cards to see the other side. The program must tell, if the magician can do the trick.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
for _ in range(n):
first, second = (int(x) for x in data.readline().split(':'))
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # add to the list by appending
m = data.readline()
m = int(m)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
data:
4
2:5
3:4
6:3
2:7
2
3-4
1-3
results:
YES
YES
YES
YES
YES
YES
YES
What should be in results:
NO
YES
The code is full of bugs: you should write and test it incrementally instead of all at once. It seems that you started using readlines (which is a good way of managing this kind of work) but you kept the rest of the code in a reading one by one style. If you used readlines, the line for i, line in enumerate(data): should be changed to for i, line in enumerate(lines):.
Anyway, here is a corrected version with some explanation. I hope I did not mess with the logic.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
# The following line created a huge list of "Pairs" types, not instances
# pairs = [Pair] * (2*200*1000+1)
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
# removing the reading of all data...
# lines = data.readlines()
# m = lines[n]
# removed bad for: for i, line in enumerate(data):
for _ in range(n): # you don't need the index
first, second = (int(x) for x in data.readline().split(':'))
# removed unnecessary recasting to int
# first = int(first)
# second = int(second)
# changed the swapping to a more elegant way
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # we add to the list by appending
# removed unnecessary for: once you read all the first and seconds,
# you reached M
m = data.readline()
m = int(m)
# you don't need the index... indeed you don't need to count (you can read
# to the end of file, unless it is malformed)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
# removed unnecessary recasting to int
# a = int(a)
# b = int(b)
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
Response previous to edition
range(1, 1) is empty, so this part of the code:
for i in range (1, 1):
n = data.readline()
n = int(n)
does not define n, at when execution gets to line 12 you get an error.
You can remove the for statement, changing those three lines to:
n = data.readline()
n = int(n)
I have an output of .xyz file from a molecular simulation software. I need to calculate the distance between two sets of atoms. First line is the number of atoms (1046), the next line can be seen as a comment which I wouldn't need. Then comes the coordinates of the atoms. The general view of the file is as follows.
1046
i = 13641, time = 5456.400, E = -6478.1065220464
O -7.4658679231 -8.2711817669 -9.8539631371
H -7.6241360163 -9.2582538006 -9.9522290769
H -8.2358222851 -7.6941601822 -9.9653770757
O -4.9711266650 -4.7190696213 -15.2513827675
H -4.0601366272 -4.8452939622 -14.9451462873
H -5.3574156180 -5.6550789412 -15.1154558067
... ~ 1000 more lines
O -3.7163764338 -18.4917410571 -11.7137020838
H -3.3000068500 -18.5292231200 -12.6331415740
H -4.3493512803 -19.2443154891 -11.6751925772
1046
i = 13642, time = 5456.800, E = -6478.1027892656
O -7.4669935102 -8.2716185134 -9.8549232159
H -7.6152044024 -9.2599276969 -9.9641510528
H -8.2364333010 -7.6943001983 -9.9565217204
O -4.9709831745 -4.7179801609 -15.2530422573
H -4.0670595153 -4.8459686871 -14.9472675802
H -5.3460565517 -5.6569802374 -15.1037050119
...
The indices of first set of atoms that I want to extract goes like 0,3,6,9,...,360. Considering the first two lines of header in the file, I implemented this as
w_index = list(range(2,362,3))
And the other set is only 8 atoms, which I again gave as a list.
c_index = [420,488,586,688,757,847,970,1031]
My thinking is to append the corresponding lines to separate lists in order to operate on them by the function 'dist_calculate_with_pbc_correction'.
def open_file(filename):
global step
waters = []
carbons = []
step=0
with open(filename, 'r') as infile:
for index, line in enumerate(infile):
items = line.split()
if index % (natoms+2) in w_index:
kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
waters.append([kind,x,y,z])
if (index - 2)% (natoms+2) in c_index:
kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
carbons.append([kind,x,y,z])
if index > 0 and index % (natoms+2) == natoms:
dist_calculate_with_pbc_correction(carbons,waters)
carbons, waters = [], []
step+=1
And this is the function that does all the calculation and write the outcomes to a file.
def dist_calculate_with_pbc_correction(c,w):
write_file(output_fn,'%s %r \n' % ('#Step:',step))
for i in range(len(c)):
hydration = 0
min_d = 100
for j in range(len(w)):
x_diff, y_diff, z_diff = c[i][1] - w[j][1], c[i][2] - w[j][2], c[i][3] - w[j][3]
if x_diff > pbc_a/2:
x_diff -= pbc_a
elif x_diff < -pbc_a/2:
x_diff += pbc_a
if y_diff > pbc_b/2:
y_diff -= pbc_b
elif y_diff < -pbc_b/2:
y_diff += pbc_b
if z_diff > pbc_c/2:
z_diff -= pbc_c
elif z_diff < -pbc_c/2:
z_diff += pbc_c
dist = math.sqrt(x_diff**2 + y_diff**2 + z_diff**2)
if dist < min_d:
min_d, min_index = dist, w_index[j]-2
if dist < r_cutoff :
hydration +=1
write_file(output_fn,'%d %s %d %d \n' % (c_index[i], round(min_d,3), min_index, hydration))
def write_file(filename,out):
with open(filename,'a') as g:
g.write(out)
The problem is that, this code works but it works not fast enough (read extremely slow). It takes about 27 minutes to go through the whole file of ~200K steps (~200M lines). I say 'extremely slow' since a colleague of mine came up with her own version in Fortran -that does the exact same thing if not more- runs 10 times faster. Her code runs the entire file under 3 minutes even though she bothers to calculate in order to determine the smallest distance between all the atoms unlike the way I dictate the indices of atoms. I am aware that the majority of time is spent on selecting the atoms in the 'open_file' function to add them to corresponding lists, but I don't know how to improve that. Any help will be appreciated.
Here is a sample xyz file maker of carbons if you want to have a sample file handy.
import numpy as np
def sample_xyz(steps,natoms):
with open('sample_xyz.xyz','w') as f:
for step in range(steps):
f.write(str(natoms)+'\n')
f.write('i = {} \n'.format(step))
for foo in range(natoms):
line = 'C {} {} {}\n'.format(np.random.random()*10, np.random.random()*10, np.random.random()*10)
f.write(line)
As an initial situation, I have a sha1 hash value. I want to compare this with a file full of hash values to see if the sha1 hash value is contained in the file with the hash values.
So more exactly:
f1=sha1 #value read in
fobj = open("Hashvalues.txt", "r") #open file with hash values
for f1 in fobj:
print ("hash value found")
else:
print("HashValue not found")
fobj.close()
The file is very large (11.1GB)
Is there a useful algorithm to perform the search as fast as possible? The hash values in the hash file are ordered by hashes.
I think comparing this line by line won't be the fastest way, will it?
EDIT:
I changed my Code as follows:
f1="9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in
with open("pwned-passwords-sha1-ordered-by-hash-v5.txt") as f:
lineList = [line.rstrip('\n\r') for line in open("pwned-passwords-sha1-
ordered-by-hash-v5.txt")]
def binarySearch(arr, l, r, x):
while l <= r:
mid = l + (r - l)/2;
# Check if x is present at mid
if arr[mid] == x:
return mid
# If x is greater, ignore left half
elif arr[mid] < x:
l = mid + 1
# If x is smaller, ignore right half
else:
r = mid - 1
# If we reach here, then the element
# was not present
return -1
# Test array
arr = lineList
x = "9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in
# Function call
result = binarySearch(arr, 0, len(arr)-1, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
But it doesn't work as fast as i thought. Is my Implementation correct?
EDIT2:
def binarySearch (l, r, x):
# Check base case
if r >= l:
mid = l + (r - l)/2
# If element is present at the middle itself
if getLineFromFile(mid) == x:
return mid
# If element is smaller than mid, then it
# can only be present in left subarray
elif getLineFromFile(mid) > x:
return binarySearch(l, mid-1, x)
# Else the element can only be present
# in right subarray
else:
return binarySearch(mid + 1, r, x)
else:
# Element is not present in the array
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
def getLineFromFile(lineNumber):
with open('testfile.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
# get last Element of List
def tail():
for line in open('pwned.txt', 'r'):
pass
else:
print line
ausgabetail = tail()
#print ausgabetail
result = binarySearch( 0, ausgabetail, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
My problem now is to get the correct index for the right side for the binary search. I pass the function (l, r, x). The left side starts at the beginning with 0. The right side should be the end of the file so the last line. I try to get that but it doesn't work. I tried to get this with the Funktion tail(). But if I print r on testing, I get the value "None".
Do you have another idea here?
Looking at the code I see that you are still reading all the lines from the file, this indeed is the bottleneck.
It's not the binary search.
Assuming that the hashes are sorted
You can just read the number of lines in the file.
Then just perform the binary search. Instead of reading the entire file u can use Seek to reach a particular line in the file that way you will only be reading log(n) number of lines. That should increase the speed.
Example
def binarySearch(l, r, x):
....
#change arr[mid] with getLineFromFile(mid)
....
def getLineFromFile(lineNumber):
with open('xxx.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
I have a very large file of hashes and a hash I gave. I want to compare this given hash with the file to see if it is in the file.
I chose BinarySearch for this. My current problem is to find the correct index for the rightmost element.
def binarySearch (l, r, x):
# Check base case
if r >= l:
mid = l + (r - l)/2
# If element is present at the middle itself
if getLineFromFile(mid) == x:
return mid
# If element is smaller than mid, then it
# can only be present in left subarray
elif getLineFromFile(mid) > x:
return binarySearch(l, mid-1, x)
# Else the element can only be present
# in right subarray
else:
return binarySearch(mid + 1, r, x)
else:
# Element is not present in the array
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
def getLineFromFile(lineNumber):
with open('testfile.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
# get last Element of List
def tail():
for line in open('pwned.txt', 'r'):
pass
else:
print line
ausgabetail = tail()
#print ausgabetail
result = binarySearch( 0, ausgabetail, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
My problem now is to get the correct index for the right side for the binary search. I pass the function (l, r, x). The left side starts at the beginning with 0. The right side should be the end of the file so the last line. I try to get that but it doesn't work. I tried to get this with the Funktion tail(). But if I print r on testing, I get the value "None". Do you have another idea here?
Example as requested in comments above
def checkForHash(h, fname):
with open(fname) as f:
for i, line in enumerate(f):
if h == line:
return i
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
checkForHash (x, 'testfile.txt')
I was trying to solve the Broken Necklace problem from USACO and I came across this solution. The problem statement is here: https://train.usaco.org/usacoprob2?S=beads&a=c3sjno1crwH
I am confused why the person who wrote this solution made 3 copies of the initial string, and basically the entire for loop.
I have tried looking for other solutions online that might explain it better, but there is a small number of python solutions to this problem and many of them are completely different.
'''
ID: krishpa2
LANG: PYTHON3
TASK: beads
'''
with open('beads.in','r') as fin:
N = int(fin.readline())
beads = fin.readline()[:-1]
def canCollect(s):
return not ('r' in s and 'b' in s)
beads = beads*3
max = 0
for p in range(N, N*2):
i = p-1
left = []
while i > 0:
if canCollect(left + [beads[i]]):
left.append(beads[i])
i -= 1
else:
break
i = p
right = []
while i < 3*N - 1:
if canCollect(right + [beads[i]]):
right.append(beads[i])
i+=1
else:
break
result = len(left) + len(right)
if result >= N:
max = N
break
elif result > max:
max = result
print(max)
with open('beads.out','w') as fout:
fout.write(str(max) + '\n')
The program is correctly working, I just wanted to know why.
I know that this question is pretty old, but I still want to answer it for future people, so I have made a fully commented version below (based on this answer by joshjq91) -
"""
PROG: beads
LANG: PYTHON3
#FILE
"""
# Original file from Github by joshjq91
# (https://github.com/jadeGeist/USACO/blob/master/1.2.4-beads.py)
# Comments by Ayush
with open('beads.in','r') as filein:
N = int(filein.readline()) # Number of beads
beads = filein.readline()[:-1] # Necklace
def canCollect(s):
return not ('r' in s and 'b' in s) # If r and b are not in the same str,
# then you can collect the string.
beads = beads*3 # Wraparound - r actually can be shown as r r r (wraparound
# for the front and back)
max = 0 # The final result
for p in range(N, N*2): # Loop through the 2nd bead string (so you can use
i = p-1 # wraparounds for the front and back)
left = []
while i > 0: # Check if you can collect beads (left)
if canCollect(left + [beads[i]]): # Can colleect
left.append(beads[i]) # Add to left
i -= 1 # Loop through again
else:
break # Cannot collect more beads - break
i = p # You will otherwise have a duplicate bead (left is i=p-1)
right = []
while i < 3*N - 1: # Check if you can collect beads (right) - i has
#print("righti",i-N) # to be less than 3*N - 1 b/c that is the length
# ^ for testing # of the beads + runarounds.
if canCollect(right + [beads[i]]): # Can collect
right.append(beads[i]) # Add to right
i+=1 # Loop through again
else:
break # Cannot collect more beads - break
result = len(left) + len(right) # Final result
if result >= N: # The result was greater than N means that the whole
max = N # necklace is the same (EX: rwr)
break # Break - we now know we don't need to go through again b/c the
# whole string is the same!
elif result > max: # The result makes sense
max = result
with open('beads.out','w') as fileout:
fileout.write(str(max) + '\n') # Final result