As an initial situation, I have a sha1 hash value. I want to compare this with a file full of hash values to see if the sha1 hash value is contained in the file with the hash values.
So more exactly:
f1=sha1 #value read in
fobj = open("Hashvalues.txt", "r") #open file with hash values
for f1 in fobj:
print ("hash value found")
else:
print("HashValue not found")
fobj.close()
The file is very large (11.1GB)
Is there a useful algorithm to perform the search as fast as possible? The hash values in the hash file are ordered by hashes.
I think comparing this line by line won't be the fastest way, will it?
EDIT:
I changed my Code as follows:
f1="9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in
with open("pwned-passwords-sha1-ordered-by-hash-v5.txt") as f:
lineList = [line.rstrip('\n\r') for line in open("pwned-passwords-sha1-
ordered-by-hash-v5.txt")]
def binarySearch(arr, l, r, x):
while l <= r:
mid = l + (r - l)/2;
# Check if x is present at mid
if arr[mid] == x:
return mid
# If x is greater, ignore left half
elif arr[mid] < x:
l = mid + 1
# If x is smaller, ignore right half
else:
r = mid - 1
# If we reach here, then the element
# was not present
return -1
# Test array
arr = lineList
x = "9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in
# Function call
result = binarySearch(arr, 0, len(arr)-1, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
But it doesn't work as fast as i thought. Is my Implementation correct?
EDIT2:
def binarySearch (l, r, x):
# Check base case
if r >= l:
mid = l + (r - l)/2
# If element is present at the middle itself
if getLineFromFile(mid) == x:
return mid
# If element is smaller than mid, then it
# can only be present in left subarray
elif getLineFromFile(mid) > x:
return binarySearch(l, mid-1, x)
# Else the element can only be present
# in right subarray
else:
return binarySearch(mid + 1, r, x)
else:
# Element is not present in the array
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
def getLineFromFile(lineNumber):
with open('testfile.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
# get last Element of List
def tail():
for line in open('pwned.txt', 'r'):
pass
else:
print line
ausgabetail = tail()
#print ausgabetail
result = binarySearch( 0, ausgabetail, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
My problem now is to get the correct index for the right side for the binary search. I pass the function (l, r, x). The left side starts at the beginning with 0. The right side should be the end of the file so the last line. I try to get that but it doesn't work. I tried to get this with the Funktion tail(). But if I print r on testing, I get the value "None".
Do you have another idea here?
Looking at the code I see that you are still reading all the lines from the file, this indeed is the bottleneck.
It's not the binary search.
Assuming that the hashes are sorted
You can just read the number of lines in the file.
Then just perform the binary search. Instead of reading the entire file u can use Seek to reach a particular line in the file that way you will only be reading log(n) number of lines. That should increase the speed.
Example
def binarySearch(l, r, x):
....
#change arr[mid] with getLineFromFile(mid)
....
def getLineFromFile(lineNumber):
with open('xxx.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
Related
I have a very large file of hashes and a hash I gave. I want to compare this given hash with the file to see if it is in the file.
I chose BinarySearch for this. My current problem is to find the correct index for the rightmost element.
def binarySearch (l, r, x):
# Check base case
if r >= l:
mid = l + (r - l)/2
# If element is present at the middle itself
if getLineFromFile(mid) == x:
return mid
# If element is smaller than mid, then it
# can only be present in left subarray
elif getLineFromFile(mid) > x:
return binarySearch(l, mid-1, x)
# Else the element can only be present
# in right subarray
else:
return binarySearch(mid + 1, r, x)
else:
# Element is not present in the array
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
def getLineFromFile(lineNumber):
with open('testfile.txt') as f:
for i, line in enumerate(f):
if i == lineNumber:
return line
else:
print('Not 7 lines in file')
line = None
# get last Element of List
def tail():
for line in open('pwned.txt', 'r'):
pass
else:
print line
ausgabetail = tail()
#print ausgabetail
result = binarySearch( 0, ausgabetail, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
My problem now is to get the correct index for the right side for the binary search. I pass the function (l, r, x). The left side starts at the beginning with 0. The right side should be the end of the file so the last line. I try to get that but it doesn't work. I tried to get this with the Funktion tail(). But if I print r on testing, I get the value "None". Do you have another idea here?
Example as requested in comments above
def checkForHash(h, fname):
with open(fname) as f:
for i, line in enumerate(f):
if h == line:
return i
return -1
x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'
checkForHash (x, 'testfile.txt')
I would like to search a very large text file in which SHA1 Hashes are sorted by hash values using Python. The text file has 10GB and 500 000 000 lines. Each line looks like this:
000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27
I compare thereby whether a given hash value occurs in the file. I tried it with BinarySearch, but it only works with a small test file. If I use the file with 10GB the search takes much too long and the process is killed sometime because 16GB RAM was exceeded.
f=open('testfile.txt', 'r')
text=f.readlines()
data=text
#print data
x = '000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27'
def binarySearch(data, l, r, x):
while l <= r:
mid = l + (r - l)/2;
# Check if x is present at mid
if data[mid] == x:
return mid
# If x is greater, ignore left half
elif data[mid] < x:
l = mid + 1
#print l
# If x is smaller, ignore right half
else:
r = mid - 1
#print r
# If we reach here, then the element
# was not present
return -1
result = binarySearch(data,0, len(data)-1, x)
if result != -1:
print "Element is present at index % d" % result
else:
print "Element is not present in array"
Is there a way to load the 10GB text file once into RAM and access it over and over again? I have 16GB RAM available. That should be enough, right?
Is there anything else I could do to speed up the search? Unfortunately I don't know any more.
Take your sample input as input.txt as below
000000005AD76BD555C1D6D771DE417A4B87E4B4:4
00000000A8DAE4228F821FB418F59826079BF368:3
00000000DD7F2A1C68A35673713783CA390C9E93:630
00000001E225B908BAC31C56DB04D892E47536E0:5
00000006BAB7FC3113AA73DE3589630FC08218E7:2
00000008CD1806EB7B9B46A8F87690B2AC16F617:4
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38:16
0000000CAEF405439D57847A8657218C618160B2:15
0000000FC1C08E6454BED24F463EA2129E254D43:40
And remove the counts so your file becomes (in.txt below ):
000000005AD76BD555C1D6D771DE417A4B87E4B4
00000000A8DAE4228F821FB418F59826079BF368
00000000DD7F2A1C68A35673713783CA390C9E93
00000001E225B908BAC31C56DB04D892E47536E0
00000006BAB7FC3113AA73DE3589630FC08218E7
00000008CD1806EB7B9B46A8F87690B2AC16F617
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38
0000000CAEF405439D57847A8657218C618160B2
0000000FC1C08E6454BED24F463EA2129E254D43
This will ensure you have fixed size for each entry.
Now you can use mmap based file reading approach as in here https://docs.python.org/3/library/mmap.html
import mmap
import os
FIELD_SIZE=40+1 # also include newline separator
def binarySearch(mm, l, r, x):
while l <= r:
mid = int(l + (r - l)/2);
# Check if x is present at mid
mid_slice = mm[mid*FIELD_SIZE:(mid+1)*FIELD_SIZE]
mid_slice = mid_slice.decode('utf-8').strip()
# print(mid_slice)
if mid_slice == x:
return mid
# If x is greater, ignore left half
elif mid_slice < x:
l = mid + 1
#print l
# If x is smaller, ignore right half
else:
r = mid - 1
#print r
# If we reach here, then the element
# was not present
return -1
# text=f.readlines()
# data=text
#print data
x = '0000000CAEF405439D57847A8657218C618160B2'
with open('in.txt', 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
f.seek(0, os.SEEK_END)
size = f.tell()
result = binarySearch(mm, 0, size/FIELD_SIZE, x)
if result != -1:
print("Element is present at index % d" % result)
else:
print("Element is not present in array")
OUTPUT:
$ python3 find.py
Element is present at index 8
Since the file is not read completely in memory, there won't be out of memory errors.
I'm a Python newbie and trying to write code that checks if a nested list has a valid set of numbers. Each row and each column have to be valid. I have written a function called check_sequence which validates if a list has a valid set of numbers. How would I call that function from another to check to see if the row is valid? So for example, I need something like this for check_rows:
check_sequence(list):
checks if list is valid
check_rows(list):
For each of the rows in the nested list call check_sequence
Here is my code for check_sequence:
def check_sequence(mylist):
pos = 0
sequence_counter = 1
while pos < len(mylist):
print "The pos is: " + " " + str(pos)
print "The sequence_counter is:" + " " + str(sequence_counter)
for number in mylist:
print "The number is:" + " " + str(number)
if number == sequence_counter:
sequence_counter = sequence_counter + 1
pos = pos + 1
break
else:
# if list is at the last position on the last item
if sequence_counter not in mylist:
print "The pos is:" + " " + str(pos) + " and the last position is:" + " " + str(mylist[len(mylist) - 1])
print "False"
return False
print "True"
return True
So I'd call the main method like below:
check_square([[1, 2, 3],
[2, 3, 1],
[3, 1, 2]])
def check_square(list):
if check_rows() and check_columns() == True:
return True
else:
return False
Here's a solution that'll work for any arbitrary 2D list.
l = [[1,2,3],[1,2],[1,4,5,6,7]]
try:
if len([1 for x in reduce(lambda x, y :x + y, l) if type(x) != type(0)]) > 0:
raise Exception
catch Exception:
pass # error, do something
The intuition is to flatten the list and then successively check if its type is int.
Given the nested list is row oriented (the rows are the lowest dimension), you can simply use:
check_rows(list):
return all(check_sequence(sublist) for sublist in list)
Here we thus use the all(..) builtin: it evaluates to True if and only if the truthiness of all elements the generator (boldface part) is True, otherwise the result is False. So from the moment one of the rows is not valid, the matrix is not valid.
If on the other hand the nested list is column oriented (the columns are the lowest dimension), we will first need to do a transpose using zip:
check_rows(list):
return all(check_sequence(list(sublist)) for sublist in zip(*list))
The zip(*..) transposes the list and we use list(..) to make sure that check_sequence(..) is still working with lists (if any iterable is sufficient, the list(..) part can be omitted.
Are you looking for an iterative for loop?
check_sequence(list):
#your check here
check_rows(list):
for row in list:
if not check_sequence(row):
return False
return True
You have to separate in two function, and think the first one will return the complete check for each value of the other:
def check_sequence(lis):
ret = True
for row in lis:
ret = ret and check_rows(row)
return ret
def check_rows(row):
ret = True
for elem in row:
pass #do your checking
return ret
a concrete example could be:
l = [[1,2,3],[1,2],[1,4,5,6,7]]
def check_sequence(lis):
ret = True
for row in lis:
ret = ret and check_rows(row)
return ret
def check_rows(row):
return 1 in row #ask if 1 belongs to the list
check_sequence(l) ---> True
check_sequence([[1],[2,3]]) ---> False
I need to find the first missing number in a list. If there is no number missing, the next number should be the last +1.
It should first check to see if the first number is > 1, and if so then the new number should be 1.
Here is what I tried. The problem is here: if next_value - items > 1:
results in an error because at the end and in the beginning I have a None.
list = [1,2,5]
vlans3=list
for items in vlans3:
if items in vlans3:
index = vlans3.index(items)
previous_value = vlans3[index-1] if index -1 > -1 else None
next_value = vlans3[index+1] if index + 1 < len(vlans3) else None
first = vlans3[0]
last = vlans3[-1]
#print ("index: ", index)
print ("prev item:", previous_value)
print ("-cur item:", items)
print ("nxt item:", next_value)
#print ("_free: ", _free)
#print ("...")
if next_value - items > 1:
_free = previous_value + 1
print ("free: ",_free)
break
print ("**************")
print ("first item:", first)
print ("last item:", last)
print ("**************")
Another method:
L = vlans3
free = ([x + 1 for x, y in zip(L[:-1], L[1:]) if y - x > 1][0])
results in a correct number if there is a gap between the numbers, but if no space left error occurs: IndexError: list index out of range. However I need to specify somehow that if there is no free space it should give a new number (last +1). But with the below code it gives an error and I do not know why.
if free = []:
print ("no free")
else:
print ("free: ", free)
To get the smallest integer that is not a member of vlans3:
ints_list = range(min(vlans3), max(vlans3) + 1)
missing_list = [x for x in ints_list if x not in vlans3]
first_missing = min(missing_list)
However you want to return 1 if the smallest value in your list is greater than 1, and the last value + 1 if there are no missing values, so this becomes:
ints_list = [1] + list(range(min(vlan3), max(vlan3) + 2))
missing_list = [x for x in ints_list if x not in vlan3]
first_missing = min(missing_list)
First avoid using reserved word list for variable.
Second use try:except to quickly and neatly avoid this kind of issues.
def free(l):
if l == []: return 0
if l[0] > 1: return 1
if l[-1] - l[0] + 1 == len(l): return l[-1] + 1
for i in range(len(l)):
try:
if l[i+1] - l[i] > 1: break
except IndexError:
break
return l[i] + 1
How about a numpy solution? Below code works if your input is a sorted integer list with non-duplicating positive values (or is empty).
nekomatic's solution is a bit faster for small inputs, but it's just a fraction of a second, doesn't really matter. However, it does not work for large inputs - e.g. list(range(1,100000)) completely freezes on list comprehension with inclusion check. Below code does not have this issue.
import numpy as np
def first_free_id(array):
array = np.concatenate((np.array([-1, 0], dtype=np.int), np.array(array, dtype=np.int)))
where_sequence_breaks = np.where(np.diff(array) > 1)[0]
return where_sequence_breaks[0] if len(where_sequence_breaks)>0 else array[-1]+1
Prepend the array with -1 and 0 so np.diff works for empty and 1-element lists without breaking existing sequence's continuity.
Compute differences between consecutive values. Seeked discontinuity ("hole") is where the difference is bigger than 1.
If there ary any "holes" return the id of the first one, otherwise return the integer succeeding the last element.
For example I have a non-ordered list of values [10, 20, 50, 200, 100, 300, 250, 150]
I have this code which returns the next greater value:
def GetNextHighTemp(self, temp, templist):
target = int(temp)
list = []
for t in templist:
if t != "":
list.append(int(t))
return str(min((abs(target - i), i) for i in list)[1])
e.g. If temp = 55, it will return '100'.
But how can I get the lesser of the value? That is how to get it to return '50'?
Thank you.
EDIT - now working
def OnTWMatCurrentIndexChanged(self):
self.ClearTWSelectInputs()
material = self.cb_TW_mat.currentText()
temp = self.txt_design_temp.text()
if material != "":
Eref = self.GetMaterialData(material, "25", "elast")
if Eref and Eref != "":
Eref = str(float(Eref) / 1000000000)
self.txt_TW_Eref.setText(Eref)
else:
self.txt_TW_Eref.setText("194.8")
self.ShowMsg("No temperature match found for E<sub>ref</sub> in material data file. Value of 194.8 GPa will be used.", "blue")
if material != "" and temp != "":
if self.CheckTWTemp(material, temp):
dens = self.GetMaterialData(material, temp, "dens")
self.txt_TW_dens.setText(dens)
elast = self.GetMaterialData(material, temp, "elast")
elast = str(float(elast) / 1000000000)
self.txt_TW_Et.setText(elast)
stress = self.GetMaterialData(material, temp, "stress")
stress = str(float(stress) / 1000000)
self.txt_TW_stress_limit.setText(stress)
else:
self.ShowMsg("No temperature match found for " + temp + "° C in material data file. Extrapolated data will be used where possible or add new material data.", "blue")
dens = self.GetExtrapolatedMaterialData(material, temp, "dens")
self.txt_TW_dens.setText(dens)
elast = self.GetExtrapolatedMaterialData(material, temp, "elast")
elast = str(float(elast) / 1000000000)
self.txt_TW_Et.setText(elast)
stress = self.GetExtrapolatedMaterialData(material, temp, "stress")
stress = str(float(stress) / 1000000)
self.txt_TW_stress_limit.setText(stress)
else:
self.ClearTWSelectInputs()
def CheckTWTemp(self, matvar, tempvar):
for material in self.materials:
if material.attrib["name"] == matvar:
temps = material.getiterator("temp")
for temp in temps:
if int(temp.text) == int(tempvar):
return True
return False
def GetMaterialData(self, matvar, tempvar, tag):
for material in self.materials:
if material.attrib["name"] == matvar:
temps = material.getiterator("temp")
for temp in temps:
if temp.text == tempvar:
value = temp.find(tag)
return value.text
def GetExtrapolatedMaterialData(self, matvar, tempvar, tag):
try:
templist = QStringList()
for material in self.materials:
if material.attrib["name"] == matvar:
temps = material.getiterator("temp")
for temp in temps:
templist.append(temp.text)
templist.sort()
target = int(tempvar)
x1 = max(int(t) for t in templist if t != '' and int(t) < target)
x2 = min(int(t) for t in templist if t != '' and int(t) > target)
y1 = float(self.GetMaterialData(matvar, str(x1), tag))
y2 = float(self.GetMaterialData(matvar, str(x2), tag))
x = target
y = y1 - ((y1 - y2) * (x - x1) / (x2 - x1))
return str(y)
except Exception, inst:
return "0"
A better and much faster (code and cpu wise) way is to use bisect module which does binary search but for that you will need to sort the list first, here is the sample usage:
import bisect
mylist = [10, 20, 50, 200, 100, 300, 250, 150]
mylist.sort()
index = bisect.bisect(mylist, 55)
print "Greater than target", mylist[index]
print "Smaller than or equal to target", mylist[index-1]
output:
Greater than target 100
Smaller than or equal to target 50
Also you will need to check the returned index, if it is 0 it means you have passed target lower than the lowest
Edit: Ah, I used templist instead of list -- hence the confusion. I didn't mean it to be a one-line function; you still have to do the conversions. (Of course, as Mike DeSimone rightly points out, using list as a variable name is a terrible idea!! So I had a good reason for being confusing. :)
To be more explicit about it, here's a slightly streamlined version of the function (fixed to test properly for an empty list):
def GetNextHighTemp(self, temp, templist):
templist = (int(t) for t in templist if t != '')
templist = [t for t in templist if t < int(temp)]
if templist: return max(templist)
else: return None # or raise an error
Thanks to Mike for the suggestion to return None in case of an empty list -- I like that.
You could shorten this even more like so:
def GetNextHighTemp(self, temp, templist):
try: return str(max(int(t) for t in templist if t != '' and int(t) < int(temp)))
except ValueError: return None # or raise a different error
nextHighest = lambda seq,x: min([(i-x,i) for i in seq if x<=i] or [(0,None)])[1]
nextLowest = lambda seq,x: min([(x-i,i) for i in seq if x>=i] or [(0,None)])[1]
Here's how this works: Looking at nextHighest, the argument to min is a list comprehension, that calculates the differences between each value in the list and the input x, but only for those values >= x. Since you want the actual value, then we need the list elements to include both the difference to the value, and the actual value. Tuples are compared value by value, left-to-right, so the tuple for each value i in the sequence becomes (i-x,i) - the min tuple will have the actual value in the [1]'th element.
If the input x value is outside the range of values in seq (or if seq is just empty), then the list comprehension will give us an empty list, which will raise a ValueError in min. In case this happens, we add the or [(0,None)] term inside the argument to min. If the list comprehension is empty, it will evaluate to False, in which case min will instead look at the sequence containing the single tuple (0,None). In the case, the [1]'th element is None, indicating that there were no elements in seq higher than x.
Here are some test cases:
>>> t = [10, 20, 50, 200, 100, 300, 250, 150]
>>> print nextHighest(t,55)
100
>>> print nextLowest(t,55)
50
>>> print nextHighest([],55)
None
>>> print nextLowest([],55)
None
>>> print nextHighest(t,550)
None
Let the unordered list be myList:
answer = max(x for x in myList if x < temp)
If I understand you correctly, you want the greatest value that is less than your target; e.g. in your example, if your target is 55, you want 50, but if your target is 35, you want 20. The following function should do that:
def get_closest_less(lst, target):
lst.sort()
ret_val = None
previous = lst[0]
if (previous <= target):
for ndx in xrange(1, len(lst) - 1):
if lst[ndx] > target:
ret_val = previous
break
else:
previous = lst[ndx]
return str(ret_val)
If you need to step through these values, you could use a generator to get the values in succession:
def next_lesser(l, target):
for n in l:
if n < target:
yield str(n)
Both these worked properly from within a simple program.
a=[4,3,8,2,5]
temp=4
def getSmaller(temp,alist):
alist.sort()
for i in range(len(alist)):
if(i>0 and alist[i]==temp):
print alist[i-1]
elif(i==0 and alist[i]==temp):
print alist[i]
getSmaller(temp,a)