python: search in file

python: search in file - python

I have two text files that are both about 1M lines.
Let's call them f1 and f2.
For each line in f1, I need to find the index of line in f2, where the line in f1 is a substring of the line in f2. Since I need to do it for all lines of f1, using nested for loop is too time-costly, and I was wondering if there is a workaround that could significantly reduce the time.
Thank you in advance for the help.

There certainly are better ways than using two for loops :D That would give you an O(n^2) runtime. Something very useful for finding substrings is called a rolling hash. It is a way to use previous information to speed up finding substrings. It goes something like this:
Say I have string, f1 = "cat" and a long string f2 = "There once was a cat named felix". What you do is define a "hash" based on the letters of your f1 string. The specifics on this can be found online in various sources but for this example lets simplify things and say that letters are assigned to numbers starting at 0 going up to 25 and we'll multiply each letter's value to form a decimal number with the number of digits equaling the lenght of the string:
hash("cat") = 10^2 * 2 + 10^1 * 0 + 10^0 * 19
= some value (in python the "hash" values of letters
are not 0 through 25 but given by using the ord cast:
ord("a") will give you 97)
Now this next part is super cool. We designate windows of the size of our f1 string, so size 3, and hash the f2 string in the same way we did with f1. You start with the first three letters. The hash doesn't match so we move on. If the hash matched, we make sure it's the same string (sometimes hashes equal each other but are not the same string because of the way we assign hashes but that's ok).
COOL PART** Instead of simply shifting the window and rehashing the 2 through 4th letters of f2, we "roll" the window and don't recalculate the entire hash (which if f1 is really long would be a waste of time) since the only letters changing are the first and last! The trick is to subtract the first hash value (in our example would be ord("t")*10^2), then multiply the entire number remaining by ten (because we moved everything to the left), and add the new hash letter, ord("r") * 10^0. Check for a match again and move on. If we match, return the index.
Why are we doing this: if you have a long enough f1 string, you reduce the runtime down to O(n*len(n)) so asymptotically linear!
Now, the actual implementation takes time and could get messy but there are plenty of sources online for this kind of answer. My algorithms class has great course notes online which help understand the theory a little better and there are tons of links with python implementations. Hope this helps!

Related

Why is this certain function a bad hashing function?

I would like to know why the following code snippet is a bad hashing function.
def computeHash(self, s):
h = 0
for ch in s:
h = h + ord(ch) # ord gives the ASCII value of character ch
return h % HASH_TABLE_SIZE
If I dramatically increase the size of the hash table will this make up for the inadequacies of the hash function?

It's a bad hashing function because strings are order-sensitive, but the hash is not; "ab" and "ba" would hash identically, and for longer strings the collisions just get worse; all of "abc", "acb", "bac", "bca", "cab", "cba" would share the same hash.
For an order-insensitive data structure (e.g. frozenset) a strategy like this isn't as bad, but it's still too easy to produce collisions by simply reducing one ordinal by one and increasing another by one, or by simplying putting a NUL character in there; frozenset({'\0', 'a'}) would hash identically to just frozenset({'a'}); typically this is countered by incorporating the length of the collection into the hash in some manner.
Secure hashes (e.g. Python uses SipHash) are the best solution; a randomized seed combined with an algorithm that conceals the seed while incorporating it into the resulting hash makes it not only harder to accidentally create collisions (which simple improvements like hashing the index as well as the ordinal would help with, to make the hash order and length sensitive), but also makes it nigh impossible for malicious data sources to intentionally cause collisions (which the simple solutions don't handle at all).
The other problem with this hash is that it's doesn't distribute the bits evenly; short strings mean only low bits are set in the hash. This means that increasing the table size is completely pointless when the strings are all relatively short; if all the strings are ASCII and 100 characters or less, the largest possible raw hash value is 12700; if you need to store a million such strings, you'll average nearly 79 collisions per bucket in the first 12,700 buckets (in practice, much more than that for common buckets; there will be humps with many more collisions in the middling values, and fewer collisions near the beginning, and almost none at the end, since stuff like '\x7f' * 100 is the only way to reach said maximum value), and no matter how many more buckets you have, they'll never be used. Technically, an open-addressing based hash table might use them, but it would be largely equivalent to separate chaining per bucket, since all indices past 12700 would only be found by the open-addressing "bounce around" algorithm; if thats badly designed, e.g. linear scanning, you might end up linearly scanning the whole table even if no entries actually collide for your particular hash (your bucket was filled by chaining, and it has to linearly scan until it finds an empty slot or the matching element).

Bad hashing function:
1.AC and BB would give same result at for big string there can be many permutations in which sum of ascii value is same.
2.Even different length string would give same hash result . 'A ' (A +space) = 'a'.
3.Even rearrangement characters in string would give same hash.

This is a bad hashing function. One big problem: re-arranging any or all characters returns the exact same hash.
Increasing the TABLE_SIZE does nothing to adjust for this.

How to change numbers in a number

I'm currently trying to learn python.
Suppose there was a a number n = 12345.
How would one go about changing every digit starting from the first spot and iterating it between (1-9) and every other spot after (0-9).
I'm sadly currently learning python so I apologize for all the syntax error that might follow.
Here's my last few attempts/idea for skeleton of the code.
define the function
turn n into string
start with a for loop that for i in n range(0,9) for i[1]
else range(10)
Basically how does one fix a number while changing the others?
Please don't give solution just hints I enjoy the thinking process.
For example if n =29 the program could check
19,39,49,59,69,79,89,99
and
21,22,23,24,25,26,27,28

Although you are new, the process seems far easy than you think.
You want to make that change to every digit of the number (let's say n=7382). But you cannot iterate over numbers (not even changing specific digits of it as you want to): only over iterables (like lists). A string is an iterable. If you get the way to do a int-str conversion, you could iterate over every number and then print the new number.
But how do you change only the digit you are iterating to? Again, the way is repeating the conversion (saving it into a var before the loop would make great DRY) and getting a substring that gets all numbers except the one you are. There are two ways of doing this:
You search for that specific value and get its index (bad).
You enumerate the loop (good).
Why 2 is good? Because you have the real position of the actual number being change (think that doing an index in 75487 with 7 as the actual one changing would not work well when you get to the last one). Search for a way to iterate over items in a loop to get its actual index.
The easiest way to get a substring in Python is slicing. You slice two times: one to get all numbers before the actual one, and other to get all after it. Then you just join those two str with the actual variable number and you did it.
I hope I didn't put it easy for you, but is hard for a simple task as that.

How can you parallelize a regex search of one long string? [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 7 years ago.
I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
yield (match.group(1), len(match.group(0))/len(match.group(1)))
This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?

Based on information given by nhahtdh in one of the other answers, some things have come to light.
First, the problem you are posing is called finding "tandem repeats" or "squares".
Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.
Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.
I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.
http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).
In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.

When one algorithm is too slow, switch algorithms.
If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree
This will find common substrings in for you in linear time.
EDIT: #nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes
my answer, #nhahtdh should logically get some of the credit.
I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.

I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:
def partial_repeat(string):
l = len(string)
for i in range(2, l//2+1):
s = string[0:i]
multi = l//i-1
factor = l//(i-1)
ls = len(s)
if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
return s
>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'
In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.
With standard loops and min() instead of comprehensions and sorted():
>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
... if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
... results.add(x)
...
>>> min(results, key=test_string.index)
'123'
I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

Fastest sorted string concatenation

What is the fastest and most efficient way to do this:
word = "dinosaur"
newWord = word[0] + ''.join(sorted(word[1:]))
output:
"dainorsu"
Thoughts:
Would something as converting the word to an array increase performance? I read somewhere that arrays have less overhead due to them being the same data type compared to a string.
Basically I want to sort everything after the first character in the string as fast as possible. If memory is saved that would also be a plus. The problem I am trying to solve needs to be within a certain time limit so I am trying to be as fast as possible. I dont know much about python efficiency under the hood so if you explain why this method is fast as well that would be AWESOME!

Here's how I'd approach it.
Create an array of size 26 (assuming that only lowercase letters are used). Then iterate through each character in the string. For the 1st letter of the alphabet, increment the 1st index of the array; for the 2nd, increment the 2nd. Once you've scanned the whole string (which is of complexity O(n)) you will be able to reconstruct it afterwards by repeating the 'a' array[0] times, 'b' array[1] times, and so on.
This approach would beat a fast sort algorithm like quicksort or partition sort, which have complexity O(nlogn).
EDIT: Finally you'd also want to reassemble the final string efficiently. The string concatenation provided by some languages using the + operator can be inefficient, so consider using an efficient string builder class.

how to keep count of replaced strings

I have a massive string im trying to parse as series of tokens in string form, and i found a problem: because many of the strings are alike, sometimes doing string.replace()will cause previously replaced characters to be replaced again.
say i have the string being replaced is 'goto' and it gets replaced by '41' (hex) and gets converted into ASCII ('A'). later on, the string 'A' is also to be replaced, so that converted token gets replaced again, causing problems.
what would be the best way to get the strings to be replaced only once? breaking each token off the original string and searching for them one at a time takes very long
This is the code i have now. although it more or less works, its not very fast
# The largest token is 8 ASCII chars long
'out' is the string with the final outputs
while len(data) != 0:
length = 8
while reverse_search(data[:length]) == None:#sorry THC4k, i used your code
#at first, but it didnt work out
#for this and I was too lazy to
#change it
length -= 1
out += reverse_search(data[:length])
data = data[length:]

If you're trying to substitute strings at once, you can use a dictionary:
translation = {'PRINT': '32', 'GOTO': '41'}
code = ' '.join(translation[i] if i in translation else i for i in code.split(' '))
which is basically O(2|S|+(n*|dict|)). Very fast. Although memory usage could be quite substantial. Keeping track of substitutions would allow you to solve the problem in linear time, but only if you exclude the cost of looking up previous substitution. Altogether, the problem seems to be polynomial by nature.
Unless there is a function in python to translate strings via dictionaries that i don't know about, this one seems to be the simplest way of putting it.
it turns
10 PRINT HELLO
20 GOTO 10
into
10 32 HELLO
20 41 10
I hope this has something to do with your problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.