Giant set of strings leads to memory error - alternative?

Giant set of strings leads to memory error - alternative? - python

I'm trying to create a really huge list of incrementing 9 digits numbers (worst case). My plan is to have something like this:
['000000001', '000000002' , ..............,'999999999']
I already wrote the code. However, as soon as I run the code, my console prints "Memory Error" message.
Here is my current code:
HUGE_LIST = [''.join(i) for i in product('012345678', repeat = 9)
I know this might not be the best code to produce the list. Thus, can someone help me find a better way to solve this memory issue?
I'm planning to use HUGE_LIST for comparison with user input.
Example: a user enters '12345678' as input, then I want my code to assert that input with the HUGE_LIST.

The best way to solve an issue like this is to avoid a memory-intensive algorithm entirely. In this case, since your goal is to test whether a particular string is in the list, just write a function that checks whether the string satisfies the criteria to be in the list. For example, if your list contains all sequences of 9 digits, then your function just has to check whether a given input is a sequence of 9 digits.
def check(string):
return len(string) == 9 and all(c.isdigit() for c in string)
(in practice, give it a better name than check). Or if you want all sequences of 9 digits in which none of them is a 9, as your current code defining HUGE_LIST suggests, you could write
def check(string):
return len(string) == 9 and all(c.isdigit() and c != '9' for c in string)
Or so on.
If you can't write an algorithm to decide whether a string (or whatever) is in the list or not, the next best thing is to make a generator that will produce the values one at a time. If you already have a list comprehension, like
HUGE_LIST = [<something> for <variable> in <expression>]
then you can turn that into a generator by replacing the square brackets with parentheses:
HUGE_GENERATOR = (<something> for <variable> in <expression>)
Then you can test for membership using string in HUGE_GENERATOR. Note that after doing so, HUGE_GENERATOR will be (at least partially) consumed, so you can't use it for another membership test; you will have to recreate it if you want to test again.

Related

Validating generated strings quickly (+ maths)

I'm trying to validate a large amount of generated strings, pumped out by itertools.permutations
The way I'd like to validate them is checking if every overlapping 2 characters are found in an array I have set up, the string is only valid if every overlapping 2 strings are in the "paths" array
I have the following code to validate:
def valid(s):
matches = re.findall("(?=(..))", s)
for match in matches:
if match not in paths:
return False
return True
Now I'm wondering if this can get any faster since it's too slow for my liking, I assume a non regex solution would be faster
Also I was wondering if it was possible to pre-calculate how many accepted strings I will have, given that: every character in the paths array is in the itertools iterable* (so keyspace is known) and the size of the "paths" array is also known
Edit: Paths currently has 250 combinations
This is the iterable "1920eran876i3om54lstchdkgbvupywjfx"
example valid output:
1920876
1920873
1920875
1920874
1920867
1920863
1920865
1920864
1920834
1920857

You can simplify this by realising that your regex simply matches every two-character pair. You can get this by zipping two different iterators, like so:
def valid(s):
for c, d in zip(s[:-1], s[1:]):
if c + d not in paths:
return False
return True
It might be faster to iterate through indices and use slice on every loop, but you'd have to test it.
I've been assuming that paths is a string, but if it's a Sequence[str] you can make it a set for extra performance.
You can speed this up further by hand-coding Python byte code, but that's probably a bit excessive.

If paths is a list [] it would make things quicker to turn it into a set set() or {items} as a set uses a hashtable and therefore you don't have to check through the whole list if the item you're looking for is at the end.

How to search through each letter in a dictionary and use it properly

If I were to take a dictionary, such as
living_beings= {"Reptile":"Snake","mammal":"whale", "Other":"bird"}
and wished to search for individual characters (such as "a") (e.g.
for i in living_beings:
if "a" in living_beings:
print("a is here")
would there be an efficient- runs fastest- method of doing this?
The input is simply searching as outlined above (although my approach didn't work).
My (failed) code goes as follows:
animals=[]
for row in reader: #'reader' is simply what was in the dictionary
animals.append(row) #I tried to turn it into a list to sort it that way
for i in range(1, len(animals)):
r= animals[i]
for i in r:
if i== "a": #My attempt to find "a". This is obviously False as i= one of the strings in
k=i.replace("'","/") #this is my attempt at the further bit, for a bit of context
test= animals.append(k)
print(test)
In case you were wondering,
The next step would be to insert a character- "/"- before that letter (in this case "a"), although this is a slightly different problem and so not linked with my question and is simply there to give a greater understanding of the problem.
EDIT
I have found another error relating to dictionary. If the dictionary features an apostrophe (') the output is affected as it prints that particular word in quotes ("") rather that the normal apostrophes. EXAMPLE: living_beings= {"Reptile":"Snake's","mammal":"whale", "Other":"bird"} and if you use the following code (which I need to):
new= []
for i in living_beings:
r=living_beings[i]
new.append(r)
then the output is "snake's", 'whale', 'bird' (Note the difference between the first and other outputs). So My question is: How to stop the apostrophes affecting output.

My approach would be to use dict comprehension to map over the dictionary and replace every occurence of 'a' by '/a'.
I don't think there are significant performance improvements that can be done from there. You algorithm will be linear with regard to the total number of characters in the keys and items of the dict as you need to traverse the whole dictionary whatever the input.
living_beings= {"Reptile":"Snake","mammal":"whale", "Other":"bird"}
new_dict = {
kind.replace('a', '/a'): animal.replace('a', '/a') for kind, animal in living_beings.items()
}
# new_dict: {"Reptile":"Sn/ake","m/amm/al":"wh/ale", "Other":"bird"}
You could maybe optimize with a more convoluted solution that loops through the dict to mutate it instead of creating a new one, but in general I recommend not trying to do such things in Python. Just write good code, with good practices, and let Python do the optimization under the hood. After all this is what the Zen of Python tells us: Simple is better than complex.

This can be done quite efficiently using a regular expression match, e.g.:
import re
re_containsA = re.compile(r'.*a.*')
for key, word in worddict.items():
if re_containsA.match(word):
print(key)
The re.match object can then be used to find the location of the matched text.

How to change numbers in a number

I'm currently trying to learn python.
Suppose there was a a number n = 12345.
How would one go about changing every digit starting from the first spot and iterating it between (1-9) and every other spot after (0-9).
I'm sadly currently learning python so I apologize for all the syntax error that might follow.
Here's my last few attempts/idea for skeleton of the code.
define the function
turn n into string
start with a for loop that for i in n range(0,9) for i[1]
else range(10)
Basically how does one fix a number while changing the others?
Please don't give solution just hints I enjoy the thinking process.
For example if n =29 the program could check
19,39,49,59,69,79,89,99
and
21,22,23,24,25,26,27,28

Although you are new, the process seems far easy than you think.
You want to make that change to every digit of the number (let's say n=7382). But you cannot iterate over numbers (not even changing specific digits of it as you want to): only over iterables (like lists). A string is an iterable. If you get the way to do a int-str conversion, you could iterate over every number and then print the new number.
But how do you change only the digit you are iterating to? Again, the way is repeating the conversion (saving it into a var before the loop would make great DRY) and getting a substring that gets all numbers except the one you are. There are two ways of doing this:
You search for that specific value and get its index (bad).
You enumerate the loop (good).
Why 2 is good? Because you have the real position of the actual number being change (think that doing an index in 75487 with 7 as the actual one changing would not work well when you get to the last one). Search for a way to iterate over items in a loop to get its actual index.
The easiest way to get a substring in Python is slicing. You slice two times: one to get all numbers before the actual one, and other to get all after it. Then you just join those two str with the actual variable number and you did it.
I hope I didn't put it easy for you, but is hard for a simple task as that.

How can you parallelize a regex search of one long string? [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 7 years ago.
I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
yield (match.group(1), len(match.group(0))/len(match.group(1)))
This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?

Based on information given by nhahtdh in one of the other answers, some things have come to light.
First, the problem you are posing is called finding "tandem repeats" or "squares".
Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.
Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.
I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.
http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).
In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.

When one algorithm is too slow, switch algorithms.
If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree
This will find common substrings in for you in linear time.
EDIT: #nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes
my answer, #nhahtdh should logically get some of the credit.
I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.

I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:
def partial_repeat(string):
l = len(string)
for i in range(2, l//2+1):
s = string[0:i]
multi = l//i-1
factor = l//(i-1)
ls = len(s)
if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
return s
>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'
In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.
With standard loops and min() instead of comprehensions and sorted():
>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
... if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
... results.add(x)
...
>>> min(results, key=test_string.index)
'123'
I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

how to keep count of replaced strings

I have a massive string im trying to parse as series of tokens in string form, and i found a problem: because many of the strings are alike, sometimes doing string.replace()will cause previously replaced characters to be replaced again.
say i have the string being replaced is 'goto' and it gets replaced by '41' (hex) and gets converted into ASCII ('A'). later on, the string 'A' is also to be replaced, so that converted token gets replaced again, causing problems.
what would be the best way to get the strings to be replaced only once? breaking each token off the original string and searching for them one at a time takes very long
This is the code i have now. although it more or less works, its not very fast
# The largest token is 8 ASCII chars long
'out' is the string with the final outputs
while len(data) != 0:
length = 8
while reverse_search(data[:length]) == None:#sorry THC4k, i used your code
#at first, but it didnt work out
#for this and I was too lazy to
#change it
length -= 1
out += reverse_search(data[:length])
data = data[length:]

If you're trying to substitute strings at once, you can use a dictionary:
translation = {'PRINT': '32', 'GOTO': '41'}
code = ' '.join(translation[i] if i in translation else i for i in code.split(' '))
which is basically O(2|S|+(n*|dict|)). Very fast. Although memory usage could be quite substantial. Keeping track of substitutions would allow you to solve the problem in linear time, but only if you exclude the cost of looking up previous substitution. Altogether, the problem seems to be polynomial by nature.
Unless there is a function in python to translate strings via dictionaries that i don't know about, this one seems to be the simplest way of putting it.
it turns
10 PRINT HELLO
20 GOTO 10
into
10 32 HELLO
20 41 10
I hope this has something to do with your problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.