how to keep count of replaced strings - python

I have a massive string im trying to parse as series of tokens in string form, and i found a problem: because many of the strings are alike, sometimes doing string.replace()will cause previously replaced characters to be replaced again.
say i have the string being replaced is 'goto' and it gets replaced by '41' (hex) and gets converted into ASCII ('A'). later on, the string 'A' is also to be replaced, so that converted token gets replaced again, causing problems.
what would be the best way to get the strings to be replaced only once? breaking each token off the original string and searching for them one at a time takes very long
This is the code i have now. although it more or less works, its not very fast
# The largest token is 8 ASCII chars long
'out' is the string with the final outputs
while len(data) != 0:
length = 8
while reverse_search(data[:length]) == None:#sorry THC4k, i used your code
#at first, but it didnt work out
#for this and I was too lazy to
#change it
length -= 1
out += reverse_search(data[:length])
data = data[length:]

If you're trying to substitute strings at once, you can use a dictionary:
translation = {'PRINT': '32', 'GOTO': '41'}
code = ' '.join(translation[i] if i in translation else i for i in code.split(' '))
which is basically O(2|S|+(n*|dict|)). Very fast. Although memory usage could be quite substantial. Keeping track of substitutions would allow you to solve the problem in linear time, but only if you exclude the cost of looking up previous substitution. Altogether, the problem seems to be polynomial by nature.
Unless there is a function in python to translate strings via dictionaries that i don't know about, this one seems to be the simplest way of putting it.
it turns
10 PRINT HELLO
20 GOTO 10
into
10 32 HELLO
20 41 10
I hope this has something to do with your problem.

Related

How to rename wire mean in PDF file in Python?

Personally, I have the following string "E2017010000000601". This character E is for control, after comes the year, then the month and in the last positions comes a user code with a maximum of 7 positions. I would like to know how can I in Python remove those 0 from the middle of the string that are unnecessary.
For example, in the string "E2018090001002202", I do not need these 3 zeros between 9 and 1.
Already in the string "E2017010000000601", I do not need those 7 zeros between 1 and 6 ..
I have over 1000 files with this type of string, and renaming it one by one is tricky. I know that in Python I can rename this huge amount of files, but I did some code and I'm not able to mount the way I explained ... Any help?
This is basic string slicing as long as you are sure the structure is identical for each string.
You can use something like:
original_string = "E2017010000000601"
cut_string = str(int(original_string[7:]))
This should work because first you remove the first 7 values, the control char, year and month.
Then you turn to integer which removes all the zeroes at the front, then back to string.
Basically the same answer as Alexis, but since I can't comment yet, in a separate answer: since you want to keep the "EYYYYMM" part of the string, the code would be:
>>>original_string = 'E2017010000000601'
>>>cut_string= original_string[:7] + str(int(original_string[7:]))
>>>cut_string
'E201701601'
A quick explanation: we know what the first seven characters of the string will be, and we want to keep those in the string. Then we add the rest of the string, turned into an integer and back into a string, so that all unnecessary zeroes in front are removed.

Came up with serialization algorithm for strings, but only works for words that are < 10 letters long.

So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.
There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).

Giant set of strings leads to memory error - alternative?

I'm trying to create a really huge list of incrementing 9 digits numbers (worst case). My plan is to have something like this:
['000000001', '000000002' , ..............,'999999999']
I already wrote the code. However, as soon as I run the code, my console prints "Memory Error" message.
Here is my current code:
HUGE_LIST = [''.join(i) for i in product('012345678', repeat = 9)
I know this might not be the best code to produce the list. Thus, can someone help me find a better way to solve this memory issue?
I'm planning to use HUGE_LIST for comparison with user input.
Example: a user enters '12345678' as input, then I want my code to assert that input with the HUGE_LIST.
The best way to solve an issue like this is to avoid a memory-intensive algorithm entirely. In this case, since your goal is to test whether a particular string is in the list, just write a function that checks whether the string satisfies the criteria to be in the list. For example, if your list contains all sequences of 9 digits, then your function just has to check whether a given input is a sequence of 9 digits.
def check(string):
return len(string) == 9 and all(c.isdigit() for c in string)
(in practice, give it a better name than check). Or if you want all sequences of 9 digits in which none of them is a 9, as your current code defining HUGE_LIST suggests, you could write
def check(string):
return len(string) == 9 and all(c.isdigit() and c != '9' for c in string)
Or so on.
If you can't write an algorithm to decide whether a string (or whatever) is in the list or not, the next best thing is to make a generator that will produce the values one at a time. If you already have a list comprehension, like
HUGE_LIST = [<something> for <variable> in <expression>]
then you can turn that into a generator by replacing the square brackets with parentheses:
HUGE_GENERATOR = (<something> for <variable> in <expression>)
Then you can test for membership using string in HUGE_GENERATOR. Note that after doing so, HUGE_GENERATOR will be (at least partially) consumed, so you can't use it for another membership test; you will have to recreate it if you want to test again.

How can you parallelize a regex search of one long string? [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 7 years ago.
I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
yield (match.group(1), len(match.group(0))/len(match.group(1)))
This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?
Based on information given by nhahtdh in one of the other answers, some things have come to light.
First, the problem you are posing is called finding "tandem repeats" or "squares".
Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.
Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.
I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.
http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).
In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.
When one algorithm is too slow, switch algorithms.
If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree
This will find common substrings in for you in linear time.
EDIT: #nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes
my answer, #nhahtdh should logically get some of the credit.
I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.
I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:
def partial_repeat(string):
l = len(string)
for i in range(2, l//2+1):
s = string[0:i]
multi = l//i-1
factor = l//(i-1)
ls = len(s)
if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
return s
>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'
In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.
With standard loops and min() instead of comprehensions and sorted():
>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
... if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
... results.add(x)
...
>>> min(results, key=test_string.index)
'123'
I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Categories

Resources