Variable Length Needle in Haystack (Python)

Variable Length Needle in Haystack (Python) - python

I have a function designed to find errors in an application's search capabilities, which generates a variable-length search string from the non-control UTF-8 possibilities. Running pytest iterations on this function, the random UTF-8 strings, submitted for search, generate debug errors roughly once per 500 searches.
As I can grab each of the strings that caused an error, I want to determine what is the minimal sub-series of the characters in those strings which truly provoke the error. In other words, (inside of a pytest loop):
def fumble_towards_ecstasy(string_that_breaks):
# iterate over both length and content of the string
nugget = # minimum series of characters that break the search
return nugget
Should I slice the string in half and whittle down each side and re-submit until it fails, choose random characters from its (len() - 1) and then back up if an error doesn't happen? Brute force combinatorial? What's the best way to step through this?
Thanks.

Splitting the string in half will fail if there is a two character sequence that causes the failure, and that sequence lies exactly in the middle. Each half succeeds, but the combined string fails.
Here's one algorithm that will find a local minimum:
Try removing each character in turn.
If removing the character still causes failure, keep the new shorter string and repeat the algorithm on this new string.
If removing the character no longer causes failure, put it back and try removing the next character. Keep going until there are no more characters left to try. When you reach the end of the string you know that removing any one character causes the search to succeed.

I'd use a "whittle from both sides" approach. Splitting the string will always run the risk of breaking up the substring that was causing the error. My approach would be:
Pop as many characters off the left of the string as you can while still ensuring that the string causes an error.
Do the same to the right side.
You're left with - in theory - the minimal substring that causes the error.
Hope that helps!

First of all it's worth noting that the solution is possibly not unique, i.e. it may be the case that there are two or more broken substrings.
An alternate suggestion (to the good answers by both Xavier and Mark) is to run a recursive approach. Repeat the sampling with the limited subset of strings that caused the error. Once another error is found, repeat until a minimal substring is reached. This approach is robust enough to handle a more complex use case, where the error can exist in two non-adjacent entries. I don't think that is the case here, but it's nice to have a general purpopse method.

Related

Parsing blocks as Python

I am writing a lexer + parser in JFlex + CUP, and I wanted to have Python-like syntax regarding blocks; that is, indentation marks the block level.
I am unsure of how to tackle this, and whether it should be done at the lexical or sintax level.
My current approach is to solve the issue at the lexical level - newlines are parsed as instruction separators, and when one is processed I move the lexer to a special state which checks how many characters are in front of the new line and remembers in which column the last line started, and accordingly introduces and open block or close block character.
However, I am running into all sort of trouble. For example:
JFlex cannot match empty strings, so my instructions need to have at least one blanck after every newline.
I cannot close two blocks at the same time with this approach.
Is my approach correct? Should I be doing things different?

Your approach of handling indents in the lexer rather than the parser is correct. Well, it’s doable either way, but this is usually the easier way, and it’s the way Python itself (or at least CPython and PyPy) does it.
I don’t know much about JFlex, and you haven’t given us any code to work with, but I can explain in general terms.
For your first problem, you're already putting the lexer into a special state after the newline, so that "grab 0 or more spaces" should be doable by escaping from the normal flow of things and just running a regex against the line.
For your second problem, the simplest solution (and the one Python uses) is to keep a stack of indents. I'll demonstrate something a bit simpler than what Python does.
First:
indents = [0]
After each newline, grab a run of 0 or more spaces as spaces. Then:
if len(spaces) == indents[-1]:
pass
elif len(spaces) > indents[-1]:
indents.append(len(spaces))
emit(INDENT_TOKEN)
else:
while len(spaces) != indents[-1]:
indents.pop()
emit(DEDENT_TOKEN)
Now your parser just sees INDENT_TOKEN and DEDENT_TOKEN, which are no different from, say, OPEN_BRACE_TOKEN and CLOSE_BRACE_TOKEN in a C-like language.
Of you’d want better error handling—raise some kind of tokenizer error rather than an implicit IndexError, maybe use < instead of != so you can detect that you’ve gone too far instead of exhausting the stack (for better error recovery if you want to continue to emit further errors instead of bailing at the first one), etc.
For real-life example code (with error handling, and tabs as well as spaces, and backslash newline escaping, and handling non-syntactic indentation inside of parenthesized expressions, etc.), see the tokenize docs and source in the stdlib.

iterating a single item list faster than iterating a long string? #Python #Cherrypy

When using Cherrypy, I ran into this comment line. "strings get wrapped in a list because iterating over a single item list is much faster than iterating over every character in a long string."
This is located at
https://github.com/cherrypy/cherrypy/blob/master/cherrypy/lib/encoding.py#L223
I have done some researches online but I still don't fully understand the reason to wrap the response.body as [response.body]. ? Can anyone show me the details behind this design?

I think that code only makes sense if you recognize that prior to the code with that comment, self.body could be either a single string, or an iterable sequence that contains many strings. Other code will use it as the latter (iterating on it and doing string stuff with the items).
While would technically work to let that later code loop over the characters of the single string, processing the data character by character is likely inefficient. So the code below the comment wraps a list around the single string, letting it get processed all at once.

wxPython/TextCtrl replacing a character within the first x lines of a string

I've scanned the questions here as well as the web and haven't found my answer, this is my first question and I'm a noobie to (wx)Python so go easy on me.
Using TextCtrl I'm trying to remove a single character within a string, this string will always start with the same set of characters but the rest of the string is freely editable by the user.
e.g
self.text=wx.TextCtrl(panel,-1"hello world,, today we're asking a question on stackoverflow, what would you ask?")
poor example but how would I find and remove the 11th(',') character so the sentence is more formatted without affecting the rest of the string?
I've tried standard python indexing but I get an error for that, I can successfully remove chunks of the string from the start outwards of the end inwards but I need only a single character removed.
Again, sorry for the poor terminology, as I said I'm fairly new to python so some of my terms may be a bit iffy.

self.text.SetValue(self.text.GetValue()[:10] + self.text.GetValue()[11:] )
maybe??
self.text.SetValue(self.text.GetValue().replace(",,",",")
maybe?
its not really clear what you are trying to accomplish here ...

String slices/substrings/ranges in Python

I'm newbie in Python and I would like to know something that I found very curious.
Let's say I have this:
s = "hello"
Then:
s[1:4] prints "ell" which makes sense...
and then s[3:-1] prints 'l' only that does makes sense too..
But!
s[-1:3] which is same range but backwards returns an empty string ''... and s[1:10] or s[1:-20] is not throwing an error at all.. which.. from my point of view, it should produce an error right? A typical out-of-bounds error.. :S
My conclusion is that the range are always from left to right, I would like to confirm with the community if this is as I'm saying or not.
Thanks!

s[-1:3] returns the empty string because there is nothing in that range. It is requesting the range from the last character, to the third character, moving to the right, but the last character is already past the third character.
Ranges are by default left to right.
There are extended slices which can reverse the step, or change it's size. So s[-1:3:-1] will give you just 'o'. The last -1 in that slice is telling you that the slice should move from right to left.
Slices won't throw errors if you request a range that isn't in the string, they just return an empty string for those positions.

Ranges are "clamped" to the extent of the string... i.e.
s[:10]
will return the first 10 characters, or less if the string is not long enough.
A negative index means starting counting from the end, so s[-3:] takes the last three characters (or less if the string is shorter).
You can have range backward but you need to use an explicit step, like
s[10:5:-1]
You can also simply get the reverse of a string with
s[::-1]
or the string composed by taking all chars in even position with
s[::2]

how to avoid python numeric literals beginning with "0" being treated as octal?

I am trying to write a small Python 2.x API to support fetching a
job by jobNumber, where jobNumber is provided as an integer.
Sometimes the users provide ajobNumber as an integer literal
beginning with 0, e.g. 037537. (This is because they have been
coddled by R, a language that sanely considers 037537==37537.)
Python, however, considers integer literals starting with "0" to
be OCTAL, thus 037537!=37537, instead 037537==16223. This
strikes me as a blatant affront to the principle of least
surprise, and thankfully it looks like this was fixed in Python
3---see PEP 3127.
But I'm stuck with Python 2.7 at the moment. So my users do this:
>>> fetchJob(037537)
and silently get the wrong job (16223), or this:
>>> fetchJob(038537)
File "<stdin>", line 1
fetchJob(038537)
^
SyntaxError: invalid token
where Python is rejecting the octal-incompatible digit.
There doesn't seem to be anything provided via __future__ to
allow me to get the Py3K behavior---it would have to be built-in
to Python in some manner, since it requires a change to the lexer
at least.
Is anyone aware of how I could protect my users from getting the
wrong job in cases like this? At the moment the best I can think
of is to change that API so it take a string instead of an int.

At the moment the best I can think of is to change that API so it take a string instead of an int.
Yes, and I think this is a reasonable option given the situation.
Another option would be to make sure that all your job numbers contain at least one digit greater than 7 so that adding the leading zero will give an error immediately instead of an incorrect result, but that seems like a bigger hack than using strings.
A final option could be to educate your users. It will only take five minutes or so to explain not to add the leading zero and what can happen if you do. Even if they forget or accidentally add the zero due to old habits, they are more likely to spot the problem if they have heard of it before.

Perhaps you could take the input as a string, strip leading zeros, then convert back to an int?
test = "001234505"
test = int(test.lstrip("0")) # 1234505

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Variable Length Needle in Haystack (Python) - python

Related

Parsing blocks as Python

iterating a single item list faster than iterating a long string? #Python #Cherrypy

wxPython/TextCtrl replacing a character within the first x lines of a string

String slices/substrings/ranges in Python

how to avoid python numeric literals beginning with "0" being treated as octal?

Categories

Resources