Efficiently slicing a string in Python3 - python

Since python does slice-by-copy, slicing strings can be very costly.
I have a recursive algorithm that is operating on strings. Specifically, if a function is passed a string a, the function calls itself on a[1:] of the passed string. The hangup is that the strings are so long, the slice-by-copy mechanism is becoming a very costly way to remove the first character.
Is there a way to get around this, or do I need to rewrite the algorithm entirely?

The only way to get around this in general is to make your algorithm uses bytes-like types, either Py2 str or Py3 bytes; views of Py2 unicode/Py3 str are not supported. I provided details on how to do this on my answer to a related question, but the short version is, if you can assume bytes-like arguments (or convert to them), wrapping the argument in a memoryview and slicing is a reasonable solution. Once converted to a memoryview, slicing produces new memoryviews with O(1) cost (in both time and memory), rather than the O(n) time/memory cost of text slicing.

Related

Scientific notation literal with variable exponent in Python

The literal 3e4 represents the float 30000 in python (3.8 at least).
>>> print(3e4)
30000.0
The syntax of the following code is clearly invalid:
x=4
3ex
3ex is not a valid expression, but the example helps me ask my question:
Clearly, the expression 3*10**4 represents the same number, but my question here is purely related to the scientific notation literals. Just for my curiosity, is there a way to use the same syntax with a variable power, better than:
x=4
eval(f"1e{x}")
One subtle difference between 3e4 and 3*10**4 is the type (float and int respectively).
Is there also a difference in execution time perhaps in calculating these two expressions?
To your first question: No, the documentation does not suggest you can.
Is there a way to use the same syntax with a variable power?
When float is instantiated from a string, it calls out to a CPython C library PyOS_string_to_double to which handles making the str locale-aware (. vs ,) before passing the string directly to the C function strtod doc.
Meanwhile the documentation for PyOS_string_to_double does not mention of any special way to configure the exponent.
To your second question about performance, this is easily benchmarked. But, we do not have a candidate to benchmark against. So, this is a moot question.
Is there also a difference in execution time perhaps?
I hope this satiates your curiosity. If not, feel free to dig into the C code that I linked.

java programmer learning python: string split and join

Can someone explain to me why in python when we want to join a string we write:
'delim'.join(list)
and when we want to split a string we write:
str.split('delim')
coming from java it seems that one of these is backwards because in java we write:
//split:
str.split('delim');
//join
list.join('delim');
edit:
you are right. join takes a list. (though it doesnt change the question)
Can someone explain to me the rationale behind this API?
Join only makes sense when joining some sort of iterable. However, since iterables don't necessarily contain all strings, putting join as a method on an iterable doesn't make sense. (what would you expect the result of [1,"baz",my_custom_object,my_list].join("foo") to be?) The only other place to put it is as a string method with the understanding that everything in the iterable is going to be a string. Additionally, putting join as a string method allows it to be used with any iterable -- tuples, lists, generators, custom objects which support iteration or even strings.
Also note that you are completely free to split a string in the same way that you join it:
list_of_strings='this, is , a, string, separated, by , commas.'.split(',')
Of course, the utility here isn't quite as easy to see.
http://docs.python.org/faq/design.html#why-is-join-a-string-method-instead-of-a-list-or-tuple-method
From the source:
join() is a string method because in using it you are telling the
separator string to iterate over a sequence of strings and insert
itself between adjacent elements. This method can be used with any
argument which obeys the rules for sequence objects, including any new
classes you might define yourself.
Because this is a string method it can work for Unicode strings as
well as plain ASCII strings. If join() were a method of the sequence
types then the sequence types would have to decide which type of
string to return depending on the type of the separator.
So that you don't have to reimplement the join operation for every sequence type you create. Python uses protocols, not types, as its main behavioral pattern, and so every sequence can be expected to act the same even though they don't derive from an existing sequence class.

Why does Python not perform type conversion when concatenating strings?

In Python, the following code produces an error:
a = 'abc'
b = 1
print(a + b)
(The error is "TypeError: cannot concatenate 'str' and 'int' objects").
Why does the Python interpreter not automatically try using the str() function when it encounters concatenation of these types?
The problem is that the conversion is ambiguous, because + means both string concatenation and numeric addition. The following question would be equally valid:
Why does the Python interpreter not automatically try using the int() function when it encounters addition of these types?
This is exactly the loose-typing problem that unfortunately afflicts Javascript.
There's a very large degree of ambiguity with such operations. Suppose that case instead:
a = '4'
b = 1
print(a + b)
It's not clear if a should be coerced to an integer (resulting in 5), or if b should be coerced to a string (resulting in '41'). Since type juggling rules are transitive, passing a numeric string to a function expecting numbers could get you in trouble, especially since almost all arithmetic operators have overloaded operations for strings too.
For instance, in Javascript, to make sure you deal with integers and not strings, a common practice is to multiply a variable by one; in Python, the multiplication operator repeats strings, so '41' * 1 is a no-op. It's probably better to just ask the developer to clarify.
The short answer would be because Python is a strongly typed language.
This was a design decision made by Guido. It could have been one way or another really, concatenating str and int to str or int.
The best explanation, is still the one given by guido, you can check it here
The other answers have provided pretty good explanations, but have failed to mention that this feature is known a Strong Typing. Languages that perform implicit conversions are Weakly Typed.
Because Python does not perform type conversion when concatenating strings. This behavior is by design, and you should get in the habit of performing explicit type conversions when you need to coerce objects into strings or numbers.
Change your code to:
a = 'abc'
b = 1
print(a + str(b))
And you'll see the desired result.
Python would have to know what's in the string to do it correctly. There's an ambiguous case: what should '5' + 5 generate? A number or a string? That should certainly throw an error. Now to determine whether that situation holds, python would have to examine the string to tell. Should it do that every time you try to concatenate or add two things? Better to just let the programmer convert the string explicitly.
More generally, implicit conversions like that are just plain confusing! They're hard to predict, hard to read, and hard to debug.
That's just how they decided to design the language. Probably the rationale is that requiring explicit conversions to string reduces the likelihood of unintended behavior (e.g. integer addition if both operands happen to be ints instead of strings).
tell python that the int is a list to disambiguate the '+' operation.
['foo', 'bar'] + [5]
this returns: ['foo', 'bar', 5]

Exploits in Python - manipulating hex strings

I'm quite new to python and trying to port a simple exploit I've written for a stack overflow (just a nop sled, shell code and return address). This isn't for nefarious purposes but rather for a security lecture at a university.
Given a hex string (deadbeef), what are the best ways to:
represent it as a series of bytes
add or subtract a value
reverse the order (for x86 memory layout, i.e. efbeadde)
Any tips and tricks regarding common tasks in exploit writing in python are also greatly appreciated.
In Python 2.6 and above, you can use the built-in bytearray class.
To create your bytearray object:
b = bytearray.fromhex('deadbeef')
To alter a byte, you can reference it using array notation:
b[2] += 7
To reverse the bytearray in place, use b.reverse(). To create an iterator that iterates over it in reverse order, you can use the reversed function: reversed(b).
You may also be interested in the new bytes class in Python 3, which is like bytearray but immutable.
Not sure if this is the best way...
hex_str = "deadbeef"
bytes = "".join(chr(int(hex_str[i:i+2],16)) for i in xrange(0,len(hex_str),2))
rev_bytes = bytes[::-1]
Or might be simpler:
bytes = "\xde\xad\xbe\xef"
rev_bytes = bytes[::-1]
In Python 2.x, regular str values are binary-safe. You can use the binascii module's b2a_hex and a2b_hex functions to convert to and from hexadecimal.
You can use ordinary string methods to reverse or otherwise rearrange your bytes. However, doing any kind of arithmetic would require you to use the ord function to get numeric values for individual bytes, then chr to convert the result back, followed by concatenation to reassemble the modified string.
For mutable sequences with easier arithmetic, use the array module with type code 'B'. These can be initialized from the results of a2b_hex if you're starting from hexadecimal.

A Python buffer that you can truncate from the left?

Right now, I am buffering bytes using strings, StringIO, or cStringIO. But, I often need to remove bytes from the left side of the buffer. A naive approach would rebuild the entire buffer. Is there an optimal way to do this, if left-truncating is a very common operation? Python's garbage collector should actually GC the truncated bytes.
Any sort of algorithm for this (keep the buffer in small pieces?), or an existing implementation, would really help.
Edit:
I tried to use Python 2.7's memoryview for this, but sadly, the data outside the "view" isn't GCed when the original reference is deleted:
# (This will use ~2GB of memory, not 50MB)
memoryview # Requires Python 2.7+
smalls = []
for i in xrange(10):
big = memoryview('z'*(200*1000*1000))
small = big[195*1000*1000:]
del big
smalls.append(small)
print '.',
A deque will be efficient if left-removal operations are frequent (Unlike using a list, string or buffer, it's amortised O(1) for either-end removal). It will be more costly memory-wise than a string however, as you'll be storing each character as its own string object, rather than a packed sequence.
Alternatively, you could create your own implementation (eg. a linked list of string / buffer objects of fixed size), which may store the data more compactly.
Build your buffer as a list of characters or lines and slice the list. Only join as string on output. This is pretty efficient for most types of 'mutable string' behaviour.
The GC will collect the truncated bytes because they are no longer referenced in the list.
UPDATE: For modifying the list head you can simply reverse the list. This sounds like an inefficient thing to do however python's list implementation optimises this internally.
from http://effbot.org/zone/python-list.htm :
Reversing is fast, so temporarily
reversing the list can often speed
things up if you need to remove and
insert a bunch of items at the
beginning of the list:
L.reverse()
# append/insert/pop/delete at far end
L.reverse()

Categories

Resources