Python's immutable strings and their slices

Python's immutable strings and their slices - python

The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.
True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?
EDIT:
It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.

The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).
Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.
Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable,
and your proposal probably doesn't give enough benefits to be accepted.
An other problem with this kind of "optimization" is when you want to deallocate "big strings":
a = "Some string" * 10 ** 7
b = a[10000]
del a
After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data.
The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.
99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts)
and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.
If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.

That's how slices work. Slices always perform a shallow copy, allowing you to do things like
>>> x = [1,2,3]
>>> y = x[:]
Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.
See also this question.

If you are worried about memory (in the case of really large strings), use a buffer():
>>> a = "12345"
>>> b = buffer(a, 2, 2)
>>> b
<read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0>
>>> print b
34
>>> print b[:]
34
Knowing about this allows you for alternatives to string methods such as split().
If you want to split() a string, but keep the original string object (as you maybe need it), you could do:
def split_buf(s, needle):
start = None
add = len(needle)
res = []
while True:
index = s.find(needle, start)
if index < 0:
break
res.append(buffer(s, start, index-start))
start = index + add
return res
or, using .index():
def split_buf(s, needle):
start = None
add = len(needle)
res = []
try:
while True:
index = s.index(needle, start)
res.append(buffer(s, start, index-start))
start = index + add
except ValueError:
pass
return res

Related

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?

Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.

I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Python Remove last char from string and return it

While I know that there is the possibility:
>>> a = "abc"
>>> result = a[-1]
>>> a = a[:-1]
Now I also know that strings are immutable and therefore something like this:
>>> a.pop()
c
is not possible.
But is this really the preferred way?

Strings are "immutable" for good reason: It really saves a lot of headaches, more often than you'd think. It also allows python to be very smart about optimizing their use. If you want to process your string in increments, you can pull out part of it with split() or separate it into two parts using indices:
a = "abc"
a, result = a[:-1], a[-1]
This shows that you're splitting your string in two. If you'll be examining every byte of the string, you can iterate over it (in reverse, if you wish):
for result in reversed(a):
...
I should add this seems a little contrived: Your string is more likely to have some separator, and then you'll use split:
ans = "foo,blah,etc."
for a in ans.split(","):
...

Not only is it the preferred way, it's the only reasonable way. Because strings are immutable, in order to "remove" a char from a string you have to create a new string whenever you want a different string value.
You may be wondering why strings are immutable, given that you have to make a whole new string every time you change a character. After all, C strings are just arrays of characters and are thus mutable, and some languages that support strings more cleanly than C allow mutable strings as well. There are two reasons to have immutable strings: security/safety and performance.
Security is probably the most important reason for strings to be immutable. When strings are immutable, you can't pass a string into some library and then have that string change from under your feet when you don't expect it. You may wonder which library would change string parameters, but if you're shipping code to clients you can't control their versions of the standard library, and malicious clients may change out their standard libraries in order to break your program and find out more about its internals. Immutable objects are also easier to reason about, which is really important when you try to prove that your system is secure against particular threats. This ease of reasoning is especially important for thread safety, since immutable objects are automatically thread-safe.
Performance is surprisingly often better for immutable strings. Whenever you take a slice of a string, the Python runtime only places a view over the original string, so there is no new string allocation. Since strings are immutable, you get copy semantics without actually copying, which is a real performance win.
Eric Lippert explains more about the rationale behind immutable of strings (in C#, not Python) here.

The precise wording of the question makes me think it's impossible.
return to me means you have a function, which you have passed a string as a parameter.
You cannot change this parameter. Assigning to it will only change the value of the parameter within the function, not the passed in string. E.g.
>>> def removeAndReturnLastCharacter(a):
c = a[-1]
a = a[:-1]
return c
>>> b = "Hello, Gaukler!"
>>> removeAndReturnLastCharacter(b)
!
>>> b # b has not been changed
Hello, Gaukler!

Yes, python strings are immutable and any modification will result in creating a new string. This is how it's mostly done.
So, go ahead with it.

I decided to go with a for loop and just avoid the item in question, is it an acceptable alternative?
new = ''
for item in str:
if item == str[n]:
continue
else:
new += item

Python .join or string concatenation

I realise that if you have an iterable you should always use .join(iterable) instead of for x in y: str += x. But if there's only a fixed number of variables that aren't already in an iterable, is using .join() still the recommended way?
For example I have
user = 'username'
host = 'host'
should I do
ret = user + '#' + host
or
ret = '#'.join([user, host])
I'm not so much asking from a performance point of view, since both will be pretty trivial. But I've read people on here say always use .join() and I was wondering if there's any particular reason for that or if it's just generally a good idea to use .join().

If you're creating a string like that, you normally want to use string formatting:
>>> user = 'username'
>>> host = 'host'
>>> '%s#%s' % (user, host)
'username#host'
Python 2.6 added another form, which doesn't rely on operator overloading and has some extra features:
>>> '{0}#{1}'.format(user, host)
'username#host'
As a general guideline, most people will use + on strings only if they're adding two strings right there. For more parts or more complex strings, they either use string formatting, like above, or assemble elements in a list and join them together (especially if there's any form of looping involved.) The reason for using str.join() is that adding strings together means creating a new string (and potentially destroying the old ones) for each addition. Python can sometimes optimize this away, but str.join() quickly becomes clearer, more obvious and significantly faster.

I take the question to mean: "Is it ok to do this:"
ret = user + '#' + host
..and the answer is yes. That is perfectly fine.
You should, of course, be aware of the cool formatting stuff you can do in Python, and you should be aware that for long lists, "join" is the way to go, but for a simple situation like this, what you have is exactly right. It's simple and clear, and performance will not be an issue.

(I'm pretty sure all of the people pointing at string formatting are missing the question entirely.)
Creating a string by constructing an array and joining it is for performance reasons only. Unless you need that performance, or unless it happens to be the natural way to implement it anyway, there's no benefit to doing that rather than simple string concatenation.
Saying '#'.join([user, host]) is unintuitive. It makes me wonder: why is he doing this? Are there any subtleties to it; is there any case where there might be more than one '#'? The answer is no, of course, but it takes more time to come to that conclusion than if it was written in a natural way.
Don't contort your code merely to avoid string concatenation; there's nothing inherently wrong with it. Joining arrays is just an optimization.

I'll just note that I've always tended to use in-place concatenation until I was rereading a portion of the Python general style PEP PEP-8 Style Guide for Python Code.
Code should be written in a way that does not disadvantage other
implementations of Python (PyPy, Jython, IronPython, Pyrex, Psyco,
and such).
For example, do not rely on CPython's efficient implementation of
in-place string concatenation for statements in the form a+=b or a=a+b.
Those statements run more slowly in Jython. In performance sensitive
parts of the library, the ''.join() form should be used instead. This
will ensure that concatenation occurs in linear time across various
implementations.
Going by this, I have been converting to the practice of using joins so that I may retain the habit as a more automatic practice when efficiency is extra critical.
So I'll put in my vote for:
ret = '#'.join([user, host])

I use next:
ret = '%s#%s' % (user, host)

I recommend join() over concatenation, based on two aspects:
Faster.
More elegant.
Regarding the first aspect, here's an example:
import timeit
s1 = "Flowers"
s2 = "of"
s3 = "War"
def join_concat():
return s1 + " " + s2 + " " + s3
def join_builtin():
return " ".join((s1, s2, s3))
print("Join Concatenation: ", timeit.timeit(join_concat))
print("Join Builtin: ", timeit.timeit(join_builtin))
The output:
$ python3 join_test.py
Join Concatenation: 0.40386943198973313
Join Builtin: 0.2666833929979475
Considering a huge dataset (millions of lines) and its processing, 130 milliseconds per line, it's too much.
And for the second aspect, indeed, is more elegant.

fast string modification in python

This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks

#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.

Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')

this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])

have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505

here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844

Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'

string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")

A Python buffer that you can truncate from the left?

Right now, I am buffering bytes using strings, StringIO, or cStringIO. But, I often need to remove bytes from the left side of the buffer. A naive approach would rebuild the entire buffer. Is there an optimal way to do this, if left-truncating is a very common operation? Python's garbage collector should actually GC the truncated bytes.
Any sort of algorithm for this (keep the buffer in small pieces?), or an existing implementation, would really help.
Edit:
I tried to use Python 2.7's memoryview for this, but sadly, the data outside the "view" isn't GCed when the original reference is deleted:
# (This will use ~2GB of memory, not 50MB)
memoryview # Requires Python 2.7+
smalls = []
for i in xrange(10):
big = memoryview('z'*(200*1000*1000))
small = big[195*1000*1000:]
del big
smalls.append(small)
print '.',

A deque will be efficient if left-removal operations are frequent (Unlike using a list, string or buffer, it's amortised O(1) for either-end removal). It will be more costly memory-wise than a string however, as you'll be storing each character as its own string object, rather than a packed sequence.
Alternatively, you could create your own implementation (eg. a linked list of string / buffer objects of fixed size), which may store the data more compactly.

Build your buffer as a list of characters or lines and slice the list. Only join as string on output. This is pretty efficient for most types of 'mutable string' behaviour.
The GC will collect the truncated bytes because they are no longer referenced in the list.
UPDATE: For modifying the list head you can simply reverse the list. This sounds like an inefficient thing to do however python's list implementation optimises this internally.
from http://effbot.org/zone/python-list.htm :
Reversing is fast, so temporarily
reversing the list can often speed
things up if you need to remove and
insert a bunch of items at the
beginning of the list:
L.reverse()
# append/insert/pop/delete at far end
L.reverse()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.