fast string modification in python - python

This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks

#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.

Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')

this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])

have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505

here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844

Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'

string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")

Related

Python's immutable strings and their slices

The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.
True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?
EDIT:
It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.
The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).
Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.
Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable,
and your proposal probably doesn't give enough benefits to be accepted.
An other problem with this kind of "optimization" is when you want to deallocate "big strings":
a = "Some string" * 10 ** 7
b = a[10000]
del a
After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data.
The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.
99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts)
and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.
If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.
That's how slices work. Slices always perform a shallow copy, allowing you to do things like
>>> x = [1,2,3]
>>> y = x[:]
Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.
See also this question.
If you are worried about memory (in the case of really large strings), use a buffer():
>>> a = "12345"
>>> b = buffer(a, 2, 2)
>>> b
<read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0>
>>> print b
34
>>> print b[:]
34
Knowing about this allows you for alternatives to string methods such as split().
If you want to split() a string, but keep the original string object (as you maybe need it), you could do:
def split_buf(s, needle):
start = None
add = len(needle)
res = []
while True:
index = s.find(needle, start)
if index < 0:
break
res.append(buffer(s, start, index-start))
start = index + add
return res
or, using .index():
def split_buf(s, needle):
start = None
add = len(needle)
res = []
try:
while True:
index = s.index(needle, start)
res.append(buffer(s, start, index-start))
start = index + add
except ValueError:
pass
return res

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Python .join or string concatenation

I realise that if you have an iterable you should always use .join(iterable) instead of for x in y: str += x. But if there's only a fixed number of variables that aren't already in an iterable, is using .join() still the recommended way?
For example I have
user = 'username'
host = 'host'
should I do
ret = user + '#' + host
or
ret = '#'.join([user, host])
I'm not so much asking from a performance point of view, since both will be pretty trivial. But I've read people on here say always use .join() and I was wondering if there's any particular reason for that or if it's just generally a good idea to use .join().
If you're creating a string like that, you normally want to use string formatting:
>>> user = 'username'
>>> host = 'host'
>>> '%s#%s' % (user, host)
'username#host'
Python 2.6 added another form, which doesn't rely on operator overloading and has some extra features:
>>> '{0}#{1}'.format(user, host)
'username#host'
As a general guideline, most people will use + on strings only if they're adding two strings right there. For more parts or more complex strings, they either use string formatting, like above, or assemble elements in a list and join them together (especially if there's any form of looping involved.) The reason for using str.join() is that adding strings together means creating a new string (and potentially destroying the old ones) for each addition. Python can sometimes optimize this away, but str.join() quickly becomes clearer, more obvious and significantly faster.
I take the question to mean: "Is it ok to do this:"
ret = user + '#' + host
..and the answer is yes. That is perfectly fine.
You should, of course, be aware of the cool formatting stuff you can do in Python, and you should be aware that for long lists, "join" is the way to go, but for a simple situation like this, what you have is exactly right. It's simple and clear, and performance will not be an issue.
(I'm pretty sure all of the people pointing at string formatting are missing the question entirely.)
Creating a string by constructing an array and joining it is for performance reasons only. Unless you need that performance, or unless it happens to be the natural way to implement it anyway, there's no benefit to doing that rather than simple string concatenation.
Saying '#'.join([user, host]) is unintuitive. It makes me wonder: why is he doing this? Are there any subtleties to it; is there any case where there might be more than one '#'? The answer is no, of course, but it takes more time to come to that conclusion than if it was written in a natural way.
Don't contort your code merely to avoid string concatenation; there's nothing inherently wrong with it. Joining arrays is just an optimization.
I'll just note that I've always tended to use in-place concatenation until I was rereading a portion of the Python general style PEP PEP-8 Style Guide for Python Code.
Code should be written in a way that does not disadvantage other
implementations of Python (PyPy, Jython, IronPython, Pyrex, Psyco,
and such).
For example, do not rely on CPython's efficient implementation of
in-place string concatenation for statements in the form a+=b or a=a+b.
Those statements run more slowly in Jython. In performance sensitive
parts of the library, the ''.join() form should be used instead. This
will ensure that concatenation occurs in linear time across various
implementations.
Going by this, I have been converting to the practice of using joins so that I may retain the habit as a more automatic practice when efficiency is extra critical.
So I'll put in my vote for:
ret = '#'.join([user, host])
I use next:
ret = '%s#%s' % (user, host)
I recommend join() over concatenation, based on two aspects:
Faster.
More elegant.
Regarding the first aspect, here's an example:
import timeit
s1 = "Flowers"
s2 = "of"
s3 = "War"
def join_concat():
return s1 + " " + s2 + " " + s3
def join_builtin():
return " ".join((s1, s2, s3))
print("Join Concatenation: ", timeit.timeit(join_concat))
print("Join Builtin: ", timeit.timeit(join_builtin))
The output:
$ python3 join_test.py
Join Concatenation: 0.40386943198973313
Join Builtin: 0.2666833929979475
Considering a huge dataset (millions of lines) and its processing, 130 milliseconds per line, it's too much.
And for the second aspect, indeed, is more elegant.

String concatenation vs. string substitution in Python

In Python, the where and when of using string concatenation versus string substitution eludes me. As the string concatenation has seen large boosts in performance, is this (becoming more) a stylistic decision rather than a practical one?
For a concrete example, how should one handle construction of flexible URIs:
DOMAIN = 'http://stackoverflow.com'
QUESTIONS = '/questions'
def so_question_uri_sub(q_num):
return "%s%s/%d" % (DOMAIN, QUESTIONS, q_num)
def so_question_uri_cat(q_num):
return DOMAIN + QUESTIONS + '/' + str(q_num)
Edit: There have also been suggestions about joining a list of strings and for using named substitution. These are variants on the central theme, which is, which way is the Right Way to do it at which time? Thanks for the responses!
Concatenation is (significantly) faster according to my machine. But stylistically, I'm willing to pay the price of substitution if performance is not critical. Well, and if I need formatting, there's no need to even ask the question... there's no option but to use interpolation/templating.
>>> import timeit
>>> def so_q_sub(n):
... return "%s%s/%d" % (DOMAIN, QUESTIONS, n)
...
>>> so_q_sub(1000)
'http://stackoverflow.com/questions/1000'
>>> def so_q_cat(n):
... return DOMAIN + QUESTIONS + '/' + str(n)
...
>>> so_q_cat(1000)
'http://stackoverflow.com/questions/1000'
>>> t1 = timeit.Timer('so_q_sub(1000)','from __main__ import so_q_sub')
>>> t2 = timeit.Timer('so_q_cat(1000)','from __main__ import so_q_cat')
>>> t1.timeit(number=10000000)
12.166618871951641
>>> t2.timeit(number=10000000)
5.7813972166853773
>>> t1.timeit(number=1)
1.103492206766532e-05
>>> t2.timeit(number=1)
8.5206360154188587e-06
>>> def so_q_tmp(n):
... return "{d}{q}/{n}".format(d=DOMAIN,q=QUESTIONS,n=n)
...
>>> so_q_tmp(1000)
'http://stackoverflow.com/questions/1000'
>>> t3= timeit.Timer('so_q_tmp(1000)','from __main__ import so_q_tmp')
>>> t3.timeit(number=10000000)
14.564135316080637
>>> def so_q_join(n):
... return ''.join([DOMAIN,QUESTIONS,'/',str(n)])
...
>>> so_q_join(1000)
'http://stackoverflow.com/questions/1000'
>>> t4= timeit.Timer('so_q_join(1000)','from __main__ import so_q_join')
>>> t4.timeit(number=10000000)
9.4431309007150048
Don't forget about named substitution:
def so_question_uri_namedsub(q_num):
return "%(domain)s%(questions)s/%(q_num)d" % locals()
Be wary of concatenating strings in a loop! The cost of string concatenation is proportional to the length of the result. Looping leads you straight to the land of N-squared. Some languages will optimize concatenation to the most recently allocated string, but it's risky to count on the compiler to optimize your quadratic algorithm down to linear. Best to use the primitive (join?) that takes an entire list of strings, does a single allocation, and concatenates them all in one go.
"As the string concatenation has seen large boosts in performance..."
If performance matters, this is good to know.
However, performance problems I've seen have never come down to string operations. I've generally gotten in trouble with I/O, sorting and O(n2) operations being the bottlenecks.
Until string operations are the performance limiters, I'll stick with things that are obvious. Mostly, that's substitution when it's one line or less, concatenation when it makes sense, and a template tool (like Mako) when it's large.
What you want to concatenate/interpolate and how you want to format the result should drive your decision.
String interpolation allows you to easily add formatting. In fact, your string interpolation version doesn't do the same thing as your concatenation version; it actually adds an extra forward slash before the q_num parameter. To do the same thing, you would have to write return DOMAIN + QUESTIONS + "/" + str(q_num) in that example.
Interpolation makes it easier to format numerics; "%d of %d (%2.2f%%)" % (current, total, total/current) would be much less readable in concatenation form.
Concatenation is useful when you don't have a fixed number of items to string-ize.
Also, know that Python 2.6 introduces a new version of string interpolation, called string templating:
def so_question_uri_template(q_num):
return "{domain}/{questions}/{num}".format(domain=DOMAIN,
questions=QUESTIONS,
num=q_num)
String templating is slated to eventually replace %-interpolation, but that won't happen for quite a while, I think.
I was just testing the speed of different string concatenation/substitution methods out of curiosity. A google search on the subject brought me here. I thought I would post my test results in the hope that it might help someone decide.
import timeit
def percent_():
return "test %s, with number %s" % (1,2)
def format_():
return "test {}, with number {}".format(1,2)
def format2_():
return "test {1}, with number {0}".format(2,1)
def concat_():
return "test " + str(1) + ", with number " + str(2)
def dotimers(func_list):
# runs a single test for all functions in the list
for func in func_list:
tmr = timeit.Timer(func)
res = tmr.timeit()
print "test " + func.func_name + ": " + str(res)
def runtests(func_list, runs=5):
# runs multiple tests for all functions in the list
for i in range(runs):
print "----------- TEST #" + str(i + 1)
dotimers(func_list)
...After running runtests((percent_, format_, format2_, concat_), runs=5), I found that the % method was about twice as fast as the others on these small strings. The concat method was always the slowest (barely). There were very tiny differences when switching the positions in the format() method, but switching positions was always at least .01 slower than the regular format method.
Sample of test results:
test concat_() : 0.62 (0.61 to 0.63)
test format_() : 0.56 (consistently 0.56)
test format2_() : 0.58 (0.57 to 0.59)
test percent_() : 0.34 (0.33 to 0.35)
I ran these because I do use string concatenation in my scripts, and I was wondering what the cost was. I ran them in different orders to make sure nothing was interfering, or getting better performance being first or last. On a side note, I threw in some longer string generators into those functions like "%s" + ("a" * 1024) and regular concat was almost 3 times as fast (1.1 vs 2.8) as using the format and % methods. I guess it depends on the strings, and what you are trying to achieve. If performance really matters, it might be better to try different things and test them. I tend to choose readability over speed, unless speed becomes a problem, but thats just me. SO didn't like my copy/paste, i had to put 8 spaces on everything to make it look right. I usually use 4.
Remember, stylistic decisions are practical decisions, if you ever plan on maintaining or debugging your code :-) There's a famous quote from Knuth (possibly quoting Hoare?): "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
As long as you're careful not to (say) turn a O(n) task into an O(n2) task, I would go with whichever you find easiest to understand..
I use substitution wherever I can. I only use concatenation if I'm building a string up in say a for-loop.
Actually the correct thing to do, in this case (building paths) is to use os.path.join. Not string concatenation or interpolation

Ignore case in Python strings [duplicate]

This question already has answers here:
How do I do a case-insensitive string comparison?
(15 answers)
Closed 6 years ago.
What is the easiest way to compare strings in Python, ignoring case?
Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).
I guess I'm looking for an equivalent to C's stricmp().
[Some more context requested, so I'll demonstrate with a trivial example:]
Suppose you want to sort a looong list of strings. You simply do theList.sort().
This is O(n * log(n)) string comparisons and no memory management (since all
strings and list elements are some sort of smart pointers). You are happy.
Now, you want to do the same, but ignore the case (let's simplify and say
all strings are ascii, so locale issues can be ignored).
You can do theList.sort(key=lambda s: s.lower()), but then you cause two new
allocations per comparison, plus burden the garbage-collector with the duplicated
(lowered) strings.
Each such memory-management noise is orders-of-magnitude slower than simple string comparison.
Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp)
and it is as fast and as memory-friendly as theList.sort(). You are happy again.
The problem is any Python-based case-insensitive comparison involves implicit string
duplications, so I was expecting to find a C-based comparisons (maybe in module string).
Could not find anything like that, hence the question here.
(Hope this clarifies the question).
Here is a benchmark showing that using str.lower is faster than the accepted answer's proposed method (libc.strcasecmp):
#!/usr/bin/env python2.7
import random
import timeit
from ctypes import *
libc = CDLL('libc.dylib') # change to 'libc.so.6' on linux
with open('/usr/share/dict/words', 'r') as wordlist:
words = wordlist.read().splitlines()
random.shuffle(words)
print '%i words in list' % len(words)
setup = 'from __main__ import words, libc; gc.enable()'
stmts = [
('simple sort', 'sorted(words)'),
('sort with key=str.lower', 'sorted(words, key=str.lower)'),
('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),
]
for (comment, stmt) in stmts:
t = timeit.Timer(stmt=stmt, setup=setup)
print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))
typical times on my machine:
235886 words in list
simple sort: 483.59 msec/pass
sort with key=str.lower: 1064.70 msec/pass
sort with cmp=libc.strcasecmp: 5487.86 msec/pass
So, the version with str.lower is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here.
I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?
NB: The lower() string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.
Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:
Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']
Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.
Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.
The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:
from timeit import Timer
s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"
code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000 # 0.00920499992371 on my machine
If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html
I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().
However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.
There's no built in equivalent to that function you want.
You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.
Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use
str1.lower() == str2.lower()
You'll be ok
This question is asking 2 very different things:
What is the easiest way to compare strings in Python, ignoring case?
I guess I'm looking for an equivalent to C's stricmp().
Since #1 has been answered very well already (ie: str1.lower() < str2.lower()) I will answer #2.
def strincmp(str1, str2, numchars=None):
result = 0
len1 = len(str1)
len2 = len(str2)
if numchars is not None:
minlen = min(len1,len2,numchars)
else:
minlen = min(len1,len2)
#end if
orda = ord('a')
ordz = ord('z')
i = 0
while i < minlen and 0 == result:
ord1 = ord(str1[i])
ord2 = ord(str2[i])
if ord1 >= orda and ord1 <= ordz:
ord1 = ord1-32
#end if
if ord2 >= orda and ord2 <= ordz:
ord2 = ord2-32
#end if
result = cmp(ord1, ord2)
i += 1
#end while
if 0 == result and minlen != numchars:
if len1 < len2:
result = -1
elif len2 < len1:
result = 1
#end if
#end if
return result
#end def
Only use this function when it makes sense to as in many instances the lowercase technique will be superior.
I only work with ascii strings, I'm not sure how this will behave with unicode.
When something isn't supported well in the standard library, I always look for a PyPI package. With virtualization and the ubiquity of modern Linux distributions, I no longer avoid Python extensions. PyICU seems to fit the bill: https://stackoverflow.com/a/1098160/3461
There now is also an option that is pure python. It's well tested: https://github.com/jtauber/pyuca
Old answer:
I like the regular expression solution. Here's a function you can copy and paste into any function, thanks to python's block structure support.
def equals_ignore_case(str1, str2):
import re
return re.match(re.escape(str1) + r'\Z', str2, re.I) is not None
Since I used match instead of search, I didn't need to add a caret (^) to the regular expression.
Note: This only checks equality, which is sometimes what is needed. I also wouldn't go so far as to say that I like it.
This is how you'd do it with re:
import re
p = re.compile('^hello$', re.I)
p.match('Hello')
p.match('hello')
p.match('HELLO')
The recommended idiom to sort lists of values using expensive-to-compute keys is to the so-called "decorated pattern". It consists simply in building a list of (key, value) tuples from the original list, and sort that list. Then it is trivial to eliminate the keys and get the list of sorted values:
>>> original_list = ['a', 'b', 'A', 'B']
>>> decorated = [(s.lower(), s) for s in original_list]
>>> decorated.sort()
>>> sorted_list = [s[1] for s in decorated]
>>> sorted_list
['A', 'a', 'B', 'b']
Or if you like one-liners:
>>> sorted_list = [s[1] for s in sorted((s.lower(), s) for s in original_list)]
>>> sorted_list
['A', 'a', 'B', 'b']
If you really worry about the cost of calling lower(), you can just store tuples of (lowered string, original string) everywhere. Tuples are the cheapest kind of containers in Python, they are also hashable so they can be used as dictionary keys, set members, etc.
I'm pretty sure you either have to use .lower() or use a regular expression. I'm not aware of a built-in case-insensitive string comparison function.
For occasional or even repeated comparisons, a few extra string objects shouldn't matter as long as this won't happen in the innermost loop of your core code or you don't have enough data to actually notice the performance impact. See if you do: doing things in a "stupid" way is much less stupid if you also do it less.
If you seriously want to keep comparing lots and lots of text case-insensitively you could somehow keep the lowercase versions of the strings at hand to avoid finalization and re-creation, or normalize the whole data set into lowercase. This of course depends on the size of the data set. If there are a relatively few needles and a large haystack, replacing the needles with compiled regexp objects is one solution. If It's hard to say without seeing a concrete example.
You could translate each string to lowercase once --- lazily only when you need it, or as a prepass to the sort if you know you'll be sorting the entire collection of strings. There are several ways to attach this comparison key to the actual data being sorted, but these techniques should be addressed in a separate issue.
Note that this technique can be used not only to handle upper/lower case issues, but for other types of sorting such as locale specific sorting, or "Library-style" title sorting that ignores leading articles and otherwise normalizes the data before sorting it.
Just use the str().lower() method, unless high-performance is important - in which case write that sorting method as a C extension.
"How to write a Python Extension" seems like a decent intro..
More interestingly, This guide compares using the ctypes library vs writing an external C module (the ctype is quite-substantially slower than the C extension).
import re
if re.match('tEXT', 'text', re.IGNORECASE):
# is True
You could subclass str and create your own case-insenstive string class but IMHO that would be extremely unwise and create far more trouble than it's worth.
In response to your clarification...
You could use ctypes to execute the c function "strcasecmp". Ctypes is included in Python 2.5. It provides the ability to call out to dll and shared libraries such as libc. Here is a quick example (Python on Linux; see link for Win32 help):
from ctypes import *
libc = CDLL("libc.so.6") // see link above for Win32 help
libc.strcasecmp("THIS", "this") // returns 0
libc.strcasecmp("THIS", "THAT") // returns 8
may also want to reference strcasecmp documentation
Not really sure this is any faster or slower (have not tested), but it's a way to use a C function to do case insensitive string comparisons.
~~~~~~~~~~~~~~
ActiveState Code - Recipe 194371: Case Insensitive Strings
is a recipe for creating a case insensitive string class. It might be a bit over kill for something quick, but could provide you with a common way of handling case insensitive strings if you plan on using them often.

Categories

Resources