This question already has answers here:
How do I do a case-insensitive string comparison?
(15 answers)
Closed 6 years ago.
What is the easiest way to compare strings in Python, ignoring case?
Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).
I guess I'm looking for an equivalent to C's stricmp().
[Some more context requested, so I'll demonstrate with a trivial example:]
Suppose you want to sort a looong list of strings. You simply do theList.sort().
This is O(n * log(n)) string comparisons and no memory management (since all
strings and list elements are some sort of smart pointers). You are happy.
Now, you want to do the same, but ignore the case (let's simplify and say
all strings are ascii, so locale issues can be ignored).
You can do theList.sort(key=lambda s: s.lower()), but then you cause two new
allocations per comparison, plus burden the garbage-collector with the duplicated
(lowered) strings.
Each such memory-management noise is orders-of-magnitude slower than simple string comparison.
Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp)
and it is as fast and as memory-friendly as theList.sort(). You are happy again.
The problem is any Python-based case-insensitive comparison involves implicit string
duplications, so I was expecting to find a C-based comparisons (maybe in module string).
Could not find anything like that, hence the question here.
(Hope this clarifies the question).
Here is a benchmark showing that using str.lower is faster than the accepted answer's proposed method (libc.strcasecmp):
#!/usr/bin/env python2.7
import random
import timeit
from ctypes import *
libc = CDLL('libc.dylib') # change to 'libc.so.6' on linux
with open('/usr/share/dict/words', 'r') as wordlist:
words = wordlist.read().splitlines()
random.shuffle(words)
print '%i words in list' % len(words)
setup = 'from __main__ import words, libc; gc.enable()'
stmts = [
('simple sort', 'sorted(words)'),
('sort with key=str.lower', 'sorted(words, key=str.lower)'),
('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),
]
for (comment, stmt) in stmts:
t = timeit.Timer(stmt=stmt, setup=setup)
print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))
typical times on my machine:
235886 words in list
simple sort: 483.59 msec/pass
sort with key=str.lower: 1064.70 msec/pass
sort with cmp=libc.strcasecmp: 5487.86 msec/pass
So, the version with str.lower is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here.
I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?
NB: The lower() string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.
Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:
Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']
Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.
Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.
The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:
from timeit import Timer
s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"
code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000 # 0.00920499992371 on my machine
If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html
I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().
However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.
There's no built in equivalent to that function you want.
You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.
Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use
str1.lower() == str2.lower()
You'll be ok
This question is asking 2 very different things:
What is the easiest way to compare strings in Python, ignoring case?
I guess I'm looking for an equivalent to C's stricmp().
Since #1 has been answered very well already (ie: str1.lower() < str2.lower()) I will answer #2.
def strincmp(str1, str2, numchars=None):
result = 0
len1 = len(str1)
len2 = len(str2)
if numchars is not None:
minlen = min(len1,len2,numchars)
else:
minlen = min(len1,len2)
#end if
orda = ord('a')
ordz = ord('z')
i = 0
while i < minlen and 0 == result:
ord1 = ord(str1[i])
ord2 = ord(str2[i])
if ord1 >= orda and ord1 <= ordz:
ord1 = ord1-32
#end if
if ord2 >= orda and ord2 <= ordz:
ord2 = ord2-32
#end if
result = cmp(ord1, ord2)
i += 1
#end while
if 0 == result and minlen != numchars:
if len1 < len2:
result = -1
elif len2 < len1:
result = 1
#end if
#end if
return result
#end def
Only use this function when it makes sense to as in many instances the lowercase technique will be superior.
I only work with ascii strings, I'm not sure how this will behave with unicode.
When something isn't supported well in the standard library, I always look for a PyPI package. With virtualization and the ubiquity of modern Linux distributions, I no longer avoid Python extensions. PyICU seems to fit the bill: https://stackoverflow.com/a/1098160/3461
There now is also an option that is pure python. It's well tested: https://github.com/jtauber/pyuca
Old answer:
I like the regular expression solution. Here's a function you can copy and paste into any function, thanks to python's block structure support.
def equals_ignore_case(str1, str2):
import re
return re.match(re.escape(str1) + r'\Z', str2, re.I) is not None
Since I used match instead of search, I didn't need to add a caret (^) to the regular expression.
Note: This only checks equality, which is sometimes what is needed. I also wouldn't go so far as to say that I like it.
This is how you'd do it with re:
import re
p = re.compile('^hello$', re.I)
p.match('Hello')
p.match('hello')
p.match('HELLO')
The recommended idiom to sort lists of values using expensive-to-compute keys is to the so-called "decorated pattern". It consists simply in building a list of (key, value) tuples from the original list, and sort that list. Then it is trivial to eliminate the keys and get the list of sorted values:
>>> original_list = ['a', 'b', 'A', 'B']
>>> decorated = [(s.lower(), s) for s in original_list]
>>> decorated.sort()
>>> sorted_list = [s[1] for s in decorated]
>>> sorted_list
['A', 'a', 'B', 'b']
Or if you like one-liners:
>>> sorted_list = [s[1] for s in sorted((s.lower(), s) for s in original_list)]
>>> sorted_list
['A', 'a', 'B', 'b']
If you really worry about the cost of calling lower(), you can just store tuples of (lowered string, original string) everywhere. Tuples are the cheapest kind of containers in Python, they are also hashable so they can be used as dictionary keys, set members, etc.
I'm pretty sure you either have to use .lower() or use a regular expression. I'm not aware of a built-in case-insensitive string comparison function.
For occasional or even repeated comparisons, a few extra string objects shouldn't matter as long as this won't happen in the innermost loop of your core code or you don't have enough data to actually notice the performance impact. See if you do: doing things in a "stupid" way is much less stupid if you also do it less.
If you seriously want to keep comparing lots and lots of text case-insensitively you could somehow keep the lowercase versions of the strings at hand to avoid finalization and re-creation, or normalize the whole data set into lowercase. This of course depends on the size of the data set. If there are a relatively few needles and a large haystack, replacing the needles with compiled regexp objects is one solution. If It's hard to say without seeing a concrete example.
You could translate each string to lowercase once --- lazily only when you need it, or as a prepass to the sort if you know you'll be sorting the entire collection of strings. There are several ways to attach this comparison key to the actual data being sorted, but these techniques should be addressed in a separate issue.
Note that this technique can be used not only to handle upper/lower case issues, but for other types of sorting such as locale specific sorting, or "Library-style" title sorting that ignores leading articles and otherwise normalizes the data before sorting it.
Just use the str().lower() method, unless high-performance is important - in which case write that sorting method as a C extension.
"How to write a Python Extension" seems like a decent intro..
More interestingly, This guide compares using the ctypes library vs writing an external C module (the ctype is quite-substantially slower than the C extension).
import re
if re.match('tEXT', 'text', re.IGNORECASE):
# is True
You could subclass str and create your own case-insenstive string class but IMHO that would be extremely unwise and create far more trouble than it's worth.
In response to your clarification...
You could use ctypes to execute the c function "strcasecmp". Ctypes is included in Python 2.5. It provides the ability to call out to dll and shared libraries such as libc. Here is a quick example (Python on Linux; see link for Win32 help):
from ctypes import *
libc = CDLL("libc.so.6") // see link above for Win32 help
libc.strcasecmp("THIS", "this") // returns 0
libc.strcasecmp("THIS", "THAT") // returns 8
may also want to reference strcasecmp documentation
Not really sure this is any faster or slower (have not tested), but it's a way to use a C function to do case insensitive string comparisons.
~~~~~~~~~~~~~~
ActiveState Code - Recipe 194371: Case Insensitive Strings
is a recipe for creating a case insensitive string class. It might be a bit over kill for something quick, but could provide you with a common way of handling case insensitive strings if you plan on using them often.
Related
I have created a program to check if a string is a substring of another string, with the added condition of that substring being at the end.
def atEnd(first, second):
if second in first and first.endswith(second):
return True
else:
return False
first, second = input('Enter two strings: ').split()
print(atEnd(first, second))
Is there a way to find the same outcome without using the .endswith() function?
first[-len(second):] == second
Will do the job.
Your atEnd function is completely redundant with str.endswith, which is a built-in native method and therefore will already have a highly efficient implementation.
I would simply write print(first.endswith(second)) -- there's no need to complicate things further.
If you really want a free function rather than a method for some reason, then you can just invoke str.endswith directly: print(str.endswith(first, second)).
If you want to write your own implementation for efficiency reasons, you'll probably be better off using an alternative algorithm (e.g. building a suffix tree). If you want to write your own implementation to understand low-level string operations you really should learn C and read the CPython implementation source code. If you are doing this because a school assignment told you not to use endswith then that seems like a dumb assignment to me -- you should probably ask your teacher for more information.
Try utilizing re module's method findall:
import re
EndsWith = lambda first,second: re.findall("("+first+")$",second) != []
Thanks to David Beazley's tweet, I've recently found out that the new Python 3.6 f-strings can also be nested:
>>> price = 478.23
>>> f"{f'${price:0.2f}':*>20s}"
'*************$478.23'
Or:
>>> x = 42
>>> f'''-{f"""*{f"+{f'.{x}.'}+"}*"""}-'''
'-*+.42.+*-'
While I am surprised that this is possible, I am missing on how practical is that, when would nesting f-strings be useful? What use cases can this cover?
Note: The PEP itself does not mention nesting f-strings, but there is a specific test case.
I don't think formatted string literals allowing nesting (by nesting, I take it to mean f'{f".."}') is a result of careful consideration of possible use cases, I'm more convinced it's just allowed in order for them to conform with their specification.
The specification states that they support full Python expressions* inside brackets. It's also stated that a formatted string literal is really just an expression that is evaluated at run-time (See here, and here). As a result, it only makes sense to allow a formatted string literal as the expression inside another formatted string literal, forbidding it would negate the full support for Python expressions.
The fact that you can't find use cases mentioned in the docs (and only find test cases in the test suite) is because this is probably a nice (side) effect of the implementation and not it's motivating use-case.
Actually, with two exceptions: An empty expression is not allowed, and a lambda expression must be surrounded by explicit parentheses.
I guess this is to pass formatting parameters in the same line and thus simplify f-strings usage.
For example:
>>> import decimal
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"
'result: 12.35'
Of course, it allows programmers to write absolutely unreadable code, but that's not the purpose :)
I've actually just come across something similar (i think) and thought i'd share.
My specific case is a big dirty sql statement where i need to conditionally have some very different values but some fstrings are the same (and also used in other places).
Here is quick example of what i mean. The cols i'm selecting are same regardless (and also used in other queries elsewhere) but the table name depends on the group and is not such i could just do it in a loop.
Having to include mycols=mycols in str2 each time felt a little dirty when i have multiple such params.
I was not sure this would work but was happy it did. As to how pythonic its is i'm not really sure tbh.
mycols='col_a,col_b'
str1 = "select {mycols} from {mytable} where group='{mygroup}'".format(mycols=mycols,mytable='{mytable}',mygroup='{mygroup}')
group = 'group_b'
if group == 'group_a':
str2 = str1.format(mytable='tbl1',mygroup=group)
elif group == 'group_b':
str2 = str1.format(mytable='a_very_different_table_name',mygroup=group)
print(str2)
Any basic use case is where you need a string to completely describe the object you want to put inside the f-string braces {}. For example, you need strings to index dictionaries.
So, I ended up using it in an ML project with code like:
scores = dict()
scores[f'{task}_accuracy'] = 100. * n_valid / n_total
print(f'{task}_accuracy: {scores[f"{task}_accuracy"]}')
Working on a pet project I got sidetracked by writing my own DB library. One thing I discovered was this:
>>> x = dict(a = 1, b = 2, d = 3)
>>> z = f"""
UPDATE TABLE
bar
SET
{", ".join([ f'{k} = ?' for k in x.keys() ])} """.strip()
>>> z
'UPDATE TABLE
bar
SET
a = ?, b = ?, d = ? '
I was also surprised by this and honestly I am not sure I would ever do something like this in production code BUT I have also said I wouldn't do a lot of other things in production code.
I found nesting to be useful when doing ternaries. Your opinion will vary on readability, but I found this one-liner very useful.
logger.info(f"No program name in subgroups file. Using {f'{prg_num} {prg_orig_date}' if not prg_name else prg_name}")
As such, my tests for nesting would be:
Is the value reused? (Variable for expression re-use)
Is the expression clear? (Not exceeding complexity)
I use it for formatting currencies. Given values like:
a=1.23
b=45.67
to format them with a leading $ and with the decimals aligned. e.g.
$1.23
$45.67
formatting with a single f-string f"${value:5.2f}" you can get:
$ 1.23
$45.67
which is fine sometimes but not always. Nested f-strings f"${f'${value:.2f}':>6}" give you the exact format:
$1.23
$45.67
A simple example of when it's useful, together with an example of implementation: sometimes the formatting is also a variable.
num = 3.1415
fmt = ".2f"
print(f"number is {num:{fmt}}")
Nested f-strings vs. evaluated expressions in format specifiers
This question is about use-cases that would motivate using an f-string inside of some evaluated expression of an "outer" f-string.
This is different from the feature that allows evaluated expressions to appear within the format specifier of an f-string. This latter feature is extremely useful and somewhat relevant to this question since (1) it involves nested curly braces so it might be why people are looking at this post and (2) nested f-strings are allowed within the format specifier just as they are within other curly-expressions of an f-string.
F-string nesting can help with one-liners
Although certainly not the motivation for allowing nested f-strings, nesting may be helpful in obscure cases where you need or want a "one-liner" (e.g. lambda expressions, comprehensions, python -c command from the terminal). For example:
print('\n'.join([f"length of {x/3:g}{'.'*(11 - len(f'{x/3:g}'))}{len(f'{x/3:g}')}" for x in range(10)]))
If you do not need a one-liner, any syntactic nesting can be replaced by defining a variable previously and then using the variable name in the evaluated expression of the f-string (and in many if not most cases, the non-nested version would likely be more readable and easier to maintain; however it does require coming up with variable names):
for x in range(10):
to_show = f"{x/3:g}"
string_length = len(to_show)
padding = '.' * (11 - string_length)
print(f"length of {to_show}{padding}{string_length}")
Nested evaluated expressions (i.e. in the format specifier) are useful
In contrast to true f-string nesting, the related feature allowing evaluated expressions within the "format specifier" of an f-string can be extremely useful (as others have pointed out) for several reasons including:
formatting can be shared across multiple f-strings or evaluated expressions
formatting can include computed quantities that can vary from run to run
Here is an example that uses a nested evaluated expression, but not a nested f-string:
import random
results = [[i, *[random.random()] * 3] for i in range(10)]
format = "2.2f"
print("category,precision,recall,f1")
for cat, precision, recall, f1 in results:
print(f"{cat},{precision:{format}},{recall:{format}},{f1:{format}}")
However, even this use of nesting can be replaced with more flexible (and maybe cleaner) code that does not require syntactic nesting:
import random
results = [[i, *[random.random()] * 3] for i in range(10)]
def format(x):
return f"{x:2.2f}"
print("category,precision,recall,f1")
for cat, precision, recall, f1 in results:
print(f"{cat},{format(precision)},{format(recall)},{format(f1)}")
The following nested f-string one-liner does a great job in constructing a command argument string
cmd_args = f"""{' '.join([f'--{key} {value}' for key, value in kwargs.items()])}"""
where the input
{'a': 10, 'b': 20, 'c': 30, ....}
gets elegantly converted to
--a 10 --b 20 --c 30 ...
`
In F-string the open-paren & close-paren are reserved key characters.
To use f-string to build json string you have to escape the parentheses characters.
in your case only the outer parens.
f"\{f'${price:0.2f}':*>20s\}"
If some fancy formatting is needed, such nesting could be, perhaps, useful.
for n in range(10, 1000, 100):
print(f"{f'n = {n:<3}':<15}| {f'|{n:>5}**2 = {n**2:<7_}'} |")
You could use it for dynamicism. For instance, say you have a variable set to the name of some function:
func = 'my_func'
Then you could write:
f"{f'{func}'()}"
which would be equivalent to:
'{}'.format(locals()[func]())
or, equivalently:
'{}'.format(my_func())
While I know that there is the possibility:
>>> a = "abc"
>>> result = a[-1]
>>> a = a[:-1]
Now I also know that strings are immutable and therefore something like this:
>>> a.pop()
c
is not possible.
But is this really the preferred way?
Strings are "immutable" for good reason: It really saves a lot of headaches, more often than you'd think. It also allows python to be very smart about optimizing their use. If you want to process your string in increments, you can pull out part of it with split() or separate it into two parts using indices:
a = "abc"
a, result = a[:-1], a[-1]
This shows that you're splitting your string in two. If you'll be examining every byte of the string, you can iterate over it (in reverse, if you wish):
for result in reversed(a):
...
I should add this seems a little contrived: Your string is more likely to have some separator, and then you'll use split:
ans = "foo,blah,etc."
for a in ans.split(","):
...
Not only is it the preferred way, it's the only reasonable way. Because strings are immutable, in order to "remove" a char from a string you have to create a new string whenever you want a different string value.
You may be wondering why strings are immutable, given that you have to make a whole new string every time you change a character. After all, C strings are just arrays of characters and are thus mutable, and some languages that support strings more cleanly than C allow mutable strings as well. There are two reasons to have immutable strings: security/safety and performance.
Security is probably the most important reason for strings to be immutable. When strings are immutable, you can't pass a string into some library and then have that string change from under your feet when you don't expect it. You may wonder which library would change string parameters, but if you're shipping code to clients you can't control their versions of the standard library, and malicious clients may change out their standard libraries in order to break your program and find out more about its internals. Immutable objects are also easier to reason about, which is really important when you try to prove that your system is secure against particular threats. This ease of reasoning is especially important for thread safety, since immutable objects are automatically thread-safe.
Performance is surprisingly often better for immutable strings. Whenever you take a slice of a string, the Python runtime only places a view over the original string, so there is no new string allocation. Since strings are immutable, you get copy semantics without actually copying, which is a real performance win.
Eric Lippert explains more about the rationale behind immutable of strings (in C#, not Python) here.
The precise wording of the question makes me think it's impossible.
return to me means you have a function, which you have passed a string as a parameter.
You cannot change this parameter. Assigning to it will only change the value of the parameter within the function, not the passed in string. E.g.
>>> def removeAndReturnLastCharacter(a):
c = a[-1]
a = a[:-1]
return c
>>> b = "Hello, Gaukler!"
>>> removeAndReturnLastCharacter(b)
!
>>> b # b has not been changed
Hello, Gaukler!
Yes, python strings are immutable and any modification will result in creating a new string. This is how it's mostly done.
So, go ahead with it.
I decided to go with a for loop and just avoid the item in question, is it an acceptable alternative?
new = ''
for item in str:
if item == str[n]:
continue
else:
new += item
I realise that if you have an iterable you should always use .join(iterable) instead of for x in y: str += x. But if there's only a fixed number of variables that aren't already in an iterable, is using .join() still the recommended way?
For example I have
user = 'username'
host = 'host'
should I do
ret = user + '#' + host
or
ret = '#'.join([user, host])
I'm not so much asking from a performance point of view, since both will be pretty trivial. But I've read people on here say always use .join() and I was wondering if there's any particular reason for that or if it's just generally a good idea to use .join().
If you're creating a string like that, you normally want to use string formatting:
>>> user = 'username'
>>> host = 'host'
>>> '%s#%s' % (user, host)
'username#host'
Python 2.6 added another form, which doesn't rely on operator overloading and has some extra features:
>>> '{0}#{1}'.format(user, host)
'username#host'
As a general guideline, most people will use + on strings only if they're adding two strings right there. For more parts or more complex strings, they either use string formatting, like above, or assemble elements in a list and join them together (especially if there's any form of looping involved.) The reason for using str.join() is that adding strings together means creating a new string (and potentially destroying the old ones) for each addition. Python can sometimes optimize this away, but str.join() quickly becomes clearer, more obvious and significantly faster.
I take the question to mean: "Is it ok to do this:"
ret = user + '#' + host
..and the answer is yes. That is perfectly fine.
You should, of course, be aware of the cool formatting stuff you can do in Python, and you should be aware that for long lists, "join" is the way to go, but for a simple situation like this, what you have is exactly right. It's simple and clear, and performance will not be an issue.
(I'm pretty sure all of the people pointing at string formatting are missing the question entirely.)
Creating a string by constructing an array and joining it is for performance reasons only. Unless you need that performance, or unless it happens to be the natural way to implement it anyway, there's no benefit to doing that rather than simple string concatenation.
Saying '#'.join([user, host]) is unintuitive. It makes me wonder: why is he doing this? Are there any subtleties to it; is there any case where there might be more than one '#'? The answer is no, of course, but it takes more time to come to that conclusion than if it was written in a natural way.
Don't contort your code merely to avoid string concatenation; there's nothing inherently wrong with it. Joining arrays is just an optimization.
I'll just note that I've always tended to use in-place concatenation until I was rereading a portion of the Python general style PEP PEP-8 Style Guide for Python Code.
Code should be written in a way that does not disadvantage other
implementations of Python (PyPy, Jython, IronPython, Pyrex, Psyco,
and such).
For example, do not rely on CPython's efficient implementation of
in-place string concatenation for statements in the form a+=b or a=a+b.
Those statements run more slowly in Jython. In performance sensitive
parts of the library, the ''.join() form should be used instead. This
will ensure that concatenation occurs in linear time across various
implementations.
Going by this, I have been converting to the practice of using joins so that I may retain the habit as a more automatic practice when efficiency is extra critical.
So I'll put in my vote for:
ret = '#'.join([user, host])
I use next:
ret = '%s#%s' % (user, host)
I recommend join() over concatenation, based on two aspects:
Faster.
More elegant.
Regarding the first aspect, here's an example:
import timeit
s1 = "Flowers"
s2 = "of"
s3 = "War"
def join_concat():
return s1 + " " + s2 + " " + s3
def join_builtin():
return " ".join((s1, s2, s3))
print("Join Concatenation: ", timeit.timeit(join_concat))
print("Join Builtin: ", timeit.timeit(join_builtin))
The output:
$ python3 join_test.py
Join Concatenation: 0.40386943198973313
Join Builtin: 0.2666833929979475
Considering a huge dataset (millions of lines) and its processing, 130 milliseconds per line, it's too much.
And for the second aspect, indeed, is more elegant.
This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks
#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.
Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')
this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])
have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505
here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844
Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'
string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")