String concatenation vs. string substitution in Python - python

In Python, the where and when of using string concatenation versus string substitution eludes me. As the string concatenation has seen large boosts in performance, is this (becoming more) a stylistic decision rather than a practical one?
For a concrete example, how should one handle construction of flexible URIs:
DOMAIN = 'http://stackoverflow.com'
QUESTIONS = '/questions'
def so_question_uri_sub(q_num):
return "%s%s/%d" % (DOMAIN, QUESTIONS, q_num)
def so_question_uri_cat(q_num):
return DOMAIN + QUESTIONS + '/' + str(q_num)
Edit: There have also been suggestions about joining a list of strings and for using named substitution. These are variants on the central theme, which is, which way is the Right Way to do it at which time? Thanks for the responses!

Concatenation is (significantly) faster according to my machine. But stylistically, I'm willing to pay the price of substitution if performance is not critical. Well, and if I need formatting, there's no need to even ask the question... there's no option but to use interpolation/templating.
>>> import timeit
>>> def so_q_sub(n):
... return "%s%s/%d" % (DOMAIN, QUESTIONS, n)
...
>>> so_q_sub(1000)
'http://stackoverflow.com/questions/1000'
>>> def so_q_cat(n):
... return DOMAIN + QUESTIONS + '/' + str(n)
...
>>> so_q_cat(1000)
'http://stackoverflow.com/questions/1000'
>>> t1 = timeit.Timer('so_q_sub(1000)','from __main__ import so_q_sub')
>>> t2 = timeit.Timer('so_q_cat(1000)','from __main__ import so_q_cat')
>>> t1.timeit(number=10000000)
12.166618871951641
>>> t2.timeit(number=10000000)
5.7813972166853773
>>> t1.timeit(number=1)
1.103492206766532e-05
>>> t2.timeit(number=1)
8.5206360154188587e-06
>>> def so_q_tmp(n):
... return "{d}{q}/{n}".format(d=DOMAIN,q=QUESTIONS,n=n)
...
>>> so_q_tmp(1000)
'http://stackoverflow.com/questions/1000'
>>> t3= timeit.Timer('so_q_tmp(1000)','from __main__ import so_q_tmp')
>>> t3.timeit(number=10000000)
14.564135316080637
>>> def so_q_join(n):
... return ''.join([DOMAIN,QUESTIONS,'/',str(n)])
...
>>> so_q_join(1000)
'http://stackoverflow.com/questions/1000'
>>> t4= timeit.Timer('so_q_join(1000)','from __main__ import so_q_join')
>>> t4.timeit(number=10000000)
9.4431309007150048

Don't forget about named substitution:
def so_question_uri_namedsub(q_num):
return "%(domain)s%(questions)s/%(q_num)d" % locals()

Be wary of concatenating strings in a loop! The cost of string concatenation is proportional to the length of the result. Looping leads you straight to the land of N-squared. Some languages will optimize concatenation to the most recently allocated string, but it's risky to count on the compiler to optimize your quadratic algorithm down to linear. Best to use the primitive (join?) that takes an entire list of strings, does a single allocation, and concatenates them all in one go.

"As the string concatenation has seen large boosts in performance..."
If performance matters, this is good to know.
However, performance problems I've seen have never come down to string operations. I've generally gotten in trouble with I/O, sorting and O(n2) operations being the bottlenecks.
Until string operations are the performance limiters, I'll stick with things that are obvious. Mostly, that's substitution when it's one line or less, concatenation when it makes sense, and a template tool (like Mako) when it's large.

What you want to concatenate/interpolate and how you want to format the result should drive your decision.
String interpolation allows you to easily add formatting. In fact, your string interpolation version doesn't do the same thing as your concatenation version; it actually adds an extra forward slash before the q_num parameter. To do the same thing, you would have to write return DOMAIN + QUESTIONS + "/" + str(q_num) in that example.
Interpolation makes it easier to format numerics; "%d of %d (%2.2f%%)" % (current, total, total/current) would be much less readable in concatenation form.
Concatenation is useful when you don't have a fixed number of items to string-ize.
Also, know that Python 2.6 introduces a new version of string interpolation, called string templating:
def so_question_uri_template(q_num):
return "{domain}/{questions}/{num}".format(domain=DOMAIN,
questions=QUESTIONS,
num=q_num)
String templating is slated to eventually replace %-interpolation, but that won't happen for quite a while, I think.

I was just testing the speed of different string concatenation/substitution methods out of curiosity. A google search on the subject brought me here. I thought I would post my test results in the hope that it might help someone decide.
import timeit
def percent_():
return "test %s, with number %s" % (1,2)
def format_():
return "test {}, with number {}".format(1,2)
def format2_():
return "test {1}, with number {0}".format(2,1)
def concat_():
return "test " + str(1) + ", with number " + str(2)
def dotimers(func_list):
# runs a single test for all functions in the list
for func in func_list:
tmr = timeit.Timer(func)
res = tmr.timeit()
print "test " + func.func_name + ": " + str(res)
def runtests(func_list, runs=5):
# runs multiple tests for all functions in the list
for i in range(runs):
print "----------- TEST #" + str(i + 1)
dotimers(func_list)
...After running runtests((percent_, format_, format2_, concat_), runs=5), I found that the % method was about twice as fast as the others on these small strings. The concat method was always the slowest (barely). There were very tiny differences when switching the positions in the format() method, but switching positions was always at least .01 slower than the regular format method.
Sample of test results:
test concat_() : 0.62 (0.61 to 0.63)
test format_() : 0.56 (consistently 0.56)
test format2_() : 0.58 (0.57 to 0.59)
test percent_() : 0.34 (0.33 to 0.35)
I ran these because I do use string concatenation in my scripts, and I was wondering what the cost was. I ran them in different orders to make sure nothing was interfering, or getting better performance being first or last. On a side note, I threw in some longer string generators into those functions like "%s" + ("a" * 1024) and regular concat was almost 3 times as fast (1.1 vs 2.8) as using the format and % methods. I guess it depends on the strings, and what you are trying to achieve. If performance really matters, it might be better to try different things and test them. I tend to choose readability over speed, unless speed becomes a problem, but thats just me. SO didn't like my copy/paste, i had to put 8 spaces on everything to make it look right. I usually use 4.

Remember, stylistic decisions are practical decisions, if you ever plan on maintaining or debugging your code :-) There's a famous quote from Knuth (possibly quoting Hoare?): "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
As long as you're careful not to (say) turn a O(n) task into an O(n2) task, I would go with whichever you find easiest to understand..

I use substitution wherever I can. I only use concatenation if I'm building a string up in say a for-loop.

Actually the correct thing to do, in this case (building paths) is to use os.path.join. Not string concatenation or interpolation

Related

Nested f-strings

Thanks to David Beazley's tweet, I've recently found out that the new Python 3.6 f-strings can also be nested:
>>> price = 478.23
>>> f"{f'${price:0.2f}':*>20s}"
'*************$478.23'
Or:
>>> x = 42
>>> f'''-{f"""*{f"+{f'.{x}.'}+"}*"""}-'''
'-*+.42.+*-'
While I am surprised that this is possible, I am missing on how practical is that, when would nesting f-strings be useful? What use cases can this cover?
Note: The PEP itself does not mention nesting f-strings, but there is a specific test case.
I don't think formatted string literals allowing nesting (by nesting, I take it to mean f'{f".."}') is a result of careful consideration of possible use cases, I'm more convinced it's just allowed in order for them to conform with their specification.
The specification states that they support full Python expressions* inside brackets. It's also stated that a formatted string literal is really just an expression that is evaluated at run-time (See here, and here). As a result, it only makes sense to allow a formatted string literal as the expression inside another formatted string literal, forbidding it would negate the full support for Python expressions.
The fact that you can't find use cases mentioned in the docs (and only find test cases in the test suite) is because this is probably a nice (side) effect of the implementation and not it's motivating use-case.
Actually, with two exceptions: An empty expression is not allowed, and a lambda expression must be surrounded by explicit parentheses.
I guess this is to pass formatting parameters in the same line and thus simplify f-strings usage.
For example:
>>> import decimal
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"
'result: 12.35'
Of course, it allows programmers to write absolutely unreadable code, but that's not the purpose :)
I've actually just come across something similar (i think) and thought i'd share.
My specific case is a big dirty sql statement where i need to conditionally have some very different values but some fstrings are the same (and also used in other places).
Here is quick example of what i mean. The cols i'm selecting are same regardless (and also used in other queries elsewhere) but the table name depends on the group and is not such i could just do it in a loop.
Having to include mycols=mycols in str2 each time felt a little dirty when i have multiple such params.
I was not sure this would work but was happy it did. As to how pythonic its is i'm not really sure tbh.
mycols='col_a,col_b'
str1 = "select {mycols} from {mytable} where group='{mygroup}'".format(mycols=mycols,mytable='{mytable}',mygroup='{mygroup}')
group = 'group_b'
if group == 'group_a':
str2 = str1.format(mytable='tbl1',mygroup=group)
elif group == 'group_b':
str2 = str1.format(mytable='a_very_different_table_name',mygroup=group)
print(str2)
Any basic use case is where you need a string to completely describe the object you want to put inside the f-string braces {}. For example, you need strings to index dictionaries.
So, I ended up using it in an ML project with code like:
scores = dict()
scores[f'{task}_accuracy'] = 100. * n_valid / n_total
print(f'{task}_accuracy: {scores[f"{task}_accuracy"]}')
Working on a pet project I got sidetracked by writing my own DB library. One thing I discovered was this:
>>> x = dict(a = 1, b = 2, d = 3)
>>> z = f"""
UPDATE TABLE
bar
SET
{", ".join([ f'{k} = ?' for k in x.keys() ])} """.strip()
>>> z
'UPDATE TABLE
bar
SET
a = ?, b = ?, d = ? '
I was also surprised by this and honestly I am not sure I would ever do something like this in production code BUT I have also said I wouldn't do a lot of other things in production code.
I found nesting to be useful when doing ternaries. Your opinion will vary on readability, but I found this one-liner very useful.
logger.info(f"No program name in subgroups file. Using {f'{prg_num} {prg_orig_date}' if not prg_name else prg_name}")
As such, my tests for nesting would be:
Is the value reused? (Variable for expression re-use)
Is the expression clear? (Not exceeding complexity)
I use it for formatting currencies. Given values like:
a=1.23
b=45.67
to format them with a leading $ and with the decimals aligned. e.g.
$1.23
$45.67
formatting with a single f-string f"${value:5.2f}" you can get:
$ 1.23
$45.67
which is fine sometimes but not always. Nested f-strings f"${f'${value:.2f}':>6}" give you the exact format:
$1.23
$45.67
A simple example of when it's useful, together with an example of implementation: sometimes the formatting is also a variable.
num = 3.1415
fmt = ".2f"
print(f"number is {num:{fmt}}")
Nested f-strings vs. evaluated expressions in format specifiers
This question is about use-cases that would motivate using an f-string inside of some evaluated expression of an "outer" f-string.
This is different from the feature that allows evaluated expressions to appear within the format specifier of an f-string. This latter feature is extremely useful and somewhat relevant to this question since (1) it involves nested curly braces so it might be why people are looking at this post and (2) nested f-strings are allowed within the format specifier just as they are within other curly-expressions of an f-string.
F-string nesting can help with one-liners
Although certainly not the motivation for allowing nested f-strings, nesting may be helpful in obscure cases where you need or want a "one-liner" (e.g. lambda expressions, comprehensions, python -c command from the terminal). For example:
print('\n'.join([f"length of {x/3:g}{'.'*(11 - len(f'{x/3:g}'))}{len(f'{x/3:g}')}" for x in range(10)]))
If you do not need a one-liner, any syntactic nesting can be replaced by defining a variable previously and then using the variable name in the evaluated expression of the f-string (and in many if not most cases, the non-nested version would likely be more readable and easier to maintain; however it does require coming up with variable names):
for x in range(10):
to_show = f"{x/3:g}"
string_length = len(to_show)
padding = '.' * (11 - string_length)
print(f"length of {to_show}{padding}{string_length}")
Nested evaluated expressions (i.e. in the format specifier) are useful
In contrast to true f-string nesting, the related feature allowing evaluated expressions within the "format specifier" of an f-string can be extremely useful (as others have pointed out) for several reasons including:
formatting can be shared across multiple f-strings or evaluated expressions
formatting can include computed quantities that can vary from run to run
Here is an example that uses a nested evaluated expression, but not a nested f-string:
import random
results = [[i, *[random.random()] * 3] for i in range(10)]
format = "2.2f"
print("category,precision,recall,f1")
for cat, precision, recall, f1 in results:
print(f"{cat},{precision:{format}},{recall:{format}},{f1:{format}}")
However, even this use of nesting can be replaced with more flexible (and maybe cleaner) code that does not require syntactic nesting:
import random
results = [[i, *[random.random()] * 3] for i in range(10)]
def format(x):
return f"{x:2.2f}"
print("category,precision,recall,f1")
for cat, precision, recall, f1 in results:
print(f"{cat},{format(precision)},{format(recall)},{format(f1)}")
The following nested f-string one-liner does a great job in constructing a command argument string
cmd_args = f"""{' '.join([f'--{key} {value}' for key, value in kwargs.items()])}"""
where the input
{'a': 10, 'b': 20, 'c': 30, ....}
gets elegantly converted to
--a 10 --b 20 --c 30 ...
`
In F-string the open-paren & close-paren are reserved key characters.
To use f-string to build json string you have to escape the parentheses characters.
in your case only the outer parens.
f"\{f'${price:0.2f}':*>20s\}"
If some fancy formatting is needed, such nesting could be, perhaps, useful.
for n in range(10, 1000, 100):
print(f"{f'n = {n:<3}':<15}| {f'|{n:>5}**2 = {n**2:<7_}'} |")
You could use it for dynamicism. For instance, say you have a variable set to the name of some function:
func = 'my_func'
Then you could write:
f"{f'{func}'()}"
which would be equivalent to:
'{}'.format(locals()[func]())
or, equivalently:
'{}'.format(my_func())

Python way to convert argument list to string

I've inherited some Python code that constructs a string from the string arguments passed into __init__:
self.full_tag = prefix + number + point + suffix
Maybe I'm just over-thinking it but is this the best way to concatenate the arguments? I know it's possible to do something like:
self.full_tag = "".join([prefix, number, point, suffix])
Or just using string format function:
self.full_tag = '{}{}{}{}'.format(prefix, number, point, suffix)
What is the Python way of doing this?
Pythonic way is to follow The Zen of Python, which has several things to say about this case:
Beautiful is better than ugly.
Simple is better than complex.
Readability counts.
Add to that the famous quote by DonaldKnuth:
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil.
With those in mind, the best of your choices is:
self.full_tag = prefix + number + point + suffix
Although, if number is really a number and point point is really a point, then this is more explicit:
self.full_tag = "%s%d.%s" % (prefix, number, suffix)
Explicit is better than implicit.
The documentation recommends join for better performance over +:
6. ... For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.
If performance is not too important, it's more of a matter of taste.
Personally, I find "".join cleaner and more readable than all of those braces in the format version.

Python .join or string concatenation

I realise that if you have an iterable you should always use .join(iterable) instead of for x in y: str += x. But if there's only a fixed number of variables that aren't already in an iterable, is using .join() still the recommended way?
For example I have
user = 'username'
host = 'host'
should I do
ret = user + '#' + host
or
ret = '#'.join([user, host])
I'm not so much asking from a performance point of view, since both will be pretty trivial. But I've read people on here say always use .join() and I was wondering if there's any particular reason for that or if it's just generally a good idea to use .join().
If you're creating a string like that, you normally want to use string formatting:
>>> user = 'username'
>>> host = 'host'
>>> '%s#%s' % (user, host)
'username#host'
Python 2.6 added another form, which doesn't rely on operator overloading and has some extra features:
>>> '{0}#{1}'.format(user, host)
'username#host'
As a general guideline, most people will use + on strings only if they're adding two strings right there. For more parts or more complex strings, they either use string formatting, like above, or assemble elements in a list and join them together (especially if there's any form of looping involved.) The reason for using str.join() is that adding strings together means creating a new string (and potentially destroying the old ones) for each addition. Python can sometimes optimize this away, but str.join() quickly becomes clearer, more obvious and significantly faster.
I take the question to mean: "Is it ok to do this:"
ret = user + '#' + host
..and the answer is yes. That is perfectly fine.
You should, of course, be aware of the cool formatting stuff you can do in Python, and you should be aware that for long lists, "join" is the way to go, but for a simple situation like this, what you have is exactly right. It's simple and clear, and performance will not be an issue.
(I'm pretty sure all of the people pointing at string formatting are missing the question entirely.)
Creating a string by constructing an array and joining it is for performance reasons only. Unless you need that performance, or unless it happens to be the natural way to implement it anyway, there's no benefit to doing that rather than simple string concatenation.
Saying '#'.join([user, host]) is unintuitive. It makes me wonder: why is he doing this? Are there any subtleties to it; is there any case where there might be more than one '#'? The answer is no, of course, but it takes more time to come to that conclusion than if it was written in a natural way.
Don't contort your code merely to avoid string concatenation; there's nothing inherently wrong with it. Joining arrays is just an optimization.
I'll just note that I've always tended to use in-place concatenation until I was rereading a portion of the Python general style PEP PEP-8 Style Guide for Python Code.
Code should be written in a way that does not disadvantage other
implementations of Python (PyPy, Jython, IronPython, Pyrex, Psyco,
and such).
For example, do not rely on CPython's efficient implementation of
in-place string concatenation for statements in the form a+=b or a=a+b.
Those statements run more slowly in Jython. In performance sensitive
parts of the library, the ''.join() form should be used instead. This
will ensure that concatenation occurs in linear time across various
implementations.
Going by this, I have been converting to the practice of using joins so that I may retain the habit as a more automatic practice when efficiency is extra critical.
So I'll put in my vote for:
ret = '#'.join([user, host])
I use next:
ret = '%s#%s' % (user, host)
I recommend join() over concatenation, based on two aspects:
Faster.
More elegant.
Regarding the first aspect, here's an example:
import timeit
s1 = "Flowers"
s2 = "of"
s3 = "War"
def join_concat():
return s1 + " " + s2 + " " + s3
def join_builtin():
return " ".join((s1, s2, s3))
print("Join Concatenation: ", timeit.timeit(join_concat))
print("Join Builtin: ", timeit.timeit(join_builtin))
The output:
$ python3 join_test.py
Join Concatenation: 0.40386943198973313
Join Builtin: 0.2666833929979475
Considering a huge dataset (millions of lines) and its processing, 130 milliseconds per line, it's too much.
And for the second aspect, indeed, is more elegant.

better way to pass data to print in python

I was going through http://web2py.com/book/default/chapter/02 and found this:
>>> print 'number is ' + str(3)
number is 3
>>> print 'number is %s' % (3)
number is 3
>>> print 'number is %(number)s' % dict(number=3)
number is 3
It has been given that The last notation is more explicit and less error prone, and is to be preferred.
I am wondering what is the advantage of using the last notation.. will it not have a performance overhead?
>>> print 'number is ' + str(3)
number is 3
This is definitely the worst solution and might cause you problems if you do the beginner mistake "Value of obj: " + obj where obj is not a string or unicode object. For many concatenations, it's not readable at all - it's similar to something like echo "<p>Hello ".$username."!</p>"; in PHP (this can get arbitrarily ugly).
print 'number is %s' % (3)
number is 3
Now that is much better. Instead of a hard-to-read concatenation, you see the output format immediately. Coming back to the beginner mistake of outputting values, you can do print "Value of obj: %r" % obj, for example. I personally prefer this in most cases. But note that you cannot use it in gettext-translated strings if you have multiple format specifiers because the order might change in other languages.
As you forgot to mention it here, you can also use the new string formatting method which is similar:
>>> "number is {0}".format(3)
'number is 3'
Next, dict lookup:
>>> print 'number is %(number)s' % dict(number=3)
number is 3
As said before, gettext-translated strings might change the order of positional format specifiers, so this option is the best when working with translations. The performance drop should be negligible - if your program is not all about formatting strings.
As with the positional formatting, you can also do it in the new style:
>>> "number is {number}".format(number=3)
'number is 3'
It's hard to tell which one to take. I recommend you to use positional arguments with the % notation for simple strings and dict lookup formatting for translated strings.
I can think of a few differences.
First to me is cumbersome, if more than one variable is involved. I can not speak of performance penalty on that. See additional arguments below.
The second example is positional dependent and it can be easy to change position causing errors. It also does not tell you anything about the variables.
The third example, the position of variables is not important. You use a dictionary. This makes it elegant as it does not rely on positional structuring of variables.
See the example below:
>>> print 'number is %s %s' % (3,4)
number is 3 4
>>> print 'number is %s %s' % (4,3)
number is 4 3
>>> print 'number is %(number)s %(two)s' % dict(number=3, two=4)
number is 3 4
>>> print 'number is %(number)s %(two)s' % dict(two=4, number=3)
number is 3 4
>>>
Also another part of discussion on this
"+" is the string concatenation operator.
"%" is string formatting.
In this trivial case, string formatting accomplishes the same result as concatenation. Unlike string formatting, string concatenation only works when everything is already a string. So if you miss to convert your variables to string, concatenation will cause error.
[Edit: My answer was biased towards templating since the question came from web2py where templates are so commonly involved]
As Ryan says below, the concatenation is faster than formatting.
Suggestion is
Use the first form - concatenation, if you are concatenating just two strings
Use the second form, if there are few variables. You can invariably see the positions and deal with them
Use the third form when you are doing templating i.e. formatting a large piece of string with variable data. The dictionary form helps in providing meaning to variables inside the large piece of text.
I am wondering what is the advantage
of using the last notation..
Hm, as you said, the last notation is really more explicit and actually is less error prone.
will it not have a performance
overhead?
It will have little performance overhead, but it's minor if compared with data fetching from DB or network connections.
It's a bad, unjustified piece of advice.
The third method is cumbersome, violates DRY, and error prone, except if:
You are writing a framework which don't have control over the format string. For example, logging module, web2py, or gettext.
The format string is extremely long.
The format string is read from a file from a config file.
The problem with the third method should be obvious when you consider that foo appears three times in this code: "%(foo)s" % dict(foo=foo). This is error prone. Most programs should not use the third method, unless they know they need to.
The second method is the simplest method, and is what you generally use in most programs. It is best used when the format string is immediate, e.g. 'values: %s %s %s' % (a, b, c) instead of taken from a variable, e.g. fmt % (a, b, c).
The first concatenation is almost never useful, except perhaps if you're building list by loops:
s = ''
for x in l:
s += str(x)
however, in that case, it's generally better and faster to use str.join():
s = ''.join(str(x) for x in l)

fast string modification in python

This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks
#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.
Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')
this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])
have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505
here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844
Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'
string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")

Categories

Resources