I am writing python2.7.15 code to access chars inside a word. How can I optimize this process, in order to check also if every word is contained inside an external list?
I have tried two versions of python2 code: version(1) is an extended version of what my code has to do, whereas in version (2) I tried a compact version of the same code.
chars_array = ['a','b','c']
VERSION (1)
def version1(word):
chars =[x for x in word]
count = 0
for c in chars:
if not c in chars_array:
count+=1
return count
VERSION (2)
def version2(word):
return sum([1 for c in [x for x in word] if not c in chars_array])
I am analyzing a large corpus and for version1 I obtain an execution time of 8.56 sec, whereas for version2 it is 8.12 sec.
The fastest solution (can be up to 100x faster for an extremely long string):
joined = ''.join(chars_array)
def version3(word):
return len(word.translate(None, joined))
Another slower solution that is approximately the same speed as your code:
from itertools import ifilterfalse
def version4(word):
return sum(1 for _ in ifilterfalse(set(chars_array).__contains__, word))
Timings (s is a random string):
In [17]: %timeit version1(s)
1000 loops, best of 3: 79.9 µs per loop
In [18]: %timeit version2(s)
10000 loops, best of 3: 98.1 µs per loop
In [19]: %timeit version3(s)
100000 loops, best of 3: 4.12 µs per loop # <- fastest
In [20]: %timeit version4(s)
10000 loops, best of 3: 84.3 µs per loop
With chars_array = ['a', 'e', 'i', 'o', 'u', 'y'] and words equal to a list
of 56048 English words, I measured a number of variants with a command similar to the following at an IPython prompt:
%timeit n = [version1(word) for word in words]
In each case it reported "10 loops, best of 3", and I have shown the time per loop
in comments next to each function definition below:
# OP's originals:
def version1(word): # 163 ms
chars =[x for x in word]
count = 0
for c in chars:
if not c in chars_array:
count+=1
return count
def version2(word): # 173 ms
return sum([1 for c in [x for x in word] if not c in chars_array])
Now let's hit version1 and version2 with three optimizations:
remove the redundant list comprehension and iterate through word directly instead;
use the operator not in rather than negating the result of the in operator;
check for (non-)membership of a set rather than a list.
_
chars_set = set(chars_array)
def version1a(word): # 95.5 ms
count = 0
for c in word:
if c not in chars_set:
count+=1
return count
def version2a(word): # 104 ms
return sum([1 for c in word if c not in chars_set])
So there's actually an advantage for the multi-line code over the list comprehension. This may depend on word length, though: version2a has to allocate a new list the same length as the word, whereas version1a does not. Let's refine version2a further to give it that same advantage, by summing over a generator expression rather than a list comprehension:
def version2b(word): # 111 ms
return sum(1 for c in word if c not in chars_set)
To my surprise that was actually slightly counterproductive—but again, that effect may depend on word length.
Finally let's experience the power of .translate():
chars_str = ''.join(chars_set)
def version3(word): # 40.7 ms
return len(word.translate(None, chars_str))
We have a clear winner.
I was trying to optimize simple character counting function. After few changes I decided to check the timings and expected the function using basic 'while' loop to be faster than 'for in' loop.
But to my surprise while loop was almost 30% slower than for in here! Shouldn't be simple 'while' loop which has lower abstraction (doing less internally) be much faster than 'for in'?
import timeit
def faster_count_alphabet(filename):
l = [0] * 128 # all ascii values 0 to 127
with open(filename) as fh:
a = fh.read()
for chars in a:
l[ord(chars)] += 1
return l
def faster_count_alphabet2(filename):
l = [0] * 128 # all ascii values 0 to 127
with open(filename) as fh:
a = fh.read()
i = 0
size = len(a)
while(i<size):
l[ord(a[i])] += 1
i+=1
return l
if __name__ == "__main__":
print timeit.timeit("faster_count_alphabet('connect.log')", setup="from __main__ import faster_count_alphabet", number = 10)
print timeit.timeit("faster_count_alphabet2('connect.log')", setup="from __main__ import faster_count_alphabet2", number = 10)
Here is the timings I am getting:
7.087787236
9.9472761879
While Loop
Well in your while loop the interpreter has to check every iteration whether your expression is true therefore it has to access both elements i and size and compare them.
For Loop
The for loop on the other hand has no need for that since the for loop is optimized as Chris_Rands already pointed out
There are the result of my testing with python2.7:
For Loop
test1
python -mtimeit -s"d='/Users/xuejiang/go/src/main'.split('/')" "for i in range(len(d)):k=('index:',i,'value:',d[i])"
result:
1000000 loops, best of 3: 0.747 usec per loop
test2:
python -mtimeit -s"d='/Users/xuejiang/go/src/main'.split('/');i=0" "for v in d:k=('index:',i,'value:',v);i+=1"
result:
1000000 loops, best of 3: 0.524 usec per loop
While Loop
test
python -mtimeit -s"d='/Users/xuejiang/go/src/main'.split('/');i=0" "while i <len(d):k=('index:',i,'value:',d[i]);i+=1"
result:
10000000 loops, best of 3: 0.0658 usec per loop
That is: while loop is much faster
Your code should represent same functionality, I have modified it for you to re-run your test.
I can see you even use the optimised version of the foreach loop instead of the normal for loop while you are benchmarking.
def faster_count_for_loop(filename):
l = [0] * 128 # all ascii values 0 to 127
with open(filename) as fh:
a = fh.read()
size = len(a)
for i in range(size):
l[ord(a[i])] += 1
return l
def faster_count_while_loop(filename):
l = [0] * 128 # all ascii values 0 to 127
with open(filename) as fh:
a = fh.read()
i = 0
size = len(a)
while(i < size):
l[ord(a[i])] += 1
i += 1
return l
Given a string, find the first non-repeating character in it and return its index. If it doesn't exist, return -1. You may assume the string contain only lowercase letters.
I'm going to define a hash that tracks the occurrence of characters. Traverse the string from left to right, check if the current character is in the hash, continue if yes, otherwise in another loop traverse the rest of the string to see if the current character exists. Return the index if not and update the hash if it exists.
def firstUniqChar(s):
track = {}
for index, i in enumerate(s):
if i in track:
continue
elif i in s[index+1:]: # For the last element, i in [] holds False
track[i] = 1
continue
else:
return index
return -1
firstUniqChar('timecomplexity')
What's the time complexity (average and worst) of my algorithm?
Your algorithm has time complexity of O(kn) where k is the number of unique characters in the string. If k is a constant then it is O(n). As the problem description clearly bounds the number of alternatives for elements ("assume lower-case (ASCII) letters"), thus k is constant and your algorithm runs in O(n) time on this problem. Even though n will grow to infinite, you will only make O(1) slices of the string and your algorithm will remain O(n). If you removed track, then it would be O(n²):
In [36]: s = 'abcdefghijklmnopqrstuvwxyz' * 10000
In [37]: %timeit firstUniqChar(s)
100 loops, best of 3: 18.2 ms per loop
In [38]: s = 'abcdefghijklmnopqrstuvwxyz' * 20000
In [37]: %timeit firstUniqChar(s)
10 loops, best of 3: 36.3 ms per loop
In [38]: s = 'timecomplexity' * 40000 + 'a'
In [39]: %timeit firstUniqChar(s)
10 loops, best of 3: 73.3 ms per loop
It pretty much holds there that the T(n) is still of O(n) complexity - it scales exactly linearly with number of characters in the string, even though this is the worst-case scenario for your algorithm - there is no single character that is be unique.
I will present a not-that efficient, but simple and smart method here; count the character histogram first with collections.Counter; then iterate over the characters finding the one
from collections import Counter
def first_uniq_char_ultra_smart(s):
counts = Counter(s)
for i, c in enumerate(s):
if counts[c] == 1:
return i
return -1
first_uniq_char('timecomplexity')
This has time complexity of O(n); Counter counts the histogram in O(n) time and we need to enumerate the string again for O(n) characters. However in practice I believe my algorithm has low constants, because it uses a standard dictionary for Counter.
And lets make a very stupid brute-force algorithm. Since you can assume that the string contains only lower-case letters, then use that assumption:
import string
def first_uniq_char_very_stupid(s):
indexes = []
for c in string.ascii_lowercase:
if s.count(c) == 1:
indexes.append(s.find(c))
# default=-1 is Python 3 only
return min(indexes, default=-1)
Let's test my algorithm and some algorithms found in the other answers, on Python 3.5. I've chosen a case that is pathologically bad for my algorithm:
In [30]: s = 'timecomplexity' * 10000 + 'a'
In [31]: %timeit first_uniq_char_ultra_smart(s)
10 loops, best of 3: 35 ms per loop
In [32]: %timeit karin(s)
100 loops, best of 3: 11.7 ms per loop
In [33]: %timeit john(s)
100 loops, best of 3: 9.92 ms per loop
In [34]: %timeit nicholas(s)
100 loops, best of 3: 10.4 ms per loop
In [35]: %timeit first_uniq_char_very_stupid(s)
1000 loops, best of 3: 1.55 ms per loop
So, my stupid algorithm is the fastest, because it finds the a at the end and bails out. And my smart algorithm is slowest, One more reason for bad performance of my algorithm besides this being worst case is that OrderedDict is written in C on Python 3.5, and Counter is in Python.
Let's make a better test here:
In [60]: s = string.ascii_lowercase * 10000
In [61]: %timeit nicholas(s)
100 loops, best of 3: 18.3 ms per loop
In [62]: %timeit karin(s)
100 loops, best of 3: 19.6 ms per loop
In [63]: %timeit john(s)
100 loops, best of 3: 18.2 ms per loop
In [64]: %timeit first_uniq_char_very_stupid(s)
100 loops, best of 3: 2.89 ms per loop
So it appears that the "stupid" algorithm of mine isn't all that stupid at all, it exploits the speed of C while minimizing the number of iterations of Python code being run, and wins clearly in this problem.
As others have noted, your algorithm is O(n²) due to nested linear search. As discovered by #Antti, the OP's algorithm is linear and bound by O(kn) for k as the number of all possible lowercase letters.
My proposition for an O(n) solution:
from collections import OrderedDict
def first_unique_char(string):
duplicated = OrderedDict() # ordered dict of char to boolean indicating duplicate existence
for s in string:
duplicated[s] = s in duplicated
for char, is_duplicate in duplicated.items():
if not is_duplicate:
return string.find(char)
return -1
print(first_unique_char('timecomplexity')) # 4
Your algorithm is O(n2), because you have a "hidden" iteration over a slice of s inside the loop over s.
A faster algorithm would be:
def first_unique_character(s):
good = {} # char:idx
bad = set() # char
for index, ch in enumerate(s):
if ch in bad:
continue
if ch in good: # new repeat
bad.add(ch)
del good[ch]
else:
good[ch] = index
if not good:
return -1
return min(good.values())
This is O(n) because the in lookups use hash tables, and the number of distinct characters should be much less than len(s).
What's the most pythonic way to mesh two strings together?
For example:
Input:
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'
Output:
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
For me, the most pythonic* way is the following which pretty much does the same thing but uses the + operator for concatenating the individual characters in each string:
res = "".join(i + j for i, j in zip(u, l))
print(res)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
It is also faster than using two join() calls:
In [5]: l1 = 'A' * 1000000; l2 = 'a' * 1000000
In [6]: %timeit "".join("".join(item) for item in zip(l1, l2))
1 loops, best of 3: 442 ms per loop
In [7]: %timeit "".join(i + j for i, j in zip(l1, l2))
1 loops, best of 3: 360 ms per loop
Faster approaches exist, but they often obfuscate the code.
Note: If the two input strings are not the same length then the longer one will be truncated as zip stops iterating at the end of the shorter string. In this case instead of zip one should use zip_longest (izip_longest in Python 2) from the itertools module to ensure that both strings are fully exhausted.
*To take a quote from the Zen of Python: Readability counts.
Pythonic = readability for me; i + j is just visually parsed more easily, at least for my eyes.
Faster Alternative
Another way:
res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
print(''.join(res))
Output:
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
Speed
Looks like it is faster:
%%timeit
res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
''.join(res)
100000 loops, best of 3: 4.75 µs per loop
than the fastest solution so far:
%timeit "".join(list(chain.from_iterable(zip(u, l))))
100000 loops, best of 3: 6.52 µs per loop
Also for the larger strings:
l1 = 'A' * 1000000; l2 = 'a' * 1000000
%timeit "".join(list(chain.from_iterable(zip(l1, l2))))
1 loops, best of 3: 151 ms per loop
%%timeit
res = [''] * len(l1) * 2
res[::2] = l1
res[1::2] = l2
''.join(res)
10 loops, best of 3: 92 ms per loop
Python 3.5.1.
Variation for strings with different lengths
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijkl'
Shorter one determines length (zip() equivalent)
min_len = min(len(u), len(l))
res = [''] * min_len * 2
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
print(''.join(res))
Output:
AaBbCcDdEeFfGgHhIiJjKkLl
Longer one determines length (itertools.zip_longest(fillvalue='') equivalent)
min_len = min(len(u), len(l))
res = [''] * min_len * 2
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
res += u[min_len:] + l[min_len:]
print(''.join(res))
Output:
AaBbCcDdEeFfGgHhIiJjKkLlMNOPQRSTUVWXYZ
With join() and zip().
>>> ''.join(''.join(item) for item in zip(u,l))
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
On Python 2, by far the faster way to do things, at ~3x the speed of list slicing for small strings and ~30x for long ones, is
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
This wouldn't work on Python 3, though. You could implement something like
res = bytearray(len(u) * 2)
res[::2] = u.encode("ascii")
res[1::2] = l.encode("ascii")
res.decode("ascii")
but by then you've already lost the gains over list slicing for small strings (it's still 20x the speed for long strings) and this doesn't even work for non-ASCII characters yet.
FWIW, if you are doing this on massive strings and need every cycle, and for some reason have to use Python strings... here's how to do it:
res = bytearray(len(u) * 4 * 2)
u_utf32 = u.encode("utf_32_be")
res[0::8] = u_utf32[0::4]
res[1::8] = u_utf32[1::4]
res[2::8] = u_utf32[2::4]
res[3::8] = u_utf32[3::4]
l_utf32 = l.encode("utf_32_be")
res[4::8] = l_utf32[0::4]
res[5::8] = l_utf32[1::4]
res[6::8] = l_utf32[2::4]
res[7::8] = l_utf32[3::4]
res.decode("utf_32_be")
Special-casing the common case of smaller types will help too. FWIW, this is only 3x the speed of list slicing for long strings and a factor of 4 to 5 slower for small strings.
Either way I prefer the join solutions, but since timings were mentioned elsewhere I thought I might as well join in.
If you want the fastest way, you can combine itertools with operator.add:
In [36]: from operator import add
In [37]: from itertools import starmap, izip
In [38]: timeit "".join([i + j for i, j in uzip(l1, l2)])
1 loops, best of 3: 142 ms per loop
In [39]: timeit "".join(starmap(add, izip(l1,l2)))
1 loops, best of 3: 117 ms per loop
In [40]: timeit "".join(["".join(item) for item in zip(l1, l2)])
1 loops, best of 3: 196 ms per loop
In [41]: "".join(starmap(add, izip(l1,l2))) == "".join([i + j for i, j in izip(l1, l2)]) == "".join(["".join(item) for item in izip(l1, l2)])
Out[42]: True
But combining izip and chain.from_iterable is faster again
In [2]: from itertools import chain, izip
In [3]: timeit "".join(chain.from_iterable(izip(l1, l2)))
10 loops, best of 3: 98.7 ms per loop
There is also a substantial difference between
chain(* and chain.from_iterable(....
In [5]: timeit "".join(chain(*izip(l1, l2)))
1 loops, best of 3: 212 ms per loop
There is no such thing as a generator with join, passing one is always going to be slower as python will first build a list using the content because it does two passes over the data, one to figure out the size needed and one to actually do the join which would not be possible using a generator:
join.h:
/* Here is the general case. Do a pre-pass to figure out the total
* amount of space we'll need (sz), and see whether all arguments are
* bytes-like.
*/
Also if you have different length strings and you don't want to lose data you can use izip_longest :
In [22]: from itertools import izip_longest
In [23]: a,b = "hlo","elworld"
In [24]: "".join(chain.from_iterable(izip_longest(a, b,fillvalue="")))
Out[24]: 'helloworld'
For python 3 it is called zip_longest
But for python2, veedrac's suggestion is by far the fastest:
In [18]: %%timeit
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
....:
100 loops, best of 3: 2.68 ms per loop
You could also do this using map and operator.add:
from operator import add
u = 'AAAAA'
l = 'aaaaa'
s = "".join(map(add, u, l))
Output:
'AaAaAaAaAa'
What map does is it takes every element from the first iterable u and the first elements from the second iterable l and applies the function supplied as the first argument add. Then join just joins them.
Jim's answer is great, but here's my favorite option, if you don't mind a couple of imports:
from functools import reduce
from operator import add
reduce(add, map(add, u, l))
A lot of these suggestions assume the strings are of equal length. Maybe that covers all reasonable use cases, but at least to me it seems that you might want to accomodate strings of differing lengths too. Or am I the only one thinking the mesh should work a bit like this:
u = "foobar"
l = "baz"
mesh(u,l) = "fboaozbar"
One way to do this would be the following:
def mesh(a,b):
minlen = min(len(a),len(b))
return "".join(["".join(x+y for x,y in zip(a,b)),a[minlen:],b[minlen:]])
I like using two fors, the variable names can give a hint/reminder to what is going on:
"".join(char for pair in zip(u,l) for char in pair)
Just to add another, more basic approach:
st = ""
for char in u:
st = "{0}{1}{2}".format( st, char, l[ u.index( char ) ] )
Feels a bit un-pythonic not to consider the double-list-comprehension answer here, to handle n string with O(1) effort:
"".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
where all_strings is a list of the strings you want to interleave. In your case, all_strings = [u, l]. A full use example would look like this:
import itertools
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b = 'abcdefghijklmnopqrstuvwxyz'
all_strings = [a,b]
interleaved = "".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
print(interleaved)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
Like many answers, fastest? Probably not, but simple and flexible. Also, without too much added complexity, this is slightly faster than the accepted answer (in general, string addition is a bit slow in python):
In [7]: l1 = 'A' * 1000000; l2 = 'a' * 1000000;
In [8]: %timeit "".join(a + b for i, j in zip(l1, l2))
1 loops, best of 3: 227 ms per loop
In [9]: %timeit "".join(c for cs in zip(*(l1, l2)) for c in cs)
1 loops, best of 3: 198 ms per loop
Potentially faster and shorter than the current leading solution:
from itertools import chain
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'
res = "".join(chain(*zip(u, l)))
Strategy speed-wise is to do as much at the C-level as possible. Same zip_longest() fix for uneven strings and it would be coming out of the same module as chain() so can't ding me too many points there!
Other solutions I came up with along the way:
res = "".join(u[x] + l[x] for x in range(len(u)))
res = "".join(k + l[i] for i, k in enumerate(u))
You could use iteration_utilities.roundrobin1
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'
from iteration_utilities import roundrobin
''.join(roundrobin(u, l))
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
or the ManyIterables class from the same package:
from iteration_utilities import ManyIterables
ManyIterables(u, l).roundrobin().as_string()
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
1 This is from a third-party library I have written: iteration_utilities.
I would use zip() to get a readable and easy way:
result = ''
for cha, chb in zip(u, l):
result += '%s%s' % (cha, chb)
print result
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
I would like to add brackets to each character in a string. So
"HelloWorld"
should become:
"[H][e][l][l][o][W][o][r][l][d]"
I have used this code:
word = "HelloWorld"
newWord = ""
for letter in word:
newWord += "[%s]" % letter
which is the most straightforward way to do it but the string concatenations are pretty slow.
Any suggestions on speeding up this code.
>>> s = "HelloWorld"
>>> ''.join('[{}]'.format(x) for x in s)
'[H][e][l][l][o][W][o][r][l][d]'
If string is huge then using str.join with a list comprehension will be faster and memory efficient than using a generator expression(https://stackoverflow.com/a/9061024/846892):
>>> ''.join(['[{}]'.format(x) for x in s])
'[H][e][l][l][o][W][o][r][l][d]'
From Python performance tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
The most pythonic way would probably be with a generator comprehension:
>>> s = "HelloWorld"
>>> "".join("[%s]" % c for c in s)
'[H][e][l][l][o][W][o][r][l][d]'
Ashwini Chaudhary's answer is very similar, but uses the modern (Python3) string format function. The old string interpolation with % still works fine and is a bit simpler.
A bit more creatively, inserting ][ between each character, and surrounding it all with []. I guess this might be a bit faster, since it doesn't do as many string interpolations, but speed shouldn't be an issue here.
>>> "[" + "][".join(s) + "]"
'[H][e][l][l][o][W][o][r][l][d]'
If you are concerned about speed and need a fast implementation, try to determine an implementation which offloads the iteration to the underline native module. This is true for at least in CPython.
Suggested Implementation
"[{}]".format(']['.join(s))
Output
'[H][e][l][l][o][W][o][r][l][d]'
Comparing with a competing solution
In [12]: s = "a" * 10000
In [13]: %timeit "[{}]".format(']['.join(s))
1000 loops, best of 3: 215 us per loop
In [14]: %timeit ''.join(['[{}]'.format(x) for x in s])
100 loops, best of 3: 3.06 ms per loop
In [15]: %timeit ''.join('[{}]'.format(x) for x in s)
100 loops, best of 3: 3.26 ms per loop