I would like to add brackets to each character in a string. So
"HelloWorld"
should become:
"[H][e][l][l][o][W][o][r][l][d]"
I have used this code:
word = "HelloWorld"
newWord = ""
for letter in word:
newWord += "[%s]" % letter
which is the most straightforward way to do it but the string concatenations are pretty slow.
Any suggestions on speeding up this code.
>>> s = "HelloWorld"
>>> ''.join('[{}]'.format(x) for x in s)
'[H][e][l][l][o][W][o][r][l][d]'
If string is huge then using str.join with a list comprehension will be faster and memory efficient than using a generator expression(https://stackoverflow.com/a/9061024/846892):
>>> ''.join(['[{}]'.format(x) for x in s])
'[H][e][l][l][o][W][o][r][l][d]'
From Python performance tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
The most pythonic way would probably be with a generator comprehension:
>>> s = "HelloWorld"
>>> "".join("[%s]" % c for c in s)
'[H][e][l][l][o][W][o][r][l][d]'
Ashwini Chaudhary's answer is very similar, but uses the modern (Python3) string format function. The old string interpolation with % still works fine and is a bit simpler.
A bit more creatively, inserting ][ between each character, and surrounding it all with []. I guess this might be a bit faster, since it doesn't do as many string interpolations, but speed shouldn't be an issue here.
>>> "[" + "][".join(s) + "]"
'[H][e][l][l][o][W][o][r][l][d]'
If you are concerned about speed and need a fast implementation, try to determine an implementation which offloads the iteration to the underline native module. This is true for at least in CPython.
Suggested Implementation
"[{}]".format(']['.join(s))
Output
'[H][e][l][l][o][W][o][r][l][d]'
Comparing with a competing solution
In [12]: s = "a" * 10000
In [13]: %timeit "[{}]".format(']['.join(s))
1000 loops, best of 3: 215 us per loop
In [14]: %timeit ''.join(['[{}]'.format(x) for x in s])
100 loops, best of 3: 3.06 ms per loop
In [15]: %timeit ''.join('[{}]'.format(x) for x in s)
100 loops, best of 3: 3.26 ms per loop
Related
I'm working on matching a list of regular expressions with a list of strings. The problem is, that the lists are very big (RegEx about 1 million, strings about 50T). What I've got so far is this:
reg_list = ["domain\.com\/picture\.png", "entry{0,9}"]
y = ["test","string","entry4also_found","entry5"]
for r in reg_list:
for x in y:
if re.findall(r, x):
RESULT_LIST.append(x)
print(x)
Which works very well logically but is way to unefficient for those number of entries. Is there a better (more efficient) solution for this?
Use any() to test if any of the regular expressions match, rather than looping over the entire list.
Compile all the regular expressions first, so this doesn't have to be done repeatedly.
reg_list = [re.compile(rx) for rx in reg_list]
for word in y:
if any(rx.search(word) for rx in reg_list):
RESULT_LIST.append(word)
The only enhancements that come to mind are
Stopping match at first occurrence as re.findall attempts to search for multiple matches, this is not what you are after
Pre-compiling your regexes.
reg_list = [r"domain\.com/picture\.png", r"entry{0,9}"]
reg_list = [re.compile(x) for x in reg_list] # Step 1
y = ["test","string","entry4also_found","entry5"]
RESULT_LIST = []
for r in reg_list:
for x in y:
if r.search(x): # Step 2
RESULT_LIST.append(x)
print(x)
python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop
$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
So, if you are going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).
I had a python program which reads lines from files and puts them into dict, for simple, it looks like:
data = {'file_name':''}
with open('file_name') as in_fd:
for line in in_fd:
data['file_name'] += line
I found it took hours to finish.
And then, I did a bit change to the program:
data = {'file_name':[]}
with open('file_name') as in_fd:
for line in in_fd:
data['file_name'].append(line)
data['file_name'] = ''.join(data['file_name'])
It finished in seconds.
I thought it's += makes the program slow, but it seems not. Please take a look at the result of the following test.
I knew we could use list append and join to improve performance when concat strings. But I never thought such a performance gap between append and join and add and assign.
So I decided to do some more tests, and finally found it's the dict update operation makes the program insanely slow. Here is a scripts:
import time
LOOPS = 10000
WORD = 'ABC'*100
s1=time.time()
buf1 = []
for i in xrange(LOOPS):
buf1.append(WORD)
ss = ''.join(buf1)
s2=time.time()
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
s3=time.time()
buf3 = {'1':''}
for i in xrange(LOOPS):
buf3['1'] += WORD
s4=time.time()
buf4 = {'1':[]}
for i in xrange(LOOPS):
buf4['1'].append(WORD)
buf4['1'] = ''.join(buf4['1'])
s5=time.time()
print s2-s1, s3-s2, s4-s3, s5-s4
In my laptop(mac pro 2013 mid, OS X 10.9.5, cpython 2.7.10), it's output is:
0.00299620628357 0.00415587425232 3.49465799332 0.00231599807739
Inspired by juanpa.arrivillaga's comments, I did a bit change to the second loop:
trivial_reference = []
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
trivial_reference.append(buf2) # add a trivial reference to avoid optimization
After the change, now the second loops takes 19 seconds to complete. So it seems just a optimization problem just as juanpa.arrivillaga said.
+= performs really bad when building large strings but can be efficient in one-case in CPython.mentioned below
For sure-shot faster string concatenation use str.join().
From String Concatenation section under Python Performance Tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
Why s += x is faster than s['1'] += x or s[0] += x?
From Note 6:
CPython implementation detail: If s and t are both strings, some
Python implementations such as CPython can usually perform an in-place
optimization for assignments of the form s = s + t or s += t. When
applicable, this optimization makes quadratic run-time much less
likely. This optimization is both version and implementation
dependent. For performance sensitive code, it is preferable to use the
str.join() method which assures consistent linear concatenation
performance across versions and implementations.
The optimization in case of CPython is that if a string has only one reference then we can resize it in-place.
/* Note that we don't have to modify *unicode for unshared Unicode
objects, since we can modify them in-place. */
Now latter two are not simple in-place additions. In fact these are not in-place additions at all.
s[0] += x
is equivalent to:
temp = s[0] # Extra reference. `S[0]` and `temp` both point to same string now.
temp += x
s[0] = temp
Example:
>>> lst = [1, 2, 3]
>>> def func():
... lst[0] = 90
... return 100
...
>>> lst[0] += func()
>>> print lst
[101, 2, 3] # Not [190, 2, 3]
But in general never use s += x for concatenating string, always use str.join on a collection of strings.
Timings
LOOPS = 1000
WORD = 'ABC'*100
def list_append():
buf1 = [WORD for _ in xrange(LOOPS)]
return ''.join(buf1)
def str_concat():
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
def dict_val_concat():
buf3 = {'1': ''}
for i in xrange(LOOPS):
buf3['1'] += WORD
return buf3['1']
def list_val_concat():
buf4 = ['']
for i in xrange(LOOPS):
buf4[0] += WORD
return buf4[0]
def val_pop_concat():
buf5 = ['']
for i in xrange(LOOPS):
val = buf5.pop()
val += WORD
buf5.append(val)
return buf5[0]
def val_assign_concat():
buf6 = ['']
for i in xrange(LOOPS):
val = buf6[0]
val += WORD
buf6[0] = val
return buf6[0]
>>> %timeit list_append()
1000 loops, best of 3: 1.31 ms per loop
>>> %timeit str_concat()
100 loops, best of 3: 3.09 ms per loop
>>> %run so.py
>>> %timeit list_append()
10000 loops, best of 3: 71.2 us per loop
>>> %timeit str_concat()
1000 loops, best of 3: 276 us per loop
>>> %timeit dict_val_concat()
100 loops, best of 3: 9.66 ms per loop
>>> %timeit list_val_concat()
100 loops, best of 3: 9.64 ms per loop
>>> %timeit val_pop_concat()
1000 loops, best of 3: 556 us per loop
>>> %timeit val_assign_concat()
100 loops, best of 3: 9.31 ms per loop
val_pop_concat is fast here because by using pop() we are dropping reference from the list to that string and now CPython can resize it in-place(guessed correctly by #niemmi in comments).
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to match "necklaces" of symbols in Python by looking up their linear representations, for which I use normal strings. For example, the strings "AABC", "ABCA", "BCAA", "CAAB" all represent the same necklace (pictured).
In order to get an overview, I store only one of the equivalent strings of a given necklace as a "representative". As to check if I've stored a candidate necklace, I need a function to normalize any given string representation. As a kind of pseudo code, I wrote a function in Python:
import collections
def normalized(s):
q = collections.deque(s)
l = list()
l.append(''.join(q))
for i in range(len(s)-1):
q.rotate(1)
l.append(''.join(q))
l.sort()
return l[0]
For all the string representations in the above example necklace, this function returns "AABC", which comes first alphabetically.
Since I'm relatively new to Python, I wonder - if I'd start implementing an application in Python - would this function be already "good enough" for production code? In other words: Would an experienced Python programmer use this function, or are there obvious flaws?
If I understand you correctly you first need to construct all circular permutations of the input sequence and then determine the (lexicographically) smallest element. That is the root of your symbol loop.
Try this:
def normalized(s):
L = [s[i:] + s[:i] for i in range(len(s))]
return sorted(L)[0]
This code works only with strings, no conversions between queue and string as in your code. A quick timing test shows this code to run in 30-50% of the time.
It would be interesting to know the length of s in your application. As all permutations have to be stored temporarily len(s)^2 bytes are needed for the temp list L. Hopefully this is not a constraint in your case.
Edit:
Today I stumbled upon the observation that if you concatenate the original string to itself it will contain all rotations as substrings. So the code will be:
def normalized4(s):
ss = s + s # contains all rotations of s as substrings
n = len(s)
return min((ss[i:i+n] for i in range(n)))
This will indeed run faster as there is only one concatination left plus n slicings. Using stringlengths of 10 to 10**5 the runtime is between 55% and 66% on my machine, compared to the min() version with a generator.
Please note that you trade off speed for memory consumption (2x) which doesn't matter here but might do in a different setting.
You could use min rather then sorting:
def normalized2(s):
return min(( s[i:] + s[:i] for i in range(len(s)) ))
But still it needs to copy string len(s) times. Faster way is to filter starting indexes of smallest char, until you get only one. Effectively search for smallest loop:
def normalized3(s):
ssize=len(s)
minchar= min(s)
minindexes= [ i for i in range(ssize) if minchar == s[i] ]
for offset in range(1,ssize):
if len( minindexes ) == 1 :
break
minchar= min( s[(i+offset)%ssize] for i in minindexes )
minindexes= [i for i in minindexes if minchar == s[(i+offset)%ssize]]
return s[minindexes[0]:] + s[:minindexes[0]]
For long string this is much faster:
In [143]: loop = [ random.choice("abcd") for i in range(100) ]
In [144]: timeit normalized(loop)
1000 loops, best of 3: 237 µs per loop
In [145]: timeit normalized2(loop)
10000 loops, best of 3: 91.3 µs per loop
In [146]: timeit normalized3(loop)
100000 loops, best of 3: 16.9 µs per loop
But if we have much of repetition, this method is not eficient:
In [147]: loop = "abcd" * 25
In [148]: timeit normalized(loop)
1000 loops, best of 3: 245 µs per loop
In [149]: timeit normalized2(loop)
100000 loops, best of 3: 18.8 µs per loop
In [150]: timeit normalized3(loop)
1000 loops, best of 3: 612 µs per loop
We can also scan forward the string, but I doubt it could be any faster, without some fancy algorithm.
How about something like this:
patterns = ['abc', 'bca', 'cab']
normalized = lambda p: ''.join(sorted(p))
normalized_patterns = set(normalized(p) for p in patterns)
Example output:
In [1]: normalized = lambda p: ''.join(sorted(p))
In [2]: normalized('abba')
Out[2]: 'aabb'
In [3]: normalized('CBAE')
Out[3]: 'ABCE'
Hi In need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers. The length of the final string should be only first 200 characters.
I know of a solution which is :-
string = "Special $#! character's spaces 888323"
string = ''.join(e for e in string if e.isalnum())[:200]
But this will first remove all the unwanted characters and then slice it.
Is there something that will work like a generator, ie as soon as total characters are 200, it should break. I want a pythonic solution. PS : I know I can achieve it via FOR loops.
from itertools import islice
"".join(islice((e for e in string if e.isalnum()), 200))
But personally, I think the for loop sounds a lot better to me.
Use a generator expression or function with itertools.islice:
from itertools import islice
s = "Special $#! character's spaces 888323"
gen = (e for e in s if e.isalnum())
new_s = ''.join(islice(gen, 200))
Note that if the strings are not huge and the number n(200 here) is not small compared to string length then you should use str.translate with simple slicing as it is going to be very fast compared to a Python based for-loop:
>>> from string import whitespace, punctuation
>>> s.translate(None, whitespace+punctuation)[:10]
'Specialcha'
Some timing comparisons for a large string:
>>> s = "Special $#! character's spaces 888323" * 10000
>>> len(s)
390000
# For very small n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 200))
10000 loops, best of 3: 20.2 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:200]
1000 loops, best of 3: 383 µs per loop
# For mid-sized n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 10000))
1000 loops, best of 3: 930 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:10000]
1000 loops, best of 3: 378 µs per loop
# When n is comparable to length of string.
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 100000))
100 loops, best of 3: 9.41 ms per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:100000]
1000 loops, best of 3: 385 µs per loop
If regular expressions aren't solving your problem, it could just be that you're not using enough of them yet :-) Here's a one-liner (discounting the import) that limits it to 20 characters (because your test data didn't match your specifications):
>>> import re
>>> string = "Special $#! character's spaces 888323"
>>> re.sub("[^A-Za-z0-9]","",string)[:20]
'Specialcharactersspa'
While not technically a generator, it will work just as well provided you're not having to process truly massive strings.
What it will do is avoid the split and rejoin in your original solution:
''.join(e for e in something)
No doubt there's some cost to the regular expression processing but I'd have a hard time believing it's as high as building a temporary list then tearing it down into a string again. Still, if you're concerned, you should measure, not guess!
If you want an actual generator, it's easy enough to implement one:
class alphanum(object):
def __init__(self, s, n):
self.s = s
self.n = n
self.ix = 0
def __iter__(self):
return self
def __next__(self):
return self.next()
def next(self):
if self.n <= 0:
raise StopIteration()
while self.ix < len(self.s) and not self.s[self.ix].isalnum():
self.ix += 1
if self.ix == len(self.s):
raise StopIteration()
self.ix += 1
self.n -= 1
return self.s[self.ix-1]
def remainder(self):
return ''.join([x for x in self])
for x in alphanum("Special $#! chars", 10):
print x
print alphanum("Special $#! chars", 10).remainder()
which shows how you can use it as a 'character' iterator as well as a string modifier:
S
p
e
c
i
a
l
c
h
a
Specialcha
I don't know how to express this. I want to print:
_1__2__3__4_
With "_%s_" as a substring of that. How to get the main string when I format the substring? (as a shortcut of:
for x in range(1,5):
print "_%s_" % (x)
(Even though this prints multiple lines))
Edit: just in one line
Did you mean something like this?
my_string = "".join(["_%d_" % i for i in xrange(1,5)])
That creates a list of the substrings as requested and then concatenates the items in the list using the empty string as separator (See str.join() documentation).
Alternatively you can add to a string though a loop with the += operator although it is much slower and less efficient:
s = ""
for x in range(1,5):
s += "_%d_" % x
print s
print("_" + "__".join(map(str, xrange(1,5)))) +"_"
_1__2__3__4_
In [9]: timeit ("_" + "__".join(map(str,xrange(1,5)))) +"_"
1000000 loops, best of 3: 1.38 µs per loop
In [10]: timeit "".join(["_%d_" % i for i in xrange(1,5)])
100000 loops, best of 3: 3.19 µs per loop
you can maintain your style if you want to.
if you are using python 2.7:
from __future__ import print_function
for x in range(1,5):
print("_%s_" % (x), sep = '', end = '')
print()
for python 3.x, import is not required.
python doc: https://docs.python.org/2.7/library/functions.html?highlight=print#print
Python 3:
print("_{}_".format("__".join(map(str,range(1,5)))))
_1__2__3__4_
Python 2:
print "_{0}_".format("__".join(map(str,range(1,5))))
_1__2__3__4_