How to produce an arbitary string with a specific length in python? - python

I need to have a 100000 characters long string. What is the most efficient and shortest way of producing such a string in python?
The content of the string is not of importance.

Something like:
'x' * 100000 # or,
''.join('x' for x in xrange(100000)) # or,
from itertools import repeat
''.join(repeat('x', times=100000))
Or for a bit of a mixup of letters:
from string import ascii_letters
from random import choice
''.join(choice(ascii_letters) for _ in xrange(100000))
Or, for some random data:
import os
s = os.urandom(100000)

You can simply do
s = 'a' * 100000

Since efficiency is important, here's a quick benchmark for some of the approaches mentioned so far:
$ python -m timeit "" "'a'*100000"
100000 loops, best of 3: 4.99 usec per loop
$ python -m timeit "from itertools import repeat" "''.join(repeat('x', times=100000))"
1000 loops, best of 3: 2.24 msec per loop
$ python -m timeit "import array" "array.array('c',[' ']*100000).tostring()"
100 loops, best of 3: 3.92 msec per loop
$ python -m timeit "" "''.join('x' for x in xrange(100000))"
100 loops, best of 3: 5.69 msec per loop
$ python -m timeit "import os" "os.urandom(100000)"
100 loops, best of 3: 6.17 msec per loop
Not surprisingly, of the ones posted, using string multiplication is the fastest by far.
Also note that it is more efficient to multiply a single char than a multi-char string (to get the same final string length).
$ python -m timeit "" "'a'*100000"
100000 loops, best of 3: 4.99 usec per loop
$ python -m timeit "" "'ab'*50000"
100000 loops, best of 3: 6.02 usec per loop
$ python -m timeit "" "'abcd'*25000"
100000 loops, best of 3: 6 usec per loop
$ python -m timeit "" "'abcdefghij'*10000"
100000 loops, best of 3: 6.03 usec per loop
Tested on Python 2.7.3

Strings can use the multiplication operator:
"a" * 100000

Try making an array of blank characters.
import array
longCharArray = array.array('c',[' ']*100000)
This will allocate an array of ' ' characters of size 100000
longCharArray.tostring()
Will convert to a string.

Just pick some character and repeat it 100000 times:
"a"*100000
Why you would want this is another question. . .

You can try something like this:
"".join(random.sample(string.lowercase * 385,10000))

As a one liner:
''.join([chr(random.randint(32, 126)) for x in range(30)])
Change the range() value to get a different length of string; change the bounds of randint() to get a different set of characters.

Related

deque.popleft() vs list.pop(0), performance analysis

According to this question, I checked the performance on my laptop.
Surprisingly, I found that pop(0) from a list is faster than popleft() from a deque stucture:
python -m timeit 'l = range(10000)' 'l.pop(0)'
gives:
10000 loops, best of 3: 66 usec per loop
While:
python -m timeit 'import collections' 'l = collections.deque(range(10000))' 'l.popleft()'
gives:
10000 loops, best of 3: 123 usec per loop
Moreover, I checked the performance on jupyter finding the same outcome:
%timeit l = range(10000); l.pop(0)
10000 loops, best of 3: 64.7 µs per loop
from collections import deque
%timeit l = deque(range(10000)); l.popleft()
10000 loops, best of 3: 122 µs per loop
What is the reason?
The problem is that your timeit call also times the deque/list creation, and creating a deque is obviously much slower because of the chaining.
In the command line, you can pass the setup to timeit using the -s option like this:
python -m timeit -s"import collections, time; l = collections.deque(range(10000000))" "l.popleft()"
Also, since setup is only run once, you get a pop error (empty list) after a whule, since I haven't changed default number of iterations, so I created a large deque to make it up, and got
10000000 loops, best of 3: 0.0758 usec per loop
on the other hand with list it's slower:
python -m timeit -s "l = list(range(10000000))" "l.pop(0)"
100 loops, best of 3: 9.72 msec per loop
I have also coded the bench in a script (more convenient), with a setup (to avoid clocking the setup) and 99999 iterations on a 100000-size list:
import timeit
print(timeit.timeit(stmt='l.pop(0)',setup='l = list(range(100000))',number=99999))
print(timeit.timeit(setup='import collections; l = collections.deque(range(100000))', stmt='l.popleft()', number=99999))
no surprise: deque wins:
2.442976927292288 for pop in list
0.007311641921253109 for pop in deque
note that l.pop() for the list runs in 0.011536903686244897 seconds, which is very good when popping the last element, as expected.

Calculating number of characters excluding whitespace - python

Which is a faster way to calculate number of characters without whitespaces?
which is more pythonic?
def sent_length(sentence):
return sum(1 for c in sentence if c != ' ')
or
def sent_length(sentence):
return len(sentence.replace(" ", ""))
or
import re
pattern = re.compile(r'\s+')
def sent_length(sentence):
return len(re.sub(pattern, '', sentence))
You can get timings from python -m timeit:
[matt tmp] python -m timeit "sum(1 for c in 'blah blah blah' if c != ' ')"
100000 loops, best of 3: 2.96 usec per loop
[matt benchmark] python -m timeit -s "import re; pattern = re.compile(r'\s+')" "len(pattern.sub('', 'blah blah blah'))"
100000 loops, best of 3: 2.2 usec per loop
[matt tmp] python -m timeit "len(''.join('blah blah blah'.split()))"
1000000 loops, best of 3: 0.785 usec per loop
[matt tmp] python -m timeit "len('blah blah blah'.replace(' ', ''))"
1000000 loops, best of 3: 0.437 usec per loop
[matt tmp] python -m timeit "len('blah blah blah') - 'blah blah blah'.count(' ')"
1000000 loops, best of 3: 0.384 usec per loop
This will help you determine what is the quickest. More pythonic? I'd go with the fastest as performance is always important.
In terms of faster you can test that yourself.
from datetime import datetime
start = datetime.now()
# some method
end = datetime.now()
diff = end-start
In terms of what is more pythonic, I don't believe any of them are more pythonic. They are each a valid implementation that most people would accept. It's just a slightly different approach. In general regular expressions will take slightly longer to run.

Python3 vs Python2 list/generator range performance

I have this simple function that partitions a list and returns an index i in the list such that elements at indices less that i are smaller than list[i] and elements at indices greater than i are bigger.
def partition(arr):
first_high = 0
pivot = len(arr) - 1
for i in range(len(arr)):
if arr[i] < arr[pivot]:
arr[first_high], arr[i] = arr[i], arr[first_high]
first_high = first_high + 1
arr[first_high], arr[pivot] = arr[pivot], arr[first_high]
return first_high
if __name__ == "__main__":
arr = [1, 5, 4, 6, 0, 3]
pivot = partition(arr)
print(pivot)
The runtime is substantially bigger with python 3.4 that python 2.7.6
on OS X:
time python3 partition.py
real 0m0.040s
user 0m0.027s
sys 0m0.010s
time python partition.py
real 0m0.031s
user 0m0.018s
sys 0m0.011s
Same thing on ubuntu 14.04 / virtual box
python3:
real 0m0.049s
user 0m0.034s
sys 0m0.015s
python:
real 0m0.044s
user 0m0.022s
sys 0m0.018s
Is python3 inherently slower that python2.7 or is there any specific optimizations to the code do make run as fast as on python2.7
As mentioned in the comments, you should be benchmarking with timeit rather than with OS tools.
My guess is the range function is probably performing a little slower in Python 3. In Python 2 it simply returns a list, in Python 3 it returns a range which behave more or less like a generator. I did some benchmarking and this was the result, which may be a hint on what you're experiencing:
python -mtimeit "range(10)"
1000000 loops, best of 3: 0.474 usec per loop
python3 -mtimeit "range(10)"
1000000 loops, best of 3: 0.59 usec per loop
python -mtimeit "range(100)"
1000000 loops, best of 3: 1.1 usec per loop
python3 -mtimeit "range(100)"
1000000 loops, best of 3: 0.578 usec per loop
python -mtimeit "range(1000)"
100000 loops, best of 3: 11.6 usec per loop
python3 -mtimeit "range(1000)"
1000000 loops, best of 3: 0.66 usec per loop
As you can see, when input provided to range is small, it tends to be fast in Python 2. If the input grows, then Python 3's range behave better.
My suggestion: test the code for larger arrays, with a hundred or a thousand elements.
Actually, I went further and test a complete iteration through the elements. The results were totally in favor of Python 2:
python -mtimeit "for i in range(1000):pass"
10000 loops, best of 3: 31 usec per loop
python3 -mtimeit "for i in range(1000):pass"
10000 loops, best of 3: 45.3 usec per loop
python -mtimeit "for i in range(10000):pass"
1000 loops, best of 3: 330 usec per loop
python3 -mtimeit "for i in range(10000):pass"
1000 loops, best of 3: 480 usec per loop
My conclusion is that, is probably faster to iterate through a list than through a generator. Although the latter is definitely more efficient regarding memory consumption. This is a classic example of the trade off between speed and memory. Although the speed difference is not that big per se (less than miliseconds). So you should value this and what's better for your program.

Cost of len() function

What is the cost of len() function for Python built-ins? (list/tuple/string/dictionary)
It's O(1) (constant time, not depending of actual length of the element - very fast) on every type you've mentioned, plus set and others such as array.array.
Calling len() on those data types is O(1) in CPython, the official and most common implementation of the Python language. Here's a link to a table that provides the algorithmic complexity of many different functions in CPython:
TimeComplexity Python Wiki Page
All those objects keep track of their own length. The time to extract the length is small (O(1) in big-O notation) and mostly consists of [rough description, written in Python terms, not C terms]: look up "len" in a dictionary and dispatch it to the built_in len function which will look up the object's __len__ method and call that ... all it has to do is return self.length
The below measurements provide evidence that len() is O(1) for oft-used data structures.
A note regarding timeit: When the -s flag is used and two strings are passed to timeit the first string is executed only once and is not timed.
List:
$ python -m timeit -s "l = range(10);" "len(l)"
10000000 loops, best of 3: 0.0677 usec per loop
$ python -m timeit -s "l = range(1000000);" "len(l)"
10000000 loops, best of 3: 0.0688 usec per loop
Tuple:
$ python -m timeit -s "t = (1,)*10;" "len(t)"
10000000 loops, best of 3: 0.0712 usec per loop
$ python -m timeit -s "t = (1,)*1000000;" "len(t)"
10000000 loops, best of 3: 0.0699 usec per loop
String:
$ python -m timeit -s "s = '1'*10;" "len(s)"
10000000 loops, best of 3: 0.0713 usec per loop
$ python -m timeit -s "s = '1'*1000000;" "len(s)"
10000000 loops, best of 3: 0.0686 usec per loop
Dictionary (dictionary-comprehension available in 2.7+):
$ python -mtimeit -s"d = {i:j for i,j in enumerate(range(10))};" "len(d)"
10000000 loops, best of 3: 0.0711 usec per loop
$ python -mtimeit -s"d = {i:j for i,j in enumerate(range(1000000))};" "len(d)"
10000000 loops, best of 3: 0.0727 usec per loop
Array:
$ python -mtimeit -s"import array;a=array.array('i',range(10));" "len(a)"
10000000 loops, best of 3: 0.0682 usec per loop
$ python -mtimeit -s"import array;a=array.array('i',range(1000000));" "len(a)"
10000000 loops, best of 3: 0.0753 usec per loop
Set (set-comprehension available in 2.7+):
$ python -mtimeit -s"s = {i for i in range(10)};" "len(s)"
10000000 loops, best of 3: 0.0754 usec per loop
$ python -mtimeit -s"s = {i for i in range(1000000)};" "len(s)"
10000000 loops, best of 3: 0.0713 usec per loop
Deque:
$ python -mtimeit -s"from collections import deque;d=deque(range(10));" "len(d)"
100000000 loops, best of 3: 0.0163 usec per loop
$ python -mtimeit -s"from collections import deque;d=deque(range(1000000));" "len(d)"
100000000 loops, best of 3: 0.0163 usec per loop
len is an O(1) because in your RAM, lists are stored as tables (series of contiguous addresses). To know when the table stops the computer needs two things : length and start point. That is why len() is a O(1), the computer stores the value, so it just needs to look it up.

Remove characters except digits from string using Python?

How can I remove all characters except numbers from string?
Use re.sub, like so:
>>> import re
>>> re.sub('\D', '', 'aas30dsa20')
'3020'
\D matches any non-digit character so, the code above, is essentially replacing every non-digit character for the empty string.
Or you can use filter, like so (in Python 2):
>>> filter(str.isdigit, 'aas30dsa20')
'3020'
Since in Python 3, filter returns an iterator instead of a list, you can use the following instead:
>>> ''.join(filter(str.isdigit, 'aas30dsa20'))
'3020'
In Python 2.*, by far the fastest approach is the .translate method:
>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>
string.maketrans makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256)) (just faster to make;-). .translate applies the translation table (which here is irrelevant since all essentially means identity) AND deletes characters present in the second argument -- the key part.
.translate works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.
Back to 2.*, the performance difference is impressive...:
$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop
Speeding things up by 7-8 times is hardly peanuts, so the translate method is well worth knowing and using. The other popular non-RE approach...:
$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop
is 50% slower than RE, so the .translate approach beats it by over an order of magnitude.
In Python 3, or for Unicode, you need to pass .translate a mapping (with ordinals, not characters directly, as keys) that returns None for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:
import string
class Del:
def __init__(self, keep=string.digits):
self.comp = dict((ord(c),c) for c in keep)
def __getitem__(self, k):
return self.comp.get(k)
DD = Del()
x='aaa12333bb445bb54b5b52'
x.translate(DD)
also emits '1233344554552'. However, putting this in xx.py we have...:
$ python3.1 -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop
...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.
s=''.join(i for i in s if i.isdigit())
Another generator variant.
You can use filter:
filter(lambda x: x.isdigit(), "dasdasd2313dsa")
On python3.0 you have to join this (kinda ugly :( )
''.join(filter(lambda x: x.isdigit(), "dasdasd2313dsa"))
You can easily do it using Regex
>>> import re
>>> re.sub("\D","","£70,000")
70000
along the lines of bayer's answer:
''.join(i for i in s if i.isdigit())
The op mentions in the comments that he wants to keep the decimal place. This can be done with the re.sub method (as per the second and IMHO best answer) by explicitly listing the characters to keep e.g.
>>> re.sub("[^0123456789\.]","","poo123.4and5fish")
'123.45'
x.translate(None, string.digits)
will delete all digits from string. To delete letters and keep the digits, do this:
x.translate(None, string.letters)
Use a generator expression:
>>> s = "foo200bar"
>>> new_s = "".join(i for i in s if i in "0123456789")
A fast version for Python 3:
# xx3.py
from collections import defaultdict
import string
_NoneType = type(None)
def keeper(keep):
table = defaultdict(_NoneType)
table.update({ord(c): c for c in keep})
return table
digit_keeper = keeper(string.digits)
Here's a performance comparison vs. regex:
$ python3.3 -mtimeit -s'import xx3; x="aaa12333bb445bb54b5b52"' 'x.translate(xx3.digit_keeper)'
1000000 loops, best of 3: 1.02 usec per loop
$ python3.3 -mtimeit -s'import re; r = re.compile(r"\D"); x="aaa12333bb445bb54b5b52"' 'r.sub("", x)'
100000 loops, best of 3: 3.43 usec per loop
So it's a little bit more than 3 times faster than regex, for me. It's also faster than class Del above, because defaultdict does all its lookups in C, rather than (slow) Python. Here's that version on my same system, for comparison.
$ python3.3 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
100000 loops, best of 3: 13.6 usec per loop
Try:
import re
string = '1abcd2XYZ3'
string_without_letters = re.sub(r'[a-z]', '', string.lower())
this should give:
123
Ugly but works:
>>> s
'aaa12333bb445bb54b5b52'
>>> a = ''.join(filter(lambda x : x.isdigit(), s))
>>> a
'1233344554552'
>>>
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 2.48 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'
100000 loops, best of 3: 2.02 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 2.37 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'
100000 loops, best of 3: 1.97 usec per loop
I had observed that join is faster than sub.
You can read each character. If it is digit, then include it in the answer. The str.isdigit() method is a way to know if a character is digit.
your_input = '12kjkh2nnk34l34'
your_output = ''.join(c for c in your_input if c.isdigit())
print(your_output) # '1223434'
You can use join + filter + lambda:
''.join(filter(lambda s: s.isdigit(), "20 years ago, 2 months ago, 2 days ago"))
Output: '2022'
Not a one liner but very simple:
buffer = ""
some_str = "aas30dsa20"
for char in some_str:
if not char.isdigit():
buffer += char
print( buffer )
I used this. 'letters' should contain all the letters that you want to get rid of:
Output = Input.translate({ord(i): None for i in 'letters'}))
Example:
Input = "I would like 20 dollars for that suit"
Output = Input.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxzy'}))
print(Output)
Output:
20
my_string="sdfsdfsdfsfsdf353dsg345435sdfs525436654.dgg("
my_string=''.join((ch if ch in '0123456789' else '') for ch in my_string)
print(output:+my_string)
output: 353345435525436654
Another one:
import re
re.sub('[^0-9]', '', 'ABC123 456')
Result:
'123456'

Categories

Resources