python map function acting on a string of numbers - python

I've been playing around with the map function in Python and I was looking for some help in understanding the following behaviour:
foo="12345"
print map(int,foo)
gives you [1, 2, 3, 4, 5]. Obviously int(foo) spits out 12345. So what exactly is happening? Since strings are iterable by character, would the above two lines be synonymous with
print [int(x) for x in foo]
I know they will output the same result but is there anything different going on behind the scenes? Is one more efficient or better than another? Is one more "pythonic"?
Thanks a lot!

map() may be somewhat faster than using list comprehension in some cases and in some cases map is slower than list comprehensions.
when using a built-in function:
python -mtimeit -s'xs=xrange(1000)' 'map(int,"1234567890")'
10000 loops, best of 3: 18.3 usec per loop
python -mtimeit -s'xs=xrange(1000)' '[int(x) for x in "1234567890"]'
100000 loops, best of 3: 20 usec per loop
with lambda,map() becomes slow:
python -mtimeit -s'xs=xrange(1000)' '[x*10 for x in "1234567890"]'
100000 loops, best of 3: 6.11 usec per loop
python -mtimeit -s'xs=xrange(1000)' 'map(lambda x:x*10,"1234567890")'
100000 loops, best of 3: 11.2 usec per loop
But, in python 3x map() returns a map object, i.e. an iterator

Apply function to every item of iterable and return a list of the
results.
From the documentation for map
int() attempts to convert what is passed into an integer and will raise a ValueError if you try something silly, like this:
>>> int('Hello')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'Hello'
map() will return a list, which has the return value of the function that you ask it to call for any iterable. If your function returns nothing, then you'll get a list of Nones, like this:
>>> def silly(x):
... pass
...
>>> map(silly,'Hello')
[None, None, None, None, None]
It is the short and efficient way to do something like this:
def verbose_map(some_function,something):
results = []
for i in something:
results.append(some_function(i))
return results

map can be thought of to work like this:
def map(func, iterable):
answer = []
for elem in iterable:
answer.append(func(elem))
return answer
Basically, it returns a list L such that the ith element of L is the result of computing func on the ith element of your iterable.
So, with int and a string of ints, in each iteration of the for loop, the element is a specific character, which when given to int comes back as an actual int. The result of calling map on such a string is a list, whose elements correspond to the inted values of the corresponding character in the string.
So yes, if L = "12345", then map(int, L) is synonymous with [int(x) for x in L]
Hope this helps

foo="12345"
In [507]: dis.dis('map(int,foo)')
0 <109> 28769
3 STORE_SLICE+0
4 LOAD_ATTR 29806 (29806)
7 <44>
8 BUILD_TUPLE 28527
11 STORE_SLICE+1
def map(func, iterable):
answer = []
for elem in iterable:
answer.append(func(elem))
return answer
dis.dis('map(int,foo)')
0 <109> 28769
3 STORE_SLICE+0
4 LOAD_ATTR 29806 (29806)
7 <44>
8 BUILD_TUPLE 28527
11 STORE_SLICE+1
dis.dis('[int(x) for x in foo]')
0 DELETE_NAME 28265 (28265)
3 LOAD_GLOBAL 30760 (30760)
6 STORE_SLICE+1
7 SLICE+2
8 BUILD_TUPLE 29295
11 SLICE+2
12 SETUP_LOOP 26912 (to 26927)
15 JUMP_FORWARD 26144 (to 26162)
18 JUMP_IF_FALSE 23919 (to 23940)
And timing:
In [512]: timeit map(int,foo)
100000 loops, best of 3: 6.89 us per loop
In [513]: def mymap(func, iterable):
...: answer = []
...: for elem in iterable:
...: answer.append(func(elem))
...: return answer
In [514]: timeit mymap(int,foo)
100000 loops, best of 3: 8.29 us per loop
In [515]: timeit [int(x) for x in foo]
100000 loops, best of 3: 7.5 us per loop

"More efficient" is a can of worms. On this computer, it's faster to use map with CPython, but the list comprehension is faster for pypy
$ python -mtimeit 'map(int,"1234567890")'
100000 loops, best of 3: 8.05 usec per loop
$ python -mtimeit '[int(x) for x in "1234567890"]'
100000 loops, best of 3: 9.33 usec per loop
$ pypy -mtimeit 'map(int,"1234567890")'
1000000 loops, best of 3: 1.18 usec per loop
$ pypy -mtimeit '[int(x) for x in "1234567890"]'
1000000 loops, best of 3: 0.938 usec per loop
Python3 shows map() to be faster even with the extra call to list() that is required
$ python3 -mtimeit 'list(map(int,"1234567890"))'
100000 loops, best of 3: 11.8 usec per loop
$ python3 -mtimeit '[int(x) for x in "1234567890"]'
100000 loops, best of 3: 13.6 usec per loop

Yes, there is a huge difference behind the scenes. If you print(map) you'll see it is a builtin. A builtin function executes faster than one written in python or than most that are based off of how the language is parsed, map uses the fast iter method, a list comprehension does not. Other that there is no difference.
map(int, '1'*1000000)
vs.
[int(i) for i in '1'*1000000]
Using CPython, and the unix time program, map completes in ~3 seconds, the list comprehension in ~5.
Oh, one thing to note, this only pertains when the function passed to map is written in C.

Related

Fastest way to check if duplicates exist in a python list / numpy ndarray

I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.
Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.
Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.
Depending on how large your array is, and how likely duplicates are, the answer will be different.
For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.
Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.
Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.
So, there's no one best solution for all possible scenarios.
Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:
from bitarray import bitarray # pip3 install bitarray
def dupcheck(X):
# Hardcoded values to give about 5% false positives for 10000 elements
size = 62352
hashcount = 4
bits = bitarray(size)
bits.setall(0)
def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
for i in range(hashcount):
if not bits[hash((x, i)) % size]: return False
return True
def add(x):
for i in range(hashcount):
bits[hash((x, i)) % size] = True
seen = set()
seen_add = seen.add
for x in X:
if check(x) or add(x):
if x in seen or seen_add(x):
return True
return False
This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.
Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).
My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- --------- ----------- ------------ ---------
set_len 0 7.73486e-06 Baseline
unique_len 9.644e-06 2.55573e-05 Slower 0
Length 1000
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- ---------- ----------- ------------ ---------
set_len 0.00011066 0.000270466 Baseline
unique_len 4.3684e-05 8.95608e-05 Faster 0
Then I tried my own implementation, but I think it would require optimized C code to beat set:
def check_items(key_rand, **kwargs):
for i, vali in enumerate(key_rand):
for j in range(i+1, len(key_rand)):
valj = key_rand[j]
if vali == valj:
break
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
----------- ---------- ----------- ------------ ---------
set_len 0 6.74221e-06 Baseline
unique_len 0 2.14604e-05 Slower 0
check_items 1.1138e-05 2.16369e-05 Slower 0
(using my randomized compare_time() function from easyinfo)

Multiply all items in a list in Python [duplicate]

This question already has answers here:
How can I multiply all items in a list together with Python?
(15 answers)
Closed 6 years ago.
How do I multiply the items in a list ?
For example:
num_list = [1,2,3,4,5]
def multiplyListItems(l):
# some code here...
The expected calculation and return value is 1 x 2 x 3 x 4 x 5 = 120.
One way is to use reduce:
>>> num_list = [1,2,3,4,5]
>>> reduce(lambda x, y: x*y, num_list)
120
Use functools.reduce, which is faster (see below) and more forward-compatible with Python 3.
import operator
import functools
num_list = [1,2,3,4,5]
accum_value = functools.reduce(operator.mul, num_list)
print(accum_value)
# Output
120
Measure the execution time for 3 different ways,
# Way 1: reduce
$ python -m timeit "reduce(lambda x, y: x*y, [1,2,3,4,5])"
1000000 loops, best of 3: 0.727 usec per loop
# Way 2: np.multiply.reduce
$ python -m timeit -s "import numpy as np" "np.multiply.reduce([1,2,3,4,5])"
100000 loops, best of 3: 6.71 usec per loop
# Way 3: functools.reduce
$ python -m timeit -s "import operator, functools" "functools.reduce(operator.mul, [1,2,3,4,5])"
1000000 loops, best of 3: 0.421 usec per loop
For a bigger list, it is better to use np.multiply.reduce as mentioned by #MikeMüller.
$ python -m timeit "reduce(lambda x, y: x*y, range(1, int(1e5)))"
10 loops, best of 3: 3.01 sec per loop
$ python -m timeit -s "import numpy as np" "np.multiply.reduce(range(1, int(1e5)))"
100 loops, best of 3: 11.2 msec per loop
$ python -m timeit -s "import operator, functools" "functools.reduce(operator.mul, range(1, int(1e5)))"
10 loops, best of 3: 2.98 sec per loop
A NumPy solution:
>>> import numpy as np
>>> np.multiply.reduce(num_list)
120
Run times for a bit larger list:
In [303]:
from operator import mul
from functools import reduce
import numpy as np
​
a = list(range(1, int(1e5)))
In [304]
%timeit np.multiply.reduce(a)
100 loops, best of 3: 8.25 ms per loop
In [305]:
%timeit reduce(lambda x, y: x*y, a)
1 loops, best of 3: 5.04 s per loop
In [306]:
%timeit reduce(mul, a)
1 loops, best of 3: 5.37 s per loop
NumPy is largely implemented in C. Therefore, it can often be one or two orders of magnitudes faster than writing loops over Python lists. This works for larger arrays. If an array is small and it is used often form Python, things can be slower than using pure Python. This is because of the overhead converting between Python objects and C data types. In fact, it is an anti-pattern to write Python for loops to iterate over NumPy arrays.
Here, the list with five numbers causes considerable overhead compared to gain
from the faster numerics.

Efficiency: For x in EmptyList vs if length > 0

Sorry, kind of a hard question to title.
If I want to iterate over a potentially empty list, which is more efficient? I'm expecting the list to be empty the majority of the time.
for x in list:
dostuff()
OR
if len(list)>0:
for x in list:
dostuff()
Based on timings from the timeit module:
>>> from timeit import timeit
>>> timeit('for x in lst:pass', 'lst=[]')
0.08301091194152832
>>> timeit('if len(lst)>0:\n for x in lst:\n pass', 'lst=[]')
0.09223318099975586
It looks like just doing the for loop will be faster when the list is empty, making it faster option regardless of the state of the list.
However, there is a significantly faster option:
>>> timeit('if lst:\n for x in lst:\n pass', 'lst=[]')
0.03235578536987305
Using if lst is much faster than either checking the length of the list or always doing the for loop. However, all three methods are quite fast, so if you are trying to optimize your code I would suggest trying to find what the real bottleneck is - take a look at When is optimisation premature?.
You can just use if list:
In [15]: if l:
....: print "hello"
....:
In [16]: l1= [1]
In [17]: if l1:
....: print "hello from l1"
....:
hello from l1
In [21]: %timeit for x in l:pass
10000000 loops, best of 3: 54.4 ns per loop
In [22]: %timeit if l:pass
10000000 loops, best of 3: 22.4 ns per loop
If a list is empty if list will evaluate to False so no need to check the len(list).
Firstly if len(list) > 0: should be if list: to improve readability. I personally would have thought having the if statement is redundant but timeit seems to prove me wrong. It seems (unless I've made a silly mistake) that having the check for an empty list makes the code faster (for an empty list):
$ python -m timeit 'list = []' 'for x in list:' ' print x'
10000000 loops, best of 3: 0.157 usec per loop
$ python -m timeit 'list = []' 'if list:' ' for x in list:' ' print x'
10000000 loops, best of 3: 0.0766 usec per loop
First variant is more efficient, because Python's for loop checks length of list anyway, making additional explicit check just a waste of CPU cycles.

Check for multidimensional list in Python

I have some data which is either 1 or 2 dimensional. I want to iterate through every pattern in the data set and perform foo() on it. If the data is 1D then add this value to a list, if it's 2D then take the mean of the inner list and append this value.
I saw this question, and decided to implement it checking for instance of a list. I can't use numpy for this application.
outputs = []
for row in data:
if isinstance(row, list):
vals = [foo(window) for window in row]
outputs.append(sum(vals)/float(len(vals)))
else:
outputs.append(foo(row))
Is there a neater way of doing this? On each run, every pattern will have the same dimensionality, so I could make a separate class for 1D/2D but that will add a lot of classes to my code. The datasets can get quite large so a quick solution is preferable.
Your code is already almost as neat and fast as it can be. The only slight improvement is replacing [foo(window) for window in row] with map(foo, row), which can be seen by the benchmarks:
> python -m timeit "foo = lambda x: x+1; list(map(foo, range(1000)))"
10000 loops, best of 3: 132 usec per loop
> python -m timeit "foo = lambda x: x+1; [foo(a) for a in range(1000)]"
10000 loops, best of 3: 140 usec per loop
isinstance() already seems faster than its counterparts hasattr() and type() ==:
> python -m timeit "[isinstance(i, int) for i in range(1000)]"
10000 loops, best of 3: 117 usec per loop
> python -m timeit "[hasattr(i, '__iter__') for i in range(1000)]"
1000 loops, best of 3: 470 usec per loop
> python -m timeit "[type(i) == int for i in range(1000)]"
10000 loops, best of 3: 130 usec per loop
However, if you count short as neat, you can also simplify your code (after replacingmap) to:
mean = lambda x: sum(x)/float(len(x)) #or `from statistics import mean` in python3.4
output = [foo(r) if isinstance(r, int) else mean(map(foo, r)) for r in data]

Python function call speed

I'm really confused with functions call speed in Python. First and second cases, nothing unexpected:
%timeit reduce(lambda res, x: res+x, range(1000))
10000 loops, best of 3: 150 µs per loop
def my_add(res, x):
return res + x
%timeit reduce(my_add, range(1000))
10000 loops, best of 3: 148 µs per loop
But third case looks strange for me:
from operator import add
%timeit reduce(add, range(1000))
10000 loops, best of 3: 80.1 µs per loop
At the same time:
%timeit add(10, 100)
%timeit 10 + 100
10000000 loops, best of 3: 94.3 ns per loop
100000000 loops, best of 3: 14.7 ns per loop
So, why the third case gives speed up about 50%?
add is implemented in C.
>>> from operator import add
>>> add
<built-in function add>
>>> def my_add(res, x):
... return res + x
...
>>> my_add
<function my_add at 0x18358c0>
The reason that a straight + is faster is that add still has to call the Python VM's BINARY_ADD instruction as well as perform some other work due to it being a function, while + is only a BINARY_ADD instruction.
The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. For example, operator.add(x, y) is equivalent to the expression x+y. The function names are those used for special class methods; variants without leading and trailing __ are also provided for convenience.
From Python docs (emphasis mine)
The operator module is an efficient (native I'd assume) implementation. IMHO, calling a native implementation should be quicker than calling a python function.
You could try calling the interpreter with -O or -OO to compile the python core and check the timing again.

Categories

Resources