python map/reduce: emit multiple keys values from single map lambda - python

Is there a canonical way to emit multiple keys from a single item in the input sequence so that they form a continuous sequence and I don't need to use a reduce(...) just to flatten the sequence?
e.g. if I wanted to expand each digit in a series of numbers into individual numbers in a sequence
[1,12,123,1234,12345] => [1,1,2,1,2,3,1,2,3,4,1,2,3,4,5]
then I'd write some python that looked a bit like this:
somedata = [1,12,123,1234,12345]
listified = map(lambda x:[int(c) for c in str(x)], somedata)
flattened = reduce(lambda x,y: x+y,listified,[])
but would prefer not to have to call the flattened = reduce(...) if there was a neater (or maybe more efficient) way to express this.

map(func, *iterables) will always call func as many times as the length of the shortest iterable (assuming no Exception is raised). Functions always return a single object. So list(map(func, *iterables)) will always have the same length as the shortest iterable.
Thus list(map(lambda x:[int(c) for c in str(x)], somedata)) will always have the same length as somedata. There is no way around that.
If the desired result (e.g. [1,1,2,1,2,3,1,2,3,4,1,2,3,4,5]) has more items than the input (e.g. [1,12,123,1234,12345]) then something other than map must be used to produce it.
You could, for example, use itertools.chain.from_iterable to flatten 2 levels of nesting:
In [31]: import itertools as IT
In [32]: somedata = [1,12,123,1234,12345]
In [33]: list(map(int, IT.chain.from_iterable(map(str, somedata))))
Out[33]: [1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
or, to flatten a list of lists, sum(..., []) suffices:
In [44]: sum(map(lambda x:[int(c) for c in str(x)], somedata), [])
Out[44]: [1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
but note that this is much slower than using IT.chain.from_iterable (see below).
Here is a benchmark (using IPython's %timeit) testing the various methods on a list of 10,000 integers from 0 to a million:
In [4]: import random
In [8]: import functools
In [49]: somedata = [random.randint(0, 10**6) for i in range(10**4)]
In [50]: %timeit list(map(int, IT.chain.from_iterable(map(str, somedata))))
100 loops, best of 3: 9.35 ms per loop
In [13]: %timeit [int(i) for i in list(''.join(str(somedata)[1:-1].replace(', ','')))]
100 loops, best of 3: 12.2 ms per loop
In [52]: %timeit [int(j) for i in somedata for j in str(i)]
100 loops, best of 3: 12.3 ms per loop
In [51]: %timeit sum(map(lambda x:[int(c) for c in str(x)], somedata), [])
1 loop, best of 3: 869 ms per loop
In [9]: %timeit listified = map(lambda x:[int(c) for c in str(x)], somedata); functools.reduce(lambda x,y: x+y,listified,[])
1 loop, best of 3: 871 ms per loop

Got two ideas, one with list comprehentions:
print [int(j) for i in somedata for j in list(str(i)) ]
Something new (from comments), string is already iterable, so it would be:
print [int(j) for i in somedata for j in str(i) ]
second with opertations on strings and list comprehentions:
print [int(i) for i in list(''.join(str(somedata)[1:-1].replace(', ','')))]
output for both:
[1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]

Here's how the transformation goes:
Convert every item (int) to a string: 12 -> '12'
Convert every item (str) to a list of string: '12' -> ['1', '2']
Flatten every item (list of str): ['1', '2'] -> '1', '2'
Convert every item (str) to an int: '1' -> 1
We can use Pyterator for this:
from pyterator import iterate
(
iterate([1, 12, 123, 1234, 12345])
.flat_map(lambda x: list(str(x))) # Steps 1-3
.map(int) # Step 4
.to_list()
)

Related

Is there a simpler and faster way to get an indexes dict in which contains the indexes of the same elements in a list or a numpy array

Description:
I have a large array with simple integers(positive and not large) like 1, 2, ..., etc. For example: [1, 1, 2, 2, 1, 2]. I want to get a dict in which use a single value from the list as the dict's key, and use the indexes list of this value as the dict's value.
Question:
Is there a simpler and faster way to get the expected results in python? (array can be a list or a numpy array)
Code:
a = [1, 1, 2, 2, 1, 2]
results = indexes_of_same_elements(a)
print(results)
Expected results:
{1:[0, 1, 4], 2:[2, 3, 5]}
You can avoid iteration here using vectorized methods, in particular np.unique + np.argsort:
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
out = dict(zip(el, np.split(idx, c.cumsum()[:-1])))
{1: array([0, 1, 4], dtype=int64), 2: array([2, 3, 5], dtype=int64)}
Performance
a = np.random.randint(1, 100, 10000)
In [183]: %%timeit
...: idx = np.argsort(a)
...: el, c = np.unique(a, return_counts=True)
...: dict(zip(el, np.split(idx, c.cumsum()[:-1])))
...:
897 µs ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [184]: %%timeit
...: results = {}
...: for i, k in enumerate(a):
...: results.setdefault(k, []).append(i)
...:
2.61 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can exploit the fact that the elements are "simple" (i.e. nonnegative and not too large?) integers.
The trick is to construct a sparse matrix with just one element per row and then to transform it to a column wise representation. This is typically faster than argsort because this transform is O(M + N + nnz), if the sparse matrix is MxN with nnz nonzeros.
from scipy import sparse
def use_sprsm():
x = sparse.csr_matrix((a, a, np.arange(a.size+1))).tocsc()
idx, = np.where(x.indptr[:-1] != x.indptr[1:])
return {i: a for i, a in zip(idx, np.split(x.indices, x.indptr[idx[1:]]))}
# for comparison
def use_asort():
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
return dict(zip(el, np.split(idx, c.cumsum()[:-1])))
Sample run:
>>> a = np.random.randint(0, 100, (10_000,))
>>>
# sanity check, note that `use_sprsm` returns sorted indices
>>> for k, v in use_asort().items():
... assert np.array_equal(np.sort(v), use_sprsm()[k])
...
>>> timeit(use_asort, number=1000)
0.8930604780325666
>>> timeit(use_sprsm, number=1000)
0.38419671391602606
It is pretty trivial to construct the dict:
In []:
results = {}
for i, k in enumerate(a):
results.setdefault(k, []).append(i) # str(k) if you really need the key to be a str
print(results)
Out[]:
{1: [0, 1, 4], 2: [2, 3, 5]}
You could also use results = collections.defaultdict(list) and then results[k].append(i) instead of results.setdefault(k, []).append(i)

Stepping with multiple values while slicing an array in Python

I am trying to get m values while stepping through every n elements of an array. For example, for m = 2 and n = 5, and given
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I want to retrieve
b = [1, 2, 6, 7]
Is there a way to do this using slicing? I can do this using a nested list comprehension, but I was wondering if there was a way to do this using the indices only. For reference, the list comprehension way is:
b = [k for j in [a[i:i+2] for i in range(0,len(a),5)] for k in j]
I agree with wim that you can't do it with just slicing. But you can do it with just one list comprehension:
>>> [x for i,x in enumerate(a) if i%n < m]
[1, 2, 6, 7]
No, that is not possible with slicing. Slicing only supports start, stop, and step - there is no way to represent stepping with "groups" of size larger than 1.
In short, no, you cannot. But you can use itertools to remove the need for intermediary lists:
from itertools import chain, islice
res = list(chain.from_iterable(islice(a, i, i+2) for i in range(0, len(a), 5)))
print(res)
[1, 2, 6, 7]
Borrowing #Kevin's logic, if you want a vectorised solution to avoid a for loop, you can use 3rd party library numpy:
import numpy as np
m, n = 2, 5
a = np.array(a) # convert to numpy array
res = a[np.where(np.arange(a.shape[0]) % n < m)]
There are other ways to do it, which all have advantages for some cases, but none are "just slicing".
The most general solution is probably to group your input, slice the groups, then flatten the slices back out. One advantage of this solution is that you can do it lazily, without building big intermediate lists, and you can do it to any iterable, including a lazy iterator, not just a list.
# from itertools recipes in the docs
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
groups = grouper(a, 5)
truncated = (group[:2] for group in groups)
b = [elem for group in truncated for elem in group]
And you can convert that into a pretty simple one-liner, although you still need the grouper function:
b = [elem for group in grouper(a, 5) for elem in group[:2]]
Another option is to build a list of indices, and use itemgetter to grab all the values. This might be more readable for a more complicated function than just "the first 2 of every 5", but it's probably less readable for something as simple as your use:
indices = [i for i in range(len(a)) if i%5 < 2]
b = operator.itemgetter(*indices)(a)
… which can be turned into a one-liner:
b = operator.itemgetter(*[i for i in range(len(a)) if i%5 < 2])(a)
And you can combine the advantages of the two approaches by writing your own version of itemgetter that takes a lazy index iterator—which I won't show, because you can go even better by writing one that takes an index filter function instead:
def indexfilter(pred, a):
return [elem for i, elem in enumerate(a) if pred(i)]
b = indexfilter((lambda i: i%5<2), a)
(To make indexfilter lazy, just replace the brackets with parens.)
… or, as a one-liner:
b = [elem for i, elem in enumerate(a) if i%5<2]
I think this last one might be the most readable. And it works with any iterable rather than just lists, and it can be made lazy (again, just replace the brackets with parens). But I still don't think it's simpler than your original comprehension, and it's not just slicing.
The question states array, and by that if we are talking about NumPy arrays, we can surely use few obvious NumPy tricks and few not-so obvious ones. We can surely use slicing to get a 2D view into the input under certain conditions.
Now, based on the array length, let's call it l and m, we would have three scenarios :
Scenario #1 :l is divisible by n
We can use slicing and reshaping to get a view into the input array and hence get constant runtime.
Verify the view concept :
In [108]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [109]: m = 2; n = 5
In [110]: a.reshape(-1,n)[:,:m]
Out[110]:
array([[1, 2],
[6, 7]])
In [111]: np.shares_memory(a, a.reshape(-1,n)[:,:m])
Out[111]: True
Check timings on a very large array and hence constant runtime claim :
In [118]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [119]: %timeit a.reshape(-1,n)[:,:m]
1000000 loops, best of 3: 563 ns per loop
In [120]: a = np.arange(10000000)
In [121]: %timeit a.reshape(-1,n)[:,:m]
1000000 loops, best of 3: 564 ns per loop
To get flattened version :
If we have to get a flattened array as output, we just need to use a flattening operation with .ravel(), like so -
In [127]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
In [128]: m = 2; n = 5
In [129]: a.reshape(-1,n)[:,:m].ravel()
Out[129]: array([1, 2, 6, 7])
Timings show that it's not too bad when compared with the other looping and vectorized numpy.where versions from other posts -
In [143]: a = np.arange(10000000)
# #Kevin's soln
In [145]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.23 s per loop
# #jpp's soln
In [147]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 145 ms per loop
In [144]: %timeit a.reshape(-1,n)[:,:m].ravel()
100 loops, best of 3: 16.4 ms per loop
Scenario #2 :l is not divisible by n, but the groups end with a complete one at the end
We go to the non-obvious NumPy methods with np.lib.stride_tricks.as_strided that allows to go beyoond the memory block bounds (hence we need to be careful here to not write into those) to facilitate a solution using slicing. The implementation would look something like this -
def select_groups(a, m, n):
a = np.asarray(a)
strided = np.lib.stride_tricks.as_strided
# Get params defining the lengths for slicing and output array shape
nrows = len(a)//n
add0 = len(a)%n
s = a.strides[0]
out_shape = nrows+int(add0!=0),m
# Finally stride, flatten with reshape and slice
return strided(a, shape=out_shape, strides=(s*n,s))
A sample run to verify that the output is a view -
In [151]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])
In [152]: m = 2; n = 5
In [153]: select_groups(a, m, n)
Out[153]:
array([[ 1, 2],
[ 6, 7],
[11, 12]])
In [154]: np.shares_memory(a, select_groups(a, m, n))
Out[154]: True
To get flattened version, append with .ravel().
Let's get some timings comparison -
In [158]: a = np.arange(10000003)
In [159]: m = 2; n = 5
# #Kevin's soln
In [161]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.24 s per loop
# #jpp's soln
In [162]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 148 ms per loop
In [160]: %timeit select_groups(a, m=m, n=n)
100000 loops, best of 3: 5.8 µs per loop
If we need a flattened version, it's still not too bad -
In [163]: %timeit select_groups(a, m=m, n=n).ravel()
100 loops, best of 3: 16.5 ms per loop
Scenario #3 :l is not divisible by n,and the groups end with a incomplete one at the end
For this case, we would need an extra slicing at the end on top of what we had in the previous method, like so -
def select_groups_generic(a, m, n):
a = np.asarray(a)
strided = np.lib.stride_tricks.as_strided
# Get params defining the lengths for slicing and output array shape
nrows = len(a)//n
add0 = len(a)%n
lim = m*(nrows) + add0
s = a.strides[0]
out_shape = nrows+int(add0!=0),m
# Finally stride, flatten with reshape and slice
return strided(a, shape=out_shape, strides=(s*n,s)).reshape(-1)[:lim]
Sample run -
In [166]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [167]: m = 2; n = 5
In [168]: select_groups_generic(a, m, n)
Out[168]: array([ 1, 2, 6, 7, 11])
Timings -
In [170]: a = np.arange(10000001)
In [171]: m = 2; n = 5
# #Kevin's soln
In [172]: %timeit [x for i,x in enumerate(a) if i%n < m]
1 loop, best of 3: 1.23 s per loop
# #jpp's soln
In [173]: %timeit a[np.where(np.arange(a.shape[0]) % n < m)]
10 loops, best of 3: 145 ms per loop
In [174]: %timeit select_groups_generic(a, m, n)
100 loops, best of 3: 12.2 ms per loop
I realize that recursion isn't popular, but would something like this work? Also, uncertain if adding recursion to the mix counts as just using slices.
def get_elements(A, m, n):
if(len(A) < m):
return A
else:
return A[:m] + get_elements(A[n:], m, n)
A is the array, m and n are defined as in the question. The first if covers the base case, where you have an array with length less than the number of elements you're trying to retrieve, and the second if is the recursive case. I'm somewhat new to python, please forgive my poor understanding of the language if this doesn't work properly, though I tested it and it seems to work fine.
With itertools you could get an iterator with:
from itertools import compress, cycle
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
n = 5
m = 2
it = compress(a, cycle([1, 1, 0, 0, 0]))
res = list(it)

What is pythonic way to get list's value's increment list? [duplicate]

time_interval = [4, 6, 12]
I want to sum up the numbers like [4, 4+6, 4+6+12] in order to get the list t = [4, 10, 22].
I tried the following:
t1 = time_interval[0]
t2 = time_interval[1] + t1
t3 = time_interval[2] + t2
print(t1, t2, t3) # -> 4 10 22
If you're doing much numerical work with arrays like this, I'd suggest numpy, which comes with a cumulative sum function cumsum:
import numpy as np
a = [4,6,12]
np.cumsum(a)
#array([4, 10, 22])
Numpy is often faster than pure python for this kind of thing, see in comparison to #Ashwini's accumu:
In [136]: timeit list(accumu(range(1000)))
10000 loops, best of 3: 161 us per loop
In [137]: timeit list(accumu(xrange(1000)))
10000 loops, best of 3: 147 us per loop
In [138]: timeit np.cumsum(np.arange(1000))
100000 loops, best of 3: 10.1 us per loop
But of course if it's the only place you'll use numpy, it might not be worth having a dependence on it.
In Python 2 you can define your own generator function like this:
def accumu(lis):
total = 0
for x in lis:
total += x
yield total
In [4]: list(accumu([4,6,12]))
Out[4]: [4, 10, 22]
And in Python 3.2+ you can use itertools.accumulate():
In [1]: lis = [4,6,12]
In [2]: from itertools import accumulate
In [3]: list(accumulate(lis))
Out[3]: [4, 10, 22]
I did a bench-mark of the top two answers with Python 3.4 and I found itertools.accumulate is faster than numpy.cumsum under many circumstances, often much faster. However, as you can see from the comments, this may not always be the case, and it's difficult to exhaustively explore all options. (Feel free to add a comment or edit this post if you have further benchmark results of interest.)
Some timings...
For short lists accumulate is about 4 times faster:
from timeit import timeit
def sum1(l):
from itertools import accumulate
return list(accumulate(l))
def sum2(l):
from numpy import cumsum
return list(cumsum(l))
l = [1, 2, 3, 4, 5]
timeit(lambda: sum1(l), number=100000)
# 0.4243644131347537
timeit(lambda: sum2(l), number=100000)
# 1.7077815784141421
For longer lists accumulate is about 3 times faster:
l = [1, 2, 3, 4, 5]*1000
timeit(lambda: sum1(l), number=100000)
# 19.174508565105498
timeit(lambda: sum2(l), number=100000)
# 61.871223849244416
If the numpy array is not cast to list, accumulate is still about 2 times faster:
from timeit import timeit
def sum1(l):
from itertools import accumulate
return list(accumulate(l))
def sum2(l):
from numpy import cumsum
return cumsum(l)
l = [1, 2, 3, 4, 5]*1000
print(timeit(lambda: sum1(l), number=100000))
# 19.18597290944308
print(timeit(lambda: sum2(l), number=100000))
# 37.759664884768426
If you put the imports outside of the two functions and still return a numpy array, accumulate is still nearly 2 times faster:
from timeit import timeit
from itertools import accumulate
from numpy import cumsum
def sum1(l):
return list(accumulate(l))
def sum2(l):
return cumsum(l)
l = [1, 2, 3, 4, 5]*1000
timeit(lambda: sum1(l), number=100000)
# 19.042188624851406
timeit(lambda: sum2(l), number=100000)
# 35.17324400227517
Try the
itertools.accumulate() function.
import itertools
list(itertools.accumulate([1,2,3,4,5]))
# [1, 3, 6, 10, 15]
Behold:
a = [4, 6, 12]
reduce(lambda c, x: c + [c[-1] + x], a, [0])[1:]
Will output (as expected):
[4, 10, 22]
Assignment expressions from PEP 572 (new in Python 3.8) offer yet another way to solve this:
time_interval = [4, 6, 12]
total_time = 0
cum_time = [total_time := total_time + t for t in time_interval]
You can calculate the cumulative sum list in linear time with a simple for loop:
def csum(lst):
s = lst.copy()
for i in range(1, len(s)):
s[i] += s[i-1]
return s
time_interval = [4, 6, 12]
print(csum(time_interval)) # [4, 10, 22]
The standard library's itertools.accumulate may be a faster alternative (since it's implemented in C):
from itertools import accumulate
time_interval = [4, 6, 12]
print(list(accumulate(time_interval))) # [4, 10, 22]
Since python 3.8 it's possible to use Assignment expressions, so things like this became easier to implement
nums = list(range(1, 10))
print(f'array: {nums}')
v = 0
cumsum = [v := v + n for n in nums]
print(f'cumsum: {cumsum}')
produces
array: [1, 2, 3, 4, 5, 6, 7, 8, 9]
cumsum: [1, 3, 6, 10, 15, 21, 28, 36, 45]
The same technique can be applied to find the cum product, mean, etc.
p = 1
cumprod = [p := p * n for n in nums]
print(f'cumprod: {cumprod}')
s = 0
c = 0
cumavg = [(s := s + n) / (c := c + 1) for n in nums]
print(f'cumavg: {cumavg}')
results in
cumprod: [1, 2, 6, 24, 120, 720, 5040, 40320, 362880]
cumavg: [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
First, you want a running list of subsequences:
subseqs = (seq[:i] for i in range(1, len(seq)+1))
Then you just call sum on each subsequence:
sums = [sum(subseq) for subseq in subseqs]
(This isn't the most efficient way to do it, because you're adding all of the prefixes repeatedly. But that probably won't matter for most use cases, and it's easier to understand if you don't have to think of the running totals.)
If you're using Python 3.2 or newer, you can use itertools.accumulate to do it for you:
sums = itertools.accumulate(seq)
And if you're using 3.1 or earlier, you can just copy the "equivalent to" source straight out of the docs (except for changing next(it) to it.next() for 2.5 and earlier).
If You want a pythonic way without numpy working in 2.7 this would be my way of doing it
l = [1,2,3,4]
_d={-1:0}
cumsum=[_d.setdefault(idx, _d[idx-1]+item) for idx,item in enumerate(l)]
now let's try it and test it against all other implementations
import timeit, sys
L=list(range(10000))
if sys.version_info >= (3, 0):
reduce = functools.reduce
xrange = range
def sum1(l):
cumsum=[]
total = 0
for v in l:
total += v
cumsum.append(total)
return cumsum
def sum2(l):
import numpy as np
return list(np.cumsum(l))
def sum3(l):
return [sum(l[:i+1]) for i in xrange(len(l))]
def sum4(l):
return reduce(lambda c, x: c + [c[-1] + x], l, [0])[1:]
def this_implementation(l):
_d={-1:0}
return [_d.setdefault(idx, _d[idx-1]+item) for idx,item in enumerate(l)]
# sanity check
sum1(L)==sum2(L)==sum3(L)==sum4(L)==this_implementation(L)
>>> True
# PERFORMANCE TEST
timeit.timeit('sum1(L)','from __main__ import sum1,sum2,sum3,sum4,this_implementation,L', number=100)/100.
>>> 0.001018061637878418
timeit.timeit('sum2(L)','from __main__ import sum1,sum2,sum3,sum4,this_implementation,L', number=100)/100.
>>> 0.000829620361328125
timeit.timeit('sum3(L)','from __main__ import sum1,sum2,sum3,sum4,this_implementation,L', number=100)/100.
>>> 0.4606760001182556
timeit.timeit('sum4(L)','from __main__ import sum1,sum2,sum3,sum4,this_implementation,L', number=100)/100.
>>> 0.18932826995849608
timeit.timeit('this_implementation(L)','from __main__ import sum1,sum2,sum3,sum4,this_implementation,L', number=100)/100.
>>> 0.002348129749298096
There could be many answers for this depending on the length of the list and the performance. One very simple way which I can think without thinking of the performance is this:
a = [1, 2, 3, 4]
a = [sum(a[0:x]) for x in range(1, len(a)+1)]
print(a)
[1, 3, 6, 10]
This is by using list comprehension and this may work fairly well it is just that here I am adding over the subarray many times, you could possibly improvise on this and make it simple!
Cheers to your endeavor!
values = [4, 6, 12]
total = 0
sums = []
for v in values:
total = total + v
sums.append(total)
print 'Values: ', values
print 'Sums: ', sums
Running this code gives
Values: [4, 6, 12]
Sums: [4, 10, 22]
Try this:
result = []
acc = 0
for i in time_interval:
acc += i
result.append(acc)
l = [1,-1,3]
cum_list = l
def sum_list(input_list):
index = 1
for i in input_list[1:]:
cum_list[index] = i + input_list[index-1]
index = index + 1
return cum_list
print(sum_list(l))
In Python3, To find the cumulative sum of a list where the ith element
is the sum of the first i+1 elements from the original list, you may do:
a = [4 , 6 , 12]
b = []
for i in range(0,len(a)):
b.append(sum(a[:i+1]))
print(b)
OR you may use list comprehension:
b = [sum(a[:x+1]) for x in range(0,len(a))]
Output
[4,10,22]
lst = [4, 6, 12]
[sum(lst[:i+1]) for i in xrange(len(lst))]
If you are looking for a more efficient solution (bigger lists?) a generator could be a good call (or just use numpy if you really care about performance).
def gen(lst):
acu = 0
for num in lst:
yield num + acu
acu += num
print list(gen([4, 6, 12]))
In [42]: a = [4, 6, 12]
In [43]: [sum(a[:i+1]) for i in xrange(len(a))]
Out[43]: [4, 10, 22]
This is slighlty faster than the generator method above by #Ashwini for small lists
In [48]: %timeit list(accumu([4,6,12]))
100000 loops, best of 3: 2.63 us per loop
In [49]: %timeit [sum(a[:i+1]) for i in xrange(len(a))]
100000 loops, best of 3: 2.46 us per loop
For larger lists, the generator is the way to go for sure. . .
In [50]: a = range(1000)
In [51]: %timeit [sum(a[:i+1]) for i in xrange(len(a))]
100 loops, best of 3: 6.04 ms per loop
In [52]: %timeit list(accumu(a))
10000 loops, best of 3: 162 us per loop
Somewhat hacky, but seems to work:
def cumulative_sum(l):
y = [0]
def inc(n):
y[0] += n
return y[0]
return [inc(x) for x in l]
I did think that the inner function would be able to modify the y declared in the outer lexical scope, but that didn't work, so we play some nasty hacks with structure modification instead. It is probably more elegant to use a generator.
Without having to use Numpy, you can loop directly over the array and accumulate the sum along the way. For example:
a=range(10)
i=1
while((i>0) & (i<10)):
a[i]=a[i-1]+a[i]
i=i+1
print a
Results in:
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
A pure python oneliner for cumulative sum:
cumsum = lambda X: X[:1] + cumsum([X[0]+X[1]] + X[2:]) if X[1:] else X
This is a recursive version inspired by recursive cumulative sums. Some explanations:
The first term X[:1] is a list containing the previous element and is almost the same as [X[0]] (which would complain for empty lists).
The recursive cumsum call in the second term processes the current element [1] and remaining list whose length will be reduced by one.
if X[1:] is shorter for if len(X)>1.
Test:
cumsum([4,6,12])
#[4, 10, 22]
cumsum([])
#[]
And simular for cumulative product:
cumprod = lambda X: X[:1] + cumprod([X[0]*X[1]] + X[2:]) if X[1:] else X
Test:
cumprod([4,6,12])
#[4, 24, 288]
Here's another fun solution. This takes advantage of the locals() dict of a comprehension, i.e. local variables generated inside the list comprehension scope:
>>> [locals().setdefault(i, (elem + locals().get(i-1, 0))) for i, elem
in enumerate(time_interval)]
[4, 10, 22]
Here's what the locals() looks for each iteration:
>>> [[locals().setdefault(i, (elem + locals().get(i-1, 0))), locals().copy()][1]
for i, elem in enumerate(time_interval)]
[{'.0': <enumerate at 0x21f21f7fc80>, 'i': 0, 'elem': 4, 0: 4},
{'.0': <enumerate at 0x21f21f7fc80>, 'i': 1, 'elem': 6, 0: 4, 1: 10},
{'.0': <enumerate at 0x21f21f7fc80>, 'i': 2, 'elem': 12, 0: 4, 1: 10, 2: 22}]
Performance is not terrible for small lists:
>>> %timeit list(accumulate([4, 6, 12]))
387 ns ± 7.53 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit np.cumsum([4, 6, 12])
5.31 µs ± 67.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit [locals().setdefault(i, (e + locals().get(i-1,0))) for i,e in enumerate(time_interval)]
1.57 µs ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
And obviously falls flat for larger lists.
>>> l = list(range(1_000_000))
>>> %timeit list(accumulate(l))
95.1 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np.cumsum(l)
79.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np.cumsum(l).tolist()
120 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit [locals().setdefault(i, (e + locals().get(i-1, 0))) for i, e in enumerate(l)]
660 ms ± 5.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even though the method is ugly and not practical, it sure is fun.
I think the below code is the easiest:
a=[1,1,2,1,2]
b=[a[0]]+[sum(a[0:i]) for i in range(2,len(a)+1)]
def cumulative_sum(list):
l = []
for i in range(len(list)):
new_l = sum(list[:i+1])
l.append(new_l)
return l
time_interval = [4, 6, 12]
print(cumulative_sum(time_interval)
Maybe a more beginner-friendly solution.
So you need to make a list of cumulative sums. You can do it by using for loop and .append() method
time_interval = [4, 6, 12]
cumulative_sum = []
new_sum = 0
for i in time_interval:
new_sum += i
cumulative_sum.append(new_sum)
print(cumulative_sum)
or, using numpy module
import numpy
time_interval = [4, 6, 12]
c_sum = numpy.cumsum(time_interval)
print(c_sum.tolist())
This would be Haskell-style:
def wrand(vtlg):
def helpf(lalt,lneu):
if not lalt==[]:
return helpf(lalt[1::],[lalt[0]+lneu[0]]+lneu)
else:
lneu.reverse()
return lneu[1:]
return helpf(vtlg,[0])

How to get n elements of a list not contained in another one?

I have two lists, of different size (either one can be larger than the other one), with some common elements. I would like to get n elements from the first list which are not in the second one.
I see two families of solutions (the example below is for n=3)
a = [i for i in range(2, 10)]
b = [i * 2 for i in range (1, 10)]
# [2, 3, 4, 5, 6, 7, 8, 9] [2, 4, 6, 8, 10, 12, 14, 16, 18]
# solution 1: generate the whole list, then slice
s1 = list(set(a) - set(b))
s2 = [i for i in a if i not in b]
for i in [s1, s2]:
print (i[:3])
# solution 2: the simple loop solution
c = 0
s3 = []
for i in a:
if i not in b:
s3.append(i)
c += 1
if c == 3:
break
print(s3)
All of the them are correct, the output is
[9, 3, 5]
[3, 5, 7]
[3, 5, 7]
(the first solution does not give the first 3 ones because set does not preserve the order - but this is OK in my case as I will have unsorted (even explicitly shuffled) lists anyway)
Are there the most pythonic and reasonably optimal ones?
The solution 1 first computes the difference, then slices - which I find quite inefficient (the sizes of my lists will be ~100k elements, I will be looking for the first 100 ones).
The solution 2 looks more optimal but it is ugly (which is a matter of taste, but I learned that when something looks ugly in Python, it means that there are usually more pythonic solution).
I will settle for solution 2 if there are no better alternatives.
I would use set.difference and slice:
print(list(set(a).difference(b))[:3])
[3, 5, 7]
set.difference already gives you elements in a that are not in b:
set([3, 5, 7, 9])
So you just need a slice of that.
Or without calling list in the set use iter, next and a comprehension:
diff = iter(set(a).difference(b))
n = 3
sli = [next(diff) for _ in range(n)]
print(sli)
.difference does not create a second set so it is a more efficient solution:
In [1]: a = [i for i in range(2, 10000000)]
In [2]: b = [i * 2 for i in range (1, 10000000)]
In [3]: timeit set(a).difference(b)
1 loops, best of 3: 848 ms per loop
In [4]: timeit set(a)- set(b)
1 loops, best of 3: 1.54 s per loop
For the large lists above s2 = [i for i in a if i not in b] would give you enough time to cook a meal before it finished.
Using iter and .difference:
In [11]: %%timeit
diff = iter(set(a).difference(b))
n = 3
sli = [next(diff) for _ in range(n)]
....:
1 loops, best of 3: 797 ms per loop
It might be marginally faster to avoid constructing the full difference if you only need 100, but by how much is going to depend on your dataset.
import random
from itertools import islice
def m1(a,b):
return list(set(a) - set(b))[:100]
def m2(a,b):
return list(set(a).difference(b))[:100]
def m3(a,b):
return list(islice(set(a).difference(b), 100))
def m4(a,b):
bset = set(b)
return list(islice((x for x in a if x not in bset), 100))
gives me
>>> a = [random.randint(0, 10**6) for i in range(10**5)]
>>> b = [random.randint(0, 10**6) for i in range(10**5)]
>>> %timeit m1(a,b)
10 loops, best of 3: 121 ms per loop
>>> %timeit m2(a,b)
10 loops, best of 3: 98.7 ms per loop
>>> %timeit m3(a,b)
10 loops, best of 3: 82.3 ms per loop
>>> %timeit m4(a,b)
10 loops, best of 3: 42.8 ms per loop
>>>
>>> a = list(range(10**5))
>>> b = [i*2 for i in a]
>>> %timeit m1(a,b)
10 loops, best of 3: 58.7 ms per loop
>>> %timeit m2(a,b)
10 loops, best of 3: 50.8 ms per loop
>>> %timeit m3(a,b)
10 loops, best of 3: 40.7 ms per loop
>>> %timeit m4(a,b)
10 loops, best of 3: 21.7 ms per loop
With a little more work you could even avoiding needing to make the full bset. If you're very likely to find 100 missing if you only look at the first 10^4 or so of the list, for example, it might be worth trying that first. But I'd be surprised if this turned out to be a bottleneck in your code, and so it's probably not worth worrying about.
Could turn b into a set but not a. Set up a generator to exploit laziness, then use a comprehension to get the items you want:
a = [i for i in range(2, 10)]
b = [i * 2 for i in range (1, 10)]
bset = set(b)
agen = (i for i in a if not i in set(b))
first3 = [j for (i,j) in enumerate(agen) if i < 3]
print(first3)

Efficient way to create an array that is a sequence of variable length ranges in numpy

Suppose I have an array
import numpy as np
x=np.array([5,7,2])
I want to create an array that contains a sequence of ranges stacked together with the
length of each range given by x:
y=np.hstack([np.arange(1,n+1) for n in x])
Is there some way to do this without the speed penalty of a list comprehension or looping. (x could be a very large array)
The result should be
y == np.array([1,2,3,4,5,1,2,3,4,5,6,7,1,2])
You could use accumulation:
def my_sequences(x):
x = x[x != 0] # you can skip this if you do not have 0s in x.
# Create result array, filled with ones:
y = np.cumsum(x, dtype=np.intp)
a = np.ones(y[-1], dtype=np.intp)
# Set all beginnings to - previous length:
a[y[:-1]] -= x[:-1]
# and just add it all up (btw. np.add.accumulate is equivalent):
return np.cumsum(a, out=a) # here, in-place should be safe.
(One word of caution: If you result array would be larger then the possible size np.iinfo(np.intp).max this might with some bad luck return wrong results instead of erroring out cleanly...)
And because everyone always wants timings (compared to Ophion's) method:
In [11]: x = np.random.randint(0, 20, 1000000)
In [12]: %timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
1 loops, best of 3: 753 ms per loop
In [13]: %timeit my_sequences(x)
1 loops, best of 3: 191 ms per loop
of course the my_sequences function will not ill-perform when the values of x get large.
First idea; prevent multiple calls to np.arange and concatenate should be much faster then hstack:
import numpy as np
x=np.array([5,7,2])
>>>a=np.arange(1,x.max()+1)
>>> np.hstack([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
>>> np.concatenate([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
If there are many nonunique values this seems more efficient:
>>>ua,uind=np.unique(x,return_inverse=True)
>>>a=[np.arange(1,k+1) for k in ua]
>>>np.concatenate(np.take(a,uind))
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])
Some timings for your case:
x=np.random.randint(0,20,1000000)
Original code
#Using hstack
%timeit np.hstack([np.arange(1,n+1) for n in x])
1 loops, best of 3: 7.46 s per loop
#Using concatenate
%timeit np.concatenate([np.arange(1,n+1) for n in x])
1 loops, best of 3: 5.27 s per loop
First code:
#Using hstack
%timeit a=np.arange(1,x.max()+1);np.hstack([a[:k] for k in x])
1 loops, best of 3: 3.03 s per loop
#Using concatenate
%timeit a=np.arange(1,x.max()+1);np.concatenate([a[:k] for k in x])
10 loops, best of 3: 998 ms per loop
Second code:
%timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
10 loops, best of 3: 522 ms per loop
Looks like we gain a 14x speedup with the final code.
Small sanity check:
ua,uind=np.unique(x,return_inverse=True)
a=[np.arange(1,k+1) for k in ua]
out=np.concatenate(np.take(a,uind))
>>>out.shape
(9498409,)
>>>np.sum(x)
9498409

Categories

Resources