Related
I have a series of 1D arrays of different lengths greater than 1.
I would like to find in s the the numbers that appear together in more than one array and in how many arrays do they appear together.
import numpy as np
import pandas as pd
a=np.array([1,2,3])
b=np.array([])
c=np.array([2,3,4,5,6])
d=np.array([2,3,4,5,6,9,15])
e=np.arra([5,6])
s=pd.Series([a,b,c,d,e])
In this example the desired outcome would be sth like
{[2,3]:3, [5,6]:3, [2,3,4,5,6]:2]}
The expected result does not need to be a dictionary but any structure that contains this information.
Also i would have to to that for >200 series like s so performance also matters for me
I have tried
result=s.value_counts()
but i cant figure out how to proceed
I think what you are missing here is a way to build the combinations of numbers present in each array, to then be able to count how many times each combination appears. To do that you can use stuff like the built-in itertools module:
from itertools import combinations
import numpy as np
a = np.array([1,2,3])
for c in combinations(a, 2):
print(c)
>>> (1, 2)
>>> (1, 3)
>>> (2, 3)
So using this, you can then build a series for each length and check how many times each combination of length 2 happens, how many times each combination of length 3 happens and so on.
import numpy as np
import pandas as pd
a=np.array([1,2,3])
b=np.array([])
c=np.array([2,3,4,5,6])
d=np.array([2,3,4,5,6,9,15])
e=np.array([5,6])
all_arrays = a, b, c, d, e
maxsize = max(array.size for array in all_arrays)
for length in range(2, maxsize+1):
length_N_combs = pd.Series(x for array in all_arrays for x in combinations(array, length) if array.size >= length)
counts = length_N_combs.value_counts()
print(counts[counts>1])
From here you can format the output however you like. Note that you have to exclude arrays that are too short. I'm using a generator comprehension for a slight increase in efficiency, but note that this algorithm is not gonna be cheap anyway, you need a lot of comparisons. Generator comprehensions are a way to condense a generator expression into a one liner (and much more than that). In this case, the above nested comprehension is roughly equivalent to defining a generator that yields from the generator that combinations returns and calling that generator to build the pandas Series. Something like this will give you the same result:
def length_N_combs_generator(arrays, length):
for array in arrays:
if array.size >= length:
yield from combinations(array, length)
for length in range(2, maxsize+1):
s = pd.Series(length_N_combs_generator(all_arrays, length))
counts = s.value_counts()
print(counts[counts>1])
You can use set operations:
from itertools import combinations
from collections import Counter
s2 = s.apply(frozenset).sort_values(key=lambda s: s.str.len(), ascending=False)
c = Counter(x for a,b in combinations(s2, 2) if len(x:=a&b))
# increment all values by 1
for k in c:
c[k] += 1
dict(c)
Output:
{frozenset({2, 3, 4, 5, 6}): 2, frozenset({2, 3}): 3, frozenset({5, 6}): 3}
Given a tuple T that contains all different integers, I want to get all the tuples that result from dropping individual integers from T. I came up with the following code:
def drop(T):
S = set(T)
for i in S:
yield tuple(S.difference({i}))
for t in drop((1,2,3)):
print(t)
# (2,3)
# (1,3)
# (1,2)
I'm not unhappy with this, but I wonder if there is a better/faster way because with large tuples, difference() needs to look for the item in the set, but I already know that I'll be removing items sequentially. However, this code is only 2x faster:
def drop(T):
for i in range(len(T)):
yield T[:i] + T[i+1:]
and in any case, neither scales linearly with the size of T.
Instead of looking at it as "remove one item each item" you can look at it as "use all but one" and then using itertools it becomes straightforward:
from itertools import combinations
T = (1, 2, 3, 4)
for t in combinations(T, len(T)-1):
print(t)
Which gives:
(1, 2, 3)
(1, 2, 4)
(1, 3, 4)
(2, 3, 4)
* Assuming the order doesn't really matter
From your description, you're looking for combinations of the elements of T. With itertools.combinations, you can ask for all r-length tuples, in sorted order, without repeated elements. For example :
import itertools
T = [1,2,3]
for i in itertools.combinations(T, len(T) - 1):
print(i)
I am trying to use itertools.product to manage the bookkeeping of some nested for loops, where the number of nested loops is not known in advance. Below is a specific example where I have chosen two nested for loops; the choice of two is only for clarity, what I need is a solution that works for an arbitrary number of loops.
This question provides an extension/generalization of the question appearing here:
Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array
Now I am extending the above technique using an itertools trick I learned here:
Iterating over an unknown number of nested loops in python
Preamble:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [idx1, idx2]
func_table = []
for items in product(*joint):
f = trivial_functional(*items)
func_table.append(f)
At the end of the above itertools loop, I have a 12-element, 1-d array of functions, func_table, each element having been built from the trivial_functional.
Question:
Suppose I am given a pair of integers, (i_1, i_2), where these integers are to be interpreted as the indices of idx1 and idx2, respectively. How can I use itertools.product to determine the correct corresponding element of the func_table array?
I know how to hack the answer by writing my own function that mimics the itertools.product bookkeeping, but surely there is a built-in feature of itertools.product that is intended for exactly this purpose?
I don't know of a way of calculating the flat index other than doing it yourself. Fortunately this isn't that difficult:
def product_flat_index(factors, indices):
if len(factors) == 1: return indices[0]
else: return indices[0] * len(factors[0]) + product_flat_index(factors[1:], indices[1:])
>> product_flat_index(joint, (2, 1))
9
An alternative approach is to store the results in a nested array in the first place, making translation unnecessary, though this is more complex:
from functools import reduce
from operator import getitem, setitem, itemgetter
def get_items(container, indices):
return reduce(getitem, indices, container)
def set_items(container, indices, value):
c = reduce(getitem, indices[:-1], container)
setitem(c, indices[-1], value)
def initialize_table(lengths):
if len(lengths) == 1: return [0] * lengths[0]
subtable = initialize_table(lengths[1:])
return [subtable[:] for _ in range(lengths[0])]
func_table = initialize_table(list(map(len, joint)))
for items in product(*map(enumerate, joint)):
f = trivial_functional(*map(itemgetter(1), items))
set_items(func_table, list(map(itemgetter(0), items)), f)
>>> get_items(func_table, (2, 1)) # same as func_table[2][1]
<function>
So numerous answers were quite useful, thanks to everyone for the solutions.
It turns out that if I recast the problem slightly with Numpy, I can accomplish the same bookkeeping, and solve the problem I was trying to solve with vastly improved speed relative to pure python solutions. The trick is just to use Numpy's reshape method together with the normal multi-dimensional array indexing syntax.
Here's how this works. We just convert func_table into a Numpy array, and reshape it:
func_table = np.array(func_table)
component_dimensions = [len(idx1), len(idx2)]
func_table = np.array(func_table).reshape(component_dimensions)
Now func_table can be used to return the correct function not just for a single 2d point, but for a full array of 2d points:
dim1_pts = [3,1,2,1,3,3,1,3,0]
dim2_pts = [0,1,2,1,2,0,1,2,1]
func_array = func_table[dim1_pts, dim2_pts]
As usual, Numpy to the rescue!
This is a little messy, but here you go:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [enumerate(idx1), enumerate(idx2)]
func_map = {}
for indexes, items in map(lambda x: zip(*x), product(*joint)):
f = trivial_functional(*items)
func_map[indexes] = f
print(func_map[(2, 0)](5)) # 40 = (3+5)*5
I'd suggest using enumerate() in the right place:
from itertools import product
def trivial_functional(i, j): return lambda x : (i+j)*x
idx1 = [1, 2, 3, 4]
idx2 = [5, 6, 7]
joint = [idx1, idx2]
func_table = []
for items in product(*joint):
f = trivial_functional(*items)
func_table.append(f)
From what I understood from your comments and your code, func_table is simply indexed by the occurence of a certain input in the sequence. You can access it back again using:
for index, items in enumerate(product(*joint)):
# because of the append(), index is now the
# position of the function created from the
# respective tuple in join()
func_table[index](some_value)
I need to generate a lot of random numbers. I've tried using random.random but this function is quite slow. Therefore I switched to numpy.random.random which is way faster! So far so good. The generated random numbers are actually used to calculate some thing (based on the number). I therefore enumerate over each number and replace the value. This seems to kill all my previously gained speedup. Here are the stats generated with timeit():
test_random - no enumerate
0.133111953735
test_np_random - no enumerate
0.0177130699158
test_random - enumerate
0.269361019135
test_np_random - enumerate
1.22525310516
as you can see, generating the number is almost 10 times faster using numpy, but enumerating over those numbers gives me equal run times.
Below is the code that I'm using:
import numpy as np
import timeit
import random
NBR_TIMES = 10
NBR_ELEMENTS = 100000
def test_random(do_enumerate=False):
y = [random.random() for i in range(NBR_ELEMENTS)]
if do_enumerate:
for index, item in enumerate(y):
# overwrite the y value, in reality this will be some function of 'item'
y[index] = 1 + item
def test_np_random(do_enumerate=False):
y = np.random.random(NBR_ELEMENTS)
if do_enumerate:
for index, item in enumerate(y):
# overwrite the y value, in reality this will be some function of 'item'
y[index] = 1 + item
if __name__ == '__main__':
from timeit import Timer
t = Timer("test_random()", "from __main__ import test_random")
print "test_random - no enumerate"
print t.timeit(NBR_TIMES)
t = Timer("test_np_random()", "from __main__ import test_np_random")
print "test_np_random - no enumerate"
print t.timeit(NBR_TIMES)
t = Timer("test_random(True)", "from __main__ import test_random")
print "test_random - enumerate"
print t.timeit(NBR_TIMES)
t = Timer("test_np_random(True)", "from __main__ import test_np_random")
print "test_np_random - enumerate"
print t.timeit(NBR_TIMES)
What's the best way to speed this up and why does enumerate slow things down so dramatically?
EDIT: the reason I use enumerate is because I need both the index and the value of the current element.
To take full advantage of numpy's speed, you want to create ufuncs whenever possible. Applying vectorize to a function as mgibsonbr suggests is one way to do that, but a better way, if possible, is simply to construct a function that takes advantage of numpy's built-in ufuncs. So something like this:
>>> import numpy
>>> a = numpy.random.random(10)
>>> a + 1
array([ 1.29738145, 1.33004628, 1.45825441, 1.46171177, 1.56863326,
1.58502855, 1.06693054, 1.93304272, 1.66056379, 1.91418473])
>>> (a + 1) * 0.25 / 4
array([ 0.08108634, 0.08312789, 0.0911409 , 0.09135699, 0.09803958,
0.09906428, 0.06668316, 0.12081517, 0.10378524, 0.11963655])
What is the nature of the function you want to apply across the numpy array? If you tell us, perhaps we can help you come up with a version that uses only numpy ufuncs.
It's also possible to generate an array of indices without using enumerate. Numpy provides ndenumerate, which is an iterator, and probably slower, but it also provides indices, which is a very quick way to generate the indices corresponding to the values in an array. So...
>>> numpy.indices(a.shape)
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
So to be more explicit, you can use the above and combine them using numpy.rec.fromarrays:
>>> a = numpy.random.random(10)
>>> ind = numpy.indices(a.shape)
>>> numpy.rec.fromarrays([ind[0], a])
rec.array([(0, 0.092473494150913438), (1, 0.20853257641948986),
(2, 0.35141455604686067), (3, 0.12212258656960817),
(4, 0.50986868372639049), (5, 0.0011439325711705139),
(6, 0.50412473457942508), (7, 0.28973489788728601),
(8, 0.20078799423168536), (9, 0.34527678271856999)],
dtype=[('f0', '<i8'), ('f1', '<f8')])
It's starting to sound like your main concern is performing the operation in-place. That's harder to do using vectorize but it's easy with the ufunc approach:
>>> def somefunc(a):
... a += 1
... a /= 15
...
>>> a = numpy.random.random(10)
>>> b = a
>>> somefunc(a)
>>> a
array([ 0.07158446, 0.07052393, 0.07276768, 0.09813235, 0.09429439,
0.08561703, 0.11204622, 0.10773558, 0.11878885, 0.10969279])
>>> b
array([ 0.07158446, 0.07052393, 0.07276768, 0.09813235, 0.09429439,
0.08561703, 0.11204622, 0.10773558, 0.11878885, 0.10969279])
As you can see, numpy performs these operations in-place.
Check numpy.vectorize, it should let you apply arbitrary functions to numpy arrays. For your simple example, you'd do something like this:
vecFunc = vectorize(lambda x: x + 1)
vecFunc(y)
However, that will create a new numpy array instead of modifying it in-place (which may or may not be a problem in your particular case).
In general, you'll always be better manipulating numpy structures with numpy functions than iterating with python functions, since the former are not only optimized but implemented in C, while the latter will be always interpreted.
I know you can create easily nested lists in python like this:
[[1,2],[3,4]]
But how to create a 3x3x3 matrix of zeroes?
[[[0] * 3 for i in range(0, 3)] for j in range (0,3)]
or
[[[0]*3]*3]*3
Doesn't seem right. There is no way to create it just passing a list of dimensions to a method? Ex:
CreateArray([3,3,3])
In case a matrix is actually what you are looking for, consider the numpy package.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html#numpy.zeros
This will give you a 3x3x3 array of zeros:
numpy.zeros((3,3,3))
You also benefit from the convenience features of a module built for scientific computing.
List comprehensions are just syntactic sugar for adding expressiveness to list initialization; in your case, I would not use them at all, and go for a simple nested loop.
On a completely different level: do you think the n-dimensional array of NumPy could be a better approach?
Although you can use lists to implement multi-dimensional matrices, I think they are not the best tool for that goal.
NumPy addresses this problem
Link
>>> a = array( [2,3,4] )
>>> a
array([2, 3, 4])
>>> type(a)
<type 'numpy.ndarray'>
But if you want to use the Python native lists as a matrix the following helper methods can become handy:
import copy
def Create(dimensions, item):
for dimension in dimensions:
item = map(copy.copy, [item] * dimension)
return item
def Get(matrix, position):
for index in position:
matrix = matrix[index]
return matrix
def Set(matrix, position, value):
for index in position[:-1]:
matrix = matrix[index]
matrix[position[-1]] = value
Or use the nest function defined here, combined with repeat(0) from the itertools module:
nest(itertools.repeat(0),[3,3,3])
Just nest the multiplication syntax:
[[[0] * 3] * 3] * 3
It's therefore simple to express this operation using folds
def zeros(dimensions):
return reduce(lambda x, d: [x] * d, [0] + dimensions)
Or if you want to avoid reference replication, so altering one item won't affect any other you should instead use copies:
import copy
def zeros(dimensions):
item = 0
for dimension in dimensions:
item = map(copy.copy, [item] * dimension)
return item