I would like to pick n random elements from set1 that are not present in set2, if there are n such elements. If there are not, only those that are different should be returned or an empty set in the worst case.
Example 1:
input: n=2, set1={0,1,2,3,4,5,6,7}, set2={0,2,4,6}
example possible output: {1,5} (other possible outputs: {1,3}, {1,7}, {3,5}, {3,7}, {5,7})
Example 2:
input: n=5, set1={0,1,2,3,4,5,6,7}, set2={0,2,4,6}
single possible output: {1,3,5,7} since there are only 4 choices and number of elements to choose is 5
Create a set of unique values and return either n random elements from it, or if n is larger than the population return all elements:
def random_unique(x, y, n):
''' returns n random elements from set x not found in set y '''
unique = x - y
return set(random.sample(unique, min(n, len(unique))))
In action:
x = {0, 1, 2, 3, 4, 5, 6, 7}
y = {0, 2, 4, 6}
random_unique(x, y, 2)
{3, 5}
random_unique(x, y, 10)
{1, 3, 5, 7}
In two lines (we could make it one, but it's a little ugly), using sets and random.sample:
diff = set(list1).difference(list2)
random.sample(diff, min(len(diff), n))
Original solution (preserving duplicates in list1, as well as order, which doesn't really matter for random samples):
diff = [x for x in list1 if x not in set(list2)]
random.sample(diff, min(len(diff), n))
If you don't care about preserving duplicates, then set difference is indeed the way to go. Checking the timing between the two implementations with the following:
list1 = np.arange(10000)
list2 = np.random.randint(0, 10000, 1000)
we get:
set difference: 1.15 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
list comprehension: 1.13 s ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
list comprehension with set pre-defined: 1.47 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's a factor of 1000 faster for sets! Not much of a difference between set.difference and the list comprehension though if we have already built the set, but what difference is there appears to be significant (those standard deviations are tiny!).
Related
I am wondering if there is an efficient way to compare rows in a matrix and count the number of equal elements in the rows. Say I have a matrix:
[['food', 'food', 'food'],
['food', 'food', 'drink'],
['food', 'food', 'drink']]
I would like to compare the first row with the second row, the first row with the third row, and the second row with the third row. There is no need to compare two rows two times and I don't want to compare a row with itself. I'd like to return a list or array that is as long as the number of comparisons (or similar) and that contains the number of equal elements for each comparison. In this case, I'd get: [2, 2, 3].
I've tried looping through the matrix as follows:
comparisons = [sum(matrix[i]==matrix[j]) for i in range(len(matrix)) for j in range(len(matrix)) if i < j]
I'm worried this solution will be too slow if the size of the matrix grows. Is there a more efficient solution by using e.g. NumPy?
By using itertools.chain.from_iterable:
>>> list(chain.from_iterable(
(matrix[i+1:] == row).sum(1) for i, row in enumerate(matrix[:-1])
))
[2, 2, 3]
Timing:
# Method 1 [from the question]
>>> %timeit [sum(matrix[i]==matrix[j]) for i in range(len(matrix)) for j in range(len(matrix)) if i < j]
25.6 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Method 2
>>> %timeit list(chain.from_iterable((matrix[i+1:] == row).sum(1) for i, row in enumerate(matrix[:-1])))
11.8 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I have a dictionary, with a very large number of keys (~300k and growing) and as values it has sets which also have a large number of items (~20k).
dictionary = {
1: {1, 2, 3},
2: {3, 4},
3: {5, 6},
4: {1, 5, 12, 13},
5: set()
}
What I want to achieve is create two arrays:
keys = [1 1 1 2 2 3 3 4 4 4 4]
items = [1 2 3 3 4 5 6 1 5 12 13]
Which basically represent a mapping of each item in each set along with its corresponding key.
I tried using numpy for this job, but it still takes a very long time and I want to know if it can be optimized.
numpy code:
keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))
items = np.concatenate(list(map(lambda x: list(x), dictionary.values())))
keys = np.array(keys, dtype=np.uint32)
items = np.array(items, dtype=np.uint16)
return keys, items
The second part is an attempt to try to reduce the memory footprint of those variables to account for their respective data types. But I know they will still default to 64bit variables in the first two operations (before applying the dtype change), so the memory will get allocated and I might run out of RAM.
not sure if it will perform much better but straight forward way to do it is like this
import numpy as np
keys = np.array(list(dictionary.keys()), dtype=np.uint32).repeat([len(s) for s in dictionary.values()])
values = np.concatenate([np.array(list(s), np.uint16) for s in dictionary.values()])
display(keys)
display(values)
For this small sample, a pure list version is considerably faster than the numpy one:
In [14]: timeit list(itertools.chain.from_iterable([[item[0]]*len(item[1]) for item in dictionary.items()]))
2.71 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: timeit np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))
52.2 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and
In [24]: list(itertools.chain.from_iterable(dictionary.values()))
Out[24]: [1, 2, 3, 3, 4, 5, 6, 1, 13, 12, 5]
In [25]: timeit list(itertools.chain.from_iterable(dictionary.values()))
876 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [26]: timeit np.concatenate(list(map(lambda x: list(x), dictionary.values())))
13.8 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And Paul Panzer's version:
In [41]: timeit np.fromiter(itertools.chain.from_iterable(dictionary.values()),'int32')
3.69 µs ± 9.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It's probably better to use np.fromiter here. It is certainly easier on the memory as it avoids creating all those temporaries.
Timings:
import numpy as np
import itertools as it
from simple_benchmark import BenchmarkBuilder
B = BenchmarkBuilder()
#B.add_function()
def pp(a):
szs = np.fromiter(map(len,a.values()),int,len(a))
ks = np.fromiter(a.keys(),np.uint32,len(a)).repeat(szs)
vls = np.fromiter(it.chain.from_iterable(a.values()),np.uint16,ks.size)
return ks,vls
#B.add_function()
def OP(a):
keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), a.items())))
items = np.concatenate(list(map(list, a.values())))
return keys, items
#B.add_function()
def DevKhadka(a):
keys = np.array(list(a.keys()), dtype=np.uint32).repeat([len(s) for s in a.values()])
values = np.concatenate([np.array(list(s), np.uint16) for s in a.values()])
return keys,values
#B.add_function()
def hpaulj(a):
ks = list(it.chain.from_iterable([[item[0]]*len(item[1]) for item in a.items()]))
vls = list(it.chain.from_iterable(a.values()))
return ks,vls
#B.add_arguments('total no items')
def argument_provider():
for exp in range(1,12):
sz = 2**exp
a = {j:set(np.random.randint(1,2**16,np.random.randint(1,sz)).tolist())
for j in range(1,10*sz)}
yield sum(map(len,a.values())),a
r = B.run()
r.plot()
import pylab
pylab.savefig('dct2np.png')
I want to know whether we can specify the next value to increment in "for loop" by hard coding?
Currently, I am iterating this way:
Eg:
for i in range(0, 10, 2):
print(i)
output will be 0,2,4,6,8
If I want a value of 5 and 7 along with increments of 2, how can I do that?
Eg:
for i in range(0, 10, 2, 4, 5, 6, 7, 8):
If I understand your question correctly, this code is for you:
Using generators comprehension:
pred = i%2 is 0
forced_values = [5, 7]
list((i for i in range(0, 10) if pred or i in forced_values))
# output: [0, 2, 4, 5, 6, 7, 8]
or equivalently:
sorted(list(range(0, 10, 2)) + forced_values)
Comparison of execution times:
Benchmark:
n = 10000000 # size of range values
m = 10000000 # size of forced_value list
1. Solution with generators comprehension:
%%timeit
list((i for i in range(0, n) if i%2 is 0 or i in range(0, m)))
# 3.47 s ± 265 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. Solution with sorting:
%%timeit
sorted(list(range(0, n, 2)) + list(range(0, m)))
# 1.59 s ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
or with unordered list, if the order is not important:
%%timeit
list(range(0, n, 2)) + list(range(0, m))
# 1.03 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3. Solution proposed by #blhsing with more_itertools package, and specifically collate function:
%%timeit
l = []
for i in collate(range(0, n, 2), range(0, m)):
l.append(i)
# 6.89 s ± 886 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The best solution, even on very large lists, seems to be the second one that is 2 to 4 times faster than the other proposed solutions.
To clarify, here are some relevant comments:
if I need to parse 1 through 1000 with exceptions of increments in between, is there a way other than specifying indexes?
Ex: for i in range(1, 1000, 20); I need i value of 178, 235, 650 in between. Can I do that in for loop?
The technical answer is: yes and no. Yes, because of course you can do it in a for loop. No, because there is no way around specifying the exceptions. (Otherwise they wouldn't be exceptions, would they?)
You still use a for loop, because Python's for loop is not really about indices or ranges. It's about iterating over arbitrary objects. It so happens that the simple numeric for loop that many other languages have is most directly translated into Python as a loop over a range. But really, a Python for loop is simply of the form
for x in y:
# do stuff here
And the loop iterates over y, no matter what y is, as long as it's iterable, with x taking the value of one element of y on each iteration. That's it.
But, what you seem to be really after is a way to loop over a bunch of numbers that mostly follow a simple pattern. I would probably do it like this:
values = list(range(1, 1000, 20)) + [178, 235, 650]
for i in sorted(values):
print(i)
Or, if you don't mind a longer line:
for i in sorted(list(range(1, 1000, 20)) + [178, 235, 650]):
print(i)
You can give the for loop a list:
for i in [2, 4, 5, 6, 7, 8]:
You could also create a custom iterator if it's following an algorithm. That's described in another answer here: Build a Basic Python Iterator.
I have a numpy array and i need to get (without changing the original) the same array, but with the first item places at the end. Since i am using this a lot i am looking for clean way of getting this.
So for example, if my original array is [1,2,3,4] , i would like to get an array [4,1,2,3] without modifying the original array.
I found one solution:
x = [1,2,3,4]
a = np.append(x[1:],x[0])]
However, i am looking for a more pythonic way. Basically something like this:
x = [1,2,3,4]
a = x[(:1,0)]
However, this of course doesn't work. Is there a better way of doing what i want than using the append() function?
np.roll is easy to use, but not the fastest method. It is general purpose, with multiple dimensions and shifts.
Its action can be simplified to:
def simple_roll(x):
res = np.empty_like(x)
res[0] = x[-1]
res[1:] = x[:-1]
return res
In [90]: np.roll(np.arange(1,5),1)
Out[90]: array([4, 1, 2, 3])
In [91]: simple_roll(np.arange(1,5))
Out[91]: array([4, 1, 2, 3])
time tests:
In [92]: timeit np.roll(np.arange(1001),1)
36.8 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [93]: timeit simple_roll(np.arange(1001))
5.54 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We could also use r_ to construct one index array to do the copy. But it is slower (due to advanced indexing as opposed to slicing):
def simple_roll1(x):
idx = np.r_[-1,0:x.shape[0]-1]
return x[idx]
In [101]: timeit simple_roll1(np.arange(1001))
34.2 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can use np.roll, as from the docs:
Roll array elements along a given axis.
Elements that roll beyond the last position are re-introduced at the
first.
np.roll([1,2,3,4], 1)
# array([4, 1, 2, 3])
To roll in the other direction, use a negative shift:
np.roll([1,2,3,4], -1)
# array([2, 3, 4, 1])
I want to sum a 2 dimensional array in python:
Here is what I have:
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print sum1([[1, 2],[3, 4],[5, 6]])
It displays 4 instead of 21 (1+2+3+4+5+6 = 21). Where is my mistake?
I think this is better:
>>> x=[[1, 2],[3, 4],[5, 6]]
>>> sum(sum(x,[]))
21
You could rewrite that function as,
def sum1(input):
return sum(map(sum, input))
Basically, map(sum, input) will return a list with the sums across all your rows, then, the outer most sum will add up that list.
Example:
>>> a=[[1,2],[3,4]]
>>> sum(map(sum, a))
10
This is yet another alternate Solution
In [1]: a=[[1, 2],[3, 4],[5, 6]]
In [2]: sum([sum(i) for i in a])
Out[2]: 21
And numpy solution is just:
import numpy as np
x = np.array([[1, 2],[3, 4],[5, 6]])
Result:
>>> b=np.sum(x)
print(b)
21
Better still, forget the index counters and just iterate over the items themselves:
def sum1(input):
my_sum = 0
for row in input:
my_sum += sum(row)
return my_sum
print sum1([[1, 2],[3, 4],[5, 6]])
One of the nice (and idiomatic) features of Python is letting it do the counting for you. sum() is a built-in and you should not use names of built-ins for your own identifiers.
This is the issue
for row in range (len(input)-1):
for col in range(len(input[0])-1):
try
for row in range (len(input)):
for col in range(len(input[0])):
Python's range(x) goes from 0..x-1 already
range(...)
range([start,] stop[, step]) -> list of integers
Return a list containing an arithmetic progression of integers.
range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
These are exactly the valid indices for a list of 4 elements.
range() in python excludes the last element. In other words, range(1, 5) is [1, 5) or [1, 4]. So you should just use len(input) to iterate over the rows/columns.
def sum1(input):
sum = 0
for row in range (len(input)):
for col in range(len(input[0])):
sum = sum + input[row][col]
return sum
Don't put -1 in range(len(input)-1) instead use:
range(len(input))
range automatically returns a list one less than the argument value so no need of explicitly giving -1
def sum1(input):
return sum([sum(x) for x in input])
Quick answer, use...
total = sum(map(sum,[array]))
where [array] is your array title.
In Python 3.7
import numpy as np
x = np.array([ [1,2], [3,4] ])
sum(sum(x))
outputs
10
It seems like a general consensus is that numpy is a complicated solution. In comparison to simpler algorithms. But for the sake of the answer being present:
import numpy as np
def addarrays(arr):
b = np.sum(arr)
return sum(b)
array_1 = [
[1, 2],
[3, 4],
[5, 6]
]
print(addarrays(array_1))
This appears to be the preferred solution:
x=[[1, 2],[3, 4],[5, 6]]
sum(sum(x,[]))
def sum1(input):
sum = 0
for row in input:
for col in row:
sum += col
return sum
print(sum1([[1, 2],[3, 4],[5, 6]]))
Speed comparison
import random
import timeit
import numpy
x = [[random.random() for i in range(100)] for j in range(100)]
xnp = np.array(x)
Methods
print("Sum python array:")
%timeit sum(map(sum,x))
%timeit sum([sum(i) for i in x])
%timeit sum(sum(x,[]))
%timeit sum([x[i][j] for i in range(100) for j in range(100)])
print("Convert to numpy, then sum:")
%timeit np.sum(np.array(x))
%timeit sum(sum(np.array(x)))
print("Sum numpy array:")
%timeit np.sum(xnp)
%timeit sum(sum(xnp))
Results
Sum python array:
130 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
149 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.05 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.58 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Convert to numpy, then sum:
1.36 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sum numpy array:
24.6 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
301 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def sum1(input):
sum = 0
for row in range (len(input)-1):
for col in range(len(input[0])-1):
sum = sum + input[row][col]
return sum
print (sum1([[1, 2],[3, 4],[5, 6]]))
You had a problem with parenthesis at the print command....
This solution will be good now
The correct solution in Visual Studio Code