Mutually exclusive random sampling from a list

Mutually exclusive random sampling from a list - python

input = ['beleriand','mordor','hithlum','eol','morgoth','melian','thingol']
I'm having trouble creating X number of lists of size Y without repeating any elements.
What I have been doing is using:
x = 3
y = 2
import random
output = random.sample(input, y)
# ['mordor', 'thingol']
but if I repeat this then I will have repeats.
I would like the output to be something like
[['mordor', 'thingol'], ['melian', 'hithlum'], ['beleriand', 'eol']]
since I chose x = 3 (3 lists) of size y = 2 (2 elements per list).
def random_generator(x,y):
....

You can simply shuffle the original list and then generate n groups of m elements successively from it. There may be fewer or more than that number of groups possible. Note thatinputis the name of a Python built-in function, so I renamed itwords.
import itertools
from pprint import pprint
import random
def random_generator(seq, n, m):
rand_seq = seq[:] # make a copy to avoid changing input argument
random.shuffle(rand_seq)
lists = []
limit = n-1
for i,group in enumerate(itertools.izip(*([iter(rand_seq)]*m))):
lists.append(group)
if i == limit: break # have enough
return lists
words = ['beleriand', 'mordor', 'hithlum', 'eol', 'morgoth', 'melian', 'thingol']
pprint(random_generator(words, 3, 2))
Output:
[('mordor', 'hithlum'), ('thingol', 'melian'), ('morgoth', 'beleriand')]
It would be more Pythonic to generate the groups iteratively. The above function could easily be turned into generator by making ityieldeach group, one-by-one, instead of returning them all in a relatively much longer list-of-lists:
def random_generator_iterator(seq, n, m):
rand_seq = seq[:]
random.shuffle(rand_seq)
limit = n-1
for i,group in enumerate(itertools.izip(*([iter(rand_seq)]*m))):
yield group
if i == limit: break
pprint([group for group in random_generator_iterator(words, 3, 2)])

rather than randomly taking two things from your list, just randomize your list and iterate through it to create your new array of the dimensions you specify!
import random
my_input = ['beleriand','mordor','hithlum','eol','morgoth','melian','thingol']
def random_generator(array,x,y):
random.shuffle(array)
result = []
count = 0
while count < x:
section = []
y1 = y * count
y2 = y * (count + 1)
for i in range (y1,y2):
section.append(array[i])
result.append(section)
count += 1
return result
print random_generator(my_input,3,2)

You could use random.sample in combination with the itertools.grouper recipe.
input = ['beleriand','mordor','hithlum','eol','morgoth','melian','thingol']
import itertools
import random
def grouper(iterable,group_size):
return itertools.izip(*([iter(iterable)]*group_size))
def random_generator(x,y):
k = x*y
sample = random.sample(input,k)
return list(grouper(sample,y))
print random_generator(3,2)
print random_generator(3,2)
print random_generator(3,2)
print random_generator(3,2)
print random_generator(3,2)
print random_generator(3,2)
for one run, this results in:
[('melian', 'mordor'), ('hithlum', 'eol'), ('thingol', 'morgoth')]
[('hithlum', 'thingol'), ('mordor', 'beleriand'), ('morgoth', 'eol')]
[('morgoth', 'beleriand'), ('melian', 'thingol'), ('hithlum', 'mordor')]
[('beleriand', 'thingol'), ('melian', 'hithlum'), ('eol', 'morgoth')]
[('mordor', 'hithlum'), ('eol', 'beleriand'), ('melian', 'morgoth')]
[('mordor', 'melian'), ('thingol', 'beleriand'), ('morgoth', 'eol')]
And the next run:
[('mordor', 'thingol'), ('eol', 'hithlum'), ('melian', 'beleriand')]
[('eol', 'beleriand'), ('mordor', 'melian'), ('hithlum', 'thingol')]
[('hithlum', 'mordor'), ('thingol', 'morgoth'), ('melian', 'eol')]
[('morgoth', 'eol'), ('mordor', 'thingol'), ('melian', 'beleriand')]
[('melian', 'morgoth'), ('mordor', 'eol'), ('thingol', 'hithlum')]
[('mordor', 'morgoth'), ('hithlum', 'thingol'), ('eol', 'melian')]

Related

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.

Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']

Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']

The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

Loop through multiple lists in specific intervals

I have two lists. One with names, and one with numbers that correspond with a name in the first list (corresponding name and number are at the same index point in each list). I need to reference each name and number in a url that can only handle 25 different names & points at a time.
pointNames = ['name1', 'name2', 'name3']
points = ['1', '2', '3'] #yes, the numbers are meant to be strings
My actual lists have roughly 600 values in each. What I'm trying to do is loop through each list at the same time, but in increments of 25. I'm able to do this with a single list using the following:
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
for group in chunker(pointNames, 25):
print (group)
This prints multiple groups of 25 values from the list until it has gone through the entire list. I want to do exactly this, but with two lists. I'm able to print each list entirely with for(point, name) in zip(points, pointNames): but it I need it in groups of 25.
I've also tried combining the two lists into a dictionary:
dictionary = dict(zip(points, pointNames))
for group in chunker(dictionary, 25):
print (group)
but i get the following error:
TypeError: unhashable type: 'slice'

A generator would be more efficient:
import itertools
def chunker(size, *seq):
seq = zip(*seq)
while True:
val = list(itertools.islice(seq, size))
if not val:
break
yield val
for group in chunker(2, pointNames, points):
print(group)
gen_groups = chunker(2, pointNames, points, pointNames, points)
group = next(gen_groups)
print(group)
Using *seq allows you to give any number of list as parameters.

How about this relatively minimal change to your first function:
def chunker(seq1, seq2, size):
seq = list(zip(seq1, seq2))
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
Call as follows:
for group in chunker(pointNames, points, 25):
print(group)

Itertools can slice an iterator (or generator) into chunks, together with a small helper function to keep going until it is done:
import itertools
# helper function, https://stackoverflow.com/a/24527424
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield itertools.chain([first], itertools.islice(iterator, size - 1))
# 600 points and pointNames
points = (str(i) for i in range(600))
pointNames = ('name ' + str(i) for i in range(600))
# work with the chunks
for chunk in chunks(zip(pointNames, points), 25):
print('-' * 40)
print(list(chunk))

Get all possible combinations of rows in a matrix

I'm setting up a simple sentence generator in python, to create as many word combinations as possible to describe a generic set of images involving robots. (Its a long story :D)
It outputs something like this: 'Cyborg Concept Downloadable Illustration'
Amazingly, the random generate I wrote only goes up to 255 unique combinations. Here is the script:
import numpy
from numpy import matrix
from numpy import linalg
import itertools
from pprint import pprint
import random
m = matrix( [
['Robot','Cyborg','Andoid', 'Bot', 'Droid'],
['Character','Concept','Mechanical Person', 'Artificial Intelligence', 'Mascot'],
['Downloadable','Stock','3d', 'Digital', 'Robotics'],
['Clipart','Illustration','Render', 'Image', 'Graphic'],
])
used = []
i = 0
def make_sentence(m, used):
sentence = []
i = 0
while i <= 3:
word = m[i,random.randrange(0,4)]
sentence.append(word)
i = i+1
return ' '.join(sentence)
def is_used(sentence, used):
if sentence not in used:
return False
else:
return True
sentences = []
i = 0
while i <= 1000:
sentence = make_sentence(m, used)
if(is_used(sentence, used)):
continue
else:
sentences.append(sentence)
print str(i) + ' ' +sentence
used.append(sentence)
i = i+1
Using randint instead of randrange, I get up to 624 combinations (instantly) then it hangs in an infinite loop, unable to create more combos.
I guess the question is, is there a more appropriate way of determining all possible combinations of a matrix?

You can make use of itertools to get the all possible combinations of matrix. I given one example to show how itertools will work.
import itertools
mx = [
['Robot','Cyborg','Andoid', 'Bot', 'Droid'],
['Character','Concept','Mechanical Person', 'Artificial Intelligence', 'Mascot'],
['Downloadable','Stock','3d', 'Digital', 'Robotics'],
['Clipart','Illustration','Render', 'Image', 'Graphic'],
]
for combination in itertools.product(*mx):
print combination

Your code can make use of recursion. Without itertools, here is one strategy:
def make_sentences(m, choices = []):
output = []
if len(choices) == 4:
sentence = ""
i = 0
#Go through the four rows of the matrix
#and choose words for the sentence
for j in choices:
sentence += " " + m[i][j]
i += 1
return [sentence] #must be returned as a list
for i in range(0,4):
output += make_sentences(m, choices+[i])
return output #this could be changed to a yield statement
This is quite different from your original function.
The choices list keeps track of the index of the column for each ROW in m that has been selected. When the recursive method finds that choices four rows have been selected, it outputs a list with just ONE sentence.
Where the method finds that the choices list doesn't have four elements, it recursively calls itself for FOUR new choices lists. The results of these recursive calls are added to the output list.

Inverse of random.shuffle()?

I have a function, for simplicity I'll call it shuffler and it takes an list, gives random a seed 17 and then prints that list shuffled.
def shuffler( n ):
import random
random.seed( 17 )
print( random.shuffle( n ) )
How would I create another function called unshuffler that "unshuffles" that list that is returned by shuffler(), bringing it back to the list I inputted into shuffler() assuming that I know the seed?

Just wanted to contribute an answer that's more compatible with functional patterns commonly used with numpy. Ultimately this solution should perform the fastest as it will take advantage of numpy's internal optimizations, which themselves can be further optimized via the use of projects like numba. It ought to be much faster than using conventional loop structures in python.
import numpy as np
original_data = np.array([23, 44, 55, 19, 500, 201]) # Some random numbers to represent the original data to be shuffled
data_length = original_data.shape[0]
# Here we create an array of shuffled indices
shuf_order = np.arange(data_length)
np.random.shuffle(shuf_order)
shuffled_data = original_data[shuf_order] # Shuffle the original data
# Create an inverse of the shuffled index array (to reverse the shuffling operation, or to "unshuffle")
unshuf_order = np.zeros_like(shuf_order)
unshuf_order[shuf_order] = np.arange(data_length)
unshuffled_data = shuffled_data[unshuf_order] # Unshuffle the shuffled data
print(f"original_data: {original_data}")
print(f"shuffled_data: {shuffled_data}")
print(f"unshuffled_data: {unshuffled_data}")
assert np.all(np.equal(unshuffled_data, original_data))

Here are two functions that do what you need:
import random
import numpy as np
def shuffle_forward(l):
order = range(len(l)); random.shuffle(order)
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out
Example
l = range(10000); random.shuffle(l)
l_shuf, order = shuffle_forward(l)
l_unshuffled = shuffle_backward(l_shuf, order)
print l == l_unshuffled
#True

Reseed the random generator with the seed in question and then shuffle the list 1, 2, ..., n. This tells you exactly what ended up where in the shuffle.

In Python3:
import random
import numpy as np
def shuffle_forward(l):
order = list(range(len(l)); random.shuffle(order))
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out

Fast way to remove a few items from a list/queue

This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853

The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.

If you need to remove item in O(1) you can use HashMaps

Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.

A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]

I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.

elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398

import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mutually exclusive random sampling from a list - python

Related

Find values in list which differ from reference list by up to N characters

Loop through multiple lists in specific intervals

Get all possible combinations of rows in a matrix

Inverse of random.shuffle()?

Fast way to remove a few items from a list/queue

Categories

Resources