How to create an alignment matrix given two lists - python

Given these two nested lists, which are the same. For example, from a_lis and b_lis are the same. However, list a is the reversed form of b_lis:
['Berries', 'grapes', 'lemon', 'Orange', 'Apple']
and
['Apple', 'Orange', 'lemon', 'grapes', 'Berries']
a_lis, and b_lis:
a_lis = [['Berries', 'grapes', 'lemon', 'Orange', 'Apple'],
['Apricots', 'peach', 'grapes', 'lemon', 'Orange', 'Apple'],
[1, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple'],
['pumpkin', 'avocados', 'strawberries', 'lemon', 'Orange', 'Apple'],
[3, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple']]
And
b_lis = [['Apple', 'Orange', 'lemon', 'grapes', 'Berries'],
['Apple', 'Orange', 'lemon', 'grapes', 'peach', 'Apricots'],
['Apple', 'Orange', 'lemon', 'strawberries', 'Melons', 1],
['Apple', 'Orange', 'lemon', 'strawberries', 'avocados', 'pumpkin'],
['Apple', 'Orange', 'lemon', 'strawberries', 'Melons', 3]]
How can I align them into a 2 dimensional nested list with all the possible alignments, if and only if the lists are different? For example, ['Berries', 'grapes', 'lemon', 'Orange', 'Apple'], and ['Apple', 'Orange', 'lemon', 'grapes', 'Berries'] should not be concatenated because they are the same (i.e. the first one is the reversed version from the other). This is how the expected output should look like this (*):
So far, I tried to first, create a function that tell me if two lists are the same no matter its position:
def sequences_contain_same_items(a, b):
for item in a:
try:
i = b.index(item)
except ValueError:
return False
b = b[:i] + b[i+1:]
return not b
Then I iterated the lists:
lis= []
for f, b in zip(a_lis, b_lis):
#print(f, b)
lis.append(f)
lis.append(b)
print(lis)
However, I do not get how to produce the alignment output list. What I do not understand is if product is the right operation to apply here. Any idea of how to produce (*)?

a_lis = [['Berries', 'grapes', 'lemon', 'Orange', 'Apple'],
['Apricots', 'peach', 'grapes', 'lemon', 'Orange', 'Apple'],
[1, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple'],
['pumpkin', 'avocados', 'strawberries', 'lemon', 'Orange', 'Apple'],
[3, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple']]
reva = [k[-1::-1] for k in a_lis]
m = []
for i, v in enumerate(a_lis):
for i1,v1 in enumerate(reva):
if i==i1:
pass
else:
m.append(v)
m.append(v1)
print(m)
In a more compact way,
m = sum([[v, v1] for i, v in enumerate(a_lis) for i1,v1 in enumerate(reva) if i!=i1], [])
m = [[v, v1] for i, v in enumerate(a_lis) for i1,v1 in enumerate(reva) if i!=i1]

Related

Finding most common elements from large amount of lists

I have a lot of lists, and I want to find group of most common elements that appear in the lists.
For example:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
How to find group of elements that appear in these lists, for example:
Group 1: ['apple', 'banana', 'orange'] in l1, l2, appear 2 times
Group 2: ['apple', 'grape', 'kiwi'] in l4, l5, appear 2 times
Group 3: ['apple', 'grape', 'orange'] in l2, l5, appear 2 times
Index of the element is not important. Group should have minimum 3 and maximum 5 elements.
Lists can have from 3 to 10 elements.
I know that I can do something like this with intersections, but what if I have totally different list:
l7 = ["x", "y", "z", "k"]
Elements from this list are not appearing in any other list
I think this might help, (I gave it a degree of freedom on the number of combinations to to extract and the minimum number of elements in each given combination :
import itertools
def get_max_rep(all_lists, min_comb_len, num_el_displayed):
"""Returns 'num_el_displayed' number of elements that repeat the most in the given lists"""
# extract all the unique values of all the lists
all_elements = set([el for l in all_lists for el in l])
# build all the possible combinations starting from 'min_comb_len' number of elements
combinations = [
el for r in range(min(min_comb_len, len(all_elements)-1),len(all_elements))
for el in itertools.combinations(all_elements, r)
]
# count the number of repetitions of each combination in the given lists
out = sorted(
[(comb, sum([all(fruit in el for fruit in comb) for el in all_lists]))
for comb in combinations], key=lambda x:x[1], reverse=True
)[:num_el_displayed]
return out
To test it out (here I want the first 5 combinations that have the most repetitions and that have a minimum of 2 elements:
# testing ...
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
all_lists = [l1, l2, l3, l4, l5, l6]
print(get_max_rep(all_lists, 2, 5))
output:
[(('grape', 'kiwi'), 3), (('grape', 'apple'), 3), (('orange', 'apple'), 3), (('banana', 'grape'), 2), (('banana', 'orange'), 2)]
The problem you are facing is called Frequent Itemsets Mining. The 2 most popular algorithms (at least that I know) that solve this problem are:
Apriori
FP-Growth
They are very well explained in the book Data Mining. Concepts and Techniques (3rd Edition). A library that implements apriori is apyori, and this is how you could use it:
from apyori import apriori
transactions = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
results = list(map(lambda x: list(x.items),
filter(lambda x: len(x.items) >= 3,
apriori(transactions, min_support=2/len(transactions), max_length=5)
)
))
print(results)
It returns:
[['banana', 'apple', 'orange'], ['grape', 'kiwi', 'apple'], ['grape', 'apple', 'orange']]
About the arguments of the apriori call:
min_support is the minimum frequency a certain itemset must have. In your case, it is 2/len(transactions) because the itemset must be present in at least 2 of the transactions
max_length is the maximum length of an itemset
I had to filter the result because otherwise I would get also the itemsets whose length is less than 3. (It is weird that there is no argument for the min length too, maybe I did not find it but the help(apyori.apriori) does not mention anything like that).
You could do something like this using a powerset of the combinations, obviously excluding the ones that are not in the range 3-5 lists.
Note: For the powerset function i give credit to this other answer from martineu:
Using itertools.chain.from_iterable and itertools.combinations to get the combinations.
itertools.chain.from_iterable(iterable)
Alternate constructor for chain(). Gets chained inputs from a single iterable argument that is evaluated lazily
and using functools.reduce to do some workaround for the intersections, I have come up with this code that think that might help:
functools.reduce(function, iterable[, initializer])
from the docs:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
from itertools import chain, combinations
from functools import reduce
from pprint import pprint
lsts = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
lsts = tuple(map(set, lsts))
def powerset(iterable, start=0, stop=None): #from answer https://stackoverflow.com/a/40986475/13285707
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable) # allows duplicate elements
if stop is None:
stop = len(s)+1
return chain.from_iterable(combinations(s, r) for r in range(start, stop))
groups = []
for x in powerset(lsts, 3, 5):
if 3 < len(c:=reduce(set.intersection,x)) < 5:
groups.append(c)
pprint(groups)
Output:
[{'apple', 'orange'}, {'apple', 'grape'}, {'grape', 'kiwi'}]
My solutions:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
# return same elements of two lists
def sub_eq_lists(l1, l2):
return list(set([x for x in l1 + l3 if x in l1 and x in l2]))
all_lists = [l1, l2, l3, l4, l5, l6]
results = []
for j in all_lists:
for l in all_lists:
if(l!=j):
same_elements = sub_eq_lists(j,l)
if (len(same_elements) > 2) and (len(same_elements) < 11):# your conditions
results.append(same_elements)
# count occurrences
results = {str(item): results.count(item) for item in results}
results
Output:
{"['apple', 'orange', 'banana']": 2,
"['apple', 'orange', 'grape']": 2,
"['apple', 'kiwi', 'grape']": 2}
You can use A.issubset(B) in order to check if elements of A are present in B or not, it will simply return False.
def count1(grp=None, list_num=None):
return grp.issubset(list_num)
if __name__ == "__main__":
g1 = {'apple', 'banana', 'orange'}
g2 = {'apple', 'grape', 'kiwi'}
g3 = {'apple', 'grape', 'orange'}
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
group_count = {}
groups = [g1, g2, g3]
list_check = [l1, l2, l3, l4, l5, l6]
for index_list, list_num in enumerate(list_check):
for index, grp in enumerate(groups):
if count1(grp, list_num):
group_count.setdefault(f'g{index + 1}',[]).append(f'l{index_list + 1}')
print(group_count)
Output :
This what i understood from question.
{'g1': ['l1', 'l2'], 'g3': ['l2', 'l5'], 'g2': ['l4', 'l5']}

How to delete n elements from an array of two nested lists without losing the array? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am working with a nested structure like this:
l=[
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'bannana', 'grapes']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'apricot', avocado]],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'tomato']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon','pear','strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'apricot', 2]]
]
How can I preserve an arbitrary number of elements from each element (sublist) of two nested lists? For example, say I want to preserve at least 5 elements. The expected output should be:
]
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']]
]
Or 9:
[
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'bannana', 'grapes']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon','pear','strawberry', 'strawberry', 'strawberry']]
]
Or 11:
[
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'bannana', 'grapes']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'apricot', avocado]],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'tomato']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon','pear','strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry']]
]
Alternatively, consider this list:
l2 = [
[['apple'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'apricot', avocado]],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'tomato']],
[['apple', 'tomato'], ['watermelon','pear','strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'apricot', 2]]
]
If I want 4, the output should look like this:
[
[['apple'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'],[]],
[['apple', 'bannana', 'pear', 'watermelon'],[]],
[['apple', 'tomato'], ['watermelon','pear']]
]
I could iterate and join over each sublist. However, If I do that I might break the inner lists inside the list. Any idea of how to remove a number elements without losing the [[],[]] structure efficiently?
Using for loop:
res = []
n = 4
for li, lj in l2:
res.append([li[:n], lj[:max(0,n-len(li))]])
res
Output:
[[['apple'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], []],
[['apple', 'bannana', 'pear', 'watermelon'], []],
[['apple', 'tomato'], ['watermelon', 'pear']]]
With l and n=5:
res = []
n = 5
for li, lj in l:
res.append([li[:n], lj[:max(0,n-len(li))]])
res
Output:
[[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']]]
To cut the input list in-place (using Python's list.clear() feature):
import pprint
def cut_list(lst, n):
for i, (l1, l2) in enumerate(lst):
if len(l1 + l2) > n: # check if there are items to cut
if len(l1) >= n: # if the 1st sublist covers the limit
lst[i][0] = l1[:n]
lst[i][1].clear() # clear the 2nd sublist in-place
else: # cut the 2nd sublist leaving the 1st one intact
lst[i][1] = l2[:n - len(l1)]
lst = [
[['apple'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'],
['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'apricot', 'avocado']],
[['apple', 'bannana', 'pear', 'watermelon'],
['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'tomato']],
[['apple', 'tomato'],
['watermelon', 'pear', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry', 'strawberry',
'apricot', 2]]
]
cut_list(lst, 4)
pprint.pprint(lst)
The output:
[[['apple'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], []],
[['apple', 'bannana', 'pear', 'watermelon'], []],
[['apple', 'tomato'], ['watermelon', 'pear']]]
Here's one that works for arbitrarily-sized inner lists:
def truncate_inner(it, keep):
for x in it:
yield x[:max(0, keep)]
keep -= len(x)
Usage for a 3d list such as l2:
for row in [list(truncate_inner(x, 3)) for x in l2]:
print(row)
Try this?
>>> def shrink( b, keep ) :
... result = []
... for bb in b :
... if keep < 1 : break
... result.append( bb[:keep] )
... keep -= len(bb)
... return result
...
>>> [shrink( b, 6 ) for b in a]
print json.dumps( [shrink( b, 6 ) for b in a], indent=4)
[[
[
"apple",
"bannana",
"pear",
"watermelon"
],
[
"watermelon",
"pear"
]
],
[
[
"apple",
"bannana",
"pear",
"watermelon"
],
[
"watermelon",
"pear"
]
],
[
[
"apple",
"bannana",
"pear",
"watermelon"
],
[
"watermelon",
"pear"
]
],
[
[
"apple",
"bannana",
"pear",
"watermelon"
],
[
"watermelon",
"pear"
]
]
]
>>>
for arr_2d in l:
assert len (arr_2d) == 2
fir_arr = arr_2d[0]
sec_arr = arr_2d[1]
arr_2d[1] = sec_arr[0:n-len(fir_arr)]
It works. I have tested.
for arr_2d in l: # iterate each 2D array inside l
assert len (arr_2d) == 2 # make sure the current 2D array has 2 elements
fir_arr = arr_2d[0] # assign variable
sec_arr = arr_2d[1] # to each of this 2d array
arr_2d[1] = sec_arr[0:n-len(fir_arr)] # pythonic way to cut the second element based on the number of items in the first
Output:
l
Out[50]:
[[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']],
[['apple', 'bannana', 'pear', 'watermelon'], ['watermelon']]]

merge a string column to a set of list using Python

I have a Pandas DataFrame like this :
id fruits
01 Apple, Apricot
02 Apple, Banana, Clementine, Pear
03 Orange, Pineapple, Pear
How can i get a list of fruits like this by deleting duplicates?
['Apple','Apricot','Banana','Clementine','Orange','Pear','Pineapple']
You can flatten lists created by split, convert to sets for unique and last to lists:
a = list(set([item for sublist in df['fruits'].str.split(', ') for item in sublist]))
print (a)
['Pineapple', 'Clementine', 'Apple', 'Banana', 'Apricot', 'Orange', 'Pear']
Or:
a = df['fruits'].str.split(', ', expand=True).stack().drop_duplicates().tolist()
print (a)
['Apple', 'Apricot', 'Banana', 'Clementine', 'Pear', 'Orange', 'Pineapple']
Thanks #kabanus for alternative:
a = list(set(sum(df['fruits'].str.split(', '),[])))
using str.extractall & drop_duplicates
df.fruits.str.extractall(r'(\w+)').drop_duplicates()[0].tolist()
outputs:
['Apple', 'Apricot', 'Banana', 'Clementine', 'Pear', 'Orange', 'Pineapple']
try this,
set(', '.join(df['fruits']).split(', '))
Output:
set(['Apple', 'Apricot', 'Pear', 'Pineapple', 'Orange', 'Banana', 'Clementine'])

How to implement biased random function?

My question is about random.choice function. As we know, when we run random.choice(['apple','banana']), it will return either 'apple' or 'banana' with equal probabilities, what if I want to return biased result, for example, rerurn 'apple' with 0.9 probability and 'banana' with 0.1 probability? How to implement this?
Luckily, in Python 3, you can simply use
import random
random.choices(a, probability)
#random.choices(population, weights=None, *, cum_weights=None, k=1)
A basic way would be to get a rand number between 0 and 1 and make some test:
randNumber = random.random()
if randNumber < 0.9:
fruit = "apple"
else:
fruit = "banana"
which can be simplified by: ['apple', 'banana'][random.random()>0.9] (thanks to #falsetru comment)
The point is to create a new list with more of or less of that certain element
you wan't to biase
This ought to do it:
import random
a = ['apple','banana']
probability = [0.1,0.9]
def biase(lst,probability):
zipped = zip(lst,probability)
lst = [[i[0]] * int(i[1]*100) for i in zipped]
new = [b for i in lst for b in i]
return new
biased_list = biase(a,probability)
random_word = random.choice(biased_list)
print random_word
This code will produce banana in most cases because the string banana is repeated 90% than apple
I've added a list called probability and I've zipped (python lists are ordered) it but a dictionary is more suitable for these sort of tasks
And if you go under the hood and print biased_list you'll see something like:
['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana']

Python: Print in rows

Say I have a list
food_list = ['apple', 'pear', 'tomato', 'bean', 'carrot', 'grape']
How would I print the list in rows containing 4 columns, so it would look like:
apple pear tomato bean
carrot grape
food_list = ['apple', 'pear', 'tomato', 'bean', 'carrot', 'grape']
for i in xrange(0, len(food_list), 4):
print '\t'.join(food_list[i:i+4])
Try with this
food_list = ['apple', 'pear', 'tomato', 'bean', 'carrot', 'grape']
size = 4
g = (food_list[i:i+size] for i in xrange(0, len(food_list), size))
for i in g:
print i
food_list = ['apple', 'pear', 'tomato', 'bean', 'carrot', 'grape']
index = 0
for each_food in food_list:
if index < 3:
print each_food,
index += 1
else:
print each_food
index = 0

Categories

Resources