Data processing in lists with duplicates in Python - python

I have two lists: one contains the products and the other one contains their associated prices. The lists can contain an undefined number of products. An example of the lists would be something like:
Products : ['Apple', 'Apple', 'Apple', 'Orange', 'Banana', 'Banana', 'Peach', 'Pineapple', 'Pineapple']
Prices: ['1.00', '2.00', '1.50', '3.00', '0.50', '1.50', '2.00', '1.00', '1.00']
I want to be able to remove all the duplicates from the products list and keep only the cheapest price associated with the unique products in the price list. Note that some products might have the same price (in our example the Pineapple).
The desired final lists would be something like:
Products : ['Apple', 'Orange', 'Banana', 'Peach', 'Pineapple']
Prices: ['1.00', '3.00', '0.50', '2.00', '1.00']
I would like to know the most effective way to do so in Python. Thank you

from collections import OrderedDict
products = ['Apple', 'Apple', 'Apple', 'Orange', 'Banana', 'Banana', 'Peach', 'Pineapple', 'Pineapple']
prices = ['1.00', '2.00', '1.50', '3.00', '0.50', '1.50', '2.00', '1.00', '1.00']
min_prices = OrderedDict()
for prod, price in zip(products, prices):
min_prices[prod] = min(float(price), min_prices.get(prod, float('inf')))
>>> print min_prices.keys(), min_prices.values()
['Apple', 'Orange', 'Banana', 'Peach', 'Pineapple'] [1.0, 3.0, 0.5, 2.0, 1.0]

Probably the simplest way is to take advantage of dictionaries' enforcement of unique keys:
from operator import itemgetter
Products = ['Apple', 'Apple', 'Apple', 'Orange', 'Banana', 'Banana', 'Peach', 'Pineapple', 'Pineapple']
Prices = ['1.00', '2.00', '1.50', '3.00', '0.50', '1.50', '2.00', '1.00', '1.00']
final = dict(sorted(zip(Products, Prices), key=itemgetter(1), reverse=True))

What about this:
prices = map(float,prices)
r={}
for k,v in zip(products,prices):
if v < r.setdefault(k,float('inf')):
r[k] = v
products,prices = r.keys(),map(str,r.values())

Not the shortest solution, but it illustrates the point: Suppose that your lists are products and prices, respectively. Then:
lookup = dict()
for prod, price in zip(products, prices):
if prod not in lookup:
lookup[prod] = price
else:
lookup[prod] = min(price, lookup[prod])
At this point, the lookup dict contains each of your products, and its minimal price. A dict is certainly a better data structure for this than two lists; if you really want to have this as two separate lists instead, you can do something like this:
new_prods = []
new_prices = []
for prod, price in lookup.items():
new_prods.append(prod)
new_prices.append(price)

>>> from collections import OrderedDict
>>> products = ['Apple', 'Apple', 'Apple', 'Orange', 'Banana', 'Banana', 'Peach', 'Pineapple', 'Pineapple']
>>> prices = ['1.00', '2.00', '1.50', '3.00', '0.50', '1.50', '2.00', '1.00', '1.00']
>>> dic = OrderedDict()
>>> for x,y in zip(products,prices):
... dic.setdefault(x, []).append(y)
...
>>> dic.keys()
['Apple', 'Orange', 'Banana', 'Peach', 'Pineapple']
>>> [min(val, key = float) for val in dic.values()]
['1.00', '3.00', '0.50', '2.00', '1.00']

You can use a dictionary to do this:
Products = ['Apple', 'Apple', 'Apple', 'Orange', 'Banana', 'Banana', 'Peach', 'Pineapple', 'Pineapple']
Prices = ['1.00', '2.00', '1.50', '3.00', '0.50', '1.50', '2.00', '1.00', '1.00']
Prices=[float(price) for price in Prices]
di={}
for prod,price in zip(Products,Prices):
di.setdefault(prod,[]).append(price)
for key,val in di.items():
di[key]=min(val)
print di
Prints {'Orange': 3.0, 'Pineapple': 1.0, 'Apple': 1.0, 'Peach': 2.0, 'Banana': 0.5}
If you want two lists in the same order, you can do this:
from collections import OrderedDict
new_prod=OrderedDict.fromkeys(Products).keys()
new_prices=[di[item] for item in new_prod]
Prints:
['Apple', 'Orange', 'Banana', 'Peach', 'Pineapple']
[1.0, 3.0, 0.5, 2.0, 1.0]

Related

Finding most common elements from large amount of lists

I have a lot of lists, and I want to find group of most common elements that appear in the lists.
For example:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
How to find group of elements that appear in these lists, for example:
Group 1: ['apple', 'banana', 'orange'] in l1, l2, appear 2 times
Group 2: ['apple', 'grape', 'kiwi'] in l4, l5, appear 2 times
Group 3: ['apple', 'grape', 'orange'] in l2, l5, appear 2 times
Index of the element is not important. Group should have minimum 3 and maximum 5 elements.
Lists can have from 3 to 10 elements.
I know that I can do something like this with intersections, but what if I have totally different list:
l7 = ["x", "y", "z", "k"]
Elements from this list are not appearing in any other list
I think this might help, (I gave it a degree of freedom on the number of combinations to to extract and the minimum number of elements in each given combination :
import itertools
def get_max_rep(all_lists, min_comb_len, num_el_displayed):
"""Returns 'num_el_displayed' number of elements that repeat the most in the given lists"""
# extract all the unique values of all the lists
all_elements = set([el for l in all_lists for el in l])
# build all the possible combinations starting from 'min_comb_len' number of elements
combinations = [
el for r in range(min(min_comb_len, len(all_elements)-1),len(all_elements))
for el in itertools.combinations(all_elements, r)
]
# count the number of repetitions of each combination in the given lists
out = sorted(
[(comb, sum([all(fruit in el for fruit in comb) for el in all_lists]))
for comb in combinations], key=lambda x:x[1], reverse=True
)[:num_el_displayed]
return out
To test it out (here I want the first 5 combinations that have the most repetitions and that have a minimum of 2 elements:
# testing ...
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
all_lists = [l1, l2, l3, l4, l5, l6]
print(get_max_rep(all_lists, 2, 5))
output:
[(('grape', 'kiwi'), 3), (('grape', 'apple'), 3), (('orange', 'apple'), 3), (('banana', 'grape'), 2), (('banana', 'orange'), 2)]
The problem you are facing is called Frequent Itemsets Mining. The 2 most popular algorithms (at least that I know) that solve this problem are:
Apriori
FP-Growth
They are very well explained in the book Data Mining. Concepts and Techniques (3rd Edition). A library that implements apriori is apyori, and this is how you could use it:
from apyori import apriori
transactions = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
results = list(map(lambda x: list(x.items),
filter(lambda x: len(x.items) >= 3,
apriori(transactions, min_support=2/len(transactions), max_length=5)
)
))
print(results)
It returns:
[['banana', 'apple', 'orange'], ['grape', 'kiwi', 'apple'], ['grape', 'apple', 'orange']]
About the arguments of the apriori call:
min_support is the minimum frequency a certain itemset must have. In your case, it is 2/len(transactions) because the itemset must be present in at least 2 of the transactions
max_length is the maximum length of an itemset
I had to filter the result because otherwise I would get also the itemsets whose length is less than 3. (It is weird that there is no argument for the min length too, maybe I did not find it but the help(apyori.apriori) does not mention anything like that).
You could do something like this using a powerset of the combinations, obviously excluding the ones that are not in the range 3-5 lists.
Note: For the powerset function i give credit to this other answer from martineu:
Using itertools.chain.from_iterable and itertools.combinations to get the combinations.
itertools.chain.from_iterable(iterable)
Alternate constructor for chain(). Gets chained inputs from a single iterable argument that is evaluated lazily
and using functools.reduce to do some workaround for the intersections, I have come up with this code that think that might help:
functools.reduce(function, iterable[, initializer])
from the docs:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
from itertools import chain, combinations
from functools import reduce
from pprint import pprint
lsts = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
lsts = tuple(map(set, lsts))
def powerset(iterable, start=0, stop=None): #from answer https://stackoverflow.com/a/40986475/13285707
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable) # allows duplicate elements
if stop is None:
stop = len(s)+1
return chain.from_iterable(combinations(s, r) for r in range(start, stop))
groups = []
for x in powerset(lsts, 3, 5):
if 3 < len(c:=reduce(set.intersection,x)) < 5:
groups.append(c)
pprint(groups)
Output:
[{'apple', 'orange'}, {'apple', 'grape'}, {'grape', 'kiwi'}]
My solutions:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
# return same elements of two lists
def sub_eq_lists(l1, l2):
return list(set([x for x in l1 + l3 if x in l1 and x in l2]))
all_lists = [l1, l2, l3, l4, l5, l6]
results = []
for j in all_lists:
for l in all_lists:
if(l!=j):
same_elements = sub_eq_lists(j,l)
if (len(same_elements) > 2) and (len(same_elements) < 11):# your conditions
results.append(same_elements)
# count occurrences
results = {str(item): results.count(item) for item in results}
results
Output:
{"['apple', 'orange', 'banana']": 2,
"['apple', 'orange', 'grape']": 2,
"['apple', 'kiwi', 'grape']": 2}
You can use A.issubset(B) in order to check if elements of A are present in B or not, it will simply return False.
def count1(grp=None, list_num=None):
return grp.issubset(list_num)
if __name__ == "__main__":
g1 = {'apple', 'banana', 'orange'}
g2 = {'apple', 'grape', 'kiwi'}
g3 = {'apple', 'grape', 'orange'}
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
group_count = {}
groups = [g1, g2, g3]
list_check = [l1, l2, l3, l4, l5, l6]
for index_list, list_num in enumerate(list_check):
for index, grp in enumerate(groups):
if count1(grp, list_num):
group_count.setdefault(f'g{index + 1}',[]).append(f'l{index_list + 1}')
print(group_count)
Output :
This what i understood from question.
{'g1': ['l1', 'l2'], 'g3': ['l2', 'l5'], 'g2': ['l4', 'l5']}

Python - Search in a dictionary

I need to look for a specific value within a variable.
market = {'fruit': [{'fruit_id': '25', 'fruit': 'banana', weight: 1.00}, {'fruit_id': '15', 'fruit': 'apple', weight: 1 .50}, {'fruit_id': '5', 'fruit': 'pear', weight: 2.00}]}
#print(type(market))
<class 'dict'>
How can I find the fruit whose fruit_id is '15'?
You can iterate over the value of fruit from markets dict and search.
for fr in market['fruit']:
if fr['fruit_id'] == '15':
ans = fr
print(ans)
ans = {'fruit_id': '15', 'fruit': 'apple', 'weight': 1.50}
In addition to the comments, you can create a function to search in your market dictionary:
market = {'fruit': [{'fruit_id': '25', 'fruit': 'banana', 'weight': 1.0},
{'fruit_id': '15', 'fruit': 'apple', 'weight': 1.5},
{'fruit_id': '5', 'fruit': 'pear', 'weight': 2.0}]}
def search_by_id(fruit_id):
for fruit in market['fruit']:
if fruit['fruit_id'] == fruit_id:
return fruit['fruit']
How to use it:
>>> search_by_id('15')
'apple'
>>> search_by_id('5')
'pear'
If you will need to search for fruits by id more than once, consider creating a dict where the id is the key:
>>> market = {'fruit': [
... {'fruit_id': '25', 'fruit': 'banana', 'weight': 1.00},
... {'fruit_id': '15', 'fruit': 'apple', 'weight': 1.50},
... {'fruit_id': '5', 'fruit': 'pear', 'weight': 2.00}
... ]}
>>>
>>> fruits_by_id = {f['fruit_id']: f for f in market['fruit']}
>>> fruits_by_id['15']
{'fruit_id': '15', 'fruit': 'apple', 'weight': 1.5}
Once you have a dict where a particular piece of data is the key, locating that piece of data by the key is easy, both for you and the computer (it's "constant time", aka effectively instantaneous, to locate an item in a dict by its key, whereas iterating through an entire dict takes an amount of time depending on how big the dict is).
If you aren't constrained in how market is defined, and your program is going to be looking up items by their id most of the time, it might make more sense to simply make market['fruit'] a dict up front (keyed on id) rather than having it be a list. Consider the following representation:
>>> market = {'fruit': {
... 25: {'name': 'banana', 'weight': 1.00},
... 15: {'name': 'apple', 'weight': 1.50},
... 5: {'name': 'pear', 'weight': 2.00}
... }}
>>> market['fruit'][15]
{'name': 'apple', 'weight': 1.5}

How to create an alignment matrix given two lists

Given these two nested lists, which are the same. For example, from a_lis and b_lis are the same. However, list a is the reversed form of b_lis:
['Berries', 'grapes', 'lemon', 'Orange', 'Apple']
and
['Apple', 'Orange', 'lemon', 'grapes', 'Berries']
a_lis, and b_lis:
a_lis = [['Berries', 'grapes', 'lemon', 'Orange', 'Apple'],
['Apricots', 'peach', 'grapes', 'lemon', 'Orange', 'Apple'],
[1, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple'],
['pumpkin', 'avocados', 'strawberries', 'lemon', 'Orange', 'Apple'],
[3, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple']]
And
b_lis = [['Apple', 'Orange', 'lemon', 'grapes', 'Berries'],
['Apple', 'Orange', 'lemon', 'grapes', 'peach', 'Apricots'],
['Apple', 'Orange', 'lemon', 'strawberries', 'Melons', 1],
['Apple', 'Orange', 'lemon', 'strawberries', 'avocados', 'pumpkin'],
['Apple', 'Orange', 'lemon', 'strawberries', 'Melons', 3]]
How can I align them into a 2 dimensional nested list with all the possible alignments, if and only if the lists are different? For example, ['Berries', 'grapes', 'lemon', 'Orange', 'Apple'], and ['Apple', 'Orange', 'lemon', 'grapes', 'Berries'] should not be concatenated because they are the same (i.e. the first one is the reversed version from the other). This is how the expected output should look like this (*):
So far, I tried to first, create a function that tell me if two lists are the same no matter its position:
def sequences_contain_same_items(a, b):
for item in a:
try:
i = b.index(item)
except ValueError:
return False
b = b[:i] + b[i+1:]
return not b
Then I iterated the lists:
lis= []
for f, b in zip(a_lis, b_lis):
#print(f, b)
lis.append(f)
lis.append(b)
print(lis)
However, I do not get how to produce the alignment output list. What I do not understand is if product is the right operation to apply here. Any idea of how to produce (*)?
a_lis = [['Berries', 'grapes', 'lemon', 'Orange', 'Apple'],
['Apricots', 'peach', 'grapes', 'lemon', 'Orange', 'Apple'],
[1, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple'],
['pumpkin', 'avocados', 'strawberries', 'lemon', 'Orange', 'Apple'],
[3, 'Melons', 'strawberries', 'lemon', 'Orange', 'Apple']]
reva = [k[-1::-1] for k in a_lis]
m = []
for i, v in enumerate(a_lis):
for i1,v1 in enumerate(reva):
if i==i1:
pass
else:
m.append(v)
m.append(v1)
print(m)
In a more compact way,
m = sum([[v, v1] for i, v in enumerate(a_lis) for i1,v1 in enumerate(reva) if i!=i1], [])
m = [[v, v1] for i, v in enumerate(a_lis) for i1,v1 in enumerate(reva) if i!=i1]

merge a string column to a set of list using Python

I have a Pandas DataFrame like this :
id fruits
01 Apple, Apricot
02 Apple, Banana, Clementine, Pear
03 Orange, Pineapple, Pear
How can i get a list of fruits like this by deleting duplicates?
['Apple','Apricot','Banana','Clementine','Orange','Pear','Pineapple']
You can flatten lists created by split, convert to sets for unique and last to lists:
a = list(set([item for sublist in df['fruits'].str.split(', ') for item in sublist]))
print (a)
['Pineapple', 'Clementine', 'Apple', 'Banana', 'Apricot', 'Orange', 'Pear']
Or:
a = df['fruits'].str.split(', ', expand=True).stack().drop_duplicates().tolist()
print (a)
['Apple', 'Apricot', 'Banana', 'Clementine', 'Pear', 'Orange', 'Pineapple']
Thanks #kabanus for alternative:
a = list(set(sum(df['fruits'].str.split(', '),[])))
using str.extractall & drop_duplicates
df.fruits.str.extractall(r'(\w+)').drop_duplicates()[0].tolist()
outputs:
['Apple', 'Apricot', 'Banana', 'Clementine', 'Pear', 'Orange', 'Pineapple']
try this,
set(', '.join(df['fruits']).split(', '))
Output:
set(['Apple', 'Apricot', 'Pear', 'Pineapple', 'Orange', 'Banana', 'Clementine'])

How to get the index of filtered item in list using lambda?

I have a list of fruits [{'name': 'apple', 'qty': 233}, {'name': 'orange', 'qty': '441'}]
When i filter the list for orange using lambda, list(filter(lambda x: x['name']=='orange', fruits)) , i get the right dict but i can not get the index of the dict. Index should be 1 not 0.
How do i get the right index of the filtered item ?
You can use a list comprehension and enumerate() instead:
>>> fruits = [{'name': 'apple', 'qty': 233}, {'name': 'orange', 'qty': '441'}]
>>> [(idx, fruit) for idx, fruit in enumerate(fruits) if fruit['name'] == 'orange']
[(1, {'name': 'orange', 'qty': '441'})]
Like #ChrisRands posted in the comments, you could also use filter by creating a enumeration object for your fruits list:
>>> list(filter(lambda fruit: fruit[1]['name'] == 'orange', enumerate(fruits)))
[(1, {'name': 'orange', 'qty': '441'})]
>>>
Here are some timings for the two methods:
>>> setup = \
"fruits = [{'name': 'apple', 'qty': 233}, {'name': 'orange', 'qty': '441'}]"
>>> listcomp = \
"[(idx, fruit) for idx, fruit in enumerate(fruits) if fruit['name'] == 'orange']"
>>> filter_lambda = \
"list(filter(lambda fruit: fruit[1]['name'] == 'orange', enumerate(fruits)))"
>>>
>>> timeit(setup=setup, stmt=listcomp)
1.0297133629997006
>>> timeit(setup=setup, stmt=filter_lambda)
1.6447856079998928
>>>

Categories

Resources