Find the most common element in a list - python

What is an efficient way to find the most common element in a Python list?
My list items may not be hashable so can't use a dictionary.
Also in case of draws the item with the lowest index should be returned. Example:
>>> most_common(['duck', 'duck', 'goose'])
'duck'
>>> most_common(['goose', 'duck', 'duck', 'goose'])
'goose'

A simpler one-liner:
def most_common(lst):
return max(set(lst), key=lst.count)

Borrowing from here, this can be used with Python 2.7:
from collections import Counter
def Most_Common(lst):
data = Counter(lst)
return data.most_common(1)[0][0]
Works around 4-6 times faster than Alex's solutions, and is 50 times faster than the one-liner proposed by newacct.
On CPython 3.6+ (any Python 3.7+) the above will select the first seen element in case of ties. If you're running on older Python, to retrieve the element that occurs first in the list in case of ties you need to do two passes to preserve order:
# Only needed pre-3.6!
def most_common(lst):
data = Counter(lst)
return max(lst, key=data.get)

With so many solutions proposed, I'm amazed nobody's proposed what I'd consider an obvious one (for non-hashable but comparable elements) -- [itertools.groupby][1]. itertools offers fast, reusable functionality, and lets you delegate some tricky logic to well-tested standard library components. Consider for example:
import itertools
import operator
def most_common(L):
# get an iterable of (item, iterable) pairs
SL = sorted((x, i) for i, x in enumerate(L))
# print 'SL:', SL
groups = itertools.groupby(SL, key=operator.itemgetter(0))
# auxiliary function to get "quality" for an item
def _auxfun(g):
item, iterable = g
count = 0
min_index = len(L)
for _, where in iterable:
count += 1
min_index = min(min_index, where)
# print 'item %r, count %r, minind %r' % (item, count, min_index)
return count, -min_index
# pick the highest-count/earliest item
return max(groups, key=_auxfun)[0]
This could be written more concisely, of course, but I'm aiming for maximal clarity. The two print statements can be uncommented to better see the machinery in action; for example, with prints uncommented:
print most_common(['goose', 'duck', 'duck', 'goose'])
emits:
SL: [('duck', 1), ('duck', 2), ('goose', 0), ('goose', 3)]
item 'duck', count 2, minind 1
item 'goose', count 2, minind 0
goose
As you see, SL is a list of pairs, each pair an item followed by the item's index in the original list (to implement the key condition that, if the "most common" items with the same highest count are > 1, the result must be the earliest-occurring one).
groupby groups by the item only (via operator.itemgetter). The auxiliary function, called once per grouping during the max computation, receives and internally unpacks a group - a tuple with two items (item, iterable) where the iterable's items are also two-item tuples, (item, original index) [[the items of SL]].
Then the auxiliary function uses a loop to determine both the count of entries in the group's iterable, and the minimum original index; it returns those as combined "quality key", with the min index sign-changed so the max operation will consider "better" those items that occurred earlier in the original list.
This code could be much simpler if it worried a little less about big-O issues in time and space, e.g....:
def most_common(L):
groups = itertools.groupby(sorted(L))
def _auxfun((item, iterable)):
return len(list(iterable)), -L.index(item)
return max(groups, key=_auxfun)[0]
same basic idea, just expressed more simply and compactly... but, alas, an extra O(N) auxiliary space (to embody the groups' iterables to lists) and O(N squared) time (to get the L.index of every item). While premature optimization is the root of all evil in programming, deliberately picking an O(N squared) approach when an O(N log N) one is available just goes too much against the grain of scalability!-)
Finally, for those who prefer "oneliners" to clarity and performance, a bonus 1-liner version with suitably mangled names:-).
from itertools import groupby as g
def most_common_oneliner(L):
return max(g(sorted(L)), key=lambda(x, v):(len(list(v)),-L.index(x)))[0]

What you want is known in statistics as mode, and Python of course has a built-in function to do exactly that for you:
>>> from statistics import mode
>>> mode([1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6])
3
Note that if there is no "most common element" such as cases where the top two are tied, this will raise StatisticsError on Python
<=3.7, and on 3.8 onwards it will return the first one encountered.

Without the requirement about the lowest index, you can use collections.Counter for this:
from collections import Counter
a = [1936, 2401, 2916, 4761, 9216, 9216, 9604, 9801]
c = Counter(a)
print(c.most_common(1)) # the one most common element... 2 would mean the 2 most common
[(9216, 2)] # a set containing the element, and it's count in 'a'

If they are not hashable, you can sort them and do a single loop over the result counting the items (identical items will be next to each other). But it might be faster to make them hashable and use a dict.
def most_common(lst):
cur_length = 0
max_length = 0
cur_i = 0
max_i = 0
cur_item = None
max_item = None
for i, item in sorted(enumerate(lst), key=lambda x: x[1]):
if cur_item is None or cur_item != item:
if cur_length > max_length or (cur_length == max_length and cur_i < max_i):
max_length = cur_length
max_i = cur_i
max_item = cur_item
cur_length = 1
cur_i = i
cur_item = item
else:
cur_length += 1
if cur_length > max_length or (cur_length == max_length and cur_i < max_i):
return cur_item
return max_item

This is an O(n) solution.
mydict = {}
cnt, itm = 0, ''
for item in reversed(lst):
mydict[item] = mydict.get(item, 0) + 1
if mydict[item] >= cnt :
cnt, itm = mydict[item], item
print itm
(reversed is used to make sure that it returns the lowest index item)

Sort a copy of the list and find the longest run. You can decorate the list before sorting it with the index of each element, and then choose the run that starts with the lowest index in the case of a tie.

A one-liner:
def most_common (lst):
return max(((item, lst.count(item)) for item in set(lst)), key=lambda a: a[1])[0]

I am doing this using scipy stat module and lambda:
import scipy.stats
lst = [1,2,3,4,5,6,7,5]
most_freq_val = lambda x: scipy.stats.mode(x)[0][0]
print(most_freq_val(lst))
Result:
most_freq_val = 5

# use Decorate, Sort, Undecorate to solve the problem
def most_common(iterable):
# Make a list with tuples: (item, index)
# The index will be used later to break ties for most common item.
lst = [(x, i) for i, x in enumerate(iterable)]
lst.sort()
# lst_final will also be a list of tuples: (count, index, item)
# Sorting on this list will find us the most common item, and the index
# will break ties so the one listed first wins. Count is negative so
# largest count will have lowest value and sort first.
lst_final = []
# Get an iterator for our new list...
itr = iter(lst)
# ...and pop the first tuple off. Setup current state vars for loop.
count = 1
tup = next(itr)
x_cur, i_cur = tup
# Loop over sorted list of tuples, counting occurrences of item.
for tup in itr:
# Same item again?
if x_cur == tup[0]:
# Yes, same item; increment count
count += 1
else:
# No, new item, so write previous current item to lst_final...
t = (-count, i_cur, x_cur)
lst_final.append(t)
# ...and reset current state vars for loop.
x_cur, i_cur = tup
count = 1
# Write final item after loop ends
t = (-count, i_cur, x_cur)
lst_final.append(t)
lst_final.sort()
answer = lst_final[0][2]
return answer
print most_common(['x', 'e', 'a', 'e', 'a', 'e', 'e']) # prints 'e'
print most_common(['goose', 'duck', 'duck', 'goose']) # prints 'goose'

Building on Luiz's answer, but satisfying the "in case of draws the item with the lowest index should be returned" condition:
from statistics import mode, StatisticsError
def most_common(l):
try:
return mode(l)
except StatisticsError as e:
# will only return the first element if no unique mode found
if 'no unique mode' in e.args[0]:
return l[0]
# this is for "StatisticsError: no mode for empty data"
# after calling mode([])
raise
Example:
>>> most_common(['a', 'b', 'b'])
'b'
>>> most_common([1, 2])
1
>>> most_common([])
StatisticsError: no mode for empty data

Simple one line solution
moc= max([(lst.count(chr),chr) for chr in set(lst)])
It will return most frequent element with its frequency.

You probably don't need this anymore, but this is what I did for a similar problem. (It looks longer than it is because of the comments.)
itemList = ['hi', 'hi', 'hello', 'bye']
counter = {}
maxItemCount = 0
for item in itemList:
try:
# Referencing this will cause a KeyError exception
# if it doesn't already exist
counter[item]
# ... meaning if we get this far it didn't happen so
# we'll increment
counter[item] += 1
except KeyError:
# If we got a KeyError we need to create the
# dictionary key
counter[item] = 1
# Keep overwriting maxItemCount with the latest number,
# if it's higher than the existing itemCount
if counter[item] > maxItemCount:
maxItemCount = counter[item]
mostPopularItem = item
print mostPopularItem

ans = [1, 1, 0, 0, 1, 1]
all_ans = {ans.count(ans[i]): ans[i] for i in range(len(ans))}
print(all_ans)
all_ans={4: 1, 2: 0}
max_key = max(all_ans.keys())
4
print(all_ans[max_key])
1

#This will return the list sorted by frequency:
def orderByFrequency(list):
listUniqueValues = np.unique(list)
listQty = []
listOrderedByFrequency = []
for i in range(len(listUniqueValues)):
listQty.append(list.count(listUniqueValues[i]))
for i in range(len(listQty)):
index_bigger = np.argmax(listQty)
for j in range(listQty[index_bigger]):
listOrderedByFrequency.append(listUniqueValues[index_bigger])
listQty[index_bigger] = -1
return listOrderedByFrequency
#And this will return a list with the most frequent values in a list:
def getMostFrequentValues(list):
if (len(list) <= 1):
return list
list_most_frequent = []
list_ordered_by_frequency = orderByFrequency(list)
list_most_frequent.append(list_ordered_by_frequency[0])
frequency = list_ordered_by_frequency.count(list_ordered_by_frequency[0])
index = 0
while(index < len(list_ordered_by_frequency)):
index = index + frequency
if(index < len(list_ordered_by_frequency)):
testValue = list_ordered_by_frequency[index]
testValueFrequency = list_ordered_by_frequency.count(testValue)
if (testValueFrequency == frequency):
list_most_frequent.append(testValue)
else:
break
return list_most_frequent
#tests:
print(getMostFrequentValues([]))
print(getMostFrequentValues([1]))
print(getMostFrequentValues([1,1]))
print(getMostFrequentValues([2,1]))
print(getMostFrequentValues([2,2,1]))
print(getMostFrequentValues([1,2,1,2]))
print(getMostFrequentValues([1,2,1,2,2]))
print(getMostFrequentValues([3,2,3,5,6,3,2,2]))
print(getMostFrequentValues([1,2,2,60,50,3,3,50,3,4,50,4,4,60,60]))
Results:
[]
[1]
[1]
[1, 2]
[2]
[1, 2]
[2]
[2, 3]
[3, 4, 50, 60]

Here:
def most_common(l):
max = 0
maxitem = None
for x in set(l):
count = l.count(x)
if count > max:
max = count
maxitem = x
return maxitem
I have a vague feeling there is a method somewhere in the standard library that will give you the count of each element, but I can't find it.

This is the obvious slow solution (O(n^2)) if neither sorting nor hashing is feasible, but equality comparison (==) is available:
def most_common(items):
if not items:
raise ValueError
fitems = []
best_idx = 0
for item in items:
item_missing = True
i = 0
for fitem in fitems:
if fitem[0] == item:
fitem[1] += 1
d = fitem[1] - fitems[best_idx][1]
if d > 0 or (d == 0 and fitems[best_idx][2] > fitem[2]):
best_idx = i
item_missing = False
break
i += 1
if item_missing:
fitems.append([item, 1, i])
return items[best_idx]
But making your items hashable or sortable (as recommended by other answers) would almost always make finding the most common element faster if the length of your list (n) is large. O(n) on average with hashing, and O(n*log(n)) at worst for sorting.

>>> li = ['goose', 'duck', 'duck']
>>> def foo(li):
st = set(li)
mx = -1
for each in st:
temp = li.count(each):
if mx < temp:
mx = temp
h = each
return h
>>> foo(li)
'duck'

I needed to do this in a recent program. I'll admit it, I couldn't understand Alex's answer, so this is what I ended up with.
def mostPopular(l):
mpEl=None
mpIndex=0
mpCount=0
curEl=None
curCount=0
for i, el in sorted(enumerate(l), key=lambda x: (x[1], x[0]), reverse=True):
curCount=curCount+1 if el==curEl else 1
curEl=el
if curCount>mpCount \
or (curCount==mpCount and i<mpIndex):
mpEl=curEl
mpIndex=i
mpCount=curCount
return mpEl, mpCount, mpIndex
I timed it against Alex's solution and it's about 10-15% faster for short lists, but once you go over 100 elements or more (tested up to 200000) it's about 20% slower.

def most_frequent(List):
counter = 0
num = List[0]
for i in List:
curr_frequency = List.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
List = [2, 1, 2, 2, 1, 3]
print(most_frequent(List))

Hi this is a very simple solution, with linear time complexity
L = ['goose', 'duck', 'duck']
def most_common(L):
current_winner = 0
max_repeated = None
for i in L:
amount_times = L.count(i)
if amount_times > current_winner:
current_winner = amount_times
max_repeated = i
return max_repeated
print(most_common(L))
"duck"
Where number, is the element in the list that repeats most of the time

numbers = [1, 3, 7, 4, 3, 0, 3, 6, 3]
max_repeat_num = max(numbers, key=numbers.count) *# which number most* frequently
max_repeat = numbers.count(max_repeat_num) *#how many times*
print(f" the number {max_repeat_num} is repeated{max_repeat} times")

def mostCommonElement(list):
count = {} // dict holder
max = 0 // keep track of the count by key
result = None // holder when count is greater than max
for i in list:
if i not in count:
count[i] = 1
else:
count[i] += 1
if count[i] > max:
max = count[i]
result = i
return result
mostCommonElement(["a","b","a","c"]) -> "a"

The most common element should be the one which is appearing more than N/2 times in the array where N being the len(array). The below technique will do it in O(n) time complexity, with just consuming O(1) auxiliary space.
from collections import Counter
def majorityElement(arr):
majority_elem = Counter(arr)
size = len(arr)
for key, val in majority_elem.items():
if val > size/2:
return key
return -1

def most_common(lst):
if max([lst.count(i)for i in lst]) == 1:
return False
else:
return max(set(lst), key=lst.count)

def popular(L):
C={}
for a in L:
C[a]=L.count(a)
for b in C.keys():
if C[b]==max(C.values()):
return b
L=[2,3,5,3,6,3,6,3,6,3,7,467,4,7,4]
print popular(L)

Related

What is the most efficient way of getting the intersection of k sorted arrays?

Given k sorted arrays what is the most efficient way of getting the intersection of these lists
Example
INPUT:
[[1,3,5,7], [1,1,3,5,7], [1,4,7,9]]
Output:
[1,7]
There is a way to get the union of k sorted arrays based on what I read in the Elements of programming interviews book in nlogk time. I was wondering if there is a way to do something similar for the intersection as well
## merge sorted arrays in nlogk time [ regular appending and merging is nlogn time ]
import heapq
def mergeArys(srtd_arys):
heap = []
srtd_iters = [iter(x) for x in srtd_arys]
# put the first element from each srtd array onto the heap
for idx, it in enumerate(srtd_iters):
elem = next(it, None)
if elem:
heapq.heappush(heap, (elem, idx))
res = []
# collect results in nlogK time
while heap:
elem, ary = heapq.heappop(heap)
it = srtd_iters[ary]
res.append(elem)
nxt = next(it, None)
if nxt:
heapq.heappush(heap, (nxt, ary))
EDIT: obviously this is an algorithm question that I am trying to solve so I cannot use any of the inbuilt functions like set intersection etc
Exploiting sort order
Here is a single pass O(n) approach that doesn't require any special data structures or auxiliary memory beyond the fundamental requirement of one iterator per input.
from itertools import cycle, islice
def intersection(inputs):
"Yield the intersection of elements from multiple sorted inputs."
# intersection(['ABBCD', 'BBDE', 'BBBDDE']) --> B B D
n = len(inputs)
iters = cycle(map(iter, inputs))
try:
candidate = next(next(iters))
while True:
for it in islice(iters, n-1):
while (value := next(it)) < candidate:
pass
if value != candidate:
candidate = value
break
else:
yield candidate
candidate = next(next(iters))
except StopIteration:
return
Here's a sample session:
>>> data = [[1,3,5,7], [1,1,3,5,7], [1,4,7,9]]
>>> list(intersection(data))
[1, 7]
>>> data = [[1,1,2,3], [1,1,4,4]]
>>> list(intersection(data))
[1, 1]
Algorithm in words
The algorithm starts by selecting the next value from the next iterator to be a candidate.
The main loop assumes a candidate has been selected and it loops over the next n - 1 iterators. For each of those iterators, it consumes values until it finds a value that is a least as large as the candidate. If that value is larger than the candidate, that value becomes the new candidate and the main loop starts again. If all n - 1 values are equal to the candidate, then the candidate is emitted and a new candidate is fetched.
When any input iterator is exhausted, the algorithm is complete.
Doing it without libraries (core language only)
The same algorithm works fine (though less beautifully) without using itertools. Just replace cycle and islice with their list based equivalents:
def intersection(inputs):
"Yield the intersection of elements from multiple sorted inputs."
# intersection(['ABBCD', 'BBDE', 'BBBDDE']) --> B B D
n = len(inputs)
iters = list(map(iter, inputs))
curr_iter = 0
try:
it = iters[curr_iter]
curr_iter = (curr_iter + 1) % n
candidate = next(it)
while True:
for i in range(n - 1):
it = iters[curr_iter]
curr_iter = (curr_iter + 1) % n
while (value := next(it)) < candidate:
pass
if value != candidate:
candidate = value
break
else:
yield candidate
it = iters[curr_iter]
curr_iter = (curr_iter + 1) % n
candidate = next(it)
except StopIteration:
return
Yes, it is possible! I've modified your example code to do this.
My answer assumes that your question is about the algorithm - if you want the fastest-running code using sets, see other answers.
This maintains the O(n log(k)) time complexity: all the code between if lowest != elem or ary != times_seen: and unbench_all = False is O(log(k)). There is a nested loop inside the main loop (for unbenched in range(times_seen):) but this only runs times_seen times, and times_seen is initially 0 and is reset to 0 after every time this inner loop is run, and can only be incremented once per main loop iteration, so the inner loop cannot do more iterations in total than the main loop. Thus, since the code inside the inner loop is O(log(k)) and runs at most as many times as the outer loop, and the outer loop is O(log(k)) and runs n times, the algorithm is O(n log(k)).
This algorithm relies upon how tuples are compared in Python. It compares the first items of the tuples, and if they are equal it, compares the second items (i.e. (x, a) < (x, b) is true if and only if a < b).
In this algorithm, unlike in the example code in the question, when an item is popped from the heap, it is not necessarily pushed again in the same iteration. Since we need to check if all sub-lists contain the same number, after a number is popped from the heap, it's sublist is what I call "benched", meaning that it is not added back to the heap. This is because we need to check if other sub-lists contain the same item, so adding this sub-list's next item is not needed right now.
If a number is indeed in all sub-lists, then the heap will look something like [(2,0),(2,1),(2,2),(2,3)], with all the first elements of the tuples the same, so heappop will select the one with the lowest sub-list index. This means that first index 0 will be popped and times_seen will be incremented to 1, then index 1 will be popped and times_seen will be incremented to 2 - if ary is not equal to times_seen then the number is not in the intersection of all sub-lists. This leads to the condition if lowest != elem or ary != times_seen:, which decides when a number shouldn't be in the result. The else branch of this if statement is for when it still might be in the result.
The unbench_all boolean is for when all sub-lists need to be removed from the bench - this could be because:
The current number is known to not be in the intersection of the sub-lists
It is known to be in the intersection of the sub-lists
When unbench_all is True, all the sub-lists that were removed from the heap are re-added. It is known that these are the ones with indices in range(times_seen) since the algorithm removes items from the heap only if they have the same number, so they must have been removed in order of index, contiguously and starting from index 0, and there must be times_seen of them. This means that we don't need to store the indices of the benched sub-lists, only the number that have been benched.
import heapq
def mergeArys(srtd_arys):
heap = []
srtd_iters = [iter(x) for x in srtd_arys]
# put the first element from each srtd array onto the heap
for idx, it in enumerate(srtd_iters):
elem = next(it, None)
if elem:
heapq.heappush(heap, (elem, idx))
res = []
# the number of tims that the current number has been seen
times_seen = 0
# the lowest number from the heap - currently checking if the first numbers in all sub-lists are equal to this
lowest = heap[0][0] if heap else None
# collect results in nlogK time
while heap:
elem, ary = heap[0]
unbench_all = True
if lowest != elem or ary != times_seen:
if lowest == elem:
heapq.heappop(heap)
it = srtd_iters[ary]
nxt = next(it, None)
if nxt:
heapq.heappush(heap, (nxt, ary))
else:
heapq.heappop(heap)
times_seen += 1
if times_seen == len(srtd_arys):
res.append(elem)
else:
unbench_all = False
if unbench_all:
for unbenched in range(times_seen):
unbenched_it = srtd_iters[unbenched]
nxt = next(unbenched_it, None)
if nxt:
heapq.heappush(heap, (nxt, unbenched))
times_seen = 0
if heap:
lowest = heap[0][0]
return res
if __name__ == '__main__':
a1 = [[1, 3, 5, 7], [1, 1, 3, 5, 7], [1, 4, 7, 9]]
a2 = [[1, 1], [1, 1, 2, 2, 3]]
for arys in [a1, a2]:
print(mergeArys(arys))
An equivalent algorithm can be written like this, if you prefer:
def mergeArys(srtd_arys):
heap = []
srtd_iters = [iter(x) for x in srtd_arys]
# put the first element from each srtd array onto the heap
for idx, it in enumerate(srtd_iters):
elem = next(it, None)
if elem:
heapq.heappush(heap, (elem, idx))
res = []
# collect results in nlogK time
while heap:
elem, ary = heap[0]
lowest = elem
keep_elem = True
for i in range(len(srtd_arys)):
elem, ary = heap[0]
if lowest != elem or ary != i:
if ary != i:
heapq.heappop(heap)
it = srtd_iters[ary]
nxt = next(it, None)
if nxt:
heapq.heappush(heap, (nxt, ary))
keep_elem = False
i -= 1
break
heapq.heappop(heap)
if keep_elem:
res.append(elem)
for unbenched in range(i+1):
unbenched_it = srtd_iters[unbenched]
nxt = next(unbenched_it, None)
if nxt:
heapq.heappush(heap, (nxt, unbenched))
if len(heap) < len(srtd_arys):
heap = []
return res
You can use builtin sets and sets intersections :
d = [[1,3,5,7],[1,1,3,5,7],[1,4,7,9]]
result = set(d[0]).intersection(*d[1:])
{1, 7}
You can use reduce:
from functools import reduce
a = [[1,3,5,7],[1,1,3,5,7],[1,4,7,9]]
reduce(lambda x, y: x & set(y), a[1:], set(a[0]))
{1, 7}
I've come up with this algorithm. It doesn't exceed O(nk) I don't know if it's good enough for you. the point of this algorithm is that you can have k indexes for each array and each iteration you find the indexes of the next element in the intersection and increase every index until you exceed the bounds of an array and there are no more items in the intersection. the trick is since the arrays are sorted you can look at two elements in two different arrays and if one is bigger than the other you can instantly throw away the other because you know you cant have a smaller number than the one you are looking at. the worst case of this algorithm is that every index will be increased to the bound which takes kn time since an index cannot decrease its value.
inter = []
for n in range(len(arrays[0])):
if indexes[0] >= len(arrays[0]):
return inter
for i in range(1,k):
if indexes[i] >= len(arrays[i]):
return inter
while indexes[i] < len(arrays[i]) and arrays[i][indexes[i]] < arrays[0][indexes[0]]:
indexes[i] += 1
while indexes[i] < len(arrays[i]) and indexes[0] < len(arrays[0]) and arrays[i][indexes[i]] > arrays[0][indexes[0]]:
indexes[0] += 1
if indexes[0] < len(arrays[0]):
inter.append(arrays[0][indexes[0]])
indexes = [idx+1 for idx in indexes]
return inter
You said we can't use sets but how about dicts / hash tables? (yes I know they're basically the same thing) :D
If so, here's a fairly simple approach (please excuse the py2 syntax):
arrays = [[1,3,5,7],[1,1,3,5,7],[1,4,7,9]]
counts = {}
for ar in arrays:
last = None
for i in ar:
if (i != last):
counts[i] = counts.get(i, 0) + 1
last = i
N = len(arrays)
intersection = [i for i, n in counts.iteritems() if n == N]
print intersection
Same as Raymond Hettinger's solution but with more basic python code:
def intersection(arrays, unique: bool=False):
result = []
if not len(arrays) or any(not len(array) for array in arrays):
return result
pointers = [0] * len(arrays)
target = arrays[0][0]
start_step = 0
current_step = 1
while True:
idx = current_step % len(arrays)
array = arrays[idx]
while pointers[idx] < len(array) and array[pointers[idx]] < target:
pointers[idx] += 1
if pointers[idx] < len(array) and array[pointers[idx]] > target:
target = array[pointers[idx]]
start_step = current_step
current_step += 1
continue
if unique:
while (
pointers[idx] + 1 < len(array)
and array[pointers[idx]] == array[pointers[idx] + 1]
):
pointers[idx] += 1
if (current_step - start_step) == len(arrays):
result.append(target)
for other_idx, other_array in enumerate(arrays):
pointers[other_idx] += 1
if pointers[idx] < len(array):
target = array[pointers[idx]]
start_step = current_step
if pointers[idx] == len(array):
return result
current_step += 1
Here's an O(n) answer (where n = sum(len(sublist) for sublist in data)).
from itertools import cycle
def intersection(data):
result = []
maxval = float("-inf")
consecutive = 0
try:
for sublist in cycle(iter(sublist) for sublist in data):
value = next(sublist)
while value < maxval:
value = next(sublist)
if value > maxval:
maxval = value
consecutive = 0
continue
consecutive += 1
if consecutive >= len(data)-1:
result.append(maxval)
consecutive = 0
except StopIteration:
return result
print(intersection([[1,3,5,7], [1,1,3,5,7], [1,4,7,9]]))
[1, 7]
Some of the above methods are not covering the examples when there are duplicates in every subset of the list. The Below code implements this intersection and it will be more efficient if there are lots of duplicates in the subset of the list :) If not sure about duplicates it is recommended to use Counter from collections from collections import Counter. The custom counter function is made for increasing the efficiency of handling large duplicates. But still can not beat Raymond Hettinger's implementation.
def counter(my_list):
my_list = sorted(my_list)
first_val, *all_val = my_list
p_index = my_list.index(first_val)
my_counter = {}
for item in all_val:
c_index = my_list.index(item)
diff = abs(c_index-p_index)
p_index = c_index
my_counter[first_val] = diff
first_val = item
c_index = my_list.index(item)
diff = len(my_list) - c_index
my_counter[first_val] = diff
return my_counter
def my_func(data):
if not data or not isinstance(data, list):
return
# get the first value
first_val, *all_val = data
if not isinstance(first_val, list):
return
# count items in first value
p = counter(first_val) # counter({1: 2, 3: 1, 5: 1, 7: 1})
# collect all common items and calculate the minimum occurance in intersection
for val in all_val:
# collecting common items
c = counter(val)
# calculate the minimum occurance in intersection
inner_dict = {}
for inner_val in set(c).intersection(set(p)):
inner_dict[inner_val] = min(p[inner_val], c[inner_val])
p = inner_dict
# >>>p
# {1: 2, 7: 1}
# Sort by keys of counter
sorted_items = sorted(p.items(), key=lambda x:x[0]) # [(1, 2), (7, 1)]
result=[i[0] for i in sorted_items for _ in range(i[1])] # [1, 1, 7]
return result
Here are the sample Examples
>>> data = [[1,3,5,7],[1,1,3,5,7],[1,4,7,9]]
>>> my_func(data=data)
[1, 7]
>>> data = [[1,1,3,5,7],[1,1,3,5,7],[1,1,4,7,9]]
>>> my_func(data=data)
[1, 1, 7]
You can do the following using the functions heapq.merge, chain.from_iterable and groupby
from heapq import merge
from itertools import groupby, chain
ls = [[1, 3, 5, 7], [1, 1, 3, 5, 7], [1, 4, 7, 9]]
def index_groups(lst):
"""[1, 1, 3, 5, 7] -> [(1, 0), (1, 1), (3, 0), (5, 0), (7, 0)]"""
return chain.from_iterable(((e, i) for i, e in enumerate(group)) for k, group in groupby(lst))
iterables = (index_groups(li) for li in ls)
flat = merge(*iterables)
res = [k for (k, _), g in groupby(flat) if sum(1 for _ in g) == len(ls)]
print(res)
Output
[1, 7]
The idea is to give an extra value (using enumerate) to differentiate between equal values within the same list (see the function index_groups).
The complexity of this algorithm is O(n) where n is the sum of the lengths of each list in the input.
Note that the output for (an extra 1 en each list):
ls = [[1, 1, 3, 5, 7], [1, 1, 3, 5, 7], [1, 1, 4, 7, 9]]
is:
[1, 1, 7]
You can use bit-masking with one-hot encoding. The inner lists become maxterms. You and them together for the intersection and or them for the union. Then you have to convert back, for which I've used a bit hack.
problem = [[1,3,5,7],[1,1,3,5,8,7],[1,4,7,9]];
debruijn = [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9];
u32 = accum = (1 << 32) - 1;
for vec in problem:
maxterm = 0;
for v in vec:
maxterm |= 1 << v;
accum &= maxterm;
# https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn
result = [];
while accum:
power = accum;
accum &= accum - 1; # Peter Wegner CACM 3 (1960), 322
power &= ~accum;
result.append(debruijn[((power * 0x077CB531) & u32) >> 27]);
print result;
This uses (simulates) 32-bit integers, so you can only have [0, 31] in your sets.
*I am inexperienced at Python, so I timed it. One should definitely use set.intersection.
Here is the single-pass counting algorithm, a simplified version of what others have suggested.
def intersection(iterables):
target, count = None, 0
for it in itertools.cycle(map(iter, iterables)):
for value in it:
if count == 0 or value > target:
target, count = value, 1
break
if value == target:
count += 1
break
else: # exhausted iterator
return
if count >= len(iterables):
yield target
count = 0
Binary and exponential search haven't come up yet. They're easily recreated even with the "no builtins" constraint.
In practice, that would be much faster, and sub-linear. In the worst case - where the intersection isn't shrinking - the naive approach would repeat work. But there's a solution for that: integrate the binary search while splitting the arrays in half.
def intersection(seqs):
seq = min(seqs, key=len)
if not seq:
return
pivot = seq[len(seq) // 2]
lows, counts, highs = [], [], []
for seq in seqs:
start = bisect.bisect_left(seq, pivot)
stop = bisect.bisect_right(seq, pivot, start)
lows.append(seq[:start])
counts.append(stop - start)
highs.append(seq[stop:])
yield from intersection(lows)
yield from itertools.repeat(pivot, min(counts))
yield from intersection(highs)
Both handle duplicates. Both guarantee O(N) worst-case time (counting slicing as atomic). The latter will approach O(min_size) speed; by always splitting the smallest in half it essentially can't suffer from the bad luck of uneven splits.
I couldn't help but notice that this is seems to be a variation on the Welfare Crook problem; see David Gries's book, The Science of Programming. Edsger Dijkstra also wrote an EWD about this, see Ascending Functions and the Welfare Crook.
The Welfare Crook
Suppose we have three long magnetic tapes, each containing a list of names in alphabetical order:
all people working for IBM Yorktown
students at Columbia University
people on welfare in New York City
Practically speaking, all three lists are endless, so no upper bounds are given. It is know that at least one person is on all three lists. Write a program to locate the first such person.
Our intersection of the ordered lists problem is a generalization of the Welfare Crook problem.
Here's a (rather primitive?) Python solution to the Welfare Crook problem:
def find_welfare_crook(f, g, h, i, j, k):
"""f, g, and h are "ascending functions," i.e.,
i <= j implies f[i] <= f[j] or, equivalently,
f[i] < f[j] implies i < j, and the same goes for g and h.
i, j, k define where to start the search in each list.
"""
# This is an implementation of a solution to the Welfare Crook
# problems presented in David Gries's book, The Science of Programming.
# The surprising and beautiful thing is that the guard predicates are
# so few and so simple.
i , j , k = i , j , k
while True:
if f[i] < g[j]:
i += 1
elif g[j] < h[k]:
j += 1
elif h[k] < f[i]:
k += 1
else:
break
return (i,j,k)
# The other remarkable thing is how the negation of the guard
# predicates works out to be: f[i] == g[j] and g[j] == c[k].
Generalization to Intersection of K Lists
This generalizes to K lists, and here's what I devised; I don't know how Pythonic this is, but it pretty compact:
def findIntersectionLofL(lofl):
"""Generalized findIntersection function which operates on a "list of lists." """
K = len(lofl)
indices = [0 for i in range(K)]
result = []
#
try:
while True:
# idea is to maintain the indices via a construct like the following:
allEqual = True
for i in range(K):
if lofl[i][indices[i]] < lofl[(i+1)%K][indices[(i+1)%K]] :
indices[i] += 1
allEqual = False
# When the above iteration finishes, if all of the list
# items indexed by the indices are equal, then another
# item common to all of the lists must be added to the result.
if allEqual :
result.append(lofl[0][indices[0]])
while lofl[0][indices[0]] == lofl[1][indices[1]]:
indices[0] += 1
except IndexError as e:
# Eventually, the foregoing iteration will advance one of the
# indices past the end of one of the lists, and when that happens
# an IndexError exception will be raised. This means the algorithm
# is finished.
return result
This solution does not keep repeated items. Changing the program to include all of the repeated items by changing what the program does in the conditional at the end of the "while True" loop is an exercise left to the reader.
Improved Performance
Comments from #greybeard prompted refinements shown below, in the
pre-computation of the "array index moduli" (the "(i+1)%K" expressions) and further investigation also brought about changes to the inner iteration's structure, to further remove overhead:
def findIntersectionLofLunRolled(lofl):
"""Generalized findIntersection function which operates on a "list of lists."
Accepts a list-of-lists, lofl. Each of the lists must be ordered.
Returns the list of each element which appears in all of the lists at least once.
"""
K = len(lofl)
indices = [0] * K
result = []
lt = [ (i, (i+1) % K) for i in range(K) ] # avoids evaluation of index exprs inside the loop
#
try:
while True:
allUnEqual = True
while allUnEqual:
allUnEqual = False
for i,j in lt:
if lofl[i][indices[i]] < lofl[j][indices[j]]:
indices[i] += 1
allUnEqual = True
# Now all of the lofl[i][indices[i]], for all i, are the same value.
# Store that value in the result, and then advance all of the indices
# past that common value:
v = lofl[0][indices[0]]
result.append(v)
for i,j in lt:
while lofl[i][indices[i]] == v:
indices[i] += 1
except IndexError as e:
# Eventually, the foregoing iteration will advance one of the
# indices past the end of one of the lists, and when that happens
# an IndexError exception will be raised. This means the algorithm
# is finished.
return result

How to find first value in a list having no duplicates?

l1 = ['A','B','C','D','A','B']
l2 = []
'C' is the first value in list l1, i want to create a function so that it returns C in l2.
In 3.6 and higher, this is very easy. Now that dicts preserve insertion order, collections.Counter can be used to efficiently count all elements in a single pass, then you can just scan the resulting Counter in order to find the first element with a count of 1:
from collections import Counter
l1 = ['A','B','C','D','A','B']
l2 = [next(k for k, v in Counter(l1).items() if v == 1)]
Work is strictly O(n), with only one pass of the input required (plus a partial pass of the unique values in the Counter itself), and the code is incredibly simple. In modern Python, Counter even has a C accelerator for counting inputs that pushes all the Counter construction work to the C layer, making it impossible to beat. If you want to account for the possibility that no such element exists, just wrap the l2 initialization to make it:
try:
l2 = [next(k for k, v in Counter(l1).items() if v == 1)]
except StopIteration:
l2 = []
# ... whatever else makes sense for your scenario ...
or avoid exception handling with itertools.islice (so l2 is 0-1 items, and it still short-circuits once a hit is found):
from itertools import islice
l2 = list(islice((k for k, v in Counter(l1).items() if v == 1), 1))
You can convert list to string and then compare index of each character from left and right using find and rfind functions of string. It stops counting as soon as the first match is found,
l1 = ['A','B','C','D','A','B']
def i_list(input):
l1 = ''.join(input)
for i in l1:
if l1.find(i) == l1.rfind(i):
return(i)
print(i_list(l1))
# output
C
An implementation using a defaultdict:
# Initialize
from collections import defaultdict
counts = defaultdict(int)
# Count all occurrences
for item in l1:
counts[item] += 1
# Find the first non-duplicated item
for item in l1:
if counts[item] == 1:
l2 = [item]
break
else:
l2 = []
As a follow up to ShadowRanger's answer, if you're using a lower version of Python, it's not that more complicated to filter the original list so that you don't have to rely on the ordering of the counter items:
from collections import Counter
l1 = ['A','B','C','D','A','B']
c = Counter(l1)
l2 = [x for x in l1 if c[x] == 1][:1]
print(l2) # ['C']
This is also O(n).
We can do it also with "numpy".
def find_first_non_duplicate(l1):
indexes_counts = np.asarray(np.unique(l1, return_counts=True, return_index=True)[1:]).T
(not_duplicates,) = np.where(indexes_counts[:, 1] == 1)
if not_duplicates.size > 0:
return [l1[np.min(indexes_counts[not_duplicates, 0])]]
else:
return []
print(find_first_non_duplicate(l1))
# output
['C']
A faster way to get the first unique element in a list:
Check each element one by one and store it in a dict.
Loop through the dict and check the first element whose count is
def countElement(a):
"""
returns a dict of element and its count
"""
g = {}
for i in a:
if i in g:
g[i] +=1
else:
g[i] =1
return g
#List to be processed - input_list
input_list = [1,1,1,2,2,2,2,3,3,4,5,5,234,23,3,12,3,123,12,31,23,13,2,4,23,42,42,34,234,23,42,34,23,423,42,34,23,423,4,234,23,42,34,23,4,23,423,4,23,4] #Input from user
try:
if input_list: #if list is not empty
print ([i for i,v in countElement(input_list).items() if v == 1][0]) #get the first element whose count is 1
else: #if list is empty
print ("empty list in Input")
except: #if list is empty - IndexError
print (f"Only duplicate values in list - {input_list}")

Is there a way to check if a list is a sublist of another list? [duplicate]

I want to write a function that determines if a sublist exists in a larger list.
list1 = [1,0,1,1,1,0,0]
list2 = [1,0,1,0,1,0,1]
#Should return true
sublistExists(list1, [1,1,1])
#Should return false
sublistExists(list2, [1,1,1])
Is there a Python function that can do this?
Let's get a bit functional, shall we? :)
def contains_sublist(lst, sublst):
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in xrange(len(lst)-n+1))
Note that any() will stop on first match of sublst within lst - or fail if there is no match, after O(m*n) ops
If you are sure that your inputs will only contain the single digits 0 and 1 then you can convert to strings:
def sublistExists(list1, list2):
return ''.join(map(str, list2)) in ''.join(map(str, list1))
This creates two strings so it is not the most efficient solution but since it takes advantage of the optimized string searching algorithm in Python it's probably good enough for most purposes.
If efficiency is very important you can look at the Boyer-Moore string searching algorithm, adapted to work on lists.
A naive search has O(n*m) worst case but can be suitable if you cannot use the converting to string trick and you don't need to worry about performance.
No function that I know of
def sublistExists(list, sublist):
for i in range(len(list)-len(sublist)+1):
if sublist == list[i:i+len(sublist)]:
return True #return position (i) if you wish
return False #or -1
As Mark noted, this is not the most efficient search (it's O(n*m)). This problem can be approached in much the same way as string searching.
My favourite simple solution is following (however, its brutal-force, so i dont recommend it on huge data):
>>> l1 = ['z','a','b','c']
>>> l2 = ['a','b']
>>>any(l1[i:i+len(l2)] == l2 for i in range(len(l1)))
True
This code above actually creates all possible slices of l1 with length of l2, and sequentially compares them with l2.
Detailed explanation
Read this explanation only if you dont understand how it works (and you want to know it), otherwise there is no need to read it
Firstly, this is how you can iterate over indexes of l1 items:
>>> [i for i in range(len(l1))]
[0, 1, 2, 3]
So, because i is representing index of item in l1, you can use it to show that actuall item, instead of index number:
>>> [l1[i] for i in range(len(l1))]
['z', 'a', 'b', 'c']
Then create slices (something like subselection of items from list) from l1 with length of2:
>>> [l1[i:i+len(l2)] for i in range(len(l1))]
[['z', 'a'], ['a', 'b'], ['b', 'c'], ['c']] #last one is shorter, because there is no next item.
Now you can compare each slice with l2 and you see that second one matched:
>>> [l1[i:i+len(l2)] == l2 for i in range(len(l1))]
[False, True, False, False] #notice that the second one is that matching one
Finally, with function named any, you can check if at least one of booleans is True:
>>> any(l1[i:i+len(l2)] == l2 for i in range(len(l1)))
True
The efficient way to do this is to use the Boyer-Moore algorithm, as Mark Byers suggests. I have done it already here: Boyer-Moore search of a list for a sub-list in Python, but will paste the code here. It's based on the Wikipedia article.
The search() function returns the index of the sub-list being searched for, or -1 on failure.
def search(haystack, needle):
"""
Search list `haystack` for sublist `needle`.
"""
if len(needle) == 0:
return 0
char_table = make_char_table(needle)
offset_table = make_offset_table(needle)
i = len(needle) - 1
while i < len(haystack):
j = len(needle) - 1
while needle[j] == haystack[i]:
if j == 0:
return i
i -= 1
j -= 1
i += max(offset_table[len(needle) - 1 - j], char_table.get(haystack[i]));
return -1
def make_char_table(needle):
"""
Makes the jump table based on the mismatched character information.
"""
table = {}
for i in range(len(needle) - 1):
table[needle[i]] = len(needle) - 1 - i
return table
def make_offset_table(needle):
"""
Makes the jump table based on the scan offset in which mismatch occurs.
"""
table = []
last_prefix_position = len(needle)
for i in reversed(range(len(needle))):
if is_prefix(needle, i + 1):
last_prefix_position = i + 1
table.append(last_prefix_position - i + len(needle) - 1)
for i in range(len(needle) - 1):
slen = suffix_length(needle, i)
table[slen] = len(needle) - 1 - i + slen
return table
def is_prefix(needle, p):
"""
Is needle[p:end] a prefix of needle?
"""
j = 0
for i in range(p, len(needle)):
if needle[i] != needle[j]:
return 0
j += 1
return 1
def suffix_length(needle, p):
"""
Returns the maximum length of the substring ending at p that is a suffix.
"""
length = 0;
j = len(needle) - 1
for i in reversed(range(p + 1)):
if needle[i] == needle[j]:
length += 1
else:
break
j -= 1
return length
Here is the example from the question:
def main():
list1 = [1,0,1,1,1,0,0]
list2 = [1,0,1,0,1,0,1]
index = search(list1, [1, 1, 1])
print(index)
index = search(list2, [1, 1, 1])
print(index)
if __name__ == '__main__':
main()
Output:
2
-1
Here is a way that will work for simple lists that is slightly less fragile than Mark's
def sublistExists(haystack, needle):
def munge(s):
return ", "+format(str(s)[1:-1])+","
return munge(needle) in munge(haystack)
def sublistExists(x, y):
occ = [i for i, a in enumerate(x) if a == y[0]]
for b in occ:
if x[b:b+len(y)] == y:
print 'YES-- SUBLIST at : ', b
return True
if len(occ)-1 == occ.index(b):
print 'NO SUBLIST'
return False
list1 = [1,0,1,1,1,0,0]
list2 = [1,0,1,0,1,0,1]
#should return True
sublistExists(list1, [1,1,1])
#Should return False
sublistExists(list2, [1,1,1])
Might as well throw in a recursive version of #NasBanov's solution
def foo(sub, lst):
'''Checks if sub is in lst.
Expects both arguments to be lists
'''
if len(lst) < len(sub):
return False
return sub == lst[:len(sub)] or foo(sub, lst[1:])
def sublist(l1,l2):
if len(l1) < len(l2):
for i in range(0, len(l1)):
for j in range(0, len(l2)):
if l1[i]==l2[j] and j==i+1:
pass
return True
else:
return False
I know this might not be quite relevant to the original question but it might be very elegant 1 line solution to someone else if the sequence of items in both lists doesn't matter. The result below will show True if List1 elements are in List2 (regardless of order). If the order matters then don't use this solution.
List1 = [10, 20, 30]
List2 = [10, 20, 30, 40]
result = set(List1).intersection(set(List2)) == set(List1)
print(result)
Output
True
if iam understanding this correctly, you have a larger list, like :
list_A= ['john', 'jeff', 'dave', 'shane', 'tim']
then there are other lists
list_B= ['sean', 'bill', 'james']
list_C= ['cole', 'wayne', 'jake', 'moose']
and then i append the lists B and C to list A
list_A.append(list_B)
list_A.append(list_C)
so when i print list_A
print (list_A)
i get the following output
['john', 'jeff', 'dave', 'shane', 'tim', ['sean', 'bill', 'james'], ['cole', 'wayne', 'jake', 'moose']]
now that i want to check if the sublist exists:
for value in list_A:
value= type(value)
value= str(value).strip('<>').split()[1]
if (value == "'list'"):
print "True"
else:
print "False"
this will give you 'True' if you have any sublist inside the larger list.

Finding the mode of a list

Given a list of items, recall that the mode of the list is the item that occurs most often.
I would like to know how to create a function that can find the mode of a list but that displays a message if the list does not have a mode (e.g., all the items in the list only appear once). I want to make this function without importing any functions. I'm trying to make my own function from scratch.
You can use the max function and a key. Have a look at python max function using 'key' and lambda expression.
max(set(lst), key=lst.count)
You can use the Counter supplied in the collections package which has a mode-esque function
from collections import Counter
data = Counter(your_list_in_here)
data.most_common() # Returns all unique items and their counts
data.most_common(1) # Returns the highest occurring item
Note: Counter is new in python 2.7 and is not available in earlier versions.
Python 3.4 includes the method statistics.mode, so it is straightforward:
>>> from statistics import mode
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3
You can have any type of elements in the list, not just numeric:
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
Taking a leaf from some statistics software, namely SciPy and MATLAB, these just return the smallest most common value, so if two values occur equally often, the smallest of these are returned. Hopefully an example will help:
>>> from scipy.stats import mode
>>> mode([1, 2, 3, 4, 5])
(array([ 1.]), array([ 1.]))
>>> mode([1, 2, 2, 3, 3, 4, 5])
(array([ 2.]), array([ 2.]))
>>> mode([1, 2, 2, -3, -3, 4, 5])
(array([-3.]), array([ 2.]))
Is there any reason why you can 't follow this convention?
There are many simple ways to find the mode of a list in Python such as:
import statistics
statistics.mode([1,2,3,3])
>>> 3
Or, you could find the max by its count
max(array, key = array.count)
The problem with those two methods are that they don't work with multiple modes. The first returns an error, while the second returns the first mode.
In order to find the modes of a set, you could use this function:
def mode(array):
most = max(list(map(array.count, array)))
return list(set(filter(lambda x: array.count(x) == most, array)))
Extending the Community answer that will not work when the list is empty, here is working code for mode:
def mode(arr):
if arr==[]:
return None
else:
return max(set(arr), key=arr.count)
In case you are interested in either the smallest, largest or all modes:
def get_small_mode(numbers, out_mode):
counts = {k:numbers.count(k) for k in set(numbers)}
modes = sorted(dict(filter(lambda x: x[1] == max(counts.values()), counts.items())).keys())
if out_mode=='smallest':
return modes[0]
elif out_mode=='largest':
return modes[-1]
else:
return modes
A little longer, but can have multiple modes and can get string with most counts or mix of datatypes.
def getmode(inplist):
'''with list of items as input, returns mode
'''
dictofcounts = {}
listofcounts = []
for i in inplist:
countofi = inplist.count(i) # count items for each item in list
listofcounts.append(countofi) # add counts to list
dictofcounts[i]=countofi # add counts and item in dict to get later
maxcount = max(listofcounts) # get max count of items
if maxcount ==1:
print "There is no mode for this dataset, values occur only once"
else:
modelist = [] # if more than one mode, add to list to print out
for key, item in dictofcounts.iteritems():
if item ==maxcount: # get item from original list with most counts
modelist.append(str(key))
print "The mode(s) are:",' and '.join(modelist)
return modelist
Mode of a data set is/are the member(s) that occur(s) most frequently in the set. If there are two members that appear most often with same number of times, then the data has two modes. This is called bimodal.If there are more than 2 modes, then the data would be called multimodal. If all the members in the data set appear the same number of times, then the data set has no mode at all. Following function modes() can work to find mode(s) in a given list of data:
import numpy as np; import pandas as pd
def modes(arr):
df = pd.DataFrame(arr, columns=['Values'])
dat = pd.crosstab(df['Values'], columns=['Freq'])
if len(np.unique((dat['Freq']))) > 1:
mode = list(dat.index[np.array(dat['Freq'] == max(dat['Freq']))])
return mode
else:
print("There is NO mode in the data set")
Output:
# For a list of numbers in x as
In [1]: x = [2, 3, 4, 5, 7, 9, 8, 12, 2, 1, 1, 1, 3, 3, 2, 6, 12, 3, 7, 8, 9, 7, 12, 10, 10, 11, 12, 2]
In [2]: modes(x)
Out[2]: [2, 3, 12]
# For a list of repeated numbers in y as
In [3]: y = [2, 2, 3, 3, 4, 4, 10, 10]
In [4]: modes(y)
Out[4]: There is NO mode in the data set
# For a list of strings/characters in z as
In [5]: z = ['a', 'b', 'b', 'b', 'e', 'e', 'e', 'd', 'g', 'g', 'c', 'g', 'g', 'a', 'a', 'c', 'a']
In [6]: modes(z)
Out[6]: ['a', 'g']
If we do not want to import numpy or pandas to call any function from these packages, then to get this same output, modes() function can be written as:
def modes(arr):
cnt = []
for i in arr:
cnt.append(arr.count(i))
uniq_cnt = []
for i in cnt:
if i not in uniq_cnt:
uniq_cnt.append(i)
if len(uniq_cnt) > 1:
m = []
for i in list(range(len(cnt))):
if cnt[i] == max(uniq_cnt):
m.append(arr[i])
mode = []
for i in m:
if i not in mode:
mode.append(i)
return mode
else:
print("There is NO mode in the data set")
I wrote up this handy function to find the mode.
def mode(nums):
corresponding={}
occurances=[]
for i in nums:
count = nums.count(i)
corresponding.update({i:count})
for i in corresponding:
freq=corresponding[i]
occurances.append(freq)
maxFreq=max(occurances)
keys=corresponding.keys()
values=corresponding.values()
index_v = values.index(maxFreq)
global mode
mode = keys[index_v]
return mode
Short, but somehow ugly:
def mode(arr) :
m = max([arr.count(a) for a in arr])
return [x for x in arr if arr.count(x) == m][0] if m>1 else None
Using a dictionary, slightly less ugly:
def mode(arr) :
f = {}
for a in arr : f[a] = f.get(a,0)+1
m = max(f.values())
t = [(x,f[x]) for x in f if f[x]==m]
return m > 1 t[0][0] else None
This function returns the mode or modes of a function no matter how many, as well as the frequency of the mode or modes in the dataset. If there is no mode (ie. all items occur only once), the function returns an error string. This is similar to A_nagpal's function above but is, in my humble opinion, more complete, and I think it's easier to understand for any Python novices (such as yours truly) reading this question to understand.
def l_mode(list_in):
count_dict = {}
for e in (list_in):
count = list_in.count(e)
if e not in count_dict.keys():
count_dict[e] = count
max_count = 0
for key in count_dict:
if count_dict[key] >= max_count:
max_count = count_dict[key]
corr_keys = []
for corr_key, count_value in count_dict.items():
if count_dict[corr_key] == max_count:
corr_keys.append(corr_key)
if max_count == 1 and len(count_dict) != 1:
return 'There is no mode for this data set. All values occur only once.'
else:
corr_keys = sorted(corr_keys)
return corr_keys, max_count
For a number to be a mode, it must occur more number of times than at least one other number in the list, and it must not be the only number in the list. So, I refactored #mathwizurd's answer (to use the difference method) as follows:
def mode(array):
'''
returns a set containing valid modes
returns a message if no valid mode exists
- when all numbers occur the same number of times
- when only one number occurs in the list
- when no number occurs in the list
'''
most = max(map(array.count, array)) if array else None
mset = set(filter(lambda x: array.count(x) == most, array))
return mset if set(array) - mset else "list does not have a mode!"
These tests pass successfully:
mode([]) == None
mode([1]) == None
mode([1, 1]) == None
mode([1, 1, 2, 2]) == None
Here is how you can find mean,median and mode of a list:
import numpy as np
from scipy import stats
#to take input
size = int(input())
numbers = list(map(int, input().split()))
print(np.mean(numbers))
print(np.median(numbers))
print(int(stats.mode(numbers)[0]))
Simple code that finds the mode of the list without any imports:
nums = #your_list_goes_here
nums.sort()
counts = dict()
for i in nums:
counts[i] = counts.get(i, 0) + 1
mode = max(counts, key=counts.get)
In case of multiple modes, it should return the minimum node.
Why not just
def print_mode (thelist):
counts = {}
for item in thelist:
counts [item] = counts.get (item, 0) + 1
maxcount = 0
maxitem = None
for k, v in counts.items ():
if v > maxcount:
maxitem = k
maxcount = v
if maxcount == 1:
print "All values only appear once"
elif counts.values().count (maxcount) > 1:
print "List has multiple modes"
else:
print "Mode of list:", maxitem
This doesn't have a few error checks that it should have, but it will find the mode without importing any functions and will print a message if all values appear only once. It will also detect multiple items sharing the same maximum count, although it wasn't clear if you wanted that.
This will return all modes:
def mode(numbers)
largestCount = 0
modes = []
for x in numbers:
if x in modes:
continue
count = numbers.count(x)
if count > largestCount:
del modes[:]
modes.append(x)
largestCount = count
elif count == largestCount:
modes.append(x)
return modes
For those looking for the minimum mode, e.g:case of bi-modal distribution, using numpy.
import numpy as np
mode = np.argmax(np.bincount(your_list))
Okey! So community has already a lot of answers and some of them used another function and you don't want.
let we create our very simple and easily understandable function.
import numpy as np
#Declare Function Name
def calculate_mode(lst):
Next step is to find Unique elements in list and thier respective frequency.
unique_elements,freq = np.unique(lst, return_counts=True)
Get mode
max_freq = np.max(freq) #maximum frequency
mode_index = np.where(freq==max_freq) #max freq index
mode = unique_elements[mode_index] #get mode by index
return mode
Example
lst =np.array([1,1,2,3,4,4,4,5,6])
print(calculate_mode(lst))
>>> Output [4]
How my brain decided to do it completely from scratch. Efficient and concise :) (jk lol)
import random
def removeDuplicates(arr):
dupFlag = False
for i in range(len(arr)):
#check if we found a dup, if so, stop
if dupFlag:
break
for j in range(len(arr)):
if ((arr[i] == arr[j]) and (i != j)):
arr.remove(arr[j])
dupFlag = True
break;
#if there was a duplicate repeat the process, this is so we can account for the changing length of the arr
if (dupFlag):
removeDuplicates(arr)
else:
#if no duplicates return the arr
return arr
#currently returns modes and all there occurences... Need to handle dupes
def mode(arr):
numCounts = []
#init numCounts
for i in range(len(arr)):
numCounts += [0]
for i in range(len(arr)):
count = 1
for j in range(len(arr)):
if (arr[i] == arr[j] and i != j):
count += 1
#add the count for that number to the corresponding index
numCounts[i] = count
#find which has the greatest number of occurences
greatestNum = 0
for i in range(len(numCounts)):
if (numCounts[i] > greatestNum):
greatestNum = numCounts[i]
#finally return the mode(s)
modes = []
for i in range(len(numCounts)):
if numCounts[i] == greatestNum:
modes += [arr[i]]
#remove duplicates (using aliasing)
print("modes: ", modes)
removeDuplicates(modes)
print("modes after removing duplicates: ", modes)
return modes
def initArr(n):
arr = []
for i in range(n):
arr += [random.randrange(0, n)]
return arr
#initialize an array of random ints
arr = initArr(1000)
print(arr)
print("_______________________________________________")
modes = mode(arr)
#print result
print("Mode is: ", modes) if (len(modes) == 1) else print("Modes are: ", modes)
def mode(inp_list):
sort_list = sorted(inp_list)
dict1 = {}
for i in sort_list:
count = sort_list.count(i)
if i not in dict1.keys():
dict1[i] = count
maximum = 0 #no. of occurences
max_key = -1 #element having the most occurences
for key in dict1:
if(dict1[key]>maximum):
maximum = dict1[key]
max_key = key
elif(dict1[key]==maximum):
if(key<max_key):
maximum = dict1[key]
max_key = key
return max_key
def mode(data):
lst =[]
hgh=0
for i in range(len(data)):
lst.append(data.count(data[i]))
m= max(lst)
ml = [x for x in data if data.count(x)==m ] #to find most frequent values
mode = []
for x in ml: #to remove duplicates of mode
if x not in mode:
mode.append(x)
return mode
print mode([1,2,2,2,2,7,7,5,5,5,5])
Here is a simple function that gets the first mode that occurs in a list. It makes a dictionary with the list elements as keys and number of occurrences and then reads the dict values to get the mode.
def findMode(readList):
numCount={}
highestNum=0
for i in readList:
if i in numCount.keys(): numCount[i] += 1
else: numCount[i] = 1
for i in numCount.keys():
if numCount[i] > highestNum:
highestNum=numCount[i]
mode=i
if highestNum != 1: print(mode)
elif highestNum == 1: print("All elements of list appear once.")
If you want a clear approach, useful for classroom and only using lists and dictionaries by comprehension, you can do:
def mode(my_list):
# Form a new list with the unique elements
unique_list = sorted(list(set(my_list)))
# Create a comprehensive dictionary with the uniques and their count
appearance = {a:my_list.count(a) for a in unique_list}
# Calculate max number of appearances
max_app = max(appearance.values())
# Return the elements of the dictionary that appear that # of times
return {k: v for k, v in appearance.items() if v == max_app}
#function to find mode
def mode(data):
modecnt=0
#for count of number appearing
for i in range(len(data)):
icount=data.count(data[i])
#for storing count of each number in list will be stored
if icount>modecnt:
#the loop activates if current count if greater than the previous count
mode=data[i]
#here the mode of number is stored
modecnt=icount
#count of the appearance of number is stored
return mode
print mode(data1)
import numpy as np
def get_mode(xs):
values, counts = np.unique(xs, return_counts=True)
max_count_index = np.argmax(counts) #return the index with max value counts
return values[max_count_index]
print(get_mode([1,7,2,5,3,3,8,3,2]))
Perhaps try the following. It is O(n) and returns a list of floats (or ints). It is thoroughly, automatically tested. It uses collections.defaultdict, but I'd like to think you're not opposed to using that. It can also be found at https://stromberg.dnsalias.org/~strombrg/stddev.html
def compute_mode(list_: typing.List[float]) -> typing.List[float]:
"""
Compute the mode of list_.
Note that the return value is a list, because sometimes there is a tie for "most common value".
See https://stackoverflow.com/questions/10797819/finding-the-mode-of-a-list
"""
if not list_:
raise ValueError('Empty list')
if len(list_) == 1:
raise ValueError('Single-element list')
value_to_count_dict: typing.DefaultDict[float, int] = collections.defaultdict(int)
for element in list_:
value_to_count_dict[element] += 1
count_to_values_dict = collections.defaultdict(list)
for value, count in value_to_count_dict.items():
count_to_values_dict[count].append(value)
counts = list(count_to_values_dict)
if len(counts) == 1:
raise ValueError('All elements in list are the same')
maximum_occurrence_count = max(counts)
if maximum_occurrence_count == 1:
raise ValueError('No element occurs more than once')
minimum_occurrence_count = min(counts)
if maximum_occurrence_count <= minimum_occurrence_count:
raise ValueError('Maximum count not greater than minimum count')
return count_to_values_dict[maximum_occurrence_count]

Python: Check the occurrences in a list against a value

lst = [1,2,3,4,1]
I want to know 1 occurs twice in this list, is there any efficient way to do?
lst.count(1) would return the number of times it occurs. If you're going to be counting items in a list, O(n) is what you're going to get.
The general function on the list is list.count(x), and will return the number of times x occurs in a list.
Are you asking whether every item in the list is unique?
len(set(lst)) == len(lst)
Whether 1 occurs more than once?
lst.count(1) > 1
Note that the above is not maximally efficient, because it won't short-circuit -- even if 1 occurs twice, it will still count the rest of the occurrences. If you want it to short-circuit you will have to write something a little more complicated.
Whether the first element occurs more than once?
lst[0] in lst[1:]
How often each element occurs?
import collections
collections.Counter(lst)
Something else?
For multiple occurrences, this give you the index of each occurence:
>>> lst=[1,2,3,4,5,1]
>>> tgt=1
>>> found=[]
>>> for index, suspect in enumerate(lst):
... if(tgt==suspect):
... found.append(index)
...
>>> print len(found), "found at index:",", ".join(map(str,found))
2 found at index: 0, 5
If you want the count of each item in the list:
>>> lst=[1,2,3,4,5,2,2,1,5,5,5,5,6]
>>> count={}
>>> for item in lst:
... count[item]=lst.count(item)
...
>>> count
{1: 2, 2: 3, 3: 1, 4: 1, 5: 5, 6: 1}
def valCount(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
u = [ x for x,y in valCount(lst).iteritems() if y > 1 ]
u is now a list of all values which appear more than once.
Edit:
#katrielalex: thank you for pointing out collections.Counter, of which I was not previously aware. It can also be written more concisely using a collections.defaultdict, as demonstrated in the following tests. All three methods are roughly O(n) and reasonably close in run-time performance (using collections.defaultdict is in fact slightly faster than collections.Counter).
My intention was to give an easy-to-understand response to what seemed a relatively unsophisticated request. Given that, are there any other senses in which you consider it "bad code" or "done poorly"?
import collections
import random
import time
def test1(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
def test2(lst):
res = collections.defaultdict(lambda: 0)
for v in lst:
res[v] += 1
return res
def test3(lst):
return collections.Counter(lst)
def rndLst(lstLen):
r = random.randint
return [r(0,lstLen) for i in xrange(lstLen)]
def timeFn(fn, *args):
st = time.clock()
res = fn(*args)
return time.clock() - st
def main():
reps = 5000
res = []
tests = [test1, test2, test3]
for t in xrange(reps):
lstLen = random.randint(10,50000)
lst = rndLst(lstLen)
res.append( [lstLen] + [timeFn(fn, lst) for fn in tests] )
res.sort()
return res
And the results, for random lists containing up to 50,000 items, are as follows:
(Vertical axis is time in seconds, horizontal axis is number of items in list)
Another way to get all items that occur more than once:
lst = [1,2,3,4,1]
d = {}
for x in lst:
d[x] = x in d
print d[1] # True
print d[2] # False
print [x for x in d if d[x]] # [1]
You could also sort the list which is O(n*log(n)), then check the adjacent elements for equality, which is O(n). The result is O(n*log(n)). This has the disadvantage of requiring the entire list be sorted before possibly bailing when a duplicate is found.
For a large list with a relatively rare duplicates, this could be the about the best you can do. The best way to approach this really does depend on the size of the data involved and its nature.

Categories

Resources