I have the following random selection script:
import random
length_of_list = 200
my_list = list(range(length_of_list))
num_selections = 10
numbers = random.sample(my_list, num_selections)
It looks at a list of predetermined size and randomly selects 10 numbers. Is there a way to run this section 500 times and then get the top 10 numbers which were selected the most? I was thinking that I could feed the numbers into a dictionary and then get the top 10 numbers from there. So far, I've done the following:
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
current_number = my_dict.get(number)
key_number = number
my_dict.update(number = number+1)
print(my_dict)
Here I want the code to take the current number assigned to that key and then add 1, but I cannot manage to make it work. It seems like the key for the dictionary update has to be that specific key, cannot insert a variable.. Also, I think having this nested loop might not be so efficient as I have to run this 500 times 1500 times 23... so I am concerned about performance. If anyone has an idea of what I should try, it would be great! Thanks
SOLUTION:
import random
from collections import defaultdict
from collections import OrderedDict
length_of_list = 50
my_list = list(range(length_of_list))
num_selections = 10
my_dict = dict.fromkeys(my_list)
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
def get_top_numbers(data, n, order=False):
"""Gets the top n numbers from the dictionary"""
top = sorted(data.items(), key=lambda x: x[1], reverse=True)[:n]
if order:
return OrderedDict(top)
return dict(top)
print(get_top_numbers(di, n=10))
my_dict.update(number = number+1) in this line you are assigning a new value to a variable inside the parentheses of a function call. Unless you're giving the function a kwarg called number with value number+1 this in the following error:
TypeError: 'number' is an invalid keyword argument for this function
Also dict.update doesn't accept an integer but another dictionary. You should read the documentation about this function: https://www.tutorialspoint.com/python3/dictionary_update.htm
Here it say's dict.update(dict2) takes a dictionary which it will integrate into dict. See example below:
dict = {'Name': 'Zara', 'Age': 17}
dict2 = {'Gender': 'female' }
dict.update(dict2)
print ("updated dict : ", dict)
Gives as result:
updated dict : {'Gender': 'female', 'Age': 17, 'Name': 'Zara'}
So far for the errors in your code, I see a good answer is already given so I won't repeat him.
Checkout defaultdict of collections module,
So basically, you create a defaultdict with default value 0 and then iterate over your numbers list and update the value of the number to +=1
from collections import defaultdict
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
print(di)
You can use for this task collections.Counter which provides addition method. So you will use two counters one which is sum of all and second which contains count of samples.
counter = collections.Counter()
for run in range(500):
samples = random.sample(my_list, num_samples)
sample_counter = collections.Counter(samples)
counter = counter + sample_counter
I am new to Python and can't quite figure out a solution to my Problem. I would like to split a list into two lists, based on what the list item starts with. My list looks like this, each line represents an item (yes this is not the correct list notation, but for a better overview i'll leave it like this) :
***
**
.param
+foo = bar
+foofoo = barbar
+foofoofoo = barbarbar
.model
+spam = eggs
+spamspam = eggseggs
+spamspamspam = eggseggseggs
So I want a list that contains all lines starting with a '+' between .param and .model and another list that contains all lines starting with a '+' after model until the end.
I have looked at enumerate() and split(), but since I have a list and not a string and am not trying to match whole items in the list, I'm not sure how to implement them.
What I have is this:
paramList = []
for line in newContent:
while line.startswith('+'):
paramList.append(line)
if line.startswith('.'):
break
This is just my try to create the first list. The Problem is, the code reads the second block of '+'s as well because break just Exits the while Loop, not the for Loop.
I hope you can understand my question and thanks in advance for any pointers!
What you want is really a simple task that can be accomplish using list slices and list comprehension:
data = ['**','***','.param','+foo = bar','+foofoo = barbar','+foofoofoo = barbarbar',
'.model','+spam = eggs','+spamspam = eggseggs','+spamspamspam = eggseggseggs']
# First get the interesting positions.
param_tag_pos = data.index('.param')
model_tag_pos = data.index('.model')
# Get all elements between tags.
params = [param for param in data[param_tag_pos + 1: model_tag_pos] if param.startswith('+')]
models = [model for model in data[model_tag_pos + 1: -1] if model.startswith('+')]
print(params)
print(models)
Output
>>> ['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar']
>>> ['+spam = eggs', '+spamspam = eggseggs']
Answer to comment:
Suppose you have a list containing numbers from 0 up to 5.
l = [0, 1, 2, 3, 4, 5]
Then using list slices you can select a subset of l:
another = l[2:5] # another is [2, 3, 4]
That what we are doing here:
data[param_tag_pos + 1: model_tag_pos]
And for your last question: ...how does python know param are the lines in data it should iterate over and what exactly does the first paramin param for paramdo?
Python doesn't know, You have to tell him.
First param is a variable name I'm using here, it cuold be x, list_items, whatever you want.
and I will translate the line of code to plain english for you:
# Pythonian
params = [param for param in data[param_tag_pos + 1: model_tag_pos] if param.startswith('+')]
# English
params is a list of "things", for each "thing" we can see in the list `data`
from position `param_tag_pos + 1` to position `model_tag_pos`, just if that "thing" starts with the character '+'.
data = {}
for line in newContent:
if line.startswith('.'):
cur_dict = {}
data[line[1:]] = cur_dict
elif line.startswith('+'):
key, value = line[1:].split(' = ', 1)
cur_dict[key] = value
This creates a dict of dicts:
{'model': {'spam': 'eggs',
'spamspam': 'eggseggs',
'spamspamspam': 'eggseggseggs'},
'param': {'foo': 'bar',
'foofoo': 'barbar',
'foofoofoo': 'barbarbar'}}
I am new to Python
Whoops. Don't bother with my answer then.
I want a list that contains all lines starting with a '+' between
.param and .model and another list that contains all lines starting
with a '+' after model until the end.
import itertools as it
import pprint
data = [
'***',
'**',
'.param',
'+foo = bar',
'+foofoo = barbar',
'+foofoofoo = barbarbar',
'.model',
'+spam = eggs',
'+spamspam = eggseggs',
'+spamspamspam = eggseggseggs',
]
results = [
list(group) for key, group in it.groupby(data, lambda s: s.startswith('+'))
if key
]
pprint.pprint(results)
print '-' * 20
print results[0]
print '-' * 20
pprint.pprint(results[1])
--output:--
[['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar'],
['+spam = eggs', '+spamspam = eggseggs', '+spamspamspam = eggseggseggs']]
--------------------
['+foo = bar', '+foofoo = barbar', '+foofoofoo = barbarbar']
--------------------
['+spam = eggs', '+spamspam = eggseggs', '+spamspamspam = eggseggseggs']
This thing here:
it.groupby(data, lambda x: x.startswith('+')
...tells python to create groups from the strings according to their first character. If the first character is a '+', then the string gets put into a True group. If the first character is not a '+', then the string gets put into a False group. However, there are more than two groups because consecutive False strings will form a group, and consecutive True strings will form a group.
Based on your data, the first three strings:
***
**
.param
will create one False group. Then, the next strings:
+foo = bar
+foofoo = barbar
+foofoofoo = barbarbar
will create one True group. Then the next string:
'.model'
will create another False group. Then the next strings:
'+spam = eggs'
'+spamspam = eggseggs'
'+spamspamspam = eggseggseggs'
will create another True group. The result will be something like:
{
False: [strs here],
True: [strs here],
False: [strs here],
True: [strs here]
}
Then it's just a matter of picking out each True group: if key, and then converting the corresponding group to a list: list(group).
Response to comment:
where exactly does python go through data, like how does it know s is
the data it's iterating over?
groupby() works like do_stuff() below:
def do_stuff(items, func):
for item in items:
print func(item)
#Create the arguments for do_stuff():
data = [1, 2, 3]
def my_func(x):
return x + 100
#Call do_stuff() with the proper argument types:
do_stuff(data, my_func) #Just like when calling groupby(), you provide some data
#and a function that you want applied to each item in data
--output:--
101
102
103
Which can also be written like this:
do_stuff(data, lambda x: x + 100)
lambda creates an anonymous function, which is convenient for simple functions which you don't need to refer to by name.
This list comprehension:
[
list(group)
for key, group in it.groupby(data, lambda s: s.startswith('+'))
if key
]
is equivalent to this:
results = []
for key, group in it.groupby(data, lambda s: s.startswith('+') ):
if key:
results.append(list(group))
It's clearer to explicitly write a for loop, however list comprehensions execute much faster. Here is some detail:
[
list(group) #The item you want to be in the results list for the current iteration of the loop here:
for key, group in it.groupby(data, lambda s: s.startswith('+')) #A for loop
if key #Only include the item for the current loop iteration in the results list if key is True
]
I would suggest doing things step by step.
1) Grab every word from the array separately.
2) Grab the first letter of the word.
3) Look if that is a '+' or '.'
Example code:
import re
class Dark():
def __init__(self):
# Array
x = ['+Hello', '.World', '+Hobbits', '+Dwarves', '.Orcs']
xPlus = []
xDot = []
# Values
i = 0
# Look through every word in the array one by one.
while (i != len(x)):
# Grab every word (s), and convert to string (y).
s = x[i:i+1]
y = '\n'.join(s)
# Print word
print(y)
# Grab the first letter.
letter = y[:1]
if (letter == '+'):
xPlus.append(y)
elif (letter == '.'):
xDot.append(y)
else:
pass
# Add +1
i = i + 1
# Print lists
print(xPlus)
print(xDot)
#Run class
Dark()
Given a list of items, recall that the mode of the list is the item that occurs most often.
I would like to know how to create a function that can find the mode of a list but that displays a message if the list does not have a mode (e.g., all the items in the list only appear once). I want to make this function without importing any functions. I'm trying to make my own function from scratch.
You can use the max function and a key. Have a look at python max function using 'key' and lambda expression.
max(set(lst), key=lst.count)
You can use the Counter supplied in the collections package which has a mode-esque function
from collections import Counter
data = Counter(your_list_in_here)
data.most_common() # Returns all unique items and their counts
data.most_common(1) # Returns the highest occurring item
Note: Counter is new in python 2.7 and is not available in earlier versions.
Python 3.4 includes the method statistics.mode, so it is straightforward:
>>> from statistics import mode
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3
You can have any type of elements in the list, not just numeric:
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'
Taking a leaf from some statistics software, namely SciPy and MATLAB, these just return the smallest most common value, so if two values occur equally often, the smallest of these are returned. Hopefully an example will help:
>>> from scipy.stats import mode
>>> mode([1, 2, 3, 4, 5])
(array([ 1.]), array([ 1.]))
>>> mode([1, 2, 2, 3, 3, 4, 5])
(array([ 2.]), array([ 2.]))
>>> mode([1, 2, 2, -3, -3, 4, 5])
(array([-3.]), array([ 2.]))
Is there any reason why you can 't follow this convention?
There are many simple ways to find the mode of a list in Python such as:
import statistics
statistics.mode([1,2,3,3])
>>> 3
Or, you could find the max by its count
max(array, key = array.count)
The problem with those two methods are that they don't work with multiple modes. The first returns an error, while the second returns the first mode.
In order to find the modes of a set, you could use this function:
def mode(array):
most = max(list(map(array.count, array)))
return list(set(filter(lambda x: array.count(x) == most, array)))
Extending the Community answer that will not work when the list is empty, here is working code for mode:
def mode(arr):
if arr==[]:
return None
else:
return max(set(arr), key=arr.count)
In case you are interested in either the smallest, largest or all modes:
def get_small_mode(numbers, out_mode):
counts = {k:numbers.count(k) for k in set(numbers)}
modes = sorted(dict(filter(lambda x: x[1] == max(counts.values()), counts.items())).keys())
if out_mode=='smallest':
return modes[0]
elif out_mode=='largest':
return modes[-1]
else:
return modes
A little longer, but can have multiple modes and can get string with most counts or mix of datatypes.
def getmode(inplist):
'''with list of items as input, returns mode
'''
dictofcounts = {}
listofcounts = []
for i in inplist:
countofi = inplist.count(i) # count items for each item in list
listofcounts.append(countofi) # add counts to list
dictofcounts[i]=countofi # add counts and item in dict to get later
maxcount = max(listofcounts) # get max count of items
if maxcount ==1:
print "There is no mode for this dataset, values occur only once"
else:
modelist = [] # if more than one mode, add to list to print out
for key, item in dictofcounts.iteritems():
if item ==maxcount: # get item from original list with most counts
modelist.append(str(key))
print "The mode(s) are:",' and '.join(modelist)
return modelist
Mode of a data set is/are the member(s) that occur(s) most frequently in the set. If there are two members that appear most often with same number of times, then the data has two modes. This is called bimodal.If there are more than 2 modes, then the data would be called multimodal. If all the members in the data set appear the same number of times, then the data set has no mode at all. Following function modes() can work to find mode(s) in a given list of data:
import numpy as np; import pandas as pd
def modes(arr):
df = pd.DataFrame(arr, columns=['Values'])
dat = pd.crosstab(df['Values'], columns=['Freq'])
if len(np.unique((dat['Freq']))) > 1:
mode = list(dat.index[np.array(dat['Freq'] == max(dat['Freq']))])
return mode
else:
print("There is NO mode in the data set")
Output:
# For a list of numbers in x as
In [1]: x = [2, 3, 4, 5, 7, 9, 8, 12, 2, 1, 1, 1, 3, 3, 2, 6, 12, 3, 7, 8, 9, 7, 12, 10, 10, 11, 12, 2]
In [2]: modes(x)
Out[2]: [2, 3, 12]
# For a list of repeated numbers in y as
In [3]: y = [2, 2, 3, 3, 4, 4, 10, 10]
In [4]: modes(y)
Out[4]: There is NO mode in the data set
# For a list of strings/characters in z as
In [5]: z = ['a', 'b', 'b', 'b', 'e', 'e', 'e', 'd', 'g', 'g', 'c', 'g', 'g', 'a', 'a', 'c', 'a']
In [6]: modes(z)
Out[6]: ['a', 'g']
If we do not want to import numpy or pandas to call any function from these packages, then to get this same output, modes() function can be written as:
def modes(arr):
cnt = []
for i in arr:
cnt.append(arr.count(i))
uniq_cnt = []
for i in cnt:
if i not in uniq_cnt:
uniq_cnt.append(i)
if len(uniq_cnt) > 1:
m = []
for i in list(range(len(cnt))):
if cnt[i] == max(uniq_cnt):
m.append(arr[i])
mode = []
for i in m:
if i not in mode:
mode.append(i)
return mode
else:
print("There is NO mode in the data set")
I wrote up this handy function to find the mode.
def mode(nums):
corresponding={}
occurances=[]
for i in nums:
count = nums.count(i)
corresponding.update({i:count})
for i in corresponding:
freq=corresponding[i]
occurances.append(freq)
maxFreq=max(occurances)
keys=corresponding.keys()
values=corresponding.values()
index_v = values.index(maxFreq)
global mode
mode = keys[index_v]
return mode
Short, but somehow ugly:
def mode(arr) :
m = max([arr.count(a) for a in arr])
return [x for x in arr if arr.count(x) == m][0] if m>1 else None
Using a dictionary, slightly less ugly:
def mode(arr) :
f = {}
for a in arr : f[a] = f.get(a,0)+1
m = max(f.values())
t = [(x,f[x]) for x in f if f[x]==m]
return m > 1 t[0][0] else None
This function returns the mode or modes of a function no matter how many, as well as the frequency of the mode or modes in the dataset. If there is no mode (ie. all items occur only once), the function returns an error string. This is similar to A_nagpal's function above but is, in my humble opinion, more complete, and I think it's easier to understand for any Python novices (such as yours truly) reading this question to understand.
def l_mode(list_in):
count_dict = {}
for e in (list_in):
count = list_in.count(e)
if e not in count_dict.keys():
count_dict[e] = count
max_count = 0
for key in count_dict:
if count_dict[key] >= max_count:
max_count = count_dict[key]
corr_keys = []
for corr_key, count_value in count_dict.items():
if count_dict[corr_key] == max_count:
corr_keys.append(corr_key)
if max_count == 1 and len(count_dict) != 1:
return 'There is no mode for this data set. All values occur only once.'
else:
corr_keys = sorted(corr_keys)
return corr_keys, max_count
For a number to be a mode, it must occur more number of times than at least one other number in the list, and it must not be the only number in the list. So, I refactored #mathwizurd's answer (to use the difference method) as follows:
def mode(array):
'''
returns a set containing valid modes
returns a message if no valid mode exists
- when all numbers occur the same number of times
- when only one number occurs in the list
- when no number occurs in the list
'''
most = max(map(array.count, array)) if array else None
mset = set(filter(lambda x: array.count(x) == most, array))
return mset if set(array) - mset else "list does not have a mode!"
These tests pass successfully:
mode([]) == None
mode([1]) == None
mode([1, 1]) == None
mode([1, 1, 2, 2]) == None
Here is how you can find mean,median and mode of a list:
import numpy as np
from scipy import stats
#to take input
size = int(input())
numbers = list(map(int, input().split()))
print(np.mean(numbers))
print(np.median(numbers))
print(int(stats.mode(numbers)[0]))
Simple code that finds the mode of the list without any imports:
nums = #your_list_goes_here
nums.sort()
counts = dict()
for i in nums:
counts[i] = counts.get(i, 0) + 1
mode = max(counts, key=counts.get)
In case of multiple modes, it should return the minimum node.
Why not just
def print_mode (thelist):
counts = {}
for item in thelist:
counts [item] = counts.get (item, 0) + 1
maxcount = 0
maxitem = None
for k, v in counts.items ():
if v > maxcount:
maxitem = k
maxcount = v
if maxcount == 1:
print "All values only appear once"
elif counts.values().count (maxcount) > 1:
print "List has multiple modes"
else:
print "Mode of list:", maxitem
This doesn't have a few error checks that it should have, but it will find the mode without importing any functions and will print a message if all values appear only once. It will also detect multiple items sharing the same maximum count, although it wasn't clear if you wanted that.
This will return all modes:
def mode(numbers)
largestCount = 0
modes = []
for x in numbers:
if x in modes:
continue
count = numbers.count(x)
if count > largestCount:
del modes[:]
modes.append(x)
largestCount = count
elif count == largestCount:
modes.append(x)
return modes
For those looking for the minimum mode, e.g:case of bi-modal distribution, using numpy.
import numpy as np
mode = np.argmax(np.bincount(your_list))
Okey! So community has already a lot of answers and some of them used another function and you don't want.
let we create our very simple and easily understandable function.
import numpy as np
#Declare Function Name
def calculate_mode(lst):
Next step is to find Unique elements in list and thier respective frequency.
unique_elements,freq = np.unique(lst, return_counts=True)
Get mode
max_freq = np.max(freq) #maximum frequency
mode_index = np.where(freq==max_freq) #max freq index
mode = unique_elements[mode_index] #get mode by index
return mode
Example
lst =np.array([1,1,2,3,4,4,4,5,6])
print(calculate_mode(lst))
>>> Output [4]
How my brain decided to do it completely from scratch. Efficient and concise :) (jk lol)
import random
def removeDuplicates(arr):
dupFlag = False
for i in range(len(arr)):
#check if we found a dup, if so, stop
if dupFlag:
break
for j in range(len(arr)):
if ((arr[i] == arr[j]) and (i != j)):
arr.remove(arr[j])
dupFlag = True
break;
#if there was a duplicate repeat the process, this is so we can account for the changing length of the arr
if (dupFlag):
removeDuplicates(arr)
else:
#if no duplicates return the arr
return arr
#currently returns modes and all there occurences... Need to handle dupes
def mode(arr):
numCounts = []
#init numCounts
for i in range(len(arr)):
numCounts += [0]
for i in range(len(arr)):
count = 1
for j in range(len(arr)):
if (arr[i] == arr[j] and i != j):
count += 1
#add the count for that number to the corresponding index
numCounts[i] = count
#find which has the greatest number of occurences
greatestNum = 0
for i in range(len(numCounts)):
if (numCounts[i] > greatestNum):
greatestNum = numCounts[i]
#finally return the mode(s)
modes = []
for i in range(len(numCounts)):
if numCounts[i] == greatestNum:
modes += [arr[i]]
#remove duplicates (using aliasing)
print("modes: ", modes)
removeDuplicates(modes)
print("modes after removing duplicates: ", modes)
return modes
def initArr(n):
arr = []
for i in range(n):
arr += [random.randrange(0, n)]
return arr
#initialize an array of random ints
arr = initArr(1000)
print(arr)
print("_______________________________________________")
modes = mode(arr)
#print result
print("Mode is: ", modes) if (len(modes) == 1) else print("Modes are: ", modes)
def mode(inp_list):
sort_list = sorted(inp_list)
dict1 = {}
for i in sort_list:
count = sort_list.count(i)
if i not in dict1.keys():
dict1[i] = count
maximum = 0 #no. of occurences
max_key = -1 #element having the most occurences
for key in dict1:
if(dict1[key]>maximum):
maximum = dict1[key]
max_key = key
elif(dict1[key]==maximum):
if(key<max_key):
maximum = dict1[key]
max_key = key
return max_key
def mode(data):
lst =[]
hgh=0
for i in range(len(data)):
lst.append(data.count(data[i]))
m= max(lst)
ml = [x for x in data if data.count(x)==m ] #to find most frequent values
mode = []
for x in ml: #to remove duplicates of mode
if x not in mode:
mode.append(x)
return mode
print mode([1,2,2,2,2,7,7,5,5,5,5])
Here is a simple function that gets the first mode that occurs in a list. It makes a dictionary with the list elements as keys and number of occurrences and then reads the dict values to get the mode.
def findMode(readList):
numCount={}
highestNum=0
for i in readList:
if i in numCount.keys(): numCount[i] += 1
else: numCount[i] = 1
for i in numCount.keys():
if numCount[i] > highestNum:
highestNum=numCount[i]
mode=i
if highestNum != 1: print(mode)
elif highestNum == 1: print("All elements of list appear once.")
If you want a clear approach, useful for classroom and only using lists and dictionaries by comprehension, you can do:
def mode(my_list):
# Form a new list with the unique elements
unique_list = sorted(list(set(my_list)))
# Create a comprehensive dictionary with the uniques and their count
appearance = {a:my_list.count(a) for a in unique_list}
# Calculate max number of appearances
max_app = max(appearance.values())
# Return the elements of the dictionary that appear that # of times
return {k: v for k, v in appearance.items() if v == max_app}
#function to find mode
def mode(data):
modecnt=0
#for count of number appearing
for i in range(len(data)):
icount=data.count(data[i])
#for storing count of each number in list will be stored
if icount>modecnt:
#the loop activates if current count if greater than the previous count
mode=data[i]
#here the mode of number is stored
modecnt=icount
#count of the appearance of number is stored
return mode
print mode(data1)
import numpy as np
def get_mode(xs):
values, counts = np.unique(xs, return_counts=True)
max_count_index = np.argmax(counts) #return the index with max value counts
return values[max_count_index]
print(get_mode([1,7,2,5,3,3,8,3,2]))
Perhaps try the following. It is O(n) and returns a list of floats (or ints). It is thoroughly, automatically tested. It uses collections.defaultdict, but I'd like to think you're not opposed to using that. It can also be found at https://stromberg.dnsalias.org/~strombrg/stddev.html
def compute_mode(list_: typing.List[float]) -> typing.List[float]:
"""
Compute the mode of list_.
Note that the return value is a list, because sometimes there is a tie for "most common value".
See https://stackoverflow.com/questions/10797819/finding-the-mode-of-a-list
"""
if not list_:
raise ValueError('Empty list')
if len(list_) == 1:
raise ValueError('Single-element list')
value_to_count_dict: typing.DefaultDict[float, int] = collections.defaultdict(int)
for element in list_:
value_to_count_dict[element] += 1
count_to_values_dict = collections.defaultdict(list)
for value, count in value_to_count_dict.items():
count_to_values_dict[count].append(value)
counts = list(count_to_values_dict)
if len(counts) == 1:
raise ValueError('All elements in list are the same')
maximum_occurrence_count = max(counts)
if maximum_occurrence_count == 1:
raise ValueError('No element occurs more than once')
minimum_occurrence_count = min(counts)
if maximum_occurrence_count <= minimum_occurrence_count:
raise ValueError('Maximum count not greater than minimum count')
return count_to_values_dict[maximum_occurrence_count]
I have a list of phone numbers that have been dialed (nums_dialed).
I also have a set of phone numbers which are the number in a client's office (client_nums)
How do I efficiently figure out how many times I've called a particular client (total)
For example:
>>>nums_dialed=[1,2,2,3,3]
>>>client_nums=set([2,3])
>>>???
total=4
Problem is that I have a large-ish dataset: len(client_nums) ~ 10^5; and len(nums_dialed) ~10^3.
which client has 10^5 numbers in his office? Do you do work for an entire telephone company?
Anyway:
print sum(1 for num in nums_dialed if num in client_nums)
That will give you as fast as possible the number.
If you want to do it for multiple clients, using the same nums_dialed list, then you could cache the data on each number first:
nums_dialed_dict = collections.defaultdict(int)
for num in nums_dialed:
nums_dialed_dict[num] += 1
Then just sum the ones on each client:
sum(nums_dialed_dict[num] for num in this_client_nums)
That would be a lot quicker than iterating over the entire list of numbers again for each client.
>>> client_nums = set([2, 3])
>>> nums_dialed = [1, 2, 2, 3, 3]
>>> count = 0
>>> for num in nums_dialed:
... if num in client_nums:
... count += 1
...
>>> count
4
>>>
Should be quite efficient even for the large numbers you quote.
Using collections.Counter from Python 2.7:
dialed_count = collections.Counter(nums_dialed)
count = sum(dialed_count[t] for t in client_nums)
Thats very popular way to do some combination of sorted lists in single pass:
nums_dialed = [1, 2, 2, 3, 3]
client_nums = [2,3]
nums_dialed.sort()
client_nums.sort()
c = 0
i = iter(nums_dialed)
j = iter(client_nums)
try:
a = i.next()
b = j.next()
while True:
if a < b:
a = i.next()
continue
if a > b:
b = j.next()
continue
# a == b
c += 1
a = i.next() # next dialed
except StopIteration:
pass
print c
Because "set" is unordered collection (don't know why it uses hashes, but not binary tree or sorted list) and it's not fair to use it there. You can implement own "set" through "bisect" if you like lists or through something more complicated that will produce ordered iterator.
The method I use is to simply convert the set into a list and then use the len() function to count its values.
set_var = {"abc", "cba"}
print(len(list(set_var)))
Output:
2