Python 'for' loop performance too slow

Python 'for' loop performance too slow - python

I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation. Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:
col_26 = []
col_27 = []
col_28 = []
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_26.append('Yes')
col_27.append('No')
col_28.append(df['A_value'][ind])
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26.append('No')
col_27.append('Yes')
col_28.append(df['B_value'][ind])
else:
col_26.append('')
col_27.append('')
col_28.append(float('nan'))

You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

Try column operations:
data = {'A_factor': [1, 2, 3, 4, 5],
'A_value': [10, 20, 30, 40, 50],
'B_factor': [2, 3, 1, 2, 6],
'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan
mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']
df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']

Appending to lists in Python is painfully slow. Initializing the lists before the iteration can speed things up. For example,
def f():
x = []
for ii in range(500000):
x.append(str(x))
def f2():
x = [""] * 500000
for ii in range(500000):
x[ii] = str(x)
timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243
Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping. For example:
a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()
col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)
a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor
col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'
col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'
col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan

append causes python requests for heap memory to get more memory. using append in for loop causes get memory and free it continually to get more memory. so it's better to say to python how many item you need.
col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_28[ind] = df['A_value'][ind]
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26[ind] = False
col_27[ind] = True
col_28[ind] = df['B_value'][ind]
else:
col_26[ind] = ''
col_27[ind] = ''

Related

Python - speed up iterations over permutation list (with multithreading?)

I have generated permutations list with the itertools.permutation (all permutations) or numpy.permuted (part of all permutations) functions in python, depending on how big all permutations are. This part of the code is ok and works well and quickly.
However, the iterator list is big enough (100k or bigger) and I would like to go through it with multiple threads but don't really know how to accomplish that.
Here is what I have so far. The chunk of code is working but in an inefficient way, it takes a long time to complete the task. I have tried to use multiprocessing.Pool() function, but I have not been able to make it work
avg = {'Block': np.arange(1, 17, 1),
'A': np.random.uniform(0,1,16),
'B': np.random.uniform(0,1,16),
'C': np.random.uniform(0,1,16)
}
avg = pd.DataFrame(avg)
seed = 1234
thr = 100000
def permuta (avg, seed, thr):
kpi = []
# Permutations
if len(pd.unique(avg.Block)) > 9:
rng = np.random.default_rng(seed)
perm = rng.permuted(np.tile(pd.unique((avg.Block)-1).astype(int), thr).reshape(thr, (pd.unique(avg.Block)-1).size), axis=1)
aa = list(perm)
aa1 = [tuple(x) for x in aa]
bb = [np.arange(0, len(avg))]
bb1 = tuple(bb[0])
if bb1 not in aa1:
bb.extend(aa)
aa = [tuple(x) for x in bb]
else:
aa = [tuple(x) for x in aa]
else:
perm = permutations(avg.index)
aa = list(perm)
n0 = len(aa)
for i in aa:
if (aa.index(i)+1) % 1000 == 0:
print(' progress: {:.2f}%'.format((aa.index(i)+1)/n0*100))
df = avg.loc[list(i)]
df.reset_index(drop=True, inplace=True)
model_A = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1,1), df.A)
model_B = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1,1), df.B)
model_C = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1,1), df.C)
block_order_id = tuple(x+1 for x in i)
model_kpi = [block_order_id, model_A.coef_[0], model_B.coef_[0], model_C.coef_[0]]
kpi.append(model_kpi)
kpi = pd.DataFrame (kpi, columns = ['Block_ord', 'm_A', 'm_B', 'm_C'])
return kpi
I would be grateful if someone could help me to speed up the code execution, using all cores for calculations, replacing the "for" loop for a more efficient iterator, or a mix of them.
Thanks for your help

I usually use concurrent.futures as described https://docs.python.org/3/library/concurrent.futures.html
Since you mentioned having trouble getting some implementations set up, here's an rough example pasted below that you just need to tweak a bit.
In this example, I'm using ThreadPoolExecutor which you can try out and tune to your machine. However, if you want to start using different processes then you can use ProcessPoolExecutor which has similar syntax and is also explained in the doc link I posted
import concurrent.futures
avg = {'Block': np.arange(1, 17, 1),
'A': np.random.uniform(0, 1, 16),
'B': np.random.uniform(0, 1, 16),
'C': np.random.uniform(0, 1, 16)
}
avg = pd.DataFrame(avg)
seed = 1234
thr = 100000
def permuta(avg, seed, thr):
kpi = []
# Permutations
if len(pd.unique(avg.Block)) > 9:
rng = np.random.default_rng(seed)
perm = rng.permuted(
np.tile(pd.unique((avg.Block) - 1).astype(int), thr).reshape(thr, (pd.unique(avg.Block) - 1).size), axis=1)
aa = list(perm)
aa1 = [tuple(x) for x in aa]
bb = [np.arange(0, len(avg))]
bb1 = tuple(bb[0])
if bb1 not in aa1:
bb.extend(aa)
aa = [tuple(x) for x in bb]
else:
aa = [tuple(x) for x in aa]
else:
perm = permutations(avg.index)
aa = list(perm)
n0 = len(aa)
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
futures = {executor.submit(everything_in_for_loop, i, other_args_needed_for_this_method): i for i in aa}
error_count = 0
results = [] # the results will be placed into this array
for future in futures:
try:
result = future.result()
if result is not None:
results.append(result)
except Exception as exc:
error_count += 1
print(type(exc))
print(exc.args)
kpi = pd.DataFrame(kpi, columns=['Block_ord', 'm_A', 'm_B', 'm_C'])
return kpi
def everything_in_for_loop(i, other_args_needed_for_this_method):
if (aa.index(i) + 1) % 1000 == 0:
print(' progress: {:.2f}%'.format((aa.index(i) + 1) / n0 * 100))
df = avg.loc[list(i)]
df.reset_index(drop=True, inplace=True)
model_A = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1, 1), df.A)
model_B = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1, 1), df.B)
model_C = LinearRegression(fit_intercept=True).fit(df.index.values.reshape(-1, 1), df.C)
block_order_id = tuple(x + 1 for x in i)
model_kpi = [block_order_id, model_A.coef_[0], model_B.coef_[0], model_C.coef_[0]]
kpi.append(model_kpi)

Speeding up a pd.apply() function

I have some example data
import numpy as np
import pandas as pd
users = 5
size = users*6
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
For each userid, I need to determine how many times a_time > b_time or vice versa, depending on column focus.
I have a custom function
def some_func(x):
if (x.focus == 'a').all():
a = x.a_time
b = x.b_time
x['changes'] = (b > a).sum()
x['days'] = len(a)
elif (x.focus == 'b').all():
a = x.a_time
b = x.b_time
x['changes'] = (a > b).sum()
x['days'] = len(a)
elif (x.focus == 'both').all():
x['changes'] = 0
x['days'] = len(a)
else:
x['changes'] = None
x['days'] = None
return x
test_dat.groupby(['userid', 'focus']).apply(some_func).reset_index(name = 'n_changes')
that works just fine when the number of userid is small. However, as the number of unique userid increases to >100K, this function is almost unbearably slow.
Is there a way to speed up this fx? My guess is that there might be an alternative to the if-elif-else syntax in some_func() but I'm not sure what that syntax might be. The number of rows for each userid is arbitrarily long.
I'm open to non-pandas options if necessary.

A now-deleted answer from another user helped me get to a reasonably-quick solution. After learning that GroupBy.apply() does operate on dataframes and that this answer suggests that apply() isn't very performant (at least compared to transform()), I thought I'd try dropping apply() entirely. I don't think this solution is very pretty but with 277161 userid values in 1.36MM rows, it runs in about 32 seconds:
# Simulate some data
users = 277161
size = 1360000
df = pd.DataFrame(
{'userid': np.random.choice(np.arange(0, users), size),
'a_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
'b_time': np.random.normal(loc = 1.5, scale = 0.5, size = size),
}
)
df['focus'] = np.where(df.userid % 2 == 0, 'a', 'b')
test_dat = df[['userid', 'focus', 'a_time', 'b_time']].sort_values('userid').copy(deep = True).reset_index(drop = True)
# Evalute time
start = datetime.now()
test_dat['aux'] = np.where(
test_dat.focus == 'a', test_dat.a_time.lt(test_dat.b_time),
np.where(test_dat.focus == 'b', test_dat.a_time.gt(test_dat.b_time), None))
test = test_dat.groupby(['userid', 'focus'])['aux'].agg([np.sum, np.size]).reset_index()
# the row above for some reason doesn't handle single boolean values well and just
# returns the boolean values when the number of rows in a group = 1,
# so I force those remaining boolean values into ints here
test['sum'] = test['sum'].astype(int)
test = test.rename(columns = {'sum': 'changes', 'size': 'days'})
end = datetime.now()
(end - start).seconds
I recognize that this isn't an explicit answer to my question (i.e., "speeding up an apply() function") but the reality is that this is a workable, reasonably-quick solution.

python: processing data so that only constant values remain

I have data from a measurement and I want to process the data so that only the values remain, that are constant. The measured signal consists of parts where the value stays constant for some time then I do a change on the system that causes the value to increase. It takes time for the system to reach the constant value after the adjustment I do.
I wrote a programm that compares every value with the 10 previous values. If it is equal to them within a tolerance it gets saved.
The code works but i feel like this can be done cleaner and more efficient so that it is sutable to process larger amouts of data. But I dont know how to make the code in for-loop more efficient. Do you have any suggestions?
Thank you in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('radiale Steifigkeit_22_04_2022_raw.csv',
sep= ";",
decimal = ',',
skipinitialspace=True,
comment = '\t')
#df = df.drop(df.columns[[0,4]], axis=1)
#print(df.head())
#print(df.dtypes)
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Kraft')
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Weg')
#plt.show()
s = pd.Series(df['Weg'], name = 'Weg')
f = pd.Series(df['Kraft'], name= 'Kraft')
t = pd.Series(df['Time_SYS 01-cDAQ:1_A-In-All_Rec_rel'], name= 'Zeit')
#s_const = pd.Series()
s_const = []
f_const = []
t_const = []
s = np.abs(s)
#plt.plot(s)
#plt.show()
c = 0
#this for-loop compares the value s[i] with the previous 10 measurements.
#If it is equal within a tolerance it is saved into s_const.
for i in range(len(s)):
#for i in range(0,2000):
if i > 10:
si = round(s[i],3)
s1i = round(s[i-1],3)
s2i = round(s[i-2],3)
s3i = round(s[i-3],3)
s4i = round(s[i-4],3)
s5i = round(s[i-5],3)
s6i = round(s[i-6],3)
s7i = round(s[i-7],3)
s8i = round(s[i-8],3)
s9i = round(s[i-9],3)
s10i = round(s[i-10],3)
if si == s1i == s2i == s3i == s4i == s5i== s6i == s7i== s8i == s9i == s10i:
c = c+1
s_const.append(s[i])
f_const.append(f[i])

Here is a very performant implementation using itertools (based on Check if all elements in a list are identical):
from itertools import groupby
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
data = [1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5]
window = 3
stable = [i for i in range(len(data) - window + 1) if all_equal(data[i:i+window])]
print(stable) # -> [1, 2, 7, 8, 9, 10, 13]
The algorithm produces a list of indices in your data where a stable period of length window starts.

How do I generate a table from a list

I have a list that contains sublists with 3 values and I need to print out a list that looks like:
I also need to compare the third column values with eachother to tell if they are increasing or decreasing as you go down.
bb = 3.9
lowest = 0.4
#appending all the information to a list
allinfo= []
while bb>=lowest:
everything = angleWithPost(bb,cc,dd,ee)
allinfo.append(everything)
bb-=0.1
I think the general idea for finding out whether or not the third column values are increasing or decreasing is:
#Checking whether or not Fnet are increasing or decreasing
ii=0
while ii<=(10*(bb-lowest)):
if allinfo[ii][2]>allinfo[ii+1][2]:
abc = "decreasing"
elif allinfo[ii][2]<allinfo[ii+1][2]:
abc = "increasing"
ii+=1
Then when i want to print out my table similar to the one above.
jj=0
while jj<=(10*(bb-lowest))
print "%8.2f %12.2f %12.2f %s" %(allinfo[jj][0], allinfo[jj][1], allinfo[jj][2], abc)
jj+=1
here is the angle with part
def chainPoints(aa,DIS,SEG,H):
#xtuple x chain points
n=0
xterms = []
xterm = -DIS
while n<=SEG:
xterms.append(xterm)
n+=1
xterm = -DIS + n*2*DIS/(SEG)
#
#ytuple y chain points
k=0
yterms = []
while k<=SEG:
yterm = H + aa*m.cosh(xterms[k]/aa) - aa*m.cosh(DIS/aa)
yterms.append(yterm)
k+=1
return(xterms,yterms)
#
#
def chainLength(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)# using x points and y points from the chainpoints function
#length of chain
ff=1
Lterm=0.
totallength=0.
while ff<=SEG:
Lterm = m.sqrt((xterms[ff]-xterms[ff-1])**2 + (yterms[ff]-yterms[ff-1])**2)
totallength += Lterm
ff+=1
return(totallength)
#
def angleWithPost(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)
totallength = chainLength(aa,DIS,SEG,H)
#Find the angle
thetaradians = (m.pi)/2. + m.atan(((yterms[1]-yterms[0])/(xterms[1]-xterms[0])))
#Need to print out the degrees
thetadegrees = (180/m.pi)*thetaradians
#finding the net force
Fnet = abs((rho*grav*totallength))/(2.*m.cos(thetaradians))
return(totallength, thetadegrees, Fnet)

Review this Python2 implementation which uses map and an iterator trick.
from itertools import izip_longest, islice
from pprint import pprint
data = [
[1, 2, 3],
[1, 2, 4],
[1, 2, 3],
[1, 2, 5],
]
class AddDirection(object):
def __init__(self):
# This default is used if the series begins with equal values or has a
# single element.
self.increasing = True
def __call__(self, pair):
crow, nrow = pair
if nrow is None or crow[-1] == nrow[-1]:
# This is the last row or the direction didn't change. Just return
# the direction we previouly had.
inc = self.increasing
elif crow[-1] > nrow[-1]:
inc = False
else:
# Here crow[-1] < nrow[-1].
inc = True
self.increasing = inc
return crow + ["Increasing" if inc else "Decreasing"]
result = map(AddDirection(), izip_longest(data, islice(data, 1, None)))
pprint(result)
The output:
pts/1$ python2 a.py
[[1, 2, 3, 'Increasing'],
[1, 2, 4, 'Decreasing'],
[1, 2, 3, 'Increasing'],
[1, 2, 5, 'Increasing']]
Whenever you want to transform the contents of a list (in this case the list of rows), map is a good place where to begin thinking.
When the algorithm requires data from several places of a list, offsetting the list and zipping the needed values is also a powerful technique. Using generators so that the list doesn't have to be copied, makes this viable in real code.
Finally, when you need to keep state between calls (in this case the direction), using an object is the best choice.
Sorry if the code is too terse!

Basically you want to add a 4th column to the inner list and print the results?
#print headers of table here, use .format for consistent padding
previous = 0
for l in outer_list:
if l[2] > previous:
l.append('increasing')
elif l[2] < previous:
l.append('decreasing')
previous = l[2]
#print row here use .format for consistent padding
Update for list of tuples, add value to tuple:
import random
outer_list = [ (i, i, random.randint(0,10),)for i in range(0,10)]
previous = 0
allinfo = []
for l in outer_list:
if l[2] > previous:
allinfo.append(l +('increasing',))
elif l[2] < previous:
allinfo.append(l +('decreasing',))
previous = l[2]
#print row here use .format for consistent padding
print(allinfo)
This most definitely can be optimized and you could reduce the number of times you are iterating over the data.

Finding the mode of a list

Given a list of items, recall that the mode of the list is the item that occurs most often.
I would like to know how to create a function that can find the mode of a list but that displays a message if the list does not have a mode (e.g., all the items in the list only appear once). I want to make this function without importing any functions. I'm trying to make my own function from scratch.

You can use the max function and a key. Have a look at python max function using 'key' and lambda expression.
max(set(lst), key=lst.count)

You can use the Counter supplied in the collections package which has a mode-esque function
from collections import Counter
data = Counter(your_list_in_here)
data.most_common() # Returns all unique items and their counts
data.most_common(1) # Returns the highest occurring item
Note: Counter is new in python 2.7 and is not available in earlier versions.

Python 3.4 includes the method statistics.mode, so it is straightforward:
>>> from statistics import mode
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
3
You can have any type of elements in the list, not just numeric:
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'

Taking a leaf from some statistics software, namely SciPy and MATLAB, these just return the smallest most common value, so if two values occur equally often, the smallest of these are returned. Hopefully an example will help:
>>> from scipy.stats import mode
>>> mode([1, 2, 3, 4, 5])
(array([ 1.]), array([ 1.]))
>>> mode([1, 2, 2, 3, 3, 4, 5])
(array([ 2.]), array([ 2.]))
>>> mode([1, 2, 2, -3, -3, 4, 5])
(array([-3.]), array([ 2.]))
Is there any reason why you can 't follow this convention?

There are many simple ways to find the mode of a list in Python such as:
import statistics
statistics.mode([1,2,3,3])
>>> 3
Or, you could find the max by its count
max(array, key = array.count)
The problem with those two methods are that they don't work with multiple modes. The first returns an error, while the second returns the first mode.
In order to find the modes of a set, you could use this function:
def mode(array):
most = max(list(map(array.count, array)))
return list(set(filter(lambda x: array.count(x) == most, array)))

Extending the Community answer that will not work when the list is empty, here is working code for mode:
def mode(arr):
if arr==[]:
return None
else:
return max(set(arr), key=arr.count)

In case you are interested in either the smallest, largest or all modes:
def get_small_mode(numbers, out_mode):
counts = {k:numbers.count(k) for k in set(numbers)}
modes = sorted(dict(filter(lambda x: x[1] == max(counts.values()), counts.items())).keys())
if out_mode=='smallest':
return modes[0]
elif out_mode=='largest':
return modes[-1]
else:
return modes

A little longer, but can have multiple modes and can get string with most counts or mix of datatypes.
def getmode(inplist):
'''with list of items as input, returns mode
'''
dictofcounts = {}
listofcounts = []
for i in inplist:
countofi = inplist.count(i) # count items for each item in list
listofcounts.append(countofi) # add counts to list
dictofcounts[i]=countofi # add counts and item in dict to get later
maxcount = max(listofcounts) # get max count of items
if maxcount ==1:
print "There is no mode for this dataset, values occur only once"
else:
modelist = [] # if more than one mode, add to list to print out
for key, item in dictofcounts.iteritems():
if item ==maxcount: # get item from original list with most counts
modelist.append(str(key))
print "The mode(s) are:",' and '.join(modelist)
return modelist

Mode of a data set is/are the member(s) that occur(s) most frequently in the set. If there are two members that appear most often with same number of times, then the data has two modes. This is called bimodal.If there are more than 2 modes, then the data would be called multimodal. If all the members in the data set appear the same number of times, then the data set has no mode at all. Following function modes() can work to find mode(s) in a given list of data:
import numpy as np; import pandas as pd
def modes(arr):
df = pd.DataFrame(arr, columns=['Values'])
dat = pd.crosstab(df['Values'], columns=['Freq'])
if len(np.unique((dat['Freq']))) > 1:
mode = list(dat.index[np.array(dat['Freq'] == max(dat['Freq']))])
return mode
else:
print("There is NO mode in the data set")
Output:
# For a list of numbers in x as
In [1]: x = [2, 3, 4, 5, 7, 9, 8, 12, 2, 1, 1, 1, 3, 3, 2, 6, 12, 3, 7, 8, 9, 7, 12, 10, 10, 11, 12, 2]
In [2]: modes(x)
Out[2]: [2, 3, 12]
# For a list of repeated numbers in y as
In [3]: y = [2, 2, 3, 3, 4, 4, 10, 10]
In [4]: modes(y)
Out[4]: There is NO mode in the data set
# For a list of strings/characters in z as
In [5]: z = ['a', 'b', 'b', 'b', 'e', 'e', 'e', 'd', 'g', 'g', 'c', 'g', 'g', 'a', 'a', 'c', 'a']
In [6]: modes(z)
Out[6]: ['a', 'g']
If we do not want to import numpy or pandas to call any function from these packages, then to get this same output, modes() function can be written as:
def modes(arr):
cnt = []
for i in arr:
cnt.append(arr.count(i))
uniq_cnt = []
for i in cnt:
if i not in uniq_cnt:
uniq_cnt.append(i)
if len(uniq_cnt) > 1:
m = []
for i in list(range(len(cnt))):
if cnt[i] == max(uniq_cnt):
m.append(arr[i])
mode = []
for i in m:
if i not in mode:
mode.append(i)
return mode
else:
print("There is NO mode in the data set")

I wrote up this handy function to find the mode.
def mode(nums):
corresponding={}
occurances=[]
for i in nums:
count = nums.count(i)
corresponding.update({i:count})
for i in corresponding:
freq=corresponding[i]
occurances.append(freq)
maxFreq=max(occurances)
keys=corresponding.keys()
values=corresponding.values()
index_v = values.index(maxFreq)
global mode
mode = keys[index_v]
return mode

Short, but somehow ugly:
def mode(arr) :
m = max([arr.count(a) for a in arr])
return [x for x in arr if arr.count(x) == m][0] if m>1 else None
Using a dictionary, slightly less ugly:
def mode(arr) :
f = {}
for a in arr : f[a] = f.get(a,0)+1
m = max(f.values())
t = [(x,f[x]) for x in f if f[x]==m]
return m > 1 t[0][0] else None

This function returns the mode or modes of a function no matter how many, as well as the frequency of the mode or modes in the dataset. If there is no mode (ie. all items occur only once), the function returns an error string. This is similar to A_nagpal's function above but is, in my humble opinion, more complete, and I think it's easier to understand for any Python novices (such as yours truly) reading this question to understand.
def l_mode(list_in):
count_dict = {}
for e in (list_in):
count = list_in.count(e)
if e not in count_dict.keys():
count_dict[e] = count
max_count = 0
for key in count_dict:
if count_dict[key] >= max_count:
max_count = count_dict[key]
corr_keys = []
for corr_key, count_value in count_dict.items():
if count_dict[corr_key] == max_count:
corr_keys.append(corr_key)
if max_count == 1 and len(count_dict) != 1:
return 'There is no mode for this data set. All values occur only once.'
else:
corr_keys = sorted(corr_keys)
return corr_keys, max_count

For a number to be a mode, it must occur more number of times than at least one other number in the list, and it must not be the only number in the list. So, I refactored #mathwizurd's answer (to use the difference method) as follows:
def mode(array):
'''
returns a set containing valid modes
returns a message if no valid mode exists
- when all numbers occur the same number of times
- when only one number occurs in the list
- when no number occurs in the list
'''
most = max(map(array.count, array)) if array else None
mset = set(filter(lambda x: array.count(x) == most, array))
return mset if set(array) - mset else "list does not have a mode!"
These tests pass successfully:
mode([]) == None
mode([1]) == None
mode([1, 1]) == None
mode([1, 1, 2, 2]) == None

Here is how you can find mean,median and mode of a list:
import numpy as np
from scipy import stats
#to take input
size = int(input())
numbers = list(map(int, input().split()))
print(np.mean(numbers))
print(np.median(numbers))
print(int(stats.mode(numbers)[0]))

Simple code that finds the mode of the list without any imports:
nums = #your_list_goes_here
nums.sort()
counts = dict()
for i in nums:
counts[i] = counts.get(i, 0) + 1
mode = max(counts, key=counts.get)
In case of multiple modes, it should return the minimum node.

Why not just
def print_mode (thelist):
counts = {}
for item in thelist:
counts [item] = counts.get (item, 0) + 1
maxcount = 0
maxitem = None
for k, v in counts.items ():
if v > maxcount:
maxitem = k
maxcount = v
if maxcount == 1:
print "All values only appear once"
elif counts.values().count (maxcount) > 1:
print "List has multiple modes"
else:
print "Mode of list:", maxitem
This doesn't have a few error checks that it should have, but it will find the mode without importing any functions and will print a message if all values appear only once. It will also detect multiple items sharing the same maximum count, although it wasn't clear if you wanted that.

This will return all modes:
def mode(numbers)
largestCount = 0
modes = []
for x in numbers:
if x in modes:
continue
count = numbers.count(x)
if count > largestCount:
del modes[:]
modes.append(x)
largestCount = count
elif count == largestCount:
modes.append(x)
return modes

For those looking for the minimum mode, e.g:case of bi-modal distribution, using numpy.
import numpy as np
mode = np.argmax(np.bincount(your_list))

Okey! So community has already a lot of answers and some of them used another function and you don't want.
let we create our very simple and easily understandable function.
import numpy as np
#Declare Function Name
def calculate_mode(lst):
Next step is to find Unique elements in list and thier respective frequency.
unique_elements,freq = np.unique(lst, return_counts=True)
Get mode
max_freq = np.max(freq) #maximum frequency
mode_index = np.where(freq==max_freq) #max freq index
mode = unique_elements[mode_index] #get mode by index
return mode
Example
lst =np.array([1,1,2,3,4,4,4,5,6])
print(calculate_mode(lst))
>>> Output [4]

How my brain decided to do it completely from scratch. Efficient and concise :) (jk lol)
import random
def removeDuplicates(arr):
dupFlag = False
for i in range(len(arr)):
#check if we found a dup, if so, stop
if dupFlag:
break
for j in range(len(arr)):
if ((arr[i] == arr[j]) and (i != j)):
arr.remove(arr[j])
dupFlag = True
break;
#if there was a duplicate repeat the process, this is so we can account for the changing length of the arr
if (dupFlag):
removeDuplicates(arr)
else:
#if no duplicates return the arr
return arr
#currently returns modes and all there occurences... Need to handle dupes
def mode(arr):
numCounts = []
#init numCounts
for i in range(len(arr)):
numCounts += [0]
for i in range(len(arr)):
count = 1
for j in range(len(arr)):
if (arr[i] == arr[j] and i != j):
count += 1
#add the count for that number to the corresponding index
numCounts[i] = count
#find which has the greatest number of occurences
greatestNum = 0
for i in range(len(numCounts)):
if (numCounts[i] > greatestNum):
greatestNum = numCounts[i]
#finally return the mode(s)
modes = []
for i in range(len(numCounts)):
if numCounts[i] == greatestNum:
modes += [arr[i]]
#remove duplicates (using aliasing)
print("modes: ", modes)
removeDuplicates(modes)
print("modes after removing duplicates: ", modes)
return modes
def initArr(n):
arr = []
for i in range(n):
arr += [random.randrange(0, n)]
return arr
#initialize an array of random ints
arr = initArr(1000)
print(arr)
print("_______________________________________________")
modes = mode(arr)
#print result
print("Mode is: ", modes) if (len(modes) == 1) else print("Modes are: ", modes)

def mode(inp_list):
sort_list = sorted(inp_list)
dict1 = {}
for i in sort_list:
count = sort_list.count(i)
if i not in dict1.keys():
dict1[i] = count
maximum = 0 #no. of occurences
max_key = -1 #element having the most occurences
for key in dict1:
if(dict1[key]>maximum):
maximum = dict1[key]
max_key = key
elif(dict1[key]==maximum):
if(key<max_key):
maximum = dict1[key]
max_key = key
return max_key

def mode(data):
lst =[]
hgh=0
for i in range(len(data)):
lst.append(data.count(data[i]))
m= max(lst)
ml = [x for x in data if data.count(x)==m ] #to find most frequent values
mode = []
for x in ml: #to remove duplicates of mode
if x not in mode:
mode.append(x)
return mode
print mode([1,2,2,2,2,7,7,5,5,5,5])

Here is a simple function that gets the first mode that occurs in a list. It makes a dictionary with the list elements as keys and number of occurrences and then reads the dict values to get the mode.
def findMode(readList):
numCount={}
highestNum=0
for i in readList:
if i in numCount.keys(): numCount[i] += 1
else: numCount[i] = 1
for i in numCount.keys():
if numCount[i] > highestNum:
highestNum=numCount[i]
mode=i
if highestNum != 1: print(mode)
elif highestNum == 1: print("All elements of list appear once.")

If you want a clear approach, useful for classroom and only using lists and dictionaries by comprehension, you can do:
def mode(my_list):
# Form a new list with the unique elements
unique_list = sorted(list(set(my_list)))
# Create a comprehensive dictionary with the uniques and their count
appearance = {a:my_list.count(a) for a in unique_list}
# Calculate max number of appearances
max_app = max(appearance.values())
# Return the elements of the dictionary that appear that # of times
return {k: v for k, v in appearance.items() if v == max_app}

#function to find mode
def mode(data):
modecnt=0
#for count of number appearing
for i in range(len(data)):
icount=data.count(data[i])
#for storing count of each number in list will be stored
if icount>modecnt:
#the loop activates if current count if greater than the previous count
mode=data[i]
#here the mode of number is stored
modecnt=icount
#count of the appearance of number is stored
return mode
print mode(data1)

import numpy as np
def get_mode(xs):
values, counts = np.unique(xs, return_counts=True)
max_count_index = np.argmax(counts) #return the index with max value counts
return values[max_count_index]
print(get_mode([1,7,2,5,3,3,8,3,2]))

Perhaps try the following. It is O(n) and returns a list of floats (or ints). It is thoroughly, automatically tested. It uses collections.defaultdict, but I'd like to think you're not opposed to using that. It can also be found at https://stromberg.dnsalias.org/~strombrg/stddev.html
def compute_mode(list_: typing.List[float]) -> typing.List[float]:
"""
Compute the mode of list_.
Note that the return value is a list, because sometimes there is a tie for "most common value".
See https://stackoverflow.com/questions/10797819/finding-the-mode-of-a-list
"""
if not list_:
raise ValueError('Empty list')
if len(list_) == 1:
raise ValueError('Single-element list')
value_to_count_dict: typing.DefaultDict[float, int] = collections.defaultdict(int)
for element in list_:
value_to_count_dict[element] += 1
count_to_values_dict = collections.defaultdict(list)
for value, count in value_to_count_dict.items():
count_to_values_dict[count].append(value)
counts = list(count_to_values_dict)
if len(counts) == 1:
raise ValueError('All elements in list are the same')
maximum_occurrence_count = max(counts)
if maximum_occurrence_count == 1:
raise ValueError('No element occurs more than once')
minimum_occurrence_count = min(counts)
if maximum_occurrence_count <= minimum_occurrence_count:
raise ValueError('Maximum count not greater than minimum count')
return count_to_values_dict[maximum_occurrence_count]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 'for' loop performance too slow - python

You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

Related

Python - speed up iterations over permutation list (with multithreading?)

Speeding up a pd.apply() function

python: processing data so that only constant values remain

How do I generate a table from a list

Finding the mode of a list

Categories

Resources