Python: categorising a list by orders of magnitude

Python: categorising a list by orders of magnitude - python

I have a nested list with values:
list = [
...
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
...]
I want to count values in the second index / column by order of magnitude, starting at the lowest order of magnitude and ending at the largest...e.g.
99.23033109735835 = 10 <= x < 100
142.8576737907048 = 100 <= x < 1000
9432 = 1000 <= x < 10000
The aim is to output a simple char (#) count for how many index values fall in each category, e.g.
10 <= x < 100: ###
100 <= x < 1000: #########
I've started by grabbing the max() and min() values for the index in order to automatically calculate the largest and smalles magnitude categories, but I'm not sure how to associate each value in the column to an order of magnitude...if someone could point me in the right direction or give me some ideas I would be most grateful.

This function will turn your double into an integer order of magnitude:
>>> def magnitude(x):
... return int(math.log10(x))
...
>>> magnitude(99.23)
1
>>> magnitude(9432)
3
(so 10 ** magnitude(x) <= x <= 10 ** (1 + magnitude(x)) for all x).
Just use the magnitude as a key, and count the occurrences per key. defaultdict may be helpful here.
Note this magnitude only works for positive powers of 10 (because int(double) truncation rounds towards zero).
Use
def magnitude(x):
return int(math.floor(math.log10(x)))
instead if this matters for your use case. (Thanks to larsmans for pointing this out).

Extending Useless' answer to all real numbers, you can use:
import math
def magnitude (value):
if (value == 0): return 0
return int(math.floor(math.log10(abs(value))))
Test cases:
In [123]: magnitude(0)
Out[123]: 0
In [124]: magnitude(0.1)
Out[124]: -1
In [125]: magnitude(0.02)
Out[125]: -2
In [126]: magnitude(150)
Out[126]: 2
In [127]: magnitude(-5280)
Out[127]: 3

If x is one of your numbers, what is len(str(int(x))) ?
Or, if you have numbers less than 0, what is int(math.log10(x)) ?
(See also log10's docs. Also note that int() rounding here may not be what you want - see ceil and floor, and note you may need int(ceil(...)) or int(floor(...)) to get an integer answer)

To categorize by the order of magnitude do:
from math import floor, log10
from collections import Counter
counter = Counter(int(floor(log10(x[1]))) for x in list)
1 is from 10 to less then 100, 2 from 100 to less then 1000.
print counter
Counter({2: 2, 1: 1})
Then its just simply printing it out
for x in sorted(counter.keys()):
print "%d <= x < %d: %d" % (10**x, 10**(x+1), counter[x])

In case you ever want overlapping ranges or ranges with arbitrary bounds (not sticked to orders of magnitude/powers of 2/any other predictable series):
from collections import defaultdict
lst = [
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
]
buckets = {
'10<=x<100': lambda x: 10<=x<100,
'100<=x<1000': lambda x: 100<=x<1000,
}
result = defaultdict(int)
for item in lst:
second_column = item[1]
for label, range_check in buckets.items():
if range_check(second_column):
result[label] +=1
print (result)

Another option, using bisect
import bisect
from collections import Counter
list0 = [
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135]
]
magnitudes = [10**x for x in xrange(5)]
c = Counter(bisect.bisect(magnitudes, x[1]) for x in list0)
for x in c:
print x, '#'*c[x]

import bisect
from collections import defaultdict
lis1 = [['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
]
lis2 = [0, 100, 1000, 1000]
dic = defaultdict(int)
for x in lis1:
x = x[1]
ind=bisect.bisect(lis2,x)
if not (x >= lis2[-1] or x <= lis2[0]):
sm, bi = lis2[ind-1], lis2[ind]
dic ["{} <= {} <= {}".format(sm ,x, bi)] +=1
for k,v in dic.items():
print k,'-->',v
output:
0 <= 99.2303310974 <= 100 --> 1
100 <= 142.857673791 <= 1000 --> 1
100 <= 109.333263436 <= 1000 --> 1

Related

Fill order from smaller packages?

The input is an integer that specifies the amount to be ordered.
There are predefined package sizes that have to be used to create that order.
e.g.
Packs
3 for $5
5 for $9
9 for $16
for an input order 13 the output should be:
2x5 + 1x3
So far I've the following approach:
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
while remaining_order > 0:
found = False
for pack_num in package_numbers:
if pack_num <= remaining_order:
required_packages.append(pack_num)
remaining_order -= pack_num
found = True
break
if not found:
break
But this will lead to the wrong result:
1x9 + 1x3
remaining: 1

So, you need to fill the order with the packages such that the total price is maximal? This is known as Knapsack problem. In that Wikipedia article you'll find several solutions written in Python.
To be more precise, you need a solution for the unbounded knapsack problem, in contrast to popular 0/1 knapsack problem (where each item can be packed only once). Here is working code from Rosetta:
from itertools import product
NAME, SIZE, VALUE = range(3)
items = (
# NAME, SIZE, VALUE
('A', 3, 5),
('B', 5, 9),
('C', 9, 16))
capacity = 13
def knapsack_unbounded_enumeration(items, C):
# find max of any one item
max1 = [int(C / item[SIZE]) for item in items]
itemsizes = [item[SIZE] for item in items]
itemvalues = [item[VALUE] for item in items]
# def totvalue(itemscount, =itemsizes, itemvalues=itemvalues, C=C):
def totvalue(itemscount):
# nonlocal itemsizes, itemvalues, C
totsize = sum(n * size for n, size in zip(itemscount, itemsizes))
totval = sum(n * val for n, val in zip(itemscount, itemvalues))
return (totval, -totsize) if totsize <= C else (-1, 0)
# Try all combinations of bounty items from 0 up to max1
bagged = max(product(*[range(n + 1) for n in max1]), key=totvalue)
numbagged = sum(bagged)
value, size = totvalue(bagged)
size = -size
# convert to (iten, count) pairs) in name order
bagged = ['%dx%d' % (n, items[i][SIZE]) for i, n in enumerate(bagged) if n]
return value, size, numbagged, bagged
if __name__ == '__main__':
value, size, numbagged, bagged = knapsack_unbounded_enumeration(items, capacity)
print(value)
print(bagged)
Output is:
23
['1x3', '2x5']
Keep in mind that this is a NP-hard problem, so it will blow as you enter some large values :)

You can use itertools.product:
import itertools
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(a)
print(remaining_order)
Output:
(5, 5, 3)
0
This simply does the below steps:
Get value closest to 13, in the list with all the product values.
Then simply make it modify the number of remaining_order.
If you want it output with 'x':
import itertools
from collections import Counter
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(' '.join(['{0}x{1}'.format(v,k) for k,v in Counter(a).items()]))
print(remaining_order)
Output:
2x5 + 1x3
0

For you problem, I tried two implementations depending on what you want, in both of the solutions I supposed you absolutely needed your remaining to be at 0. Otherwise the algorithm will return you -1. If you need them, tell me I can adapt my algorithm.
As the algorithm is implemented via dynamic programming, it handles good inputs, at least more than 130 packages !
In the first solution, I admitted we fill with the biggest package each time.
I n the second solution, I try to minimize the price, but the number of packages should always be 0.
remaining_order = 13
package_numbers = sorted([9,5,3], reverse=True) # To make sure the biggest package is the first element
prices = {9: 16, 5: 9, 3: 5}
required_packages = []
# First solution, using the biggest package each time, and making the total order remaining at 0 each time
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = [0, 0, 0]
for i in range(1, remaining_order + 1):
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != -1:
ans[i] = [tmp[x] if x != index else tmp[x] + 1 for x in range(len(tmp))]
break
else: # Using for else instead of a boolean value `found`
ans[i] = -1 # -1 is the not found combinations
print(ans[13]) # [0, 2, 1]
print(ans[9]) # [1, 0, 0]
# Second solution, minimizing the price with order at 0
def price(x):
return 16*x[0]+9*x[1]+5*x[2]
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = ([0, 0, 0],0) # combination + price
for i in range(1, remaining_order + 1):
# The not found packages will be (-1, float('inf'))
minimal_price = float('inf')
minimal_combinations = -1
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != (-1, float('inf')):
tmp_price = price(tmp[0]) + prices[package_number]
if tmp_price < minimal_price:
minimal_price = tmp_price
minimal_combinations = [tmp[0][x] if x != index else tmp[0][x] + 1 for x in range(len(tmp[0]))]
ans[i] = (minimal_combinations, minimal_price)
print(ans[13]) # ([0, 2, 1], 23)
print(ans[9]) # ([0, 0, 3], 15) Because the price of three packages is lower than the price of a package of 9

In case you need a solution for a small number of possible
package_numbers
but a possibly very big
remaining_order,
in which case all the other solutions would fail, you can use this to reduce remaining_order:
import numpy as np
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
sub_max=np.sum([(np.product(package_numbers)/i-1)*i for i in package_numbers])
while remaining_order > sub_max:
remaining_order -= np.product(package_numbers)
required_packages.append([max(package_numbers)]*np.product(package_numbers)/max(package_numbers))
Because if any package is in required_packages more often than (np.product(package_numbers)/i-1)*i it's sum is equal to np.product(package_numbers). In case the package max(package_numbers) isn't the one with the samllest price per unit, take the one with the smallest price per unit instead.
Example:
remaining_order = 100
package_numbers = [5,3]
Any part of remaining_order bigger than 5*2 plus 3*4 = 22 can be sorted out by adding 5 three times to the solution and taking remaining_order - 5*3.
So remaining order that actually needs to be calculated is 10. Which can then be solved to beeing 2 times 5. The rest is filled with 6 times 15 which is 18 times 5.
In case the number of possible package_numbers is bigger than just a handful, I recommend building a lookup table (with one of the others answers' code) for all numbers below sub_max which will make this immensely fast for any input.

Since no declaration about the object function is found, I assume your goal is to maximize the package value within the pack's capability.
Explanation: time complexity is fixed. Optimal solution may not be filling the highest valued item as many as possible, you have to search all possible combinations. However, you can reuse the possible optimal solutions you have searched to save space. For example, [5,5,3] is derived from adding 3 to a previous [5,5] try so the intermediate result can be "cached". You may either use an array or you may use a set to store possible solutions. The code below runs the same performance as the rosetta code but I think it's clearer.
To further optimize, use a priority set for opts.
costs = [3,5,9]
value = [5,9,16]
volume = 130
# solutions
opts = set()
opts.add(tuple([0]))
# calc total value
cost_val = dict(zip(costs, value))
def total_value(opt):
return sum([cost_val.get(cost,0) for cost in opt])
def possible_solutions():
solutions = set()
for opt in opts:
for cost in costs:
if cost + sum(opt) > volume:
continue
cnt = (volume - sum(opt)) // cost
for _ in range(1, cnt + 1):
sol = tuple(list(opt) + [cost] * _)
solutions.add(sol)
return solutions
def optimize_max_return(opts):
if not opts:
return tuple([])
cur = list(opts)[0]
for sol in opts:
if total_value(sol) > total_value(cur):
cur = sol
return cur
while sum(optimize_max_return(opts)) <= volume - min(costs):
opts = opts.union(possible_solutions())
print(optimize_max_return(opts))
If your requirement is "just fill the pack" it'll be even simpler using the volume for each item instead.

Finding the subsequence that starts and ends with the same number with the maximum sum

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.
The input is the following:
6
3 2 4 3 5 6
The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.
Here is my code.
length = int(input())
mass = raw_input().split()
for i in range(length):
mass[i]=int(mass[i])
value=-10000000000
b = 1
e = 1
for i in range(len(mass)):
if mass[i:].count(mass[i])!=1:
for j in range(i,len(mass)):
if mass[j]==mass[i]:
f = mass[i:j+1]
if sum(f)>value:
value = sum(f)
b = i+1
e = j+1
else:
if mass[i]>value:
value = mass[i]
b = i+1
e = i+1
print value
print b,e

This should be faster than your current approach.
Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.
from operator import itemgetter
from itertools import groupby
raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]
result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))
output
(19, 2, 5)
Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.
Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.
from operator import itemgetter
from itertools import groupby
from random import seed, randrange
seed(42)
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))
output
(155821402, 21, 24831)
If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.
from itertools import accumulate
def maxsum(seq):
sums = [0] + list(accumulate(seq))
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, [i, i])[1] = i
return max((sums[j] - sums[i - 1], i, j)
for i, j in where.values())
If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.
Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.
The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.
If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.
from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer
seed(42)
def maxsum_combo(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max((sums[j] - sums[i - 1], i, j)
for v in where.values() for i, j in combinations(v, 2))
except ValueError:
return None
def maxsum_kadane(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
except ValueError:
return None
# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
max_ending_here = max_so_far = seq[0]
for x in seq[1:]:
y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
max_ending_here = max(x, y)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
def test(num, hi, loops):
print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
print('Verifying...')
for k in range(5):
mass = [randrange(-hi // 2, hi) for _ in range(num)]
a = maxsum_combo(mass)
b = maxsum_kadane(mass)
print(a, b, a==b)
print('\nTiming...')
for func in maxsum_combo, maxsum_kadane:
t = Timer(lambda: func(mass))
result = sorted(t.repeat(3, loops))
result = ', '.join([format(u, '.5f') for u in result])
print('{0:14} : {1}'.format(func.__name__, result))
loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)
loops = 10
hi = num // 100
test(num, hi, loops)
output
num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True
Timing...
maxsum_combo : 1.72903, 1.73780, 1.81138
maxsum_kadane : 2.17738, 2.22108, 2.22394
num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True
Timing...
maxsum_combo : 7.29226, 7.34213, 7.36688
maxsum_kadane : 1.07532, 1.07695, 1.10525
This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.
If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:
def maxsum_kadane(seq):
if not seq:
return None
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
# Find the maximum of the single items
m_single = max((k, v[0], v[0]) for k, v in where.items())
# Find the maximum of the subsequences
try:
m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
return max(m_single, m_subseq)
except ValueError:
# No subsequences
return m_single
I haven't tested it extensively, but it should work. ;)

Finding median of list in Python

How do you find the median of a list in Python? The list can be of any size and the numbers are not guaranteed to be in any particular order.
If the list contains an even number of elements, the function should return the average of the middle two.
Here are some examples (sorted for display purposes):
median([1]) == 1
median([1, 1]) == 1
median([1, 1, 2, 4]) == 1.5
median([0, 2, 5, 6, 8, 9, 9]) == 6
median([0, 0, 0, 0, 4, 4, 6, 8]) == 2

Python 3.4 has statistics.median:
Return the median (middle value) of numeric data.
When the number of data points is odd, return the middle data point.
When the number of data points is even, the median is interpolated by taking the average of the two middle values:
>>> median([1, 3, 5])
3
>>> median([1, 3, 5, 7])
4.0
Usage:
import statistics
items = [6, 1, 8, 2, 3]
statistics.median(items)
#>>> 3
It's pretty careful with types, too:
statistics.median(map(float, items))
#>>> 3.0
from decimal import Decimal
statistics.median(map(Decimal, items))
#>>> Decimal('3')

(Works with python-2.x):
def median(lst):
n = len(lst)
s = sorted(lst)
return (s[n//2-1]/2.0+s[n//2]/2.0, s[n//2])[n % 2] if n else None
>>> median([-5, -5, -3, -4, 0, -1])
-3.5
numpy.median():
>>> from numpy import median
>>> median([1, -4, -1, -1, 1, -3])
-1.0
For python-3.x, use statistics.median:
>>> from statistics import median
>>> median([5, 2, 3, 8, 9, -2])
4.0

The sorted() function is very helpful for this. Use the sorted function
to order the list, then simply return the middle value (or average the two middle
values if the list contains an even amount of elements).
def median(lst):
sortedLst = sorted(lst)
lstLen = len(lst)
index = (lstLen - 1) // 2
if (lstLen % 2):
return sortedLst[index]
else:
return (sortedLst[index] + sortedLst[index + 1])/2.0

Of course you can use build in functions, but if you would like to create your own you can do something like this. The trick here is to use ~ operator that flip positive number to negative. For instance ~2 -> -3 and using negative in for list in Python will count items from the end. So if you have mid == 2 then it will take third element from beginning and third item from the end.
def median(data):
data.sort()
mid = len(data) // 2
return (data[mid] + data[~mid]) / 2

Here's a cleaner solution:
def median(lst):
quotient, remainder = divmod(len(lst), 2)
if remainder:
return sorted(lst)[quotient]
return sum(sorted(lst)[quotient - 1:quotient + 1]) / 2.
Note: Answer changed to incorporate suggestion in comments.

You can try the quickselect algorithm if faster average-case running times are needed. Quickselect has average (and best) case performance O(n), although it can end up O(n²) on a bad day.
Here's an implementation with a randomly chosen pivot:
import random
def select_nth(n, items):
pivot = random.choice(items)
lesser = [item for item in items if item < pivot]
if len(lesser) > n:
return select_nth(n, lesser)
n -= len(lesser)
numequal = items.count(pivot)
if numequal > n:
return pivot
n -= numequal
greater = [item for item in items if item > pivot]
return select_nth(n, greater)
You can trivially turn this into a method to find medians:
def median(items):
if len(items) % 2:
return select_nth(len(items)//2, items)
else:
left = select_nth((len(items)-1) // 2, items)
right = select_nth((len(items)+1) // 2, items)
return (left + right) / 2
This is very unoptimised, but it's not likely that even an optimised version will outperform Tim Sort (CPython's built-in sort) because that's really fast. I've tried before and I lost.

You can use the list.sort to avoid creating new lists with sorted and sort the lists in place.
Also you should not use list as a variable name as it shadows python's own list.
def median(l):
half = len(l) // 2
l.sort()
if not len(l) % 2:
return (l[half - 1] + l[half]) / 2.0
return l[half]

def median(x):
x = sorted(x)
listlength = len(x)
num = listlength//2
if listlength%2==0:
middlenum = (x[num]+x[num-1])/2
else:
middlenum = x[num]
return middlenum

def median(array):
"""Calculate median of the given list.
"""
# TODO: use statistics.median in Python 3
array = sorted(array)
half, odd = divmod(len(array), 2)
if odd:
return array[half]
return (array[half - 1] + array[half]) / 2.0

A simple function to return the median of the given list:
def median(lst):
lst = sorted(lst) # Sort the list first
if len(lst) % 2 == 0: # Checking if the length is even
# Applying formula which is sum of middle two divided by 2
return (lst[len(lst) // 2] + lst[(len(lst) - 1) // 2]) / 2
else:
# If length is odd then get middle value
return lst[len(lst) // 2]
Some examples with the median function:
>>> median([9, 12, 20, 21, 34, 80]) # Even
20.5
>>> median([9, 12, 80, 21, 34]) # Odd
21
If you want to use library you can just simply do:
>>> import statistics
>>> statistics.median([9, 12, 20, 21, 34, 80]) # Even
20.5
>>> statistics.median([9, 12, 80, 21, 34]) # Odd
21

I posted my solution at Python implementation of "median of medians" algorithm , which is a little bit faster than using sort(). My solution uses 15 numbers per column, for a speed ~5N which is faster than the speed ~10N of using 5 numbers per column. The optimal speed is ~4N, but I could be wrong about it.
Per Tom's request in his comment, I added my code here, for reference. I believe the critical part for speed is using 15 numbers per column, instead of 5.
#!/bin/pypy
#
# TH #stackoverflow, 2016-01-20, linear time "median of medians" algorithm
#
import sys, random
items_per_column = 15
def find_i_th_smallest( A, i ):
t = len(A)
if(t <= items_per_column):
# if A is a small list with less than items_per_column items, then:
#
# 1. do sort on A
# 2. find i-th smallest item of A
#
return sorted(A)[i]
else:
# 1. partition A into columns of k items each. k is odd, say 5.
# 2. find the median of every column
# 3. put all medians in a new list, say, B
#
B = [ find_i_th_smallest(k, (len(k) - 1)/2) for k in [A[j:(j + items_per_column)] for j in range(0,len(A),items_per_column)]]
# 4. find M, the median of B
#
M = find_i_th_smallest(B, (len(B) - 1)/2)
# 5. split A into 3 parts by M, { < M }, { == M }, and { > M }
# 6. find which above set has A's i-th smallest, recursively.
#
P1 = [ j for j in A if j < M ]
if(i < len(P1)):
return find_i_th_smallest( P1, i)
P3 = [ j for j in A if j > M ]
L3 = len(P3)
if(i < (t - L3)):
return M
return find_i_th_smallest( P3, i - (t - L3))
# How many numbers should be randomly generated for testing?
#
number_of_numbers = int(sys.argv[1])
# create a list of random positive integers
#
L = [ random.randint(0, number_of_numbers) for i in range(0, number_of_numbers) ]
# Show the original list
#
# print L
# This is for validation
#
# print sorted(L)[int((len(L) - 1)/2)]
# This is the result of the "median of medians" function.
# Its result should be the same as the above.
#
print find_i_th_smallest( L, (len(L) - 1) / 2)

In case you need additional information on the distribution of your list, the percentile method will probably be useful. And a median value corresponds to the 50th percentile of a list:
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9])
median_value = np.percentile(a, 50) # return 50th percentile
print median_value

Here what I came up with during this exercise in Codecademy:
def median(data):
new_list = sorted(data)
if len(new_list)%2 > 0:
return new_list[len(new_list)/2]
elif len(new_list)%2 == 0:
return (new_list[(len(new_list)/2)] + new_list[(len(new_list)/2)-1]) /2.0
print median([1,2,3,4,5,9])

Just two lines are enough.
def get_median(arr):
'''
Calculate the median of a sequence.
:param arr: list
:return: int or float
'''
arr = sorted(arr)
return arr[len(arr)//2] if len(arr) % 2 else (arr[len(arr)//2] + arr[len(arr)//2-1])/2

median Function
def median(midlist):
midlist.sort()
lens = len(midlist)
if lens % 2 != 0:
midl = (lens / 2)
res = midlist[midl]
else:
odd = (lens / 2) -1
ev = (lens / 2)
res = float(midlist[odd] + midlist[ev]) / float(2)
return res

I had some problems with lists of float values. I ended up using a code snippet from the python3 statistics.median and is working perfect with float values without imports. source
def calculateMedian(list):
data = sorted(list)
n = len(data)
if n == 0:
return None
if n % 2 == 1:
return data[n // 2]
else:
i = n // 2
return (data[i - 1] + data[i]) / 2

def midme(list1):
list1.sort()
if len(list1)%2>0:
x = list1[int((len(list1)/2))]
else:
x = ((list1[int((len(list1)/2))-1])+(list1[int(((len(list1)/2)))]))/2
return x
midme([4,5,1,7,2])

def median(array):
if len(array) < 1:
return(None)
if len(array) % 2 == 0:
median = (array[len(array)//2-1: len(array)//2+1])
return sum(median) / len(median)
else:
return(array[len(array)//2])

I defined a median function for a list of numbers as
def median(numbers):
return (sorted(numbers)[int(round((len(numbers) - 1) / 2.0))] + sorted(numbers)[int(round((len(numbers) - 1) // 2.0))]) / 2.0

import numpy as np
def get_median(xs):
mid = len(xs) // 2 # Take the mid of the list
if len(xs) % 2 == 1: # check if the len of list is odd
return sorted(xs)[mid] #if true then mid will be median after sorting
else:
#return 0.5 * sum(sorted(xs)[mid - 1:mid + 1])
return 0.5 * np.sum(sorted(xs)[mid - 1:mid + 1]) #if false take the avg of mid
print(get_median([7, 7, 3, 1, 4, 5]))
print(get_median([1,2,3, 4,5]))

A more generalized approach for median (and percentiles) would be:
def get_percentile(data, percentile):
# Get the number of observations
cnt=len(data)
# Sort the list
data=sorted(data)
# Determine the split point
i=(cnt-1)*percentile
# Find the `floor` of the split point
diff=i-int(i)
# Return the weighted average of the value above and below the split point
return data[int(i)]*(1-diff)+data[int(i)+1]*(diff)
# Data
data=[1,2,3,4,5]
# For the median
print(get_percentile(data=data, percentile=.50))
# > 3
print(get_percentile(data=data, percentile=.75))
# > 4
# Note the weighted average difference when an int is not returned by the percentile
print(get_percentile(data=data, percentile=.51))
# > 3.04

Try This
import math
def find_median(arr):
if len(arr)%2==1:
med=math.ceil(len(arr)/2)-1
return arr[med]
else:
return -1
print(find_median([1,2,3,4,5,6,7,8]))

Implement it:
def median(numbers):
"""
Calculate median of a list numbers.
:param numbers: the numbers to be calculated.
:return: median value of numbers.
>>> median([1, 3, 3, 6, 7, 8, 9])
6
>>> median([1, 2, 3, 4, 5, 6, 8, 9])
4.5
>>> import statistics
>>> import random
>>> numbers = random.sample(range(-50, 50), k=100)
>>> statistics.median(numbers) == median(numbers)
True
"""
numbers = sorted(numbers)
mid_index = len(numbers) // 2
return (
(numbers[mid_index] + numbers[mid_index - 1]) / 2 if mid_index % 2 == 0
else numbers[mid_index]
)
if __name__ == "__main__":
from doctest import testmod
testmod()
source from

Function median:
def median(d):
d=np.sort(d)
n2=int(len(d)/2)
r=n2%2
if (r==0):
med=d[n2]
else:
med=(d[n2] + d[n2+1]) / 2
return med

Simply, Create a Median Function with an argument as a list of the number and call the function.
def median(l):
l = sorted(l)
lent = len(l)
if (lent % 2) == 0:
m = int(lent / 2)
result = l[m]
else:
m = int(float(lent / 2) - 0.5)
result = l[m]
return result

What I did was this:
def median(a):
a = sorted(a)
if len(a) / 2 != int:
return a[len(a) / 2]
else:
return (a[len(a) / 2] + a[(len(a) / 2) - 1]) / 2
Explanation: Basically if the number of items in the list is odd, return the middle number, otherwise, if you half an even list, python automatically rounds the higher number so we know the number before that will be one less (since we sorted it) and we can add the default higher number and the number lower than it and divide them by 2 to find the median.

Here's the tedious way to find median without using the median function:
def median(*arg):
order(arg)
numArg = len(arg)
half = int(numArg/2)
if numArg/2 ==half:
print((arg[half-1]+arg[half])/2)
else:
print(int(arg[half]))
def order(tup):
ordered = [tup[i] for i in range(len(tup))]
test(ordered)
while(test(ordered)):
test(ordered)
print(ordered)
def test(ordered):
whileloop = 0
for i in range(len(ordered)-1):
print(i)
if (ordered[i]>ordered[i+1]):
print(str(ordered[i]) + ' is greater than ' + str(ordered[i+1]))
original = ordered[i+1]
ordered[i+1]=ordered[i]
ordered[i]=original
whileloop = 1 #run the loop again if you had to switch values
return whileloop

It is very simple;
def median(alist):
#to find median you will have to sort the list first
sList = sorted(alist)
first = 0
last = len(sList)-1
midpoint = (first + last)//2
return midpoint
And you can use the return value like this median = median(anyList)

Find the nth lucky number generated by a sieve in Python

I'm trying to make a program in Python which will generate the nth lucky number according to the lucky number sieve. I'm fairly new to Python so I don't know how to do all that much yet. So far I've figured out how to make a function which determines all lucky numbers below a specified number:
def lucky(number):
l = range(1, number + 1, 2)
i = 1
while i < len(l):
del l[l[i] - 1::l[i]]
i += 1
return l
Is there a way to modify this so that I can instead find the nth lucky number? I thought about increasing the specified number gradually until a list of the appropriate length to find the required lucky number was created, but that seems like a really inefficient way of doing it.
Edit: I came up with this, but is there a better way?
def lucky(number):
f = 2
n = number * f
while True:
l = range(1, n + 1, 2)
i = 1
while i < len(l):
del l[l[i] - 1::l[i]]
i += 1
if len(l) >= number:
return l[number - 1]
f += 1
n = number * f

I came up with this, but is there a better way?
Truth is, there will always be a better way, the remaining question being: is it good enough for your need?
One possible improvement would be to turn all this into a generator function. That way, you would only compute new values as they are consumed. I came up with this version, which I only validated up to about 60 terms:
import itertools
def _idx_after_removal(removed_indices, value):
for removed in removed_indices:
value -= value / removed
return value
def _should_be_excluded(removed_indices, value):
for j in range(len(removed_indices) - 1):
value_idx = _idx_after_removal(removed_indices[:j + 1], value)
if value_idx % removed_indices[j + 1] == 0:
return True
return False
def lucky():
yield 1
removed_indices = [2]
for i in itertools.count(3, 2):
if not _should_be_excluded(removed_indices, i):
yield i
removed_indices.append(i)
removed_indices = list(set(removed_indices))
removed_indices.sort()
If you want to extract for example the 100th term from this generator, you can use itertools nth recipe:
def nth(iterable, n, default=None):
"Returns the nth item or a default value"
return next(itertools.islice(iterable, n, None), default)
print nth(lucky(), 100)
I hope this works, and there's without any doubt more room for code improvement (but as stated previously, there's always room for improvement!).

With numpy arrays, you can make use of boolean indexing, which may help. For example:
>>> a = numpy.arange(10)
>>> print a
[0 1 2 3 4 5 6 7 8 9]
>>> print a[a > 3]
[4 5 6 7 8 9]
>>> mask = np.array([True, False, True, False, True, False, True, False, True, False])
>>> print a[mask]
[0 2 4 6 8]
Here is a lucky number function using numpy arrays:
import numpy as np
class Didnt_Findit(Exception):
pass
def lucky(n):
'''Return the nth lucky number.
n --> int
returns int
'''
# initial seed
lucky_numbers = [1]
# how many numbers do you need to get to n?
candidates = np.arange(1, n*100, 2)
# use numpy array boolean indexing
next_lucky = candidates[candidates > lucky_numbers[-1]][0]
# accumulate lucky numbers till you have n of them
while next_lucky < candidates[-1]:
lucky_numbers.append(next_lucky)
#print lucky_numbers
if len(lucky_numbers) == n:
return lucky_numbers[-1]
mask_start = next_lucky - 1
mask_step = next_lucky
mask = np.array([True] * len(candidates))
mask[mask_start::mask_step] = False
#print mask
candidates = candidates[mask]
next_lucky = candidates[ candidates > lucky_numbers[-1]][0]
raise Didnt_Findit('n = ', n)
>>> print lucky(10)
33
>>> print lucky(50)
261
>>> print lucky(500)
3975
Checked mine and #icecrime's output for 10, 50 and 500 - they matched.
Yours is much faster than mine and scales better with n.

n=input('enter n ')
a= list(xrange(1,n))
x=a[1]
for i in range(1,n):
del a[x-1::x]
x=a[i]
l=len(a)
if i==l-1:
break
print "lucky numbers till %d" % n
print a
lets do this with an example.lets print lucky numbers till 100
put n=100
firstly a=1,2,3,4,5....100
x=a[1]=2
del a[1::2] leaves
a=1,3,5,7....99
now l=50
and now x=3
then del a[2::3] leaving a =1,3,7,9,13,15,.....
and loop continues till i==l-1

Finding all sequences of A, B such that have a specified number of each element

For example, given the two letters A and B, I'd like to generate all strings of length n that have x A's and y B's.
I'd like this to be done efficiently. One way that I've considered is to build a length x list of A's, and then insert y B's into the list every possible way. But insertion into a python list is linear, so this method would suck as the list gets big.
PERFORMANCE GOAL (this may be unreasonable, but it is my hope): Generate all strings of length 20 with equal numbers of A and B in time less than a minute.
EDIT: Using permutations('A' * x, 'B' * y) has been suggested. While not a bad idea, it's wasting a lot. If x = y = 4, you'd generate the string 'AAAABBBB' many times. Is there a better way that might generate each string only once? I've tried code to the effect of set(permutations('A' * x, 'B' * y)) and it is too slow.

Regarding your concerns with the performance, here is an actual generator implementation of your idea (without insert). It finds the positions for B and fill the list accordingly.
import itertools
def make_sequences(num_a, num_b):
b_locations = range(num_a+1)
for b_comb in itertools.combinations_with_replacement(b_locations, num_b):
result = []
result_a = 0
for b_position in b_comb:
while b_position > result_a:
result.append('A')
result_a += 1
result.append('B')
while result_a < num_a:
result.append('A')
result_a += 1
yield ''.join(result)
It does perform better. Comparing with the Greg Hewgill's solution (naming it make_sequences2):
In : %timeit list(make_sequences(4,4))
10000 loops, best of 3: 145 us per loop
In : %timeit make_sequences2(4,4)
100 loops, best of 3: 6.08 ms per loop
Edit
A generalized version:
import itertools
def insert_letters(sequence, rest):
if not rest:
yield sequence
else:
letter, number = rest[0]
rest = rest[1:]
possible_locations = range(len(sequence)+1)
for locations in itertools.combinations_with_replacement(possible_locations, number):
result = []
count = 0
temp_sequence = sequence
for location in locations:
while location > count:
result.append(temp_sequence[0])
temp_sequence = temp_sequence[1:]
count += 1
result.append(letter)
if temp_sequence:
result.append(temp_sequence)
for item in insert_letters(''.join(result), rest):
yield item
def generate_sequences(*args):
'''
arguments : squence of (letter, number) tuples
'''
(letter, number), rest = args[0], args[1:]
for sequence in insert_letters(letter*number, rest):
yield sequence
Usage:
for seq in generate_sequences(('A', 2), ('B', 1), ('C', 1)):
print seq
# Outputs
#
# CBAA
# BCAA
# BACA
# BAAC
# CABA
# ACBA
# ABCA
# ABAC
# CAAB
# ACAB
# AACB
# AABC

A simple way to do this would be the following:
import itertools
def make_sequences(x, y):
return set(itertools.permutations("A" * x + "B" * y))
The itertools.permutations() function doesn't take into account the repeated elements in the input list. It ends up generating permutations that are duplicates of previously generated permutations. So using the set() constructor removes the duplicate elements in the result.

This should give you the idea (I've included every step so you can see what's going on):
>>> x = 2
>>> y = 3
>>> lst_a = ['A'] * x
>>> lst_b = ['B'] * y
>>> print lst_a, lst_b
['A', 'A'] ['B', 'B', 'B']
>>> lst_a.extend(lst_b)
>>> lst_a
['A', 'A', 'B', 'B', 'B']
>>> print list(itertools.permutations(lst_a))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: categorising a list by orders of magnitude - python

Related

Fill order from smaller packages?

Finding the subsequence that starts and ends with the same number with the maximum sum

Finding median of list in Python

Find the nth lucky number generated by a sieve in Python

Finding all sequences of A, B such that have a specified number of each element

Categories

Resources