Inverse of np.percentile() [duplicate] - python
I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.
For example, fn([1,2,3,4,17]) returns [0.0, 0.25, 0.50, 0.75, 1.00].
Can anyone please either:
Help me correct my code below? OR
Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?
My current code:
def median(mylist):
length = len(mylist)
if not length % 2:
return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0
return mylist[length / 2]
###############################################################################
# PERCENTILE FUNCTION
###############################################################################
def percentile(x):
"""
Find the correspoding percentile of each value relative to a list of values.
where x is the list of values
Input list should already be sorted!
"""
# sort the input list
# list_sorted = x.sort()
# count the number of elements in the list
list_elementCount = len(x)
#obtain set of values from list
listFromSetFromList = list(set(x))
# count the number of unique elements in the list
list_uniqueElementCount = len(set(x))
# define extreme quantiles
percentileZero = min(x)
percentileHundred = max(x)
# define median quantile
mdn = median(x)
# create empty list to hold percentiles
x_percentile = [0.00] * list_elementCount
# initialize unique count
uCount = 0
for i in range(list_elementCount):
if x[i] == percentileZero:
x_percentile[i] = 0.00
elif x[i] == percentileHundred:
x_percentile[i] = 1.00
elif x[i] == mdn:
x_percentile[i] = 0.50
else:
subList_elementCount = 0
for j in range(i):
if x[j] < x[i]:
subList_elementCount = subList_elementCount + 1
x_percentile[i] = float(subList_elementCount / list_elementCount)
#x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount)
if i == 0:
continue
else:
if x[i] == x[i-1]:
continue
else:
uCount = uCount + 1
return x_percentile
Currently, if I submit percentile([1,2,3,4,17]), the list [0.0, 0.0, 0.5, 0.0, 1.0] is returned.
I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).
With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:
>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]
(I used a dataset containing ties to illustrate what happens in such cases.)
The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.
As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:
>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3, 0.3, 0.7, 0.7, 1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4, 0.4, 0.8, 0.8, 1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. , 0. , 0.4, 0.4, 0.8])
In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)
I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:
In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop
In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop
In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop
With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:
In [18]: x = np.random.randint(0, 10000, 1000)
In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop
In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop
In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop
This advantage will only become more pronounced on larger and larger datasets.
I think you want scipy.stats.percentileofscore
Example:
percentileofscore([1, 2, 3, 4], 3)
75.0
percentiles = [percentileofscore(data, i) for i in data]
In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.
Here is a solution that takes O(n log n) time.
def list_to_percentiles(numbers):
pairs = zip(numbers, range(len(numbers)))
pairs.sort(key=lambda p: p[0])
result = [0 for i in range(len(numbers))]
for rank in xrange(len(numbers)):
original_index = pairs[rank][1]
result[original_index] = rank * 100.0 / (len(numbers)-1)
return result
I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.
EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.
Pure numpy version of Kevin's solution
As Kevin said, optimal solution works in O(n log(n)) time. Here is fast version of his code in numpy, which works almost the same time as stats.rankdata:
percentiles = numpy.argsort(numpy.argsort(array)) * 100. / (len(array) - 1)
PS. This is one if my favourite tricks in numpy.
this might look oversimplyfied but what about this:
def percentile(x):
pc = float(1)/(len(x)-1)
return ["%.2f"%(n*pc) for n, i in enumerate(x)]
EDIT:
def percentile(x):
unique = set(x)
mapping = {}
pc = float(1)/(len(unique)-1)
for n, i in enumerate(unique):
mapping[i] = "%.2f"%(n*pc)
return [mapping.get(el) for el in x]
I tried Scipy's percentile score but it turned out to be very slow for one of my tasks. So, simply implemented it this way. Can be modified if a weak ranking is needed.
def assign_pct(X):
mp = {}
X_tmp = np.sort(X)
pct = []
cnt = 0
for v in X_tmp:
if v in mp:
continue
else:
mp[v] = cnt
cnt+=1
for v in X:
pct.append(mp[v]/cnt)
return pct
Calling the function
assign_pct([23,4,1,43,1,6])
Output of function
[0.75, 0.25, 0.0, 1.0, 0.0, 0.5]
If I understand you correctly, all you want to do, is to define the percentile this element represents in the array, how much of the array is before that element. as in [1, 2, 3, 4, 5]
should be [0.0, 0.25, 0.5, 0.75, 1.0]
I believe such code will be enough:
def percentileListEdited(List):
uniqueList = list(set(List))
increase = 1.0/(len(uniqueList)-1)
newList = {}
for index, value in enumerate(uniqueList):
newList[index] = 0.0 + increase * index
return [newList[val] for val in List]
For me the best solution is to use QuantileTransformer in sklearn.preprocessing.
from sklearn.preprocessing import QuantileTransformer
fn = lambda input_list : QuantileTransformer(100).fit_transform(np.array(input_list).reshape([-1,1])).ravel().tolist()
input_raw = [1, 2, 3, 4, 17]
output_perc = fn( input_raw )
print "Input=", input_raw
print "Output=", np.round(output_perc,2)
Here is the output
Input= [1, 2, 3, 4, 17]
Output= [ 0. 0.25 0.5 0.75 1. ]
Note: this function has two salient features:
input raw data is NOT necessarily sorted.
input raw data is NOT necessarily single column.
This version allows also to pass exact percentiles values used to ranking:
def what_pctl_number_of(x, a, pctls=np.arange(1, 101)):
return np.argmax(np.sign(np.append(np.percentile(x, pctls), np.inf) - a))
So it's possible to find out what's percentile number value falls for provided percentiles:
_x = np.random.randn(100, 1)
what_pctl_number_of(_x, 1.6, [25, 50, 75, 100])
Output:
3
so it hits to 75 ~ 100 range
for a pure python function to calculate a percentile score for a given item, compared to the population distribution (a list of scores), I pulled this from the scipy source code and removed all references to numpy:
def percentileofscore(a, score, kind='rank'):
n = len(a)
if n == 0:
return 100.0
left = len([item for item in a if item < score])
right = len([item for item in a if item <= score])
if kind == 'rank':
pct = (right + left + (1 if right > left else 0)) * 50.0/n
return pct
elif kind == 'strict':
return left / n * 100
elif kind == 'weak':
return right / n * 100
elif kind == 'mean':
pct = (left + right) / n * 50
return pct
else:
raise ValueError("kind can only be 'rank', 'strict', 'weak' or 'mean'")
source: https://github.com/scipy/scipy/blob/v1.2.1/scipy/stats/stats.py#L1744-L1835
Given that calculating percentiles is trickier than one would think, but way less complicated than the full scipy/numpy/scikit package, this is the best for light-weight deployment. The original code filters for only nonzero-values better, but otherwise, the math is the same. The optional parameter controls how it handles values that are in between two other values.
For this use case, one can call this function for each item in a list using the map() function.
Related
How to run for loop on float variables in python? [duplicate]
Is there a range() equivalent for floats in Python? >>> range(0.5,5,1.5) [0, 1, 2, 3, 4] >>> range(0.5,5,0.5) Traceback (most recent call last): File "<pyshell#10>", line 1, in <module> range(0.5,5,0.5) ValueError: range() step argument must not be zero
You can either use: [x / 10.0 for x in range(5, 50, 15)] or use lambda / map: map(lambda x: x/10.0, range(5, 50, 15))
I don't know a built-in function, but writing one like [this](https://stackoverflow.com/a/477610/623735) shouldn't be too complicated. def frange(x, y, jump): while x < y: yield x x += jump --- As the comments mention, this could produce unpredictable results like: >>> list(frange(0, 100, 0.1))[-1] 99.9999999999986 To get the expected result, you can use one of the other answers in this question, or as #Tadhg mentioned, you can use decimal.Decimal as the jump argument. Make sure to initialize it with a string rather than a float. >>> import decimal >>> list(frange(0, 100, decimal.Decimal('0.1')))[-1] Decimal('99.9') Or even: import decimal def drange(x, y, jump): while x < y: yield float(x) x += decimal.Decimal(jump) And then: >>> list(drange(0, 100, '0.1'))[-1] 99.9 [editor's not: if you only use positive jump and integer start and stop (x and y) , this works fine. For a more general solution see here.]
I used to use numpy.arange but had some complications controlling the number of elements it returns, due to floating point errors. So now I use linspace, e.g.: >>> import numpy >>> numpy.linspace(0, 10, num=4) array([ 0. , 3.33333333, 6.66666667, 10. ])
Pylab has frange (a wrapper, actually, for matplotlib.mlab.frange): >>> import pylab as pl >>> pl.frange(0.5,5,0.5) array([ 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
Eagerly evaluated (2.x range): [x * .5 for x in range(10)] Lazily evaluated (2.x xrange, 3.x range): itertools.imap(lambda x: x * .5, xrange(10)) # or range(10) as appropriate Alternately: itertools.islice(itertools.imap(lambda x: x * .5, itertools.count()), 10) # without applying the `islice`, we get an infinite stream of half-integers.
using itertools: lazily evaluated floating point range: >>> from itertools import count, takewhile >>> def frange(start, stop, step): return takewhile(lambda x: x< stop, count(start, step)) >>> list(frange(0.5, 5, 1.5)) # [0.5, 2.0, 3.5]
I helped add the function numeric_range to the package more-itertools. more_itertools.numeric_range(start, stop, step) acts like the built in function range but can handle floats, Decimal, and Fraction types. >>> from more_itertools import numeric_range >>> tuple(numeric_range(.1, 5, 1)) (0.1, 1.1, 2.1, 3.1, 4.1)
There is no such built-in function, but you can use the following (Python 3 code) to do the job as safe as Python allows you to. from fractions import Fraction def frange(start, stop, jump, end=False, via_str=False): """ Equivalent of Python 3 range for decimal numbers. Notice that, because of arithmetic errors, it is safest to pass the arguments as strings, so they can be interpreted to exact fractions. >>> assert Fraction('1.1') - Fraction(11, 10) == 0.0 >>> assert Fraction( 0.1 ) - Fraction(1, 10) == Fraction(1, 180143985094819840) Parameter `via_str` can be set to True to transform inputs in strings and then to fractions. When inputs are all non-periodic (in base 10), even if decimal, this method is safe as long as approximation happens beyond the decimal digits that Python uses for printing. For example, in the case of 0.1, this is the case: >>> assert str(0.1) == '0.1' >>> assert '%.50f' % 0.1 == '0.10000000000000000555111512312578270211815834045410' If you are not sure whether your decimal inputs all have this property, you are better off passing them as strings. String representations can be in integer, decimal, exponential or even fraction notation. >>> assert list(frange(1, 100.0, '0.1', end=True))[-1] == 100.0 >>> assert list(frange(1.0, '100', '1/10', end=True))[-1] == 100.0 >>> assert list(frange('1', '100.0', '.1', end=True))[-1] == 100.0 >>> assert list(frange('1.0', 100, '1e-1', end=True))[-1] == 100.0 >>> assert list(frange(1, 100.0, 0.1, end=True))[-1] != 100.0 >>> assert list(frange(1, 100.0, 0.1, end=True, via_str=True))[-1] == 100.0 """ if via_str: start = str(start) stop = str(stop) jump = str(jump) start = Fraction(start) stop = Fraction(stop) jump = Fraction(jump) while start < stop: yield float(start) start += jump if end and start == stop: yield(float(start)) You can verify all of it by running a few assertions: assert Fraction('1.1') - Fraction(11, 10) == 0.0 assert Fraction( 0.1 ) - Fraction(1, 10) == Fraction(1, 180143985094819840) assert str(0.1) == '0.1' assert '%.50f' % 0.1 == '0.10000000000000000555111512312578270211815834045410' assert list(frange(1, 100.0, '0.1', end=True))[-1] == 100.0 assert list(frange(1.0, '100', '1/10', end=True))[-1] == 100.0 assert list(frange('1', '100.0', '.1', end=True))[-1] == 100.0 assert list(frange('1.0', 100, '1e-1', end=True))[-1] == 100.0 assert list(frange(1, 100.0, 0.1, end=True))[-1] != 100.0 assert list(frange(1, 100.0, 0.1, end=True, via_str=True))[-1] == 100.0 assert list(frange(2, 3, '1/6', end=True))[-1] == 3.0 assert list(frange(0, 100, '1/3', end=True))[-1] == 100.0 Code available on GitHub
As kichik wrote, this shouldn't be too complicated. However this code: def frange(x, y, jump): while x < y: yield x x += jump Is inappropriate because of the cumulative effect of errors when working with floats. That is why you receive something like: >>>list(frange(0, 100, 0.1))[-1] 99.9999999999986 While the expected behavior would be: >>>list(frange(0, 100, 0.1))[-1] 99.9 Solution 1 The cumulative error can simply be reduced by using an index variable. Here's the example: from math import ceil def frange2(start, stop, step): n_items = int(ceil((stop - start) / step)) return (start + i*step for i in range(n_items)) This example works as expected. Solution 2 No nested functions. Only a while and a counter variable: def frange3(start, stop, step): res, n = start, 1 while res < stop: yield res res = start + n * step n += 1 This function will work well too, except for the cases when you want the reversed range. E.g: >>>list(frange3(1, 0, -.1)) [] Solution 1 in this case will work as expected. To make this function work in such situations, you must apply a hack, similar to the following: from operator import gt, lt def frange3(start, stop, step): res, n = start, 0. predicate = lt if start < stop else gt while predicate(res, stop): yield res res = start + n * step n += 1 With this hack you can use these functions with negative steps: >>>list(frange3(1, 0, -.1)) [1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.3999999999999999, 0.29999999999999993, 0.19999999999999996, 0.09999999999999998] Solution 3 You can go even further with plain standard library and compose a range function for the most of numeric types: from itertools import count from itertools import takewhile def any_range(start, stop, step): start = type(start + step)(start) return takewhile(lambda n: n < stop, count(start, step)) This generator is adapted from the Fluent Python book (Chapter 14. Iterables, Iterators and generators). It will not work with decreasing ranges. You must apply a hack, like in the previous solution. You can use this generator as follows, for example: >>>list(any_range(Fraction(2, 1), Fraction(100, 1), Fraction(1, 3)))[-1] 299/3 >>>list(any_range(Decimal('2.'), Decimal('4.'), Decimal('.3'))) [Decimal('2'), Decimal('2.3'), Decimal('2.6'), Decimal('2.9'), Decimal('3.2'), Decimal('3.5'), Decimal('3.8')] And of course you can use it with float and int as well. Be careful If you want to use these functions with negative steps, you should add a check for the step sign, e.g.: no_proceed = (start < stop and step < 0) or (start > stop and step > 0) if no_proceed: raise StopIteration The best option here is to raise StopIteration, if you want to mimic the range function itself. Mimic range If you would like to mimic the range function interface, you can provide some argument checks: def any_range2(*args): if len(args) == 1: start, stop, step = 0, args[0], 1. elif len(args) == 2: start, stop, step = args[0], args[1], 1. elif len(args) == 3: start, stop, step = args else: raise TypeError('any_range2() requires 1-3 numeric arguments') # here you can check for isinstance numbers.Real or use more specific ABC or whatever ... start = type(start + step)(start) return takewhile(lambda n: n < stop, count(start, step)) I think, you've got the point. You can go with any of these functions (except the very first one) and all you need for them is python standard library.
Why Is There No Floating Point Range Implementation In The Standard Library? As made clear by all the posts here, there is no floating point version of range(). That said, the omission makes sense if we consider that the range() function is often used as an index (and of course, that means an accessor) generator. So, when we call range(0,40), we're in effect saying we want 40 values starting at 0, up to 40, but non-inclusive of 40 itself. When we consider that index generation is as much about the number of indices as it is their values, the use of a float implementation of range() in the standard library makes less sense. For example, if we called the function frange(0, 10, 0.25), we would expect both 0 and 10 to be included, but that would yield a generator with 41 values, not the 40 one might expect from 10/0.25. Thus, depending on its use, an frange() function will always exhibit counter intuitive behavior; it either has too many values as perceived from the indexing perspective or is not inclusive of a number that reasonably should be returned from the mathematical perspective. In other words, it's easy to see how such a function would appear to conflate two very different use cases – the naming implies the indexing use case; the behavior implies a mathematical one. The Mathematical Use Case With that said, as discussed in other posts, numpy.linspace() performs the generation from the mathematical perspective nicely: numpy.linspace(0, 10, 41) array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. , 2.25, 2.5 , 2.75, 3. , 3.25, 3.5 , 3.75, 4. , 4.25, 4.5 , 4.75, 5. , 5.25, 5.5 , 5.75, 6. , 6.25, 6.5 , 6.75, 7. , 7.25, 7.5 , 7.75, 8. , 8.25, 8.5 , 8.75, 9. , 9.25, 9.5 , 9.75, 10. ]) The Indexing Use Case And for the indexing perspective, I've written a slightly different approach with some tricksy string magic that allows us to specify the number of decimal places. # Float range function - string formatting method def frange_S (start, stop, skip = 1.0, decimals = 2): for i in range(int(start / skip), int(stop / skip)): yield float(("%0." + str(decimals) + "f") % (i * skip)) Similarly, we can also use the built-in round function and specify the number of decimals: # Float range function - rounding method def frange_R (start, stop, skip = 1.0, decimals = 2): for i in range(int(start / skip), int(stop / skip)): yield round(i * skip, ndigits = decimals) A Quick Comparison & Performance Of course, given the above discussion, these functions have a fairly limited use case. Nonetheless, here's a quick comparison: def compare_methods (start, stop, skip): string_test = frange_S(start, stop, skip) round_test = frange_R(start, stop, skip) for s, r in zip(string_test, round_test): print(s, r) compare_methods(-2, 10, 1/3) The results are identical for each: -2.0 -2.0 -1.67 -1.67 -1.33 -1.33 -1.0 -1.0 -0.67 -0.67 -0.33 -0.33 0.0 0.0 ... 8.0 8.0 8.33 8.33 8.67 8.67 9.0 9.0 9.33 9.33 9.67 9.67 And some timings: >>> import timeit >>> setup = """ ... def frange_s (start, stop, skip = 1.0, decimals = 2): ... for i in range(int(start / skip), int(stop / skip)): ... yield float(("%0." + str(decimals) + "f") % (i * skip)) ... def frange_r (start, stop, skip = 1.0, decimals = 2): ... for i in range(int(start / skip), int(stop / skip)): ... yield round(i * skip, ndigits = decimals) ... start, stop, skip = -1, 8, 1/3 ... """ >>> min(timeit.Timer('string_test = frange_s(start, stop, skip); [x for x in string_test]', setup=setup).repeat(30, 1000)) 0.024284090992296115 >>> min(timeit.Timer('round_test = frange_r(start, stop, skip); [x for x in round_test]', setup=setup).repeat(30, 1000)) 0.025324633985292166 Looks like the string formatting method wins by a hair on my system. The Limitations And finally, a demonstration of the point from the discussion above and one last limitation: # "Missing" the last value (10.0) for x in frange_R(0, 10, 0.25): print(x) 0.25 0.5 0.75 1.0 ... 9.0 9.25 9.5 9.75 Further, when the skip parameter is not divisible by the stop value, there can be a yawning gap given the latter issue: # Clearly we know that 10 - 9.43 is equal to 0.57 for x in frange_R(0, 10, 3/7): print(x) 0.0 0.43 0.86 1.29 ... 8.14 8.57 9.0 9.43 There are ways to address this issue, but at the end of the day, the best approach would probably be to just use Numpy.
A solution without numpy etc dependencies was provided by kichik but due to the floating point arithmetics, it often behaves unexpectedly. As noted by me and blubberdiblub, additional elements easily sneak into the result. For example naive_frange(0.0, 1.0, 0.1) would yield 0.999... as its last value and thus yield 11 values in total. A bit more robust version is provided here: def frange(x, y, jump=1.0): '''Range for floats.''' i = 0.0 x = float(x) # Prevent yielding integers. x0 = x epsilon = jump / 2.0 yield x # yield always first value while x + epsilon < y: i += 1.0 x = x0 + i * jump if x < y: yield x Because the multiplication, the rounding errors do not accumulate. The use of epsilon takes care of possible rounding error of the multiplication, even though issues of course might rise in the very small and very large ends. Now, as expected: > a = list(frange(0.0, 1.0, 0.1)) > a[-1] 0.9 > len(a) 10 And with somewhat larger numbers: > b = list(frange(0.0, 1000000.0, 0.1)) > b[-1] 999999.9 > len(b) 10000000 The code is also available as a GitHub Gist.
This can be done with numpy.arange(start, stop, stepsize) import numpy as np np.arange(0.5,5,1.5) >> [0.5, 2.0, 3.5, 5.0] # OBS you will sometimes see stuff like this happening, # so you need to decide whether that's not an issue for you, or how you are going to catch it. >> [0.50000001, 2.0, 3.5, 5.0] Note 1: From the discussion in the comment section here, "never use numpy.arange() (the numpy documentation itself recommends against it). Use numpy.linspace as recommended by wim, or one of the other suggestions in this answer" Note 2: I have read the discussion in a few comments here, but after coming back to this question for the third time now, I feel this information should be placed in a more readable position.
A simpler library-less version Aw, heck -- I'll toss in a simple library-less version. Feel free to improve on it[*]: def frange(start=0, stop=1, jump=0.1): nsteps = int((stop-start)/jump) dy = stop-start # f(i) goes from start to stop as i goes from 0 to nsteps return [start + float(i)*dy/nsteps for i in range(nsteps)] The core idea is that nsteps is the number of steps to get you from start to stop and range(nsteps) always emits integers so there's no loss of accuracy. The final step is to map [0..nsteps] linearly onto [start..stop]. edit If, like alancalvitti you'd like the series to have exact rational representation, you can always use Fractions: from fractions import Fraction def rrange(start=0, stop=1, jump=0.1): nsteps = int((stop-start)/jump) return [Fraction(i, nsteps) for i in range(nsteps)] [*] In particular, frange() returns a list, not a generator. But it sufficed for my needs.
Usage # Counting up drange(0, 0.4, 0.1) [0, 0.1, 0.2, 0.30000000000000004, 0.4] # Counting down drange(0, -0.4, -0.1) [0, -0.1, -0.2, -0.30000000000000004, -0.4] To round each step to N decimal places drange(0, 0.4, 0.1, round_decimal_places=4) [0, 0.1, 0.2, 0.3, 0.4] drange(0, -0.4, -0.1, round_decimal_places=4) [0, -0.1, -0.2, -0.3, -0.4] Code def drange(start, end, increment, round_decimal_places=None): result = [] if start < end: # Counting up, e.g. 0 to 0.4 in 0.1 increments. if increment < 0: raise Exception("Error: When counting up, increment must be positive.") while start <= end: result.append(start) start += increment if round_decimal_places is not None: start = round(start, round_decimal_places) else: # Counting down, e.g. 0 to -0.4 in -0.1 increments. if increment > 0: raise Exception("Error: When counting down, increment must be negative.") while start >= end: result.append(start) start += increment if round_decimal_places is not None: start = round(start, round_decimal_places) return result Why choose this answer? Many other answers will hang when asked to count down. Many other answers will give incorrectly rounded results. Other answers based on np.linspace are hit-and-miss, they may or may not work due to difficulty in choosing the correct number of divisions. np.linspace really struggles with decimal increments of 0.1, and the order of divisions in the formula to convert the increment into a number of splits can result in either correct or broken code. Other answers based on np.arange are deprecated. If in doubt, try the four tests cases above.
I do not know if the question is old but there is a arange function in the NumPy library, it could work as a range. np.arange(0,1,0.1) #out: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
i wrote a function that returns a tuple of a range of double precision floating point numbers without any decimal places beyond the hundredths. it was simply a matter of parsing the range values like strings and splitting off the excess. I use it for displaying ranges to select from within a UI. I hope someone else finds it useful. def drange(start,stop,step): double_value_range = [] while start<stop: a = str(start) a.split('.')[1].split('0')[0] start = float(str(a)) double_value_range.append(start) start = start+step double_value_range_tuple = tuple(double_value_range) #print double_value_range_tuple return double_value_range_tuple
Whereas integer-based ranges are well defined in that "what you see is what you get", there are things that are not readily seen in floats that cause troubles in getting what appears to be a well defined behavior in a desired range. There are two approaches that one can take: split a given range into a certain number of segment: the linspace approach in which you accept the large number of decimal digits when you select a number of points that does not divide the span well (e.g. 0 to 1 in 7 steps will give a first step value of 0.14285714285714285) give the desired WYSIWIG step size that you already know should work and wish that it would work. Your hopes will often be dashed by getting values that miss the end point that you wanted to hit. Multiples can be higher or lower than you expect: >>> 3*.1 > .3 # 0.30000000000000004 True >>> 3*.3 < 0.9 # 0.8999999999999999 True You will try to avoid accumulating errors by adding multiples of your step and not incrementing, but the problem will always present itself and you just won't get what you expect if you did it by hand on paper -- with exact decimals. But you know it should be possible since Python shows you 0.1 instead of the underlying integer ratio having a close approximation to 0.1: >>> (3*.1).as_integer_ratio() (1351079888211149, 4503599627370496) In the methods offered as answers, the use of Fraction here with the option to handle input as strings is best. I have a few suggestions to make it better: make it handle range-like defaults so you can start from 0 automatically make it handle decreasing ranges make the output look like you would expect if you were using exact arithmetic I offer a routine that does these same sort of thing but which does not use the Fraction object. Instead, it uses round to create numbers having the same apparent digits as the numbers would have if you printed them with python, e.g. 1 decimal for something like 0.1 and 3 decimals for something like 0.004: def frange(start, stop, step, n=None): """return a WYSIWYG series of float values that mimic range behavior by excluding the end point and not printing extraneous digits beyond the precision of the input numbers (controlled by n and automatically detected based on the string representation of the numbers passed). EXAMPLES ======== non-WYSIWYS simple list-comprehension >>> [.11 + i*.1 for i in range(3)] [0.11, 0.21000000000000002, 0.31] WYSIWYG result for increasing sequence >>> list(frange(0.11, .33, .1)) [0.11, 0.21, 0.31] and decreasing sequences >>> list(frange(.345, .1, -.1)) [0.345, 0.245, 0.145] To hit the end point for a sequence that is divisibe by the step size, make the end point a little bigger by adding half the step size: >>> dx = .2 >>> list(frange(0, 1 + dx/2, dx)) [0.0, 0.2, 0.4, 0.6, 0.8, 1.0] """ if step == 0: raise ValueError('step must not be 0') # how many decimal places are showing? if n is None: n = max([0 if '.' not in str(i) else len(str(i).split('.')[1]) for i in (start, stop, step)]) if step*(stop - start) > 0: # a non-null incr/decr range if step < 0: for i in frange(-start, -stop, -step, n): yield -i else: steps = round((stop - start)/step) while round(step*steps + start, n) < stop: steps += 1 for i in range(steps): yield round(start + i*step, n)
def Range(*argSequence): if len(argSequence) == 3: imin = argSequence[0]; imax = argSequence[1]; di = argSequence[2] i = imin; iList = [] while i <= imax: iList.append(i) i += di return iList if len(argSequence) == 2: return Range(argSequence[0], argSequence[1], 1) if len(argSequence) == 1: return Range(1, argSequence[0], 1) Please note the first letter of Range is capital. This naming method is not encouraged for functions in Python. You can change Range to something like drange or frange if you want. The "Range" function behaves just as you want it to. You can check it's manual here [ http://reference.wolfram.com/language/ref/Range.html ].
I think that there is a very simple answer that really emulates all the features of range but for both float and integer. In this solution, you just suppose that your approximation by default is 1e-7 (or the one you choose) and you can change it when you call the function. def drange(start,stop=None,jump=1,approx=7): # Approx to 1e-7 by default ''' This function is equivalent to range but for both float and integer ''' if not stop: # If there is no y value: range(x) stop= start start= 0 valor= round(start,approx) while valor < stop: if valor==int(valor): yield int(round(valor,approx)) else: yield float(round(valor,approx)) valor += jump for i in drange(12): print(i)
Talk about making a mountain out of a mole hill. If you relax the requirement to make a float analog of the range function, and just create a list of floats that is easy to use in a for loop, the coding is simple and robust. def super_range(first_value, last_value, number_steps): if not isinstance(number_steps, int): raise TypeError("The value of 'number_steps' is not an integer.") if number_steps < 1: raise ValueError("Your 'number_steps' is less than 1.") step_size = (last_value-first_value)/(number_steps-1) output_list = [] for i in range(number_steps): output_list.append(first_value + step_size*i) return output_list first = 20.0 last = -50.0 steps = 5 print(super_range(first, last, steps)) The output will be [20.0, 2.5, -15.0, -32.5, -50.0] Note that the function super_range is not limited to floats. It can handle any data type for which the operators +, -, *, and / are defined, such as complex, Decimal, and numpy.array: import cmath first = complex(1,2) last = complex(5,6) steps = 5 print(super_range(first, last, steps)) from decimal import * first = Decimal(20) last = Decimal(-50) steps = 5 print(super_range(first, last, steps)) import numpy as np first = np.array([[1, 2],[3, 4]]) last = np.array([[5, 6],[7, 8]]) steps = 5 print(super_range(first, last, steps)) The output will be: [(1+2j), (2+3j), (3+4j), (4+5j), (5+6j)] [Decimal('20.0'), Decimal('2.5'), Decimal('-15.0'), Decimal('-32.5'), Decimal('-50.0')] [array([[1., 2.],[3., 4.]]), array([[2., 3.],[4., 5.]]), array([[3., 4.],[5., 6.]]), array([[4., 5.],[6., 7.]]), array([[5., 6.],[7., 8.]])]
There will be of course some rounding errors, so this is not perfect, but this is what I use generally for applications, which don't require high precision. If you wanted to make this more accurate, you could add an extra argument to specify how to handle rounding errors. Perhaps passing a rounding function might make this extensible and allow the programmer to specify how to handle rounding errors. arange = lambda start, stop, step: [i + step * i for i in range(int((stop - start) / step))] If I write: arange(0, 1, 0.1) It will output: [0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9]
Is there a range() equivalent for floats in Python? NO Use this: def f_range(start, end, step, coef=0.01): a = range(int(start/coef), int(end/coef), int(step/coef)) var = [] for item in a: var.append(item*coef) return var
There several answers here that don't handle simple edge cases like negative step, wrong start, stop etc. Here's the version that handles many of these cases correctly giving same behaviour as native range(): def frange(start, stop=None, step=1): if stop is None: start, stop = 0, start steps = int((stop-start)/step) for i in range(steps): yield start start += step Note that this would error out step=0 just like native range. One difference is that native range returns object that is indexable and reversible while above doesn't. You can play with this code and test cases here.
How to generate a randomized list of a sequence of numbers?
I want a function to generate a list of length n containing an arithmetic sequence of numbers between 0 and 1, but put in a random order. For example, for the function def randSequence(n): ... return myList randSequence(10) returns [0.5, 0.3, 0.9, 0.8, 0.6, 0.2, 0.4, 0.0, 0.1, 0.7] and randSequence(5) returns [0.4, 0.0, 0.2, 0.8, 0.6] Currently, I have it so it generates the sequence of numbers in one loop, and randomizes it in another, as follows: def randSequence(n): step = 1 / n setList = [] myList = [] for i in range(n): setList.append(i * step) for i in range(n): index = random.randint(0, len(setList) - 1) myList.append(setList.pop(index)) return myList Unfortunately, this solution is slow, especially for large numbers (like n > 1,000,000). Is there a better way to write this code, or even better, is there a function that can do this task for me?
#HeapOverflow suggested exchanging the second loop for the shuffle function: def randSequence(n): step = 1 / n myList = [] for i in range(n): myList.append(i * step) random.shuffle(myList) return myList Which is an order of magnitude faster than before. From past experience, I suspect that the pop function on lists is rather slow and was the main bottleneck in the second loop.
First, I'd like to point out that the main reason for bad performance of your code is due to this line: myList.append(setList.pop(index)) The time complexity list.pop from the middle of a list is roughly O(n) since popping from the middle of the list forces Python to move a bunch of memory around. This makes the net complexity O(n^2). You can drastically improve performance by making changes inplace, e.g.: def randSequenceInplace(n): 'a.k.a. Fisher-Yates' step = 1 / n lst = [step * i for i in range(n)] for i in range(n-1): index = random.randint(i, n - 1) lst[i], lst[index] = lst[index], lst[i] # myList.append(setList.pop(index)) return lst For completeness, you can go with a vectorized numpy solution or use the previosuly mentioned random.shuffle to get even better performance. Timings: n = 10**6 %time randSequence(n) # CPU times: user 1min 22s, sys: 33 ms, total: 1min 22s # Wall time: 1min 22s %time randSequenceInplace(n) # CPU times: user 1.33 s, sys: 1.91 ms, total: 1.33 s # Wall time: 1.33 s %timeit np.random.permutation(n) / n # 10 loops, best of 3: 22.4 ms per loop
how to speed up loop in numpy?
I would like to speed up this code : import numpy as np import pandas as pd a = pd.read_csv(path) closep = a['Clsprc'] delta = np.array(closep.diff()) upgain = np.where(delta >= 0, delta, 0) downloss = np.where(delta <= 0, -delta, 0) up = sum(upgain[0:14]) / 14 down = sum(downloss[0:14]) / 14 u = [] d = [] for x in np.nditer(upgain[14:]): u1 = 13 * up + x u.append(u1) up = u1 for y in np.nditer(downloss[14:]): d1 = 13 * down + y d.append(d1) down = d1 The data below: 0 49.00 1 48.76 2 48.52 3 48.28 ... 36785758 13.88 36785759 14.65 36785760 13.19 Name: Clsprc, Length: 36785759, dtype: float64 The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link. def moving_average(a, n=14) : ret = np.cumsum(a, dtype=float) ret[n:] = ret[n:] - ret[:-n] return ret[n - 1:] / n If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation: The nditer will try to provide chunks that are as large as possible to the inner loop. By forcing ‘C’ and ‘F’ order, we get different external loop sizes. This mode is enabled by specifying an iterator flag. Observe that with the default of keeping native memory order, the iterator is able to provide a single one-dimensional chunk, whereas when forcing Fortran order, it has to provide three chunks of two elements each. for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'): # x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing: upgain=np.array([.1,.2,.3,.4]) u=[] up=1 for x in upgain: u1=10*up+x u.append(u1) up=u1 producing: [10.1, 101.2, 1012.3, 10123.4] np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface). https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x. To use frompyfunc, define: def foo(x,y):return 10*x+y The loop application (above) would be def loopfoo(upgain,u,u1): for x in upgain: u1=foo(u1,x) u.append(u1) return u The 'vectorized' version would be: vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out vfoo.accumulate(upgain,dtype=object).astype(float) The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155 In [1195]: loopfoo([1,.1,.2,.3,.4],[],0) Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4] In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object) Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object) For this small list, loopfoo is faster (3µs v 21µs) For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster: In [1199]: timeit loopfoo(biggain,[],0) 1000 loops, best of 3: 281 µs per loop In [1200]: timeit vfoo.accumulate(biggain,dtype=object) 10000 loops, best of 3: 57.4 µs per loop For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.
Python Custom Zipf Number Generator Performing Poorly
I needed a custom Zipf-like number generator because numpy.random.zipf function doesn't achieve what I need. Firstly, its alpha must be greater than 1.0 and I need an alpha of 0.5. Secondly, its cardinality is directly related to the sample size and I need to make more samples than the cardinality, e.g. make a list of 1000 elements from a Zipfian distribution of only 6 unique values. #stanga posted a great solution to this. import random import bisect import math class ZipfGenerator: def __init__(self, n, alpha): # Calculate Zeta values from 1 to n: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) # Store the translation map: self.distMap = [x / zeta[-1] for x in zeta] def next(self): # Take a uniform 0-1 pseudo-random value: u = random.random() # Translate the Zipf variable: return bisect.bisect(self.distMap, u) - 1 The alpha can be less than 1.0 and the sampling can be infinite for a fixed cardinality n. The problem is that it runs too slow. # Calculate Zeta values from 1 to n: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) These two lines are the culprits. When I choose n=50000 I can generate my list in ~10 seconds. I need to execute this when n=5000000 but it's not feasible. I don't fully understand why this is performing so slow because (I think) it has linear complexity and the floating point operations seem simple. I am using Python 2.6.6 on a good server. Is there an optimization I can make or a different solution altogether that meet my requirements? EDIT: I'm updating my question with a possible solution using modifications recommended by #ev-br . I've simplified it as a subroutine that returns the entire list. #ev-br was correct to suggest changing bisect for searchssorted as the former proved to be a bottleneck as well. def randZipf(n, alpha, numSamples): # Calculate Zeta values from 1 to n: tmp = numpy.power( numpy.arange(1, n+1), -alpha ) zeta = numpy.r_[0.0, numpy.cumsum(tmp)] # Store the translation map: distMap = [x / zeta[-1] for x in zeta] # Generate an array of uniform 0-1 pseudo-random values: u = numpy.random.random(numSamples) # bisect them with distMap v = numpy.searchsorted(distMap, u) samples = [t-1 for t in v] return samples
Let me take a small example first In [1]: import numpy as np In [2]: import math In [3]: alpha = 0.1 In [4]: n = 5 In [5]: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] In [6]: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) In [7]: tmp Out[7]: [1.0, 0.9330329915368074, 0.8959584598407623, 0.8705505632961241, 0.8513399225207846] In [8]: zeta Out[8]: [0, 1.0, 1.9330329915368074, 2.82899145137757, 3.699542014673694, 4.550881937194479] Now, let's try to vectorize it, starting from innermost operations. The reduce call is essentially a cumulative sum: In [9]: np.cumsum(tmp) Out[9]: array([ 1. , 1.93303299, 2.82899145, 3.69954201, 4.55088194]) You want a leading zero, so let's prepend it: In [11]: np.r_[0., np.cumsum(tmp)] Out[11]: array([ 0. , 1. , 1.93303299, 2.82899145, 3.69954201, 4.55088194]) Your tmp array can be constructed in one go as well: In [12]: tmp_vec = np.power(np.arange(1, n+1) , -alpha) In [13]: tmp_vec Out[13]: array([ 1. , 0.93303299, 0.89595846, 0.87055056, 0.85133992]) Now, quick-and-dirty timings In [14]: %%timeit ....: n = 1000 ....: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] ....: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) ....: 100 loops, best of 3: 3.16 ms per loop In [15]: %%timeit ....: n = 1000 ....: tmp_vec = np.power(np.arange(1, n+1) , -alpha) ....: zeta_vec = np.r_[0., np.cumsum(tmp)] ....: 10000 loops, best of 3: 101 µs per loop Now, it gets better with increasing n: In [18]: %%timeit n = 50000 tmp_vec = np.power(np.arange(1, n+1) , -alpha) zeta_vec = np.r_[0, np.cumsum(tmp)] ....: 100 loops, best of 3: 3.26 ms per loop As compared to In [19]: %%timeit n = 50000 tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) ....: 1 loops, best of 3: 7.01 s per loop Down the line, the call to bisect can be replaced by np.searchsorted. EDIT: A couple of comments which are not directly relevant to the original question, and are rather based on my guesses of what can trip you down the line: a random generator should accept a seed. You can rely on numpy's global np.random.seed, but better make it an explicit argument defaulting to None (meaning do not seed it.) samples = [t-1 for t in v] is not needed, just return v-1. best avoid mixing camelCase and pep8_lower_case_with_underscores. note that this is very similar to what scipy.stats.rv_discrete is doing. If you only need sampling, you're fine. If you need a full-fledged distribution, you may look into using it.
Breaking early when computing cumulative products or sums in numpy
Say I have a range r=numpy.array(range(1, 6)) and I am calculating the cumulative sum using numpy.cumsum(r). But instead of returning [1, 3, 6, 10, 15] I would like it to return [1, 3, 6] because of the condition that the cumulative result must be less than 10. If the array is very large, I would like the cumulative sum to break out before it starts calculating values that are redundant and will be thrown away later. Of course, I am trivializing everything here for the sake of the question. Is it possible to break out of cumsum or cumprod early based on a condition?
I don't think this is possible with any function in numpy, since in most cases these are meant for vectorized computations on fixed-length arrays. One obvious way to do what you want is to break out of a standard for-loop in Python (as I assume you know): def limited_cumsum(x, limit): y = [] sm = 0 for item in x: sm += item if sm > limit: return y y.append(sm) return y But this would obviously be an order of magnitude slower than numpy's cumsum. Since you probably need some very specialized function, the changes are low to find the exact function you need in numpy. You should probably have a look at Cython, which allows you to implement custom functions that are as flexible as a Python function (and using a syntax that is almost Python), with a speed close to that of C.
Depending on size of the array you are computing the cumulative sum for and how quickly you expect the target value to be reached it may be quicker to calculate the cumulative sum in steps. import numpy as np size = 1000000 target = size def stepped_cumsum(): arr = np.arange(size) out = np.empty(len(arr), dtype=int) step = 1000 last_value = 0 for i in range(0, len(arr), step): np.cumsum(arr[i:i+step], out=out[i:i+step]) out[i:i+step] += last_value last_value = out[i+step-1] if last_value >= target: break else: return out greater_than_target_index = i + (out[i:i+step] >= target).argmax() # .copy() required so rest of backing array can be freed return out[:greater_than_target_index].copy() def normal_cumsum(): arr = np.arange(size) out = np.cumsum(arr) return out stepped_result = stepped_cumsum() normal_result = normal_cumsum() assert (stepped_result < target).all() assert (stepped_result == normal_result[:len(stepped_result)]).all() Results: In [60]: %timeit cumsum.stepped_cumsum() 1000 loops, best of 3: 1.22 ms per loop In [61]: %timeit cumsum.normal_cumsum() 100 loops, best of 3: 3.69 ms per loop