Iterating over numpy arange changes the values - python

I am using a numpy arange.
[In] test = np.arange(0.01, 0.2, 0.02)
[In] test
[Out] array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19])
But then, if I iterate over this array, it iterates over slightly smaller values.
[In] for t in test:
.... print(t)
[Out]
0.01
0.03
0.049999999999999996
0.06999999999999999
0.08999999999999998
0.10999999999999997
0.12999999999999998
0.15
0.16999999999999998
0.18999999999999997
Why is this happening?
To avoid this problem, I have been rounding the values, but is this the best way to solve this problem?
for t in test:
print(round(t, 2))

I think the nature of the floating point numbers mentioned in the comments is the issue.
If you still think you're afraid of leaving it that way I suggest that you multiply your numbers by 100 and so work with intergers:
test = np.arange(1, 20, 2)
print(test)
for t in test:
print(t / 100)
This gives me the following output:
[ 1 3 5 7 9 11 13 15 17 19]
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
Alternatively you can also try the following:
test = np.arange(1, 20, 2) / 100

Did you try:
test = np.arange(0.01, 0.2, 0.02, dtype=np.float32)

Related

How to filter for rows with close values across columns

I have columns of probabilities in a pandas dataframe as an output from multiclass machine learning.
I am looking to filter rows for which the model had very close probabilities between the classes for that row, and ideally only care about similar values that are similar to the highest value in that row, but I'm not sure where to start.
For example my data looks like this:
ID class1 class2 class3 class4 class5
row1 0.97 0.2 0.4 0.3 0.2
row2 0.97 0.96 0.4 0.3 0.2
row3 0.7 0.5 0.3 0.4 0.5
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
row6 0.1 0.11 0.3 0.9 0.2
I'd like to filter for rows where at least 2 (or more) probability class columns have a probability that is close to at least one other probability column in that row (e.g., maybe within 0.05). So an example output would filter to:
ID class1 class2 class3 class4 class5
row2 0.97 0.96 0.4 0.3 0.2
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
I don't mind if a filter includes row6 as it also meets my <0.05 different main requirement, but ideally because the 0.05 difference isn't with the largest probability I'd prefer to ignore this too.
What can I do to develop a filter like this?
Example data:
Edit: I have increased the size of my example data, as I do not want pairs specifically but any and all rows that in inside their row their column values for 2 or more probabilities have close values
d = {'ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6'],
'class1': [0.97, 0.97, 0.7, 0.97, 0.1, 0.1],
'class2': [0.2, 0.96, 0.5, 0.98, 0.2, 0.11],
'class3': [0.4, 0.4, 0.3, 0.2, 0.3, 0.3],
'class4': [0.3, 0.3, 0.4, 0.3, 0.78, 0.9],
'class5': [0.2, 0.2, 0.5, 0.2, 0.8, 0.2]}
df = pd.DataFrame(data=d)
Here is an example using numpy and itertools.combinations to get the pairs of similar rows with at least N matches with 0.05:
from itertools import combinations
import numpy as np
df2 = df.set_index('ID')
N = 2
out = [(a, b) for a,b in combinations(df2.index, r=2)
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N]
Output:
[('row1', 'row2'), ('row1', 'row4'), ('row2', 'row4')]
follow-up
My real data is 10,000 rows and I want to filter out all rows that
have more than one column of probabilities that are close to each
other. Is there a way to do this without specifying pairs
from itertools import combinations
N = 2
df2 = df.set_index('ID')
keep = set()
seen = set()
for a,b in combinations(df2.index, r=2):
if {a,b}.issubset(seen):
continue
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N:
keep.update({a, b})
seen.update({a, b})
print(keep)
# {'row1', 'row2', 'row4'}
You can do that with:
Transpose the dataframe to get each sample as column and classes probabilities as rows.
We only need to check the minimal requirement which is if the difference between the 2 largest values is less than or equal 0.05.
df = pd.DataFrame(data=d).set_index("ID").T
result = [col for col in df.columns if np.isclose(*df[col].nlargest(2), atol=0.05)]
Output:
['row2', 'row4', 'row5']'
Dataframe after the transpose:
ID row1 row2 row3 row4 row5 row6
class1 0.97 0.97 0.7 0.97 0.10 0.10
class2 0.20 0.96 0.5 0.98 0.20 0.11
class3 0.40 0.40 0.3 0.20 0.30 0.30
class4 0.30 0.30 0.4 0.30 0.75 0.90
class5 0.20 0.20 0.5 0.20 0.80 0.20

convert an array of integers to an array of floats

Given any integer n convert it to a float 0.n
#input
[11 22 5 1 68 17 5 4 558]
#output
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Is there a way in numpy to do the following.
import numpy as np
int_=np.array([11,22,5,1,68,17,5,4,558])
float_=np.array([])
for i in range(len(int_)):
float_=np.append(float_,int_[i]/10**(len(str(int_[i]))))
print(float_)
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
for now the code I have is slow (takes a lot of time for very large arrays)
One way using numpy.log10:
arr = np.array([11,22,5,1,68,17,5,4,558])
new_arr = arr/np.power(10, np.log10(arr).astype(int) + 1)
print(new_arr)
Output:
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Explain:
numpy.log10(arr).astype(int) + 1 will give you the number of digits
numpy.power(10, {above}) will give you the required denominator
You can also try a Vectorize version of your code
def chg_to_float(val):
return val/10**len(str(val))
v_chg_to_float = np.vectorize(chg_to_float)
np.array(list(map(chg_to_float, ar)))
Since you're only inserting a 0. in front of each input integer, you can simply cast them to strings, add the 0., and then cast them to floats.
>>> input_list = [11, 22, 5, 1, 68, 17, 5, 4, 558]
>>> [float(f'0.{str(item)}') for item in input_list]
[0.11, 0.22, 0.5, 0.1, 0.68, 0.17, 0.5, 0.4, 0.558]
Performance could be enhanced by using a generator comprehension instead of a list comprehension.

Compute percentile rank relative to a given population

I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])
You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64
I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.

Pandas Rank: unexpected behavior for method = 'dense' and pct = True

Suppose I have a series with duplicates:
import pandas as pd
ts = pd.Series([1,2,3,4] * 5)
and I want to calculate percentile ranks of it.
It is always a bit tricky to calculate ranks with multiple matches, but I think I am getting unexpected results:
ts.rank(method = 'dense', pct = True)
Out[112]:
0 0.05
1 0.10
2 0.15
3 0.20
4 0.05
5 0.10
6 0.15
7 0.20
8 0.05
9 0.10
10 0.15
11 0.20
12 0.05
13 0.10
14 0.15
15 0.20
16 0.05
17 0.10
18 0.15
19 0.20
dtype: float64
So I am getting as percentiles [0.05, 0.1, 0.15, 0.2], where I guess the expected output might be [0.25, 0.5, 0.75, 1], i.e. multiplying the output by the number of repeated values.
My guess here is that, in order to calculate percentile ranks, pd.rank is simply dividing by the number of observations, which is wrong for method = 'dense'.
So my questions are:
Do you agree the output is unexpected/wrong
How can I obtain my expected output, i.e. assign to each duplicate the percentile rank I would get if I didn't have any duplicate in
the series?
I have reported the issue on GithUB: https://github.com/pandas-dev/pandas/pull/15639
All that pct=True does is divide by the nobs, which gives unexpected behavior for method = 'dense', so this considered as a bug to be fixed in next major release.

Read rows of different sizes into columns in python

I have an input file that looks sort of like this:
0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4
So it starts with some number of columns, and ends with n*2 columns (n is the last line).
I can get the number of rows, say # rows = i. I can also get n.
I want to read this file into a python 2d array (not a list), e.g. Array[i][n*2]. I realize I may need to fill the empty columns with zeros so that it can be read simply as
Array = numpy.loadtxt("data.txt")
But I don't know how to proceed.
Thanks
I don't think any of the built-in missing-value stuff is going to help here, because space-separated columns make it ambiguous which values are missing. (Not ambiguous in your context—you know all the missing columns are on the right—but a general-purpose parser won't.) Hopefully I'm wrong and someone else will provide a simpler answer, but otherwise…
One option is to extend the lines one by one on the fly and feed them into an array. If memory isn't an issue, you can do this with a list comprehension over the row:
def readrow(row, cols):
a = np.fromstring(row, sep=' ')
a.resize((cols,)
return a
with open(file_path, 'rb') as f:
a = np.array([readrow(row, 2*n) for row in f])
If you can't afford to waste the memory to create a temporary list of i 1D arrays, you may need to use something like fromiter to generate a 1D array, then reshape it:
a = np.fromiter(itertools.chain.from_iterable(
readrow(row, n*2) for row in f)).reshape((n*2,))
(Although at this point, using numpy to parse the rows instead of csv or just str.split seems like it might be a bit silly.)
If you want to pad the short lines with 0.0's here is one way - pad with a full set of 0.0's, then slice only the leading significant part:
data = """0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4""".splitlines()
maxcols = int(data[-1])*2
emptyvalue = 0.0
pad = [emptyvalue]*maxcols
for line in data[:-1]:
# get the input data values, converted from strings to floats
vals = map(float, line.split())
# pad the input with default values, then only take the first maxcols values
vals = (vals + pad)[:maxcols]
# show our work in a nice table
print "[" + ','.join("%s%.2f" % (' ' if v>=0 else '', v) for v in vals) + "]"
prints
[ 0.10, 0.30, 0.40, 0.30, 0.00, 0.00, 0.00, 0.00]
[ 0.20, 2.00, 1.20,-0.20, 0.00, 0.00, 0.00, 0.00]
[ 0.10,-1.22, 0.12, 9.20, 0.20, 0.20, 0.00, 0.00]
[ 0.30,-1.42, 0.20, 6.20, 0.90, 0.88, 0.00, 0.00]
[ 0.30,-1.42, 0.12, 1.10, 0.10, 0.88, 0.06, 0.14]

Categories

Resources