I have an input file that looks sort of like this:
0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4
So it starts with some number of columns, and ends with n*2 columns (n is the last line).
I can get the number of rows, say # rows = i. I can also get n.
I want to read this file into a python 2d array (not a list), e.g. Array[i][n*2]. I realize I may need to fill the empty columns with zeros so that it can be read simply as
Array = numpy.loadtxt("data.txt")
But I don't know how to proceed.
Thanks
I don't think any of the built-in missing-value stuff is going to help here, because space-separated columns make it ambiguous which values are missing. (Not ambiguous in your context—you know all the missing columns are on the right—but a general-purpose parser won't.) Hopefully I'm wrong and someone else will provide a simpler answer, but otherwise…
One option is to extend the lines one by one on the fly and feed them into an array. If memory isn't an issue, you can do this with a list comprehension over the row:
def readrow(row, cols):
a = np.fromstring(row, sep=' ')
a.resize((cols,)
return a
with open(file_path, 'rb') as f:
a = np.array([readrow(row, 2*n) for row in f])
If you can't afford to waste the memory to create a temporary list of i 1D arrays, you may need to use something like fromiter to generate a 1D array, then reshape it:
a = np.fromiter(itertools.chain.from_iterable(
readrow(row, n*2) for row in f)).reshape((n*2,))
(Although at this point, using numpy to parse the rows instead of csv or just str.split seems like it might be a bit silly.)
If you want to pad the short lines with 0.0's here is one way - pad with a full set of 0.0's, then slice only the leading significant part:
data = """0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4""".splitlines()
maxcols = int(data[-1])*2
emptyvalue = 0.0
pad = [emptyvalue]*maxcols
for line in data[:-1]:
# get the input data values, converted from strings to floats
vals = map(float, line.split())
# pad the input with default values, then only take the first maxcols values
vals = (vals + pad)[:maxcols]
# show our work in a nice table
print "[" + ','.join("%s%.2f" % (' ' if v>=0 else '', v) for v in vals) + "]"
prints
[ 0.10, 0.30, 0.40, 0.30, 0.00, 0.00, 0.00, 0.00]
[ 0.20, 2.00, 1.20,-0.20, 0.00, 0.00, 0.00, 0.00]
[ 0.10,-1.22, 0.12, 9.20, 0.20, 0.20, 0.00, 0.00]
[ 0.30,-1.42, 0.20, 6.20, 0.90, 0.88, 0.00, 0.00]
[ 0.30,-1.42, 0.12, 1.10, 0.10, 0.88, 0.06, 0.14]
Related
I have columns of probabilities in a pandas dataframe as an output from multiclass machine learning.
I am looking to filter rows for which the model had very close probabilities between the classes for that row, and ideally only care about similar values that are similar to the highest value in that row, but I'm not sure where to start.
For example my data looks like this:
ID class1 class2 class3 class4 class5
row1 0.97 0.2 0.4 0.3 0.2
row2 0.97 0.96 0.4 0.3 0.2
row3 0.7 0.5 0.3 0.4 0.5
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
row6 0.1 0.11 0.3 0.9 0.2
I'd like to filter for rows where at least 2 (or more) probability class columns have a probability that is close to at least one other probability column in that row (e.g., maybe within 0.05). So an example output would filter to:
ID class1 class2 class3 class4 class5
row2 0.97 0.96 0.4 0.3 0.2
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
I don't mind if a filter includes row6 as it also meets my <0.05 different main requirement, but ideally because the 0.05 difference isn't with the largest probability I'd prefer to ignore this too.
What can I do to develop a filter like this?
Example data:
Edit: I have increased the size of my example data, as I do not want pairs specifically but any and all rows that in inside their row their column values for 2 or more probabilities have close values
d = {'ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6'],
'class1': [0.97, 0.97, 0.7, 0.97, 0.1, 0.1],
'class2': [0.2, 0.96, 0.5, 0.98, 0.2, 0.11],
'class3': [0.4, 0.4, 0.3, 0.2, 0.3, 0.3],
'class4': [0.3, 0.3, 0.4, 0.3, 0.78, 0.9],
'class5': [0.2, 0.2, 0.5, 0.2, 0.8, 0.2]}
df = pd.DataFrame(data=d)
Here is an example using numpy and itertools.combinations to get the pairs of similar rows with at least N matches with 0.05:
from itertools import combinations
import numpy as np
df2 = df.set_index('ID')
N = 2
out = [(a, b) for a,b in combinations(df2.index, r=2)
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N]
Output:
[('row1', 'row2'), ('row1', 'row4'), ('row2', 'row4')]
follow-up
My real data is 10,000 rows and I want to filter out all rows that
have more than one column of probabilities that are close to each
other. Is there a way to do this without specifying pairs
from itertools import combinations
N = 2
df2 = df.set_index('ID')
keep = set()
seen = set()
for a,b in combinations(df2.index, r=2):
if {a,b}.issubset(seen):
continue
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N:
keep.update({a, b})
seen.update({a, b})
print(keep)
# {'row1', 'row2', 'row4'}
You can do that with:
Transpose the dataframe to get each sample as column and classes probabilities as rows.
We only need to check the minimal requirement which is if the difference between the 2 largest values is less than or equal 0.05.
df = pd.DataFrame(data=d).set_index("ID").T
result = [col for col in df.columns if np.isclose(*df[col].nlargest(2), atol=0.05)]
Output:
['row2', 'row4', 'row5']'
Dataframe after the transpose:
ID row1 row2 row3 row4 row5 row6
class1 0.97 0.97 0.7 0.97 0.10 0.10
class2 0.20 0.96 0.5 0.98 0.20 0.11
class3 0.40 0.40 0.3 0.20 0.30 0.30
class4 0.30 0.30 0.4 0.30 0.75 0.90
class5 0.20 0.20 0.5 0.20 0.80 0.20
Given any integer n convert it to a float 0.n
#input
[11 22 5 1 68 17 5 4 558]
#output
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Is there a way in numpy to do the following.
import numpy as np
int_=np.array([11,22,5,1,68,17,5,4,558])
float_=np.array([])
for i in range(len(int_)):
float_=np.append(float_,int_[i]/10**(len(str(int_[i]))))
print(float_)
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
for now the code I have is slow (takes a lot of time for very large arrays)
One way using numpy.log10:
arr = np.array([11,22,5,1,68,17,5,4,558])
new_arr = arr/np.power(10, np.log10(arr).astype(int) + 1)
print(new_arr)
Output:
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Explain:
numpy.log10(arr).astype(int) + 1 will give you the number of digits
numpy.power(10, {above}) will give you the required denominator
You can also try a Vectorize version of your code
def chg_to_float(val):
return val/10**len(str(val))
v_chg_to_float = np.vectorize(chg_to_float)
np.array(list(map(chg_to_float, ar)))
Since you're only inserting a 0. in front of each input integer, you can simply cast them to strings, add the 0., and then cast them to floats.
>>> input_list = [11, 22, 5, 1, 68, 17, 5, 4, 558]
>>> [float(f'0.{str(item)}') for item in input_list]
[0.11, 0.22, 0.5, 0.1, 0.68, 0.17, 0.5, 0.4, 0.558]
Performance could be enhanced by using a generator comprehension instead of a list comprehension.
Using Pandas, how can I efficiently add a new column that is true/false if the value in one column (x) is between the values in two other columns (low and high)?
The np.select approach from here works perfectly, but I "feel" like there should be a one-liner way to do this.
Using Python 3.7
fid = [0, 1, 2, 3, 4]
x = [0.18, 0.07, 0.11, 0.3, 0.33]
low = [0.1, 0.1, 0.1, 0.1, 0.1]
high = [0.2, 0.2, 0.2, 0.2, 0.2]
test = pd.DataFrame(data=zip(fid, x, low, high), columns=["fid", "x", "low", "high"])
conditions = [(test["x"] >= test["low"]) & (test["x"] <= test["high"])]
labels = ["True"]
test["between"] = np.select(conditions, labels, default="False")
display(test)
Like mentioned by #Brebdan, you can use this builtin:
test["between"] = test["x"].between(test["low"], test["high"])
output:
fid x low high between
0 0 0.18 0.1 0.2 True
1 1 0.07 0.1 0.2 False
2 2 0.11 0.1 0.2 True
3 3 0.30 0.1 0.2 False
4 4 0.33 0.1 0.2 False
I am using a numpy arange.
[In] test = np.arange(0.01, 0.2, 0.02)
[In] test
[Out] array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19])
But then, if I iterate over this array, it iterates over slightly smaller values.
[In] for t in test:
.... print(t)
[Out]
0.01
0.03
0.049999999999999996
0.06999999999999999
0.08999999999999998
0.10999999999999997
0.12999999999999998
0.15
0.16999999999999998
0.18999999999999997
Why is this happening?
To avoid this problem, I have been rounding the values, but is this the best way to solve this problem?
for t in test:
print(round(t, 2))
I think the nature of the floating point numbers mentioned in the comments is the issue.
If you still think you're afraid of leaving it that way I suggest that you multiply your numbers by 100 and so work with intergers:
test = np.arange(1, 20, 2)
print(test)
for t in test:
print(t / 100)
This gives me the following output:
[ 1 3 5 7 9 11 13 15 17 19]
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
Alternatively you can also try the following:
test = np.arange(1, 20, 2) / 100
Did you try:
test = np.arange(0.01, 0.2, 0.02, dtype=np.float32)
I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)
Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])