numpy sorting and removing top values

numpy sorting and removing top values - python

I don't know if there is a name for this algorithm, but basically for a given y, I want to find the maximum x such that:
import numpy as np
np_array = np.random.rand(1000, 1)
np.sum(np_array[np_array > x] - x) >= y
Of course, a search algo would be to find the top value n_1, reduce it to the second largest value, n_2. Stop if n_1 - n-2 > y; else reduce both n_1 and n_2 to n_3, stop if (n_1 - n_3) + (n_2 - n_3) > y ...
But I feel there must be an algo to generate a sequence of {xs} that converges to its true value.

Let's use your example from the comments:
a = np.array([0.1, 0.3, 0.2, 0.6, 0.1, 0.4, 0.5, 0.2])
y = 0.5
First let's sort the data in descending order:
s = np.sort(a)[::-1] # 0.6, 0.5, 0.4, 0.3, 0.2, 0.2, 0.1,
Let's take a look at how the choice of x affects the possible values of the sum r = np.sum(np_array[np_array > x] - x):
If x ≥ 0.6, then r = 0.0 - x ⇒ -∞ < r ≤ -0.6
If 0.6 > x ≥ 0.5, then r = 0.6 - x ⇒ 0.0 < r ≤ 0.1 (where 0.1 = 0.6 - 0.5 × 1)
If 0.5 > x ≥ 0.4, then r = 0.6 - x + 0.5 - x = 1.1 - 2 * x ⇒ 0.1 < r ≤ 0.3 (where 0.3 = 1.1 - 0.4 × 2)
If 0.4 > x ≥ 0.3, then r = 0.6 - x + 0.5 - x + 0.4 - x = 1.5 - 3 * x ⇒ 0.3 < r ≤ 0.6 (where 0.6 = 1.5 - 0.3 × 3)
If 0.3 > x ≥ 0.2, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x = 1.8 - 4 * x ⇒ 0.6 < r ≤ 1.0 (where 1.0 = 1.8 - 0.2 × 4)
If 0.2 > x ≥ 0.1, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x + 0.2 - x + 0.2 - x = 2.2 - 6 * x ⇒ 1.0 < r ≤ 1.6 (where 1.6 = 2.2 - 0.1 × 6)
If 0.1 > x, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x + 0.2 - x + 0.2 - x + 0.1 - x + 0.1 - x = 2.4 - 8 * x ⇒ 1.6 < r ≤ ∞
The range of r is continuous except for the portion a[0] < r ≤ 0.0. Duplicate elements affect the range of available r values for each value in a, but otherwise are nothing special. We can remove, but also account for the duplicates by using np.unique instead of np.sort:
s, t = np.unique(a, return_counts=True)
s, t = s[::-1], t[::-1]
w = np.cumsum(t)
If your data can reasonably be expected not to contain duplicates, then use the sorted s shown in the beginning, and set t = np.ones(s.size, dtype=int) and therefore w = np.arange(s.size) + 1.
For s[i] > x ≥ s[i + 1], the bounds of r are given by c[i] - w[i] * s[i] < r ≤ c[i] - w[i] * s[i + 1], where
c = np.cumsum(s * t) # You can use just `np.cumsum(s)` if no duplicates
So finding where y ends up is a matter of placing it between the correct bounds. This can be done with a binary search, e.g., np.searchsorted:
# Left bound. Sum is strictly greater than this
bounds = c - w * s
i = np.searchsorted(bounds[1:], y, 'right')
The first element of bounds is always 0.0, and the resulting index i will point to the upper bound. By truncating off the first element, we shift the result to the lower bound, and ignore the zero.
The solution is found by solving for the location of x in the selected bin:
y = c[i] - w[i] * x
So you have:
x = (c[i] - y) / w[i]
You can write a function:
def dm(a, y, duplicates=False):
if duplicates:
s, t = np.unique(a, return_counts=True)
s, t = s[::-1], t[::-1]
w = np.cumsum(t)
c = np.cumsum(s * t)
i = np.searchsorted((c - w * s)[1:], y, 'right')
x = (c[i] - y) / w[i]
else:
s = np.sort(a)[::-1]
c = np.cumsum(s)
i = np.searchsorted((c - s)[1:], y, 'right')
x = (c[i] - y) / (i + 1)
return x
This does not handle the case where y < 0, but it does allow you to enter many y values simultaneously, since searchsorted is pretty well vectorized.
Here is a usage sample:
>>> dm(a, 0.5, True)
Out[247]: 0.3333333333333333
>>> dm(a, 0.6, True)
0.3
>>> dm(a, [0.1, 0.2, 0.3, 0.4, 0.5], True)
array([0.5 , 0.45 , 0.4 , 0.36666667, 0.33333333])
As for whether this algorithm has a name: I am not aware of any. Since I wrote this, I feel that "discrete madness" is an appropriate name. Slips off the tongue nicely too: "Ah yes, I computed the threshold using discrete madness".

This is an answer to the original question, where we find the maximum x s.t. np.sum(np_array[np_array > x]) >= y:
You can accomplish this with sorting and cumulative sum:
s = np.sort(np_array)[::-1]
c = np.cumsum(s)
i = np.argmax(c > y)
result = s[i]
s is the candidates for x in descending order. Comparing the cumulative sum c to y tells you exactly where the sum will exceed y. np.argmax returns the index of the first place that happens. The result is that index extracted from s.
This computation in numpy is slower than it needs to be because we can short circuit the sum immediately without computing a separate mask. The complexity is the same, however. You could speed up the following with numba or cython:
s = np.sort(np_array)[::-1]
c = 0
for i in range(len(s)):
c += s[i]
if c > y:
break
result = s[i]

Related

How to merge two DataFrames by complicated condition?

Python 3.9.5
The first big DataFrame contains points and the second big DataFrame contains square areas. Square areas are defined by four straight lines, that are parallel to the coordinate axes and are completely defined by a set of constraints: y_min, y_max, x_min, x_max. For example:
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
points
y x
0 0.5 0.5
1 0.5 1.5
2 1.5 1.5
3 1.5 0.5
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
square_areas
y_min y_max x_min x_max
0 0 1 0 1
1 1 2 1 2
How to get all points, that don't belong to square areas without sequential enumeration of areas in a cycle?
Needed Output:
y x
0 0.5 1.5
1 1.5 0.5

I'm not sure how to do this with a 'merge', but you can iterate over the square_areas dataframe and evaluate the conditions of the point dataframe.
I'm assuming you'll have more than two test cases, so this iterative approach should work. Each iteration only looks at points that have not been evaluated by a prior square_areas row.
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
print(points)
# assume everything is outside until it evaluates inside
points['outside'] = 'Y'
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
print(square_areas)
for i in range(square_areas.shape[0]):
ymin = square_areas.iloc[i]['y_min']
ymax = square_areas.iloc[i]['y_max']
xmin = square_areas.iloc[i]['x_min']
xmax = square_areas.iloc[i]['x_max']
points.loc[points['outside'] == 'Y', 'outside'] = np.where(points[points['outside'] == 'Y']['x'].between(xmin, xmax) & points[points['outside'] == 'Y']['y'].between(ymin, ymax), 'N', points[points['outside'] == 'Y']['outside'])
points.loc[points['outside'] == 'Y']
Output
y x outside
1 0.50000 1.50000 Y
3 1.50000 0.50000 Y

Assignment by logical indexing in numpy

I have a real-valued numpy array of size (1000,). All values lie between 0 and 1, and I want to convert this to a categorical array. All values less than 0.25 should be assigned to category 0, values between 0.25 and 0.5 to category 1, 0.5 to 0.75 to category 2, and 0.75 to 1 to category 3. Logical indexing doesn't seem to work:
Y[Y < 0.25] = 0
Y[np.logical_and(Y >= 0.25, Y < 0.5)] = 1
Y[np.logical_and(Y >= 0.5, Y < 0.75)] = 2
Y[Y >= 0.75] = 3
Result:
for i in range(4):
print(f"Y == {i}: {sum(Y == i)}")
Y == 0: 206
Y == 1: 0
Y == 2: 0
Y == 3: 794
What needs to be done instead?

The error is in your conversion logic, not in your indexing. The final statment:
Y[Y >= 0.75] = 3
Converts not only the values in range 0.75 - 1.00, but also the prior assignments to classes 1 and 2.
You can reverse the assignment order, starting with class 3.
You can put an upper limit on the final class, although you still have a boundary problem with 1.00 vs class 1.
Perhaps best would be to harness the regularity of your divisions, such as:
Y = int(4*Y) # but you still have boundary problems.

Where am I going wrong in the following LP code?

I am trying to solve an LP problem with two variables with two constraints where one is inequality and the other one is equality constraint in Scipy.
To convert the inequality in the constraint I have added another variable in it called A.
Min(z) = 80x + 60y
Constraints:
0.2x + 0.32y <= 0.25
x + y = 1
x, y <= 0
I have changed the inequality constraints by the following equations by adding an extra variable A
0.2x + 0.32y + A = 0.25
Min(z) = 80x + 60y + 0A
X+ Y + 0A = 1
from scipy.optimize import linprog
import numpy as np
z = np.array([80, 60, 0])
C = np.array([
[0.2, 0.32, 1],
[1, 1, 0]
])
b = np.array([0.25, 1])
x1 = (0, None)
x2 = (0, None)
sol = linprog(-z, A_eq = C, b_eq = b, bounds = (x1, x2), method='simplex')
However, I am getting an error message
Invalid input for linprog with method = 'simplex'. Length of bounds
is inconsistent with the length of c
How can I fix this?

The problem is that you do not provide bounds for A. If you e.g. run
linprog(-z, A_eq = C, b_eq = b, bounds = (x1, x2, (0, None)), method='simplex')
you will obtain:
con: array([0., 0.])
fun: -80.0
message: 'Optimization terminated successfully.'
nit: 3
slack: array([], dtype=float64)
status: 0
success: True
x: array([1. , 0. , 0.05])
As you can see, the constraints are met:
0.2 * 1 + 0.32 * 0.0 + 0.05 = 0.25 # (0.2x + 0.32y + A = 0.25)
and also
1 + 0 + 0 = 1 # (X + Y + 0A = 1)

Applying an adjustment matrix over each column of a timeseries-indexed DataFrame

I'm not familiar with applying matrix calculations and I'm getting nowhere fast in my attempts to apply the following complexity factors to every datapoint in my DataFrame (below values are all abof variable values). I've tried various combinations of df.apply(), np.dot() and np.matrix() but can't find a way (let alone a fast way!) to get the output I need.
Matrix to be applied:
0.6 0.3 0.1 (=1.0)
|Low |Med |High
------------------
0.2 |Low |1.1 |1.4 |2.0
0.4 |Med |0.8 |1.0 |1.4
0.4 |High |0.6 |0.8 |1.1
(=1.0)
...so the calculation I'm trying to apply is as follows (if datapoint was 500, the adjusted result would be 454):
(<datapoint> * (0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0))
+(<datapoint> * (0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4))
+(<datapoint> * (0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1))
DataFrame for matrix to be applied over
The DataFrame for this matrix to be applied over has multi-level columns. Each column is an independent Series which runs across the DataFrame's timeseries index (empty datapoints filled with NaN).
The following code generates the test DataFrame I'm experimenting with:
element=[]
role=[]
#Generate the Series'
element1_part1= pd.Series(abs(np.random.randn(5)), index=pd.date_range('01-01-2018',periods=5,freq='D'))
element.append('Element 1')
role.append('Part1')
element1_part2= pd.Series(abs(np.random.randn(4)), index=pd.date_range('01-02-2018',periods=4,freq='D'))
element.append('Element 1')
role.append('Part2')
element2_part1= pd.Series(abs(np.random.randn(2)), index=pd.date_range('01-04-2018',periods=2,freq='D'))
element.append('Element 2')
role.append('Part1')
element2_part2= pd.Series(abs(np.random.randn(2)), index=pd.date_range('01-02-2018',periods=2,freq='D'))
element.append('Element 2')
role.append('Part2')
element3 = pd.Series(abs(np.random.randn(4)), index=pd.date_range('01-02-2018',periods=4,freq='D'))
element.append('Element 3')
role.append('Only Part')
#Zip the multi-level columns to Tuples
arrays=[element,role]
tuples = list(zip(*arrays))
#Concatenate the Series' and define timeseries
elements=pd.concat([element1_part1, element1_part2, element2_part1, element2_part2, element3], axis=1)
dateseries=elements.index
elements.columns=pd.MultiIndex.from_tuples(tuples, names=['Level-1', 'Level-2'])

If I'm understanding the problem correctly, you want an elementwise-operation that updates the elements data frame with:
(<datapoint> * [(0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0)])
+(<datapoint> * [(0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4)])
+(<datapoint> * [(0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1)])
For all <datapoint>, this operation has the form (with x = <datapoint>):
[x * (a + b + c)] + [x * (d + e + f)] + [x * (g + h + i)]
= x * (a + ... + i)
= Cx # for some constant C
That means you just need to compute the scalar value C:
row_val = np.array([0.2, 0.4, 0.4])
col_val = np.array([0.6, 0.3, 0.1])
mat_val = np.matrix([[1.1, 1.4, 2.0],
[0.8, 1.0, 1.4],
[0.6, 0.8, 1.1]])
apply_mat = np.multiply(np.outer(row_val, col_val), mat_val)
apply_vec = np.sum(apply_mat, axis=1)
C = np.sum(apply_vec)
# 0.908
Or "by hand":
print(((0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0)) +
((0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4)) +
((0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1)))
# 0.908
This value for C matches your example datapoint and expected output:
0.908 * 500 = 454.0
Now you can use mul():
elements.mul(C)
With your example data, this is the output:
Level-1 Element 1 Element 2 Element 3
Level-2 Part1 Part2 Part1 Part2 Only Part
2018-01-01 2.169116 NaN NaN NaN NaN
2018-01-02 0.620286 1.645149 NaN 1.173356 0.277663
2018-01-03 0.782959 1.677798 NaN 0.557048 1.220138
2018-01-04 0.206314 0.773896 0.629524 NaN 0.572183
2018-01-05 1.209667 0.542614 0.666525 NaN 0.579032

linear interpolation between two data points

I have two data points x and y:
x = 5 (value corresponding to 95%)
y = 17 (value corresponding to 102.5%)
No I would like to calculate the value for xi which should correspond to 100%.
x = 5 (value corresponding to 95%)
xi = ?? (value corresponding to 100%)
y = 17 (value corresponding to 102.5%)
How should I do this using python?

is that what you want?
In [145]: s = pd.Series([5, np.nan, 17], index=[95, 100, 102.5])
In [146]: s
Out[146]:
95.0 5.0
100.0 NaN
102.5 17.0
dtype: float64
In [147]: s.interpolate(method='index')
Out[147]:
95.0 5.0
100.0 13.0
102.5 17.0
dtype: float64

You can use numpy.interp function to interpolate a value
import numpy as np
import matplotlib.pyplot as plt
x = [95, 102.5]
y = [5, 17]
x_new = 100
y_new = np.interp(x_new, x, y)
print(y_new)
# 13.0
plt.plot(x, y, "og-", x_new, y_new, "or");

We can easily plot this on a graph without Python:
This shows us what the answer should be (13).
But how do we calculate this? First, we find the gradient with this:
The numbers substituted into the equation give this:
So we know for 0.625 we increase the Y value by, we increase the X value by 1.
We've been given that Y is 100. We know that 102.5 relates to 17. 100 - 102.5 = -2.5. -2.5 / 0.625 = -4 and then 17 + -4 = 13.
This also works with the other numbers: 100 - 95 = 5, 5 / 0.625 = 8, 5 + 8 = 13.
We can also go backwards using the reciprocal of the gradient (1 / m).
We've been given that X is 13. We know that 102.5 relates to 17. 13 - 17 = -4. -4 / 0.625 = -2.5 and then 102.5 + -2.5 = 100.
How do we do this in python?
def findXPoint(xa,xb,ya,yb,yc):
m = (xa - xb) / (ya - yb)
xc = (yc - yb) * m + xb
return
And to find a Y point given the X point:
def findYPoint(xa,xb,ya,yb,xc):
m = (ya - yb) / (xa - xb)
yc = (xc - xb) * m + yb
return yc
This function will also extrapolate from the data points.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy sorting and removing top values - python

Related

How to merge two DataFrames by complicated condition?

Assignment by logical indexing in numpy

Where am I going wrong in the following LP code?

Applying an adjustment matrix over each column of a timeseries-indexed DataFrame

linear interpolation between two data points

Categories

Resources