Applying an adjustment matrix over each column of a timeseries-indexed DataFrame - python

I'm not familiar with applying matrix calculations and I'm getting nowhere fast in my attempts to apply the following complexity factors to every datapoint in my DataFrame (below values are all abof variable values). I've tried various combinations of df.apply(), np.dot() and np.matrix() but can't find a way (let alone a fast way!) to get the output I need.
Matrix to be applied:
0.6 0.3 0.1 (=1.0)
|Low |Med |High
------------------
0.2 |Low |1.1 |1.4 |2.0
0.4 |Med |0.8 |1.0 |1.4
0.4 |High |0.6 |0.8 |1.1
(=1.0)
...so the calculation I'm trying to apply is as follows (if datapoint was 500, the adjusted result would be 454):
(<datapoint> * (0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0))
+(<datapoint> * (0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4))
+(<datapoint> * (0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1))
DataFrame for matrix to be applied over
The DataFrame for this matrix to be applied over has multi-level columns. Each column is an independent Series which runs across the DataFrame's timeseries index (empty datapoints filled with NaN).
The following code generates the test DataFrame I'm experimenting with:
element=[]
role=[]
#Generate the Series'
element1_part1= pd.Series(abs(np.random.randn(5)), index=pd.date_range('01-01-2018',periods=5,freq='D'))
element.append('Element 1')
role.append('Part1')
element1_part2= pd.Series(abs(np.random.randn(4)), index=pd.date_range('01-02-2018',periods=4,freq='D'))
element.append('Element 1')
role.append('Part2')
element2_part1= pd.Series(abs(np.random.randn(2)), index=pd.date_range('01-04-2018',periods=2,freq='D'))
element.append('Element 2')
role.append('Part1')
element2_part2= pd.Series(abs(np.random.randn(2)), index=pd.date_range('01-02-2018',periods=2,freq='D'))
element.append('Element 2')
role.append('Part2')
element3 = pd.Series(abs(np.random.randn(4)), index=pd.date_range('01-02-2018',periods=4,freq='D'))
element.append('Element 3')
role.append('Only Part')
#Zip the multi-level columns to Tuples
arrays=[element,role]
tuples = list(zip(*arrays))
#Concatenate the Series' and define timeseries
elements=pd.concat([element1_part1, element1_part2, element2_part1, element2_part2, element3], axis=1)
dateseries=elements.index
elements.columns=pd.MultiIndex.from_tuples(tuples, names=['Level-1', 'Level-2'])

If I'm understanding the problem correctly, you want an elementwise-operation that updates the elements data frame with:
(<datapoint> * [(0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0)])
+(<datapoint> * [(0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4)])
+(<datapoint> * [(0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1)])
For all <datapoint>, this operation has the form (with x = <datapoint>):
[x * (a + b + c)] + [x * (d + e + f)] + [x * (g + h + i)]
= x * (a + ... + i)
= Cx # for some constant C
That means you just need to compute the scalar value C:
row_val = np.array([0.2, 0.4, 0.4])
col_val = np.array([0.6, 0.3, 0.1])
mat_val = np.matrix([[1.1, 1.4, 2.0],
[0.8, 1.0, 1.4],
[0.6, 0.8, 1.1]])
apply_mat = np.multiply(np.outer(row_val, col_val), mat_val)
apply_vec = np.sum(apply_mat, axis=1)
C = np.sum(apply_vec)
# 0.908
Or "by hand":
print(((0.2 * 0.6 * 1.1) + (0.2 * 0.3 * 1.4) + (0.2 * 0.1 * 2.0)) +
((0.4 * 0.6 * 0.8) + (0.4 * 0.3 * 1.0) + (0.4 * 0.1 * 1.4)) +
((0.4 * 0.6 * 0.6) + (0.4 * 0.3 * 0.8) + (0.4 * 0.1 * 1.1)))
# 0.908
This value for C matches your example datapoint and expected output:
0.908 * 500 = 454.0
Now you can use mul():
elements.mul(C)
With your example data, this is the output:
Level-1 Element 1 Element 2 Element 3
Level-2 Part1 Part2 Part1 Part2 Only Part
2018-01-01 2.169116 NaN NaN NaN NaN
2018-01-02 0.620286 1.645149 NaN 1.173356 0.277663
2018-01-03 0.782959 1.677798 NaN 0.557048 1.220138
2018-01-04 0.206314 0.773896 0.629524 NaN 0.572183
2018-01-05 1.209667 0.542614 0.666525 NaN 0.579032

Related

How to sum over some columns based on condition in pandas

I have a data frame like this:
mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)
So what I need to do is to sum up all values in column p1,p2,p3,p4,p5 if the correspondent values in w1,w2,w3,w4,w5 is not CUSTOM_MASK or CUSTOM_UNKNOWN.
So the result would be to add a column to the data frame like this: (0.1+0.2+0.1=0.4 is for the first row).
top_p
0.4
0.3
1.51
So my question is that is there any pandas way to do this?
What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN) and then sum it up if those values was not exist in the columns.
You can use mask. The idea is to create a boolean mask with the w columns, and use it to filter the relevant w columns and sum:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before summing, the output of mask looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5
Here's a way to do this using np.dot():
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
We first prepare the two sets of column names p1,...,p5 and w1,...,w5.
Then we use apply() to take the dot product of the values in the pN columns with the filtering criteria based on the wN columns (namely include only contributions from pN column values whose corresponding wN column value is not in the list of excluded strings).
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Alternatively, element-wise multiplication and sum across columns can be used like this:
pCols, wCols = [[c for c in mydf.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
mydf['top_p'] = (mydf[pCols] * ~mydf[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
Here, we needed to rename the columns of one of the 5-column DataFrames to ensure that * (DataFrame.multiply()) can do the element-wise multiplication.
UPDATE: Here are a few timing comparisons on various possible methods for solving this question:
#1. Pandas mask and sum (see answer by #enke):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
#2. Pandas apply with Numpy dot solution:
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
#3. Pandas element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
#4. Numpy element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Timing results:
Timeit results for df with 30000 rows:
method_1 ran in 0.008165133331203833 seconds using 3 iterations
method_2 ran in 13.408894366662329 seconds using 3 iterations
method_3 ran in 0.007688766665523872 seconds using 3 iterations
method_4 ran in 0.006326200003968552 seconds using 3 iterations
Time performance results:
Method #4 (numpy multiply/sum) is about 20% faster than the runners-up.
Methods #1 and #3 (pandas mask/sum vs multiply/sum) are neck-and-neck in second place.
Method #2 (pandas apply/numpy dot) is frightfully slow.
Here's the timeit() test code in case it's of interest:
import pandas as pd
import numpy as np
nListReps = 10000
df = pd.DataFrame({'p1':[0.1, 0.2, 0.3]*nListReps, 'p2':[0.2, 0.1,0.3]*nListReps, 'p3':[0.1,0.9, 0.01]*nListReps, 'p4':[0.11, 0.2, 0.4]*nListReps, 'p5':[0.3, 0.1,0.5]*nListReps,
'w1':['cancel','hello', 'hi']*nListReps, 'w2':['good','bad','ugly']*nListReps, 'w3':['thanks','CUSTOM_MASK','great']*nListReps,
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible']*nListReps,'w5':['CUSTOM_MASK','CUSTOM_MASK','job']*nListReps})
from timeit import timeit
def foo_1(df):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
def foo_2(df):
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
return df
def foo_3(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
return df
def foo_4(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
n = 3
print(f'Timeit results for df with {len(df.index)} rows:')
for foo in ['foo_'+str(i + 1) for i in range(4)]:
t = timeit(f"{foo}(df.copy())", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ran in {t} seconds using {n} iterations')
Conclusion:
The absolute fastest of these four approaches seems to be Numpy element-wise multiply and sum. However, #enke's Pandas mask and sum is pretty close in performance and is arguably the most aesthetically pleasing of the four candidates.
Perhaps this hybrid of the two (which runs about as fast as #4 above) is worth considering:
df['top_p'] = (df.filter(like='p').to_numpy() * ~df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Filtering out rows based on an IQR score in Pandas DataFrame

df_boyd_out = df_boyd[~((df_boyd['MTTR'] < (Q1 - 1.5 * IQR)) | (df_boyd['MTTR'] > (Q3 + 1.5 * IQR))).any(axis=1)]
The above is my code, which returns: ValueError: No axis named 1 for object type Series
I've tried:
df_boyd_out = df_boyd[~((df_boyd.MTTR < (Q1 - 1.5 * IQR)) | (df_boyd.MTTR > (Q3 + 1.5 * IQR))).any(axis=1)]
You don't need .any(axis=1) since your code already returns a Series of boolean values.
Other point, you can replace:
~((df_boyd.MTTR < (Q1 - 1.5 * IQR)) | (df_boyd.MTTR > (Q3 + 1.5 * IQR)))
by:
df_boyd.MTTR.between(Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)
which is probably more readable.

How to update a matrix of probabilities

I am trying to find/figure out a function that can update probabilities.
Suppose there are three players and each of them get a fruit out of a basket: ["apple", "orange", "banana"]
I store the probabilities of each player having each fruit in a matrix (like this table):
apple
orange
banana
Player 1
0.3333
0.3333
0.3333
Player 2
0.3333
0.3333
0.3333
Player 3
0.3333
0.3333
0.3333
The table can be interpreted as the belief of someone (S) who doesn't know who has what. Each row and column sums to 1.0 because each player has one of the fruits and each fruit is at one of the players.
I want to update these probabilities based on some knowledge that S gains. Example information:
Player 1 did X. We know that Player 1 does X with 80% probability if he has an apple. With 50% if he has an orange. With 10% if he has a banana.
This can be written more concisely as [0.8, 0.5, 0.1] and let us call it reach_probability.
A fairly easy to comprehend example is:
probabilities = [
[0.5, 0.5, 0.0],
[0.0, 0.5, 0.5],
[0.5, 0.0, 0.5],
]
# Player 1's
reach_probability = [1.0, 0.0, 1.0]
new_probabilities = [
[1.0, 0.0, 0.0],
[0.0, 1.0, 0.0],
[0.0, 0.0, 1.0],
]
The above example can be fairly easily thought through.
another example:
probabilities = [
[0.25, 0.25, 0.50],
[0.25, 0.50, 0.25],
[0.50, 0.25, 0.25],
]
# Player 1's
reach_probability = [1.0, 0.5, 0.5]
new_probabilities = [
[0.4, 0.2, 0.4],
[0.2, 0.5, 0.3],
[0.4, 0.3, 0.3],
]
In my use case using a simulation is not an option. My probabilities matrix is big. Not sure if the only way to calculate this is using an iterative algorithm or if there is a better way.
I looked at bayesian stuff and not sure how to apply it in this case. Updating it row by row then spreading out the difference proportionally to the previous probabilities seems promising but I haven't managed to make it work correctly. Maybe it isn't even possible like that.
Initial condition: p(apple) = p(orange) = p(banana) = 1/3.
Player 1 did X. We know that Player 1 does X with 80% probability if he has an apple. With 50% if he has an orange. With 10% if he has a banana.
p(X | apple) = 0.8
p(x | orange) = 0.5
p(x | banana) = 0.1
Since apple, orange, and banana are all equally likely at 1/3, we have p(x) = 1/3 * 1.4) ~ 0.466666666.
Recall Bayes theorem: p(a | b) = p(b|a) * p(a) / p(b)
So p(apple | x) = p(x | apple) * p(apple) / p(x) = 0.8 * (1/3) / 0.46666666 ~ 57.14%
similarly p(orange | x) = 0.5 * (1/3) / 0.46666666 ~ 35.71%
and p(banana | x) = 0.1 * (1/3) / 0.46666666 ~ 7.14%
Taking your example:
probabilities = [
[0.25, 0.25, 0.50],
[0.25, 0.50, 0.25],
[0.50, 0.25, 0.25],
]
# Player 1's
reach_probability = [1.0, 0.5, 0.5]
new_probabilities = [
[0.4, 0.2, 0.4],
[0.2, 0.5, 0.3],
[0.4, 0.3, 0.3],
]
p(x) = 0.25 * 1.0 + 0.25 * 0.5 + 0.5 * 0.5 = 0.625
p(a|x) = p(x|a) * p(a) / p(x) = 1.0 * 0.25 / 0.625 = 0.4
p(b|x) = p(x|b) * p(b) / p(x) = 0.5 * 0.25 / 0.625 = 0.2
p(c|x) = p(x|c) * p(c) / p(x) = 0.5 * 0.50 / 0.625 = 0.4
As desired. The other entries of each column can just be scaled to get a column sum of 1.0.
E.g. in column 1 we multiple the other entries by (1-0.4)/(1-0.25). This takes 0.25 -> 0.2 and 0.50 -> 0.40. Similarly for the other columns.
new_probabilities = [
[0.4, 0.200, 0.4],
[0.2, 0.533, 0.3],
[0.4, 0.266, 0.3],
]
If then player 2 does y with the same conditional probabilities we get:
p(y) = 0.2 * 1.0 + 0.533 * 0.5 + 0.3 * 0.5 = 0.6165
p(a|y) = p(y|a) * p(a) / p(y) = 1.0 * 0.2 / 0.6165 = 0.3244
p(b|y) = p(y|b) * p(b) / p(y) = 0.5 * 0.533 / 0.6165 = 0.4323
p(c|y) = p(y|c) * p(c) / p(y) = 0.5 * 0.266 / 0.6165 = 0.2157
Check this document:
Endgame Solving in Large Imperfect-Information Games∗
(S. Ganzfried, T. Sandholm, in International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) (2015), pp. 37–45.)
Here is how I would approach this - have not worked through whether this has problems too but it seems alright in your examples.
Assume each update is of the form "X,Y has probability p'" Mark element X,Y dirty with delta p - p', where p was the old probability. Now, redistribute the delta proportionally to all unmarked elements in the row, then the column, marking each dirty with its own delta, and marking the first clean. Continue until no dirty entry remains.
0.5 0.5 0.0
0.0 0.5 0.5
0.5 0.0 0.5
Belief: 2,1 has probability zero.
0.5 0.0* 0.0 update 2,1 and mark dirty
0.0 0.5 0.5 delta is 0.5
0.5 0.0 0.5
1.0* 0.0' 0.0 distribute 0.5 to row & col
0.0 1.0* 0.5 update as dirty, both deltas -0.5
0.5 0.0 0.5
1.0' 0.0' 0.0 distribute -0.5 to rows & cols
0.0 1.0' 0.0* update as dirty, both deltas 0.5
0.0* 0.0 0.5
1.0' 0.0' 0.0 distribute 0.5 to row & col
0.0 1.0' 0.0' update as dirty, delta is -0.5
0.0' 0.0 1.0*
1.0' 0.0' 0.0 distribute on row/col
0.0 1.0' 0.0' no new dirty elements, complete
0.0' 0.0 1.0'
In your first example:
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
Belief: 3,1 has probability 0
1/3 1/3 0* update 3,1 to zero, mark dirty
1/3 1/3 1/3 delta is 1/3
1/3 1/3 1/3
1/2* 1/2* 0' distribute 1/3 proportionally across row then col
1/3 1/3 1/2* delta is -1/6
1/3 1/3 1/2*
1/2' 1/2' 0' distribute -1/6 proportionally across row then col
1/4* 1/4* 1/2' delta is 1/12
1/4* 1/4* 1/2'
1/2' 1/2' 0' distribute prportionally to unmarked entries
1/4' 1/4' 1/2' no new dirty entries, terminate
1/4' 1/4' 1/2'
You can mark entries dirty by inserting them with associated deltas into a queue and a hashset. Entries in both the queue and hash set are dirty. Entries in the hashset only are clean. Process the queue until you run out of entries.
I do not show an example where distribution is uneven, but the key is to distribute proportionally. Entries with 0 can never become non-zero except by a new belief.
Unfortunately there’s no known nice solution.
The way that I would apply Bayesian reasoning is to store a likelihood
matrix instead of a probability matrix. (Actually I’d store
log-likelihoods to prevent underflow, but that’s an implementation
detail.) We can start with the matrix
Apple
Orange
Banana
1
1
1
1
2
1
1
1
3
1
1
1
representing no knowledge. You could use the all-1/3 matrix instead, but
I’ve used 1 to emphasize that normalization is not required. To apply an
update like Player 1 doing X with conditional probabilities [0.8, 0.5,
0.1], we just multiply the row element-wise:
Apple
Orange
Banana
1
0.8
0.5
0.1
2
1
1
1
3
1
1
1
If Player 1 does Y independently with the same conditional
probabilities, then we get
Apple
Orange
Banana
1
0.64
0.25
0.01
2
1
1
1
3
1
1
1
Now, the rub is that these likelihoods don’t have a nice relationship to
probabilities of specific outcomes. All we know is that the probability
of a specific matching is proportional to the product of its matrix
entries. As a simple example, with a matrix like
Apple
Orange
Banana
1
1
0
0
2
0
1
0
3
0
1
1
the entry for Player 3 having Orange is 1, yet this assignment has
probability 0 because both possibilities for completing the matching
have probability 0.
What we need is the
permanent,
which sums the likelihood of every matching, and the minor for each
matrix entry, which sums the likelihood of every matching that makes the
corresponding assignment. Unfortunately we don’t know a good exact
algorithm for computing the permanent, and experts are skeptical that
one exists (the problem is NP-hard, and actually #P-complete). The
known approximation employs sampling via Markov chains.

numpy sorting and removing top values

I don't know if there is a name for this algorithm, but basically for a given y, I want to find the maximum x such that:
import numpy as np
np_array = np.random.rand(1000, 1)
np.sum(np_array[np_array > x] - x) >= y
Of course, a search algo would be to find the top value n_1, reduce it to the second largest value, n_2. Stop if n_1 - n-2 > y; else reduce both n_1 and n_2 to n_3, stop if (n_1 - n_3) + (n_2 - n_3) > y ...
But I feel there must be an algo to generate a sequence of {xs} that converges to its true value.
Let's use your example from the comments:
a = np.array([0.1, 0.3, 0.2, 0.6, 0.1, 0.4, 0.5, 0.2])
y = 0.5
First let's sort the data in descending order:
s = np.sort(a)[::-1] # 0.6, 0.5, 0.4, 0.3, 0.2, 0.2, 0.1,
Let's take a look at how the choice of x affects the possible values of the sum r = np.sum(np_array[np_array > x] - x):
If x ≥ 0.6, then r = 0.0 - x ⇒ -∞ < r ≤ -0.6
If 0.6 > x ≥ 0.5, then r = 0.6 - x ⇒ 0.0 < r ≤ 0.1 (where 0.1 = 0.6 - 0.5 × 1)
If 0.5 > x ≥ 0.4, then r = 0.6 - x + 0.5 - x = 1.1 - 2 * x ⇒ 0.1 < r ≤ 0.3 (where 0.3 = 1.1 - 0.4 × 2)
If 0.4 > x ≥ 0.3, then r = 0.6 - x + 0.5 - x + 0.4 - x = 1.5 - 3 * x ⇒ 0.3 < r ≤ 0.6 (where 0.6 = 1.5 - 0.3 × 3)
If 0.3 > x ≥ 0.2, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x = 1.8 - 4 * x ⇒ 0.6 < r ≤ 1.0 (where 1.0 = 1.8 - 0.2 × 4)
If 0.2 > x ≥ 0.1, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x + 0.2 - x + 0.2 - x = 2.2 - 6 * x ⇒ 1.0 < r ≤ 1.6 (where 1.6 = 2.2 - 0.1 × 6)
If 0.1 > x, then r = 0.6 - x + 0.5 - x + 0.4 - x + 0.3 - x + 0.2 - x + 0.2 - x + 0.1 - x + 0.1 - x = 2.4 - 8 * x ⇒ 1.6 < r ≤ ∞
The range of r is continuous except for the portion a[0] < r ≤ 0.0. Duplicate elements affect the range of available r values for each value in a, but otherwise are nothing special. We can remove, but also account for the duplicates by using np.unique instead of np.sort:
s, t = np.unique(a, return_counts=True)
s, t = s[::-1], t[::-1]
w = np.cumsum(t)
If your data can reasonably be expected not to contain duplicates, then use the sorted s shown in the beginning, and set t = np.ones(s.size, dtype=int) and therefore w = np.arange(s.size) + 1.
For s[i] > x ≥ s[i + 1], the bounds of r are given by c[i] - w[i] * s[i] < r ≤ c[i] - w[i] * s[i + 1], where
c = np.cumsum(s * t) # You can use just `np.cumsum(s)` if no duplicates
So finding where y ends up is a matter of placing it between the correct bounds. This can be done with a binary search, e.g., np.searchsorted:
# Left bound. Sum is strictly greater than this
bounds = c - w * s
i = np.searchsorted(bounds[1:], y, 'right')
The first element of bounds is always 0.0, and the resulting index i will point to the upper bound. By truncating off the first element, we shift the result to the lower bound, and ignore the zero.
The solution is found by solving for the location of x in the selected bin:
y = c[i] - w[i] * x
So you have:
x = (c[i] - y) / w[i]
You can write a function:
def dm(a, y, duplicates=False):
if duplicates:
s, t = np.unique(a, return_counts=True)
s, t = s[::-1], t[::-1]
w = np.cumsum(t)
c = np.cumsum(s * t)
i = np.searchsorted((c - w * s)[1:], y, 'right')
x = (c[i] - y) / w[i]
else:
s = np.sort(a)[::-1]
c = np.cumsum(s)
i = np.searchsorted((c - s)[1:], y, 'right')
x = (c[i] - y) / (i + 1)
return x
This does not handle the case where y < 0, but it does allow you to enter many y values simultaneously, since searchsorted is pretty well vectorized.
Here is a usage sample:
>>> dm(a, 0.5, True)
Out[247]: 0.3333333333333333
>>> dm(a, 0.6, True)
0.3
>>> dm(a, [0.1, 0.2, 0.3, 0.4, 0.5], True)
array([0.5 , 0.45 , 0.4 , 0.36666667, 0.33333333])
As for whether this algorithm has a name: I am not aware of any. Since I wrote this, I feel that "discrete madness" is an appropriate name. Slips off the tongue nicely too: "Ah yes, I computed the threshold using discrete madness".
This is an answer to the original question, where we find the maximum x s.t. np.sum(np_array[np_array > x]) >= y:
You can accomplish this with sorting and cumulative sum:
s = np.sort(np_array)[::-1]
c = np.cumsum(s)
i = np.argmax(c > y)
result = s[i]
s is the candidates for x in descending order. Comparing the cumulative sum c to y tells you exactly where the sum will exceed y. np.argmax returns the index of the first place that happens. The result is that index extracted from s.
This computation in numpy is slower than it needs to be because we can short circuit the sum immediately without computing a separate mask. The complexity is the same, however. You could speed up the following with numba or cython:
s = np.sort(np_array)[::-1]
c = 0
for i in range(len(s)):
c += s[i]
if c > y:
break
result = s[i]

Place x,y coordinates into bins

I have a Pandas dataframe with two of the columns containing x,y coordinates that I plot as below:
plt.figure(figsize=(10,5))
plt.scatter(df.x, df.y, s=1, marker = ".")
plt.xlim(-1.5, 1.5)
plt.ylim(0, 2)
plt.xticks(np.arange(-1.5, 1.6, 0.1))
plt.yticks(np.arange(0, 2.1, 0.1))
plt.grid(True)
plt.show()
I want to split the x and y axes every 0.1 units to give 600 bins (30x20). Then I want to know how many of my points are in each bin and the indices of these points so I can look them up in my dataframe. I basically want to create 600 new dataframes for each bin.
This is what I've tried so far:
df[(df.x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)]
This will give me part of the dataframe contained within the square (-0.1 ≤ x < 0) & (0.7 ≤ y < 0.8). I want a way to create 600 of these.
I would use the cut function to create the bins and then group by them and count
#create fake data with bounds for x and y
df = pd.DataFrame({'x':np.random.rand(1000) * 3 - 1.5,
'y':np.random.rand(1000) * 2})
# bin the data into equally spaced groups
x_cut = pd.cut(df.x, np.linspace(-1.5, 1.5, 31), right=False)
y_cut = pd.cut(df.y, np.linspace(0, 2, 21), right=False)
# group and count
df.groupby([x_cut, y_cut]).count()
Output
x y
x y
[-1.5, -1.4) [0, 0.1) 3.0 3.0
[0.1, 0.2) 1.0 1.0
[0.2, 0.3) 3.0 3.0
[0.3, 0.4) NaN NaN
[0.4, 0.5) 1.0 1.0
[0.5, 0.6) 3.0 3.0
[0.6, 0.7) 1.0 1.0
[0.7, 0.8) 2.0 2.0
[0.8, 0.9) 2.0 2.0
[0.9, 1) 1.0 1.0
[1, 1.1) 2.0 2.0
[1.1, 1.2) 1.0 1.0
[1.2, 1.3) 2.0 2.0
[1.3, 1.4) 3.0 3.0
[1.4, 1.5) 2.0 2.0
[1.5, 1.6) 3.0 3.0
[1.6, 1.7) 3.0 3.0
[1.7, 1.8) 1.0 1.0
[1.8, 1.9) 1.0 1.0
[1.9, 2) 1.0 1.0
[-1.4, -1.3) [0, 0.1) NaN NaN
[0.1, 0.2) NaN NaN
[0.2, 0.3) 2.0 2.0
And to completely answer your question. You can add the categories to the original dataframe as columns and then do your searching from there like this.
# add new columns
df['x_cut'] = x_cut
df['y_cut'] = y_cut
print(df.head(15)
x y x_cut y_cut
0 1.239743 1.348838 [1.2, 1.3) [1.3, 1.4)
1 -0.539468 0.349576 [-0.6, -0.5) [0.3, 0.4)
2 0.406346 1.922738 [0.4, 0.5) [1.9, 2)
3 -0.779597 0.104891 [-0.8, -0.7) [0.1, 0.2)
4 1.379920 0.317418 [1.3, 1.4) [0.3, 0.4)
5 0.075020 0.748397 [0, 0.1) [0.7, 0.8)
6 -1.227913 0.735301 [-1.3, -1.2) [0.7, 0.8)
7 -0.866753 0.386308 [-0.9, -0.8) [0.3, 0.4)
8 -1.004893 1.120654 [-1.1, -1) [1.1, 1.2)
9 0.007665 0.865248 [0, 0.1) [0.8, 0.9)
10 -1.072368 0.155731 [-1.1, -1) [0.1, 0.2)
11 0.819917 1.528905 [0.8, 0.9) [1.5, 1.6)
12 0.628310 1.022167 [0.6, 0.7) [1, 1.1)
13 1.002999 0.122493 [1, 1.1) [0.1, 0.2)
14 0.032624 0.426623 [0, 0.1) [0.4, 0.5)
And then to get the combination that you described above: df[(x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)] you would can set the index as x_cut and y_cut and do some hierarchical index selection.
df = df.set_index(['x_cut', 'y_cut'])
df.loc[[('[-0.1, 0)', '[0.7, 0.8)')]]
Output
x y
x_cut y_cut
[-0.1, 0) [0.7, 0.8) -0.043397 0.702029
[0.7, 0.8) -0.032508 0.799284
[0.7, 0.8) -0.036608 0.709394
[0.7, 0.8) -0.025254 0.741085
One of many ways to do it.
bins = (df // .1 * .1).round(1).stack().groupby(level=0).apply(tuple)
dict_of_df = {name: group for name, group in df.groupby(bins)}
You can get the dataframe of counts with
df.groupby(bins).size().unstack()
you could transform your units into their respective indices 0 - 19 and 0 - 29 and increment a matrix of zeros..
import numpy as np
shape = [30,20]
bins = np.zeros(shape, dtype=int)
xmin = np.min(df.x)
xmax = np.max(df.x)
xwidth = xmax - xmin
xind = int(((df.x - xmin) / xwidth) * shape[0])
#ymin
#ymax
#ywidth
#yind
for ind in zip(xind, yind):
bins[ind] += 1

Categories

Resources