matplotlib hist while ignoring a particular no data value - python

I've got a 2D numpy array with 1.0e6 as the no data value. I'd like to generate a histogram of the data and while I've succeeded this can't be the best way to do it.
from matplotlib import pyplot
import sys
eps = sys.float_info.epsilon
no_data = 1.0e6
e_data = elevation.reshape(elevation.size)
e_data_clean = [ ]
for i in xrange(len(e_data)):
val = e_data[i]
# floating point equality check for val aprox not equal no_data
if val > no_data + eps and val < no_data - eps:
e_data_clean.append(val)
pyplot.hist(e_data_clean, bins=100)
It seems like there should be a clean (and much faster one liner for this). Is there?

You can use a boolean array to select the required indices:
selected_values = (e_data > (no_data + eps)) & (e_data < (no_data - eps))
pyplot.hist(e_data[selected_values])
(e_data > (no_data + eps)) will create an array of np.bool with the same shape as e_data, set to True at a given index if and only if the value at that index is greater than (no_data + eps). & is the element-wise and operator to satisfy both conditions.
Alternatively, if no_data is just a convention, I would set those values to numpy.nan instead, and use e_data[numpy.isfinite(e_data)].

Related

MATCH smallest value equivalent code in python

I would like to seek the quickest equivalent Python code of the MATCH Excel function which returns the position of the smallest and nearest value of "VALUE" in a "RANGE", i.e. Match(VALUE, RANGE, -1). This function should be applied to multiple "VALUE", i.e, a vector.
I have an initial solution below but it is very very slow with a high number of element (100k):
In the example below I want to match each value in "vector" to find the position of its smallest nearest value in the "vector_to_match":
import numpy as np
import perfplot
simulLength = 100000
vector = np.random.rand(simulLength)
vector_to_match = np.arange(100000)/100000
def Match_Smallest(x):
orderCheck = np.array((vector_to_match < x) * 1)
x_order = sum(orderCheck) - 1
return x_order
def A_Finding(x):
return np.array(list(map(Match_Smallest, x)))
# what I want to get :
vector_position_outout = A_Finding(vector)
# but Match_Smallest(x) is really too slow
It takes about more than a minute for me to get the output vector A_Finding(vector). It would like to see if there is any quicker way to do it because Excel was beating me on my way in speed.
perfplot.show(
setup=lambda n: np.random.rand(n), # or setup=np.random.rand
kernels=[
A_Finding
],
labels=["c_"],
n_range=[10 ** k for k in range(5)],
xlabel="len(a)",
)
Use broadcasting:
out = np.sum(np.array((vector_to_match < vector[:, None]) * 1), axis=1) - 1
Or like mentioned #donkopotamus in comments - use searchsorted :
out = vector_to_match.searchsorted(vector) - 1

Loop over elements in a numpy array of arbitrary dimension

I have a code like the following:
def infball_proj(mu, beta):
newmu = np.zeros(mu.shape)
if len(mu.shape) == 2:
for i in range(mu.shape[0]):
for j in range(mu.shape[1]):
if np.abs(mu[i,j]) > beta:
newmu[i,j] = np.sign(mu[i,j]) * beta
else:
newmu[i,j] = mu[i,j]
return newmu
elif len(mu.shape) == 1:
for i in range(mu.shape[0]):
if np.abs(mu[i]) > beta:
newmu[i] = np.sign(mu[i]) * beta
else:
newmu[i] = mu[i]
return newmu
Is there a smarter way to do this so I don't have to write the 2 different cases? It would be nice if I could have a version that scales to an arbitrary dimension (i.e. numbers of axes).
Something like this should do the job:
newmu = np.where(np.abs(mu) > beta, np.sign(mu) * beta, mu)
Or, if I get the logic right,
newmu = np.minimum(np.abs(mu), beta) * np.sign(mu)
mu[np.abs(mu)>beta] = np.sign(mu[np.abs(mu)>beta]) * beta
np.abs(mu)>beta will create a boolean array which can then be used for boolean indexing.
The LHS mu[np.abs(mu)>beta] will return a view of the elements being selected by the boolean indexing and can be assigned to the value your want, that is, the RHS.
REMEMBER: Try to avoid for-loop of NumPy array as it is very inefficient.

Faster alternative to np.where for a sorted array

Given a large array a which is sorted along each row, is there faster alternative to numpy's np.where to find the indices where min_v <= a <= max_v? I would imagine that leveraging the sorted nature of the array should be able to speed things up.
Here's an example of a setup using np.where to find the given indices in a large array.
import numpy as np
# Initialise an example of an array in which to search
r, c = int(1e2), int(1e6)
a = np.arange(r*c).reshape(r, c)
# Set up search limits
min_v = (r*c/2)-10
max_v = (r*c/2)+10
# Find indices of occurrences
idx = np.where(((a >= min_v) & (a <= max_v)))
You can use np.searchsorted:
import numpy as np
r, c = 10, 100
a = np.arange(r*c).reshape(r, c)
min_v = ((r * c) // 2) - 10
max_v = ((r * c) // 2) + 10
# Old method
idx = np.where(((a >= min_v) & (a <= max_v)))
# With searchsorted
i1 = np.searchsorted(a.ravel(), min_v, 'left')
i2 = np.searchsorted(a.ravel(), max_v, 'right')
idx2 = np.unravel_index(np.arange(i1, i2), a.shape)
print((idx[0] == idx2[0]).all() and (idx[1] == idx2[1]).all())
# True
When I use np.searchsorted with the 100 million numbers in the original example with the not up-to-date NumPy version 1.12.1 (can't tell about newer versions), it is not much faster than np.where:
>>> import timeit
>>> timeit.timeit('np.where(((a >= min_v) & (a <= max_v)))', number=10, globals=globals())
6.685825735330582
>>> timeit.timeit('np.searchsorted(a.ravel(), [min_v, max_v])', number=10, globals=globals())
5.304438766092062
But, despite the NumPy docs for searchsorted say This function uses the same algorithm as the builtin python bisect.bisect_left and bisect.bisect_right functions, the latter are a lot faster:
>>> import bisect
>>> timeit.timeit('bisect.bisect_left(a.base, min_v), bisect.bisect_right(a.base, max_v)', number=10, globals=globals())
0.002058468759059906
Therefore, I'd use this:
idx = np.unravel_index(range(bisect.bisect_left(a.base, min_v),
bisect.bisect_right(a.base, max_v)), a.shape)

Pandas:Aggregate by numerical column name and give error when substitution

In order to eliminate variations at the time of measurement, I want to compile in a specific range.
For example, I want to sum the column name in the range of ± 0.1 of the integer and assign it to an integer column. However, I can not substitute because of a shape error.
I think that it is caused by converting the type of the column, but what should I do about it?
Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame(data= np.arange(0,10000,1).reshape(100,100))
df.columns = np.arange(0,10,0.1)
print(df.head())
df.columns = df.columns.astype(float)
temp = df.columns.values
for n in np.arange(1, 9, 1):
l = n - 0.1
m = n + 0.1
calc_n = temp[np.where((temp >= l) & (temp <= m))]
calc = np.sum(df[df.columns.intersection(calc_n)], axis=1)
n_position = temp[np.where(temp == n)]
df[n_position] = calc.values
ValueError: shape mismatch: value array of shape (100,) could not be broadcast to indexing result of shape (1,100)
The ValueError is because n_position is an array. So df[n_position] gives you a dataframe instead of a column.
It is usually not a good idea to use floats as indexes. And you should be careful when comparing floats. This line calc_n = temp[np.where((temp >= l) & (temp <= m))] won't always give accurate results.
For starter, try:
for n in np.arange(1, 9, 1):
margin = 0.101 # set your own margin
calc_n = np.where(np.abs(temp-n) < margin)
df[n] = df.iloc[:,calc_n[0]].sum(axis=1)

In numpy, Python, how to conditionally rewrite part of an array, when the values I want to set are in an array of a different size?

Let's say I have three arrays:
A[size1] of {0..size1}
B[size2] of {0..size1}
C[size2] of boolean
What I want:
for (int e = 0; e < size2; ++e) :
if C[e] == some_condition, then B[e] = A[B[e]]
Since Python is slow, I have to implement it via numpy arithmetic on arrays. How can I do that?
Example:
A = np.array([np.random.randint(0,n,size1), np.random.randint(0,size1,size1)])
B = np.random.randint(0,size1,size2)
C = np.random.randint(0,n,size2)
#that's the part I want to do in numpy:
for i in range (size2) :
if (C[i] > A[0][B[i]]) :
B[i] = A[1][B[i]]
You could simply use boolean-indexing -
mask = C > A[0][B] # Create mask to select valid ones from B
B[mask] = A[1][B[mask]] # Use mask to select and assign values

Categories

Resources