Pandas: Check if row has similar values

Pandas: Check if row has similar values - python

I'm generating an overlay for a map using pandas and used:
if ((df['latitude'] == new_latitude) & (df['longitude'] == new_longitude)).any():
continue
to make sure that I don't produce duplicate points. But I am starting to produce points that are 0.001 different (in either longitude, latitude or both) than one already produced. How can I prevent this in a similar manner as above?

IIUC you can subtract from the entire series and then just filter the points:
thresh = 0.001
lat = df.loc[(df['latitude'] - new_latitude).abs() > thresh, 'latitude']
lon = df.loc[(df['longtitude'] - new_longtitude).abs() > thresh, 'longtitude']
this uses abs to get the absolute value to generate a boolean mask and filter all the duplicate and near duplicate values out.

You could use numpy.isclose function with atol setted to your precision:
import numpy as np
prec = 0.001
np.isclose(df['latitude'], new_latitude, atol=prec)
if ((np.isclose(df['latitude'], new_latitude, prec) & (np.isclose(df['longitude'], new_longitude, prec)).any():
continue

Related

Is there a faster method for iterating over a very big 2D numpy array than using np.where?

i have a huge 2D numpy array filled with integer values. I collect them from a .tif-image via gdal.GetRasterBand().
The pixel values of the image represent unique cluster-identification numbers. So all pixels inside one cluster have the same value.
In my script i want to check if the clusters have more pixels than a specific threshold. If the clustersize is bigger than the threshold I want to keep the cluster and give them a pixel value 1. If a cluster has less pixel then the threshold, all pixels of this cluster should get the value 0.
My code so far works, but is very very slow. And because i want to vary the threshold, it takes like forever.
I would really appreciate your help. Thank you.
# Import GeoTIFF via GDAL and convert to NumpyArray
data = gdal.Open(image)
raster = data.GetRasterBand(1)
raster = raster.ReadAsArray()
# Different thresholds for iteration
thresh = [0,10,25,50,100,1000,2000]
for threshold in thresh:
clusteredRaster = np.array(raster.copy(), dtype = int)
for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image
if clusteredRaster[np.where(clusteredRaster == clump)].size >= threshold:
clusteredRaster[np.where(clusteredRaster == clump)] = int(1)
else:
clusteredRaster[np.where(clusteredRaster == clump)] = int(0)
'''
[ClusterImage][1]
In the image you can see the cluster image. Each color stands vor a specific clusternumber. I want to delete the small ones (under a specific size) and just keep the big ones.
[1]: https://i.stack.imgur.com/miEKg.png

there are a number of modifications that can be done to improve performance,
clusteredRaster = np.array(raster.copy(), dtype = int)
can be replaced with
clusteredRaster = raster.astype(int)
which this is essentially both a copy and a casting operator so this operation is faster.
now for
clusteredRaster[np.where(clusteredRaster == clump)] = int(1)
you don't need to call np.where, this will work faster
clusteredRaster[clusteredRaster == clump] = int(1)
also done for this part
clusteredRaster[np.where(clusteredRaster == clump)].size
you can also remove the evaluation of clusteredRaster==clump twice as follow:
for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image
indicies = clusteredRaster==clump
if clusteredRaster[indicies].size >= threshold:
clusteredRaster[indicies] = int(1)
else:
clusteredRaster[indicies] = int(0)
i think your function will now work twice as fast, however if you want to run faster, you have to use smaller datatypes like np.uint8 instead of plain int, provided your image is encoded in RGB and can be represented by 8 bit ints (or maybe np.uint16 if 8 bits is too low ?)
this is as fast as it can get from python side.
there are faster methods like using C modules with openmp to multithread your work across multiple cores, this can easily be done with something like numba or cython without having to worry about writing C code, but there's a lot of reading to do if you want to achieve the fastest performance ever, like which threading backend to use (TBB vs openmp) and some os and device dependent capabilities.

In addition to the changes suggested by Ahmed Mohamed AEK you can also take the calculation of unique values, indices, and counts outside of the for loops. Plus you don't need to copy raster each time - you can make an array of np.uint8s.
This gives the same results as your original implementation:
data = gdal.Open(image)
raster = data.GetRasterBand(1).ReadAsArray()
# Different thresholds for iteration
thresh = [0, 10, 25, 50, 100, 1000, 2000]
# determine the unique clumps and their frequencies outside of the for loops
clumps, counts = np.unique(raster, return_counts=True)
# only determine the indices once, rather than for each threshold
indices = np.asarray([raster==clump for clump in clumps])
for threshold in thresh:
clustered_raster = np.zeros_like(raster, dtype=np.uint8)
for clump_indices, clump_counts in zip(indices, counts):
clustered_raster[clump_indices] = clump_counts >= threshold

I got a easy solution based on your helpful answers!
The idea is to find the unique values and cluster sizes per threshold and instant fill in correct values, thus avoid a loop.
It reduces the iteration time from initially 142 seconds per iteration to 0.52 seconds and reproduces the same results.
data = gdal.Open(image)
raster = data.GetRasterBand(1).ReadAsArray()
thresh = [0, 10, 25, 50, 100, 1000, 2000]
for threshold in thresh:
# Create new 0-raster with same dimensions as input raster
clusteredRaster = np.zeros(raster.shape, dtype = uint8)
# Get unique cluster IDs and count the size of the occurence
clumps, counts = np.unique(raster, return_counts=True)
# Get only the clumps which are bigger than the threshold
biggerClumps = clumps[counts >= threshold]
# fill in ones for the relevant cluster IDs
clusteredRaster[np.isin(raster,biggerClumps)] = 1

Python - Filter local extrema based on relative height

Using fuglede's answer, it's easy to find the local extrema of a DataFrame column :
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
df.data.plot()
Which gives the following graph :
I now would like to group those extrema in pairs (minimum & extremum that are neighbors, in this order) and remove the pairs where extremum < minimum + threshold. By removing I mean replacing the corresponding values in df['min'] and df['max'] by nans.
This basically filters the irrelevant small extrema.
I've tried find_peaks with various options but none gave the intended results.
Is there an elegant and fast way to do this ?

I think you have missed the excellent answer from Foad reported here Pandas finding local max and min
Instead of calculating max and min by a shift of 1, you can set a window (number of neighbors) and find the local min and max of your values. Although there is no single window param that will fit perfectly, it reduces the noise substantially.
from scipy.signal import argrelextrema
# Find peaks in the window
n = 10 #window size
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal, order=n)[0]]['data']

I agree with the previous, but I think this might be more what you are asking for.
threshold = 0.8
points = df.dropna(subset=['min', 'max'], how='all').copy()
ddf = pd.merge(points['min'].dropna().reset_index(),
points['max'].dropna().reset_index(),
left_index=True,
right_index=True)
ddf = ddf[ddf['max'] < (ddf['min'] + threshold)]
# Plot results
plt.scatter(ddf['index_x'], ddf['min'], c='r')
plt.scatter(ddf['index_y'], ddf['max'], c='g')
df.data.plot()
Although I suspect what you want is actually this:
threshold = 3
points = df.dropna(subset=['min', 'max'], how='all').copy()
ddf = pd.merge(points['min'].dropna().reset_index(),
points['max'].dropna().reset_index(),
left_index=True,
right_index=True)
ddf = ddf[ddf['max'] > (ddf['min'] + threshold)]
# Plot results
plt.scatter(ddf['index_x'], ddf['min'], c='r')
plt.scatter(ddf['index_y'], ddf['max'], c='g')
df.data.plot()
To merge this back onto the original dataframe:
df['min'] = df.index.map(ddf.set_index('index_x')['min'])
df['max'] = df.index.map(ddf.set_index('index_y')['max'])

how to make a high pass filter?

I have a 3D data matrix of sea level data (time, y, x) and I found the power spectrum by taking the square of the FFT but there are low frequencies that are really dominant. I want to get rid of those low frequencies by applying a high pass filter... how would I go about doing that?
Example of data set and structure/code is below:
This is the data set and creating the arrays:
Yearmin = 2018
Yearmax = 2019
year_len = Yearmax - Yearmin + 1.0 # number of years
direcInput = "filepath"
a = s.Dataset(direcInput+"test.nc", mode='r')
#creating arrays
lat = a.variables["latitude"][:]
lon = a.variables["longitude"][:]
time1 = a.variables["time"][:] #DAYS SINCE JAN 1ST 1950
sla = a.variables["sla"][:,:,:] #t, y, x
time = Yearmin + (year_len * (time1 - np.min(time1)) / ( np.max(time1) - np.min(time1)))
#detrending and normalizing data
def standardize(y, detrend = True, normalize = True):
if detrend == True:
y = signal.detrend(y, axis=0)
y = (y - np.mean(y, axis=0))
if normalize == True:
y = y / np.std(y, axis=0)
return y
sla_standard = standardize(sla)
print(sla_standard.shape) = (710, 81, 320)
#fft
fft = np.fft.rfft(sla_standard, axis=0)
spec = np.square(abs(fft))
frequencies = (0, nyquist, df)
#PLOTTING THE FREQUENCIES VS SPECTRUM FOR A FEW DIFFERENT SPATIAL LOCATIONS
plt.plot(frequencies, spec[:, 68,85])
plt.plot(frequencies, spec[:, 23,235])
plt.plot(frequencies, spec[:, 39,178])
plt.plot(frequencies, spec[:, 30,149])
plt.xlim(0,.05)
plt.show()
My goal is to make a high pass filter of the ORIGINAL time series (sla_standard) to remove the two really big peaks. Which type of filter should I use? Thank you!

Use .axes.Axes.set_ylim to set the y-axis limit.
Axes.set_ylim(self, left=None, right=None, emit=True, auto=False, *, ymin=None, ymax=None)
So in your case ymin=None and you set ymax for example to ymax=60000 before you start plotting.
Thus plt.ylim(ymin=None, ymax=60000).
Taking out data should not be done here because its "falsifying results". What you actually want is to zoom in on the chart. The person who reads the chart independently from you would interpret the data falsely if not made aware in advance. Peaks that go off the chart are okay because everybody understands that.
Or:
Directly replacement of certain values in an array (arr):
arr[arr > ori] = dest
For example in your case ori=60000 and dest=1
All values larger ">" than 60k are replaces by 1.

The different filters: As you state a filter acts on the frequencies of your signal. Different filter shapes exist and some of them have complex expressions because they need to be implemented in real time processing (causal). However in your case, you seem to post process the data. You can use the Fourier Transform, that requires all the data (non causal).
The filter to choose: Consequently you can directly perform you filtering operation in the Fourier domain by applying a mask on your frequencies. If you want to remove frequencies, I recommand you to use a binary mask made of 0 and 1. Why? Because it is the simplest filter you can think about. It is scientifically relevant to state that you completely removed some frequencies (say it and justify it). However it is more difficult to claim that you let some and attenuated a little bit others, and that you chose arbitrarily the attenuation factor...
Python implementation
signal_fft = np.fft.rfft(sla_standard,axis=0)
mask = np.ones_like(sla_standard)
mask[freq_to_filter,...] = 0.0 # define here the frequencies to filter
filtered_signal = np.fft.irfft(mask*signal_fft,axis=0)

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

I have a symmetric matrix (1877 x 1877), here is the matrix file. I try to standardize the values between 0-1. After I apply this method, the matrix is no longer symmetric. Any help is appreciated.
print((dist.transpose() == dist).all()) # this prints 'True'
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler()
return minmax_scale.fit_transform(X)
sci_dist_scaled = sci_minmax(dist)
(sci_dist_scaled.transpose() == sci_dist_scaled).all() # this print 'False'
sci_dist_scaled.dtype, dist.dtype # (dtype('float64'), dtype('float64'))

Looking at this description the minmaxscaler appears to work column-by-column, so, naturally, you can't expect it to preserve symmetry.
What's best to do in your case depends a bit on what you are trying to achieve, really. If having the values between 0 and 1 is all you require you can rescale by hand:
mn, mx = dist.min(), dist.max()
dist01 = (dist - mn) / (mx - mn)
but depending on your ultimate problem this may be too simplistic...

median-absolute-deviation (MAD) based outlier detection

I wanted to apply median-absolute-deviation (MAD) based outlier detection using the answer from #Joe Kington as given below:
Pythonic way of detecting outliers in one dimensional observation data
However, what's going wrong with my code, I could not figure out how to assign the outliers as nan values for MY DATA:
import numpy as np
data = np.array([55,32,4,5,6,7,8,9,11,0,2,1,3,4,5,6,7,8,25,25,25,25,10,11,12,25,26,27,28],dtype=float)
median = np.median(data, axis=0)
diff = np.sum((data - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
data_without_outliers = data[modified_z_score < 3.5]
?????
print data_without_outliers

What is the problem with using:
data[modified_z_score > 3.5] = np.nan
Note that this will only work if data is a floating point array (which it should be if you are calculating MAD).

The problem might be line:
diff = np.sum((data - median)**2, axis=-1)
Applying np.sum() will collapse the result to scalar.
Remove top-level sum, and your code will work.
Other way around it to ensure that that data is at least 2d array. You can use numpy.atleast_2d() for that.
In order to assign NaNs, follow answer from https://stackoverflow.com/a/22804327/4989451

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Check if row has similar values - python

You could use numpy.isclose function with atol setted to your precision: import numpy as np prec = 0.001 np.isclose(df['latitude'], new_latitude, atol=prec) if ((np.isclose(df['latitude'], new_latitude, prec) & (np.isclose(df['longitude'], new_longitude, prec)).any(): continue

Related

Is there a faster method for iterating over a very big 2D numpy array than using np.where?

Python - Filter local extrema based on relative height

how to make a high pass filter?

Numpy symmetric matrix becomes asymmetric when I applied min-max scaling

median-absolute-deviation (MAD) based outlier detection

Categories

Resources