Get density of messages on distinct ID's - python

Imagine that there is 10 houses, where there can be one to an infinite number of persons. Each of those persons sends a number of messages, containing their userid and the house number. This can be from 1 to infinite number of messages. I want to know the average number of messages that is sent by each person, for each house, to later plot which house got the largest number of average messages.
Now, that I've explained conceptually, the houses aren't houses, but latitudes, from f.ex -90 to -89 etc. And that a person can send messages from different houses.
So I've got a database with latitude and senderID. I want to plot the density of latitudes pr unique senderID:
Number of rows/Number of unique userids at each latitude over an interval
This is an sample input:
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
40.72, 47.59, 54.42, 63.84, 76.77, 77.43, 78.54]
userid= [5, 7, 6, 6, 6, 6, 5, 2,
2, 2, 1, 5, 10, 9 ,8]
Here are the corresponding densities:
-80 to -90: 1
-40 to -50: 1
-30 to -40: 4
-20 to -30: 1
40 to 50: 2
50 to 60: 1
60 to 70: 1
70 to 80: 1
An other input:
lat = [70,70,70,70,70,80,80,80]
userid = [1,2,3,4,5,1,1,2]
The density for latitude 70 is 1, while the density for latitude 80 is 1.5.
If I would do this through a database query/pseudocode I would do something like:
SELECT count(latitude) FROM messages WHERE latitude < 79 AND latitude > 69
SELECT count(distinct userid) FROM messages WHERE latitude < 79 AND latitude > 69
The density would then be count(latitude)/count(distinct userid) - also to be interpreted as totalmessagesFromCertainLatitude/distinctUserIds. This would be repeated for intervals from -90 to 90, i.e -90<latitude<-89 up to 89<latitude<90
To get any help with this is probably a far stretch, but I just cant organize my thoughts to do this while I'm confident there are no errors. I would be happy for anything. I'm sorry if I was unclear.

Because this packs so neatly into pandas' built-ins, it's probably fast in pandas for big datasets.
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
40.72, 47.59, 54.42, 63.84, 76.77, 77.43, 78.54]
userid= [5, 7, 6, 6, 6, 6, 5, 2,
2, 2, 1, 5, 10, 9 ,8]
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from math import floor
df = pd.DataFrame(zip(userid,lat), columns = ['userid','lat']
)
df['zone'] = map(lambda x: floor(x) * 10,df.lat/10) # for ten-degree zones
zonewidth=10
#df['zone'] = map(floor, df.lat) # for one-degree zones
#zonewidth=1 # ditto
dfz = df.groupby('zone') #returns a dict of dataframes
#for k, v in dfz: # useful for exploring the GroupBy object
# print(k, v.userid.values, float(len(v.userid.values))/len(set(v.userid.values)))
p = [(k, float(len(v.userid.values))/len(set(v.userid.values))) for k, v in dfz]
# plotting could be tightened up -- PatchCollection?
R = [Rectangle((x, 0), zonewidth, y, facecolor='red', edgecolor='black',fill=True) for x, y in p]
fig, ax = plt.subplots()
for r in R:
ax.add_patch(r)
plt.xlim((-90, 90))
tall = max([r.get_height() for r in R])
plt.ylim((0, tall + 0.5))
plt.show()
For the first set of test data:

I'm not 100% sure I've understood the output you want, but this will produce a stepped, cumulative histogram-like plot with the x-axis being latitudes (binned) and the y axis being the density you define above.
From your sample code, you already have numpy installed and are happy to use it. The approach I would take is to get two data sets rather like what would be returned by your SQL sample and then use them to get the densities and then plot. Using your existing latitude / userid data format - it might look something like this
EDIT: Removed first version of code from here and some comments which were redundant following clarification and question edits from the OP
Following comments and OP clarification - I think this is what is desired:
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby
def draw_hist(latitudes,userids):
min_lat = -90
max_lat = 90
binwidth = 1
bin_range = np.arange(min_lat,max_lat,binwidth)
all_rows = zip(latitudes,userids)
binned_latitudes = np.digitize(latitudes,bin_range)
all_in_bins = zip(binned_latitudes,userids)
unique_in_bins = list(set(all_in_bins))
all_in_bins.sort()
unique_in_bins.sort()
bin_count_all = []
for bin, group in groupby(all_in_bins, lambda x: x[0]):
bin_count_all += [(bin, len([k for k in group]))]
bin_count_unique = []
for bin, group in groupby(unique_in_bins, lambda x: x[0]):
bin_count_unique += [(bin, len([ k for k in group]))]
# bin_count_all and bin_count_unique now contain the data
# corresponding to the SQL / pseudocode in your question
# for each latitude bin
bin_density = [(bin_range[b-1],a*1.0/u) for ((b,a),(_,u)) in zip(bin_count_all, bin_count_unique)]
bin_density = np.array(bin_density).transpose()
# plot as standard bar - note you can put uneven widths in
# as an array-like here if necessary
# the * simply unpacks the x and y values from the density
plt.bar(*bin_density, width=binwidth)
plt.show()
# can save away plot here if desired
latitudes = [-70.5, 5.3, 70.32, 70.43, 5, 32, 80, 80, 87.3]
userids = [1,1,2,2,4,5,1,1,2]
draw_hist(latitudes,userids)
Sample output with different bin widths on OP dataset

I think this solves the case, allthough it isn't efficient at all:
con = lite.connect(databasepath)
binwidth = 1
latitudes = []
userids = []
info = []
densities = []
with con:
cur = con.cursor()
cur.execute('SELECT latitude, userid FROM dynamicMessage')
con.commit()
print "executed"
while True:
tmp = cur.fetchone()
if tmp != None:
info.append([float(tmp[0]),float(tmp[1])])
else:
break
info = sorted(info, key=itemgetter(0))
for x in info:
latitudes.append(x[0])
userids.append(x[1])
x = 0
latitudecount = 0
for b in range(int(min(latitudes)),int(max(latitudes))+1):
numlatitudes = sum(i<b for i in latitudes)
if numlatitudes > 1:
tempdensities = latitudes[0:numlatitudes]
latitudes = latitudes[numlatitudes:]
tempuserids = userids[0:numlatitudes]
userids = userids[numlatitudes:]
density = numlatitudes/len(list(set(tempuserids)))
if density>1:
tempdensities = [b]*int(density)
densities.extend(tempdensities)
plt.hist(densities, bins=len(list(set(densities))))
plt.savefig('latlongstats'+'t'+str(time.strftime("%H:%M:%S")), format='png')

What follows is not a complete solution in terms of plotting the required histogram, but I think it's nevertheless worthy of being reported
The bulk of the solution, we scan the array of tuples to select the ones in the required range and we count
the number of selected tuples
the unique ids, using a trick consisting in creating a set (this discards automatically the duplicates) and computing its numerosity
eventually we return the required ratio or zero if the count of distinct ids is zero
def ratio(d, mn, mx):
tmp = [(lat, uid) for lat, uid in d if mn <= lat < mx]
nlats, nduids = len(tmp), len({t[1] for t in tmp})
return 1.0*nlats/nduids if nduids>0 else 0
The data is input and assigned, via zip, to a list of tuples
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
-19.00, -12.32, -6.14, -1.11, 4.40, 10.23, 19.40, 31.18,
40.72, 47.59, 54.42, 63.84, 76.77]
userid= [52500.0, 70100.0, 35310.0, 47776.0, 70100.0, 30991.0, 37328.0, 25575.0,
37232.0, 6360.0, 52908.0, 52908.0, 52908.0, 77500.0, 345.0, 6360.0,
3670.0, 36690.0, 3720.0, 2510.0, 2730.0]
data = zip(lat,userid)
preparation of the bins
extremes = range(-90,91,10)
intervals = zip(extremes[:-1],extremes[1:])
actual computation, the result is a list of floats that can be passed to the relevant pyplot functions
ratios = [ratio(data,*i) for i in intervals]
print ratios
# [1.0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0, 1.0, 1.0, 0]

Related

How to identify what coordinates that are within a specific distance of eachother

I am trying to identify which coordinates fall within a specific distance of each other. Currently, my code is grouping all points together when it should be two separate groups.
from sklearn.neighbors import DistanceMetric
from math import radians
import pandas as pd
import numpy as np
from collections import Counter
data = {'Lat': [38.42447, 38.424474, 38.424493, 38.424394, 38.424457, 38.424434],
'Long': [-77.402199, -77.402228, -77.402186, -77.398625, -77.398602, -77.398459],
'Name': ['Truck', 'Truck1','Truck2','Truck3','Truck4','Truck5',]}
df = pd.DataFrame(data)
df['Lat'] = np.radians(df['Lat'])
df['Long'] = np.radians(df['Long'])
dist = DistanceMetric.get_metric('haversine')
df[['Lat','Long']].to_numpy()
dist.pairwise(df[['Lat','Long']].to_numpy())*6371000
final_df = pd.DataFrame(dist.pairwise(df[['Lat','Long']].to_numpy())*6371000, columns=df.Name.unique(), index=df.Name.unique())
potential_grouping = []
for row, col in final_df.items():
for item in col:
if int(item) < 15:
potential_grouping.append(row)
outside_features = [k for k, v in Counter(potential_grouping).items() if v == 1]
acceptable_features = [k for k, v in Counter(potential_grouping).items() if v > 1]
print(acceptable_features)
current output: ['Truck', 'Truck1', 'Truck2', 'Truck3', 'Truck4', 'Truck5']
desired output: [['Truck', 'Truck1', 'Truck2'],['Truck3', 'Truck4', 'Truck5']]
Here is a crappy picture of what is happening...
The 6 small circles are currently being grouped (big red circle) but should be separate (2 green circles). This is happening because each coordinate (the small brown circles) are within 15 meters of one another. How can I insure that I get my desired output?
Here is one way using DBSCAN:
from sklearn.cluster import DBSCAN
# here Lat and Long are already in radians
X = df[['Lat', 'Long']].to_numpy()
# here 15 is your max distance in meters divided by earth radius in meters
clustering = DBSCAN(eps=15/6373000, min_samples=1, metric='haversine').fit(X)
# see groups
print(clustering.labels_)
# [0 0 0 1 1 1]
# get the result as you want
acceptable_features = df['Name'].groupby(clustering.labels_).agg(list).tolist()
print(acceptable_features)
# [['Truck', 'Truck1', 'Truck2'], ['Truck3', 'Truck4', 'Truck5']]

How can I plot a confidence interval in Python?

I recently started to use Python, and I can't understand how to plot a confidence interval for a given datum (or set of data).
I already have a function that computes, given a set of measurements, a higher and lower bound depending on the confidence level that I pass to it, but how can I use those two values to plot a confidence interval?
There are several ways to accomplish what you asking for:
Using only matplotlib
from matplotlib import pyplot as plt
import numpy as np
#some example data
x = np.linspace(0.1, 9.9, 20)
y = 3.0 * x
#some confidence interval
ci = 1.96 * np.std(y)/np.sqrt(len(x))
fig, ax = plt.subplots()
ax.plot(x,y)
ax.fill_between(x, (y-ci), (y+ci), color='b', alpha=.1)
fill_between does what you are looking for. For more information on how to use this function, see: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.fill_between.html
Output
Alternatively, go for seaborn, which supports this using lineplot or regplot,
see: https://seaborn.pydata.org/generated/seaborn.lineplot.html
Let's assume that we have three categories and lower and upper bounds of confidence intervals of a certain estimator across these three categories:
data_dict = {}
data_dict['category'] = ['category 1','category 2','category 3']
data_dict['lower'] = [0.1,0.2,0.15]
data_dict['upper'] = [0.22,0.3,0.21]
dataset = pd.DataFrame(data_dict)
You can plot the confidence interval for each of these categories using the following code:
for lower,upper,y in zip(dataset['lower'],dataset['upper'],range(len(dataset))):
plt.plot((lower,upper),(y,y),'ro-',color='orange')
plt.yticks(range(len(dataset)),list(dataset['category']))
Resulting with the following graph:
import matplotlib.pyplot as plt
import statistics
from math import sqrt
def plot_confidence_interval(x, values, z=1.96, color='#2187bb', horizontal_line_width=0.25):
mean = statistics.mean(values)
stdev = statistics.stdev(values)
confidence_interval = z * stdev / sqrt(len(values))
left = x - horizontal_line_width / 2
top = mean - confidence_interval
right = x + horizontal_line_width / 2
bottom = mean + confidence_interval
plt.plot([x, x], [top, bottom], color=color)
plt.plot([left, right], [top, top], color=color)
plt.plot([left, right], [bottom, bottom], color=color)
plt.plot(x, mean, 'o', color='#f44336')
return mean, confidence_interval
plt.xticks([1, 2, 3, 4], ['FF', 'BF', 'FFD', 'BFD'])
plt.title('Confidence Interval')
plot_confidence_interval(1, [10, 11, 42, 45, 44])
plot_confidence_interval(2, [10, 21, 42, 45, 44])
plot_confidence_interval(3, [20, 2, 4, 45, 44])
plot_confidence_interval(4, [30, 31, 42, 45, 44])
plt.show()
x: The x value of the input.
values: An array containing the repeated values (usually measured values) of y corresponding to the value of x.
z: The critical value of the z-distribution. Using 1.96 corresponds to the critical value of 95%.
Result:
For a confidence interval across categories, building on what omer sagi suggested, let's say if we have a Pandas data frame with a column that contains categories (like category 1, category 2, and category 3) and another that has continuous data (like some kind of rating), here's a function using pd.groupby() and scipy.stats to plot difference in means across groups with confidence intervals:
import pandas as pd
import numpy as np
import scipy.stats as st
def plot_diff_in_means(data: pd.DataFrame, col1: str, col2: str):
"""
Given data, plots difference in means with confidence intervals across groups
col1: categorical data with groups
col2: continuous data for the means
"""
n = data.groupby(col1)[col2].count()
# n contains a pd.Series with sample size for each category
cat = list(data.groupby(col1, as_index=False)[col2].count()[col1])
# 'cat' has the names of the categories, like 'category 1', 'category 2'
mean = data.groupby(col1)[col2].agg('mean')
# The average value of col2 across the categories
std = data.groupby(col1)[col2].agg(np.std)
se = std / np.sqrt(n)
# Standard deviation and standard error
lower = st.t.interval(alpha = 0.95, df=n-1, loc = mean, scale = se)[0]
upper = st.t.interval(alpha = 0.95, df =n-1, loc = mean, scale = se)[1]
# Calculates the upper and lower bounds using SciPy
for upper, mean, lower, y in zip(upper, mean, lower, cat):
plt.plot((lower, mean, upper), (y, y, y), 'b.-')
# for 'b.-': 'b' means 'blue', '.' means dot, '-' means solid line
plt.yticks(
range(len(n)),
list(data.groupby(col1, as_index = False)[col2].count()[col1])
)
Given hypothetical data:
cat = ['a'] * 10 + ['b'] * 10 + ['c'] * 10
a = np.linspace(0.1, 5.0, 10)
b = np.linspace(0.5, 7.0, 10)
c = np.linspace(7.5, 20.0, 10)
rating = np.concatenate([a, b, c])
dat_dict = dict()
dat_dict['cat'] = cat
dat_dict['rating'] = rating
test_dat = pd.DataFrame(dat_dict)
which would look like this (but with more rows of course):
cat
rating
a
0.10000
a
0.64444
b
0.50000
b
0.12222
c
7.50000
c
8.88889
We can use the function to plot a difference in means with a confidence interval:
plot_diff_in_means(data = test_dat, col1 = 'cat', col2 = 'rating')
which gives us the following graph:

Python Cartopy mapping

Im trying to map CO2 levels based on public data from NASA on global map and depict those values as (long,lat,value) as topographic map, based on the data and by using Panoply software, this is what my plot should look like:
The data is in .nc4 format and read correctly, however I cant get the data plot Im using Cartopy API & following this example:(https://scitools.org.uk/cartopy/docs/latest/gallery/waves.html#sphx-glr-gallery-waves-py).
Also I do not want to use Basemap.
Attempt # 1:
See Python code below:
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import cartopy.crs as ccrs
"""
function that download each OCO - 2 data that is in .nc4 format from file "subset_OCO2_L2_ABand_V8_20180929_010345.txt"
which is list of links
for all data with date range 2015 - 09 - 01 to 2016 - 01 - 01# make sure that you have a valid user name & password by registering in https: //earthdata.nasa.gov/
#implementation based on http: //unidata.github.io/netcdf4-python/#section1"""
def download_oco2_nc4(username, password, filespath):
filespath = "C:\\Users\\Desktop\\oco2\\oco2_LtCO2_150831_B8100r_171009083146s.nc4"
dataset = Dataset(filespath)
print(dataset.file_format)
print(dataset.dimensions.keys())
print(dataset.variables['xco2'])
XCO2 = []
LONGITUDE = []
LATITUDE = []
# XCO2
XCO2 = dataset.variables['xco2'][:]
print("->", type(XCO2))
print(dataset.variables['latitude'])
# LATITUDE
LATITUDE = dataset.variables['latitude'][:]
print(dataset.variables['longitude'])
# LONGITUDE
LONGITUDE = dataset.variables['longitude'][:]
return XCO2, LONGITUDE, LATITUDE, dataset
def mapXoco2():
fig = plt.figure(figsize = (10, 5))
ax = fig.add_subplot(1, 1, 1, projection = ccrs.Mollweide())
XCO2, LONGITUDE, LATITUDE, dataset = download_oco2_nc4(1, 2, 3)
dataset.close()
XCO2_subset = list()
counter = 0
for xco2 in XCO2:
if counter < 10:
XCO2_subset.append(xco2)
counter = counter + 1
else:
break
print("XCO2_subset="+str(len(XCO2_subset)))
counter = 0
LONGITUDE_subset = list()
for longitude in LONGITUDE:
if counter < 10:
LONGITUDE_subset.append(longitude)
counter = counter + 1
else:
break
print("LONGITUDE_subset="+str(len(LONGITUDE_subset)))
counter = 0
LATITUDE_subset = list()
for latitude in LATITUDE:
if counter < 10:
LATITUDE_subset.append(latitude)
counter = counter + 1
else:
break
print("LATITUDE_subset="+str(len(LATITUDE_subset)))
XCO2_subset = np.array(XCO2_subset)
LONGITUDE_subset = np.array(LONGITUDE_subset)
LATITUDE_subset = np.array(LATITUDE_subset)
#LONGITUDE_subset, LATITUDE_subset = np.meshgrid(LONGITUDE_subset, LATITUDE_subset)
#XCO2_subset,XCO2_subset = np.meshgrid(XCO2_subset,XCO2_subset)
ax.contourf(LONGITUDE_subset,LATITUDE_subset,XCO2_subset,
transform = ccrs.Mollweide(central_longitude=0, globe=None),
cmap = 'nipy_spectral')
ax.coastlines()
ax.set_global()
plt.show()
print(XCO2_subset)
mapXoco2()
When I comment out these lines:
#LONGITUDE_subset, LATITUDE_subset = np.meshgrid(LONGITUDE_subset, LATITUDE_subset)
#XCO2_subset,XCO2_subset = np.meshgrid(XCO2_subset,XCO2_subset)
I get an error:
raise TypeError("Input z must be a 2D array.")
TypeError: Input z must be a 2D array.
However when I DO NOT comment these lines:
LONGITUDE_subset, LATITUDE_subset = np.meshgrid(LONGITUDE_subset, LATITUDE_subset)
XCO2_subset,XCO2_subset = np.meshgrid(XCO2_subset,XCO2_subset
)
I get an empty map, I see the continents but no plotted values C02 values.
I believe interpreting the 1D to 2D transformation of my inputs incorrectly.
Attempt # 2(updated):
Instead of dealing with why/what these 2d transformations in the API are doing, Im ploting each point 1 by 1 using a loop. The issue is although I can see more data(Im only ploting about 10% of the data) I CANT SEE THE MAP/CONTINENTS I SEE THE VALUES PLOT ON WHITE BACKGROUND???, SEE CODE:
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import cartopy.crs as ccrs
from random import sample
"""
function that download each OCO - 2 data that is in .nc4 format from file "subset_OCO2_L2_ABand_V8_20180929_010345.txt"
which is list of links
for all data with date range 2015 - 09 - 01 to 2016 - 01 - 01# make sure that you have a valid user name & password by registering in https: //earthdata.nasa.gov/
#implementation based on http: //unidata.github.io/netcdf4-python/#section1"""
filespath = "C:\\Users\\Downloads\\oco2_LtCO2_150830_B7305Br_160712072205s.nc4"
def download_oco2_nc4(filespath):
dataset = Dataset(filespath)
print("file format:"+str(dataset.file_format))
print("dimensions.keys():"+str(dataset.dimensions.keys()))
print("variables['xco2']:"+str(dataset.variables['xco2']))
XCO2 = []
LONGITUDE = []
LATITUDE = []
# XCO2
XCO2 = dataset.variables['xco2'][:]
print("->", type(XCO2))
print(dataset.variables['latitude'])
# LATITUDE
LATITUDE = dataset.variables['latitude'][:]
print(dataset.variables['longitude'])
# LONGITUDE
LONGITUDE = dataset.variables['longitude'][:]
return XCO2, LONGITUDE, LATITUDE, dataset
def mapXoco2():
fig = plt.figure(figsize = (10, 5))
ax = fig.add_subplot(1, 1, 1, projection = ccrs.Mollweide())
XCO2, LONGITUDE, LATITUDE, dataset = download_oco2_nc4(filespath)
dataset.close()
XCO2_subset = np.array(XCO2)
LONGITUDE_subset = np.array(LONGITUDE)
LATITUDE_subset = np.array(LATITUDE)
"""each of the arrays has over 80,000 of data therefore its taking to long to map, after 10,000 rows its to slow, and 10,000 isnt sufficient.
Because oco-2 gathers data from trajectory the 1st 10% or whatever precent of the data will not be a good representation of the overal data.
We must sample from X number of slices across the data.
"""
#XCO2 attempt to get ten ranges, we need to check 10 ranges therefore we need if statements not if/else
if (len(XCO2_subset)>=10000):
first_XCO2_subset=XCO2_subset[0:1000]
if (len(XCO2_subset)>=20000):
second_XCO2_subset=XCO2_subset[20000:21000]
if (len(XCO2_subset)>=30000):
third_XCO2_subset=XCO2_subset[30000:31000]
if (len(XCO2_subset)>=40000):
fourth_XCO2_subset=XCO2_subset[40000:41000]
if (len(XCO2_subset)>=50000):
fifth_XCO2_subset=XCO2_subset[50000:51000]
if (len(XCO2_subset)>=60000):
sixth_XCO2_subset=XCO2_subset[60000:61000]
if (len(XCO2_subset)>=70000):
seventh_XCO2_subset=XCO2_subset[70000:71000]
if (len(XCO2_subset)>=80000):
eight_XCO2_subset=XCO2_subset[80000:81000]
sampled_xco2 = first_XCO2_subset + second_XCO2_subset + third_XCO2_subset + fourth_XCO2_subset + fifth_XCO2_subset + sixth_XCO2_subset + seventh_XCO2_subset + eight_XCO2_subset
#LONGITUDE attempt to get ten ranges, we need to check 10 ranges therefore we need if statements not if/else
if (len(LONGITUDE_subset)>=10000):
first_LONGITUDE_subset=LONGITUDE_subset[0:1000]
if (len(LONGITUDE_subset)>=20000):
second_LONGITUDE_subset=LONGITUDE_subset[20000:21000]
if (len(LONGITUDE_subset)>=30000):
third_LONGITUDE_subset=LONGITUDE_subset[30000:31000]
if (len(LONGITUDE_subset)>=40000):
fourth_LONGITUDE_subset=LONGITUDE_subset[40000:41000]
if (len(LONGITUDE_subset)>=50000):
fifth_LONGITUDE_subset=LONGITUDE_subset[50000:51000]
if (len(LONGITUDE_subset)>=60000):
sixth_LONGITUDE_subset=LONGITUDE_subset[60000:61000]
if (len(LONGITUDE_subset)>=70000):
seventh_LONGITUDE_subset=LONGITUDE_subset[70000:71000]
if (len(LONGITUDE_subset)>=80000):
eight_LONGITUDE_subset=LONGITUDE_subset[80000:81000]
sampled_LONGITUDE = first_LONGITUDE_subset + second_LONGITUDE_subset + third_LONGITUDE_subset + fourth_LONGITUDE_subset + fifth_LONGITUDE_subset + sixth_LONGITUDE_subset + seventh_LONGITUDE_subset + eight_LONGITUDE_subset
#LATITUDE attempt to get ten ranges, we need to check 10 ranges therefore we need if statements not if/else
if (len(LATITUDE_subset)>=10000):
first_LATITUDE_subset=LATITUDE_subset[0:1000]
if (len(LATITUDE_subset)>=20000):
second_LATITUDE_subset=LATITUDE_subset[20000:21000]
if (len(LATITUDE_subset)>=30000):
third_LATITUDE_subset=LATITUDE_subset[30000:31000]
if (len(LATITUDE_subset)>=40000):
fourth_LATITUDE_subset=LATITUDE_subset[40000:41000]
if (len(LATITUDE_subset)>=50000):
fifth_LATITUDE_subset=LATITUDE_subset[50000:51000]
if (len(LATITUDE_subset)>=60000):
sixth_LATITUDE_subset=LATITUDE_subset[60000:61000]
if (len(LATITUDE_subset)>=70000):
seventh_LATITUDE_subset=LATITUDE_subset[70000:71000]
if (len(LATITUDE_subset)>=80000):
eight_LATITUDE_subset=LATITUDE_subset[80000:81000]
sampled_LATITUDE = first_LATITUDE_subset + second_LATITUDE_subset + third_LATITUDE_subset + fourth_LATITUDE_subset + fifth_LATITUDE_subset + sixth_LATITUDE_subset + seventh_LATITUDE_subset + eight_LATITUDE_subset
ax = plt.axes(projection=ccrs.Mollweide())
#plt.contourf(LONGITUDE_subset, LATITUDE_subset, XCO2_subset, 60,transform=ccrs.PlateCarree())
for long, lat, value in zip(sampled_LONGITUDE, sampled_LATITUDE,sampled_xco2):
#print(long, lat, value)
if value >= 0 and value < 370:
ax.plot(long,lat,marker='o',color='blue', markersize=1, transform=ccrs.PlateCarree())
elif value >= 370 and value < 390:
ax.plot(long,lat,marker='o',color='cyan', markersize=1, transform=ccrs.PlateCarree())
elif value >= 390 and value < 402:
ax.plot(long,lat,marker='o',color='yellow', markersize=1, transform=ccrs.PlateCarree())
elif value >= 402 and value < 410:
ax.plot(long,lat,marker='o',color='orange', markersize=1, transform=ccrs.PlateCarree())
elif value >= 410 and value < 415:
ax.plot(long,lat,marker='o',color='red', markersize=1, transform=ccrs.PlateCarree())
else:
ax.plot(long,lat,marker='o',color='brown', markersize=1, transform=ccrs.PlateCarree())
ax.coastlines()
plt.show()
mapXoco2()
Output:
file format:NETCDF4
dimensions.keys():odict_keys(['sounding_id', 'levels', 'bands', 'vertices', 'epoch_dimension', 'source_files'])
variables['xco2']:
float32 xco2(sounding_id)
units: ppm
long_name: XCO2
missing_value: -999999.0
comment: Column-averaged dry-air mole fraction of CO2 (includes bias correction)
unlimited dimensions:
current shape = (82776,)
filling on, default _FillValue of 9.969209968386869e+36 used
->
float32 latitude(sounding_id)
units: degrees_north
long_name: latitude
missing_value: -999999.0
comment: center latitude of the measurement
unlimited dimensions:
current shape = (82776,)
filling on, default _FillValue of 9.969209968386869e+36 used
float32 longitude(sounding_id)
units: degrees_east
long_name: longitude
missing_value: -999999.0
comment: center longitude of the measurement
unlimited dimensions:
current shape = (82776,)
filling on, default _FillValue of 9.969209968386869e+36 used
1) What happened to the map & continents?
Thanks & any useful help appreciated.
It looks like your transform argument is incorrect. If you have latitude/longitude data the value of the transform argument should be ccrs.PlateCarree(). See this guide in the cartopy documentation for details: https://scitools.org.uk/cartopy/docs/latest/tutorials/understanding_transform.html.
I cannot run your example to verify this is the solution. To get the most out of Stack Overflow you should provide a minimal working example that others can run themselves. See https://stackoverflow.com/help/mcve and https://stackoverflow.com/help/how-to-ask for tips.

Recording headers in text file and making plots with subsequent data

I am having trouble parsing a text file that I created with another program. The text file looks something like this:
velocity 4
0 0
0.0800284750334461 0.0702333599787275
0.153911082737118 0.128537103048848
0.222539323234924 0.176328826156044
0.286621942300277 0.21464146333504
0.346732028739683 0.244229944930359
0.403339781262399 0.265638972071027
...
velocity 8
0 0
0.169153136373962 0.124121036173475
0.312016311613761 0.226778846267302
0.435889653693839 0.312371513797743
0.545354054604357 0.383832483710643
0.643486956562741 0.443203331839287
...
I want to grab the number in the same row as velocity (the header) and save it as the title of the plot of the subsequent data. Every other row apart from the header represents the x and y coordinates of a shooting ball.
So if I have five different headers, I would like to see five different lines on a single graph with a legend displaying the different velocities.
Here is my python code so far. I am close to what I want to get, but I am missing the first set of data (velocity = 4 m/s) and the colors on my legend don't match the line colors.
import matplotlib.pyplot as plt
xPoints = []
yPoints = []
fig, ax = plt.subplots()
with open('artilleryMotion.txt') as inf:
for line in inf:
column = line.split()
if line.startswith("v"):
velocity = column[1]
ax.plot(xPoints, yPoints, label = '%s m/s' % velocity)
else:
xPoints.append(column[0])
yPoints.append(column[1])
ax.legend()
plt.title("Ping-Pong Ball Artillery Motion")
plt.xlabel("distance")
plt.ylabel("height")
plt.ylim(ymin = 0)
ax.set_autoscaley_on(1)
I have been struggling with this for a while.
Edit_1: This is my output at the moment:
Artillery motion plot
Edit_2: I removed the indentation of the last lines of code. The color problem still occurs.
Edit_3: How would I go about saving the x and y points to a new array for each velocity? This may solve my issues.
Edit_4: Thanks to Charles Morris, I was able to create these plots. I just need to now determine if the initial upwards "arcing" motion by the ping pong ball for the higher velocities is representative of the physics or is a limitation of my code.
Artillery Motion Final
Edit: Ignore the old information, and see Solved solution below:
The following code works an example text file: input.txt
velocity 4
0 0
0.0800284750334461 0.0702333599787275
0.153911082737118 0.128537103048848
0.222539323234924 0.176328826156044
0.286621942300277 0.21464146333504
0.346732028739683 0.244229944930359
0.403339781262399 0.265638972071027
velocity 8
0 0
0.169153136373962 0.124121036173475
0.312016311613761 0.226778846267302
0.435889653693839 0.312371513797743
0.545354054604357 0.383832483710643
0.643486956562741 0.443203331839287
1) Import our text file
We use np.genfromtxt() for imports. In this case, we can Specify that dtype = float. This has the effect that the affect that Numbers are imported as 'Float' and thus, strings (in this case 'Velocity'), are imported as NaN.
Source:
https://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
from matplotlib import pyplot as plt
from itertools import groupby
from numpy import NaN as nan
A = np.genfromtxt('input.txt',dtype=float)
>>>
array([[ nan, 4. ],
[ 0. , 0. ],
[ 0.08002848, 0.07023336],
[ 0.15391108, 0.1285371 ],
[ 0.22253932, 0.17632883],
[ 0.28662194, 0.21464146],
[ 0.34673203, 0.24422994],
[ 0.40333978, 0.26563897],
[ nan, 8. ],
[ 0. , 0. ],
[ 0.16915314, 0.12412104],
[ 0.31201631, 0.22677885],
[ 0.43588965, 0.31237151],
[ 0.54535405, 0.38383248],
[ 0.64348696, 0.44320333]])
2) Slice the imported array A
We can slice these arrays into separate X and Y arrays representing our X and Y values. Read up on array slicing in numpy here: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
In this case, we take all values with index = 0 (X) and all values with index 1 (Y):
# x values
# y values
X = A[:,0]
Y = A[:,1]
>>> X = array([ nan, 0. , 0.08002848, 0.15391108, 0.22253932,
0.28662194, 0.34673203, 0.40333978, nan, 0. ,
0.16915314, 0.31201631, 0.43588965, 0.54535405, 0.64348696])
>>> Y = array([ 4. , 0. , 0.07023336, 0.1285371 , 0.17632883,
0.21464146, 0.24422994, 0.26563897, 8. , 0. ,
0.12412104, 0.22677885, 0.31237151, 0.38383248, 0.44320333])
3) Split the data for each velocity.
Here we desire to separate our X and Y values into those for each Velocity. Our X values are separated by Nan and our Y values are separated by 4,8,16....
Thus: For x, we split by nan. nan is a result of the genfromtxt() parsing Velocity as a float and returning nan.
Sources:
numpy: split 1D array of chunks separated by nans into a list of the chunks
Split array at value in numpy
For y, we split our array up on the numbers 4,8,16 etc. To do this, we exclude values that, when divided by 4, have zero remainder (using the % Python operator).
Sources:
Split array at value in numpy
How to check if a float value is a whole number
Split NumPy array according to values in the array (a condition)
Find the division remainder of a number
How do I use Python's itertools.groupby()?
XX = [list(v) for k,v in groupby(X,np.isfinite) if k]
YY = [list(v) for k,v in groupby(Y,lambda x: x % 4 != 0 or x == 0) if k]
>>>
XX = [[0.0,
0.080028475033446095,
0.15391108273711801,
0.22253932323492401,
0.28662194230027699
0.34673202873968301,
0.403339781262399],
[0.0,
0.16915313637396201,
0.31201631161376098,
0.43588965369383897,
0.54535405460435704,
0.64348695656274102]]
>>> YY =
[[0.0,
0.070233359978727497,
0.12853710304884799,
0.17632882615604401,
0.21464146333504,
0.24422994493035899,
0.26563897207102699],
[0.0,
0.124121036173475,
0.22677884626730199,
0.31237151379774297,
0.38383248371064299,
0.44320333183928701]]
4) Extract labels
Using a similar technique as above, we accept values = to our velocities 4,8,16 etc. In this case, we accept only those numbers which, when divided by 4, have 0 remainder, and are not 0. We then convert to a string and add m/s.
Ylabels = [list(v) for k,v in groupby(Y,lambda x: x % 4 == 0 and x != 0) if k]
Velocities = [str(i[0]) + ' m/s' for i in Ylabels]
>>> Y labels = [[4.0], [8.0]]
>>> Velocities = ['4.0 m/s', '8.0 m/s']
5) Plot
Plot values by index for each velocity.
fig, ax = plt.subplots()
for i in range(0,len(XX)):
plt.plot(XX[i],YY[i],label = Velocities[i])
ax.legend()
plt.title("Ping-Pong Ball Artillery Motion")
plt.xlabel("distance")
plt.ylabel("height")
plt.ylim(ymin = 0)
ax.set_autoscaley_on(1)
Code Altogether:
import numpy as np
from matplotlib import pyplot as plt
from itertools import groupby
from numpy import NaN as nan
A = np.genfromtxt('input.txt',dtype=float)
X = A[:,0]
Y = A[:,1]
Ylabels = [list(v) for k,v in groupby(Y,lambda x: x % 4 == 0 and x != 0) if k]
Velocities = [str(i[0]) + ' m/s' for i in Ylabels]
XX = [list(v) for k,v in groupby(X,np.isfinite) if k]
YY = [list(v) for k,v in groupby(Y,lambda x: x % 4 != 0 or x == 0) if k]
fig, ax = plt.subplots()
for i in range(0,len(XX)):
plt.plot(XX[i],YY[i],label = Velocities[i])
ax.legend()
plt.title("Ping-Pong Ball Artillery Motion")
plt.xlabel("distance")
plt.ylabel("height")
plt.ylim(ymin = 0)
ax.set_autoscaley_on(1)
Old Answer:
The first time you iterate over all lines in the file, your xPoints and yPoints arrays are empty. Therefore, when you try and plot values for v = 4, you are plotting an empty array - hence your missing line.
You need to populate the arrays first, and then plot them. At the moment, you are plotting the values for v = 4 in the line labelled v = 8, and for v = 8, the values for v = 16 and so on.
Ignore:
For the array population, try the following:
xPoints = []
yPoints = []
with open('artilleryMotion.txt') as inf:
# initialize placeholder velocity variable
velocity = 0
for line in inf:
column = line.split()
if line.startswith("v"):
velocity = column[1]
else:
xPoints.append({velocity: column[0]})
yPoints.append({velocity: column[1]})
In the above, you save the data as a list of dictionaries (separate for x and y points), where the key is equal to the velocity that has been read in most recently, and the values are the x and y coordinates.
As a new velocity is read in, the placeholder variable velocity is updated and so the x and y values can be identified according the key that they have.
This allows you to Seaprate your plots by dictionary key (look up D.iteritems() D.items() ) and you can plot each set of points individually.

Is there a numpy builtin to reject outliers from a list

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.
import numpy as np
def reject_outliers(data):
m = 2
u = np.mean(data)
s = np.std(data)
filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return filtered
>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]
I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).
Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.
Building on eumiro's answer:
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else np.zero(len(d))
return data[s<m]
Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.
Note that for the data[s<m] syntax to work, data must be a numpy array.
This method is almost identical to yours, just more numpyst (also working on numpy arrays only):
def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]
Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.
def reject_outliers_2(data, m=2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m]
Example:
data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))
Gives:
[[10, 10, 10, 17, 10, 10]] # 17 is not filtered
[10, 10, 10, 10, 10] # 17 is filtered (it's distance, 7, is greater than m)
Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:
def reject_outliers(sr, iq_range=0.5):
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
return sr[ (sr - median).abs() <= iqr]
For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.
An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)
As a simple example:
import numpy as np
# Create some random numbers
x = np.random.normal(5, 2, 1000)
# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500
# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)
rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)
The output I get is:
Mean= 4.99760520022
Median= 4.95395274981
Max/Min= 11.1226494654 -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649
Mean= 9.64760520022
Median= 4.95667658782
Max/Min= 2205.43861943 -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694
Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462
Which is close to the expected value of 2.
If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):
y = x[abs(x - p50) < rSig*5]
# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))
Which gives:
Mean= 4.99755359935
Median= 4.95213030447
Max/Min= 11.1226494654 -2.15388472011
StdDev= 1.97692712883
I have no idea which approach is the more efficent/robust
I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).
To do so I used numpy's masking functions:
def reject_outliers(data, m=2):
stdev = np.std(data)
mean = np.mean(data)
maskMin = mean - stdev * m
maskMax = mean + stdev * m
mask = np.ma.masked_outside(data, maskMin, maskMax)
print('Masking values outside of {} and {}'.format(maskMin, maskMax))
return mask
I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".
The code provided in this answer works on both single dim numpy array and multiple numpy array.
Let's import some modules firstly.
import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr
z score based method
This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.
def sd_outlier(x, axis = None, bar = 3, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_z = stat.zscore(x, axis = axis)
if side == 'gt':
return d_z > bar
elif side == 'lt':
return d_z < -bar
elif side == 'both':
return np.abs(d_z) > bar
IQR based method
This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.
def q1(x, axis = None):
return np.percentile(x, 25, axis = axis)
def q3(x, axis = None):
return np.percentile(x, 75, axis = axis)
def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_iqr = iqr(x, axis = axis)
d_q1 = q1(x, axis = axis)
d_q3 = q3(x, axis = axis)
iqr_distance = np.multiply(d_iqr, bar)
stat_shape = list(x.shape)
if isinstance(axis, collections.Iterable):
for single_axis in axis:
stat_shape[single_axis] = 1
else:
stat_shape[axis] = 1
if side in ['gt', 'both']:
upper_range = d_q3 + iqr_distance
upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
if side in ['lt', 'both']:
lower_range = d_q1 - iqr_distance
lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)
if side == 'gt':
return upper_outlier
if side == 'lt':
return lower_outlier
if side == 'both':
return np.logical_or(upper_outlier, lower_outlier)
Finally, if you want to filter out the outliers, use a numpy selector.
Have a nice day.
Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.
(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)
You could try to iteratively apply your algorithm or you filter using the interquartile range:
(here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)
import numpy as np
def sortoutOutliers(dataIn,factor):
quant3, quant1 = np.percentile(dataIn, [75 ,25])
iqr = quant3 - quant1
iqrSigma = iqr/1.34896
medData = np.median(dataIn)
dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ]
return(dataOut)
So many answers, but I'm adding a new one that can be useful for the author or even for other users.
You could use the Hampel filter. But you need to work with Series.
Hampel filter returns the Outliers indices, then you can delete them from the Series, and then convert it back to a List.
To use Hampel filter, you can easily install the package with pip:
pip install hampel
Usage:
# Imports
from hampel import hampel
import pandas as pd
list_d = [2, 4, 5, 1, 6, 5, 40]
# List to Series
time_series = pd.Series(list_d)
# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)
# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)
filtered_d.values.tolist()
print(f'filtered_d: {filtered_d.values.tolist()}')
And the output will be:
filtered_d: [2, 4, 5, 1, 6, 5]
Where, ts is a pandas Series object and window_size is a total window size will be computed as 2 * window_size + 1.
For this Series I set window_size with the value 3.
The cool thing about working with Series is being able to generate graphics:
# Imports
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()
And the output will be:
To learn more about Hampel filter, I recommend the following readings:
Python implementation of the Hampel Filter
Outlier Detection with Hampel Filter
Clean-up your time series data with a Hampel Filter
if you want to get the index position of the outliers idx_list will return it.
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
data_range = np.arange(len(data))
idx_list = data_range[s>=m]
return data[s<m], idx_list
data_points = np.array([8, 10, 35, 17, 73, 77])
print(reject_outliers(data_points))
after rejection: [ 8 10 35 17], index positions of outliers: [4 5]
For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:
mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)
Then it is possible to compute the mean:
masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))
(I use it for Background Subtraction)
Here I find the outliers in x and substitute them with the median of a window of points (win) around them (taking from Benjamin Bannier answer the median deviation)
def outlier_smoother(x, m=3, win=3, plots=False):
''' finds outliers in x, points > m*mdev(x) [mdev:median deviation]
and replaces them with the median of win points around them '''
x_corr = np.copy(x)
d = np.abs(x - np.median(x))
mdev = np.median(d)
idxs_outliers = np.nonzero(d > m*mdev)[0]
for i in idxs_outliers:
if i-win < 0:
x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
elif i+win+1 > len(x):
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
else:
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
if plots:
plt.figure('outlier_smoother', clear=True)
plt.plot(x, label='orig.', lw=5)
plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')
plt.plot(x_corr, '-o', label='corrected')
plt.legend()
return x_corr
Trim outliers in a numpy array along axis and replace them with min or max values along this axis, whichever is closer. The threshold is z-score:
def np_z_trim(x, threshold=10, axis=0):
""" Replace outliers in numpy ndarray along axis with min or max values
within the threshold along this axis, whichever is closer."""
mean = np.mean(x, axis=axis, keepdims=True)
std = np.std(x, axis=axis, keepdims=True)
masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
min = np.nanmin(masked, axis=axis, keepdims=True)
max = np.nanmax(masked, axis=axis, keepdims=True)
repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
return np.where(np.isnan(masked), repl, masked)
My solution drops the top and bottom percentiles, keeping values that are equal to the boundary:
def remove_percentile_outliers(data, percent_to_drop=0.001):
low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
return data[(data >= low )&(data <= high)]
My solution let the outliers equal to the previous value.
test_data = [2,4,5,1,6,5,40, 3]
def reject_outliers(data, m=2):
mean = np.mean(data)
std = np.std(data)
for i in range(len(data)) :
if np.abs(data[i] -mean) > m*std :
data[i] = data[i-1]
return data
reject_outliers(test_data)
Output:
[2, 4, 5, 1, 6, 5, 5, 3]

Categories

Resources