Not getting Elevation with Pydeck - python

I have a DataSet like following and I want to represent in a pydeck layer map the column n_cases:
Table
I am doing in this way, but i never get elevation in the hexagons, have tried with lot of elevation ranges and elevation scales:
r = pdk.Deck(map_style=None,
initial_view_state=pdk.ViewState(
latitude=provincias_map['latitude'].mean(),
longitude=provincias_map['longitude'].mean(),
zoom=5,
pitch=45),
layers=[
pdk.Layer(
'HexagonLayer',
data=provincias_map,
get_position='[longitude, latitude]',
get_elevation = 'n_cases',
radius=18000,
elevation_scale=20,
elevation_range=[0, 1780000],
pickable=True,
extruded=True,
coverage = 1,
)])
map

Set the get_elevation_weight option to 'n_cases' to have the computed elevation value as the correlated 'n_cases' value of the data point.
max_range = int(provincias_map['n_cases'].max())
r = pdk.Deck(map_style=None,
initial_view_state=pdk.ViewState(
latitude=provincias_map['latitude'].mean(),
longitude=provincias_map['longitude'].mean(),
zoom=5,
pitch=45),
layers=[
pdk.Layer(
'HexagonLayer',
data=provincias_map,
get_position='[longitude, latitude]',
get_elevation_weight = 'n_cases',
radius=18000,
elevation_scale=1,
elevation_range=[0, max_range],
pickable=True,
extruded=True,
coverage = 1,
)])

Related

Recognition of a plateau with a slope close to zero

I am writing code to remove plateau outliers from time series data. I proceeded after receiving advice to use np.diff, but there was a problem that it could not be recognized if it was not the same value.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=15):
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus)
plateaus = find_plateaus(test, min_length=5, tolerance = 0.02, smoothing=11)
plateaus = np.ravel(plateaus, order = 'A')
plateaus = plateaus.tolist()
print(plateaus)
test2['T&F'] = np.nan
for i in test2.index:
if i in plateaus:
test2.loc[i,['T&F']] = test2.loc[i,'data']
else :
test2.loc[i,['T&F']] = 0
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(test2.index, test2['data'], color='black', label = 'time_series')
ax.scatter(test2.index,test2['T&F'], color='red', label = 'D910')
plt.legend()
plt.show();
Do you know any libraries or methods that can be used?
I want to recognize the parts marked in the picture below.
enter image description here
Still in progress, but found the answer.
First, make the np array multidimensional.
ex) time_step = 3
.....
Then, using np.std(), find the standard deviation,
After checking, you can set the standard deviation range to recognize the included range.

Plotly express choropleth map custom color_continuous_scale

I am trying to create a custom coloring for an animated choropleth map. I am using Plotly express and my dataframe looks like this.
where I am plotting the values on each region (region code=K_KRAJ, region name=N_KRAJ) and my animation is over the variables.
The values are in percentages so the min is 0 and max is 1. I want to divide the colors into 6 parts with exactly the midpoints as written here in color_continous_scale
fig = px.choropleth(df_anim,
locations="K_KRAJ",
featureidkey="properties.K_KRAJ",
geojson=regions_json,
color="value",
hover_name="N_KRAJ",
color_continuous_scale=[(0.0, "#e5e5e5"), (0.0001, "#e5e5e5"),
(0.0001, "#ffe5f0"), (0.0075, "#ffe5f0"),
(0.0075, "#facfdf"), (0.01, "#facfdf"),
(0.01, "#f3b8ce"), (0.025, "#f3b8ce"),
(0.025, "#eca2bf"), (0.05, "#eca2bf"),
(0.05, "#e37fb1"), (1, "#e37fb1")
],
animation_frame="variable"
)
fig.update_geos(fitbounds="locations", visible=False)
fig.show()
Unfortunately, that creates a wrong map like this
instead of a map like this
the second map which is almost correct was created using the largest value as 100% and mathematically finding the midpoints. Even though this is very close to being correct, there can always be numerical mistakes and I would rather use the code shown above if it worked correctly.
the almost correct one was created like this (max value was 0.06821107602623269)
color_continuous_scale=[(0.0, "#e5e5e5"), (0.001449275362, "#e5e5e5"), # 0.01% , 0.0001
(0.01449275362, "#ffe5f0"), (0.1086956522, "#ffe5f0"), # 0.75% , 0.0075
(0.1086956522, "#facfdf"), (0.1449275362, "#facfdf"), # 1% , 0.01
(0.1449275362, "#f3b8ce"), (0.3623188406, "#f3b8ce"), # 2.5% , 0.025
(0.3623188406, "#eca2bf"), (0.7246376812, "#eca2bf"), # 5% , 0.05
(0.7246376812, "#e37fb1"), (1, "#e37fb1") # 6.9% , 0.069
],
And even best if someone knew how to change the numbers in the colorscale which is shown in the images on the right from numbers to percentages (0.05 -> 5%)
If I add range_color=(0, 1) it adds the correct colors but then there is a useless colorbar on the right.
color_continuous_scale is a Plotly Express construct not limited to choropleths. Hence technique presented is how to build a color scale
I cannot find a repeatable source of Czech region geometry, hence code below does not work as an MWE without you have geometry in your own downloads folder
core solution
given you want six bins, start by using pd.cut() and get the bin edges
with this scale them to be between 0 and 1 to work with color scales
construct colorscale with hard edges
edges = pd.cut(df_anim["value"], bins=5, retbins=True)[1]
edges = edges[:-1] / edges[-1]
colors = ["#e5e5e5", "#ffe5f0", "#facfdf", "#f3b8ce", "#eca2bf", "#e37fb1"]
cc_scale = (
[(0, colors[0])]
+ [(e, colors[(i + 1) // 2]) for i, e in enumerate(np.repeat(edges, 2))]
+ [(1, colors[5])]
)
from pathlib import Path
import geopandas as gpd
import pandas as pd
import numpy as np
import plotly.express as px
# simulate source data
gdf = gpd.read_file(
list(Path.home().joinpath("Downloads/WGS84").glob("*KRAJ*.shp"))[0]
).set_crs("epsg:4326")
gdf["geometry"] = gdf.to_crs(gdf.estimate_utm_crs()).simplify(2000).to_crs(gdf.crs)
regions_json = gdf.__geo_interface__
df = (
pd.json_normalize(regions_json["features"])
.pipe(lambda d: d.loc[:, [c.strip() for c in d.columns if c[0:3] == "pro"]])
.rename(columns={"properties.ID": "K_KRAJ", "properties.NAZEV_NUTS": "N_KRAJ"})
)
df_anim = df.merge(
pd.DataFrame(
{"variable": [f"REL{n1}{n2}" for n1 in range(15, 21) for n2 in ["06", "12"]]}
),
how="cross",
).pipe(lambda d: d.assign(value=np.random.uniform(0, 0.003, len(d))))
# end data simulation
edges = pd.cut(df_anim["value"], bins=5, retbins=True)[1]
edges = edges[:-1] / edges[-1]
colors = ["#e5e5e5", "#ffe5f0", "#facfdf", "#f3b8ce", "#eca2bf", "#e37fb1"]
cc_scale = (
[(0, colors[0])]
+ [(e, colors[(i + 1) // 2]) for i, e in enumerate(np.repeat(edges, 2))]
+ [(1, colors[5])]
)
fig = px.choropleth(
df_anim,
locations="K_KRAJ",
featureidkey="properties.ID", ### ! changed !
geojson=regions_json,
color="value",
hover_name="N_KRAJ",
color_continuous_scale=cc_scale,
animation_frame="variable",
)
fig.update_geos(fitbounds="locations", visible=False)

How to set widgets to link to array for jupyternotebooks

I am trying to set an interactive notebook up that plots some interpolated GPS data. I have the plotting working by itself, but I am trying to use the ipython widgets to make it more interactive for others.
Currently, my plotting looks like this
def create_grid(array,spacing=.01):
'''
creates evenly spaced grid from the min and max of an array
'''
grid = np.arange(np.amin(array), np.amax(array),spacing)
return grid
def interpolate(x, y, z, grid_spacing = .01, model='spherical',returngrid = False):
'''Interpolates z value and uses create_grid to create a grid of values based on min and max of x and y'''
grid_x = create_grid(x,spacing = grid_spacing)
grid_y = create_grid(y, spacing = grid_spacing)
OK = OrdinaryKriging(x, y, z, variogram_model=model, verbose = False,\
enable_plotting=False, nlags = 20)
z1, ss1 = OK.execute('grid', grid_x,grid_y,mask = False)
print('Interpolation Complete')
vals=np.ma.getdata(z1)
sigma = np.ma.getdata(ss1)
if returngrid == False:
return vals,sigma
else:
return vals, sigma, grid_x, grid_y
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plt.colorbar(plot)
cb.set_label('Northing Change')
plt.show()
'''
This works currently, but I am trying to set up a widget to change the variogram model in the kriging interpolation, as well as change the field to be interpolated.
Currently, to do that I have:
def update_plot(zfield,variogram):
plt.clf()
z1, ss1, grid_x,grid_y =interpolate(lon,lat,zfield,returngrid= True,model=variogram)
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plot.colorbar(plot)
cb.set_label('Interpolated Value')
variogram = widgets.Dropdown(options = ['linear', 'power', 'gaussian', 'spherical', 'exponential', 'hole-effect'],
value = 'spherical', description = "Variogram model for interpolation")
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v},value = 'Delta N',
description = 'Interpolated value')
widgets.interactive(update_plot, variogram = variogram,zfield =zfield)
Which brings up the error
TraitError: Invalid selection: value not found
the values delta_n, delta_e and delta_v are numpy arrays. I have tried looking at documentation but it is not as detailed as something like matplotlibs documentation or something so I feel like I am kind of flying blind here.
Thank you
In this line, you specify the possible values of the Dropdown as:
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v}
When a mapping is used, the values of the dict are interpreted as the possible options. So value = 'Delta N' causes an error as this is not one of the possible values of the Dropdown (although it is one of the keys in the mapping dict). I believe you want value = delta_n instead.

Including multiple seasonal terms in Python statsmodels.tsa ARIMA

I am trying to model a time series in python using python 2.7.11 and the excellent statsmodels.tsa package. My data consists of hourly measurements of traffic intensity over several weeks. Thus, the data has multiple seasonal components, days form a 24 hour period; weeks form a 168 hour period.
At this point, the modeling options in statsmodels.tsa are not set up to handle multiple seasonality, as they only allow for the specification of one seasonal factor. However, I came across the work of Rob Hyneman on multiple seasonality in R. He advocates modeling seasonal components of a time series using Fourier series, including a Fourier series in the model for the frequencies corresponding to each of seasonal periods.
I've used Welch's method to obtain the power spectral density of the signal in my observed time series, extracted the peaks in the signal which correspond to the frequencies at which I expect my seasonal effects, and used the frequency and amplitude to generate a sine wave pattern corresponding to the seasonal trends I expect in my data. As an aside, I think this allows me to bypass Hyneman's step of selecting the value of k based on the AIC, because I am using the signal inherent in the observed data.
To ensure that the sine waves match the occurrence of the seasonal pattern in the data, I match the peak of both sine wave patterns to the peaks in the observed data by visually selecting a peak within one of the 24-hour periods, and matching the hour of its occurrence to the highest value of the variable representing the sine wave. Prior to this, I have checked that the daily peaks occur at the same hour consistently.
So far, so good it seems - plots of the sine waves constructed with the obtained frequencies and amplitudes roughly correspond to the observed data. I then fit an ARIMA(2,0,0) model, including both of the decomposition-based variables as exogenous variables. At this point, I want to test the predictive utility of the model. However, this is where things get complicated.
When I am using ARIMA from the statsmodels package, the estimates I get from fitting the model form a pattern which replicates the sine waves, but with a range of values matching my observation. There is still a lot of variance in the observations which is not explained by the seasonal trends, leading me to believe that somewhere in the model fitting procedure something is not going the way it is supposed to.
Unfortunately, I am not sufficiently well-versed in the art of time series modeling to know if my unexpected results are due to the nature of exogenous variables I am including, statsmodels functionality that I should be using, but am omitting, or wrongful assumptions about the concept of seasonal trends.
Some concrete questions I have are:
is it possible to include multiple seasonal trends (i.e. Fourier- or decomposition-based) in an ARIMA model using statsmodels in python?
could reconstruction of the seasonal trend using sine waves cause difficulties when the sine waves are included as exogenous variables in the model as specified above and in the code below?
why does the model specified int he code below not yield predictions which match the observed data more closely?
Any help is much appreciated!
Best wishes, and thanks in advance,
Evert
p.s.: Sorry if my code sample and data file are overly long - as I am not sure what causes the unexpected results I thought I'd post the whole thing. Also, apologies for not following PEP8 at times - I'm still learning :)
Code sample:
import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator
# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#
def plotmean(timeseries, show=0, path=''):
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue', label='Observed scores')
mean = plt.plot(rolmean, color='red', label='Rolling mean')
std = plt.plot(rolstd, color='black', label='Rolling SD')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#
def runwelch(dta, show, path):
nps = (len(dta) / 2) + 8
nov = nps / 2
fft = nps
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
plt.plot(f, Pxx_den)
plt.ylim([0.5e-7, 10])
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return f, Pxx_den
#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions
def getsines(n_obs, freq, density, n, show):
ftemp = freq
dtemp = density
fstore = []
dstore = []
astore = []
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600
for a in range(0, n, 1):
max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
dstore.append(max_value)
fstore.append(ftemp[max_index])
astore.append(np.sqrt(max_value))
dtemp[max_index] = 0
if show == 1:
for b in range(0, len(fstore), 1):
sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
plt.plot(sound_sine)
plt.show()
plt.clf()
return fstore, astore
def sine(freq, time_interval, rate, amp):
w = 2. * np.pi * freq
t = np.linspace(0, time_interval, time_interval * rate)
y = amp * np.sin(w * t)
return y
#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set
def buildFterms(dta, fstore, astore):
n = len(fstore)
n_obs = len(dta)
fs_temp = .0002778
# Set to 1/3600 because of hourly sampling
samplespace = n_obs * 3600 + (24 * 3600)
# Add one excess day for later fitting of sine waves to peaks
store = []
for i in range(0, n, 1):
tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
store.append(tmp)
k_168_store = store[0]
k_24_store = store[1]
k_24 = np.transpose(k_24_store)
k_168 = np.transpose(k_168_store)
k_24 = pd.Series(k_24)
k_168 = pd.Series(k_168)
dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
# Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
# peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
# sine waves[6]
# print dta_ind, dta_val, k_24_ind, k_24_val
k_24_sel = k_24[6:1014]
k_168_sel = k_168[6:1014]
exog = pd.concat([k_24_sel, k_168_sel], axis=1)
return exog
#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#
def plotpacf(x, show=0, path=''):
dflength = len(x)
nlags = dflength * .80
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#
def dftest(dta):
print 'Results of Dickey-Fuller Test:'
dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
if dfoutput[0] < dfoutput[4]:
dfoutput['Stationary'] = 'True'
else:
dfoutput['Stationary'] = 'False'
print dfoutput
#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def diffit(dta, d, show, path=''):
templist = []
for i in range(0, (len(dta) - d), 1):
tempval = dta[i] - dta[i + d]
templist.append(tempval)
y = templist[d:len(templist)]
y = pd.Series(y)
plotpacf(y, show, path)
return y
#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
results = mod.fit()
resids = results.resid.values
summarised = results.summary()
print summarised
plotpacf(resids, show, path)
return results
#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
dflength = len(datrng) - 1
observation = dta
prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
df = pd.concat([prediction, observation], axis=1)
df.columns = ['predicted', 'observed']
plt.plot(prediction)
plt.plot(observation)
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df
#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#
def pred(pred_hours, k, df, arima, show=0, path=''):
n_obs = len(df.index)
lastdt = df.index[n_obs - 1]
lastdt = lastdt.to_datetime()
datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
future = pd.DataFrame(index=datrng, columns=df.columns)
df = pd.concat([df, future])
lendf = len(df.index)
df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
print df
marked = 2 * pred_hours
df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
if show != 0:
plt.show()
if path != '':
plt.savefig(path, format='png', bbox_inches='tight')
plt.clf()
return df[['predicted', 'observed']].ix[-marked:]
dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)
Data file

Find a easier way to cluster 2-d scatter data into grid array data

I have figured out a method to cluster disperse point data into structured 2-d array(like rasterize function). And I hope there are some better ways to achieve that target.
My work
1. Intro
1000 point data has there dimensions of properties (lon, lat, emission) whicn represent one factory located at (x,y) emit certain amount of CO2 into atmosphere
grid network: predefine the 2-d array in the shape of 20x20
http://i4.tietuku.com/02fbaf32d2f09fff.png
The code reproduced here:
#### define the map area
xc1,xc2,yc1,yc2 = 113.49805889531724,115.5030664238035,37.39995194888143,38.789235929357105
map = Basemap(llcrnrlon=xc1,llcrnrlat=yc1,urcrnrlon=xc2,urcrnrlat=yc2)
#### reading the point data and scatter plot by their position
df = pd.read_csv("xxxxx.csv")
px,py = map(df.lon, df.lat)
map.scatter(px, py, color = "red", s= 5,zorder =3)
#### predefine the grid networks
lon_grid,lat_grid = np.linspace(xc1,xc2,21), np.linspace(yc1,yc2,21)
lon_x,lat_y = np.meshgrid(lon_grid,lat_grid)
grids = np.zeros(20*20).reshape(20,20)
plt.pcolormesh(lon_x,lat_y,grids,cmap = 'gray', facecolor = 'none',edgecolor = 'k',zorder=3)
2. My target
Finding the nearest grid point for each factory
Add the emission data into this grid number
3. Algorithm realization
3.1 Raster grid
note: 20x20 grid points are distributed in this area represented by blue dot.
http://i4.tietuku.com/8548554587b0cb3a.png
3.2 KD-tree
Find the nearest blue dot of each red point
sh = (20*20,2)
grids = np.zeros(20*20*2).reshape(*sh)
sh_emission = (20*20)
grids_em = np.zeros(20*20).reshape(sh_emission)
k = 0
for j in range(0,yy.shape[0],1):
for i in range(0,xx.shape[0],1):
grids[k] = np.array([lon_grid[i],lat_grid[j]])
k+=1
T = KDTree(grids)
x_delta = (lon_grid[2] - lon_grid[1])
y_delta = (lat_grid[2] - lat_grid[1])
R = np.sqrt(x_delta**2 + y_delta**2)
for i in range(0,len(df.lon),1):
idx = T.query_ball_point([df.lon.iloc[i],df.lat.iloc[i]], r=R)
# there are more than one blue dot which are founded sometimes,
# So I'll calculate the distances between the factory(red point)
# and all blue dots which are listed
if (idx > 1):
distance = []
for k in range(0,len(idx),1):
distance.append(np.sqrt((df.lon.iloc[i] - grids[k][0])**2 + (df.lat.iloc[i] - grids[k][1])**2))
pos_index = distance.index(min(distance))
pos = idx[pos_index]
# Only find 1 point
else:
pos = idx
grids_em[pos] += df.so2[i]
4. Result
co2 = grids_em.reshape(20,20)
plt.pcolormesh(lon_x,lat_y,co2,cmap =plt.cm.Spectral_r,zorder=3)
http://i4.tietuku.com/6ded65c4ac301294.png
5. My question
Can someone point out some drawbacks or error of this method?
Is there some algorithms more aligned with my target?
Thanks a lot!
There are many for-loop in your code, it's not the numpy way.
Make some sample data first:
import numpy as np
import pandas as pd
from scipy.spatial import KDTree
import pylab as pl
xc1, xc2, yc1, yc2 = 113.49805889531724, 115.5030664238035, 37.39995194888143, 38.789235929357105
N = 1000
GSIZE = 20
x, y = np.random.multivariate_normal([(xc1 + xc2)*0.5, (yc1 + yc2)*0.5], [[0.1, 0.02], [0.02, 0.1]], size=N).T
value = np.ones(N)
df_points = pd.DataFrame({"x":x, "y":y, "v":value})
For equal space grids you can use hist2d():
pl.hist2d(df_points.x, df_points.y, weights=df_points.v, bins=20, cmap="viridis");
Here is the output:
Here is the code to use KdTree:
X, Y = np.mgrid[x.min():x.max():GSIZE*1j, y.min():y.max():GSIZE*1j]
grid = np.c_[X.ravel(), Y.ravel()]
points = np.c_[df_points.x, df_points.y]
tree = KDTree(grid)
dist, indices = tree.query(points)
grid_values = df_points.groupby(indices).v.sum()
df_grid = pd.DataFrame(grid, columns=["x", "y"])
df_grid["v"] = grid_values
fig, ax = pl.subplots(figsize=(10, 8))
ax.plot(df_points.x, df_points.y, "kx", alpha=0.2)
mapper = ax.scatter(df_grid.x, df_grid.y, c=df_grid.v,
cmap="viridis",
linewidths=0,
s=100, marker="o")
pl.colorbar(mapper, ax=ax);
the output is:

Categories

Resources