Get results in an Earth Engine python script - python

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)

If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Related

using PM4PY, is ignoring outliers to calulate the average

I have a csv file which has the following attributes - (id, date, status)
First I store the values in a dataframe and process it
# get the data in a data frame
log_csv = pd.read_csv('HILTGLOBAL.csv', sep=',')
# processing the data
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv['createdDate'] = pd.to_datetime(log_csv.createdDate)
log_csv['createdDate'] = log_csv['createdDate'].values.astype('datetime64[D]')
log_csv = log_csv.sort_values('createdDate')
After that I rename some columns as required by PM4PY, and get the event log
# renaming
log_csv.rename(columns = {'currentStatus':'concept:name','createdDate':'time:timestamp','candidateId':'case:concept:name'},inplace = True)
# getting the event logs
log = log_converter.apply(log_csv)
Then I try to get the directly follows graph of the above dataframe.
I want the edges to represent the average time between each stage.
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.objects.dfg.retrieval.log import Parameters
dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
gviz = dfg_visualization.apply(dfg, log=log,
variant=dfg_visualization.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
dfg_visualization.view(gviz)
However, the outliers are being ignored in calculating the average time.
I do not know how to fix it such that all points are considered.

Compute mean and standard deviation for HDF5 data

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).
Main Routine
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).
Storing of Simulation Data
To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.
After running the simulations, I receive the following Keys structure:
Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>
Computing Statistics for Files
Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).
The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:
episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]
I currently use the following code to extract the data storing it into an empty list:
def ExtractSimData(name, simulation_runs, length):
# Create empty list
result = []
# Call the simulation run file
filename = f"runs/{length}/{name}_simulation_runs2.h5"
with h5py.File(filename, "r") as hf:
# List all groups
print("Keys: %s" % hf.keys())
for i in range(simulation_runs):
a_group_key = list(hf.keys())[i]
data = list(hf[a_group_key])
for element in data:
result.append(element)
The data structure of result looks something like this:
[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]
First Attempt to Compute Means
I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):
mean_result = [np.mean(k) for k in zip(*list(result))]
However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).
Is there any way to efficiently accomplish this?
This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)
Outline of steps in the procedure summarized below (after opening the file):
Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.
A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules
Code to read and calculate statistics below.
import h5py
import numpy as np
def ExtractSimData(name, simulation_runs, length):
# Call the simulation run file
filename = f"runs/{length}/{name}simulation_runs2.h5"
with h5py.File(filename, "a") as hf:
# List all dataset names
ds_names = list(hf.keys())
print(f'Dataset names (keys): {ds_names}')
# Create empty arrays for player1 and player2 episode values
sim_cnt = len(ds_names)
print(f'# of simulation runs (dataset count) = {sim_cnt}')
ep_cnt = hf[ ds_names[0] ].shape[0]
print(f'# of episodes (rows) in each dataset = {ep_cnt}')
p1_arr = np.empty((ep_cnt,sim_cnt))
p2_arr = np.empty((ep_cnt,sim_cnt))
for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation
p1_arr[:,i] = hf[ds][:,0]
p2_arr[:,i] = hf[ds][:,1]
ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4),
compression='gzip', chunks=True)
ds1[:,0] = np.mean(p1_arr, axis=1)
ds1[:,1] = np.std(p1_arr, axis=1)
ds1[:,2] = np.mean(p2_arr, axis=1)
ds1[:,3] = np.std(p2_arr, axis=1)
dt = np.dtype([ ('average_player1',float), ('average_player2',float),
('std_player1',float), ('std_player2',float) ] )
ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt,
compression='gzip', chunks=True)
ds2['average_player1'] = np.mean(p1_arr, axis=1)
ds2['std_player1'] = np.std(p1_arr, axis=1)
ds2['average_player2'] = np.mean(p2_arr, axis=1)
ds2['std_player2'] = np.std(p2_arr, axis=1)
### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)
Code to create pseudo-data HDF5 file below.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
players = 2
periods = 1000
# Define the simulation with some random data
val_hist = np.random.random(periods*players).reshape(periods,players)
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (unique datasets)
with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation N times
simulations = 10
for i in range(simulations):
print(f'--- Iteration {i} ---')
test_simulation(i)

Query point data on a Google Earth Engine Image by specifying lat/long

I am trying to extract point values from a Google Earth Engine Image Collection by specifying lat/long information.
This seems to work perfectly fine when I am working with multiple images and use ee.Image.cat() to join them before I query the image. In the code example below composite = ee.Image.cat().
However, when I change composite (line 3 from the bottom) to one of the image collections (eg. chirps), it does not seem to work.
Please could someone assist me with this.
def getPropertyValue(settings):
collection = settings['collection'];
fieldName = settings['fieldName'];
dateRange = settings['dateRange'];
geoLocation = settings['geoLocation'];
scale = settings['scale'];
image = ee.ImageCollection(collection).select(fieldName).filterDate(dateRange[0], dateRange[1]).mean();
point = ee.Geometry.Point(geoLocation);
mean = image.reduceRegions(point, 'mean', scale);
valueRef = mean.select([fieldName], ['precipitation'], retainGeometry=True).getInfo();
value = valueRef[fieldName][0]['properties'][fieldName];
return value;
fieldName = 'LST_AVE';
chirps = ee.ImageCollection("JAXA/GCOM-C/L3/LAND/LST/V2").select(fieldName).filterDate('2020-01-01', '2020-02-01').mean()
point = ee.Geometry.Point([26.8206, 30.8025])
dist_stats = composite.reduceRegions(point, 'mean', 5000)
dist_stats = dist_stats.select([fieldName], [fieldName], retainGeometry=True).getInfo();
print(dist_stats['features'][0]['properties'][fieldName])
Result when using composite
14248.55
Error when replacing composite with a Google Earth Engine Image
EEException: Error in map(ID=0):
Feature.select: Selected a different number of properties (0) than names (1).
reduceRegions names the output column after the reducer, not the field that is being reduced. (Though it's more complicated when you have multiple bands and reducers).
So this:
dist_stats = dist_stats.select([fieldName], [fieldName], retainGeometry=True).getInfo();
should be changed to this
dist_stats = dist_stats.select(['mean'], [fieldName], retainGeometry=True).getInfo();

HTS Prophet Holidays Issue

I am attempting to use the htsprophet package in Python. I am using the following example code below. This example is pulled from https://github.com/CollinRooney12/htsprophet/blob/master/htsprophet/runHTS.py . The issue I am getting is ValueError "holidays must be a DataFrame with 'ds' and 'holiday' column. I am wondering if there is a work around this because I clearly have a data frame holidays with the two columns ds and holidays. I believe that the error comes from one of the dependency packages from fbprophet from the forecaster file. I am wondering if there is anything that I need to add or if anyone has added something to fix this.
import pandas as pd
from htsprophet.hts import hts, orderHier, makeWeekly
from htsprophet.htsPlot import plotNode, plotChild, plotNodeComponents
import numpy as np
#%% Random data (Change this to whatever data you want)
date = pd.date_range("2015-04-02", "2017-07-17")
date = np.repeat(date, 10)
medium = ["Air", "Land", "Sea"]
businessMarket = ["Birmingham","Auburn","Evanston"]
platform = ["Stone Tablet","Car Phone"]
mediumDat = np.random.choice(medium, len(date))
busDat = np.random.choice(businessMarket, len(date))
platDat = np.random.choice(platform, len(date))
sessions = np.random.randint(1000,10000,size=(len(date),1))
data = pd.DataFrame(date, columns = ["day"])
data["medium"] = mediumDat
data["platform"] = platDat
data["businessMarket"] = busDat
data["sessions"] = sessions
#%% Run HTS
##
# Make the daily data weekly (optional)
##
data1 = makeWeekly(data)
##
# Put the data in the format to run HTS, and get the nodes input (a list of list that describes the hierarchical structure)
##
data2, nodes = orderHier(data, 1, 2, 3)
##
# load in prophet inputs (Running HTS runs prophet, so all inputs should be gathered beforehand)
# Made up holiday data
##
holidates = pd.date_range("12/25/2013","12/31/2017", freq = 'A')
holidays = pd.DataFrame(["Christmas"]*5, columns = ["holiday"])
holidays["ds"] = holidates
holidays["lower_window"] = [-4]*5
holidays["upper_window"] = [0]*5
##
# Run hts with the CVselect function (this decides which hierarchical aggregation method to use based on minimum mean Mean Absolute Scaled Error)
# h (which is 12 here) - how many steps ahead you would like to forecast. If youre using daily data you don't have to specify freq.
#
# NOTE: CVselect takes a while, so if you want results in minutes instead of half-hours pick a different method
##
myDict = hts(data2, 52, nodes, holidays = holidays, method = "FP", transform = "BoxCox")
##
The problem lies in the htsProphet package, with the 'fitForecast.py' file. The instantiation of the fbProphet object relies on just positional arguments, however a new argument as been added to the fbProphet class. This means the arguments don't correspond anymore.
You can solve this by hacking the fbProphet module and changing the positional arguments to keyword arguments, just fixing lines '73-74' should be sufficient to get it running:
Prophet(growth=growth, changepoints=changepoints1, n_changepoints=n_changepoints1, yearly_seasonality=yearly_seasonality, weekly_seasonality=weekly_seasonality, holidays=holidays, seasonality_prior_scale=seasonality_prior_scale, \
holidays_prior_scale=holidays_prior_scale, changepoint_prior_scale=changepoint_prior_scale, mcmc_samples=mcmc_samples, interval_width=interval_width, uncertainty_samples=uncertainty_samples)
Ill submit a bug for this to the creators.

why is my data a tuple and how can I change this so I can sort the data

I am using rpy2 to do some statistical analyses in R via python. After importing a data file I want to sort the data and do a couple other things with it in R. Once I import the data and try to sort the data I get this error message:
TypeError: 'tuple' object cannot be interpreted as an index
The last 2 lines of my code are where I am trying to sort my data, and the few lines before that are where I import the data.
root = os.getcwd()
dirs = [os.path.abspath(name) for name in os.listdir(".") if os.path.isdir(name)]
for d in dirs:
os.chdir(d)
cwd = os.getcwd()
files_to_analyze = (glob.glob("*.afa"))
for f in files_to_analyze:
afa_file = os.path.join(cwd + '/' + f)
readfasta = robjects.r['read.fasta']
mydatafasta = readfasta(afa_file)
names = robjects.r['names']
IDnames = names(mydatafasta)
substr = robjects.r['substr']
ID = substr(IDnames, 1,8)
#print ID
readtable = robjects.r['read.table']
gps_file = os.path.join(root + '/' + "GPS.txt")
xy = readtable(gps_file, sep="\t")
#print xy
order = robjects.r['order']
gps = xy[order(xy[:,2]),]
I don't understand why my data is a tuple and not a dataframe that I can manipulate further using R. Is there a way to transform this into a workable dataframe that can be used by R?
My xy data look like:
Species AB425882 35.62 -83.4
Species AB425905 35.66 -83.33
Species KC413768 37.35 127.03
Species AB425841 35.33 -82.82
Species JX402724 29.38 -82.2
I want to sort the data alphanumerically by the second column using the order function in R.
There is a quite a bit of guesswork since the example is not sufficient to reproduce what you have.
In the following, if xy is an R data frame, you will want to use the method dedicated to R-style subsetting to perform R-style subsetting (see the doc):
# Note R indices are 1-based while Python indices are 0-based.
# When using R-style subsetting the indices are 1-based.
gps = xy.rx(order(xy.rx(True, 2)),
True)

Categories

Resources