python netcdf: making a copy of all variables and attributes but one

python netcdf: making a copy of all variables and attributes but one - python

I need to process a single variable in a netcdf file that actually contains many attributes and variable.
I think it is not possible to update a netcdf file (see question How to delete a variable in a Scientific.IO.NetCDF.NetCDFFile?)
My approach is the following:
get the variable to process from the original file
process the variable
copy all data from the original netcdf BUT the processed variable to the final file
copy the processed variable to the final file
My problem is to code step 3. I started with the following:
def processing(infile, variable, outfile):
data = fileH.variables[variable][:]
# do processing on data...
# and now save the result
fileH = NetCDFFile(infile, mode="r")
outfile = NetCDFFile(outfile, mode='w')
# build a list of variables without the processed variable
listOfVariables = list( itertools.ifilter( lamdba x:x!=variable , fileH.variables.keys() ) )
for ivar in listOfVariables:
# here I need to write each variable and each attribute
How can I save all data and attribute in a handfull of code without having to rebuild a whole structure of data?

Here's what I just used and worked. #arne's answer updated for Python 3 and also to include copying variable attributes:
import netCDF4 as nc
toexclude = ['ExcludeVar1', 'ExcludeVar2']
with netCDF4.Dataset("in.nc") as src, netCDF4.Dataset("out.nc", "w") as dst:
# copy global attributes all at once via dictionary
dst.setncatts(src.__dict__)
# copy dimensions
for name, dimension in src.dimensions.items():
dst.createDimension(
name, (len(dimension) if not dimension.isunlimited() else None))
# copy all file data except for the excluded
for name, variable in src.variables.items():
if name not in toexclude:
x = dst.createVariable(name, variable.datatype, variable.dimensions)
dst[name][:] = src[name][:]
# copy variable attributes all at once via dictionary
dst[name].setncatts(src[name].__dict__)

If you just want to copy the file picking out variables, nccopy is a great tool as submitted by #rewfuss.
Here's a Pythonic (and more flexible) solution with python-netcdf4. This allows you to open it for processing and other calculations before writing to file.
with netCDF4.Dataset(file1) as src, netCDF4.Dataset(file2) as dst:
for name, dimension in src.dimensions.iteritems():
dst.createDimension(name, len(dimension) if not dimension.isunlimited() else None)
for name, variable in src.variables.iteritems():
# take out the variable you don't want
if name == 'some_variable':
continue
x = dst.createVariable(name, variable.datatype, variable.dimensions)
dst.variables[x][:] = src.variables[x][:]
This does not take into account of variable attributes, such as fill_values. You can do that easily following the documentation.
Do be careful, netCDF4 files once written/created this way cannot be undone. The moment you modify the variable, it is written to file at the end of with statement, or if you call .close() on the Dataset.
Of course, if you wish to process the variables before writing them, you have to be careful about which dimensions to create. In a new file, Never write to variables without creating them. Also, never create variables without having defined dimensions, as noted in python-netcdf4's documentation.

This answer builds on the one from Xavier Ho (https://stackoverflow.com/a/32002401/7666), but with the fixes I needed to complete it:
import netCDF4 as nc
import numpy as np
toexclude = ["TO_REMOVE"]
with nc.Dataset("orig.nc") as src, nc.Dataset("filtered.nc", "w") as dst:
# copy attributes
for name in src.ncattrs():
dst.setncattr(name, src.getncattr(name))
# copy dimensions
for name, dimension in src.dimensions.iteritems():
dst.createDimension(
name, (len(dimension) if not dimension.isunlimited else None))
# copy all file data except for the excluded
for name, variable in src.variables.iteritems():
if name not in toexclude:
x = dst.createVariable(name, variable.datatype, variable.dimensions)
dst.variables[name][:] = src.variables[name][:]

The nccopy utility in C netCDF versions 4.3.0 and later includes an option to list which variables are to be copied (along with their attributes). Unfortunately, it doesn't include an option for which variables to exclude, which is what you need.
However, if the list of (comma-delimited) variables to be included doesn't cause the nccopy command-line to exceed system limits, this would work. There are two variants for this option:
nccopy -v var1,var2,...,varn input.nc output.nc
nccopy -V var1,var2,...,varn input.nc output.nc
The first (-v) includes all the variable definitions, but only data for the named variables.
The second (-V) doesn't include definitions or data for unnamed variables.

I know this is an old question, but as an alternative, you can use the library netcdf and shutil:
import shutil
from netcdf import netcdf as nc
def processing(infile, variable, outfile):
shutil.copyfile(infile, outfile)
with nc.loader(infile) as in_root, nc.loader(outfile) as out_root:
data = nc.getvar(in_root, variable)
# do your processing with data and save them as memory "values"...
values = data[:] * 3
new_var = nc.getvar(out_root, variable, source=data)
new_var[:] = values

All of the recipes so far (except for one form #rewfuss, which works fine, but is not exactly a pythonic one) produce a plain NetCDF3 file, which might be killing on highly compressed NetCDF4 datasets. Here is an attempt to cope with the issue.
import netCDF4
infname="Inf.nc"
outfname="outf.nc"
skiplist="var1 var2".split()
with netCDF4.Dataset(infname) as src:
with netCDF4.Dataset(outfname, "w", format=src.file_format) as dst:
# copy global attributes all at once via dictionary
dst.setncatts(src.__dict__)
# copy dimensions
for name, dimension in src.dimensions.items():
dst.createDimension(
name, (len(dimension) if not dimension.isunlimited() else None))
# copy all file data except for the excluded
for name, variable in src.variables.items():
if name in skiplist:
continue
createattrs = variable.filters()
if createattrs is None:
createattrs = {}
else:
chunksizes = variable.chunking()
print(createattrs)
if chunksizes == "contiguous":
createattrs["contiguous"] = True
else:
createattrs["chunksizes"] = chunksizes
x = dst.createVariable(name, variable.datatype, variable.dimensions, **createattrs)
# copy variable attributes all at once via dictionary
dst[name].setncatts(src[name].__dict__)
dst[name][:] = src[name][:]
This seems to work fine and store the variables the way they are in the original file, except it does not copy some variable attributes that start from _underscore, and are not known to the NetCDF library.

Related

Multiple scripts access the same module with the same data in python?

Recently I have been trying to make a makeshift "disk space" reader. I made a library that stores values in a list "the disk" and when I subprocess a new script to write to the "disk" to see if the values change on the display nothing happens. I realized that any time you import a module the module sort of clones itself to only that script.
I want to be able to have scripts import the same module and so that if 1 script changes a value another script can see that value.
Here is my code for the "disk" system
import time
ram = []
space = 256000
lastspace = 0
for i in range(0,space + 1):
ram.append('')
def read(location):
try:
if ram[int(location)] == '':
return "ERR_NO_VALUE"
else:
return ram[int(location)]
except:
return "ERR_OUT_OF_RANGE"
def write(location, value):
try:
ram[int(location)] = value
except:
return "ERR_OUT_OF_RANGE"
def getcontents():
contents = []
for i in range(0, 256001):
contents.append([str(i)+ '- ', ram[i]])
return contents
def getrawcontents():
contents = []
for i in range(0, 256001):
contents.append(ram[i])
return contents
def erasechunk(beg, end):
try:
for i in range(int(beg), int(end) + 1):
ram[i] = ''
except:
return "ERR_OUT_OF_RANGE"
def erase(location):
ram[int(location)] = ''
def reset():
ram = []
times = space/51200
tc = 0
for i in range(0,round(times)):
for x in range(0,51201):
ram.append('')
tc += 1
print("Byte " + str(tc) + " of " + " Bytes")
for a in range(0,100):
print('\a', end='')
return [len(ram), ' bytes']
def wipe():
for i in range(0,256001):
ram[i] = ''
return "WIPED"
def getspace():
x = 0
for i in range(0,len(ram)):
if ram[i] != "":
x += 1
return [x,256000]

The shortest answer to your question, which I'm understanding as "if I import the same function into two (or more) Python namespaces, can they interact with each other?", is no. What actually happens when you import a module is that Python uses the source script to 'build' those functions in the namespace you're importing them to; there is no sense of permanence in "where the module came from" since that original module isn't actually running in a Python process anywhere! When you import those functions into multiple scripts, it's just going to create those pseudo-global variables (in your case ram) with the function you're importing.
Python import docs: https://docs.python.org/3/reference/import.html
The whole page on Python's data model, including what __globals__ means for functions and modules: https://docs.python.org/3/reference/datamodel.html
Explanation:
To go into a bit more depth, when you import any of the functions from this script (let's assume it's called 'disk.py'), you'll get an object in that function's __globals__ dict called ram, which will indeed work as you expect for these functions in your current namespace:
from disk import read,write
write(13,'thing')
print(read(13)) #prints 'thing'
We might assume, since these functions are accurately accessing our ram object, that the ram object is being modified somehow in the namespace of the original script, which could then be accessed by a different script (a different Python process). Looking at the namespace of our current script using dir() might support that notion, since we only see read and write, and not ram. But the secret is that ram is hidden in those functions' __globals__ dict (mentioned above), which is how the functions are interacting with ram:
from disk import read,write
print(type(write.__globals__['ram'])) #<class 'list'>
print(write.__globals__['ram'] is read.__globals__['ram']) #True
write(13,'thing')
print(read(13)) #'thing'
print(read.__globals__['ram'][13]) #'thing'
As you can see, ram actually is a variable defined in the namespace of our current Python process, hidden in the functions' __globals__ dict, which is actually the exact same dictionary for any function imported from the same module; read.__globals__ is write.__globals__ evaluates to True (even if you don't import them at the same time!).
So, to wrap it all up, ram is contained in the __globals__ dict for the disk module, which is created separately in the namespace of each process you import into:
Python interpreter #1:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139775502955080 139775502955080
Python interpreter #2:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139797009773128 139797009773128
Solution hint:
There are many approaches on how to do this practically that are beyond the scope of this answer, but I will suggest that pickle is the standard way to send objects between Python interpreters using files, and has a really standard interface. You can just write, read, etc your ram object using a pickle file. To write:
import pickle
with open('./my_ram_file.pkl','wb') as ram_f:
pickle.dump(ram,ram_f)
To read:
import pickle
with open('./my_ram_file.pkl','rb') as ram_f:
ram = pickle.load(ram_f)

Execute python code and evaluate/test results

Admittedly I am not sure how to ask this, as I know how to handle this in R (code execution in a new environment), but equivalent searches for the python solution are not yielding what I was hoping.
In short, I will receive a spreadsheet (or csv) where the contents of the column will contain, hopefully, valid python code. This could be the equivalent of a script, but just contained in the csv/workbook. For a use case, think teaching programming and the output is an LMS.
What I am hoping to do is loop over the file, and for each cell, run the code, and with the results in memory, test to see if certain things exist.
For example: https://docs.google.com/spreadsheets/d/1D-zC10rUTuozfTR5yHfauIGbSNe-PmfrZCkC7UTPH1c/edit?usp=sharing
When evaluating the first response in the spreadsheet above, I would want to test that x, y, and z are all properly defined and have the expected values.
Because there would be multiple rows in the file, one per student, how can I run each row separately, evaluate the results, and ensure that I isolate the results to only that cell. Simply, when moving on, I do not retain any of the past evaluations.

(I am unaware of tools to do code checking, so I am dealing with it in a very manual way.)
It is possible to use Python's exec() function to execute strings such as the content in the cells.
Ex:
variables = {}
exec("""import os
# a comment
x = 2
y = 6
z = x * y""", variables)
assert variables["z"] == 12
Dealing with the csv file:
import csv
csv_file = open("path_to_csv_file", "rt")
csv_reader = csv.reader(csv_file)
iterator = iter(csv_reader)
next(iterator) # To skip the titles of the columns
for row in iterator:
user = row[0]
answer = row[1]
### Any other code involving the csv file must be put here to work properly,
### that is, before closing csv_file.
csv_file.close() # Remember to close the file.
It won't be able to detect whether some module was imported (Because when importing from an exec() function, the module will remain in cache for the next exec's). One way to test this would be to 'unimport' the module and test the exec for Exceptions.
Ex:
# This piece of code would be before closing the file,
# INSIDE THE FOR LOOP AND WITH IT IDENTED (Because you want
# it to run for each student.).
try:
del os # 'unimporting' os (This doesn't 'unimport' as much as deletes a
# reference to the module, what could be problematic if a 'from
# module import object' statement was used.)
except NameError: # So that trying to delete a module that wasn't imported
# does not lead to Exceptions being raised.
pass
namespace = dict()
try:
exec(answer, namespace)
except:
# Answer code could not be run without raising exceptions, i.e., the code
# is poorly written.
# Code you want to run when the answer is wrong.
else:
# The code hasn't raised Exceptions, time to test the variables.
x, y, z = namespace['x'], namespace['y'], namespace['z']
if (x == 2) and (y == 6) and (z == x * y):
# Code you want to run when the answer is right.
else:
# Code you want to run when the answer is wrong.
I sense that this is not the best way to do this, but it is certainly an attempt.
I hope this helped.
EDIT: Removed some bad code and added part of Tadhg McDonald-Jensen's comment.

Feeding list position as parameter to arcpy.SymDiff_analysis function

New to python.
I have two feature datasets, Input features (NewDemoFC) and Update Features (ExistingFC) with 5 feature classes each. One set contains demolished features, the other set contains all active features. The objective is to compare the two and wherever the demolished features (from NewDemoFC) overlap with an active feature (from ExistingFC), delete the overlapping active features (from ExistingFC) and output a new feature class.
I want to use the while function and be able to feed a particular position from the list for both input features and update feature parameters. Also would like to maintain the same names and order for output feature class names.
Trying to achieve the results of the below model for a dataset with multiple files as the SymDiff_analysis tool doesn't work on more than one FC as input, unless you add each feature class as a line item, specifying input, output and any intermediate temporary files. This is not practical for a dataset with 100 odd feature classes.
enter image description here
CODE IS AS UNDER
# Import arcpy module
import arcpy
# Set environment to generate new input feature class list and count
arcpy.env.workspace = "T:\eALP_Update.gdb\Point_DemoNew"
NewDemoFC = arcpy.ListFeatureClasses()
NewDemoFCCount = len(NewDemoFC)
# Set environment to generate existing feature class list
arcpy.env.workspace = "T:\eALP_Update.gdb\Point_InputExisting"
ExistingFC = arcpy.ListFeatureClasses()
E_PointFeatures_ActiveOnly = []
i = 0
#arcpy.env.workspace = "T:\eALP_Update.gdb\Point_ActiveExisting"
while i < NewDemoFCCount:
# Process: Symmetrical Difference (2)
arcpy.SymDiff_analysis(NewDemoFC[i], ExistingFC[i], E_PointFeatures_ActiveOnly[i], "ALL", "0.01 Feet")
i = i + 1
ERROR I GET IS AS UNDER
Traceback (most recent call last):
File "C:\Python27\ArcGIS10.5\Lib\site-packages\pythonwin\pywin\framework\intpyapp.py", line 345, in OnFileRun
scriptutils.RunScript(None, None, showDlg)
File "C:\Python27\ArcGIS10.5\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 353, in RunScript
del main.file
AttributeError: file
5
[u'Demo_New_UTILITYPOINT', u'Demo_New_ROADPOINT', u'Demo_New_AIRPORTSIGN', u'Demo_New_AIRPORTCONTROLPOINT', u'Demo_New_AIRFIELDLIGHT']
5
[u'UtilityPoint', u'RoadPoint', u'AirportSign', u'AirportControlPoint', u'AirfieldLight']
Traceback (most recent call last):
File "C:\Python27\ArcGIS10.5\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 326, in RunScript
exec codeObject in main.dict
File "T:\Data\GOAA\eALPUpdates\Point_2-SymmetricalDifferenceOnly.py", line 41, in
arcpy.SymDiff_analysis(NewDemoFC[i], ExistingFC[i], E_PointFeatures_ActiveOnly[i], "ALL", "0.01 Feet")
IndexError: list index out of range
[Dbg]>>>

What you want to do is use a for loop that iterates over each feature class to avoid the odd indexing process you have going on inside your call to arcpy.SymDiff. For example, the use of i to index E_PointFeatures_ActiveOnly (an empty list) as an output path won't work. To do this the way you want to, you'll need to dynamically generate file names. Make sure that the output folder is empty when you do this to avoid naming conflicts. The code you have also duplicates everything for each folder, so we can define functions to eliminate that, that way you can re-use it easily. Lastly, you really want to avoid altering global variables like arcpy.env.workspace multiple times - the function below for this is really verbose, but since it's a function, you only have to do it once! I'll assume you have access to arcgis version >= 10.1 The following code is long and untested, but I imagine it should do the trick:
import arcpy
arcpy.env.workspace = "T:\eALP_Update.gdb\Point_ActiveExisting"
def getFCs(folderOne, folderTwo):
"""Get feature classes from two folders"""
from os.path import join
x = []
y = []
folders = [folderOne, folderTwo]
for folder in folders:
for paths, directory, filenames in arcpy.da.Walk(
folder,
topdown=True,
followlinks=False,
datatype='FeatureClass',
type='ALL'):
for file in filenames:
if folder == folder[0]:
x.append(join(directory, file))
else:
y.append(join(directory, file))
return x, y
def batchSymDiff(inFolder, updateFolder, joinAttr, clusterTolerance):
"""Performs SymDiff analysis for every feature in a set of folders"""
inFeatures, updateFeatures = getFCs(inFolder, updateFolder)
for fc1, fc2 in zip(inFeatures, updateFeatures):
output = fc2.replace(".shp", "_sym.shp") # this naming pattern assumes ".shp" ending
arcpy.SymDiff_analysis(fc1, fc2, output, joinAttr, clusterTolerance)
# set variables for batch process
inFolder = "T:\eALP_Update.gdb\Point_DemoNew"
updateFolder = "T:\eALP_Update.gdb\Point_InputExisting"
joinAttr = "ALL"
clusterTolerance = "0.01"
# execute batchSymDiff
batchSymDiff(inFolder, updateFolder, joinAttr, clusterTolerance)
This code is probably more verbose than it has to be, but doing it this way means you can avoid changing global env vars over and over - a risky business since the errors it causes are really difficult to diagnose sometimes - and it makes your code reusable. Also note that it eliminates the need to use a "manual" counter (i). Hope it helps! I suggest testing the code on test data first.

Unable to load pickle file

I was previously able to load a pickle file. I saved a new file under a different name. I am unable to load either the old or the new file. Which is a bummer as it contains data which I have worked hard to scrub.
Here is the code that I use to save:
def pickleStore():
pickle.dump(store, open("...shelf3.p", "wb"))
Here is the code that I use to re-load:
def pickleLoad():
store = pickle.load(open(".../shelf3.p","rb" ) )
The created file exists, and I run pickleLoad() no errors come up, neither does any variables show in the panel variable explorer. If I try to load a non-existent file, I get a error message.
I am running spyder, python 3.5.
Any suggestions?

If you want to write to a module-level variable from a function, you need to use the global keyword:
store = None
def pickleLoad():
global store
store = pickle.load(open(".../shelf3.p","rb" ) )
...or return the value and perform the assignment from module-level code:
store = None
def pickleLoad():
return pickle.load(open(".../shelf3.p","rb" ) )
store = pickleLoad()

As a general and more versatile approach I would suggest something like this:
def load(file_name):
with open(simulation, 'rb') as pickle_file:
return pickle.load(pickle_file)
def save(file_name, data):
with open(file_name, 'wb') as f:
pickle.dump(data, f)
I have added this snippet to several projects in order to reduce rewriting same code several times.

In Python, is it possible to overload Numpy's memmap to delete itself when the memmap object is no longer referenced?

I am trying to use memmap when certain data doesn't fit in memory and employ memmap's ability to trick code into thinking it's just an ndarray. To further expand on this way of using memmap I was wondering if it would be possible to overload memmap's dereference operator to delete the memmap file.
So for example:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tmpfile.dat')
{
out = np.memmap(filename, dtype=a.dtype, mode='w+', shape=a.shape)
}
# At this point out is out of scope, so the overloaded
# dereference function would delete tmpfile.dat
Does this sound feasible/has this been done? Is there something I am not thinking of?
Thank you!

just delete the file after it has been opened by np.memmap
the file will then be deleted by the system after the last reference to the file descriptor is closed.
python temporary files work like this and can very conveniently be used with the with context manger construct:
with tempfile.NamedTemporaryFile() as f:
# file gone now from the filesystem
# but f still holds a reference so it still exists and uses space (see /prof<pid>/fd)
# open it again (will not work on windows)
x = np.memmap(f.name, dtype=np.float64, mode='w+', shape=(3,4))
# file path is unlinked but exists on disk with the open file reference in x
del x
# now all references are gone and the file is properly deleted

A case if we do not want to use with and just have some class that handles it for us:
class tempmap(np.memmap):
"""
Extension of numpy memmap to automatically map to a file stored in temporary directory.
Usefull as a fast storage option when numpy arrays become large and we just want to do some quick experimental stuff.
"""
def __new__(subtype, dtype=np.uint8, mode='w+', offset=0,
shape=None, order='C'):
ntf = tempfile.NamedTemporaryFile()
self = np.memmap.__new__(subtype, ntf, dtype, mode, offset, shape, order)
self.temp_file_obj = ntf
return self
def __del__(self):
if hasattr(self,'temp_file_obj') and self.temp_file_obj is not None:
self.temp_file_obj.close()
del self.temp_file_obj
def np_as_tmp_map(nparray):
tmpmap = tempmap(dtype=nparray.dtype, mode='w+', shape=nparray.shape)
tmpmap[...] = nparray
return tmpmap
def test_memmap():
"""Test, that deleting a temp memmap also deletes the file."""
x = np_as_tmp_map(np.zeros(10, 10), np.float))
name = copy(x.temp_file_obj.name)
del x
x = None
assert not os.path.isfile(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python netcdf: making a copy of all variables and attributes but one - python

Related

Multiple scripts access the same module with the same data in python?

Execute python code and evaluate/test results

Feeding list position as parameter to arcpy.SymDiff_analysis function

Unable to load pickle file

In Python, is it possible to overload Numpy's memmap to delete itself when the memmap object is no longer referenced?

Categories

Resources