How do I add a shapefile in ArcGIS via python scripting?

How do I add a shapefile in ArcGIS via python scripting? - python

I am trying to automate various tasks in ArcGIS Desktop (using ArcMap generally) with Python, and I keep needing a way to add a shape file to the current map. (And then do stuff to it, but that's another story).
The best I can do so far is to add a layer file to the current map, using the following ("addLayer" is a layer file object):
def AddLayerFromLayerFile(addLayer):
import arcpy
mxd = arcpy.mapping.MapDocument("CURRENT")
df = arcpy.mapping.ListDataFrames(mxd, "Layers")[0]
arcpy.mapping.AddLayer(df, addLayer, "AUTO_ARRANGE")
arcpy.RefreshActiveView()
arcpy.RefreshTOC()
del mxd, df, addLayer
However, my raw data is always going be shape files, so I need to be able to open them. (Equivantly: convert a shape file to a layer file wiothout opening it, but I'd prefer not to do that).

Variable "theShape" is the path of the shape file to be added.
import arcpy
import arcpy.mapping
# get the map document
mxd = arcpy.mapping.MapDocument("CURRENT")
# get the data frame
df = arcpy.mapping.ListDataFrames(mxd,"*")[0]
# create a new layer
newlayer = arcpy.mapping.Layer(theShape)
# add the layer to the map at the bottom of the TOC in data frame 0
arcpy.mapping.AddLayer(df, newlayer,"BOTTOM")
# Refresh things
arcpy.RefreshActiveView()
arcpy.RefreshTOC()
del mxd, df, newlayer

Recently I struggled with a similar task, and initially used the method of identifying the map document, identifying the data frame, creating a layer and adding the layer to the map document. Interestingly enough, this can all be accomplished using the following provided it is called from within the current map document.
# import modules
import arcpy
# create layer in TOC and reference it in a variable for possible other actions
newLyr = arcpy.MakeFeatureLayer_managment(
in_features,
out_layer
)[0]
Make Feature Layer requires two inputs, the input features and the output layer. The input features can be any type of feature class or layer. This includes shapefiles. Output layer is the name of the layer to appear in the table of contents.
Also, Make Feature Layer can accept a where clause to create a definition query at creation time. This typically is how I implement it, when needing to create a lot of layers with different definition queries quickly.
Finally, in the above snippet, although it is not necessary, I demonstrated how to populate a variable with the result of the tool output so the layer could be manipulated in the table of contents using arcpy.mapping if this is necessary later in the script. Every tool returns a result object. The result object output can be accessed using the getOutput method, but it can also be accessed by using the index of the result property you are interested in, in this case the output located at index 0.

Related

How to copy a dataset object to a different hdf5 file using pytables or h5py?

I have selected specific hdf5 datasets and want to copy them to a new hdf5 file. I could find some tutorials on copying between two files, but what if you have just created a new file and you want to copy datasets to the file? I thought the way below would work, but it doesn't. Are there any simple ways to do this?
>>> dic_oldDataset['old_dataset']
<HDF5 dataset "old_dataset": shape (333217,), type "|V14">
>>> new_file = h5py.File('new_file.h5', 'a')
>>> new_file.create_group('new_group')
>>> new_file['new_group']['new_dataset'] = dic_oldDataset['old_dataset']
RuntimeError: Unable to create link (interfile hard links are not allowed)

Answer 3
Use the copy method of the group class from h5py.
TL;DR
This works on groups and datasets.
Is recursive (can do deep and shallow copies).
Has options for attributes, symbolic links and references.
with h5py.File('destFile.h5','w') as f_dest:
with h5py.File('srcFile.h5','r') as f_src:
f_src.copy(f_src["/path/to/DataSet"],f_dest["/another/path"],"DataSet")
(The file object is also the root group.)
Locations in HDF5
"An HDF5 file is organized as a rooted, directed graph" (source).
HDF5 groups (including the root group) and data sets are related to each other as "locations" (in the C API most functions take a loc_id which identifes a group or data set). These locations are the nodes on the graph, paths describe arcs through the graph to a node. copy takes a source and destination location, not specifically a group or dataset, so it can be applied to both. The source and destination do not need to be in the same file.
Attributes
Attributes are stored within the header of the group or data set they are associated with. Therefore the attributes are also associated with that "location". It follows that copying a group or dataset will include all attributes associated with that "location". However you can turn this off.
References
copy offers settings for references, also called object pointers. Object pointers are a data type in hdf5: H5T_STD_REG_OBJ, similar to an integer H5T_STD_I32BE (source) and can be stored in attributes or data sets. References can point to whole objects or regions within a data set. copy only seems to cover object references. Does it break with data set regions H5T_STD_REF_DSETREG?
Symbolic links
The "locations" taken by the C api are one level of abstraction which explains why the copy function works on individual datasets. Look at the figure again, it is the edges which are labelled, not the nodes. Under the hood, HDF5 objects are the targets of links, each link (edge) has a name, the objects (nodes) do not have names. There are two types of links: hard links and symbolic links. All HDF5 objects must have at least one hard link, hard links can only target objects within their file. When hard links are created the reference count increases by one, symbolic links do not effect the reference count. Symbolic links may point to objects within the file (soft) or objects in other files (external). copy offers options to expand soft and external symbolic links.
This explains the error code (below) and offers an alternative to copying your dataset; A soft link could allow access to a data set in another file.
RuntimeError: Unable to create link (interfile hard links are not allowed)

Answer 1 (using h5py):
This creates a simple structured array to populate the first dataset in the first file.
The data is then read from that dataset and copied to the second file using my_array.
import h5py, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = h5py.File('test1.h5', 'w')
h5file1.create_dataset('/ex_group1/ex_ds1', data=arr)
print (h5file1)
my_array=h5file1['/ex_group1/ex_ds1']
h5file2 = h5py.File('test2.h5', 'w')
h5file2.create_dataset('/exgroup2/ex_ds2', data=my_array)
print (h5file2)
h5file1.close()
h5file2.close()

Answer 2 (using pytables):
This follows the same process as above with pytables functions. It creates the same simple structured array to populate the first dataset in the first file. The data is then read from that dataset and copied to the second file using my_array.
import tables, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = tables.open_file('test1.h5', mode = 'w', title = 'Test file')
my_group = h5file1.create_group('/', 'ex_group1', 'Example Group')
my_table = h5file1.create_table(my_group, 'ex_ds1', None, 'Example dataset', obj=arr)
print (h5file1)
my_array=my_table.read()
h5file2 = tables.open_file('test2.h5', mode = 'w', title = 'Test file')
h5file2.create_table('/exgroup2', 'ex_ds2', createparents=True, obj=my_array)
print (h5file2)
h5file1.close()
h5file2.close()

TensorFlow - tf.data.Dataset reading large HDF5 files

I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
)
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset pipeline, this was quite easy using the RandomShuffleQueue and enqueue_many ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord files, but taking a random sample of video frames stored in a tfrecord file seems quite impossible (see here). Additionally, I have looked at the from_generator() inputs for tf.data.Dataset but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.

I stumbled across this question while dealing with a similar issue. I came up with a solution based on using a Python generator, together with the TF dataset construction method from_generator. Because we use a generator, the HDF5 file should be opened for reading only once and kept open as long as there are entries to read. So it will not be opened, read, and then closed for every single call to get the next data element.
Generator definition
To allow the user to pass in the HDF5 filename as an argument, I generated a class that has a __call__ method since from_generator specifies that the generator has to be callable. This is the generator:
import h5py
import tensorflow as tf
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["train_img"]:
yield im
By using a generator, the code should pick up from where it left off at each call from the last time it returned a result, instead of running everything from the beginning again. In this case it is on the next iteration of the inner for loop. So this should skip opening the file again for reading, keeping it open as long as there is data to yield. For more on generators, see this excellent Q&A.
Of course, you will have to replace anything inside the with block to match how your dataset is constructed and what outputs you want to obtain.
Usage example
ds = tf.data.Dataset.from_generator(
generator(hdf5_path),
tf.uint8,
tf.TensorShape([427,561,3]))
value = ds.make_one_shot_iterator().get_next()
# Example on how to read elements
while True:
try:
data = sess.run(value)
print(data.shape)
except tf.errors.OutOfRangeError:
print('done.')
break
Again, in my case I had stored uint8 images of height 427, width 561, and 3 color channels in my dataset, so you will need to modify these in the above call to match your use case.
Handling multiple files
I have a proposed solution for handling multiple HDF5 files. The basic idea is to construct a Dataset from the filenames as usual, and then use the interleave method to process many input files concurrently, getting samples from each of them to form a batch, for example.
The idea is as follows:
ds = tf.data.Dataset.from_tensor_slices(filenames)
# You might want to shuffle() the filenames here depending on the application
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(filename),
tf.uint8,
tf.TensorShape([427,561,3])),
cycle_length, block_length)
What this does is open cycle_length files concurrently, and produce block_length items from each before moving to the next file - see interleave documentation for details. You can set the values here to match what is appropriate for your application: e.g., do you need to process one file at a time or several concurrently, do you only want to have a single sample at a time from each file, and so on.
Edit: for a parallel version, take a look at tf.contrib.data.parallel_interleave!
Possible caveats
Be aware of the peculiarities of using from_generator if you decide to go with the solution. For Tensorflow 1.6.0, the documentation of from_generator mentions these two notes.
It may be challenging to apply this across different environments or with distributed training:
NOTE: The current implementation of Dataset.from_generator() uses
tf.py_func and inherits the same constraints. In particular, it
requires the Dataset- and Iterator-related operations to be placed on
a device in the same process as the Python program that called
Dataset.from_generator(). The body of generator will not be serialized
in a GraphDef, and you should not use this method if you need to
serialize your model and restore it in a different environment.
Be careful if the generator depends on external state:
NOTE: If generator depends on mutable global variables or other
external state, be aware that the runtime may invoke generator
multiple times (in order to support repeating the Dataset) and at any
time between the call to Dataset.from_generator() and the production
of the first element from the generator. Mutating global variables or
external state can cause undefined behavior, and we recommend that you
explicitly cache any external state in generator before calling
Dataset.from_generator().

I took me a while to figure this out, so I thought I should record this here. Based on mikkola's answer, this is how to handle multiple files:
import h5py
import tensorflow as tf
class generator:
def __call__(self, file):
with h5py.File(file, 'r') as hf:
for im in hf["train_img"]:
yield im
ds = tf.data.Dataset.from_tensor_slices(filenames)
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(),
tf.uint8,
tf.TensorShape([427,561,3]),
args=(filename,)),
cycle_length, block_length)
The key is you can't pass filename directly to generator, since it's a Tensor. You have to pass it through args, which tensorflow evaluates and converts it to a regular python variable.

Parsing osm.pbf data using GDAL/OGR python module

I'm trying to extract data from an OSM.PBF file using the python GDAL/OGR module.
Currently my code looks like this:
import gdal, ogr
osm = ogr.Open('file.osm.pbf')
## Select multipolygon from the layer
layer = osm.GetLayer(3)
# Create list to store pubs
pubs = []
for feat in layer:
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
While this little bit of code works fine with small.pbf files (15mb). However, when parsing files larger than 50mb I get the following error:
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES MODE
When I turn this mode on with:
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
ogr does not return any features at all anymore, even when parsing small files.
Does anyone know what is going on here?

Thanks to scai's answer I was able to figure it out.
The special reading pattern required for interleaved reading that is mentioned in gdal.org/1.11/ogr/drv_osm.html is translated into a working python example that can be found below.
This is an example of how to extract all features in an .osm.pbf file that have the 'amenity=pub' tag
import gdal, ogr
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
osm = ogr.Open('file.osm.pbf')
# Grab available layers in file
nLayerCount = osm.GetLayerCount()
thereIsDataInLayer = True
pubs = []
while thereIsDataInLayer:
thereIsDataInLayer = False
# Cycle through available layers
for iLayer in xrange(nLayerCount):
lyr=osm.GetLayer(iLayer)
# Get first feature from layer
feat = lyr.GetNextFeature()
while (feat is not None):
thereIsDataInLayer = True
#Do something with feature, in this case store them in a list
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
#The destroy method is necessary for interleaved reading
feat.Destroy()
feat = lyr.GetNextFeature()
As far as I understand it, a while-loop is needed instead of a for-loop because when using the interleaved reading method, it is impossible to obtain the featurecount of a collection.
More clarification on why this piece of code works like it does would be greatly appreciated.

PyFITS: hdulist.writeto()

I'm extracting extensions from a multi-extension FITS file, manipulate the data, and save the data (with the extension's header information) to a new FITS file.
To my knowledge pyfits.writeto() does the task. However, when I give it a data parameter in the form of an array, it gives me the error:
'AttributeError: 'numpy.ndarray' object has no attribute 'lower''
Here is a sample of my code:
'file = 'hst_11166_54_wfc3_ir_f110w_drz.fits'
hdulist = pyfits.open(dir + file)'
sci = hdulist[1].data # science image data
exp = hdulist[5].data # exposure time data
sci = sci*exp # converts electrons/second to electrons
file = 'test_counts.fits'
hdulist.writeto(file,sci,clobber=True)
hdulist.close()
I appreciate any help with this. Thanks in advance.

You're confusing the HDUList.writeto method, and the writeto function.
What you're calling is a method on the HDUList object that is returned when you call pyfits.open. You can think of this object as something like a file handle to your original drizzled FITS file. You can manipulate this object in place and either write it out to a new file or save updates in place (if you open the file in mode='update').
The writeto function on the other hand is not tied to any existing file. It's just a high-level function for writing an array out to a file. In your example you could write your array of electron counts out like:
pyfits.writeto(filename, data)
This will create a single-HDU FITS file with the array data in the PRIMARY HDU.
Do be aware of the admonishment at the top of this section of the docs: http://docs.astropy.org/en/v1.0.3/io/fits/index.html#convenience-functions
The functions like pyfits.writeto are there for convenience in interactive work, but are not recommendable for use in code that will be run repeatedly, as in a script. Instead have a look at these instructions to start.

It is probably because you should use hdulist.writeto(file, clobber=True). There is only one required argument:
https://pythonhosted.org/pyfits/api_docs/api_hdulists.html#pyfits.HDUList.writeto
If you give a second argument, it is used for output_verify which should be a string, not a numpy array. This probably explains your AttributeError ....

create shape file from csv file with python

I am working on a really big script right now where I have a csv file that I have removed rows and columns from, and edited the headers. I need to create one big shapefile for the entire csv file then create individual shape files for the units under one of the headers. I thougt the best way to do this would be to use arcpy.MakeXyEventLayer(), I saw in an arcgis sample script to then use arcpy.GetCount() for the output file of the xyEveveLayer, then arcpy.SaveToLayerFile_management() and arcpy.FeatureClassToShapefile_ conversion, but when I run the script only my csv file is getting edited and there is no layer in the output file. Is there a step I am missing or should this be making my shape.
this is the few lines of code I have used after all of he csv file editing to do what is described above:
outLyr = sys.arg[3] # shapefile layer output name
XYLyr.newLyr(csvOut, lyrOutFile, spRef, sys.argv[4], sys.argv[5]) # x coordinate column; y coordinate column
print arcpy.GetCount_management(lyrOutFile)
csv2LYR.saveLYR(lyrOutFile, curDir)

arcpy.SaveToLayerFile_management does not save data to a shapefile or any other kind of featureclass. It only creates a .lyr file, which points to a data source and renders it with saved symbology, etc. You can use arcpy.FeatureClassToShapefile_conversion to create the shapefile from the in-memory feature layer created with arcpy.MakeXyEventLayer. Help for that tool is here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.