Parsing osm.pbf data using GDAL/OGR python module

Parsing osm.pbf data using GDAL/OGR python module - python

I'm trying to extract data from an OSM.PBF file using the python GDAL/OGR module.
Currently my code looks like this:
import gdal, ogr
osm = ogr.Open('file.osm.pbf')
## Select multipolygon from the layer
layer = osm.GetLayer(3)
# Create list to store pubs
pubs = []
for feat in layer:
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
While this little bit of code works fine with small.pbf files (15mb). However, when parsing files larger than 50mb I get the following error:
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES MODE
When I turn this mode on with:
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
ogr does not return any features at all anymore, even when parsing small files.
Does anyone know what is going on here?

Thanks to scai's answer I was able to figure it out.
The special reading pattern required for interleaved reading that is mentioned in gdal.org/1.11/ogr/drv_osm.html is translated into a working python example that can be found below.
This is an example of how to extract all features in an .osm.pbf file that have the 'amenity=pub' tag
import gdal, ogr
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
osm = ogr.Open('file.osm.pbf')
# Grab available layers in file
nLayerCount = osm.GetLayerCount()
thereIsDataInLayer = True
pubs = []
while thereIsDataInLayer:
thereIsDataInLayer = False
# Cycle through available layers
for iLayer in xrange(nLayerCount):
lyr=osm.GetLayer(iLayer)
# Get first feature from layer
feat = lyr.GetNextFeature()
while (feat is not None):
thereIsDataInLayer = True
#Do something with feature, in this case store them in a list
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
#The destroy method is necessary for interleaved reading
feat.Destroy()
feat = lyr.GetNextFeature()
As far as I understand it, a while-loop is needed instead of a for-loop because when using the interleaved reading method, it is impossible to obtain the featurecount of a collection.
More clarification on why this piece of code works like it does would be greatly appreciated.

Related

What does music21.instrument.partitionByInstrument() really do?

I've been working on a music generation model using LSTM and I am a little confused about data preprocessing. I am using the music21 library to process the '.mid' files and I am confused about what is the difference between these two code snippets that I am referring to.
parts = instrument.partitionByInstrument(midi)
if parts: # file has instrument parts
notes_to_parse = parts.parts[0].recurse()
else: # file has notes in a flat structure
notes_to_parse = midi.flat.notes
and
songs = instrument.partitionByInstrument(j)
for part in songs.parts:
pick = part.recurse()
Is the code only considering one instrument in the first case and in the second case we are taking all the instruments?
Please help me understand this, I am very confused.

How to retrieve all variable names within a netcdf using GDAL

I am struggling to find a way to retrieve metadata information from a FILE using GDAL.
Specifically, I would like to retrieve the band names and the order in which they are stored in a given file (may that be a GEOTIFF or a NETCDF).
For instance, if we follow the description within the GDAL documentation, we have the "GetMetaData" method from the gdal.Dataset (see here and here). Despite this method returning a whole set of information regarding the dataset, it does not provide the band names and the order that they are stored within the given FILE. As a matter of fact, it seems to be an old problem (from 2015) that seems not to be solved yet (more info here). As it seems, "R" language has already solved this problem (see here), though Python hasn't.
Just to be thorough here, I know that there are other Python packages that can help in this endeavour (e.g., xarray, rasterio, etc.); nevertheless, it would be important to be concise with the set of packages that one should use in a single script. Therefore, I would like to know a definite way to find the band (a.k.a., variable) names and the order they are stored within a single FILE using gdal.
Please, let me know your thoughs in this regard.
Below, I present a starting point for solving this Issue, in which a file is opened by GDAL (creating a Dataset object).
from gdal import Dataset
from osgeo import gdal
OpeneddatasetFile = gdal.Open(f'NETCDF:{input}/{file_name}.nc:' + var)
if isinstance(OpeneddatasetFile , Dataset):
print("File opened successfully")
# here is where one should be capable of fetching the variable (a.k.a., band) names
# of the OpeneddatasetFile.
# Ideally, it would be most welcome some kind of method that could return a dictionary
# with this information
# something like:
# VariablesWithinFile = OpeneddatasetFile.getVariablesWithinFileAsDictionary()

I have finally found a way to retrieve variable names from the NETCDF file using GDAL, and that is thank's to the comments given by Robert Davy above.
I have organized the code into a set of functions to help its visualization. Notice that there is also a function for reading metadata from the NETCDF, which returns this info in a dictionary format (see the "readInfo" function).
from gdal import Dataset, InfoOptions
from osgeo import gdal
import numpy as np
def read_data(filename):
dataset = gdal.Open(filename)
if not isinstance(dataset, Dataset):
raise FileNotFoundError("Impossible to open the netcdf file")
return dataset
def readInfo(ds, infoFormat="json"):
"how to: https://gdal.org/python/"
info = gdal.Info(ds, options=InfoOptions(format=infoFormat))
return info
def listAllSubDataSets(infoDict: dict):
subDatasetVariableKeys = [x for x in infoDict["metadata"]["SUBDATASETS"].keys()
if "_NAME" in x]
subDatasetVariableNames = [infoDict["metadata"]["SUBDATASETS"][x]
for x in subDatasetVariableKeys]
formatedsubDatasetVariableNames = []
for x in subDatasetVariableNames:
s = x.replace('"', '').split(":")[-1]
s = ''.join(s)
formatedsubDatasetVariableNames.append(s)
return formatedsubDatasetVariableNames
if "__main__" == __name__:
filename = "netcdfFile.nc"
ds = read_data(filename)
infoDict = readInfo(ds)
infoDict["VariableNames"] = listAllSubDataSets(infoDict)

Set up data pipeline for video processing

I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)

How to save a picture array, along with information pertaining to it?

I am scraping cars, and will have many pictures, this part is not a problem. I want to save the car specifications also. I am wondering the best way to do this efficiently. Ideally, I would have something like built-in datasets in many libraries. Such as:
print(dataset)
{
'image': ([255, 203, 145, ...]),
'info': (['Audi', '355 HP', ...])
}
That way, I could easily extract images and info with dataset['info'], or something. I could easily assign both like x, y = dataset.

There are several options, but for structured data like this, it's common to store dictionaries using hdf5.
See python tutorial and full documentation here
http://docs.h5py.org/en/stable/quick.html
Here's a full python example. Notice the dictionary like interface.
import h5py
import numpy as np
#####
#writing output file
#####
my_file = h5py.File("output.h5",'w')
my_file['info'] = np.string_("some_random pixels") #hdf5 needs numpy to store strings
my_file['image'] = np.random.rand(5,5)
my_file.close()
#####
#reading input file
#####
loaded_file = h5py.File("output.h5",'r')
print(np.array(loaded_file['info'])) #hdf5 also needs numpy to read strings as well
print(np.array(loaded_file['image']))
loaded_file.close()

How do I add a shapefile in ArcGIS via python scripting?

I am trying to automate various tasks in ArcGIS Desktop (using ArcMap generally) with Python, and I keep needing a way to add a shape file to the current map. (And then do stuff to it, but that's another story).
The best I can do so far is to add a layer file to the current map, using the following ("addLayer" is a layer file object):
def AddLayerFromLayerFile(addLayer):
import arcpy
mxd = arcpy.mapping.MapDocument("CURRENT")
df = arcpy.mapping.ListDataFrames(mxd, "Layers")[0]
arcpy.mapping.AddLayer(df, addLayer, "AUTO_ARRANGE")
arcpy.RefreshActiveView()
arcpy.RefreshTOC()
del mxd, df, addLayer
However, my raw data is always going be shape files, so I need to be able to open them. (Equivantly: convert a shape file to a layer file wiothout opening it, but I'd prefer not to do that).

Variable "theShape" is the path of the shape file to be added.
import arcpy
import arcpy.mapping
# get the map document
mxd = arcpy.mapping.MapDocument("CURRENT")
# get the data frame
df = arcpy.mapping.ListDataFrames(mxd,"*")[0]
# create a new layer
newlayer = arcpy.mapping.Layer(theShape)
# add the layer to the map at the bottom of the TOC in data frame 0
arcpy.mapping.AddLayer(df, newlayer,"BOTTOM")
# Refresh things
arcpy.RefreshActiveView()
arcpy.RefreshTOC()
del mxd, df, newlayer

Recently I struggled with a similar task, and initially used the method of identifying the map document, identifying the data frame, creating a layer and adding the layer to the map document. Interestingly enough, this can all be accomplished using the following provided it is called from within the current map document.
# import modules
import arcpy
# create layer in TOC and reference it in a variable for possible other actions
newLyr = arcpy.MakeFeatureLayer_managment(
in_features,
out_layer
)[0]
Make Feature Layer requires two inputs, the input features and the output layer. The input features can be any type of feature class or layer. This includes shapefiles. Output layer is the name of the layer to appear in the table of contents.
Also, Make Feature Layer can accept a where clause to create a definition query at creation time. This typically is how I implement it, when needing to create a lot of layers with different definition queries quickly.
Finally, in the above snippet, although it is not necessary, I demonstrated how to populate a variable with the result of the tool output so the layer could be manipulated in the table of contents using arcpy.mapping if this is necessary later in the script. Every tool returns a result object. The result object output can be accessed using the getOutput method, but it can also be accessed by using the index of the result property you are interested in, in this case the output located at index 0.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing osm.pbf data using GDAL/OGR python module - python

Related

What does music21.instrument.partitionByInstrument() really do?

How to retrieve all variable names within a netcdf using GDAL

Set up data pipeline for video processing

How to save a picture array, along with information pertaining to it?

How do I add a shapefile in ArcGIS via python scripting?

Categories

Resources