I don't seem to be able to open the zip3.zip shape file I download from (http://www.vdstech.com/usa-data.aspx)
Here is my code:
import geopandas as gpd
data = gpd.read_file("data/zip3.shp")
this gives me the error:
CPLE_AppDefinedError: b'Recode from CP437 to UTF-8 failed with the error: "Invalid argument".'
As per my answer on this question, seems like your dataset contains non-UTF characters. If you are facing this similar issue, chances are that using encoding-"utf-8" won't help as Fiona's open() call will still fail.
If other solutions don't work, two solutions I propose that solved this issue are:
Open your shapefile on a GIS editor (like QGis), then save it again making sure you select the Encoding option to "UTF-8". After this you should have no problem when calling gpd.read_file("data/zip3.shp").
You can also achieve this format change in Python using GDAL, by reading your shapefile and saving it again. This will effectively change the encoding to UTF-8, as this is the default encoding as indicated in the docs for the CreateDataSource() method. For this try the following code snippet:
from osgeo import ogr
driver = ogr.GetDriverByName("ESRI Shapefile")
ds = driver.Open("nbac_2016_r2_20170707_1114.shp", 0) #open your shapefile
#get its layer
layer = ds.GetLayer()
#create new shapefile to convert
ds2 = driver.CreateDataSource('convertedShape.shp')
#create a Polygon layer, as the one your Shapefile has
layer2 = ds2.CreateLayer('', None, ogr.wkbPolygon)
#iterate over all features of your original shapefile
for feature in layer:
#and create a new feature on your converted shapefile with those features
layer2.CreateFeature(feature)
#proper closing
ds = layer = ds2 = layer2 = None
It looks like this shapefile doesn't have an associated cpg specifying the encoding of the .dbf file, and then falling back to trying to use your default system encoding isn't working either. You should be able to open this with:
data = gpd.read_file("data/zip3.shp", encoding="utf-8")
geopandas relies on fiona for shapefile reading, and you may need to upgrade your fiona version for this to work; see some discussion here
Since you probably have GDAL installed, I recommend converting the file to UTF-8 using the CLI:
ogr2ogr output.shp input.shp -lco ENCODING=UTF-8
Worked like a charm for me. It's much faster than QGIS and can be used in a cluster environment. I also posted this answer here. Specifying the encoding in geopandas did not work for me.
Maybe the file is dependent on other files.
I faced the same problem and when I copied other files that this shapefile is dependent on, the code ran correctly but requested to install another package called descartes. When I installed the package, the code ran correctly
Related
I am struggling to find a way to retrieve metadata information from a FILE using GDAL.
Specifically, I would like to retrieve the band names and the order in which they are stored in a given file (may that be a GEOTIFF or a NETCDF).
For instance, if we follow the description within the GDAL documentation, we have the "GetMetaData" method from the gdal.Dataset (see here and here). Despite this method returning a whole set of information regarding the dataset, it does not provide the band names and the order that they are stored within the given FILE. As a matter of fact, it seems to be an old problem (from 2015) that seems not to be solved yet (more info here). As it seems, "R" language has already solved this problem (see here), though Python hasn't.
Just to be thorough here, I know that there are other Python packages that can help in this endeavour (e.g., xarray, rasterio, etc.); nevertheless, it would be important to be concise with the set of packages that one should use in a single script. Therefore, I would like to know a definite way to find the band (a.k.a., variable) names and the order they are stored within a single FILE using gdal.
Please, let me know your thoughs in this regard.
Below, I present a starting point for solving this Issue, in which a file is opened by GDAL (creating a Dataset object).
from gdal import Dataset
from osgeo import gdal
OpeneddatasetFile = gdal.Open(f'NETCDF:{input}/{file_name}.nc:' + var)
if isinstance(OpeneddatasetFile , Dataset):
print("File opened successfully")
# here is where one should be capable of fetching the variable (a.k.a., band) names
# of the OpeneddatasetFile.
# Ideally, it would be most welcome some kind of method that could return a dictionary
# with this information
# something like:
# VariablesWithinFile = OpeneddatasetFile.getVariablesWithinFileAsDictionary()
I have finally found a way to retrieve variable names from the NETCDF file using GDAL, and that is thank's to the comments given by Robert Davy above.
I have organized the code into a set of functions to help its visualization. Notice that there is also a function for reading metadata from the NETCDF, which returns this info in a dictionary format (see the "readInfo" function).
from gdal import Dataset, InfoOptions
from osgeo import gdal
import numpy as np
def read_data(filename):
dataset = gdal.Open(filename)
if not isinstance(dataset, Dataset):
raise FileNotFoundError("Impossible to open the netcdf file")
return dataset
def readInfo(ds, infoFormat="json"):
"how to: https://gdal.org/python/"
info = gdal.Info(ds, options=InfoOptions(format=infoFormat))
return info
def listAllSubDataSets(infoDict: dict):
subDatasetVariableKeys = [x for x in infoDict["metadata"]["SUBDATASETS"].keys()
if "_NAME" in x]
subDatasetVariableNames = [infoDict["metadata"]["SUBDATASETS"][x]
for x in subDatasetVariableKeys]
formatedsubDatasetVariableNames = []
for x in subDatasetVariableNames:
s = x.replace('"', '').split(":")[-1]
s = ''.join(s)
formatedsubDatasetVariableNames.append(s)
return formatedsubDatasetVariableNames
if "__main__" == __name__:
filename = "netcdfFile.nc"
ds = read_data(filename)
infoDict = readInfo(ds)
infoDict["VariableNames"] = listAllSubDataSets(infoDict)
I would like to get the byte contents of a pandas dataframe exported as hdf5, ideally without actually saving the file (i.e., in-memory).
On python>=3.6, < 3.9 (and pandas==1.2.4, pytables==3.6.1) the following used to work:
import pandas as pd
with pd.HDFStore(
"in-memory-save-file",
mode="w",
driver="H5FD_CORE",
driver_core_backing_store=0,
) as store:
store.put("my_key", df, format="table")
binary_data = store._handle.get_file_image()
Where df is the dataframe to be converted to hdf5, and the last line calls this pytables function.
However, starting with python 3.9, I get the following error when using the snippet above:
File "tables/hdf5extension.pyx", line 523, in tables.hdf5extension.File.get_file_image
tables.exceptions.HDF5ExtError: Unable to retrieve the size of the buffer for the file image. Plese note that not all drivers provide support for image files.
The error is raised by the same pytables function linked above, apparently due to issues while retrieving the size of the buffer for the file image. I don't understand the ultimate reason for it, though.
I have tried other alternatives such as saving to a BytesIO file-object, so far unsuccessfully.
How can I keep the hdf5 binary of a pandas dataframe in-memory on python 3.9?
The fix was to do conda install -c conda-forge pytables instead of pip install pytables. I still don't understand the ultimate reason behind the error, though.
I need to read OpenAir files in Python.
According to the following vector driver description, GDAL has built-in OpenAir functionality:
https://gdal.org/drivers/vector/openair.html
However there is no example code for reading such OpenAir files.
So far I have tried to read a sample file using the following lines:
from osgeo import gdal
airspace = gdal.Open('export.txt')
However it returns me the following error:
ERROR 4: `export.txt' not recognized as a supported file format.
I already looked at vectorio however no OpenAir functionality has been implemented.
Why do I get the error above?
In case anyone wants to reproduce the problem: sample OpenAir files can easily be generated using XContest:
https://airspace.xcontest.org/
Since you're dealing with vector data, you need to use ogr instead of gdal (it's normally packaged along with gdal)
So you can do:
from osgeo import ogr
ds = ogr.Open('export.txt')
layer = ds.GetLayer(0)
featureCount = layer.GetFeatureCount()
print(featureCount)
There's plenty of info out there on using ogr, but this cookbook might be helpful.
When using xarray open_dataset or open_mfdataset to load a NARR netcdf dataset (e.g. ftp://ftp.cdc.noaa.gov/Datasets/NARR/monolevel/air.2m.2010.nc), xarray returns an error regarding "conflicting _FillValue and missing_values".
Entering:
ds = xarray.open_dataset('air.2m.2010.nc')
yields this error:
ValueError: ('Discovered conflicting _FillValue and missing_value. Considering opening the offending dataset using decode_cf=False, corrected the attributes', 'and decoding explicitly using xray.conventions.decode_cf(ds)')
When using the suggestion to open as such:
ds = xarray.open_dataset('air.2m.2010.nc',decode_cf=False),
the dataset is opened, but the variables, time, coordinates etc. are not decoded (obviously). Using xarray.decode_cf(ds) explicitly does not seem to help to successfully decode the dataset as the same error is encountered.
I believe this error arises because the NARR dataset is a Lambert Conformal and so there are some missing values due to the shape of the grid as it is opened by xarray, and for some reason, this conflicts with the fill values.
What is the best way to open and decode this file in xarray?
N.B. I have been able to open and decode using netcdf4-python, but would like to be able to do this in xarray to utilize it's out of core computation functionality provided by dask.
This issue has been fixed in more recent versions of xarray. Using version 0.12, I get the following
>>> ds = xr.open_dataset('air.2m.2010.nc')
.../conventions.py:394: SerializationWarning: variable 'air' has multiple fill values {9.96921e+36, -9.96921e+36}, decoding all values to NaN.
In other words, it raises a warning, but not an error, and successfully applies a mask to both missing values.
So your issue can be fixed by upgrading to a more recent version of xarray.
I was able to solve a similar issue I was having with NARR data from the same source and xarray, but only for the time variable. I did not have issues with the other variables.
I am sure there are much easier ways to do this (I am still pretty new at python + xarray), but I ended up taking the time variable and values from the dataset(s) I was interested in, created a new dataset and 'decoded' the time, then updated the time variable and values in my original dataset of interest.
test = xr.open_mfdataset(r'evap*nc',decode_cf=False)
t_unit = test.variables['time']
t_unit.attrs['units']
#u'hours since 1800-1-1 00:00:0.0'
attrs = {'units': 'hours since 1800-01-01'}
ds = xr.Dataset({'time': ('time', t_unit, attrs)})
ds = xr.decode_cf(ds)
test.update({'time':('time', ds['time'])})
Please let me know if you find an easier way! I don't have this issue with the study datasets I am currently using from another source, but would be curious as to how others solved this issue with the ESRL NARR data.
I am trying to access HDF5 with the compressed image datablok. I use the classical command gdal
f = gdal.Open(path+product)
but this seems not working since the file is pointing to none has you can see below
Starting processing proba database
processing PROBAV_L1C_20131009_092303_2_V001.HDF5
None
processing PROBAV_L1C_20130925_092925_2_V001.HDF5
None
Processing complete
I would like to ask if there is someone can give me some indication how to handle hdf5 which gdal without using h5py which does not support compressed datablock as well.
Thanks
It couldn't open the file, either because it couldn't see the path, or you don't have an HDF5 driver for Python. The behaviour returning None is expected behaivour, but can be modified to raise an exception if it cannot open the file:
from osgeo import gdal
gdal.UseExceptions()
if not gdal.GetDriverByName('HDF5'):
raise Exception('HDF5 driver is not available')
I think you miss protocol before Open.
This works for me with other Proba images:
from os import gddal
path="PROBAV_L2A_20140321_031709_2_333M_V001.HDF5"
product="LEVEL2A/GEOMETRY/SAA"
f = gdal.Open("HDF5:\"{}\"://{}".format(path,product))
f.ReadAsArray()
You could also read the complete name using GetSubDatasets which returns a list of tuples:
ds = gdal.Open(path)
subdataset_read = ds.GetSubDatasets()[0]
print("Subdataset: ",subdataset_read)
ds_sub = gdal.Open(subdataset_read[0],
gdal.GA_ReadOnly)
ds_sub.ReadAsArray()