How do I use pydap library to collect THREDDS data? - python

I have been trying to use the example get_nomads.py module from Will Holgren which he was nice enough to forward my way. In the code, there is a call to get the THREDDS data as follows:
from pydap.client import open_url
dataset = open_url('https://nomads.ncdc.noaa.gov/thredds/dodsC/gfs-004/201612/20161201/gfs_4_20161201_0000_003.grb2')
Which does not work because (apparently) the old THREDDS server has been decommissioned.
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2018.2.4\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\client.py", line 64, in open_url
dataset = DAPHandler(url, application, session, output_grid).dataset
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\handlers\dap.py", line 51, in __init__
raise_for_status(r)
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\net.py", line 30, in raise_for_status
comment=response.body
webob.exc.HTTPError: 404 Not Found
So looking around I am not able to find a THREDDS server that supports this method of data access.
BTW: I am able to get data as follows:
url = 'http://dtvirt5.deltares.nl:8080/thredds/dodsC/opendap/rijkswaterstaat/jarkus/profiles/transect.nc'
dataset = open_url(url)
<DatasetType with children 'id', 'areacode', 'areaname', 'alongshore', 'cross_shore', 'time', 'time_bounds', 'epsg', 'x', 'y', 'lat', 'lon', 'angle', 'mean_high_water', 'mean_low_water', 'max_cross_shore_measurement', 'min_cross_shore_measurement', 'nsources', 'max_altitude_measurement', 'min_altitude_measurement', 'rsp_x', 'rsp_y', 'rsp_lat', 'rsp_lon', 'time_topo', 'time_bathy', 'origin', 'altitude'>
variable = dataset['id']
print(variable[0:10])
[2000100 2000101 2000102 2000103 2000104 2000105 2000106 2000120 2000140
2000160]
I also see that I can manually download the data from https://www.ncei.noaa.gov/thredds/dodsC/gfs-g4-anl-files/201808/20180828/gfsanl_4_20180828_1800_006.grb2.html
But I cant seem to find argument format to download the data using pydap. I think all I need is a pointer to a real THREDDS server that has the appropriate DDS and DAS files at the same URI location.
Does anyone know how to get the GFS4 GRB files using the pydap client?
Thanks

Haven't tested using pydap, tested using netCDF4 which does very will with THREDDS. This should work using pydap.
dataset = open_url('http://www.ncei.noaa.gov/thredds/dodsC/gfs-g4-anl-files/201612/20161201/gfsanl_4_20161201_0000_003.grb2')
The THREDDS OPeNDAP form for that one file is here:
The main catalog organized by YYYYMM/ is at:
https://www.ncei.noaa.gov/thredds/catalog/gfs-g4-anl-files/catalog.html
All NCEI GFS datasets with links to TDS access can be seen here:
https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs

When I try the link you provided Eric with pydap I get this error.
dataset = open_url('http://www.ncei.noaa.gov/thredds/dodsC/gfs-g4-anl-files/201612/20161201/gfsanl_4_20161201_0000_003.grb2')
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2018.2.4\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\client.py", line 64, in open_url
dataset = DAPHandler(url, application, session, output_grid).dataset
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\handlers\dap.py", line 64, in __init__
self.dataset = build_dataset(dds)
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\parsers\dds.py", line 161, in build_dataset
return DDSParser(dds).parse()
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\parsers\dds.py", line 49, in parse
self.consume('dataset')
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\parsers\dds.py", line 41, in consume
token = super(DDSParser, self).consume(regexp)
File "C:\Users\pmoran\jira\slf\venv\lib\site-packages\pydap\parsers\__init__.py", line 182, in consume
raise Exception("Unable to parse token: %s" % self.buffer[:10])
Exception: Unable to parse token: <!DOCTYPE
However, per your suggestion I'm able to get data with NETCDF4.
Here's what I did.
>>> import netCDF4
>>> nc = netCDF4.Dataset('http://www.ncei.noaa.gov/thredds/dodsC/gfs-g4-anl-files/201612/20161201/gfsanl_4_20161201_0000_003.grb2')
>>> nc.variables.keys()
odict_keys(['LatLon_Projection', 'lat', 'lon', 'reftime', 'time', 'time_bounds',
...
'v-component_of_wind_altitude_above_msl', 'v-component_of_wind_height_above_ground', 'v-component_of_wind_tropopause', 'v-component_of_wind_sigma'])
That seems to work. Not sure whats wrong with pydap.

Related

TypeError: Parameter to MergeFrom() must be instance of same class: expected google.datastore.v1.Key.PathElement got PathElement

I am trying to push the data to GCP DataStore, The below code snippet works fine in Jupyter Notebook but it is throwing error in VS Code.
def load_data_json(self, kind_name, data_with_qp_ID, qp_id):
#Load the data in JSON format to upload into the DataStore
data_with_qp_ID_as_JSON = self.convert_DF_to_JSON(data_with_qp_ID, qp_id)
#Loop to iterate through the JSON format and upload into the GCS Storage
for data in data_with_qp_ID_as_JSON.keys():
with self.client.transaction():
incomplete_key = self.client.key(kind_name)
task = datastore.Entity(key=incomplete_key)
task.update(data_with_qp_ID_as_JSON[data])
self.client.put(task)
return 'Ingestion Successful - Data Store Repository'
I have defined the name of the bucket in "Kind Name", data_with_qp_id is a pandas dataframe, qp_id is the name of the column name in pandas. Please see the error message that I get below,
Traceback (most recent call last):
File "/Users/ajaykrishnan/Desktop/Projects/Sprint 3/Data Migration/DataMigration_v1.1/main2.py", line 139, in <module>
write_datastore_db.load_data_json(ds_kindname, bookmarks_data_with_qp_ID, qp_id)
File "/Users/ajaykrishnan/Desktop/Projects/Sprint 3/Data Migration/DataMigration_v1.1/pkg/repository/ds_repository.py", line 50, in load_data_json
self.client.put(task)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/client.py", line 597, in put
self.put_multi(entities=[entity], retry=retry, timeout=timeout)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/client.py", line 634, in put_multi
current.put(entity)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/transaction.py", line 315, in put
super(Transaction, self).put(entity)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/batch.py", line 227, in put
_assign_entity_to_pb(entity_pb, entity)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/batch.py", line 373, in _assign_entity_to_pb
bare_entity_pb = helpers.entity_to_protobuf(entity)
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/helpers.py", line 208, in entity_to_protobuf
key_pb = entity.key.to_protobuf()
File "/opt/anaconda3/lib/python3.9/site-packages/google/cloud/datastore/key.py", line 298, in to_protobuf
key.path.append(element)
TypeError: Parameter to MergeFrom() must be instance of same class: expected google.datastore.v1.Key.PathElement got PathElement.
My environment is as follows,
Mac OS Monterey V12.06
Python - Conda 3.9.12
I was able to clear this error. It was an issue with Protobuf library that my environment was using. I downgraded the version of protobuf from 4.x.x to 3.20.1 and it worked.

I am having trouble reading RData using pyreadr

I have installed pyreadr and would like to use it to read .RData. I have followed the direction given from the previous comment on here Loading RData but that did not work for me.
When I try the following line (with the path pointing to the right file):
result = pyreadr.read_r('/path/to/file.RData')
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/apps/bioapps/python/Python-3.6.1/lib/python3.6/site-packages/pyreadr/pyreadr.py",
line 80, in list_objects
parser.parse(path)
File "pyreadr/librdata.pyx", line 117, in pyreadr.librdata.Parser.parse
File "pyreadr/librdata.pyx", line 142, in pyreadr.librdata.Parser.parse
pyreadr.custom_errors.LibrdataError: Unable to read from file
Please let me know if you have any suggestions.
Thanks!

How can read Minecraft .mca files so that in python I can extract individual blocks?

I can't find a way of reading the Minecraft world files in a way that i could use in python
I've looked around the internet but can find no tutorials and only a few libraries that claim that they can do this but never actually work
from nbt import *
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
I expected this to work but instead I got errors about the file not being compressed or something of the sort
Full error:
Traceback (most recent call last):
File "C:\Users\rober\Desktop\MinePy\MinecraftWorldReader.py", line 2, in <module>
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 628, in __init__
self.parse_file()
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 652, in parse_file
type = TAG_Byte(buffer=self.file)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 99, in __init__
self._parse_buffer(buffer)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 105, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Use anvil parser. (Install with pip install anvil-parser)
Reading
import anvil
region = anvil.Region.from_file('r.0.0.mca')
# You can also provide the region file name instead of the object
chunk = anvil.Chunk.from_region(region, 0, 0)
# If `section` is not provided, will get it from the y coords
# and assume it's global
block = chunk.get_block(0, 0, 0)
print(block) # <Block(minecraft:air)>
print(block.id) # air
print(block.properties) # {}
https://pypi.org/project/anvil-parser/
According to this page, the .mca files is not totally kind of of NBT file. It begins with an 8KiB header which includes the offsets of chunks in the region file itself and the timestamps for the last updates of those chunks.
I recommend you to see the offical announcement and this page for more information.

CloudDataflow can not use "google.cloud.datastore" package?

I want to put datastore with transaction on CloudDataflow.
So, I wrote below.
def exe_dataflow():
....
from google.cloud import datastore
# call from pipeline
def ds_test(content):
datastore_client = datastore.Client()
kind = 'test_out'
name = 'change'
task_key = datastore_client.key(kind, name)
for _ in range(3):
with datastore_client.transaction():
current_value = client.get(task_key)
current_value['v'] += content['v']
datastore_client.put(task)
# pipeline
....
| 'datastore test' >> beam.Map(ds_test)
But, Error occured and log message was displayed as below.
(7b75e0ef2db229da): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
op.start()
...(SNIP)...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
return getattr(__import__(module, None, None, [obj]), obj)
AttributeError: 'module' object has no attribute 'datastore'
CloudDataflow can not use "google.cloud.datastore" package?
add 2018/2/28.
I add --requirements_file to MyOption
options = MyOptions(flags = ["--requirements_file", "./requirements.txt"])
and I make requirements.txt
google-cloud-datastore==1.5.0
But, Another error occurred.
(366397598dcf7f02): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
op.start()
...(SNIP)...
File "my_dataflow.py", line 66, in to_entity
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/__init__.py", line 60, in <module>
from google.cloud.datastore.batch import Batch
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/batch.py", line 24, in <module>
from google.cloud.datastore import helpers
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/helpers.py", line 29, in <module>
from google.cloud.datastore_v1.proto import datastore_pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/__init__.py", line 17, in <module>
from google.cloud.datastore_v1 import types
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/types.py", line 21, in <module>
from google.cloud.datastore_v1.proto import datastore_pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/proto/datastore_pb2.py", line 17, in <module>
from google.cloud.datastore_v1.proto import entity_pb2 as google_dot_cloud_dot_datastore__v1_dot_proto_dot_entity__pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/proto/entity_pb2.py", line 28, in <module>
dependencies=[google_dot_api_dot_annotations__pb2.DESCRIPTOR,google_dot_protobuf_dot_struct__pb2.DESCRIPTOR,google_dot_protobuf_dot_timestamp__pb2.DESCRIPTOR,google_dot_type_dot_latlng__pb2.DESCRIPTOR,])
File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 824, in __new__
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "google/cloud/datastore_v1/proto/entity.proto":
google.datastore.v1.PartitionId.project_id: "google.datastore.v1.PartitionId.project_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
...(SNIP)...
google.datastore.v1.Entity.properties: "google.datastore.v1.Entity.PropertiesEntry" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/datastore_v1/proto/entity.proto". To use it here, please add the necessary import.
The recommended way to interact with Cloud Datastore from a Cloud Dataflow Pipeline is to use the Datastore I/O API, which is available through the Dataflow SDK and provides some methods to read and write data to a Cloud Datastore database.
You can find detailed documentation for the Datastore I/O package for Dataflow SDK 2.x for Python in this other link. The datastore.v1.datastoreio module is the specific module that you want to use. There is plenty of information in the links I am sharing, but in short, it is a connector to Datastore that uses PTransform to read / write / delete a PCollection from Datastore using the classes ReadFromDatastore() / WriteToDatastore() / DeleteFromDatastore() respectively.
You should try using it instead of implementing the calls yourself. I suspect this may be the reason for the error you are seeing, as a Datastore implementation already exists in the Dataflow SDK:
"google.datastore.v1.PartitionId.project_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
UPDATE:
It looks like those three classes collect several mutations and executes them in a single transaction. You can check that in the code describing the classes.
If the aim is to retrieve (get()) and then update (put()) a Datastore entity, you can probably work with the write_mutations() function, which is described in the documentation, and you can work with a full batch of mutations performing the operations you are interested in.

DeleteResource from Google Docs with Python

I am trying to delete a spreadsheet in Google Docs with this function:
def f_DeleteResource(xls_name):
"""Delete a resource"""
client=Auth()
for e1 in client.GetResources().entry:
e2 = client.GetResource(e1)
if xls_name==e2.title.text:
client.DeleteResource(e2.resource_id.text,True)
And I obtain different errors when I change the first parameter of client.DeleteResource(p1,p2):
client.DeleteResource(e2.resource_id.text,True):
Traceback (most recent call last):
File "C:\xmp\D6GDocsDeleteUpload.py", line 164, in <module> main()
File "C:\xmp\D6GDocsDeleteUpload.py", line 157, in main f_DeleteResource(sys.argv[2])
File "C:\xmp\D6GDocsDeleteUpload.py", line 144, in f_DeleteResource client.DeleteResource(e2.resource_id.text,True)
File "C:\Python27\lib\site-packages\gdata\docs\client.py", line 540, in delete_resource uri = entry.GetEditLink().href
AttributeError: 'str' object has no attribute 'GetEditLink'
client.DeleteResource(e2,True):
Traceback (most recent call last):
File "C:\xmp\D6GDocsDeleteUpload.py", line 164, in <module> main()
File "C:\xmp\D6GDocsDeleteUpload.py", line 157, in main f_DeleteResource(sys.argv[2])
File "C:\xmp\D6GDocsDeleteUpload.py", line 144, in f_DeleteResource client.DeleteResource(e2,True)
File "C:\Python27\lib\site-packages\gdata\docs\client.py", line 543, in delete_resource return super(DocsClient, self).delete(uri, **kwargs)
File "C:\Python27\lib\site-packages\gdata\client.py", line 748, in delete **kwargs)
File "C:\Python27\lib\site-packages\gdata\docs\client.py", line 66, in request return super(DocsClient, self).request(method=method, uri=uri, **kwargs)
File "C:\Python27\lib\site-packages\gdata\client.py", line 319, in request RequestError)
gdata.client.RequestError: Server responded with: 403, <errors xmlns='http://schemas.google.com/g/2005'><error><domain>GData</domain><code>matchHeaderRequired</code><location type='header'>If-Match|If-None-Match</location><internalReason>If-Match or If-None-Match header or entry etag attribute required</internalReason></error></errors>
Anyone can help me?
It seems to be a bug in Google API Python library. I checked gdata-2.0.16 and noticed that DeleteResource() function uses only URL of the resource (gdata/docs/client.py lines 540-543), but later checks for hasattr(entry_or_uri, 'etag') (gdata/client.py lines 737-741) and of course string value (uri) doesn't have etag attribute.
You may work around it using force keyword argument:
import gdata.docs.data
import gdata.docs.client
client = gdata.docs.client.DocsClient()
client.ClientLogin('xxxxxx#gmail.com', 'xxxxxx', 'XxX')
for doc in client.GetAllResources():
if doc.title.text == 'qpqpqpqpqpqp':
client.DeleteResource(doc, force=True)
break
If you want you may report an error to library maintainers (if it isn't already reported).
This issue has been fixed in this revision:
http://code.google.com/p/gdata-python-client/source/detail?r=f98fff494fb89fca12deede00c3567dd589e5f97
If you sync you client to the repository, you should be able to delete a resource without having to specify 'force=True'.

Categories

Resources