Say I have a json schema called child.json.
"$ref": "file:child.json" will work
"$ref": "file:./child.json" will work
That is the only two relative path worked for me. I am using the python validator: http://sacharya.com/validating-json-using-python-jsonschema/
The issue I have is: if i have 3 schema: grandpa.json, parent.json, and child.json; grandpa is referring to parent using "$ref": "file:parent.json and parent is referring to child using "$ref": "file:child.json. Then the above relative path does not work no more
Building off the github issue linked by #jruizaranguren I ended up with the following which works as expected:
import os
import json
import jsonschema
schema_dir = os.path.abspath('resources')
with open(os.path.join(schema_dir, 'schema.json')) as file_object:
schema = json.load(file_object)
# Your data
data = {"sample": "woo!"}
# Note that the second parameter does nothing.
resolver = jsonschema.RefResolver('file://' + schema_dir + '/', None)
# This will find the correct validator and instantiate it using the resolver.
# Requires that your schema contains a line like this: "$schema": "http://json-schema.org/draft-04/schema#"
jsonschema.validate(data, schema, resolver=resolver)
Related
This is my first time using Google's Vertex AI Pipelines. I checked this codelab as well as this post and this post, on top of some links derived from the official documentation. I decided to put all that knowledge to work, in some toy example: I was planning to build a pipeline consisting of 2 components: "get-data" (which reads some .csv file stored in Cloud Storage) and "report-data" (which basically returns the shape of the .csv data read in the previous component). Furthermore, I was cautious to include some suggestions provided in this forum. The code I currently have, goes as follows:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from google.cloud import aiplatform
# Components section
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
import pandas as pd
from google.cloud import storage
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# path = "gs://my-bucket/program_grouping_data.zip"
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
):
import pandas as pd
df = pd.read_csv(inputd.path)
return df.shape
# Pipeline section
#pipeline(
# Default pipeline root. You can override it when submitting the pipeline.
pipeline_root=PIPELINE_ROOT,
# A name for the pipeline.
name="my-pipeline",
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compilation section
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running and submitting job
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={"url": "test_vertex/pipeline_root/program_grouping_data.zip", "bucket": "my-bucket"},
enable_caching=True,
)
run1.submit()
I was happy to see that the pipeline compiled with no errors, and managed to submit the job. However "my happiness lasted short", as when I went to Vertex AI Pipelines, I stumbled upon some "error", which goes like:
The DAG failed because some tasks failed. The failed tasks are: [get-data].; Job (project_id = my-project, job_id = 4290278978419163136) is failed due to the above error.; Failed to handle the job: {project_number = xxxxxxxx, job_id = 4290278978419163136}
I did not find any related info on the web, neither could I find any log or something similar, and I feel a bit overwhelmed that the solution to this (seemingly) easy example, is still eluding me.
Quite obviously, I don't what or where I am mistaking. Any suggestion?
With some suggestions provided in the comments, I think I managed to make my demo pipeline work. I will first include the updated code:
from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output
from datetime import datetime
from google.cloud import aiplatform
from typing import NamedTuple
# Importing 'COMPONENTS' of the 'PIPELINE'
#component(
packages_to_install=[
"google-cloud-storage",
"pandas",
],
base_image="python:3.9",
output_component_file="get_data.yaml"
)
def get_data(
bucket: str,
url: str,
dataset: Output[Dataset],
):
"""Reads a csv file, from some location in Cloud Storage"""
import ast
import pandas as pd
from google.cloud import storage
# 'Pulling' demo .csv data from a know location in GCS
storage_client = storage.Client("my-project")
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(url)
blob.download_to_filename('localdf.csv')
# Reading the pulled demo .csv data
df = pd.read_csv('localdf.csv', compression='zip')
df['new_skills'] = df['new_skills'].apply(ast.literal_eval)
df.to_csv(dataset.path + ".csv" , index=False, encoding='utf-8-sig')
#component(
packages_to_install=["pandas"],
base_image="python:3.9",
output_component_file="report_data.yaml"
)
def report_data(
inputd: Input[Dataset],
) -> NamedTuple("output", [("rows", int), ("columns", int)]):
"""From a passed csv file existing in Cloud Storage, returns its dimensions"""
import pandas as pd
df = pd.read_csv(inputd.path+".csv")
return df.shape
# Building the 'PIPELINE'
#pipeline(
# i.e. in my case: PIPELINE_ROOT = 'gs://my-bucket/test_vertex/pipeline_root/'
# Can be overriden when submitting the pipeline
pipeline_root=PIPELINE_ROOT,
name="readcsv-pipeline", # Your own naming for the pipeline.
)
def my_pipeline(
url: str = "test_vertex/pipeline_root/program_grouping_data.zip",
bucket: str = "my-bucket"
):
dataset_task = get_data(bucket, url)
dimensions = report_data(
dataset_task.output
)
# Compiling the 'PIPELINE'
compiler.Compiler().compile(
pipeline_func=my_pipeline, package_path="pipeline_job.json"
)
# Running the 'PIPELINE'
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
run1 = aiplatform.PipelineJob(
display_name="my-pipeline",
template_path="pipeline_job.json",
job_id="mlmd-pipeline-small-{0}".format(TIMESTAMP),
parameter_values={
"url": "test_vertex/pipeline_root/program_grouping_data.zip",
"bucket": "my-bucket"
},
enable_caching=True,
)
# Submitting the 'PIPELINE'
run1.submit()
Now, I will add some complementary comments, which in sum, managed to solve my problem:
First, having the "Logs Viewer" (roles/logging.viewer) enabled for your user, will greatly help to troubleshoot any existing error in your pipeline (Note: that role worked for me, however you might want to look for a better matching role for you own purposes here). Those errors will appear as "Logs", which can be accessed by clicking the corresponding button:
NOTE: In the picture above, when the "Logs" are displayed, it might be helpful to carefully check each log (close to the time when you created you pipeline), as generally each eof them corresponds with a single warning or error line:
Second, the output of my pipeline was a tuple. In my original approach, I just returned the plain tuple, but it is advised to return a NamedTuple instead. In general, if you need to input / output one or more "small values" (int or str, for any reason), pick a NamedTuple to do so.
Third, when the connection between your pipelines is Input[Dataset] or Ouput[Dataset], adding the file extension is needed (and quite easy to forget). Take for instance the ouput of the get_data component, and notice how the data is recorded by specifically adding the file extension, i.e. dataset.path + ".csv".
Of course, this is a very tiny example, and projects can easily scale to huge projects, however as some sort of "Hello Vertex AI Pipelines" it will work well.
Thank you.
Thanks for your writeup. Very helpful! I had the same error, but it turned out to be for a different reasons, so noting it here...
In my pipeline definition step I have the following parameters...
'''
def my_pipeline(bq_source_project: str = BQ_SOURCE_PROJECT,
bq_source_dataset: str = BQ_SOURCE_DATASET,
bq_source_table: str = BQ_SOURCE_TABLE,
output_data_path: str = "crime_data.csv"):
'''
My error was when I run the pipeline, I did not have these same parameters entered. Below is the fixed version...
'''
job = pipeline_jobs.PipelineJob(
project=PROJECT_ID,
location=LOCATION,
display_name=PIPELINE_NAME,
job_id=JOB_ID,
template_path=FILENAME,
pipeline_root=PIPELINE_ROOT,
parameter_values={'bq_source_project': BQ_SOURCE_PROJECT,
'bq_source_dataset': BQ_SOURCE_DATASET,
'bq_source_table': BQ_SOURCE_TABLE}
'''
i am trying to figure out how can i tag resources with operation Merge like in PS.
example in powershell -
Update-AzTag -ResourceId $s.ResourceId -Tag $mergedTags -Operation Replace
my code in python -
# Tag the resource groups.
resource_group_client.resource_groups.create_or_update(resource_group_name=rg["Resource-group-name"],parameters=
{'location': rg['location'],
'tags':tags_dict,
'Operation': 'Merge'})
as you can see i am trying my luck to put 'operation' : 'merge' but it dosent work...
any help here please?
We don’t have an option as merging in python create_or_update function, have a look on the documentation on merging tags.
We can use resource_group_params.update(tags={'hello': 'world'}) to update all the tags at once.
Below is the code about updating
resource_group_params = {'location':'westus'} #adding location tag to a variable to pass it in create_or_update function
resource_group_params.update(tags={'hello': 'world'}) # adding tags (This will remove all the previous tags and update with the one which we are passing currently
client.resource_groups.create_or_update('azure-sample-group', resource_group_params)
But the above code from documentation will remove all the previous tags and update with the one which we are passing currently.
Here our requirement is to append/merge tags, So I have created a python script where we can append tags to the old tag:
# Import the needed credential and management objects from the libraries.
from types import FunctionType
from azure.identity import AzureCliCredential
from azure.mgmt.resource import ResourceManagementClient
import os
from azure.mgmt.resource.resources.v2016_02_01 import operations
credential = AzureCliCredential()
subscription_id = "SUBSCRIPTION ID" #Add your subscription ID
resource_group = "RESOURCE_GRP_ID" #Add your resource group
resource_client = ResourceManagementClient(credential, subscription_id)
resource_list = resource_client.resource_groups.list()
for resource in resource_list:
if resource.name == resource_group:
appendtags = resource.tags #gathering old tags
newTags = {"Name1": "first"} #new tags
appendtags.update(newTags) # adding my new tags to old ones
print(appendtags)
resource_client.resource_groups.create_or_update(resource_group_name=resource_group, parameters= {"location": "westus2", "tags": appendtags})
i have fixed it with this code.. and it worked PERFECT -
(you should work with "update_at_scope".
resource_group_client = ResourceManagementClient(credential, subscription_id=sub.subscription_id)
body = {
"operation" : "Merge",
"properties" : {
"tags" :
tags_dict,
}
}
resource_group_client.tags.update_at_scope(rg["Resource-id"] ,body)
Here's a simplified version of the JSON I am working with:
{
"libraries": [
{
"library-1": {
"file": {
"url": "foobar.com/.../library-1.bin"
}
}
},
{
"library-2": {
"application": {
"url": "barfoo.com/.../library-2.exe"
}
}
}
]
}
Using json, I can json.loads() this file. I need to be able to find the 'url', download it, and save it to a local folder called library. In this case, I'd create two folders within libraries/, one called library-1, the other library-2. Within these folder's would be whatever was downloaded from the url.
The issue, however, is being able to get to the url:
my_json = json.loads(...) # get the json
for library in my_json['libraries']:
file.download(library['file']['url']) # doesn't access ['application']['url']
Since the JSON I am using uses a variety of accessors, sometimes 'file', other times 'dll' etc, I can't use one specific dictionary key. How can I use multiple. Would there be a modular way to do this?
Edit: There are numerous accessors, 'file', 'application' and 'dll' are only some examples.
You can just iterate through each level of the dictionary and download the files if you find a url.
urls = []
for library in my_json['libraries']:
for lib_name, lib_data in library.items():
for module_name, module_data in lib_data.items():
url = module_data.get('url')
if url is not None:
# create local directory with lib_name
# download files from url to local directory
urls.append(url)
# urls = ['foobar.com/.../library-1.bin', 'barfoo.com/.../library-2.exe']
This should work:
for library in my_json['libraries']:
for value in library.values():
for url in value.values():
file.download(url)
I would suggest doing it like this:
for library in my_json['libraries']:
library_data = library.popitem()[1].popitem()[1]
file.download(library_data['url'])
Try this
for library in my_json['libraries']:
if 'file' in library:
file.download(library['file']['url'])
elif 'dll' in library:
file.download(library['dll']['url'])
It just sees if your dict(created by parsing JSON) has a key named 'file'. If so, then use 'url' of the dict corresponds to the 'file' key. If not, try the same with 'dll' keyword.
Edit: If you don't know the key to access the dict containing the url, try this.
for library in my_json['libraries']:
for key in library:
if 'url' in library['key']:
file.download(library[key]['url'])
This iterates over all the keys in your library. Then, whichever key contains an 'url', downloads using that.
I am using pyarango driver (https://github.com/tariqdaouda/pyArango) for arangoDB, but I cannot understand how the field validation works. I have set the fields of a collection as in the github example:
import pyArango.Collection as COL
import pyArango.Validator as VAL
from pyArango.theExceptions import ValidationError
import types
class String_val(VAL.Validator) :
def validate(self, value) :
if type(value) is not types.StringType :
raise ValidationError("Field value must be a string")
return True
class Humans(COL.Collection) :
_validation = {
'on_save' : True,
'on_set' : True,
'allow_foreign_fields' : True # allow fields that are not part of the schema
}
_fields = {
'name' : Field(validators = [VAL.NotNull(), String_val()]),
'anything' : Field(),
'species' : Field(validators = [VAL.NotNull(), VAL.Length(5, 15), String_val()])
}
So I was expecting that when I try to add a document into "Humans" collection, if 'name' field is not a string, an error would rise. But it didn't seem to work that easy.
This is how I add documents to the collection:
myjson = json.loads(open('file.json').read())
collection_name = "Humans"
bindVars = {"doc": myjson, '#collection': collection_name}
aql = "For d in #doc INSERT d INTO ##collection LET newDoc = NEW RETURN newDoc"
queryResult = db.AQLQuery(aql, bindVars = bindVars, batchSize = 100)
So if 'name' is not a string I actually don't get any error and is uploaded into the collection.
Does someone knows how can check if a document contains proper fields for that collection using the built-in validation of pyarango?
I don't see anything wrong with your validator, its just that if you're using AQL queries to insert your documents, pyArango has no way of knowing the contents prior to insertion.
Validators only work on pyArango documents if you do:
humans = db["Humans"]
doc = humans.createDocument()
doc["name"] = 101
That should trigger the exception because you've defined:
'on_set': True
ArangoDB as document store itself doesn't enforce schemas, neither do the drivers.
If you need schema validation, this can be done on top of the driver or inside of ArangoDB using a Foxx service (via the joi validation library).
One possible solution for doing this is using JSON Schema with its python implementation on top of the driver in your application:
from jsonschema import validate
schema = {
"type" : "object",
"properties" : {
"name" : {"type" : "string"},
"species" : {"type" : "string"},
},
}
Another real life example using JSON Schema is swagger.io, which is also used to document the ArangoDB REST API and ArangoDB Foxx services.
I don't know yet what was wrong with the code I posted but now seems to work. However I had to convert unicode to utf-8 when reading the json file otherwise it was not able to identify strings. I know ArangoDB as itself does not enforce schemes but I am using that has a built-in validation.
For those interested in a built-in validation of arangoDB using python visit pyarango github.
My code in Python 3.4:
from rdflib import Graph, plugin
import json, rdflib_jsonld
from rdflib.plugin import register, Serializer
register('json-ld', Serializer, 'rdflib_jsonld.serializer', 'JsonLDSerializer')
context = {
"#context": {
"foaf" : "http://xmlns.com/foaf/0.1/",
"vcard": "http://www.w3.org/2006/vcard/ns#country-name",
"job": "http://example.org/job",
"name": {"#id": "foaf:name"},
"country": {"#id": "vcard:country-name"},
"profession": {"#id": "job:occupation"},
}
}
x = [{"name": "bert", "country": "antartica", "profession": "bear"}]
g = Graph()
g.parse(data=json.dumps(x), format='json-ld', context=context)
g.close()
Error:
"No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class'rdflib.parser.Parser'>)
According to the RDFLib documentation, a list of supported plugins does not include the json-ld format. However, I had it working before with format set to json-ld and there are plenty of examples using the json-ld format, e.g: https://github.com/RDFLib/rdflib-jsonld/issues/19
I included the import of rdflib_jsonld although it worked before on another environment (Python 2.7) with only the rdflib (I know, doesn't make any sense).
The register part of json-ld on line 4 isn't helping either.
Anyone having an idea?
I got it working by adding:
from SPARQLWrapper import SPARQLWrapper
I was looking into the jsonLayer module from RDFLib at http://rdflib.readthedocs.org/en/latest/apidocs/rdflib.plugins.sparql.results.html#module-rdflib.plugins.sparql.results.jsonlayer and noticed the mentioning of SPARQLWrapper which I used in my previous environment where I got the example working and there it was.
this a simple syntax you can use
import rdflib
import json
from collections import Counter
from rdflib import Graph, plugin
from rdflib.serializer import Serializer
g = rdflib.Graph()
g.parse("http://purl.obolibrary.org/obo/go.owl")
j = g.serialize(format='json-ld', indent=4)
with open('ontology.json', 'a+') as f:
f.write(str(j))
f.close()
I encountered this PluginException as well, in a Jupyter notebook, after running the following two cells:
! pip install rdflib-json
from rdflib import Graph, plugin
from rdflib.serializer import Serializer
g = Graph()
g.parse(data="""
<some turtle triples>
""", format="turtle")
g.serialize(format="json-ld")
Turns out it started working by just restarting the notebook and rerunning the code above (so reimporting after restarting the notebook).