Using Schema_Object in GoogleCloudStorageToBigQueryOperator in Airflow

Using Schema_Object in GoogleCloudStorageToBigQueryOperator in Airflow - python

I want to load GCS files written in JSON format into a BQ Table through an Airflow DAG.
So, i used the GoogleCloudStorageToBigQueryOperator. Additionally, to avoid using the autodetect option, i created a Schema JSON file stored in the GCS bucket which i have my JSON raw data files to be used as schema_object.
Below is the JSON Schema file:
[{"name": "id", "type": "INTEGER", "mode": "NULLABLE"},{"name": "description", "type": "INTEGER", "mode": "NULLABLE"}]
And for my JSON raw data file, it looks like that (New line Delimited JSON File):
{"description":"HR Department","id":9}
{"description":"Restaurant Department","id":10}
Here is my operator look like:
gcs_to_bq = GoogleCloudStorageToBigQueryOperator(
task_id=table_name + "_gcs_to_bq",
bucket=bucket_name,
bigquery_conn_id="bigquery_default",
google_cloud_storage_conn_id="google_cloud_storage_default",
source_objects=[table_name + "/{{ ds_nodash }}/data_json/*.json"],
schema_object=table_name+"/{{ ds_nodash }}/data_json/schema_file.json",
allow_jagged_rows=True,
ignore_unknown_values=True,
source_format="NEWLINE_DELIMITED_JSON",
destination_project_dataset_table=project_id
+ "."
+ data_set
+ "."
+ table_name,
write_disposition="WRITE_TRUNCATE",
create_disposition="CREATE_IF_NEEDED",
dag=dag,
)
The error i got is:
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: No object found when new array is started.; BeginArray returned false; Parser terminated before end of string File: schema_file.json
Could you please help me solving this issue?
Thanks in Advance.

I see 2 problems :
Your BigQuery table schema is incorrect, the type of description column is INTEGER instead of STRING. You have to set it to STRING
You are using an old Airflow version. In the recent version, normally the source object retrieves by default object from bucket specified in bucket param. For the old version, I am not sure about this behaviour. You can set the full path to check if it solve your issue, example : schema_object='gs://test-bucket/schema.json'

Related

Unable to parse JSON from file using Python3

I'm trying to get the value of ["pooled_metrics"]["vmaf"]["harmonic_mean"] from a JSON file I want to parse using python. This is the current state of my code:
for crf in crf_ranges:
vmaf_output_crf_list_log = job['config']['output_dir'] + '/' + build_name(stream) + f'/vmaf_{crf}.json'
# read the vmaf_output_crf_list_log file and get value from ["pooled_metrics"]["vmaf"]["harmonic_mean"]
with open(vmaf_output_crf_list_log, 'r') as json_vmaf_file:
# load the json_string["pooled_metrics"] into a python dictionary
vm = json.loads(json_vmaf_file.read())
vmaf_values.append((crf, vm["pooled_metrics"]["vmaf"]["harmonic_mean"]))
This will give me back the following error:
AttributeError: 'dict' object has no attribute 'loads'
I always get back the same AttributeError not matter if I use "load" or "loads".
I validated the contents of the JSON, which is valid using various online validators, but still, I am not able to load the JSON for further parsing operations.
I expect that I can load a file that contains valid JSON data. The content of the file looks like this:
{
"frames": [
{
"frameNum": 0,
"metrics": {
"integer_vif_scale2": 0.997330,
}
},
],
"pooled_metrics": {
"vmaf": {
"min": 89.617207,
"harmonic_mean": 99.868023
}
},
"aggregate_metrics": {
}
}
Can somebody provide me some advice onto this behavior, what does it seem so absolutely impossible to load this JSON file?

loads is a method for the json library as the docs say https://docs.python.org/3/library/json.html#json.loads. In this case you are having a AttributeError this means that probably you have created another variable named "json" and when you call json.loads is calling that variable hence it won't have a loads method.

How to test and check a json files and make a report of the errors

I am currently using jsonschema with python to check my json files. I have created a schema.json that checks the json files that are looped over, and I display the file is invalid or valid according to the checks i have added in my schema.json.
Currently i am not able to display and highlight the error in the json file which is being checked, for example
"data_type": {
"type": "string",
"not": { "enum": ["null", "0", "NULL", ""] }
},
if the json file has the value "data_type" : "null" or anything similar in the enum, it will display json file is invalid but i want it to highlight the error and display you added null in the data_type field can i do it with jsonschema?
If there is another way without using jsonschema that will work; the main testing is to check multiple json files in a folder (which can be looped over) and check whether they are valid according to few set of rules and display a clean and nice report which specifically tells us the problem.

What I did is used Draft7Validator from jsonschema and it allows to display error message
validator = jsonschema.Draft7Validator(schema)
errors = sorted(validator.iter_errors(jsonData[a]), key=lambda e: e.path)
for error in errors:
print(f"{error.message}")
report.write(f"In {file_name}, object {a}, {error.message}{N}")
error_report.write(f"In {file_name}, object {a},{error.json_path[2:]} {error.message}{N}")
This prints -
In test1.json, object 0, apy '' is not of type 'number'
In test2.json, object 0, asset 'asset_contract_address' is a required property
Helpful link that helped me achieve this- Handling Validation Errors

Uploading JSON to Bigquery unspecific error

I am just getting started with the python BigQuery API (https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery) after briefly trying out (https://github.com/pydata/pandas-gbq) and realizing that the pandas-gbq does not support RECORD type, i.e. no nested fields.
Now I am trying to upload nested data to BigQuery. I managed to create the table with the respective Schema, however I am struggling with the upload of the json data.
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from google.cloud.bigquery import LoadJobConfig
from google.cloud.bigquery import SchemaField
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
SchemaField('address', 'RECORD', mode='REPEATED', fields=(
SchemaField('test', 'STRING', mode='NULLABLE'),
SchemaField('second','STRING', mode='NULLABLE')
))
]
table_ref = client.dataset('TestApartments').table('Test2')
table = bigquery.Table(table_ref, schema=SCHEMA)
table = client.create_table(table)
When trying to upload a very simple JSON to bigquery I get a rather ambiguous error
400 Error while reading data, error message: JSON table encountered
too many errors, giving up. Rows: 1; errors: 1. Please look into the
error stream for more details.
Beside the fact that it makes me kinda sad that its giving up on me :), obviously that error description does not really help ...
Please find below how I try to upload the JSON and the sampledata.
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
with open('testjson.json', 'rb') as source_file:
job = client.load_table_from_file(
source_file,
table_ref,
location='US', # Must match the destination dataset location.
job_config=job_config) # API request
job.result() # Waits for table load to complete.
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_id, table_id))
This is my JSON object
"[{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}]"
A JSON example would be wonderful as this seems the only way to upload nested data if I am not mistaken.

I have been reproducing your scenario using your same code and the JSON content you shared, and I suspect the issue is only that you are defining the JSON content between quotation marks (" or '), while it should not have that format.
The correct format is the one that #ElliottBrossard already shared with you in his answer:
{'full_name':'test', 'age':2, 'address': [{'test':'hi', 'second':'hi2'}]}
If I run your code using that content in the testjson.json file, I get the response Loaded 1 rows into MY_DATASET:MY_TABLE and the content gets loaded in the table. If, otherwise, I use the format below (which you are using according to your question and the comments in the other answer), I get the result google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details.
"{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}"
Additionally, you can go to the Jobs page in the BigQuery UI (following the link https://bigquery.cloud.google.com/jobs/YOUR_PROJECT_ID), and there you will find more information about the failed Load job. As an example, when I run your code with the wrong JSON format, this is what I get:
As you will see, here the error message is more relevant:
error message: JSON parsing error in row starting at position 0: Value encountered without start of object
It indicates that the it could not find any valid start of a JSON object (i.e. a bracket { at the very beginning of the object).
TL;DR: remove the quotation marks in your JSON object and the load job should be fine.

I think your JSON content should be:
{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}
(No brackets.) As an example using the command-line client:
$ echo "{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}" \
> example.json
$ bq query --use_legacy_sql=false \
"CREATE TABLE tmp_elliottb.JsonExample (full_name STRING NOT NULL, age INT64 NOT NULL, address ARRAY<STRUCT<test STRING, second STRING>>);"
$ bq load --source_format=NEWLINE_DELIMITED_JSON \
tmp_elliottb.JsonExample example.json
$ bq head tmp_elliottb.JsonExample
+-----------+-----+--------------------------------+
| full_name | age | address |
+-----------+-----+--------------------------------+
| test | 2 | [{"test":"hi","second":"hi2"}] |
+-----------+-----+--------------------------------+

How to use relative path for $ref in Json Schema

Say I have a json schema called child.json.
"$ref": "file:child.json" will work
"$ref": "file:./child.json" will work
That is the only two relative path worked for me. I am using the python validator: http://sacharya.com/validating-json-using-python-jsonschema/
The issue I have is: if i have 3 schema: grandpa.json, parent.json, and child.json; grandpa is referring to parent using "$ref": "file:parent.json and parent is referring to child using "$ref": "file:child.json. Then the above relative path does not work no more

Building off the github issue linked by #jruizaranguren I ended up with the following which works as expected:
import os
import json
import jsonschema
schema_dir = os.path.abspath('resources')
with open(os.path.join(schema_dir, 'schema.json')) as file_object:
schema = json.load(file_object)
# Your data
data = {"sample": "woo!"}
# Note that the second parameter does nothing.
resolver = jsonschema.RefResolver('file://' + schema_dir + '/', None)
# This will find the correct validator and instantiate it using the resolver.
# Requires that your schema contains a line like this: "$schema": "http://json-schema.org/draft-04/schema#"
jsonschema.validate(data, schema, resolver=resolver)

json to geoDjango model

I am building a database using Django, geodjango and postgresql of field data. The data includes lats and lons. One of the tasks I have is to ingest data that has already been collected. I would like to use .json file to define the metadata and write some code to batch process some json files.
What I have so far is, a model:
class deployment(models.Model):
'''
#brief This is the abstract deployment class.
'''
startPosition=models.PointField()
startTimeStamp=models.DateTimeField()
endTimeStamp=models.DateTimeField()
missionAim=models.TextField()
minDepth=models.FloatField() # IT seems there is no double in Django
maxDepth=models.FloatField()
class auvDeployment(deployment):
'''
#brief AUV meta data
'''
#==================================================#
# StartPosition : <point>
# distanceCovered : <double>
# startTimeStamp : <dateTime>
# endTimeStamp : <dateTime>
# transectShape : <>
# missionAim : <Text>
# minDepth : <double>
# maxDepth : <double>
#--------------------------------------------------#
# Maybe need to add unique AUV fields here later when
# we have more deployments
#==================================================#
transectShape=models.PolygonField()
distanceCovered=models.FloatField()
And I function I want to use to ingest the data
#staticmethod
def importDeploymentFromFile(file):
'''
#brief This function reads in a metadta file that includes campaign information. Destinction between deployment types is made on the fine name. <type><deployment>.<supported text> auvdeployment.json
#param file The file that holds the metata data. formats include .json todo:-> .xml .yaml
'''
catamiWebPortal.logging.info("Importing metadata from " + file)
fileName, fileExtension = os.path.splitext(file)
if fileExtension == '.json':
if os.path.basename(fileName.upper()) == 'AUVDEPLOYMENT':
catamiWebPortal.logging.info("Found valid deployment file")
data = json.load(open(file))
Model = auvDeployment(**data)
Model.save()
And the file I am trying to read in this
{
"id":1,
"startTimeStamp":"2011-09-09 13:20:00",
"endTimeStamp":"2011-10-19 14:23:54",
"missionAim":"for fun times, call luke",
"minDepth":10.0,
"maxDepth":20.0,
"startPosition":{{"type": "PointField", "coordinates": [ 5.000000, 23.000000 ] }},
"distanceCovered":20.0
}
The error that I am getting is this
TypeError: cannot set auvDeployment GeometryProxy with value of type: <type 'dict'>
If I remove the geo types from the model and file. It will read the file and populate the database table.
I would appreciate any advice one how I am parse the datafile with the geotypes.
Thanks

Okay the solution is as follows. The file format is not the geoJSON file format, it's the geos format. The .json file should be as follows.
{
"id": 1,
"startTimeStamp": "2011-10-19 10:23:54",
"endTimeStamp":"2011-10-19 14:23:54",
"missionAim": "for fun times, call luke",
"minDepth":10.0,
"maxDepth":20.0,
"startPosition":"POINT(-23.15 113.12)",
"distanceCovered":20,
"transectShape":"POLYGON((-23.15 113.12, -23.53 113.34, -23.67 112.9, -23.25 112.82, -23.15 113.12))"
}
Not the StartPosition syntax has changed.

A quick fix would be to use the GEOs API in geoDjango to change the startPosition field from geoJson format to a GEOSGeometry object before you save the model. This should allow it to pass validation.
Include the GEOSGeometry function from Django with:
from django.contrib.gis.geos import GEOSGeometry
...
Model = auvDeployment(**data)
Model.startPosition = GEOSGeometry(str(Model.startPosition))
Model.save()
The GEOS API cant construct objects from a GeoJSON format, as long as you make it a string first. As it stands, you are loading it as a dictionary type instead of a string.

I suggest you use the default command for loading fixtures: loaddata
python manage.py loaddata path/to/myfixture.json ...
The structure of your json would have to be slighty adjusted, but you could make a simple dumpdata to see how the structure should look like.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.