Uploading JSON to Bigquery unspecific error - python

I am just getting started with the python BigQuery API (https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery) after briefly trying out (https://github.com/pydata/pandas-gbq) and realizing that the pandas-gbq does not support RECORD type, i.e. no nested fields.
Now I am trying to upload nested data to BigQuery. I managed to create the table with the respective Schema, however I am struggling with the upload of the json data.
from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from google.cloud.bigquery import LoadJobConfig
from google.cloud.bigquery import SchemaField
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
SchemaField('address', 'RECORD', mode='REPEATED', fields=(
SchemaField('test', 'STRING', mode='NULLABLE'),
SchemaField('second','STRING', mode='NULLABLE')
))
]
table_ref = client.dataset('TestApartments').table('Test2')
table = bigquery.Table(table_ref, schema=SCHEMA)
table = client.create_table(table)
When trying to upload a very simple JSON to bigquery I get a rather ambiguous error
400 Error while reading data, error message: JSON table encountered
too many errors, giving up. Rows: 1; errors: 1. Please look into the
error stream for more details.
Beside the fact that it makes me kinda sad that its giving up on me :), obviously that error description does not really help ...
Please find below how I try to upload the JSON and the sampledata.
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
with open('testjson.json', 'rb') as source_file:
job = client.load_table_from_file(
source_file,
table_ref,
location='US', # Must match the destination dataset location.
job_config=job_config) # API request
job.result() # Waits for table load to complete.
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_id, table_id))
This is my JSON object
"[{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}]"
A JSON example would be wonderful as this seems the only way to upload nested data if I am not mistaken.

I have been reproducing your scenario using your same code and the JSON content you shared, and I suspect the issue is only that you are defining the JSON content between quotation marks (" or '), while it should not have that format.
The correct format is the one that #ElliottBrossard already shared with you in his answer:
{'full_name':'test', 'age':2, 'address': [{'test':'hi', 'second':'hi2'}]}
If I run your code using that content in the testjson.json file, I get the response Loaded 1 rows into MY_DATASET:MY_TABLE and the content gets loaded in the table. If, otherwise, I use the format below (which you are using according to your question and the comments in the other answer), I get the result google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details.
"{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}"
Additionally, you can go to the Jobs page in the BigQuery UI (following the link https://bigquery.cloud.google.com/jobs/YOUR_PROJECT_ID), and there you will find more information about the failed Load job. As an example, when I run your code with the wrong JSON format, this is what I get:
As you will see, here the error message is more relevant:
error message: JSON parsing error in row starting at position 0: Value encountered without start of object
It indicates that the it could not find any valid start of a JSON object (i.e. a bracket { at the very beginning of the object).
TL;DR: remove the quotation marks in your JSON object and the load job should be fine.

I think your JSON content should be:
{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}
(No brackets.) As an example using the command-line client:
$ echo "{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}" \
> example.json
$ bq query --use_legacy_sql=false \
"CREATE TABLE tmp_elliottb.JsonExample (full_name STRING NOT NULL, age INT64 NOT NULL, address ARRAY<STRUCT<test STRING, second STRING>>);"
$ bq load --source_format=NEWLINE_DELIMITED_JSON \
tmp_elliottb.JsonExample example.json
$ bq head tmp_elliottb.JsonExample
+-----------+-----+--------------------------------+
| full_name | age | address |
+-----------+-----+--------------------------------+
| test | 2 | [{"test":"hi","second":"hi2"}] |
+-----------+-----+--------------------------------+

Related

Using Schema_Object in GoogleCloudStorageToBigQueryOperator in Airflow

I want to load GCS files written in JSON format into a BQ Table through an Airflow DAG.
So, i used the GoogleCloudStorageToBigQueryOperator. Additionally, to avoid using the autodetect option, i created a Schema JSON file stored in the GCS bucket which i have my JSON raw data files to be used as schema_object.
Below is the JSON Schema file:
[{"name": "id", "type": "INTEGER", "mode": "NULLABLE"},{"name": "description", "type": "INTEGER", "mode": "NULLABLE"}]
And for my JSON raw data file, it looks like that (New line Delimited JSON File):
{"description":"HR Department","id":9}
{"description":"Restaurant Department","id":10}
Here is my operator look like:
gcs_to_bq = GoogleCloudStorageToBigQueryOperator(
task_id=table_name + "_gcs_to_bq",
bucket=bucket_name,
bigquery_conn_id="bigquery_default",
google_cloud_storage_conn_id="google_cloud_storage_default",
source_objects=[table_name + "/{{ ds_nodash }}/data_json/*.json"],
schema_object=table_name+"/{{ ds_nodash }}/data_json/schema_file.json",
allow_jagged_rows=True,
ignore_unknown_values=True,
source_format="NEWLINE_DELIMITED_JSON",
destination_project_dataset_table=project_id
+ "."
+ data_set
+ "."
+ table_name,
write_disposition="WRITE_TRUNCATE",
create_disposition="CREATE_IF_NEEDED",
dag=dag,
)
The error i got is:
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: No object found when new array is started.; BeginArray returned false; Parser terminated before end of string File: schema_file.json
Could you please help me solving this issue?
Thanks in Advance.
I see 2 problems :
Your BigQuery table schema is incorrect, the type of description column is INTEGER instead of STRING. You have to set it to STRING
You are using an old Airflow version. In the recent version, normally the source object retrieves by default object from bucket specified in bucket param. For the old version, I am not sure about this behaviour. You can set the full path to check if it solve your issue, example : schema_object='gs://test-bucket/schema.json'

How to extract the json data without square brackets in python

I suspect I am facing an issue because my json data is within square brackets ([]).
I am processing pubsub data which is in below format.
[{"ltime":"2022-04-12T11:33:00.970Z","cnt":199,"fname":"MYFILENAME","data":[{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}]
I am successfully processing/extracting all the fields except 'data'. I need to create a json file using the 'data' in a cloud storage.
when I use msg['data'] I am getting below,
[{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}]
Sample code:
pubsub_msg = base64.b64decode(event['data']).decode('utf-8')
pubsub_data = json.loads(pubsub_msg)
json_file = pubsub_data['fname'] + '.json'
json_content = pubsub_data['data']
try:
<< call function >>
except Exception as e:
<< Error >>
below is the error I am getting it.
2022-04-12T17:35:01.761Zmyapp-pipelineeopcm2kjno5h Error in uploading json document to gcs: [{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}] could not be converted to bytes
I am not sure the issue is because of square brackets [].
Correct me if I am wrong and please help to get the exact json data for creating a file.

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

I'm working on building a Dataflow pipeline that reads a CSV file (containing 250,000 rows) from my Cloud Storage bucket, modifies the value of each row and then writes the modified contents to a new CSV in the same bucket. With the code below I'm able to read and modify the contents of the original file, but when I attempt to write the contents of the new file in GCS I get the following error:
google.api_core.exceptions.TooManyRequests: 429 POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=multipart: {
"error": {
"code": 429,
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"errors": [
{
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"domain": "usageLimits",
"reason": "rateLimitExceeded"
}
]
}
}
: ('Request failed with status code', 429, 'Expected one of', <HTTPStatus.OK: 200>) [while running 'Store Output File']
My code in Dataflow:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import traceback
import sys
import pandas as pd
from cryptography.fernet import Fernet
import google.auth
from google.cloud import storage
fernet_secret = 'aD4t9MlsHLdHyuFKhoyhy9_eLKDfe8eyVSD3tu8KzoP='
bucket = 'my-bucket'
inputFile = f'gs://{bucket}/product-codes/test_codes.csv'
outputFile = 'product-codes/URL_test_codes.csv'
#Pipeline Logic
def product_codes_pipeline(project, env, region='us-central1'):
options = PipelineOptions(
streaming=False,
project=project,
region=region,
staging_location="gs://my-bucket-dataflows/Templates/staging",
temp_location="gs://my-bucket-dataflows/Templates/temp",
template_location="gs://my-bucket-dataflows/Templates/Generate_Product_Codes.py",
subnetwork='https://www.googleapis.com/compute/v1/projects/{}/regions/us-central1/subnetworks/{}-private'.format(project, env)
)
# Transform function
def genURLs(code):
f = Fernet(fernet_secret)
encoded = code.encode()
encrypted = f.encrypt(encoded)
decrypted = f.decrypt(encrypted.decode().encode())
decoded = decrypted.decode()
if code != decoded:
print(f'Error: Code {code} and decoded code {decoded} do not match')
sys.exit(1)
url = 'https://some-url.com/redeem/product-code=' + encrypted.decode()
return url
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, urls):
df = pd.DataFrame([urls], columns=['URL'])
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f'{outputFile}').upload_from_string(df.to_csv(index=False), 'text/csv')
# End function
p = beam.Pipeline(options=options)
(p | 'Read Input CSV' >> beam.io.ReadFromText(inputFile, skip_header_lines=1)
| 'Map Codes' >> beam.Map(genURLs)
| 'Store Output File' >> beam.ParDo(WriteCSVFIle(bucket)))
p.run()
The code produces URL_test_codes.csv in my bucket, but the file only contains one row (not including the 'URL' header) which tells me that my code is writing/overwriting the file as it processes each row. Is there a way to bulk write the contents of the entire file instead of making a series of requests to update the file? I'm new to Python/Dataflow so any help is greatly appreciated.
Let's point out the issues: the evident one is a quota issue from GCS side, reflected by the '429' error codes. But as you noted, this is derived from the inherent issue, which is more related to how you try to write your data to your blob.
Since a Beam Pipeline generates a Parallel Collection of elements, when you add elements to your PCollection, each pipeline step will be executed for each element, in other words, your ParDo function will try to write something to your output file once per element in your PCollection.
So, there are some issues with your WriteCSVFIle function. For example, in order to write your PCollection to GCS, it would be better to use a separate pipeline task focused on writing the whole PCollection, such as follows:
First, you can import this Function already included in Apache Beam:
from apache_beam.io import WriteToText
Then, you use it at the end of your pipeline:
| 'Write PCollection to Bucket' >> WriteToText('gs://{0}/{1}'.format(bucket_name, outputFile))
With this option, you don't need to create a storage client or reference a blob, the function just needs to receive the GCS URI where it would write the final result and you can adjust it according to the parameters you can find in the documentation.
With this, you only need to address the Dataframe created in your WriteCSVFIle function. Each pipeline step creates a new PCollection, so if a Dataframe-creator function should receive an element from a PCollection of URLs, then the new PCollection elements resulting from the Dataframe function will have 1 dataframe per url following your current logic, but since it seems you just want to write the results from genURLs considering that 'URL' is the only column in your dataframe, maybe going directly from genURLs to WriteToText can output what you're looking for.
Either way, you can adjust your pipeline accordingly, but at least with the WriteToText transform it would take care of writing your whole final PCollection to your Cloud Storage bucket.

Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output

I am attempting to use dataflow to read a pubsub message and write it to big query. I was given alpha access by the Google team and have gotten the provided examples working but now I need to apply it to my scenario.
Pubsub payload:
Message {
data: {'datetime': '2017-07-13T21:15:02Z', 'mac': 'FC:FC:48:AE:F6:94', 'status': 1}
attributes: {}
}
Big Query Schema:
schema='mac:STRING, status:INTEGER, datetime:TIMESTAMP',
My goal is to simply read the message payload and insert into bigquery. I am struggling with getting my head around the transformations and how should I map the key/values to the big query schema.
I am very new to this so any help is appreciated.
Current code:https://codeshare.io/ayqX8w
Thanks!
I was able to successfully parse the pubsub string by defining a function that loads it into a json object (see parse_pubsub()). One weird issue I encountered was that I was not able to import json at the global scope. I was receiving "NameError: global name 'json' is not defined" errors. I had to import json within the function.
See my working code below:
from __future__ import absolute_import
import logging
import argparse
import apache_beam as beam
import apache_beam.transforms.window as window
'''Normalize pubsub string to json object'''
# Lines look like this:
# {'datetime': '2017-07-13T21:15:02Z', 'mac': 'FC:FC:48:AE:F6:94', 'status': 1}
def parse_pubsub(line):
import json
record = json.loads(line)
return (record['mac']), (record['status']), (record['datetime'])
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--input_topic', required=True,
help='Input PubSub topic of the form "/topics/<PROJECT>/<TOPIC>".')
parser.add_argument(
'--output_table', required=True,
help=
('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
with beam.Pipeline(argv=pipeline_args) as p:
# Read the pubsub topic into a PCollection.
lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
| beam.Map(parse_pubsub)
| beam.Map(lambda (mac_bq, status_bq, datetime_bq): {'mac': mac_bq, 'status': status_bq, 'datetime': datetime_bq})
| beam.io.WriteToBigQuery(
known_args.output_table,
schema=' mac:STRING, status:INTEGER, datetime:TIMESTAMP',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
Data written to Python SDK's BigQuery sink should be in the form of a dictionary where each key of the dictionary gives a field of the BigQuery table and corresponding value gives the value to be written to that field. For a BigQuery RECORD type, value itself should be a dictionary with corresponding key,value pairs.
I filed a JIRA to improve documentation of corresponding python module in Beam: https://issues.apache.org/jira/browse/BEAM-3090
I have a similar usecase (reading rows as strings from PubSub, converting them to dicts and then processing them).
I am using ast.literal_eval(), which seems to work for me. This command will evaluate the string, but in a safer way than eval() (see here). It should return a dict whose keys are strings, and values are evaluated to the most likely type (int, str, float...). You may want to make sure the values take the correct type though.
This would give you a pipeline like this
import ast
lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
| "JSON row to dict" >> beam.Map(
lambda s: ast.literal_eval(s))
| beam.io.WriteToBigQuery( ... )
)
I have not used BigQuery (yet), so I cannot help you on the last line, but what you wrote seems correct at first glance.

Pymongo: Bulk update with $setOnInsert error

I am trying to do bulk updates, while at the same time retaining the state of a specific field.
In my code I am either creating a document, or add to the list 'stuff'.
#init bulk
data = [...]
bulkop = col.initialize_ordered_bulk_op()
for d in data:
bulkop.find({'thing':d}).upsert().update({'$setOnInsert':{'status':0},'$push':{'stuff':'something'},'$inc': { 'seq': 1 }})
bulkop.execute()
However, I am getting an error when I try this.
Error: pymongo.errors.BulkWriteError: batch op errors occurred
It works fine without the $setOnInsert':{'status':0} addition, but I need this to make sure that the state var is not getting updated.
I think you are using pymongo 3.x, the above code works on pymongo 2.x. Try this...
from pymongo import MongoClient, UpdateMany
client = MongoClient(< connection string >)
db = client[test_db]
col = db[test]
docs = [doc_1,doc_2,doc_3,....,doc_n]
requests = [UpdateMany({ },{'$setOnInsert':{'status':0},"$set":{"$push":docs}}, upsert = True)]
result = col.bulk_write(requests)
print(result.inserted_count)
Here empty { } means updating all the records.
Hope this helps...

Categories

Resources