I am trying to read a table from a Google spanner database, and write it to a text file to do a backup, using google dataflow with the python sdk.
I have written the following script:
from __future__ import absolute_import
import argparse
import itertools
import logging
import re
import time
import datetime as dt
import logging
import apache_beam as beam
from apache_beam.io import iobase
from apache_beam.io import WriteToText
from apache_beam.io.range_trackers import OffsetRangeTracker, UnsplittableRangeTracker
from apache_beam.metrics import Metrics
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions, SetupOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from google.cloud.spanner.client import Client
from google.cloud.spanner.keyset import KeySet
BUCKET_URL = 'gs://my_bucket'
OUTPUT = '%s/output/' % BUCKET_URL
PROJECT_ID = 'my_project'
INSTANCE_ID = 'my_instance'
DATABASE_ID = 'my_db'
JOB_NAME = 'spanner-backup'
TABLE = 'my_table'
class SpannerSource(iobase.BoundedSource):
def __init__(self):
logging.info('Enter __init__')
self.spannerOptions = {
"id": PROJECT_ID,
"instance": INSTANCE_ID,
"database": DATABASE_ID
}
self.SpannerClient = Client
def estimate_size(self):
logging.info('Enter estimate_size')
return 1
def get_range_tracker(self, start_position=None, stop_position=None):
logging.info('Enter get_range_tracker')
if start_position is None:
start_position = 0
if stop_position is None:
stop_position = OffsetRangeTracker.OFFSET_INFINITY
range_tracker = OffsetRangeTracker(start_position, stop_position)
return UnsplittableRangeTracker(range_tracker)
def read(self, range_tracker): # This is not called when using the dataflowRunner !
logging.info('Enter read')
# instantiate spanner client
spanner_client = self.SpannerClient(self.spannerOptions["id"])
instance = spanner_client.instance(self.spannerOptions["instance"])
database = instance.database(self.spannerOptions["database"])
# read from table
table_fields = database.execute_sql("SELECT t.column_name FROM information_schema.columns AS t WHERE t.table_name = '%s'" % TABLE)
table_fields.consume_all()
self.columns = [x[0] for x in table_fields]
keyset = KeySet(all_=True)
results = database.read(table=TABLE, columns=self.columns, keyset=keyset)
# iterator over rows
results.consume_all()
for row in results:
JSON_row = {
self.columns[i]: row[i] for i in range(len(self.columns))
}
yield JSON_row
def split(self, start_position=None, stop_position=None):
# this should not be called since the source is unspittable
logging.info('Enter split')
if start_position is None:
start_position = 0
if stop_position is None:
stop_position = 1
# Because the source is unsplittable (for now), only a single source is returned
yield iobase.SourceBundle(
weight=1,
source=self,
start_position=start_position,
stop_position=stop_position)
def run(argv=None):
"""Main entry point"""
pipeline_options = PipelineOptions()
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL
#pipeline_options.view_as(StandardOptions).runner = 'DirectRunner'
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=pipeline_options)
output = p | 'Get Rows from Spanner' >> beam.io.Read(SpannerSource())
iso_datetime = dt.datetime.now().replace(microsecond=0).isoformat()
output | 'Store in GCS' >> WriteToText(file_path_prefix=OUTPUT + iso_datetime + '-' + TABLE, file_name_suffix='') # if this line is commented, job completes but does not do anything
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
However, this script runs correctly only on the DirectRunner: when I let it run on the DataflowRunner, it runs for a while without any output, before exiting with an error:
"Executing failure step failure14 [...] Workflow failed. Causes: [...] The worker lost contact with the service."
Sometimes, it just goes on forever, without creating an output.
Moreover, if I comment the line 'output = ...', the job completes, but without actually reading the data.
It also appears that the dataflowRunner calls the function 'estimate_size' of the source, but not the functions 'read' or 'get_range_tracker'.
Does anyone have any ideas about what may cause this ?
I know there is a (more complete) java SDK with an experimental spanner source/sink available, but if possible I'd rather stick with python.
Thanks
Google currently added support of Backup Spanner with Dataflow, you can choose related template when creating DataFlow job.
For more: https://cloud.google.com/blog/products/gcp/cloud-spanner-adds-import-export-functionality-to-ease-data-movement
I have reworked my code following the suggestion to simply use a ParDo, instead of using the BoundedSource class. As a reference, here is my solution; I am sure there are many ways to improve on it, and I would be happy to to hear opinions.
In particular I am surprised that I have to a create a dummy PColl when starting the pipeline (if I don't, I get an error
AttributeError: 'PBegin' object has no attribute 'windowing'
that I could not work around. The dummy PColl feels a bit like a hack.
from __future__ import absolute_import
import datetime as dt
import logging
import apache_beam as beam
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions, SetupOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from google.cloud.spanner.client import Client
from google.cloud.spanner.keyset import KeySet
BUCKET_URL = 'gs://my_bucket'
OUTPUT = '%s/some_folder/' % BUCKET_URL
PROJECT_ID = 'my_project'
INSTANCE_ID = 'my_instance'
DATABASE_ID = 'my_database'
JOB_NAME = 'my_jobname'
class ReadTables(beam.DoFn):
def __init__(self, project, instance, database):
super(ReadTables, self).__init__()
self._project = project
self._instance = instance
self._database = database
def process(self, element):
# get list of tables in the database
table_names_row = Client(self._project).instance(self._instance).database(self._database).execute_sql('SELECT t.table_name FROM information_schema.tables AS t')
for row in table_names_row:
if row[0] in [u'COLUMNS', u'INDEXES', u'INDEX_COLUMNS', u'SCHEMATA', u'TABLES']: # skip these
continue
yield row[0]
class ReadSpannerTable(beam.DoFn):
def __init__(self, project, instance, database):
super(ReadSpannerTable, self).__init__()
self._project = project
self._instance = instance
self._database = database
def process(self, element):
# first read the columns present in the table
table_fields = Client(self._project).instance(self._instance).database(self._database).execute_sql("SELECT t.column_name FROM information_schema.columns AS t WHERE t.table_name = '%s'" % element)
columns = [x[0] for x in table_fields]
# next, read the actual data in the table
keyset = KeySet(all_=True)
results_streamed_set = Client(self._project).instance(self._instance).database(self._database).read(table=element, columns=columns, keyset=keyset)
for row in results_streamed_set:
JSON_row = { columns[i]: row[i] for i in xrange(len(columns)) }
yield (element, JSON_row) # output pairs of (table_name, data)
def run(argv=None):
"""Main entry point"""
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(SetupOptions).requirements_file = "requirements.txt"
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=pipeline_options)
init = p | 'Begin pipeline' >> beam.Create(["test"]) # have to create a dummy transform to initialize the pipeline, surely there is a better way ?
tables = init | 'Get tables from Spanner' >> beam.ParDo(ReadTables(PROJECT, INSTANCE_ID, DATABASE_ID)) # read the tables in the db
rows = (tables | 'Get rows from Spanner table' >> beam.ParDo(ReadSpannerTable(PROJECT, INSTANCE_ID, DATABASE_ID)) # for each table, read the entries
| 'Group by table' >> beam.GroupByKey()
| 'Formatting' >> beam.Map(lambda (table_name, rows): (table_name, list(rows)))) # have to force to list here (dataflowRunner produces _Unwindowedvalues)
iso_datetime = dt.datetime.now().replace(microsecond=0).isoformat()
rows | 'Store in GCS' >> WriteToText(file_path_prefix=OUTPUT + iso_datetime, file_name_suffix='')
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
Related
I am trying to create my first pipleine in dataflow, I have the same code runnign when i execute using the interactive beam runner but on dataflow I get all sort of errors, which are not making much sense to me.
I am getting json from pub sub which is of the following format.
{"timestamp":1589992571906,"lastPageVisited":"https://kickassdataprojects.com/simple-and-complete-tutorial-on-simple-linear-regression/","pageUrl":"https://kickassdataprojects.com/","pageTitle":"Helping%20companies%20and%20developers%20create%20awesome%20data%20projects%20%7C%20Data%20Engineering/%20Data%20Science%20Blog","eventType":"Pageview","landingPage":0,"referrer":"direct","uiud":"31af5f22-4cc4-48e0-9478-49787dd5a19f","sessionId":322371}
Here is the code of my pipeline.
from __future__ import absolute_import
import apache_beam as beam
#from apache_beam.runners.interactive import interactive_runner
#import apache_beam.runners.interactive.interactive_beam as ib
import google.auth
from datetime import timedelta
import json
from datetime import datetime
from apache_beam import window
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode, AfterCount
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
import argparse
import logging
from time import mktime
def setTimestamp(elem):
from apache_beam import window
yield window.TimestampedValue(elem, elem['timestamp'])
def createTuples(elem):
yield (elem["sessionId"], elem)
class WriteToBigQuery(beam.PTransform):
"""Generate, format, and write BigQuery table row information."""
def __init__(self, table_name, dataset, schema, project):
"""Initializes the transform.
Args:
table_name: Name of the BigQuery table to use.
dataset: Name of the dataset to use.
schema: Dictionary in the format {'column_name': 'bigquery_type'}
project: Name of the Cloud project containing BigQuery table.
"""
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(WriteToBigQuery, self).__init__()
beam.PTransform.__init__(self)
self.table_name = table_name
self.dataset = dataset
self.schema = schema
self.project = project
def get_schema(self):
"""Build the output table schema."""
return ', '.join('%s:%s' % (col, self.schema[col]) for col in self.schema)
def expand(self, pcoll):
return (
pcoll
| 'ConvertToRow' >>
beam.Map(lambda elem: {col: elem[col]
for col in self.schema})
| beam.io.WriteToBigQuery(
self.table_name, self.dataset, self.project, self.get_schema()))
class ParseSessionEventFn(beam.DoFn):
"""Parses the raw game event info into a Python dictionary.
Each event line has the following format:
username,teamname,score,timestamp_in_ms,readable_time
e.g.:
user2_AsparagusPig,AsparagusPig,10,1445230923951,2015-11-02 09:09:28.224
The human-readable time string is not used here.
"""
def __init__(self):
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(ParseSessionEventFn, self).__init__()
beam.DoFn.__init__(self)
def process(self, elem):
#timestamp = mktime(datetime.strptime(elem["timestamp"], "%Y-%m-%d %H:%M:%S").utctimetuple())
elem['sessionId'] = int(elem['sessionId'])
elem['landingPage'] = int(elem['landingPage'])
yield elem
class AnalyzeSessions(beam.DoFn):
"""Parses the raw game event info into a Python dictionary.
Each event line has the following format:
username,teamname,score,timestamp_in_ms,readable_time
e.g.:
user2_AsparagusPig,AsparagusPig,10,1445230923951,2015-11-02 09:09:28.224
The human-readable time string is not used here.
"""
def __init__(self):
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(AnalyzeSessions, self).__init__()
beam.DoFn.__init__(self)
def process(self, elem, window=beam.DoFn.WindowParam):
sessionId = elem[0]
uiud = elem[1][0]["uiud"]
count_of_events = 0
pageUrl = []
window_end = window.end.to_utc_datetime()
window_start = window.start.to_utc_datetime()
session_duration = window_end - window_start
for rows in elem[1]:
if rows["landingPage"] == 1:
referrer = rows["refererr"]
pageUrl.append(rows["pageUrl"])
return {
"pageUrl":pageUrl,
"eventType":"pageview",
"uiud":uiud,
"sessionId":sessionId,
"session_duration": session_duration,
"window_start" : window_start
}
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument('--topic', type=str, help='Pub/Sub topic to read from')
parser.add_argument(
'--subscription', type=str, help='Pub/Sub subscription to read from')
parser.add_argument(
'--dataset',
type=str,
required=True,
help='BigQuery Dataset to write tables to. '
'Must already exist.')
parser.add_argument(
'--table_name',
type=str,
default='game_stats',
help='The BigQuery table name. Should not already exist.')
parser.add_argument(
'--fixed_window_duration',
type=int,
default=60,
help='Numeric value of fixed window duration for user '
'analysis, in minutes')
parser.add_argument(
'--session_gap',
type=int,
default=5,
help='Numeric value of gap between user sessions, '
'in minutes')
parser.add_argument(
'--user_activity_window_duration',
type=int,
default=30,
help='Numeric value of fixed window for finding mean of '
'user session duration, in minutes')
args, pipeline_args = parser.parse_known_args(argv)
session_gap = args.session_gap * 60
options = PipelineOptions(pipeline_args)
# Set the pipeline mode to stream the data from Pub/Sub.
options.view_as(StandardOptions).streaming = True
options.view_as( StandardOptions).runner= 'DataflowRunner'
options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=options)
lines = (p
| beam.io.ReadFromPubSub(
subscription="projects/phrasal-bond-274216/subscriptions/rrrr")
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| beam.Map(lambda x: json.loads(x))
| beam.ParDo(ParseSessionEventFn())
)
next = ( lines
| 'AddEventTimestamps' >> beam.Map(setTimestamp)
| 'Create Tuples' >> beam.Map(createTuples)
| beam.Map(print)
| 'Window' >> beam.WindowInto(window.Sessions(15))
| 'group by key' >> beam.GroupByKey()
| 'analyze sessions' >> beam.ParDo(AnalyzeSessions())
| 'WriteTeamScoreSums' >> WriteToBigQuery(
args.table_name,
{
"uiud":'STRING',
"session_duration": 'INTEGER',
"window_start" : 'TIMESTAMP'
},
options.view_as(GoogleCloudOptions).project)
)
next1 = ( next
| 'Create Tuples' >> beam.Map(createTuples)
| beam.Map(print)
)
result = p.run()
# result.wait_till_termination()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
In the following code, I get the following error 'generator' object is not subscriptable, when I try to create tuples in my pipeline. I get it using yield is creating the generator object, even return doesn't work it just beaks my pipeline.
apache_beam.coders.coder_impl.SequenceCoderImpl.get_estimated_size_and_observables File "sessiontest1.py", line 23, in createTuples TypeError: 'generator' object is not subscriptable [while running 'generatedPtransform-148']
Here is the code I use to execute the pipeline.
python3 sessiontest1.py --project phrasal-bond-xxxxx --region us-central1 --subscription projects/phrasal-bond-xxxxx/s
ubscriptions/xxxxxx --dataset sessions_beam --runner DataflowRunner --temp_location gs://webevents/sessions --service_account_email-xxxxxxxx-
compute#developer.gserviceaccount.com
Any help on this would be appreciated. Thanks guys, again first time working on dataflow, so not sure what I am missing here.
Other errors I was getting before that are sorted now:-
a) I get the error that widow is not defined from the line name beam.Map(lambda elem: window.TimestampedValue(elem, elem['timestamp'])) .
If I go beam.window then it says beam is not defined, according to me beam should be provided by dataflow,
NameError: name 'window' is not defined [while running 'generatedPtransform-3820']
You just need to import the modules in the function itself.
Getting a 'generator' object is not subscriptable error on createTuples indicates that when you try to do elem["sessionID"], the elem is already a generator. The previous transform you do is setTimestamp, which is also using yield and therefore outputting a generator that gets passed as the element to createTuples.
The solution here is to implement setTimestamp and createTuples with return instead of yield. Return the element you want to receive in the following transform.
You should set save_main_session = True in your code. ( try to uncomment that line in your code). See more about NameError here : https://cloud.google.com/dataflow/docs/resources/faq
I'm not a Data Engineer and have some doubts about the best approach to follow. My main goal is to have a job for populating (with some frequency, daily, for example) supplying some csv files (in a bucket on GCP) to bigquery tables.
Here my actual script:
import pandas as pd
from google.oauth2 import service_account
from pandas.tests.io.test_gbq import pandas_gbq
from src.uploads import files
KEY_JSON = 'key.json'
PROJECT_ID = "<PROJECT_ID>"
SUFFIX_TABLE_NAME = "<TABLE_SUFFIX>."
def parse_to_pd(file, header, numeric_fields):
df = pd.read_csv(file, skiprows=[1], sep=',', decimal=",", thousands='.')
df.columns = header
for col in numeric_fields:
df[col] = pd.to_numeric(df[col])
return df
def load_data_to_bq(df, table_id):
credentials = service_account. \
Credentials. \
from_service_account_file(KEY_JSON, )
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = PROJECT_ID
name_table = table_id.replace(SUFFIX_TABLE_NAME)
pandas_gbq.to_gbq(df, name_table, if_exists="replace")
if __name__ == "__main__":
for table_id, config in files.items():
load_data_to_bq(
parse_to_pd(config.get('file_name'),
config.get('fields'),
config.get('numeric_fields'),
config.get('date_fields')
), table_id)
This script had worked, but I want to run on GCP in some service on the cloud. Can you have some suggestions?
So, I decided no use DataFlow and Apache Beam because my files were no so big. So a just had scheduled a job in a crontab. Here a class that I had created to process:
import os
import shutil
import pandas as pd
import uuid
from google.cloud import storage
class DataTransformation:
"""A helper class which contains the logic to translate a csv into a
format BigQuery will accept.
"""
def __init__(self, schema, bucket_name, credentials):
""" Here we read the input schema and which file will be transformed. This is used to specify the types
of data to create a pandas dataframe.
"""
self.schema = schema
self.files = []
self.blob_files = []
self.client = storage.Client(credentials=credentials, project="stewardship")
self.bucket = self.client.get_bucket(bucket_name)
def parse_method(self, csv_file):
"""This method translates csv_file in a pandas dataframe which can be loaded into BigQuery.
Args:
csv_file: some.csv
Returns:
A pandas dataframe.
"""
df = pd.read_csv(csv_file,
skiprows=[1],
sep=self.schema.get('sep'),
decimal=self.schema.get('decimal'),
thousands=self.schema.get('thousands')
)
df.columns = self.schema.get('fields')
for col in self.schema.get('numeric_fields'):
df[col] = pd.to_numeric(df[col])
shutil.move(csv_file, "./temp/processed/{0}".format(
os.path.splitext(os.path.basename(csv_file))[0])
)
return df
def process_files(self):
"""This method process all files and concat to a unique dataframe
Returns:
A pandas dataframe contained.
"""
frames = []
for file in self.files:
frames.append(self.parse_method(file))
if frames:
return pd.concat(frames)
else:
return pd.DataFrame([], columns=['a'])
def download_blob(self):
"""Downloads a blob from the bucket."""
for blob_file in self.bucket.list_blobs(prefix="input"):
if self.schema.get("file_name") in blob_file.name:
unique_filename = "{0}_{1}".format(self.schema.get("file_name"), str(uuid.uuid4()))
destination_file = os.path.join("./temp/input", unique_filename + ".csv")
blob_file.download_to_filename(
destination_file
)
self.files.append(destination_file)
self.blob_files.append(blob_file)
return True if len(self.blob_files) > 0 else False
def upload_blob(self, destination_blob_name):
"""Uploads a file to the bucket."""
blob = self.bucket.blob(destination_blob_name)
blob.upload_from_filename(os.path.splitext(os.path.basename(destination_blob_name))[0] +
os.path.splitext(os.path.basename(destination_blob_name))[1])
def move_processed_files(self):
"""Move processed files to processed folder"""
for blob_file in self.blob_files:
self.bucket.rename_blob(blob_file, "processed/" + blob_file.name)
return [b.name for b in self.blob_files]
So in main I had used the pandas_gbq to process everything:
import logging
from google.oauth2 import service_account
from pandas.tests.io.test_gbq import pandas_gbq
from src.data_transformation import DataTransformation
from src.schemas import schema_files_csv
KEY_JSON = 'KEY.json'
PROJECT_ID = "<PROJECT_NAME>"
SUFFIX_TABLE_NAME = "<TABLE_SUFFIX>"
BUCKET_NAME = "BUCKET_NAME"
def run():
credentials = service_account. \
Credentials. \
from_service_account_file(KEY_JSON, )
# DataTransformation is a class we built in this script to hold the logic for
# transforming the file into a BigQuery table.
for table, schema in schema_files_csv.items():
try:
logging.info("Processing schema for {}".format(schema.get("file_name")))
data_ingestion = DataTransformation(schema, BUCKET_NAME, credentials)
if not data_ingestion.download_blob():
logging.info(" 0 files to process")
continue
logging.info("Downloaded files: {}".format(",".join(data_ingestion.files) or "0 files"))
frame = data_ingestion.process_files()
logging.info("Dataframe created with some {} lines".format(str(frame.shape)))
if not frame.empty:
pandas_gbq.context.project, pandas_gbq.context.credentials = (PROJECT_ID, credentials)
pandas_gbq.to_gbq(frame,
table.replace(SUFFIX_TABLE_NAME, ""),
if_exists="replace"
)
logging.info("Table {} was loaded on Big Query".format(table.replace(SUFFIX_TABLE_NAME, "")))
blob_files = data_ingestion.move_processed_files()
logging.info("Moving files {} to processed folder".format(",".join(blob_files)))
data_ingestion.upload_blob("info.log")
except ValueError as err:
logging.error("csv schema expected are wrong, please ask to Andre Araujo update the schema. "
"Error: {}".format(err.__str__()))
if __name__ == "__main__":
logging.basicConfig(filename='info.log', level=logging.INFO)
run()
To process a use a schema in a dict/JSON like that:
{
"<PROJECT>.<DATASET>.<TABLE_NAME>": {
"file_name": "<NAME_OF_FILE>",
"fields": [
"Project",
"Assignment_Name",
"Request_Id",
"Resource_Grade",
"Resource_Role",
"Record_ID",
"Assignment_ID",
"Resource_Request_Account_Id",
],
"numeric_fields": [],
"sep": ";",
"decimal": ".",
"thousands": ","
},
.... other schema
}
I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these:
orderID.json:
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1}
combined.json:
{"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"}
{"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"}
{"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"}
{"barcode":"95593","name":"Dog","quantity":6,"orderID":"test2"}
{"barcode":"95594","name":"Scar","quantity":6,"orderID":"test2"}
To something like this (using orderID as the unique and primary key):
output.json:
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95590","name":"Ash","quantity":6}
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95591","name":"Beat","quantity":6}
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95592","name":"Cat","quantity":6}
I have my codes like this now which was adapted from join two json in Google Cloud Platform with dataflow
from __future__ import absolute_import
import argparse
import apache_beam as beam
import json
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from google.api_core import datetime_helpers
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import ServiceUnavailable
from google.api_core.exceptions import TooManyRequests
from google.cloud import bigquery
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--topic',
type=str,
help='Pub/Sub topic to read from')
parser.add_argument(
'--topic2',
type=str,
help='Pub/Sub topic to match with'
)
parser.add_argument(
'--output',
help=('Output local filename'))
args, pipeline_args = parser.parse_known_args(argv)
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)
orderID = (p | 'read from text1' >> beam.io.ReadFromText('orderID.json')
#'Read from orderID PubSub' >> beam.io.ReadFromPubSub(topic=args.topic2)
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e))
| 'key_orderID' >> beam.Map(lambda orders: (orders['orderID'], orders))
)
orders_si = beam.pvalue.AsDict(orderID)
orderDetails = (p | 'read from text' >> beam.io.ReadFromText('combined.json')
| 'Parse JSON to Dict1' >> beam.Map(lambda e: json.loads(e)))
#'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic))
def join_orderID_orderDetails(order, order_dict):
return order.update(order_dict[order['orderID']])
joined_dicts = orderDetails | beam.Map(join_orderID_orderDetails, order_dict=orders_si)
joined_dicts | beam.io.WriteToText('beam.output')
p.run()
#result.wait_until_finish()
if __name__ == '__main__':
run()
But my output now in beam.output just shows:
None
None
None
Can someone point out to me what I am doing wrong about this ?
The question that is different from the reported duplicate post is:
Why are my results "None"?
What am I doing wrong here?
I suspect these are the issues:
"order" variable - is that correctly referenced in "join_orderID_orderDetails"
List item "join_orderID_orderDetails" in "join_dicts? - is that correctly referneced too?
Try the below, Hope this will help you a little.
Here i have used an array of your order and combined, instead of using a file.
order = [{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1}]
combined = [
{"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"},
{"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"},
{"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"},
{"barcode":"95593","name":"Dog","quantity":6,"orderID":"test2"},
{"barcode":"95594","name":"Scar","quantity":6,"orderID":"test2"}
]
def joinjson(repl, tobeCombined):
newarr = []
for data in tobeCombined:
replData = getOrderData(repl,data['orderID'])
if replData is not None:
data.update(replData)
newarr.append(data)
return newarr
def getOrderData(order, orderID):
for data in order:
print("Data OrderID : ",data['orderID'])
if data['orderID'] == orderID:
return data
print(joinjson(order,combined))
I am trying to write a function that loads a JSON file I have on Google Cloud Storage into a BigQuery dataset, however, even if I pass the schema explicitly it still says that "No schema specified on job or table"
import oauth2client
import uuid
import time
from google.cloud import bigquery as bq
# from oauth2client.client import GoogleCredentials
# Configuration
BILLING_PROJECT_ID = ---
DATASET_NAME = ---
TABLE_NAME = ---
BUCKET_NAME = ---
FILE = ---
SOURCE = 'gs://{}/{}'.format(BUCKET_NAME, FILE)
SCHEMA = [
bq.SchemaField('question_id', 'INTEGER'),
bq.SchemaField('accepted_answer', 'INTEGER'),
bq.SchemaField('answer_count', 'INTEGER')
]
# CREDENTIALS = GoogleCredentials.get_application_efault()
client = bq.Client(project=BILLING_PROJECT_ID)
# Dataset
# Check if the dataset exists
def create_datasets(name):
dataset = client.dataset(name)
try:
assert not dataset.exists()
dataset.create()
assert dataset.exists()
print("Dataset {} created".format(name))
except(AssertionError):
pass
def load_data_from_gcs(dataset_name, table_name, source, schema):
'''
Load Data from Google Cloud Storage
'''
dataset = client.dataset(dataset_name)
table = dataset.table(table_name)
table.schema = schema
job_name = str(uuid.uuid4())
job = client.load_table_from_storage(
job_name, table, source)
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
def wait_for_job(job):
while True:
job.reload()
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
load_data_from_gcs(dataset_name=DATASET_NAME,
table_name=TABLE_NAME,
source=SOURCE,
schema=SCHEMA)
I have resolved the issue, in this case, I forgot to call table.create() before starting the job.
I am looking to fetch and publish data from spark streaming onto cloudant. My code is as follows -
from CloudantPublisher import CloudantPublisher
from CloudantFetcher import CloudantFetcher
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from kafka import KafkaConsumer, KafkaProducer
import json
class SampleFramework():
def __init__(self):
pass
#staticmethod
def messageHandler(m):
return json.loads(m.message)
#staticmethod
def processData(rdd):
if (rdd.isEmpty()):
SampleFramework.logger.info("RDD is empty")
return
# Expand
expanded_rdd = rdd.mapPartitions(CloudantFetcher.fetch)
expanded_rdd.foreachPartition(CloudantPublisher.publish)
def run(self, ssc):
self.ssc = ssc
directKafkaStream = KafkaUtils.createDirectStream(self.ssc, SUBSCRIBE_QUEUE], \
{"metadata.broker.list": METADATA, \
"bootstrap.servers": BOOTSTRAP_SERVERS}, \
messageHandler= SampleFramework.messageHandler)
directKafkaStream.foreachRDD(SampleFramework.processData)
ssc.start()
ssc.awaitTermination()
Other supporting classes -
from CloudantConnector import CloudantConnector
class CloudantFetcher:
config = Config.createConfig()
cloudantConnector = CloudantConnector(config)
#staticmethod
def fetch(data):
final_data = []
for row in data:
id = row["id"]
if(not CloudantFetcher.cloudantConnector.isReady()):
CloudantFetcher.cloudantConnector.open()
data_json = CloudantFetcher.cloudantConnector.getOne({"id": id})
row["data"] = data_json
final_data.append(row)
CloudantFetcher.cloudantConnector.close()
return final_data
class CloudantPublisher:
config = Config.createConfig()
cloudantConnector = CloudantConnector(config)
#staticmethod
def publish(data):
CloudantPublisher.cloudantConnector.open()
CloudantPublisher.cloudantConnector.postAll(data)
CloudantPublisher.cloudantConnector.close()
from cloudant.client import Cloudant
from cloudant.result import Result
from cloudant.result import QueryResult
from cloudant.document import Document
from cloudant.query import Query
from cloudant.database import CloudantDatabase
import json
class CloudantConnector:
def __init__(self, config, db_name):
self.config = config
self.client = Cloudant(self.config["cloudant"]["credentials"]["username"], self.config["cloudant"]["credentials"]["password"], url=self.config["cloudant"]["host"]["full"])
self.initialized = False
self.db_name = self.config["cloudant"]["host"]["db_name"]
def open(self):
try:
self.client.connect()
self.logger.info("Connection to Cloudant established.")
self.initialized = True
except:
raise Exception("Could not connect to Cloudant! Please verify credentials.")
self.database = CloudantDatabase(self.client,self.db_name)
if self.database.exists():
pass
else:
self.database.create()
def isReady(self):
return self.initialized
def close(self):
self.client.disconnect()
def getOne(self, query):
new_filter = query
query = Query(self.database, selector = query, limit=1)
results_string = json.dumps(query.result[0][0])
results_json = json.loads(results_string)
return results_json
def postAll(self, docs):
documents = []
quantum = self.config["cloudant"]["constants"]["bulk_quantum"]
count = 0
for doc in docs:
document = Document(self.database)
document["id"] = doc["id"]
document["data"] = doc["data"]
documents.append(document)
count = count + 1
if(count%quantum==0):
self.database.bulk_docs(documents)
documents = []
if(len(documents)!=0):
self.database.bulk_docs(documents)
self.logger.debug("Uploaded document to the Cloudant database.")
My implementation works, but it's slow as compared to what I would expect in the case of not initializing the cloudant connection in each partition and maintaining a static source of these connection threads which can be passed on to each partition to use/ fetched by each partition to use.
My Questions are as follows:
Do I need to create a connection pool with cloudant 2.0 API in python? (It seems that it already exists within the API). If yes, then how should I go about it? The closest I have seen an implementation is this - link, but it's on an outdated cloudant api and does not work with the new one.
If the answer to the above is 'Yes', How can I make this accessible to the workers? I see references to creating serializable, lazily instantiated connection-client objects here. This would mean that I would make a lazily instantiated cloudant connection object in the SampleFramework. How can I do this in Python? Just like given in the spark documentation.
connection = ConnectionPool.getConnection()
for record in iter:
connection.send(record)
ConnectionPool.returnConnection(connection)
If the above is not possible, how do I speed up my operations? The only alternative I can think off is maintaining a single connection on the driver program, collecting the data from all workers and then fetching/uploading the same. This would decrease the number of times I need to connect to cloudant, but would take away the distributed fetching/publishing architecture.